-
Notifications
You must be signed in to change notification settings - Fork 22
/
hw13.Rnw
99 lines (67 loc) · 7.53 KB
/
hw13.Rnw
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
\documentclass[12pt]{article}
\usepackage{fullpage,hyperref,verbatim}\setlength{\parskip}{3mm}\setlength{\parindent}{0mm}
\begin{document}
\begin{center}\bf
Homework 13. Due by 5pm on Thursday 12/2.
A workflow for reproducible statistical research: combining Latex and R with knitr
\end{center}
There are numerous advantages to writing a statistics paper in such a way that the tables, figures and other quantitative results are automatically generated from chunks of code included in the document. These include:
\begin{enumerate}
\item Changing the data. Questions like ``How stable are my conclusions? What happens to all my figures and tables if I omit the 5 smallest states from my panel of 50 states?'' can be rapidly answered. The easier it is to investigate new analyses, the more things you explore.
\item Effective collaboration. All coauthors can read, run and modify all the code that produced the figures in the current version of a circulated draft.
\item Debugging. If your adviser asks ``How exactly did this number get produced?'' you can give a rapid, precise and accurate answer.
\item Updating. If you are presenting code (e.g., a lab presentation) and you want to make changes where necessary for a new software version, you simply re-run the document.
\item Revisions. 4 months after you submitted the paper, when the referee reports come back, you will be glad if you have your work organized in this way!
\end{enumerate}
Rmarkdown and Jupyter are convenient platforms for exploratory investigations, but Rnw (a format associated with the R package knitr) is better placed to generate Latex for publication-quality pdf articles. This homework investigates a workflow, meaning a set of tools and procedures that together get research done effectively, used for the research project at \url{https://github.com/ionides/bagged_filters}
The code is somewhat complex and involves various features that may be new to you. You are welcome to ask your peers for help if you get stuck. Edit the file \texttt{810f21/hw13.Rnw} with your answers, build a pdf and submit it to Canvas.
To get started, clone the \texttt{bagged\_filters} git repository to your laptop, as in Homework 9. There are two pdf files:
\begin{itemize}
\item \texttt{ms.pdf}, the main article.
\item \texttt{si/si.pdf}, an online supplement for the article.
\end{itemize}
We will focus on ms.pdf.
\begin{enumerate}
\item The source file for ms.pdf is ms.Rnw, a file in R noweb format designed to be run by \texttt{knitr::knit()}. Rstudio runs \texttt{knit()} automatically when you ask it to build from an Rnw file, but for workflows based on text commands it is good practice to call \texttt{knit()} directly from R. Rnw files simply combine chunks of \LaTeX with chunks of R code.
Have you used Rnw format before, either through Rstudio or not? Rnw has some similarities with the Rmarkdown (Rmd) format. Rmd is slightly simpler and quicker for some tasks, but more difficult than Rnw for fine control of Latex.
YOUR ANSWER HERE
\item There are various ways to compile ms.Rnw to ms.pdf. All of them need the necessary R packages, which you may need to install on your laptop or greatlakes or anywhere else you try running the code.
Note that before installing \texttt{spatPomp} you will need to have \texttt{pomp} installed, for which you may need to consult the instructions at \url{https://kingaa.github.io/pomp/install.html}.
The installation of \texttt{pomp} is nontrivial because this package carries out compilation of C code, so you need to have a C compiler installed and talking properly to R. Time spent figuring this out is not entirely wasted.
Now, in an R session running in the \texttt{bagged\_filters} directory, you can run
<<,eval=FALSE>>=
library(knitr)
knit("ms.Rnw")
@
If all is well, this will generate a file \texttt{ms.tex} which can be used to produce \texttt{ms.pdf} by running pdflatex. Likely, issues will arise that need to be solved to get this working. Spend a reasonable amount of time trying to get this working. It is okay if you cannot get the code to run, since many of the questions do not depend on this. Report on whether you were successful, what problems you overcame, and where you got stuck.
YOUR ANSWER HERE.
\item Have you used \texttt{make} before? (\url{https://en.wikipedia.org/wiki/Make_(software)}) This is a standard tool for organizing scientific coding projects, and it is installed by default on Mac and Linux systems. The \texttt{bagged\_filters} directory has a Makefile, so you can run
<<,eval=FALSE>>=
make ms.pdf
@
at a terminal prompt to build ms.pdf from ms.Rnw. This just runs \texttt{knit} followed by \texttt{pdflatex} so it cannot work unless the separate steps are working. For debugging, it can be better to run \texttt{knit} and \texttt{pdflatex} sequentially.
Try this, and report briefly
YOUR ANSWER HERE.
\item The manuscript can also be built on greatlakes, and this is appropriate for a production version having numerical calculations too extensive for a laptop. Identify the critical lines of code to enable the program to run on either greatlakes or a laptop.
YOUR ANSWER HERE
Optionally, work on building ms.tex on greatlakes, by running
<<,eval=FALSE>>=
sbatch ms.sbat
@
This requires cloning the git repository to greatlakes and installing all necessary R packages locally in your greatlakes account, a somewhat tedious activity but maybe worthwhile practice at getting things set up correctly.
\item Writing a reasonably large reproducible document combining text and code, you cannot avoid the issue of caching. You do not want to re-run all computations each time you edit any text in the document, so you must save (i.e., cache) results that do not need to be recomputed. Ideally, when we edit code we would re-run only the partial results that have changed as a consequence of the edit. Sadly, it is intractable to automate this in a foolproof way. The knitr code chunk option \texttt{cache=TRUE} re-runs a code chunk if that particular chunk is edited. It is necessary to delete all cached files (e.g., \texttt{rm -rf cache}) occasionally to rebuild the cache correctly. In ms.Rnw, the \texttt{stew()} function from the \texttt{pomp} package is used to give additional manual control of caching the most time-consuming results.
Have you had any prior experience working with cache on reproducible documents?
YOUR ANSWER HERE.
\item For debugging, it is helpful to have a quick version of the code which can be run to check for errors before starting a long, expensive computing job. Setting \texttt{run\_level=1} in \texttt{ms.Rnw} gives a version of the code that runs in a few seconds. If you can successfully compile \texttt{ms.tex} or \texttt{ms.pdf}, try deleting all the cached results from \texttt{run\_level=1} by
<<,eval=F>>=
rm -rf *_1
@
Then if you knit the document you will run quick versions of all the computations. Did that work for you?
If you are new to Linux/Unix, work out what the parts of the command \texttt{rm -rf *\_1} do. The \texttt{f} flag may not be necessary for you.
YOUR ANSWER HERE.
\item Workflows for writing manuscripts are build up over years, borrowed, shared, and modified for different purposes and evolving technologies. Compare this workflow with the range of techniques you already use.
YOUR ANSWER HERE.
\item Notice how the random number generator seed is set to give reproducible results (e.g., search for ``seed'' in ms.Rnw). Subtle problems can arise when setting seeds for parallel computations. Can you think of any? This code attempts to deal with them via the \texttt{doRNG} R package.
YOUR ANSWER HERE.
\end{enumerate}
\end{document}