Skip to content

mrc-ide/glodide

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

minimal R version Licence Travis-CI Build Status

Submission of Global Lmic Reports via Orderly To DIDE

This is a working R package for submitting global fits to DIDE cluster.

Installation

git clone https://github.com/mrc-ide/glodide.git
cd glodide
open glodide.Rproj
devtools::install_deps()

Overview

The structure within analysis is as follows:

analysis/
    |
    ├── 01_submission /       # submission scripts 
    |
    ├── data/
    │   ├── DO-NOT-EDIT-ANY-FILES-IN-HERE-BY-HAND
    │   ├── raw_data/       # orderly bundles provided from bundling in global-lmic-reports-orderly
    │   └── derived_data/   # orderly outputs produced from running reports in raw_data

Process for submitting Jobs

This repository is used for submitting orderly reports to the DIDE cluster. Outline of steps:

Run in global-lmic-reports-orderly

  1. Execute bundle script in global-lmic-reports-orderly repository to produce orderly bundles in analysis/data/raw_data in glodide

Run in glodide

  1. Submit bundles to the DIDE HPC cluster using analysis/01_submission.R that uses `d

Run in global-lmic-reports-orderly

  1. Check fits are correct using code in global-lmic-reports-orderly and resubmit any countries that need to be rerun/run for longer
  2. Pull correctly run orderly bundles from analysis/data/derived_data into global-lmic-reports-orderly
  3. Compile gh-pages and push to mrc-ide/global-lmic-reports

The rationale for separating these steps between the two repositories is to ensure that the global-lmic-reports-orderly location is not on the network share, which has file backups set that can cause issues with file locking etc related to the orderly database. In addition, the use of didehpc (currently) requires the working directory when submitting jobs to be on the network share and so this separation currently seems the best approach.

Troubleshooting

1. LaTeX and PDF Compile Issues

2. Google Tag Pandoc Issues

3. Context 24 Core Template

4. Didehpc Errors

5. Reconcile Errors

6. Mapping Network Drives

7. Updating packages in context

8. Peculiar didehpc failed jobs


1. LaTeX and PDF Compile Issues

PDF compilation was initially tricky with the default setup not having pandoc on a network share nor being able to correctly compile the PDF document from orderly runs. As a result, both TinyTex and pandoc were installed onto the network share at L:/OJ (pandoc was then copied over to its location on the network share):

tinytex::install_tinytex(dir = "L:/OJ/TinyTex")
installr::install.pandoc()

In the global-lmic-reports-orderly lmic_reports_vaccine run script, we then had to specify the following to get Rmd to compile PDF documents correctly, with much of the guidance for the last 2 lines below coming from the TinyTex debugging guide

# pandoc linking
if(file.exists("L:\\OJ\\pandoc")) {
  rmarkdown:::set_pandoc_info("L:\\OJ\\pandoc")
  Sys.setenv(RSTUDIO_PANDOC="L:\\OJ\\pandoc")
  tinytex::use_tinytex("L:\\OJ\\TinyTex")
  tinytex::tlmgr_update()
}

2. Google Tag Pandoc Issues

Each page of the compiled website has a header of html with Google Analytics set up. This was originally included as an html file that was specified to appear before the body in the compiled html files. On the previous Azure server this worked fine, however, after switching to the DIDE cluster, on occasion there would be 404 errors on rendering the html page due to a network error related to fetching the Analytics code. As a result, we swapped to including as plain text the html that the Google Analytics html file was fetching instead. Hopefully, this fixes this issues but if any jobs fail on rendering the html with similar errors then check whether this text has changed or is wrong etc.

3. Context 24 Core Template

When we moved from running the squire model to the nimue model we ran out of RAM when running the model. 50 draws from the mcmc chain appears to be fine on the DIDE cluster but only through specifying the 24 Core template in the setup of the didhpc context in the cluster submit script. If there any errors are returned suggesting the there is insufficient memory, e.g. “Can’t allocate vector of xB…”, then recommend either changing the number of trajectories drawn.

4. Didehpc Errors

There are many types of didehpc error that may appear. First steps should be to head to https://mrc-ide.github.io/didehpc/articles/troubleshooting.html to read through all the guidance there, which should help identify specific errors.

5. Reconcile Errors

The troubleshoot guide above contains information on how to reconcile job statuses, i.e. checking to see if the job status given by grp$status() is correct and matches what is shown at the DIDE cluster end. However, to see the status of all jobs from the DIDE cluster, the following will help (and is similar to what is internally run when reconciling errors using obj$reconcile(grp$ids)):

dat <- obj$client$status_user("*", obj$config$cluster)
# dat <- dat[which(dat$name %in% grp$ids),] # this shows you all the tasks
dat <- dat[match(grp$ids, dat$name),] # it uses a match call to get the most recent task running with that id

6. Mapping Network Drives

To run the cluster submit script, two network drives need to be mapped:

  • T: //fi–didef3.dide.ic.ac.uk/tmp
  • L: //fi–didenas5/malaria

If these are not mapped when starting the machine (in particular the tmp drive is often not mapped), then see https://support.microsoft.com/en-us/windows/map-a-network-drive-in-windows-10-29ce55d1-34e3-a7e2-4801-131475f9557d for instructions. In overview:

File Explorer > This PC > Map Network Drive (in Tabs) > Map using mappings in list above

7. Updating packages in context

If you need to update any packages that are in the context, e.g. if you make changes to squire and nimue that you pushed to their Github repositories, then to update these in the context use the following in the cluster_submit script:

obj <- didehpc::queue_didehpc(ctx, config = config, provision = "lazy")

8. Peculiar didehpc failed jobs

If there is a string of jobs that fail in a row, i.e. say jobs 80 - 100 all error, and the error is not one that is a clear R error, i.e. the jobs just seem to stop, this is likely a problem with the specific node the jobs are being run on. In which case, rerunning them should work. This error could be due to the node not behaving or because other jobs being run on that node by other users are maybe taking too much memory and causing something strange. If it continues to be an issue, work out the dide_id of the failed task and ask Wes to see if there is something strange with that specific node.


The R package

This repository is organized as an R package. There are a few R functions exported in this package - the majority of the R code is in the analysis directory. The R package structure is here to help manage dependencies, to take advantage of continuous integration, and so we can keep file and data management simple.

To download the package source as you see it on GitHub, for offline browsing, use this line at the shell prompt (assuming you have Git installed on your computer):

git clone https://github.com/mrc-ide/glodide.git

Once the download is complete, open the glodide.Rproj in RStudio to begin working with the package and compendium files. We will endeavor to keep all package dependencies required listed in the DESCRIPTION. This has the advantage of allowing devtools::install_dev_deps() to install the required R packages needed to run the code in this repository

Licenses

Code: MIT year: 2021, copyright holder: OJ Watson

Data: CC-0 attribution requested in reuse

About

Submission of Global LMIC Reports via Orderly To DIDE Cluster

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages