This is an outline of my process for streamlining report generation and current solution for localising projects to run off a network location.
Problem: reports that need regularly updating i.e. monitoring of events and data issues
- All analyses are performed within RStudio Project folders within "pgm".
- UpdateData.bat Batch file to copy source data files to project "data" folder
xcopy "T:\DREAM3R\SOURCE\*.sas7bdat" ".\SOURCE\" /y /S /d
xcopy "T:\DREAM3R\CLEAN\*.sas7bdat" ".\CLEAN\" /y /S /d
copy "T:\DREAM3R\_Documentation\ALS.xls" "ALS.xls" /d
- program which loads source data + creates most derived datasets + some objects for easy reuse within report files (e.g. graphs, complex tables)
SetupData.R
library(tidyverse)
library(readxl)
library(haven)
library(lubridate)
library(dplyr)
library(tidyr)
library(Hmisc)
# Load raw data ----------------------------------------------------------
rm(list = ls())
data = '../../data/source'
source('load_queries.r')
files = list.files(data, pattern=".sas", all.files=FALSE, full.names=FALSE)
files
metadata_fields = openxlsx::read.xlsx("../../data/ALS.xlsx",sheet="Fields")
metadata_forms = openxlsx::read.xlsx("../../data/ALS.xlsx",sheet="Forms")
excluded_dates = c("DRNOW","TANOW","TLNEWNOW","TRANSDAT","LBCOLDATTIM","LBPBMCDATFRZ","MinCreated","MaxUpdated",
"SaveTS","RecordDate",
"HUDRNOW","HUDRNOW_INT",
"DRNOW_INT","TANOW_INT",
"IECSPICVDAT","IECSPICVDAT_INT",
"RIREVDAT","RIREVDAT_INT",
"RISUBDAT","RISUBDAT_INT","BRTSSDAT","BRTSRDAT","LBSSDAT","RISCANDAT","RIORDAT","RIREVAVDAT")
f = read_sas(paste(data,files[1],sep="/"))
temp=f %>% select(Subject,which(sapply(.,is.POSIXct))) %>%
gather(variable,date,-Subject) %>%
mutate(date = as.POSIXct.Date(date)) %>%
group_by(Subject) %>%
summarise(maxdate = max(date, na.rm = TRUE),.groups="keep") #%>%
#mutate(maxdate=as.POSIXct.Date(maxdate))
datasetTimeStamp = file.info(paste(data,files[1],sep="/"))$mtime
Followed by my processing and data derivation steps. Finally, save the datasets with date / git version stamp.
save.image("../../Data/derived/DREAM3R_datasets.RData")
save.image(paste0("../../Data/derived/DREAM3R_datasets_",as.character(today()),".RData"))
- Kint.R program to run SetupData.R and render all report markdown documents / reports to an output "out" directory.
# Run data computation programs
source('_setupData.r')
# Generate reports
rmarkdown::render("DREAM3R_dashboard.rmd",
output_format = "all",
output_dir="../../out/"
)
- Distribute_output.bat batch file (saved in the output folder) to copy reports to shared network locations + cloud storage for easy access to latest results when in meetings.
copy "DREAM3R_dashboard.html" "X:\Oncology\ALTG\DREAM3R\11 Statistics\11.04 Report\11.04.01 Interim Statistical Reports\DREAM3R_Monitoring.html"
copy "DREAM3R_DataIssues.html" "X:\Oncology\ALTG\DREAM3R\11 Statistics\11.04 Report\11.04.01 Interim Statistical Reports\DREAM3R_DataIssues.html"
copy "DREAM3R_RECIST.html" "X:\Oncology\ALTG\DREAM3R\11 Statistics\11.04 Report\11.04.01 Interim Statistical Reports\DREAM3R_RECIST.html"
xcopy "Y:\Statistics\1 STUDIES\Oncology\ALTG\DREAM3R\8.5 Analysis\out" "C:\Users\%USERNAME%\OneDrive - The University of Sydney (Staff)\TOGA\DREAM3R\" /y /S
- Batch file to enable scheduling of programs - note this needs to be saved on a local drive (C/D) as "task scheduler" often can't files on network.
- Update.bat Batch file to run all the above steps.
cd "Y:\Statistics\1 STUDIES\Oncology\ALTG\DREAM3R\8.5 Analysis\Data"
call _updateData.bat
cd "Y:\Statistics\1 STUDIES\Oncology\ALTG\DREAM3R\8.5 Analysis\pgm"
call update.bat
cd "Y:\Statistics\1 STUDIES\Oncology\ALTG\DREAM3R\8.5 Analysis\out"
call _copyToSMP.bat
- Configure "Task Scheduler" to run the report process on a regular schedule (I usually update everything daily just after the data has been refreshed)
Problem: opening data and running programs stored on a network location can be slow. RStudio projects perform better when running from a local directory.
We were spoiled with fast a fast local network for years but in the era of cloud storage our network drives have been moved offsite.
Solution: sync data and supporting files to local drive via custom batch file, similar batch file to return updated data / output to the network location
- Batch file to create local copy of project folder structure
if not exist "C:\TRIAL_DATA\DREAM3R" mkdir "C:\TRIAL_DATA\DREAM3R"
if not exist "C:\TRIAL_DATA\DREAM3R\data" mkdir "C:\TRIAL_DATA\DREAM3R\data"
if not exist "C:\TRIAL_DATA\DREAM3R\out" mkdir "C:\TRIAL_DATA\DREAM3R\out"
if not exist "C:\TRIAL_DATA\DREAM3R\pgm" mkdir "C:\TRIAL_DATA\DREAM3R\pgm"
if not exist "C:\TRIAL_DATA\DREAM3R\pgm\DREAM3R_Monitoring" mkdir "C:\TRIAL_DATA\DREAM3R\pgm\DREAM3R_Monitoring"
if not exist "C:\TRIAL_DATA\DREAM3R\pgm\DREAM3R_Patients" mkdir "C:\TRIAL_DATA\DREAM3R\pgm\DREAM3R_Patients"
if not exist "C:\TRIAL_DATA\DREAM3R\pgm\DREAM3R_StudyReport" mkdir "C:\TRIAL_DATA\DREAM3R\pgm\DREAM3R_StudyReport"
xcopy "Y:\Statistics\1 STUDIES\Oncology\ALTG\DREAM3R\8.5 Analysis\data" "C:\TRIAL_DATA\DREAM3R\data" /s /d /y
copy "Y:\Statistics\1 STUDIES\Oncology\ALTG\DREAM3R\8.5 Analysis\DREAM3R_Y_drive.lnk" "C:\TRIAL_DATA\DREAM3R\DREAM3R_Y_drive.lnk" /y
copy "Y:\Statistics\1 STUDIES\Oncology\ALTG\DREAM3R\8.5 Analysis\DREAM3R_createLocal.bat" "C:\TRIAL_DATA\EnRICH\Data\DREAM3R_createLocal.bat" /y
copy "Y:\Statistics\1 STUDIES\Oncology\ALTG\DREAM3R\8.5 Analysis\DREAM3R_copyDataOutput_toYdrive.bat" "C:\TRIAL_DATA\EnRICH\Data\DREAM3R_copyDataOutput_toYdrive.bat" /y
- TIP create links to network location (in local folder) + link to local location (in network) to make it easy to move around.
- All programs are version controlled and pushed to enterprise git hub server.
- Use GitHub Desktop to simplify syncing of repository to local director.
-- It's not easy (?safe) to update the repo within a batch file but I'm in the habit of checking for updates when I start work + when I've finished.
- Local version of batch file to copy source data files to project "data" folder (not needed if Y drive one used relative paths)
UpdateData.bat
xcopy "T:\DREAM3R\SOURCE\*.sas7bdat" ".\SOURCE\" /y /S /d
xcopy "T:\DREAM3R\CLEAN\*.sas7bdat" ".\CLEAN\" /y /S /d
copy "T:\DREAM3R\_Documentation\ALS.xls" "ALS.xls" /d
- batch file to copy all derived data + output files to network folder (only if they've been more recently updated than the network copies).
copyDataOutput_toYdrive.bat
xcopy "C:\TRIAL_DATA\DREAM3R\out" "Y:\Statistics\1 STUDIES\Oncology\ALTG\DREAM3R\8.5 Analysis\out" /s /d /y
xcopy "C:\TRIAL_DATA\DREAM3R\data" "Y:\Statistics\1 STUDIES\Oncology\ALTG\DREAM3R\8.5 Analysis\data" /s /d /y
xcopy "C:\TRIAL_DATA\SHERLOCK\data" "Y:\Statistics\1 STUDIES\Oncology\ALTG\SHERLOCK\8.5 Analysis\data" /s /d /y
- sometimes reports also generate files which are saved in the program folder so I copy them manually. e.g.
copy "C:\TRIAL_DATA\SHERLOCK\pgm\SHERLOCK_Monitoring\SHERLOCK_interim.png" "Y:\Statistics\1 STUDIES\Oncology\ALTG\SHERLOCK\8.5 Analysis\pgm\SHERLOCK_Monitoring\SHERLOCK_interim.png"
copy "C:\TRIAL_DATA\SHERLOCK\pgm\SHERLOCK_Monitoring\Recruitment.png" "Y:\Statistics\1 STUDIES\Oncology\ALTG\SHERLOCK\8.5 Analysis\pgm\SHERLOCK_Monitoring\Recruitment.png"
-- could copy pgm folder like the others but could confuse git if not synced
- Local version of the program to run all components UpdateLocal.bat
cd "C:\TRIAL_DATA\DREAM3R\data"
call _updateData.bat
cd "C:\TRIAL_DATA\DREAM3R\pgm"
call update.bat
cd "C:\TRIAL_DATA\DREAM3R"
call copyDataOutput_toYdrive.bat
- Ideally all programs use relative paths - you can have trouble getting relative path calls to work when triggered via the scheduled tasks
- All these tasks could be run by python/R negating the bulk of the files - it is nice to have batch files to easily run if needed
- Use a generator / template to streamline creation of files / links etc.