Skip to content

Instructions

jreps edited this page Mar 24, 2023 · 6 revisions

title: "Create Model Development Network Studies With Strategus" output: html_document date: "2022-11-23"

Introduction

In this document we will explain how to create a prediction model development network study using Strategus. The researcher developing the network study needs the latest PatientLevelPrediction R package and Strategus R Package.

With Strategus, the researcher developing the network study needs to create a json file with the specifications that detail the modules to use (plus their version) and the inputs into the modules. In addition, the json file needs to include all the cohorts used in the study (e.g., any target population, outcome or predictor cohorts). The cohorts are added into the Shared Resources.

Each module contains a script that wraps the analysis R package using the settings specified in the json file as input into the script (e.g., PatientLevelPredictionModule contains a script that uses the settings in the json file to figure out the inputs to call PatientLevelPrediction). The modules also contain an renv (that installs all the R dependencies and suitable versions for the module into a sperate environment when running the script), this ensures the settings in the json file match the inputs for the PatientLevelPrediction. This will also help if you need to run a study years in the future. Please note: modules never need to be manually downloaded.

Create Prediction Development JSON File

Requirements

To create a network json file for developing prediction models models the user needs:

  • The latest PatientLevelPrediction installed (remotes::install_github('ohdsi/PatientLevelPrediction'))
  • The latest Strategus installed (remotes::install_github('ohdsi/Strategus', ref = 'develop'))
  • ROhdsiWebApi R package installed (remotes::install_github('ohdsi/ROhdsiWebApi'))
  • All the cohorts for the study in ATLAS (Note: custom cohorts can be added outside ATLAS but this is not detailed in this document)
library(Strategus)
library(PatientLevelPrediction)
library(dplyr)

Prediction module settings

To create the settings for the PatientLevelPredictionModule you simply need to create a list of model designs. A model design is created using PatientLevelPrediction::createModelDesign.

In the example below I will create two model designs. Both designs have these design settings in common:

  • covariates (age in 5-year bins, sex, condition/drug groups in the prior 1-year)
  • target cohort id of 301 (this refers to the cohort id in ATLAS)
  • outcome cohort id of 298
  • default restrict settings (no restriction when extracting the data)
  • population settings defining the time at risk of 1 day after index to 3 years after index, removing patients with the outcome prior to index and removing patients who have <1 day follow-up (leave the database at index).
  • no feature engineering
  • no over/under sampling
  • default preprocessing
  • default train/test/validation split

However, the first design using a LASSO logistic regression as the classifier whereas the second design uses Random Forest (with defauly hyper-parameter grid search).


covariateSettings <- FeatureExtraction::createCovariateSettings(
  useDemographicsGender = T,
  useDemographicsAgeGroup = T, #PLP age group
  useConditionGroupEraLongTerm  = T,
  useDrugGroupEraLongTerm = T,
)

modelDesignList <- list()
length(modelDesignList) <- 2

modelDesignList[[1]] <- PatientLevelPrediction::createModelDesign(
    targetId = 301,
    outcomeId = 298,
    restrictPlpDataSettings = createRestrictPlpDataSettings(),
    populationSettings = createStudyPopulationSettings(
      removeSubjectsWithPriorOutcome = T,
      priorOutcomeLookback = 99999,
      requireTimeAtRisk = T,
      minTimeAtRisk = 1,
      riskWindowStart = 1,
      startAnchor = 'cohort start',
      riskWindowEnd = 365*3,
      endAnchor = 'cohort start'
    ),
    covariateSettings = covariateSettings,
    featureEngineeringSettings = NULL,
    sampleSettings = NULL,
    preprocessSettings = createPreprocessSettings(),
    modelSettings = PatientLevelPrediction::setLassoLogisticRegression(),
    splitSettings = createDefaultSplitSetting(),
    runCovariateSummary = T
  )
  
modelDesignList[[2]] <- PatientLevelPrediction::createModelDesign(
  targetId = 301,
  outcomeId = 298,
  restrictPlpDataSettings = createRestrictPlpDataSettings(),
  populationSettings = createStudyPopulationSettings(
    removeSubjectsWithPriorOutcome = T,
    priorOutcomeLookback = 99999,
    requireTimeAtRisk = T,
    minTimeAtRisk = 1,
    riskWindowStart = 1,
    startAnchor = 'cohort start',
    riskWindowEnd = 3*365,
    endAnchor = 'cohort start'
  ),
  covariateSettings = covariateSettings,
  featureEngineeringSettings = NULL,
  sampleSettings = NULL,
  preprocessSettings = createPreprocessSettings(),
  modelSettings = PatientLevelPrediction::setRandomForest(),
  splitSettings = createDefaultSplitSetting(),
  runCovariateSummary = T
)

We have the prediction model design settings, so now we just need to source a function that will take as input the modelDeisgnList we just created and output a specification object for Strategus.

# source the latest PatientLevelPredictionModule SettingsFunctions.R
source("https://raw.githubusercontent.com/OHDSI/PatientLevelPredictionModule/v0.0.8/SettingsFunctions.R")

# this will load a function called createPatientLevelPredictionModuleSpecifications
# that takes as input a modelDesignList
# createPatientLevelPredictionModuleSpecifications(modelDesignList) 

# now we create a specification for the prediction module
# using the model designs list we define previously as input
patientLevelPredictionModuleSpecifications <- createPatientLevelPredictionModuleSpecifications(modelDesignList)

We now have the specifications for the prediction model development ready.

Note: if you need to add custom aspects into a network study it is possible to create your custom module rather than using the OHDSI repository module.

As the model development for the two model designs previously defined requires cohorts 298 and 301 to be generated, we will need to add these cohorts into the json file and also use Cohort Generator to run the cohort SQL for both cohorts when executing the cohort study. This means CohortGeneratorModule is required.

Cohort Generator settings

To add the cohort generation specifications we need to source a similar function that will create the specifications for the cohort generator module.

The following code will use v0.0.13 of the CohortGeneratorModule found in the OHDSI repository . The setting incremental set to TRUE means if you generate the cohorts and then modify one cohort and rerun, cohort generator will only rerun the SQL for the modified cohort and will skip the previously generated unmodified cohort. The setting generateStats set to TRUE means extra tables with statistics about the cohorts will be generated.

# source the cohort generator settings function
source("https://raw.githubusercontent.com/OHDSI/CohortGeneratorModule/v0.1.0/SettingsFunctions.R")
# this loads a function called createCohortGeneratorModuleSpecifications that takes as
# input incremental (boolean) and generateStats (boolean)

# specify the inputs to create the cohort generator specification
cohortGeneratorModuleSpecifications <- createCohortGeneratorModuleSpecifications(
      incremental = TRUE,
      generateStats = TRUE
      )

This is the specification for the cohort generation. However, we also need to add in the cohort definitions for the analysis (cohorts 301 and 298). This is done in Strategus within the shared resources.

Shared resources

To create the cohort definitions we create another function named createCohortSharedResource that takes the cohort definitions and returns a list with a named item called cohortDefinitions set to the cohort definition set input into createCohortSharedResource.

createCohortSharedResource <- function(cohortDefinitionSet) {
  sharedResource <- list(cohortDefinitions = cohortDefinitionSet)
  class(sharedResource) <- c("CohortDefinitionSharedResources", "SharedResources")
  return(sharedResource)
}

In this example I use ATLAS and ROhdsiWebApi to extract the cohort definitions, but then process these into the format Strategus requires:


# first define your ATLAS webapi:
baseUrl <- '<your atlas WebAPI>'

# Next - you may need to authorize the web API.
#  In this example I use windows authorization
ROhdsiWebApi::authorizeWebApi(
  baseUrl = baseUrl, 
  authMethod = 'windows', 
  webApiUsername = '<your webapi username>',
  webApiPassword = '<your webapi password>'
)

# now we extract the two cohorts
# note: if you used cohorts as predictors you need to add them here as well
cohortDefinitions <- ROhdsiWebApi::exportCohortDefinitionSet(
  baseUrl = baseUrl, 
  cohortIds = c(301, 298), 
  generateStats = F # set this to T if you want stats
)

# here we modify the cohort into the format for Strategus
cohortDefinitions <- lapply(1:length(cohortDefinitions$atlasId), function(i){list(
  cohortId = cohortDefinitions$cohortId[i],
  cohortName = cohortDefinitions$cohortName[i],
  cohortDefinition = cohortDefinitions$json[i]
)})

Now we have functions for creating the prediction specification, the cohort specification and creating the required shared resources. We also have the cohorts downloaded and in the required format. Next we need to put these together to create the analysis object.

Creating the analysis object

We can use the Strategus function createEmptyAnalysisSpecificiations() to create an empty specification and then use the pipe function %>% and other Strategus functions addSharedResources and addModuleSpecifications to create an R object with the network study specification:

analysisSpecifications <- createEmptyAnalysisSpecificiations() %>%
  addSharedResources(createCohortSharedResource(cohortDefinitions)) %>%
  addModuleSpecifications(cohortGeneratorModuleSpecifications) %>%
  addModuleSpecifications(patientLevelPredictionModuleSpecifications)

Here we use addModuleSpecifications twice as we use cohort generator and prediction models in the analysis.

Finally, we can save this object using ParallelLogger as a json file:

ParallelLogger::saveSettingsToJson(analysisSpecifications, './example_study.json')

This file can now be shared with others (e.g., via email or GitHub) enabling them to run the model development you specified. If run, the users will develop two prediction models.

Running the study from the JSON file

To run the study a user need to specify the location of the json file and their connection details, plus some settings for the cohort table name, output folder and min cell count. Once the inputs are fill in, the user can run the code below and the models should be developed.


library(Strategus)

##=========== START OF INPUTS ==========
# Add your json file location, connection to OMOP CDM data settings and 

# load the json spec
analysisSpecifications <- ParallelLogger::loadSettingsFromJson('<location to json file>')

connectionDetailsReference <- "<database ref>"

connectionDetails <- DatabaseConnector::createConnectionDetails(
  dbms = '<dbms>',
  server ='<server>',
  user = '<user>',
  password = '<password>',
  port = '<port>'
)

workDatabaseSchema <- '<your workDatabaseSchema>'
cdmDatabaseSchema <- '<your cdmDatabaseSchema>'

outputLocation <- '<folder location to run study and output results?'
minCellCount <- 5
cohortTableName <- "strategus_example"

##=========== END OF INPUTS ==========

storeConnectionDetails(
  connectionDetails = connectionDetails,
  connectionDetailsReference = connectionDetailsReference
  )

executionSettings <- createCdmExecutionSettings(
  connectionDetailsReference = connectionDetailsReference,
  workDatabaseSchema = workDatabaseSchema,
  cdmDatabaseSchema = cdmDatabaseSchema,
  cohortTableNames = CohortGenerator::getCohortTableNames(cohortTable = cohortTableName),
  workFolder = file.path(outputLocation, "strategusWork"),
  resultsFolder = file.path(outputLocation, "strategusOutput"),
  minCellCount = minCellCount
)

# Note: this environmental variable should be set once for each compute node
Sys.setenv("INSTANTIATED_MODULES_FOLDER" = file.path(outputLocation, "StrategusInstantiatedModules"))

execute(
  analysisSpecifications = analysisSpecifications,
  executionSettings = executionSettings,
  executionScriptFolder = file.path(outputLocation, "strategusExecution")
  )