# Example usage

Examples based on those included in two papers by Riley et al. published in Statistics in Medicine (2018). NB: Survival example based on Riley et al. BMJ paper (2020).

# Imports

In [1]:
from pmsampsize.pmsampsize import *

# Details

`pmsampsize` can be used to calculate the minimum sample size for the development of models with continuous, binary or survival (time-to-event) outcomes. Riley et al. lay out a series of criteria the sample size should meet. These aim to minimise the overfitting and to ensure precise estimation of key parameters in the prediction model. 

For continuous outcomes, there are four criteria: 
- small overfitting defined by an expected shrinkage of predictor effects by 10% or less, 
- small absolute difference of 0.05 in the model’s apparent and adjusted R-squared value, 
- precise estimation of the residual standard deviation, and 
- precise estimation of the average outcome value. 

The sample size calculation requires the user to pre-specify (e.g. based on previous evidence) the anticipated R-squared of the model, and the average outcome value and standard deviation of outcome values in the population of interest. 

For binary or survival (time-to-event) outcomes, there are three criteria: 
- small overfitting defined by an expected shrinkage of predictor effects by 10% or less, 
- small absolute difference of 0.05 in the model’s apparent and adjusted Nagelkerke’s R-squared value, and 
- precise estimation (within +/- 0.05) of the average outcome risk in the population for a key timepoint of interest for prediction.

The `pmsampsize` Python pacakge is based on the original R package developed by Joie Ensor.   

# Binary outcomes (Logistic prediction models)

Use pmsampsize to calculate the minimum sample size required to develop a multivariable prediction model for a binary outcome using 24 candidate predictor parameters. Based on previous evidence, the outcome prevalence is anticipated to be 0.174 (17.4%) and a lower bound (taken from the adjusted Cox-Snell R-squared of an existing prediction model) for the new model's R-squared value is 0.288.

In [3]:
samplesize = pmsampsize(type = "b", csrsquared = 0.288, parameters = 24, prevalence = 0.174)

NB: Assuming 0.05 acceptable difference in apparent & adjusted R-squared
NB: Assuming 0.05 margin of error in estimation of intercept
NB: Events per Predictor Parameter (EPP) assumes prevalence = 0.174 

Criteria      Sample size    Shrinkage    Parameter    CS_Rsq    Max_Rsq    Nag_Rsq    EPP
----------  -------------  -----------  -----------  --------  ---------  ---------  -----
Criteria 1            623          0.9           24     0.288      0.603      0.477   4.52
Criteria 2            662        0.905           24     0.288      0.603      0.477    4.8
Criteria 3            221        0.905           24     0.288      0.603      0.477    1.6
----------  -------------  -----------  -----------  --------  ---------  ---------  -----
Final SS              662        0.905           24     0.288      0.603      0.477    4.8
 
Minimum sample size required for new model development based on user inputs = 662
with 116 events (assuming an outcome prevalence = 0.174) and an EPP = 4.8 


Now lets assume we could not obtain a Cox-Snell R-squared estimate from an existing prediction model, but instead had a C-statistic (0.89) reported for the existing prediction model. We can use this C-statistic along with the prevalence to approximate the Cox-Snell R-squared using the approach of Riley et al. (2020). Use pmsampsize with the cstatistic() option instead of rsquared() option.

In [4]:
samplesize = pmsampsize(type = "b", cstatistic = 0.89, parameters = 24, prevalence = 0.174)

Given input C-statistic = 0.89  & prevalence = 0.174
Cox-Snell R-sq = 0.2908 

NB: Assuming 0.05 acceptable difference in apparent & adjusted R-squared
NB: Assuming 0.05 margin of error in estimation of intercept
NB: Events per Predictor Parameter (EPP) assumes prevalence = 0.174 

Criteria      Sample size    Shrinkage    Parameter    CS_Rsq    Max_Rsq    Nag_Rsq    EPP
----------  -------------  -----------  -----------  --------  ---------  ---------  -----
Criteria 1            615          0.9           24    0.2908      0.603      0.482   4.46
Criteria 2            660        0.906           24    0.2908      0.603      0.482   4.78
Criteria 3            221        0.906           24    0.2908      0.603      0.482    1.6
----------  -------------  -----------  -----------  --------  ---------  ---------  -----
Final SS              660        0.906           24    0.2908      0.603      0.482   4.78
 
Minimum sample size required for new model development based on user inputs = 

# Survival outcomes (Cox prediction models)

Use pmsampsize to calculate the minimum sample size required for developing # a multivariable prediction model with a survival outcome using 30 candidate predictors. We know an existing prediction model in the same field has an R-squared adjusted of 0.051. Further, in the previous study the mean follow-up was 2.07 years, and overall event rate was 0.065. We select a timepoint of interest for prediction using the newly developed model of 2 years.

In [5]:
samplesize = pmsampsize(type = "s", csrsquared = 0.051, parameters = 30, rate = 0.065, timepoint = 2, meanfup = 2.07)


 NB: Assuming 0.05 acceptable difference in apparent & adjusted R-squared 
 NB: Assuming 0.05 margin of error in estimation of overall risk at time point = 2 
 NB: Events per Predictor Parameter (EPP) assumes overall event rate = 0.065 
 

Criteria       Sample size    Shrinkage    Parameter    CS_Rsq    Max_Rsq    Nag_Rsq    EPP
-----------  -------------  -----------  -----------  --------  ---------  ---------  -----
Criteria 1            5143          0.9           30     0.051      0.555      0.092  23.07
Criteria 2            1039        0.648           30     0.051      0.555      0.092   4.66
Criteria 3*           5143          0.9           30     0.051      0.555      0.092  23.07
-----------  -------------  -----------  -----------  --------  ---------  ---------  -----
Final SS              5143          0.9           30     0.051      0.555      0.092  23.07

Minimum sample size required for new model development based on user inputs = 5143
corresponding to 10646.0 person

# Continuous outcomes (Linear prediction models)

Use pmsampsize to calculate the minimum sample size required for developing a multivariable prediction model for a continuous outcome (here, FEV1 say), using 25 candidate predictors. We know an existing prediction model in the same field has an R-squared adjusted of 0.2, and that FEV1 values in the population have a mean of 1.9 and SD of 0.6.

In [6]:
samplesize = pmsampsize(type = "c", rsquared = 0.2, parameters = 25, intercept = 1.9, sd = 0.6)


 NB: Assuming 0.05 acceptable difference in apparent & adjusted R-squared 
 NB: Assuming MMOE <= 1.1 in estimation of intercept & residual standard deviation 
 SPP - Subjects per Predictor Parameter 
 

Criteria      Sample size    Shrinkage    Parameter    Rsq    SPP
----------  -------------  -----------  -----------  -----  -----
Criteria 1            918          0.9           25    0.2  36.72
Criteria 2            401        0.801           25    0.2  16.04
Criteria 3            259        0.727           25    0.2  10.36
Criteria 4            918          0.9           25    0.2  36.72
----------  -------------  -----------  -----------  -----  -----
Final SS              918          0.9           25    0.2  36.72

Minimum sample size required for new model development based on user inputs = 918

* 95% CI for intercept = (1.87, 1.93), for sample size n = 918


# References

Riley RD, Ensor J, Snell KIE, Harrell FE, Martin GP, Reitsma JB, et al. Calculating the sample size required for developing a clinical prediction model. BMJ (Clinical research ed). 2020. doi: 10.1136/bmj.m441 

Riley RD, Snell KIE, Ensor J, Burke DL, Harrell FE, Jr., Moons KG, Collins GS. Minimum sample size required for developing a multivariable prediction model: Part I continuous outcomes. Statistics in Medicine. 2018. doi: 10.1002/sim.7993 

Riley RD, Snell KIE, Ensor J, Burke DL, Harrell FE, Jr., Moons KG, Collins GS. Minimum sample size required for developing a multivariable prediction model: Part II binary and time-to-event outcomes. Statistics in Medicine. 2018. doi: 10.1002/sim.7992 

Riley, RD, Van Calster, B, Collins, GS. A note on estimating the Cox-Snell R2 from a reported C statistic (AUROC) to inform sample size calculations for developing a prediction model with a binary outcome. Statistics in Medicine. 2020. doi: 10.1002/sim.8806