Skip to content

Classification models to predict clinical trial termination

Notifications You must be signed in to change notification settings

Beth526/metis_project_3

Repository files navigation

metis_project_3

Overall Survival of Clinical Trials: Classification models to predict early trial termination

Clinical trials can end early for a variety of reasons such as low accrual, interim analysis suggesting the intervention has low efficacy, adverse events and loss of funding or interest. I wanted to see if I could create a classification model to predict whether or not a trial would be terminated early or completed.

Clinical trails in the United States are required to be reported clinicaltrials.gov, however many still go unreported or are missing information. I used the clinicaltrials.gov API to collect data on study design, outcome measures, eligibility, investigators/sponsor and study locations for over 18,000 cancer interventional trials designated as 'Terminated' or 'Completed' from clinicaltrials.gov. This data was stored in a PostgresSQL database.

I one hot encoded the categorial data fields and engineered several new features using regex and text extraction from the free text fields, resulting in 400 total features.

Features Notebook

Model optimization was performed using:

  • scikit-learn
  • imblearn
  • xgboost

Models tested:

  • kNN
  • Logistic Regression
  • SVC
  • Naive Bayes
  • Random Forest
  • XGBoost
  • Ensembled models

I used standardscaler to normalize the data and kNN imputation to impute values for some features with missing values. Only about 1/3 of the trials in the dataset were 'Terminated' causing a class imbalance, so I used either ADASYN oversampling or balanced model class weights when available. Models were optimized with gridsearch and most models reached similar F1 scores and AUCs for calling the "Terminated" class of ~0.4 and ~0.65, respectively. I acheived mild class seperation and a recall of 60-70% for "Terminated" trials. Ensembling only improved scores for a combination of kNN and Logistic Regression. Overall XGBoost performed the best.

Model Optimization Notebook

Model Ensembling Notebook

Model Evaluation Notebook

I made a Streamlit app to allow users to interact directly with the logistic regression model and see how almost all the features affect the predictions for trial termination.

App deployed using Heroku

App screenshot

Streamlit App Code

The final project presentation is below:

Presentation

About

Classification models to predict clinical trial termination

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages