metis_project_3

Overall Survival of Clinical Trials: Classification models to predict early trial termination

Clinical trials can end early for a variety of reasons such as low accrual, interim analysis suggesting the intervention has low efficacy, adverse events and loss of funding or interest. I wanted to see if I could create a classification model to predict whether or not a trial would be terminated early or completed.

Clinical trails in the United States are required to be reported clinicaltrials.gov, however many still go unreported or are missing information. I used the clinicaltrials.gov API to collect data on study design, outcome measures, eligibility, investigators/sponsor and study locations for over 18,000 cancer interventional trials designated as 'Terminated' or 'Completed' from clinicaltrials.gov. This data was stored in a PostgresSQL database.

I one hot encoded the categorial data fields and engineered several new features using regex and text extraction from the free text fields, resulting in 400 total features.

Features Notebook

Model optimization was performed using:

scikit-learn
imblearn
xgboost

Models tested:

kNN
Logistic Regression
SVC
Naive Bayes
Random Forest
XGBoost
Ensembled models

I used standardscaler to normalize the data and kNN imputation to impute values for some features with missing values. Only about 1/3 of the trials in the dataset were 'Terminated' causing a class imbalance, so I used either ADASYN oversampling or balanced model class weights when available. Models were optimized with gridsearch and most models reached similar F1 scores and AUCs for calling the "Terminated" class of ~0.4 and ~0.65, respectively. I acheived mild class seperation and a recall of 60-70% for "Terminated" trials. Ensembling only improved scores for a combination of kNN and Logistic Regression. Overall XGBoost performed the best.

Model Optimization Notebook

Model Ensembling Notebook

Model Evaluation Notebook

I made a Streamlit app to allow users to interact directly with the logistic regression model and see how almost all the features affect the predictions for trial termination.

App deployed using Heroku

Streamlit App Code

The final project presentation is below:

Presentation

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
SQL		SQL
pickled files		pickled files
.DS_Store		.DS_Store
.gitignore		.gitignore
Project 3 presentation.pdf		Project 3 presentation.pdf
Project 3_model_test.ipynb		Project 3_model_test.ipynb
Project 3_prep_for_app.ipynb		Project 3_prep_for_app.ipynb
Project3_features.ipynb		Project3_features.ipynb
Project3_model_ensembles.ipynb		Project3_model_ensembles.ipynb
Project3_model_optimization.ipynb		Project3_model_optimization.ipynb
README.md		README.md
clin_trials_streamlit.py		clin_trials_streamlit.py
streamlit app screen shot.png		streamlit app screen shot.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

metis_project_3

About

Releases

Packages

Languages

Beth526/metis_project_3

Folders and files

Latest commit

History

Repository files navigation

metis_project_3

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages