## Linear Regression Prediction on Clinical Trials Data

This notebook explores and prepares clinical trial data for predictive modeling and analysis. Using a study-level dataset derived from the clinical_trials database i created with the pipeline in part 1 of this project, this notebook focuses on selecting, transforming, and encoding study design features to support regression tasks such as predicting enrollment size or study duration. Emphasis is placed on avoiding data leakage, applying appropriate feature engineering, and building an analysis-ready dataset suitable for downstream machine learning experiments.


In [65]:
import sqlite3
import pandas as pd

In [66]:
connection = sqlite3.connect('C:\jupyter notebook\Clinical Trials Data ETL Pipeline\SQL\clinical_trials.db')
cursor = connection.cursor()

  connection = sqlite3.connect('C:\jupyter notebook\Clinical Trials Data ETL Pipeline\SQL\clinical_trials.db')


In [67]:
query = "SELECT * FROM prediction_dataset"
data = pd.read_sql(query, connection)

In [68]:
# dropped columns that added no predictive value or would have caused data leakage
# i.e features that would not reasonably be known at the time of study creation or beginning 
data.drop(columns=['nct_id', 'brief_title', 'official_title', 'acronym', 'org_name', 'org_class', 'overall_status', 
                        'start_date', 'primary_completion_date', 'completion_date', 'study_first_post_date', 'last_update_post_date',
                        'enrollment_size'], inplace=True)

In [69]:
data.head(20)

Unnamed: 0,study_type,allocation,intervention_model,primary_purpose,masking,enrollment_count,has_dmc,fda_regulated_drug,fda_regulated_device,duration_days,condition_count,location_count,intervention_count
0,INTERVENTIONAL,RANDOMIZED,PARALLEL,PREVENTION,NONE,300.0,1,0,0,1119.0,4,1,2
1,INTERVENTIONAL,,SINGLE_GROUP,TREATMENT,NONE,1074.0,1,0,1,3887.0,1,51,1
2,INTERVENTIONAL,,SINGLE_GROUP,DIAGNOSTIC,NONE,43.0,1,0,0,816.0,2,1,2
3,INTERVENTIONAL,RANDOMIZED,PARALLEL,TREATMENT,DOUBLE,232.0,1,0,0,,1,5,1
4,OBSERVATIONAL,,,,,8680.0,0,1,0,2186.0,2,1,1
5,INTERVENTIONAL,RANDOMIZED,PARALLEL,TREATMENT,TRIPLE,30.0,0,0,1,1249.0,7,1,1
6,INTERVENTIONAL,RANDOMIZED,PARALLEL,TREATMENT,SINGLE,118.0,1,0,0,670.0,1,6,1
7,OBSERVATIONAL,,,,,1035.0,0,0,0,981.0,3,27,0
8,INTERVENTIONAL,,SINGLE_GROUP,SUPPORTIVE_CARE,NONE,28.0,0,0,0,354.0,1,1,1
9,INTERVENTIONAL,RANDOMIZED,PARALLEL,TREATMENT,NONE,120.0,1,0,0,624.0,1,24,2


In [70]:
data.describe(include='all')

Unnamed: 0,study_type,allocation,intervention_model,primary_purpose,masking,enrollment_count,has_dmc,fda_regulated_drug,fda_regulated_device,duration_days,condition_count,location_count,intervention_count
count,1000,676,672,672,675,994.0,1000.0,1000.0,1000.0,573.0,1000.0,1000.0,1000.0
unique,2,3,5,9,5,,,,,,,,
top,INTERVENTIONAL,RANDOMIZED,PARALLEL,TREATMENT,NONE,,,,,,,,
freq,679,496,459,458,321,,,,,,,,
mean,,,,,,4611.155,0.338,0.037,0.068,1109.116928,2.354,8.138,1.501
std,,,,,,68741.05,0.473265,0.188856,0.251872,1211.247699,10.904292,50.08786,1.148616
min,,,,,,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
25%,,,,,,30.0,0.0,0.0,0.0,368.0,1.0,1.0,1.0
50%,,,,,,90.0,0.0,0.0,0.0,785.0,1.0,1.0,1.0
75%,,,,,,343.0,1.0,0.0,0.0,1490.0,2.25,1.0,2.0


In [71]:
# drop rows with null values for duration_days since thats the metric we are trying to predict
# we dont want to fill our target variable with inaccurate information so its better to just drop the row 
data.dropna(subset=['duration_days'], inplace=True)

In [72]:
# these instances of 'NA' are essentially NULL values, we want to replace them to preserve information 
# we will also be replacing NULL values ('None') with 'Unknown' also to preserve information
data['allocation'] = data['allocation'].replace('NA', 'Unknown')

In [73]:
# these instances of 'OTHER' are essentially NULL values to us, we want to replace them to preserve information
data['primary_purpose'] = data['primary_purpose'].replace('OTHER', 'Unknown')

In [74]:
# we are filling all NULL values with 'Unknown' to preserve information for the model 
data[['study_type', 'allocation', 'intervention_model', 'primary_purpose', 'masking'] ] = data[['study_type', 'allocation', 'intervention_model', 'primary_purpose', 'masking'] ].fillna('Unknown')

In [77]:
# as we run this we can see there are no more NULL values in the dataset 
data.isna().sum()

study_type              0
allocation              0
intervention_model      0
primary_purpose         0
masking                 0
enrollment_count        0
has_dmc                 0
fda_regulated_drug      0
fda_regulated_device    0
duration_days           0
condition_count         0
location_count          0
intervention_count      0
dtype: int64

Data is now cleaned and formatted and ready to be trained on!