# ü§ñ Model Development & Comparison - Clinical Decision Support System



---

## Table of Contents
1. [Setup & Data Loading](#setup)
2. [Train-Test Split](#split)
3. [Baseline Model](#baseline)
4. [Random Forest](#rf)
5. [XGBoost](#xgboost)
6. [LightGBM](#lightgbm)
7. [Model Comparison](#comparison)
8. [Hyperparameter Tuning](#tuning)
9. [Final Model Selection](#final)

## 1. Setup & Data Loading <a id='setup'></a>

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# ML libraries
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, roc_curve, confusion_matrix, classification_report,
    precision_recall_curve, average_precision_score
)
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
import xgboost as xgb
import lightgbm as lgb
from scipy import stats
import joblib
import warnings

warnings.filterwarnings('ignore')
sns.set_style('whitegrid')

print("‚úÖ Libraries imported successfully")

‚úÖ Libraries imported successfully


In [2]:
# Load processed data
try:
    df = pd.read_csv('../data/processed/clinical_data_processed.csv')
    print("‚úÖ Loaded processed data")
except:
    # Fallback to raw data
    df = pd.read_csv('../data/Clinical Data_Discovery_Cohort.csv')
    print("‚ö†Ô∏è Using raw data - run feature engineering notebook first for best results")

print(f"Dataset shape: {df.shape}")
df.head()

‚ö†Ô∏è Using raw data - run feature engineering notebook first for best results
Dataset shape: (30, 10)


Unnamed: 0,PatientID,Specimen date,Dead or Alive,Date of Death,Date of Last Follow Up,sex,race,Stage,Event,Time
0,1,3/17/2003,Dead,2/24/2010,2/24/2010,F,B,pT2N2MX,1,2536
1,2,6/17/2003,Dead,11/12/2004,11/12/2004,M,W,T2N2MX,1,514
2,3,9/9/2003,Dead,8/1/2009,8/1/2009,F,B,T2N1MX,1,2153
3,4,10/14/2003,Dead,12/29/2006,12/29/2006,M,W,pT2NOMX,1,1172
4,5,12/1/2003,Dead,1/31/2004,1/31/2004,F,W,T2NOMX,1,61


## 2. Train-Test Split <a id='split'></a>

In [3]:
# Identify target variable
target_cols = [col for col in df.columns if 'event' in col.lower() or 'outcome' in col.lower()]
if target_cols:
    target_col = target_cols[0]
else:
    # Use first binary column as target
    binary_cols = [col for col in df.columns if df[col].nunique() == 2]
    target_col = binary_cols[0] if binary_cols else df.columns[-1]

print(f"Target variable: {target_col}")
print(f"Target distribution:\n{df[target_col].value_counts()}")

# Separate features and target
X = df.drop(columns=[target_col])
y = df[target_col]

# Handle any remaining categorical variables
X = pd.get_dummies(X, drop_first=True)

print(f"\nFeature matrix shape: {X.shape}")
print(f"Target shape: {y.shape}")

Target variable:  Event
Target distribution:
 Event
1    21
0     9
Name: count, dtype: int64

Feature matrix shape: (30, 100)
Target shape: (30,)


In [4]:
# Split data: 60% train, 20% validation, 20% test
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp  # 0.25 of 0.8 = 0.2
)

print("Data Split:")
print(f"Training set: {X_train.shape[0]} samples ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"Validation set: {X_val.shape[0]} samples ({X_val.shape[0]/len(X)*100:.1f}%)")
print(f"Test set: {X_test.shape[0]} samples ({X_test.shape[0]/len(X)*100:.1f}%)")

# Check class balance
print(f"\nClass balance in training set:")
print(y_train.value_counts(normalize=True))

Data Split:
Training set: 18 samples (60.0%)
Validation set: 6 samples (20.0%)
Test set: 6 samples (20.0%)

Class balance in training set:
 Event
1    0.722222
0    0.277778
Name: proportion, dtype: float64


## 3. Baseline Model - Logistic Regression <a id='baseline'></a>

In [5]:
# Train Logistic Regression
print("Training Logistic Regression (Baseline)...")

lr_model = LogisticRegression(max_iter=1000, random_state=42, class_weight='balanced')
lr_model.fit(X_train, y_train)

# Predictions
y_pred_lr = lr_model.predict(X_val)
y_pred_proba_lr = lr_model.predict_proba(X_val)[:, 1]

# Evaluation
lr_metrics = {
    'Model': 'Logistic Regression',
    'Accuracy': accuracy_score(y_val, y_pred_lr),
    'Precision': precision_score(y_val, y_pred_lr, zero_division=0),
    'Recall': recall_score(y_val, y_pred_lr, zero_division=0),
    'F1-Score': f1_score(y_val, y_pred_lr, zero_division=0),
    'AUC-ROC': roc_auc_score(y_val, y_pred_proba_lr)
}

print("\n‚úÖ Logistic Regression Results:")
for metric, value in lr_metrics.items():
    if metric != 'Model':
        print(f"{metric}: {value:.4f}")

# Confusion Matrix
cm_lr = confusion_matrix(y_val, y_pred_lr)
print(f"\nConfusion Matrix:\n{cm_lr}")

Training Logistic Regression (Baseline)...

‚úÖ Logistic Regression Results:
Accuracy: 0.6667
Precision: 0.6667
Recall: 1.0000
F1-Score: 0.8000
AUC-ROC: 1.0000

Confusion Matrix:
[[0 2]
 [0 4]]


## 4. Random Forest <a id='rf'></a>

In [6]:
# Train Random Forest
print("Training Random Forest...")

rf_model = RandomForestClassifier(
    n_estimators=200,
    max_depth=15,
    min_samples_split=10,
    min_samples_leaf=4,
    max_features='sqrt',
    class_weight='balanced',
    random_state=42,
    n_jobs=-1
)

rf_model.fit(X_train, y_train)

# Predictions
y_pred_rf = rf_model.predict(X_val)
y_pred_proba_rf = rf_model.predict_proba(X_val)[:, 1]

# Evaluation
rf_metrics = {
    'Model': 'Random Forest',
    'Accuracy': accuracy_score(y_val, y_pred_rf),
    'Precision': precision_score(y_val, y_pred_rf, zero_division=0),
    'Recall': recall_score(y_val, y_pred_rf, zero_division=0),
    'F1-Score': f1_score(y_val, y_pred_rf, zero_division=0),
    'AUC-ROC': roc_auc_score(y_val, y_pred_proba_rf)
}

print("\n‚úÖ Random Forest Results:")
for metric, value in rf_metrics.items():
    if metric != 'Model':
        print(f"{metric}: {value:.4f}")

# Feature Importance
feature_importance_rf = pd.DataFrame({
    'Feature': X.columns,
    'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)

print("\nTop 10 Important Features:")
print(feature_importance_rf.head(10))

Training Random Forest...

‚úÖ Random Forest Results:
Accuracy: 0.8333
Precision: 0.8000
Recall: 1.0000
F1-Score: 0.8889
AUC-ROC: 0.8750

Top 10 Important Features:
                                            Feature  Importance
1                                              Time    0.301370
0                                         PatientID    0.260274
81                                         sex_ M      0.164384
30                     Dead or Alive_ Dead             0.109589
82                                       race_ B       0.095890
84                                       race_ W       0.068493
98                                 Stage  _ pT2pN0     0.000000
72   Date of Last Follow Up_ 6/5/2009                  0.000000
70   Date of Last Follow Up_ 5/9/2006                  0.000000
69   Date of Last Follow Up_ 5/21/2006                 0.000000


## 5. XGBoost <a id='xgboost'></a>

In [7]:
# Train XGBoost
print("Training XGBoost...")

# Calculate scale_pos_weight for imbalanced data
scale_pos_weight = (y_train == 0).sum() / (y_train == 1).sum()

xgb_model = xgb.XGBClassifier(
    n_estimators=200,
    max_depth=6,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    scale_pos_weight=scale_pos_weight,
    random_state=42,
    n_jobs=-1,
    eval_metric='logloss'
)

xgb_model.fit(X_train, y_train)

# Predictions
y_pred_xgb = xgb_model.predict(X_val)
y_pred_proba_xgb = xgb_model.predict_proba(X_val)[:, 1]

# Evaluation
xgb_metrics = {
    'Model': 'XGBoost',
    'Accuracy': accuracy_score(y_val, y_pred_xgb),
    'Precision': precision_score(y_val, y_pred_xgb, zero_division=0),
    'Recall': recall_score(y_val, y_pred_xgb, zero_division=0),
    'F1-Score': f1_score(y_val, y_pred_xgb, zero_division=0),
    'AUC-ROC': roc_auc_score(y_val, y_pred_proba_xgb)
}

print("\n‚úÖ XGBoost Results:")
for metric, value in xgb_metrics.items():
    if metric != 'Model':
        print(f"{metric}: {value:.4f}")

Training XGBoost...

‚úÖ XGBoost Results:
Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000
F1-Score: 1.0000
AUC-ROC: 1.0000


## 6. LightGBM <a id='lightgbm'></a>

In [8]:
# Train LightGBM
print("Training LightGBM...")

lgb_model = lgb.LGBMClassifier(
    n_estimators=200,
    max_depth=6,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    class_weight='balanced',
    random_state=42,
    n_jobs=-1,
    verbose=-1
)

lgb_model.fit(X_train, y_train)

# Predictions
y_pred_lgb = lgb_model.predict(X_val)
y_pred_proba_lgb = lgb_model.predict_proba(X_val)[:, 1]

# Evaluation
lgb_metrics = {
    'Model': 'LightGBM',
    'Accuracy': accuracy_score(y_val, y_pred_lgb),
    'Precision': precision_score(y_val, y_pred_lgb, zero_division=0),
    'Recall': recall_score(y_val, y_pred_lgb, zero_division=0),
    'F1-Score': f1_score(y_val, y_pred_lgb, zero_division=0),
    'AUC-ROC': roc_auc_score(y_val, y_pred_proba_lgb)
}

print("\n‚úÖ LightGBM Results:")
for metric, value in lgb_metrics.items():
    if metric != 'Model':
        print(f"{metric}: {value:.4f}")

Training LightGBM...

‚úÖ LightGBM Results:
Accuracy: 0.6667
Precision: 0.6667
Recall: 1.0000
F1-Score: 0.8000
AUC-ROC: 0.5000


## 7. Model Comparison <a id='comparison'></a>

### 7.1 Performance Metrics Table

In [9]:
# Compile all metrics
results_df = pd.DataFrame([lr_metrics, rf_metrics, xgb_metrics, lgb_metrics])
results_df = results_df.set_index('Model')

print("Model Comparison:")
display(results_df.style.highlight_max(axis=0, props='background-color: lightgreen'))

# Visualize metrics
fig = go.Figure()

for metric in ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'AUC-ROC']:
    fig.add_trace(go.Bar(
        name=metric,
        x=results_df.index,
        y=results_df[metric],
        text=results_df[metric].round(3),
        textposition='auto'
    ))

fig.update_layout(
    title='Model Performance Comparison',
    xaxis_title='Model',
    yaxis_title='Score',
    barmode='group',
    height=500
)
fig.show()

Model Comparison:


Unnamed: 0_level_0,Accuracy,Precision,Recall,F1-Score,AUC-ROC
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Logistic Regression,0.666667,0.666667,1.0,0.8,1.0
Random Forest,0.833333,0.8,1.0,0.888889,0.875
XGBoost,1.0,1.0,1.0,1.0,1.0
LightGBM,0.666667,0.666667,1.0,0.8,0.5


### 7.2 ROC Curves

In [10]:
# Plot ROC curves for all models
fig = go.Figure()

models = [
    ('Logistic Regression', y_pred_proba_lr),
    ('Random Forest', y_pred_proba_rf),
    ('XGBoost', y_pred_proba_xgb),
    ('LightGBM', y_pred_proba_lgb)
]

for name, y_pred_proba in models:
    fpr, tpr, _ = roc_curve(y_val, y_pred_proba)
    auc = roc_auc_score(y_val, y_pred_proba)
    
    fig.add_trace(go.Scatter(
        x=fpr, y=tpr,
        name=f'{name} (AUC = {auc:.3f})',
        mode='lines'
    ))

# Add diagonal line
fig.add_trace(go.Scatter(
    x=[0, 1], y=[0, 1],
    name='Random Classifier',
    mode='lines',
    line=dict(dash='dash', color='gray')
))

fig.update_layout(
    title='ROC Curves - Model Comparison',
    xaxis_title='False Positive Rate',
    yaxis_title='True Positive Rate',
    height=600,
    width=800
)
fig.show()

### 7.3 Precision-Recall Curves

In [11]:
# Plot Precision-Recall curves
fig = go.Figure()

for name, y_pred_proba in models:
    precision, recall, _ = precision_recall_curve(y_val, y_pred_proba)
    ap = average_precision_score(y_val, y_pred_proba)
    
    fig.add_trace(go.Scatter(
        x=recall, y=precision,
        name=f'{name} (AP = {ap:.3f})',
        mode='lines'
    ))

fig.update_layout(
    title='Precision-Recall Curves - Model Comparison',
    xaxis_title='Recall',
    yaxis_title='Precision',
    height=600,
    width=800
)
fig.show()

### 7.4 Confusion Matrices

In [12]:
# Plot confusion matrices
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=['Logistic Regression', 'Random Forest', 'XGBoost', 'LightGBM']
)

predictions = [
    (y_pred_lr, 1, 1),
    (y_pred_rf, 1, 2),
    (y_pred_xgb, 2, 1),
    (y_pred_lgb, 2, 2)
]

for y_pred, row, col in predictions:
    cm = confusion_matrix(y_val, y_pred)
    
    fig.add_trace(
        go.Heatmap(
            z=cm,
            x=['Predicted 0', 'Predicted 1'],
            y=['Actual 0', 'Actual 1'],
            colorscale='Blues',
            showscale=False,
            text=cm,
            texttemplate='%{text}',
            textfont={"size": 16}
        ),
        row=row, col=col
    )

fig.update_layout(height=800, title_text="Confusion Matrices Comparison")
fig.show()

## 8. Hyperparameter Tuning <a id='tuning'></a>

### 8.1 Random Forest Tuning

In [13]:
# Hyperparameter grid for Random Forest
rf_param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 15, 20],
    'min_samples_split': [5, 10, 15],
    'min_samples_leaf': [2, 4, 6]
}

print("Tuning Random Forest (this may take a few minutes)...")

rf_grid = RandomizedSearchCV(
    RandomForestClassifier(random_state=42, class_weight='balanced', n_jobs=-1),
    param_distributions=rf_param_grid,
    n_iter=20,
    cv=3,
    scoring='roc_auc',
    random_state=42,
    n_jobs=-1
)

rf_grid.fit(X_train, y_train)

print("\n‚úÖ Best Random Forest Parameters:")
print(rf_grid.best_params_)
print(f"\nBest CV AUC-ROC: {rf_grid.best_score_:.4f}")

# Evaluate tuned model
y_pred_rf_tuned = rf_grid.predict(X_val)
y_pred_proba_rf_tuned = rf_grid.predict_proba(X_val)[:, 1]

print(f"\nTuned Random Forest Validation AUC-ROC: {roc_auc_score(y_val, y_pred_proba_rf_tuned):.4f}")

Tuning Random Forest (this may take a few minutes)...

‚úÖ Best Random Forest Parameters:
{'n_estimators': 200, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_depth': 15}

Best CV AUC-ROC: 0.8750

Tuned Random Forest Validation AUC-ROC: 1.0000


## 9. Final Model Selection <a id='final'></a>

In [14]:
# Select best model based on AUC-ROC
best_model_name = results_df['AUC-ROC'].idxmax()
best_auc = results_df['AUC-ROC'].max()

print(f"üèÜ Best Model: {best_model_name}")
print(f"Validation AUC-ROC: {best_auc:.4f}")

# Select the corresponding model
model_mapping = {
    'Logistic Regression': lr_model,
    'Random Forest': rf_grid.best_estimator_ if 'rf_grid' in locals() else rf_model,
    'XGBoost': xgb_model,
    'LightGBM': lgb_model
}

final_model = model_mapping[best_model_name]

# Evaluate on test set
print("\nEvaluating on Test Set...")
y_pred_test = final_model.predict(X_test)
y_pred_proba_test = final_model.predict_proba(X_test)[:, 1]

test_metrics = {
    'Accuracy': accuracy_score(y_test, y_pred_test),
    'Precision': precision_score(y_test, y_pred_test, zero_division=0),
    'Recall': recall_score(y_test, y_pred_test, zero_division=0),
    'F1-Score': f1_score(y_test, y_pred_test, zero_division=0),
    'AUC-ROC': roc_auc_score(y_test, y_pred_proba_test)
}

print("\nüìä Test Set Performance:")
for metric, value in test_metrics.items():
    print(f"{metric}: {value:.4f}")

# Classification report
print("\nDetailed Classification Report:")
print(classification_report(y_test, y_pred_test))

üèÜ Best Model: Logistic Regression
Validation AUC-ROC: 1.0000

Evaluating on Test Set...

üìä Test Set Performance:
Accuracy: 0.5000
Precision: 0.6667
Recall: 0.5000
F1-Score: 0.5714
AUC-ROC: 0.6250

Detailed Classification Report:
              precision    recall  f1-score   support

           0       0.33      0.50      0.40         2
           1       0.67      0.50      0.57         4

    accuracy                           0.50         6
   macro avg       0.50      0.50      0.49         6
weighted avg       0.56      0.50      0.51         6



In [15]:
# Save final model
model_filename = f'../src/models/final_patient_outcome_model.pkl'
joblib.dump(final_model, model_filename)
print(f"\n‚úÖ Saved final model to: {model_filename}")

# Save feature names
feature_names = X.columns.tolist()
with open('../src/models/feature_names.txt', 'w') as f:
    f.write('\n'.join(feature_names))
print("‚úÖ Saved feature names")

# Save model metadata
metadata = {
    'model_name': best_model_name,
    'test_auc_roc': test_metrics['AUC-ROC'],
    'test_accuracy': test_metrics['Accuracy'],
    'test_f1_score': test_metrics['F1-Score'],
    'n_features': len(feature_names),
    'training_samples': len(X_train)
}

import json
with open('../src/models/model_metadata.json', 'w') as f:
    json.dump(metadata, f, indent=2)
print("‚úÖ Saved model metadata")


‚úÖ Saved final model to: ../src/models/final_patient_outcome_model.pkl
‚úÖ Saved feature names
‚úÖ Saved model metadata


## Summary

### Models Trained:
1. **Logistic Regression** (Baseline)
2. **Random Forest** (with hyperparameter tuning)
3. **XGBoost**
4. **LightGBM**

### Best Model:
- Selected based on AUC-ROC score
- Evaluated on independent test set
- Saved for deployment

### Next Steps:
1. Proceed to Model Interpretability notebook
2. Implement SHAP analysis
3. Create feature importance visualizations
4. Validate clinical relevance

---

**Notebook Version:** 1.0  
**Last Updated:** November 2025