# Oncology Clinical Trials Model Training

This notebook demonstrates the model training process for predicting oncology clinical trial outcomes. We'll train both classification models (to predict trial completion status) and regression models (to predict trial duration).

Building on the feature engineering performed in the previous notebook, we'll now use those features to train predictive models and analyze their performance.

## Setup

First, let's import the necessary libraries and modules.

In [None]:
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from datetime import datetime

# Add project root to path to import project modules
project_root = Path().resolve().parents[0]
sys.path.append(str(project_root))

# Import model training functions
from src.models.train_model import (
    load_modeling_data,
    train_classification_models,
    train_regression_models,
    save_model
)

# Import visualization functions for model analysis
from src.visualization.visualize import set_plotting_style

# Define project directories
PROJECT_DIR = project_root
PROCESSED_DATA_DIR = PROJECT_DIR / 'data' / 'processed'
MODEL_DIR = PROJECT_DIR / 'models'

# Set plotting style
set_plotting_style()

# Set random seed for reproducibility
np.random.seed(42)

## Load Modeling Data

Let's load the modeling-ready dataset that was prepared in the feature engineering notebook.

In [None]:
# Load the most recent modeling-ready dataset
df_modeling = load_modeling_data()

print(f"Loaded dataset with {df_modeling.shape[0]} rows and {df_modeling.shape[1]} columns")

# Display the first few rows
df_modeling.head()

## Prepare Features and Target Variables

We'll prepare two sets of models:
1. Classification models to predict trial completion status (completed vs. terminated)
2. Regression models to predict trial duration (in days)

Let's prepare the feature matrix and target variables for both tasks.

In [None]:
# Define target variables
classification_target = 'is_completed'  # Binary: 1 for completed, 0 for terminated
regression_target = 'duration_days'     # Continuous: trial duration in days

# Filter data to include only completed or terminated trials for classification
df_classification = df_modeling[df_modeling['overall_status'].isin(['Completed', 'Terminated'])]

# Create binary target for classification (1 for completed, 0 for terminated)
df_classification[classification_target] = (df_classification['overall_status'] == 'Completed').astype(int)

# Filter data to include only completed trials for regression (duration prediction)
df_regression = df_modeling[df_modeling['overall_status'] == 'Completed']

# Identify feature columns (exclude target variables and metadata)
exclude_cols = ['nct_id', 'overall_status', classification_target, regression_target]
feature_cols = [col for col in df_modeling.columns if col not in exclude_cols]

print(f"Classification dataset: {df_classification.shape[0]} rows, {len(feature_cols)} features")
print(f"Regression dataset: {df_regression.shape[0]} rows, {len(feature_cols)} features")

# Prepare feature matrices and target vectors
X_classification = df_classification[feature_cols]
y_classification = df_classification[classification_target]

X_regression = df_regression[feature_cols]
y_regression = df_regression[regression_target]

# Check class balance for classification
print("
Class distribution for completion status:")
print(y_classification.value_counts(normalize=True).round(3) * 100)

## Train Classification Models

Now, let's train classification models to predict trial completion status. We'll train multiple models and compare their performance.

In [None]:
# Train classification models
classification_results = train_classification_models(X_classification, y_classification)

# Display performance metrics
classification_metrics = pd.DataFrame(classification_results['metrics'])
classification_metrics = classification_metrics.sort_values('f1_score', ascending=False)

print("
Classification Model Performance:")
display(classification_metrics)

# Identify best model
best_classification_model_name = classification_metrics.index[0]
best_classification_model = classification_results['models'][best_classification_model_name]

print(f"
Best classification model: {best_classification_model_name}")

# Save the best model
model_info = {
    'model': best_classification_model,
    'preprocessor': classification_results['preprocessor'],
    'feature_names': feature_cols,
    'metrics': classification_metrics.loc[best_classification_model_name].to_dict()
}

save_model(model_info, 'classification')

## Visualize Classification Results

Let's visualize the performance of our classification models.

In [None]:
# Plot model performance comparison
plt.figure(figsize=(12, 6))

# Plot accuracy, precision, recall, and F1 score
metrics_to_plot = ['accuracy', 'precision', 'recall', 'f1_score']

# Prepare data for plotting
plot_data = classification_metrics[metrics_to_plot].reset_index()
plot_data = pd.melt(plot_data, id_vars=['index'], value_vars=metrics_to_plot, 
                    var_name='Metric', value_name='Score')

# Create the plot
sns.barplot(x='index', y='Score', hue='Metric', data=plot_data)
plt.title('Classification Model Performance Comparison')
plt.xlabel('Model')
plt.ylabel('Score')
plt.xticks(rotation=45)
plt.ylim(0, 1)
plt.legend(title='Metric')
plt.tight_layout()
plt.show()

## Feature Importance for Classification

Let's examine which features are most important for predicting trial completion status.

In [None]:
# Get feature importance from the best model (if available)
if hasattr(best_classification_model, 'feature_importances_'):
    # Get feature importances
    importances = best_classification_model.feature_importances_
    
    # Get feature names after preprocessing
    if hasattr(classification_results['preprocessor'], 'get_feature_names_out'):
        feature_names = classification_results['preprocessor'].get_feature_names_out()
    else:
        feature_names = feature_cols
    
    # Create a DataFrame for visualization
    feature_importance = pd.DataFrame({
        'feature': feature_names,
        'importance': importances
    })
    
    # Sort by importance
    feature_importance = feature_importance.sort_values('importance', ascending=False)
    
    # Plot top 20 features
    plt.figure(figsize=(12, 8))
    sns.barplot(x='importance', y='feature', data=feature_importance.head(20))
    plt.title(f'Top 20 Important Features for {best_classification_model_name}')
    plt.xlabel('Importance')
    plt.ylabel('Feature')
    plt.tight_layout()
    plt.show()
else:
    print(f"Feature importance not available for {best_classification_model_name}")

## Train Regression Models

Now, let's train regression models to predict trial duration for completed trials.

In [None]:
# Train regression models
regression_results = train_regression_models(X_regression, y_regression)

# Display performance metrics
regression_metrics = pd.DataFrame(regression_results['metrics'])
regression_metrics = regression_metrics.sort_values('r2_score', ascending=False)

print("
Regression Model Performance:")
display(regression_metrics)

# Identify best model
best_regression_model_name = regression_metrics.index[0]
best_regression_model = regression_results['models'][best_regression_model_name]

print(f"
Best regression model: {best_regression_model_name}")

# Save the best model
model_info = {
    'model': best_regression_model,
    'preprocessor': regression_results['preprocessor'],
    'feature_names': feature_cols,
    'metrics': regression_metrics.loc[best_regression_model_name].to_dict()
}

save_model(model_info, 'regression')

## Visualize Regression Results

Let's visualize the performance of our regression models.

In [None]:
# Plot model performance comparison
plt.figure(figsize=(12, 6))

# Plot R² score
plt.subplot(1, 2, 1)
sns.barplot(x=regression_metrics.index, y=regression_metrics['r2_score'])
plt.title('R² Score by Model')
plt.xlabel('Model')
plt.ylabel('R² Score')
plt.xticks(rotation=45)
plt.ylim(0, 1)

# Plot RMSE
plt.subplot(1, 2, 2)
sns.barplot(x=regression_metrics.index, y=regression_metrics['rmse'])
plt.title('RMSE by Model')
plt.xlabel('Model')
plt.ylabel('RMSE (days)')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

## Feature Importance for Regression

Let's examine which features are most important for predicting trial duration.

In [None]:
# Get feature importance from the best model (if available)
if hasattr(best_regression_model, 'feature_importances_'):
    # Get feature importances
    importances = best_regression_model.feature_importances_
    
    # Get feature names after preprocessing
    if hasattr(regression_results['preprocessor'], 'get_feature_names_out'):
        feature_names = regression_results['preprocessor'].get_feature_names_out()
    else:
        feature_names = feature_cols
    
    # Create a DataFrame for visualization
    feature_importance = pd.DataFrame({
        'feature': feature_names,
        'importance': importances
    })
    
    # Sort by importance
    feature_importance = feature_importance.sort_values('importance', ascending=False)
    
    # Plot top 20 features
    plt.figure(figsize=(12, 8))
    sns.barplot(x='importance', y='feature', data=feature_importance.head(20))
    plt.title(f'Top 20 Important Features for {best_regression_model_name}')
    plt.xlabel('Importance')
    plt.ylabel('Feature')
    plt.tight_layout()
    plt.show()
else:
    print(f"Feature importance not available for {best_regression_model_name}")

## Summary

In this notebook, we've trained and evaluated models for predicting oncology clinical trial outcomes:

1. **Classification Models for Trial Completion Status**
   - Trained multiple models to predict whether a trial will complete successfully or terminate early
   - Evaluated models using accuracy, precision, recall, and F1 score
   - Identified the most important features for predicting completion status

2. **Regression Models for Trial Duration**
   - Trained multiple models to predict the duration of completed trials
   - Evaluated models using R², MAE, and RMSE
   - Identified the most important features for predicting trial duration

The best models have been saved and can be used for further analysis and prediction in the next notebook.