# Supervised Learning Model Training

This notebook demonstrates the training of multiple supervised learning models for heart disease classification using the feature-selected dataset.

## Models Included:
- Logistic Regression
- Decision Tree
- Random Forest
- Support Vector Machine (SVM)

## Objectives:
1. Load and prepare the feature-selected dataset
2. Train multiple classification models
3. Evaluate model performance with cross-validation
4. Compare models and identify the best performer
5. Save trained models for future use

In [1]:
# Import required libraries
import sys
import os
sys.path.append('../src')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import classification_report, confusion_matrix
import warnings
warnings.filterwarnings('ignore')

# Import our custom trainer
from model_trainer import SupervisedTrainer

# Set style for plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("Libraries imported successfully!")

Libraries imported successfully!


## 1. Initialize the Supervised Trainer

In [2]:
# Initialize the supervised trainer with a fixed random state for reproducibility
trainer = SupervisedTrainer(random_state=42)

print("SupervisedTrainer initialized with random_state=42")
print(f"Trainer object: {trainer}")

SupervisedTrainer initialized with random_state=42
Trainer object: <model_trainer.SupervisedTrainer object at 0x0000024213DFEA70>


## 2. Load and Prepare Data

In [None]:
# Load the feature-selected dataset
data_path = '../data/processed/heart_disease_selected.csv'

# Prepare data with 80/20 train-test split
X_train, X_test, y_train, y_test = trainer.prepare_data(data_path, test_size=0.2)

print(f"\nData preparation completed:")
print(f"Training set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")
print(f"Feature names: {trainer.training_log['data_info']['feature_names']}")

## 3. Visualize Data Distribution

In [None]:
# Visualize class distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Training set class distribution
train_counts = pd.Series(y_train).value_counts().sort_index()
axes[0].bar(train_counts.index, train_counts.values, color=['lightcoral', 'lightblue'])
axes[0].set_title('Training Set Class Distribution')
axes[0].set_xlabel('Class (0: No Disease, 1: Disease)')
axes[0].set_ylabel('Count')
axes[0].set_xticks([0, 1])

# Test set class distribution
test_counts = pd.Series(y_test).value_counts().sort_index()
axes[1].bar(test_counts.index, test_counts.values, color=['lightcoral', 'lightblue'])
axes[1].set_title('Test Set Class Distribution')
axes[1].set_xlabel('Class (0: No Disease, 1: Disease)')
axes[1].set_ylabel('Count')
axes[1].set_xticks([0, 1])

plt.tight_layout()
plt.show()

print(f"Training set class balance: {dict(train_counts)}")
print(f"Test set class balance: {dict(test_counts)}")

## 4. Train Individual Models

Let's train each model individually to understand their specific parameters and performance.

### 4.1 Logistic Regression

In [None]:
# Train Logistic Regression with default parameters
lr_model = trainer.train_logistic_regression(C=1.0, penalty='l2')

# Evaluate on test set
lr_metrics = trainer.evaluate_model('logistic_regression', lr_model)
print(f"\nLogistic Regression Performance:")
for metric, value in lr_metrics.items():
    print(f"{metric.capitalize()}: {value:.4f}")

### 4.2 Decision Tree

In [None]:
# Train Decision Tree with pruning parameters
dt_model = trainer.train_decision_tree(max_depth=10, min_samples_split=5)

# Evaluate on test set
dt_metrics = trainer.evaluate_model('decision_tree', dt_model)
print(f"\nDecision Tree Performance:")
for metric, value in dt_metrics.items():
    print(f"{metric.capitalize()}: {value:.4f}")

### 4.3 Random Forest

In [None]:
# Train Random Forest with ensemble parameters
rf_model = trainer.train_random_forest(n_estimators=100, max_features='sqrt')

# Evaluate on test set
rf_metrics = trainer.evaluate_model('random_forest', rf_model)
print(f"\nRandom Forest Performance:")
for metric, value in rf_metrics.items():
    print(f"{metric.capitalize()}: {value:.4f}")

# Display feature importance
feature_names = trainer.training_log['data_info']['feature_names']
feature_importance = pd.DataFrame({
    'feature': feature_names,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

print(f"\nTop 5 Most Important Features (Random Forest):")
print(feature_importance.head().to_string(index=False))

### 4.4 Support Vector Machine

In [None]:
# Train SVM with RBF kernel
svm_model = trainer.train_svm(kernel='rbf', C=1.0)

# Evaluate on test set
svm_metrics = trainer.evaluate_model('svm', svm_model)
print(f"\nSVM Performance:")
for metric, value in svm_metrics.items():
    print(f"{metric.capitalize()}: {value:.4f}")

## 5. Train All Models and Compare Performance

In [None]:
# Get model performance summary
summary_df = trainer.get_model_summary()
print("Model Performance Summary:")
print("=" * 80)
print(summary_df.to_string(index=False))

## 6. Cross-Validation Analysis

In [None]:
# Perform 5-fold cross-validation
cv_results = trainer.cross_validate_models(cv_folds=5)

# Create cross-validation results DataFrame
cv_df = pd.DataFrame({
    'Model': [name.replace('_', ' ').title() for name in cv_results.keys()],
    'Mean CV Accuracy': [results['mean_accuracy'] for results in cv_results.values()],
    'Std CV Accuracy': [results['std_accuracy'] for results in cv_results.values()]
})

print("\nCross-Validation Results:")
print("=" * 50)
print(cv_df.to_string(index=False))

## 7. Visualize Model Performance Comparison

In [None]:
# Create performance comparison plots
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Extract metrics for plotting
models = list(trainer.models.keys())
model_names = [name.replace('_', ' ').title() for name in models]

metrics_data = {}
for model_name in models:
    metrics = trainer.evaluate_model(model_name, trainer.models[model_name])
    for metric, value in metrics.items():
        if metric not in metrics_data:
            metrics_data[metric] = []
        metrics_data[metric].append(value)

# Plot 1: Accuracy Comparison
axes[0, 0].bar(model_names, metrics_data['accuracy'], color='skyblue')
axes[0, 0].set_title('Model Accuracy Comparison')
axes[0, 0].set_ylabel('Accuracy')
axes[0, 0].set_ylim(0, 1)
axes[0, 0].tick_params(axis='x', rotation=45)

# Plot 2: Precision vs Recall
axes[0, 1].scatter(metrics_data['precision'], metrics_data['recall'], 
                   s=100, c=['red', 'green', 'blue', 'orange'], alpha=0.7)
for i, name in enumerate(model_names):
    axes[0, 1].annotate(name, (metrics_data['precision'][i], metrics_data['recall'][i]),
                        xytext=(5, 5), textcoords='offset points')
axes[0, 1].set_xlabel('Precision')
axes[0, 1].set_ylabel('Recall')
axes[0, 1].set_title('Precision vs Recall')
axes[0, 1].grid(True, alpha=0.3)

# Plot 3: F1-Score Comparison
axes[1, 0].bar(model_names, metrics_data['f1_score'], color='lightgreen')
axes[1, 0].set_title('F1-Score Comparison')
axes[1, 0].set_ylabel('F1-Score')
axes[1, 0].set_ylim(0, 1)
axes[1, 0].tick_params(axis='x', rotation=45)

# Plot 4: ROC-AUC Comparison (if available)
if 'roc_auc' in metrics_data:
    axes[1, 1].bar(model_names, metrics_data['roc_auc'], color='coral')
    axes[1, 1].set_title('ROC-AUC Comparison')
    axes[1, 1].set_ylabel('ROC-AUC')
    axes[1, 1].set_ylim(0, 1)
    axes[1, 1].tick_params(axis='x', rotation=45)
else:
    axes[1, 1].text(0.5, 0.5, 'ROC-AUC not available\nfor all models', 
                    ha='center', va='center', transform=axes[1, 1].transAxes)
    axes[1, 1].set_title('ROC-AUC Comparison')

plt.tight_layout()
plt.show()

## 8. Cross-Validation Visualization

In [None]:
# Plot cross-validation results
plt.figure(figsize=(12, 6))

# Extract CV data for plotting
cv_means = [cv_results[model]['mean_accuracy'] for model in models]
cv_stds = [cv_results[model]['std_accuracy'] for model in models]

# Create bar plot with error bars
bars = plt.bar(model_names, cv_means, yerr=cv_stds, capsize=5, 
               color=['lightblue', 'lightgreen', 'lightcoral', 'lightyellow'],
               edgecolor='black', linewidth=1)

plt.title('5-Fold Cross-Validation Results', fontsize=16, fontweight='bold')
plt.xlabel('Models', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.ylim(0, 1)
plt.xticks(rotation=45)
plt.grid(axis='y', alpha=0.3)

# Add value labels on bars
for i, (bar, mean, std) in enumerate(zip(bars, cv_means, cv_stds)):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + std + 0.01,
             f'{mean:.3f}±{std:.3f}', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

## 9. Detailed Classification Reports

In [None]:
# Generate detailed classification reports for each model
for model_name, model in trainer.models.items():
    print(f"\n{'='*60}")
    print(f"CLASSIFICATION REPORT: {model_name.upper().replace('_', ' ')}")
    print(f"{'='*60}")
    
    y_pred = model.predict(trainer.X_test_scaled)
    report = classification_report(trainer.y_test, y_pred, 
                                 target_names=['No Disease', 'Disease'])
    print(report)

## 10. Confusion Matrices

In [None]:
# Create confusion matrices for all models
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.ravel()

for i, (model_name, model) in enumerate(trainer.models.items()):
    y_pred = model.predict(trainer.X_test_scaled)
    cm = confusion_matrix(trainer.y_test, y_pred)
    
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
                xticklabels=['No Disease', 'Disease'],
                yticklabels=['No Disease', 'Disease'],
                ax=axes[i])
    
    axes[i].set_title(f'{model_name.replace("_", " ").title()} Confusion Matrix')
    axes[i].set_xlabel('Predicted')
    axes[i].set_ylabel('Actual')

plt.tight_layout()
plt.show()

## 11. Save Models and Training Log

In [None]:
# Save all trained models
saved_files = trainer.save_models('../models/supervised')

print(f"Saved {len(saved_files)} model files:")
for file_path in saved_files:
    print(f"  - {file_path}")

# Save training log
trainer.save_training_log('../results/training_log.json')
print(f"\nTraining log saved to: ../results/training_log.json")

## 12. Model Recommendations

In [None]:
# Identify the best performing model based on different metrics
print("MODEL RECOMMENDATIONS")
print("=" * 50)

# Find best model for each metric
best_models = {}
for metric in ['accuracy', 'precision', 'recall', 'f1_score']:
    if metric in metrics_data:
        best_idx = np.argmax(metrics_data[metric])
        best_models[metric] = {
            'model': model_names[best_idx],
            'score': metrics_data[metric][best_idx]
        }

for metric, info in best_models.items():
    print(f"Best {metric.capitalize()}: {info['model']} ({info['score']:.4f})")

# Overall recommendation based on F1-score (balanced metric)
best_f1_idx = np.argmax(metrics_data['f1_score'])
recommended_model = model_names[best_f1_idx]

print(f"\nRECOMMENDED MODEL: {recommended_model}")
print(f"Reason: Best F1-score ({metrics_data['f1_score'][best_f1_idx]:.4f}) - balanced precision and recall")

# Cross-validation recommendation
best_cv_model = max(cv_results.keys(), key=lambda x: cv_results[x]['mean_accuracy'])
print(f"\nMOST STABLE MODEL (CV): {best_cv_model.replace('_', ' ').title()}")
print(f"CV Accuracy: {cv_results[best_cv_model]['mean_accuracy']:.4f} ± {cv_results[best_cv_model]['std_accuracy']:.4f}")

## Summary

This notebook successfully demonstrated:

1. **Data Preparation**: Loaded feature-selected dataset and performed stratified train-test split
2. **Model Training**: Trained four different classification models with appropriate parameters
3. **Model Evaluation**: Evaluated models using multiple metrics (accuracy, precision, recall, F1-score, ROC-AUC)
4. **Cross-Validation**: Performed 5-fold cross-validation for robust performance assessment
5. **Visualization**: Created comprehensive visualizations for model comparison
6. **Model Persistence**: Saved all trained models and training logs for future use

### Key Findings:
- All models achieved reasonable performance on the heart disease classification task
- Cross-validation provided insights into model stability and generalization
- Feature importance analysis revealed the most predictive features
- Model comparison enabled selection of the best performer for deployment

### Next Steps:
1. Hyperparameter optimization to improve model performance
2. Ensemble methods to combine model predictions
3. Model deployment for real-time predictions
4. Performance monitoring and model updating