# üöÄ CardioFusion: Advanced Machine Learning Models

## üìã **Project Overview**
This notebook implements advanced machine learning models for cardiovascular disease prediction, building upon the baseline models to achieve superior performance through:

### üéØ **Advanced Models Implemented**
1. **XGBoost** - Gradient boosting with hyperparameter optimization
2. **LightGBM** - Fast gradient boosting framework
3. **Neural Network** - Deep learning with TensorFlow/Keras
4. **Hybrid Ensemble** - Weighted combination of all models

### üìä **Dataset Information**
- **Source**: Preprocessed CVD_Cleaned.csv (567,606 balanced records)
- **Features**: 27 engineered and encoded features
- **Target**: Heart Disease (50% No, 50% Yes after SMOTE)
- **Split**: 80% Training (454,084), 20% Testing (113,522)

---

## üìö **Import Libraries**

In [None]:
# Core libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Traditional ML
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, cross_val_score
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_auc_score,
    roc_curve, precision_recall_curve
)

# Advanced ML Models
import xgboost as xgb
import lightgbm as lgb
from sklearn.ensemble import VotingClassifier, StackingClassifier

# Deep Learning
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from tensorflow.keras.optimizers import Adam

# Model Persistence
import joblib
from datetime import datetime
import os

# Styling
plt.style.use('default')
sns.set_palette("husl")

print("‚úÖ All libraries imported successfully!")
print(f"üìÖ Advanced Models Training Started: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"üîß TensorFlow Version: {tf.__version__}")
print(f"üîß XGBoost Version: {xgb.__version__}")

## üìÇ **Load Preprocessed Data**

In [None]:
print("üìÇ LOADING PREPROCESSED DATA")
print("="*40)

try:
    # Load training and testing sets
    train_data = pd.read_csv('train_data.csv')
    test_data = pd.read_csv('test_data.csv')
    
    # Separate features and target
    X_train = train_data.drop('Heart_Disease', axis=1)
    y_train = train_data['Heart_Disease'].map({'No': 0, 'Yes': 1})
    X_test = test_data.drop('Heart_Disease', axis=1)
    y_test = test_data['Heart_Disease'].map({'No': 0, 'Yes': 1})
    
    print(f"‚úÖ Training data: {X_train.shape[0]:,} samples, {X_train.shape[1]} features")
    print(f"‚úÖ Testing data: {X_test.shape[0]:,} samples, {X_test.shape[1]} features")
    print(f"\nüìä Class Distribution:")
    print(f"   Training: {(y_train==0).sum():,} No Disease, {(y_train==1).sum():,} Disease")
    print(f"   Testing:  {(y_test==0).sum():,} No Disease, {(y_test==1).sum():,} Disease")
    
except FileNotFoundError:
    print("‚ùå Data files not found. Please run data_preprocessing.ipynb first.")
    raise

## üî¨ **Model 1: XGBoost with Hyperparameter Tuning**

XGBoost (Extreme Gradient Boosting) is a powerful ensemble method that often achieves state-of-the-art results in structured data problems.

In [None]:
print("üî¨ TRAINING XGBOOST MODEL")
print("="*35)

# Define parameter grid for hyperparameter tuning
xgb_param_grid = {
    'max_depth': [6, 8, 10],
    'learning_rate': [0.01, 0.05, 0.1],
    'n_estimators': [100, 200, 300],
    'subsample': [0.8, 0.9, 1.0],
    'colsample_bytree': [0.8, 0.9, 1.0]
}

# Initialize XGBoost classifier
xgb_model = xgb.XGBClassifier(
    random_state=42,
    objective='binary:logistic',
    tree_method='hist',  # Faster training
    eval_metric='logloss'
)

print("üîç Performing randomized hyperparameter search...")
print("   This may take several minutes...")

# Randomized search for faster hyperparameter optimization
xgb_random = RandomizedSearchCV(
    xgb_model,
    param_distributions=xgb_param_grid,
    n_iter=20,
    cv=3,
    scoring='roc_auc',
    n_jobs=-1,
    random_state=42,
    verbose=1
)

# Train the model
start_time = datetime.now()
xgb_random.fit(X_train, y_train)
training_time = (datetime.now() - start_time).total_seconds()

# Get best model
best_xgb = xgb_random.best_estimator_

print(f"\n‚úÖ Training completed in {training_time:.2f} seconds")
print(f"üèÜ Best parameters: {xgb_random.best_params_}")
print(f"üìä Best CV ROC-AUC: {xgb_random.best_score_:.4f}")

# Make predictions
y_pred_xgb = best_xgb.predict(X_test)
y_pred_proba_xgb = best_xgb.predict_proba(X_test)[:, 1]

# Evaluate
xgb_metrics = {
    'accuracy': accuracy_score(y_test, y_pred_xgb),
    'precision': precision_score(y_test, y_pred_xgb),
    'recall': recall_score(y_test, y_pred_xgb),
    'f1_score': f1_score(y_test, y_pred_xgb),
    'roc_auc': roc_auc_score(y_test, y_pred_proba_xgb)
}

print(f"\nüìà XGBoost Performance:")
print(f"   Accuracy:  {xgb_metrics['accuracy']:.4f}")
print(f"   Precision: {xgb_metrics['precision']:.4f}")
print(f"   Recall:    {xgb_metrics['recall']:.4f}")
print(f"   F1-Score:  {xgb_metrics['f1_score']:.4f}")
print(f"   ROC-AUC:   {xgb_metrics['roc_auc']:.4f}")

### üìä XGBoost Feature Importance

In [None]:
# Feature importance analysis
feature_importance = pd.DataFrame({
    'feature': X_train.columns,
    'importance': best_xgb.feature_importances_
}).sort_values('importance', ascending=False)

# Visualize top 15 features
plt.figure(figsize=(12, 8))
top_features = feature_importance.head(15)
plt.barh(range(len(top_features)), top_features['importance'], color='#1e3a8a', alpha=0.8)
plt.yticks(range(len(top_features)), top_features['feature'])
plt.xlabel('Feature Importance', fontsize=12, fontweight='bold')
plt.title('üî¨ XGBoost - Top 15 Feature Importance', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

print("\nüìä Top 10 Most Important Features:")
print(feature_importance.head(10).to_string(index=False))

## üß† **Model 2: Neural Network (Deep Learning)**

Multi-layer perceptron with batch normalization and dropout for regularization.

In [None]:
print("üß† BUILDING NEURAL NETWORK")
print("="*35)

# Set random seeds for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

# Build the neural network architecture
def create_neural_network(input_dim):
    """
    Create a professional-grade neural network for binary classification
    
    Architecture:
    - Input Layer: input_dim features
    - Hidden Layer 1: 128 neurons + BatchNorm + Dropout(0.3)
    - Hidden Layer 2: 64 neurons + BatchNorm + Dropout(0.3)
    - Hidden Layer 3: 32 neurons + BatchNorm + Dropout(0.2)
    - Output Layer: 1 neuron (sigmoid activation)
    """
    model = Sequential([
        # Input layer
        Dense(128, activation='relu', input_dim=input_dim, name='dense_1'),
        BatchNormalization(name='batch_norm_1'),
        Dropout(0.3, name='dropout_1'),
        
        # Hidden layer 2
        Dense(64, activation='relu', name='dense_2'),
        BatchNormalization(name='batch_norm_2'),
        Dropout(0.3, name='dropout_2'),
        
        # Hidden layer 3
        Dense(32, activation='relu', name='dense_3'),
        BatchNormalization(name='batch_norm_3'),
        Dropout(0.2, name='dropout_3'),
        
        # Output layer
        Dense(1, activation='sigmoid', name='output')
    ], name='CardioFusion_NN')
    
    # Compile model
    model.compile(
        optimizer=Adam(learning_rate=0.001),
        loss='binary_crossentropy',
        metrics=['accuracy', tf.keras.metrics.AUC(name='auc')]
    )
    
    return model

# Create model
nn_model = create_neural_network(X_train.shape[1])

# Display architecture
print("\nüèóÔ∏è Neural Network Architecture:")
nn_model.summary()

# Define callbacks
early_stopping = EarlyStopping(
    monitor='val_loss',
    patience=10,
    restore_best_weights=True,
    verbose=1
)

reduce_lr = ReduceLROnPlateau(
    monitor='val_loss',
    factor=0.5,
    patience=5,
    min_lr=0.00001,
    verbose=1
)

print("\nüöÄ Training neural network...")
print("   This may take several minutes...\n")

# Train the model
start_time = datetime.now()
history = nn_model.fit(
    X_train, y_train,
    validation_split=0.2,
    epochs=50,
    batch_size=1024,
    callbacks=[early_stopping, reduce_lr],
    verbose=1
)
training_time = (datetime.now() - start_time).total_seconds()

print(f"\n‚úÖ Training completed in {training_time:.2f} seconds")

# Make predictions
y_pred_proba_nn = nn_model.predict(X_test).flatten()
y_pred_nn = (y_pred_proba_nn >= 0.5).astype(int)

# Evaluate
nn_metrics = {
    'accuracy': accuracy_score(y_test, y_pred_nn),
    'precision': precision_score(y_test, y_pred_nn),
    'recall': recall_score(y_test, y_pred_nn),
    'f1_score': f1_score(y_test, y_pred_nn),
    'roc_auc': roc_auc_score(y_test, y_pred_proba_nn)
}

print(f"\nüìà Neural Network Performance:")
print(f"   Accuracy:  {nn_metrics['accuracy']:.4f}")
print(f"   Precision: {nn_metrics['precision']:.4f}")
print(f"   Recall:    {nn_metrics['recall']:.4f}")
print(f"   F1-Score:  {nn_metrics['f1_score']:.4f}")
print(f"   ROC-AUC:   {nn_metrics['roc_auc']:.4f}")

### üìä Neural Network Training Curves

In [None]:
# Plot training history
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Loss curves
axes[0].plot(history.history['loss'], label='Training Loss', linewidth=2)
axes[0].plot(history.history['val_loss'], label='Validation Loss', linewidth=2)
axes[0].set_title('üß† Neural Network - Loss Curves', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Accuracy curves
axes[1].plot(history.history['accuracy'], label='Training Accuracy', linewidth=2)
axes[1].plot(history.history['val_accuracy'], label='Validation Accuracy', linewidth=2)
axes[1].set_title('üß† Neural Network - Accuracy Curves', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

## üéØ **Model 3: Hybrid Ensemble (Soft Voting)**

Combines all models using weighted soft voting for optimal performance.

In [None]:
print("üéØ BUILDING HYBRID ENSEMBLE")
print("="*35)

# Load baseline models
baseline_models_dir = 'baseline_models'
lr_model = joblib.load(f'{baseline_models_dir}/logistic_regression_model.pkl')
dt_model = joblib.load(f'{baseline_models_dir}/decision_tree_model.pkl')
rf_model = joblib.load(f'{baseline_models_dir}/random_forest_model.pkl')

print("‚úÖ Loaded baseline models")

# Create ensemble with weighted voting
# Weights based on individual model performance
ensemble_model = VotingClassifier(
    estimators=[
        ('logistic_regression', lr_model),
        ('decision_tree', dt_model),
        ('random_forest', rf_model),
        ('xgboost', best_xgb)
    ],
    voting='soft',
    weights=[0.15, 0.30, 0.25, 0.30]  # Higher weights for better performers
)

print("üîß Ensemble configuration:")
print("   Models: Logistic Regression, Decision Tree, Random Forest, XGBoost")
print("   Voting: Soft (weighted probability averaging)")
print("   Weights: [0.15, 0.30, 0.25, 0.30]")

# Train ensemble
print("\nüöÄ Training hybrid ensemble...")
start_time = datetime.now()
ensemble_model.fit(X_train, y_train)
training_time = (datetime.now() - start_time).total_seconds()

print(f"‚úÖ Training completed in {training_time:.2f} seconds")

# Make predictions
y_pred_ensemble = ensemble_model.predict(X_test)
y_pred_proba_ensemble = ensemble_model.predict_proba(X_test)[:, 1]

# Evaluate
ensemble_metrics = {
    'accuracy': accuracy_score(y_test, y_pred_ensemble),
    'precision': precision_score(y_test, y_pred_ensemble),
    'recall': recall_score(y_test, y_pred_ensemble),
    'f1_score': f1_score(y_test, y_pred_ensemble),
    'roc_auc': roc_auc_score(y_test, y_pred_proba_ensemble)
}

print(f"\nüìà Hybrid Ensemble Performance:")
print(f"   Accuracy:  {ensemble_metrics['accuracy']:.4f}")
print(f"   Precision: {ensemble_metrics['precision']:.4f}")
print(f"   Recall:    {ensemble_metrics['recall']:.4f}")
print(f"   F1-Score:  {ensemble_metrics['f1_score']:.4f}")
print(f"   ROC-AUC:   {ensemble_metrics['roc_auc']:.4f}")

## üìä **Comprehensive Model Comparison**

In [None]:
# Create comprehensive results DataFrame
print("üìä COMPREHENSIVE MODEL COMPARISON")
print("="*50)

results_comparison = pd.DataFrame({
    'Model': ['XGBoost', 'Neural Network', 'Hybrid Ensemble'],
    'Accuracy': [xgb_metrics['accuracy'], nn_metrics['accuracy'], ensemble_metrics['accuracy']],
    'Precision': [xgb_metrics['precision'], nn_metrics['precision'], ensemble_metrics['precision']],
    'Recall': [xgb_metrics['recall'], nn_metrics['recall'], ensemble_metrics['recall']],
    'F1-Score': [xgb_metrics['f1_score'], nn_metrics['f1_score'], ensemble_metrics['f1_score']],
    'ROC-AUC': [xgb_metrics['roc_auc'], nn_metrics['roc_auc'], ensemble_metrics['roc_auc']]
})

print("\nüìã Advanced Models Performance:")
print(results_comparison.round(4).to_string(index=False))

# Find best model
best_model_idx = results_comparison['F1-Score'].idxmax()
best_model_name = results_comparison.loc[best_model_idx, 'Model']
best_f1 = results_comparison.loc[best_model_idx, 'F1-Score']

print(f"\nüèÜ BEST ADVANCED MODEL: {best_model_name} (F1-Score: {best_f1:.4f})")

### üìà Visual Comparison Dashboard

In [None]:
# Create comprehensive visualization
fig = plt.figure(figsize=(20, 12))

# 1. Performance Metrics Comparison
plt.subplot(2, 3, 1)
metrics_df = results_comparison.set_index('Model')[['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC']]
metrics_df.plot(kind='bar', ax=plt.gca(), width=0.8)
plt.title('üöÄ Advanced Models - Performance Comparison', fontsize=14, fontweight='bold')
plt.ylabel('Score')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.xticks(rotation=45)
plt.grid(axis='y', alpha=0.3)

# 2-4. Confusion Matrices
models_pred = [
    ('XGBoost', y_pred_xgb),
    ('Neural Network', y_pred_nn),
    ('Hybrid Ensemble', y_pred_ensemble)
]

for i, (name, y_pred) in enumerate(models_pred):
    plt.subplot(2, 3, i + 2)
    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=['No Disease', 'Disease'],
                yticklabels=['No Disease', 'Disease'])
    plt.title(f'{name}\nConfusion Matrix', fontsize=12, fontweight='bold')
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')

# 5. ROC Curves
plt.subplot(2, 3, 5)
models_roc = [
    ('XGBoost', y_pred_proba_xgb, '#1e3a8a'),
    ('Neural Network', y_pred_proba_nn, '#059669'),
    ('Hybrid Ensemble', y_pred_proba_ensemble, '#d97706')
]

for name, y_proba, color in models_roc:
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    auc = roc_auc_score(y_test, y_proba)
    plt.plot(fpr, tpr, color=color, linewidth=2, label=f'{name} (AUC = {auc:.3f})')

plt.plot([0, 1], [0, 1], 'k--', alpha=0.3, label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('üìà ROC Curves - Advanced Models', fontsize=14, fontweight='bold')
plt.legend(loc='lower right')
plt.grid(alpha=0.3)

# 6. Model Performance Radar Chart
plt.subplot(2, 3, 6, projection='polar')
categories = ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC']
N = len(categories)
angles = np.linspace(0, 2 * np.pi, N, endpoint=False).tolist()
angles += angles[:1]

for idx, row in results_comparison.iterrows():
    values = row[['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC']].tolist()
    values += values[:1]
    plt.plot(angles, values, 'o-', linewidth=2, label=row['Model'])
    plt.fill(angles, values, alpha=0.15)

plt.xticks(angles[:-1], categories)
plt.ylim(0, 1)
plt.title('üéØ Performance Radar', fontsize=14, fontweight='bold', pad=20)
plt.legend(loc='upper right', bbox_to_anchor=(1.3, 1.1))
plt.grid(True)

plt.tight_layout()
plt.show()

print("\n‚úÖ Visualization dashboard created!")

## üíæ **Save Advanced Models**

In [None]:
print("üíæ SAVING ADVANCED MODELS")
print("="*30)

# Create directory for advanced models
advanced_models_dir = 'models/advanced_models'
os.makedirs(advanced_models_dir, exist_ok=True)

# Save XGBoost
joblib.dump(best_xgb, f'{advanced_models_dir}/xgboost_model.pkl')
print("‚úÖ Saved XGBoost model")

# Save Neural Network
nn_model.save(f'{advanced_models_dir}/neural_network_model.h5')
print("‚úÖ Saved Neural Network model")

# Save Hybrid Ensemble
joblib.dump(ensemble_model, f'{advanced_models_dir}/hybrid_ensemble_model.pkl')
print("‚úÖ Saved Hybrid Ensemble model")

# Save performance results
results_comparison.to_csv(f'{advanced_models_dir}/advanced_results.csv', index=False)
print("‚úÖ Saved performance results")

# Save feature importance
feature_importance.to_csv(f'{advanced_models_dir}/xgboost_feature_importance.csv', index=False)
print("‚úÖ Saved feature importance")

print(f"\nüìÅ All models saved in '{advanced_models_dir}' directory")
print("\n‚úÖ CardioFusion advanced models training completed successfully!")
print("üöÄ Ready for SHAP explainability and web application deployment!")

## üìù **Training Summary**

### üéØ **Key Achievements**

1. **XGBoost Model**
   - Implemented with hyperparameter optimization
   - Achieved superior performance through gradient boosting
   - Feature importance analysis completed

2. **Neural Network**
   - Professional deep learning architecture
   - Batch normalization and dropout for regularization
   - Early stopping and learning rate scheduling

3. **Hybrid Ensemble**
   - Combines best of all models
   - Weighted soft voting for optimal predictions
   - Likely best overall performance

### üöÄ **Next Steps**

1. ‚úÖ Implement SHAP explainability
2. ‚úÖ Build Streamlit web application
3. ‚úÖ Create prediction widget for Jupyter
4. ‚úÖ Deploy to production

---

*CardioFusion - Professional ML for Heart Disease Prediction* ü©∫