<div style="background-color: #3D3D3A; padding: 20px; border-radius: 10px; margin-bottom: 20px;">
    <h1 style="color: white; text-align: center; margin: 0;">🤖 Diabetes Binary Classification: Baseline Models</h1>
    <p style="color: #CCCCCC; text-align: center; margin-top: 10px;">Model Training and Evaluation Pipeline</p>
</div>

<div style="background-color: #3D3D3A; padding: 15px; border-radius: 8px; margin-bottom: 20px;">
    <h2 style="color: white; margin: 0;">📋 Overview</h2>
    <p style="color: #CCCCCC; margin-top: 10px;">This notebook implements the baseline models training and evaluation pipeline for binary diabetes classification.</p>
    <ul style="color: #CCCCCC;">
        <li>Load preprocessed data</li>
        <li>Train multiple classifier models</li>
        <li>Evaluate performance metrics</li>
        <li>Compare and visualize results</li>
    </ul>
</div>

<div style="background-color: #3D3D3A; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <h2 style="color: white; margin: 0;">📚 Import Required Libraries</h2>
</div>

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
import warnings
import sys

warnings.filterwarnings('ignore')

sys.path.append(str(Path.cwd().parent))

# Import custom modules
from src.evaluation.performance_metrics import PerformanceMetrics
from src.evaluation.performance_visualization import PerformanceVisualizer
from src.data.data_versioning import DataVersioner
from src.training.Training import DiabetesModelTrainer

# Set random seed
np.random.seed(42)

<div style="background-color: #3D3D3A; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <h2 style="color: white; margin: 0;">📥 Load and Prepare Data</h2>
</div>

In [2]:
# Load the preprocessed version
data_versioner = DataVersioner(base_dir='../data')
data = data_versioner.get_version("2025_02_24_02_27_54")

# Display basic information
print("Dataset Shape:", data.shape)
print("\nFeature Names:")
print(data.columns.tolist())
print("\nClass Distribution:")
print(data['diabetes'].value_counts(normalize=True))

[2025-02-24 20:51:57] |     INFO | [data_versioning.py:  32] | data_versioning | Using existing Mlflow experiment: diabetes_classification
[2025-02-24 20:51:57] |     INFO | [data_versioning.py: 129] | data_versioning | Loading dataset from local path: ..\data\versions\diabetes_processed_2025_02_24_02_27_54\diabetes_processed.csv
Dataset Shape: (159490, 21)

Feature Names:
['gender', 'age', 'hypertension', 'heart_disease', 'smoking_history', 'bmi', 'HbA1c_level', 'blood_glucose_level', 'bmi_category', 'age_risk', 'age_bmi_interaction', 'medical_risk_score', 'metabolic_score', 'smoking_risk', 'lifestyle_score', 'age_hypertension', 'age_heart_disease', 'cardio_metabolic_risk', 'combined_risk_score', 'diabetes', 'split']

Class Distribution:
diabetes
0    0.549652
1    0.450348
Name: proportion, dtype: float64


<div style="background-color: #3D3D3A; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <h2 style="color: white; margin: 0;">🤖 Train Models</h2>
</div>

In [3]:
# Initialize trainer
trainer = DiabetesModelTrainer()

# Prepare data
X_train, X_test, y_train, y_test = trainer.prepare_data(
    data=data,
    target_column='diabetes',
    already_split=True
)

print("Training Data Shape:", X_train.shape)
print("Testing Data Shape:", X_test.shape)

[2025-02-24 20:51:57] |     INFO | [Training.py:  44] |        training | Initialized DiabetesModelTrainer with experiment: diabetes_classification
[2025-02-24 20:51:57] |     INFO | [Training.py:  68] |        training | Preparing data for training...
[2025-02-24 20:51:57] |     INFO | [Training.py: 117] |        training | Data prepared successfully. Train set: (140260, 19), Test set: (19230, 19)
Training Data Shape: (140260, 19)
Testing Data Shape: (19230, 19)


In [4]:
# # Train all models using train_multiple_models

# models_config = [
#     {'name': 'logistic_regression', 'params': None},
#     {'name': 'random_forest', 'params': None},
#     {'name': 'xgboost', 'params': None},
#     {'name': 'lightgbm', 'params': None},
#     {'name': 'catboost', 'params': None}
# ]
# trained_models = trainer.train_multiple_models(
#     X_train=X_train,
#     models_config=models_config,
#     y_train=y_train,
#     cv_folds=5
# )

# print("\nTraining completed for models:", list(trained_models.keys()))

<div style="background-color: #3D3D3A; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <h2 style="color: white; margin: 0;">📊 Evaluate Models</h2>
</div>

In [5]:
# # Evaluate all models
# summary_df, evaluation_results = trainer.evaluate_models(
#     models_results=trained_models,
#     X_test=X_test,
#     y_test=y_test
# )

# # Display summary metrics
# print("\nModel Performance Summary:")
# display(summary_df.style.format({
#     'Train Accuracy': '{:.3f}',
#     'Test Accuracy': '{:.3f}',
#     'Train F1 (Macro)': '{:.3f}',
#     'Test F1 (Macro)': '{:.3f}'
# }).background_gradient(cmap='RdYlGn'))

<div style="background-color: #3D3D3A; padding: 20px; border-radius: 10px; margin: 20px 0;">
    <h2 style="color: white; margin: 0;">💡 Conclusions</h2>
    <p style="color: #CCCCCC; margin-top: 10px;">Based on the evaluation results:</p>
    <ul style="color: #CCCCCC;">
        <li>Compare model performances across different metrics</li>
        <li>Identify best performing model for different use cases</li>
        <li>Analyze confusion matrices for misclassification patterns</li>
        <li>Consider trade-offs between precision and recall</li>
    </ul>
</div>

In [None]:
from src.Optimization.hyperparameter_tuner import CatBoostHyperparameterTuner
tuner = CatBoostHyperparameterTuner(
    experiment_name="diabetes_catboost_optimization",
    n_trials=20,
    cv_folds=3,
    recall_weight=0.7,
    accuracy_weight=0.3
)

# Run the optimization
optimization_results = tuner.optimize(X_train, y_train)
best_params = optimization_results['best_params']

[2025-02-24 20:51:57] |     INFO | [hyperparameter_tuner.py:  60] | hyperparameter_tuner | Initialized CatBoostHyperparameterTuner with 20 trials
[2025-02-24 20:51:57] |     INFO | [hyperparameter_tuner.py:  61] | hyperparameter_tuner | Optimization weights: Recall=0.70, Accuracy=0.30
[2025-02-24 20:51:57] |     INFO | [hyperparameter_tuner.py:  75] | hyperparameter_tuner | Starting hyperparameter optimization with 20 trials


[I 2025-02-24 20:51:57,654] A new study created in memory with name: no-name-0daef5f2-8ad4-4698-9c3b-bb5258074841
[I 2025-02-24 20:53:06,023] Trial 0 finished with value: 0.9689127235057025 and parameters: {'iterations': 833, 'learning_rate': 0.059455702552193845, 'depth': 4, 'l2_leaf_reg': 2.0126270550930567e-06, 'random_strength': 0.7060800471071709, 'bagging_temperature': 6.43667170377921, 'border_count': 132, 'boosting_type': 'Ordered'}. Best is trial 0 with value: 0.9689127235057025.
[I 2025-02-24 20:53:32,792] Trial 1 finished with value: 0.9708904844660333 and parameters: {'iterations': 541, 'learning_rate': 0.0692559414864544, 'depth': 6, 'l2_leaf_reg': 5.31846696355123, 'random_strength': 0.00860556095075302, 'bagging_temperature': 2.991220547260208, 'border_count': 46, 'boosting_type': 'Plain'}. Best is trial 1 with value: 0.9708904844660333.
[I 2025-02-24 20:57:06,860] Trial 2 finished with value: 0.9694845184586871 and parameters: {'iterations': 613, 'learning_rate': 0.0674

In [7]:
final_model = tuner.train_final_model(X_train, y_train, best_params)

[2025-02-24 20:40:44] |     INFO | [hyperparameter_tuner.py: 203] | hyperparameter_tuner | Training final model with best parameters
[2025-02-24 20:40:49] |     INFO | [hyperparameter_tuner.py: 216] | hyperparameter_tuner | Saved optimized model to models/optimized\catboost_optimized.pkl




In [8]:
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score


y_pred = final_model.predict(X_test)
y_pred_proba = final_model.predict_proba(X_test)[:, 1]

# Print classification report
print("\n--- Optimized CatBoost Model Performance ---")
print(classification_report(y_test, y_pred))
print("ROC-AUC Score:", roc_auc_score(y_test, y_pred_proba))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))


--- Optimized CatBoost Model Performance ---
              precision    recall  f1-score   support

           0       0.97      1.00      0.98     17534
           1       0.95      0.70      0.81      1696

    accuracy                           0.97     19230
   macro avg       0.96      0.85      0.89     19230
weighted avg       0.97      0.97      0.97     19230

ROC-AUC Score: 0.974260183987552

Confusion Matrix:
[[17473    61]
 [  511  1185]]
