# 🏋️ Model Training: Gradient Boosting Classifier

This section trains a **Gradient Boosting Classifier** on the selected features for multiclass diabetes prediction.

Gradient Boosting is chosen because it often provides **high accuracy, robustness to overfitting**, and works well with **imbalanced datasets**.

In [1]:
# ===========================
# Run Preprocessing Notebook
# ===========================
%run ./Preprocessing.ipynb

Preprocessed data saved as joblib files!


## 1️⃣ Import Required Libraries

- **pandas** → data manipulation  
- **StandardScaler** → feature scaling  
- **ColumnTransformer** → preprocess columns  
- **Pipeline** → combine preprocessing and modeling  
- **GradientBoostingClassifier** → predictive model  
- **joblib** → save trained model for later deployment

In [2]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingClassifier
import joblib

## 2️⃣ Define Features and Target

- **Features (`X`)** → clinically significant indicators selected from preprocessing.  
- **Target (`y`)** → the `Class` column (0=Non-diabetic, 1=Pre-diabetic, 2=Diabetic)


In [3]:
selected_features = ['HbA1c', 'BMI', 'AGE', 'Urea', 'Chol', 'VLDL', 'TG', 'Cr', 'LDL']
X = df[selected_features]
y = df['Class']

## 3️⃣ Column Transformer for Scaling

Scaling ensures all numeric features contribute equally to the model.  
- StandardScaler transforms features to **zero mean and unit variance**.


In [4]:
preprocessor = ColumnTransformer(
    transformers=[('num', StandardScaler(), selected_features)]
)

## 4️⃣ Build Pipeline with Gradient Boosting

Pipeline combines **preprocessing** and **model training** in one step.  

Gradient Boosting hyperparameters used:

- `n_estimators=150` → number of boosting stages  
- `max_depth=7` → depth of each tree  
- `learning_rate=0.05` → shrinkage factor  
- `max_features='log2'` → number of features to consider at each split  
- `subsample=0.9` → fraction of samples for fitting each tree  
- `random_state=42` → reproducibility

In [6]:
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', GradientBoostingClassifier(
        n_estimators=150,
        max_depth=7,
        learning_rate=0.05,
        max_features='log2',
        subsample=0.9,
        random_state=42
    ))
])

## 5️⃣ Train the Model

In [8]:
pipeline.fit(X, y)

## 6️⃣ Save Trained Pipeline

The trained pipeline is saved using **joblib** for **later inference or deployment** with FastAPI.


In [9]:
joblib.dump(pipeline, 'gb_diabetes_model.joblib')


['gb_diabetes_model.joblib']

## ✅ Key Points

- Gradient Boosting provides **high performance on multiclass tasks**.  
- Pipeline ensures **scaling + modeling** are combined, preventing data leakage.  
- Model can now be loaded anytime for **real-time prediction** without retraining.
