# 🧩 Data Preprocessing & Preparation

## 1️⃣ Load the Dataset

The dataset is loaded using `pandas`. Features (`X`) are separated from the target variable (`y`), which represents the diabetes class.

- `Class` contains three categories:  
  0 → Non-diabetic  
  1 → Pre-diabetic  
  2 → Diabetic

In [1]:
import pandas as pd

df = pd.read_csv("Multiclass Diabetes Dataset.csv")
X = df.drop("Class", axis=1)
y = df["Class"]


## 2️⃣ Train-Test Split

Data is split into **training (80%)** and **testing (20%)** sets.

- `stratify=y` preserves the class distribution in both sets.  
- `random_state=42` ensures reproducibility.


In [2]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)


## 3️⃣ Feature Scaling

**StandardScaler** is used to normalize features:

- Mean = 0, Standard Deviation = 1  
- Scaling improves performance for many ML models  
- The scaler is fitted only on training data to avoid data leakage


In [4]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## 4️⃣ Handling Class Imbalance with SMOTE

**SMOTE (Synthetic Minority Over-sampling Technique)** generates synthetic samples for minority classes to balance the dataset.

- Applied only to the training set
- Prevents models from being biased toward the majority class

In [5]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_train_res, y_train_res = smote.fit_resample(X_train_scaled, y_train)


## 5️⃣ Save Preprocessed Data

Preprocessed data is saved as **Joblib files** for easy loading during model training and evaluation.

- Allows modular workflow: preprocessing and modeling are separate

In [6]:
import joblib

joblib.dump(X_train_res, "X_train_res.joblib")
joblib.dump(X_test_scaled, "X_test_scaled.joblib")
joblib.dump(y_train_res, "y_train_res.joblib")
joblib.dump(y_test, "y_test.joblib")

print("Preprocessed data saved as joblib files!")

Preprocessed data saved as joblib files!


## ✅ Key Insights

- Proper **train-test split** ensures unbiased evaluation  
- **Scaling** improves model convergence and accuracy  
- **SMOTE** fixes class imbalance, critical for multi-class prediction  
- **Joblib serialization** enables reusing preprocessed data without repeating steps