#  Data Preparation & Preprocessing
## Early Stage Diabetes Risk Prediction

###  Objective
In this notebook, we will transform the raw data into a format suitable for machine learning models.
Based on our EDA findings, we will implement the following pipeline:

1.  **Data Splitting:** Divide data into Train, Validation, and Test sets (60/20/20).
2.  **Numerical Scaling:** Apply `StandardScaler` to the `Age` column.
3.  **Categorical Encoding:** Apply `OneHotEncoder` (binary mode) to all symptom columns.
4.  **Target Encoding:** Convert `Positive`/`Negative` classes to `1`/`0`.

### 1. Imports and Setup
We start by importing necessary libraries and our custom utility functions from the `utils` folder.

In [5]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import joblib  # To save our scalers/encoders for later use
import sys

sys.path.append('..')
from utils.preprocessing import (
    load_data, 
    detect_outliers_iqr, 
    split_data, 
    create_preprocessor, 
    encode_target, 
    save_artifacts
)

print("Libraries and Utils loaded successfully!")

Libraries and Utils loaded successfully!


In [6]:
# Load the raw data
df = pd.read_csv('../data/raw/diabetes_data_upload.csv')
print("Data Loaded successfully!")

Data Loaded successfully!


### 3. Outlier Detection
We check the numerical `Age` column for outliers using the Interquartile Range (IQR) method. 
*Note: In medical datasets, high age is often a valid risk factor, so we typically inspect rather than remove.*

In [8]:
# Detect outliers in 'Age'
lower, upper, count = detect_outliers_iqr(df, 'Age')

print(f" Outlier Analysis for 'Age':")
print(f"   Lower Bound: {lower}")
print(f"   Upper Bound: {upper}")
print(f"   Total Outliers: {count}")

# Logic to handle outliers
if count > 0:
    print("   Decision: Outliers kept. High age is a valid predictor for diabetes.")
else:
    print("   Decision: No statistical outliers found.")

 Outlier Analysis for 'Age':
   Lower Bound: 12.0
   Upper Bound: 84.0
   Total Outliers: 4
   Decision: Outliers kept. High age is a valid predictor for diabetes.


### Interpretation of Outlier Analysis

We performed an outlier check on the **Age** feature using the Interquartile Range (IQR) method.

* **Statistical Bounds:** The calculated "normal" range for our dataset is between **12.0** and **84.0** years old.
* **Findings:** We identified **4** instances falling outside this range (likely patients older than 84).

> **Decision:** **Kept (No Removal)**.
> In the context of **Diabetes Risk Prediction**, advanced age is a biologically valid risk factor rather than a data entry error. Removing these patients would result in a loss of critical medical information regarding high-risk demographics.

## 1. Train-Validation-Test Split
We split the data **before** any processing to prevent "Data Leakage" (information from the test set leaking into the training process).

* **Train Set (60%):** Used to teach the model.
* **Validation Set (20%):** Used to tune hyperparameters.
* **Test Set (20%):** Used for the final unbiased evaluation.

In [9]:
# Split the data
X_train, X_val, X_test, y_train, y_val, y_test = split_data(df, target_column='class')

print(f"Train Shape: {X_train.shape}")
print(f"Val Shape:   {X_val.shape}")
print(f"Test Shape:  {X_test.shape}")

Train Shape: (312, 16)
Val Shape:   (104, 16)
Test Shape:  (104, 16)


### 5. Setup Preprocessing Pipeline (Scaling & Encoding)
We create a `ColumnTransformer` that automatically applies:
* **StandardScaler** to numerical features (`Age`).
* **OneHotEncoder** to categorical features (Symptoms).

In [10]:
# Initialize the preprocessor object
preprocessor = create_preprocessor()

print("Preprocessor configured.")

Preprocessor configured.


### 6. Apply Preprocessing
We fit the preprocessor on the **Training Set** only (to avoid data leakage) and then transform all three datasets.

In [11]:
# Fit on Train, Transform on All
X_train_processed = preprocessor.fit_transform(X_train)
X_val_processed = preprocessor.transform(X_val)
X_test_processed = preprocessor.transform(X_test)

print("Feature scaling and encoding complete.")

Feature scaling and encoding complete.


### 7. Target Encoding
We convert the target variable (`Positive`/`Negative`) into binary format (`1`/`0`).

In [12]:
# Encode targets
y_train_enc, y_val_enc, y_test_enc, le = encode_target(y_train, y_val, y_test)

print(f"Target classes: {le.classes_}")
print(f"Encoded shape: {y_train_enc.shape}")

Target classes: ['Negative' 'Positive']
Encoded shape: (312,)


### 8. Save Processed Data & Artifacts
Finally, we save the transformed datasets as CSV files and serialize the models (joblib) for use in the App and Evaluation notebooks.

In [16]:
# Convert processed arrays back to DataFrames for easier saving
# Extract feature names from the OneHotEncoder
ohe_feature_names = preprocessor.named_transformers_['cat'].get_feature_names_out()
all_feature_names = ['Age'] + list(ohe_feature_names)

X_train_df = pd.DataFrame(X_train_processed, columns=all_feature_names)
X_val_df = pd.DataFrame(X_val_processed, columns=all_feature_names)
X_test_df = pd.DataFrame(X_test_processed, columns=all_feature_names)

# Save everything using the utils function
save_artifacts(
    X_train_df, X_val_df, X_test_df, 
    y_train_enc, y_val_enc, y_test_enc, 
    preprocessor, le
)

âœ… All files and models saved successfully!


### 9. Inspect Transformed Data
We examine the first few rows of the processed training set to verify:
1.  **Age:** Should be a float value (scaled around 0), not the original age (e.g., 30, 50).
2.  **Symptoms:** Should be binary (`0.0` or `1.0`), representing the One-Hot Encoded values.

In [None]:
print("Shape of Transformed Train Data:", X_train_df.shape)
print("\n--- First 5 Rows of Transformed Data ---")
display(X_train_df.head())

print("\n--- First 5 Target Values (Encoded) ---")
print(y_train_enc[:5]) 

Shape of Transformed Train Data: (312, 16)

--- First 5 Rows of Transformed Data ---


Unnamed: 0,Age,Gender_Male,Polyuria_Yes,Polydipsia_Yes,sudden weight loss_Yes,weakness_Yes,Polyphagia_Yes,Genital thrush_Yes,visual blurring_Yes,Itching_Yes,Irritability_Yes,delayed healing_Yes,partial paresis_Yes,muscle stiffness_Yes,Alopecia_Yes,Obesity_Yes
0,-1.492623,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.711724,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
2,1.054422,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
3,-0.424507,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
4,-1.656948,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0



--- First 5 Target Values (Encoded) ---
[0 1 1 0 1]


### Interpretation of Changes:
* **Age:** Now appears as a decimal (e.g., `0.5`, `-1.2`) because it is standardized.
* **Gender:** Has become `Gender_Male` (1 if Male, 0 if Female).
* **Polyuria:** Has become `Polyuria_Yes` (1 if Yes, 0 if No).

###  Note: Where did the 'Female' column go?

You noticed that we have a `Gender_Male` column but no `Gender_Female` column. This is intentional and caused by the parameter `drop='first'` in our `OneHotEncoder`.

* **How it works:** We only need **one** column to represent two binary options.
    * If `Gender_Male` is **1**, the patient is **Male**.
    * If `Gender_Male` is **0**, the patient is **Female**.
* **Why we do this:** Adding a second column (like `Gender_Female`) would create duplicate information (redundancy). Removing it prevents a mathematical problem called the **"Dummy Variable Trap"** (Multicollinearity), making our model cleaner and more efficient.