#  Data Preparation & Preprocessing
## Early Stage Diabetes Risk Prediction

###  Objective
In this notebook, we will transform the raw data into a format suitable for machine learning models.
Based on our EDA findings, we will implement the following pipeline:

1.  **Data Splitting:** Divide data into Train, Validation, and Test sets (60/20/20).
2.  **Numerical Scaling:** Apply `StandardScaler` to the `Age` column.
3.  **Categorical Encoding:** Apply `OneHotEncoder` (binary mode) to all symptom columns.
4.  **Target Encoding:** Convert `Positive`/`Negative` classes to `1`/`0`.

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import joblib  # To save our scalers/encoders for later use

# Load the raw data
df = pd.read_csv('../data/raw/diabetes_data_upload.csv')

print("Data Loaded successfully!")

Data Loaded successfully!


## 1. Train-Validation-Test Split
We split the data **before** any processing to prevent "Data Leakage" (information from the test set leaking into the training process).

* **Train Set (60%):** Used to teach the model.
* **Validation Set (20%):** Used to tune hyperparameters.
* **Test Set (20%):** Used for the final unbiased evaluation.

In [None]:
# Separate Features (X) and Target (y)
X = df.drop('class', axis=1)
y = df['class']

# First Split: 80% Train+Val, 20% Test
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Second Split: Split the 80% into Train (75%) and Val (25%) -> Results in 60% Train, 20% Val overall
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
)

print(f"Training Set:   {X_train.shape}")
print(f"Validation Set: {X_val.shape}")
print(f"Testing Set:    {X_test.shape}")

Training Set:   (312, 16)
Validation Set: (104, 16)
Testing Set:    (104, 16)


## 2. Defining the Preprocessing Pipeline
We will use Scikit-Learn's `ColumnTransformer` to apply different logic to different columns simultaneously.

* **Numerical (`Age`):** Requires `StandardScaler` (Mean=0, Std=1).
* **Categorical (Symptoms):** Requires `OneHotEncoder`. We use `drop='first'` to turn binary columns (Male/Female) into a single column (0/1) to avoid redundancy.

In [None]:
# 1. Identify Column Types
numeric_features = ['Age']
categorical_features = [col for col in X.columns if col != 'Age']

# 2. Define Transformers
# For Age: Scale it
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

# For Symptoms: Encode "Yes"/"No" to 1/0
categorical_transformer = Pipeline(steps=[
    ('encoder', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'))
])

# 3. Combine into a Preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ],
    verbose_feature_names_out=False # Keeps column names clean (e.g., "Gender_Male" instead of "cat__Gender_Male")
)

print("âœ… Pipeline defined successfully.")

âœ… Pipeline defined successfully.


## 3. Applying Transformations
We **Fit** the preprocessor ONLY on the Training data, and then **Transform** the Validation and Test data. This ensures the model never "sees" the test statistics.

In [9]:
# Fit on Train, Transform Train
X_train_processed = preprocessor.fit_transform(X_train)

# Transform Val and Test (using the rules learned from Train)
X_val_processed = preprocessor.transform(X_val)
X_test_processed = preprocessor.transform(X_test)

# --- Rebuild DataFrames ---
# The output of transformers is a numpy array. We put it back into a DataFrame for readability.

# Get new column names
new_columns = numeric_features + list(preprocessor.named_transformers_['cat']['encoder'].get_feature_names_out(categorical_features))

X_train_df = pd.DataFrame(X_train_processed, columns=new_columns)
X_val_df = pd.DataFrame(X_val_processed, columns=new_columns)
X_test_df = pd.DataFrame(X_test_processed, columns=new_columns)

print("Preview of Processed Data (First 5 rows of Train):")
display(X_train_df.head())

Preview of Processed Data (First 5 rows of Train):


Unnamed: 0,Age,Gender_Male,Polyuria_Yes,Polydipsia_Yes,sudden weight loss_Yes,weakness_Yes,Polyphagia_Yes,Genital thrush_Yes,visual blurring_Yes,Itching_Yes,Irritability_Yes,delayed healing_Yes,partial paresis_Yes,muscle stiffness_Yes,Alopecia_Yes,Obesity_Yes
0,1.128102,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0
1,-0.779358,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0
2,1.045169,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
3,0.713437,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0
4,1.045169,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0


### Interpretation of Changes:
* **Age:** Now appears as a decimal (e.g., `0.5`, `-1.2`) because it is standardized.
* **Gender:** Has become `Gender_Male` (1 if Male, 0 if Female).
* **Polyuria:** Has become `Polyuria_Yes` (1 if Yes, 0 if No).

### ðŸ’¡ Note: Where did the 'Female' column go?

You noticed that we have a `Gender_Male` column but no `Gender_Female` column. This is intentional and caused by the parameter `drop='first'` in our `OneHotEncoder`.

* **How it works:** We only need **one** column to represent two binary options.
    * If `Gender_Male` is **1**, the patient is **Male**.
    * If `Gender_Male` is **0**, the patient is **Female**.
* **Why we do this:** Adding a second column (like `Gender_Female`) would create duplicate information (redundancy). Removing it prevents a mathematical problem called the **"Dummy Variable Trap"** (Multicollinearity), making our model cleaner and more efficient.

## 4. Target Encoding
The target column (`class`) contains "Positive" and "Negative". We must convert this to `1` and `0`.

In [7]:
# Initialize LabelEncoder
target_encoder = LabelEncoder()

# Fit on Train, Transform everything
y_train_encoded = target_encoder.fit_transform(y_train)
y_val_encoded = target_encoder.transform(y_val)
y_test_encoded = target_encoder.transform(y_test)

# Check the mapping
print(f"Class Mapping: {dict(zip(target_encoder.classes_, target_encoder.transform(target_encoder.classes_)))}")
# Expect: {'Negative': 0, 'Positive': 1}

Class Mapping: {'Negative': np.int64(0), 'Positive': np.int64(1)}


## 5. Saving Processed Data
We save the processed arrays and the preprocessor object. Saving the `preprocessor` is crucial because we will need it later to process new real-world data in our web app.

In [11]:
# Save DataFrames
X_train_df.to_csv('../data/processed/X_train.csv', index=False)
X_val_df.to_csv('../data/processed/X_val.csv', index=False)
X_test_df.to_csv('../data/processed/X_test.csv', index=False)

# Save Targets (as DataFrames for consistency)
pd.DataFrame(y_train_encoded, columns=['class']).to_csv('../data/processed/y_train.csv', index=False)
pd.DataFrame(y_val_encoded, columns=['class']).to_csv('../data/processed/y_val.csv', index=False)
pd.DataFrame(y_test_encoded, columns=['class']).to_csv('../data/processed/y_test.csv', index=False)

# Save the Preprocessor and Target Encoder (For the App/Inference later)
import os
os.makedirs('../models', exist_ok=True) # Ensure folder exists

joblib.dump(preprocessor, '../models/preprocessor.joblib')
joblib.dump(target_encoder, '../models/target_encoder.joblib')

print("âœ… All files saved to 'data/processed/' and 'models/'")

âœ… All files saved to 'data/processed/' and 'models/'
