# Data Preprocessing

**Project:** MediGuide AI (MVP)

**Objective:**  
Convert raw healthcare data into a clean, consistent, and model-ready format based on insights from EDA.

This notebook:
- defines preprocessing rules,
- justifies each transformation,
- validates the final feature matrix.

⚠️ This notebook does NOT train models.  
⚠️ Final preprocessing logic will be implemented in `ml/src/preprocessing.py`.


In [1]:
# ============================================================
# Core libraries for data manipulation
# ============================================================

import pandas as pd
# pandas is used because healthcare datasets are structured
# in tabular (rows = patients, columns = features) format.

import numpy as np
# numpy is required for numerical operations such as:
# - handling NaN values
# - replacing invalid values
# - numerical assertions


# ============================================================
# Scikit-learn utilities used ONLY for preprocessing
# ============================================================

from sklearn.model_selection import train_test_split
# train_test_split is used here ONLY to prevent data leakage.
# In ML, preprocessing MUST be learned from training data only.

from sklearn.preprocessing import StandardScaler
# StandardScaler is chosen because:
# - ML models assume features on comparable scales
# - Many healthcare metrics have different units


In [2]:
# ============================================================
# Dataset path definition
# ============================================================

DATASET_PATH = "../../data/raw/mediguide-ai.csv"
# Path is stored in a variable instead of inline to:
# - improve readability
# - make future path changes trivial
# - allow reuse in scripts later


# ============================================================
# Load raw dataset
# ============================================================

raw_healthcare_df = pd.read_csv(DATASET_PATH)
# Using a domain-specific name (not just 'df') improves clarity
# when notebooks grow large or are reviewed by others.

raw_healthcare_df.head()


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
# ============================================================
# Create a working copy of raw data
# ============================================================

healthcare_df = raw_healthcare_df.copy()
# We NEVER preprocess the raw dataset directly.
# This protects:
# - data integrity
# - reproducibility
# - debugging capability


In [4]:
# ============================================================
# Explicitly define the target column
# ============================================================

TARGET_COLUMN = "Outcome"
# Defining the target explicitly prevents:
# - accidental inclusion in feature set
# - silent bugs when columns change


# ============================================================
# Define feature columns
# ============================================================

FEATURE_COLUMNS = [
    column_name
    for column_name in healthcare_df.columns
    if column_name != TARGET_COLUMN
]

FEATURE_COLUMNS


['Pregnancies',
 'Glucose',
 'BloodPressure',
 'SkinThickness',
 'Insulin',
 'BMI',
 'DiabetesPedigreeFunction',
 'Age']

In [5]:
# ============================================================
# Identify columns where zero is not a valid medical value
# ============================================================

INVALID_ZERO_COLUMNS = [
    "Glucose",
    "BloodPressure",
    "BMI"
]
# In clinical datasets, zero often means "missing measurement",
# not an actual recorded zero.


# ============================================================
# Replace invalid zeros with NaN
# ============================================================

healthcare_df[INVALID_ZERO_COLUMNS] = (
    healthcare_df[INVALID_ZERO_COLUMNS]
    .replace(0, np.nan)
)


In [7]:
# ============================================================
# Impute missing values using median (SAFE & FUTURE-PROOF)
# ============================================================

for column in INVALID_ZERO_COLUMNS:
    # Compute median for the column (robust to outliers)
    median_value = healthcare_df[column].median()
    
    # Explicit reassignment to avoid chained assignment issues
    healthcare_df[column] = healthcare_df[column].fillna(median_value)


In [8]:
# ============================================================
# Separate features and target
# ============================================================

X = healthcare_df[FEATURE_COLUMNS]
y = healthcare_df[TARGET_COLUMN]


# ============================================================
# Train-test split
# ============================================================

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)


In [9]:
# ============================================================
# Identify numeric columns only
# ============================================================

NUMERIC_COLUMNS = X_train.select_dtypes(
    include=np.number
).columns.tolist()


# ============================================================
# Initialize scaler
# ============================================================

scaler = StandardScaler()
# Scaler learns mean & std ONLY from training data.


# ============================================================
# Create copies before modifying
# ============================================================

X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()


# ============================================================
# Fit on training data, transform both
# ============================================================

X_train_scaled[NUMERIC_COLUMNS] = scaler.fit_transform(
    X_train[NUMERIC_COLUMNS]
)

X_test_scaled[NUMERIC_COLUMNS] = scaler.transform(
    X_test[NUMERIC_COLUMNS]
)


In [10]:
# ============================================================
# Verify no missing values remain
# ============================================================

assert X_train_scaled.isnull().sum().sum() == 0, \
    "Training data still contains missing values"

assert X_test_scaled.isnull().sum().sum() == 0, \
    "Test data still contains missing values"


# ============================================================
# Verify shape consistency
# ============================================================

X_train_scaled.shape, X_test_scaled.shape


((614, 8), (154, 8))

## Preprocessing Decisions Summary

- Raw data preserved using defensive copying
- Medically invalid zero values handled explicitly
- Missing values imputed with median (robust choice)
- Data split before scaling to prevent leakage
- Numerical features standardized
- Final feature count: ___

Next step:
➡️ Implement this logic in `ml/src/preprocessing.py`
➡️ Proceed to model training
