# AI-Based Multi-Disease Risk Predictor  
## Notebook 04: Data Preprocessing & Model-Ready Data

### Objective
To preprocess the cleaned datasets by:
- Separating features and target
- Scaling numerical features
- Creating train-test splits
This step prepares the data for machine learning models.

### Import Required Libraries

In [13]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from joblib import dump
import os

### Load Dataset 

In [2]:
heart=pd.read_csv("C://Users//JOHN//Desktop//ML_projects//ai-disease-risk-predictor//data//raw_data/heart_clean.csv")
diabetes=pd.read_csv("C://Users//JOHN//Desktop//ML_projects//ai-disease-risk-predictor//data//raw_data/diabetes_clean.csv")

In [8]:
print(heart.shape,  diabetes.shape)

(1025, 6) (768, 4)


### Feature & Target Split (Heart Data)

In [9]:
X_heart = heart.drop("target", axis=1)
y_heart = heart["target"]

print("Heart X shape:", X_heart.shape)
print("Heart y shape:", y_heart.shape)

Heart X shape: (1025, 5)
Heart y shape: (1025,)


### Feature & Target Split (Diabetes Data)

In [10]:
X_diabetes = diabetes.drop("target", axis=1)
y_diabetes = diabetes["target"]

print("Diabetes X shape:", X_diabetes.shape)
print("Diabetes y shape:", y_diabetes.shape)

Diabetes X shape: (768, 3)
Diabetes y shape: (768,)


## Feature Scaling

Medical features have different value ranges (e.g., age, glucose, cholesterol).
Scaling ensures:
- Equal contribution of all features
- Improved model performance
- Stable and faster training

StandardScaler is used as it is widely accepted in ML pipelines.


### Scaling Heart Dataset

In [11]:
heart_scaler = StandardScaler()
X_heart_scaled = heart_scaler.fit_transform(X_heart)

### Scaling Diabetes Dataset 

In [12]:
diabetes_scaler = StandardScaler()
X_diabetes_scaled = diabetes_scaler.fit_transform(X_diabetes)

### Save Scalers using JOBLIB

In [14]:
os.makedirs("C://Users//JOHN//Desktop//ML_projects//ai-disease-risk-predictor/model", exist_ok=True)

dump(heart_scaler, "C://Users//JOHN//Desktop//ML_projects//ai-disease-risk-predictor/model/heart_scaler.joblib")
dump(diabetes_scaler, "C://Users//JOHN//Desktop//ML_projects//ai-disease-risk-predictor/model/diabetes_scaler.joblib")

['C://Users//JOHN//Desktop//ML_projects//ai-disease-risk-predictor/model/diabetes_scaler.joblib']

### Train-Test Split (Heart Data)

In [15]:
Xh_train, Xh_test, yh_train, yh_test = train_test_split(
    X_heart_scaled, y_heart, test_size=0.2, random_state=42
)

### Train-Test Split (Diabetes Data)

In [16]:
Xd_train, Xd_test, yd_train, yd_test = train_test_split(
    X_diabetes_scaled, y_diabetes, test_size=0.2, random_state=42
)

### Create Preprocessed Data Folder

In [17]:
os.makedirs("C://Users//JOHN//Desktop//ML_projects//ai-disease-risk-predictor//data/preprocessed_data", exist_ok=True)

### Save Preprocessed Data

In [18]:
# Heart data
np.save("C://Users//JOHN//Desktop//ML_projects//ai-disease-risk-predictor//data/preprocessed_data/Xh_train.npy", Xh_train)
np.save("C://Users//JOHN//Desktop//ML_projects//ai-disease-risk-predictor//data/preprocessed_data/Xh_test.npy", Xh_test)
np.save("C://Users//JOHN//Desktop//ML_projects//ai-disease-risk-predictor//data/preprocessed_data/yh_train.npy", yh_train)
np.save("C://Users//JOHN//Desktop//ML_projects//ai-disease-risk-predictor//data/preprocessed_data/yh_test.npy", yh_test)

# Diabetes data
np.save("C://Users//JOHN//Desktop//ML_projects//ai-disease-risk-predictor//data/preprocessed_data/Xd_train.npy", Xd_train)
np.save("C://Users//JOHN//Desktop//ML_projects//ai-disease-risk-predictor//data/preprocessed_data/Xd_test.npy", Xd_test)
np.save("C://Users//JOHN//Desktop//ML_projects//ai-disease-risk-predictor//data/preprocessed_data/yd_train.npy", yd_train)
np.save("C://Users//JOHN//Desktop//ML_projects//ai-disease-risk-predictor//data/preprocessed_data/yd_test.npy", yd_test)

## Preprocessing Summary

- Features and targets separated for both diseases
- Numerical features scaled using StandardScaler
- Train-test datasets created
- Preprocessed data saved separately
- Scalers stored using joblib for reproducibility

The data is now fully prepared for machine learning model training.
