# Heart Disease Data Preparation

Load the cleaned data and prepare it for model training

In [1]:
import pandas as pd
import pickle
from pathlib import Path

pd.set_option("display.max_columns", None)
df = pd.read_csv("C:/4TH SEM/Group Project/Code/data/cleaned_heart.csv")

### Encode categorical features
Use one-hot encoding for categorical columns before splitting into `X` and `y` so downstream models see numeric inputs only.

In [2]:
cat_cols = ["gender", "cholesterol", "gluc"]
df = pd.get_dummies(df, columns=cat_cols, prefix=cat_cols, drop_first=True)

df.head()

Unnamed: 0,id,age,height,weight,systolic_bp,diastolic_bp,smoke,alco,active,target,bmi,gender_2,cholesterol_2,cholesterol_3,gluc_2,gluc_3
0,0,50.4,168,62.0,110,80,0,0,1,0,21.96712,True,False,False,False,False
1,1,55.4,156,85.0,140,90,0,0,1,1,34.927679,False,False,True,False,False
2,2,51.6,165,64.0,130,70,0,0,0,1,23.507805,False,False,True,False,False
3,3,48.2,169,82.0,150,100,0,0,1,1,28.710479,True,False,False,False,False
4,4,47.8,156,56.0,100,60,0,0,0,0,23.011177,False,False,False,False,False


To divide the datset into input features (X) and the target variable (y) to prepare the data for machine learning model training.

Features (X):
All independent variables that describe patient health parameters.

Target (y):
The dependent variable that indicates the presence or absence of disease.

This separation ensures that the model learns patterns only from the input features without accessing the actual outcome during training.

In [3]:
X = df.drop("target", axis=1)
y = df["target"]

To scaling the data we used the standardscaler so data became model ready use and use print to confirm that dat is ready

In [4]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print("Final Feature Shape:", X_scaled.shape)
print("Target Shape:", y.shape) 

Final Feature Shape: (68889, 15)
Target Shape: (68889,)


# Now data is ready for model training

(Scaler saving moved to the model-training notebook to avoid duplicate artifacts.)