## Day 5: Data Preprocessing and Quality Handling

The objective of this notebook is to clean and prepare the UCI Heart Disease dataset
for machine learning models.

Based on the issues identified during Day 4 (Data Understanding), this step focuses on:
- Handling invalid and missing values
- Separating numerical and categorical features
- Encoding categorical variables
- Scaling numerical features

No model training or prediction is performed at this stage.
The output of this notebook is a clean, model-ready dataset.


In [4]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler


In [5]:
df = pd.read_csv("../data/heart_disease.csv")
df.head()


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0


Before preprocessing, the dataset is reloaded to ensure that
all transformations are reproducible and independent of Day 4 analysis.


In [6]:
# Check unique values in suspected problematic columns
df['ca'].unique(), df['thal'].unique()


(array([2, 0, 1, 3, 4]), array([3, 2, 1, 0]))

In the UCI Heart Disease dataset:
- The `ca` and `thal` columns are known to contain invalid placeholder values
- These values do not represent valid medical measurements

Such values must be treated as missing data before further processing.


In [7]:
df['ca'] = df['ca'].replace(-1, np.nan)
df['thal'] = df['thal'].replace(-1, np.nan)


In [8]:
df.isnull().sum()


age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

### Missing Value Handling Strategy

- Numerical features are filled using the **median**
  (robust to outliers)
- Categorical features are filled using the **mode**
  (most frequent category)

This approach preserves dataset size while minimizing distortion.


In [9]:
numerical_features = [
    'age', 'trestbps', 'chol', 'thalach', 'oldpeak'
]

categorical_features = [
    'sex', 'cp', 'fbs', 'restecg', 'exang',
    'slope', 'ca', 'thal'
]


In [10]:
# Fill numerical features with median
for col in numerical_features:
    df[col].fillna(df[col].median(), inplace=True)

# Fill categorical features with mode
for col in categorical_features:
    df[col].fillna(df[col].mode()[0], inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values

In [11]:
df.isnull().sum()


age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

### Encoding Strategy

Categorical variables are encoded using **One-Hot Encoding**
to avoid introducing ordinal relationships where none exist.

This approach is suitable for most classical machine learning models.


In [12]:
df_encoded = pd.get_dummies(df, columns=categorical_features, drop_first=True)
df_encoded.head()


Unnamed: 0,age,trestbps,chol,thalach,oldpeak,target,sex_1,cp_1,cp_2,cp_3,...,exang_1,slope_1,slope_2,ca_1,ca_2,ca_3,ca_4,thal_1,thal_2,thal_3
0,52,125,212,168,1.0,0,True,False,False,False,...,False,False,True,False,True,False,False,False,False,True
1,53,140,203,155,3.1,0,True,False,False,False,...,True,False,False,False,False,False,False,False,False,True
2,70,145,174,125,2.6,0,True,False,False,False,...,True,False,False,False,False,False,False,False,False,True
3,61,148,203,161,0.0,0,True,False,False,False,...,False,False,True,True,False,False,False,False,False,True
4,62,138,294,106,1.9,0,False,False,False,False,...,False,True,False,False,False,True,False,False,True,False


### Feature Scaling

Numerical features are scaled using **StandardScaler**
to ensure that all features contribute equally during model training.

Only numerical features are scaled.


In [13]:
scaler = StandardScaler()
df_encoded[numerical_features] = scaler.fit_transform(
    df_encoded[numerical_features]
)


In [14]:
X = df_encoded.drop('target', axis=1)
y = df_encoded['target']

X.shape, y.shape


((1025, 22), (1025,))

### Final Dataset Structure

- `X` contains all processed feature variables
- `y` contains the target variable (heart disease presence)
- The dataset is now clean, encoded, and scaled
- Ready for baseline model training


### Conclusion of Day 5

- Invalid and missing values have been handled
- Categorical and numerical features were treated appropriately
- Feature scaling and encoding are complete
- A clean, model-ready dataset has been prepared

Next Step:
**Day 6 – Baseline Models and Initial Evaluation**
