# 02 – Data Cleaning & Preprocessing

## 1. Introduction

This notebook performs data cleaning, preprocessing, outlier handling, encoding, scaling, and preparation of the dataset for model training.

We use the Cardiovascular Disease dataset from Kaggle.

Goal: Produce a cleaned dataset ready for ML.

## 2. Import Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import seaborn as sns
import matplotlib.pyplot as plt

## 3. Load Dataset

In [5]:
df = pd.read_csv("../data/raw/cardio_train.csv", sep = ";")
df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0


## 4. Add Derived Feature (Age in Years)

In [6]:
df['age_years'] = (df['age'] / 365).astype(int)
df = df.drop(columns = ['age'])  # replace original
df.head()

Unnamed: 0,id,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,age_years
0,0,2,168,62.0,110,80,1,1,0,0,1,0,50
1,1,1,156,85.0,140,90,3,1,0,0,1,1,55
2,2,1,165,64.0,130,70,3,1,0,0,0,1,51
3,3,2,169,82.0,150,100,1,1,0,0,1,1,48
4,4,1,156,56.0,100,60,1,1,0,0,0,0,47


## 5. Handling Duplicates

In [7]:
print("Duplicates before:", df.duplicated().sum())
df = df.drop_duplicates()
print("Duplicates after:", df.duplicated().sum())

Duplicates before: 0
Duplicates after: 0


## 6. Handling Outliers

We remove unrealistic values using simple rules.

### 6.1 Height Outliers

In [8]:
df = df[(df['height'] >= 120) & (df['height'] <= 220)]

### 6.2 Weight Outliers

In [9]:
df = df[(df['weight'] >= 30) & (df['weight'] <= 200)]

### 6.3 Systolic BP (ap_hi) Outliers

In [10]:
df = df[(df['ap_hi'] >= 80) & (df['ap_hi'] <= 200)]

### 6.4 Diastolic BP (ap_lo) Outliers

In [11]:
df = df[(df['ap_lo'] >= 50) & (df['ap_lo'] <= 150)]

### 6.5 Check Dataset After Outlier Removal

In [12]:
df.describe()

Unnamed: 0,id,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,age_years
count,68614.0,68614.0,68614.0,68614.0,68614.0,68614.0,68614.0,68614.0,68614.0,68614.0,68614.0,68614.0,68614.0
mean,49972.653351,1.348661,164.414012,74.112165,126.559959,81.337657,1.364517,1.225829,0.088,0.053502,0.803378,0.494637,52.827951
std,28847.366008,0.47655,7.91316,14.288144,16.536077,9.470911,0.678857,0.571805,0.283296,0.225034,0.397447,0.499975,6.768942
min,0.0,1.0,120.0,30.0,80.0,50.0,1.0,1.0,0.0,0.0,0.0,0.0,29.0
25%,25000.5,1.0,159.0,65.0,120.0,80.0,1.0,1.0,0.0,0.0,1.0,0.0,48.0
50%,50014.5,1.0,165.0,72.0,120.0,80.0,1.0,1.0,0.0,0.0,1.0,0.0,53.0
75%,74869.75,2.0,170.0,82.0,140.0,90.0,1.0,1.0,0.0,0.0,1.0,1.0,58.0
max,99999.0,2.0,207.0,200.0,200.0,150.0,3.0,3.0,1.0,1.0,1.0,1.0,64.0


## 7. Encode Categorical Columns

Dataset columns like cholesterol, gluc, smoke, alco, active are already numeric.

No label encoding needed.

In [14]:
df.nunique()

id             68614
gender             2
height            73
weight           273
ap_hi             99
ap_lo             76
cholesterol        3
gluc               3
smoke              2
alco               2
active             2
cardio             2
age_years         28
dtype: int64

## 8. Feature Engineering (Add BMI)

In [15]:
df['bmi'] = df['weight'] / (df['height'] / 100) ** 2
df.head()

Unnamed: 0,id,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,age_years,bmi
0,0,2,168,62.0,110,80,1,1,0,0,1,0,50,21.96712
1,1,1,156,85.0,140,90,3,1,0,0,1,1,55,34.927679
2,2,1,165,64.0,130,70,3,1,0,0,0,1,51,23.507805
3,3,2,169,82.0,150,100,1,1,0,0,1,1,48,28.710479
4,4,1,156,56.0,100,60,1,1,0,0,0,0,47,23.011177


## 9. Preprocessing Pipeline (Main Section)

### 9.1 Select Features & Target

In [16]:
X = df.drop(columns = ['cardio'])
y = df['cardio']

### 9.2 Identify Numerical Columns for Scaling

In [17]:
num_cols = ['height', 'weight', 'ap_hi', 'ap_lo', 'bmi', 'age_years']

### 9.3 Train–Test Split

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)
# random_state - ensures the split is reproducible (same split every run)
# stratify - preserves the class distribution from y in both train and test sets

### 9.4 Scaling Numerical Features

In [24]:
scaler = StandardScaler()

X_train[num_cols] = scaler.fit_transform(X_train[num_cols])
X_test[num_cols] = scaler.transform(X_test[num_cols])

## 10. Save Processed Files

### 10.1 Save Cleaned Full Dataset

In [27]:
df.to_csv("../data/processed/cleaned_data.csv", index = False)

### 10.2 Save Train–Test Parts

In [28]:
X_train.to_csv("../data/processed/X_train.csv", index = False)
X_test.to_csv("../data/processed/X_test.csv", index = False)
y_train.to_csv("../data/processed/y_train.csv", index = False)
y_test.to_csv("../data/processed/y_test.csv", index = False)

## 11. Preprocessing Summary

### Preprocessing Completed Successfully

- Converted age (days → years)
- Removed duplicates
- Removed unrealistic outliers
- Added feature: BMI
- Selected features and target
- Applied Train–Test split (80/20)
- Scaled numerical features using StandardScaler
- Saved cleaned & split datasets for model training