# Step 2 — Preprocessing (Cleveland Heart Dataset)

**Goals**
- Reload the base dataset (independent of Step 1)
- Impute missing values:
  - `ca`, `thal` → mode (most frequent)
  - all other numeric features (except `target`) → median
- Scale feature columns to [0, 1] (MinMaxScaler)
- Create binary label: `target_bin = 0 if target==0 else 1`
- Build `X` (features) and `y` (label) for modeling


In [1]:
 # import libraries
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler

In [2]:
# 2.1 — Load dataset and basic cleaning
# Paths assume this notebook is in notebooks/ and data/ is one level up
csv_path = "../data/cleveland.csv"  # renamed from processed.cleveland.data

# Load raw file (no header in file)
df = pd.read_csv(csv_path, header=None)

# Assign UCI column names
df.columns = [
    "age","sex","cp","trestbps","chol","fbs","restecg",
    "thalach","exang","oldpeak","slope","ca","thal","target"
]

# Make '?' explicit NaN, then coerce all columns to numeric
df = df.replace("?", np.nan)
for c in df.columns:
    df[c] = pd.to_numeric(df[c], errors="coerce")

# Quick sanity check
print("Shape:", df.shape)
print(df.isna().sum().sort_values(ascending=False).head(5))
df.head()

Shape: (303, 14)
ca      4
thal    2
age     0
sex     0
cp      0
dtype: int64


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0


In [3]:
# 2.2 — Impute missing values
data = df.copy()

# Mode imputation for code-like columns
mode_cols = ["ca", "thal"]
mode_imputer = SimpleImputer(strategy="most_frequent")
data[mode_cols] = mode_imputer.fit_transform(data[mode_cols])

# Median imputation for the rest (exclude target)
num_cols = data.columns.drop(["target"])
median_imputer = SimpleImputer(strategy="median")
data[num_cols] = median_imputer.fit_transform(data[num_cols])

# Verify no missings remain
print("Remaining NaNs after imputation:\n", data.isna().sum().sum())


Remaining NaNs after imputation:
 0


In [4]:
# 2.3 — Scale features to [0, 1]
feature_cols = data.columns.drop(["target"])
scaler = MinMaxScaler()
data[feature_cols] = scaler.fit_transform(data[feature_cols])

# Peek
data[feature_cols].head()


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,0.708333,1.0,0.0,0.481132,0.244292,1.0,1.0,0.603053,0.0,0.370968,1.0,0.0,0.75
1,0.791667,1.0,1.0,0.622642,0.365297,0.0,1.0,0.282443,1.0,0.241935,0.5,1.0,0.0
2,0.791667,1.0,1.0,0.245283,0.23516,0.0,1.0,0.442748,1.0,0.419355,0.5,0.666667,1.0
3,0.166667,1.0,0.666667,0.339623,0.283105,0.0,0.0,0.885496,0.0,0.564516,1.0,0.0,0.0
4,0.25,0.0,0.333333,0.339623,0.178082,0.0,1.0,0.770992,0.0,0.225806,0.0,0.0,0.0


In [5]:
# 2.4 — Binary label + final feature/label sets
data["target_bin"] = (data["target"] > 0).astype(int)

X = data.drop(columns=["target", "target_bin"])
y = data["target_bin"]

print("X shape:", X.shape)
print("y counts:\n", y.value_counts())

# Feature range sanity (should be within [0, 1])
mins = X.min().round(3)
maxs = X.max().round(3)
range_preview = pd.DataFrame({"min": mins, "max": maxs})
print("\nFeature ranges (first 10 rows):")
print(range_preview.head(10))


X shape: (303, 13)
y counts:
 0    164
1    139
Name: target_bin, dtype: int64

Feature ranges (first 10 rows):
          min  max
age       0.0  1.0
sex       0.0  1.0
cp        0.0  1.0
trestbps  0.0  1.0
chol      0.0  1.0
fbs       0.0  1.0
restecg   0.0  1.0
thalach   0.0  1.0
exang     0.0  1.0
oldpeak   0.0  1.0
