# Section 2: Data Preprocessing and Preparation

## Purpose of Preprocessing

This section documents the preprocessing decisions applied to the dataset prior to model training. The goal is not to maximize model performance, but to ensure a fair and consistent comparison between different classification algorithms.

All preprocessing steps are designed to reflect realistic data preparation practices commonly encountered in healthcare-related datasets.


## Target Variable Binarization

The original target variable encodes multiple levels of heart disease severity. For this project, the target is converted into a binary variable:

- 0 → No heart disease
- 1 → Presence of heart disease

This formulation reflects a clinically relevant screening task where the primary objective is to identify whether heart disease is present, rather than estimating severity.


In [9]:
# import important libraries
from ucimlrepo import fetch_ucirepo
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer

In [2]:
# Fetch dataset
heart_disease = fetch_ucirepo(id=45)

# data (pandas dataframes)
X = heart_disease.data.features
y = heart_disease.data.targets

### The dataset is reloaded here to keep preprocessing self-contained

In [3]:
# print dataset information
print(f"Dataset Information: \n {X.info()}\n{y.info()} \nDataset Shape: \n{X.shape}\n {y.shape}")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        299 non-null    float64
 12  thal      301 non-null    float64
dtypes: float64(3), int64(10)
memory usage: 30.9 KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   num     303 non-null    int64
dtypes: int64(1)
memory usage: 2.5 KB
Dataset Inf

### Converting Multi-class to Binary Classification

The original target variable has 5 classes (0-4):
- Class 0: No heart disease
- Classes 1-4: Varying levels of heart disease severity

We're converting this to binary classification:
- 0 → No disease
- 1 → Disease present (combining all severity levels 1-4)

This simplifies the problem and balances the dataset better.

In [7]:
# Convert target to binary
y = (y > 0).astype(int)

# Check new distribution
print(y.value_counts().sort_index())

num
0      164
1      139
Name: count, dtype: int64


### Missing Value Handling

After converting to binary classification, we need to address missing values in the dataset.

The features `ca` (number of major vessels) and `thal` (thalassemia) contain missing values. Both features are clinically relevant for heart disease prediction, so we cannot drop them.

We'll use a simple, transparent imputation strategy: filling missing values with the median for each feature. This preserves the dataset size and maintains the distribution of the data.

In [8]:
# Check for missing values
print("Missing values per feature:")
print(X.isnull().sum())
print(f"\nTotal missing values: {X.isnull().sum().sum()}")

Missing values per feature:
age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          4
thal        2
dtype: int64

Total missing values: 6


In [10]:
# impute missing values with median
imputer = SimpleImputer(strategy='median')
X_imputed = imputer.fit_transform(X)

# convert back to DataFrame to preserve column names
X = pd.DataFrame(X_imputed, columns=X.columns)

In [11]:
# verify no missing values remain
print("Missing values after imputation:")
print(X.isnull().sum().sum())
print(f"\nDataset size preserved: {X.shape}")

Missing values after imputation:
0

Dataset size preserved: (303, 13)


### Why median and not with mode or mean?

- median is more robust to outliers as we do not want outliers interference in this

### Feature Type Identification

Before scaling or encoding, we need to identify which features are numerical and which are categorical.

Numerical features are continuous values (age, blood pressure, cholesterol) that can be scaled.

Categorical features are discrete categories (sex, chest pain type, thalassemia) that will need encoding later.

Understanding feature types is important because:
- Numerical features benefit from scaling
- Categorical features need encoding (one-hot, label, frequency, target encoding)
- Different feature types behave differently in models

In [None]:
# separate features 
categorical_features = X.select_dtypes(include=["object"]).columns
numerical_features = X.select_dtypes(include=["int64", 'float64']).columns


In [14]:
# view separated features 
print(f"CATEGORICAL FEATURES: \n {categorical_features.tolist()} \n NUMERICAL FEATURES: \n {numerical_features.tolist()}")

CATEGORICAL FEATURES: 
 [] 
 NUMERICAL FEATURES: 
 ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']


### Outcome 
- As shown, all features are numerical either in integer or float
- No object data types hence encoding is not necessary for categorical variables 

In [15]:
# Display summary of numerical features
print("\n Numerical Feature Summary: \n")
print(X[numerical_features].describe())


 Numerical Feature Summary: 

              age         sex          cp    trestbps        chol         fbs  \
count  303.000000  303.000000  303.000000  303.000000  303.000000  303.000000   
mean    54.438944    0.679868    3.158416  131.689769  246.693069    0.148515   
std      9.038662    0.467299    0.960126   17.599748   51.776918    0.356198   
min     29.000000    0.000000    1.000000   94.000000  126.000000    0.000000   
25%     48.000000    0.000000    3.000000  120.000000  211.000000    0.000000   
50%     56.000000    1.000000    3.000000  130.000000  241.000000    0.000000   
75%     61.000000    1.000000    4.000000  140.000000  275.000000    0.000000   
max     77.000000    1.000000    4.000000  200.000000  564.000000    1.000000   

          restecg     thalach       exang     oldpeak       slope          ca  \
count  303.000000  303.000000  303.000000  303.000000  303.000000  303.000000   
mean     0.990099  149.607261    0.326733    1.039604    1.600660    0.663366

### Feature Type Conclusion


- Numerical features might need scaling (standardization) or even normalization
- Categorical features will need encoding (one-hot encoding or label-encoding or frequency-encoding or target-encoding) depending on what's best for such features
- Different preprocessing steps apply to each type

### Feature Scaling

Scaling is a critical preprocessing step that affects model performance differently by dominating the euclidean distance or variance:

**Why scaling matters in this project?:**
- KNN relies on distance calculations, so features on different scales heavily impact results
- SVM (Support Vector Machine) is also distance-based and requires scaled features for optimal performance
- Logistic Regression is less sensitive to scaling but can still benefit from it

**Our approach:**
We'll treat scaling as an experimental factor in this project. We'll scale the numerical features using StandardScaler (mean=0, std=1) + MinMax Scaling and compare model performance with one, with both and without scaling.

**Important:** We only scale the features (X), never the target (y). We must also avoid data leakage by fitting the scaler only on training data.