# Data Cleaning & Preprocessing Notebook 02
This is where we will do any cleaning and standardizations to prepare for using ML models to predict DEATH_EVENT.  

*Note:* Some of these steps are redundant/unnecessary but are show for 'full lifecycle coverage'. Sometimes these two note books (ie. 01,02) are combined into 1 for simplicity.  

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression

# Set the style of the plots
sns.set_style('darkgrid')

In [2]:
df = pd.read_csv("../data/heart_failure_clinical_records_dataset_cleaned.csv")

In [3]:
continuous_features = ['age', 'creatinine_phosphokinase', 'ejection_fraction', 'platelets', 'serum_creatinine', 'serum_sodium', 'time']
categorical_features = ['anaemia', 'diabetes', 'high_blood_pressure', 'sex', 'smoking']

In [4]:
# Detect and handle missing values (already checked)
missing_values = df.isnull().sum()
print("Missing values:")
print(missing_values)

# Handle missing values - None


Missing values:
age                         0
anaemia                     0
creatinine_phosphokinase    0
diabetes                    0
ejection_fraction           0
high_blood_pressure         0
platelets                   0
serum_creatinine            0
serum_sodium                0
sex                         0
smoking                     0
time                        0
DEATH_EVENT                 0
dtype: int64


In [5]:
# Remove Outliers (already checked)
quantile_1 = df[continuous_features].quantile(0.25)
quantile_3 = df[continuous_features].quantile(0.75)
iqr = quantile_3 - quantile_1
lower_bound = quantile_1 - 1.5 * iqr
upper_bound = quantile_3 + 1.5 * iqr

outliers = (df[continuous_features] < lower_bound) | (df[continuous_features] > upper_bound)

print("Outliers: ")
outliers.sum()
# Handle Outliers 
# Will skip this step as we did it in 01 EDA notebook
# (ignoring the 3 extra outliers since they are being calculate on the cleaned dataset)

Outliers: 


age                         0
creatinine_phosphokinase    0
ejection_fraction           0
platelets                   0
serum_creatinine            3
serum_sodium                0
time                        0
dtype: int64

### Standardization and Train Split 
Some of our ML models that we will be using require standardization of features. So we will first split our data, then standardize it.  

Process:  
- We will standardize only the continuous numerical features (age, creatinine_phosphokinase, ejection_fraction, platelets, serum_creatinine, serum_sodium, time) using StandardScaler.  

- Binary 0/1 variables were left unchanged. 

- Standardization was applied in a scikit-learn Pipeline, fit only on the training data, and then applied to the test set.   

- This was necessary for scale-sensitive models such as logistic regression, SVM, k-NN, and neural networks. 

- Tree-based models do not require scaling, but using a consistent preprocessing pipeline prevents data leakage and maintains comparability across models.



### Models That Require Feature Scaling

| **Model** | **Why It Needs Scaling** |
|----------|---------------------------|
| **Logistic Regression** | L2 regularization assumes all features are on the same scale; otherwise coefficients are penalized inconsistently. |
| **SVM (Linear, RBF kernels)** | Uses distance and dot products; features with larger magnitude dominate the decision boundary. |
| **k-NN** | Distance-based algorithmâ€”unscaled features distort nearest-neighbor calculations. |
| **Neural Networks (PyTorch)** | Gradient-based optimization converges faster when inputs have similar scale; prevents exploding/vanishing gradients. |
| **PCA** | Computes variance along components; features with larger variance dominate unless scaled. |


In [6]:
# Train, Validation, Test Split 
X = df.drop('DEATH_EVENT', axis=1)
y = df['DEATH_EVENT']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

In [7]:
# Column Transformer + Pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), continuous_features),
        ('cat', 'passthrough', categorical_features)
    ]
)

# Example pipeline with logistic regression, will redo this part in the ML notebook
lr_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(penalty='l2', C=0.1, solver='liblinear',class_weight='balanced'))]
)

In [8]:
# Verify Feature Scaling
X_train_scaled = preprocessor.fit_transform(X_train)
print("Mean: \n", X_train_scaled.mean(axis=0))
print("Std: \n", X_train_scaled.std(axis=0))


Mean: 
 [ 2.77865874e-16  6.45045780e-17  2.36930277e-16 -2.20183896e-16
  4.69829017e-16 -3.91989358e-16  7.93902498e-17  4.91620112e-01
  4.18994413e-01  3.68715084e-01  6.48044693e-01  3.35195531e-01]
Std: 
 [1.         1.         1.         1.         1.         1.
 1.         0.49992977 0.49339446 0.4824565  0.47758012 0.47205877]


Since for continous features the mean ~ 0 and std ~ 1, scaling is working as intended!

### Note:
The imbalance in the dataset in DEATH_EVENT is about 2.67:1 to died:survived. The imbalance is mild and using a process such as SMOTE may add unneccessary noise and can hurt generalization. 

In [9]:
# Add data to CSVs for access in next notebook (commented out to avoid overwriting)
# X_train.to_csv("../data/X_train.csv", index=False)
# X_val.to_csv("../data/X_val.csv", index=False)
# y_train.to_csv("../data/y_train.csv", index=False)
# y_val.to_csv("../data/y_val.csv", index=False)
