# Data Preprocessing


After gathering some insights from the [Exploratory Data Analysis stage](./Exploratory_Data_Analysis.ipynb)  
here is what we will be doing here:

1. Remove **Insignificant** features (RestingECG, RestingBP)
1. Remove outliers present in the features (RestingBP, Cholesterol, Oldpeak)
1. Deal with the large number of zero (missing data) values in the features (Cholesterol, Oldpeak)
1. Map the binary-categorical-non_numrical features (Sex, ExerciseAngina) to have binary values (0, 1)
1. Use One-Hot encoding on the other categorical-non_numrical features (ChestPainType, RestingECG, ST_Slope)
1. Use the StandardScaler on the numerical features (Age, RestingBP, Cholesterol, MaxHR, Oldpeak)
1. Make the final preprocessed data set

In [1]:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split
import joblib

In [2]:
# Read the raw data
data = pd.read_csv('./data/cardiovascular_raw_dataset.csv')
data.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,42,M,NAP,120,240,1,Normal,194,N,0.8,Down,0
1,36,M,NAP,130,209,0,Normal,178,N,0.0,Up,0
2,56,M,ASY,150,213,1,Normal,125,Y,1.0,Flat,1
3,37,F,NAP,130,211,0,Normal,142,N,0.0,Up,0
4,51,M,ASY,120,0,1,Normal,104,N,0.0,Flat,1


## Remove **Insignificant** features

In [3]:
# data = data.drop(columns=['RestingECG', 'RestingBP'])
# data.head()

#### Disclaimer
I actually tried to remove them and the results (accuracy) were worse

## Remove Outliers

As we have seen in [Exploratory Data Analysis stage](./Exploratory_Data_Analysis.ipynb)  
the number of outliers in the features (RestingBP, Cholesterol, Oldpeak) are really small,  
only 2.2% of the data, so it is **safe** to remove them

In [4]:
# Remove the outliers (based on the insights we had in the EDA stage)
data = data[(data['Cholesterol'] <= 460) &
            ((data['Oldpeak'] < 5) & (data['Oldpeak'] >= 0))]
data.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,42,M,NAP,120,240,1,Normal,194,N,0.8,Down,0
1,36,M,NAP,130,209,0,Normal,178,N,0.0,Up,0
2,56,M,ASY,150,213,1,Normal,125,Y,1.0,Flat,1
3,37,F,NAP,130,211,0,Normal,142,N,0.0,Up,0
4,51,M,ASY,120,0,1,Normal,104,N,0.0,Flat,1


## Deal with the large number of zeros

### Oldpeak

After searching, I found that having a zero value for **Oldpeak** is fairly  
normal and as it fairl distributed between people with and without Heart Disease  
I decided to keep it and add new feature to indicate wheather the value of **Oldpeak**  
is zero or not, as it can be beneficial to the model

In [5]:
# (Feature Engineering) Create a new feature to indicate a zero oldpeak
data['Zero_Oldpeak'] = (data['Oldpeak'] == 0).astype(int)
data.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease,Zero_Oldpeak
0,42,M,NAP,120,240,1,Normal,194,N,0.8,Down,0,0
1,36,M,NAP,130,209,0,Normal,178,N,0.0,Up,0,1
2,56,M,ASY,150,213,1,Normal,125,Y,1.0,Flat,1,0
3,37,F,NAP,130,211,0,Normal,142,N,0.0,Up,0,1
4,51,M,ASY,120,0,1,Normal,104,N,0.0,Flat,1,1


### Cholesterol

As having a zero cholesterol level for a human is impossible  
this must indicate the absense of measurment, and as we have observed  
from the graphs, most individuals with a cholesterol level of zero has heart disease  
so using the mean or average to impute the zero values could have a bad effect on the   
model (according to this dataset), so the best solution would be to add a new feature   
to let the model know if this data was missing or not.

This will allow the model to learn that missing cholesterol data is  
a relevant factor in predicting heart disease.

In [6]:
# (Feature Engineering) Create a new feature to indicate a zero cholesterol
data['Have_Cholesterol_Measurement'] = (data['Cholesterol'] == 0).astype(int)
data.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease,Zero_Oldpeak,Have_Cholesterol_Measurement
0,42,M,NAP,120,240,1,Normal,194,N,0.8,Down,0,0,0
1,36,M,NAP,130,209,0,Normal,178,N,0.0,Up,0,1,0
2,56,M,ASY,150,213,1,Normal,125,Y,1.0,Flat,1,0,0
3,37,F,NAP,130,211,0,Normal,142,N,0.0,Up,0,1,0
4,51,M,ASY,120,0,1,Normal,104,N,0.0,Flat,1,1,1


### Discliamer

Normally, I would have used the mean or average to fill this data, but the fact that 19% of cholesterol measurements are zeros and 88% of these zeros are associated with peopel having heart disease, I couldn't do that.

## Map the binary-categorical-non_numrical features

In [7]:
# Map Sex(M, F) -> Sex(0, 1) and ExerciseAngina(N, Y) -> ExerciseAngina(0, 1)
binary_categorical_non_numerical = ['Sex', 'ExerciseAngina']

data[binary_categorical_non_numerical] = data[binary_categorical_non_numerical].apply(
    lambda x: pd.factorize(x)[0])
    
data.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease,Zero_Oldpeak,Have_Cholesterol_Measurement
0,42,0,NAP,120,240,1,Normal,194,0,0.8,Down,0,0,0
1,36,0,NAP,130,209,0,Normal,178,0,0.0,Up,0,1,0
2,56,0,ASY,150,213,1,Normal,125,1,1.0,Flat,1,0,0
3,37,1,NAP,130,211,0,Normal,142,0,0.0,Up,0,1,0
4,51,0,ASY,120,0,1,Normal,104,0,0.0,Flat,1,1,1


## Use One-Hot encoding and StandardScalar

Now, to preprocess the remaining features.
We will encode the ramaing categorical features and use one-hot encoding on the  
numerical features with large scale.

In [8]:
# Before using the StandardScalar, we need to split the data, so we don'train_test_split
# use the validation data to fit the preprocessor
x = data.drop(['HeartDisease'], axis=1)
y = data[['HeartDisease']]

x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.2, random_state=7)
print(x_train.shape, y_train.shape, x_val.shape, y_val.shape)

(575, 13) (575, 1) (144, 13) (144, 1)


In [9]:
# Get the names of the categorical features which we will use one-hot encoding on
# and the numerical features which we will use the StandardScaler on
cat_features = data.select_dtypes(include=['O']).columns.tolist()
num_features = data.select_dtypes(exclude=['O']).columns
num_features = [col for col in num_features if data[col].max() > 1]
remaining_features = [col for col in data.columns if col not in cat_features and col not in num_features and col != 'HeartDisease']

print(cat_features)
print(num_features)
print(remaining_features)

['ChestPainType', 'RestingECG', 'ST_Slope']
['Age', 'RestingBP', 'Cholesterol', 'MaxHR', 'Oldpeak']
['Sex', 'FastingBS', 'ExerciseAngina', 'Zero_Oldpeak', 'Have_Cholesterol_Measurement']


In [10]:
# generate and fit the preprocessor on the training data only
preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), num_features),  # Apply StandardScaler to numerical features
        ("cat", OneHotEncoder(), cat_features),   # Apply OneHotEncoder to categorical features
    ]
)

# Transform the data
preprocessor.fit(x_train)

In [11]:
# Save the preprocessor to use it later
joblib.dump(preprocessor, './model/preprocessor.pkl')
joblib.dump(preprocessor, './app/app_ml/model/preprocessor.pkl')


['./model/preprocessor.pkl']

In [12]:
# Transform the data using the preprocessor
processed_x_train = preprocessor.transform(x_train)
processed_x_val = preprocessor.transform(x_val)
print(processed_x_train.shape, processed_x_val.shape)

(575, 15) (144, 15)


In [13]:
# Get the new column names
ohe_feature_names = preprocessor.named_transformers_['cat'].get_feature_names_out(cat_features)
new_column_names = num_features + list(ohe_feature_names)

print("Shape before conversion:", processed_x_train.shape, processed_x_val.shape)
# Re-create data frames for the train and validation data
processed_x_train = pd.DataFrame(processed_x_train, columns=new_column_names)
processed_x_val = pd.DataFrame(processed_x_val, columns=new_column_names)
print("Shape after conversion:", processed_x_train.shape, processed_x_val.shape)
processed_x_train.head()

Shape before conversion: (575, 15) (144, 15)
Shape after conversion: (575, 15) (144, 15)


Unnamed: 0,Age,RestingBP,Cholesterol,MaxHR,Oldpeak,ChestPainType_ASY,ChestPainType_ATA,ChestPainType_NAP,ChestPainType_TA,RestingECG_LVH,RestingECG_Normal,RestingECG_ST,ST_Slope_Down,ST_Slope_Flat,ST_Slope_Up
0,-0.857863,-1.289041,0.325176,0.145543,0.082996,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,0.012296,-0.105705,0.807048,0.883046,-0.896941,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
2,1.426303,0.324599,0.142064,-0.553144,1.062934,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
3,-1.184172,-0.105705,0.411913,0.261991,-0.798947,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
4,-0.205244,-0.320857,-0.002497,0.766599,0.082996,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0


In [14]:
# Concatenate the rest of the data that didn't need preprocessing
print("Shape before concat:", processed_x_train.shape, processed_x_val.shape)

# Reset indices to ensure proper alignment
processed_x_train = processed_x_train.reset_index(drop=True)
x_train_remaining = x_train[remaining_features].reset_index(drop=True)
processed_x_val = processed_x_val.reset_index(drop=True)
x_val_remaining = x_val[remaining_features].reset_index(drop=True)

print("Shape before concat:", x_train_remaining.shape, x_val_remaining.shape)

# Concatenate along columns
processed_x_train = pd.concat([processed_x_train, x_train_remaining], axis=1)
processed_x_val = pd.concat([processed_x_val, x_val_remaining], axis=1)
print("Shape after concat:", processed_x_train.shape, processed_x_val.shape)
processed_x_train.head()

Shape before concat: (575, 15) (144, 15)
Shape before concat: (575, 5) (144, 5)
Shape after concat: (575, 20) (144, 20)


Unnamed: 0,Age,RestingBP,Cholesterol,MaxHR,Oldpeak,ChestPainType_ASY,ChestPainType_ATA,ChestPainType_NAP,ChestPainType_TA,RestingECG_LVH,RestingECG_Normal,RestingECG_ST,ST_Slope_Down,ST_Slope_Flat,ST_Slope_Up,Sex,FastingBS,ExerciseAngina,Zero_Oldpeak,Have_Cholesterol_Measurement
0,-0.857863,-1.289041,0.325176,0.145543,0.082996,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0,0,1,0,0
1,0.012296,-0.105705,0.807048,0.883046,-0.896941,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1,1,1,1,0
2,1.426303,0.324599,0.142064,-0.553144,1.062934,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0,0,1,0,0
3,-1.184172,-0.105705,0.411913,0.261991,-0.798947,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0,1,1,0,0
4,-0.205244,-0.320857,-0.002497,0.766599,0.082996,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0,1,1,0,0


In [15]:
# Saving the processed data to csv files
processed_x_train.to_csv('./data/features_training_data.csv', index=False)
processed_x_val.to_csv('./data/features_validation_data.csv', index=False)

y_train.to_csv('./data/target_training_data.csv', index=False)
y_val.to_csv('./data/target_validation_data.csv', index=False)