## Data Preprocessing

It is important to prepare the dataset for machine learning. For this project, the necessary preprocessing steps include:
- Handling missing values
- Encoding categorical variables
- Scaling/normalization
- Splitting data into training and test sets

In [15]:
# Load libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder

In [27]:
# Load dataset
df = pd.read_csv("../data/heart_disease.csv")
df.head()

Unnamed: 0,Age,Gender,Blood Pressure,Cholesterol Level,Exercise Habits,Smoking,Family Heart Disease,Diabetes,BMI,High Blood Pressure,...,High LDL Cholesterol,Alcohol Consumption,Stress Level,Sleep Hours,Sugar Consumption,Triglyceride Level,Fasting Blood Sugar,CRP Level,Homocysteine Level,Heart Disease Status
0,56.0,Male,153.0,155.0,High,Yes,Yes,No,24.991591,Yes,...,No,High,Medium,7.633228,Medium,342.0,,12.969246,12.38725,No
1,69.0,Female,146.0,286.0,High,No,Yes,Yes,25.221799,No,...,No,Medium,High,8.744034,Medium,133.0,157.0,9.355389,19.298875,No
2,46.0,Male,126.0,216.0,Low,No,No,No,29.855447,No,...,Yes,Low,Low,4.44044,Low,393.0,92.0,12.709873,11.230926,No
3,32.0,Female,122.0,293.0,High,Yes,Yes,No,24.130477,Yes,...,Yes,Low,High,5.249405,High,293.0,94.0,12.509046,5.961958,No
4,60.0,Male,166.0,242.0,Low,Yes,Yes,Yes,20.486289,Yes,...,No,Low,High,7.030971,High,263.0,154.0,10.381259,8.153887,No


1. Handling Missing Values

Missing values can make models bias, which will reduce its overall accuracy. For missing values in numerical columns, we can impute a median value as it's not affected by outliers. For categorical features, imputing the empty cells with the most frequent value keeps the data consistent.

Since most columns have 30 or less missing values, these changes would not affect the model extremely. "Alcohol Consumption" has over 2,000 missing values, which is a large portion of the dataset. Keeping it would require complex imputation so removing it would be a simple way to clean the data. Alcohol's affects on heart disease is also determined by many other factors like the type of alcohol, so no key predictor was removed.

In [38]:
# Remove "Alcohol Consumption" column
new_df = df.drop(columns = ["Alcohol Consumption"])

# Put median value into missing values for NUMERICAL columns
num_cols = new_df.select_dtypes(include = "float64").columns
for column in num_cols:
    if new_df[column].isnull().sum() > 0:
        new_df[column] = new_df[column].fillna(new_df[column].median())

# Put mode value into categorical columns
cat_cols = new_df.select_dtypes(include = "object").columns
for column in cat_cols:
    if new_df[column].isnull().sum() > 0:
        new_df[column] = new_df[column].fillna(new_df[column].mode()[0])

new_df.shape

(10000, 20)