# 0. Data Cleaning

This [Kaggle data](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset) contains data about patients that have had a stroke. This needs some cleanup for our model. In this section we will perform data cleaning techniques that are required for the best possible model. 

In [75]:
# import dependencies

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-darkgrid')

  plt.style.use('seaborn-darkgrid')


In [76]:
df = pd.read_csv("./data/stroke-data-raw.csv")

In [77]:
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


We can first of all see a lot of things that we would want to modify, drop, fill in ...

To Drop:
- id => gives no extra information about the patient that can be used for the model. This can also not be used, because this is essentially random.

To One-Hot-Encode:
We need to one-hot-encode some features, because no model can understand text. These features will be split according to their value and given a "1" if applicable and "0" if not.

- gender
- work_type
- Residence_type
- smoking_status

To Modify slightly:

Binary features such as "hypertension" are given "1" or "0" if applicable or not. But feature "ever_married" has "Yes" and "No". This needs to change to "1" and "0". Model can't understand "Yes" or "No".

## 0.1 Dropping Features

In [78]:
df_dropped = df.drop(columns=['id'])

In [79]:
df_dropped.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


## 0.2 One-hot-encoding

### 0.2.1 Gender

In [80]:
df_dropped['gender'].value_counts()

gender
Female    2994
Male      2115
Other        1
Name: count, dtype: int64

Looking at gender we can see that there is a patient who was given the "Other" gender. Since this doesn't exist, we can safely assume that we can fill this in with the most common gender type, which is "Female".

In [81]:
df_gender = df_dropped.copy()

df_gender.loc[df_gender['gender'] == "Other", 'gender'] = "Female"

In [82]:
df_gender['gender'].value_counts()

gender
Female    2995
Male      2115
Name: count, dtype: int64

In [83]:
df_gender = pd.get_dummies(df_gender, columns=['gender'])

df_gender.columns = df_gender.columns.str.lower()

# Convert boolean columns to integers (0s and 1s).
boolean_columns = df_gender.select_dtypes(include=bool).columns
df_gender[boolean_columns] = df_gender[boolean_columns].astype(int)

In [84]:
df_gender.head()

Unnamed: 0,age,hypertension,heart_disease,ever_married,work_type,residence_type,avg_glucose_level,bmi,smoking_status,stroke,gender_female,gender_male
0,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1,0,1
1,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1,1,0
2,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1,0,1
3,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1,1,0
4,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1,1,0


### 0.2.2 Work Type

In [85]:
df_work = df_gender.copy()

df_work['work_type'].value_counts()

work_type
Private          2925
Self-employed     819
children          687
Govt_job          657
Never_worked       22
Name: count, dtype: int64

We can see that there are a few work types. Some of them have an uppercase letter, some of them have dashes. We would like this to be as homogenous as possible.

In [86]:
df_work = pd.get_dummies(df_work, columns=['work_type'])

df_work.columns = df_work.columns.str.lower().str.replace('-', '_').str.replace('work_type_', 'work_')

# Convert boolean columns to integers (0s and 1s).
boolean_columns = df_work.select_dtypes(include=bool).columns
df_work[boolean_columns] = df_work[boolean_columns].astype(int)

In [87]:
df_work.head()

Unnamed: 0,age,hypertension,heart_disease,ever_married,residence_type,avg_glucose_level,bmi,smoking_status,stroke,gender_female,gender_male,work_govt_job,work_never_worked,work_private,work_self_employed,work_children
0,67.0,0,1,Yes,Urban,228.69,36.6,formerly smoked,1,0,1,0,0,1,0,0
1,61.0,0,0,Yes,Rural,202.21,,never smoked,1,1,0,0,0,0,1,0
2,80.0,0,1,Yes,Rural,105.92,32.5,never smoked,1,0,1,0,0,1,0,0
3,49.0,0,0,Yes,Urban,171.23,34.4,smokes,1,1,0,0,0,1,0,0
4,79.0,1,0,Yes,Rural,174.12,24.0,never smoked,1,1,0,0,0,0,1,0


## Modify ever_married slightly

Need to change "Yes" to "1" and "No" to "0" otherwise the model can't understand this.

In [88]:
df_work['ever_married'] = df_work['ever_married'].replace({'Yes': '1', 'No': '0'})

df_work.head()