### Stoke Prediction Dataset 
# Preprocessing

Data source: https://www.kaggle.com/fedesoriano/stroke-prediction-dataset <br>
Data updated date: 2021-01-26

In [1]:
# import libraries needed

import pandas as pd
import numpy as np

# make notebook full width for better viewing

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

In [2]:
# importing data
df = pd.read_csv(r'data/stroke.csv', index_col='id')

# Data Metadescription

| Feature | Data type | Other descriptions | Processing Needed? | Missing Value? | Encoding Needed? |
| ------- | --------- | ------------------ | ----------------- | --------------- | ---------------- |
gender | categorical | Female, Male, Other | T | T (Other) | T |
age | float64 | | F | F | F |
hypertension | categorical | 0, 1 | T | F | F |
heart_disease | categorical | 0, 1| T | F | F |
ever_married | categorical | 0, 1 | T | F | F |
work_type | categorical | Private, Self-employed, children, Govt_job, Never_worked | T | F | T |
residence_type | categorical | Rural, Urban | T | F | T |
avg_glucose_level | float64 | | F | F | F |
bmi | float64 | | F | T (replaced with mean) | F |
smoking_status | categorical | never smoked, unknown, formerly smoked, smokes | T | T (unknown) | T |
stroke | categorical | 0, 1 | T | F | F |

# 1. Separate df into numerical and categorical

In [3]:
numerical = df.select_dtypes(include=['float64'])
categorical = df.select_dtypes(exclude=['float64'])

# 2. Binary Encoding

Why Binary over one-hot:
Main reason is to avoid multicolinearity. Though it might decrease interpretability later.

for example: consider the column heart_disease is now one hot encoded as "heart_disease: 0" and "heart_disease:1". Someone with "heart disease:0" encoded True, implied that "heart disease:1" will be False. Since they are mutually exclusive.

Thus, the columns below are encoded using binary encoding to avoid the problem above.
- gender
- hypertension
- heart_disease
- ever_married
- work_type
- residence_type
- smoking_status
- stroke

In [4]:
categorical_processed = pd.get_dummies(categorical, drop_first=True)

In [5]:
# double checking the first 2 rows
df.head(2)

Unnamed: 0_level_0,gender,age,hypertension,heart_disease,ever_married,work_type,residence_type,avg_glucose_level,bmi,smoking_status,stroke
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
9046,male,67.0,0,1,1,private,urban,228.69,36.6,formerly smoked,1
51676,female,61.0,0,0,1,self-employed,rural,202.21,28.893237,never smoked,1


In [6]:
categorical_processed.head(2)

Unnamed: 0_level_0,hypertension,heart_disease,ever_married,stroke,gender_male,gender_other,work_type_govt_job,work_type_never_worked,work_type_private,work_type_self-employed,residence_type_urban,smoking_status_never smoked,smoking_status_smokes,smoking_status_unknown
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
9046,0,1,1,1,1,0,0,0,1,0,1,0,0,0
51676,0,0,1,1,0,0,0,0,0,1,0,1,0,0


# 3. Create a processed dataframe that includes the encoding above.

In [7]:
df.shape

(5110, 11)

In [8]:
df = pd.concat([numerical, categorical_processed], axis=1)

# 4. Train test split
- Train data for SMOTE
- Test data without SMOTE treatment

In [9]:
y = df['stroke']
X = df.drop(['stroke'], axis=1)

In [10]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3)

In [11]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(3577, 16)
(3577,)
(1533, 16)
(1533,)


# 5. SMOTE for training dataset

In [12]:
df['stroke'].value_counts()

0    4861
1     249
Name: stroke, dtype: int64

In [13]:
from imblearn.over_sampling import SMOTE
from collections import Counter

In [14]:
# model for SMOTE
oversample = SMOTE()

# fit the SMOTE model over the training set
X_train, y_train = oversample.fit_resample(X_train, y_train)

In [15]:
counter = Counter(y_train)
print(counter)

Counter({0: 3412, 1: 3412})


# 6. Numerical standardization

Note: this step is done after train test split so that we're just fitting the standardscaler with training data, and transforming both X_train and X_train.

#### why standard scaler?

StandardScaler follows Standard Normal Distribution (SND). Therefore, it makes mean = 0 and scales the data to unit variance. 

MinMaxScaler scales all the data features in the range [0, 1] or else in the range [-1, 1] if there are negative values in the dataset. This scaling compresses all the inliers in the narrow range [0, 0.005].

In the presence of outliers, StandardScaler does not guarantee balanced feature scales, due to the influence of the outliers while computing the empirical mean and standard deviation. 

#### Our dataset does contains outliers, so more advanced standardization technique should be explored for optimization.

- age: no outliers
- avg_glucose_level: outliers (both ends)
- bmi: outliers (especially the obese end)

In [16]:
# get only the numerical values in the X_train and X_test datasets

X_train_numerical = X_train.select_dtypes(include=['float64'])
X_test_numerical = X_test.select_dtypes(include=['float64'])
columnnames_numerical = list(X_train_numerical.columns)

In [17]:
# build the model for standard scaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# fit the scaler to train data
# note: this is only fitted to the X_train dataset to make sure data is cross validated.
scaler.fit(X_train_numerical)

# transform both X_train and X_test
X_train_numerical = scaler.transform(X_train_numerical)
X_test_numerical = scaler.transform(X_test_numerical)

In [18]:
X_train_numerical = pd.DataFrame(X_train_numerical, columns=columnnames_numerical, index=X_train.index)
X_test_numerical = pd.DataFrame(X_test_numerical, columns=columnnames_numerical, index=X_test.index)

# 7. Create preprocessed dataframes (training and testing datasets)

In [19]:
# create dataframes of X_train and X_test categorical dtypes
X_train_categorical = X_train.select_dtypes(exclude=['float64'])
X_test_categorical = X_test.select_dtypes(exclude=['float64'])

In [20]:
# concat both numerical and categorical data types together to form the new Xs.
X_train = pd.concat([X_train_numerical, X_train_categorical], axis=1)
X_test = pd.concat([X_test_numerical, X_test_categorical], axis=1)

print('The processed training dataset contains {} rows and {} features'.format(X_train.shape[0], X_train.shape[1]))
print('The processed testing dataset contains {} rows and {} features'.format(X_test.shape[0], X_test.shape[1]))

The processed training dataset contains 6824 rows and 16 features
The processed testing dataset contains 1533 rows and 16 features


In [21]:
y_tr_stroke = 0
y_tr_nostroke = 0
y_te_stroke = 0
y_te_nostroke = 0

# loop through all values check how many stroke there are.
## for y_train
for y in y_train:
    if y ==1:
        y_tr_stroke = y_tr_stroke+1
    else:
        y_tr_nostroke = y_tr_nostroke+1

## for y_test
for y in y_test:
    if y ==1:
        y_te_stroke = y_te_stroke+1
    else:
        y_te_nostroke = y_te_nostroke+1

print('Training dataset: There are {} number of people who has stroke and {} number of people with no stroke.'.format(y_tr_stroke, y_tr_nostroke))
print('Testing dataset: There are {} number of people who has stroke and {} number of people with no stroke.'.format(y_te_stroke, y_te_nostroke))

Training dataset: There are 3412 number of people who has stroke and 3412 number of people with no stroke.
Testing dataset: There are 84 number of people who has stroke and 1449 number of people with no stroke.


# 8. Export the data

In [22]:
X_train.to_csv('data/X_train.csv')
X_test.to_csv('data/X_test.csv')

y_train.to_csv('data/y_train.csv')
y_test.to_csv('data/y_test.csv')