# Stroke Prediction Data Cleansing and Preprocessing

Let's create pipelines for cleansing and preprocessing the healthcare-dataset-stroke-data dataset based on the output of Stroke_Prediction_Data_Exploration notebook.

Source: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset

#### Acknowledgements
    (Confidential Source) - Use only for educational purposes
    If you use this dataset in your research, please credit the author.

Based on the Stroke_Prediction_Data_Exploration notebook let's summarize the requirement for the dataset cleaning and preprocessing.

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [3]:
data = pd.read_csv('Data/healthcare-dataset-stroke-data.csv', index_col='id')

In [4]:
data

Unnamed: 0_level_0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
...,...,...,...,...,...,...,...,...,...,...,...
18234,Female,80.0,1,0,Yes,Private,Urban,83.75,,never smoked,0
44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.20,40.0,never smoked,0
19723,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.6,never smoked,0
37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,formerly smoked,0


In [39]:
train_set, test_set = train_test_split(data, test_size=0.2, random_state=24)

First step - drop not useful columns

In [9]:
cols_drop = ['gender', 'Residence_type', 'bmi']
train_set = train_set.drop(cols_drop, axis=1).copy()

In [10]:
train_set

Unnamed: 0_level_0,age,hypertension,heart_disease,ever_married,work_type,avg_glucose_level,smoking_status,stroke
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
32826,6.0,0,0,No,children,87.74,Unknown,0
45472,22.0,0,0,Yes,Private,138.55,never smoked,0
49753,34.0,0,0,No,Self-employed,81.54,formerly smoked,0
25982,24.0,0,0,No,Private,91.21,formerly smoked,0
38243,37.0,0,0,Yes,Private,101.07,Unknown,0
...,...,...,...,...,...,...,...,...
23413,26.0,0,0,No,Private,97.24,never smoked,0
22548,34.0,0,0,Yes,Private,91.02,never smoked,0
45053,64.0,0,0,Yes,Govt_job,239.64,formerly smoked,0
68438,51.0,0,0,Yes,Private,90.78,never smoked,0


Second step - yes/no to 1/0

In [11]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import LabelBinarizer
from sklearn.pipeline import Pipeline

In [12]:
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names]

In [13]:
class MyLabelBinarizer(LabelBinarizer):
    def fit_transform(self, X, y=None):
        return super(LabelBinarizer, self).fit_transform(X)    

In [14]:
cat_yn_pipeline = Pipeline([
        ("select_bin", DataFrameSelector(['ever_married'])),
        ("bin_encoder", MyLabelBinarizer()),
    ])

third step - age < 10 & smoking_status = 'Unknown' to 'never smoked'
             age < 17 & work_type = 'Never_worked' to 'children'

In [36]:
class CustomLimitedImputer(BaseEstimator, TransformerMixin):
    ''' Simple customized imputer to change the following:
        smoking_status to "never smoked" if age < 10 and smoking_status = "Unknown"
        work_type to "children" if age < 17 and swork_type = "Never_worked" '''
    def __init__(self, attribute_names):
        assert all(attr in ['smoking_status', 'work_type'] for attr in attribute_names), 'Only smoking_status and work_type are supported'
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X = X.copy()
        for attr in self.attribute_names:
            if attr == 'smoking_status':
                X.loc[(X.age < 10) & (X.smoking_status == 'Unknown'), 'smoking_status'] = 'never smoked'
            elif attr == 'work_type':
                X.loc[(X.age < 17) & (X.work_type == 'Never_worked'), 'work_type'] = 'children'
        X.drop(['age'], axis=1, inplace=True)
        return X.values

In [16]:
# cat_cust_pipeline = Pipeline([
#         ("select_cat", DataFrameSelector(['smoking_status', 'work_type', 'age'])), # age is used as a parameter
#         ("cat_encoder", CustomLimitedImputer(['smoking_status', 'work_type'])),
#     ])

Fourth step - one hot for work_type and smoking_status features

In [17]:
from sklearn.preprocessing import OneHotEncoder

In [18]:
# cat_oh_pipeline = Pipeline([
#         ("select_cat", DataFrameSelector(['smoking_status', 'work_type'])),
#         ("cat_encoder", OneHotEncoder(sparse=False)),
#     ])

Two above pipelines are going to work on the same columns so they need to be combined, otherwise in the result there will be duplicated columns.

In [37]:
cat_oh_pipeline = Pipeline([
        ("select_cat", DataFrameSelector(['smoking_status', 'work_type', 'age'])), # age is used as a parameter
        ("imputer", CustomLimitedImputer(['smoking_status', 'work_type'])),
        ("cat_encoder", OneHotEncoder(sparse=False)),
    ])

Fifth step - numeric features standardization

In [19]:
from sklearn.preprocessing import StandardScaler

In [20]:
num_pipeline = Pipeline([
        ("select_numeric", DataFrameSelector(['age', 'avg_glucose_level'])),
        ("scale", StandardScaler()),
    ])

Sixth step - build combined pipeline

In [21]:
from sklearn.pipeline import FeatureUnion

In [38]:
preprocess_pipeline = FeatureUnion(transformer_list=[
        ('cat_yn_pipeline', cat_yn_pipeline),
        ('cat_oh_pipeline', cat_oh_pipeline),
        ('num_pipeline', num_pipeline),
    ])

In [23]:
train_set

Unnamed: 0_level_0,age,hypertension,heart_disease,ever_married,work_type,avg_glucose_level,smoking_status,stroke
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
32826,6.0,0,0,No,children,87.74,Unknown,0
45472,22.0,0,0,Yes,Private,138.55,never smoked,0
49753,34.0,0,0,No,Self-employed,81.54,formerly smoked,0
25982,24.0,0,0,No,Private,91.21,formerly smoked,0
38243,37.0,0,0,Yes,Private,101.07,Unknown,0
...,...,...,...,...,...,...,...,...
23413,26.0,0,0,No,Private,97.24,never smoked,0
22548,34.0,0,0,Yes,Private,91.02,never smoked,0
45053,64.0,0,0,Yes,Govt_job,239.64,formerly smoked,0
68438,51.0,0,0,Yes,Private,90.78,never smoked,0


Now the fitting the parameters and transformation of data can be performed.

In [40]:
X_train = preprocess_pipeline.fit_transform(train_set[train_set.columns[:-1]])

In [41]:
y_train = train_set.stroke.values

In [42]:
X_train[0], y_train[0]

(array([ 0.        ,  0.        ,  0.        ,  1.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  1.        ,
        -1.64739047, -0.40515978]),
 0)

In [43]:
X_test = preprocess_pipeline.transform(test_set[test_set.columns[:-1]])

In [44]:
y_test = test_set.stroke.values

In [45]:
X_test[0]

array([ 0.        ,  0.        ,  0.        ,  1.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  1.        ,
       -1.51440768, -0.66862289])

Let's go to do some algorithms exploration in the next notebook.