<b>Page Summary</b><br>
    This page will do all neccessary steps to have entire dataset prepared for training. Since we do have imbalanced dataset with only approximate of 5% positive cases, we have to upsample train dataset in order to have better trained model. This will be done by Smotetomek algorithm, as a result, we get train set with 50:50 negative to postivice cases, this will make sure our model will be better trained as we will see in next page<br><br>

In bellow page we split initial dataset to train set and test set (holdout set), <b>upsampling will be done ONLY on train set</b>, in following page, test set will be used for evaluation of model performance


- Load data from previous page <i>train_data.csv</i>
- Encode labels to numeric values
- Create "binned" features to get categories from numeric data
- Normalize data to have all values in the similar scale
- Use OneHotEncoding on train set, this must be done due to algorithm we use for upsampling (Smotetomek)
- Upsample train set to have balanced dataset (since it is imbalanced)
- Revert Onehotencoded train dataset back to initial state due to that onehotencoded dataframe would have poor performance with catbooster model

___________________________________________
<b>1. Import libraries and load data and split them to train test</b>

In [38]:
import pandas as pd
import numpy as np
from catboost import CatBoostClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

In [39]:
train_data = pd.read_csv('train_data.csv', index_col='id')

print("Train data shape: ", train_data.shape)

Train data shape:  (5109, 14)


Split dataset to train and holdout set (test set)

In [40]:
from sklearn.model_selection import train_test_split  

X = train_data.drop('stroke',axis=1)
y = train_data['stroke']

X_train,X_test,y_train,y_test = train_test_split(X,y, test_size=0.2, random_state=1,stratify=y) 

In [41]:
train_df_raw = pd.concat([X_train,y_train],axis=1)
#---------
test_df_raw = pd.concat([X_test,y_test],axis=1)
#print(test_df_raw.isnull().sum())
#---------

<b>2. Data Analysis</b>

Get stroke number in train and test set

In [42]:
stroke_sum = train_df_raw.stroke.sum()
#---------
stroke_sum_test = test_df_raw.stroke.sum()
#---------
print("Test set strokes: ", stroke_sum_test)
print("Train set strokes: ",stroke_sum)

Test set strokes:  50
Train set strokes:  199


In [43]:
strokes_app = pd.DataFrame()

train_df = train_df_raw.copy()

#---------
test_df = test_df_raw.copy()
#---------

Display data

In [44]:
train_df.head(1)

Unnamed: 0_level_0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,Age_group,avg_glucose_level_group,bmi_group,stroke
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
53923,Female,22.0,0,0,No,Private,Urban,113.11,19.8,Unknown,0,2,0,0


In [45]:
test_df.head(1)

Unnamed: 0_level_0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,Age_group,avg_glucose_level_group,bmi_group,stroke
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
33085,Female,20.0,0,0,No,Private,Rural,102.42,18.6,never smoked,0,2,0,0


Get all null values

In [46]:
print("Train set: \n" + str(train_df.isnull().sum()) + "\n\n")
print("Test set: \n" + str(test_df.isnull().sum()))

Train set: 
gender                     0
age                        0
hypertension               0
heart_disease              0
ever_married               0
work_type                  0
Residence_type             0
avg_glucose_level          0
bmi                        0
smoking_status             0
Age_group                  0
avg_glucose_level_group    0
bmi_group                  0
stroke                     0
dtype: int64


Test set: 
gender                     0
age                        0
hypertension               0
heart_disease              0
ever_married               0
work_type                  0
Residence_type             0
avg_glucose_level          0
bmi                        0
smoking_status             0
Age_group                  0
avg_glucose_level_group    0
bmi_group                  0
stroke                     0
dtype: int64


<b>3. Create new bins</b>

We are creating new categorical features that are "binned" from numeric data that were provided (Age Group, Bmi group and Average glucose level)

In [47]:
bins = pd.DataFrame()
train_df['Age_group'], age_bins = pd.qcut(train_df['age'],q=4,labels=[0,1,2,3],retbins=True)
train_df['bmi_group'], bmi_bins = pd.qcut(train_df['bmi'],q=4,labels=[0,1,2,3],retbins=True)
train_df['avg_glucose_level_group'], glucose_bins = pd.qcut(train_df['avg_glucose_level'],q=4,labels=[0,1,2,3],retbins=True)


#---------
test_df['Age_group'] = pd.cut(test_df['age'],bins=age_bins,labels=[0,1,2,3])
test_df['bmi_group'] = pd.cut(test_df['bmi'],bins=bmi_bins,labels=[0,1,2,3])
test_df['avg_glucose_level_group'] = pd.cut(test_df['avg_glucose_level'],bins=glucose_bins,labels=[0,1,2,3])
#---------

#Get Extreme values
age_max = train_df['age'].max()
age_min = train_df['age'].min()
avg_glucose_level_max = train_df['avg_glucose_level'].max()
avg_glucose_level_min = train_df['avg_glucose_level'].min()
bmi_max = train_df['bmi'].max()
bmi_min = train_df['bmi'].min()

test_df.loc[test_df['age'] > age_max, 'Age_group'] = 3
test_df.loc[test_df['age'] < age_min, 'Age_group'] = 0

test_df.loc[test_df['bmi'] > age_max, 'bmi_group'] = 3
test_df.loc[test_df['bmi'] < age_min, 'bmi_group'] = 0

test_df.loc[test_df['avg_glucose_level'] > age_max, 'avg_glucose_level_group'] = 3
test_df.loc[test_df['avg_glucose_level'] < age_min, 'avg_glucose_level_group'] = 0


"""
test_df['Age_group'] = test_df['age'].apply(lambda x: 0 if x < age_min else test_df.Age_group)
test_df['Age_group'] = test_df['age'].apply(lambda x: 3 if x > age_max else test_df.iloc[x]['age'])

test_df['bmi_group'] = test_df['bmi'].apply(lambda x: 0 if x < bmi_min else x)
test_df['bmi_group'] = test_df['bmi'].apply(lambda x: 3 if x > bmi_max else x)

test_df['avg_glucose_level_group'] = test_df['avg_glucose_level'].apply(lambda x: 0 if x < avg_glucose_level_min else x)
test_df['avg_glucose_level_group'] = test_df['avg_glucose_level'].apply(lambda x: 3 if x > avg_glucose_level_max else x)
"""


"\ntest_df['Age_group'] = test_df['age'].apply(lambda x: 0 if x < age_min else test_df.Age_group)\ntest_df['Age_group'] = test_df['age'].apply(lambda x: 3 if x > age_max else test_df.iloc[x]['age'])\n\ntest_df['bmi_group'] = test_df['bmi'].apply(lambda x: 0 if x < bmi_min else x)\ntest_df['bmi_group'] = test_df['bmi'].apply(lambda x: 3 if x > bmi_max else x)\n\ntest_df['avg_glucose_level_group'] = test_df['avg_glucose_level'].apply(lambda x: 0 if x < avg_glucose_level_min else x)\ntest_df['avg_glucose_level_group'] = test_df['avg_glucose_level'].apply(lambda x: 3 if x > avg_glucose_level_max else x)\n"

<b>4. Use LabelEncoder to encode data</b>

Use labelencoder to encode data in numeric value for future modelling

In [48]:
train_df.head(1)

Unnamed: 0_level_0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,Age_group,avg_glucose_level_group,bmi_group,stroke
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
53923,Female,22.0,0,0,No,Private,Urban,113.11,19.8,Unknown,0,2,0,0


In [49]:
from sklearn.preprocessing import LabelEncoder 
map_table = pd.DataFrame()

labels = ['gender','ever_married','work_type','Residence_type','smoking_status']

le = LabelEncoder()


for label in labels:
    le.fit(train_df[label])
    #print(le.classes_)
    le_name_mapping = dict(zip(le.classes_, le.transform(le.classes_)))
    train_df[label] = le.transform(train_df[label])
    #--------
    test_df[label] = le.transform(test_df[label])
    test_df[label] = test_df[label].astype('int64')
    #--------
    train_df[label] = train_df[label].astype('int64')
    print(le_name_mapping)



{'Female': 0, 'Male': 1}
{'No': 0, 'Yes': 1}
{'Govt_job': 0, 'Never_worked': 1, 'Private': 2, 'Self-employed': 3, 'children': 4}
{'Rural': 0, 'Urban': 1}
{'Unknown': 0, 'formerly smoked': 1, 'never smoked': 2, 'smokes': 3}


After LabelEncoding check null values and remove if any

In [50]:
print("Train set: \n" + str(train_df.isnull().sum()) + "\n\n")
print("Test set: \n" + str(test_df.isnull().sum()))

Train set: 
gender                     0
age                        0
hypertension               0
heart_disease              0
ever_married               0
work_type                  0
Residence_type             0
avg_glucose_level          0
bmi                        0
smoking_status             0
Age_group                  0
avg_glucose_level_group    0
bmi_group                  0
stroke                     0
dtype: int64


Test set: 
gender                     0
age                        0
hypertension               0
heart_disease              0
ever_married               0
work_type                  0
Residence_type             0
avg_glucose_level          0
bmi                        0
smoking_status             0
Age_group                  0
avg_glucose_level_group    0
bmi_group                  1
stroke                     0
dtype: int64


In [51]:
#Drop nan values
test_df = test_df.dropna()

In [52]:
print("Train set: \n" + str(train_df.isnull().sum()) + "\n\n")
print("Test set: \n" + str(test_df.isnull().sum()))

Train set: 
gender                     0
age                        0
hypertension               0
heart_disease              0
ever_married               0
work_type                  0
Residence_type             0
avg_glucose_level          0
bmi                        0
smoking_status             0
Age_group                  0
avg_glucose_level_group    0
bmi_group                  0
stroke                     0
dtype: int64


Test set: 
gender                     0
age                        0
hypertension               0
heart_disease              0
ever_married               0
work_type                  0
Residence_type             0
avg_glucose_level          0
bmi                        0
smoking_status             0
Age_group                  0
avg_glucose_level_group    0
bmi_group                  0
stroke                     0
dtype: int64


In [53]:
train_df.head()

Unnamed: 0_level_0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,Age_group,avg_glucose_level_group,bmi_group,stroke
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
53923,0,22.0,0,0,0,2,1,113.11,19.8,0,0,2,0,0
34966,0,43.0,0,0,1,3,1,87.41,39.7,1,1,1,3,0
67711,0,18.0,0,0,0,2,0,88.85,36.2,0,0,1,3,0
6049,0,5.0,0,0,0,4,0,73.69,24.8,0,0,0,1,0
15689,1,42.0,0,0,1,0,1,68.19,31.0,2,1,0,2,0


<b>5. Get proper column types</b>

Set type of column based on data - category, int, float

In [54]:
cat_features = ['gender','ever_married','work_type','Residence_type','smoking_status','hypertension','heart_disease',
               'ever_married','work_type','Residence_type','smoking_status','Age_group','avg_glucose_level_group',
               'bmi_group']

floats = ['avg_glucose_level','bmi']


for coll in train_df.columns:
  
    if coll in cat_features:
        train_df[coll] = train_df[coll].astype('category')
    if coll in floats:
        train_df[coll] = train_df[coll].astype('float64')
    if coll not in cat_features and coll not in floats:
        train_df[coll] = train_df[coll].astype('int64')

#------------
for coll in test_df.columns:
   
    if coll in cat_features:
        test_df[coll] = test_df[coll].astype('category')
    if coll in floats:
        test_df[coll] = test_df[coll].astype('float64')
    if coll not in cat_features and coll not in floats:
        test_df[coll] = test_df[coll].astype('int64')
#------------


Get Final X and y of train and test set

In [55]:
X = train_df.drop('stroke',axis=1)
y = train_df['stroke']
#-------
X_test = test_df.drop('stroke',axis=1)
y_test_final = test_df['stroke']

#X_test = X_test.reset_index()
#X_test = X_test.drop('id',axis=1)
#-------


In [56]:
print("X_test shape: " + str(X_test.shape))
print("y_test_final: " + str(y_test_final.shape))

X_test shape: (1021, 13)
y_test_final: (1021,)


In [57]:
X_test.tail(1)

Unnamed: 0_level_0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,Age_group,avg_glucose_level_group,bmi_group
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
62416,0,26,0,0,1,2,0,73.29,27.8,2,1,0,1


<b>6. Scale Floats in dataset to have similar scale</b>

Scale avg glucose level, bmi and age to similar scale for model in both train and test set

In [58]:
#Create Copy
X_first = X.copy()

X_floats = X_first[['avg_glucose_level','bmi','age']].copy()
X_floats.head(2)

#--------
#Create Copy
X_first_test = X_test.copy()

X_floats_test = X_first_test[['avg_glucose_level','bmi','age']].copy()
#--------


Call StandardScaller

In [59]:
#def Scale(df,type)
from sklearn.preprocessing import StandardScaler 

scaler = StandardScaler() 
scaler.fit(X_floats)
X_floats = pd.DataFrame(scaler.transform(X_floats),columns=X_floats.columns,index=X_floats.index)

#--------
X_floats_test = pd.DataFrame(scaler.transform(X_floats_test),columns=X_floats_test.columns,index=X_floats_test.index)
#--------

X_floats.head(2)

Unnamed: 0_level_0,avg_glucose_level,bmi,age
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
53923,0.157082,-1.195102,-0.928353
34966,-0.412776,1.445891,-0.005473


In [60]:
X_floats_test.head(2)

Unnamed: 0_level_0,avg_glucose_level,bmi,age
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
33085,-0.079952,-1.354357,-1.016246
26826,-0.724314,-1.68614,0.785567


In [61]:
X.head(1)

Unnamed: 0_level_0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,Age_group,avg_glucose_level_group,bmi_group
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
53923,0,22,0,0,0,2,1,113.11,19.8,0,0,2,0


Drop initial unscalled features from dataframe

In [62]:
CollsToDrop = ['avg_glucose_level','bmi','age']

try:
    X = X.drop(labels=CollsToDrop,axis=1)
    X_test = X_test.drop(labels=CollsToDrop,axis=1)

except:
    pass

In [63]:
X.head(1)

Unnamed: 0_level_0,gender,hypertension,heart_disease,ever_married,work_type,Residence_type,smoking_status,Age_group,avg_glucose_level_group,bmi_group
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
53923,0,0,0,0,2,1,0,0,2,0


In [64]:
categories = cat_features
X_first.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4087 entries, 53923 to 45252
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype   
---  ------                   --------------  -----   
 0   gender                   4087 non-null   category
 1   age                      4087 non-null   int64   
 2   hypertension             4087 non-null   category
 3   heart_disease            4087 non-null   category
 4   ever_married             4087 non-null   category
 5   work_type                4087 non-null   category
 6   Residence_type           4087 non-null   category
 7   avg_glucose_level        4087 non-null   float64 
 8   bmi                      4087 non-null   float64 
 9   smoking_status           4087 non-null   category
 10  Age_group                4087 non-null   category
 11  avg_glucose_level_group  4087 non-null   category
 12  bmi_group                4087 non-null   category
dtypes: category(10), float64(2), int64(1)
memory usage: 169.0 

Get final dataframes train: X_final and test: X_final_test

In [65]:
X_final = pd.concat([X,X_floats],axis=1)

#------
X_final_test = pd.concat([X_test,X_floats_test],axis=1)
#------
X_final_test.tail(10)

Unnamed: 0_level_0,gender,hypertension,heart_disease,ever_married,work_type,Residence_type,smoking_status,Age_group,avg_glucose_level_group,bmi_group,avg_glucose_level,bmi,age
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
42412,0,0,0,0,2,1,0,0,3,1,0.89945,-0.146667,-1.10414
53217,0,0,0,0,2,0,0,0,3,1,-0.295257,-0.624435,-1.10414
52134,1,0,0,1,2,0,0,2,3,3,-0.352686,0.875224,0.433993
44142,1,0,0,0,2,0,2,0,3,1,-0.244258,-0.106853,-0.796513
737,1,0,0,0,4,1,0,0,3,2,-0.384394,0.211658,-1.455713
8719,1,0,0,0,4,1,1,0,3,0,0.226707,-1.646326,-1.36782
52173,1,0,0,0,3,1,2,1,0,3,-0.708127,1.43262,-0.225206
8521,1,0,0,1,2,0,1,3,3,2,2.702598,0.370914,1.225033
38119,1,0,0,1,0,1,2,3,3,2,-0.25601,0.304557,0.917407
62416,0,0,0,1,2,0,2,1,0,1,-0.725866,-0.133396,-0.752566


In [66]:
print("X final shape is: ",X_final.shape)
print("y shape is: ",y.shape)

X final shape is:  (4087, 13)
y shape is:  (4087,)


<b>7. Get dummies (onehotencoded dataset) and upsample</b>

Dataframe must be one hot encoded due to upcoming upsampling that is done. In following step, in order to have upsampled data by smotetomek, we have to one hot encode dataframe

In [67]:
X_dummy = pd.get_dummies(X_final)

In [68]:
X_dummy.head(1)

Unnamed: 0_level_0,avg_glucose_level,bmi,age,gender_0,gender_1,hypertension_0,hypertension_1,heart_disease_0,heart_disease_1,ever_married_0,...,Age_group_2,Age_group_3,avg_glucose_level_group_0,avg_glucose_level_group_1,avg_glucose_level_group_2,avg_glucose_level_group_3,bmi_group_0,bmi_group_1,bmi_group_2,bmi_group_3
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
53923,0.157082,-1.195102,-0.928353,1,0,1,0,1,0,1,...,0,0,0,0,1,0,1,0,0,0


In [69]:
cat_features = ['gender','ever_married','work_type','Residence_type','smoking_status','hypertension','heart_disease',
               'ever_married','work_type','Residence_type','smoking_status','Age_group','avg_glucose_level_group',
               'bmi_group']
#col = 'gen_der_2'
#value = col[-1:]
#print(value)

In [70]:
X_final.head()

Unnamed: 0_level_0,gender,hypertension,heart_disease,ever_married,work_type,Residence_type,smoking_status,Age_group,avg_glucose_level_group,bmi_group,avg_glucose_level,bmi,age
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
53923,0,0,0,0,2,1,0,0,2,0,0.157082,-1.195102,-0.928353
34966,0,0,0,1,3,1,1,1,1,3,-0.412776,1.445891,-0.005473
67711,0,0,0,0,2,0,0,0,1,3,-0.380847,0.981395,-1.10414
6049,0,0,0,0,4,0,0,0,0,1,-0.716997,-0.531536,-1.675446
15689,1,0,0,1,0,1,2,1,0,2,-0.838951,0.291286,-0.04942


Get final dummied dataframe called X_final

In [71]:
X_final_dummy = pd.get_dummies(X_final)
print(len(X_final_dummy))
X_final_dummy.head()

4087


Unnamed: 0_level_0,avg_glucose_level,bmi,age,gender_0,gender_1,hypertension_0,hypertension_1,heart_disease_0,heart_disease_1,ever_married_0,...,Age_group_2,Age_group_3,avg_glucose_level_group_0,avg_glucose_level_group_1,avg_glucose_level_group_2,avg_glucose_level_group_3,bmi_group_0,bmi_group_1,bmi_group_2,bmi_group_3
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
53923,0.157082,-1.195102,-0.928353,1,0,1,0,1,0,1,...,0,0,0,0,1,0,1,0,0,0
34966,-0.412776,1.445891,-0.005473,1,0,1,0,1,0,0,...,0,0,0,1,0,0,0,0,0,1
67711,-0.380847,0.981395,-1.10414,1,0,1,0,1,0,1,...,0,0,0,1,0,0,0,0,0,1
6049,-0.716997,-0.531536,-1.675446,1,0,1,0,1,0,1,...,0,0,1,0,0,0,0,1,0,0
15689,-0.838951,0.291286,-0.04942,0,1,1,0,1,0,0,...,0,0,1,0,0,0,0,0,1,0


<B>7. UPSAMPLE TRAINING DATASET</B>

Our train set is imbalanced due to only 5% of data are positive stroke cases, for model training we have to make dataset more balanced. Therefore we are using smotetomek to upsample data and to have balanced stroke ratio. Upsampling is done ONLY on training set, test set is untouched. In order to use smotetomek, our dataframe must be one hot encoded.

In [72]:
from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTETomek

print("Before upsampling: ",X_final_dummy.shape[0])

smt = SMOTETomek(random_state=2811)
X, y = smt.fit_resample(X_final_dummy, y)
X_final = pd.DataFrame(X)

print("After upsampling: ", X_final.shape[0])

Before upsampling:  4087
After upsampling:  7768


In [73]:
print(X_final.shape)
print(y.shape)

(7768, 34)
(7768,)


Summary of train set

In [74]:
print("total training examples: ", len(y))
print("total strokes in training set:",y.sum())
print("Ratio of strokes: " + str(y.sum()/len(y)*100) + "%")

total training examples:  7768
total strokes in training set: 3884
Ratio of strokes: 50.0%


Summary of test set

In [75]:
print("total testing examples: ",len(y_test_final))
print("total strokes in testing set",y_test_final.sum())
print("Ratio of strokes: " + str(np.round(y_test_final.sum()/len(y_test_final)*100,2)) + "%")

total testing examples:  1021
total strokes in testing set 50
Ratio of strokes: 4.9%


Training set will be used to train model with balanced data of strokes, model will be evaluated on test set which remains unbalanced

<b>8. Transform dummy dataframe back to initial one for catbooster</b>

After upsampling is done get back dataframe to initial state due to the fact that catbooster has poor performance with one hot encoded dataframe.

In [76]:
i = 0

gender = pd.concat([X_final.gender_0,X_final.gender_1],axis=1)
gender = gender.gender_0.apply(lambda x:0 if x == 1 else 1)
gender = pd.DataFrame(gender)
gender = gender.rename(columns={'gender_0':'gender'})
gender

i += 1
print(str(i) + " completed, total: " + str(len(gender)))

hypertension = pd.concat([X_final.hypertension_0,X_final.hypertension_1],axis=1)
hypertension = hypertension.hypertension_0.apply(lambda x:0 if x == 1 else 1)
hypertension = pd.DataFrame(hypertension)
hypertension = hypertension.rename(columns={'hypertension_0':'hypertension'})
hypertension

i += 1
print(str(i) + " completed, total: " + str(len(hypertension)))

heart_disease = pd.concat([X_final.heart_disease_0,X_final.heart_disease_1],axis=1)
heart_disease = heart_disease.heart_disease_0.apply(lambda x:0 if x == 1 else 1)
heart_disease = pd.DataFrame(heart_disease)
heart_disease = heart_disease.rename(columns={'heart_disease_0':'heart_disease'})
heart_disease

i += 1
print(str(i) + " completed, total: " + str(len(heart_disease)))

ever_married = pd.concat([X_final.ever_married_0,X_final.ever_married_1],axis=1)
ever_married = ever_married.ever_married_0.apply(lambda x:0 if x == 1 else 1)
ever_married = pd.DataFrame(ever_married)
ever_married = ever_married.rename(columns={'ever_married_0':'ever_married'})
ever_married

i += 1
print(str(i) + " completed, total: " + str(len(ever_married)))

residence_type = pd.concat([X_final.Residence_type_0,X_final.Residence_type_1],axis=1)
residence_type = residence_type.Residence_type_0.apply(lambda x:0 if x == 1 else 1)
residence_type = pd.DataFrame(residence_type)
residence_type = residence_type.rename(columns={'Residence_type_0':'Residence_type'})
residence_type

i += 1
print(str(i) + " completed, total: " + str(len(residence_type)))

worktype = pd.concat([X_final.work_type_0,X_final.work_type_1,X_final.work_type_2,X_final.work_type_3,X_final.work_type_4],axis=1)
worktype = worktype.reset_index()
worktype

values = []
ids = []

for row in range(worktype.shape[0]):
    flag = False
    for coll in worktype.columns:#worktype.shape[1])
        
        if coll[:-1] == 'work_type_':
            
            if worktype.loc[row][coll] == 1:
                #get value
                value = coll[-1:]
                idss = worktype.loc[row]['index']
                ids.append(idss)
                values.append(value)
                flag = True
                
    if flag == False:     
        worktype.loc[row][coll] = 1
        idss = worktype.loc[row]['index']
        ids.append(idss)
        values.append(value)
        
dicti = {'index':ids,
        'work_type':values}

work_type = pd.DataFrame(dicti)
work_type = work_type.set_index('index')
work_type.head(5)

i += 1
print(str(i) + " completed, total: " + str(len(work_type)))


smoking_status = pd.concat([X_final.smoking_status_0,X_final.smoking_status_1,X_final.smoking_status_2,X_final.smoking_status_3],axis=1)
smoking_status = smoking_status.reset_index()
smoking_status

values = []
ids = []

for row in range(smoking_status.shape[0]):
    flag = False
    for coll in smoking_status.columns:#worktype.shape[1])
        
        if coll[:-1] == 'smoking_status_':
            
            if smoking_status.loc[row][coll] == 1:
                #get value
                value = coll[-1:]
                idss = smoking_status.loc[row]['index']
                ids.append(idss)
                values.append(value)
                flag = True
    if flag == False:     
        smoking_status.loc[row][coll] = 1
        idss = smoking_status.loc[row]['index']
        ids.append(idss)
        values.append(value)
        

dicti = {'index':ids,
        'smoking_status':values}

smoking_status = pd.DataFrame(dicti)
smoking_status = smoking_status.set_index('index')
smoking_status.head(5)

i += 1
print(str(i) + " completed, total: " + str(len(smoking_status)))

avg_glucose_level_group = pd.concat([X_final.avg_glucose_level_group_0,X_final.avg_glucose_level_group_1,X_final.avg_glucose_level_group_2,X_final.avg_glucose_level_group_3],axis=1)
avg_glucose_level_group = avg_glucose_level_group.reset_index()
avg_glucose_level_group

values = []
ids = []

for row in range(avg_glucose_level_group.shape[0]):
    flag = False
    
    for coll in avg_glucose_level_group.columns:
        
        if coll[:-1] == 'avg_glucose_level_group_':
            
            if avg_glucose_level_group.loc[row][coll] == 1:
                #get value
                value = coll[-1:]
                idss = avg_glucose_level_group.loc[row]['index']
                ids.append(idss)
                values.append(value)
                flag = True
 
    if flag == False:     
        avg_glucose_level_group.loc[row][coll] = 1
        idss = avg_glucose_level_group.loc[row]['index']
        ids.append(idss)
        values.append(value)

dicti = {'index':ids,
        'avg_glucose_level_group':values}

avg_glucose_level_group = pd.DataFrame(dicti)
avg_glucose_level_group = avg_glucose_level_group.set_index('index')
avg_glucose_level_group.head(5)

i += 1
print(str(i) + " completed, total: " + str(len(avg_glucose_level_group)))

bmi_group = pd.concat([X_final.bmi_group_0,X_final.bmi_group_1,X_final.bmi_group_2,X_final.bmi_group_3],axis=1)
bmi_group = bmi_group.reset_index()
bmi_group

values = []
ids = []

for row in range(bmi_group.shape[0]):
    flag = False
    for coll in bmi_group.columns:#worktype.shape[1])
        
        if coll[:-1] == 'bmi_group_':
            
            if bmi_group.loc[row][coll] == 1:
                #get value
                value = coll[-1:]
                idss = bmi_group.loc[row]['index']
                ids.append(idss)
                values.append(value)
                flag = True

    if flag == False:     
        bmi_group.loc[row][coll] = 1
        idss = bmi_group.loc[row]['index']
        ids.append(idss)
        values.append(value)

dicti = {'index':ids,
        'bmi_group':values}

bmi_group = pd.DataFrame(dicti)
bmi_group = bmi_group.set_index('index')
bmi_group.head(5)

i += 1
print(str(i) + " completed, total: " + str(len(bmi_group)))

age_group = pd.concat([X_final.Age_group_0,X_final.Age_group_1,X_final.Age_group_2,X_final.Age_group_3],axis=1)
age_group = age_group.reset_index()
age_group

values = []
ids = []

for row in range(age_group.shape[0]):
    flag = False
    for coll in age_group.columns:#worktype.shape[1])
        
        if coll[:-1] == 'Age_group_':
            
            if age_group.loc[row][coll] == 1:
                #get value
                value = coll[-1:]
                idss = age_group.loc[row]['index']
                ids.append(idss)
                values.append(value)
                flag = True
                
    if flag == False:     
        age_group.loc[row][coll] = 1
        idss = age_group.loc[row]['index']
        ids.append(idss)
        values.append(value)           
    
dicti = {'index':ids,
        'Age_group':values}

age_group = pd.DataFrame(dicti)
age_group = age_group.set_index('index')

print(str(i) + " completed, total: " + str(len(age_group)))

#age_group.head(5)
print('DONE')

1 completed, total: 7768
2 completed, total: 7768
3 completed, total: 7768
4 completed, total: 7768
5 completed, total: 7768


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  worktype.loc[row][coll] = 1


6 completed, total: 7768


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  smoking_status.loc[row][coll] = 1


7 completed, total: 7768


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  avg_glucose_level_group.loc[row][coll] = 1


8 completed, total: 7768


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bmi_group.loc[row][coll] = 1


9 completed, total: 7768


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  age_group.loc[row][coll] = 1


9 completed, total: 7768
DONE


In [77]:
"""
print(len(gender))
print(len(hypertension))
print(len(heart_disease))
print(len(ever_married))
print(len(work_type))
print(len(residence_type))
print(len(smoking_status))
print(len(age_group))
print(len(avg_glucose_level_group))
print(len(bmi_group))
print(len(y))
print(len(X_floats))
"""


'\nprint(len(gender))\nprint(len(hypertension))\nprint(len(heart_disease))\nprint(len(ever_married))\nprint(len(work_type))\nprint(len(residence_type))\nprint(len(smoking_status))\nprint(len(age_group))\nprint(len(avg_glucose_level_group))\nprint(len(bmi_group))\nprint(len(y))\nprint(len(X_floats))\n'

In [78]:
X_final.tail(1)

Unnamed: 0,avg_glucose_level,bmi,age,gender_0,gender_1,hypertension_0,hypertension_1,heart_disease_0,heart_disease_1,ever_married_0,...,Age_group_2,Age_group_3,avg_glucose_level_group_0,avg_glucose_level_group_1,avg_glucose_level_group_2,avg_glucose_level_group_3,bmi_group_0,bmi_group_1,bmi_group_2,bmi_group_3
7767,-0.63526,0.056693,0.687055,0,0,1,0,1,0,0,...,1,0,0,0,0,0,0,0,1,0


In [79]:
labels = ['avg_glucose_level','bmi','age']

X_floats_final = X_final[labels]
X_floats_final

Unnamed: 0,avg_glucose_level,bmi,age
0,0.157082,-1.195102,-0.928353
1,-0.412776,1.445891,-0.005473
2,-0.380847,0.981395,-1.104140
3,-0.716997,-0.531536,-1.675446
4,-0.838951,0.291286,-0.049420
...,...,...,...
7763,-0.387381,-0.164948,0.574618
7764,-0.817934,-0.379615,1.620553
7765,2.144701,-0.984783,1.001952
7766,-0.855319,1.228004,0.860064


Merge again float dataset with initial dataset together

In [80]:
X_processed = pd.DataFrame()
X_processed = pd.concat([gender,hypertension,heart_disease,ever_married,
                         work_type,residence_type,smoking_status,age_group,
                         avg_glucose_level_group,bmi_group,X_floats_final],axis=1)

X_processed.tail()

Unnamed: 0,gender,hypertension,heart_disease,ever_married,work_type,Residence_type,smoking_status,Age_group,avg_glucose_level_group,bmi_group,avg_glucose_level,bmi,age
7763,0,0,0,1,2,1,1,3,1,3,-0.387381,-0.164948,0.574618
7764,0,1,0,1,2,1,0,3,0,1,-0.817934,-0.379615,1.620553
7765,1,1,0,1,0,1,3,3,3,0,2.144701,-0.984783,1.001952
7766,0,0,0,1,0,1,2,3,0,3,-0.855319,1.228004,0.860064
7767,1,0,0,1,2,1,0,2,0,2,-0.63526,0.056693,0.687055


Change type of features

In [81]:
cat_features = ['gender','ever_married','work_type','Residence_type','smoking_status','hypertension','heart_disease',
               'ever_married','work_type','Residence_type','smoking_status','Age_group','avg_glucose_level_group',
               'bmi_group']

floats = ['avg_glucose_level','bmi','age']


for coll in X_processed.columns:
  
    if coll in cat_features:
        X_processed[coll] = X_processed[coll].astype('category')
    if coll in floats:
        X_processed[coll] = X_processed[coll].astype('float64')
    if coll not in cat_features and coll not in floats:
        X_processed[coll] = X_processed[coll].astype('int64')

for coll in X_final_test.columns:
  
    if coll in cat_features:
        X_final_test[coll] = X_final_test[coll].astype('category')
    if coll in floats:
        X_final_test[coll] = X_final_test[coll].astype('float64')
    if coll not in cat_features and coll not in floats:
        X_final_test[coll] = X_final_test[coll].astype('int64')
        
#----------------------------
try:
    X_processed = X_processed.drop('id',axis=1)
except:
    pass

In [82]:
X_processed.head(1)

Unnamed: 0,gender,hypertension,heart_disease,ever_married,work_type,Residence_type,smoking_status,Age_group,avg_glucose_level_group,bmi_group,avg_glucose_level,bmi,age
0,0,0,0,0,2,1,0,0,2,0,0.157082,-1.195102,-0.928353


In [83]:
X_final_test.info()
X_processed.info()
print(X_processed.shape)
print(y.shape)
#X_processed.head(1)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1021 entries, 33085 to 62416
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype   
---  ------                   --------------  -----   
 0   gender                   1021 non-null   category
 1   hypertension             1021 non-null   category
 2   heart_disease            1021 non-null   category
 3   ever_married             1021 non-null   category
 4   work_type                1021 non-null   category
 5   Residence_type           1021 non-null   category
 6   smoking_status           1021 non-null   category
 7   Age_group                1021 non-null   category
 8   avg_glucose_level_group  1021 non-null   category
 9   bmi_group                1021 non-null   category
 10  avg_glucose_level        1021 non-null   float64 
 11  bmi                      1021 non-null   float64 
 12  age                      1021 non-null   float64 
dtypes: category(10), float64(3)
memory usage: 43.3 KB
<class '

Merge together features and labels in train dataset

In [84]:
print("X processed ", X_processed.shape)
X_train_dataset = pd.DataFrame()
X_train_dataset = pd.concat([X_processed,y],axis=1)
print("X train dataset", X_train_dataset.shape)

X processed  (7768, 13)
X train dataset (7768, 14)


In [85]:
X_processed.head(2)

Unnamed: 0,gender,hypertension,heart_disease,ever_married,work_type,Residence_type,smoking_status,Age_group,avg_glucose_level_group,bmi_group,avg_glucose_level,bmi,age
0,0,0,0,0,2,1,0,0,2,0,0.157082,-1.195102,-0.928353
1,0,0,0,1,3,1,1,1,1,3,-0.412776,1.445891,-0.005473


In [86]:
try:
    y_test_final
    X_final_test = X_final_test.set_index('id')
except:
    pass

Merge together features and labels in test dataset

In [87]:
print("X test: ", X_test.shape)
y_test_dataset = pd.DataFrame()
X_test_dataset = pd.concat([X_final_test,y_test_final],axis=1)

print("y test final", y_test_final.shape)

X test:  (1021, 10)
y test final (1021,)


<B>Export final dataframes in csv format</B>

In [88]:
#Train set
X_train_dataset.to_csv('train_dataset_final.csv')
#Holdout set
X_test_dataset.to_csv('test_dataset_final.csv')