# Stroke Prediction Model Searching

Let's do some research to create stroke prediction model based on the healthcare-dataset-stroke-data dataset.

Source: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset

Let's get together the pipelines from Stroke_Prediction_Cleansing_and_Preprocessing notebook.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import LabelBinarizer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import FeatureUnion

In [2]:
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names]

class MyLabelBinarizer(LabelBinarizer):
    def fit_transform(self, X, y=None):
        return super(LabelBinarizer, self).fit_transform(X)

class CustomLimitedImputer(BaseEstimator, TransformerMixin):
    ''' Simple customized imputer to change the following:
        smoking_status to "never smoked" if age < 10 and smoking_status = "Unknown"
        work_type to "children" if age < 17 and swork_type = "Never_worked" '''
    def __init__(self, attribute_names):
        assert all(attr in ['smoking_status', 'work_type'] for attr in attribute_names), 'Only smoking_status and work_type are supported'
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X = X.copy()
        for attr in self.attribute_names:
            if attr == 'smoking_status':
                X.loc[(X.age < 10) & (X.smoking_status == 'Unknown'), 'smoking_status'] = 'never smoked'
            elif attr == 'work_type':
                X.loc[(X.age < 17) & (X.work_type == 'Never_worked'), 'work_type'] = 'children'
        X.drop(['age'], axis=1, inplace=True)
        return X.values

# pipelines
cat_yn_pipeline = Pipeline([
        ("select_bin", DataFrameSelector(['ever_married'])),
        ("bin_encoder", MyLabelBinarizer()),
    ])

# cat_cust_pipeline = Pipeline([
#         ("select_cat", DataFrameSelector(['smoking_status', 'work_type', 'age'])), # age is used as a parameter
#         ("cat_encoder", CustomLimitedImputer(['smoking_status', 'work_type'])),
#     ])

cat_oh_pipeline = Pipeline([
#         ("select_cat", DataFrameSelector(['smoking_status', 'work_type'])),
        ("select_cat", DataFrameSelector(['smoking_status', 'work_type', 'age'])), # age is used as a parameter
        ("imputer", CustomLimitedImputer(['smoking_status', 'work_type'])),
        ("cat_encoder", OneHotEncoder(sparse=False)),
    ])

num_pipeline = Pipeline([
        ("select_numeric", DataFrameSelector(['age', 'avg_glucose_level'])),
        ("scale", StandardScaler()),
    ])

# final preprocessing pipeline
preprocess_pipeline = FeatureUnion(transformer_list=[
        ('cat_yn_pipeline', cat_yn_pipeline),
#         ('cat_cust_pipeline', cat_cust_pipeline),
        ('cat_oh_pipeline', cat_oh_pipeline),
        ('num_pipeline', num_pipeline),
    ])

In [3]:
data = pd.read_csv('Data/healthcare-dataset-stroke-data.csv', index_col='id')

In [4]:
data

Unnamed: 0_level_0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
...,...,...,...,...,...,...,...,...,...,...,...
18234,Female,80.0,1,0,Yes,Private,Urban,83.75,,never smoked,0
44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.20,40.0,never smoked,0
19723,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.6,never smoked,0
37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,formerly smoked,0


In [5]:
X_train, X_test, y_train, y_test = train_test_split(data[data.columns[:-1]], data.stroke, test_size=0.2, random_state=24) #stratify=data.stroke)

In [6]:
X_train = preprocess_pipeline.fit_transform(X_train)
X_test = preprocess_pipeline.transform(X_test)

Let's start with simple Stochastic Gradient Descent

In [7]:
from sklearn.linear_model import SGDClassifier

In [8]:
sgdc = SGDClassifier(random_state=24)
sgdc.fit(X_train, y_train)

SGDClassifier(random_state=24)

In [7]:
from sklearn.model_selection import cross_val_score

In [14]:
sgdc_scores = cross_val_score(sgdc, X_train, y_train, cv=10)
sgdc_scores.mean()

0.9505878757370919

In [11]:
np.mean(y_train == 0)

0.951320939334638

The results are pretty similar. 
Let's check the precision and recall metrics.

In [8]:
from sklearn.model_selection import cross_val_predict

In [18]:
sgdc_scores = cross_val_predict(sgdc, X_train, y_train, cv=10)

In [9]:
from sklearn.metrics import confusion_matrix

In [20]:
confusion_matrix(y_train, sgdc_scores)

array([[3883,    6],
       [ 196,    3]], dtype=int64)

In [10]:
from sklearn.metrics import precision_score, recall_score

In [22]:
recall_score(y_train, sgdc_scores)

0.01507537688442211

In [23]:
precision_score(y_train, sgdc_scores)

0.3333333333333333

Due to the problem nautre the model is predicting the recall metric will be a good choice. The model has to predict as much strokes as it can. Nothing wrong will happen if some false positive cases will be taken under observation by a doctor.

The data is really unbalanced what reflects the reality. Let's then focus on decision tree-based models which are more tolerant for such data.

Decision Tree Classifier.

In [11]:
from sklearn.tree import DecisionTreeClassifier

In [25]:
dtc = DecisionTreeClassifier(random_state=24)
# dtc.fit(X_train, y_train)

DecisionTreeClassifier(random_state=24)

In [26]:
dtc_scores = cross_val_score(dtc, X_train, y_train, cv=10, scoring='recall')
dtc_scores.mean()

0.19657894736842102

Extra Trees Classifier

In [27]:
from sklearn.ensemble import ExtraTreesClassifier

In [28]:
etc = ExtraTreesClassifier(random_state=24)
# etc.fit(X_train, y_train)

ExtraTreesClassifier(random_state=24)

In [29]:
etc_scores = cross_val_score(etc, X_train, y_train, cv=10, scoring='recall')
etc_scores.mean()

0.13026315789473683

AdaBoost Classifier

In [30]:
from sklearn.ensemble import AdaBoostClassifier

In [31]:
abc = AdaBoostClassifier(random_state=24)
# abc.fit(X_train, y_train)

AdaBoostClassifier(random_state=24)

In [32]:
abc_scores = cross_val_score(abc, X_train, y_train, cv=10, scoring='recall')
abc_scores.mean()

0.005

Random Forest Classifier

In [11]:
from sklearn.ensemble import RandomForestClassifier

In [34]:
rfc = RandomForestClassifier(random_state=24)
# rfc.fit(X_train, y_train)

RandomForestClassifier(random_state=24)

In [35]:
rfc_scores = cross_val_score(rfc, X_train, y_train, cv=10, scoring='recall')
rfc_scores.mean()

0.11026315789473684

In [45]:
from sklearn.ensemble import GradientBoostingClassifier

In [46]:
gbc = GradientBoostingClassifier(random_state=24)
# gbc.fit(X_train, y_train)

GradientBoostingClassifier(random_state=24)

In [47]:
gbc_scores = cross_val_score(gbc, X_train, y_train, cv=10, scoring='recall')
gbc_scores.mean()

0.015263157894736843

In [54]:
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier

In [55]:
hgbc = HistGradientBoostingClassifier(random_state=24)
# hgbc.fit(X_train, y_train)

HistGradientBoostingClassifier(random_state=24)

In [56]:
hgbc_scores = cross_val_score(hgbc, X_train, y_train, cv=10, scoring='recall')
hgbc_scores.mean()

0.05026315789473686

Let's use the GridSearchCV to find better hiperparameters.

In [12]:
from sklearn.model_selection import GridSearchCV

Decision Tree Classifier

In [72]:
param_grid = [
    {'max_depth': [None]+[x for x in range(1, 20)],
     'min_samples_leaf': [1, 2, 3, 4],
     'max_features': [None]+['auto', 'sqrt', 'log2'],
    }]

In [73]:
dt_clf = DecisionTreeClassifier(random_state=24)
grid_search = GridSearchCV(dt_clf, param_grid, cv=10, scoring='recall')
grid_search.fit(X_train, y_train)

GridSearchCV(cv=10, estimator=DecisionTreeClassifier(random_state=24),
             param_grid=[{'max_depth': [None, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
                                        12, 13, 14, 15, 16, 17, 18, 19],
                          'max_features': [None, 'auto', 'sqrt', 'log2'],
                          'min_samples_leaf': [1, 2, 3, 4]}],
             scoring='recall')

In [74]:
grid_search.best_estimator_

DecisionTreeClassifier(max_features='auto', random_state=24)

In [71]:
dtc = DecisionTreeClassifier(max_features='auto', random_state=24)
# dtc.fit(X_train, y_train)

dtc_scores = cross_val_score(dtc, X_train, y_train, cv=10, scoring='recall')
dtc_scores.mean()

0.20131578947368425

Just slightly better.

Extra Trees Classifier

In [230]:
param_grid = [
    {'n_estimators': [x for x in range(172, 193, 10)],
     'criterion': ['gini', 'entropy'], #, 'log_loss'],
#      'max_depth': [None]+[x for x in range(1, 20)],
#      'max_depth': [None]+[20, 40],
#      'min_samples_split': [2, 3, 4],
#      'min_samples_leaf': [1, 5, 10, 15],
#      'max_features': [None]+['auto', 'sqrt', 'log2'],
     'max_features': [None, 'sqrt'],
#      'max_leaf_nodes': [None, 4, 8, 16]
#      'bootstrap': [True, False],
#      'max_samples': [None, 1000, 1500, 2000],
     'class_weight': [None, 'balanced', 'balanced_subsample']
    }]

In [231]:
et_clf = ExtraTreesClassifier(random_state=24)
grid_search = GridSearchCV(et_clf, param_grid, cv=10, scoring='recall')
grid_search.fit(X_train, y_train)

GridSearchCV(cv=10, estimator=ExtraTreesClassifier(random_state=24),
             param_grid=[{'class_weight': [None, 'balanced',
                                           'balanced_subsample'],
                          'criterion': ['gini', 'entropy'],
                          'max_features': [None, 'sqrt'],
                          'n_estimators': [172, 182, 192]}],
             scoring='recall')

In [232]:
grid_search.best_estimator_

ExtraTreesClassifier(criterion='entropy', max_features=None, n_estimators=172,
                     random_state=24)

In [145]:
# etc = ExtraTreesClassifier(max_depth=20, n_estimators=200, random_state=24)
# etc = ExtraTreesClassifier(n_estimators=188, random_state=24)
# etc = ExtraTreesClassifier(max_features=None, n_estimators=188, random_state=24)
# etc = ExtraTreesClassifier(criterion='entropy', max_features=None, n_estimators=184, random_state=24)
etc = ExtraTreesClassifier(criterion='entropy', max_features=None, n_estimators=172, random_state=24)
# etc.fit(X_train, y_train)

etc_scores = cross_val_score(etc, X_train, y_train, cv=10, scoring='recall')
etc_scores.mean()

0.16026315789473683

A lot of searching for 23% improvement 0.13 -> 0.16.

In [None]:
Random Forest Classifier

In [210]:
param_grid = [
    {'n_estimators': [x for x in range(380, 391, 10)],
#      'criterion': ['gini', 'entropy', 'log_loss'],
     'criterion': ['entropy'],
#      'max_depth': [None]+[x for x in range(1, 20)],
#      'max_depth': [None]+[1, 10, 20],
#      'min_samples_split': [2, 3, 4],
#      'min_samples_leaf': [1, 2, 3, 4],
#      'max_features': [None]+['auto', 'sqrt', 'log2'],
#      'bootstrap': [True, False],
     'bootstrap': [False],
#      'max_samples': [None, 1000, 1500, 2000],
#      'class_weight': [None, 'balanced', 'balanced_subsample']
    }]

In [211]:
%%time
rf_clf = RandomForestClassifier(random_state=24)
grid_search = GridSearchCV(rf_clf, param_grid, cv=10, scoring='recall')
grid_search.fit(X_train, y_train)

Wall time: 1min 34s


GridSearchCV(cv=10, estimator=RandomForestClassifier(random_state=24),
             param_grid=[{'bootstrap': [False],
                          'class_weight': [None, 'balanced',
                                           'balanced_subsample'],
                          'criterion': ['entropy'],
                          'n_estimators': [380, 390]}],
             scoring='recall')

In [212]:
grid_search.best_estimator_

RandomForestClassifier(bootstrap=False, criterion='entropy', n_estimators=380,
                       random_state=24)

In [302]:
rfc = RandomForestClassifier(bootstrap=False, criterion='entropy', n_estimators=380, random_state=24)
rfc.fit(X_train, y_train)

rfc_scores = cross_val_score(rfc, X_train, y_train, cv=10, scoring='recall')
rfc_scores.mean()

0.16052631578947368

Let's give a chance for oversampling.

In [14]:
from sklearn.utils import resample

In [12]:
# double the stroke examples
X_oversampled, y_oversampled = resample(X_train[y_train == 1],
                                        y_train[y_train == 1],
                                        replace=True,
                                        n_samples=y_train[y_train == 1].shape[0],
                                        random_state=24)

In [13]:
X_train_os = np.vstack((X_train, X_oversampled))
y_train_os = np.hstack((y_train, y_oversampled))

In [249]:
X_train_os.shape, X_train.shape

((4287, 12), (4088, 12))

In [250]:
y_train_os.shape, y_train.shape

((4287,), (4088,))

In [263]:
y_train.value_counts(), np.bincount(y_train_os)

(0    3889
 1     199
 Name: stroke, dtype: int64,
 array([3889,  398], dtype=int64))

In [16]:
dtc_os = DecisionTreeClassifier(max_features='auto', random_state=24)
# dtc_os.fit(X_train_os, y_train_os)

dtc_scores_os = cross_val_score(dtc_os, X_train_os, y_train_os, cv=10, scoring='recall')
dtc_scores_os.mean()

0.8474999999999999

In [294]:
etc_os = ExtraTreesClassifier(criterion='entropy', max_features=None, n_estimators=172, random_state=24)
# etc_os.fit(X_train_os, y_train_os)

etc_scores_os = cross_val_score(etc_os, X_train_os, y_train_os, cv=10, scoring='recall')
etc_scores_os.mean()

0.845

In [301]:
rfc_os = RandomForestClassifier(bootstrap=False, criterion='entropy', n_estimators=380, random_state=24)
rfc_os.fit(X_train_os, y_train_os)

rfc_scores_os = cross_val_score(rfc_os, X_train_os, y_train_os, cv=10, scoring='recall')
rfc_scores_os.mean()

0.8550000000000001

Wow! By increasing the stroke examples in the test set by 2 times the recall become almost 533% better.

In [16]:
# triple the stroke examples
X_oversampled3, y_oversampled3 = resample(X_train[y_train == 1],
                                        y_train[y_train == 1],
                                        replace=True,
                                        n_samples=2*y_train[y_train == 1].shape[0],
                                        random_state=24)

In [17]:
X_train_os3 = np.vstack((X_train, X_oversampled3))
y_train_os3 = np.hstack((y_train, y_oversampled3))

In [18]:
X_train_os3.shape, X_train.shape

((4486, 12), (4088, 12))

In [19]:
y_train_os3.shape, y_train.shape

((4486,), (4088,))

In [276]:
dtc_os2 = DecisionTreeClassifier(max_features='auto', random_state=24)
# dtc_os2.fit(X_train_os3, y_train_os3)

dtc_scores_os2 = cross_val_score(dtc_os2, X_train_os3, y_train_os3, cv=10, scoring='recall')
dtc_scores_os2.mean()

0.9616666666666667

In [275]:
etc_os2 = ExtraTreesClassifier(criterion='entropy', max_features=None, n_estimators=172, random_state=24)
# etc_os2.fit(X_train_os3, y_train_os3)

etc_scores_os2 = cross_val_score(etc_os2, X_train_os3, y_train_os3, cv=10, scoring='recall')
etc_scores_os2.mean()

0.96

In [20]:
rfc_os2 = RandomForestClassifier(bootstrap=False, criterion='entropy', n_estimators=380, random_state=24)
rfc_os2.fit(X_train_os3, y_train_os3)

rfc_scores_os2 = cross_val_score(rfc_os2, X_train_os3, y_train_os3, cv=10, scoring='recall')
rfc_scores_os2.mean()

0.9633333333333333

Would be good to take a look on the confusion_matrix 

In [21]:
rfc_pred_os = cross_val_predict(rfc_os2, X_train_os3, y_train_os3, cv=10)
confusion_matrix(y_train_os3, rfc_pred_os)

array([[3777,  112],
       [  22,  575]], dtype=int64)

Let's take a breath and verify the models on the test data.

In [22]:
y_pred = rfc_os2.predict(X_test)

In [23]:
confusion_matrix(y_test, y_pred)

array([[948,  24],
       [ 48,   2]], dtype=int64)

In [24]:
recall_score(y_test, y_pred)

0.04

4% not so good anymore huh? After oversampling the train data the model became overfitted.

Let's give the SMOTE a chance 

In [14]:
from imblearn.over_sampling import SMOTE

In [30]:
oversample = SMOTE(sampling_strategy=0.1, random_state=24)
X_smote, y_smote = oversample.fit_resample(X_train, y_train)

In [31]:
X_smote.shape, X_train.shape

((4277, 12), (4088, 12))

In [32]:
y_smote.shape, y_train.shape

((4277,), (4088,))

In [33]:
y_train.value_counts(), np.bincount(y_smote)

(0    3889
 1     199
 Name: stroke, dtype: int64,
 array([3889,  388], dtype=int64))

Let's see the results for "smoted" data.

In [47]:
rfc_smote = RandomForestClassifier(bootstrap=False, criterion='entropy', n_estimators=380, random_state=24)
rfc_smote.fit(X_smote, y_smote)

rfc_scores_smote = cross_val_score(rfc_smote, X_smote, y_smote, cv=10, scoring='recall')
rfc_scores_smote.mean()

0.4854925775978408

Let's see the impact of higher minority class sample generation ratio.

In [48]:
oversample = SMOTE(sampling_strategy=0.2, random_state=24)
X_train_smote, y_train_smote = oversample.fit_resample(X_train, y_train)

In [49]:
'X', X_train_smote.shape, X_train.shape, 'y', y_train_smote.shape, y_train.shape

('X', (4666, 12), (4088, 12), 'y', (4666,), (4088,))

In [50]:
y_train.value_counts(), np.bincount(y_train_smote)

(0    3889
 1     199
 Name: stroke, dtype: int64,
 array([3889,  777], dtype=int64))

In [55]:
rfc_smote2 = RandomForestClassifier(bootstrap=False, criterion='entropy', n_estimators=380, random_state=24)
rfc_smote2.fit(X_train_smote, y_train_smote)

rfc_scores_smote2 = cross_val_score(rfc_smote2, X_train_smote, y_train_smote, cv=10, scoring='recall')
rfc_scores_smote2.mean()

0.6824841824841825

In [78]:
y_pred = rfc_smote2.predict(X_test)
recall_score(y_test, y_pred)

0.1

What about double it?

In [99]:
oversample = SMOTE(sampling_strategy=0.4, random_state=24)
X_train_smote2, y_train_smote2 = oversample.fit_resample(X_train, y_train)

In [100]:
'X', X_train_smote2.shape, X_train.shape, 'y', y_train_smote2.shape, y_train.shape

('X', (5444, 12), (4088, 12), 'y', (5444,), (4088,))

In [101]:
y_train.value_counts(), np.bincount(y_train_smote2)

(0    3889
 1     199
 Name: stroke, dtype: int64,
 array([3889, 1555], dtype=int64))

In [87]:
rfc_smote3 = RandomForestClassifier(bootstrap=False, criterion='entropy', n_estimators=380, random_state=24)
rfc_smote3.fit(X_train_smote2, y_train_smote2)

rfc_scores_smote3 = cross_val_score(rfc_smote3, X_train_smote2, y_train_smote2, cv=10, scoring='recall')
rfc_scores_smote3.mean()

0.8271505376344086

In [88]:
y_pred = rfc_smote3.predict(X_test)
recall_score(y_test, y_pred)

0.14

Let's jump to 1:1 ratio.

In [94]:
oversample = SMOTE(sampling_strategy=1, random_state=24)
X_train_smote11, y_train_smote11 = oversample.fit_resample(X_train, y_train)

In [95]:
'X', X_train_smote11.shape, X_train.shape, 'y', y_train_smote11.shape, y_train.shape

('X', (7778, 12), (4088, 12), 'y', (7778,), (4088,))

In [96]:
y_train.value_counts(), np.bincount(y_train_smote11)

(0    3889
 1     199
 Name: stroke, dtype: int64,
 array([3889, 3889], dtype=int64))

In [155]:
rfc_smote11 = RandomForestClassifier(bootstrap=False, criterion='entropy', n_estimators=380, random_state=24)
rfc_smote11.fit(X_train_smote11, y_train_smote11)

rfc_scores_smote11 = cross_val_score(rfc_smote11, X_train_smote11, y_train_smote11, cv=10, scoring='recall')
rfc_scores_smote11.mean()

0.9318573927331514

In [156]:
y_pred = rfc_smote11.predict(X_test)
recall_score(y_test, y_pred)

0.16

There is not too much improvement above 0.3 (checked silently). The issue can be with hiperparameters so let's run a GridSearchCV for the train dataset with new genereted Stroke examples for SMOTE sampling ratio 0.4. 

In [150]:
param_grid = [
    {'n_estimators': [x for x in range(400, 401, 10)],
#      'criterion': ['gini', 'entropy'], #, 'log_loss'],
#      'criterion': ['entropy'],
#      'max_depth': [None]+[x for x in range(1, 20)],
#      'max_depth': [None]+[1, 10, 20],
#      'max_depth': [5, 15, 25],
     'max_depth': [14, 15, 16],
#      'min_samples_split': [5, 20, 45],
     'min_samples_split': [2, 3, 4],
#      'min_samples_leaf': [1, 2, 3, 4],
#      'max_features': [None]+['auto', 'sqrt', 'log2'],
#      'bootstrap': [True, False],
     'bootstrap': [False],
#      'max_samples': [None, 1000, 1500, 2000],
#      'class_weight': [None, 'balanced', 'balanced_subsample']
    }]

In [151]:
%%time
rf_clf = RandomForestClassifier(random_state=24)
grid_search = GridSearchCV(rf_clf, param_grid, cv=10, scoring='recall')
grid_search.fit(X_train_smote2, y_train_smote2)

Wall time: 2min 54s


GridSearchCV(cv=10, estimator=RandomForestClassifier(random_state=24),
             param_grid=[{'bootstrap': [False], 'max_depth': [14, 15, 16],
                          'min_samples_split': [2, 3, 4],
                          'n_estimators': [400]}],
             scoring='recall')

In [152]:
grid_search.best_estimator_

RandomForestClassifier(bootstrap=False, max_depth=15, min_samples_split=4,
                       n_estimators=400, random_state=24)

In [153]:
# rfc_smote3 = RandomForestClassifier(n_estimators=410, random_state=24) 
# rfc_smote3 = RandomForestClassifier(criterion='entropy', n_estimators=400, random_state=24)
# rfc_smote3 = RandomForestClassifier(max_depth=20, n_estimators=400, random_state=24)
# rfc_smote3 = RandomForestClassifier(max_depth=15, n_estimators=400, random_state=24)
# rfc_smote3 = RandomForestClassifier(min_samples_split=20, n_estimators=420, random_state=24)
# rfc_smote3 = RandomForestClassifier(bootstrap=False, max_depth=15, min_samples_split=5, n_estimators=400, random_state=24)
# rfc_smote3 = RandomForestClassifier(bootstrap=False, max_depth=15, min_samples_split=3, n_estimators=400, random_state=24)
rfc_smote3 = RandomForestClassifier(bootstrap=False, max_depth=15, min_samples_split=4, n_estimators=400, random_state=24)
rfc_smote3.fit(X_train_smote2, y_train_smote2) 
# rfc_smote3.fit(X_train_smote11, y_train_smote11)

rfc_scores_smote3 = cross_val_score(rfc_smote3, X_train_smote2, y_train_smote2, cv=10, scoring='recall')
# rfc_scores_smote3 = cross_val_score(rfc_smote3, X_train_smote11, y_train_smote11, cv=10, scoring='recall')
rfc_scores_smote3.mean()

0.850281224152192

In [154]:
y_pred = rfc_smote3.predict(X_test)
recall_score(y_test, y_pred)

0.18

In [161]:
# checked for min_samples_split= 3, 4, 5
rfc_smote11 = RandomForestClassifier(bootstrap=False, max_depth=15, min_samples_split=5, n_estimators=400, random_state=24)
rfc_smote11.fit(X_train_smote11, y_train_smote11)

rfc_scores_smote11 = cross_val_score(rfc_smote11, X_train_smote11, y_train_smote11, cv=10, scoring='recall')
rfc_scores_smote11.mean()

0.9532021042588715

In [162]:
y_pred = rfc_smote11.predict(X_test)
recall_score(y_test, y_pred)

0.3

In [163]:
confusion_matrix(y_test, y_pred)

array([[852, 120],
       [ 35,  15]], dtype=int64)

30% is better than previous 16% but still not so good.

Let's combine the SMOTE and undersampling (for majority class).

In [13]:
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as imPipeline

In [216]:
# smote 0.4 & under 0.6 24%
# smote 0.6 & under 0.8 28%
# smote 0.6 & under 1 30%
# smote 0.4 & under 0.8 30%
# smote 0.4 & under 1 36%
# smote 0.5 & under 1 30%
# smote 0.8 & under 1 32%
# smote 0.45 & under 1 30%

oversample = SMOTE(sampling_strategy=0.4, random_state=24)
undersample = RandomUnderSampler(sampling_strategy=1, random_state=24)
resample_pipeline = imPipeline([('smote', oversample), ('undersample', undersample)])
X_train_ou, y_train_ou = resample_pipeline.fit_resample(X_train, y_train)

In [217]:
'X', X_train_ou.shape, X_train.shape, 'y', y_train_ou.shape, y_train.shape

('X', (3110, 12), (4088, 12), 'y', (3110,), (4088,))

In [218]:
y_train.value_counts(), np.bincount(y_train_ou)

(0    3889
 1     199
 Name: stroke, dtype: int64,
 array([1555, 1555], dtype=int64))

In [219]:
rfc_ou = RandomForestClassifier(bootstrap=False, max_depth=15, min_samples_split=5, n_estimators=400, random_state=24)
rfc_ou.fit(X_train_ou, y_train_ou)

rfc_scores_ou = cross_val_score(rfc_ou, X_train_ou, y_train_ou, cv=10, scoring='recall')
rfc_scores_ou.mean()

0.9169354838709678

In [220]:
y_pred = rfc_ou.predict(X_test)
recall_score(y_test, y_pred)

0.36

SMOTE has a k_neighbors argument which define the number of nearest neighbors used for new samples generation. Let's run some test for previous setup.

In [227]:
for kn_no in range(1, 14):
    print('k_neighbors:', kn_no)
    oversample = SMOTE(sampling_strategy=0.4, k_neighbors=kn_no, random_state=24)
    undersample = RandomUnderSampler(sampling_strategy=1, random_state=24)
    resample_pipeline = imPipeline([('smote', oversample), ('undersample', undersample)])
    X_train_ou, y_train_ou = resample_pipeline.fit_resample(X_train, y_train)
    
    rfc_ou = RandomForestClassifier(bootstrap=False, max_depth=15, min_samples_split=5, n_estimators=400, random_state=24)
    rfc_ou.fit(X_train_ou, y_train_ou)

    rfc_scores_ou = cross_val_score(rfc_ou, X_train_ou, y_train_ou, cv=10, scoring='recall')
    print('Train recall mean', rfc_scores_ou.mean())
    
    y_pred = rfc_ou.predict(X_test)
    print('Test recall mean', recall_score(y_test, y_pred))

The best result is for k_neighbors=10

In [15]:
oversample = SMOTE(sampling_strategy=0.4, k_neighbors=10, random_state=24)
undersample = RandomUnderSampler(sampling_strategy=1, random_state=24)
resample_pipeline = imPipeline([('smote', oversample), ('undersample', undersample)])
X_train_ou, y_train_ou = resample_pipeline.fit_resample(X_train, y_train)

rfc_ou = RandomForestClassifier(bootstrap=False, max_depth=15, min_samples_split=5, n_estimators=400, random_state=24)
rfc_ou.fit(X_train_ou, y_train_ou)

rfc_scores_ou = cross_val_score(rfc_ou, X_train_ou, y_train_ou, cv=10, scoring='recall')
print('Train recall mean', rfc_scores_ou.mean())

y_pred = rfc_ou.predict(X_test)
print('Test recall mean', recall_score(y_test, y_pred))

Train recall mean 0.9111290322580645
Test recall mean 0.46


It is 46%. Almost 12 times more that on initial imbalanced data.

Another thing to consider is changing cv to have stratified data in each splitted partition. 

In [16]:
from sklearn.model_selection import RepeatedStratifiedKFold

In [17]:
splitter = RepeatedStratifiedKFold(n_splits=10, n_repeats=5, random_state=1)

In [18]:
rfc_ou = RandomForestClassifier(bootstrap=False, max_depth=15, min_samples_split=5, n_estimators=400, random_state=24)
rfc_ou.fit(X_train_ou, y_train_ou)

rfc_scores_ou = cross_val_score(rfc_ou, X_train_ou, y_train_ou, cv=splitter, scoring='recall')
print('Train recall mean', rfc_scores_ou.mean())

y_pred = rfc_ou.predict(X_test)
print('Test recall mean', recall_score(y_test, y_pred))

Train recall mean 0.9120347394540943
Test recall mean 0.46


Seems the difference is barely noticeable.

The best result so far is 46%. Let's take a look on confusion matrix.

In [236]:
confusion_matrix(y_test, y_pred)

array([[821, 151],
       [ 27,  23]], dtype=int64)

Does not look good. There is 151 wrongly predicted strokes and just 23 from 50 actual strokes.

Changing metric from "recall" to "roc_auc".

In [21]:
from sklearn.metrics import roc_auc_score

In [None]:
# splitter = RepeatedStratifiedKFold(n_splits=10, n_repeats=5, random_state=24)

In [239]:
rfc_ou = RandomForestClassifier(bootstrap=False, max_depth=15, min_samples_split=5, n_estimators=400, random_state=24)
rfc_ou.fit(X_train_ou, y_train_ou)

rfc_scores_ou = cross_val_score(rfc_ou, X_train_ou, y_train_ou, cv=splitter, scoring='roc_auc')
print('Train roc_auc mean', rfc_scores_ou.mean())

y_pred = rfc_ou.predict(X_test)
print('Test roc_auc mean', roc_auc_score(y_test, y_pred))

Train roc_auc mean 0.9549619520264682
Test roc_auc mean 0.6523251028806585


Let's perform a GridSearchCV 

In [266]:
param_grid = [
    {'n_estimators': [x for x in range(370, 771, 20)],
#      'criterion': ['gini', 'entropy'], #, 'log_loss'],
     'criterion': ['entropy'],
#      'max_depth': [None]+[x for x in range(1, 20)],
#      'max_depth': [None]+[1, 10, 20],
#      'max_depth': [5, 15, 25],
#      'max_depth': [14, 15, 16],
#      'min_samples_split': [5, 20, 45],
#      'min_samples_split': [2, 3, 4],
#      'min_samples_leaf': [1, 2, 3, 4],
#      'max_features': [None]+['auto', 'sqrt', 'log2'],
#      'bootstrap': [True, False],
#      'bootstrap': [False],
#      'max_samples': [None, 1000, 1500, 2000],
#      'class_weight': [None, 'balanced', 'balanced_subsample']
    }]

In [267]:
%%time
rf_clf = RandomForestClassifier(random_state=24)
grid_search = GridSearchCV(rf_clf, param_grid, cv=splitter, scoring='roc_auc')
grid_search.fit(X_train_ou, y_train_ou)

Wall time: 35min 47s


GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=10, random_state=1),
             estimator=RandomForestClassifier(random_state=24),
             param_grid=[{'criterion': ['entropy'],
                          'n_estimators': [370, 390, 410, 430, 450, 470, 490,
                                           510, 530, 550, 570, 590, 610, 630,
                                           650, 670, 690, 710, 730, 750,
                                           770]}],
             scoring='roc_auc')

In [268]:
grid_search.best_estimator_

RandomForestClassifier(criterion='entropy', n_estimators=710, random_state=24)

In [19]:
def get_train_test_roc_auc(rfc_ou):
    rfc_ou.fit(X_train_ou, y_train_ou)

    rfc_scores_ou = cross_val_score(rfc_ou, X_train_ou, y_train_ou, cv=splitter, scoring='roc_auc')
    print('Train roc_auc mean', rfc_scores_ou.mean())

    y_pred = rfc_ou.predict(X_test)
    print('Test roc_auc mean', roc_auc_score(y_test, y_pred))

In [269]:
get_train_test_roc_auc(grid_search.best_estimator_)

Train roc_auc mean 0.9547220843672456
Test roc_auc mean 0.626954732510288


In [22]:
get_train_test_roc_auc(RandomForestClassifier(criterion='entropy', n_estimators=710, random_state=24))

Train roc_auc mean 0.9547220843672456
Test roc_auc mean 0.626954732510288


In [26]:
get_train_test_roc_auc(RandomForestClassifier(n_estimators=1200, max_features=0.5, max_depth=4, min_samples_leaf=5, random_state=24))

Train roc_auc mean 0.9046600496277916
Test roc_auc mean 0.7074485596707819


Let's start the searching over. This time separate the data into train, validation and test set. The last one will be locked until the model and hyperparameters will be chosen.