# Stroke Prediction
Stroke is the 2nd highest cause of mortality in the world (WHO). Even if an individual survives a stroke, it is common that the individual has severe symptoms such as spasticity, cognitive problems etc.

Since stroke is a major health problem, it is crucial to know the risk factors that cause a stroke.In this dataset, we will investigate these factors and build a model that predicts stroke.

The data set contains 11 features and 1 target variable. The target variable is 'Stroke' column, which is binary data 0: No stroke, 1 : Stroke.  

Anyway, let's start digging.


#### Importing packages 

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import numpy as np
from scipy import stats
sns.set()
pd.set_option('display.max_columns', 60)


import warnings
warnings.filterwarnings("ignore")

In [None]:

# Read the csv file
df = pd.read_csv('../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv')

In [None]:
# First and last 5 rows of the data
display(df.head())
df.tail()

In [None]:
print('The Stroke data has {0} rows and {1} columns'.format(df.shape[0],df.shape[1]))

Let's see;
The data consists of 5110 rows and 12 columns. The first column is 'ID' column, we might want to remove this. Then, the other columns seems to be categorical variables except Age, glucose level and BMI. We will get the info of the data, just in case if there is any unsuitable types.

After that, we should investigate missing values if there is any; and see the proportion of the missings.

## Investigating Target Variable

In [None]:
# Since this is a classification problem, let's investigate the proportion of stroke variables.
df['stroke'].value_counts()

4861 vs 249. We have Class Imbalance here. From what I learnt (tnx google and youtube),  We can use Spread SubSampling or  Synthetic Minority Over-sampling Technique (SMOTE). Spread SubSampling means that we will delete some rows which had 0 (no stroke) value. However I don't think this is a good idea for this dataset, because the number of  participants with stroke is 227; and we need to delete nearly 4400 rows. So I think it is better to use SMOTE and have a decent number of data

We'll handle this after dealing with missings.

In [None]:
# remove Id columns and investigate column types
df = df.drop(['id'], axis=1)
df.info()

The data types seem good. No changes needed

In [None]:
# investigate the means, medians, min-max of the data.
df.describe()

Age column seem to have an issue. 0.08 years of age might be wrong, we'll dive into that.  Also, 10 BMI would be wrong too, but we need to see if it's ok or not. I really wished that data had weight an height informations as well.

In [None]:
missings = pd.DataFrame(columns=['Columns','Missing','Percentage'])

In [None]:
for x in df.columns:
    if df[x].isna().sum() >0:
        missings = missings.append({'Columns': x ,'Missing': df[x].isna().sum(), 
                                    'Percentage':(df[x].isna().sum()/len(df[x])*100)}, ignore_index=True)

In [None]:
missings

only BMI column has missings and it's really a low proportion of the data. We will use KNN imputation to predict the missings , or easily we will fill them with means; but i want to learn KNN here, so I will use it :)

In [None]:
#first import KNN imputer
from sklearn.impute import KNNImputer

In [None]:
# check the describe line for the mean of the BMI
#mean 28 std 7.8

In [None]:
imputer = KNNImputer(n_neighbors=5)
imputed = pd.DataFrame(imputer.fit_transform(df.iloc[:,[1,7,8,10]]), columns=['age','avg_glucose_levels','bmi','stroke'])



basicly we imputed BMI values accoring to the neigborhoods; KNN imputer will get nearest 5 columns according to age, avg_glucose_levels and stroke; and returned their means for the missing rows.

In [None]:
# lets see if the mean was changed
display(imputed.describe())
imputed.info()

We have no missing, and mean and std seems not changed ( we can ignore the changes on the decimals).
Change BMI column of the df data.

In [None]:
df['bmi'] = imputed['bmi']

In [None]:
df

#### Creating new features: 
has_diabetes and 
is_obese

According to DSM VI glucose lvl above 126 accepted as diabetes mellitus.

BMI > 30 is obese

In [None]:
df['has_diabetes'] = [(1 if i >125 else 0) for i in df['avg_glucose_level']] 
df['is_obese'] = [1 if i>=30 else 0 for i in df['bmi']]

In [None]:
df

## Outlier detection 

In [None]:
# possible outliers are in Age and BMI columns as mentioned above. 

In [None]:
# Reference to Ceren İyim github, link : https://github.com/cereniyim/Tree-Classification-ML-Model
def outlier_function(df, col_name):
    ''' this function detects first and third quartile and interquartile range for a given column of a dataframe
    then calculates upper and lower limits to determine outliers conservatively
    returns the number of lower and uper limit and number of outliers respectively
    '''
    first_quartile = np.percentile(df[col_name], 25)
    third_quartile = np.percentile(df[col_name], 75)
    IQR = third_quartile - first_quartile
                      
    upper_limit = third_quartile+(1.5*IQR)
    lower_limit = first_quartile-(1.5*IQR)
    outlier_count = 0
                      
    for value in df[col_name].tolist():
        if (value < lower_limit) | (value > upper_limit):
            outlier_count +=1
    return lower_limit, upper_limit, outlier_count

In [None]:
numerics= df.select_dtypes(include='float64')
for column in numerics.columns:
    if outlier_function(numerics, column)[2] > 0:
        print("There are {} outliers in {}".format(outlier_function(df, column)[2], column))

In [None]:
#I want to investigate age columns additionally, although there was no outliers detected. The ages below 1 may be problem
age_invest = df[df['work_type']=='children']
display(age_invest)
age_invest.describe()

It is obvious that I was wrong about the age issue; the ages below 0 is months converted to years, and these values are children's data. So I would keep age data as it is

TLDR: age columns is good, back to outlier removal.

In [None]:
# I'm going to remove outliers by glucose levels in which we had 166 outliers.
df = df[(df['avg_glucose_level'] > outlier_function(df, 'avg_glucose_level')[0]) &
              (df['avg_glucose_level'] < outlier_function(df, 'avg_glucose_level')[1])]


In [None]:
df.shape

In [None]:
df.describe()

Outliers done, now it is time to do some visualization

In [None]:
#Before graphs, lets see the means of the two groups (stroke and healthy)

means = df.loc[:,['age','avg_glucose_level','bmi','stroke']]
counts = df.loc[:,['gender', 'hypertension','heart_disease','ever_married','work_type','Residence_type','smoking_status',
                  'has_diabetes','is_obese','stroke']]
display(df['stroke'].value_counts())
display(means.groupby(means['stroke']).mean())
display(counts.groupby(counts['stroke']).sum())

People who had a stroke is older, has higher blood sugar rates and slightly higher BMI. 

In [None]:
print('the percentage of stroke in the data set is :', sum(df['stroke']==1)*100/len(df))

According to Global burden disease reports Stroke prevelance in the world is 1,180.40 per 100,000 population. Which is 1.12%. 

Reference: https://www.world-stroke.org/assets/downloads/WSO_Fact-sheet_15.01.2020.pdf

However, to make appropriate predictions, we need more data from Stroke feature. As mentioned above I will use SMOTE to estimate/create more data.


## Data Visualition

### All data visuals

In [None]:
fig, (ax1,ax2,ax3) = plt.subplots(1,3, figsize=(30,10))

sns.countplot(df['stroke'], ax = ax1)
ax1.set_xticklabels( ['No Stroke', 'Stroke'])
ax1.set_title('Numbers of stroke and healthy individuals')
ax1.text( s = df['stroke'].value_counts()[0],
         x = 0,
         y = (df['stroke'].value_counts()[0]) * 0.8,
        fontsize = 20)

ax1.text( s = df['stroke'].value_counts()[1],
         x = 1,
         y = (df['stroke'].value_counts()[1]),
        fontsize = 20)

sns.violinplot(x=df['stroke'], y=df['age'], ax = ax2)
ax2.set_title('Age Proportions of Healthy and Individuals with Stroke')


sns.histplot(data = df, x = 'age', hue='stroke', ax = ax3)
miny_lim, y_lim = plt.ylim()
ax3.axvline(df[df['stroke']==1]['age'].mean(), linestyle='--', color='r')
ax3.axvline(df[df['stroke']==0]['age'].mean(), linestyle='--')
ax3.text(s = f"Mean Age (Stroke) : \n {df[df['stroke']==1]['age'].mean():.2f}",
         y = y_lim * 0.75, x =df[df['stroke']==1]['age'].mean() )
ax3.text(s = f"Mean Age : \n {df[df['stroke']==0]['age'].mean():.2f}",
         y = y_lim * 0.9, x =df[df['stroke']==0]['age'].mean()+3 )
ax3.set_title('Age Proportions of Healthy and Individuals with Stroke')

plt.show()

It seems, stroke is morelikely to occur at older ages. (yes, i know this is too obvious)

In [None]:
strokes = df[df['stroke']==1]
no_stroke = df[df['stroke']==0]
fig, ax = plt.subplots(5,2, figsize=(20,12))

ax[0,0].pie(strokes['hypertension'].value_counts(normalize=True),
           autopct= '%1.1f%%',
           shadow = True,
            labels = ['no hypertension', 'hypertension'],
           startangle = 90,
           explode = [0,0.2])
ax[0,0].set_title('Hypertension Percentages of Clients with a Stroke')

ax[0,1].pie(no_stroke['hypertension'].value_counts(normalize=True),
           autopct= '%1.1f%%',
           shadow = True,
           startangle = 90,
           explode = [0,0.2])
ax[0,1].set_title('Hypertension Percentages of Clients without a Stroke')

ax[1,0].pie(strokes['gender'].value_counts(normalize=True),
           autopct= '%1.1f%%',
           shadow = True,
           labels = ['male', 'female'],
           startangle = 90,
           explode = [0,0.2])
ax[1,0].set_title('Gender Percentages of Clients with a Stroke')

ax[1,1].pie(no_stroke['gender'].value_counts(normalize=True),
           autopct= '%1.1f%%',
           shadow = True,
            labels = ['male', 'female', 'other'],
           startangle = 90,
           )
ax[1,1].set_title('Gender Percentages of Clients without a Stroke')


ax[2,0].pie(strokes['ever_married'].value_counts(normalize=True),
           autopct= '%1.1f%%',
           shadow = True,
            labels = ['yes', 'no'],
           startangle = 90,
           explode = [0,0.2])
ax[2,0].set_title('Marital Status Percentages of Clients with a Stroke')


ax[2,1].pie(no_stroke['ever_married'].value_counts(normalize=True),
           autopct= '%1.1f%%',
           shadow = True,
            labels = ['yes', 'no'],
           startangle = 90,
           explode = [0,0.2])
ax[2,1].set_title('Marital Status Percentages of Clients without a Stroke')

ax[3,0].pie(strokes['smoking_status'].value_counts(normalize=True),
           autopct= '%1.1f%%',
           shadow = True,
           labels = ['never smoked', 'formerly smokes', 'smokes','unkown'],
           startangle = 90,
           #explode = [0,0.2]
           )
ax[3,0].set_title('Smoking Status Percentages of Clients with a Stroke')

ax[3,1].pie(no_stroke['smoking_status'].value_counts(normalize=True),
           autopct= '%1.1f%%',
           shadow = True,
          labels = ['never smoked', 'formerly smokes', 'smokes','unkown'],
           startangle = 90,
           #explode = [0,0.2]
           )
ax[3,1].set_title('Smoking Status Percentages of Clients without a Stroke')

ax[4,0].pie(strokes['is_obese'].value_counts(normalize=True),
           autopct= '%1.1f%%',
           shadow = True,
           labels = ['not obese', 'obese'],
           startangle = 90,
           #explode = [0,0.2]
           )
ax[4,0].set_title('Obesity Percentages of Clients with a Stroke')

ax[4,1].pie(no_stroke['is_obese'].value_counts(normalize=True),
           autopct= '%1.1f%%',
           shadow = True,
            labels = ['not obese', 'obese'],
           startangle = 90,
           #explode = [0,0.2]
           )
ax[4,1].set_title('Obesity Percentages of Clients without a Stroke')


plt.show()

## Observations of Plots

As we can see from the pie charts;

* Clients with a stroke has higher hypertension rates than the clients with no stroke

* We cannot tell anything from the gender distrubitions; however according to literature Male gender has higher chance to have a stroke 

* Let's make a bias here : IF YOU ARE MARRIED, YOU ARE MORE LIKELY TO HAVE A STROKE :) 
        
        Well it is hard to say that. 
        
        Let's recall the histogram that we plot for age distributions. The clients with stroke had higher age means than the clients with no stroke. It is more likely someone to get married after a certain age.  So , I wouldn't say that marriage affects the possibility to have a stroke or not.
        
* Nearly all groups has similar proportions, however smokers are higher in the ones had a stroke. And again, it is hard to conclude 'Smoking increases your chances to have a stroke' from this dataset. 

* And yet again, we cannot make an assumption from the obesity levels. Maybe we can create labels from obesity data. 

In [None]:
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(30,10))

sns.histplot(data = df, x= 'bmi', hue= 'stroke', ax = ax1)
ax1.set_title('BMI distributions of  stroke and healthy individuals')
ax1.text( s = 'Stroke mean (red) : ' "{:.2f}".format(strokes['bmi'].mean()),
         x = 33,
         y = 250,
       fontsize = 20)

ax1.text( s = 'No stroke mean (blue): ' "{:.2f}".format(no_stroke['bmi'].mean()),
         x = 33,
         y = 225,
       fontsize = 20)
ax1.axvline(strokes['bmi'].mean(), linestyle='--', color='r')
ax1.axvline(no_stroke['bmi'].mean(), linestyle='--', color='b')

sns.violinplot(x=df['stroke'], y=df['bmi'], ax = ax2)
ax2.set_title('Age Proportions of Healthy and Individuals with Stroke')



plt.show()
# the clients who had a stroke gathered around 30 bmi, means are close

In [None]:
fig , (ax1,ax2) = plt.subplots(1,2, figsize=(12,6))
sns.countplot(x= 'work_type', hue= 'stroke', data=df, ax=ax1)
_ = ax1.set_xticklabels(labels = df['work_type'].unique(),rotation=60, ha='right')
ax1.set_title('Work Types Among Different Diagnosis')

sns.countplot(x= 'Residence_type', hue= 'stroke', data=df, ax = ax2)
_ = ax2.set_title('Stroke Counts among Residence Type')

plt.show()

Work Type : I don't know how things work in the country where the data gathered, But I assume private sector and running your own work is a stresfull thing to earn a position in competative environment, rather than government job. 

Residence : Seems where you live is not that important on having a stroke

# Scaling

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
lab_enc= LabelEncoder()

In [None]:
# Let's see (don't scroll back to top, lazy style) the columns to be encoded
for x in df.select_dtypes(include='object'):
    print(df[x].value_counts())

In [None]:
lab_enc_data= df.loc[:,['gender','ever_married','Residence_type','work_type','smoking_status']]
for x in lab_enc_data.columns:
    lab_enc_data[x]=lab_enc.fit_transform(lab_enc_data[x])

In [None]:
# I know this is the hard way, but i want to preserv the df and after append new columns after. 
# First remove these columns from df, then append lab_enc_data to the df. I will do this with a for loop
for x in lab_enc_data.columns:
    df[x]=lab_enc_data[x]

In [None]:
df.head()

Appending completed.

One more thing to do. It would be better to handle earlier, but anyway. Let's investigate the normal distribution of the numeric data.

In [None]:
def bayesian_dist(column, data):
    """this function gets column name (str) and dataframe (str), 
    returns distribution plot of the column, skewness and kurtosis"""
    sns.distplot(data[column])
    plt.title(x)
    plt.show()
    plt.show()
    print('skewness: ', stats.skew(data[column]))
    print('kurtosis: ', stats.kurtosis(data[column]))
    

def normal_visual(column, df):
    """This function gets column and dataframe as str.
    Return 
    Shapiro Wilk test and Kolmogorov-Smirnov test results,
    distplot, skewness and kurtosis of the column
    """
    bayesian_dist(column, df)
    print('*'* 30)
    print(column, 'Shapiro-Wilk test t score: ', "{:.2f}".format(stats.shapiro(df[x])[0]))
    print(column, 'Shapiro-Wilk test p value: ', "{:.2f}".format(stats.shapiro(df[x])[1]))
    print('*'*30)
    print(column, 'Kolmogorov-Smirnov t score: ', "{:.2f}".format(stats.kstest(df[x],'norm', args=(df[x].mean(),
                                                                                                   df[x].std()))[0]))
    print(column, 'Kolmogorov-Smirnov t score: ', "{:.2f}".format(stats.kstest(df[x],'norm', args=(df[x].mean(),
                                                                                                   df[x].std()))[1]))
    
    

In [None]:
for x in means.columns:
    if x != 'stroke':
        normal_visual(x, df)

We will accept the data normally distributed if Skewness is in range -0.5,0.5 and Kurtosis -3,3

Skewness and kurtosis seem ok (except bmi), but histograms doesn't show normal dist. 

let's investigate p values. p>0.05 is accepted as Normal dist. So age and bmi might be normally distributed, however glucose level is not. Now let's see Kolmogorov-smirnov test results which is an other way to examine ND.

KS test results also indicate no ND of these data. 

We'll use Box-Cox transformation for these variables.

In [None]:
# boxcox transformation 

for x in means.columns:
    if x != 'stroke':
        means[x] = stats.boxcox(df[x])[0]

In [None]:
for x in means.columns:
    if x != 'stroke':
        normal_visual(x, means)

They seem ok for now. Now append these new columns to the df dataframe

In [None]:
for x in means.columns:
    if x != 'stroke':
        df[x]=means[x]

# and check if it's ok
df.head()

## Dealing with Class Imbalance with SMOTE

In [None]:
from sklearn.model_selection import train_test_split
target = df['stroke']
predictors = df.drop('stroke', axis=1)
x_train, x_test, y_train, y_test = train_test_split(predictors,target, train_size=0.7,
                                                    random_state= 42, stratify = target.values)


display(x_train.shape)
display(y_train.shape)
display(x_test.shape)
display(y_test.shape)

In [None]:
# SMOTE method for class imbalance, 
X = df.drop('stroke', axis=1)
y= df['stroke']
from imblearn.over_sampling import SMOTE
oversample = SMOTE(sampling_strategy= 0.4, random_state=42)
X_sm,y_sm = oversample.fit_resample(x_train,y_train,)
print('Data shapes before oversampling were {0} and {1}'.format(x_train.shape, y_train.shape))
print('Data shapes after oversampling are {0} and {1}'.format(X_sm.shape, y_sm.shape))
y_sm.value_counts()

In [None]:
# undersampling 
from imblearn.under_sampling import NearMiss
nearmiss = NearMiss(sampling_strategy = 0.7)
X_us, y_us = nearmiss.fit_resample(X_sm, y_sm)
print('Data shapes after oversampling were {0} and {1}'.format(X_sm.shape, y_sm.shape))
print('Data shapes after undersampling are {0} and {1}'.format(X_us.shape, y_us.shape))
y_us.value_counts()

# Model Interpretation

In [None]:

from sklearn.metrics import accuracy_score

#ML algoritms
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier


#Performance metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import cross_val_score

In [None]:
# Non-oversampled data ML
model_accuracy = pd.DataFrame(columns=['Model','Accuracy'])
models = {"LR": LogisticRegression(),
          "NB": GaussianNB(),
          "KNN" : KNeighborsClassifier(),
          "DT" : DecisionTreeClassifier(),
          'RFC' : RandomForestClassifier(),
          'ABC' : AdaBoostClassifier(),
          'GBC' : GradientBoostingClassifier(),
          'DTC' : DecisionTreeClassifier(),
          }

for test, clf in models.items():
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    acc = accuracy_score(y_test,y_pred)
    train_pred = clf.predict(x_train)
    train_acc = accuracy_score(y_train, train_pred)
    print( test + ' scores')
    print(acc)
    print(classification_report(y_test, y_pred))
    print(confusion_matrix(y_test, y_pred))
    print('*' * 100)
    model_accuracy = model_accuracy.append({'Model': test, 'Accuracy': acc, 'Train_acc': train_acc}, ignore_index=True)

In [None]:
model_accuracy

Accuracy scores are good, ranging between 0.90 to 0.94. Seem's Ok right?

#### NO !!

Lets investigate the confusion matrices:

All models are good at predicting 'no stroke'. On the other hand , predicting 'Stroke' is on the ground.. However, the model needs to predict 'stroke' which is 4% in total population.  With these True negative predictions these models are useless. 

Let's try the models with oversampled data.


In [None]:
#resampled data train_test_split
resX_train, resX_test, resy_train, resy_test = train_test_split(X_us, y_us, train_size= 0.7,
                                                               random_state=42)

print(resX_train.shape, resX_test.shape)
print(resy_train.shape, resy_test.shape)

In [None]:
# oversampled data ML
model_accuracy = pd.DataFrame(columns=['Model','Accuracy'])
models = {"LR": LogisticRegression(),
          "NB": GaussianNB(),
          "KNN" : KNeighborsClassifier(),
          "DT" : DecisionTreeClassifier(),
          'RFC' : RandomForestClassifier(),
          'ABC' : AdaBoostClassifier(),
          'GBC' : GradientBoostingClassifier(),
          'DTC' : DecisionTreeClassifier(),
          }

for test, clf in models.items():
    clf.fit(resX_train, resy_train)
    y_pred = clf.predict(resX_test)
    acc = accuracy_score(resy_test,y_pred)
    train_pred = clf.predict(resX_train)
    train_acc = accuracy_score(resy_train, train_pred)
    print( test + ' scores')
    print(acc)
    print(classification_report(resy_test,y_pred))
    print(confusion_matrix(resy_test,y_pred))
    print('*' * 100)
    model_accuracy = model_accuracy.append({'Model': test, 'Accuracy': acc, 'Train_acc': train_acc}, ignore_index=True)

In [None]:
model_accuracy.sort_values('Accuracy', ascending=False)

 ### Random Forest to go
 As we can see RFC has a accuracy score of 95%, and also precision-recall and f1-scores are 95%. By data oversampling, we handled Class Imbalance. 
 
 Let's try Hyperparameter Tuning for the RFC algoritm

In [None]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
rfc = RandomForestClassifier()


# number of estimators , default = 100
n_estimators = [int(x) for x in np.linspace(start=10, stop=100, num=10)]
# number of  features for every split, default= auto, we have options sqrt(same as auto), log2, None( = n_features)
max_features = ['auto','log2']
#maximum depth for trees
max_depth = [2,4,6,8,10, None]
#min_samples_leaf
min_samples_leaf = [1,2]
#min_samples_split
min_samples_split = [2,5,10]
#bootstrap
bootstrap = [True, False]

#Create random grid
param_grid = {'n_estimators': n_estimators,
             'max_features' : max_features,
             'max_depth' : max_depth,
             'min_samples_leaf' : min_samples_leaf,
             'min_samples_split' : min_samples_split,
             'bootstrap' : bootstrap}



In [None]:
rf_grid = RandomizedSearchCV(estimator= rfc, param_distributions = param_grid, n_iter= 100,  cv=3, n_jobs=2, verbose=2)

In [None]:
rf_grid.fit(X_sm, y_sm)
rf_grid.best_params_

In [None]:
# number of estimators , default = 100
n_estimators = [29,30,31]
# number of  features for every split, default= auto, we have options sqrt(same as auto), log2, None( = n_features)
max_features = [1,3,5]
#maximum depth for trees
max_depth = [ None, 2, 5]
#min_samples_leaf
min_samples_leaf = [1]
#min_samples_split
min_samples_split = [1,2,3]
#bootstrap
bootstrap = [False]

#Create random grid
param_grid = {'n_estimators': n_estimators,
             'max_features' : max_features,
             'max_depth' : max_depth,
             'min_samples_leaf' : min_samples_leaf,
             'min_samples_split' : min_samples_split,
             'bootstrap' : bootstrap}

In [None]:
gr_grid = GridSearchCV(estimator= rfc, param_grid = param_grid,cv=3, n_jobs=-1, verbose=2)

In [None]:
gr_grid.fit(X_sm, y_sm)
gr_grid.best_params_ , gr_grid.best_score_

We can see clearly best parameters for Random Forest Classifier model. 

In [None]:
rfc = RandomForestClassifier(bootstrap= False, max_depth=None, max_features= 3, min_samples_leaf = 1, 
                            min_samples_split=3, n_estimators= 31)
rfc.fit(resX_train, resy_train)
y_pred = rfc.predict(resX_test)
acc = accuracy_score(resy_test,y_pred)
print(acc)
print(accuracy_score(rfc.predict(resX_train), resy_train))
print(classification_report(resy_test, y_pred))
print(confusion_matrix(resy_test, y_pred))

### ROC Curves and AUC

Keeping on model evaluation.

In [None]:
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

In [None]:
fpr, tpr, threshold = roc_curve(resy_test, y_pred)
plt.plot(fpr, tpr)
plt.title('ROC Curve for Stroke Prediction')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()



In [None]:
roc_auc_score(resy_test,y_pred)

## Feature Importance evaluation

In [None]:
feature_imp = rfc.feature_importances_

In [None]:
feature_imp

In [None]:
sns.barplot(X_sm.columns, feature_imp)
plt.xticks(rotation=60)

In [None]:
features = X_sm.loc[:,['age','work_type','Residence_type','avg_glucose_level','bmi','smoking_status','is_obese']]

In [None]:
resX_train, resX_test, resy_train, resy_test = train_test_split(features, y_sm, test_size=0.3, random_state=42)

In [None]:
rfc.fit(resX_train, resy_train)
y_pred = rfc.predict(resX_test)
acc = accuracy_score(resy_test,y_pred)
print(acc)
print(classification_report(resy_test, y_pred))
print(confusion_matrix(resy_test, y_pred))

Feature importance didn't changed the model accuracy significantly. 

# Discussion

This was a good exercise for me , and I've learnt about BoxCox transformation, Class Imbalance and SMOTE and nearmiss method. And I also practiced my model interpretation,  bayesian distribution- skewness-kurtosis values, hyperparameter tuning and tried Feature Importance(way to go btw)

Anyways,
Random Forest Classification did a good job on predicting clients with Stroke. Not only accuracy scores are good, additionally Precision, Recall, F1-scores and ROC curve and AUC score are also good.

PS: 
I learned that I made a mistake using SMOTE on all the data. So I changed the process as follows:
Split the data into train and test datasets
Used SMOTE to increase 'Stroke' values in train df, then used nearmiss to decrease 'No stroke' value. This way I prevented the estimation of stroke data too much. 
Then trained the model and made a prediction.

Also, we sse the random forest classifier model overfit. 