# Table of Contents

**What is covered in this notebook?**
* [Importing Libraries](#1)
* [Load Dataset](#2)
* [Data Cleaning and EDA](#3)
* [Data Preprocessing](#4)
* [Modeling](#5)
* [Feature Importance](#6)
* [Try To Predict](#7)
* [Conclusion](#8)

# Importing Libraries
<a id='1'></a>

So, we are going to import some common libraries such as numpy, pandas, matplotlib, seaborn, and scikit-learn to build our model

In [None]:
import pandas as pd
import numpy as np

from matplotlib import colors
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import seaborn as sns

from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, classification_report

import warnings
warnings.filterwarnings('ignore')

# Load Dataset
<a id='2'></a>

In [None]:
df = pd.read_csv('../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv')
df.head()

In [None]:
df.shape

# Data Cleaning and EDA
<a id='3'></a>

**In this part we are focus on 2 things:**
1. Handle the missing values (whether fill or drop the data), remove irrelevant features, get rid the outliers
2. Get some insights and patterns by visualizing the data

So, let's jump into the process!
First of all, we need to get the information necessary in order to get better understanding about the dataset we got already

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.isna().sum()

**From information above, we can summarize:**
* There are categorical features in the data. So we need to encode them into numeric features later on 
* Only 1 column that has missing values, that is bmi column
* Minority of the sample are infants or children less than 1 year old

In [None]:
#set color for data visualization
sns.set(rc={"axes.facecolor":"#EAE0D5","figure.facecolor":"#EAE0D5", "grid.color":"#C6AC8F",
            "axes.edgecolor":"#C6AC8F", "axes.labelcolor":"#0A0908", "xtick.color":"#0A0908",
            "ytick.color":"#0A0908"})

palettes = ['#9B856A', '#475962', '#598392', '#124559', '#540B0E']
cmap = colors.ListedColormap(['#9B856A', '#124559', '#475962', '#598392'])

**Additional information:**
* axes.facecolor --> set color for background inside x and y axis
* figure.facecolor --> set color for background outside x and y axis where labels placed
* grid.color --> set color for grid line
* axes.edgecolor --> set color for line x and y axis
* axes.labelcolor --> set color for labels in the plot
* xtick.color --> set color for the values of x axis
* ytick.color --> set color for the values of y axis

In [None]:
sns.pairplot(data=df, y_vars='bmi', x_vars=['age', 'avg_glucose_level'], hue='Residence_type',
             size=5, palette=['#9B856A', '#475962'])

From the plot above, people mostly has bmi score in range 10 to 60 and there is no difference for people who live in urban or rural area. All data are equally distributed

In [None]:
df['gender'].value_counts()

Oh look, there is 1 data apart from Male and Female. Let's see if there is a relation between gender and bmi hopefully we can classify the Other gender to either Female or Male

In [None]:
plt.figure(figsize=(8,5))
sns.boxplot(data=df, x='gender', y='bmi', palette=palettes)

From above plot there's no chance to know wheter the other gender is Female or Male because bmi for both of them is similar. So, we drop the Other gender

In [None]:
df = df[df['gender'] != 'Other']

Next, we want to know the distribution of gender in our dataset

In [None]:
#total number of male and female in dataset
value = df['gender'].value_counts().sort_values().values

#percentage of male and female
percent = (df['gender'].value_counts()*100/len(df)).sort_values().values

idx = df['gender'].value_counts().sort_values().index.values

In [None]:
plt.figure(figsize=(8,5))
ax = sns.barplot(data=df, x=idx, y=percent, palette=palettes, edgecolor=palettes)

#set y axis to percentage
ax.yaxis.set_major_formatter(mtick.PercentFormatter())

ax.set_title('Distribution of Gender in Dataset', weight='bold', fontsize=14)
ax.set_xlabel('gender')
ax.set_ylabel('% of gender')

#place text in barplot
for i,v in enumerate(percent):
    #(x position, y position, text, ...)
    ax.text(i, v-10, 'Total:{}\n{:.2f}%'.format(value[i],v), horizontalalignment='center', weight='bold', color='white', fontsize=14)

Let's have a look at the correlation among the features (Excluding the categorical attributes at this point)

In [None]:
sns.heatmap(df.corr(), annot=True, fmt='.2f', cmap=cmap)

We know that id column completely has no correlation to predict stroke. Furthermore, correlation between bmi and stroke is very low (under 0.05). So, we drop both of the features

In [None]:
df.drop(['id', 'bmi'], axis=1, inplace=True)

Let's explore age column in our dataset

In [None]:
ax = sns.kdeplot(data=df, x='age', palette=palettes, color='#475962', fill=True)

ax.annotate('Highest Density', weight='bold', xy=(50,0.016), xytext=(-10,0.015),
            arrowprops=dict(facecolor='#475962', edgecolor='#475962', shrink=0.05))

In [None]:
sns.jointplot(data=df, x='age', y='heart_disease', hue='stroke', kind='kde', fill=False, palette=['#9B856A', '#475962'])

In [None]:
sns.jointplot(data=df, x='age', y='avg_glucose_level', hue='stroke', kind='kde', fill=False, palette=['#9B856A', '#475962'])

**A thing we get from each plot**
* In this dataset, majority of people are around 40 and 60 years old
* Older people more likely to get heart disease than people under 40 years old
* People above 50 years old have more chance to get stroke

Let's see what we can get from the plot below. Is there something interesting?

In [None]:
g = sns.FacetGrid(data=df, row='work_type', col='Residence_type', hue='stroke',
                  size=2.5, aspect=2, palette=palettes)
g.map(plt.scatter, 'age', 'avg_glucose_level', edgecolor='#EAE0D5', lw=0.2)

**Some interesting insights:**
* Stroke doesn't look where people live. Distribution of stroke in the urban area is similar to rural area
* 2 children have stroke

In [None]:
df[df['stroke'] == 1]['age'].sort_values()

Because those children are outlier, i think we can get rid both of them from our dataset

In [None]:
df.drop(df.index[[162,245]], inplace=True)

In [None]:
ax = sns.violinplot(data=df, x='hypertension', y='age', hue='gender', split=True, palette=palettes)

In [None]:
ax = sns.countplot(data=df, x='stroke', hue='gender', palette=palettes, edgecolor=palettes)

for patch in ax.patches:
    clr = patch.get_facecolor()
    patch.set_edgecolor(clr)

If we look at the visualization, total number of female who got stroke is bigger than male

Next, let's focus on smoking status column. We will try to do some visualizations to get better understanding about the effect of smoking

In [None]:
smoke = df['smoking_status'].value_counts()

In [None]:
fig, ax = plt.subplots(figsize =(8, 5))
wedges, texts, autotexts = ax.pie(x=smoke, autopct="%.2f%%", labels=smoke.index, explode=[0,0.15,0,0], colors=palettes,
        radius=1.5, wedgeprops={ 'linewidth' : 1, 'edgecolor' : '#EAE0D5' }, textprops=dict(fontsize=14))

ax.set_title('Distribution of Smoking Status in Dataset', y=1.3, weight='bold', fontsize=14)

for autotext in autotexts:
    autotext.set_color('white')
    autotext.set_weight('bold')

In [None]:
ax = sns.countplot(data=df, hue='work_type', y='smoking_status', palette=palettes, orient='h')

#to change edgecolor
for patch in ax.patches:
    clr = patch.get_facecolor()
    patch.set_edgecolor(clr)

ax.legend(bbox_to_anchor=(1.4, 0.4))
sns.despine()

In [None]:
sns.boxplot(data=df, x='smoking_status', y='age', palette=palettes)

Those 3 plots above show us there's unknown smoking status that might influence our model. Fortunately we got some insight from relation between smoking status and work type plot. As we can see that unknown status has a big number of children and we can assume that children never smoke. So, with that assumtion we can categorize unknown status to never smoked status

**The next step, we will replace the status in the following manner:**
* smokes and formerly smoked --> smoke
* never smoked and unknown --> never smoked

In [None]:
def smoke(text):
    if text == 'never smoked' or text == 'Unknown':
        return 'never smoked'
    else:
        return 'smoke'

In [None]:
df['smoking_status'] = df['smoking_status'].apply(smoke)

In [None]:
ax = sns.barplot(data=df, x='smoking_status', y='heart_disease', hue='hypertension', palette=palettes, ci=None)

#to change edgecolor
for patch in ax.patches:
    clr = patch.get_facecolor()
    patch.set_edgecolor(clr)

sns.despine()

Seems like heart disease is more likely to occur to someone with hypertension and smokers have more chance to get hypertension

In [None]:
ax = sns.barplot(data=df, x='smoking_status', y='stroke',
            palette=palettes, edgecolor=palettes, ci=None)

ax.set_title('Chance of Getting Stroke Based on Smoking Behavior', y=1.1, weight='bold', fontsize=14)

Well, smoking increases the risk of stroke by around 2.5%

In [None]:
data_prob = pd.crosstab(index=df['stroke'], columns=df['smoking_status'], normalize='index')
data_prob = data_prob*100
data_prob

In [None]:
ax = data_prob.plot.bar(rot=0, stacked=True, color=palettes, edgecolor='#EAE0D5')
ax.yaxis.set_major_formatter(mtick.PercentFormatter())
ax.set_title('Distribution of Smoking Status Based on Whether They Have Stroke or Not', x=0.6, y=1.1, weight='bold', fontsize=14)
ax.set_ylabel('% of smoking_status')

for i,ns in enumerate(data_prob['never smoked']):
    ax.text(i, ns/2, '{:.2f}%'.format(ns), ha='center', fontsize=13, weight='bold', color='white')

for i,s in enumerate(data_prob['smoke']):
    ax.text(i, 100-s/2, '{:.2f}%'.format(s), ha='center', fontsize=13, weight='bold', color='white')
            
ax.legend(bbox_to_anchor=(1.05, 1))
sns.despine()

# Preprocessing
<a id='4'></a>

In this part, we are going to encode object features using dummy variables as well as performing feature scalling to dataset

In [None]:
#get all object features
obj_feat = df.dtypes[df.dtypes == 'O'].index.values

In [None]:
le = LabelEncoder()

for i in obj_feat:
    df[i] = le.fit_transform(df[i])

In [None]:
df.head()

In [None]:
X = df.drop('stroke', axis=1)
y = df['stroke']

In [None]:
scaler = StandardScaler()

scaler.fit(X)
X_scaled = scaler.transform(X)
X = pd.DataFrame(X_scaled, columns=X.columns)

In [None]:
X.head()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=11)

# Modeling
<a id='5'></a>

We will use 6 models for predicting categorical output, with the help of cross validation we evaluate the performance of each model using recall and accuracy score. By simply taking the mean of both scores, we know which model has the highest score indicating it has the best performance for this particular case

**Metrics we use:**
* Accuracy --> Ratio of correctly predicted observation to the total observations
* Recall --> Ratio of correctly predicted positive observations to the all observations in actual class

In [None]:
all_model = [LogisticRegression(), KNeighborsClassifier(), DecisionTreeClassifier(),
            RandomForestClassifier(), BernoulliNB(), SVC()]

In [None]:
recall = []
accuracy = []

for model in all_model:
    cv = cross_val_score(model, X_train, y_train, scoring='recall', cv=10).mean()
    recall.append(cv)

    cv = cross_val_score(model, X_train, y_train, scoring='accuracy', cv=10).mean()
    accuracy.append(cv)

In [None]:
model = ['LogisticRegression', 'KNeighborsClassifier', 'DecisionTreeClassifier',
         'RandomForestClassifier', 'BernoulliNB', 'SVC']

score = pd.DataFrame({'Model': model, 'Accuracy': accuracy, 'Recall': recall})
score.style.background_gradient(cmap=cmap,high=1,axis=0)

Performance of each model is shown in the table above. In this particular case, it's better to have a high recall since we don't want to predict someone has no stroke, but he actually has stroke. From this consideration, We choose Decision Tree Classifier as our final model

However, because of the imbalanced dataset, the average model performance is not quite good to develop in the real world due to the mean of recall score is very low

In [None]:
dtc = DecisionTreeClassifier()

dtc.fit(X_train, y_train)

In [None]:
pred = dtc.predict(X_test)

In [None]:
print(confusion_matrix(y_test, pred, labels=(1,0)))
print(classification_report(y_test, pred))

# Feature Importance
<a id='6'></a>

In [None]:
pd.DataFrame(dtc.feature_importances_, index=X.columns, columns=['Feature Importance']).sort_values(by='Feature Importance').plot.bar(color=palettes, edgecolor='#EAE0D5')

For this particular dataset, avg_glucose_level and age are the most important features to determine if someone has high risk or low risk of getting stroke. The higher glucose level in blood, the higher the risk for getting stroke and likewise the older people

# Try To Predict
<a id='7'></a>

In [None]:
def prediction(feat_value):
    scaled = scaler.transform(feat_value)
    return dtc.predict(feat_value)

**We want to try predicting the output using 2 scenarios provided below**

**Scenario 1**
* Gender: Male (1)
* Age: 65
* Hypertension: True (1)
* Heart_disease: False (0)
* Ever_married: True (1)
* Work_type: Self-employed (3)
* Residence_type: Urban (1)
* Avg_glucose_level: 200
* Smoking_status: Smoke (1)

In [None]:
prediction([[1, 65, 1, 0, 1, 3, 1, 200, 1]])

**Scenario 2**
* Gender: Female (0)
* Age: 40
* Hypertension: False (0)
* Heart_disease: False (0)
* Ever_married: True (1)
* Work_type: Govt_job (0)
* Residence_type: Rural (0)
* Avg_glucose_level: 160
* Smoking_status: Never_smoked (0)

In [None]:
prediction([[0, 40, 0, 0, 1, 0, 0, 160, 0]])

We end up getting the output 1 (high risk) for scenario 1 and output 0 (low risk) for scenario 2

# Conclusion
<a id='8'></a>

In this entire work, we build a machine learning model to predict whether someone has high risk of getting stroke or not. We decided using Decision Tree Classifier as our final model since it has the highest recall score and good accuracy score as well. From analysis we did earlier from our model, glucose level in the body and age have the main role to determine the output

**If you find this notebook useful, please upvote**

**Thanks**