# About the dataset

### Name : Stroke Prediction Dataset
### Link : https://www.kaggle.com/fedesoriano/stroke-prediction-dataset

### Context

According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.
This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.


### Attribute Information
1) id: unique identifier

2) gender: "Male", "Female" or "Other"

3) age: age of the patient

4) hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension

5) heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease

6) ever_married: "No" or "Yes"

7) work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"

8) Residence_type: "Rural" or "Urban"

9) avg_glucose_level: average glucose level in blood

10) bmi: body mass index

11) smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*

12) stroke: 1 if the patient had a stroke or 0 if not
*Note: "Unknown" in smoking_status means that the information is unavailable for this patient

## Importing the data

In [1]:
import pandas as pd

In [8]:
df=pd.read_csv("healthcare-dataset-stroke-data.csv")
df

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
...,...,...,...,...,...,...,...,...,...,...,...,...
5105,18234,Female,80.0,1,0,Yes,Private,Urban,83.75,,never smoked,0
5106,44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.20,40.0,never smoked,0
5107,19723,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.6,never smoked,0
5108,37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,formerly smoked,0


## Data Cleaning

In [9]:
pd.DataFrame(df.isnull().sum(), columns= ['Number of missing values'])

Unnamed: 0,Number of missing values
id,0
gender,0
age,0
hypertension,0
heart_disease,0
ever_married,0
work_type,0
Residence_type,0
avg_glucose_level,0
bmi,201


In [10]:
df.describe()

Unnamed: 0,id,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke
count,5110.0,5110.0,5110.0,5110.0,5110.0,4909.0,5110.0
mean,36517.829354,43.226614,0.097456,0.054012,106.147677,28.893237,0.048728
std,21161.721625,22.612647,0.296607,0.226063,45.28356,7.854067,0.21532
min,67.0,0.08,0.0,0.0,55.12,10.3,0.0
25%,17741.25,25.0,0.0,0.0,77.245,23.5,0.0
50%,36932.0,45.0,0.0,0.0,91.885,28.1,0.0
75%,54682.0,61.0,0.0,0.0,114.09,33.1,0.0
max,72940.0,82.0,1.0,1.0,271.74,97.6,1.0


In [11]:
#Removing the id column as it will not have any effect on the target variable
df.drop('id',axis=1,inplace=True)

In [16]:
#No. of unique values, including the 'Nan' values
df.nunique(axis=0,dropna=False)

gender                  3
age                   104
hypertension            2
heart_disease           2
ever_married            2
work_type               5
Residence_type          2
avg_glucose_level    3852
bmi                   418
smoking_status          4
stroke                  2
dtype: int64

In [12]:
#Dropping the rows with missing values
df.dropna(inplace=True)

In [13]:
df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
2,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
5,Male,81.0,0,0,Yes,Private,Urban,186.21,29.0,formerly smoked,1


### Dealing with Categorical columns

Creating a new dataframe(df_1) with only Ratio attributes

In [17]:
df_1=df[['age','hypertension','heart_disease',
       'avg_glucose_level','bmi']]

In [18]:
df_1.head()

Unnamed: 0,age,hypertension,heart_disease,avg_glucose_level,bmi
0,67.0,0,1,228.69,36.6
2,80.0,0,1,105.92,32.5
3,49.0,0,0,171.23,34.4
4,79.0,1,0,174.12,24.0
5,81.0,0,0,186.21,29.0


### Categorical Datatypes

Columns = Gender, ever_married, work_type, Residence_type, smoking_status

In [19]:
col=['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status']

In [20]:
for c in col:
    print(df[c].unique())

['Male' 'Female' 'Other']
['Yes' 'No']
['Private' 'Self-employed' 'Govt_job' 'children' 'Never_worked']
['Urban' 'Rural']
['formerly smoked' 'never smoked' 'smokes' 'Unknown']


In [21]:
df['smoking_status'].value_counts()

never smoked       1852
Unknown            1483
formerly smoked     837
smokes              737
Name: smoking_status, dtype: int64

In [22]:
#one-hot encoding
for c in col:
    df_1=pd.concat([df_1,pd.get_dummies(df[c])],axis=1)

In [23]:
df_1.head()

Unnamed: 0,age,hypertension,heart_disease,avg_glucose_level,bmi,Female,Male,Other,No,Yes,...,Never_worked,Private,Self-employed,children,Rural,Urban,Unknown,formerly smoked,never smoked,smokes
0,67.0,0,1,228.69,36.6,0,1,0,0,1,...,0,1,0,0,0,1,0,1,0,0
2,80.0,0,1,105.92,32.5,0,1,0,0,1,...,0,1,0,0,1,0,0,0,1,0
3,49.0,0,0,171.23,34.4,1,0,0,0,1,...,0,1,0,0,0,1,0,0,0,1
4,79.0,1,0,174.12,24.0,1,0,0,0,1,...,0,0,1,0,1,0,0,0,1,0
5,81.0,0,0,186.21,29.0,0,1,0,0,1,...,0,1,0,0,0,1,0,1,0,0


In [24]:
pd.DataFrame(df_1.isnull().sum(), columns= ['Number of missing values'])

Unnamed: 0,Number of missing values
age,0
hypertension,0
heart_disease,0
avg_glucose_level,0
bmi,0
Female,0
Male,0
Other,0
No,0
Yes,0


### Scaling the data using zscore

In [25]:
from scipy.stats import zscore
data_scaled=df_1.apply(zscore)
data_scaled.head()

Unnamed: 0,age,hypertension,heart_disease,avg_glucose_level,bmi,Female,Male,Other,No,Yes,...,Never_worked,Private,Self-employed,children,Rural,Urban,Unknown,formerly smoked,never smoked,smokes
0,1.070138,-0.318067,4.381968,2.777698,0.981345,-1.199942,1.200447,-0.014274,-0.729484,0.729484,...,-0.067095,0.863918,-0.432978,-0.397906,-0.98564,0.98564,-0.657926,2.205673,-0.778346,-0.420302
2,1.646563,-0.318067,4.381968,0.013842,0.459269,-1.199942,1.200447,-0.014274,-0.729484,0.729484,...,-0.067095,0.863918,-0.432978,-0.397906,1.014569,-1.014569,-0.657926,-0.453376,1.284775,-0.420302
3,0.272012,-0.318067,-0.228208,1.484132,0.701207,0.833374,-0.833023,-0.014274,-0.729484,0.729484,...,-0.067095,0.863918,-0.432978,-0.397906,-0.98564,0.98564,-0.657926,-0.453376,-0.778346,2.379241
4,1.602222,3.143994,-0.228208,1.549193,-0.623083,0.833374,-0.833023,-0.014274,-0.729484,0.729484,...,-0.067095,-1.157518,2.309587,-0.397906,1.014569,-1.014569,-0.657926,-0.453376,1.284775,-0.420302
5,1.690903,-0.318067,-0.228208,1.821368,0.013595,-1.199942,1.200447,-0.014274,-0.729484,0.729484,...,-0.067095,0.863918,-0.432978,-0.397906,-0.98564,0.98564,-0.657926,2.205673,-0.778346,-0.420302


In [26]:
X=data_scaled
y=df['stroke']

# Stratified K-fold

In [27]:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5)
skf.get_n_splits(X, y)

5

In [28]:
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

In [29]:
def model_scoring(impurity_measure):
    
    dt_model=DecisionTreeClassifier(criterion=impurity_measure,max_depth=100)
    foldno=1
    AccSum=0
    
    for train_index, test_index in skf.split(X, y):
        #Seperating training and testing folds
        x_train_fold, x_test_fold = X.iloc[(train_index)], X.iloc[(test_index)]
        y_train_fold= y.iloc[(train_index)]
        y_test_fold = y.iloc[(test_index)]
        
        #Fitting the model
        dt_model.fit(x_train_fold, y_train_fold)
        y_predict=dt_model.predict(x_test_fold)
        
        #Accuracy for training set
        train_accuracy.append(dt_model.score(x_train_fold, y_train_fold))
        #Accuracy for test set
        test_accuracy.append(dt_model.score(x_test_fold, y_test_fold))
        AccSum=AccSum+dt_model.score(x_test_fold, y_test_fold)
        #Depth of the decision tree
        list_depth.append(dt_model.get_depth())
        
        #Confusion matrix
        cm=metrics.confusion_matrix(y_test_fold, y_predict, labels=[1, 0])
        list_accuracy.append(calcF1(cm,impurity_measure,foldno))
        foldno=foldno+1
    return (AccSum/5)

In [30]:
#Function for calculation model performance measures
#Note: The parameters impurity_measure and foldno are passed just to get a single list as output.
#No operations are done using them both.

def calcF1(cm,impurity_measure,foldno):
    
    #Seperating the True positives, false positives,etc. from confusion matrix
    TP=cm[0][0]
    FN=cm[0][1]
    TN=cm[1][1]
    FP=cm[1][0]
    
    Precision=TP/(TP+FP)
    Recall=TP/(TP+FN)
    Spf=TN/(TN+FP)
    
    F1=2*((Precision*Recall)/(Precision+Recall))
    
    return impurity_measure,foldno,Precision,Recall,Spf,F1

In [31]:
list_accuracy = []
list_depth=[]
train_accuracy=[]
test_accuracy=[]

#Calling the model scoring function
print("Avg Accuracy for Gini",model_scoring('gini'))
print("Avg Accuracy for Entropy",model_scoring('entropy'))

Avg Accuracy for Gini 0.9162766701752856
Avg Accuracy for Entropy 0.9203501975414754


In [32]:
table=pd.DataFrame(list_accuracy,columns=[["Impurity Measure","Fold No.","Precision","Recall","Specificity","F1 Score"]])
table.insert(2,"TrainAccuracy",train_accuracy,True)
table.insert(3,"TestAccuracy",test_accuracy,True)
table.insert(4,"Tree Depth",list_depth,True)
table

Unnamed: 0,Impurity Measure,Fold No.,TrainAccuracy,TestAccuracy,Tree Depth,Precision,Recall,Specificity,F1 Score
0,gini,1,1.0,0.900204,19,0.1,0.166667,0.932979,0.125
1,gini,2,1.0,0.919552,20,0.122449,0.142857,0.954255,0.131868
2,gini,3,1.0,0.923625,19,0.148936,0.166667,0.957447,0.157303
3,gini,4,1.0,0.919552,18,0.163636,0.214286,0.951064,0.185567
4,gini,5,1.0,0.918451,17,0.085106,0.097561,0.954255,0.090909
5,entropy,1,1.0,0.915479,21,0.140351,0.190476,0.947872,0.161616
6,entropy,2,1.0,0.911405,17,0.040816,0.047619,0.95,0.043956
7,entropy,3,1.0,0.92668,28,0.2,0.238095,0.957447,0.217391
8,entropy,4,1.0,0.928717,20,0.15,0.142857,0.96383,0.146341
9,entropy,5,1.0,0.91947,20,0.068182,0.073171,0.956383,0.070588


### Observation: 
    Even though the accuracy of the test set is high, the values of precision, recall and F1 score are very low. This either means the model is overfitting or there a problem in the dataset

### Target Column Distribution

In [35]:
n_1=len(df.loc[df['stroke'] == 1])
n_0=len(df.loc[df['stroke'] == 0])
print(n_1)
print(n_0)
print("Percentage of patients who had stroke: {0} ({1:2.2f}%)".format(n_1, (n_1 / (n_1 + n_0)) * 100 ))

209
4700
Percentage of patients who had stroke: 209 (4.26%)


### Downsampling

In [37]:
import numpy as np

#Getting the indices of stroke and non stroke patients seperately
no_stroke_indices = df[df['stroke'] == 0].index 
stroke_indices = df[df['stroke'] == 1].index 

#Randomly downsampling the data
random_indices = np.random.choice( no_stroke_indices, 250 , replace=False)

#Combining both the stroke and te downsampled non stroke indices o create a balanced dataset
down_sample_indices = np.concatenate([stroke_indices,random_indices])

#Creating a new dataframe with the downsampled indices
df_downsampled=df.loc[down_sample_indices] 
df_downsampled.groupby(["stroke"]).count()

Unnamed: 0_level_0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status
stroke,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,250,250,250,250,250,250,250,250,250,250
1,209,209,209,209,209,209,209,209,209,209


In [38]:
#Performing data cleaning for the downsampled data

In [39]:
df_ds=df_downsampled[['age','hypertension','heart_disease',
       'avg_glucose_level','bmi']]

In [40]:
col=['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status']
for c in col:
    df_ds=pd.concat([df_ds,pd.get_dummies(df_downsampled[c])],axis=1)

In [41]:
pd.DataFrame(df_ds.isnull().sum(), columns= ['Number of missing values'])

Unnamed: 0,Number of missing values
age,0
hypertension,0
heart_disease,0
avg_glucose_level,0
bmi,0
Female,0
Male,0
No,0
Yes,0
Govt_job,0


In [42]:
#Scaling the data and seperating X and Y
df_ds_scaled=df_ds.apply(zscore)
X=df_ds_scaled
y=df_downsampled['stroke']

In [43]:
list_accuracy = []
list_depth=[]
train_accuracy=[]
test_accuracy=[]

#Calling the model scoring function
print("Avg Accuracy for Gini",model_scoring('gini'))
print("Avg Accuracy for Entropy",model_scoring('entropy'))

Avg Accuracy for Gini 0.6818681318681319
Avg Accuracy for Entropy 0.6777353081700908


Gini Impurity Measure gives higher accuracy

In [44]:
table_ds=pd.DataFrame(list_accuracy,columns=[["Impurity Measure","Fold No.","Precision","Recall","Specificity","F1 Score"]])
table_ds.insert(2,"TrainAccuracy",train_accuracy,True)
table_ds.insert(3,"TestAccuracy",test_accuracy,True)
table_ds.insert(4,"Tree Depth",list_depth,True)
table_ds

Unnamed: 0,Impurity Measure,Fold No.,TrainAccuracy,TestAccuracy,Tree Depth,Precision,Recall,Specificity,F1 Score
0,gini,1,1.0,0.706522,13,0.727273,0.571429,0.82,0.64
1,gini,2,1.0,0.673913,14,0.62,0.738095,0.62,0.673913
2,gini,3,1.0,0.663043,13,0.617021,0.690476,0.64,0.651685
3,gini,4,1.0,0.706522,17,0.659574,0.738095,0.68,0.696629
4,gini,5,1.0,0.659341,13,0.631579,0.585366,0.72,0.607595
5,entropy,1,1.0,0.652174,13,0.631579,0.571429,0.72,0.6
6,entropy,2,1.0,0.663043,17,0.622222,0.666667,0.66,0.643678
7,entropy,3,1.0,0.684783,14,0.644444,0.690476,0.68,0.666667
8,entropy,4,1.0,0.630435,16,0.576923,0.714286,0.56,0.638298
9,entropy,5,1.0,0.758242,17,0.756757,0.682927,0.82,0.717949
