# ETE-456: Stroke Prediction according to given dataset by classification  

> Objective: 
 1. *Apply various classification algorithms on a real world dataset.*

## Stroke Prediction Dataset
**Context**: According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths. This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.

### Attribute Information
1. id: unique identifier
2. gender: "Male", "Female" or "Other"
3. age: age of the patient
4. hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
5. heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
6. ever_married: "No" or "Yes"
7. work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
8. Residence_type: "Rural" or "Urban"
9. avg_glucose_level: average glucose level in blood
10. bmi: body mass index
11. smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
12. stroke: 1 if the patient had a stroke or 0 if not *Note: "Unknown" in smoking_status means that the information is unavailable for this patient

In [3]:
import warnings
warnings.filterwarnings("ignore")

** Import the Libraries**

In [4]:
import numpy as np        
import pandas as pd     
import matplotlib.pyplot as plt       

# Machine learning libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

Dataset

In [None]:
# Download the data
!wget -O stroke-data.csv https://www.dropbox.com/s/zgburk3yces5tee/healthcare-dataset-stroke-data.csv?dl=0

In [6]:
"""importing the dataset """

dataset = pd.read_csv('stroke-data.csv')
dataset

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
...,...,...,...,...,...,...,...,...,...,...,...,...
5105,18234,Female,80.0,1,0,Yes,Private,Urban,83.75,,never smoked,0
5106,44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.20,40.0,never smoked,0
5107,19723,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.6,never smoked,0
5108,37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,formerly smoked,0


In [7]:
dataset.columns

Index(['id', 'gender', 'age', 'hypertension', 'heart_disease', 'ever_married',
       'work_type', 'Residence_type', 'avg_glucose_level', 'bmi',
       'smoking_status', 'stroke'],
      dtype='object')

In [51]:
feature = dataset[['age','hypertension','heart_disease', 'avg_glucose_level', 'bmi','smoking_status']]  # for independent
target = dataset[['stroke']]   # for dependent

In [10]:
feature.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   age                5110 non-null   float64
 1   hypertension       5110 non-null   int64  
 2   heart_disease      5110 non-null   int64  
 3   avg_glucose_level  5110 non-null   float64
 4   bmi                4909 non-null   float64
 5   smoking_status     5110 non-null   object 
dtypes: float64(3), int64(2), object(1)
memory usage: 239.7+ KB


In [11]:
target.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   stroke  5110 non-null   int64
dtypes: int64(1)
memory usage: 40.0 KB


In [27]:
dataset.isnull().value_counts()

id     gender  age    hypertension  heart_disease  ever_married  work_type  Residence_type  avg_glucose_level  bmi    smoking_status  stroke
False  False   False  False         False          False         False      False           False              False  False           False     4909
                                                                                                               True   False           False      201
dtype: int64

In [None]:
feature

In [None]:
target

#Taking care of missing values


In [54]:
from sklearn.impute import SimpleImputer

In [55]:
mean_value=dataset['bmi'].mean()
dataset['bmi'].fillna(value=mean_value, inplace=True)

In [56]:
dataset.isnull().value_counts()

id     gender  age    hypertension  heart_disease  ever_married  work_type  Residence_type  avg_glucose_level  bmi    smoking_status  stroke
False  False   False  False         False          False         False      False           False              False  False           False     5110
dtype: int64

In [None]:
dataset

In [None]:
feature

In [None]:
target

#Encoding

In [60]:
from sklearn.preprocessing import LabelEncoder , OneHotEncoder

In [61]:
encoder=OneHotEncoder(sparse=False)
encoded_labels = pd.DataFrame (encoder.fit_transform(feature[['smoking_status']]))

In [63]:
encoded_labels.columns = encoder.get_feature_names(['smoking_status'])
dataset= pd.concat([feature, encoded_labels ], axis=1)

In [None]:
dataset

In [65]:
dataset.columns

Index(['age', 'hypertension', 'heart_disease', 'avg_glucose_level', 'bmi',
       'smoking_status', 'smoking_status_Unknown',
       'smoking_status_formerly smoked', 'smoking_status_never smoked',
       'smoking_status_smokes'],
      dtype='object')

In [66]:
new_features = dataset[['age', 'hypertension', 'heart_disease', 'bmi',
       'avg_glucose_level',
       'smoking_status_Unknown',
       'smoking_status_formerly smoked', 'smoking_status_never smoked',
       'smoking_status_smokes']]

#Splitting Dataset

In [67]:
from sklearn.model_selection import train_test_split

In [71]:
"""Spliting the Dataset into Training Set and Test Set """

X_train,X_test,y_train,y_test=train_test_split(new_features,target,test_size=0.2,random_state=0)

In [72]:
print(X_train.shape)
print(X_test.shape)

(4088, 9)
(1022, 9)


In [None]:
X_test

Different types of Regression Algorithm


*   Logistic Regression
*   K Nearest Neighbor (KNN)
*   Decision Tree Classifier
*   Random Forest Classifier
*   Naive Bayes
*   Support Vector Machine (SVM) 

#Logistic Regression

In [74]:
from sklearn.linear_model import LogisticRegression

# Fitting Logistic Regression to the training dataset
lr = LogisticRegression()

lr.fit(X_train,y_train)

LogisticRegression()

In [76]:
# prediction
y_pred = lr.predict(X_test)

In [78]:
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score,f1_score,precision_score,recall_score

In [79]:
# Making confusing matrix (actual,prediction)
confusion_matrix(y_test,y_pred)

array([[968,   0],
       [ 54,   0]])

In [80]:
# Making confusing matrix
# it is used to check the accuracy of the classification
print(classification_report(y_test,y_pred,target_names = ['No','Yes']))

              precision    recall  f1-score   support

          No       0.95      1.00      0.97       968
         Yes       0.00      0.00      0.00        54

    accuracy                           0.95      1022
   macro avg       0.47      0.50      0.49      1022
weighted avg       0.90      0.95      0.92      1022



#KNN Classifiers

In [81]:
from sklearn.neighbors import KNeighborsClassifier

# Classifier Model
classifier = KNeighborsClassifier(n_neighbors=3, metric = 'minkowski')
classifier.fit(X_train,y_train)
# Prediction
y_pred = classifier.predict(X_test)

In [82]:
# Making confusing matrix
print(confusion_matrix(y_test,y_pred))

[[955  13]
 [ 51   3]]


In [83]:
# Making confusing matrix
# it is used to check the accuracy of the classification
print(classification_report(y_test,y_pred,target_names = ['No','Yes']))

              precision    recall  f1-score   support

          No       0.95      0.99      0.97       968
         Yes       0.19      0.06      0.09        54

    accuracy                           0.94      1022
   macro avg       0.57      0.52      0.53      1022
weighted avg       0.91      0.94      0.92      1022



#Support Vector Machine

In [84]:
from sklearn.svm import SVC

# Classifier Model
classifier = SVC(kernel = 'linear', random_state = 42)
classifier.fit(X_train,y_train)
# Prediction
y_pred = classifier.predict(X_test)

In [85]:
# Making confusing matrix
print(confusion_matrix(y_test,y_pred))

[[968   0]
 [ 54   0]]


In [86]:
# Making confusing matrix
# it is used to check the accuracy of the classification
print(classification_report(y_test,y_pred,target_names = ['No','Yes']))

              precision    recall  f1-score   support

          No       0.95      1.00      0.97       968
         Yes       0.00      0.00      0.00        54

    accuracy                           0.95      1022
   macro avg       0.47      0.50      0.49      1022
weighted avg       0.90      0.95      0.92      1022



#Decision Tree

In [87]:
from sklearn.tree import DecisionTreeClassifier
# Classifier Model
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(X_train,y_train)

DecisionTreeClassifier(criterion='entropy', random_state=0)

In [88]:
# Prediction
y_pred = classifier.predict(X_test)

In [89]:
# Making confusing matrix
print(confusion_matrix(y_test,y_pred))


[[914  54]
 [ 46   8]]


In [90]:
# Making confusing matrix
# it is used to check the accuracy of the classification
print(classification_report(y_test,y_pred,target_names = ['No','Yes']))

              precision    recall  f1-score   support

          No       0.95      0.94      0.95       968
         Yes       0.13      0.15      0.14        54

    accuracy                           0.90      1022
   macro avg       0.54      0.55      0.54      1022
weighted avg       0.91      0.90      0.91      1022



#Random Forest Classifier

In [91]:
from sklearn.ensemble import RandomForestClassifier
# Classifier Model
classifier = RandomForestClassifier(n_estimators=32, criterion ='entropy', random_state = 40)
classifier.fit(X_train,y_train)

RandomForestClassifier(criterion='entropy', n_estimators=32, random_state=40)

In [93]:
# Prediction
y_pred = classifier.predict(X_test)

In [94]:
# Making confusing matrix
print(confusion_matrix(y_test,y_pred))

[[966   2]
 [ 53   1]]


In [95]:
# Making confusing matrix
# it is used to check the accuracy of the classification
print(classification_report(y_test,y_pred,target_names = ['No','Yes']))

              precision    recall  f1-score   support

          No       0.95      1.00      0.97       968
         Yes       0.33      0.02      0.04        54

    accuracy                           0.95      1022
   macro avg       0.64      0.51      0.50      1022
weighted avg       0.92      0.95      0.92      1022



#Result Analysis

To predict strok, different types of classification algorithm were applied on the dataset. The evaluation matrices is to  evaluate classification algorithms with  Precision, Recall, F1 Score and Accuracy. For  comparision among differnt algorithm performance, the weighted average of models with accuracy will be considered.

#Logistic Regression

Weighted Average Precision : 90%

Weighted Average Recall : 95%

Weighted Average F1-Score : **92**%

Accuracy : 95%

#KNN Classifier

Weighted Average Precision : 91%

Weighted Average Recall : 94%

Weighted Average F1-Score : 92%

Accuracy : 94%

#Support Vector Machine
Weighted Average Precision : 90%

Weighted Average Recall : 95%

Weighted Average F1-Score : 92%

Accuracy : 95%

#Decision Tree
Weighted Average Precision : 91%

Weighted Average Recall : 90%

Weighted Average F1-Score : 91%

Accuracy : 90%

#Random Forest Classifier
Weighted Average Precision : 92%

Weighted Average Recall : 95%

Weighted Average F1-Score : 92%

Accuracy : 95%

#Discussion
In this project, several classifiers are used to predict Stroke from "Stroke Dataset". 

The classifiers used in this project are Logistic Regression, K-Nearest Neighbor, Support Vector Machine, Decision Tree & Random Forest Classifier.

Necessary  libraries  were imported and then the given dataset was stored from dropbox.
 Data pre-processing was performed which includes encoding (label encoding, one hot encoding), feature scaling, taking care of missing files. Then the dataset was splitted and trained into the classifiers. Then some evaluation matrices were used to evaluate the the classifiers and the comparision based on evaluation was shown in Result Analysis. From the comparision, it is seen that the Random Forest Classifier has the highest evaalution values.