### Stoke Prediction Dataset 
# Modeling

Data source: https://www.kaggle.com/fedesoriano/stroke-prediction-dataset <br>
Data updated date: 2021-01-26

#### Supervised Learning: Classification model to predict a binary outcome
Outcome:
- 0: no stroke
- 1: stroke

Here are the different types of learning that we will implore for out prediction.
- Decision Tree Classifer
- Logistic Regression
- Random Forest
- AdaBoost Classifer
- XGBoost Classifer
- SVM Classifer

Model Evaluation:
- Confusion Matrix: Maximize True Positive rate, minimize False Nagative rate.
- Recall for stroke
![title](img/ConfusionMatrix.ppm)
- Balanced accuracy
![title](img/Balanced-accuracy-formula.png)

# 0. Sourcing, Loading, and Evaluation Model

In [1]:
# import packagas

import pandas as pd
import numpy as np
from sklearn.metrics import recall_score, balanced_accuracy_score, confusion_matrix, classification_report

# make notebook full width for better viewing

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

In [2]:
# load data

X_train = pd.read_csv('data/X_train.csv', index_col=0)
X_test = pd.read_csv('data/X_test.csv', index_col=0)

y_train = pd.read_csv('data/y_train.csv', index_col=0)
y_test = pd.read_csv('data/y_test.csv', index_col=0)

In [3]:
# create lists to store all results

result_model_name = []
result_tn = []
result_fp = []
result_fn = []
result_tn = []
result_recall_stroke = []
result_recall_nostroke = []
result_balanced_accuracy = []

In [4]:
# create an evaluation model that print its performance

def evaluation(y_pred, y_test, list_name):
    # confusion matrix
    tn, fp, fn, tp = confusion_matrix(y_test,y_pred,labels=[1,0]).reshape(-1)
    print('Outcome values : tp:{}, fn:{}, fn:{}, tn:{}'.format(tp, fn, fp, tn))

    # model evaluation metrics - recall
    print('\nRecall score for "No Stroke": ' , round(recall_score(y_test,y_pred, pos_label = 0),2))
    print('Recall score for "Stroke": ' , round(recall_score(y_test,y_pred, pos_label = 1), 2))

    # classification report for precision, recall f1-score and accuracy
    matrix = classification_report(y_test, y_pred, target_names=['No Stroke', 'Stroke'])
    print('\nClassification report : \n',matrix)

    # model evaluation metrics - accuracy
    print("Balanced accuracy:", round(balanced_accuracy_score(y_test,y_pred),2))
    
    list_name.extend([tn, fp, fn, tn, round(recall_score(y_test,y_pred, pos_label = 1), 2), round(recall_score(y_test,y_pred, pos_label = 0),2), round(balanced_accuracy_score(y_test,y_pred),2)])

In [5]:
model_list = []

# 1. Decision Tree

In [6]:
# import tree model
from sklearn.tree import DecisionTreeClassifier

# create the model
dt_model = DecisionTreeClassifier()

# fit the data
dt_model.fit(X_train, y_train)

# make prediction
y_pred = dt_model.predict(X_test)

In [7]:
# print model performance
decision_tree = ['Decision Tree']
evaluation(y_pred, y_test, decision_tree)
model_list.append(decision_tree)

Outcome values : tp:1339, fn:110, fn:71, tn:13

Recall score for "No Stroke":  0.92
Recall score for "Stroke":  0.15

Classification report : 
               precision    recall  f1-score   support

   No Stroke       0.95      0.92      0.94      1449
      Stroke       0.11      0.15      0.13        84

    accuracy                           0.88      1533
   macro avg       0.53      0.54      0.53      1533
weighted avg       0.90      0.88      0.89      1533

Balanced accuracy: 0.54


# 2. Logistic Regression

In [8]:
#import model
from sklearn.linear_model import LogisticRegression

# create the model
lr_model = LogisticRegression()

# fit the model
lr_model.fit(X_train, y_train)

# make prediction
y_pred = lr_model.predict(X_test)

  y = column_or_1d(y, warn=True)


In [9]:
# print model performance
logistic_regression = ['Logistic Regression']
evaluation(y_pred, y_test, logistic_regression)
model_list.append(logistic_regression)

Outcome values : tp:1318, fn:131, fn:54, tn:30

Recall score for "No Stroke":  0.91
Recall score for "Stroke":  0.36

Classification report : 
               precision    recall  f1-score   support

   No Stroke       0.96      0.91      0.93      1449
      Stroke       0.19      0.36      0.24        84

    accuracy                           0.88      1533
   macro avg       0.57      0.63      0.59      1533
weighted avg       0.92      0.88      0.90      1533

Balanced accuracy: 0.63


### Feature importance in Logistic Regression

In [10]:
coef = lr_model.coef_[0]
coef = [abs(number) for number in coef]
print(coef)

cols = list(X_train.columns)

[2.183203330688211, 0.332529398418136, 0.2645263250878176, 0.6192779194359219, 0.9786485040822869, 0.7230226434453599, 0.7210062363794068, 0.17666757692135546, 4.957176155947972, 0.8377396537026984, 3.7926685961006097, 4.728404133535544, 0.9728299893760243, 2.079875195003335, 1.7838061524117446, 2.4047085894822864]


In [11]:
sorted_index = sorted(range(len(coef)), key = lambda k: coef[k], reverse = True)
for idx in sorted_index:
    print(cols[idx])

work_type_govt_job
work_type_self-employed
work_type_private
smoking_status_unknown
age
smoking_status_never smoked
smoking_status_smokes
heart_disease
residence_type_urban
work_type_never_worked
ever_married
gender_male
hypertension
avg_glucose_level
bmi
gender_other


# 3. Random Forest

In [12]:
# import model
from sklearn.ensemble import RandomForestClassifier

# create the model
rf_model = RandomForestClassifier()

# fit the model
rf_model.fit(X_train, y_train)

# make prediction
y_pred = rf_model.predict(X_test)

  


In [13]:
# print model performance
random_forest = ['Random Forest']
evaluation(y_pred, y_test, random_forest)
model_list.append(random_forest)

Outcome values : tp:1402, fn:47, fn:75, tn:9

Recall score for "No Stroke":  0.97
Recall score for "Stroke":  0.11

Classification report : 
               precision    recall  f1-score   support

   No Stroke       0.95      0.97      0.96      1449
      Stroke       0.16      0.11      0.13        84

    accuracy                           0.92      1533
   macro avg       0.55      0.54      0.54      1533
weighted avg       0.91      0.92      0.91      1533

Balanced accuracy: 0.54


# 4. Adaboost Classification

In [14]:
# import the model
from sklearn.ensemble import AdaBoostClassifier

# create the model
ab_model = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=100, 
                            learning_rate=0.5, random_state=100)

# fit the model
ab_model.fit(X_train, y_train)

# make prediction
y_pred = ab_model.predict(X_test)

  y = column_or_1d(y, warn=True)


In [15]:
# print model performance
adaboost = ['Adaboost Classification']
evaluation(y_pred, y_test, adaboost)
model_list.append(adaboost)

Outcome values : tp:1341, fn:108, fn:67, tn:17

Recall score for "No Stroke":  0.93
Recall score for "Stroke":  0.2

Classification report : 
               precision    recall  f1-score   support

   No Stroke       0.95      0.93      0.94      1449
      Stroke       0.14      0.20      0.16        84

    accuracy                           0.89      1533
   macro avg       0.54      0.56      0.55      1533
weighted avg       0.91      0.89      0.90      1533

Balanced accuracy: 0.56


# 5. XGboost

In [16]:
from sklearn.ensemble import GradientBoostingClassifier

xg_model = GradientBoostingClassifier()
xg_model.fit(X_train, y_train)

y_pred = xg_model.predict(X_test)

  y = column_or_1d(y, warn=True)


In [17]:
# print model performance
xgboost = ['XGboost']
evaluation(y_pred, y_test, xgboost)
model_list.append(xgboost)

Outcome values : tp:1352, fn:97, fn:64, tn:20

Recall score for "No Stroke":  0.93
Recall score for "Stroke":  0.24

Classification report : 
               precision    recall  f1-score   support

   No Stroke       0.95      0.93      0.94      1449
      Stroke       0.17      0.24      0.20        84

    accuracy                           0.89      1533
   macro avg       0.56      0.59      0.57      1533
weighted avg       0.91      0.89      0.90      1533

Balanced accuracy: 0.59


# 6. SVM

In [18]:
# import model
from sklearn.svm import SVC

# create the model
sv_model = SVC()

# fit the model
sv_model.fit(X_train, y_train)

# make prediction on test dataset
y_pred = sv_model.predict(X_test)

  y = column_or_1d(y, warn=True)


In [19]:
# print model performance
svm = ['SVM']
evaluation(y_pred, y_test, svm)
model_list.append(svm)

Outcome values : tp:1341, fn:108, fn:64, tn:20

Recall score for "No Stroke":  0.93
Recall score for "Stroke":  0.24

Classification report : 
               precision    recall  f1-score   support

   No Stroke       0.95      0.93      0.94      1449
      Stroke       0.16      0.24      0.19        84

    accuracy                           0.89      1533
   macro avg       0.56      0.58      0.56      1533
weighted avg       0.91      0.89      0.90      1533

Balanced accuracy: 0.58


# Evaluation

In [20]:
model_list

[['Decision Tree', 13, 71, 110, 13, 0.15, 0.92, 0.54],
 ['Logistic Regression', 30, 54, 131, 30, 0.36, 0.91, 0.63],
 ['Random Forest', 9, 75, 47, 9, 0.11, 0.97, 0.54],
 ['Adaboost Classification', 17, 67, 108, 17, 0.2, 0.93, 0.56],
 ['XGboost', 20, 64, 97, 20, 0.24, 0.93, 0.59],
 ['SVM', 20, 64, 108, 20, 0.24, 0.93, 0.58]]

In [34]:
column_names = ['model', 'tn', 'fp', 'fn', 'tn', 'recall_stroke', 'recall_nostroke', 'balanced_accuracy_score']

# create a table
model_performance = pd.DataFrame(model_list)

# rename columns
model_performance.columns = column_names

# set index
model_performance = model_performance.set_index('model')

model_performance

Unnamed: 0_level_0,tn,fp,fn,tn,recall_stroke,recall_nostroke,balanced_accuracy_score
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Decision Tree,13,71,110,13,0.15,0.92,0.54
Logistic Regression,30,54,131,30,0.36,0.91,0.63
Random Forest,9,75,47,9,0.11,0.97,0.54
Adaboost Classification,17,67,108,17,0.2,0.93,0.56
XGboost,20,64,97,20,0.24,0.93,0.59
SVM,20,64,108,20,0.24,0.93,0.58


In [35]:
# Highest Recall Score
model_performance.sort_values('recall_stroke', ascending=False).head(1)

Unnamed: 0_level_0,tn,fp,fn,tn,recall_stroke,recall_nostroke,balanced_accuracy_score
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Logistic Regression,30,54,131,30,0.36,0.91,0.63


In [36]:
# Highest Balanced Accuracy Score
model_performance.sort_values('balanced_accuracy_score', ascending=False).head(1)

Unnamed: 0_level_0,tn,fp,fn,tn,recall_stroke,recall_nostroke,balanced_accuracy_score
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Logistic Regression,30,54,131,30,0.36,0.91,0.63


In [37]:
y_test.shape

(1533, 1)

In [39]:
y_test.value_counts()

stroke
0         1449
1           84
dtype: int64

# Conclusion:

All performed okay on the test data, most likely due to original dataset is imbalanced.

If we make a rough guess that half of the people will get stroke, we will get the following confusion matrix:<br>
tp: 42<br>
fp: 725<br>
tn: 724<br>
fn: 42<br>

making the recall score for stroke 0.5, balanced accuracy score of 0.5.

Logistic Regrssion shows the strongest performance without any other optimization currently.
It has recall score for stroke: 0.36; balanced accuracy score of 0.63.

Balanced accuracy score of logistic regression performs better than benchmark (50% guess model). but recall score for stroke is weaker than benchmark.

### Optimization suggestion.

1. Optimize SMOTE for imbalanced data (oversample the stroke data) and undersample the non-stroke data.<br>
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/

2. Use Cross Validation, Grid Search, and Random Search to tune the hyperparameter.

3. Get more features and data, especially stroke dataset.