### Stoke Prediction Dataset 
# Modeling

Data source: https://www.kaggle.com/fedesoriano/stroke-prediction-dataset <br>
Data updated date: 2021-01-26

#### Supervised Learning: Classification model to predict a binary outcome
Outcome:
- 0: no stroke
- 1: stroke

Here are the different types of learning that we will implore for out prediction.
- Decision Tree
- Logistic Regression
- Random Forest

Model Evaluation:
- Confusion Matrix: Maximize True Positive rate, minimize False Nagative rate.
- Recall for stroke
![title](img/ConfusionMatrix.ppm)
- Balanced accuracy
![title](img/Balanced-accuracy-formula.png)

# 0. Sourcing, Loading, and Evaluation Model

In [9]:
# import packagas

import pandas as pd
import numpy as np
from sklearn.metrics import recall_score, balanced_accuracy_score, confusion_matrix, classification_report

# make notebook full width for better viewing

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

In [10]:
# load data

X_train = pd.read_csv('data/X_train.csv', index_col=0)
X_test = pd.read_csv('data/X_test.csv', index_col=0)

y_train = pd.read_csv('data/y_train.csv', index_col=0)
y_test = pd.read_csv('data/y_test.csv', index_col=0)

In [11]:
# create an evaluation model that print its performance

def evaluation(y_pred, y_test):
    # confusion matrix
    tp, fn, fp, tn = confusion_matrix(y_test,y_pred,labels=[1,0]).reshape(-1)
    print('Outcome values : tp:{}, fn:{}, fn:{}, tn:{}'.format(tp, fn, fp, tn))

    # model evaluation metrics - recall
    print('\nRecall score for "No Stroke": ' , round(metrics.recall_score(y_test,y_pred, pos_label = 0),2))
    print('Recall score for "Stroke": ' , round(metrics.recall_score(y_test,y_pred, pos_label = 1), 2))

    # classification report for precision, recall f1-score and accuracy
    matrix = classification_report(y_test, y_pred, target_names=['No Stroke', 'Stroke'])
    print('\nClassification report : \n',matrix)

    # model evaluation metrics - accuracy
    print("Balanced accuracy:", round(metrics.balanced_accuracy_score(y_test,y_pred),2))

# 1. Decision Tree

In [12]:
# import tree model
from sklearn.tree import DecisionTreeClassifier

# create the model
dt_model = DecisionTreeClassifier()

# fit the data
dt_model.fit(X_train, y_train)

# make prediction
y_pred = dt_model.predict(X_test)

In [13]:
# print model performance
evaluation(y_pred, y_test)

Outcome values : tp:11, fn:52, fn:73, tn:1397

Recall score for "No Stroke":  0.95
Recall score for "Stroke":  0.17

Classification report : 
               precision    recall  f1-score   support

   No Stroke       0.96      0.95      0.96      1470
      Stroke       0.13      0.17      0.15        63

    accuracy                           0.92      1533
   macro avg       0.55      0.56      0.55      1533
weighted avg       0.93      0.92      0.92      1533

Balanced accuracy: 0.56


# 2. Logistic Regression

In [14]:
#import model
from sklearn.linear_model import LogisticRegression

# create the model
lr_model = LogisticRegression()

# fit the model
lr_model.fit(X_train, y_train)

# make prediction
y_pred = lr_model.predict(X_test)

  return f(**kwargs)


In [15]:
# print model performance
evaluation(y_pred, y_test)

Outcome values : tp:0, fn:63, fn:0, tn:1470

Recall score for "No Stroke":  1.0
Recall score for "Stroke":  0.0

Classification report : 
               precision    recall  f1-score   support

   No Stroke       0.96      1.00      0.98      1470
      Stroke       0.00      0.00      0.00        63

    accuracy                           0.96      1533
   macro avg       0.48      0.50      0.49      1533
weighted avg       0.92      0.96      0.94      1533

Balanced accuracy: 0.5


  _warn_prf(average, modifier, msg_start, len(result))


# 3. Random Forest

In [16]:
# import model
from sklearn.ensemble import RandomForestClassifier

# create the model
rf_model = RandomForestClassifier()

# fit the model
rf_model.fit(X_train, y_train)

# make prediction
y_pred = rf_model.predict(X_test)

  


In [17]:
# print model performance
evaluation(y_pred, y_test)

Outcome values : tp:0, fn:63, fn:3, tn:1467

Recall score for "No Stroke":  1.0
Recall score for "Stroke":  0.0

Classification report : 
               precision    recall  f1-score   support

   No Stroke       0.96      1.00      0.98      1470
      Stroke       0.00      0.00      0.00        63

    accuracy                           0.96      1533
   macro avg       0.48      0.50      0.49      1533
weighted avg       0.92      0.96      0.94      1533

Balanced accuracy: 0.5


In [33]:
from math import ceil
sample_size = ceil(len(y_test)/2)
sample_size

767

# Conclusion:

Decision Tree, Logistic Regression, and Random Forest all performed badly on this dataset by default, most likely due to the nature of imbalanced dataset.

If we make a rough guess that half of the people will get stroke, we will get the following confusion matrix:<br>
tp: 28<br>
fp: 739 <br>
tn: 732<br>
fn: 35<br>

making the recall score 0.44, higher than these models above. (Decision tree: 0.17; Logistic regression: 0, Random forest: 0)

### Optimization suggestion.

1. Implement SMOTE for imbalanced data (oversample the stroke data) and undersample the non-stroke data.<br>
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/

2. Use Cross Validation, Grid Search, and Random Search to tune the hyperparameter.