### Stoke Prediction Dataset 
# Modeling

Data source: https://www.kaggle.com/fedesoriano/stroke-prediction-dataset <br>
Data updated date: 2021-01-26

#### Supervised Learning: Classification model to predict a binary outcome
Outcome:
- 0: no stroke
- 1: stroke

Here are the different types of learning that we will implore for out prediction.
- Decision Tree
- Logistic Regression
- Random Forest

Model Evaluation:
- Confusion Matrix: Maximize True Positive rate, minimize False Nagative rate.
- Recall for stroke
![title](img/ConfusionMatrix.ppm)
- Balanced accuracy
![title](img/Balanced-accuracy-formula.png)

# 0. Sourcing and Loading

In [2]:
# import packagas

import pandas as pd
import numpy as np
from sklearn import metrics

# make notebook full width for better viewing

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

In [3]:
# load data

X_train = pd.read_csv('data/X_train.csv', index_col=0)
X_test = pd.read_csv('data/X_test.csv', index_col=0)

y_train = pd.read_csv('data/y_train.csv', index_col=0)
y_test = pd.read_csv('data/y_test.csv', index_col=0)

# 1. Decision Tree

In [4]:
# import tree model
from sklearn.tree import DecisionTreeClassifier

# create the model
dt_model = DecisionTreeClassifier()

# fit the data
dt_model.fit(X_train, y_train)

# make prediction
y_pred = dt_model.predict(X_test)

In [6]:
# confusion matrix
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

# outcome values order in sklearn
tp, fn, fp, tn = confusion_matrix(y_test,y_pred,labels=[1,0]).reshape(-1)
print('Outcome values : tp:{}, fn:{}, fn:{}, tn:{}'.format(tp, fn, fp, tn))

# model evaluation metrics - recall
print('\nRecall score for "No Stroke": ' , round(metrics.recall_score(y_test,y_pred, pos_label = 0),2))
print('Recall score for "Stroke": ' , round(metrics.recall_score(y_test,y_pred, pos_label = 1), 2))

# classification report for precision, recall f1-score and accuracy
matrix = classification_report(y_test, y_pred, target_names=['No Stroke', 'Stroke'])
print('\nClassification report : \n',matrix)

# model evaluation metrics - accuracy
print("Balanced accuracy:", round(metrics.balanced_accuracy_score(y_test,y_pred),2))

Outcome values : tp:14, fn:49, fn:72, tn:1398

Recall score for "No Stroke":  0.95
Recall score for "Stroke":  0.22

Classification report : 
               precision    recall  f1-score   support

   No Stroke       0.97      0.95      0.96      1470
      Stroke       0.16      0.22      0.19        63

    accuracy                           0.92      1533
   macro avg       0.56      0.59      0.57      1533
weighted avg       0.93      0.92      0.93      1533

Balanced accuracy: 0.59


# 2. Logistic Regression

In [None]:
#import model
from sklearn.linear_model import LogisticRegression

# create the model
lr_model = LogisticRegression()

# fit the model
lr_model.fit(X_train, y_train)

# make prediction
y_pred = lr_model.predict()

# 3. Random Forest