### Stoke Prediction Dataset 
# Modeling

Data source: https://www.kaggle.com/fedesoriano/stroke-prediction-dataset <br>
Data updated date: 2021-01-26

#### Supervised Learning: Classification model to predict a binary outcome
Outcome:
- 0: no stroke
- 1: stroke

Here are the different types of learning that we will implore for out prediction.
- Decision Tree Classifer
- Logistic Regression
- Random Forest
- AdaBoost Classifer
- XGBoost Classifer
- SVM Classifer

Model Evaluation:
- Confusion Matrix: Maximize True Positive rate, minimize False Nagative rate.
- Recall for stroke
![title](img/ConfusionMatrix.ppm)
- Balanced accuracy
![title](img/Balanced-accuracy-formula.png)

# 0. Sourcing, Loading, and Evaluation Model

In [1]:
# import packagas

import pandas as pd
import numpy as np
from sklearn.metrics import recall_score, balanced_accuracy_score, confusion_matrix, classification_report

# make notebook full width for better viewing

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

In [2]:
# load data

X_train = pd.read_csv('data/X_train.csv', index_col=0)
X_test = pd.read_csv('data/X_test.csv', index_col=0)

y_train = pd.read_csv('data/y_train.csv', index_col=0)
y_test = pd.read_csv('data/y_test.csv', index_col=0)

In [3]:
# create an evaluation model that print its performance

def evaluation(y_pred, y_test):
    # confusion matrix
    tn, fp, fn, tp = confusion_matrix(y_test,y_pred,labels=[1,0]).reshape(-1)
    print('Outcome values : tp:{}, fn:{}, fn:{}, tn:{}'.format(tp, fn, fp, tn))

    # model evaluation metrics - recall
    print('\nRecall score for "No Stroke": ' , round(recall_score(y_test,y_pred, pos_label = 0),2))
    print('Recall score for "Stroke": ' , round(recall_score(y_test,y_pred, pos_label = 1), 2))

    # classification report for precision, recall f1-score and accuracy
    matrix = classification_report(y_test, y_pred, target_names=['No Stroke', 'Stroke'])
    print('\nClassification report : \n',matrix)

    # model evaluation metrics - accuracy
    print("Balanced accuracy:", round(balanced_accuracy_score(y_test,y_pred),2))

# 1. Decision Tree

In [4]:
# import tree model
from sklearn.tree import DecisionTreeClassifier

# create the model
dt_model = DecisionTreeClassifier()

# fit the data
dt_model.fit(X_train, y_train)

# make prediction
y_pred = dt_model.predict(X_test)

In [5]:
# print model performance
evaluation(y_pred, y_test)

Outcome values : tp:1394, fn:58, fn:66, tn:15

Recall score for "No Stroke":  0.96
Recall score for "Stroke":  0.19

Classification report : 
               precision    recall  f1-score   support

   No Stroke       0.95      0.96      0.96      1452
      Stroke       0.21      0.19      0.19        81

    accuracy                           0.92      1533
   macro avg       0.58      0.57      0.58      1533
weighted avg       0.92      0.92      0.92      1533

Balanced accuracy: 0.57


# 2. Logistic Regression

In [6]:
#import model
from sklearn.linear_model import LogisticRegression

# create the model
lr_model = LogisticRegression()

# fit the model
lr_model.fit(X_train, y_train)

# make prediction
y_pred = lr_model.predict(X_test)

  return f(**kwargs)


In [7]:
# print model performance
evaluation(y_pred, y_test)

Outcome values : tp:1451, fn:1, fn:81, tn:0

Recall score for "No Stroke":  1.0
Recall score for "Stroke":  0.0

Classification report : 
               precision    recall  f1-score   support

   No Stroke       0.95      1.00      0.97      1452
      Stroke       0.00      0.00      0.00        81

    accuracy                           0.95      1533
   macro avg       0.47      0.50      0.49      1533
weighted avg       0.90      0.95      0.92      1533

Balanced accuracy: 0.5


### Feature importance in Logistic Regression

In [8]:
coef = lr_model.coef_[0]
coef = [abs(number) for number in coef]
print(coef)

cols = list(X_train.columns)

[1.5477730981286286, 0.23199879756941053, 0.03134308939892663, 0.3980515436231587, 0.31108743679060175, 0.0030237489741253266, 0.019673502089336783, 0.0, 0.16477366779621755, 0.036046173169635806, 0.20977405311416095, 0.5749300205501592, 0.19141722602389824, 0.25009878530617397, 0.15237428056429733, 0.08458649566589704]


In [9]:
sorted_index = sorted(range(len(coef)), key = lambda k: coef[k], reverse = True)
for idx in sorted_index:
    print(cols[idx])

age
work_type_self-employed
hypertension
heart_disease
smoking_status_never smoked
avg_glucose_level
work_type_private
residence_type_urban
work_type_govt_job
smoking_status_smokes
smoking_status_unknown
work_type_never_worked
bmi
gender_male
ever_married
gender_other


# 3. Random Forest

In [10]:
# import model
from sklearn.ensemble import RandomForestClassifier

# create the model
rf_model = RandomForestClassifier()

# fit the model
rf_model.fit(X_train, y_train)

# make prediction
y_pred = rf_model.predict(X_test)

  


In [11]:
# print model performance
evaluation(y_pred, y_test)

Outcome values : tp:1448, fn:4, fn:80, tn:1

Recall score for "No Stroke":  1.0
Recall score for "Stroke":  0.01

Classification report : 
               precision    recall  f1-score   support

   No Stroke       0.95      1.00      0.97      1452
      Stroke       0.20      0.01      0.02        81

    accuracy                           0.95      1533
   macro avg       0.57      0.50      0.50      1533
weighted avg       0.91      0.95      0.92      1533

Balanced accuracy: 0.5


# 4. Adaboost Classification

In [12]:
# import the model
from sklearn.ensemble import AdaBoostClassifier

# create the model
ab_model = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=100, 
                            learning_rate=0.5, random_state=100)

# fit the model
ab_model.fit(X_train, y_train)

# make prediction
y_pred = ab_model.predict(X_test)

  return f(**kwargs)


In [13]:
# print model performance
evaluation(y_pred, y_test)

Outcome values : tp:1397, fn:55, fn:68, tn:13

Recall score for "No Stroke":  0.96
Recall score for "Stroke":  0.16

Classification report : 
               precision    recall  f1-score   support

   No Stroke       0.95      0.96      0.96      1452
      Stroke       0.19      0.16      0.17        81

    accuracy                           0.92      1533
   macro avg       0.57      0.56      0.57      1533
weighted avg       0.91      0.92      0.92      1533

Balanced accuracy: 0.56


# 5. XGboost

In [14]:
from sklearn.ensemble import GradientBoostingClassifier

xg_model = GradientBoostingClassifier()
xg_model.fit(X_train, y_train)

y_pred = xg_model.predict(X_test)

  return f(**kwargs)


In [15]:
# print model performance
evaluation(y_pred, y_test)

Outcome values : tp:1446, fn:6, fn:80, tn:1

Recall score for "No Stroke":  1.0
Recall score for "Stroke":  0.01

Classification report : 
               precision    recall  f1-score   support

   No Stroke       0.95      1.00      0.97      1452
      Stroke       0.14      0.01      0.02        81

    accuracy                           0.94      1533
   macro avg       0.55      0.50      0.50      1533
weighted avg       0.91      0.94      0.92      1533

Balanced accuracy: 0.5


# 6. SVM

In [16]:
# import model
from sklearn.svm import SVC

# create the model
sv_model = SVC()

# fit the model
sv_model.fit(X_train, y_train)

# make prediction on test dataset
y_pred = sv_model.predict(X_test)

  return f(**kwargs)


In [17]:
# print model performance
evaluation(y_pred, y_test)

Outcome values : tp:1452, fn:0, fn:81, tn:0

Recall score for "No Stroke":  1.0
Recall score for "Stroke":  0.0

Classification report : 
               precision    recall  f1-score   support

   No Stroke       0.95      1.00      0.97      1452
      Stroke       0.00      0.00      0.00        81

    accuracy                           0.95      1533
   macro avg       0.47      0.50      0.49      1533
weighted avg       0.90      0.95      0.92      1533

Balanced accuracy: 0.5


  _warn_prf(average, modifier, msg_start, len(result))


# Conclusion:

All performed badly on the test data, most likely due to the nature of imbalanced dataset.

If we make a rough guess that half of the people will get stroke, we will get the following confusion matrix:<br>
tp: 28<br>
fp: 739 <br>
tn: 732<br>
fn: 35<br>

making the recall score 0.44, higher than these models above. (Decision tree: 0.17; Logistic regression: 0, Random forest: 0)

### Optimization suggestion.

1. Implement SMOTE for imbalanced data (oversample the stroke data) and undersample the non-stroke data.<br>
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/

2. Use Cross Validation, Grid Search, and Random Search to tune the hyperparameter.

3. Get more features and data.