_______________________________________________________________________________________________________________________________
Modeling Notebook. I will train 6 models to compare against eachother. I am only focussed on accuracy at this time but will calculate all other metrics to place in a confusion matrix. This is by no means close to a production model, as more data is needed to train on. This will just give me a starting point for later iterations of this project, with hopes that a production model will be implemented in the future. The models used will be:
    
    1. Logistic Regression
    2. Gradient Boost
    3. KNN
    4. Random Forest
    5. Decision Trees
    6. Bagged Trees
With the help of Kate Skibo I was able to develop a model selection process, that allows me to run multiple models through Gridsearch without having to call them separately. *Note: The looped gridsearch might take an hour or so to run all the way through and is gpu intensive. Be sure to have closed any unnecessary applications running in the background as to not hinder performance. 
_______________________________________________________________________________________________________________________________

In [17]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore", category=RuntimeWarning) 

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier 
from sklearn.pipeline import Pipeline
from sklearn.ensemble import  BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.metrics import mean_squared_error
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

In [3]:
final = pd.read_csv('../data/model_df.csv')

In [4]:
final = final.drop(columns = ['Unnamed: 0', 'stroke_risk'])

In [5]:
final.head(15)

Unnamed: 0,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke,diabetes,ever_married_Yes,gender_Male,gender_Other,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,Residence_type_Urban,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes
0,3.0,0,0,95.12,18.0,0,0,0,1,0,0,0,0,1,0,0,1,0
1,58.0,1,0,87.96,39.2,0,0,1,1,0,0,1,0,0,1,0,1,0
2,8.0,0,0,110.89,17.6,0,0,0,0,0,0,1,0,0,1,0,1,0
3,70.0,0,0,69.04,35.9,0,0,1,0,0,0,1,0,0,0,1,0,0
4,14.0,0,0,161.28,19.1,0,0,0,1,0,1,0,0,0,0,0,1,0
5,47.0,0,0,210.95,50.1,0,1,1,0,0,0,1,0,0,1,0,1,0
6,52.0,0,0,77.59,17.7,0,0,1,0,0,0,1,0,0,1,1,0,0
7,75.0,0,1,243.53,27.0,0,1,1,0,0,0,0,1,0,0,0,1,0
8,32.0,0,0,77.67,32.3,0,0,1,0,0,0,1,0,0,0,0,0,1
9,74.0,1,0,205.84,54.6,0,1,1,0,0,0,0,1,0,1,0,1,0


### Modeling: 
    1. Establish x and y variables for binary classification
    2. Establish a baseline using the y value
    3. Train-Test-Split with train size .85
    4. Stadard scale since age, bmi, and glucose levels are on different scales.
    6. Run multiple models 
    5. Interpret scores   

In [6]:
#Defining x, y, and the baseline
X = final.drop(columns = ['stroke','diabetes'])
y = final['stroke']
print(y.value_counts(normalize=True))

stroke
0    0.978726
1    0.021274
Name: proportion, dtype: float64


In [7]:
#train_test_split with Standard Scalar
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=.85, random_state=42)
ss = StandardScaler()

#Fit and tranform only on training data. Transform on testing data
Z_train = ss.fit_transform(X_train)
Z_test = ss.transform(X_test)

In [15]:
# Define models and hyperparameters for grid search
models = [
    {'name': 'Logistic Regression', 'model': LogisticRegression(), 'params': {
        'penalty' : ['l2','none'], 'C' : [.01, .1, 1, 10]
    }},
           
    {'name': 'Gradient Boost', 'model': GradientBoostingClassifier(), 'params': {
        'n_estimators' : [50, 75, 100, 150, 250], 'learning_rate': [0.1, 0.5, 1], 'max_depth' : [1, 3, 5, 7, 10]}},
    
    {'name': 'KNeighbors' , 'model' : KNeighborsClassifier(), 'params': {
        'n_neighbors' : [3, 5, 7, 9, 11, 13, 15]}},
    
    {'name': 'Decision Tree', 'model': DecisionTreeClassifier(), 'params': {
        'max_depth' : [3, 5, 7, 10]}},
    
    {'name': 'Random Forest', 'model' : RandomForestClassifier(), 'params': {
        'n_estimators' : [100, 200, 300, 400, 500], 'max_depth' : [3, 5, 7, 10]}},
    
    {'name': 'Bagging Trees', 'model': BaggingClassifier(), 'params': {
        'n_estimators' : [100, 200, 300, 400, 500], 'max_samples' : [.1, .3, .5, .7, 1]
    }},
    
    
]

In [18]:
#Loop the gridsearch with all models
for model in models:
    #Gridsearch parameters
    cgs = GridSearchCV(model['model'], model['params'], scoring= 'f1',
                      cv=5, n_jobs=-1, return_train_score=True, refit=True).fit(Z_train, y_train)
    
    #Getting predictions on standard scaled testing data
    preds = cgs.predict(Z_test)
    
    #Creating a confusion matrix
    tn, fp, fn, tp = confusion_matrix(y_test, preds).ravel()
    
    #Printing out precision, sensitivity, accurary, f1 scores for both training and testing data, and plotting a confusion matrices
    prec = tp / (tp + fp)
    sens = tp / (tp + fn)
    print(f"{model['name']} Best Params: {cgs.best_params_}")
    print(f"{model['name']}: Specificity: {tn / (tn + fp)}")
    print(f"{model['name']}: Accuracy: {(tn + tp) / (tn + fn + tp + fp)}")
    print("__" * 30)
    print(f"{model['name']}: Precision: {prec}")
    print(f"{model['name']}: Sensitivity: {sens}")
    print(f"{model['name']}: Training F1 Score: {cgs.score(Z_train, y_train)}")
    print(f"{model['name']}: Testing F1 Score: {cgs.score(Z_test, y_test)}")
    print(confusion_matrix(y_test, preds))
    print("==" * 30)

Logistic Regression Best Params: {'C': 0.01, 'penalty': 'l2'}
Logistic Regression: Specificity: 1.0
Logistic Regression: Accuracy: 0.9796619486051944
____________________________________________________________
Logistic Regression: Precision: nan
Logistic Regression: Sensitivity: 0.0
Logistic Regression: Training F1 Score: 0.0
Logistic Regression: Testing F1 Score: 0.0
[[7129    0]
 [ 148    0]]
Gradient Boost Best Params: {'learning_rate': 0.1, 'max_depth': 10, 'n_estimators': 250}
Gradient Boost: Specificity: 0.9985972787207182
Gradient Boost: Accuracy: 0.9869451697127938
____________________________________________________________
Gradient Boost: Precision: 0.863013698630137
Gradient Boost: Sensitivity: 0.42567567567567566
Gradient Boost: Training F1 Score: 1.0
Gradient Boost: Testing F1 Score: 0.5701357466063348
[[7119   10]
 [  85   63]]
KNeighbors Best Params: {'n_neighbors': 3}
KNeighbors: Specificity: 0.9920044887080937
KNeighbors: Accuracy: 0.9729284045623197
_________________

_______________________________________________________________________________________________________________________________
After all the models have ran, we can look at the accuracy to determine which one is the best for our chosen metric of success. Remember the baseline score was .978. Using the baseline, these models performed the best:
    
    1. Bagging Trees - .9857 acc
    2. Logistic Regression - .979 acc
    3. Gradient Booost - .9869 acc
    
In all of these models there were over 7000 properly predicted true negatives. Gradient properly predicted 63 true positives and Bagging Trees properly predicted 45. Bagging trees however, had 103 false negatives as opposed to the Gradient Boost's 85. In order to move forward with a production model, I would like to see the false negatives be a lot lower. More data would be needed for that and maybe more parameter tuning to mitigate these false negatives. The goal is to help people to predict strokes and not give them a false sense of security when could potentially face a life threatening stroke.
_______________________________________________________________________________________________________________________________