# TITLE

## Final Project Submission

Please fill out:
* __Student name:__ Cassarra Groesbeck
* __Student pace:__ Part Time/ Flex
* __Scheduled project review date/time:__ 
* __Instructor name:__ 
* __Blog post URL:__



# 1. Introduction 
- Every year, more than 795,000 people in the United States have a stroke. About 610,000 of these are first or new strokes. About 87% of all strokes are ischemic strokes, in which blood flow to the brain is blocked.[Apr 5, 2022, CDC.gov](https://www.cdc.gov/stroke/facts.htm)

- From 1990 to 2019, the change in the prevalence of stroke in the general population increased by about 60%. [Feb 3, 2022, newsroom.heart.org](https://newsroom.heart.org/news/u-s-stroke-rate-declining-in-adults-75-and-older-yet-rising-in-adults-49-and-younger)

- Strokes are the No. 5 cause of death and a leading cause of disability in the United States. 80% of strokes are preventable.[American Stoke Association](https://www.stroke.org/en/about-stroke)



## 1a. Objectives
Because of the nature of this problem, and in order to capture as many potential stroke victims as possible, I feel it appropriate to be very aggressive with stroke predictions. For this reason I have chosen Recall as the evaluation metric for my models. This will inevitably lead to extra false positives, however, because of the health risks associated with strokes, I feel the measures needed to declare an individual to NOT be at risk, outweigh a misclassification of at risk.


## 1b. Business Understanding

# 2. Data Understanding
This dataset contains 5110 observations with 12 attributes (11 clinical features) for predicting stroke events. 


### 2a. Attribute Information
| Column     | Description   |
|------------|:--------------|
| `id`               | **unique identifier**  |
| `gender`           | **"Male", "Female" or "Other"**  |
| `age`              | **age of the patient** |
| `hypertension`     | **0 if the patient doesn't have hypertension, 1 if the patient has hypertension**  |
| `heart_disease`    | **0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease**   |
| `ever_married`     | **"No" or "Yes"**  |
| `work_type`        | **"children", "Govt_jov", "Never_worked", "Private" or "Self-employed"**   |
| `Residence_type`   | **"Rural" or "Urban"**  |
| `avg_glucose_level`| **average glucose level in blood**  |
| `bmi`              | **body mass index** |
| `smoking_status`   | **"formerly smoked", "never smoked", "smokes" or "Unknown"***  |
| `stroke`           | **1 if the patient had a stroke or 0 if not**  |
|    **_*Note:_**      | _"Unknown" in_ `smoking_status` _means that the information is unavailable for this patient_ |


### 2b. Acknowledgements
Data comes from the [Stroke Prediction Dataset](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset) and can be found on [kaggle](https://www.kaggle.com).

# 3. Imports

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
#plt.style.use('seaborn')
sns.set_style('darkgrid', {'axes.facecolor': '0.9', "grid.color": ".6", "grid.linestyle": ":"})
%matplotlib inline

from imblearn.over_sampling import SMOTE, SMOTENC
from imblearn.pipeline import Pipeline as imbpipeline
from imblearn.combine import SMOTETomek

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer, SimpleImputer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.metrics import confusion_matrix, plot_confusion_matrix,\
    precision_score, recall_score, accuracy_score, f1_score, log_loss,\
    roc_curve, roc_auc_score, classification_report, plot_roc_curve, auc

import warnings
warnings.filterwarnings('ignore')

seed = 42

# 4. Exploring the raw data

### 4a. Load and visually check the data

In [None]:
df = pd.read_csv('Data/healthcare-dataset-stroke-data.csv')

In [None]:
# Visual check of raw df
df.head()

### 4b. Drop unnecessary column, and identify target feature

In [None]:
# Drop 'id' column
df = df.drop('id', axis=1)

In [None]:
# Identify Target Feature
target = 'stroke'

### 4c. The basics

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.isnull().sum()

### 4d. Exploring nulls

In [None]:
# percent of missing data
f'{(201/5110)*100:.3}%'

In [None]:
# just the nulls
bmi_nulls = df[df['bmi'].isnull()]
# `stroke` patients with missing `bmi` data
bmi_nulls['stroke'].value_counts()

### 4e. Distribution of values

In [None]:
print('-'*30)
print('Distribution of Target Feature')
print('-'*30)
print('COUNTS:')
print(df[target].value_counts())
print('- '*15)
print('PERCENTAGES:')
for value in [norm_value_count for norm_value_count \
                in enumerate(df[target].value_counts(normalize=True))]:
    print(f'{value[0]}\t{value[1]*100:.4}%')
print(f'Name: {target}, dtype: int64') # just for symmetry
print('-'*30)

In [None]:
show = df.drop(target, axis=1)

print("-"*30)
print(f'Distribution of Other Features')
print("-"*30)
for column in show.columns:
    print("-"*30)
    print(f"UNIQUE VALUES: {len(show[column].unique())}")
    if len(show[column].unique()) <= 5:
        print("- "*15)
        print(show[column].value_counts())
    else:
        print("- "*15)
        print(f'\t\t  MIN: {show[column].min()}')
        print(f'\t\t  MEAN: {round(show[column].mean())}')
        print(f'\t\t  MAX: {show[column].max()}')
        print((f'Name: {column}, dtype: float64')) 
    print("-"*30)

### 4f. Visualization of distribution of values 

In [None]:
# setup
fig, axes = plt.subplots(ncols=4, nrows=3, figsize=(12, 10))
fig.set_tight_layout(True)
# plot
for index, col in enumerate(df.columns):
    ax = axes[index//4][index%4]
    sns.histplot(data=df[col], ax=ax, linewidth=0.1, alpha=1)
    ax.tick_params(axis='x', rotation=60)

### 4g. Findings

- `gender`
 - There is only 1 'other' value, for simplicity I will drop this 1 row.
 - About 1000 more women than men in this dataset.
- `age`
 - The youngest patient is under 1 yo. Oldest patient is 82.
 - Decent distribution, 
 - Average age of patients in this dataset are 43 years old. 
- `hypertension`
 - Binary, 1 if the patient has hypertension.
 - Very similar distribution as target feature.
- `heart_disease`
 - Binary, 1 if the patient has heart disease.
 - Almost identical distribution as target feature.
 - I am curious how `hypertension` & `heart_disease` will correlate with eachother. 
- `ever_married`
 - About 65/35 split with majority of patients listed as 'Yes.'
- `work_type`
 - Only 5 categories. Not surprisingly, more than half answered 'Private.'
 - Other 4 make up little more than 40%.
 - 'Never_worked' is less than 1%.
- `Residence_type`
 - Almost 50/50 split between 'Urban' and 'Rural.'
- `avg_glucose_level` & `bmi` 
 - continuous features that are both skewed positively.
 - `bmi` is missing 201 values (3.93%). Of those missing values, 40 are stroke patients.
   - because of this missing data an imputer is required for modeling
- `smoking_status`
 - about 30% of patients are listed as 'Unknown.'


### 4h. As indicated above (4g): Drop 1 'other' value in `gender`

In [None]:
data = df.drop(df[df['gender']=='Other'].index)

# 5. Seperate and Split Data

### 5a. Separate data into features and target

In [None]:
X = data.drop(target, axis = 1)
y = data[target]

### 5b. Split data into train and test sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    stratify=y, 
                                                    test_size=test_size, 
                                                    random_state=seed)

# 6. Which features are better predictors?

Explain how search relates to business problem, why am I looking where I am looking, What am I trying to find?
start with pairplot, pps, hoping to find correlation in easy to identify features such as age, gender (errrr, it's "sex" but ok), bmi, residence type, and smoking. Explain this. 


### 6a. Make a visualization data frame with `X_train` and `y_train`

### 6b. Pairplot
Why a pairplot, what will this do?

### 6c. PPS 
what is a pps, link to article, what will it do?


### 6d. Explore top 3 individual features relative to stroke


# 7. Functions

### 7a. Very important words about my function
This function does all the heavy lifting for me. It allows me to bypass grid search, for establishing baseline models and for testing a best parameters model on test data without needing to repeat the computationally expensive grid search. 

In [None]:
def test_check_model(model,
                     X_train=X_train,
                     X_test=X_test,
                     y_train=y_train,
                     y_test=y_test,
                     test_size=.20, 
                     random_state=seed, 
                     imputer=None,
                     smote=None,             
                     scaler=None,
                     grid_search=True,
                     param_dict=None,
                     use_test_data=False, 
                     show_classification_report=False, 
                     show_thresholds_table=False,  
                     show_plots=False,
                     display_labels=None):
    
    """   
    Uses sklearn.model_selection.train_test_split to divide data into train and test sets.
    
    If data has NaN values, an sklearn imputer to must be specified. 
    
    Option to use any sklearn scaler to scale data. 
    Option to use any SMOTE (must be appropiate for data, ie, categorical vs continuous, or both)
    Option to use sklearn GridSearchCV to search for best paramaeters for model.  
    Option to specify use of Test data for evaluation metrics.
    
    
    Output
    ---------- 
    (optional) Classification Report
    (optional) Thresholds, FPRs, TPRs Stats Table with AUC score
    (optional) Plots Confusion matrix and if available an ROC curve
    
    
    Returns
    ----------
    Recall score
    
    
    Parameters
    ----------
 
    model : supervised learning model to be evaluated. 
    
    pipe_grid_param_dict : dict or list of dictionaries
        sklearn.model_selection.GridSearchCV parameter: 
            Dictionary with parameters names (`str`) as keys and lists of
            parameter settings to try as values, or a list of such
            dictionaries, in which case the grids spanned by each dictionary
            in the list are explored. This enables searching over any sequence
            of parameter settings.
    
    data : pandas data frame, default=data
    
    target : string or variable set to a string, default=target
    
    test_size : float or int, default=.20
        sklearn.model_selection.train_test_split parameter:
            If float, should be between 0.0 and 1.0 and represent the proportion
            of the dataset to include in the test split. If int, represents the
            absolute number of test samples.
    
    random_state : int or RandomState instance, default=42
        Controls the shuffling applied to the data before applying the split.
        Pass an int for reproducible output across multiple function calls.

    use_test_data : boolean True or False, default=False
        determines the data used to asses model performance
    
    show_thresholds_report : boolean True or False, default=False
        Print table with AUC score, Thresholds, FPR's and TPR's
    
    show_plots : boolean True or False, default=False
        If target variable is binary, plots Confusion Matrix and ROC curve.
        Otherwise just Confusion Matrix
    
    display_labels : list or 'None', default=None
        If the target is binary 0,1 the labels can be changed to more descriptive labels. 
        example: ['Healthy', 'HeartDisease']

    """

    
    ######################################################################
    # 1. TRANSFORM-DATA PIPELINE                                         #
    ######################################################################   
    
    
    # 1a. Seperate by type of data
    X_train_nums = X_train.select_dtypes('float64')
    X_train_cat = X_train.select_dtypes('object')


    # 1b. Pipeline 1 (numerical data)
    numerical_pipeline = Pipeline(steps=[
        ('scaler', scaler)])
    
    
    # 1c. Pipeline 2 (categorical data)  
    categorical_pipeline = Pipeline(steps=[
        ('ohe', OneHotEncoder(sparse=False, handle_unknown='ignore'))])


    # 1d. Converge pipelines 1 & 2
    trans = ColumnTransformer(transformers=[
        ('numerical', numerical_pipeline, X_train_nums.columns),
        ('categorical', categorical_pipeline, X_train_cat.columns)],
                             remainder='passthrough')
        
    
    # 1e. Model with converged pipeline
    model_pipe = imbpipeline(steps=[
        ('trans', trans),
        ('imputer', imputer),
        ('smote', smote),
        ('model', model)])

    
    ######################################################################
    # 2. GRID SEARCH PIPELINE                                            #
    ######################################################################   
    

    # 2a. Determine if using gridsearch
    if grid_search:
        best_model = GridSearchCV(estimator=model_pipe, 
                           param_grid=param_dict, 
                           scoring='recall', 
                           cv=3)

        
    # fit the model to evaluate
        fit_model = best_model.fit(X_train, y_train) 
    
    
    else: 
        fit_model = model_pipe.fit(X_train, y_train) 
   

    ######################################################################
    # 3. PRINT MODEL DETAILS                                             #
    ######################################################################

    
    # 3a. Determine if using GridSearch
    if grid_search:
        
        """
        # 3a-1. Use .best_params_ dict and reformat to print the same way 
            classifiers/scalers/imputers etc are instantiated.


            Example: 
            ----------

               .best_params_ = {'model__criterion': 'gini', 
                                'model__max_depth': 6, 
                                'scaler__with_mean': True} 

                prints as: 
                    DecisionTreeClassifier(criterion='gini', max_depth=6, ..., N-param=N-value) 
                    StandardScaler(with_mean=True)
        """  
        kind_of_params = {}
        for k,v in fit_model.best_params_.items():
            key = k.split("__")[0]
            if key not in kind_of_params.keys():
                kind_of_params[key] = "" 
            if k.split("__")[1] == 'solver':
                kind_of_params[key] += k.split("__")[1]+"='"+str(v)+"', " # solver has qoutes around it
            elif k.split("__")[1] == 'criterion':
                kind_of_params[key] += k.split("__")[1]+"='"+str(v)+"', " # criterion has qoutes around it
            else: 
                kind_of_params[key] += k.split("__")[1]+"="+str(v)+", " #<-- notice comma


        # 3a-2. Remove extra comma at end of each dic value
        for k, v in kind_of_params.items():
            kind_of_params[k] = v[:-2]


        # 3a-3. Print in copy paste format:
        # Ex. DecisionTreeClassifier(criterion='gini', ..., paramN=value)
        if 'model' in  kind_of_params.keys():
            model_text = str(model).split(")")[0]+", "+kind_of_params['model']+")"
            print(model_text) 
        else:
            print(model) # otherwise, use what was fed into function

   
        if 'imputer' in kind_of_params.keys():
            imputer_text = str(imputer).split("()")[0]+"("+kind_of_params['imputer']+")"
            print(imputer_text)
        else:
            print(imputer)


        if scaler != None:
            if 'scaler' in kind_of_params.keys():
                scaler_text = str(scaler).split("()")[0]+"("+kind_of_params['scaler']+")"
                print(scaler_text) 
            else:
                print(scaler)  


        if smote != None:
            if 'smote' in kind_of_params.keys():
                smote_text = str(smote).split(")")[0]+", "+kind_of_params['smote']+")"
                print(smote_text)
            else:
                print(smote)

    # 3b. If bypassing GridSearch             
    else: 
        
        # 3b-1 Print what was fed into function, example: DecisionTreeClassifier()
        # model & imputer are manditory
        print(model)  
        print(imputer)

        # scaler and/or smote are optional (will not print if not used) 
        if scaler != None:  
            print(scaler)
        if smote != None:
            print(smote)
            
    # 3c. Print type of data used in model evaluation
    if use_test_data:
        data_used_text = "Test"
    else:
        data_used_text = "Training"
    
    print(f'\nModel evaluated with {data_used_text} data\n')
    

    ######################################################################
    # 4. (Optional) CLASSIFICATION REPORT                                #
    ######################################################################

    
    # 4a. Assign variables based on data using for evaluation
    if use_test_data:
        X_true = X_test
        y_true = y_test
    else: 
        X_true = X_train
        y_true = y_train
        
        
    # 4b. Make predictions and print report 
    y_preds = fit_model.predict(X_true)
    cr = classification_report(y_true, y_preds, digits=4)
    
    print('-'*54)
    print('\t\tCLASSIFICATION REPORT')
    print('-'*54)
    print(cr)
    print('-'*54)
        

        
    ######################################################################
    # 5. (Optional) THRESHOLDS TABLE                                     #
    ######################################################################   
   

    if show_thresholds_table:
        # 5a. Calculate the probability scores
        if ('LogisticReg' in str(model)):
            y_score = fit_model.decision_function(X_true) 
            fpr, tpr, thresholds = roc_curve(y_true, y_score)
        else:
            y_score = fit_model.predict_proba(X_true)
            fpr, tpr, thresholds = roc_curve(y_true, y_score[:,1]) # <-- probability of Class 1

        # 5b. Format values and print
        # To display as: THRESHOLD: value | FPR: percent%, TPR:percent%
        thresh_fp_tp = list(zip(thresholds, fpr, tpr))
        these_to_print = [f'THRESHOLD: {e[0]:.2f} | FPR: {e[1]:.2%}, TPR:{e[2]:.2%}' \
                          for e in thresh_fp_tp]        
        auc_score = auc(fpr, tpr)
        
        print('-'*54)
        print('\t\t  THRESHOLD STATS')
        print('-'*54)
        print(f'AUC: {auc_score}')
        print('- '*23)
        for element in these_to_print:
            print(element)
        print('-'*54)
        

    ######################################################################
    # 6. (Optional) VISUALIZATIONS                                       #
    ######################################################################
    
    
    if show_plots: 
        # Figure set up
        plt.style.use('fivethirtyeight')
        fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(13, 5))
        fig.suptitle(f'Model Evaluated on {data_used_text} Data', color='tab:blue', size=14)
        

        # Left  
        axes[0].set_title("Confusion Matrix", size=15)  
        axes[0].grid(False)
        plot_confusion_matrix(fit_model, X_true, y_true, 
                              cmap=plt.cm.Blues, 
                              ax=axes[0], 
                              display_labels=display_labels,
                              normalize='true')

        # Right 
        axes[1].set_title("ROC Curve", size=15)
        plot_roc_curve(fit_model, X_true, y_true, ax=axes[1]);


print("ran")

In [None]:
cat_column_indices = [X_train.columns.get_loc(c) \
                      for c in X_train.columns \
                      if c in X_train.select_dtypes('object').columns]

p_dic = {'model__criterion': ['gini', 'entropy'],
 'model__max_depth': [1, 2],
 'model__min_samples_split': [2,10],
 'model__min_samples_leaf': [1, 6]}

model = DecisionTreeClassifier(random_state=seed)

imputer=IterativeImputer(random_state=seed)
smote=SMOTENC(categorical_features=cat_column_indices, random_state=seed)             
scaler=StandardScaler()
#grid_search=True
#param_dict=None,
#use_test_data=False, 
#show_classification_report=False, 
#show_thresholds_table=False,  
#show_plots=False,
#display_labels=None):

In [None]:
test_check_model(model, imputer=imputer, grid_search=False, use_test_data=True, show_plots=True)

In [None]:
class_weight_dic={0: round(data[target].value_counts(normalize=True)[1],6), \
              1: round(data[target].value_counts(normalize=True)[0],6)}

In [None]:
model3 = LogisticRegression(random_state=seed, class_weight=class_weight_dic)

In [None]:
test_check_model(model3, imputer=imputer, grid_search=False, show_plots=True)

In [None]:
test_check_model(model3, imputer=imputer, smote=smote, grid_search=False, show_plots=True)

In [None]:
test_check_model(model3, imputer=imputer, smote=smote, grid_search=False, show_plots=True, use_test_data=True)

TODO: 
- for loops for baseline models imputer with different imputers, 
- add return to function, I want scores 
- do more visualization, find corrs to start models small, one or two features then add features


# 8. Set up for modeling
have cat_cols, instantiate baseline models, imputers, and scaler


# 9. Modeling
Explain choices at each step, what did you find. why are you moving in direction you are?

# 10. Final Model
why this final model? What is so great about it. how does it solve business problem?