# TITLE

## Final Project Submission

Please fill out:
* __Student name:__ Cassarra Groesbeck
* __Student pace:__ Part Time/ Flex
* __Scheduled project review date/time:__ 
* __Instructor name:__ 
* __Blog post URL:__



# 1. Introduction 
Every year, more than 795,000 people in the United States have a stroke. About 610,000 of these are first or new strokes. About 87% of all strokes are ischemic strokes, in which blood flow to the brain is blocked [CDC.gov](https://www.cdc.gov/stroke/facts.htm). Strokes are the No. 5 cause of death and a leading cause of disability in the United States. 80% of strokes are preventable [American Stoke Association](https://www.stroke.org/en/about-stroke). From 1990 to 2019, the change in the prevalence of stroke in the general population increased by about 60%. [newsroom.heart.org](https://newsroom.heart.org/news/u-s-stroke-rate-declining-in-adults-75-and-older-yet-rising-in-adults-49-and-younger)


## 1a. Objectives
Because of the nature of this problem, and in order to capture as many potential stroke victims as possible, I feel it is appropriate to be very aggressive with stroke predictions. For this reason I have chosen Recall as the evaluation metric for my models. This will inevitably lead to extra false positives, however, because of the health risks associated with strokes, I feel the measures needed to declare an individual to NOT be at risk, outweigh a misclassification of at risk.


## 1b. Business Understanding
Business that needs strokes predicted and why. (what are thay going to do with this information)

# 2. Data Understanding
This dataset contains 5110 observations with 12 attributes (11 clinical features) for predicting stroke events. 


### 2a. Attribute Information
| Column     | Description   |
|------------|:--------------|
| `id`               | **unique identifier**  |
| `gender`           | **"Male", "Female" or "Other"**  |
| `age`              | **age of the patient** |
| `hypertension`     | **0 if the patient doesn't have hypertension, 1 if the patient has hypertension**  |
| `heart_disease`    | **0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease**   |
| `ever_married`     | **"No" or "Yes"**  |
| `work_type`        | **"children", "Govt_jov", "Never_worked", "Private" or "Self-employed"**   |
| `Residence_type`   | **"Rural" or "Urban"**  |
| `avg_glucose_level`| **average glucose level in blood**  |
| `bmi`              | **body mass index** |
| `smoking_status`   | **"formerly smoked", "never smoked", "smokes" or "Unknown"***  |
| `stroke`           | **1 if the patient had a stroke or 0 if not**  |
|    **_*Note:_**      | _"Unknown" in_ `smoking_status` _means that the information is unavailable for this patient_ |


### 2b. Acknowledgements
Data comes from the [Stroke Prediction Dataset](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset) and can be found on [kaggle](https://www.kaggle.com).

# 3. Imports

In [None]:
#pip install -U imbalanced-learn

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
#plt.style.use('seaborn')
sns.set_style('darkgrid', {'axes.facecolor': '0.9', "grid.color": ".6", "grid.linestyle": ":"})
%matplotlib inline

import xgboost

from imblearn.over_sampling import SMOTEN, SMOTENC
from imblearn.pipeline import Pipeline as imbpipeline

from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, BaggingClassifier, \
ExtraTreesClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer, SimpleImputer, KNNImputer
from sklearn.pipeline import Pipeline#, FeatureUnion
from sklearn.metrics import confusion_matrix, plot_confusion_matrix,\
    precision_score, recall_score, accuracy_score, f1_score, log_loss,\
    roc_curve, roc_auc_score, classification_report, plot_roc_curve, auc

import warnings
warnings.filterwarnings('ignore')

seed = 42

# 4. Exploring the Raw Data

### 4a. Load and Visually Check the Data

In [None]:
df = pd.read_csv('Data/healthcare-dataset-stroke-data.csv')

In [None]:
# Visual check of raw df
df.head()

### 4b. Drop Unnecessary `id` Column

In [None]:
df = df.drop('id', axis=1)

### 4c. Identify Target feature, `stroke`

In [None]:
target = 'stroke'

### 4d. The Basics

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.isnull().sum()

### 4e. Exploring Nulls

In [None]:
# percent of missing data
f'{(201/5110)*100:.3}%'

In [None]:
# just the nulls
bmi_nulls = df[df['bmi'].isnull()]
# `stroke` patients with missing `bmi` data
bmi_nulls['stroke'].value_counts()

### 4f. Distribution of Values

In [None]:
print('-'*30)
print('Distribution of Target Feature')
print('-'*30)
print('COUNTS:')
print(df[target].value_counts())
print('- '*15)
print('PERCENTAGES:')
for value in [norm_value_count for norm_value_count \
                in enumerate(df[target].value_counts(normalize=True))]:
    print(f'{value[0]}\t{value[1]*100:.4}%')
print(f'Name: {target}, dtype: int64') # just for symmetry
print('-'*30)

In [None]:
show = df.drop(target, axis=1)

print("-"*30)
print(f'Distribution of Other Features')
print("-"*30)
for column in show.columns:
    print("-"*30)
    print(f"UNIQUE VALUES: {len(show[column].unique())}")
    if len(show[column].unique()) <= 5:
        print("- "*15)
        print(show[column].value_counts())
    else:
        print("- "*15)
        print(f'\t\t  MIN: {show[column].min()}')
        print(f'\t\t  MEAN: {round(show[column].mean())}')
        print(f'\t\t  MAX: {show[column].max()}')
        print((f'Name: {column}, dtype: float64')) 
    print("-"*30)

### 4g. Visualizing The of Distribution of Values 

In [None]:
# setup
fig, axes = plt.subplots(ncols=4, nrows=3, figsize=(12, 10))
fig.set_tight_layout(True)
# plot
for index, col in enumerate(df.columns):
    ax = axes[index//4][index%4]
    sns.histplot(data=df[col], ax=ax, linewidth=0.1, alpha=1)
    ax.tick_params(axis='x', rotation=60)

### NOTES:

- `gender`
 - There is only 1 'other' value, for simplicity I will drop this 1 row.
 - About 1000 more women than men in this dataset.
- `age`
 - The youngest patient is under 1 yo. Oldest patient is 82.
 - Decent distribution, 
 - Average age of patients in this dataset are 43 years old. 
- `hypertension`
 - Binary, 1 if the patient has hypertension.
 - Very similar distribution as target feature.
- `heart_disease`
 - Binary, 1 if the patient has heart disease.
 - Almost identical distribution as target feature.
 - I am curious how `hypertension` & `heart_disease` will correlate with eachother. 
- `ever_married`
 - About 65/35 split with majority of patients listed as 'Yes.'
- `work_type`
 - Only 5 categories. Not surprisingly, more than half answered 'Private.'
 - Other 4 make up little more than 40%.
 - 'Never_worked' is less than 1%.
- `Residence_type`
 - Almost 50/50 split between 'Urban' and 'Rural.'
- `avg_glucose_level` & `bmi` 
 - continuous features that are both skewed positively.
 - `bmi` is missing 201 values (3.93%). Of those missing values, 40 are stroke patients.
   - because of this missing data an imputer is required for modeling
- `smoking_status`
 - about 30% of patients are listed as 'Unknown.'


### 4h. As indicated above in NOTES: Drop 1 'other' value in `gender`

In [None]:
data = df.drop(df[df['gender']=='Other'].index)

# 5. Seperate and Split Data

### 5a. Separate data into features and target

In [None]:
X = data.drop(target, axis = 1)
y = data[target]

### 5b. Split data into train and test sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    stratify=y, 
                                                    test_size=.2, 
                                                    random_state=seed)

# 6. Visualizing the `X_train` `y_train` data

### 6a. Make a visualization df

In [None]:
viz_df = pd.concat([X_train, y_train], axis=1)
viz_df.head()

### 6b. Pair Plot
I will start with a pair plot; this will help me visualize relationships between each variable.

In [None]:
sns.pairplot(viz_df, hue='stroke', markers=['*','<'], plot_kws={'linewidth':0.1});

### NOTES:
`age` appears to be an important feature. 

The pair plot is great when comparing continuous features, however, it is useless when comparing two binary features, notice hypertension-heart_disease. Technically these features are categorical, with 1 and 0 representing "Yes" and "No," so it is not surprising the pairplot does not reveal much about these features given they are masquerading around under the ruse of numeric values. On that same note, it is important to note, the pairplot excludes categorical features altogether. Notice `gender`, `ever_married`, `work_type`, `Residence_type`, and `smoking_status` are absent. These features will need to be explored seperately. But first, based on short comings of the pair plot, I would like to add a feature to better understanding the relationship between hypertension and heart_disease; I will call this new feature `hyper_heart`. Secondly, based on the insight of the pairplot, I want to add a feature that seperates out patients who are 50 and up. Next, I will break down `bmi` and `avg_glucose_level` into categorical ranges as this is more conducive to machine learning. 


# 7. Adding Features
Add features to `viz_df` (remember this is X_train and y_train concatenated) then explore the categorical features. 
- `age_50+`
 - The values in this column will be 
   - 1 : patients 50 and above
   - 0 : patients 49 and below 
- `hyper_heart`
 - The values in this column will be 
   - 1 : patients with hypertension and heart disease 
   - 0 : if patients do not have BOTH hypertension and heart disease.
- `bmi_range`
 - The values in this columns will be
   - 'underweight' : patients with bmi < 18.49
   - 'normal range' : patients with bmi between 18.5 and 24.9
   - 'overweight' : patients with bmi between 25.0 and 29.9 
   - 'obese' : patients with bmi 30 or greater
  - Values for bmi ranges determined by - WHO classification of weight status - table found on the [National Library of Medicine](https://www.ncbi.nlm.nih.gov/books/NBK535456/figure/article-18425.image.f1/) website
- `avg_glucose_range`
 - The values in this columns will depend on age
   - for children < 5.99 (0-5 years old) :
     - 'normal' : patients with average glucose level < 180
     - 'action suggested' : if greater
     - *note: I used 'action suggested' for children as I did not find much information on this age range and was uncomfortable labeling them 'diabetic' or 'pre-diabetic'*
   - for children between 6 and 10 
     - 'normal' : patients with average glucose level < 140 
     - 'action suggested' : if greater
   - for patients older than 10
     - 'normal' : patients with average glucose level < 116.99
     - 'prediabetic' : patients with average glucose level between 117 and 136.99
     - 'diabetic' : if greater that 137
  - Values for childrens ranges determined by [Nationwide Children’s Hospital Diabetes Center Target Blood Glucose Ranges](https://www.nationwidechildrens.org/family-resources-education/health-wellness-and-safety-resources/resources-for-parents-and-kids/managing-your-diabetes/chapter-three-monitoring-blood-glucose)
  - Values for adult ranges determined by [Medical News Today](https://www.medicalnewstoday.com/articles/a1c-chart-diabetes-numbers). This was the hardest as there is some discrepancy in these ranges. For example, the childrens table groups anyone from ten years and up as having a target range from 70 to 120, however the adult table labels 'normal' as anything below 117. I chose to use the data from these two tables despite this discrepancy becuase they are from credible resources. 
  


### 7a. Make copies
Before I move forward I want to copy `X_train` and `X_test` in case I want as a reference later.

In [None]:
X_train_original = X_train.copy()
X_test_original = X_test.copy()

### 7b. Add Columns 

In [None]:
dfs = [viz_df, X_train, X_test]  

for df_to_add_to in dfs:
    # print for reference
    print("\t\t\t\t\tBEFORE ADDING COLUMNS")
    print(df_to_add_to.shape)
    display(df_to_add_to.head())
    
    # create lists for columns
    hyper_heart_col = []
    age_50_plus_col = []
    bmi_ranges = []
    avg_glucose_ranges = []
    
    for i in df_to_add_to.index:        
        # Populate hyper_heart_col list
        if (df_to_add_to['hypertension'][i]==1) and (df_to_add_to['heart_disease'][i]==1):
            hyper_heart_col.append(1)
        else:
            hyper_heart_col.append(0)
            
        # Populate age_50_plus_col list
        if df_to_add_to['age'][i] >= 50:
            age_50_plus_col.append(1)
        else:
            age_50_plus_col.append(0)
        
        # Populate bmi_ranges list
        if pd.isnull(df_to_add_to['bmi'][i]):
            bmi_ranges.append(np.NaN)
        elif (df_to_add_to['bmi'][i]<18.4):
            bmi_ranges.append('underwieght')
        elif (18.5 < df_to_add_to['bmi'][i] < 24.9):
            bmi_ranges.append('normal range')
        elif (25.0 < df_to_add_to['bmi'][i] < 29.9):
            bmi_ranges.append('overweight')
        else:
            bmi_ranges.append('obese') 
        
        # Populate avg_glucose_ranges list
        if df_to_add_to['age'][i] < 6:
            if df_to_add_to['avg_glucose_level'][i] < 180:
                avg_glucose_ranges.append('normal')
            else:
                avg_glucose_ranges.append('action suggested')
        elif 6 <= df_to_add_to['age'][i] < 10:
            if df_to_add_to['avg_glucose_level'][i] < 140:
                avg_glucose_ranges.append('normal')
            else:
                avg_glucose_ranges.append('action suggested')
        else:
            if df_to_add_to['avg_glucose_level'][i] < 117:
                avg_glucose_ranges.append('normal')
            elif 117 <= df_to_add_to['avg_glucose_level'][i] < 137:
                avg_glucose_ranges.append('pre-diabetic')
            else:
                avg_glucose_ranges.append('diabetic')

    # Add columns to dfs
    df_to_add_to['hyper_heart'] = hyper_heart_col
    df_to_add_to['age_50+'] = age_50_plus_col
    df_to_add_to['bmi_range'] = bmi_ranges
    df_to_add_to['avg_glucose_range'] = avg_glucose_ranges

    print("\t\t\t\t\tAFTER ADDING COLUMNS")
    print(df_to_add_to.shape)
    display(df_to_add_to.head())


# 8. Visualizations of Categorical Features

### 8a. Convert to Categorical 
Before I plot, I will convert those sneaky false numeric features into their true categorical form. This will make for better looking labels and visuals.

In [None]:
nums_to_cats = ['stroke', 'hypertension', 'heart_disease', 'hyper_heart', 'age_50+']
for feat in nums_to_cats:
    viz_df[feat] = viz_df[feat].map({1:'Yes', 0:'No'})
    
viz_df.head()

### 8b. Plotting the Distribution of Categorical Features from `viz_df`

In [None]:
# Set up plot
fig, axes = plt.subplots(ncols=3, nrows=4, figsize=(15, 23))
fig.set_tight_layout(True)

# Lists for loops
feats = ['gender',
         'hypertension',
         'heart_disease',
         'ever_married',
         'work_type',
         'Residence_type',
         'smoking_status', 
         'hyper_heart',
         'age_50+', 
         'bmi_range', 
         'avg_glucose_range']

titles = ['Sex',
         'Hypertension',
         'Heart Disease',
         'Ever Married',
         'Work Type',
         'Residence Type',
         'Smoking Status',
         'Hypertention and Heart Disease', 
         '50 and above',
         'BMI Ranges', 
         'Average Glucose Ranges']

vals = ['No', 'Yes']

labels = ['Non-Stroke','Stroke']


for i in range(len(feats)):
    # Set up plots
    ax = axes[i//3][i%3]
    ax.set(xlabel=titles[i], ylabel='Number of Patients')
    
    for index in range(2):
        # Define the df 
        bar_df = viz_df[viz_df['stroke']==vals[index]]
        count_df = pd.DataFrame(bar_df.groupby([feats[i]])[feats[i]].count())
        plot_df = count_df.rename(columns={feats[i]: 'count'}).reset_index()

        # Plot
        ax.bar(plot_df[feats[i]], plot_df['count'], label=labels[index]) 
        ax.tick_params(axis='x', rotation=60)
        ax.legend();
        

### NOTES:
The added feature `age_50+` further suggests that age is prehaps the most significant feature. Other important factors seems to be `bmi_range`, and `avg_glucose_range`. I am hesitant to label `ever_married` as significant simply becuase marriage coincides with age, but I am not ruling it insignificant either. `hypertension`, `heart_disease`, sex (`gender`), and potentially `smoking_status` also seem to play important roles in stokes. 

Of the added features I will keep:
- `age_50+`
- `bmi_range`
- `avg_glucose_range`

And drop: 
- `hyper_heart`

I will also drop the 3 orignal continuous features:
- `age`
- `bmi`
- `avg_glucose_level`

 

# 9. Drop Features

In [None]:
dfs = [X_train, X_test] 

feats_to_drop = ['age', 'bmi', 'avg_glucose_level', 'hyper_heart']

for df_to_drop_from in dfs:    
    # print for reference
    print("\t\t\t\t\tBEFORE DROPPING COLUMNS")
    print(df_to_drop_from.shape)
    display(df_to_drop_from.head())
    
    # drop features 
    #df_to_drop_from = 
    df_to_drop_from.drop(feats_to_drop, axis=1, inplace=True)
    
    # reprint after drops
    print("\t\t\t\t\tAFTER DROPPING COLUMNS")
    print(df_to_drop_from.shape)
    display(df_to_drop_from.head())  


# 10. Function: `check_model` 


This function does all the heavy lifting for me. Not only can it establish baseline models but it can also handle gridsearch for me. It has the option to change out imputers and scalers. Using smote and scaleing are optional, as well as using test data. It has the ability to output classification reports, threshold tables, and a side by side plot of a confusion maxtrix and ROC curve. With each model check it prints out the details of the classifier, imputer, scaler (if used) and smote method (if used). And finally, it returns the .fit model so I can calculate and save scores to a df (vs reading classification report). 

In [None]:
def check_model(model,
                X_train=X_train,
                X_test=X_test,
                y_train=y_train,
                y_test=y_test, 
                random_state=seed, 
                imputer=None,
                smote=None,             
                scaler=None,
                grid_search=False,
                use_test_data=False, 
                print_model_details=False,
                show_classification_report=False, 
                show_thresholds_table=False,  
                show_plots=False,
                display_labels=None):
    
    """   
    
    Uses sklearn.pipeline.Pipeline, sklearn.compose.ColumnTransformer, & imblearn.pipeline.Pipeline 
    to scale, using any sklearn scaler (this step is optional), one hot encode, using 
    sklearn.preprocessing.OneHotEncoder, impute, using any sklearn imputer (this step is optional), 
    and smote (Synthetic Minority Over-sampling Technique), using any sklearn smote technique (this step 
    is optional), to transform the data. Then, either fit that pipeline-model or feed it into 
    sklearn.model_selection.GridSearchCV then fit the girdsearch-model. By defualt GridSearch is bypassed
    with grid_search=False.
    
    
    Output
    ---------- 
    (optional) Print model details
    (optional) Classification Report
    (optional) Thresholds, FPRs, TPRs Stats Table with AUC score
    (optional) Plots Confusion matrix and ROC curve
    
    
    Returns
    ----------
    Trained model (model.fit(X_train, y_train))
    
    
    Parameters
    ----------
 
    model : supervised learning model to be evaluated. 
    
    X_train : pandas data frame, default=X_train
    
    X_test : pandas data frame, default=X_test
                
    y_train : pandas series, default=y_train,
                
    y_test : pandas series, default=y_test
    
    pipe_grid_param_dict : dict or list of dictionaries
        sklearn.model_selection.GridSearchCV parameter: 
            Dictionary with parameters names (`str`) as keys and lists of
            parameter settings to try as values, or a list of such
            dictionaries, in which case the grids spanned by each dictionary
            in the list are explored. This enables searching over any sequence
            of parameter settings.
    
    random_state : int or RandomState instance, default=42
        Controls the shuffling applied to the data before applying the split.
        Pass an int for reproducible output across multiple function calls.
        
    imputer : any sklearn imputer, default=None
        Transformers for missing value imputation
        NOTE: Because the data has NaN values, an sklearn imputer must be specified.
    
    smote : any sklearn smote technique, default=None
        Synthetic Minority Over-sampling Technique
    
    scaler : any sklearn preprocessing scaler, default=None
    
    grid_search : False or dict or list of dictionaries, default=False
        If not set to False, this should be a dict for GridSearchCV param_grid.
            sklearn.model_selection.GridSearchCV parameter: 
                Dictionary with parameters names (`str`) as keys and lists of
                parameter settings to try as values, or a list of such
                dictionaries, in which case the grids spanned by each dictionary
                in the list are explored. This enables searching over any sequence
                of parameter settings.

    use_test_data : boolean True or False, default=False
        Determines the data used to asses model performance.
        
    print_model_details : boolean True or False, default=False
        Determines if function prints details about classifier, imputer, scaler (if used),
        and smote technique (if used).
        Example: 
            LogisticRegression(random_state=42, C=1.0, fit_intercept=True, max_iter=100, solver='lbfgs')
            KNNImputer()
            StandardScaler()
            SMOTENC(categorical_features=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], random_state=42)
            
    
    show_thresholds_report : boolean True or False, default=False
        Print table with AUC score, Thresholds, FPR's and TPR's
    
    show_plots : boolean True or False, default=False
        Plots Confusion Matrix and ROC curve.
    
    display_labels : list or 'None', default=None
        If the target is binary 0,1 the labels can be changed to more descriptive labels. 
        Example: ['Healthy', 'HeartDisease']

    """

    
    ######################################################################
    # 1. TRANSFORM-DATA PIPELINE                                         #
    ######################################################################   
    
    # Determine if df all categorical #TODO
    
    # 1a. Seperate by type of data
    cat_col_names = []
    for col in X_train.columns:
        if (col in X_train.select_dtypes('object').columns) or (sorted(X_train[col].unique())==[0,1]):
            cat_col_names.append(col)
            
    
    X_train_cat = X_train[cat_col_names]
    X_train_nums = X_train.drop(cat_col_names, axis=1)
    
    
    #X_train_nums = X_train.select_dtypes('float64')
    #X_train_cat = X_train.select_dtypes('object')


    # 1b. Pipeline 1 (numerical data)
    numerical_pipeline = Pipeline(steps=[
        ('scaler', scaler)])
    
    
    # 1c. Pipeline 2 (categorical data)  
    categorical_pipeline = Pipeline(steps=[
        ('ohe', OneHotEncoder(sparse=False, handle_unknown='ignore'))])


    # 1d. Converge pipelines 1 & 2
    trans = ColumnTransformer(transformers=[
        ('numerical', numerical_pipeline, X_train_nums.columns),
        ('categorical', categorical_pipeline, X_train_cat.columns)],
                             remainder='passthrough')
        
    
    # 1e. Model with converged pipeline
    model_pipe = imbpipeline(steps=[
        ('trans', trans),
        ('imputer', imputer),
        ('smote', smote),
        ('model', model)])

    
    ######################################################################
    # 2. GRID SEARCH PIPELINE                                            #
    ######################################################################   
    

    # 2a. Determine if using gridsearch
    if grid_search != False:
        best_model = GridSearchCV(estimator=model_pipe, 
                           param_grid=grid_search, 
                           scoring='recall', 
                           cv=3)

        
        # fit the model to evaluate
        fit_model = best_model.fit(X_train, y_train) 
    
    # if no grid search then .fit converged pipeline
    else: 
        fit_model = model_pipe.fit(X_train, y_train) 
   

    ######################################################################
    # 3. (Optional) PRINT MODEL DETAILS                                  #
    ######################################################################

    if print_model_details: 
        
        # 3a. Determine if using GridSearch
        if grid_search:

            """
            # 3a-1. Use .best_params_ dict and reformat to print the same way 
                classifiers/scalers/imputers etc are instantiated.


                Example: 
                ----------

                   .best_params_ = {'model__criterion': 'gini', 
                                    'model__max_depth': 6, 
                                    'scaler__with_mean': True} 

                    prints as: 
                        DecisionTreeClassifier(criterion='gini', max_depth=6, ..., N-param=N-value) 
                        StandardScaler(with_mean=True)
            """  
            kind_of_params = {}
            for k,v in fit_model.best_params_.items():
                key = k.split("__")[0]
                if key not in kind_of_params.keys():
                    kind_of_params[key] = "" 
                if k.split("__")[1] == 'solver':
                    kind_of_params[key] += k.split("__")[1]+"='"+str(v)+"', " # solver has qoutes around it
                elif k.split("__")[1] == 'criterion':
                    kind_of_params[key] += k.split("__")[1]+"='"+str(v)+"', " # criterion has qoutes around it
                else: 
                    kind_of_params[key] += k.split("__")[1]+"="+str(v)+", " #<-- notice comma


            # 3a-2. Remove extra comma at end of each dic value
            for k, v in kind_of_params.items():
                kind_of_params[k] = v[:-2]


            # 3a-3. Print in copy paste format:
            # Ex. DecisionTreeClassifier(criterion='gini', ..., paramN=value)
            if 'model' in  kind_of_params.keys():
                model_text = str(model).split(")")[0]+", "+kind_of_params['model']+")"
                print(model_text) 
            else:
                print(model) # otherwise, use what was fed into function


            if 'imputer' in kind_of_params.keys():
                imputer_text = str(imputer).split("()")[0]+"("+kind_of_params['imputer']+")"
                print(imputer_text)
            else:
                print(imputer)


            if scaler != None:
                if 'scaler' in kind_of_params.keys():
                    scaler_text = str(scaler).split("()")[0]+"("+kind_of_params['scaler']+")"
                    print(scaler_text) 
                else:
                    print(scaler)  


            if smote != None:
                if 'smote' in kind_of_params.keys():
                    smote_text = str(smote).split(")")[0]+", "+kind_of_params['smote']+")"
                    print(smote_text)
                else:
                    print(smote)

        # 3b. If bypassing GridSearch             
        else: 

            # 3b-1 Print what was fed into function, example: DecisionTreeClassifier()
            # model & imputer are manditory
            print(model)  
            print(imputer)

            # scaler and/or smote are optional (will not print if not used) 
            if scaler != None:  
                print(scaler)
            if smote != None:
                print(smote)

    # 3c. Print type of data used in model evaluation
    if use_test_data:
        data_used_text = "Test"
    else:
        data_used_text = "Train"


    ######################################################################
    # 4. (Optional) CLASSIFICATION REPORT                                #
    ######################################################################

    
    # 4a. Assign variables based on data using for evaluation
    if use_test_data:
        X_true = X_test
        y_true = y_test
    else: 
        X_true = X_train
        y_true = y_train
        
    
    if show_classification_report:
        # 4b. Make predictions and print report 
        y_preds = fit_model.predict(X_true)
        cr = classification_report(y_true, y_preds, digits=4)
        print()
        print('-'*54)
        print(f'\t  CLASSIFICATION REPORT : {data_used_text} Data')
        print('-'*54)
        print(cr)
        print('-'*54)
        

        
    ######################################################################
    # 5. (Optional) THRESHOLDS TABLE                                     #
    ######################################################################   
   

    if show_thresholds_table:
        # 5a. Calculate the probability scores
        if ('LogisticReg' in str(model)):
            y_score = fit_model.decision_function(X_true) 
            fpr, tpr, thresholds = roc_curve(y_true, y_score)
        else:
            y_score = fit_model.predict_proba(X_true)
            fpr, tpr, thresholds = roc_curve(y_true, y_score[:,1]) # <-- probability of Class 1

        # 5b. Format values and print
        # To display as: THRESHOLD: value | FPR: percent%, TPR:percent%
        thresh_fp_tp = list(zip(thresholds, fpr, tpr))
        these_to_print = [f'THRESHOLD: {e[0]:.2f} | FPR: {e[1]:.2%}, TPR:{e[2]:.2%}' \
                          for e in thresh_fp_tp]        
        auc_score = auc(fpr, tpr)
        
        print('-'*54)
        print('\t\t  THRESHOLD STATS')
        print('-'*54)
        print(f'AUC: {auc_score}')
        print('- '*23)
        for element in these_to_print:
            print(element)
        print('-'*54)
        

    ######################################################################
    # 6. (Optional) VISUALIZATIONS                                       #
    ######################################################################
    
    
    if show_plots: 
        # Figure set up
        plt.style.use('fivethirtyeight')
        fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(13, 5))
        fig.suptitle(f'Model Evaluated on {data_used_text} Data', color='tab:blue', size=14)
        

        # Left  
        axes[0].set_title("Confusion Matrix", size=15)  
        axes[0].grid(False)
        plot_confusion_matrix(fit_model, X_true, y_true, 
                              cmap=plt.cm.Blues, 
                              ax=axes[0], 
                              display_labels=display_labels,
                              normalize='true')

        # Right 
        axes[1].set_title("ROC Curve", size=15)
        plot_roc_curve(fit_model, X_true, y_true, ax=axes[1]);
        
    
    return fit_model


print("ran")

In [None]:
########## don't delete yet TESTS FOR FUNCTION

d_tree_dic = {'model__criterion': ['gini', 'entropy'],
 'model__max_depth': [1, 2],
 'model__min_samples_split': [2,10],
 'model__min_samples_leaf': [1, 6]}



zero = round(data[target].value_counts(normalize=True)[1],6)
one = round(data[target].value_counts(normalize=True)[0],6)
class_weight_dic={0:zero, 1: one}

 
log_reg_dict = {'model__C' : [1.0, 1e12],
                'model__fit_intercept' : [True, False],
                'model__class_weight' : ['balanced', class_weight_dic],
                'model__solver' : ['lbfgs','liblinear'],
                'model__max_iter' : [100,500]}



imputer = IterativeImputer(random_state=seed)

# Categorical Column Names
cat_col_names = []
for col in X_train.columns:
    if (col in X_train.select_dtypes('object').columns) or (sorted(X_train[col].unique())==[0,1]):
        cat_col_names.append(col)

# Convert to indices      
cat_col_indices = [X_train.columns.get_loc(col) for col in cat_col_names]


smote=SMOTENC(categorical_features=cat_col_indices, random_state=seed)

In [None]:
model3 = LogisticRegression(random_state=seed)

In [None]:
show_model3 = check_model(model3, 
                          imputer=imputer, 
                          print_model_details=True,
                          show_classification_report=True, 
                          show_plots=True)

In [None]:
mod3_with_smotenc = check_model(model3, 
                                imputer=imputer, 
                                smote=smote, 
                                show_classification_report=True, 
                                show_plots=True)

In [None]:
mod3_with_gs = check_model(model3, 
                           imputer=imputer, 
                           scaler=StandardScaler(),
                           smote=smote,
                           grid_search=log_reg_dict,
                           print_model_details=True,
                           show_classification_report=True,
                           show_plots=True)


In [None]:
mod3_with_gs_NOsmote = check_model(model3, 
                                   imputer=imputer, 
                                   #smote=smote,
                                   grid_search=log_reg_dict,
                                   show_classification_report=True,
                                   show_plots=True)

In [None]:

model4 = LogisticRegression(random_state=42, 
                            C=1.0, 
                            class_weight={0: 0.048738, 1: 0.951262}, 
                            fit_intercept=True, 
                            max_iter=100, 
                            solver='lbfgs')

mod4_with_gs_NOsmote = check_model(model4, 
                                   imputer=imputer, 
                                   smote=smote,
                                   use_test_data=True,
                                   show_classification_report=True,
                                   show_plots=True)

In [None]:

##########

# 11. Set up for modeling
Below I will instantiate the classifiers, imputers, scaler, and smotenc I will loop through in step 9

### 11a. Models

In [None]:
log_reg = LogisticRegression(random_state=seed) #class_weight

d_tree = DecisionTreeClassifier(random_state=seed) #class_weight

XGB = XGBClassifier(random_state=seed)

forest = RandomForestClassifier(random_state=seed) #class_weight

bag_tree = BaggingClassifier(DecisionTreeClassifier(random_state=seed))

abc = AdaBoostClassifier(random_state=seed)

etr = ExtraTreesClassifier(random_state=seed)

gbc = GradientBoostingClassifier(random_state=seed)

xgboost_XGB = xgboost.XGBClassifier(random_state=seed, objective='binary:logistic')

knn = KNeighborsClassifier()

### 11b. Imputers

In [None]:
iter_imputer = IterativeImputer(random_state=seed)

sim_immputer = SimpleImputer()

knn_imputer =  KNNImputer()

### 11c. Scaler

In [None]:
scaler = StandardScaler()

### 11d. Smote

In [None]:
# Categorical Column Names
cat_col_names = []
for col in X_train.columns:
    if (col in X_train.select_dtypes('object').columns) or (sorted(X_train[col].unique())==[0,1]):
        cat_col_names.append(col)

# Convert to indices      
cat_col_indices = [X_train.columns.get_loc(col) for col in cat_col_names]

if len(cat_col_names) == len(X_train.columns):
    smote = SMOTEN(random_state=seed)
else:
    smote = SMOTENC(categorical_features=cat_col_indices, random_state=seed)


# 12. Modeling without GridSearchCV
### Bypassing GridSearch for now.
Using forloop to run baselines, scaled and non scaled (StandardScaler when scaled), smote and non smote (SMOTENC or SMOTEN depending on weather or not data is all categorical or not) and putting results in a `results` data frame. Some form of imputation is required due to NaNs in `bmi` or `bmi_ranges` so the loop runs through 3 different ways of imputing data (IterativeImputer, SimpleImputer, KNNImputer). Scaling and class imbalance techniques are optional so both have a None added to their list.  Note from step 8a, these models have no parameters yet other than setting a `random_state` (except in KNeighborsClassifier which does not have a `random_state` param).

### 12a. Create Lists for the Loop

In [None]:
dfs = ['original', 'all_categorical', 'converted_cat']

models = [log_reg,
          d_tree,
          XGB, 
          forest, 
          bag_tree, 
          abc, 
          etr, 
          gbc, 
          xgboost_XGB, 
          knn]

model_names = ['LogisticRegression',
               'DecisionTreeClassifier',
               'XGBClassifier', 
               'RandomForestClassifier', 
               'BaggingClassifier', 
               'AdaBoostClassifier', 
               'ExtraTreesClassifier', 
               'GradientBoostingClassifier', 
               'xgboost.XGBClassifier', 
               'KNeighborsClassifier']

imputers = [iter_imputer, 
            sim_immputer, 
            knn_imputer]

imputer_names = ['IterativeImputer', 
                  'SimpleImputer', 
                  'KNNImputer']

scalers = [None, scaler]

scaler_names = ['None', 'StandardScaler']


### 12b. Create Empty `results` Data Frame to Store Results

In [None]:
results = pd.DataFrame(columns = ['df_used', 'model', 'imputer', 'scaler', 'smote', \
                                  'train_recall', 'test_recall'])
results.head()

In [None]:
dfs

### 12c. Loop Through all Models, Imputers, Scaler, and Smote

In [None]:
num_of_loops = 0

# For each df
for idex in dfs:
    if idex == 'original':
        use_this_X_train = X_train_original
        use_this_X_test = X_test_original
    else:
        use_this_X_train = X_train
        use_this_X_test = X_test

        
    ################################################################
       
    # Loop Through Models
    for i in range(len(models)):
        model = models[i]
        model_name = model_names[i]
        
        num_of_loops += 1 
        print(f'{num_of_loops} of 30 loops')

        # Through Imputers
        for ind in range(len(imputers)):
            imputer = imputers[ind]
            imputer_name = imputer_names[ind]

            # With and Without Scaling
            for index in range(2):
                if (index == 1) & (idex != 'original'): # if all features categorical, skip scaling
                    continue
                else:
                    scaler = scalers[index]
                    scaler_name = scaler_names[index]

                # With and Without using SMOTE
                for smote_index in range(2):
                    if smote_index == 0: # None
                        class_imbal_option = None
                        class_imbal_name = 'None'
                    else:
                        # Categorical Column Names
                        cat_col_names = []
                        for col in use_this_X_train.columns:
                            if (col in use_this_X_train.select_dtypes('object').columns) \
                            or (sorted(use_this_X_train[col].unique())==[0,1]):
                                cat_col_names.append(col)
                                
                        # if all categorical 
                        if len(cat_col_names) == len(use_this_X_train.columns):
                            class_imbal_option = SMOTEN(random_state=seed)
                            class_imbal_name = 'SMOTEN'
                        else:
                            # Convert cat_col_names to indices      
                            cat_col_indices = [use_this_X_train.columns.get_loc(col) for col in cat_col_names]
                            class_imbal_option = SMOTENC(categorical_features=cat_col_indices, 
                                                         random_state=seed)
                            class_imbal_name = "SMOTENC"
                            

                    # Save Fitted Model
                    fitted = check_model(model,
                                         X_train=use_this_X_train,
                                         X_test=use_this_X_test,
                                         imputer=imputer, 
                                         scaler=scaler, 
                                         smote=class_imbal_option)

                    # Calculate Train and Test Recall Score
                    train_score = round(recall_score(y_train,fitted.predict(use_this_X_train)), 3)
                    test_score = recall_score(y_test,fitted.predict(use_this_X_test))

                    # Add to `results` df
                    results.loc[len(results.index)] = [idex, model_name, imputer_name, scaler_name, \
                                                       class_imbal_name, train_score, test_score] 


### 12c. View the Data Frame

In [None]:
# each model ranked
for model in model_names:
    display(results[results['model']==model].sort_values(['test_recall', 'train_recall'], ascending=False))

In [None]:
bests = pd.DataFrame(columns = ['df_used', 'model', 'imputer', 'scaler', 'smote', \
                                  'train_recall', 'test_recall'])
for model in model_names:
    model_best = results[results['model']==model]\
    .sort_values(['test_recall', 'train_recall'], ascending=False)[:1]
    bests = pd.concat([bests,model_best])

bests

### 12d. Just the Baselines (no scaling or smote)

In [None]:
baselines = results[(results['scaler']=='None')&(results['smote']=='None')]
baselines = baselines.sort_values(['test_recall', 'train_recall'], ascending=False)
baselines

**NOTES:** Not surprising there is a lot of over fitting. This is why GridSearchCV is so important. A note about the baseline imputers: there is not a single imputer that consistantly outperformed the others. 

### 12e. Top Performers

In [None]:
results.sort_values(['test_recall', 'train_recall'], ascending=False)[:50]

### NOTES:
Top 5 performers witout parameter tweaking are:
1. LogisticRegression
2. AdaBoostClassifier
3. GradientBoostingClassifier
4. KNeighborsClassifier
5. DecisionTreeClassifier

# 13. Use GridSearch

### something important to say

# 11. Final Model
why this final model? What is so great about it. how does it solve business problem?