<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Business-Case" data-toc-modified-id="Business-Case-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Business Case</a></span></li><li><span><a href="#The-Data" data-toc-modified-id="The-Data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>The Data</a></span></li><li><span><a href="#Evaluation" data-toc-modified-id="Evaluation-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Evaluation</a></span></li><li><span><a href="#Imports" data-toc-modified-id="Imports-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#Bringing-in-the-Data" data-toc-modified-id="Bringing-in-the-Data-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Bringing in the Data</a></span></li><li><span><a href="#EDA" data-toc-modified-id="EDA-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>EDA</a></span></li><li><span><a href="#Cleaning" data-toc-modified-id="Cleaning-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Cleaning</a></span></li><li><span><a href="#Train-Test-Split" data-toc-modified-id="Train-Test-Split-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Train Test Split</a></span></li><li><span><a href="#Pipelines" data-toc-modified-id="Pipelines-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Pipelines</a></span><ul class="toc-item"><li><span><a href="#Setting-up-column-transformer" data-toc-modified-id="Setting-up-column-transformer-9.1"><span class="toc-item-num">9.1&nbsp;&nbsp;</span>Setting up column transformer</a></span></li><li><span><a href="#Logistic-Regression-Vanilla" data-toc-modified-id="Logistic-Regression-Vanilla-9.2"><span class="toc-item-num">9.2&nbsp;&nbsp;</span>Logistic Regression Vanilla</a></span><ul class="toc-item"><li><span><a href="#Classification-Report:-Logistic-Regression-Vanilla" data-toc-modified-id="Classification-Report:-Logistic-Regression-Vanilla-9.2.1"><span class="toc-item-num">9.2.1&nbsp;&nbsp;</span>Classification Report: Logistic Regression Vanilla</a></span></li></ul></li><li><span><a href="#Logistic-Regression-GridSearchCV" data-toc-modified-id="Logistic-Regression-GridSearchCV-9.3"><span class="toc-item-num">9.3&nbsp;&nbsp;</span>Logistic Regression GridSearchCV</a></span></li></ul></li></ul></div>

## Business Case

I have been hired by a **fictitious** version of Costco Pharmacy (**My opinions are my own and in no way reflect that of Costco Pharmacy**). I have been hired by the pharmacy to try and gain a better understanding of what affects personal vaccination rates. Their hope is to try and encourage more of their members to visit the pharmacy for their seasonal flu vaccines. 

Due to the COVID-19 pandemic vaccination rates have never been such a talking point before. Here in the US the decision to get vaccinated has devolved into a political divide instead of a public health effort.

I will be providing Costco Pharmacy with an ability to make a prediction of which of their members have been vaccinated and which have not. To be able to create more targeted advertising to those who have not been vaccinated. This model will also be able to determine which of our features are the most important to someone getting the seasonal flu which will be able to aid in the direction of the marketing team.


## The Data

The data we will be looking at today comes from another pandemic, the H1N1 influenza pandemic. This data originally comes from a National 2009 H1N1 Flu Survey, which also included information on the seasonal flu. I brought in the data from a competition being hosted by DRIVENDATA. You can find the download page [here](https://www.drivendata.org/competitions/66/flu-shot-learning/data/) (Note: It requires you to sign up and register to the competition to be able to download, but itâ€™s free to do so). You can also visit their benchmark notebook [here](https://drivendata.co/blog/predict-flu-vaccine-data-benchmark/). Which looks at this project as a multi classification problem for both H1N1 and the seasonal flu. I am currently looking at it only trying to predict the vaccination rate for seasonal flu.

In the future I hope to look at a similar classification problem with COVID-19 but the data acquisition required for that was outside the scope of my project. This data will still prove useful to our stakeholder to try and gain a better understanding of what affects personal vaccination rates.

I would also like to look at data streams directly from Costco Pharmacy. I could try and determine what other factors play important roles in vaccination rates from information they are already collecting.

## Evaluation

Overall accuracy is important for this project but the main metric we will be looking at is recall.

The marketing team has decided they would prefer to air on the side of caution and still market to those who are possibly already vaccinated (higher recall). Marketing material to those who are already vaccinated might cause minor annoyance but will not be that serious. It might also provide information to those who have been vaccinated for the seasonal flu at a different location that Costco Pharmacy offers the same service. 

## Imports

In [None]:
from pathlib import Path

import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import (
    train_test_split, 
    cross_val_score
)


from sklearn.pipeline import Pipeline



# from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.metrics import (
    mean_squared_error, 
    plot_confusion_matrix, 
    roc_auc_score, 
    classification_report
)



from sklearn.model_selection import GridSearchCV
# from sklearn.model_selection import RandomizedSearchCV


from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier

RANDOM_SEED = 42    # Set a random seed for reproducibility

pd.set_option("display.max_columns", 100)

## Bringing in the Data

In [None]:
DATA_PATH = Path.cwd().parent / "data"

In [None]:
df_features = pd.read_csv(
    DATA_PATH / "training_set_features.csv", 
    index_col="respondent_id"
)
df_labels = pd.read_csv(
    DATA_PATH / "training_set_labels.csv", 
    index_col="respondent_id"
)

## EDA

## Cleaning

The majority of the data cleaning (eg. OHE, imputing NaNs) will happen in the pipeline

In [None]:
# Will be dropping features related to h1n1 since we would not have this
# information for future data 
select_features = ['behavioral_antiviral_meds', 
                   'behavioral_avoidance',
                   'behavioral_face_mask', 
                   'behavioral_wash_hands', 
                   'behavioral_large_gatherings', 
                   'behavioral_outside_home', 
                   'behavioral_touch_face', 
                   'doctor_recc_seasonal', 
                   'chronic_med_condition', 
                   'child_under_6_months', 
                   'health_worker', 
                   'health_insurance', 
                   'opinion_seas_vacc_effective', 
                   'opinion_seas_risk', 
                   'opinion_seas_sick_from_vacc', 
                   'age_group', 
                   'education', 
                   'race', 
                   'sex',
                   'income_poverty', 
                   'marital_status', 
                   'rent_or_own', 
                   'employment_status', 
                   'hhs_geo_region', 
                   'census_msa', 
                   'household_adults', 
                   'household_children', 
                   'employment_industry', 
                   'employment_occupation']



df_clean = df_features[select_features]
df_clean.info()

Here you can see we drop six features related to H1N1

In [None]:
df_features.shape

In [None]:
df_clean.shape

## Train Test Split

In [None]:
# Defining X, y
y = df_labels['seasonal_vaccine']
X = df_features

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

## Pipelines

### Setting up column transformer

I will be using mode for filling in numeric NaNs because most of my data is binary or an opinion ranked 1-5

For categorical data I will simply be adding a missing indicator

In [None]:
# We'll throw these mini-pipelines into our ColumnTransformer.

subpipe_num = Pipeline(steps=[
    # Fills in num Nan's with the mode ('most_frequent')
    ('num_impute', SimpleImputer(strategy='most_frequent')),
    # Scales the data
    ('ss', StandardScaler())]
)
subpipe_cat = Pipeline(steps=[
    # Fills in cat Nan's as "missing"
    ('cat_impute', SimpleImputer(fill_value="missing", strategy="constant")),
    # One Hot Encoding
    ('ohe', OneHotEncoder(sparse=False, handle_unknown='ignore'))]
)

In [None]:
# Applying the mini pipelines to their respective columns 
# numeric compiler to numeric columns, cat compiler to cat columns

# Selects the respective columns (numeric and categorical)
numeric_columns = df_clean.select_dtypes(exclude=object).columns
cat_columns = df_clean.select_dtypes(include=object).columns

# Applies the transformer
CT = ColumnTransformer(transformers=[
    ('subpipe_num', subpipe_num, numeric_columns),
                                    
    ('subpipe_cat', subpipe_cat, cat_columns)],
                       
                       remainder='passthrough')

# The "remainder='passthrough'" bit tells the compiler to leave
# the other df columns unchanged.

### Logistic Regression Vanilla

In [None]:
pipeline_lr_vanilla = Pipeline(steps=[
    ('ct', CT),
    ('lr', LogisticRegression())
])

In [None]:
pipeline_lr_vanilla.fit(X_train, y_train)

#### Classification Report: Logistic Regression Vanilla

In [None]:
scores = cross_val_score(pipeline_lr_vanilla, X_train, y_train, cv=5)
scores.mean()

In [None]:
y_pred = pipeline_lr_vanilla.predict(X_test)
print(classification_report(y_test, y_pred))

In [None]:
# Training confusion matrix
plot_confusion_matrix(pipeline_lr_vanilla, X_train, y_train)

In [None]:
# Training confusion matrix (normalized)
plot_confusion_matrix(pipeline_lr_vanilla, X_train, y_train, 
                      normalize='true')

In [None]:
# Test confusion matrix
plot_confusion_matrix(pipeline_lr_vanilla, X_test, y_test)

In [None]:
# Test confusion matrix (normalized)
plot_confusion_matrix(pipeline_lr_vanilla, X_test, y_test,
                      normalize='true')

### Logistic Regression GridSearchCV

In [None]:
pipeline_lr_vanilla.get_params().keys()

In [None]:
param_grid = [    
    {'lr__penalty' : ['l1', 'l2', 'elasticnet', 'none'],
    'lr__C' : np.logspace(-4, 4, 20),
    'lr__solver' : ['lbfgs','newton-cg','liblinear','sag','saga'],
    'lr__max_iter' : [100, 1000,2500, 5000]
    }
]

In [None]:
grid_lr = GridSearchCV(pipeline_lr_vanilla, param_grid, cv=5, verbose=2)
grid_lr.fit(X_train, y_train)