# Phase 3 Project
**Client:** Providence (Most medical centers) <br>
**Authors:** Tommy Phung

## Overview
With the growing doubts on vaccines effectivness, patients are questoning whether to take the Covid vaccine. We will be using the National 2009 H1N1 Flu Survey provided from [United States National Center for Health Statistics](https://www.cdc.gov/nchs/index.htm).
We will be modeling to see if we can predict whether an individual have taken the Seasonal Vaccine based on several different features from the their response in the survey.


***<p style="text-align: center;">Features</p>***


| Label                       | Description                                                                                                                                                                                                                                                                                                                                     |
|:-----------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
| h1n1_concern                | Level of concern about the H1N1 flu                                                                                                                                                                                                                                                                                                             |
| h1n1_knowledge              | Level of knowledge about H1N1 flu                                                                                                                                                                                                                                                                                                               |
| behavioral_antiviral_meds   | Has taken antiviral medications                                                                                                                                                                                                                                                                                                                 |
| behavioral_avoidance        | Has avoided close contact with others with flu-like symptoms                                                                                                                                                                                                                                                                                     |
| behavioral_face_mask        | Has bought a face mask                                                                                                                                                                                                                                                                                                                           |
| behavioral_wash_hands       | Has frequently washed hands or used hand sanitizer                                                                                                                                                                                                                                                                                               |
| behavioral_large_gatherings | Has reduced time at large gatherings                                                                                                                                                                                                                                                                                                             |
| behavioral_outside_home     | Has reduced contact with people outside of own household                                                                                                                                                                                                                                                                                         |
| behavioral_touch_face       | Has avoided touching eyes, nose, or mouth                                                                                                                                                                                                                                                                                                       |
| doctor_recc_h1n1            | H1N1 flu vaccine was recommended by doctor                                                                                                                                                                                                                                                                                                       |
| doctor_recc_seasonal        | Seasonal flu vaccine was recommended by doctor                                                                                                                                                                                                                                                                                                   |
| chronic_med_condition       | Has a chronic medical conditions |
| child_under_6_months        | Has regular close contact with a child under the age of six months                                                                  |
| health_worker               | Is a healthcare worker                                                                                                                                                                                                                                                                                                                           |
| health_insurance          | Has health insurance                                                                                                                                                                                                                                                                                                                               |
| opinion_h1n1_vacc_effective | Respondent's opinion about H1N1 vaccine effectiveness                                                                                                                                                                                                                                                                                           |
| opinion_h1n1_risk           | Respondent's opinion about risk of getting sick with H1N1 flu without vaccine                                                                                                                                                                                                                                                                   |
| opinion_h1n1_sick_from_vacc | Respondent's worry of getting sick from taking H1N1 vaccine                                                                                                                                                                                                                                                                                     |
| opinion_seas_vacc_effective | Respondent's opinion about seasonal flu vaccine effectiveness                                                                                                                                                                                                                                                                                   |
| opinion_seas_risk           | Respondent's opinion about risk of getting sick with seasonal flu without vaccine                                                                                                                                                                                                                                                               |
| opinion_seas_sick_from_vacc | Respondent's worry of getting sick from taking seasonal flu vaccine                                                                                                                                                                                                                                                                             |
| age_group                   | Age group of respondent                                                                                                                                                                                                                                                                                                                         |
| education                   | Self-reported education level                                                                                                                                                                                                                                                                                                                   |
| race                        | Race of respondent                                                                                                                                                                                                                                                                                                                               |
| sex                         | Sex of respondent                                                                                                                                                                                                                                                                                                                               |
| income_poverty              | Household annual income of respondent with respect to 2008 Census poverty thresholds                                                                                                                                                                                                                                                             |
| marital_status              | Marital status of respondent                                                                                                                                                                                                                                                                                                                     |
| rent_or_own                 | Housing situation of respondent.                                                                                                                                                                                                                                                                                                                 |
| employment_status           | Employment status of respondent.                                                                                                                                                                                                                                                                                                                 |
| hhs_geo_region              | Respondent's residence using a 10-region geographic classification defined by the U.S. Dept. of Health and Human Services.                                                                                                                                         
| census_msa                  | Respondent's residence within metropolitan statistical areas (MSA) as defined by the U.S. Census                                                                                                                                                                                                                                                 |
| household_adults            | Number of other adults in household, top-coded to 3                                                                                                                                                                                                                                                                                             |
| household_children          | Number of children in household, top-coded to 3                                                                                                                                                                                                                                                                                                 |
| employment_industry         | Type of industry respondent is employed in. Values are represented as short random character strings                                                                                                                                                                                                                                             |
| employment_occupation       | Type of occupation of respondent. Values are represented as short random character strings                                                                                                                                                                                                                                                       |


***<p style="text-align: center;">Targets</p>***

| Label            | Description                                      |
|------------------|--------------------------------------------------:|
| h1n1_vaccine     | Whether respondent received H1N1 flu vaccine     |
| seasonal_vaccine | Whether respondent received seasonal flu vaccine |

For this model, we will be focusing the **season flu vaccine** label from the dataset.

## Import Libraries
Majority of the libraries being used are from sklearn in order to format the data and create the regression models.

In [1]:
import pandas as pd    # Read the dataset into a dataframe and general adjustments to data points
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score    # Split dataset to training sets, Perform multiple iteractions and perform cross value scores
from sklearn.preprocessing import MinMaxScaler, StandardScaler     # Scalers to scale the dataset 
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score 
from sklearn.tree import DecisionTreeClassifier    # The basic regression model using Decision tree
from sklearn.ensemble import RandomForestClassifier    # A more complex model with Random Forests
import joblib    # Enable to load model previously made
import matplotlib.pyplot as plt 
from sklearn.model_selection import KFold

## Import Dataset
The dataset was seperated into training dataset and testing dataset. The data only had the target results from the 'testing' set so we will only be using that for our modeling. <br>
We will be focusing mainly on the seasonal vaccine. 

In [23]:
features = pd.read_csv('Data/training_set_features.csv')    # Original Dataset
labels = pd.read_csv('Data/training_set_labels.csv')    # Original Dataset
target = labels['seasonal_vaccine'].copy()    # Only the seasonal vaccine column

## Business Understanding 
Vaccines are a useful way to prevent viral infections. One of the most common viral infections is influenza or most commonly known as the flu. The CDC recommended individuals to recieve the vaccine during the flu season of every year. However, since vaccines aren't manditory, individuals may deny them through personal beliefs or limited knowledge of the vaccine. Vaccines would be wasted where they could be used in other medical centers. <br> 

**For example, 1.1 billion Covid Vaccines were estimated to be wasted due to expired vaccines and supply chain issues.** <br>

In order to give patients the the vaccines as efficently as possible, hospitals and medical center store vaccines to be administer quickly whenever requested. With the current dataset, we could potentially predict whether a patient would want a vaccine based on their answers on the survey. This way, medical center can order an adiquite amount of vaccines with minimual waste. 


## Data Exploration
1. Check for duplicates, NaN's and Missing values  -> Missing values found for multiple.
2. Check columns data type 

In [3]:
# 1. 
# Check for duplicates
df_list = [features, labels]
if sum([dataframe.duplicated().sum() for dataframe in df_list]) > 0:
    print('Dataframes have duplicates')
else:
    print('No duplicates found')

# Check for NaN / Missing Values
if sum([dataframe.isna().sum().values.sum() for dataframe in df_list]) > 0:
    print('Dataframes have missing values')
else:
    print('Dataframes have no missing values')

drop_list = ['respondent_id']   # The list of columns to be dropped if over 20 of the data is missing, respondent id is not used in this model
for name, feature in zip(features.columns, features.isna().sum()):
    if(round(feature / len(features), 2) > .05):    # Cutoff filling missing values if more than 5% of the data is missing
        drop_list.append(name)
        
drop_list = set(drop_list)
print('{} columns are missing values and needs to be removed'.format(len(drop_list)))

No duplicates found
Dataframes have missing values
8 columns are missing values and needs to be removed


In [4]:
2. 
# Check columns data types
print('There are {} string data types that need to be converted. '.format(sum([len(features.select_dtypes(object).columns)])))

There are 12 string data types that need to be converted. 


## Data Understanding
The dataset consist of binary and numerical entries based on their answers on a survey. There are 8 columns that have 5% of the data missing that were determined to be the cutoff to be removed. The remaining data will be filled with the approprate measurements. In order to model the dataset, all of the data type needs to be a numeric so dummy variables are needed for 12 of the columns.  

## Data Preperation
1. The dataframe are to be split into training set and testing set using 80/20 split.
2. The columns with a large amount of missing values are dropped
3. The columns with a small amount of missing values are replaced with the mode value of the feature
4. The categorical data have dummy variables made for regression model
5. Normalize the data to have the same scale

In [5]:
## 1. Split the dataframe 
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size= .20, random_state= 420)

In [6]:
# 2. Remove columns with too many missing values
training_prepped = X_train.drop(drop_list, axis = 1)
testing_prepped = X_test.drop(drop_list, axis = 1)

In [7]:
# 3. Fill remaining missing values 
missing_train = {}
missing_test = {}

for columns in training_prepped:
    missing_train[columns] = training_prepped[columns].mode().values[0]
for columns in testing_prepped:
    missing_test[columns] = testing_prepped[columns].mode().values[0]
    
training_prepped.fillna(missing_train, inplace= True)
testing_prepped.fillna(missing_test, inplace = True)

In [8]:
# 4. Create dummy values for the string object datatype
training_prepped = pd.get_dummies(training_prepped)
testing_prepped = pd.get_dummies(testing_prepped)

In [15]:
# 5. Scaler to normalize the dateframe
scaler = MinMaxScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(training_prepped), columns= training_prepped.columns)
X_test_scaled = pd.DataFrame(scaler.transform(testing_prepped), columns= testing_prepped.columns)

## Baseline Model - Logistic regression model

In [10]:
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(fit_intercept=False, C=1e12, solver='liblinear')
model_log = logreg.fit(scaled_training, y_train)

In [12]:
import numpy as np

y_pred_train = logreg.predict(scaled_training)
y_pred_test = logreg.predict(scaled_testing)
print(logreg.score(scaled_testing, y_test))
correct_train = np.abs(y_pred_train == y_train).sum()
correct_test = np.abs(y_pred_test == y_test).sum()
print('The model predict the training data {}% correctly.'.format(round(correct_train / len(y_train)*100,2)))
print('The model predict the testing data {}% correctly.'.format(round(correct_test / len(y_test)* 100,2)))
print(cross_val_score(logreg, scaled_testing, y_test).mean())
print(cross_val_score(logreg, scaled_training, y_train).mean())

0.7600149756645451
The model predict the training data 76.01% correctly.
The model predict the testing data 76.0% correctly.
0.751778588270742
0.7582494734378658


## Analysis
The training data has a 81% accuracy but wasn't able to predict any of the testing dataset correctly

## Baseline Regression Model - Decision Tree

The baseline regression model I used was a Decision Tree. Majority of the parameters wasn't changed and a random state was set to allow reproductivity. 

In [17]:
tree_clf = DecisionTreeClassifier(criterion= 'entropy', random_state= 420)
tree_model = tree_clf.fit(X_train_scaled,y_train)

tree_scores = []
tree_scores.append(tree_clf.score(X_train_scaled, y_train))
tree_scores.append(tree_clf.score(X_test_scaled, y_test))
tree_scores.append(cross_val_score(tree_clf, X_test_scaled, y_test, scoring= 'accuracy').mean())

# Complex Regression Model - Random Forest Tree

In [21]:
forest_clf = RandomForestClassifier(criterion= 'entropy', max_features = 20, random_state= 420)
forest_model = forest_clf.fit(X_train_scaled,y_train)

forest_scores = []
forest_scores.append(forest_clf.score(X_train_scaled, y_train))
forest_scores.append(forest_clf.score(X_test_scaled, y_test))
forest_scores.append(cross_val_score(forest_clf, X_test_scaled, y_test, scoring= 'accuracy').mean())

In [33]:
 # with open('hyperparameter_model.pkl', 'wb') as f:
 #    joblib.dump(forest_grid_search, f)

In [37]:
# Model generation was used once with these parameters but was commented out due to length of processing the model and fitting it to the dataset
# param_grid = {
#                 'n_estimators' : [10, 30, 100, 300],
#                 'criterion': ['gini', 'entropy'],
#                 'max_depth' : [None, 2,3,4,5,6],
#                 'min_samples_split' : [2,5,10],
#                 'min_samples_leaf' : [1,2,3,4,5,6],
#                 'max_features' : ['sqrt', 10, 20, 30]
#              }
# forest_clf = RandomForestClassifier()
# forest_grid_search = GridSearchCV(forest_clf, param_grid, return_train_score= True, cv= 3)
# forest_grid_search.fit(X_train_scaled, y_train)

forest_grid_search.best_params_

{'criterion': 'entropy',
 'max_depth': None,
 'max_features': 10,
 'min_samples_leaf': 6,
 'min_samples_split': 5,
 'n_estimators': 300}

In [34]:
# with open('hyperparameter_model.pkl', 'rb') as f:
#     forest_grid_search = joblib.load(f)
grid_scores = []
grid_scores.append(forest_grid_search.score(X_train_scaled, y_train))
grid_scores.append(forest_grid_search.score(X_test_scaled, y_test))

In [35]:
for name, score in zip(['Decision Tree', 'Random Forest', 'Random Forest with Tuning']
                       ,[tree_scores, forest_scores, grid_scores]):
    
    print(f'{name} Model')
    print('Training Accuracy Score: {}'.format(score[0]))
    print('Testing Accuracy Score: {}'.format(score[1]))
    if name in ['Decision Tree', 'Random Forest']:
        print('Cross Validation Score Mean: {} \n'.format(score[2]))

Decision Tree Model
Training Accuracy Score: 0.9999063889538966
Testing Accuracy Score: 0.6542493448146761
Cross Validation Score Mean: 0.6574268716956938 

Random Forest Model
Training Accuracy Score: 0.9998595834308448
Testing Accuracy Score: 0.7529015350056159
Cross Validation Score Mean: 0.7448539536057011 

Random Forest with Tuning Model
Training Accuracy Score: 0.8367423355956003
Testing Accuracy Score: 0.7648820666417072


In [26]:
for column in X_test_scaled.columns:
    display(X_train_scaled[column].value_counts())

0.666667    8498
0.333333    6538
1.000000    3710
0.000000    2619
Name: h1n1_concern, dtype: int64

0.5    11746
1.0     7604
0.0     2015
Name: h1n1_knowledge, dtype: int64

0.0    20319
1.0     1046
Name: behavioral_antiviral_meds, dtype: int64

1.0    15531
0.0     5834
Name: behavioral_avoidance, dtype: int64

0.0    19875
1.0     1490
Name: behavioral_face_mask, dtype: int64

1.0    17620
0.0     3745
Name: behavioral_wash_hands, dtype: int64

0.0    13732
1.0     7633
Name: behavioral_large_gatherings, dtype: int64

0.0    14184
1.0     7181
Name: behavioral_outside_home, dtype: int64

1.0    14533
0.0     6832
Name: behavioral_touch_face, dtype: int64

0.0    15507
1.0     5858
Name: chronic_med_condition, dtype: int64

0.0    19613
1.0     1752
Name: child_under_6_months, dtype: int64

0.0    19002
1.0     2363
Name: health_worker, dtype: int64

0.75    9611
1.00    5765
0.50    3770
0.25    1508
0.00     711
Name: opinion_h1n1_vacc_effective, dtype: int64

0.25    8171
0.00    6546
0.75    4336
1.00    1411
0.50     901
Name: opinion_h1n1_risk, dtype: int64

0.25    7609
0.00    7225
0.75    4687
1.00    1731
0.50     113
Name: opinion_h1n1_sick_from_vacc, dtype: int64

0.75    9692
1.00    7949
0.25    1776
0.00     979
0.50     969
Name: opinion_seas_vacc_effective, dtype: int64

0.25    7572
0.75    6085
0.00    4805
1.00    2356
0.50     547
Name: opinion_seas_risk, dtype: int64

0.00    9921
0.25    6144
0.75    3852
1.00    1378
0.50      70
Name: opinion_seas_sick_from_vacc, dtype: int64

0.333333    11726
0.000000     6482
0.666667     2257
1.000000      900
Name: household_adults, dtype: int64

0.000000    15156
0.333333     2507
0.666667     2318
1.000000     1384
Name: household_children, dtype: int64

0.0    17231
1.0     4134
Name: age_group_18 - 34 Years, dtype: int64

0.0    18276
1.0     3089
Name: age_group_35 - 44 Years, dtype: int64

0.0    17181
1.0     4184
Name: age_group_45 - 54 Years, dtype: int64

0.0    16902
1.0     4463
Name: age_group_55 - 64 Years, dtype: int64

0.0    15870
1.0     5495
Name: age_group_65+ Years, dtype: int64

0.0    16725
1.0     4640
Name: education_12 Years, dtype: int64

0.0    19456
1.0     1909
Name: education_< 12 Years, dtype: int64

0.0    12198
1.0     9167
Name: education_College Graduate, dtype: int64

0.0    15716
1.0     5649
Name: education_Some College, dtype: int64

0.0    19669
1.0     1696
Name: race_Black, dtype: int64

0.0    19958
1.0     1407
Name: race_Hispanic, dtype: int64

0.0    20094
1.0     1271
Name: race_Other or Multiple, dtype: int64

1.0    16991
0.0     4374
Name: race_White, dtype: int64

1.0    12728
0.0     8637
Name: sex_Female, dtype: int64

0.0    12728
1.0     8637
Name: sex_Male, dtype: int64

1.0    11969
0.0     9396
Name: marital_status_Married, dtype: int64

0.0    11969
1.0     9396
Name: marital_status_Not Married, dtype: int64

1.0    12027
0.0     9338
Name: employment_status_Employed, dtype: int64

0.0    13202
1.0     8163
Name: employment_status_Not in Labor Force, dtype: int64

0.0    20190
1.0     1175
Name: employment_status_Unemployed, dtype: int64

0.0    19726
1.0     1639
Name: hhs_geo_region_atmpeygn, dtype: int64

0.0    19080
1.0     2285
Name: hhs_geo_region_bhuqouqj, dtype: int64

0.0    20471
1.0      894
Name: hhs_geo_region_dqpwygqj, dtype: int64

0.0    18748
1.0     2617
Name: hhs_geo_region_fpwskwrf, dtype: int64

0.0    19074
1.0     2291
Name: hhs_geo_region_kbazzjca, dtype: int64

0.0    19713
1.0     1652
Name: hhs_geo_region_lrircsnp, dtype: int64

0.0    17894
1.0     3471
Name: hhs_geo_region_lzgpxyit, dtype: int64

0.0    19532
1.0     1833
Name: hhs_geo_region_mlyzmhmf, dtype: int64

0.0    19106
1.0     2259
Name: hhs_geo_region_oxchjgsf, dtype: int64

0.0    18941
1.0     2424
Name: hhs_geo_region_qufhixun, dtype: int64

0.0    12118
1.0     9247
Name: census_msa_MSA, Not Principle  City, dtype: int64

0.0    15015
1.0     6350
Name: census_msa_MSA, Principle City, dtype: int64

0.0    15597
1.0     5768
Name: census_msa_Non-MSA, dtype: int64

In [None]:
print(Average cross_val_score(tree_clf, scaled_testing, y_test['value_countsal_vaccine'], scoring= 'accuracy').mean()

In [None]:
cross_val_score(forest_clf, scaled_testing, y_test['seasonal_vaccine'], scoring= 'accuracy').mean()

In [None]:
cross_val_score(forest_grid_search, scaled_testing, y_test['seasonal_vaccine'], scoring= 'accuracy').mean()