# Phase 3 Project
Client: <br>
Authors: Tommy Phung

## Overview
With the growing doubts on vaccines effectivness, patients are questoning whether to take the Covid vaccine. We will be using the National 2009 H1N1 Flu Survey provided from [United States National Center for Health Statistics](https://www.cdc.gov/nchs/index.htm).
We will be modeling to see if we can predict whether an individual have taken the H1N1 Vaccine based on several different features from the their response in the survey.


***<p style="text-align: center;">Features</p>***


| Label                       | Description                                                                                                                                                                                                                                                                                                                                     |
|:-----------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
| h1n1_concern                | Level of concern about the H1N1 flu                                                                                                                                                                                                                                                                                                             |
| h1n1_knowledge              | Level of knowledge about H1N1 flu                                                                                                                                                                                                                                                                                                               |
| behavioral_antiviral_meds   | Has taken antiviral medications                                                                                                                                                                                                                                                                                                                 |
| behavioral_avoidance        | Has avoided close contact with others with flu-like symptoms                                                                                                                                                                                                                                                                                     |
| behavioral_face_mask        | Has bought a face mask                                                                                                                                                                                                                                                                                                                           |
| behavioral_wash_hands       | Has frequently washed hands or used hand sanitizer                                                                                                                                                                                                                                                                                               |
| behavioral_large_gatherings | Has reduced time at large gatherings                                                                                                                                                                                                                                                                                                             |
| behavioral_outside_home     | Has reduced contact with people outside of own household                                                                                                                                                                                                                                                                                         |
| behavioral_touch_face       | Has avoided touching eyes, nose, or mouth                                                                                                                                                                                                                                                                                                       |
| doctor_recc_h1n1            | H1N1 flu vaccine was recommended by doctor                                                                                                                                                                                                                                                                                                       |
| doctor_recc_seasonal        | Seasonal flu vaccine was recommended by doctor                                                                                                                                                                                                                                                                                                   |
| chronic_med_condition       | Has a chronic medical conditions |
| child_under_6_months        | Has regular close contact with a child under the age of six months                                                                  |
| health_worker               | Is a healthcare worker                                                                                                                                                                                                                                                                                                                           |
| health_insurance          | Has health insurance                                                                                                                                                                                                                                                                                                                               |
| opinion_h1n1_vacc_effective | Respondent's opinion about H1N1 vaccine effectiveness                                                                                                                                                                                                                                                                                           |
| opinion_h1n1_risk           | Respondent's opinion about risk of getting sick with H1N1 flu without vaccine                                                                                                                                                                                                                                                                   |
| opinion_h1n1_sick_from_vacc | Respondent's worry of getting sick from taking H1N1 vaccine                                                                                                                                                                                                                                                                                     |
| opinion_seas_vacc_effective | Respondent's opinion about seasonal flu vaccine effectiveness                                                                                                                                                                                                                                                                                   |
| opinion_seas_risk           | Respondent's opinion about risk of getting sick with seasonal flu without vaccine                                                                                                                                                                                                                                                               |
| opinion_seas_sick_from_vacc | Respondent's worry of getting sick from taking seasonal flu vaccine                                                                                                                                                                                                                                                                             |
| age_group                   | Age group of respondent                                                                                                                                                                                                                                                                                                                         |
| education                   | Self-reported education level                                                                                                                                                                                                                                                                                                                   |
| race                        | Race of respondent                                                                                                                                                                                                                                                                                                                               |
| sex                         | Sex of respondent                                                                                                                                                                                                                                                                                                                               |
| income_poverty              | Household annual income of respondent with respect to 2008 Census poverty thresholds                                                                                                                                                                                                                                                             |
| marital_status              | Marital status of respondent                                                                                                                                                                                                                                                                                                                     |
| rent_or_own                 | Housing situation of respondent.                                                                                                                                                                                                                                                                                                                 |
| employment_status           | Employment status of respondent.                                                                                                                                                                                                                                                                                                                 |
| hhs_geo_region              | Respondent's residence using a 10-region geographic classification defined by the U.S. Dept. of Health and Human Services.                                                                                                                                         
| census_msa                  | Respondent's residence within metropolitan statistical areas (MSA) as defined by the U.S. Census                                                                                                                                                                                                                                                 |
| household_adults            | Number of other adults in household, top-coded to 3                                                                                                                                                                                                                                                                                             |
| household_children          | Number of children in household, top-coded to 3                                                                                                                                                                                                                                                                                                 |
| employment_industry         | Type of industry respondent is employed in. Values are represented as short random character strings                                                                                                                                                                                                                                             |
| employment_occupation       | Type of occupation of respondent. Values are represented as short random character strings                                                                                                                                                                                                                                                       |


***<p style="text-align: center;">Targets</p>***

| Label            | Description                                      |
|------------------|--------------------------------------------------:|
| h1n1_vaccine     | Whether respondent received H1N1 flu vaccine     |
| seasonal_vaccine | Whether respondent received seasonal flu vaccine |

## Import Dataset
The dataset was seperated into training dataset and testing dataset. Each was loaded with no change needed. 

In [1]:
import pandas as pd
features = pd.read_csv('Data/training_set_features.csv')
labels = pd.read_csv('Data/training_set_labels.csv')

## Business Understanding 

In order to have enough vaccines for the coming season, providers need an estimate on how many vaccines to have in hand. Using the dataset, we can create a model to predict whether a patient would be taking the vaccine. This model should give providers a rough estimate on how many vaccines to make for any given location. 

## Data Exploration
1. Check for duplicates, NaN's and Missing values  -> Missing values found for multiple.
2. Check columns data type 

In [2]:
# 1. 
# Check for duplicates
df_list = [features, labels]
if sum([dataframe.duplicated().sum() for dataframe in df_list]) > 0:
    print('Dataframes have duplicates')
else:
    print('No duplicates found')

# Check for NaN / Missing Values
if sum([dataframe.isna().sum().values.sum() for dataframe in df_list]) > 0:
    print('Dataframes have missing values')
else:
    print('Dataframes have no missing values')

drop_list = ['respondent_id']   # The list of columns to be dropped if over 20 of the data is missing, respondent id is not used in this model
for name, feature in zip(features.columns, features.isna().sum()):
    if(round(feature / len(features), 2) > .05):    # Cutoff filling missing values if more than 5% of the data is missing
        drop_list.append(name)
        
drop_list = set(drop_list)
print('{} columns are missing values and needs to be removed'.format(len(drop_list)))

No duplicates found
Dataframes have missing values
8 columns are missing values and needs to be removed


In [3]:
2. 
# Check columns data types
print('There are {} string data types that need to be converted. '.format(sum([len(features.select_dtypes(object).columns)])))

There are 12 string data types that need to be converted. 


1. **Solution** : Remove columns with too many missing values and replace the missing values for the columns with the appropirote values. (Mode, Median or Mean 
2. **Solution** : Convert the columns using one hot encoding or mapping.

## Data Preperation
1. The dataframe are to be split into training set and testing set using 80/20 split.
2. The columns with a large amount of missing values are dropped
3. The columns with a small amount of missing values are replaced with the mode value of the feature
4. The categorical data have dummy variables made for regression model
5. Normalize the data to have the same scale

In [14]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

In [None]:
## 1. Split the dataframe 
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size= .20)sa

In [5]:
# 2. Remove columns with too many missing values
training_prepped = X_train.drop(drop_list, axis = 1)
testing_prepped = X_test.drop(drop_list, axis = 1)

In [12]:

missing_train = {}
missing_test = {}
for columns in training_prepped:
    missing_train[columns] = training_prepped[columns].mode().values[0]
for columns in testing_prepped:
    missing_test[columns] = testing_prepped[columns].mode().values[0]

SyntaxError: invalid syntax (<ipython-input-12-2d907e53a0ba>, line 1)

In [7]:
training_prepped.fillna(missing_train, inplace= True)
testing_prepped.fillna(missing_test, inplace = True)

In [8]:
2. 
# Create dummy values for the string object datatype

training_prepped = pd.get_dummies(training_prepped)
testing_prepped = pd.get_dummies(testing_prepped)

In [19]:
# 5. Scaler to normalize the dateframe
scaler = MinMaxScaler()
scaled_training = pd.DataFrame(scaler.fit_transform(training_prepped), columns= training_prepped.columns)

## Baseline Model - Logistic regression model

In [9]:
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(fit_intercept=False, C=1e12, solver='liblinear')
model_log = logreg.fit(training_prepped, y_train['seasonal_vaccine'])

In [10]:
import numpy as np

y_pred_train = logreg.predict(training_prepped)
y_pred_test = logreg.predict(testing_prepped)

correct_train = np.abs(y_pred_train == y_train['seasonal_vaccine']).sum()
correct_test = np.abs(y_pred_test == y_test['seasonal_vaccine']).sum()
print('The model predict the training data {}% correctly.'.format(round(correct_train / len(y_train)*100,2)))
print('The model predict the testing data {}% correctly.'.format(round(correct_test / len(y_test)* 100,2)))

The model predict the training data 76.0% correctly.
The model predict the testing data 75.63% correctly.


## Analysis
The training data has a 81% accuracy but wasn't able to predict any of the testing dataset correctly

In [11]:
X = training_prepped
y = training_labels['h1n1_vaccine']

log_model = sm.Logit(y, sm.add_constant(X))
log_results = log_model.fit()
print(log_results.summary())

NameError: name 'training_labels' is not defined

In [None]:
h1n1_concern
h1n1_knowledge
behavioral_antiviral_meds
behavioral_avoidance
behavioral_face_mask 
behavioral_wash_hands
behavioral_large_gatherings
behavioral_outside_home
behavioral_touch_face
doctor_recc_h1n1
doctor_recc_seasonal
chronic_med_condition
child_under_6_months
health_worker
health_insurance'
opinion_h1n1_vacc_effective
opinion_h1n1_risk
opinion_h1n1_sick_from_vacc
opinion_seas_vacc_effective
opinion_seas_risk
opinion_seas_sick_from_vacc
age_group
education
race
sex
income_poverty
marital_status
rent_or_own
employment_status
hhs_geo_region
census_msa
household_adults
household_children
employment_industry
employment_occupation