# Swine Flu and Seasonal Flu Vaccination Prediction Project

### Stakeholder: Center for Disease Control (CDC) or Department of Health and Human Services(DHHS)

### Task: What is the likelihood of an individual getting vaccinated based upon survey data?


Dataset curated from the National 2009 H1N1 Flu Survey, conducted by the CDC in 2009-2010. Obtained from:
https://www.drivendata.org/competitions/66/flu-shot-learning/page/211/

The data covers the H1N1 and seasonal flu vaccination status of adults and children, as well as flu-related behaviors, opinions about flu vaccine safety and effectiveness, and socioeconomic status. We will utilize these features in order to create a predictive model that can predict the likelihood of a person getting vaccinated. These types of models are crucial to helping direct public health policies and preventing future pandemics.

**Load packages**

In [1]:

import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline 

from sklearn.preprocessing import OneHotEncoder, StandardScaler

from sklearn.impute import MissingIndicator, SimpleImputer, KNNImputer 

from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_selection import SelectFromModel

from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_roc_curve

**Loading Data**

In [2]:
features_df = pd.read_csv('Data/training_set_features.csv', index_col="respondent_id")
labels_df = pd.read_csv('Data/training_set_labels.csv', index_col="respondent_id")
df = features_df.join(labels_df, how = 'inner')
# check that the rows between the features and the labels match up
# np.testing.assert_array_equal(features_df.index.values, labels_df.index.values)

# # merge features_df and labels_df 
# df = labels_df.merge(features_df, how = 'inner', on='respondent_id')

# # drop duplicate 
# df.drop_duplicates(inplace=True)

In [3]:
df.shape

(26707, 37)

In [4]:
df.info()

# 17  health_insurance, 36  employment_industry, 37  employment_occupation  about 50% missing 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26707 entries, 0 to 26706
Data columns (total 37 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   h1n1_concern                 26615 non-null  float64
 1   h1n1_knowledge               26591 non-null  float64
 2   behavioral_antiviral_meds    26636 non-null  float64
 3   behavioral_avoidance         26499 non-null  float64
 4   behavioral_face_mask         26688 non-null  float64
 5   behavioral_wash_hands        26665 non-null  float64
 6   behavioral_large_gatherings  26620 non-null  float64
 7   behavioral_outside_home      26625 non-null  float64
 8   behavioral_touch_face        26579 non-null  float64
 9   doctor_recc_h1n1             24547 non-null  float64
 10  doctor_recc_seasonal         24547 non-null  float64
 11  chronic_med_condition        25736 non-null  float64
 12  child_under_6_months         25887 non-null  float64
 13  health_worker   

**Check for null values**

In [5]:
df.isna().sum()

h1n1_concern                      92
h1n1_knowledge                   116
behavioral_antiviral_meds         71
behavioral_avoidance             208
behavioral_face_mask              19
behavioral_wash_hands             42
behavioral_large_gatherings       87
behavioral_outside_home           82
behavioral_touch_face            128
doctor_recc_h1n1                2160
doctor_recc_seasonal            2160
chronic_med_condition            971
child_under_6_months             820
health_worker                    804
health_insurance               12274
opinion_h1n1_vacc_effective      391
opinion_h1n1_risk                388
opinion_h1n1_sick_from_vacc      395
opinion_seas_vacc_effective      462
opinion_seas_risk                514
opinion_seas_sick_from_vacc      537
age_group                          0
education                       1407
race                               0
sex                                0
income_poverty                  4423
marital_status                  1408
r

**Functions to clean data and drop columns**

In [6]:
def basicdropna(dataframe, column_list):
    dataframe.dropna(subset=column_list, inplace=True)

In [7]:
general_dropna = ['health_worker', 'education','income_poverty', 'marital_status', 
                    'rent_or_own', 'employment_status', 'household_adults', 
                    'household_children' ]
basicdropna(df, general_dropna)

In [8]:
def columndrop(dataframe, column_list):
    dataframe.drop(column_list, axis = 1, inplace=True)

In [9]:
drop_columns =  ['employment_industry',  'employment_occupation', 'hhs_geo_region']


columndrop(df, drop_columns)

**Can we use the median value to imput missing values?**

In [10]:
df.median()

h1n1_concern                   2.0
h1n1_knowledge                 1.0
behavioral_antiviral_meds      0.0
behavioral_avoidance           1.0
behavioral_face_mask           0.0
behavioral_wash_hands          1.0
behavioral_large_gatherings    0.0
behavioral_outside_home        0.0
behavioral_touch_face          1.0
doctor_recc_h1n1               0.0
doctor_recc_seasonal           0.0
chronic_med_condition          0.0
child_under_6_months           0.0
health_worker                  0.0
health_insurance               1.0
opinion_h1n1_vacc_effective    4.0
opinion_h1n1_risk              2.0
opinion_h1n1_sick_from_vacc    2.0
opinion_seas_vacc_effective    4.0
opinion_seas_risk              2.0
opinion_seas_sick_from_vacc    2.0
household_adults               1.0
household_children             0.0
h1n1_vaccine                   0.0
seasonal_vaccine               0.0
dtype: float64

**Insight: for survey items where the scoring is 1-5, use 3 (don't know) to fill missing values. We cannot simply use the median, as this might indicate inaccurate knowledge on the part of the respondant.**


In [11]:
survey_col = ['opinion_h1n1_vacc_effective',
 'opinion_h1n1_risk', 'opinion_h1n1_sick_from_vacc',
 'opinion_seas_vacc_effective', 'opinion_seas_risk',
 'opinion_seas_sick_from_vacc']

In [12]:
def impute_missing_data(dataframe, column_list, fillvalue):
    '''column_list can be a single column or a list of columns'''
    for column in column_list:
        dataframe[column].fillna(fillvalue, inplace = True)

In [13]:
impute_missing_data(df, survey_col, 3)

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21863 entries, 0 to 26706
Data columns (total 34 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   h1n1_concern                 21825 non-null  float64
 1   h1n1_knowledge               21792 non-null  float64
 2   behavioral_antiviral_meds    21815 non-null  float64
 3   behavioral_avoidance         21720 non-null  float64
 4   behavioral_face_mask         21853 non-null  float64
 5   behavioral_wash_hands        21842 non-null  float64
 6   behavioral_large_gatherings  21805 non-null  float64
 7   behavioral_outside_home      21817 non-null  float64
 8   behavioral_touch_face        21774 non-null  float64
 9   doctor_recc_h1n1             20253 non-null  float64
 10  doctor_recc_seasonal         20253 non-null  float64
 11  chronic_med_condition        21701 non-null  float64
 12  child_under_6_months         21863 non-null  float64
 13  health_worker   

In [15]:
x_df = df.copy()

survey_col = ['opinion_h1n1_vacc_effective', 'opinion_h1n1_risk', 'opinion_h1n1_sick_from_vacc',
 'opinion_seas_vacc_effective', 'opinion_seas_risk','opinion_seas_sick_from_vacc']

behavior_col = ['behavioral_antiviral_meds', 'behavioral_avoidance', 'behavioral_face_mask',
                'behavioral_wash_hands','behavioral_large_gatherings','behavioral_outside_home','behavioral_touch_face']

doc_rec = ['doctor_recc_h1n1','doctor_recc_seasonal']

# def impute_missing_data(dataframe, column_list, fillvalue):
#     for column in column_list:
#         dataframe[column].fillna(fillvalue, inplace = True)
      
    
impute_missing_data(x_df, survey_col, 3)
impute_missing_data(x_df, ['h1n1_concern'], 2)
impute_missing_data(x_df, ['h1n1_knowledge'], 0)
impute_missing_data(x_df, behavior_col, 0)
impute_missing_data(x_df, doc_rec, 0)
impute_missing_data(x_df, ['chronic_med_condition'], 0)
impute_missing_data(x_df, ['child_under_6_months'], 0)

In [16]:
x_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21863 entries, 0 to 26706
Data columns (total 34 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   h1n1_concern                 21863 non-null  float64
 1   h1n1_knowledge               21863 non-null  float64
 2   behavioral_antiviral_meds    21863 non-null  float64
 3   behavioral_avoidance         21863 non-null  float64
 4   behavioral_face_mask         21863 non-null  float64
 5   behavioral_wash_hands        21863 non-null  float64
 6   behavioral_large_gatherings  21863 non-null  float64
 7   behavioral_outside_home      21863 non-null  float64
 8   behavioral_touch_face        21863 non-null  float64
 9   doctor_recc_h1n1             21863 non-null  float64
 10  doctor_recc_seasonal         21863 non-null  float64
 11  chronic_med_condition        21863 non-null  float64
 12  child_under_6_months         21863 non-null  float64
 13  health_worker   

**Insight: health insurance is missing many values and is usually an imporant indicator for treatment. We will use knnimputer to fill in missing values in order to keep the feature.**

In [17]:
X=x_df.drop(['h1n1_vaccine','seasonal_vaccine'], axis=1)
y=x_df[['h1n1_vaccine','seasonal_vaccine']]

# Train test split, do this before OHE

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

**OHE for object data types, then impute with KNNImputer**

In [18]:
# create OHE for objects, do this before imputer

cat_col_list = [i for i in X_train.select_dtypes(include='object').columns]

nb_list_for_ohe = ['h1n1_concern', 'h1n1_knowledge', 'opinion_h1n1_vacc_effective',
'opinion_h1n1_risk', 'opinion_h1n1_sick_from_vacc', 'opinion_seas_vacc_effective',
'opinion_seas_risk', 'opinion_seas_sick_from_vacc']

# Fits OHE on a subset of columns, then reintegrates them into the
# Origional dataframe. Do this after initial cleaning, before 
# health insurace imputation.

ohe = OneHotEncoder(drop='first', sparse=False)

def fit_trans_ohe(X_dataframe, columns):
    dums = ohe.fit_transform(X_dataframe[columns])
    dums_df = pd.DataFrame(dums,
                       columns=ohe.get_feature_names(),
                       index=X_dataframe.index)
    df_cols_dropped = X_dataframe.drop(columns, axis = 1)
    dums_df_concated = pd.concat([df_cols_dropped, dums_df], axis=1)
    return dums_df_concated

#We should end up with a fitted ohe instance called 'ohe'

In [19]:
X_train_ohe = fit_trans_ohe(X_train, cat_col_list+nb_list_for_ohe)

**Fitting an imputer for Health Insurance using socio-economic features, then pulling from a dataframe that has already been OneHotEncoded**

In [20]:
    
socio_economic_column_list = ["x0_35 - 44 Years","x0_45 - 54 Years","x0_55 - 64 Years","x0_65+ Years",
                              "x1_< 12 Years","x1_College Graduate","x1_Some College","x2_Hispanic",
                              "x2_Other or Multiple","x2_White","x3_Male", "x4_> $75,000", "x4_Below Poverty",
                              "x5_Not Married", "x6_Rent", "x7_Not in Labor Force","x7_Unemployed",
                              "x8_MSA, Principle City",'x8_Non-MSA', 'health_insurance']

# Fitting an imputer for Health Insurance using socio-economic features, 
# pulling from a dataframe that has already been OneHotEncoded


soc_eco_h_i_imputer_knn = KNNImputer()

def soc_eco_KNN_imputer(imputer, dataframe, column_list):
    soc_econ_base = dataframe[column_list]
    soc_econ_imputed = pd.DataFrame(imputer.fit_transform(soc_econ_base), 
                                         columns = soc_econ_base.columns,
                                        index=soc_econ_base.index)
    remainder_df = dataframe.drop(column_list, axis = 1)
    output_df = remainder_df.join(soc_econ_imputed)
    output_df.health_insurance = output_df.health_insurance.round() 

    return output_df


In [21]:
X_train_imputed = soc_eco_KNN_imputer(soc_eco_h_i_imputer_knn, X_train_ohe, socio_economic_column_list)

**Now we OHE for the test set. Takes X test dataframe and list of columns to encoded**

In [22]:

def trans_ohe(X_dataframe, columns):
    dums = ohe.transform(X_dataframe[columns])
    dums_df = pd.DataFrame(dums,
                       columns=ohe.get_feature_names(),
                       index=X_dataframe.index)
    df_cols_dropped = X_dataframe.drop(columns, axis = 1)
    dums_df_concated = pd.concat([df_cols_dropped, dums_df], axis=1)
    return dums_df_concated

In [23]:
X_test_ohe = trans_ohe(X_test, cat_col_list+nb_list_for_ohe)

In [24]:
def imputer_transform_only(imputer, dataframe, column_list):
    soc_econ_base = dataframe[column_list]
    soc_econ_imputed = pd.DataFrame(imputer.transform(soc_econ_base), 
                                         columns = soc_econ_base.columns,
                                        index=soc_econ_base.index)
    remainder_df = dataframe.drop(column_list, axis = 1)
    output_df = remainder_df.join(soc_econ_imputed)
    output_df.health_insurance = output_df.health_insurance.round()
    
    return output_df

In [25]:
X_test_imputed = imputer_transform_only(soc_eco_h_i_imputer_knn, X_test_ohe, socio_economic_column_list)

### We now have a working dataset of: 
    'X_train_imputed' and 'y_train' to fit models to, 'X_test_imputed' to generate predictions, and 'y_test' to validate models with.

**Save training and test data to csv files for model creation and tuning**

In [27]:
import os
cwd = os.getcwd()
path = cwd + '/Data/X_train_imputed.csv'
X_train_imputed.to_csv(path, index=False)

In [28]:

cwd = os.getcwd()
path = cwd + '/Data/X_test_imputed.csv'
X_test_imputed.to_csv(path, index=False)

In [29]:

cwd = os.getcwd()
path = cwd + '/Data/y_train.csv'
y_train.to_csv(path, index=False)

In [30]:

cwd = os.getcwd()
path = cwd + '/Data/y_test.csv'
y_test.to_csv(path, index=False)