#PROBLEM DESCRIPTION
To predict how likely individuals are to receive their H1N1 and seasonal flu vaccines. Specifically, we have to predict two probabilities: one for h1n1_vaccine and one for seasonal_vaccine.

Each row in the dataset represents one person who responded to the National 2009 H1N1 Flu Survey.

##Labels
There are two target variables:

h1n1_vaccine - Whether respondent received H1N1 flu vaccine.
seasonal_vaccine - Whether respondent received seasonal flu vaccine.
Both are binary variables: 0 = No; 1 = Yes.<br>
Some respondents didn't get either vaccine, others got only one, and some got both. This is formulated as a multilabel (and not multiclass) problem.

#Basic Information -- Training dataset

In [1]:
# Mounting Drive
from google.colab import drive

drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


##Importing libraries

In [2]:
# Importing libraries
import pandas as pd
import numpy as np

##Importing Dataset

In [3]:
# Importing training set features
df_train_features = pd.read_csv('/content/drive/MyDrive/Data_Analysis/H1N1Flu-Prediction/training_set_features.csv')

# Importing training set labels
df_train_labels = pd.read_csv('/content/drive/MyDrive/Data_Analysis/H1N1Flu-Prediction/training_set_labels .csv')

##Basic Information

In [4]:
# Dsipaly Feature Information
display(df_train_features.head())

Unnamed: 0,respondent_id,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,...,income_poverty,marital_status,rent_or_own,employment_status,hhs_geo_region,census_msa,household_adults,household_children,employment_industry,employment_occupation
0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,Below Poverty,Not Married,Own,Not in Labor Force,oxchjgsf,Non-MSA,0.0,0.0,,
1,1,3.0,2.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,...,Below Poverty,Not Married,Rent,Employed,bhuqouqj,"MSA, Not Principle City",0.0,0.0,pxcmvdjn,xgwztkwe
2,2,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,"<= $75,000, Above Poverty",Not Married,Own,Employed,qufhixun,"MSA, Not Principle City",2.0,0.0,rucpziij,xtkaffoo
3,3,1.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,Below Poverty,Not Married,Rent,Not in Labor Force,lrircsnp,"MSA, Principle City",0.0,0.0,,
4,4,2.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,...,"<= $75,000, Above Poverty",Married,Own,Employed,qufhixun,"MSA, Not Principle City",1.0,0.0,wxleyezf,emcorrxb


In [5]:
# Display Label Information
display(df_train_labels.head())

Unnamed: 0,respondent_id,h1n1_vaccine,seasonal_vaccine
0,0,0,0
1,1,0,1
2,2,0,0
3,3,0,1
4,4,0,0


In [6]:
# Shape of Training Feature dataset
print("Training Features Shape ", df_train_features.shape)

Training Features Shape  (26707, 36)


In [7]:
df_train_features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26707 entries, 0 to 26706
Data columns (total 36 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   respondent_id                26707 non-null  int64  
 1   h1n1_concern                 26615 non-null  float64
 2   h1n1_knowledge               26591 non-null  float64
 3   behavioral_antiviral_meds    26636 non-null  float64
 4   behavioral_avoidance         26499 non-null  float64
 5   behavioral_face_mask         26688 non-null  float64
 6   behavioral_wash_hands        26665 non-null  float64
 7   behavioral_large_gatherings  26620 non-null  float64
 8   behavioral_outside_home      26625 non-null  float64
 9   behavioral_touch_face        26579 non-null  float64
 10  doctor_recc_h1n1             24547 non-null  float64
 11  doctor_recc_seasonal         24547 non-null  float64
 12  chronic_med_condition        25736 non-null  float64
 13  child_under_6_mo

In [8]:
# Shape of Training labels
print("Training Labels Shape ", df_train_labels.shape)


Training Labels Shape  (26707, 3)


Double-check that the rows between the features and the labels match up.

In [9]:
np.testing.assert_array_equal(df_train_features.index.values, df_train_labels.index.values)

The assertion ran, without error it means there is no problem. If the two index arrays were not the same, there would be an error.

#Data Exploration

##Importing Libraries

In [10]:
%matplotlib inline
import matplotlib.pyplot as plt