<a href="https://colab.research.google.com/github/BhekiMabheka/Explore/blob/main/Flue_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Overview

**Can you predict whether people got H1N1 and seasonal flu vaccines using information they shared about their backgrounds, opinions, and health behaviors?**

In this challenge, we will take a look at vaccination, a key public health measure used to fight infectious diseases. Vaccines provide immunization for individuals, and enough immunization in a community can further reduce the spread of diseases through "herd immunity."

As of the launch of this competition, vaccines for the COVID-19 virus are still under development and not yet available. The competition will instead revisit the public health response to a different recent major respiratory disease pandemic. Beginning in spring 2009, a pandemic caused by the H1N1 influenza virus, colloquially named "swine flu," swept across the world. Researchers estimate that in the first year, it was responsible for between 151,000 to 575,000 deaths globally.

A vaccine for the H1N1 flu virus became publicly available in October 2009. In late 2009 and early 2010, the United States conducted the National 2009 H1N1 Flu Survey. This phone survey asked respondents whether they had received the H1N1 and seasonal flu vaccines, in conjunction with questions about themselves. These additional questions covered their social, economic, and demographic background, opinions on risks of illness and vaccine effectiveness, and behaviors towards mitigating transmission. A better understanding of how these characteristics are associated with personal vaccination patterns can provide guidance for future public health efforts.

In [34]:
# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd

# visualization
import seaborn as sns
import matplotlib.pyplot as plt


# machine learning
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

%matplotlib inline
pd.set_option('display.max_rows',  None)
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', None)

In [35]:
training_df = pd.read_csv('https://raw.githubusercontent.com/BhekiMabheka/Data/master/training_set_features.csv')
print(training_df.shape)

(26707, 36)


In [3]:
test_set_df = pd.read_csv('https://raw.githubusercontent.com/BhekiMabheka/Data/master/test_set_features.csv')
test_set_df.head(2)

Unnamed: 0,respondent_id,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,doctor_recc_h1n1,doctor_recc_seasonal,chronic_med_condition,child_under_6_months,health_worker,health_insurance,opinion_h1n1_vacc_effective,opinion_h1n1_risk,opinion_h1n1_sick_from_vacc,opinion_seas_vacc_effective,opinion_seas_risk,opinion_seas_sick_from_vacc,age_group,education,race,sex,income_poverty,marital_status,rent_or_own,employment_status,hhs_geo_region,census_msa,household_adults,household_children,employment_industry,employment_occupation
0,26707,2.0,2.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,5.0,1.0,1.0,5.0,1.0,1.0,35 - 44 Years,College Graduate,Hispanic,Female,"> $75,000",Not Married,Rent,Employed,mlyzmhmf,"MSA, Not Principle City",1.0,0.0,atmlpfrs,hfxkjkmi
1,26708,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,1.0,1.0,4.0,1.0,1.0,18 - 34 Years,12 Years,White,Male,Below Poverty,Not Married,Rent,Employed,bhuqouqj,Non-MSA,3.0,0.0,atmlpfrs,xqwwgdyp


In [26]:
# Target labels
training_labels_df = pd.read_csv("https://raw.githubusercontent.com/BhekiMabheka/Data/master/training_set_labels.csv")

In [5]:
# # Get the target labels
# training_df = pd.merge(left = training_df, left_on = ['respondent_id'],right = training_labels_df, right_on = ['respondent_id'])

In [27]:
training_labels_df.head()

Unnamed: 0,respondent_id,h1n1_vaccine,seasonal_vaccine
0,0,0,0
1,1,0,1
2,2,0,0
3,3,0,1
4,4,0,0


In [28]:
training_labels_df.head(1)

Unnamed: 0,respondent_id,h1n1_vaccine,seasonal_vaccine
0,0,0,0


In [29]:
combine = [training_labels_df, test_set_df]

In [30]:
training_labels_df.head()

Unnamed: 0,respondent_id,h1n1_vaccine,seasonal_vaccine
0,0,0,0
1,1,0,1
2,2,0,0
3,3,0,1
4,4,0,0


In [33]:
training_labels_df.head(1)

Unnamed: 0,respondent_id,h1n1_vaccine,seasonal_vaccine
0,0,0,0


In [32]:
training_df = training_labels_df.drop(['employment_industry', 'employment_occupation','age_group','education','marital_status',
                                'rent_or_own','employment_status','census_msa','income_poverty'], axis=1)

test_set_df = test_set_df.drop(['employment_industry', 'employment_occupation','age_group','education','marital_status',
                                'rent_or_own','employment_status','census_msa','income_poverty'], axis=1)
combine = [training_labels_df, test_set_df]

KeyError: ignored

In [19]:
imputer = SimpleImputer(strategy='most_frequent', missing_values=np.nan)

In [18]:
cols_missing_values = training_df.isnull().columns

In [24]:
imputer = imputer.fit(training_df[cols_missing_values])
imputer = imputer.fit(test_set_df[cols_missing_values])

KeyError: ignored

In [23]:
training_df[cols_missing_values] = imputer.transform(training_df[cols_missing_values] )
training_df.isnull().sum()

respondent_id                  0
h1n1_concern                   0
h1n1_knowledge                 0
behavioral_antiviral_meds      0
behavioral_avoidance           0
behavioral_face_mask           0
behavioral_wash_hands          0
behavioral_large_gatherings    0
behavioral_outside_home        0
behavioral_touch_face          0
doctor_recc_h1n1               0
doctor_recc_seasonal           0
chronic_med_condition          0
child_under_6_months           0
health_worker                  0
health_insurance               0
opinion_h1n1_vacc_effective    0
opinion_h1n1_risk              0
opinion_h1n1_sick_from_vacc    0
opinion_seas_vacc_effective    0
opinion_seas_risk              0
opinion_seas_sick_from_vacc    0
race                           0
sex                            0
hhs_geo_region                 0
household_adults               0
household_children             0
h1n1_vaccine                   0
seasonal_vaccine               0
dtype: int64