# FLU SHOT PREDICTION

## 1. Business Understanding

### a) Introduction

In 2009, the World Health Organization (W.H.O) declared the H1N1 flu to be a pandemic. That year, the virus caused an estimated 284,400 deaths worldwide. The pandemic was declared over but the H1N1 flu strain became one of the strains that cause seasonal flu. 

Seasonal flu (influenza) is an acute respiratory infection caused by influenza viruses which circulate in all parts of the world. Seasonal influenza is characterized by a sudden onset of fever, cough, headache, muscle and joint pain, severe malaise, sore throat and a runny nose. 
In temperate climates, seasonal epidemics occur mainly during winter, while in tropical regions, influenza may occur throughout the year, causing outbreaks more irregularly; [World Health Organization](https://www.who.int/health-topics/influenza-seasonal#tab=tab_1).


Most people with the flu get better on their own. But flu and its complications can be deadly, especially for people at high risk like the very young, the elderly, pregnant women, health workers and those with serious medical conditions. The seasonal flu vaccine can now help protect against the H1N1 flu and other seasonal flu viruses; [Mayo Clinic](https://www.mayoclinic.org/diseases-conditions/swine-flu/symptoms-causes/syc-20378103#:~:text=Overview,infect%20pigs%2C%20birds%20and%20humans.).


### b) Problem Statement

**What is the prevailing circumstance?**

Immunization is a global health and development success story, saving millions of lives every year. Vaccines reduce risks of getting a disease by working with the body’s natural defences to build protection; [W.H.O](https://www.who.int/health-topics/vaccines-and-immunization#tab=tab_1). 


**What problem are we trying to solve?**

Despite the benefits of vaccines, many people are still reluctant to get vaccines or vaccinate their children. According to the [National Center for BioTechnology Information](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4869767/#:~:text=Some%20parents%20believe%20that%20the,the%20benefits%20of%20the%20vaccines.), the reasons for vaccine hesitancy can be divided into 4 categories: religious reasons, personal beliefs or philosophical reasons, safety concerns, and a desire for more information from healthcare providers. 


**How the project aims to solve the problem?**

This project aims to develop a predictive model capable of forecasting vaccine uptake by analyzing an individual's background information and behavioral patterns.
Understanding the reasons behind vaccine hesitancy will better enable healthcare personnel to provide the education and awareness that patients' families need to make responsible immunization choices.



### c) Objectives

##### **Main Objectives**

The main objective of this project is to develop a predictive model capable of forecasting seasonal flu vaccine uptake by analyzing an individual's background information and behavioral patterns.

##### **Specific Objectives**

i) Identify key factors that influence individuals' decisions regarding vaccine uptake.

ii) Analyze behavioral patterns such as past vaccination history, health beliefs, and attitudes towards healthcare to determine their impact on vaccine acceptance.

iii) Develop a predictive model capable of forecasting whether or not an individual will take the vaccine based on collected data.

iv) Evaluate the performance of the predictive model and validate its accuracy against test data.


### d) Notebook Structure

1. Business Understanding
2. Reading and understanding the data
3. Exploratory data analysis
4. Data preprocessing
5. Modelling
6. Evaluation
7. Conclusion
8. Recommendations

### e) Data Understanding

The data used in this project was obtained from: [DRIVENDATA](https://www.drivendata.org/competitions/66/flu-shot-learning/data/).

The dataset is divided into: submission format, test set features, training set features and training set labels.

Each row in the dataset represents one person who responded to the National 2009 H1N1 Flu Survey.

The **test set features** contains:

The **training set features** contains: *26707 rows* and *35 columns*

The **training set labels** contains:


You are provided a dataset with 36 columns. For all binary variables: 0 = No; 1 = Yes.
* `respondent_id` is a unique and random identifier. 

* `h1n1_concern` - Level of concern about the H1N1 flu.
0 = Not at all concerned; 1 = Not very concerned; 2 = Somewhat concerned; 3 = Very concerned.

* `h1n1_knowledge` - Level of knowledge about H1N1 flu.
0 = No knowledge; 1 = A little knowledge; 2 = A lot of knowledge.

* `behavioral_antiviral_meds` - Has taken antiviral medications. (binary).

* `behavioral_avoidance` - Has avoided close contact with others with flu-like symptoms. (binary)

* `behavioral_face_mask` - Has bought a face mask. (binary)

* `behavioral_wash_hands` - Has frequently washed hands or used hand sanitizer. (binary)

* `behavioral_large_gatherings` - Has reduced time at large gatherings. (binary)

* `behavioral_outside_home` - Has reduced contact with people outside of own household. (binary)

* `behavioral_touch_face` - Has avoided touching eyes, nose, or mouth. (binary)

* `doctor_recc_h1n1` - H1N1 flu vaccine was recommended by doctor. (binary)

* `doctor_recc_seasonal` - Seasonal flu vaccine was recommended by doctor. (binary)

* `chronic_med_condition` - Has any of the following chronic medical conditions: asthma or an other lung condition, diabetes, a heart condition, a kidney condition, sickle cell anemia or other anemia, a neurological or neuromuscular condition, a liver condition, or a weakened immune system caused by a chronic illness or by medicines taken for a chronic illness. (binary)

* `child_under_6_months` - Has regular close contact with a child under the age of six months. (binary)

* `health_worker` - Is a healthcare worker. (binary)

* `health_insurance` - Has health insurance. (binary)

* `opinion_h1n1_vacc_effective` - Respondent's opinion about H1N1 vaccine effectiveness.
1 = Not at all effective; 2 = Not very effective; 3 = Don't know; 4 = Somewhat effective; 5 = Very effective.

* `opinion_h1n1_risk` - Respondent's opinion about risk of getting sick with H1N1 flu without vaccine.
1 = Very Low; 2 = Somewhat low; 3 = Don't know; 4 = Somewhat high; 5 = Very high.

* `opinion_h1n1_sick_from_vacc` - Respondent's worry of getting sick from taking H1N1 vaccine.
1 = Not at all worried; 2 = Not very worried; 3 = Don't know; 4 = Somewhat worried; 5 = Very worried.

* `opinion_seas_vacc_effective` - Respondent's opinion about seasonal flu vaccine effectiveness.
1 = Not at all effective; 2 = Not very effective; 3 = Don't know; 4 = Somewhat effective; 5 = Very effective.

* `opinion_seas_risk` - Respondent's opinion about risk of getting sick with seasonal flu without vaccine.
1 = Very Low; 2 = Somewhat low; 3 = Don't know; 4 = Somewhat high; 5 = Very high.

* `opinion_seas_sick_from_vacc` - Respondent's worry of getting sick from taking seasonal flu vaccine.
1 = Not at all worried; 2 = Not very worried; 3 = Don't know; 4 = Somewhat worried; 5 = Very worried.

* `age_group` - Age group of respondent.

* `education` - Self-reported education level.

* `race` - Race of respondent.

* `sex` - Sex of respondent.

* `income_poverty` - Household annual income of respondent with respect to 2008 Census poverty thresholds.

* `marital_status` - Marital status of respondent.

* `rent_or_own` - Housing situation of respondent.

* `employment_status` - Employment status of respondent.

* `hhs_geo_region` - Respondent's residence using a 10-region geographic classification defined by the U.S. Dept. of Health and Human Services. Values are represented as short random character strings.

* `census_msa` - Respondent's residence within metropolitan statistical areas (MSA) as defined by the U.S. Census.

* `household_adults` - Number of other adults in household, top-coded to 3.

* `household_children` - Number of children in household, top-coded to 3.

* `employment_industry` - Type of industry respondent is employed in. Values are represented as short random character strings.
* `employment_occupation` - Type of occupation of respondent. Values are represented as short random character strings.

## 2. Reading the Data 

In [1]:
# importing relevant libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Reading the train_set_features data, setting the unique identifier as the index and previewing top 5.
train_features = pd.read_csv("training_set_features.csv",index_col = "respondent_id")
train_features.head()

Unnamed: 0_level_0,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,doctor_recc_h1n1,...,income_poverty,marital_status,rent_or_own,employment_status,hhs_geo_region,census_msa,household_adults,household_children,employment_industry,employment_occupation
respondent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,Below Poverty,Not Married,Own,Not in Labor Force,oxchjgsf,Non-MSA,0.0,0.0,,
1,3.0,2.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,...,Below Poverty,Not Married,Rent,Employed,bhuqouqj,"MSA, Not Principle City",0.0,0.0,pxcmvdjn,xgwztkwe
2,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,,...,"<= $75,000, Above Poverty",Not Married,Own,Employed,qufhixun,"MSA, Not Principle City",2.0,0.0,rucpziij,xtkaffoo
3,1.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,...,Below Poverty,Not Married,Rent,Not in Labor Force,lrircsnp,"MSA, Principle City",0.0,0.0,,
4,2.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,...,"<= $75,000, Above Poverty",Married,Own,Employed,qufhixun,"MSA, Not Principle City",1.0,0.0,wxleyezf,emcorrxb


In [3]:
# previewing the bottom 5
train_features.tail()

Unnamed: 0_level_0,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,doctor_recc_h1n1,...,income_poverty,marital_status,rent_or_own,employment_status,hhs_geo_region,census_msa,household_adults,household_children,employment_industry,employment_occupation
respondent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
26702,2.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,"<= $75,000, Above Poverty",Not Married,Own,Not in Labor Force,qufhixun,Non-MSA,0.0,0.0,,
26703,1.0,2.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,...,"<= $75,000, Above Poverty",Not Married,Rent,Employed,lzgpxyit,"MSA, Principle City",1.0,0.0,fcxhlnwr,cmhcxjea
26704,2.0,2.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,...,,Not Married,Own,,lzgpxyit,"MSA, Not Principle City",0.0,0.0,,
26705,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,...,"<= $75,000, Above Poverty",Married,Rent,Employed,lrircsnp,Non-MSA,1.0,0.0,fcxhlnwr,haliazsg
26706,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,"<= $75,000, Above Poverty",Married,Own,Not in Labor Force,mlyzmhmf,"MSA, Principle City",1.0,0.0,,


In [3]:
# Checking the shape of the train_features dataframe
train_features.shape

(26707, 35)

**Train_features** dataframe has *26707 rows* and *35 columns*

In [16]:
# Checking the information of the train_features dataframe
train_features.info()

<class 'pandas.core.frame.DataFrame'>
Index: 26707 entries, 0 to 26706
Data columns (total 35 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   h1n1_concern                 26615 non-null  float64
 1   h1n1_knowledge               26591 non-null  float64
 2   behavioral_antiviral_meds    26636 non-null  float64
 3   behavioral_avoidance         26499 non-null  float64
 4   behavioral_face_mask         26688 non-null  float64
 5   behavioral_wash_hands        26665 non-null  float64
 6   behavioral_large_gatherings  26620 non-null  float64
 7   behavioral_outside_home      26625 non-null  float64
 8   behavioral_touch_face        26579 non-null  float64
 9   doctor_recc_h1n1             24547 non-null  float64
 10  doctor_recc_seasonal         24547 non-null  float64
 11  chronic_med_condition        25736 non-null  float64
 12  child_under_6_months         25887 non-null  float64
 13  health_worker        

In [5]:
# Reading the training_set_labels data, setting the unique identifier as the index and previewing top 5.
train_labels = pd.read_csv("training_set_labels.csv",index_col = "respondent_id")
train_labels.head()

Unnamed: 0_level_0,h1n1_vaccine,seasonal_vaccine
respondent_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0,0
1,0,1
2,0,0
3,0,1
4,0,0


In [8]:
# previewing the bottom 5
train_labels.tail()

Unnamed: 0_level_0,h1n1_vaccine,seasonal_vaccine
respondent_id,Unnamed: 1_level_1,Unnamed: 2_level_1
26702,0,0
26703,0,0
26704,0,1
26705,0,0
26706,0,0


In [9]:
# Checking the shape of the train_labels dataframe
train_labels.shape

(26707, 2)

In [17]:
# Checking the information of the train_labels dataframe
train_labels.info()

<class 'pandas.core.frame.DataFrame'>
Index: 26707 entries, 0 to 26706
Data columns (total 2 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   h1n1_vaccine      26707 non-null  int64
 1   seasonal_vaccine  26707 non-null  int64
dtypes: int64(2)
memory usage: 625.9 KB


In [11]:
# Reading the test_set_features data, setting the unique identifier as the index and previewing top 5.
test_features = pd.read_csv("test_set_features.csv",index_col = "respondent_id")
test_features.head()

Unnamed: 0_level_0,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,doctor_recc_h1n1,...,income_poverty,marital_status,rent_or_own,employment_status,hhs_geo_region,census_msa,household_adults,household_children,employment_industry,employment_occupation
respondent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
26707,2.0,2.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,...,"> $75,000",Not Married,Rent,Employed,mlyzmhmf,"MSA, Not Principle City",1.0,0.0,atmlpfrs,hfxkjkmi
26708,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,Below Poverty,Not Married,Rent,Employed,bhuqouqj,Non-MSA,3.0,0.0,atmlpfrs,xqwwgdyp
26709,2.0,2.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,...,"> $75,000",Married,Own,Employed,lrircsnp,Non-MSA,1.0,0.0,nduyfdeo,pvmttkik
26710,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,"<= $75,000, Above Poverty",Married,Own,Not in Labor Force,lrircsnp,"MSA, Not Principle City",1.0,0.0,,
26711,3.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,...,"<= $75,000, Above Poverty",Not Married,Own,Employed,lzgpxyit,Non-MSA,0.0,1.0,fcxhlnwr,mxkfnird


In [12]:
# previewing the bottom 5
test_features.tail()

Unnamed: 0_level_0,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,doctor_recc_h1n1,...,income_poverty,marital_status,rent_or_own,employment_status,hhs_geo_region,census_msa,household_adults,household_children,employment_industry,employment_occupation
respondent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
53410,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,...,,,,,dqpwygqj,"MSA, Principle City",1.0,1.0,,
53411,3.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,...,Below Poverty,Married,Rent,Employed,qufhixun,Non-MSA,1.0,3.0,fcxhlnwr,vlluhbov
53412,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,Below Poverty,Not Married,Rent,Not in Labor Force,qufhixun,"MSA, Not Principle City",1.0,0.0,,
53413,3.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,...,"<= $75,000, Above Poverty",Married,Own,Not in Labor Force,bhuqouqj,"MSA, Not Principle City",1.0,0.0,,
53414,2.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,...,,Not Married,Rent,Employed,lrircsnp,"MSA, Principle City",0.0,0.0,,xtkaffoo


In [13]:
# Checking the shape of the test_features dataframe
test_features.shape

(26708, 35)

In [18]:
# Checking the information of the test_features dataframe
test_features.info()

<class 'pandas.core.frame.DataFrame'>
Index: 26708 entries, 26707 to 53414
Data columns (total 35 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   h1n1_concern                 26623 non-null  float64
 1   h1n1_knowledge               26586 non-null  float64
 2   behavioral_antiviral_meds    26629 non-null  float64
 3   behavioral_avoidance         26495 non-null  float64
 4   behavioral_face_mask         26689 non-null  float64
 5   behavioral_wash_hands        26668 non-null  float64
 6   behavioral_large_gatherings  26636 non-null  float64
 7   behavioral_outside_home      26626 non-null  float64
 8   behavioral_touch_face        26580 non-null  float64
 9   doctor_recc_h1n1             24548 non-null  float64
 10  doctor_recc_seasonal         24548 non-null  float64
 11  chronic_med_condition        25776 non-null  float64
 12  child_under_6_months         25895 non-null  float64
 13  health_worker    

## 3. Data Preparation

The goal of Data Preparation is to ensure that the data is in a suitable format for further exploration and modeling.

This step involves cleaning the data, handling missing values, encoding categorical variables, scaling numerical features, and splitting the dataset into training and testing sets. 

## 4. Exploratory Data Analysis

The goal of EDA is to understand the relationships between variables, identify patterns, detect outliers, and gain insights into the dataset. 

This will help in selecting relevant features for the model and understanding the distribution of the data, which can guide model selection and hyperparameter tuning.

## 5. Modelling

## 6. Evaluation

## 7. Pickling

## 8. 