# FLU VACCINE UPTAKE PREDICTION
## Phase 3 Machine Learning Project

### 1.0 Business Understanding

**1.1 The Public Health Context:**

Seasonal influenza continues to pose significant health risks despite the availability of effective vaccines. However, vaccination rates remain suboptimal in many populations, suggesting the presence of behavioral, informational, and structural barriers to uptake.

**1.2 Stakeholder:** 

Centers for Disease Control and Prevention (CDC) Public Health Division

**1.3 Business Problem:**

Current public health vaccination campaigns are often broad and population-wide. While this approach maximizes exposure, it can lead to inefficiencies:
* Resources may be distributed across individuals who would vaccinate regardless.
* Vaccine-hesitant populations may remain unengaged.
* Outreach efforts may lack personalization based on behavioral or demographic differences.

This raises a key strategic question:
**Can we identify individuals who are least likely to receive a flu vaccine, so that outreach efforts can be more targeted and efficient?**

By answering this question, public health agencies can allocate resources more effectively and design interventions that address specific barriers to vaccination.

**1.4 Project Objective:**

The objective of this project is to build predictive classification models that estimate the probability that an individual receives:
* The H1N1 vaccine
* The seasonal flu vaccine

These predictions can support:
* More targeted communication campaigns
* Better understanding of behavioral drivers of vaccination.

### 2.0 Data Understanding
**2.1 Dataset Overview:**

The dataset used in this project originates from the 2009 National H1N1 Flu Survey. It contains survey responses collected to understand vaccination behavior during the H1N1 influenza outbreak.

The dataset is structured into separate files:
* training_set_features.csv – Contains demographic, behavioral, and health-related features.
* training_set_labels.csv – Contains the target variables indicating vaccine uptake.
* test_set_features.csv – Contains feature data for prediction (not used for model training in this analysis).

For this project, we focus on the training datasets.

**2.2 Target Variables:**

The training_set_labels.csv file includes two binary target variables:

* h1n1_vaccine

    * 0 = Did not receive H1N1 vaccine
    * 1 = Received H1N1 vaccine

* seasonal_vaccine

    * 0 = Did not receive seasonal flu vaccine
    * 1 = Received seasonal flu vaccine

Each represents whether a respondent received the respective vaccine.

**2.3 Feature Description**

The training_set_features.csv dataset contains:
* Demographic information (age, education level, employment status)
* Health-related conditions
* Behavioral indicators (preventive measures, doctor recommendations)
* Risk perception
* Opinion-based responses

### 3.0 Data Preparation
This section outlines the preprocessing steps required to transform the raw survey data into a format suitable for machine learning modelling.
Overview

In this section, we prepare the dataset for modeling. This includes:

* Importing necessary libraries
* Loading the feature and label datasets
* Merging datasets
* Inspecting data structure
* Identifying missing values

**3.1 Import Libraries**

In [None]:
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Sklearn tools
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (
    confusion_matrix,
    classification_report,
    roc_curve,
    roc_auc_score
)

# Set display
pd.set_option("display.max_columns", None)


**3.2 Load Dataset**

In [6]:
features = pd.read_csv("../data/training_set_features.csv")
labels = pd.read_csv("../data/training_set_labels.csv")

In [8]:
print("Features shape:", features.shape)

Features shape: (26707, 36)


In [12]:
features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26707 entries, 0 to 26706
Data columns (total 36 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   respondent_id                26707 non-null  int64  
 1   h1n1_concern                 26615 non-null  float64
 2   h1n1_knowledge               26591 non-null  float64
 3   behavioral_antiviral_meds    26636 non-null  float64
 4   behavioral_avoidance         26499 non-null  float64
 5   behavioral_face_mask         26688 non-null  float64
 6   behavioral_wash_hands        26665 non-null  float64
 7   behavioral_large_gatherings  26620 non-null  float64
 8   behavioral_outside_home      26625 non-null  float64
 9   behavioral_touch_face        26579 non-null  float64
 10  doctor_recc_h1n1             24547 non-null  float64
 11  doctor_recc_seasonal         24547 non-null  float64
 12  chronic_med_condition        25736 non-null  float64
 13  child_under_6_mo

In [9]:
print("Labels shape:", labels.shape)

Labels shape: (26707, 3)


In [13]:
labels.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26707 entries, 0 to 26706
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   respondent_id     26707 non-null  int64
 1   h1n1_vaccine      26707 non-null  int64
 2   seasonal_vaccine  26707 non-null  int64
dtypes: int64(3)
memory usage: 626.1 KB


**3.3 Data Merging**

Each respondent has both survey answers (features) and vaccination outcomes (labels).
Merging creates complete profiles.

In [15]:
data = pd.merge(features, labels, on='respondent_id', how='inner')
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26707 entries, 0 to 26706
Data columns (total 38 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   respondent_id                26707 non-null  int64  
 1   h1n1_concern                 26615 non-null  float64
 2   h1n1_knowledge               26591 non-null  float64
 3   behavioral_antiviral_meds    26636 non-null  float64
 4   behavioral_avoidance         26499 non-null  float64
 5   behavioral_face_mask         26688 non-null  float64
 6   behavioral_wash_hands        26665 non-null  float64
 7   behavioral_large_gatherings  26620 non-null  float64
 8   behavioral_outside_home      26625 non-null  float64
 9   behavioral_touch_face        26579 non-null  float64
 10  doctor_recc_h1n1             24547 non-null  float64
 11  doctor_recc_seasonal         24547 non-null  float64
 12  chronic_med_condition        25736 non-null  float64
 13  child_under_6_mo