# FLU VACCINE UPTAKE PREDICTION
## Phase 3 Machine Learning Project

### 1.0 Business Understanding

**1.1 The Public Health Context:**

Seasonal influenza continues to pose significant health risks despite the availability of effective vaccines. However, vaccination rates remain suboptimal in many populations, suggesting the presence of behavioral, informational, and structural barriers to uptake.

**1.2 Stakeholder:** 

Centers for Disease Control and Prevention (CDC) Public Health Division

**1.3 Business Problem:**

Current public health vaccination campaigns are often broad and population-wide. While this approach maximizes exposure, it can lead to inefficiencies:
* Resources may be distributed across individuals who would vaccinate regardless.
* Vaccine-hesitant populations may remain unengaged.
* Outreach efforts may lack personalization based on behavioral or demographic differences.

This raises a key strategic question:
**Can we identify individuals who are least likely to receive a flu vaccine, so that outreach efforts can be more targeted and efficient?**

By answering this question, public health agencies can allocate resources more effectively and design interventions that address specific barriers to vaccination.

**1.4 Project Objective:**

The objective of this project is to build predictive classification models that estimate the probability that an individual receives:
* The H1N1 vaccine
* The seasonal flu vaccine

These predictions can support:
* More targeted communication campaigns
* Better understanding of behavioral drivers of vaccination.

### 2.0 Data Understanding
**2.1 Dataset Overview:**

The dataset used in this project originates from the 2009 National H1N1 Flu Survey. It contains survey responses collected to understand vaccination behavior during the H1N1 influenza outbreak.

The dataset is structured into separate files:
* training_set_features.csv – Contains demographic, behavioral, and health-related features.
* training_set_labels.csv – Contains the target variables indicating vaccine uptake.
* test_set_features.csv – Contains feature data for prediction (not used for model training in this analysis).

For this project, we focus on the training datasets.

**2.2 Target Variables:**

The training_set_labels.csv file includes two binary target variables:

* h1n1_vaccine

    * 0 = Did not receive H1N1 vaccine
    * 1 = Received H1N1 vaccine

* seasonal_vaccine

    * 0 = Did not receive seasonal flu vaccine
    * 1 = Received seasonal flu vaccine

Each represents whether a respondent received the respective vaccine.

**2.3 Feature Description**

The training_set_features.csv dataset contains:
* Demographic information (age, education level, employment status)
* Health-related conditions
* Behavioral indicators (preventive measures, doctor recommendations)
* Risk perception
* Opinion-based responses

### 3.0 Data Preparation
This section outlines the preprocessing steps required to transform the raw survey data into a format suitable for machine learning modelling.
Overview

In this section, we prepare the dataset for modeling. This includes:

* Importing necessary libraries
* Loading the feature and label datasets
* Merging datasets
* Inspecting data structure
* Identifying missing values

**3.1 Import Libraries**

In [60]:
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Sklearn tools
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.metrics import (
    confusion_matrix,
    classification_report,
    roc_curve,
    roc_auc_score
)

# Set display
pd.set_option("display.max_columns", None)


**3.2 Load Dataset**

In [23]:
features = pd.read_csv("../data/training_set_features.csv")
labels = pd.read_csv("../data/training_set_labels.csv")
test = pd.read_csv("../data/test_set_features.csv")

In [24]:
print("Features shape:", features.shape)

Features shape: (26707, 36)


In [25]:
features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26707 entries, 0 to 26706
Data columns (total 36 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   respondent_id                26707 non-null  int64  
 1   h1n1_concern                 26615 non-null  float64
 2   h1n1_knowledge               26591 non-null  float64
 3   behavioral_antiviral_meds    26636 non-null  float64
 4   behavioral_avoidance         26499 non-null  float64
 5   behavioral_face_mask         26688 non-null  float64
 6   behavioral_wash_hands        26665 non-null  float64
 7   behavioral_large_gatherings  26620 non-null  float64
 8   behavioral_outside_home      26625 non-null  float64
 9   behavioral_touch_face        26579 non-null  float64
 10  doctor_recc_h1n1             24547 non-null  float64
 11  doctor_recc_seasonal         24547 non-null  float64
 12  chronic_med_condition        25736 non-null  float64
 13  child_under_6_mo

In [26]:
print("Labels shape:", labels.shape)

Labels shape: (26707, 3)


In [27]:
labels.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26707 entries, 0 to 26706
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   respondent_id     26707 non-null  int64
 1   h1n1_vaccine      26707 non-null  int64
 2   seasonal_vaccine  26707 non-null  int64
dtypes: int64(3)
memory usage: 626.1 KB


**3.3 Data Merging**

Each respondent has both survey answers (features) and vaccination outcomes (labels).
Merging creates complete profiles.

In [29]:
train = pd.merge(features, labels, on='respondent_id', how='inner')
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26707 entries, 0 to 26706
Data columns (total 38 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   respondent_id                26707 non-null  int64  
 1   h1n1_concern                 26615 non-null  float64
 2   h1n1_knowledge               26591 non-null  float64
 3   behavioral_antiviral_meds    26636 non-null  float64
 4   behavioral_avoidance         26499 non-null  float64
 5   behavioral_face_mask         26688 non-null  float64
 6   behavioral_wash_hands        26665 non-null  float64
 7   behavioral_large_gatherings  26620 non-null  float64
 8   behavioral_outside_home      26625 non-null  float64
 9   behavioral_touch_face        26579 non-null  float64
 10  doctor_recc_h1n1             24547 non-null  float64
 11  doctor_recc_seasonal         24547 non-null  float64
 12  chronic_med_condition        25736 non-null  float64
 13  child_under_6_mo

In [67]:
#separate target variables from predictor variables
X_train = train.drop(['respondent_id', 'h1n1_vaccine', 'seasonal_vaccine'], axis=1)
y_h1n1 = train['h1n1_vaccine']
y_seasonal = train['seasonal_vaccine']

1.3.2 Missing Data Handling

* Categorical features: Replace with mode
* Numerical features: Replace with median

In [68]:
# Check data types
print("\nData types in our features:")
print(X_train.dtypes.value_counts())


Data types in our features:
float64    23
object     12
dtype: int64


In [None]:
# Separate numerical and categorical columns based on data type
numerical_cols = X_train.select_dtypes(include=['float64']).columns.tolist()

In [None]:
categorical_cols = X_train.select_dtypes(include=['object']).columns.tolist()

In [73]:
#check for missing values
missing_values = X_train.isnull().sum()
missing_values


h1n1_concern                      92
h1n1_knowledge                   116
behavioral_antiviral_meds         71
behavioral_avoidance             208
behavioral_face_mask              19
behavioral_wash_hands             42
behavioral_large_gatherings       87
behavioral_outside_home           82
behavioral_touch_face            128
doctor_recc_h1n1                2160
doctor_recc_seasonal            2160
chronic_med_condition            971
child_under_6_months             820
health_worker                    804
health_insurance               12274
opinion_h1n1_vacc_effective      391
opinion_h1n1_risk                388
opinion_h1n1_sick_from_vacc      395
opinion_seas_vacc_effective      462
opinion_seas_risk                514
opinion_seas_sick_from_vacc      537
age_group                          0
education                       1407
race                               0
sex                                0
income_poverty                  4423
marital_status                  1408
r

#### CREATE TWO PREPROCESSING PIPELINES

In [80]:
# PIPELINE 1: WITH SCALING (FOR LOGISTIC REGRESSION)
#Fill empty spots with median
# and scale them so large numbers
num_transformer_scaled = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

In [None]:
# Fill empty spots with the most common word 
# and turn words into 1s and 0s (One-Hot Encoding).
cat_transformer_scaled = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

In [82]:
preprocessor_scaled = ColumnTransformer([
    ('numerical', num_transformer_scaled, numerical_cols),
    ('categorical', cat_transformer_scaled, categorical_cols)
])

### PIPELINE 2: WITHOUT SCALING (FOR DECISION TREE)

In [83]:
# Numerical pipeline unscaled
numerical_unscaled = Pipeline([
    ('imputer', SimpleImputer(strategy='median'))  # Just impute, no scaling
])
#categorical pipeline unscaled
categorical_unscaled = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

preprocessor_unscaled = ColumnTransformer([
    ('numerical', numerical_unscaled, numerical_cols),
    ('categorical', categorical_unscaled, categorical_cols)
])


### test-train-split

In [85]:
# focus on seasonal_vaccine as our target
y = y_seasonal  # Choose seasonal vaccine for our project

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y  # Maintain class distribution
)

### PROCESS FOR LOGISTIC REGRESSION (WITH SCALING)

In [None]:
#fit model
preprocessor_scaled.fit(X_train)

In [86]:
# TRANSFORM both train and test
X_train_logistic = preprocessor_scaled.transform(X_train)
X_test_logistic = preprocessor_scaled.transform(X_test)

### PROCESS FOR DECISION TREE (WITHOUT SCALING)

In [90]:
#fit
preprocessor_unscaled.fit(X_train)

In [89]:
#transform
X_train_tree = preprocessor_unscaled.transform(X_train)
X_test_tree = preprocessor_unscaled.transform(X_test)

In [92]:
#check for missing values
print(f"Logistic Train missing: {pd.DataFrame(X_train_logistic).isnull().sum().sum()}")
print(f"Logistic Test missing:  {pd.DataFrame(X_test_logistic).isnull().sum().sum()}")
print(f"Tree Train missing:     {pd.DataFrame(X_train_tree).isnull().sum().sum()}")
print(f"Tree Test missing:      {pd.DataFrame(X_test_tree).isnull().sum().sum()}")

Logistic Train missing: 0
Logistic Test missing:  0
Tree Train missing:     0
Tree Test missing:      0
