# Who Gets Vaccinated? Machine Learning Predictions of H1N1 and Seasonal Flu Vaccine Uptake





## 1. Business Understanding
The goal of this project is to develop a machine learning model that can predict the likelihood that an individual received the 2009 H1N1 vaccine and the seasonal flu vaccine based on survey responses. These predictions are to be returned as probabilities, not binary classifications. This makes the problem a multi-label probabilistic classification task — with two separate target variables to predict.

By accurately predicting vaccine uptake probabilities, public health stakeholders can better understand the factors influencing vaccine behavior and potentially design targeted interventions to increase vaccination rates among underrepresented or hesitant populations.

A successful model will help:

1. Predict and profile vaccine-hesitant populations.

2. Guide more effective public health messaging.

3. Assist in real-time policy decisions during future outbreaks (e.g., COVID-19, RSV, Monkeypox).

4. Serve as a foundation for equity-based healthcare interventions.




###  Stakeholders: 
The primary stakeholders for this model include:

1. Public health officials and policymakers, such as those working in the CDC or WHO, who need to identify populations at higher risk of remaining unvaccinated.

2. Healthcare providers and outreach programs, who can use this information to target specific groups (e.g., those with low health literacy or without insurance).

3. Researchers in epidemiology and behavioral science, who seek to understand the behavioral and socio-demographic factors influencing vaccine hesitancy.




### Business Goals
Maximize prediction accuracy of vaccine uptake for both H1N1 and seasonal flu vaccines.

Identify key behavioral, attitudinal, and demographic drivers of vaccine behavior.

Inform public health strategy by pinpointing populations less likely to receive vaccines.

Enable resource prioritization, such as focused education or mobile clinics in high-risk groups.

### Key Questions
1. What individual-level characteristics (e.g., age, health status, beliefs, behavior) predict whether someone received the H1N1 or seasonal flu vaccine?

2. Are there groups with disproportionately low vaccine uptake?

3. How do recommendations from healthcare professionals influence vaccine behavior?

4. Can this model help anticipate future vaccine hesitancy for other campaigns?



# 2. Data understanding

Dataset Overview
The data originates from the National 2009 H1N1 Flu Survey, conducted in the United States to understand vaccine behaviors during the H1N1 pandemic. The full dataset includes respondent-level survey responses, capturing demographics, health status, behavioral practices, opinions about vaccines, and employment information.
Source: 2009 National H1N1 Flu Survey (N = ~26,000)

### Vaccine Dataset

| File Name               | Description                                          |
|-------------------------|------------------------------------------------------|
| training_set_features.csv | Input features (n ≈ 26,000 respondents)            |
| training_set_labels.csv   | Target labels for H1N1 and seasonal flu vaccine uptake |
| test_set_features.csv     | Features for test set                              |



### Target Variables
The task is to predict two independent binary variables:

Target Variable	Description
|Target Variable               | Description                                          |
|-------------------------|------------------------------------------------------|
| H1N1_vaccine	1 | received H1N1 vaccine           |
| H1N1 vaccine; 0    | Did not receive |
| Seasonal_vaccine	1     | Received H1N1 vaccine|
|  Seasonal flu vaccine; 0 | Did not receive |


These are modeled separately as a multi-label problem. Some individuals received both vaccines, others only one, and many received neither.

# 3. Data preparation and cleaning

In [27]:
# Load Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings("ignore")


In [5]:
# Load the dataset
train_features = pd.read_csv('training_set_features.csv')
train_labels = pd.read_csv("training_set_labels.csv")
test_features = pd.read_csv("test_set_features.csv")


In [6]:
# Initial exploration
print(train_features.shape)
train_features.info()
train_features.isnull().sum().sort_values(ascending=False).head(10)

(26707, 36)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26707 entries, 0 to 26706
Data columns (total 36 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   respondent_id                26707 non-null  int64  
 1   h1n1_concern                 26615 non-null  float64
 2   h1n1_knowledge               26591 non-null  float64
 3   behavioral_antiviral_meds    26636 non-null  float64
 4   behavioral_avoidance         26499 non-null  float64
 5   behavioral_face_mask         26688 non-null  float64
 6   behavioral_wash_hands        26665 non-null  float64
 7   behavioral_large_gatherings  26620 non-null  float64
 8   behavioral_outside_home      26625 non-null  float64
 9   behavioral_touch_face        26579 non-null  float64
 10  doctor_recc_h1n1             24547 non-null  float64
 11  doctor_recc_seasonal         24547 non-null  float64
 12  chronic_med_condition        25736 non-null  float64
 13  chil

employment_occupation    13470
employment_industry      13330
health_insurance         12274
income_poverty            4423
doctor_recc_h1n1          2160
doctor_recc_seasonal      2160
rent_or_own               2042
employment_status         1463
marital_status            1408
education                 1407
dtype: int64

In [15]:
# Merge labels
data = train_features.merge(train_labels, on="respondent_id")

# check merged data
data.info()

# Check for missing values
data.isnull().sum().sort_values(ascending=False).head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26707 entries, 0 to 26706
Data columns (total 38 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   respondent_id                26707 non-null  int64  
 1   h1n1_concern                 26615 non-null  float64
 2   h1n1_knowledge               26591 non-null  float64
 3   behavioral_antiviral_meds    26636 non-null  float64
 4   behavioral_avoidance         26499 non-null  float64
 5   behavioral_face_mask         26688 non-null  float64
 6   behavioral_wash_hands        26665 non-null  float64
 7   behavioral_large_gatherings  26620 non-null  float64
 8   behavioral_outside_home      26625 non-null  float64
 9   behavioral_touch_face        26579 non-null  float64
 10  doctor_recc_h1n1             24547 non-null  float64
 11  doctor_recc_seasonal         24547 non-null  float64
 12  chronic_med_condition        25736 non-null  float64
 13  child_under_6_mo

employment_occupation    13470
employment_industry      13330
health_insurance         12274
income_poverty            4423
doctor_recc_h1n1          2160
doctor_recc_seasonal      2160
rent_or_own               2042
employment_status         1463
marital_status            1408
education                 1407
dtype: int64

In [None]:
# clean the missing values
def clean_missing_values(data):
    # Fill missing values for categorical columns with mode
    for col in data.select_dtypes(include=['object']).columns:
        data[col].fillna(data[col].mode()[0], inplace=True)
    
    # Fill missing values for numerical columns with mean
    for col in data.select_dtypes(include=['float64', 'int64']).columns:
        data[col].fillna(data[col].mean(), inplace=True)
    
    return data

data = clean_missing_values(data)

# Check the cleaned data
data.head()

# describe the data
data.describe(include='')


Unnamed: 0,respondent_id,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,...,rent_or_own,employment_status,hhs_geo_region,census_msa,household_adults,household_children,employment_industry,employment_occupation,h1n1_vaccine,seasonal_vaccine
count,26707.0,26707.0,26707.0,26707.0,26707.0,26707.0,26707.0,26707.0,26707.0,26707.0,...,26707,26707,26707,26707,26707.0,26707.0,26707,26707,26707.0,26707.0
unique,,,,,,,,,,,...,2,3,10,3,,,21,23,,
top,,,,,,,,,,,...,Own,Employed,lzgpxyit,"MSA, Not Principle City",,,fcxhlnwr,xtkaffoo,,
freq,,,,,,,,,,,...,20778,15023,4297,11645,,,15798,15248,,
mean,13353.0,1.618486,1.262532,0.048844,0.725612,0.068982,0.825614,0.35864,0.337315,0.677264,...,,,,,0.886499,0.534583,,,0.212454,0.465608
std,7709.791156,0.908741,0.616805,0.215258,0.444473,0.253339,0.37915,0.478828,0.472076,0.46641,...,,,,,0.749901,0.923836,,,0.409052,0.498825
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,,,0.0,0.0,,,0.0,0.0
25%,6676.5,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,,,,,0.0,0.0,,,0.0,0.0
50%,13353.0,2.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,...,,,,,1.0,0.0,,,0.0,0.0
75%,20029.5,2.0,2.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,...,,,,,1.0,1.0,,,0.0,1.0


In [45]:
data.columns



Index(['respondent_id', 'h1n1_concern', 'h1n1_knowledge',
       'behavioral_antiviral_meds', 'behavioral_avoidance',
       'behavioral_face_mask', 'behavioral_wash_hands',
       'behavioral_large_gatherings', 'behavioral_outside_home',
       'behavioral_touch_face', 'doctor_recc_h1n1', 'doctor_recc_seasonal',
       'chronic_med_condition', 'child_under_6_months', 'health_worker',
       'health_insurance', 'opinion_h1n1_vacc_effective', 'opinion_h1n1_risk',
       'opinion_h1n1_sick_from_vacc', 'opinion_seas_vacc_effective',
       'opinion_seas_risk', 'opinion_seas_sick_from_vacc', 'age_group',
       'education', 'race', 'sex', 'income_poverty', 'marital_status',
       'rent_or_own', 'employment_status', 'hhs_geo_region', 'census_msa',
       'household_adults', 'household_children', 'employment_industry',
       'employment_occupation', 'h1n1_vaccine', 'seasonal_vaccine'],
      dtype='object')

In [46]:
# Split the data into training and testing sets
X = data.drop(columns=['respondent_id', 'seasonal_vaccine', 'h1n1_vaccine'])
y = data[['seasonal_vaccine', 'h1n1_vaccine']]


# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [47]:
# Feature types
numerical = X.select_dtypes(include=["int64", "float64"]).columns.tolist()
categorical = X.select_dtypes(include=["object"]).columns.tolist()

In [48]:
# CONVERT CATEGORICAL VARIABLES TO NUMERICAL USING ONE-HOT ENCODING
def convert_categorical_to_numerical(data):
    # Identify categorical columns
    categorical_cols = data.select_dtypes(include=['object']).columns.tolist()
    
    # Create a pipeline for categorical columns
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ])
    
    # Create a column transformer
    preprocessor = ColumnTransformer(
        transformers=[
            ('cat', categorical_transformer, categorical_cols)
        ],
        remainder='passthrough'  # Keep the rest of the columns unchanged
    )
    
    # Fit and transform the data
    data_transformed = preprocessor.fit_transform(data)
    
    # Convert to DataFrame
    data_transformed = pd.DataFrame(data_transformed, columns=preprocessor.get_feature_names_out())
    
    return data_transformed

data_transformed = convert_categorical_to_numerical(data)


data_transformed.columns

Index(['cat__age_group_18 - 34 Years', 'cat__age_group_35 - 44 Years',
       'cat__age_group_45 - 54 Years', 'cat__age_group_55 - 64 Years',
       'cat__age_group_65+ Years', 'cat__education_12 Years',
       'cat__education_< 12 Years', 'cat__education_College Graduate',
       'cat__education_Some College', 'cat__race_Black',
       ...
       'remainder__opinion_h1n1_vacc_effective',
       'remainder__opinion_h1n1_risk',
       'remainder__opinion_h1n1_sick_from_vacc',
       'remainder__opinion_seas_vacc_effective',
       'remainder__opinion_seas_risk',
       'remainder__opinion_seas_sick_from_vacc', 'remainder__household_adults',
       'remainder__household_children', 'remainder__h1n1_vaccine',
       'remainder__seasonal_vaccine'],
      dtype='object', length=108)

In [None]:



# Optional: Visualize distributions
# Example: Vaccine uptake by age group or health worker status

## 3. Data Preparation

import numpy as np

# Split
X = data.drop(columns=["h1n1_vaccine", "seasonal_vaccine", "respondent_id"])
y = data[["h1n1_vaccine", "seasonal_vaccine"]]

# Feature types
numerical = X.select_dtypes(include=["int64", "float64"]).columns.tolist()
categorical = X.select_dtypes(include=["object"]).columns.tolist()

# Pipeline
numeric_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="median"))
])
categorical_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])
preprocessor = ColumnTransformer([
    ("num", numeric_transformer, numerical),
    ("cat", categorical_transformer, categorical)
])

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

## 4. Modeling
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

# Pipeline for one target
pipe_h1n1 = Pipeline([
    ("preprocessor", preprocessor),
    ("model", RandomForestClassifier(random_state=42))
])

pipe_h1n1.fit(X_train, y_train["h1n1_vaccine"])
preds_h1n1 = pipe_h1n1.predict_proba(X_val)[:, 1]

roc_auc_h1n1 = roc_auc_score(y_val["h1n1_vaccine"], preds_h1n1)
print("H1N1 ROC AUC:", roc_auc_h1n1)

# Repeat for seasonal
pipe_seasonal = Pipeline([
    ("preprocessor", preprocessor),
    ("model", RandomForestClassifier(random_state=42))
])

pipe_seasonal.fit(X_train, y_train["seasonal_vaccine"])
preds_seasonal = pipe_seasonal.predict_proba(X_val)[:, 1]

roc_auc_seasonal = roc_auc_score(y_val["seasonal_vaccine"], preds_seasonal)
print("Seasonal ROC AUC:", roc_auc_seasonal)

## 5. Evaluation
# Compare ROC AUCs
mean_auc = (roc_auc_h1n1 + roc_auc_seasonal) / 2
print("Mean ROC AUC:", mean_auc)

# Optional: Feature importances, SHAP, insights

## 6. Deployment
# Predict on test set
final_preds_h1n1 = pipe_h1n1.predict_proba(test_features.drop(columns=["respondent_id"]))[:, 1]
final_preds_seasonal = pipe_seasonal.predict_proba(test_features.drop(columns=["respondent_id"]))[:, 1]

# Save submission
submission = pd.DataFrame({
    "respondent_id": test_features["respondent_id"],
    "h1n1_vaccine": final_preds_h1n1,
    "seasonal_vaccine": final_preds_seasonal
})
submission.to_csv("submission.csv", index=False)
