# Business Understanding 

# Problem Definition 

## Context

Vaccination is a cornerstone of public health, protecting individuals and communities through both direct immunity and herd immunity. Insights from past campaigns, such as the 2009 H1N1 influenza pandemic, can guide current and future vaccination efforts, including for emerging diseases like COVID-19. Understanding factors influencing vaccine uptake helps public health authorities design targeted interventions to improve coverage.

### Stakeholder
Public health authorities, such as the CDC, are responsible for monitoring vaccine coverage and planning immunization campaigns. Predictive insights from this analysis can help identify populations less likely to get vaccinated and guide resource allocation.

### Problem Statement
This project aims to predict whether an individual received the H1N1 or seasonal flu vaccine based on demographic, behavioral, and opinion-based survey responses. By modeling vaccination likelihood, I can:

1. Identify key factors associated with vaccine uptake.
2. Predict vaccination probability for individuals or populations.
3. Inform targeted public health messaging and interventions.

### Scope and Evaluation
The analysis focuses on one vaccine type (H1N1 or seasonal) and includes:
1. Exploratory Data Analysis (EDA) to understand distributions and relationships.
2. Feature engineering and preprocessing to prepare the data for modeling.
3. Model training using classification algorithms.
4. Evaluation using accuracy, precision, recall, F1-score, and primarily ROC-AUC to measure predictive performance.

### Business Value
Predictive insights enable public health officials to:
1. Identify groups with lower vaccination rates.
2. Design more effective, data-driven vaccination campaigns.
3. Allocate resources efficiently to increase overall vaccine coverage.

## import libraries

In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler

from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import roc_auc_score, classification_report, confusion_matrix
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score

# Load data

In [None]:
# Load CSV files from the data folder
train_features = pd.read_csv('data/training_set_features.csv')
print(train_features.shape)

In [None]:
train_features.head()

In [None]:
train_labels = pd.read_csv('data/training_set_labels.csv')
print(train_labels.shape)

In [None]:
train_labels.head()

In [None]:
test_features = pd.read_csv('data/test_set_features.csv')
print(test_features.shape)

In [None]:
test_features.head()

In [None]:
submission_format = pd.read_csv('data/submission_format.csv')
submission_format.head()

In [None]:
print(submission_format.shape)

# Exploratory Data Analysis (EDA)

In [None]:
# Check data types
train_features.info()

## check missing values

In [None]:
# Check missing values
train_features.isnull().sum()

In [None]:
# Check target distribution
train_labels['h1n1_vaccine'].value_counts()
train_labels['seasonal_vaccine'].value_counts()

In [None]:
train_features['household_adults'].hist()
plt.show()

In [None]:
sns.countplot(x='age_group', data=train_features)
plt.show()

In [None]:
# Quick statistics
train_features.describe(include='all')

# Merging Features and Labels

Combines survey features and vaccination outcomes into a single dataframe. 
This simplifies preprocessing and EDA because all info is in one table.

In [None]:
# merging training_set_features.csv and training_set_labels.csv into one dataframe for cleaning and EDA.
# Merge on respondent_id
train = train_features.merge(train_labels, on='respondent_id')

In [None]:
train.shape

In [None]:
train.head()

## Cleaning Column Names

Removes spaces, converts names to lowercase, and standardizes naming. 
This ensures your code won’t break due to typos or inconsistent naming.

In [None]:
#Remove whitespaces
# Clean column names
train.columns = train.columns.str.strip().str.lower().str.replace(' ', '_')
train.columns

## Cleaning Categorical Values

Converts all categorical entries to lowercase strings

In [None]:
#Cleaning String/Category Values
# Identify categorical columns (example)
categorical_cols = ['age_group', 'education', 'race', 'sex', 'income_poverty', 
                    'marital_status', 'rent_or_own', 'employment_status', 
                    'hhs_geo_region', 'census_msa', 'employment_industry', 
                    'employment_occupation']

# Clean string values
for col in categorical_cols:
    train[col] = train[col].astype(str).str.strip().str.lower()

## Handling Missing Values in categorical
Replaces missing values (NaN) with "unknown" to handle missing data without dropping rows.

In [None]:
#Handling Missing Values
# Fill missing categorical values
for col in categorical_cols:
    train[col] = train[col].replace('nan', 'unknown')

## Handling Missing Values in numerical

In [None]:
# Fill missing numeric values with median
numeric_cols = train.select_dtypes(include='number').columns
train[numeric_cols] = train[numeric_cols].fillna(train[numeric_cols].median())

## Encoding Categorical Features
Converts categories into numerical columns (one-hot encoding). drop_first=True avoids collinearity problems in models like Logistic Regression.

In [None]:
#Encoding Categorical Features
# One-hot encoding for categorical variables
# drop_first=True avoids multicollinearity for linear models
train_encoded = pd.get_dummies(train, columns=categorical_cols, drop_first=True)
train_encoded.head()

# cleaning test_set_features.csv:

In [None]:
# Clean column names
test_features.columns = test_features.columns.str.strip().str.lower().str.replace(' ', '_')

In [None]:
# Clean string values
for col in categorical_cols:
    test_features[col] = test_features[col].astype(str).str.strip().str.lower()

In [None]:
# Fill missing categorical values
for col in categorical_cols:
    test_features[col] = test_features[col].replace('nan', 'unknown')

In [None]:
# Fill numeric missing values
numeric_cols_test = test_features.select_dtypes(include='number').columns
test_features[numeric_cols_test] = test_features[numeric_cols_test].fillna(test_features[numeric_cols_test].median())

# Preprocessing

In [None]:
# One-hot encoding
test_encoded = pd.get_dummies(test_features, columns=categorical_cols, drop_first=True)

## One-hot encoding

Ensures the test set has exactly the same features as the training set. 
Missing dummy columns in the test set are filled with 0.

In [None]:
#Align train and test columns
train_encoded, test_encoded = train_encoded.align(test_encoded, join='left', axis=1, fill_value=0)

In [None]:
#Verify shapes
print(train_encoded.shape)
print(test_encoded.shape)

In [None]:
#load the dataset
# Make a copy of the dataframe
df = test_encoded.copy()
# Check the first few rows
df.head()

## Correlation
visualize the correlation between features in your dataset using a heatmap

In [None]:
## Compute correlation matrix
corr = df.corr()

In [None]:
# Mask weak correlations (keep |corr| > 0.7)
strong_corr = corr.copy()
strong_corr[(corr > -0.7) & (corr < 0.7)] = 0

In [None]:
# Plot heatmap
plt.figure(figsize=(10,8))
sns.heatmap(
    strong_corr, 
    annot=True, 
    fmt=".2f", 
    cmap="coolwarm",  
    center=0,          
    linewidths=0.5,
    cbar_kws={"shrink":0.7}
)
plt.title("Strong Correlations (>|0.7|)")
plt.show()


In [None]:
# List feature pairs with strong correlations
corr_pairs = corr.abs().unstack()  # flatten matrix
# Remove self-correlations
corr_pairs = corr_pairs[corr_pairs < 1]
# Keep only correlations > 0.7
high_corr = corr_pairs[corr_pairs > 0.7].sort_values(ascending=False)
print("Highly correlated feature pairs:\n", high_corr)

In [None]:
# Optional: get a set of features to drop
# (keep one feature per highly correlated pair)
to_drop = set()
for feat1, feat2 in high_corr.index:
    # choose one to drop (example: drop the second feature)
    to_drop.add(feat2)

print("\nSuggested features to drop due to high correlation:\n", to_drop)

## Summarise

In [None]:
#concise stat
df.describe().T

## Check for unique values

In [None]:
#check unique values
for i in df.columns:
    uniq_val = df[i].unique()
    print(f'Column name: {i}\n, {uniq_val}\n')
    print("****"*20) 

## Selecting features and target

In [None]:
# Target variables
target = 'h1n1_vaccine'  # or 'seasonal_vaccine'
X = train_encoded.drop(['h1n1_vaccine', 'seasonal_vaccine', 'respondent_id'], axis=1)
y = train_encoded[target]

## Train Test Split data

Creates training and validation sets. 
stratify=y ensures the proportion of vaccinated vs non-vaccinated is the same in both sets, preventing bias in evaluation.

In [None]:
#split data into train and test set
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Target variable: {target}")
print(f"Training samples: {X_train.shape[0]}, Validation samples: {X_val.shape[0]}")

In [None]:
x_train.shape , y_train.shape

In [None]:
x_test.shape, y_test.shape

## scale the features

Thecode ensures that all your features are on the same scale, which improves model stability, convergence, and performance, especially for algorithms sensitive to feature magnitude.

In [None]:
# Initialize scaler
sc = StandardScaler()

# Scale the features
X_train_s = sc.fit_transform(X_train)  # fit on training data
X_val_s = sc.transform(X_val)          # transform validation data

In [None]:
# Optional: check shapes
print(X_train_s.shape, X_val_s.shape)

# Baseline Model: Logistic Regression

Baseline model using a simple, interpretable algorithm.
Predicts probability of vaccination.

In [None]:
log_model = LogisticRegression(max_iter=1000, random_state=42)
log_model.fit(X_train, y_train)

In [None]:
#check training score
train_score = model1.score(x_train_s, y_train)
print(f'{train_score:.2f}')

In [None]:
#made prediction
# Predictions
y_pred_log = log_model.predict(X_val)
y_pred_proba_log = log_model.predict_proba(X_val)[:, 1]

In [None]:
# Evaluate
roc_auc_log = roc_auc_score(y_val, y_pred_proba_log)
print("Logistic Regression ROC-AUC:", round(roc_auc_log, 3))
print(classification_report(y_val, y_pred_log))

## Check model metrics

In [None]:
#check the r2 error
round(r2_score(y_test, y_pred),2)

# Second Model: Random Forest (Tuned)

Random Forest is an ensemble model that handles nonlinearities and interactions.
GridSearchCV tunes hyperparameters to maximize ROC-AUC.

In [None]:
# Choose your single target for the MVP
target = 'h1n1_vaccine'

# Define X and y
X = train_encoded.drop(['h1n1_vaccine', 'seasonal_vaccine', 'respondent_id'], axis=1)
y = train_encoded[target]

# Check alignment
print(X.shape, y.shape)

# Split train/test with matching rows
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(X_train.shape, y_train.shape)

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Define parameter grid for tuning
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

# Initialize model
rf = RandomForestClassifier(random_state=42)

# Grid search
grid_search = GridSearchCV(
    rf, param_grid, scoring='roc_auc', cv=3, n_jobs=-1, verbose=1
)

# Fit on training data
grid_search.fit(X_train, y_train)

# Get best model
best_rf = grid_search.best_estimator_
print("Best Parameters:", grid_search.best_params_)


# Evaluate Random Forest

In [None]:
from sklearn.metrics import roc_auc_score, classification_report, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
import pandas as pd

# Predictions
y_pred_rf = best_rf.predict(X_test)
y_proba_rf = best_rf.predict_proba(X_test)[:, 1]

# ROC-AUC
roc_auc_rf = roc_auc_score(y_test, y_proba_rf)
print(f"Random Forest ROC-AUC: {roc_auc_rf:.3f}\n")

# Classification Report
print("Classification Report:")
print(classification_report(y_test, y_pred_rf))

# Confusion Matrix
ConfusionMatrixDisplay.from_estimator(best_rf, X_test, y_test, display_labels=[0, 1], cmap='Blues')
plt.title("Random Forest Confusion Matrix")
plt.show()

# Feature Importance
Identifies which survey questions and demographic factors most influence vaccination prediction.

In [None]:
# Extract feature importance
feat_imp = pd.Series(best_rf.feature_importances_, index=X_train.columns).sort_values(ascending=False)

# Show top 15 features
top_features = feat_imp.head(15)
print("\nTop 15 Important Features:\n", top_features)


In [None]:
# Plot from the feature importance
top_features.plot(kind='barh', figsize=(8,6), color='teal')
plt.gca().invert_yaxis()
plt.title("Top 15 Important Features - Random Forest")
plt.xlabel("Feature Importance")
plt.show()

In [None]:
# Define features once
feature_cols = train_encoded.drop(['respondent_id','h1n1_vaccine','seasonal_vaccine'], axis=1).columns
X_train = train_encoded[feature_cols]
X_test_final = test_encoded[feature_cols]

# Targets
y_train = train_encoded[['h1n1_vaccine','seasonal_vaccine']]

# Split
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42, stratify=y_train)

# Multi-output Logistic Regression
model = MultiOutputClassifier(LogisticRegression(max_iter=500, random_state=42))
model.fit(X_tr, y_tr)

# Predict & evaluate
y_val_pred = model.predict_proba(X_val)
auc_h1n1 = roc_auc_score(y_val['h1n1_vaccine'], y_val_pred[0][:,1])
auc_seasonal = roc_auc_score(y_val['seasonal_vaccine'], y_val_pred[1][:,1])
print("ROC AUC H1N1:", auc_h1n1)
print("ROC AUC Seasonal:", auc_seasonal)

## Multi variable Logistic Regression
Simultaneously predicts H1N1 and seasonal flu vaccination probabilities.
Produces probabilities, not just 0/1 predictions → useful for risk stratification and campaign targeting.

In [None]:
#multi-label binary classification
model = MultiOutputClassifier(LogisticRegression(max_iter=500))
model.fit(X_tr, y_tr)

In [None]:
# Predict probabilities
y_val_pred = model.predict_proba(X_val)

In [None]:
# Calculate ROC AUC for each target
auc_h1n1 = roc_auc_score(y_val['h1n1_vaccine'], y_val_pred[0][:,1])
auc_seasonal = roc_auc_score(y_val['seasonal_vaccine'], y_val_pred[1][:,1])
print("ROC AUC H1N1:", auc_h1n1)
print("ROC AUC Seasonal:", auc_seasonal)

## Submission Preparation
Creates a properly formatted CSV for submission. Probabilities reflect likelihood of vaccination.

In [None]:
# Select only feature columns used for training (drop target and ID)
feature_cols = train_encoded.drop(['respondent_id', 'h1n1_vaccine', 'seasonal_vaccine'], axis=1).columns

# Ensure X_test_final has exactly the same columns
X_test_final = test_encoded[feature_cols]

# Now make predictions
y_test_pred = model.predict_proba(X_test_final)

# Prepare submission
submission = pd.DataFrame({
    'respondent_id': test_features['respondent_id'],  # keep ID for submission
    'h1n1_vaccine': y_test_pred[0][:,1],
    'seasonal_vaccine': y_test_pred[1][:,1]
})

submission.to_csv('h1n1_submission.csv', index=False)

In [None]:
submission.head()   # shows the first 5 rows

In [None]:
submission.tail()   # shows the last 5 rows

In [None]:
submission.shape    # check the number of rows and columns

In [None]:
# Read back the CSV to verify
pd.read_csv('h1n1_submission.csv').head()

## Visualization
Plots distribution of predicted probabilities → shows how certain the model is for each respondent.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set style
sns.set(style="whitegrid")

# Plot histograms
plt.figure(figsize=(12,5))

# h1n1_vaccine probabilities
plt.subplot(1, 2, 1)
sns.histplot(submission['h1n1_vaccine'], bins=20, kde=True, color='skyblue')
plt.title('Predicted Probabilities - H1N1 Vaccine')
plt.xlabel('Probability')
plt.ylabel('Count')

# seasonal_vaccine probabilities
plt.subplot(1, 2, 2)
sns.histplot(submission['seasonal_vaccine'], bins=20, kde=True, color='salmon')
plt.title('Predicted Probabilities - Seasonal Vaccine')
plt.xlabel('Probability')
plt.ylabel('Count')

plt.tight_layout()
plt.show()