# Vaccine Uptake Prediction Model
## DataHack, IIT Guwahati

This notebook outlines the process of predicting vaccine uptake using logistic regression.
The dataset includes demographic and health information from individuals to predict their likelihood of receiving vaccines.

** Github link - https://github.com/Suppu2004/DataHack

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

# Loading the data
# Training data
train_data = pd.read_csv('/Users/suppu/Downloads/dataset and all/training_set_features.csv')
train_labels = pd.read_csv('/Users/suppu/Downloads/dataset and all/training_set_labels.csv')

# Testing data
test_data = pd.read_csv('/Users/suppu/Downloads/dataset and all/test_set_features.csv')

# Print column names to ensure 'respondent_id' is present
print("Train Data Columns:", train_data.columns.tolist())
print("Train Labels Columns:", train_labels.columns.tolist())
print("Test Data Columns:", test_data.columns.tolist())

## DATA PREPROCESSING
Convert respondent_id to int if stored as float and check for non-matching ids

In [2]:
train_data['respondent_id'] = train_data['respondent_id'].astype(int)
train_labels['respondent_id'] = train_labels['respondent_id'].astype(int)
test_data['respondent_id'] = test_data['respondent_id'].astype(int)

# Display the first few rows of the training dataset and training labels to understand its structure
print(train_data.head())
print(train_labels.head())

# Check for missing values in each column
print(train_data.isnull().sum())
print(train_labels.isnull().sum())

## Merging the data
Combining feature set and label set into a single dataframe

In [3]:
# Merge training data with labels on 'respondent_id'
train_merged = pd.merge(train_data, train_labels, on='respondent_id', how='inner')
print("Columns after merge:", train_merged.columns.tolist())

## Handling missing values
We fill missing values with the median of the numeric columns.

In [4]:
train_merged.fillna(train_merged.median(numeric_only=True), inplace=True)
test_data.fillna(train_merged.median(numeric_only=True), inplace=True)

## Encoding categorical variables
Conversion of categorical variables into a suitable numerical format using one-hot encoding

In [5]:
categorical_vars = train_merged.select_dtypes(include=['object']).columns.tolist()
if 'respondent_id' in categorical_vars:
    categorical_vars.remove('respondent_id')
train_merged = pd.get_dummies(train_merged, columns=categorical_vars, drop_first=True)
test_data = pd.get_dummies(test_data, columns=categorical_vars, drop_first=True)

# Align features
train_merged, test_data = train_merged.align(test_data, join='outer', axis=1, fill_value=0)
test_data = test_data.drop(columns=['xyz_vaccine', 'seasonal_vaccine'], errors='ignore')

if train_merged.select_dtypes(include=['object']).any().any():
    raise ValueError("There are still object type data in the training features after encoding.")

# Ensure 'respondent_id' remains
print("Test Data Columns after alignment:", test_data.columns.tolist())

## Model training
Train a logistic regression model with the data

In [6]:
if 'xyz_vaccine' not in train_merged.columns or 'seasonal_vaccine' not in train_merged.columns:
    raise ValueError("One or more target columns are missing from the training data.")

# Prepare the features and targets for training
X_train = train_merged.drop(['respondent_id', 'xyz_vaccine', 'seasonal_vaccine'], axis=1, errors='ignore')
y_train_xyz = train_merged['xyz_vaccine']
y_train_seasonal = train_merged['seasonal_vaccine']

# Initialize and train logistic regression models
model_xyz = LogisticRegression(max_iter=1000)
model_xyz.fit(X_train, y_train_xyz)

model_seasonal = LogisticRegression(max_iter=1000)
model_seasonal.fit(X_train, y_train_seasonal)

## Model evaluation and predictions

In [7]:
# Prepare test features
X_test = test_data.drop('respondent_id', axis=1, errors='ignore')

# Predict probabilities for the test data
final_probabilities_xyz = model_xyz.predict_proba(X_test)[:, 1]
final_probabilities_seasonal = model_seasonal.predict_proba(X_test)[:, 1]

# Prepare the submission DataFrame
submission = pd.DataFrame({
    'respondent_id': test_data['respondent_id'],
    'xyz_vaccine_label': final_probabilities_xyz,
    'seasonal_vaccine_label': final_probabilities_seasonal
})

# Save the submission file
submission.to_csv('vaccine_predictions.csv', index=False)

print("The final predictions for the vaccine uptake have been saved in the file 'vaccine_predictions.csv'.")