# H1N1 and Seasonal Flu Vaccine Prediction: Project Plan

## 1. Business Understanding

This initial phase focuses on understanding the project's objectives and requirements from a public health perspective, translating that knowledge into a data mining problem definition.

### a.) Background

Influenza, commonly known as the flu, is a contagious respiratory illness caused by influenza viruses. It can cause mild to severe illness, and at times can lead to death. Seasonal flu is a recurring annual threat, while pandemic flu, like the H1N1 swine flu in 2009, represents a novel virus to which most people have no immunity.

**Vaccination is the most effective way to prevent influenza** and its severe outcomes. 💉 Public health organizations, like the Centers for Disease Control and Prevention (CDC), invest heavily in annual vaccination campaigns. However, the success of these campaigns depends on public willingness to get vaccinated, often referred to as "vaccine uptake."

The data provided is from the **National 2009 H1N1 Flu Survey (NCFS)**, a telephone survey conducted in the United States. It captures a snapshot of public opinion, knowledge, and behavior during the H1N1 pandemic. This includes demographic information, personal health behaviors (like hand washing), opinions about vaccine effectiveness and risk, and ultimately, whether the respondent received the H1N1 and seasonal flu vaccines.

### b.) Problem Statement

Public health officials face the significant challenge of effectively allocating limited resources (vaccines, personnel, marketing funds) to maximize vaccination rates and protect the population during seasonal flu seasons and pandemics. A blanket approach to public health campaigns is often inefficient. Therefore, the core problem is: **How can we identify and characterize the segments of the population that are least likely to be vaccinated against H1N1 and seasonal flu, in order to design targeted and effective public health interventions?**

### c.) Objectives

The primary goal is to leverage machine learning to predict vaccination behavior. The specific objectives are:

1.  **Exploratory Data Analysis (EDA):** To use visualizations and statistical analysis to uncover patterns, correlations, and insights within the data. This helps in understanding the key factors influencing vaccination decisions.
2.  **Predictive Modeling:** To build and evaluate a robust machine learning model that accurately predicts the likelihood of an individual receiving the `h1n1_vaccine` and the `seasonal_vaccine` based on their survey responses.
3.  **Feature Importance Analysis:** To identify the most influential predictors of vaccination behavior. For example, is a doctor's recommendation more impactful than a person's perceived risk of getting sick?
4.  **Actionable Insights:** To provide clear, data-driven recommendations that can help public health stakeholders refine their communication strategies and outreach efforts.

### d.) Stakeholders

Several groups will benefit from the outcomes of this project:

* **Public Health Organizations (e.g., CDC, WHO, state health departments):** They can use the model's insights to create targeted awareness campaigns, address specific concerns (like vaccine safety), and allocate resources to areas or demographic groups with predicted low uptake.
* **Healthcare Providers (Hospitals, Clinics):** The findings can help doctors and nurses better understand patient hesitancy and tailor their recommendations during consultations.
* **Government Agencies:** For strategic planning, policymaking, and resource management related to public health emergencies.
* **Researchers:** To gain a deeper understanding of the social and behavioral drivers behind public health compliance.

### e.) Metrics of Success

A successful project will be measured by both its technical performance and its practical utility.

* **Model Performance Metrics:** The model's success will be evaluated using standard classification metrics. Given that we want to correctly identify people in both vaccinated and non-vaccinated groups, a balanced metric is crucial.
    * **AUC - ROC Score:** The Area Under the Receiver Operating Characteristic Curve is an excellent primary metric. It measures the model's ability to distinguish between the positive and negative classes across all thresholds. An AUC score above **0.80** would be considered a good result.
    * **Accuracy:** The overall percentage of correct predictions. An accuracy significantly better than the baseline (predicting the majority class) is a minimum requirement.
    * **Precision and Recall:** These are important for understanding trade-offs. High **recall** for the "vaccinated" class would mean we are good at finding most of the people who got the vaccine.
* **Project Impact Metrics:** The ultimate success is the model's ability to provide **interpretable and actionable insights**. A successful project will deliver a report identifying the top 5-10 factors most strongly correlated with vaccination behavior, which can be directly used by stakeholders to inform their strategies.

---

## 2. Data Understanding

This phase involves becoming familiar with the data, identifying potential quality issues, and gaining initial insights. This follows the **CRISP-DM** methodology, where we move from business understanding to a deep dive into the provided dataset.

### Data Description and Structure

The dataset, `cleannew.csv`, contains **26,707 rows** and **38 columns**.

* **Each row** represents a unique survey respondent.
* **Each column** represents a feature, which is an answer to a survey question, a demographic characteristic, or one of the target variables.
* **Target Variables:** The two columns we aim to predict are `h1n1_vaccine` and `seasonal_vaccine`. Both are **binary**, where `1` means the respondent received the vaccine and `0` means they did not. This defines our task as a **binary classification problem**.
* **Data Types:** The dataset is a mix of numerical and categorical data.
    * **Numerical:** Many features are encoded as numbers (e.g., `h1n1_concern` from 0-3) which represent ordinal scales.
    * **Categorical:** Features like `age_group`, `race`, and `employment_status` are represented as strings and will require encoding for use in most machine learning models.

### Features and Description

| Feature Name | Description | Data Type / Values |
| :--- | :--- | :--- |
| **Identifiers** | | |
| `respondent_id` | Unique identifier for each respondent. | Numerical (Integer) |
| **Behavioral Features** | | |
| `h1n1_concern` | Level of concern about H1N1 flu. | Ordinal (`0`-`3`: Not at all concerned to Very concerned) |
| `h1n1_knowledge` | Level of knowledge about H1N1 flu. | Ordinal (`0`-`2`: No knowledge to A lot of knowledge) |
| `behavioral_antiviral_meds` | Has taken antiviral medications. | Binary (`0`/`1`) |
| `behavioral_avoidance` | Has avoided close contact with sick people. | Binary (`0`/`1`) |
| `behavioral_face_mask` | Has bought a face mask. | Binary (`0`/`1`) |
| `behavioral_wash_hands` | Has frequently washed hands or used hand sanitizer. | Binary (`0`/`1`) |
| `behavioral_large_gatherings`| Has reduced time at large gatherings. | Binary (`0`/`1`) |
| `behavioral_outside_home` | Has reduced contact with people outside own household. | Binary (`0`/`1`) |
| `behavioral_touch_face` | Has avoided touching eyes, nose, or mouth. | Binary (`0`/`1`) |
| **Health & Medical Features** | | |
| `doctor_recc_h1n1` | H1N1 vaccine was recommended by a doctor. | Binary (`0`/`1`) |
| `doctor_recc_seasonal` | Seasonal flu vaccine was recommended by a doctor. | Binary (`0`/`1`) |
| `chronic_med_condition` | Has a chronic medical condition. | Binary (`0`/`1`) |
| `child_under_6_months` | Has regular contact with a child under 6 months. | Binary (`0`/`1`) |
| `health_worker` | Is a healthcare worker. | Binary (`0`/`1`) |
| `health_insurance` | Has health insurance. | Binary (`0`/`1`) |
| **Opinion Features** | | |
| `opinion_h1n1_vacc_effective`| Opinion about H1N1 vaccine effectiveness. | Ordinal (`1`-`5`: Not effective at all to Very effective) |
| `opinion_h1n1_risk` | Opinion about risk of getting sick from H1N1 flu. | Ordinal (`1`-`5`: Very low to Very high) |
| `opinion_h1n1_sick_from_vacc`| Concern about getting sick from the H1N1 vaccine. | Ordinal (`1`-`5`: Not at all concerned to Very concerned) |
| `opinion_seas_vacc_effective`| Opinion about seasonal flu vaccine effectiveness. | Ordinal (`1`-`5`: Not effective at all to Very effective) |
| `opinion_seas_risk` | Opinion about risk of getting sick from seasonal flu. | Ordinal (`1`-`5`: Very low to Very high) |
| `opinion_seas_sick_from_vacc`| Concern about getting sick from the seasonal flu vaccine. | Ordinal (`1`-`5`: Not at all concerned to Very concerned) |
| **Demographic Features** | | |
| `age_group` | Age group of the respondent. | Categorical (e.g., '18 - 34 Years') |
| `education` | Education level of the respondent. | Categorical (e.g., '< 12 Years', 'College Graduate') |
| `race` | Race of the respondent. | Categorical (e.g., 'White', 'Black') |
| `sex` | Sex of the respondent. | Categorical ('Female', 'Male') |
| `income_poverty` | Annual household income in relation to poverty level. | Categorical (e.g., 'Below Poverty', '> $75,000') |
| `marital_status` | Marital status of the respondent. | Categorical ('Married', 'Not Married') |
| `rent_or_own` | Housing situation of the respondent. | Categorical ('Own', 'Rent') |
| `employment_status` | Employment status of the respondent. | Categorical (e.g., 'Employed', 'Not in Labor Force') |
| `hhs_geo_region` | HHS geographic region. | Categorical (e.g., 'oxchjgsf') |
| `census_msa` | Residence in a Metropolitan Statistical Area. | Categorical ('MSA, Principle City', 'Non-MSA') |
| `household_adults` | Number of other adults in the household. | Numerical (Integer) |
| `household_children` | Number of children in the household. | Numerical (Integer) |
| `employment_industry` | Industry of employment (if employed). | Categorical (e.g., 'fcxhlnwr') |
| `employment_occupation` | Occupation (if employed). | Categorical (e.g., 'xtkaffoo') |
| **Target Variables** | | |
| `h1n1_vaccine` | **(Target)** Whether respondent received H1N1 vaccine. | **Binary (`0`/`1`)** |
| `seasonal_vaccine` | **(Target)** Whether respondent received seasonal flu vaccine. | **Binary (`0`/`1`)** |

## 2.1 Initial Data Exploration

In [8]:
# import libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

In [10]:
# loading the train set labels
df = pd.read_csv("training_set_labels.csv")
df.head()

Unnamed: 0,respondent_id,h1n1_vaccine,seasonal_vaccine
0,0,0,0
1,1,0,1
2,2,0,0
3,3,0,1
4,4,0,0


In [12]:
# load trainset features
df1 = pd.read_csv("training_set_features.csv")
df1.head()

Unnamed: 0,respondent_id,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,...,income_poverty,marital_status,rent_or_own,employment_status,hhs_geo_region,census_msa,household_adults,household_children,employment_industry,employment_occupation
0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,Below Poverty,Not Married,Own,Not in Labor Force,oxchjgsf,Non-MSA,0.0,0.0,,
1,1,3.0,2.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,...,Below Poverty,Not Married,Rent,Employed,bhuqouqj,"MSA, Not Principle City",0.0,0.0,pxcmvdjn,xgwztkwe
2,2,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,"<= $75,000, Above Poverty",Not Married,Own,Employed,qufhixun,"MSA, Not Principle City",2.0,0.0,rucpziij,xtkaffoo
3,3,1.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,Below Poverty,Not Married,Rent,Not in Labor Force,lrircsnp,"MSA, Principle City",0.0,0.0,,
4,4,2.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,...,"<= $75,000, Above Poverty",Married,Own,Employed,qufhixun,"MSA, Not Principle City",1.0,0.0,wxleyezf,emcorrxb


In [4]:
df2 = pd.read_csv("test_set_features.csv")
df2.shape 

(26708, 36)

In [14]:
# check the shape

print(f"The dataset has {df1.shape[0]} records and {df1.shape[1]} columns")

The dataset has 26707 records and 36 columns


In [16]:
# check the summary information
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26707 entries, 0 to 26706
Data columns (total 36 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   respondent_id                26707 non-null  int64  
 1   h1n1_concern                 26615 non-null  float64
 2   h1n1_knowledge               26591 non-null  float64
 3   behavioral_antiviral_meds    26636 non-null  float64
 4   behavioral_avoidance         26499 non-null  float64
 5   behavioral_face_mask         26688 non-null  float64
 6   behavioral_wash_hands        26665 non-null  float64
 7   behavioral_large_gatherings  26620 non-null  float64
 8   behavioral_outside_home      26625 non-null  float64
 9   behavioral_touch_face        26579 non-null  float64
 10  doctor_recc_h1n1             24547 non-null  float64
 11  doctor_recc_seasonal         24547 non-null  float64
 12  chronic_med_condition        25736 non-null  float64
 13  child_under_6_mo

In [18]:
#checking the summary statistics for categorical features
df1.describe(include="O").T

Unnamed: 0,count,unique,top,freq
age_group,26707,5,65+ Years,6843
education,25300,4,College Graduate,10097
race,26707,4,White,21222
sex,26707,2,Female,15858
income_poverty,22284,3,"<= $75,000, Above Poverty",12777
marital_status,25299,2,Married,13555
rent_or_own,24665,2,Own,18736
employment_status,25244,3,Employed,13560
hhs_geo_region,26707,10,lzgpxyit,4297
census_msa,26707,3,"MSA, Not Principle City",11645


In [6]:
df1.isnull().sum()

respondent_id                      0
h1n1_concern                      92
h1n1_knowledge                   116
behavioral_antiviral_meds         71
behavioral_avoidance             208
behavioral_face_mask              19
behavioral_wash_hands             42
behavioral_large_gatherings       87
behavioral_outside_home           82
behavioral_touch_face            128
doctor_recc_h1n1                2160
doctor_recc_seasonal            2160
chronic_med_condition            971
child_under_6_months             820
health_worker                    804
health_insurance               12274
opinion_h1n1_vacc_effective      391
opinion_h1n1_risk                388
opinion_h1n1_sick_from_vacc      395
opinion_seas_vacc_effective      462
opinion_seas_risk                514
opinion_seas_sick_from_vacc      537
age_group                          0
education                       1407
race                               0
sex                                0
income_poverty                  4423
m

#### Observation: 
1. The dataset comprises of numerical and categorical columns.
2. We note that there are alot of missing values which will need to be handled in the data preparation phase.

# 3. Data Preparation

Following our data understanding phase, we now transition into the Data Preparation Stage of the CRISP-DM methodology. This phase is crucial in transforming raw data into a clean and structured format that can be used effectively in analysis and modeling.

This section includes:
1. Selection of relevant data
2. Data Cleaning
3. Feature Engineering

## 3.1 Data Cleaning

In this stage, we focus on preparing the raw dataset for analysis and modeling.  
The main goals are:  

- Handle missing and invalid values.  
- Ensure correct data types.  
- Standardize categorical values to reduce noise.  
- Remove duplicates (if any).  
- Create consistent, analysis-ready features.  

This ensures that the data used in later stages is accurate, reliable, and suitable for machine learning models.

In [18]:
# load copy dataset to do cleaning and feature engineering
df_1 = df1.copy(deep=True)

In [19]:
# handle null for missing values of below 200
df_1["h1n1_concern"] = df_1["h1n1_concern"].fillna(df_1["h1n1_concern"].mode()[0])
df_1["h1n1_knowledge"] = df_1["h1n1_knowledge"].fillna(df_1["h1n1_knowledge"].mode()[0])
df_1["behavioral_antiviral_meds"] = df_1["behavioral_antiviral_meds"].fillna(df_1["behavioral_antiviral_meds"].mode()[0])
df_1["behavioral_avoidance"] = df_1["behavioral_avoidance"].fillna(df_1["behavioral_avoidance"].mode()[0])
df_1["behavioral_face_mask"] = df_1["behavioral_face_mask"].fillna(df_1["behavioral_face_mask"].mode()[0])
df_1["behavioral_wash_hands"] = df_1["behavioral_wash_hands"].fillna(df_1["behavioral_wash_hands"].mode()[0])
df_1["behavioral_large_gatherings"] = df_1["behavioral_large_gatherings"].fillna(df_1["behavioral_large_gatherings"].mode()[0])
df_1["behavioral_outside_home"] = df_1["behavioral_outside_home"].fillna(df_1["behavioral_outside_home"].mode()[0])
df_1["behavioral_touch_face"] = df_1["behavioral_touch_face"].fillna(df_1["behavioral_touch_face"].mode()[0])

In [20]:
# check on the filled nulls so far
df_1.isnull().sum()

respondent_id                      0
h1n1_concern                       0
h1n1_knowledge                     0
behavioral_antiviral_meds          0
behavioral_avoidance               0
behavioral_face_mask               0
behavioral_wash_hands              0
behavioral_large_gatherings        0
behavioral_outside_home            0
behavioral_touch_face              0
doctor_recc_h1n1                2160
doctor_recc_seasonal            2160
chronic_med_condition            971
child_under_6_months             820
health_worker                    804
health_insurance               12274
opinion_h1n1_vacc_effective      391
opinion_h1n1_risk                388
opinion_h1n1_sick_from_vacc      395
opinion_seas_vacc_effective      462
opinion_seas_risk                514
opinion_seas_sick_from_vacc      537
age_group                          0
education                       1407
race                               0
sex                                0
income_poverty                  4423
m

In [21]:
# create a new dataset for simple imputation to perform regression imputation
# this can be known as the imputing training data so that we can use a model to calculate null values in original data with missing values
data1 = df_1.copy(deep=True)

In [22]:
# carry out simple imputation by filling nulls with mode
for col in data1.columns:
    data1[col] = data1[col].fillna(data1[col].mode()[0])

In [23]:
data1.isnull().sum()

respondent_id                  0
h1n1_concern                   0
h1n1_knowledge                 0
behavioral_antiviral_meds      0
behavioral_avoidance           0
behavioral_face_mask           0
behavioral_wash_hands          0
behavioral_large_gatherings    0
behavioral_outside_home        0
behavioral_touch_face          0
doctor_recc_h1n1               0
doctor_recc_seasonal           0
chronic_med_condition          0
child_under_6_months           0
health_worker                  0
health_insurance               0
opinion_h1n1_vacc_effective    0
opinion_h1n1_risk              0
opinion_h1n1_sick_from_vacc    0
opinion_seas_vacc_effective    0
opinion_seas_risk              0
opinion_seas_sick_from_vacc    0
age_group                      0
education                      0
race                           0
sex                            0
income_poverty                 0
marital_status                 0
rent_or_own                    0
employment_status              0
hhs_geo_re

In [24]:
# identify categorical data
dataO = data1.select_dtypes(object)

In [25]:
dataO.columns

Index(['age_group', 'education', 'race', 'sex', 'income_poverty',
       'marital_status', 'rent_or_own', 'employment_status', 'hhs_geo_region',
       'census_msa', 'employment_industry', 'employment_occupation'],
      dtype='object')

In [35]:
# we check unique values for categorical data
# this will help us identify maping and encoding for the data

In [9]:
df1["age_group"].unique()

array(['55 - 64 Years', '35 - 44 Years', '18 - 34 Years', '65+ Years',
       '45 - 54 Years'], dtype=object)

In [10]:
df1["race"].unique()

array(['White', 'Black', 'Other or Multiple', 'Hispanic'], dtype=object)

In [11]:
df1["income_poverty"].unique()

array(['Below Poverty', '<= $75,000, Above Poverty', '> $75,000', nan],
      dtype=object)

In [12]:
df1["marital_status"].unique()

array(['Not Married', 'Married', nan], dtype=object)

In [13]:
df1["rent_or_own"].unique()

array(['Own', 'Rent', nan], dtype=object)

In [14]:
df1["employment_status"].unique()

array(['Not in Labor Force', 'Employed', 'Unemployed', nan], dtype=object)

In [15]:
df1["census_msa"].unique()

array(['Non-MSA', 'MSA, Not Principle  City', 'MSA, Principle City'],
      dtype=object)

In [16]:
df1["education"].unique()

array(['< 12 Years', '12 Years', 'College Graduate', 'Some College', nan],
      dtype=object)

In [17]:
df1["hhs_geo_region"].unique()

array(['oxchjgsf', 'bhuqouqj', 'qufhixun', 'lrircsnp', 'atmpeygn',
       'lzgpxyit', 'fpwskwrf', 'mlyzmhmf', 'dqpwygqj', 'kbazzjca'],
      dtype=object)

In [26]:
# after carrying out research we drop irrelevant columns
data1.drop(['employment_industry', 'employment_occupation','hhs_geo_region'], axis=1, inplace=True)

In [27]:
# mapping categorical columns into ordinal data
age_map = {'18 - 34 Years':0, '35 - 44 Years':1, '45 - 54 Years':2, '55 - 64 Years':3, '65+ Years':4}
inc_map = {'Below Poverty':0, '<= $75,000, Above Poverty':1, '> $75,000':2}
cen_map = {'Non-MSA':0, 'MSA, Not Principle  City':1, 'MSA, Principle City':1}
edu_map = {'< 12 Years':0, '12 Years':1, 'Some College':2, 'College Graduate':3}

In [28]:
# execution of the mapped data
data1['age_group'] = data1['age_group'].map(age_map)
data1['income_poverty'] = data1['income_poverty'].map(inc_map)
data1['census_msa'] = data1['census_msa'].map(cen_map)
data1['education'] = data1['education'].map(edu_map)

In [29]:
# encoding the remaining categorical columns using get_dummies
col2 = dataO[['race', 'sex','rent_or_own', 'employment_status','marital_status']]
col2_ohe = pd.get_dummies(col2, drop_first=True, dtype=int)
col2_ohe[:3]

Unnamed: 0,race_Hispanic,race_Other or Multiple,race_White,sex_Male,rent_or_own_Rent,employment_status_Not in Labor Force,employment_status_Unemployed,marital_status_Not Married
0,0,0,1,0,0,1,0,1
1,0,0,1,1,1,0,0,1
2,0,0,1,1,0,0,0,1


In [30]:
# drop categorical columns that have already been encoded
data1.drop(col2, axis=1, inplace=True)

In [31]:
# dropping the respondent id which will not be used
# the respondent id has no significant when training a model
data1.drop(["respondent_id"], axis=1, inplace=True)

In [32]:
# merge the encoded data to the original data
mergedf = pd.concat([data1, col2_ohe], axis=1)
mergedf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26707 entries, 0 to 26706
Data columns (total 35 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   h1n1_concern                          26707 non-null  float64
 1   h1n1_knowledge                        26707 non-null  float64
 2   behavioral_antiviral_meds             26707 non-null  float64
 3   behavioral_avoidance                  26707 non-null  float64
 4   behavioral_face_mask                  26707 non-null  float64
 5   behavioral_wash_hands                 26707 non-null  float64
 6   behavioral_large_gatherings           26707 non-null  float64
 7   behavioral_outside_home               26707 non-null  float64
 8   behavioral_touch_face                 26707 non-null  float64
 9   doctor_recc_h1n1                      26707 non-null  float64
 10  doctor_recc_seasonal                  26707 non-null  float64
 11  chronic_med_con

In [35]:
missing_cols = df_1.columns[df_1.isnull().any()].to_list()
for col in missing_cols:
    print(f"--- Imputing column: {col} ---")

--- Imputing column: doctor_recc_h1n1 ---
--- Imputing column: doctor_recc_seasonal ---
--- Imputing column: chronic_med_condition ---
--- Imputing column: child_under_6_months ---
--- Imputing column: health_worker ---
--- Imputing column: health_insurance ---
--- Imputing column: opinion_h1n1_vacc_effective ---
--- Imputing column: opinion_h1n1_risk ---
--- Imputing column: opinion_h1n1_sick_from_vacc ---
--- Imputing column: opinion_seas_vacc_effective ---
--- Imputing column: opinion_seas_risk ---
--- Imputing column: opinion_seas_sick_from_vacc ---
--- Imputing column: education ---
--- Imputing column: income_poverty ---
--- Imputing column: marital_status ---
--- Imputing column: rent_or_own ---
--- Imputing column: employment_status ---
--- Imputing column: household_adults ---
--- Imputing column: household_children ---
--- Imputing column: employment_industry ---
--- Imputing column: employment_occupation ---


In [36]:
# Define features (X) and target (y) from the complete training data (from the simple imputed)
x_train = mergedf.drop(["doctor_recc_h1n1" ], axis=1)
y_train = mergedf.doctor_recc_h1n1

In [37]:
# run a logistic regression on the model
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(x_train, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,42
,solver,'lbfgs'
,max_iter,1000


In [38]:
# copy of the original as to where null values are replaced
df_imputed = df_1.copy(deep=True)

In [39]:
# map
df_imputed['age_group'] = df_imputed['age_group'].map(age_map)
df_imputed['income_poverty'] = df_imputed['income_poverty'].map(inc_map)
df_imputed['census_msa'] = df_imputed['census_msa'].map(cen_map)
df_imputed['education'] = df_imputed['education'].map(edu_map)

In [40]:
# encode data with get dummies
col1 = df_imputed[['race', 'sex','rent_or_own', 'employment_status','marital_status']]
col1_ohe = pd.get_dummies(col2, drop_first=True, dtype=int)
col1_ohe[:3]

Unnamed: 0,race_Hispanic,race_Other or Multiple,race_White,sex_Male,rent_or_own_Rent,employment_status_Not in Labor Force,employment_status_Unemployed,marital_status_Not Married
0,0,0,1,0,0,1,0,1
1,0,0,1,1,1,0,0,1
2,0,0,1,1,0,0,0,1


In [41]:
# drop categorical columns that havent been encoded
df_imputed.drop(col1, axis=1, inplace=True)

In [42]:
# drop irrelevant columns
df_imputed.drop(['hhs_geo_region','employment_industry','employment_occupation', "respondent_id" ], axis=1, inplace=True)
df_imputed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26707 entries, 0 to 26706
Data columns (total 27 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   h1n1_concern                 26707 non-null  float64
 1   h1n1_knowledge               26707 non-null  float64
 2   behavioral_antiviral_meds    26707 non-null  float64
 3   behavioral_avoidance         26707 non-null  float64
 4   behavioral_face_mask         26707 non-null  float64
 5   behavioral_wash_hands        26707 non-null  float64
 6   behavioral_large_gatherings  26707 non-null  float64
 7   behavioral_outside_home      26707 non-null  float64
 8   behavioral_touch_face        26707 non-null  float64
 9   doctor_recc_h1n1             24547 non-null  float64
 10  doctor_recc_seasonal         24547 non-null  float64
 11  chronic_med_condition        25736 non-null  float64
 12  child_under_6_months         25887 non-null  float64
 13  health_worker   

In [43]:
# join encoded columns to original dataset
df_imputed_encoded = pd.concat([df_imputed, col1_ohe], axis=1)
df_imputed_encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26707 entries, 0 to 26706
Data columns (total 35 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   h1n1_concern                          26707 non-null  float64
 1   h1n1_knowledge                        26707 non-null  float64
 2   behavioral_antiviral_meds             26707 non-null  float64
 3   behavioral_avoidance                  26707 non-null  float64
 4   behavioral_face_mask                  26707 non-null  float64
 5   behavioral_wash_hands                 26707 non-null  float64
 6   behavioral_large_gatherings           26707 non-null  float64
 7   behavioral_outside_home               26707 non-null  float64
 8   behavioral_touch_face                 26707 non-null  float64
 9   doctor_recc_h1n1                      24547 non-null  float64
 10  doctor_recc_seasonal                  24547 non-null  float64
 11  chronic_med_con

In [44]:
# compare and make sure both datasets match
# mergedf will be the train dataset from which we will carry out regression imputation
mergedf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26707 entries, 0 to 26706
Data columns (total 35 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   h1n1_concern                          26707 non-null  float64
 1   h1n1_knowledge                        26707 non-null  float64
 2   behavioral_antiviral_meds             26707 non-null  float64
 3   behavioral_avoidance                  26707 non-null  float64
 4   behavioral_face_mask                  26707 non-null  float64
 5   behavioral_wash_hands                 26707 non-null  float64
 6   behavioral_large_gatherings           26707 non-null  float64
 7   behavioral_outside_home               26707 non-null  float64
 8   behavioral_touch_face                 26707 non-null  float64
 9   doctor_recc_h1n1                      26707 non-null  float64
 10  doctor_recc_seasonal                  26707 non-null  float64
 11  chronic_med_con

In [45]:
# Identify rows where the current column is missing
missing_mask = df_imputed_encoded["doctor_recc_h1n1"].isnull()

In [46]:
# Get the features for these missing rows.
x_predict = df_imputed_encoded[missing_mask].drop(columns="doctor_recc_h1n1")

In [47]:
# A model cannot predict with NaN inputs. We use the training set's mode.
for feature_col in x_predict.columns:
    mode_val = x_train[feature_col].mode()[0]
    x_predict[feature_col].fillna(mode_val, inplace=True)

In [48]:
predicted_values = model.predict(x_predict)

In [49]:
# Fill the NaNs in our target dataframe with the predictions
df_imputed_encoded.loc[missing_mask, "doctor_recc_h1n1"] = predicted_values

In [50]:
df_imputed_encoded.isnull().sum()

h1n1_concern                                0
h1n1_knowledge                              0
behavioral_antiviral_meds                   0
behavioral_avoidance                        0
behavioral_face_mask                        0
behavioral_wash_hands                       0
behavioral_large_gatherings                 0
behavioral_outside_home                     0
behavioral_touch_face                       0
doctor_recc_h1n1                            0
doctor_recc_seasonal                     2160
chronic_med_condition                     971
child_under_6_months                      820
health_worker                             804
health_insurance                        12274
opinion_h1n1_vacc_effective               391
opinion_h1n1_risk                         388
opinion_h1n1_sick_from_vacc               395
opinion_seas_vacc_effective               462
opinion_seas_risk                         514
opinion_seas_sick_from_vacc               537
age_group                         

In [37]:
# we carry out the same proccess through all the columns with missing values
# i tried to create a simpler function but it ran many errors
# the column to be imputed is the target variable

In [51]:
x_train = mergedf.drop(["chronic_med_condition"], axis=1)
y_train = mergedf.chronic_med_condition

model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(x_train, y_train)

# Identify rows where the current column is missing
missing_mask = df_imputed_encoded["chronic_med_condition"].isnull()

# Get the features for these missing rows.
x_predict = df_imputed_encoded[missing_mask].drop(columns="chronic_med_condition")

# A model cannot predict with NaN inputs. We use the training set's mode.
for feature_col in x_predict.columns:
    mode_val = x_train[feature_col].mode()[0]
    x_predict[feature_col].fillna(mode_val, inplace=True)

predicted_values = model.predict(x_predict)
# Fill the NaNs in our target dataframe with the predictions
df_imputed_encoded.loc[missing_mask, "chronic_med_condition"] = predicted_values

In [52]:
x_train = mergedf.drop(["child_under_6_months"], axis=1)
y_train = mergedf.child_under_6_months

model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(x_train, y_train)

# Identify rows where the current column is missing
missing_mask = df_imputed_encoded["child_under_6_months"].isnull()

# Get the features for these missing rows.
x_predict = df_imputed_encoded[missing_mask].drop(columns="child_under_6_months")

# A model cannot predict with NaN inputs. We use the training set's mode.
for feature_col in x_predict.columns:
    mode_val = x_train[feature_col].mode()[0]
    x_predict[feature_col].fillna(mode_val, inplace=True)

predicted_values = model.predict(x_predict)
# Fill the NaNs in our target dataframe with the predictions
df_imputed_encoded.loc[missing_mask, "child_under_6_months"] = predicted_values

In [53]:
x_train = mergedf.drop(["health_insurance"], axis=1)
y_train = mergedf.health_insurance

model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(x_train, y_train)

# Identify rows where the current column is missing
missing_mask = df_imputed_encoded["health_insurance"].isnull()

# Get the features for these missing rows.
x_predict = df_imputed_encoded[missing_mask].drop(columns="health_insurance")

# A model cannot predict with NaN inputs. We use the training set's mode.
for feature_col in x_predict.columns:
    mode_val = x_train[feature_col].mode()[0]
    x_predict[feature_col].fillna(mode_val, inplace=True)

predicted_values = model.predict(x_predict)
# Fill the NaNs in our target dataframe with the predictions
df_imputed_encoded.loc[missing_mask, "health_insurance"] = predicted_values

In [58]:
x_train = mergedf.drop(["health_worker"], axis=1)
y_train = mergedf.health_worker

model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(x_train, y_train)

# Identify rows where the current column is missing
missing_mask = df_imputed_encoded["health_worker"].isnull()

# Get the features for these missing rows.
x_predict = df_imputed_encoded[missing_mask].drop(columns="health_worker")

# A model cannot predict with NaN inputs. We use the training set's mode.
for feature_col in x_predict.columns:
    mode_val = x_train[feature_col].mode()[0]
    x_predict[feature_col].fillna(mode_val, inplace=True)

predicted_values = model.predict(x_predict)
# Fill the NaNs in our target dataframe with the predictions
df_imputed_encoded.loc[missing_mask, "health_worker"] = predicted_values

In [60]:
x_train = mergedf.drop(["doctor_recc_seasonal"], axis=1)
y_train = mergedf.doctor_recc_seasonal

model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(x_train, y_train)

# Identify rows where the current column is missing
missing_mask = df_imputed_encoded["doctor_recc_seasonal"].isnull()

# Get the features for these missing rows.
x_predict = df_imputed_encoded[missing_mask].drop(columns="doctor_recc_seasonal")

# A model cannot predict with NaN inputs. We use the training set's mode.
for feature_col in x_predict.columns:
    mode_val = x_train[feature_col].mode()[0]
    x_predict[feature_col].fillna(mode_val, inplace=True)

predicted_values = model.predict(x_predict)
# Fill the NaNs in our target dataframe with the predictions
df_imputed_encoded.loc[missing_mask, "doctor_recc_seasonal"] = predicted_values

In [66]:
x_train = mergedf.drop(["education"], axis=1)
y_train = mergedf.education

model = RandomForestClassifier(n_estimators=50, random_state=42)
model.fit(x_train, y_train)

# Identify rows where the current column is missing
missing_mask = df_imputed_encoded["education"].isnull()

# Get the features for these missing rows.
x_predict = df_imputed_encoded[missing_mask].drop(columns="education")

# A model cannot predict with NaN inputs. We use the training set's mode.
for feature_col in x_predict.columns:
    mode_val = x_train[feature_col].mode()[0]
    x_predict[feature_col].fillna(mode_val, inplace=True)

predicted_values = model.predict(x_predict)
# Fill the NaNs in our target dataframe with the predictions
df_imputed_encoded.loc[missing_mask, "education"] = predicted_values

In [74]:
x_train = mergedf.drop(["income_poverty"], axis=1)
y_train = mergedf.income_poverty

model = RandomForestClassifier(n_estimators=50, random_state=42)
model.fit(x_train, y_train)

# Identify rows where the current column is missing
missing_mask = df_imputed_encoded["income_poverty"].isnull()

# Get the features for these missing rows.
x_predict = df_imputed_encoded[missing_mask].drop(columns="income_poverty")

# A model cannot predict with NaN inputs. We use the training set's mode.
for feature_col in x_predict.columns:
    mode_val = x_train[feature_col].mode()[0]
    x_predict[feature_col].fillna(mode_val, inplace=True)

predicted_values = model.predict(x_predict)
# Fill the NaNs in our target dataframe with the predictions
df_imputed_encoded.loc[missing_mask, "income_poverty"] = predicted_values

In [86]:
x_train = mergedf.drop(["household_children"], axis=1)
y_train = mergedf.household_children

model = RandomForestClassifier(n_estimators=50, random_state=42)
model.fit(x_train, y_train)

# Identify rows where the current column is missing
missing_mask = df_imputed_encoded["household_children"].isnull()

# Get the features for these missing rows.
x_predict = df_imputed_encoded[missing_mask].drop(columns="household_children")

# A model cannot predict with NaN inputs. We use the training set's mode.
for feature_col in x_predict.columns:
    mode_val = x_train[feature_col].mode()[0]
    x_predict[feature_col].fillna(mode_val, inplace=True)

predicted_values = model.predict(x_predict)
# Fill the NaNs in our target dataframe with the predictions
df_imputed_encoded.loc[missing_mask, "household_children"] = predicted_values

In [92]:
x_train = mergedf.drop(["household_adults"], axis=1)
y_train = mergedf.household_adults

model = RandomForestClassifier(n_estimators=50, random_state=42)
model.fit(x_train, y_train)

# Identify rows where the current column is missing
missing_mask = df_imputed_encoded["household_adults"].isnull()

# Get the features for these missing rows.
x_predict = df_imputed_encoded[missing_mask].drop(columns="household_adults")

# A model cannot predict with NaN inputs. We use the training set's mode.
for feature_col in x_predict.columns:
    mode_val = x_train[feature_col].mode()[0]
    x_predict[feature_col].fillna(mode_val, inplace=True)

predicted_values = model.predict(x_predict)
# Fill the NaNs in our target dataframe with the predictions
df_imputed_encoded.loc[missing_mask, "household_adults"] = predicted_values

In [102]:
# of the remaining columns since the nulls were less than 2% of data
# i again impute with mode
df_imputed_encoded["opinion_h1n1_vacc_effective"].fillna(df_imputed_encoded["opinion_h1n1_vacc_effective"].mode()[0], inplace=True)
df_imputed_encoded["opinion_h1n1_risk"].fillna(df_imputed_encoded["opinion_h1n1_risk"].mode()[0], inplace=True)
df_imputed_encoded["opinion_h1n1_sick_from_vacc"].fillna(df_imputed_encoded["opinion_h1n1_sick_from_vacc"].mode()[0], inplace=True)
df_imputed_encoded["opinion_seas_vacc_effective"].fillna(df_imputed_encoded["opinion_seas_vacc_effective"].mode()[0], inplace=True)
df_imputed_encoded["opinion_seas_risk"].fillna(df_imputed_encoded["opinion_seas_risk"].mode()[0], inplace=True)
df_imputed_encoded["opinion_seas_sick_from_vacc"].fillna(df_imputed_encoded["opinion_seas_sick_from_vacc"].mode()[0], inplace=True)

In [106]:
df_imputed_encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26707 entries, 0 to 26706
Data columns (total 35 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   h1n1_concern                          26707 non-null  float64
 1   h1n1_knowledge                        26707 non-null  float64
 2   behavioral_antiviral_meds             26707 non-null  float64
 3   behavioral_avoidance                  26707 non-null  float64
 4   behavioral_face_mask                  26707 non-null  float64
 5   behavioral_wash_hands                 26707 non-null  float64
 6   behavioral_large_gatherings           26707 non-null  float64
 7   behavioral_outside_home               26707 non-null  float64
 8   behavioral_touch_face                 26707 non-null  float64
 9   doctor_recc_h1n1                      26707 non-null  float64
 10  doctor_recc_seasonal                  26707 non-null  float64
 11  chronic_med_con

In [116]:
# we need the respondent id to merge with train labels
train_df = df_imputed_encoded.reset_index(names=['respondent_id'])
train_df[:5]

Unnamed: 0,respondent_id,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,...,household_adults,household_children,race_Hispanic,race_Other or Multiple,race_White,sex_Male,rent_or_own_Rent,employment_status_Not in Labor Force,employment_status_Unemployed,marital_status_Not Married
0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,0.0,0.0,0,0,1,0,0,1,0,1
1,1,3.0,2.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,...,0.0,0.0,0,0,1,1,1,0,0,1
2,2,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,2.0,0.0,0,0,1,1,0,0,0,1
3,3,1.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0,0,1,0,1,1,0,1
4,4,2.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,...,1.0,0.0,0,0,1,0,0,0,0,0


In [120]:
# cleaned dataset merged with training labels
merge_df = pd.merge(train_df, df, on='respondent_id')
merge_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26707 entries, 0 to 26706
Data columns (total 38 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   respondent_id                         26707 non-null  int64  
 1   h1n1_concern                          26707 non-null  float64
 2   h1n1_knowledge                        26707 non-null  float64
 3   behavioral_antiviral_meds             26707 non-null  float64
 4   behavioral_avoidance                  26707 non-null  float64
 5   behavioral_face_mask                  26707 non-null  float64
 6   behavioral_wash_hands                 26707 non-null  float64
 7   behavioral_large_gatherings           26707 non-null  float64
 8   behavioral_outside_home               26707 non-null  float64
 9   behavioral_touch_face                 26707 non-null  float64
 10  doctor_recc_h1n1                      26707 non-null  float64
 11  doctor_recc_sea

In [122]:
# the dataset is converted to csv to carry out EDA
merge_df.to_csv('cleantrain_labels.csv', index=False)