## 1. Business Understanding

Vaccination is a critical public health strategy to prevent the spread of infectious diseases like H1N1 and seasonal influenza. Despite their proven effectiveness, vaccination rates often remain below target levels due to various factors such as demographics, beliefs, and health behaviors.

In this project, our objective is to **predict whether individuals received the H1N1 and seasonal flu vaccines** using survey data from the National 2009 H1N1 Flu Survey. By understanding the factors that influence vaccination decisions, we aim to help public health agencies:

- **Identify population groups with low vaccination rates** and the characteristics that predict vaccine hesitancy.
- **Design targeted awareness campaigns** and interventions to improve immunization coverage.
- **Optimize resource allocation** by focusing efforts on communities with higher predicted non-vaccination likelihood.

This project is a **binary classification problem** for each vaccine type (H1N1 and seasonal flu). The ultimate goal is to develop accurate models that support **data-driven decision-making** to enhance public health outcomes and reduce the spread of preventable diseases through better vaccination strategies.


# 2 Data Understanding

## 2.1 importing the data and library





In [110]:
import pandas as pd
import zipfile

In [111]:
# Load thedataset
df = pd.read_csv("data\H1N1_Flu_Vaccines.csv")


In [112]:
# Display the first 5 rows
df.head()

Unnamed: 0,respondent_id,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,...,rent_or_own,employment_status,hhs_geo_region,census_msa,household_adults,household_children,employment_industry,employment_occupation,h1n1_vaccine,seasonal_vaccine
0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,Own,Not in Labor Force,oxchjgsf,Non-MSA,0.0,0.0,,,0,0
1,1,3.0,2.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,...,Rent,Employed,bhuqouqj,"MSA, Not Principle City",0.0,0.0,pxcmvdjn,xgwztkwe,0,1
2,2,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,Own,Employed,qufhixun,"MSA, Not Principle City",2.0,0.0,rucpziij,xtkaffoo,0,0
3,3,1.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,Rent,Not in Labor Force,lrircsnp,"MSA, Principle City",0.0,0.0,,,0,1
4,4,2.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,...,Own,Employed,qufhixun,"MSA, Not Principle City",1.0,0.0,wxleyezf,emcorrxb,0,0


In [113]:
df.columns

Index(['respondent_id', 'h1n1_concern', 'h1n1_knowledge',
       'behavioral_antiviral_meds', 'behavioral_avoidance',
       'behavioral_face_mask', 'behavioral_wash_hands',
       'behavioral_large_gatherings', 'behavioral_outside_home',
       'behavioral_touch_face', 'doctor_recc_h1n1', 'doctor_recc_seasonal',
       'chronic_med_condition', 'child_under_6_months', 'health_worker',
       'health_insurance', 'opinion_h1n1_vacc_effective', 'opinion_h1n1_risk',
       'opinion_h1n1_sick_from_vacc', 'opinion_seas_vacc_effective',
       'opinion_seas_risk', 'opinion_seas_sick_from_vacc', 'age_group',
       'education', 'race', 'sex', 'income_poverty', 'marital_status',
       'rent_or_own', 'employment_status', 'hhs_geo_region', 'census_msa',
       'household_adults', 'household_children', 'employment_industry',
       'employment_occupation', 'h1n1_vaccine', 'seasonal_vaccine'],
      dtype='object')

## 2.2 Explanation of Dataset Columns

Below is a detailed explanation of all the columns currently available in the dataset:

### H1N1-Specific Features
- **h1n1_concern**: Level of concern about the H1N1 flu.
  - 0 = Not at all concerned, 1 = Not very concerned, 2 = Somewhat concerned, 3 = Very concerned.
- **h1n1_knowledge**: Level of knowledge about the H1N1 flu.
  - 0 = No knowledge, 1 = A little knowledge, 2 = A lot of knowledge.
- **doctor_recc_h1n1**: Whether the respondent’s doctor recommended the H1N1 vaccine (0 = No, 1 = Yes).

### Behavioral Features (All Binary)
- **behavioral_antiviral_meds**: Has taken antiviral medications.
- **behavioral_avoidance**: Avoided close contact with people who have flu-like symptoms.
- **behavioral_face_mask**: Bought a face mask.
- **behavioral_wash_hands**: Frequently washed hands or used hand sanitizer.
- **behavioral_large_gatherings**: Reduced time spent at large gatherings.
- **behavioral_outside_home**: Reduced contact with people outside the household.
- **behavioral_touch_face**: Avoided touching eyes, nose, or mouth.

_All behavioral features are binary: 0 = No, 1 = Yes._

### Doctor Recommendations
- **doctor_recc_seasonal**: Whether the respondent’s doctor recommended the seasonal flu vaccine (0 = No, 1 = Yes).

###  Health and Household Risk Factors
- **chronic_med_condition**: Has one or more chronic medical conditions (e.g., asthma, diabetes, heart or kidney disease, weakened immune system) — (0 = No, 1 = Yes).
- **child_under_6_months**: Has close contact with a child under six months old (0 = No, 1 = Yes).
- **health_worker**: Is a healthcare worker (0 = No, 1 = Yes).
- **health_insurance**: Has health insurance (0 = No, 1 = Yes).

###  Opinion Features
- **opinion_h1n1_vacc_effective**: Opinion on the effectiveness of the H1N1 vaccine.
  - 1 = Not at all effective; 2 = Not very effective; 3 = Don't know; 4 = Somewhat effective; 5 = Very effective.
- **opinion_h1n1_risk**: Opinion on the risk of getting H1N1 flu without vaccination.
  - 1 = Very Low; 2 = Somewhat low; 3 = Don't know; 4 = Somewhat high; 5 = Very high.
- **opinion_h1n1_sick_from_vacc**: Worry about getting sick from the H1N1 vaccine.
  - 1 = Not at all worried; 2 = Not very worried; 3 = Don't know; 4 = Somewhat worried; 5 = Very worried.
- **opinion_seas_vacc_effective**: Opinion on the effectiveness of the seasonal flu vaccine.
  - Same scale as above.
- **opinion_seas_risk**: Opinion on the risk of getting seasonal flu without vaccination.
  - Same scale as above.
- **opinion_seas_sick_from_vacc**: Worry about getting sick from the seasonal flu vaccine.
  - Same scale as above.

### Demographic Features
- **age_group**: Age category of the respondent.
- **education**: Education level.
- **race**: Race of the respondent.
- **sex**: Gender.
- **income_poverty**: Household annual income relative to 2008 Census poverty thresholds.
- **marital_status**: Marital status.
- **rent_or_own**: Housing situation (rent or own).
- **employment_status**: Employment status.
- **employment_industry**: Industry where the respondent works (encoded as random strings).
- **employment_occupation**: Occupation type (encoded as random strings).
- **hhs_geo_region**: Region of residence, using the U.S. Department of Health and Human Services 10-region classification (encoded).
- **census_msa**: Residence within a Metropolitan Statistical Area as defined by the U.S. Census.
- **household_adults**: Number of other adults in the household (top-coded at 3).
- **household_children**: Number of children in the household (top-coded at 3).


✅ **Note:**  
- All binary columns are encoded as **0 = No** and **1 = Yes**.
- Some categorical variables (like `age_group`, `education`, `income_poverty`) are nominal and will need encoding before modeling.

This covers all columns in your dataset! 📊


✅ 1️⃣ Check dataset dimensions

In [114]:
# Number of rows and columns
print("Shape of dataset:", df.shape)


Shape of dataset: (26707, 38)


### 📏 Dataset Shape

the dataset contains **26,707 rows** (individual survey respondents) and **38 columns** (variables).  
This indicates we have a substantial amount of data to train and test predictive models, with multiple features available to help explain vaccination decisions.


✅ 2️⃣ See column names & data types

In [115]:
# Data types and non-null counts
print(df.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26707 entries, 0 to 26706
Data columns (total 38 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   respondent_id                26707 non-null  int64  
 1   h1n1_concern                 26615 non-null  float64
 2   h1n1_knowledge               26591 non-null  float64
 3   behavioral_antiviral_meds    26636 non-null  float64
 4   behavioral_avoidance         26499 non-null  float64
 5   behavioral_face_mask         26688 non-null  float64
 6   behavioral_wash_hands        26665 non-null  float64
 7   behavioral_large_gatherings  26620 non-null  float64
 8   behavioral_outside_home      26625 non-null  float64
 9   behavioral_touch_face        26579 non-null  float64
 10  doctor_recc_h1n1             24547 non-null  float64
 11  doctor_recc_seasonal         24547 non-null  float64
 12  chronic_med_condition        25736 non-null  float64
 13  child_under_6_mo

### 🗂️ Dataset Info

The `.info()` output shows:
- The dataset has **26,707 rows** and **38 columns**.
- There are **23 numeric columns** (`float64`), **3 integer columns** (`int64`), and **12 categorical columns** (`object`).
- Some columns have **missing values**, as the non-null counts are less than 26,707 (e.g., `health_insurance`, `employment_industry`).
- The two target columns, `h1n1_vaccine` and `seasonal_vaccine`, have **no missing values**.

This helps us understand the data types, confirm our target variables are complete, and identify columns that will need **cleaning or imputation**.


✅ 4️⃣ Get summary statistics (numeric)

In [116]:
# Summary for numeric columns
df.describe()


Unnamed: 0,respondent_id,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,...,opinion_h1n1_vacc_effective,opinion_h1n1_risk,opinion_h1n1_sick_from_vacc,opinion_seas_vacc_effective,opinion_seas_risk,opinion_seas_sick_from_vacc,household_adults,household_children,h1n1_vaccine,seasonal_vaccine
count,26707.0,26615.0,26591.0,26636.0,26499.0,26688.0,26665.0,26620.0,26625.0,26579.0,...,26316.0,26319.0,26312.0,26245.0,26193.0,26170.0,26458.0,26458.0,26707.0,26707.0
mean,13353.0,1.618486,1.262532,0.048844,0.725612,0.068982,0.825614,0.35864,0.337315,0.677264,...,3.850623,2.342566,2.35767,4.025986,2.719162,2.118112,0.886499,0.534583,0.212454,0.465608
std,7709.791156,0.910311,0.618149,0.215545,0.446214,0.253429,0.379448,0.47961,0.472802,0.467531,...,1.007436,1.285539,1.362766,1.086565,1.385055,1.33295,0.753422,0.928173,0.409052,0.498825
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0
25%,6676.5,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,3.0,1.0,1.0,4.0,2.0,1.0,0.0,0.0,0.0,0.0
50%,13353.0,2.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,...,4.0,2.0,2.0,4.0,2.0,2.0,1.0,0.0,0.0,0.0
75%,20029.5,2.0,2.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,...,5.0,4.0,4.0,5.0,4.0,4.0,1.0,1.0,0.0,1.0
max,26706.0,3.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,5.0,5.0,5.0,5.0,5.0,5.0,3.0,3.0,1.0,1.0


### 📊 Descriptive Statistics

The `describe()` output provides summary statistics for the numeric columns in the dataset:

- **`count`** shows how many non-null values each variable has. Some variables have slightly fewer rows than the total 26,707, indicating missing data.
- **`mean`** shows the average value for each variable.
- **`std`** shows the standard deviation, which tells us how spread out the values are.
- **`min`**, **25%**, **50%** (median), **75%**, and **max** show the distribution of values for each variable.

Key observations:
- Many behavioral variables (`behavioral_*`) are binary (0 or 1) — you can see this from their min/max.
- Opinion variables (`opinion_*`) use a Likert scale (e.g., 1 to 5) — their min/max and mean reflect attitudes toward vaccines.
- The target variables `h1n1_vaccine` and `seasonal_vaccine` are binary (0 or 1), so the mean shows the proportion of people who got vaccinated.

These descriptive stats help identify:
- Variables with missing values (where `count` < 26,707).
- Variables with binary or ordinal scales.
- Potential outliers (not common here since most are 0–1 or Likert scale).

This summary guides the **data cleaning**, **imputation**, and **encoding** steps.


✅ 5️⃣ Check for missing values

In [117]:
# Count missing values per column
df.isnull().sum()


respondent_id                      0
h1n1_concern                      92
h1n1_knowledge                   116
behavioral_antiviral_meds         71
behavioral_avoidance             208
behavioral_face_mask              19
behavioral_wash_hands             42
behavioral_large_gatherings       87
behavioral_outside_home           82
behavioral_touch_face            128
doctor_recc_h1n1                2160
doctor_recc_seasonal            2160
chronic_med_condition            971
child_under_6_months             820
health_worker                    804
health_insurance               12274
opinion_h1n1_vacc_effective      391
opinion_h1n1_risk                388
opinion_h1n1_sick_from_vacc      395
opinion_seas_vacc_effective      462
opinion_seas_risk                514
opinion_seas_sick_from_vacc      537
age_group                          0
education                       1407
race                               0
sex                                0
income_poverty                  4423
m

### ⚠️ Missing Values Summary


- Some columns have **few missing values** (e.g., `behavioral_face_mask` only 19 missing).
- A few variables have **moderate missingness** (e.g., `doctor_recc_h1n1` and `doctor_recc_seasonal` each missing ~2,160 rows).
- A few variables have **substantial missingness**:
  - `health_insurance` (12,274 missing, ~46%)
  - `employment_industry` and `employment_occupation` (over 13,000 missing, ~50%)

The **target variables** (`h1n1_vaccine` and `seasonal_vaccine`) have **no missing values**





✅ 6️⃣ Check unique values in categorical columns

In [118]:
# Example: see unique values for selected columns
categorical_columns = df.select_dtypes(include=['object']).columns
for col in categorical_columns:
    print(f"{col}: {df[col].unique()}")


age_group: ['55 - 64 Years' '35 - 44 Years' '18 - 34 Years' '65+ Years'
 '45 - 54 Years']
education: ['< 12 Years' '12 Years' 'College Graduate' 'Some College' nan]
race: ['White' 'Black' 'Other or Multiple' 'Hispanic']
sex: ['Female' 'Male']
income_poverty: ['Below Poverty' '<= $75,000, Above Poverty' '> $75,000' nan]
marital_status: ['Not Married' 'Married' nan]
rent_or_own: ['Own' 'Rent' nan]
employment_status: ['Not in Labor Force' 'Employed' 'Unemployed' nan]
hhs_geo_region: ['oxchjgsf' 'bhuqouqj' 'qufhixun' 'lrircsnp' 'atmpeygn' 'lzgpxyit'
 'fpwskwrf' 'mlyzmhmf' 'dqpwygqj' 'kbazzjca']
census_msa: ['Non-MSA' 'MSA, Not Principle  City' 'MSA, Principle City']
employment_industry: [nan 'pxcmvdjn' 'rucpziij' 'wxleyezf' 'saaquncn' 'xicduogh' 'ldnlellj'
 'wlfvacwt' 'nduyfdeo' 'fcxhlnwr' 'vjjrobsf' 'arjwrbjb' 'atmlpfrs'
 'msuufmds' 'xqicxuve' 'phxvnwax' 'dotnnunm' 'mfikgejo' 'cfqqtusy'
 'mcubkhph' 'haxffmxo' 'qnlwzans']
employment_occupation: [nan 'xgwztkwe' 'xtkaffoo' 'emcorrxb' 'vlluhb

### 🔑 Categorical Variables & Unique Values

**Key categories:**  
- `age_group`: 5 clear age ranges, no missing values.
- `education`: 4 levels plus missing (`nan`).
- `race`: 4 groups, no missing.
- `sex`: Binary (`Female`, `Male`).
- `income_poverty`: 3 levels plus missing.
- `marital_status`: `Married` or `Not Married` plus missing.
- `rent_or_own`: `Own` or `Rent` plus missing.
- `employment_status`: 3 categories plus missing.
- `hhs_geo_region`: 10 region codes (nominal, encoded as text).
- `census_msa`: 3 metro status categories.
- `employment_industry` & `employment_occupation`: Many coded categories, ~50% missing.




✅ 7️⃣ Basic target distribution

In [119]:
# Check target variable distribution
print(df['h1n1_vaccine'].value_counts())
print(df['seasonal_vaccine'].value_counts())


h1n1_vaccine
0    21033
1     5674
Name: count, dtype: int64
seasonal_vaccine
0    14272
1    12435
Name: count, dtype: int64


### Target Variable Distribution

**H1N1 Vaccine:**  
- `0` → 21,033 respondents **did not receive** the H1N1 vaccine.
- `1` → 5,674 respondents **did receive** the H1N1 vaccine.
- This shows **class imbalance**: about **79% did not vaccinate**.

**Seasonal Flu Vaccine:**  
- `0` → 14,272 respondents **did not receive** the seasonal flu vaccine.
- `1` → 12,435 respondents **did receive** the seasonal flu vaccine.
- This is more balanced: about **53% did not vaccinate** vs **47% did**.




# 3 Data Preparation
### 1️⃣ Handle Missing Values
Identify all missing values


In [120]:
# Total missing by column with count and percentage
missing_count = df.isnull().sum()
missing_count = missing_count[missing_count > 0].sort_values(ascending=False)

missing_percent = (df.isnull().sum() / len(df)) * 100
missing_percent = missing_percent[missing_percent > 0].sort_values(ascending=False)

# Combine into a DataFrame for better display
missing_data = pd.DataFrame({
    'Count': missing_count,
    'Percentage': missing_percent
})

print(missing_data)

                             Count  Percentage
employment_occupation        13470   50.436215
employment_industry          13330   49.912008
health_insurance             12274   45.957989
income_poverty                4423   16.561201
doctor_recc_h1n1              2160    8.087767
doctor_recc_seasonal          2160    8.087767
rent_or_own                   2042    7.645936
employment_status             1463    5.477965
marital_status                1408    5.272026
education                     1407    5.268282
chronic_med_condition          971    3.635751
child_under_6_months           820    3.070356
health_worker                  804    3.010447
opinion_seas_sick_from_vacc    537    2.010709
opinion_seas_risk              514    1.924589
opinion_seas_vacc_effective    462    1.729884
opinion_h1n1_sick_from_vacc    395    1.479013
opinion_h1n1_vacc_effective    391    1.464036
opinion_h1n1_risk              388    1.452803
household_adults               249    0.932340
household_chi

 1️⃣ Very High Missing (≈50%)
employment_occupation & employment_industry
🔹 Too much missing to safely drop rows.
🔹 Best practice: Keep & encode missing as "Unknown" — the fact that they didn’t report may be informative!

👉 2️⃣ Medium Missing (10–50%)
health_insurance (~46%)

Binary: 0/1 — here missing might mean didn’t say, so safe to fill with mode.

income_poverty (16%)

Categorical ordinal — fill with mode or "Unknown" to keep the row.

I’d recommend "Unknown" here — income is sensitive.

👉 3️⃣ Low Missing (<10%)
doctor_recc_h1n1, doctor_recc_seasonal — binary: Fill with mode

rent_or_own — categorical: Fill with mode

employment_status — categorical: Fill with mode

marital_status — categorical: Fill with mode

education — ordinal: Fill with mode

chronic_med_condition — binary: Fill with mode

child_under_6_months — binary: Fill with mode

health_worker — binary: Fill with mode

opinion_* — ordinal: Fill with mode

household_* — numeric count: Fill with median

behavioral_* — binary: Fill with mode

h1n1_knowledge, h1n1_concern — ordinal: Fill with mode

In [121]:
# Group 1: High missing => Fill with 'Unknown'
df['employment_occupation'] = df['employment_occupation'].fillna('Unknown')
df['employment_industry'] = df['employment_industry'].fillna('Unknown')

# Group 2: Medium missing
df['health_insurance'] = df['health_insurance'].fillna(df['health_insurance'].mode()[0])
df['income_poverty'] = df['income_poverty'].fillna('Unknown')

# Group 3: Low missing
binary_cols = [
    'doctor_recc_h1n1', 'doctor_recc_seasonal',
    'chronic_med_condition', 'child_under_6_months',
    'health_worker'
]

for col in binary_cols:
    df[col] = df[col].fillna(df[col].mode()[0])

categorical_cols = [
    'rent_or_own', 'employment_status',
    'marital_status', 'education'
]

for col in categorical_cols:
    df[col] = df[col].fillna(df[col].mode()[0])

# Opinion variables => ordinal, use mode
opinion_cols = [
    'opinion_seas_sick_from_vacc', 'opinion_seas_risk',
    'opinion_seas_vacc_effective', 'opinion_h1n1_sick_from_vacc',
    'opinion_h1n1_vacc_effective', 'opinion_h1n1_risk'
]

for col in opinion_cols:
    df[col] = df[col].fillna(df[col].mode()[0])

# Household counts => numeric
df['household_adults'] = df['household_adults'].fillna(df['household_adults'].median())
df['household_children'] = df['household_children'].fillna(df['household_children'].median())

# Behavioral columns => binary, fill with mode
behavioral_cols = [
    'behavioral_avoidance', 'behavioral_touch_face',
    'behavioral_large_gatherings', 'behavioral_outside_home',
    'behavioral_antiviral_meds', 'behavioral_wash_hands',
    'behavioral_face_mask'
]

for col in behavioral_cols:
    df[col] = df[col].fillna(df[col].mode()[0])

# Remaining ordinal
df['h1n1_concern'] = df['h1n1_concern'].fillna(df['h1n1_concern'].mode()[0])
df['h1n1_knowledge'] = df['h1n1_knowledge'].fillna(df['h1n1_knowledge'].mode()[0])

# ✅ Final check
print("Remaining missing values:", df.isnull().sum().sum())


Remaining missing values: 0


1️⃣ Binary / Ordinal variables — float to int
These are columns like:

All behavioral_*

doctor_recc_*

chronic_med_condition

child_under_6_months

health_worker

health_insurance

h1n1_concern, h1n1_knowledge

household_adults, household_children (should be int)

In [122]:
# Binary & ordinal variables
binary_ordinal_cols = [
    'h1n1_concern', 'h1n1_knowledge',
    'behavioral_antiviral_meds', 'behavioral_avoidance', 'behavioral_face_mask',
    'behavioral_wash_hands', 'behavioral_large_gatherings',
    'behavioral_outside_home', 'behavioral_touch_face',
    'doctor_recc_h1n1', 'doctor_recc_seasonal',
    'chronic_med_condition', 'child_under_6_months',
    'health_worker', 'health_insurance',
    'household_adults', 'household_children',
    'opinion_h1n1_vacc_effective', 'opinion_h1n1_risk',
    'opinion_h1n1_sick_from_vacc', 'opinion_seas_vacc_effective',
    'opinion_seas_risk', 'opinion_seas_sick_from_vacc'
]

for col in binary_ordinal_cols:
    df[col] = df[col].astype('Int64')  # allows NA handling too


 Categorical variables — convert object to category

In [123]:
# Categorical features
# Convert categorical columns to 'category' dtype
categorical_cols = [
    'age_group', 'education', 'race', 'sex',
    'income_poverty', 'marital_status',
    'rent_or_own', 'employment_status',
    'employment_industry', 'employment_occupation',
    'hhs_geo_region', 'census_msa'
]

for col in categorical_cols:
    df[col] = df[col].astype('category')
# Final info to confirm the changes
df.info()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26707 entries, 0 to 26706
Data columns (total 38 columns):
 #   Column                       Non-Null Count  Dtype   
---  ------                       --------------  -----   
 0   respondent_id                26707 non-null  int64   
 1   h1n1_concern                 26707 non-null  Int64   
 2   h1n1_knowledge               26707 non-null  Int64   
 3   behavioral_antiviral_meds    26707 non-null  Int64   
 4   behavioral_avoidance         26707 non-null  Int64   
 5   behavioral_face_mask         26707 non-null  Int64   
 6   behavioral_wash_hands        26707 non-null  Int64   
 7   behavioral_large_gatherings  26707 non-null  Int64   
 8   behavioral_outside_home      26707 non-null  Int64   
 9   behavioral_touch_face        26707 non-null  Int64   
 10  doctor_recc_h1n1             26707 non-null  Int64   
 11  doctor_recc_seasonal         26707 non-null  Int64   
 12  chronic_med_condition        26707 non-null  Int64   
 13  c

In [124]:
# Total missing by column with count and percentage
missing_count = df.isnull().sum()
missing_count = missing_count[missing_count > 0].sort_values(ascending=False)

missing_percent = (df.isnull().sum() / len(df)) * 100
missing_percent = missing_percent[missing_percent > 0].sort_values(ascending=False)

# Combine into a DataFrame for better display
missing_data = pd.DataFrame({
    'Count': missing_count,
    'Percentage': missing_percent
})

print(missing_data)

Empty DataFrame
Columns: [Count, Percentage]
Index: []


In [125]:
# Summary DataFrame of unique values
unique_summary = pd.DataFrame({
    'Unique_Count': df.nunique(),
    'Unique_Percentage': (df.nunique() / len(df)) * 100
}).round(2)

print(unique_summary)

                             Unique_Count  Unique_Percentage
respondent_id                       26707             100.00
h1n1_concern                            4               0.01
h1n1_knowledge                          3               0.01
behavioral_antiviral_meds               2               0.01
behavioral_avoidance                    2               0.01
behavioral_face_mask                    2               0.01
behavioral_wash_hands                   2               0.01
behavioral_large_gatherings             2               0.01
behavioral_outside_home                 2               0.01
behavioral_touch_face                   2               0.01
doctor_recc_h1n1                        2               0.01
doctor_recc_seasonal                    2               0.01
chronic_med_condition                   2               0.01
child_under_6_months                    2               0.01
health_worker                           2               0.01
health_insurance        

- respondent_id → unique for every row **(not useful as a feature!)**
- Many columns have 2 unique values → binary **(OK as 0/1)**
- Some have 3–5 → ordinal **(keep as int)**
- Some have 10+ → nominal categories **(need encoding)**


Drop respondent_id

In [126]:
df = df.drop('respondent_id', axis=1)


# 4.  Modelling