# **Project Overview with Code Walkthrough: COVID-19 Vaccination Classification using KNN**

### **1. Objective**
The goal of this project is to classify individuals into *“fully vaccinated”* or *“not fully vaccinated”* categories using survey data collected in Canada during 2022. The machine learning model is developed using the K-Nearest Neighbors (KNN) algorithm, supported by robust data preprocessing and exploratory data analysis (EDA).

### **2. Data Loading and Initial Inspection**

Begin by importing the necessary libraries and loading the dataset.

In [None]:
import pandas as pd
import numpy as np
df = pd.read_csv("COVID-19BehaviorData_CAN2022.csv", low_memory=False)
df.info()

### **3. Data Cleaning and Preprocessing**
#### **Whitespace and Placeholder Cleanup**

To standardize the dataset, remove whitespaces and convert placeholder values like empty strings to NaN:

In [None]:
df1 = df.map(lambda x: x.strip() if isinstance(x, str) else x)
df1.replace(['', '__NA__'], np.nan, inplace=True)

#### **Column Filtering**

Columns with more than 45% missing values are dropped:

In [None]:
df_cleaned = df1.dropna(axis=1, thresh=len(df1)*0.55)

### **4. Applying Metadata Instructions for Value Mapping**

Load the ins.xlsx file, which contains mappings for transforming categorical values into standardized numeric formats:

In [None]:
ins = pd.read_excel("ins.xlsx")
ins.replace(np.nan, '', inplace=True)

Locate valid value mappings and apply them: 

In [None]:
def filter_valid_col(col, val):
    return [index for index in col if (index + 5) in val]

def find_value(col_id, value, ins_data):
    vvi = col_id + 5
    while vvi < len(ins_data) and ins_data.iloc[vvi, 2] != '':
        if value == ins_data.iloc[vvi, 2]:
            return int(ins_data.iloc[vvi, 1])
        vvi += 1
    return value

def filler(oridata, columnidlist, ins_data):
    for index in columnidlist:
        column_name = ins_data.iloc[index, 0]
        oridata[column_name] = oridata[column_name].apply(
            lambda x: find_value(index, x, ins_data) if pd.notna(x) else x)

These functions replace string values with corresponding numeric codes, enabling machine learning compatibility.



### **5. Additional Missing Data Handling**

After replacements, find special cases like "Don't know":

In [None]:
newdf_nona['household_children'] = newdf_nona['household_children'].replace("Don't know", np.nan)

Remaining NaN values are imputed using random sampling from non-missing values in the same column:

In [None]:
def NA_filler(selected_column, data):
    for name in selected_column:
        possible_values = data[name].dropna().values
        data[name] = data[name].apply(lambda x: np.random.choice(possible_values) if pd.isna(x) else x)

### **6. Region Value Normalization**

Survey responses were incorrectly tagged with UK region names. These are re-indexed numerically:

In [None]:
provinces = final_cleaned_data['region'].unique().tolist()
final_cleaned_data['region'] = final_cleaned_data['region'].apply(lambda x: provinces.index(x)+1 if isinstance(x, str) else x)

### **7. Target Variable Definition and Re-mapping**

The vaccination status column vac is simplified into a binary classification task:

In [None]:
df["vac"].replace({2: 1, 3: 2}, inplace=True)

### **8. Exploratory Data Analysis (EDA)**

Visualize class distribution:

In [None]:
df["vac"].value_counts().plot(kind='bar', title="Vaccination Status")
plt.show()

Generate correlation matrix:

In [None]:
correlation_matrix = df.select_dtypes(include=['number']).corr()
response_correlation = correlation_matrix['vac'].sort_values(ascending=False)
print(response_correlation)

### **Predictor Selection and Crosstab Analysis**

Selected predictors:

In [None]:
predictors = ['vac7','r1_8','vac_man_1','vac_man_4', 'vac_man_5', 'vac2_7','vac2_3']

Cross-tab visualization:

In [None]:
for predictor in predictors:
    crosstab = pd.crosstab(df[predictor], df['vac'])
    crosstab.div(crosstab.sum(axis=1), axis=0).plot(kind='bar', stacked=True)
    plt.title(f'Crosstab: {predictor} vs vac')
    plt.show()

### **9. Model Development with K-Nearest Neighbors (KNN)**

#### **Train-Test Split**

In [None]:
from sklearn.model_selection import train_test_split
X = df[predictors]
y = df['vac']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

#### **Feature Scaling**

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X_train)
rescaledX_train = scaler.transform(X_train)
rescaledX_test = scaler.transform(X_test)

#### **Model Tuning and Evaluation**


Using GridSearch with 10-fold Cross-Validation:

In [None]:
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.neighbors import KNeighborsClassifier
k_values = np.arange(1, 80)
param_grid = {'n_neighbors': k_values}
kfold = KFold(n_splits=10, shuffle=True, random_state=42)
grid = GridSearchCV(KNeighborsClassifier(), param_grid, scoring='accuracy', cv=kfold)
grid_result = grid.fit(rescaledX_train, y_train)

Plotting accuracy scores:

In [None]:
import matplotlib.pyplot as plt
plt.plot(k_values, grid_result.cv_results_['mean_test_score'])
plt.xlabel('Number of Neighbors k')
plt.ylabel('Accuracy')
plt.title('KNN Hyperparameter Tuning')
plt.show()

### **10. Final Model and Classification Report**

Train final model with best k:

In [None]:
best_k = grid_result.best_params_['n_neighbors']
model = KNeighborsClassifier(n_neighbors=best_k)
model.fit(rescaledX_train, y_train)
predictions = model.predict(rescaledX_test)

Generate classification summary:

In [None]:
from sklearn.metrics import classification_report, accuracy_score
from dmba import classificationSummary
print("Accuracy:", accuracy_score(y_test, predictions))
classificationSummary(y_test, predictions)
print("Classification report:\n", classification_report(y_test, predictions))

### **11. Conclusion**
This project successfully cleans and preprocesses survey data, engineers features, and applies a robust KNN model to predict COVID-19 vaccination status in a Canadian population. Through careful EDA, imputation, encoding, scaling, and hyperparameter tuning, it achieves a reliable classification pipeline.