In [2]:
from google.colab import drive
drive.mount('/content/drive')

ValueError: mount failed

In [None]:
!nvidia-smi

## Problem description
Your goal is to predict how likely individuals are to receive their H1N1 and seasonal flu vaccines. Specifically, you'll be predicting two probabilities: one for `h1n1_vaccine` and one for `seasonal_vaccine`.

Each row in the dataset represents one person who responded to the National 2009 H1N1 Flu Survey.

For this competition, there are two target variables:

- `h1n1_vaccine` - Whether respondent received H1N1 flu vaccine.
- `seasonal_vaccine` - Whether respondent received seasonal flu vaccine.
Both are binary variables: 0 = No; 1 = Yes. Some respondents didn't get either vaccine, others got only one, and some got both. This is formulated as a multilabel (and not multiclass) problem.

The features in this dataset
You are provided a dataset with 36 columns. The first column respondent_id is a unique and random identifier. The remaining 35 features are described below.

For all binary variables: 0 = No; 1 = Yes.

- `h1n1_concern` - Level of concern about the H1N1 flu.
0 = Not at all concerned; 1 = Not very concerned; 2 = Somewhat concerned; 3 = Very concerned. -> (__categorical__ __Ordinal__)
- `h1n1_knowledge` - Level of knowledge about H1N1 flu.
0 = No knowledge; 1 = A little knowledge; 2 = A lot of knowledge.
- `behavioral_antiviral_meds` - Has taken antiviral medications. (binary) -> (__categorical__ __Nominal__)
-`behavioral_avoidance` - Has avoided close contact with others with flu-like symptoms. (binary) -> (__categorical__ __Nominal__)
-`behavioral_face_mask` - Has bought a face mask. (binary) -> (__categorical__ __Nominal__)
behavioral_wash_hands - Has frequently washed hands or used hand sanitizer. (binary) -> (__categorical__ __Nominal__)
-`behavioral_large_gatherings` - Has reduced time at large gatherings. -> (__categorical__ __Nominal__)
-`behavioral_outside_home` - Has reduced contact with people outside of own household. -> (__categorical__ __Nominal__)
-`behavioral_touch_face` - Has avoided touching eyes, nose, or mouth. -> (__categorical__ __Nominal__)
-`doctor_recc_h1n1` - H1N1 flu vaccine was recommended by doctor. -> (__categorical__ __Nominal__)
-`doctor_recc_seasonal` - Seasonal flu vaccine was recommended by doctor. -> (__categorical__ __Nominal__)
-`chronic_med_condition` - Has any of the following chronic medical conditions: asthma or an other lung condition, diabetes, a heart condition, a kidney condition, sickle cell anemia or other anemia, a neurological or neuromuscular condition, a liver condition, or a weakened immune system caused by a chronic illness or by medicines taken for a chronic illness. -> (__categorical__ __Nominal__)
-`child_under_6_months` - Has regular close contact with a child under the age of six months. -> (__categorical__ __Nominal__)
-`health_worker` - Is a healthcare worker. -> (__categorical__ __Nominal__)
-`health_insurance` - Has health insurance. -> (__categorical__ __Nominal__)
-`opinion_h1n1_vacc_effective` - Respondent's opinion about H1N1 vaccine effectiveness.
1 = Not at all effective; 2 = Not very effective; 3 = Don't know; 4 = Somewhat effective; 5 = Very effective. -> (__categorical__ __Ordinal__)
-`opinion_h1n1_risk` - Respondent's opinion about risk of getting sick with H1N1 flu without vaccine.
1 = Very Low; 2 = Somewhat low; 3 = Don't know; 4 = Somewhat high; 5 = Very high. -> (__categorical__ __Ordinal__)
-`opinion_h1n1_sick_from_vacc` - Respondent's worry of getting sick from taking H1N1 vaccine.
1 = Not at all worried; 2 = Not very worried; 3 = Don't know; 4 = Somewhat worried; 5 = Very worried. -> (__categorical__ __Ordinal__)
-`opinion_seas_vacc_effective` - Respondent's opinion about seasonal flu vaccine effectiveness.
1 = Not at all effective; 2 = Not very effective; 3 = Don't know; 4 = Somewhat effective; 5 = Very effective. -> (__categorical__ __Ordinal__)
-`opinion_seas_risk` - Respondent's opinion about risk of getting sick with seasonal flu without vaccine.
1 = Very Low; 2 = Somewhat low; 3 = Don't know; 4 = Somewhat high; 5 = Very high. -> (__categorical__ __Ordinal__)
-`opinion_seas_sick_from_vacc` - Respondent's worry of getting sick from taking seasonal flu vaccine.
1 = Not at all worried; 2 = Not very worried; 3 = Don't know; 4 = Somewhat worried; 5 = Very worried. -> (__categorical__ __Ordinal__)
-`age_group` - Age group of respondent.

 ```
 array(['55 - 64 Years', '35 - 44 Years', '18 - 34 Years', '65+Years', '45 - 54 Years'], dtype=object)
 ```
  -> (__categorical__ __Ordinal__)
-`education` - Self-reported education level. -> (__categorical__ __Ordinal__)
-`race` - Race of respondent. -> (__categorical__ __Nominal__)
-`sex` - Sex of respondent. -> (__categorical__ __Nominal__)
-`income_poverty` - Household annual income of respondent with respect to 2008 Census poverty thresholds. -> (__categorical__)
-`marital_status` - Marital status of respondent. -> (__categorical__ __Nominal__)
-`rent_or_own` - Housing situation of respondent. -> (__categorical__ __Nominal__)
-`employment_status` - Employment status of respondent. -> (__categorical__ __Nominal__)
-`hhs_geo_region` - Respondent's residence using a 10-region geographic classification defined by the U.S. Dept. of Health and Human Services. Values are represented as short random character strings. -> (__categorical__ __Nominal__)
-`census_msa` - Respondent's residence within metropolitan statistical areas (MSA) as defined by the U.S. Census. -> (__categorical__ __Nominal__)
-`household_adults` - Number of other adults in household, top-coded to 3. -> (__categorical__ __Ordinal__)
-`household_children` - Number of children in household, top-coded to 3. -> (__categorical__ __Ordinal__)
-`employment_industry` - Type of industry respondent is employed in.  Values are represented as short random character strings. -> (__categorical__ __Nominal__)
-`employment_occupation` - Type of occupation of respondent. Values are represented as short random character strings. -> (__categorical__ __Nominal__)

As shown above, most predictor variables are categorical, some being _Ordinal_ and others _Nominal_.

In [None]:
#  All necessary imports for data preprocessing
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import numpy as np
from scipy.stats import skew, kurtosis
%matplotlib inline
warnings.filterwarnings("ignore")


In [None]:
# load data
path = "./drive/MyDrive/data/"
train_features_df = pd.read_csv(path + "training_set_features.csv")
train_labels_df = pd.read_csv(path + "training_set_labels.csv")

In [None]:
train_features_df.head()

In [None]:
train_labels_df.head()

In [None]:
train_features_df.shape, train_labels_df.shape

In [None]:
train_features_df.info(verbose=True, show_counts=True)

In [None]:
train_labels_df.info(verbose=True, show_counts=True)

In [None]:
# check unique features
print("- age_group: ", train_features_df["age_group"].unique())
print()
print("- education: ", train_features_df["education"].unique())
print()
print("- race: ", train_features_df["race"].unique())
print()
print("- sex: ", train_features_df["sex"].unique())
print()
print("- income_poverty: ", train_features_df["income_poverty"].unique())
print()
print("- marital_status: ", train_features_df["marital_status"].unique())
print()
print("- rent_or_own: ", train_features_df["rent_or_own"].unique())
print()
print("- employment_status: ", train_features_df["employment_status"].unique())
print()
print("- hhs_geo_region: ", train_features_df["hhs_geo_region"].unique())
print()
print("- household_adults: ", train_features_df["household_adults"].unique())
print()
print("- household_children: ", train_features_df["household_children"].unique())
print()
print("- census_msa: ", train_features_df["census_msa"].unique())
print()
print("- employment_industry: ", train_features_df["employment_industry"].unique())
print()
print("- employment_occupation: ", train_features_df["employment_occupation"].unique())
print()

In [None]:
# combine train and test for preprocessing
train_data = pd.merge(train_features_df, train_labels_df, on="respondent_id")
train_data.head()

In [None]:
# Check for missing values
print("Dataframe shape: ", train_data.shape)
print("\nMissing Values in Training Features:")
print(train_data.isnull().sum())

# Preprocessing (data cleaning)



## Handle missing values
- Step 1. Remove the features with an excessively high number of missing values (typically more than half the length of the data set).

In [None]:
# Drop columns with excessive missing values
missing_threshold = 0.5
train_data = train_data.loc[:, train_data.isnull().mean() < missing_threshold]

print("New dataframe shape: ", train_data.shape)
print("\nMissing Values in Training Features:")
print(train_data.isnull().sum())

- **Step 2.  Imputation with "Unknown" or Mode**
  - **Imputation with "Unknown" Category**

    For categorical features where the missing value doesn't have an inherent order or ranking (nominal features), it's common to replace missing values with a placeholder category like **"Unknown."** This approach prevents the loss of information and keeps the categorical nature intact.
    
  - >**Why ?** Since nominal categories don't have a natural order, introducing a new category for missing data ensures that no assumptions are made about the missing values.

  - **Imputation with Mode**

    For ordinal features (categorical features with a natural order or ranking), it's usually better to impute missing values with the **mode** (the most frequent category). This maintains the integrity of the ordinal nature while filling in missing values with the most likely category.
  - >**Why ?** Using the mode preserves the order of the categories and can be more informative than introducing a separate "Unknown" category.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

# Separate features and labels
train_features_df = train_data.drop(columns=["respondent_id", "h1n1_vaccine", "seasonal_vaccine"])
train_labels_df = train_data[["h1n1_vaccine", "seasonal_vaccine"]]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(train_features_df, train_labels_df, test_size=0.2, random_state=42)

# Identify nominal and ordinal features
nominal_features = [
    'behavioral_antiviral_meds', 'behavioral_avoidance', 'behavioral_face_mask',
    'behavioral_wash_hands', 'behavioral_large_gatherings', 'behavioral_outside_home',
    'behavioral_touch_face', 'doctor_recc_h1n1', 'doctor_recc_seasonal',
    'chronic_med_condition', 'child_under_6_months', 'health_worker',
    'health_insurance', 'race', 'sex', 'marital_status', 'rent_or_own',
    'employment_status', 'hhs_geo_region', 'census_msa', 'employment_industry'
]

ordinal_features = [
    'h1n1_concern', 'h1n1_knowledge', 'opinion_h1n1_vacc_effective', 'opinion_h1n1_risk',
    'opinion_h1n1_sick_from_vacc', 'opinion_seas_vacc_effective', 'opinion_seas_risk',
    'opinion_seas_sick_from_vacc', 'age_group', 'education', 'income_poverty',
    'household_adults', 'household_children'
]

# Check for all columns
assert set(nominal_features + ordinal_features) == set(X_train.columns)

# Impute nominal features with 'Unknown'
nominal_imputer = SimpleImputer(strategy='constant', fill_value='Unknown')
X_train[nominal_features] = nominal_imputer.fit_transform(X_train[nominal_features])
X_test[nominal_features] = nominal_imputer.transform(X_test[nominal_features])

# Impute ordinal features with mode
ordinal_imputer = SimpleImputer(strategy='most_frequent')
X_train[ordinal_features] = ordinal_imputer.fit_transform(X_train[ordinal_features])
X_test[ordinal_features] = ordinal_imputer.transform(X_test[ordinal_features])

# After imputations, check the results for any missing values
print("New training set shape: ", X_train.shape)
print("New testing set shape: ", X_test.shape)
print("\nMissing Values in Training Features:")
print(X_train.isnull().sum())

print("\nMissing Values in Testing Features:")
print(X_test.isnull().sum())

# merge X_train with y_train and X_test with y_test for further processing
train_data_imputed = pd.concat([X_train, y_train], axis=1)
test_data_imputed = pd.concat([X_test, y_test], axis=1)

In [None]:
# convert all float data types to ints (they ought to be categorical && discrete)
def convert_to_int(x):
  if x == "unknown":
    return x
  try:
    return int(x)
  except ValueError:
    return x

train_data_imputed = train_data_imputed.applymap(convert_to_int)
test_data_imputed = test_data_imputed.applymap(convert_to_int)

# inspect once more
for col in train_data_imputed.columns:
  if col == "respondent_id":
    continue
  print(f"- {col}: ", train_data_imputed[col].unique())

## Handle Duplicates

In [None]:
# Handling Duplicates:
duplicate_rows = train_data_imputed[train_data_imputed.duplicated()]
num_duplicate_rows = len(duplicate_rows)
print(f"Number of duplicate rows (train): {num_duplicate_rows}")

duplicate_rows = test_data_imputed[test_data_imputed.duplicated()]
num_duplicate_rows = len(duplicate_rows)
print(f"Number of duplicate rows (test): {num_duplicate_rows}")

In [None]:
# drop duplicates
train_data_imputed = train_data_imputed.drop_duplicates()
test_data_imputed = test_data_imputed.drop_duplicates()

# confirm
duplicate_rows = train_data_imputed[train_data_imputed.duplicated()]
num_duplicate_rows = len(duplicate_rows)
print(f"Number of duplicate rows (train): {num_duplicate_rows}")

# Explorative Data Analysis
First we perform some Exploratory Data Analysis as an initial step to check what our data
contains. This will provide us with valuable insights into what the data contains and its basic
characteristics. Here's what we do during the EDA:

## Univariate Analysis


In [None]:
# # Drop 'respondent_id', 'h1n1_vaccine', 'seasonal_vaccine' from the features
train_features_df = train_data_imputed.drop(columns=["h1n1_vaccine", "seasonal_vaccine"])
train_labels_df = train_data_imputed[["h1n1_vaccine", "seasonal_vaccine"]]

train_feat_cols = list(train_features_df.columns)

# Function to create a pie chart for binary variables
def plot_pie_chart(data: pd.DataFrame, column: str, ax: plt.Axes, title: str) -> None:
  """
  plot pie chart for binary variables
  :param data: dataframe
  :param column: column name
  :param ax: axis
  :param title: title of the plot
  """
  labels = ['No', 'Yes']

  if column == "sex":
      labels = ['Female', 'Male']

  counts = data[column].value_counts()
  sizes = counts.values
  colors = ['#ff9999', '#66b3ff']

  ax.pie(sizes, labels=labels, colors=colors, autopct='%1.1f%%', startangle=140)
  ax.axis('equal')  # Equal aspect ratio ensures that pie chart is circular.
  ax.set_title(title)

# Set up a grid of subplots with 3 columns per row
n_cols = 3
n_rows = (len(train_feat_cols) + n_cols - 1) // n_cols

fig, axes = plt.subplots(nrows=n_rows, ncols=n_cols, figsize=(18, n_rows * 4))

# Flatten the axes array for easy iteration
axes = axes.flatten()

# Create plots for each categorical variable
for i, var in enumerate(train_feat_cols):
    if train_features_df[var].nunique() == 2:  # Check if the variable is binary
        plot_pie_chart(train_features_df, var, axes[i], f'Proportion of {var}')
    else:
        sns.countplot(data=train_features_df, x=var, ax=axes[i])
        axes[i].set_title(f'Countplot of {var}')
        axes[i].set_xlabel(var)
        axes[i].set_ylabel('Count')
        axes[i].tick_params(axis='x', rotation=45)  # Rotate x-tick labels for better readability

# Hide any unused subplots

for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.show()


In [None]:
train_features_df.describe()

In [None]:
# Define the mappings for ordinal features
ordinal_mappings = {
    'h1n1_concern': {
        'Not at all concerned': 0,
        'Not very concerned': 1,
        'Somewhat concerned': 2,
        'Very concerned': 3
    },
    'h1n1_knowledge': {
        'No knowledge': 0,
        'Some knowledge': 1,
        'Knowledgeable': 2
    },
    'opinion_h1n1_vacc_effective': {
        'Not effective at all': 0,
        'Not very effective': 1,
        'Somewhat effective': 2,
        'Very effective': 3
    },
    'opinion_h1n1_risk': {
        'No risk': 0,
        'Low risk': 1,
        'Moderate risk': 2,
        'High risk': 3
    },
    'opinion_h1n1_sick_from_vacc': {
        'Not at all likely': 0,
        'Not very likely': 1,
        'Somewhat likely': 2,
        'Very likely': 3
    },
    'opinion_seas_vacc_effective': {
        'Not effective at all': 0,
        'Not very effective': 1,
        'Somewhat effective': 2,
        'Very effective': 3
    },
    'opinion_seas_risk': {
        'No risk': 0,
        'Low risk': 1,
        'Moderate risk': 2,
        'High risk': 3
    },
    'opinion_seas_sick_from_vacc': {
        'Not at all likely': 0,
        'Not very likely': 1,
        'Somewhat likely': 2,
        'Very likely': 3
    },
    'age_group': {
        '18 - 34 Years': 0,
        '35 - 44 Years': 1,
        '45 - 54 Years': 2,
        '55 - 64 Years': 3,
        '65+ Years': 4
    },
    'education': {
        '< 12 Years': 0,
        '12 Years': 1,
        'Some College': 2,
        'College Graduate': 3,
        'Some Post-Graduate': 4
    },
    'income_poverty': {
        'Below Poverty': 0,
        '<= $75,000, Above Poverty': 1,
        '> $75,000': 2
    },
    'household_adults': {
        '1 Adult': 0,
        '2 Adults': 1,
        '3+ Adults': 2
    },
    'household_children': {
        '0 Children': 0,
        '1 Child': 1,
        '2 Children': 2,
        '3+ Children': 3
    }
}

# Apply the mappings using replace function
for feature, mapping in ordinal_mappings.items():
    train_data_imputed[feature] = train_data_imputed[feature].replace(mapping)
    test_data_imputed[feature] = test_data_imputed[feature].replace(mapping)

# Verify the changes
print(train_data_imputed[list(ordinal_mappings.keys())].head())
print(test_data_imputed[list(ordinal_mappings.keys())].head())


In [None]:
# 1. label encode categorical variables

# One-hot encode all categorical features in the train and test datasets
train_data_encoded = pd.get_dummies(train_data_imputed, drop_first=True)
test_data_encoded = pd.get_dummies(test_data_imputed, drop_first=True)

# Ensure that both the train and test datasets have the same columns after encoding
test_data_encoded = test_data_encoded.reindex(columns=train_data_encoded.columns, fill_value=0)

print("Shape of encoded train data:", train_data_encoded.shape)
print("Shape of encoded test data:", test_data_encoded.shape)

## Bivariate analysis

In [None]:
train_data_encoded.columns

In [None]:
# Calculate the correlation matrix
correlation_matrix = train_data_imputed[ordinal_features].corr()
correlation_matrix


In [None]:
# Set up the matplotlib figure
plt.figure(figsize=(10, 8))

# Create a heatmap
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap='coolwarm', square=True, cbar_kws={"shrink": .8}, linewidths=0.5)

# Title and labels
plt.title('Heatmap of Correlations Between Ordinal Features')
# plt.xticks(rotation=45)
# plt.yticks(rotation=45)
plt.tight_layout()

# Show the plot
plt.show()

__Insights from the correlation matrix:__

1. **Strongest Positive Correlations**:
   - **`opinion_h1n1_risk` vs `opinion_seas_risk` (0.566)**: Individuals who perceive a higher risk from the H1N1 vaccine are likely to also perceive a higher risk from the seasonal flu vaccine.
   - **`opinion_h1n1_sick_from_vacc` vs `opinion_seas_sick_from_vacc` (0.486)**: A belief that one might get sick from the H1N1 vaccine is associated with a similar belief regarding the seasonal flu vaccine.
   - **`opinion_h1n1_vacc_effective` vs `opinion_seas_vacc_effective` (0.470)**: Individuals who believe the H1N1 vaccine is effective are also more likely to believe in the effectiveness of the seasonal flu vaccine.

2. **Moderate Positive Correlations**:
   - **`opinion_h1n1_risk` vs `h1n1_concern` (0.380)**: Those who are more concerned about H1N1 are also likely to believe there is a higher risk associated with it.
   - **`opinion_h1n1_sick_from_vacc` vs `opinion_h1n1_risk` (0.336)**: Those who believe they could get sick from the H1N1 vaccine are also more likely to perceive the risk of H1N1 as higher.

   - **`opinion_h1n1_risk` vs `opinion_h1n1_vacc_effective` (0.260)**: Those who perceive a higher risk from H1N1 are more likely to believe that the vaccine is effective.


3. **Negative Correlations**:
   - **`age_group` vs `household_children` (-0.438)**: Older individuals tend to have fewer children in their households.
   - **`education` vs `age_group` (-0.081)**: Higher education levels are weakly negatively correlated with age, indicating that older individuals in the dataset might have slightly lower education levels.
   - **`education` vs `opinion_h1n1_sick_from_vacc` (-0.088)**: Higher education levels are weakly associated with a lower belief that the H1N1 vaccine could make someone sick.

4. **Weak or Negligible Correlations**:
   - Many features, such as `household_adults` and `income_poverty` or `opinion_seas_vacc_effective`, have very weak correlations with other features, indicating that these features may not strongly influence each other.

5. **Educational Insights**:
   - **`education` vs `income_poverty` (0.372)**: Higher education is moderately correlated with lower poverty levels, which is expected.

6. **Perception of Vaccines**:
   - The correlations between perceptions of the H1N1 vaccine and the seasonal flu vaccine (e.g., effectiveness, risk, and sickness from the vaccine) suggest that individuals’ views on one vaccine are often mirrored in their views on the other. This could be due to general attitudes towards vaccination.

7. **Household Dynamics**:
   - **`household_children` vs `age_group` (-0.438)**: As people get older, they tend to have fewer children in the household, which makes sense given typical family life cycles.

**Summary:**
- **Perception Patterns**: There are clear patterns in how people perceive the risks and benefits of vaccines, with those who are concerned about one aspect (e.g., risk) being concerned about related aspects (e.g., getting sick from the vaccine).
- **Educational and Age Factors**: Higher education seems to buffer against concerns about vaccine safety, while age is inversely related to having children in the household.
- **Poverty and Education**: As expected, higher education correlates with lower levels of poverty, showing socioeconomic impacts on health-related perceptions.

These insights can help in targeting public health messages, understanding vaccine hesitancy, and tailoring interventions to specific demographic groups.

# Predictive Modelling

## Proposed models
This modelling requires a multilabel classifier. Some models we can use are listed below:

1. **Random Forest with MultiOutputClassifier**

2. **XGBoost with MultiOutputClassifier**

3. **Neural Networks**


### Random Forest Multioutput classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

SEED = 555

In [None]:
X_train_encoded = train_data_encoded.drop(columns=["h1n1_vaccine", "seasonal_vaccine"])
y_train = train_data_encoded[["h1n1_vaccine", "seasonal_vaccine"]]

X_test_encoded = test_data_encoded.drop(columns=["h1n1_vaccine", "seasonal_vaccine"])
y_test = test_data_encoded[["h1n1_vaccine", "seasonal_vaccine"]]

In [None]:
# Initialize the RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, random_state=SEED)

# Wrap it in a MultiOutputClassifier
multi_target_rf = MultiOutputClassifier(rf, n_jobs=-1)

# Train the model
multi_target_rf.fit(X_train_encoded, y_train)

In [None]:
# Predict on the test data
y_pred = multi_target_rf.predict(X_test_encoded)

In [None]:
# Access the columns by name
y_pred_h1n1 = y_pred[:, 0]
y_pred_seasonal = y_pred[:, 1]

# Accuracy for each target
accuracy_h1n1 = accuracy_score(y_test['h1n1_vaccine'], y_pred_h1n1)
accuracy_seasonal = accuracy_score(y_test['seasonal_vaccine'], y_pred_seasonal)

print("Untuned RF model\n"+"="*20)
print(f'Accuracy for H1N1 vaccine prediction: {accuracy_h1n1:.2f}')
print(f'Accuracy for Seasonal vaccine prediction: {accuracy_seasonal:.2f}')

# Classification reports
print("Classification Report for H1N1 Vaccine Prediction:")
print(classification_report(y_test['h1n1_vaccine'], y_pred_h1n1))

print("\nClassification Report for Seasonal Vaccine Prediction:")
print(classification_report(y_test['seasonal_vaccine'], y_pred_seasonal))

# Confusion matrices
print("Confusion Matrix for H1N1 Vaccine Prediction:")
print(confusion_matrix(y_test['h1n1_vaccine'], y_pred_h1n1))

print("\nConfusion Matrix for Seasonal Vaccine Prediction:")
print(confusion_matrix(y_test['seasonal_vaccine'], y_pred_seasonal))


#### Tuned Random Forest classifier

In [None]:
# from sklearn.model_selection import GridSearchCV

# # Define the parameter grid
# param_grid = {
#     'n_estimators': [100, 200],
#     'max_depth': [None, 10, 20],
#     'min_samples_split': [5, 10],
#     'min_samples_leaf': [1, 2],
# }
# # Initialize the Random Forest model
# rf = RandomForestClassifier(random_state=SEED)

# # Set up the GridSearchCV with the RandomForest and the parameter grid
# grid_search = GridSearchCV(estimator=rf, param_grid=param_grid,
#                            cv=5, n_jobs=-1, verbose=2, scoring='accuracy')

# # Fit the grid search
# grid_search.fit(X_train_encoded, y_train)

# # Best parameters from the grid search
# best_params = grid_search.best_params_

# # Best score from the grid search
# best_score = grid_search.best_score_

# print("Best Parameters:", best_params)
# print("Best Cross-Validation Score:", best_score)


### Extended Gradient Boosting (XGBoost) Multioutput classifier

**Note:** Below is the output of the above grid search.
The GridSearch is commented out because it takes a very long time to find the best parameters for the Random Forest Classifier.

>Fitting 5 folds for each of 24 candidates, totalling 120 fits
Best Parameters: {'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 100}
Best Cross-Validation Score: 0.6815207916444253

In [None]:
best_params = {'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 100}

# Train the model with the best parameters
best_rf = RandomForestClassifier(**best_params, random_state=SEED)
best_rf.fit(X_train_encoded, y_train)

In [None]:
# Evaluate the tuned model on the test set
y_test_array = y_test.values  # Convert to a NumPy array
accuracy_h1n1 = accuracy_score(y_test_array[:, 0], y_pred[:, 0])
accuracy_seasonal = accuracy_score(y_test_array[:, 1], y_pred[:, 1])

print("Tuned Random Forest Multiclassifier\n"+ "="*40)
print(f"Accuracy for H1N1 vaccine prediction: {accuracy_h1n1:.2f}")
print(f"Accuracy for Seasonal vaccine prediction: {accuracy_seasonal:.2f}")

In [None]:
from xgboost import XGBClassifier
from sklearn.multioutput import MultiOutputClassifier

# Initialize the XGBoost model
xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='logloss')

# Wrap with MultiOutputClassifier
multi_target_xgb = MultiOutputClassifier(xgb_model, n_jobs=-1)

# Train the model
multi_target_xgb.fit(X_train_encoded, y_train)

# Make predictions
y_pred_xgb = multi_target_xgb.predict(X_test_encoded)

# Evaluate the model
# Accuracy for each target
accuracy_h1n1 = accuracy_score(y_test['h1n1_vaccine'], y_pred_xgb[:, 0])
accuracy_seasonal = accuracy_score(y_test['seasonal_vaccine'], y_pred_xgb[:, 1])

print("Untuned XGboosted model\n" + "="*20)
print(f"Accuracy for H1N1 vaccine prediction: {accuracy_h1n1:.2f}")
print(f"Accuracy for Seasonal vaccine prediction: {accuracy_seasonal:.2f}")

# Classification report
print("Classification Report for H1N1 Vaccine Prediction:")
print(classification_report(y_test['h1n1_vaccine'], y_pred_xgb[:, 0]))

print("Classification Report for Seasonal Vaccine Prediction:")
print(classification_report(y_test['seasonal_vaccine'], y_pred_xgb[:, 1]))

# Confusion matrices
print("Confusion Matrix for H1N1 Vaccine Prediction:")
print(confusion_matrix(y_test['h1n1_vaccine'], y_pred_xgb[:, 0]))

print("Confusion Matrix for Seasonal Vaccine Prediction:")
print(confusion_matrix(y_test['seasonal_vaccine'], y_pred_xgb[:, 1]))


#### Tuned Xgboosted model

In [None]:
# from xgboost import XGBClassifier
# from sklearn.model_selection import GridSearchCV
# from sklearn.multioutput import MultiOutputClassifier

# # Define the parameter grid
# param_grid = {
#     'estimator__n_estimators': [100, 200, 300],
#     'estimator__max_depth': [3, 5, 7],
#     'estimator__learning_rate': [0.01, 0.09],
#     'estimator__subsample': [0.8, 1.0],
#     'estimator__colsample_bytree': [0.8, 1.0],
#     # 'estimator__gamma': [0, 0.1, 0.2],
#     # 'estimator__reg_alpha': [0, 0.1, 1],
#     # 'estimator__reg_lambda': [1, 1.5, 2]
# }

# # Initialize the XGBoost model
# xgb = XGBClassifier(objective='binary:logistic', use_label_encoder=False, eval_metric='logloss')

# # Wrap XGBoost with MultiOutputClassifier
# multi_xgb = MultiOutputClassifier(xgb)

# # Set up the GridSearchCV
# grid_search = GridSearchCV(estimator=multi_xgb, param_grid=param_grid,
#                            cv=5, n_jobs=-1, verbose=2, scoring='accuracy')

# # Fit the grid search
# grid_search.fit(X_train_encoded, y_train)

# # Best parameters from the grid search
# best_params = grid_search.best_params_

# # Best score from the grid search
# best_score = grid_search.best_score_

# print("Best Parameters:", best_params)
# print("Best Cross-Validation Score:", best_score)

Output of Xgboost Grid Search CV

>Fitting 5 folds for each of 72 candidates, totalling 360 fits.\
Best Parameters: {'estimator__colsample_bytree': 0.8, 'estimator__learning_rate': 0.09, 'estimator__max_depth': 3, 'estimator__n_estimators': 300, 'estimator__subsample': 0.8}\
Best Cross-Validation Score: 0.6922398480661167

In [None]:
# Train the model with the best parameters
best_params = {'estimator__colsample_bytree': 0.8, 'estimator__learning_rate': 0.09,
               'estimator__max_depth': 3, 'estimator__n_estimators': 300,
               'estimator__subsample': 0.8}

best_xgb = MultiOutputClassifier(XGBClassifier(**best_params,
                                               objective='binary:logistic',
                                               use_label_encoder=False,
                                               eval_metric='logloss'))

best_xgb.fit(X_train_encoded, y_train)


# Evaluate the tuned model on the test set
y_pred = best_xgb.predict(X_test_encoded)

# Calculate accuracy for H1N1 and Seasonal predictions
accuracy_h1n1 = accuracy_score(y_test.iloc[:, 0], y_pred[:, 0])
accuracy_seasonal = accuracy_score(y_test.iloc[:, 1], y_pred[:, 1])

print("Tuned XGboosted model\n" + "="*20)
print(f"Accuracy for H1N1 vaccine prediction: {accuracy_h1n1:.2f}")
print(f"Accuracy for Seasonal vaccine prediction: {accuracy_seasonal:.2f}")

### Deep Learning Model


#### Model Design

Here's how we will design the Base Neural Network model:

- **Input Layer:** The number of nodes in the input layer should match the number of features (84 in your case).
- **Hidden Layers:** We start with a simple architecture, like 2-3 hidden layers with 64-128 neurons each, and experiment with different architectures later.
- **Output Layer:** Since we have two target variables (H1N1 and Seasonal vaccine), we use two output neurons with a sigmoid activation function.

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

In [None]:
def visualize_training_results(results):
    """Function to visualize model performance
    """
    history = results.history
    plt.figure()
    plt.plot(history['loss'], label='train_loss')
    plt.plot(history['val_loss'], label='val_loss')
    plt.legend()
    plt.title('Loss')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.show()

    plt.figure()
    plt.plot(history['accuracy'], label='train_accuracy')
    plt.plot(history['val_accuracy'], label='val_accuracy')
    plt.legend()
    plt.title('Accuracy')
    plt.xlabel('Epochs')
    plt.ylabel('Accuracy')
    plt.show()

In [None]:
from tensorflow.keras import regularizers

model = Sequential()

# Input layer and first hidden layer
model.add(Dense(128, input_dim=X_train_encoded.shape[1], activation='relu', kernel_regularizer=regularizers.l2(0.001)))
# model.add(Dropout(0.2))  # Dropout for regularization

# Second hidden layer
model.add(Dense(128, activation='relu'))

# Third hidden layer (optional)
model.add(Dense(64, activation='relu'))
# model.add(Dropout(0.2))

# Output layer
model.add(Dense(2, activation='sigmoid'))

model.summary()

In [None]:
# model compilation
model.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

In [None]:
X_train_encoded_int = X_train_encoded.astype('int')
y_train_int = y_train.astype('int')

results_nn_model_1 = model.fit(X_train_encoded_int, y_train_int,
                    validation_split=0.2,
                    epochs=100,
                    batch_size=5,
                    callbacks=[tf.keras.callbacks.EarlyStopping(patience=3)])

In [None]:
visualize_training_results(results_nn_model_1)

In [None]:
# from lightgbm import LGBMClassifier

# # Convert feature names to a DataFrame if needed
# X_train_encoded_df = pd.DataFrame(X_train_encoded, columns=[f"feature_{i}" for i in range(X_train_encoded.shape[1])])
# X_test_encoded_df = pd.DataFrame(X_test_encoded, columns=[f"feature_{i}" for i in range(X_test_encoded.shape[1])])

# # Rename columns to remove special characters (if any)
# X_train_encoded_df.columns = X_train_encoded_df.columns.str.replace('[^A-Za-z0-9_]+', '', regex=True)
# X_test_encoded_df.columns = X_test_encoded_df.columns.str.replace('[^A-Za-z0-9_]+', '', regex=True)

# # Convert back to numpy array if needed
# X_train_encoded = X_train_encoded_df.values
# X_test_encoded = X_test_encoded_df.values

# lgbm_model = MultiOutputClassifier(LGBMClassifier(
#     random_state=SEED,
#     min_data_in_bin=3,
#     min_data_in_leaf=10,
#     num_leaves=31,
#     learning_rate=0.05,
#     feature_fraction=0.9,
#     n_estimators=100))
# lgbm_model.fit(X_train_encoded, y_train)

# y_pred = lgbm_model.predict(X_test_encoded)

### Support Vector Machine

In [None]:
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier

svm_model = OneVsRestClassifier(SVC(kernel='linear', probability=True))
svm_model.fit(X_train_encoded, y_train)

y_pred = svm_model.predict(X_test_encoded)

In [None]:
# Calculate accuracy for H1N1 and Seasonal predictions
accuracy_h1n1 = accuracy_score(y_test.iloc[:, 0], y_pred[:, 0])
accuracy_seasonal = accuracy_score(y_test.iloc[:, 1], y_pred[:, 1])

print("Untuned SVM model\n" + "="*20)
print(f"Accuracy for H1N1 vaccine prediction: {accuracy_h1n1:.2f}")
print(f"Accuracy for Seasonal vaccine prediction: {accuracy_seasonal:.2f}")

In [1]:
# Initialize GridSearchCV
grid_search = GridSearchCV(svm_model, param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=2)

# Fit GridSearchCV
grid_search.fit(X_train_encoded, y_train)

# Print the best parameters and the best score
print("Best Parameters: ", grid_search.best_params_)
print("Best Accuracy: ", grid_search.best_score_)

NameError: name 'GridSearchCV' is not defined