## Imports

Initially, we import all the libraries that we are going to use in this task.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

## Data loading

Next, we load the **.csv** files that contain the information about the features and labels of the training data. We save this in two pandas dataframes which we will call `features` and `labels`.

In [None]:
# Upload data from the provided files
features_path = '../data/training_set_features.csv'
labels_path = '../data/training_set_labels.csv'

# Read the data into pandas dataframes
features = pd.read_csv(features_path)
labels = pd.read_csv(labels_path)

In [None]:
features.info()

In [None]:
labels.info()

In [None]:
features.head()

In [None]:
labels.head()

In [None]:
features.describe()

In [None]:
labels.describe()

Afterwards, two auxiliary dataframes have been created to store the values of the `H1N1 Vaccine` and the `Seasonal Vaccine`. In addition, we remove the `respondent_id` column from the `features` dataframe because this information acts as noise in prediction.

In [None]:
# Separate labels and features
labels_h1n1 = labels['h1n1_vaccine']
labels_seasonal = labels['seasonal_vaccine']

# Drop the respondent_id column from the features
features = features.drop(columns=['respondent_id']) 

In [None]:
# Identify categorical and numeric columns
categorical_cols = features.select_dtypes(include=['object']).columns
numeric_cols = features.select_dtypes(include=['float64']).columns

In [None]:
# Impute missing values
features_numeric = features[numeric_cols].fillna(features[numeric_cols].mean())

In [None]:
# Impute missing values
features_categorical = features[categorical_cols].fillna('most_frequent')

# One-hot encode the categorical columns
features_categorical = pd.get_dummies(features, columns=categorical_cols)

In [None]:
features = pd.concat([features_numeric, features_categorical], axis=1)
features

## H1N1 Vaccine

In [None]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(features, labels_h1n1, test_size=0.2, random_state=42)

In [None]:
param_grid = {
    "n_estimators": [500, 800, 1000],
    "max_depth": [10, 20, 30],
    "min_samples_split": [10, 25, 50],
    "min_samples_leaf": [5, 10, 20]
}

In [None]:
grid_search = GridSearchCV(
    RandomForestClassifier(),
    param_grid,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)
model_h1n1 = grid_search.best_estimator_

In [None]:
# Predict the probabilities of the classes
y_pred_prob = model_h1n1.predict_proba(X_test)[:, 1]

# Calculate the ROC AUC
auc_roc = roc_auc_score(y_test, y_pred_prob)
auc_roc

In [None]:
# Print the best parameters
grid_search.best_params_

## Seasonal Vaccine

In [None]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(features, labels_seasonal, test_size=0.2, random_state=42)

In [None]:
param_grid = {
    "n_estimators": [500, 800, 1000],
    "max_depth": [10, 20, 30],
    "min_samples_split": [10, 25, 50],
    "min_samples_leaf": [5, 10, 20]
}

In [None]:
grid_search = GridSearchCV(
    RandomForestClassifier(),
    param_grid,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)
model_seasonal = grid_search.best_estimator_

In [None]:
# Predict the probabilities of the classes
y_pred_prob = model_seasonal.predict_proba(X_test)[:, 1]

# Calculate the ROC AUC
auc_roc = roc_auc_score(y_test, y_pred_prob)
auc_roc

In [None]:
# Print the best parameters
grid_search.best_params_

## Submission

Next, we load the *test_set_features.csv* that contains the information of the instances that we need to predict

In [None]:
test = pd.read_csv('../data/test_set_features.csv')
test.head()

We create a dataframe to store the predicted data of the probabilities of `H1N1 Vaccine` and `Seasonal Vaccine`.

In [None]:
submission_df = pd.DataFrame(test['respondent_id'])
submission_df.head()

In this part, we do the same preprocessing on the test data as we have done on the training data.

In [None]:
test_features = test.drop(columns=['respondent_id'])

In [None]:
test_categorical_cols = test_features.select_dtypes(include=['object']).columns
test_numeric_cols = test_features.select_dtypes(include=['float64']).columns

In [None]:
# Impute missing values
test_features_numeric = test_features[test_numeric_cols].fillna(test_features[test_numeric_cols].mean())

In [None]:
# Impute missing values
test_features_categorical = test_features[test_categorical_cols].fillna('most_frequent')

# One-hot encode the categorical columns
test_features_categorical = pd.get_dummies(test_features, columns=test_categorical_cols)

In [None]:
test_features = pd.concat([test_features_numeric, test_features_categorical], axis=1)

Next, we make the predictions and save them in the `submission_df`.

In [None]:
# Predict h1n1_vaccine
h1n1_vaccine = model_h1n1.predict_proba(test_features)[:, 1]

# Predict seasonal_vaccine
seasonal_vaccine = model_seasonal.predict_proba(test_features)[:, 1]

In [None]:
submission_df['h1n1_vaccine'] = h1n1_vaccine
submission_df['seasonal_vaccine'] = seasonal_vaccine
submission_df.head()

Finally, we genarate a csv to upload in the competition page

In [None]:
submission_df.to_csv('submission.csv', index=False)

## Conclusion

We have tried different techniques such as outliers removing, imputing missing values with different options but with all of them, we have obtained worse results.

The current solution is a simple one and uses quite straightforward techniques but it is the one that has given us the best score.

In addition, we have noticed that there seems to be a limit in a score of 0.86 approximately because, despite using a multitude of techniques or models, the results do not really improve and only very small progress is obtained.