<a href="https://colab.research.google.com/github/Kalana-Lakshan/DrivenData_Competitions/blob/main/Flu_Shot_Learning_Predict_H1N1_and_Seasonal_Flu_Vaccines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install pandas
!pip install numpy
import pandas as pd
import numpy as np





In [2]:
train_features_data = pd.read_csv('training_set_features.csv')
train_labels_data = pd.read_csv('training_set_labels.csv')
test_features_data = pd.read_csv('test_set_features.csv')
submission_data = pd.read_csv('submission_format.csv')



In [3]:
train_features_data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26707 entries, 0 to 26706
Data columns (total 36 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   respondent_id                26707 non-null  int64  
 1   h1n1_concern                 26615 non-null  float64
 2   h1n1_knowledge               26591 non-null  float64
 3   behavioral_antiviral_meds    26636 non-null  float64
 4   behavioral_avoidance         26499 non-null  float64
 5   behavioral_face_mask         26688 non-null  float64
 6   behavioral_wash_hands        26665 non-null  float64
 7   behavioral_large_gatherings  26620 non-null  float64
 8   behavioral_outside_home      26625 non-null  float64
 9   behavioral_touch_face        26579 non-null  float64
 10  doctor_recc_h1n1             24547 non-null  float64
 11  doctor_recc_seasonal         24547 non-null  float64
 12  chronic_med_condition        25736 non-null  float64
 13  child_under_6_mo

### Identify Missing Values
First, let's see which columns have missing values and how many there are. This will help us decide the best strategy for each column.

In [4]:
missing_values = train_features_data.isnull().sum()
missing_values = missing_values[missing_values > 0].sort_values(ascending=False)

print("Columns with missing values and their counts:")
display(missing_values)

# Also check the data types of these columns
print("\nData types of columns with missing values:")
display(train_features_data[missing_values.index].dtypes)

Columns with missing values and their counts:


Unnamed: 0,0
employment_occupation,13470
employment_industry,13330
health_insurance,12274
income_poverty,4423
doctor_recc_seasonal,2160
doctor_recc_h1n1,2160
rent_or_own,2042
employment_status,1463
marital_status,1408
education,1407



Data types of columns with missing values:


Unnamed: 0,0
employment_occupation,object
employment_industry,object
health_insurance,float64
income_poverty,object
doctor_recc_seasonal,float64
doctor_recc_h1n1,float64
rent_or_own,object
employment_status,object
marital_status,object
education,object


### Fill Missing Values in Numeric Columns

For numeric columns, common strategies include filling with the mean, median, or a specific constant (e.g., 0). The choice often depends on the distribution of the data and domain knowledge. Let's use the median for now, as it's less sensitive to outliers than the mean.

In [5]:
numeric_cols_with_missing = train_features_data[missing_values.index].select_dtypes(include=np.number).columns

print(f"Numeric columns with missing values: {list(numeric_cols_with_missing)}")

for col in numeric_cols_with_missing:
    median_val = train_features_data[col].median()
    train_features_data[col].fillna(median_val, inplace=True)
    print(f"Filled missing values in '{col}' with median: {median_val}")

print("\nMissing values after numeric imputation:")
print(train_features_data[numeric_cols_with_missing].isnull().sum())

Numeric columns with missing values: ['health_insurance', 'doctor_recc_seasonal', 'doctor_recc_h1n1', 'chronic_med_condition', 'child_under_6_months', 'health_worker', 'opinion_seas_sick_from_vacc', 'opinion_seas_risk', 'opinion_seas_vacc_effective', 'opinion_h1n1_sick_from_vacc', 'opinion_h1n1_vacc_effective', 'opinion_h1n1_risk', 'household_children', 'household_adults', 'behavioral_avoidance', 'behavioral_touch_face', 'h1n1_knowledge', 'h1n1_concern', 'behavioral_large_gatherings', 'behavioral_outside_home', 'behavioral_antiviral_meds', 'behavioral_wash_hands', 'behavioral_face_mask']
Filled missing values in 'health_insurance' with median: 1.0
Filled missing values in 'doctor_recc_seasonal' with median: 0.0
Filled missing values in 'doctor_recc_h1n1' with median: 0.0
Filled missing values in 'chronic_med_condition' with median: 0.0
Filled missing values in 'child_under_6_months' with median: 0.0
Filled missing values in 'health_worker' with median: 0.0
Filled missing values in 'opi

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train_features_data[col].fillna(median_val, inplace=True)


### Fill Missing Values in String (Categorical) Columns

For string or categorical columns, common strategies include filling with the mode (most frequent value) or a new category like 'Unknown'. Using 'Unknown' can be beneficial as it treats missingness as a meaningful category itself, especially if the missing data is not random.

Let's fill with the mode for most, but consider 'Unknown' for columns where missingness might be a category.

In [6]:
categorical_cols_with_missing = train_features_data[missing_values.index].select_dtypes(include='object').columns

print(f"Categorical columns with missing values: {list(categorical_cols_with_missing)}")

for col in categorical_cols_with_missing:
    # Option 1: Fill with mode
    #mode_val = train_features_data[col].mode()[0]
    #train_features_data[col].fillna(mode_val, inplace=True)
    #print(f"Filled missing values in '{col}' with mode: '{mode_val}'")

    # Option 2: Fill with 'Unknown' (uncomment to use this alternative)
     train_features_data[col].fillna('Unknown', inplace=True)
     print(f"Filled missing values in '{col}' with 'Unknown'")

print("\nMissing values after categorical imputation:")
print(train_features_data[categorical_cols_with_missing].isnull().sum())

Categorical columns with missing values: ['employment_occupation', 'employment_industry', 'income_poverty', 'rent_or_own', 'employment_status', 'marital_status', 'education']
Filled missing values in 'employment_occupation' with 'Unknown'
Filled missing values in 'employment_industry' with 'Unknown'
Filled missing values in 'income_poverty' with 'Unknown'
Filled missing values in 'rent_or_own' with 'Unknown'
Filled missing values in 'employment_status' with 'Unknown'
Filled missing values in 'marital_status' with 'Unknown'
Filled missing values in 'education' with 'Unknown'

Missing values after categorical imputation:
employment_occupation    0
employment_industry      0
income_poverty           0
rent_or_own              0
employment_status        0
marital_status           0
education                0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train_features_data[col].fillna('Unknown', inplace=True)


### Verify All Missing Values are Handled

Finally, let's double-check if all missing values have been addressed in the `train_features_data` DataFrame.

In [7]:
total_missing_after_imputation = train_features_data.isnull().sum().sum()

if total_missing_after_imputation == 0:
    print("All missing values in 'train_features_data' have been successfully filled!")
else:
    print(f"There are still {total_missing_after_imputation} missing values remaining. Please review the imputation steps.")
    display(train_features_data.isnull().sum()[train_features_data.isnull().sum() > 0])

All missing values in 'train_features_data' have been successfully filled!


In [8]:
train_features_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26707 entries, 0 to 26706
Data columns (total 36 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   respondent_id                26707 non-null  int64  
 1   h1n1_concern                 26707 non-null  float64
 2   h1n1_knowledge               26707 non-null  float64
 3   behavioral_antiviral_meds    26707 non-null  float64
 4   behavioral_avoidance         26707 non-null  float64
 5   behavioral_face_mask         26707 non-null  float64
 6   behavioral_wash_hands        26707 non-null  float64
 7   behavioral_large_gatherings  26707 non-null  float64
 8   behavioral_outside_home      26707 non-null  float64
 9   behavioral_touch_face        26707 non-null  float64
 10  doctor_recc_h1n1             26707 non-null  float64
 11  doctor_recc_seasonal         26707 non-null  float64
 12  chronic_med_condition        26707 non-null  float64
 13  child_under_6_mo

In [9]:
columns = ["h1n1_concern","h1n1_knowledge","behavioral_antiviral_meds","behavioral_avoidance","behavioral_face_mask","behavioral_wash_hands","behavioral_large_gatherings","behavioral_outside_home","behavioral_touch_face","doctor_recc_h1n1","doctor_recc_seasonal","chronic_med_condition","child_under_6_months","health_worker","health_insurance","opinion_h1n1_vacc_effective","opinion_h1n1_risk", "opinion_h1n1_sick_from_vacc", "opinion_seas_vacc_effective","opinion_seas_risk","opinion_seas_sick_from_vacc","age_group","education","race", "sex","income_poverty","marital_status","rent_or_own", "employment_status","hhs_geo_region","census_msa","household_adults", "household_children","employment_industry","employment_occupation"]
len(columns)

35

In [10]:
train_labels_data.describe()

Unnamed: 0,respondent_id,h1n1_vaccine,seasonal_vaccine
count,26707.0,26707.0,26707.0
mean,13353.0,0.212454,0.465608
std,7709.791156,0.409052,0.498825
min,0.0,0.0,0.0
25%,6676.5,0.0,0.0
50%,13353.0,0.0,0.0
75%,20029.5,0.0,1.0
max,26706.0,1.0,1.0


In [11]:
y1 = train_labels_data["h1n1_vaccine"]
y2 = train_labels_data["seasonal_vaccine"]

In [12]:
X = train_features_data[columns]

In [13]:
!pip install scikit-learn
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error




Fixing missing values in test data

In [21]:
test_features_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26708 entries, 0 to 26707
Data columns (total 36 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   respondent_id                26708 non-null  int64  
 1   h1n1_concern                 26623 non-null  float64
 2   h1n1_knowledge               26586 non-null  float64
 3   behavioral_antiviral_meds    26629 non-null  float64
 4   behavioral_avoidance         26495 non-null  float64
 5   behavioral_face_mask         26689 non-null  float64
 6   behavioral_wash_hands        26668 non-null  float64
 7   behavioral_large_gatherings  26636 non-null  float64
 8   behavioral_outside_home      26626 non-null  float64
 9   behavioral_touch_face        26580 non-null  float64
 10  doctor_recc_h1n1             24548 non-null  float64
 11  doctor_recc_seasonal         24548 non-null  float64
 12  chronic_med_condition        25776 non-null  float64
 13  child_under_6_mo

In [22]:
missing_values_test = test_features_data.isnull().sum()
missing_values_test = missing_values_test[missing_values_test > 0].sort_values(ascending=False)

print("Columns with missing values test and their counts:")
display(missing_values_test)

# Also check the data types of these columns
print("\nData types of columns with missing values test:")
display(test_features_data[missing_values_test.index].dtypes)

Columns with missing values test and their counts:


Unnamed: 0,0
employment_occupation,13426
employment_industry,13275
health_insurance,12228
income_poverty,4497
doctor_recc_seasonal,2160
doctor_recc_h1n1,2160
rent_or_own,2036
employment_status,1471
marital_status,1442
education,1407



Data types of columns with missing values test:


Unnamed: 0,0
employment_occupation,object
employment_industry,object
health_insurance,float64
income_poverty,object
doctor_recc_seasonal,float64
doctor_recc_h1n1,float64
rent_or_own,object
employment_status,object
marital_status,object
education,object


In [23]:
numeric_cols_with_missing_test = test_features_data[missing_values_test.index].select_dtypes(include=np.number).columns

print(f"Numeric columns with missing values test: {list(numeric_cols_with_missing_test)}")

for col in numeric_cols_with_missing_test:
    median_val = test_features_data[col].median()
    test_features_data[col].fillna(median_val, inplace=True)
    print(f"Filled missing values test in '{col}' with median: {median_val}")

print("\nMissing values after numeric imputation:")
print(test_features_data[numeric_cols_with_missing_test].isnull().sum())

Numeric columns with missing values test: ['health_insurance', 'doctor_recc_seasonal', 'doctor_recc_h1n1', 'chronic_med_condition', 'child_under_6_months', 'health_worker', 'opinion_seas_sick_from_vacc', 'opinion_seas_risk', 'opinion_seas_vacc_effective', 'opinion_h1n1_vacc_effective', 'opinion_h1n1_risk', 'opinion_h1n1_sick_from_vacc', 'household_children', 'household_adults', 'behavioral_avoidance', 'behavioral_touch_face', 'h1n1_knowledge', 'h1n1_concern', 'behavioral_outside_home', 'behavioral_antiviral_meds', 'behavioral_large_gatherings', 'behavioral_wash_hands', 'behavioral_face_mask']
Filled missing values test in 'health_insurance' with median: 1.0
Filled missing values test in 'doctor_recc_seasonal' with median: 0.0
Filled missing values test in 'doctor_recc_h1n1' with median: 0.0
Filled missing values test in 'chronic_med_condition' with median: 0.0
Filled missing values test in 'child_under_6_months' with median: 0.0
Filled missing values test in 'health_worker' with median

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  test_features_data[col].fillna(median_val, inplace=True)


In [24]:
categorical_cols_with_missing_test = test_features_data[missing_values_test.index].select_dtypes(include='object').columns

print(f"Categorical columns with missing values test: {list(categorical_cols_with_missing_test)}")

for col in categorical_cols_with_missing_test:
    # Option 1: Fill with mode
    #mode_val = train_features_data[col].mode()[0]
    #train_features_data[col].fillna(mode_val, inplace=True)
    #print(f"Filled missing values in '{col}' with mode: '{mode_val}'")

    # Option 2: Fill with 'Unknown' (uncomment to use this alternative)
     test_features_data[col].fillna('Unknown', inplace=True)
     print(f"Filled missing values test in '{col}' with 'Unknown'")

print("\nMissing values after categorical imputation:")
print(test_features_data[categorical_cols_with_missing_test].isnull().sum())

Categorical columns with missing values test: ['employment_occupation', 'employment_industry', 'income_poverty', 'rent_or_own', 'employment_status', 'marital_status', 'education']
Filled missing values test in 'employment_occupation' with 'Unknown'
Filled missing values test in 'employment_industry' with 'Unknown'
Filled missing values test in 'income_poverty' with 'Unknown'
Filled missing values test in 'rent_or_own' with 'Unknown'
Filled missing values test in 'employment_status' with 'Unknown'
Filled missing values test in 'marital_status' with 'Unknown'
Filled missing values test in 'education' with 'Unknown'

Missing values after categorical imputation:
employment_occupation    0
employment_industry      0
income_poverty           0
rent_or_own              0
employment_status        0
marital_status           0
education                0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  test_features_data[col].fillna('Unknown', inplace=True)


In [25]:
total_missing_after_imputation_test = test_features_data.isnull().sum().sum()

if total_missing_after_imputation_test == 0:
    print("All missing values in 'test_features_data' have been successfully filled!")
else:
    print(f"There are still {total_missing_after_imputation_test} missing values remaining. Please review the imputation steps.")
    display(test_features_data.isnull().sum()[test_features_data.isnull().sum() > 0])

All missing values in 'test_features_data' have been successfully filled!


In [27]:

# Identify categorical columns in X that are of object type
categorical_cols = X.select_dtypes(include='object').columns

# Apply one-hot encoding to these categorical columns
X_encoded = pd.get_dummies(X, columns=categorical_cols, drop_first=True)

#train_X,val_X,train_y2,val_y2 = train_test_split(X_encoded,y2,random_state = 1)
#train_X,val_X.train_y2,val_y2 = train_test_split(X,y2,random_state = 1)

X_test = test_features_data[columns]
categorical_cols_test = X_test.select_dtypes(include='object').columns
X_encoded_test = pd.get_dummies(X_test, columns=categorical_cols_test, drop_first=True)

model1 = RandomForestRegressor(random_state = 1,max_leaf_nodes = 604)
model1.fit(X_encoded,y1)
predictions1 = model1.predict(X_encoded_test)

model2 = RandomForestRegressor(random_state = 1,max_leaf_nodes = 920)
model2.fit(X_encoded,y2)
predictions2 = model2.predict(X_encoded_test)



In [28]:
submission_data.head()

Unnamed: 0,respondent_id,h1n1_vaccine,seasonal_vaccine
0,26707,0.5,0.7
1,26708,0.5,0.7
2,26709,0.5,0.7
3,26710,0.5,0.7
4,26711,0.5,0.7


In [29]:
output = pd.DataFrame({'respondent_id':test_features_data['respondent_id'],'h1n1_vaccine':predictions1,'seasonal_vaccine':predictions2})
output.to_csv('submission.csv',index = False)
print(output)

       respondent_id  h1n1_vaccine  seasonal_vaccine
0              26707      0.109635          0.224989
1              26708      0.034381          0.055151
2              26709      0.629646          0.814877
3              26710      0.672518          0.920233
4              26711      0.459812          0.812960
...              ...           ...               ...
26703          53410      0.310890          0.647225
26704          53411      0.143513          0.391784
26705          53412      0.084681          0.377470
26706          53413      0.021220          0.329258
26707          53414      0.444823          0.549045

[26708 rows x 3 columns]


# Task
Improve the existing RandomForestRegressor models for `h1n1_vaccine` and `seasonal_vaccine` by switching to `RandomForestClassifier` and optimizing their hyperparameters using `RandomizedSearchCV`. Use 'roc_auc' as the scoring metric for optimization. After finding the best hyperparameters for each target, retrain the `RandomForestClassifier` models and generate probability predictions for the test dataset. Finally, provide a summary of the model improvements, the final ROC AUC scores, and discuss potential next steps for further enhancement.

## Switch to RandomForestClassifier and Implement RandomizedSearchCV for H1N1

### Subtask:
Import RandomForestClassifier and RandomizedSearchCV. Define a parameter distribution for RandomForestClassifier and apply RandomizedSearchCV to find the best hyperparameters for predicting h1n1_vaccine (y1), using 'roc_auc' as the scoring metric. The `.predict_proba()` method will be used to obtain the required floating-point probability outputs.


**Reasoning**:
The subtask requires importing `RandomForestClassifier` and `RandomizedSearchCV`, defining a hyperparameter distribution for the classifier, and then applying `RandomizedSearchCV` to find the best hyperparameters for the `h1n1_vaccine` target variable. This step will import the necessary classes, define the parameter grid, instantiate the models, and fit the `RandomizedSearchCV` object to the data.



In [31]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

# Define the parameter distribution for RandomForestClassifier
param_dist = {
    'n_estimators': randint(100, 500),
    'max_features': ['sqrt', 'log2', 0.6, 0.8],
    'max_depth': randint(10, 110),
    'min_samples_split': randint(2, 10),
    'min_samples_leaf': randint(1, 4),
    'bootstrap': [True, False]
}

# Instantiate a RandomForestClassifier model
rf_classifier = RandomForestClassifier(random_state=42)



In [None]:
# Instantiate RandomizedSearchCV
random_search_h1n1 = RandomizedSearchCV(
    estimator=rf_classifier,
    param_distributions=param_dist,
    n_iter=50, # Reduced n_iter for faster execution during demonstration
    cv=5,
    scoring='roc_auc',
    random_state=42,
    n_jobs=-1 # Use all available cores
)

# Fit RandomizedSearchCV to the one-hot encoded training features and the h1n1_vaccine target
random_search_h1n1.fit(X_encoded, y1)

print("RandomizedSearchCV for H1N1 vaccine completed.")
print(f"Best parameters for H1N1 vaccine: {random_search_h1n1.best_params_}")
print(f"Best ROC AUC score for H1N1 vaccine: {random_search_h1n1.best_score_}")

**Reasoning**:
Now that RandomizedSearchCV has been applied to the h1n1_vaccine target, the next logical step is to apply the same process to the seasonal_vaccine target (y2) to find its optimal hyperparameters using RandomForestClassifier and 'roc_auc' scoring. This completes the hyperparameter tuning for both target variables.



# Task
Implement `RandomizedSearchCV` for the `seasonal_vaccine` (`y2`) model, using `RandomForestClassifier` and 'roc_auc' scoring, to find its optimal hyperparameters. Display the best hyperparameters and the corresponding best ROC AUC score.

## Analyze Best Parameters for H1N1 Model

### Subtask:
Display the best hyperparameters found by RandomizedSearchCV for the H1N1 vaccine model, along with the corresponding best ROC AUC score.


**Reasoning**:
To display the best hyperparameters and the best ROC AUC score for the H1N1 vaccine model, I will print the `best_params_` and `best_score_` attributes of the `random_search_h1n1` object.



In [33]:
print("Best hyperparameters for H1N1 vaccine:")
print(random_search_h1n1.best_params_)
print("\nBest ROC AUC score for H1N1 vaccine:")
print(random_search_h1n1.best_score_)

Best hyperparameters for H1N1 vaccine:


AttributeError: 'RandomizedSearchCV' object has no attribute 'best_params_'

**Reasoning**:
The previous cell with `random_search_h1n1.fit()` was not executed, causing the `AttributeError`. Rerunning the display code will correctly show the best parameters and score once the `fit` method completes execution.

