# 🌳 Random Forest Classifier

### Authors:
| Name                          | Github user                                        |
|-------------------------------|----------------------------------------------------|
| Sergio Herreros Fernández     | [@SergioHerreros](https://github.com/SERGI0HERREROS)|
| Francisco Javier Luna Ortiz   | [@Lunao01](https://github.com/Lunao01)|
| Carlos Romero Navarro         | [@KarManiatic](https://github.com/KarManiatic)|
| Tatsiana Shelepen             | [@Naschkatzee](https://github.com/Naschkatzee) | 

<br>

## 1. Data

In [26]:
# Libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, roc_auc_score
warnings.filterwarnings("ignore")

# Data
training_set_features_df = pd.read_csv('gold/training_set_features_df.csv') # training set features

training_set_labels_df = pd.read_csv('data/training_set_labels.csv') # training set labels

test_set_features_df = pd.read_csv('gold/test_set_features_df.csv') # test set features

<br>

## 2. Approach

Modelling.

In [27]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import KNNImputer

In [28]:
## RandomForestClassifier - h1n1_vaccine
rf_classifier_h1n1_vaccine = RandomForestClassifier(random_state = 0, n_estimators = 100, criterion = 'entropy')
rf_classifier_h1n1_vaccine.fit(training_set_features_df.iloc[:, 1:], training_set_labels_df['h1n1_vaccine'])

In [29]:
## RandomForestClassifier - seasonal_vaccine
rf_classifier_seasonal_vaccine = RandomForestClassifier(random_state = 0, n_estimators = 100, criterion = 'entropy')
rf_classifier_seasonal_vaccine.fit(training_set_features_df.iloc[:, 1:], training_set_labels_df['seasonal_vaccine'])

Prediction.

In [30]:
rfc_y_pred_h1n1_vaccine = rf_classifier_h1n1_vaccine.predict_proba(test_set_features_df.iloc[:, 1:])

rfc_y_pred_seasonal_vaccine = rf_classifier_seasonal_vaccine.predict_proba(test_set_features_df.iloc[:, 1:])

Results.

In [31]:
y_preds = pd.DataFrame(
    {
        'respondent_id': test_set_features_df['respondent_id'],
        'h1n1_vaccine': rfc_y_pred_h1n1_vaccine[:, 1],
        'seasonal_vaccine':rfc_y_pred_seasonal_vaccine[:, 1],
    },

)
print('y_preds.shape:', y_preds.shape)
y_preds.head()

y_preds.shape: (26708, 3)


Unnamed: 0,respondent_id,h1n1_vaccine,seasonal_vaccine
0,26707,0.16,0.24
1,26708,0.0,0.03
2,26709,0.53,0.83
3,26710,0.59,0.86
4,26711,0.31,0.42


<br>

## 3. Evaluation

In [33]:
X = training_set_features_df
y_h1n1_vaccine = training_set_labels_df[['h1n1_vaccine']]
y_seasonal_vaccine = training_set_labels_df[['seasonal_vaccine']]

X_train_h1n1_vaccine, X_test_h1n1_vaccine, y_train_h1n1_vaccine, y_test_h1n1_vaccine = train_test_split(X, y_h1n1_vaccine, 
                                                    test_size=0.25, 
                                                    shuffle=True,
                                                    stratify=y_h1n1_vaccine,
                                                    random_state=1)

X_train_seasonal_vaccine, X_test_seasonal_vaccine, y_train_seasonal_vaccine, y_test_seasonal_vaccine = train_test_split(X, y_seasonal_vaccine, 
                                                    test_size=0.25, 
                                                    shuffle=True,
                                                    stratify=y_seasonal_vaccine,
                                                    random_state=1)

In [None]:
# Function to draw ROC curve and print score

def draw_roc_curve(test, pred_proba):
    # Generate ROC curve values: fpr, tpr, thresholds
    fpr, tpr, thresholds = roc_curve(test, pred_proba)

    # Plot ROC curve
    plt.plot([0, 1], [0, 1], 'k--')
    plt.plot(fpr, tpr)
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curve')
    plt.show()
    
    return roc_auc_score(test, pred_proba)

h1n1_vaccine ROC curve.

In [None]:
draw_roc_curve(training_set_labels_df['h1n1_vaccine'].tolist(), rfc_y_pred_h1n1_vaccine[:,1])
#training_set_labels_df['h1n1_vaccine']

#print(rfc_y_pred_h1n1_vaccine[:,1])

ValueError: Found input variables with inconsistent numbers of samples: [26707, 26708]

seasonal_vaccine ROC curve.

<br>

## 4. Export results

The results of the model's prediction will be exported as a CSV to the results folder of the project.

In [None]:
# Export the CSV.
y_preds.to_csv('./results/RandomForestClassifier_results.csv', index=False)