# Random Forest Classifiers
---

## Introduction

The first models we are going to use to try tackling the current problem are **random forest classifiers**.

As always, the prerequisite step consists on loading the appropriate packages to perform our work:

In [1]:
# Activate 'airbnb' environment:
!source activate airbnb

In [2]:
# Needed packages:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.externals import joblib
from utils import (create_training_testing_datasets,
                   calculate_dcg,
                   calculate_ndcg,
                   clf_prediction,
                   ndcg_mean_score_calc,
                   detailed_ndcg_mean_score_calc)



---

## Create training and testing datasets

In [3]:
# Load the data:
consolidated_dataset = pd.read_csv("../data/consolidated_dataset.csv")

# Check basic info:
print("*** Some basic info:")
print("'consolidated_dataset' has {} data points with {} variables each.".format(*consolidated_dataset.shape))
print("'consolidated_dataset' counts {} missing values.".format(consolidated_dataset.isnull().sum().sum()))

# Give a look to the first lines:
print("\n*** First lines:")
display(consolidated_dataset.head())

*** Some basic info:
'consolidated_dataset' has 213451 data points with 161 variables each.
'consolidated_dataset' counts 0 missing values.

*** First lines:


Unnamed: 0,age,country_destination,nans,day_account_created,weekday_account_created,week_account_created,month_account_created,year_account_created,day_first_active,weekday_first_active,...,first_browser_SeaMonkey,first_browser_Silk,first_browser_SiteKiosk,first_browser_SlimBrowser,first_browser_Sogou Explorer,first_browser_Stainless,first_browser_TenFourFox,first_browser_TheWorld Browser,first_browser_Yandex.Browser,first_browser_wOSBrowser
0,-1.0,NDF,1.225078,28,0,26,6,2010,19,3,...,0,0,0,0,0,0,0,0,0,0
1,38.0,NDF,-0.453135,25,2,21,5,2011,23,5,...,0,0,0,0,0,0,0,0,0,0
2,56.0,US,-0.453135,28,1,39,9,2010,9,1,...,0,0,0,0,0,0,0,0,0,0
3,42.0,other,-0.453135,5,0,49,12,2011,31,5,...,0,0,0,0,0,0,0,0,0,0
4,41.0,US,0.385972,14,1,37,9,2010,8,1,...,0,0,0,0,0,0,0,0,0,0


In [4]:
# Create training and testing datasets:
X_train, X_test, y_train, y_test, encoding_dict = create_training_testing_datasets(consolidated_dataset)

---

## Calculate Normalized DCG scores

### "Out-of-the-box" random forest classifier

In [5]:
# Calculate weight of each class:
dest_weight = {x: 0 for x in range(12)}
for country_dest in y_train:
    dest_weight[country_dest] += 1

# Initialize the classifier:
rf_clf = RandomForestClassifier(random_state=42, class_weight=dest_weight)

# Train the classifier:
print("Time info about classifier training:")
%time ootb_rf_clf = rf_clf.fit(X_train, y_train)

Time info about classifier training:




CPU times: user 6.85 s, sys: 180 ms, total: 7.04 s
Wall time: 7.07 s


In [6]:
# Perform one prediction to check classifier:
print("Time info about classifier prediction:")
%time preds_list = clf_prediction(ootb_rf_clf, X_train[0])
print("***")

# Reverse encoding dictionary:
decoding_dict = dict(map(reversed, encoding_dict.items()))

# Print result:
print("For the classifier check:")
print("- Real first booking destination country: {}".format(decoding_dict[y_train[0]]))
print("- Predictions list: {}".format([decoding_dict[x] for x in preds_list]))

Time info about classifier prediction:
CPU times: user 2.56 ms, sys: 1.37 ms, total: 3.93 ms
Wall time: 2.8 ms
***
For the classifier check:
- Real first booking destination country: US
- Predictions list: ['US', 'NDF', 'FR']


In [7]:
# Calculate nDCG mean score on training dataset:
print("Time info about nDCG mean score calculation on training dataset:")
%time ndcg_mean_score = ndcg_mean_score_calc(ootb_rf_clf, X_train, y_train)
print("***")
print("On training dataset, classifier nDCG mean score is {:.6f}.".format(ndcg_mean_score))

Time info about nDCG mean score calculation on training dataset:
CPU times: user 3min 9s, sys: 1.56 s, total: 3min 10s
Wall time: 3min 9s
***
On training dataset, classifier nDCG mean score is 0.991944.


In [8]:
# Calculate nDCG mean score on testing dataset:
print("Time info about nDCG mean score calculation on testing dataset:")
%time ndcg_mean_score = ndcg_mean_score_calc(ootb_rf_clf, X_test, y_test)
print("***")
print("On testing dataset, classifier nDCG mean score is {:.6f}.".format(ndcg_mean_score))

Time info about nDCG mean score calculation on testing dataset:
CPU times: user 48.3 s, sys: 405 ms, total: 48.7 s
Wall time: 48.5 s
***
On testing dataset, classifier nDCG mean score is 0.752272.


In [9]:
# Calculate nDCG mean score for each class on testing dataset:
print("Time info about nDCG mean score calculation for each class on testing dataset:")
%time ndcg_mean_scores_list = detailed_ndcg_mean_score_calc(ootb_rf_clf, X_test, y_test, encoding_dict)
print("***")
print("Detailed results for each class on testing dataset:")
for country_dest in range(12):
    print("nDCG mean score for {}: {:.6f}".format(decoding_dict[country_dest],
                                                  ndcg_mean_scores_list[country_dest]))

Time info about nDCG mean score calculation for each class on testing dataset:
CPU times: user 52.1 s, sys: 557 ms, total: 52.7 s
Wall time: 52.8 s
***
Detailed results for each class on testing dataset:
nDCG mean score for AU: 0.003988
nDCG mean score for CA: 0.034762
nDCG mean score for DE: 0.026010
nDCG mean score for ES: 0.051628
nDCG mean score for FR: 0.113896
nDCG mean score for GB: 0.050081
nDCG mean score for IT: 0.066178
nDCG mean score for NDF: 0.891009
nDCG mean score for NL: 0.022114
nDCG mean score for PT: 0.000000
nDCG mean score for US: 0.742001
nDCG mean score for other: 0.221323


In [10]:
# Save model:
joblib.dump(ootb_rf_clf, "../models/ootb_rf_clf.pkl")

['../models/ootb_rf_clf.pkl']

### "Optimized" random forest classifier

In [11]:
# Set parameters of the random grid:
n_estimators = [int(x) for x in np.linspace(10, 100, num=5)]
max_depth = [int(x) for x in np.linspace(10, 100, num=5)]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
max_features = ['auto', 'sqrt']
bootstrap = [True, False]
random_state = [42]
class_weight = []
class_weight.append(dest_weight)

# Create the random grid:
random_grid = {'n_estimators': n_estimators,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'max_features': max_features,
               'bootstrap': bootstrap,
               'random_state': random_state,
               'class_weight': class_weight}

# Perform randomized search cross validation:

rf_clf = RandomForestClassifier()

rf_clf_random = RandomizedSearchCV(estimator=rf_clf,
                                   param_distributions=random_grid,
                                   n_iter=20,
                                   n_jobs=-1,
                                   random_state=42,
                                   verbose=0)

print("Time info about randomized search cross validation:")
%time rf_clf_fit = rf_clf_random.fit(X_train, y_train)

# Get the "best" classifier:
opt_rf_clf = rf_clf_fit.best_estimator_

Time info about randomized search cross validation:




CPU times: user 1min 19s, sys: 2.19 s, total: 1min 21s
Wall time: 17min 51s


In [12]:
# Parameters of the "best" classifier:
print("*** Parameters of the 'best' classifier:")
for param, param_value in rf_clf_fit.best_params_.items():
    print("- Paramater {}: {}".format(param, param_value))

*** Parameters of the 'best' classifier:
- Paramater random_state: 42
- Paramater n_estimators: 77
- Paramater min_samples_split: 10
- Paramater min_samples_leaf: 1
- Paramater max_features: sqrt
- Paramater max_depth: None
- Paramater class_weight: {0: 431, 1: 1142, 2: 849, 3: 1799, 4: 4018, 5: 1859, 6: 2268, 7: 99634, 8: 610, 9: 174, 10: 49901, 11: 8075}
- Paramater bootstrap: False


In [13]:
# Perform one prediction to check classifier:
print("Time info about classifier prediction:")
%time preds_list = clf_prediction(opt_rf_clf, X_train[0])
print("***")

# Reverse encoding dictionary:
decoding_dict = dict(map(reversed, encoding_dict.items()))

# Print result:
print("For the classifier check:")
print("- Real first booking destination country: {}".format(decoding_dict[y_train[0]]))
print("- Predictions list: {}".format([decoding_dict[x] for x in preds_list]))

Time info about classifier prediction:
CPU times: user 12.2 ms, sys: 5.05 ms, total: 17.2 ms
Wall time: 15.9 ms
***
For the classifier check:
- Real first booking destination country: US
- Predictions list: ['US', 'NDF', 'other', 'FR', 'IT', 'NL', 'GB', 'ES', 'DE', 'CA']


In [14]:
# Calculate nDCG mean score on training dataset:
print("Time info about nDCG mean score calculation on training dataset:")
%time ndcg_mean_score = ndcg_mean_score_calc(opt_rf_clf, X_train, y_train)
print("***")
print("On training dataset, classifier nDCG mean score is {:.6f}.".format(ndcg_mean_score))

Time info about nDCG mean score calculation on training dataset:
CPU times: user 19min 56s, sys: 9.54 s, total: 20min 5s
Wall time: 20min 4s
***
On training dataset, classifier nDCG mean score is 0.906901.


In [15]:
# Calculate nDCG mean score on testing dataset:
print("Time info about nDCG mean score calculation on testing dataset:")
%time ndcg_mean_score = ndcg_mean_score_calc(opt_rf_clf, X_test, y_test)
print("***")
print("On testing dataset, classifier nDCG mean score is {:.6f}.".format(ndcg_mean_score))

Time info about nDCG mean score calculation on testing dataset:
CPU times: user 4min 45s, sys: 866 ms, total: 4min 46s
Wall time: 4min 46s
***
On testing dataset, classifier nDCG mean score is 0.816485.


In [16]:
# Calculate nDCG mean score for each class on testing dataset:
print("Time info about nDCG mean score calculation for each class on testing dataset:")
%time ndcg_mean_scores_list = detailed_ndcg_mean_score_calc(opt_rf_clf, X_test, y_test, encoding_dict)
print("***")
print("Detailed results for each class on testing dataset:")
for country_dest in range(12):
    print("nDCG mean score for {}: {:.6f}".format(decoding_dict[country_dest],
                                                  ndcg_mean_scores_list[country_dest]))

Time info about nDCG mean score calculation for each class on testing dataset:
CPU times: user 4min 57s, sys: 2.27 s, total: 5min
Wall time: 4min 59s
***
Detailed results for each class on testing dataset:
nDCG mean score for AU: 0.007164
nDCG mean score for CA: 0.035858
nDCG mean score for DE: 0.023998
nDCG mean score for ES: 0.106723
nDCG mean score for FR: 0.375889
nDCG mean score for GB: 0.111050
nDCG mean score for IT: 0.166975
nDCG mean score for NDF: 0.965516
nDCG mean score for NL: 0.013758
nDCG mean score for PT: 0.000000
nDCG mean score for US: 0.742169
nDCG mean score for other: 0.474328


In [17]:
# Save model:
joblib.dump(opt_rf_clf, "../models/opt_rf_clf.pkl")

['../models/opt_rf_clf.pkl']

---

## Conclusion

As a quick conclusion, for the results obtained for the last "optimized" random forest classifier, we can note 4 major elements:
* On testing dataset, it gets a better nDCG mean score than the one obtained by the naive model.
* On testing dataset, it gets better nDCG mean scores for predicting correctly Australia, Canada, Germany, Spain, Great Britain, Netherlands and USA than the ones obtained by the naive model.
* On testing dataset, it gets worse nDCG mean scores for predicting correctly France, Italy, no destination found and other than the ones obtained by the naive model.
* On testing dataset, it is as "bad" as the naive model for predicting correctly Portugal.