# Multi-Layer Perceptron Classifiers
---

## Introduction

The fifth models we are going to use to try tackling the current problem are **multi-layer perceptron classifiers**.

As always, the prerequisite step consists on loading the appropriate packages to perform our work:

In [1]:
# Activate 'airbnb' environment:
!source activate airbnb

In [2]:
# Needed packages:
import numpy as np
import pandas as pd
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.externals import joblib
from utils import (create_training_testing_datasets,
                   calculate_dcg,
                   calculate_ndcg,
                   clf_prediction,
                   ndcg_mean_score_calc,
                   detailed_ndcg_mean_score_calc)



---

## Create training and testing datasets

In [3]:
# Load the data:
consolidated_dataset = pd.read_csv("../data/consolidated_dataset.csv")

# Check basic info:
print("*** Some basic info:")
print("'consolidated_dataset' has {} data points with {} variables each.".format(*consolidated_dataset.shape))
print("'consolidated_dataset' counts {} missing values.".format(consolidated_dataset.isnull().sum().sum()))

# Give a look to the first lines:
print("\n*** First lines:")
display(consolidated_dataset.head())

*** Some basic info:
'consolidated_dataset' has 213451 data points with 161 variables each.
'consolidated_dataset' counts 0 missing values.

*** First lines:


Unnamed: 0,age,country_destination,nans,day_account_created,weekday_account_created,week_account_created,month_account_created,year_account_created,day_first_active,weekday_first_active,...,first_browser_SeaMonkey,first_browser_Silk,first_browser_SiteKiosk,first_browser_SlimBrowser,first_browser_Sogou Explorer,first_browser_Stainless,first_browser_TenFourFox,first_browser_TheWorld Browser,first_browser_Yandex.Browser,first_browser_wOSBrowser
0,-1.0,NDF,1.225078,28,0,26,6,2010,19,3,...,0,0,0,0,0,0,0,0,0,0
1,38.0,NDF,-0.453135,25,2,21,5,2011,23,5,...,0,0,0,0,0,0,0,0,0,0
2,56.0,US,-0.453135,28,1,39,9,2010,9,1,...,0,0,0,0,0,0,0,0,0,0
3,42.0,other,-0.453135,5,0,49,12,2011,31,5,...,0,0,0,0,0,0,0,0,0,0
4,41.0,US,0.385972,14,1,37,9,2010,8,1,...,0,0,0,0,0,0,0,0,0,0


In [4]:
# Create training and testing datasets:
X_train, X_test, y_train, y_test, encoding_dict = create_training_testing_datasets(consolidated_dataset)

---

## Calculate Normalized DCG scores

### "Out-of-the-box" multi-layer perceptron classifier

In [5]:
# Initialize the classifier:
mlp_clf = MLPClassifier(random_state=42)

# Train the classifier:
print("Time info about classifier training:")
%time ootb_mlp_clf = mlp_clf.fit(X_train, y_train)

Time info about classifier training:
CPU times: user 20min 38s, sys: 25.2 s, total: 21min 4s
Wall time: 10min 38s


In [6]:
# Perform one prediction to check classifier:
print("Time info about classifier prediction:")
%time preds_list = clf_prediction(ootb_mlp_clf, X_train[0])
print("***")

# Reverse encoding dictionary:
decoding_dict = dict(map(reversed, encoding_dict.items()))

# Print result:
print("For the classifier check:")
print("- Real first booking destination country: {}".format(decoding_dict[y_train[0]]))
print("- Predictions list: {}".format([decoding_dict[x] for x in preds_list]))

Time info about classifier prediction:
CPU times: user 71.6 ms, sys: 6.81 ms, total: 78.4 ms
Wall time: 179 ms
***
For the classifier check:
- Real first booking destination country: US
- Predictions list: ['US', 'NDF', 'other', 'FR', 'IT', 'ES', 'GB', 'CA', 'NL', 'DE', 'AU', 'PT']


In [7]:
# Calculate nDCG mean score on training dataset:
print("Time info about nDCG mean score calculation on training dataset:")
%time ndcg_mean_score = ndcg_mean_score_calc(ootb_mlp_clf, X_train, y_train)
print("***")
print("On training dataset, classifier nDCG mean score is {:.6f}.".format(ndcg_mean_score))

Time info about nDCG mean score calculation on training dataset:
CPU times: user 1min 23s, sys: 1.25 s, total: 1min 24s
Wall time: 42.2 s
***
On training dataset, classifier nDCG mean score is 0.825010.


In [8]:
# Calculate nDCG mean score on testing dataset:
print("Time info about nDCG mean score calculation on testing dataset:")
%time ndcg_mean_score = ndcg_mean_score_calc(ootb_mlp_clf, X_test, y_test)
print("***")
print("On testing dataset, classifier nDCG mean score is {:.6f}.".format(ndcg_mean_score))

Time info about nDCG mean score calculation on testing dataset:
CPU times: user 21 s, sys: 332 ms, total: 21.4 s
Wall time: 10.7 s
***
On testing dataset, classifier nDCG mean score is 0.823896.


In [9]:
# Calculate nDCG mean score for each class on testing dataset:
print("Time info about nDCG mean score calculation for each class on testing dataset:")
%time ndcg_mean_scores_list = detailed_ndcg_mean_score_calc(ootb_mlp_clf, X_test, y_test, encoding_dict)
print("***")
print("Detailed results for each class on testing dataset:")
for country_dest in range(12):
    print("nDCG mean score for {}: {:.6f}".format(decoding_dict[country_dest],
                                                  ndcg_mean_scores_list[country_dest]))

Time info about nDCG mean score calculation for each class on testing dataset:
CPU times: user 20.7 s, sys: 308 ms, total: 21 s
Wall time: 10.5 s
***
Detailed results for each class on testing dataset:
nDCG mean score for AU: 0.000000
nDCG mean score for CA: 0.000000
nDCG mean score for DE: 0.000000
nDCG mean score for ES: 0.000000
nDCG mean score for FR: 0.430677
nDCG mean score for GB: 0.029950
nDCG mean score for IT: 0.358879
nDCG mean score for NDF: 0.949682
nDCG mean score for NL: 0.000000
nDCG mean score for PT: 0.000000
nDCG mean score for US: 0.790185
nDCG mean score for other: 0.500000


In [10]:
# Save model:
joblib.dump(ootb_mlp_clf, "../models/ootb_mlp_clf.pkl")

['../models/ootb_mlp_clf.pkl']

### "Optimized" multi-layer perceptron classifier

In [11]:
# Set parameters of the random grid:
hidden_layer_sizes = [(100), (110, 60), (80, 40, 20), (120, 60, 30)]
activation = ['identity', 'logistic', 'tanh', 'relu']
solver = ['adam']
learning_rate_init = [0.1, 0.01, 0.001, 0.0001, 0.00001]
max_iter = [200]
random_state = [42]
early_stopping = [True]
validation_fraction = [0.2]
n_iter_no_change = [10]

# Create the random grid:
random_grid = {'hidden_layer_sizes': hidden_layer_sizes,
               'activation': activation,
               'solver': solver,
               'learning_rate_init': learning_rate_init,
               'max_iter': max_iter,
               'random_state': random_state,
               'early_stopping': early_stopping,
               'validation_fraction': validation_fraction,
               'n_iter_no_change': n_iter_no_change}

# Perform randomized search cross validation:

mlp_clf = MLPClassifier()

mlp_clf_random = RandomizedSearchCV(estimator=mlp_clf,
                                    param_distributions=random_grid,
                                    n_iter=20,
                                    n_jobs=-1,
                                    random_state=42,
                                    verbose=0)

print("Time info about randomized search cross validation:")
%time mlp_clf_fit = mlp_clf_random.fit(X_train, y_train)

# Get the "best" classifier:
opt_mlp_clf = mlp_clf_fit.best_estimator_

Time info about randomized search cross validation:




CPU times: user 3min 4s, sys: 8.34 s, total: 3min 13s
Wall time: 20min 28s


In [12]:
# Parameters of the "best" classifier:
print("*** Parameters of the 'best' classifier:")
for param, param_value in mlp_clf_fit.best_params_.items():
    print("- Paramater {}: {}".format(param, param_value))

*** Parameters of the 'best' classifier:
- Paramater validation_fraction: 0.2
- Paramater solver: adam
- Paramater random_state: 42
- Paramater n_iter_no_change: 10
- Paramater max_iter: 200
- Paramater learning_rate_init: 0.0001
- Paramater hidden_layer_sizes: (80, 40, 20)
- Paramater early_stopping: True
- Paramater activation: relu


In [13]:
# Perform one prediction to check classifier:
print("Time info about classifier prediction:")
%time preds_list = clf_prediction(opt_mlp_clf, X_train[0])
print("***")

# Reverse encoding dictionary:
decoding_dict = dict(map(reversed, encoding_dict.items()))

# Print result:
print("For the classifier check:")
print("- Real first booking destination country: {}".format(decoding_dict[y_train[0]]))
print("- Predictions list: {}".format([decoding_dict[x] for x in preds_list]))

Time info about classifier prediction:
CPU times: user 1.6 ms, sys: 307 µs, total: 1.91 ms
Wall time: 1.03 ms
***
For the classifier check:
- Real first booking destination country: US
- Predictions list: ['US', 'NDF', 'other', 'FR', 'ES', 'IT', 'GB', 'DE', 'CA', 'NL', 'AU', 'PT']


In [14]:
# Calculate nDCG mean score on training dataset:
print("Time info about nDCG mean score calculation on training dataset:")
%time ndcg_mean_score = ndcg_mean_score_calc(opt_mlp_clf, X_train, y_train)
print("***")
print("On training dataset, classifier nDCG mean score is {:.6f}.".format(ndcg_mean_score))

Time info about nDCG mean score calculation on training dataset:
CPU times: user 1min 21s, sys: 1.36 s, total: 1min 23s
Wall time: 41.6 s
***
On training dataset, classifier nDCG mean score is 0.822968.


In [15]:
# Calculate nDCG mean score on testing dataset:
print("Time info about nDCG mean score calculation on testing dataset:")
%time ndcg_mean_score = ndcg_mean_score_calc(opt_mlp_clf, X_test, y_test)
print("***")
print("On testing dataset, classifier nDCG mean score is {:.6f}.".format(ndcg_mean_score))

Time info about nDCG mean score calculation on testing dataset:
CPU times: user 21 s, sys: 454 ms, total: 21.5 s
Wall time: 11 s
***
On testing dataset, classifier nDCG mean score is 0.822270.


In [16]:
# Calculate nDCG mean score for each class on testing dataset:
print("Time info about nDCG mean score calculation for each class on testing dataset:")
%time ndcg_mean_scores_list = detailed_ndcg_mean_score_calc(opt_mlp_clf, X_test, y_test, encoding_dict)
print("***")
print("Detailed results for each class on testing dataset:")
for country_dest in range(12):
    print("nDCG mean score for {}: {:.6f}".format(decoding_dict[country_dest],
                                                  ndcg_mean_scores_list[country_dest]))

Time info about nDCG mean score calculation for each class on testing dataset:
CPU times: user 20.4 s, sys: 337 ms, total: 20.7 s
Wall time: 10.4 s
***
Detailed results for each class on testing dataset:
nDCG mean score for AU: 0.000000
nDCG mean score for CA: 0.000000
nDCG mean score for DE: 0.000000
nDCG mean score for ES: 0.413072
nDCG mean score for FR: 0.411407
nDCG mean score for GB: 0.000832
nDCG mean score for IT: 0.010234
nDCG mean score for NDF: 0.935384
nDCG mean score for NL: 0.000000
nDCG mean score for PT: 0.000000
nDCG mean score for US: 0.818191
nDCG mean score for other: 0.491116


In [17]:
# Save model:
joblib.dump(opt_mlp_clf, "../models/opt_mlp_clf.pkl")

['../models/opt_mlp_clf.pkl']

---

## Conclusion

As a quick conclusion, for the results obtained for the last "optimized" multi-layer perceptron classifier, we can note 4 major elements:
* On testing dataset, it gets a better nDCG mean score than the one obtained by the naive model.
* On testing dataset, it gets better nDCG mean scores for predicting correctly Spain, Great Britain and USA than the ones obtained by the naive model.
* On testing dataset, it gets worse nDCG mean scores for predicting correctly France, Italy, no destination found and other than the ones obtained by the naive model.
* On testing dataset, it is as "bad" as the naive model for predicting correctly Australia, Canada, Germany, Netherlands and Portugal.