# Corrected Voting Classifier
---

## Introduction

Here, we are going to try tackling the current problem using an **corrected voting classifier**.

This attempt will consist on two steps:
* A first step where we are going to exploit the voting classifier we have trained previously, to provide us predictions on first booking destination country for a given sample data.
* A second step with the aim to correct the predictions made during the first step: For this step, we are going to use a gradient boosting classifier.

As always, the prerequisite step consists on loading the appropriate packages to perform our work:

In [1]:
# Activate 'airbnb' environment:
!source activate airbnb

In [2]:
# Needed packages:
import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.externals import joblib
from utils import (create_training_testing_datasets,
                   calculate_dcg,
                   calculate_ndcg,
                   clf_prediction,
                   ndcg_mean_score_calc,
                   detailed_ndcg_mean_score_calc)



---

## Create training and testing datasets

In [3]:
# Load the data:
consolidated_dataset = pd.read_csv("../data/consolidated_dataset.csv")

# Check basic info:
print("*** Some basic info:")
print("'consolidated_dataset' has {} data points with {} variables each.".format(*consolidated_dataset.shape))
print("'consolidated_dataset' counts {} missing values.".format(consolidated_dataset.isnull().sum().sum()))

# Give a look to the first lines:
print("\n*** First lines:")
display(consolidated_dataset.head())

*** Some basic info:
'consolidated_dataset' has 213451 data points with 161 variables each.
'consolidated_dataset' counts 0 missing values.

*** First lines:


Unnamed: 0,age,country_destination,nans,day_account_created,weekday_account_created,week_account_created,month_account_created,year_account_created,day_first_active,weekday_first_active,...,first_browser_SeaMonkey,first_browser_Silk,first_browser_SiteKiosk,first_browser_SlimBrowser,first_browser_Sogou Explorer,first_browser_Stainless,first_browser_TenFourFox,first_browser_TheWorld Browser,first_browser_Yandex.Browser,first_browser_wOSBrowser
0,-1.0,NDF,1.225078,28,0,26,6,2010,19,3,...,0,0,0,0,0,0,0,0,0,0
1,38.0,NDF,-0.453135,25,2,21,5,2011,23,5,...,0,0,0,0,0,0,0,0,0,0
2,56.0,US,-0.453135,28,1,39,9,2010,9,1,...,0,0,0,0,0,0,0,0,0,0
3,42.0,other,-0.453135,5,0,49,12,2011,31,5,...,0,0,0,0,0,0,0,0,0,0
4,41.0,US,0.385972,14,1,37,9,2010,8,1,...,0,0,0,0,0,0,0,0,0,0


In [4]:
# Create training and testing datasets:
X_train, X_test, y_train, y_test, encoding_dict = create_training_testing_datasets(consolidated_dataset)

---

## Train a gradient boosting classifier on predictions of previous voting classifier

In [5]:
# Load previous voting classifier:
ootb_voting_clf = joblib.load("../models/ootb_voting_clf.pkl")

In [6]:
# Perform one prediction to check classifier:
print("Time info about classifier prediction:")
%time preds_list = clf_prediction(ootb_voting_clf, X_train[0])
print("***")

# Reverse encoding dictionary:
decoding_dict = dict(map(reversed, encoding_dict.items()))

# Print result:
print("For the classifier check:")
print("- Real first booking destination country: {}".format(decoding_dict[y_train[0]]))
print("- Predictions list: {}".format([decoding_dict[x] for x in preds_list]))

Time info about classifier prediction:
CPU times: user 34.3 ms, sys: 22.1 ms, total: 56.4 ms
Wall time: 568 ms
***
For the classifier check:
- Real first booking destination country: US
- Predictions list: ['US', 'NDF', 'other', 'FR', 'ES', 'IT', 'GB', 'CA', 'DE', 'NL', 'AU', 'PT']


In [7]:
# Generate training dataset for new gradient boosting classifier:
X_pred_probs_train = []
for i in range(len(y_train)):
    pred_probs = ootb_voting_clf.predict_proba(X_train[i].reshape(1, -1)).tolist()[0]
    X_pred_probs_train.append(pred_probs)
X_pred_probs_train = np.array(X_pred_probs_train)

In [8]:
# Initialize the new gradient boosting classifier:
gb_clf = GradientBoostingClassifier(random_state=42)

# Train the classifier:
print("Time info about classifier training:")
%time corr_gb_clf = gb_clf.fit(X_pred_probs_train, y_train)

Time info about classifier training:
CPU times: user 12min 37s, sys: 32.7 s, total: 13min 9s
Wall time: 8min 31s


In [9]:
# Check mean accuracy on training dataset:
mean_accuracy = corr_gb_clf.score(X_pred_probs_train, y_train)
print("The mean accuracy on training dataset of 'corr_gb_clf' is: {:.6f}".format(mean_accuracy))

The mean accuracy on training dataset of 'corr_gb_clf' is: 0.885037


In [10]:
# Save model:
joblib.dump(corr_gb_clf, "../models/corr_gb_clf.pkl")

['../models/corr_gb_clf.pkl']

---

## Build a corrected voting classifier

In [11]:
# Build prediction mechanism of enhanced gradient boosting classifier:

def corr_voting_clf_prediction(clf_1, clf_2, sample_data):
    """ Perform predictions thanks to enhanced gradient boosting classifier """
    
    # Perform predictions with first level classifier:
    pred_probs_1 = clf_1.predict_proba(sample_data.reshape(1, -1)).tolist()[0]
    pred_probs_1 = np.array(pred_probs_1)
    
    # Perform predictions with second level classifier:
    pred_probs_2 = clf_2.predict_proba(pred_probs_1.reshape(1, -1)).tolist()[0]
    
    # Build predictive ordered list of first booking destination country:
    preds_list = [x[1] for x in sorted(zip(pred_probs_2, range(12)), reverse=True) if x[0] != 0.]
    
    # Return result:
    return preds_list

In [12]:
# Perform one prediction to check classifier:
print("Time info about classifier prediction:")
%time preds_list = corr_voting_clf_prediction(ootb_voting_clf, corr_gb_clf, X_train[0])
print("***")

# Print result:
print("For the classifier check:")
print("- Real first booking destination country: {}".format(decoding_dict[y_train[0]]))
print("- Predictions list: {}".format([decoding_dict[x] for x in preds_list]))

Time info about classifier prediction:
CPU times: user 14.1 ms, sys: 1.99 ms, total: 16.1 ms
Wall time: 13.2 ms
***
For the classifier check:
- Real first booking destination country: US
- Predictions list: ['US', 'other', 'NDF', 'FR', 'IT', 'GB', 'ES', 'CA', 'DE', 'AU', 'PT', 'NL']


In [13]:
# Adapt nDCG mean score calculators:

def ndcg_mean_score_calc_2(clf_1, clf_2, X, y):
    """ Calculate nDCG mean score on a labeled dataset """
    
    # Set nDCG scores list:
    ndcg_scores_list = []
    
    # Loop on labeled dataset:
    for i in range(len(y)):
        ndcg_score = calculate_ndcg(corr_voting_clf_prediction(clf_1, clf_2, X[i]), y[i])
        ndcg_scores_list.append(ndcg_score)
        
    # Determine nDCG mean score:
    ndcg_mean_score = np.mean(ndcg_scores_list)
    
    # Return result:
    return ndcg_mean_score

def detailed_ndcg_mean_score_calc_2(clf_1, clf_2, X, y, encoding_dict):
    """ Calculate nDCG mean score on a labeled dataset for each class """
    
    # Reverse encoding dictionary:
    decoding_dict = dict(map(reversed, encoding_dict.items()))
    
    # Set nDCG scores objects:
    ndcg_scores_dict = {country_dest: [] for country_dest in range(12)}
    ndcg_mean_scores_list = []
    
    # Loop on labeled dataset:
    for i in range(len(y)):
        ndcg_score = calculate_ndcg(corr_voting_clf_prediction(clf_1, clf_2, X[i]), y[i])
        ndcg_scores_dict[y[i]].append(ndcg_score)
        
    # Loop on country destinations:
    for country_dest in range(12):
        ndcg_mean_scores_list.append(np.mean(ndcg_scores_dict[country_dest]))
        
    # Return result:
    return ndcg_mean_scores_list

In [14]:
# Calculate nDCG mean score on training dataset:
print("Time info about nDCG mean score calculation on training dataset:")
%time ndcg_mean_score = ndcg_mean_score_calc_2(ootb_voting_clf, corr_gb_clf, X_train, y_train)
print("***")
print("On training dataset, classifier nDCG mean score is {:.6f}.".format(ndcg_mean_score))

Time info about nDCG mean score calculation on training dataset:
CPU times: user 50min 44s, sys: 53.1 s, total: 51min 37s
Wall time: 25min 50s
***
On training dataset, classifier nDCG mean score is 0.951813.


In [15]:
# Calculate nDCG mean score on testing dataset:
print("Time info about nDCG mean score calculation on testing dataset:")
%time ndcg_mean_score = ndcg_mean_score_calc_2(ootb_voting_clf, corr_gb_clf, X_test, y_test)
print("***")
print("On testing dataset, classifier nDCG mean score is {:.6f}.".format(ndcg_mean_score))

Time info about nDCG mean score calculation on testing dataset:
CPU times: user 12min 42s, sys: 13.5 s, total: 12min 55s
Wall time: 6min 28s
***
On testing dataset, classifier nDCG mean score is 0.780407.


In [16]:
# Calculate nDCG mean score for each class on testing dataset:
print("Time info about nDCG mean score calculation for each class on testing dataset:")
%time ndcg_mean_scores_list = detailed_ndcg_mean_score_calc_2(ootb_voting_clf, corr_gb_clf, X_test, y_test, encoding_dict)
print("***")
print("Detailed results for each class on testing dataset:")
for country_dest in range(12):
    print("nDCG mean score for {}: {:.6f}".format(decoding_dict[country_dest],
                                                  ndcg_mean_scores_list[country_dest]))

Time info about nDCG mean score calculation for each class on testing dataset:
CPU times: user 12min 44s, sys: 13.5 s, total: 12min 57s
Wall time: 6min 29s
***
Detailed results for each class on testing dataset:
nDCG mean score for AU: 0.059779
nDCG mean score for CA: 0.078116
nDCG mean score for DE: 0.100146
nDCG mean score for ES: 0.231286
nDCG mean score for FR: 0.386274
nDCG mean score for GB: 0.125668
nDCG mean score for IT: 0.136344
nDCG mean score for NDF: 0.887426
nDCG mean score for NL: 0.041941
nDCG mean score for PT: 0.000000
nDCG mean score for US: 0.772363
nDCG mean score for other: 0.441702


---

## Conclusion

As a quick conclusion, for the results obtained for this corrected voting classifier, we can note 4 major elements in comparison with the gradient boosting classifier we have built previously (the one which reached the best results):
* On testing dataset, it gets a (significantly) worse nDCG mean score than the one obtained by the gradient boosting classifier.
* On testing dataset, it gets better nDCG mean scores for predicting correctly Australia, Canada, Germany, Great Britain and Netherlands than the ones obtained by the gradient boosting classifier.
* On testing dataset, it gets worse nDCG mean scores for predicting correctly Spain, France, Italy, no destination found, USA and other than the ones obtained by the gradient boosting classifier.
* On testing dataset, it is as "bad" as the gradient boosting classifier for predicting correctly Portugal.