# Gradient Boosting Classifiers Pipeline
---

## Introduction

For the moment, we have not built models which reaches significantly better results than the ones obtained by the naive prediction model we use for benchmarking.

Furthermore, the results we obtained for first booking destination countries whose presence in the dataset is significantly low (all the ones except `NDF`, `USA` and `other`) are very bad.

To try to handle successfully this situation, we are going to build a **gradient boosting classifiers pipeline**, and decompose its action into 4 steps before providing predictions when a given sample data is proposed:
* A first step where a first gradient boosting classifier will have the task to determine exclusively if the fisrt booking destination country is `NDF`.
* If the prevision provided by this first gradient boosting classifier is different from `NDF`, then, a second step will be performed, where a second gradient boosting classifier will have the task to determine exclusively if the first booking destination country is `USA`.
* If the prevision provided by this second gradient boosting classifier is different from `USA`, then, a third step will be performed, where a third gradient boosting classifier will have the task to determine exclusively if the first booking destination country is `other`.
* Lastly, if the prevision provided by this third gradient boosting classifier is different from `other`, then, a fourth step will be performed, where a fourth gradient boosting classifier will have the task to determine the first booking destination country among the remaining possibilities for a first booking destination country (`FR`, `IT`, `GB`, `ES`, `CA`, `DE`, `NL`, `AU` and `PT`).

*Nota Bene:* We choose to exploit gradient boosting classifiers because they are the ones which have appeared to handle the best the data used in the current project.

As always, the prerequisite step consists on loading the appropriate packages to perform our work:

In [1]:
# Activate 'airbnb' environment:
!source activate airbnb

In [2]:
# Needed packages:
import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.externals import joblib
from utils import (create_training_testing_datasets,
                   calculate_dcg,
                   calculate_ndcg,
                   clf_prediction,
                   ndcg_mean_score_calc,
                   detailed_ndcg_mean_score_calc)



---

## Construct a gradient boosting classifiers pipeline

### Create global training and testing datasets

In [3]:
# Load the data:
consolidated_dataset = pd.read_csv("../data/consolidated_dataset.csv")

# Check basic info:
print("*** Some basic info:")
print("'consolidated_dataset' has {} data points with {} variables each.".format(*consolidated_dataset.shape))
print("'consolidated_dataset' counts {} missing values.".format(consolidated_dataset.isnull().sum().sum()))

# Give a look to the first lines:
print("\n*** First lines:")
display(consolidated_dataset.head())

*** Some basic info:
'consolidated_dataset' has 213451 data points with 161 variables each.
'consolidated_dataset' counts 0 missing values.

*** First lines:


Unnamed: 0,age,country_destination,nans,day_account_created,weekday_account_created,week_account_created,month_account_created,year_account_created,day_first_active,weekday_first_active,...,first_browser_SeaMonkey,first_browser_Silk,first_browser_SiteKiosk,first_browser_SlimBrowser,first_browser_Sogou Explorer,first_browser_Stainless,first_browser_TenFourFox,first_browser_TheWorld Browser,first_browser_Yandex.Browser,first_browser_wOSBrowser
0,-1.0,NDF,1.225078,28,0,26,6,2010,19,3,...,0,0,0,0,0,0,0,0,0,0
1,38.0,NDF,-0.453135,25,2,21,5,2011,23,5,...,0,0,0,0,0,0,0,0,0,0
2,56.0,US,-0.453135,28,1,39,9,2010,9,1,...,0,0,0,0,0,0,0,0,0,0
3,42.0,other,-0.453135,5,0,49,12,2011,31,5,...,0,0,0,0,0,0,0,0,0,0
4,41.0,US,0.385972,14,1,37,9,2010,8,1,...,0,0,0,0,0,0,0,0,0,0


In [4]:
# Create training and testing datasets:
X_train, X_test, y_train, y_test, encoding_dict = create_training_testing_datasets(consolidated_dataset)

### Create a training dataset for each gradient boosting classifier

In [5]:
# Parameters of the global encoding dictionary:
print("*** Encoding dictionary:")
for country, country_code in encoding_dict.items():
    print("- Country {}: Code {}".format(country, country_code))

*** Encoding dictionary:
- Country AU: Code 0
- Country CA: Code 1
- Country DE: Code 2
- Country ES: Code 3
- Country FR: Code 4
- Country GB: Code 5
- Country IT: Code 6
- Country NDF: Code 7
- Country NL: Code 8
- Country PT: Code 9
- Country US: Code 10
- Country other: Code 11


In [6]:
# Training dataset for first gradient boosting classifier:
X_train_1 = np.copy(X_train)
y_train_1 = np.copy(y_train)
y_train_1 = np.where(y_train_1==7, 0, 1)

# Training dataset for second gradient boosting classifier:
X_train_2 = np.copy(X_train)
y_train_2 = np.copy(y_train)
idc_to_rm = np.argwhere(y_train_2==7)
X_train_2 = np.delete(X_train_2, idc_to_rm, axis=0)
y_train_2 = np.delete(y_train_2, idc_to_rm)
y_train_2 = np.where(y_train_2==10, 0, 1)

# Training dataset for third gradient boosting classifier:
X_train_3 = np.copy(X_train)
y_train_3 = np.copy(y_train)
idc_to_rm = np.argwhere(y_train_3==7)
X_train_3 = np.delete(X_train_3, idc_to_rm, axis=0)
y_train_3 = np.delete(y_train_3, idc_to_rm)
idc_to_rm = np.argwhere(y_train_3==10)
X_train_3 = np.delete(X_train_3, idc_to_rm, axis=0)
y_train_3 = np.delete(y_train_3, idc_to_rm)
y_train_3 = np.where(y_train_3==11, 0, 1)

# Training dataset for fourth gradient boosting classifier:
X_train_4 = np.copy(X_train)
y_train_4 = np.copy(y_train)
idc_to_rm = np.argwhere(y_train_4==7)
X_train_4 = np.delete(X_train_4, idc_to_rm, axis=0)
y_train_4 = np.delete(y_train_4, idc_to_rm)
idc_to_rm = np.argwhere(y_train_4==10)
X_train_4 = np.delete(X_train_4, idc_to_rm, axis=0)
y_train_4 = np.delete(y_train_4, idc_to_rm)
idc_to_rm = np.argwhere(y_train_4==11)
X_train_4 = np.delete(X_train_4, idc_to_rm, axis=0)
y_train_4 = np.delete(y_train_4, idc_to_rm)
y_train_4 = np.where(y_train_4==8, 7, y_train_4)
y_train_4 = np.where(y_train_4==9, 8, y_train_4)

### Train first gradient boosting classifier

In [7]:
# Initialize gradient boosting classifier:
gb_clf = GradientBoostingClassifier(random_state=42)

# Train the classifier:
print("Time info about classifier training:")
%time gb_clf_1 = gb_clf.fit(X_train_1, y_train_1)

Time info about classifier training:
CPU times: user 2min 58s, sys: 1.54 s, total: 2min 59s
Wall time: 2min 41s


In [8]:
# Check mean accuracy on training dataset:
mean_accuracy = gb_clf_1.score(X_train_1, y_train_1)
print("The mean accuracy on training dataset of 'gb_clf_1' is: {:.6f}".format(mean_accuracy))

The mean accuracy on training dataset of 'gb_clf_1' is: 0.697980


In [9]:
# Save model:
joblib.dump(gb_clf_1, "../models/gb_clf_1.pkl")

['../models/gb_clf_1.pkl']

### Train second gradient boosting classifier

In [10]:
# Initialize gradient boosting classifier:
gb_clf = GradientBoostingClassifier(random_state=42)

# Train the classifier:
print("Time info about classifier training:")
%time gb_clf_2 = gb_clf.fit(X_train_2, y_train_2)

Time info about classifier training:
CPU times: user 51.5 s, sys: 289 ms, total: 51.7 s
Wall time: 51.5 s


In [11]:
# Check mean accuracy on training dataset:
mean_accuracy = gb_clf_2.score(X_train_2, y_train_2)
print("The mean accuracy on training dataset of 'gb_clf_2' is: {:.6f}".format(mean_accuracy))

The mean accuracy on training dataset of 'gb_clf_2' is: 0.703231


In [12]:
# Save model:
joblib.dump(gb_clf_2, "../models/gb_clf_2.pkl")

['../models/gb_clf_2.pkl']

### Train third gradient boosting classifier

In [13]:
# Initialize gradient boosting classifier:
gb_clf = GradientBoostingClassifier(random_state=42)

# Train the classifier:
print("Time info about classifier training:")
%time gb_clf_3 = gb_clf.fit(X_train_3, y_train_3)

Time info about classifier training:
CPU times: user 12.5 s, sys: 66.3 ms, total: 12.5 s
Wall time: 12.5 s


In [14]:
# Check mean accuracy on training dataset:
mean_accuracy = gb_clf_3.score(X_train_3, y_train_3)
print("The mean accuracy on training dataset of 'gb_clf_3' is: {:.6f}".format(mean_accuracy))

The mean accuracy on training dataset of 'gb_clf_3' is: 0.643015


In [15]:
# Save model:
joblib.dump(gb_clf_3, "../models/gb_clf_3.pkl")

['../models/gb_clf_3.pkl']

### Train fourth gradient boosting classifier

In [16]:
# Initialize gradient boosting classifier:
gb_clf = GradientBoostingClassifier(random_state=42)

# Train the classifier:
print("Time info about classifier training:")
%time gb_clf_4 = gb_clf.fit(X_train_4, y_train_4)

Time info about classifier training:
CPU times: user 2min 3s, sys: 2.67 s, total: 2min 5s
Wall time: 1min 3s


In [17]:
# Check mean accuracy on training dataset:
mean_accuracy = gb_clf_4.score(X_train_4, y_train_4)
print("The mean accuracy on training dataset of 'gb_clf_3' is: {:.6f}".format(mean_accuracy))

The mean accuracy on training dataset of 'gb_clf_3' is: 0.352471


In [18]:
# Save model:
joblib.dump(gb_clf_4, "../models/gb_clf_4.pkl")

['../models/gb_clf_4.pkl']

### Gradient boosting classifiers pipeline

In [19]:
# Build prediction mechanism of gradient boosting classifiers pipeline:

def gb_clfs_pipeline_prediction(clf_1, clf_2, clf_3, clf_4, sample_data):
    """ Perform predictions thanks to gradient boosting classifiers pipeline """
    
    # Perform predictions with first classifier:
    pred_probs_1 = clf_1.predict_proba(sample_data.reshape(1, -1)).tolist()[0]
    pred_probs_1_list = [x[1] for x in sorted(zip(pred_probs_1, range(2)), reverse=True)]
    
    # Test results from first classifier:
    if pred_probs_1_list[0] == 0:
        preds_list = [7, 10, 11, 4, 6]
    else:
        
        # Perform predictions with second classifier:
        pred_probs_2 = clf_2.predict_proba(sample_data.reshape(1, -1)).tolist()[0]
        pred_probs_2_list = [x[1] for x in sorted(zip(pred_probs_2, range(2)), reverse=True)]
        
        # Test results from second classifier:
        if pred_probs_2_list[0] == 0:
            preds_list = [10, 7, 11, 4, 6]
        else:
            
            # Perform predictions with third classifier:
            pred_probs_3 = clf_3.predict_proba(sample_data.reshape(1, -1)).tolist()[0]
            pred_probs_3_list = [x[1] for x in sorted(zip(pred_probs_3, range(2)), reverse=True)]
            
            # Test results from third classifier:
            if pred_probs_3_list[0] == 0:
                preds_list = [11, 7, 10, 4, 6]
            else:
                
                # Perform predictions with fourth classifier:
                pred_probs_4 = clf_4.predict_proba(sample_data.reshape(1, -1)).tolist()[0]
                pred_probs_4_list = [x[1] for x in sorted(zip(pred_probs_4, range(9)), reverse=True) if x[0] != 0.]
                preds_list = pred_probs_4_list + [7, 10, 11]
    
    # Return result:
    return preds_list

In [20]:
# Perform one prediction to check classifier:
print("Time info about classifier prediction:")
%time preds_list = gb_clfs_pipeline_prediction(gb_clf_1, gb_clf_2, gb_clf_3, gb_clf_4, X_train[0])
print("***")

# Reverse encoding dictionary:
decoding_dict = dict(map(reversed, encoding_dict.items()))

# Print result:
print("For the classifier check:")
print("- Real first booking destination country: {}".format(decoding_dict[y_train[0]]))
print("- Predictions list: {}".format([decoding_dict[x] for x in preds_list]))

Time info about classifier prediction:
CPU times: user 1.85 ms, sys: 162 µs, total: 2.01 ms
Wall time: 986 µs
***
For the classifier check:
- Real first booking destination country: US
- Predictions list: ['US', 'NDF', 'other', 'FR', 'IT']


---

## Check performances of the gradient boosting classifiers pipeline

In [21]:
# Adapt nDCG mean score calculators:

def ndcg_mean_score_calc_3(clf_1, clf_2, clf_3, clf_4, X, y):
    """ Calculate nDCG mean score on a labeled dataset """
    
    # Set nDCG scores list:
    ndcg_scores_list = []
    
    # Loop on labeled dataset:
    for i in range(len(y)):
        ndcg_score = calculate_ndcg(gb_clfs_pipeline_prediction(clf_1, clf_2, clf_3, clf_4, X[i]), y[i])
        ndcg_scores_list.append(ndcg_score)
        
    # Determine nDCG mean score:
    ndcg_mean_score = np.mean(ndcg_scores_list)
    
    # Return result:
    return ndcg_mean_score

def detailed_ndcg_mean_score_calc_3(clf_1, clf_2, clf_3, clf_4, X, y, encoding_dict):
    """ Calculate nDCG mean score on a labeled dataset for each class """
    
    # Reverse encoding dictionary:
    decoding_dict = dict(map(reversed, encoding_dict.items()))
    
    # Set nDCG scores objects:
    ndcg_scores_dict = {country_dest: [] for country_dest in range(12)}
    ndcg_mean_scores_list = []
    
    # Loop on labeled dataset:
    for i in range(len(y)):
        ndcg_score = calculate_ndcg(gb_clfs_pipeline_prediction(clf_1, clf_2, clf_3, clf_4, X[i]), y[i])
        ndcg_scores_dict[y[i]].append(ndcg_score)
        
    # Loop on country destinations:
    for country_dest in range(12):
        ndcg_mean_scores_list.append(np.mean(ndcg_scores_dict[country_dest]))
        
    # Return result:
    return ndcg_mean_scores_list

In [22]:
# Calculate nDCG mean score on training dataset:
print("Time info about nDCG mean score calculation on training dataset:")
%time ndcg_mean_score = ndcg_mean_score_calc_3(gb_clf_1, gb_clf_2, gb_clf_3, gb_clf_4, X_train, y_train)
print("***")
print("On training dataset, classifier nDCG mean score is {:.6f}.".format(ndcg_mean_score))

Time info about nDCG mean score calculation on training dataset:
CPU times: user 60 s, sys: 483 ms, total: 1min
Wall time: 1min
***
On training dataset, classifier nDCG mean score is 0.823940.


In [23]:
# Calculate nDCG mean score on testing dataset:
print("Time info about nDCG mean score calculation on testing dataset:")
%time ndcg_mean_score = ndcg_mean_score_calc_3(gb_clf_1, gb_clf_2, gb_clf_3, gb_clf_4, X_test, y_test)
print("***")
print("On testing dataset, classifier nDCG mean score is {:.6f}.".format(ndcg_mean_score))

Time info about nDCG mean score calculation on testing dataset:
CPU times: user 14.9 s, sys: 111 ms, total: 15 s
Wall time: 15 s
***
On testing dataset, classifier nDCG mean score is 0.822982.


In [24]:
# Calculate nDCG mean score for each class on testing dataset:
print("Time info about nDCG mean score calculation for each class on testing dataset:")
%time ndcg_mean_scores_list = detailed_ndcg_mean_score_calc_3(gb_clf_1, gb_clf_2, gb_clf_3, gb_clf_4, X_test, y_test, encoding_dict)
print("***")
print("Detailed results for each class on testing dataset:")
for country_dest in range(12):
    print("nDCG mean score for {}: {:.6f}".format(decoding_dict[country_dest],
                                                  ndcg_mean_scores_list[country_dest]))

Time info about nDCG mean score calculation for each class on testing dataset:
CPU times: user 15 s, sys: 122 ms, total: 15.1 s
Wall time: 15 s
***
Detailed results for each class on testing dataset:
nDCG mean score for AU: 0.000000
nDCG mean score for CA: 0.000000
nDCG mean score for DE: 0.002976
nDCG mean score for ES: 0.004444
nDCG mean score for FR: 0.433142
nDCG mean score for GB: 0.002001
nDCG mean score for IT: 0.388995
nDCG mean score for NDF: 0.926442
nDCG mean score for NL: 0.000000
nDCG mean score for PT: 0.000000
nDCG mean score for US: 0.832884
nDCG mean score for other: 0.499009


---

## Conclusion

As a quick conclusion, for the results obtained for this gradient boosting classifiers pipeline, we can note 4 major elements in comparison with the gradient boosting classifier we have built previously (the one which reached the best results):
* On testing dataset, it gets a worse nDCG mean score than the one obtained by the gradient boosting classifier.
* On testing dataset, it gets better nDCG mean scores for predicting correctly France, Italy, USA and other than the ones obtained by the gradient boosting classifier.
* On testing dataset, it gets worse nDCG mean scores for predicting correctly Australia, Canada, Germany, Spain, Great Britain, no destination found and Netherlands USA than the ones obtained by the gradient boosting classifier.
* On testing dataset, it is as "bad" as the gradient boosting classifier for predicting correctly Portugal.