# Naive Solution For Benchmarking
---

## Introduction

In this notebook, we are going to construct a naive solution for benchmarking which will serve us later to evaluate the prediction models we are going to build to tackle the problem studied in the current project.

For that, we are going to consider a **naive prediction model**, which, for each prediction it has to perform, returns the following list of classified first booking destination country possibilities:

`['NDF', 'US', 'Other', 'FR', 'IT']`

As it has been observed in previous notebooks, this list corresponds to the ordered top 5 first booking destination country possibilities in the consolidated dataset considered in this project.

To evaluate our models in this project, we are going to use **Normalized Discounted Cumulative Gain** (Normalized DCG) as evaluation metric. Classically used to measure effectiveness of web search engine algorithms, this metric is the one that had been chosen for Kaggle challenge, and, thus, is the one we are going to use too.

To calculate it, it is necessary, first, to calculate **Discounted Cumulative Gain** (DCG): For that, if we consider a prediction situation in which the real first booking destination country is $d$, the ordered list of possibilities proposed by the considered predictor is $\left( \hat{d}_{i} \right)_{i \in \{1, \ldots, k\}}$, with $k \in \mathbb{N}^{\star}$ (in this current project, $k$ is fixed to 5), and if we note $\forall i \in \{1, \ldots, k\}, rel_{i} = \mathbb{1}_{d = \hat{d}_{i}}$, the relevance of the prediction result at ranking $i$, then, we have:

$$DCG_{k} = \sum_{i = 1}^{k} \frac{2^{rel_{i}} - 1}{\log_{2}(i + 1)}$$

**Normalized Discounted Cumulative Gain** (Normalized DCG) can then be calculated like this:

$$nDCG_{k} = \frac{DCG_{k}}{IDCG_{k}}$$

Where $IDCG_{k}$ is the **Ideal Discounted Cumulative Gain** (Ideal DCG), the maximum possible (ideal) DCG for a given set of queries: Obviously, here, considering the exposed conditions, it is obtained, for example, with a prediction $\left( \hat{d}_{1} \right)$ with $\hat{d}_{1} = d$, which leads to $IDCG_{k} = 1$.

All $nDCG_{k}$ calculations are relative values on the interval $[0, 1]$.

And, as an application example, if for a particular user the destination is `FR`, then, the following predictions become:
* (`FR`) gives: $nDCG_{5} = \frac{DCG_{5}}{IDCG_{5}} = DCG_{5} = \frac{2^{rel_{1}} - 1}{\log_{2}(1 + 1)} = \frac{2 - 1}{\log_{2}(2)} = 1$
* (`US`, `FR`) gives: $nDCG_{5} = \frac{DCG_{5}}{IDCG_{5}} = DCG_{5} = \frac{2^{rel_{1}} - 1}{\log_{2}(1 + 1)} + \frac{2^{rel_{2}} - 1}{\log_{2}(2 + 1)} = \frac{2^{0} - 1}{\log_{2}(2)} + \frac{2^{1} - 1}{\log_{2}(3)} = \frac{1}{\log_{2}(3)} = 0.630930$

As always, the prerequisite step consists on loading the appropriate packages to perform our work:

In [1]:
# Activate 'airbnb' environment:
!source activate airbnb

In [2]:
# Needed packages:
import numpy as np
import pandas as pd
from utils import create_training_testing_datasets, calculate_dcg, calculate_ndcg

---

## Create training and testing datasets

In [3]:
# Load the data:
consolidated_dataset = pd.read_csv("../data/consolidated_dataset.csv")

# Check basic info:
print("*** Some basic info:")
print("'consolidated_dataset' has {} data points with {} variables each.".format(*consolidated_dataset.shape))
print("'consolidated_dataset' counts {} missing values.".format(consolidated_dataset.isnull().sum().sum()))

# Give a look to the first lines:
print("\n*** First lines:")
display(consolidated_dataset.head())

*** Some basic info:
'consolidated_dataset' has 213451 data points with 161 variables each.
'consolidated_dataset' counts 0 missing values.

*** First lines:


Unnamed: 0,age,country_destination,nans,day_account_created,weekday_account_created,week_account_created,month_account_created,year_account_created,day_first_active,weekday_first_active,...,first_browser_SeaMonkey,first_browser_Silk,first_browser_SiteKiosk,first_browser_SlimBrowser,first_browser_Sogou Explorer,first_browser_Stainless,first_browser_TenFourFox,first_browser_TheWorld Browser,first_browser_Yandex.Browser,first_browser_wOSBrowser
0,-1.0,NDF,1.225078,28,0,26,6,2010,19,3,...,0,0,0,0,0,0,0,0,0,0
1,38.0,NDF,-0.453135,25,2,21,5,2011,23,5,...,0,0,0,0,0,0,0,0,0,0
2,56.0,US,-0.453135,28,1,39,9,2010,9,1,...,0,0,0,0,0,0,0,0,0,0
3,42.0,other,-0.453135,5,0,49,12,2011,31,5,...,0,0,0,0,0,0,0,0,0,0
4,41.0,US,0.385972,14,1,37,9,2010,8,1,...,0,0,0,0,0,0,0,0,0,0


In [4]:
# Create training and testing datasets:
X_train, X_test, y_train, y_test, encoding_dict = create_training_testing_datasets(consolidated_dataset)

---

## Calculate Normalized DCG scores

In [5]:
# Set naive predictor response:
naive_predictor_response = ['NDF', 'US', 'other', 'FR', 'IT']
naive_predictor_encoded_response = [encoding_dict[x] for x in naive_predictor_response]

### Training dataset

In [6]:
# Calculate nDCG on training dataset:

naive_predictor_training_scores = []

for country_destination in y_train:
    ndcg_score = calculate_ndcg(naive_predictor_encoded_response, country_destination)
    naive_predictor_training_scores.append(ndcg_score)

ndcg_mean_score = np.mean(naive_predictor_training_scores)

print("On training dataset, naive predictor nDCG mean score is {:.6f}.".format(ndcg_mean_score))

On training dataset, naive predictor nDCG mean score is 0.806766.


### Testing dataset

In [7]:
# Calculate nDCG on testing dataset:

naive_predictor_testing_scores = []

for country_destination in y_test:
    ndcg_score = calculate_ndcg(naive_predictor_encoded_response, country_destination)
    naive_predictor_testing_scores.append(ndcg_score)

ndcg_mean_score = np.mean(naive_predictor_testing_scores)

print("On testing dataset, naive predictor nDCG mean score is {:.6f}.".format(ndcg_mean_score))

On testing dataset, naive predictor nDCG mean score is 0.806763.


---

## More detailed results

To allow us to go further in our analysis, we are going to determine, too, naive predictor nDCG mean score for each class.

In [8]:
# Calculate nDCG mean score for each class:

decoding_dict = dict(map(reversed, encoding_dict.items()))
ndcg_scores_dict = {country_destination: [] for country_destination in range(12)}
ndcg_mean_scores = []

for country_dest in y_test:
    ndcg_score = calculate_ndcg(naive_predictor_encoded_response, country_dest)
    ndcg_scores_dict[country_dest].append(ndcg_score)

print("*** More detailed results:")
for country_dest in range(12):
    ndcg_mean_scores.append(np.mean(ndcg_scores_dict[country_dest]))
    print("nDCG mean score for {}: {:.6f}".format(decoding_dict[country_dest], ndcg_mean_scores[country_dest]))

*** More detailed results:
nDCG mean score for AU: 0.000000
nDCG mean score for CA: 0.000000
nDCG mean score for DE: 0.000000
nDCG mean score for ES: 0.000000
nDCG mean score for FR: 0.430677
nDCG mean score for GB: 0.000000
nDCG mean score for IT: 0.386853
nDCG mean score for NDF: 1.000000
nDCG mean score for NL: 0.000000
nDCG mean score for PT: 0.000000
nDCG mean score for US: 0.630930
nDCG mean score for other: 0.500000
