### Preamble

This notebook contains work on a [kaggle competition](https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings) sponsored by AirBnB and aimed at predicting in which country a new user will make their first booking in. This competition was run for recruitment purposes a few years ago and provides a very nice example of a real-life data science challenge.

### Imports and helpers

In [1]:
# General
from pathlib import Path
import os
from functools import partial

# Data processing
import pandas as pd
import numpy as np

# Machine learning
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

%matplotlib inline

In [2]:
def display_all(df):
    with pd.option_context("display.max_rows", 1000, "display.max_columns", 1000): 
        display(df)

In [3]:
from pandas.api.types import is_string_dtype

def make_categorical(df):
    
    categories = {}

    for c in df.columns:

        if is_string_dtype(df[c]):
            df[c] = df[c].astype("category").cat.as_ordered()
            categories[c] = df[c].cat.categories
            df[c] = df[c].cat.codes
            
    return df, categories

In [4]:
def apply_categories(df, categories):
    
    for c in df.columns:
        
        if is_string_dtype(df[c]):
            df[c] = pd.Categorical(df[c], categories=categories, ordered=True)
            df[c] = df[c].cat.codes

This implementation of the evaluation metric described on the [competition homepage](https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings#evaluation) is based on work shared in the [competition discussion](https://www.kaggle.com/wendykan/ndcg-example).

In [5]:
def dcg_at_k(r, k, method=1):
    """Score is discounted cumulative gain (dcg)
    Relevance is positive real values.  Can use binary
    as the previous methods.
    Example from
    http://www.stanford.edu/class/cs276/handouts/EvaluationNew-handout-6-per.pdf
    >>> r = [3, 2, 3, 0, 0, 1, 2, 2, 3, 0]
    >>> dcg_at_k(r, 1)
    3.0
    >>> dcg_at_k(r, 1, method=1)
    3.0
    >>> dcg_at_k(r, 2)
    5.0
    >>> dcg_at_k(r, 2, method=1)
    4.2618595071429155
    >>> dcg_at_k(r, 10)
    9.6051177391888114
    >>> dcg_at_k(r, 11)
    9.6051177391888114
    Args:
        r: Relevance scores (list or numpy) in rank order
            (first element is the first item)
        k: Number of results to consider
        method: If 0 then weights are [1.0, 1.0, 0.6309, 0.5, 0.4307, ...]
                If 1 then weights are [1.0, 0.6309, 0.5, 0.4307, ...]
    Returns:
        Discounted cumulative gain
    """
    r = np.asfarray(r)[:k]
    if r.size:
        if method == 0:
            return r[0] + np.sum(r[1:] / np.log2(np.arange(2, r.size + 1)))
        elif method == 1:
            return np.sum(r / np.log2(np.arange(2, r.size + 2)))
        else:
            raise ValueError("method must be 0 or 1.")
    return 0.


def ndcg_at_k(r, k, method=1):
    """Score is normalized discounted cumulative gain (ndcg)
    Relevance is positive real values.  Can use binary
    as the previous methods.
    Example from
    http://www.stanford.edu/class/cs276/handouts/EvaluationNew-handout-6-per.pdf
    >>> r = [3, 2, 3, 0, 0, 1, 2, 2, 3, 0]
    >>> ndcg_at_k(r, 1)
    1.0
    >>> r = [2, 1, 2, 0]
    >>> ndcg_at_k(r, 4)
    0.9203032077642922
    >>> ndcg_at_k(r, 4, method=1)
    0.96519546960144276
    >>> ndcg_at_k([0], 1)
    0.0
    >>> ndcg_at_k([1], 2)
    1.0
    Args:
        r: Relevance scores (list or numpy) in rank order
            (first element is the first item)
        k: Number of results to consider
        method: If 0 then weights are [1.0, 1.0, 0.6309, 0.5, 0.4307, ...]
                If 1 then weights are [1.0, 0.6309, 0.5, 0.4307, ...]
    Returns:
        Normalized discounted cumulative gain
    """
    dcg_max = dcg_at_k(sorted(r, reverse=True), k, method)
    if not dcg_max:
        return 0.
    return dcg_at_k(r, k, method) / dcg_max

In [6]:
def format_preds_wd(m, X, index_col=False):
    """ Get top five class codes in wide format ordered by probability from left to right
    m : A fitted sklearn model implementing predict_proba
    X : Feature matrix
    """
    
    probs = m.predict_proba(X)
    res = pd.DataFrame(probs.argsort()[:, -5:][:, ::-1])
    if index_col:
        res["id"] = X.index
    else:
        res = res.set_index(X.index)
    return res

In [7]:
def format_preds_lng(m, X, mapper=None):
    res = pd.melt(format_preds_wd(m, X, index_col=True), id_vars="id", value_name="country").sort_values(by=["id", "variable"]).drop("variable", axis=1)
    if mapper: res.country = res.country.map(mapper)
    return res

In [9]:
def score_predictions_wd(preds, y_true, k=5):
    """
    preds: pd.DataFrame
      one row for each observation, one column for each prediction.
      Columns are sorted from left to right descending in order of likelihood.
    truth: pd.Series
      one row for each observation.
    """
    assert(len(preds)==len(y_true))
    r = pd.DataFrame(0, index=preds.index, columns=preds.columns, dtype=np.float64)
    for col in preds.columns:
        r[col] = (preds[col] == y_true) * 1.0

    score = pd.Series(r.apply(partial(ndcg_at_k, k=k), axis=1, result_type="reduce"), name="score")
    print(f"Average NDCG at {k}: {score.mean():.4f}")
    return score

### Data preprocessing

#### Read in the available data and check out what we have

##### Users

The `users` table is the main data source we have. There is one entry for each unique user. We have data on a total of `213.451 users` available for training. The test set contains `62.096` rows.  The users table provides basic information each user as well as the target for our prediction task, the `country_destination` column.

Our target has 12 possible values: `'NDF', 'US', 'other', 'FR', 'CA', 'GB', 'ES', 'IT', 'PT', 'NL', 'DE', 'AU'`.

Among these, there are two special values:

- `NDF` - no destination found, i.e. no booking took place yet
- `other` - the user booked, but not in one of the 10 main countries

In [11]:
DATA_PATH = Path.cwd() / "data"

In [12]:
train_users = pd.read_csv(DATA_PATH / "train_users_2.csv")

##### Sessions

The `sessions` table contains several entries for a subset of the users. Session data is available for approx. `63% of the users`. The top user has `more than 2700 sessions`.

Each session entry contains information on
- which action was taken (`action`)
- which type of action that is (`action_type`)
- a more detailed description of the action (`action_detail`) 
- the type of device that action was taken on (`device_type`)
- how many seconds have passed since the last action was taken (`secs_elapsed`)

In [13]:
sessions = pd.read_csv(DATA_PATH / "sessions.csv")

#### Prepare data for modeling

##### Add aggregated sessions data

In [14]:
sessions = (sessions
            .rename({"user_id" : "id"}, axis="columns")
            .dropna(subset=["id"])
           )

**Devices used to access AirBnB**

In [16]:
sessions.device_type.nunique()

14

In [17]:
sessions_device_type_count = (pd.DataFrame(sessions
                                             .groupby("id")
                                             .device_type
                                             .value_counts()
                                          )
                              .rename({"device_type" : "count_sessions_device"}, axis="columns")
                              .reset_index()
                              .pivot(index="id",
                                     columns="device_type",
                                     values="count_sessions_device")
                              .reset_index()
                              .fillna(-1)
                             )

In [19]:
sessions_device_type_count.head()

device_type,id,-unknown-,Android App Unknown Phone/Tablet,Android Phone,Blackberry,Chromebook,Linux Desktop,Mac Desktop,Opera Phone,Tablet,Windows Desktop,Windows Phone,iPad Tablet,iPhone,iPodtouch
0,00023iyk9l,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,36.0,-1.0,-1.0,-1.0,-1.0,-1.0,4.0,-1.0
1,0010k6l0om,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,63.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
2,001wyh0pz8,-1.0,90.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
3,0028jgx1x1,30.0,-1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
4,002qnbzfs5,14.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,775.0,-1.0


In [20]:
colname_mapper_devices = {c : f"count_device_{c.replace(' ', '_').lower()}"
                          for c in sessions_device_type_count.columns
                          if c != "id"}

In [22]:
sessions_device_type_count.rename(colname_mapper_devices, axis="columns", inplace=True)

In [23]:
sessions_preferred_device = (pd.DataFrame(sessions
                                             .groupby("id")
                                             .device_type
                                             .value_counts()
                                             .groupby(level=0)
                                             .head(1))
                             .rename({"device_type" : "to_drop"}, axis="columns")
                             .reset_index()
                             .rename({"device_type" : "preferred_device"}, axis="columns")
                             .drop("to_drop", axis="columns")
                            )

**Actions taken on AirBnB**

In [25]:
sessions.action.nunique()

359

There are many distinct values for "action", let's bin the long tail into a "other" category

In [27]:
def add_others(df_in, make_others, k=50):
    df = df_in.copy()
    for col in make_others:
        cnts = df[col].value_counts()
        df.loc[df.loc[:, col].isin(list(cnts.iloc[k-1:].index)), col] = "Other"
        
    return df

In [28]:
sessions = add_others(sessions, ["action"])

In [30]:
sessions_action_count = (pd.DataFrame(sessions
                                             .groupby("id")
                                             .action
                                             .value_counts()
                                          )
                              .rename({"action" : "count_sessions_action"}, axis="columns")
                              .reset_index()
                              .pivot(index="id",
                                     columns="action",
                                     values="count_sessions_action")
                              .reset_index()
                              .fillna(-1)
                             )

In [31]:
colname_mapper_actions = {c : f"count_action_{c.replace(' ', '_').lower()}"
                          for c in sessions_action_count.columns
                          if c != "id"}

In [32]:
sessions_action_count.rename(colname_mapper_actions, axis="columns", inplace=True)

**Summary statistics of time elapsed, i.e. time spent in interaction with AirBnB**

In [34]:
sessions_agg_elapsed = (pd.DataFrame(sessions
                                          .groupby("id")
                                          .secs_elapsed
                                          .agg(["min", "median", "max"]))
                         .reset_index()
                       )

In [35]:
colname_mapper_secs_elapsed = {c : f"{c}_secs_elapsed"
                               for c in sessions_agg_elapsed.columns
                               if c != "id"}

In [36]:
sessions_agg_elapsed.rename(colname_mapper_secs_elapsed, axis="columns", inplace=True)

**Combine all sessions features into a single dataframe and merge with train data**

In [38]:
sessions_features = (sessions_agg_elapsed
                     .merge(sessions_action_count, how="outer", on="id")
                     .merge(sessions_device_type_count, how="outer", on="id")
                     .merge(sessions_preferred_device, how="outer", on="id")
                    )

Gather feature names to fill NAs later on

In [39]:
session_action_count_features = list(colname_mapper_actions.values())
session_agg_elapsed_features = list(colname_mapper_secs_elapsed.values())
session_count_device_features = list(colname_mapper_devices.values())

In [40]:
train_users_sessions = pd.merge(train_users,
                                sessions_features,
                                how="left",
                                on="id")

In [41]:
train_users.shape, train_users_sessions.shape

((213451, 16), (213451, 84))

#### Save intermediate result

In [None]:
DATA_OUTPUT_PATH = Path.cwd() / "output"

if not DATA_OUTPUT_PATH.is_dir():
    DATA_OUTPUT_PATH.mkdir()
    
train_users_sessions.to_pickle(DATA_OUTPUT_PATH / "train_users_sessions.pkl")

### Modeling

#### Read in data

In [None]:
DATA_OUTPUT_PATH = Path.cwd() / "output"

In [None]:
train_users_sessions = pd.read_pickle(DATA_OUTPUT_PATH / "train_users_sessions.pkl")

In [53]:
df = (train_users_sessions
      .set_index("id")
      .sort_values(by="date_account_created")
      .drop(["date_account_created", "date_first_booking"], axis=1))

In [55]:
df.shape

(213451, 81)

In [56]:
df.isna().sum()

timestamp_first_active                                0
gender                                                0
age                                               87990
signup_method                                         0
signup_flow                                           0
language                                              0
affiliate_channel                                     0
affiliate_provider                                    0
first_affiliate_tracked                            6065
signup_app                                            0
first_device_type                                     0
first_browser                                         0
country_destination                                   0
min_secs_elapsed                                 140820
median_secs_elapsed                              140820
max_secs_elapsed                                 140820
count_action_other                               140045
count_action_active                             

**Fill NAs**

In [57]:
df.age = df.age.fillna(-1)
df.first_affiliate_tracked = df.age.fillna("unknown")
df.preferred_device = df.preferred_device.fillna("unknown")

In [58]:
session_numerics_to_fill = session_action_count_features \
                            + session_agg_elapsed_features \
                            + session_count_device_features

df[session_numerics_to_fill] = df[session_numerics_to_fill].fillna(-1)

In [59]:
df, cats = make_categorical(df)

In [61]:
n_train = int(.7 * df.shape[0])
X = df.drop("country_destination", axis=1).copy()
y = df["country_destination"].copy()

X_train = X[:n_train]
y_train = y[:n_train]

X_valid = X[n_train:]
y_valid = y[n_train:]

In [62]:
X_train.shape, y_train.shape, X_valid.shape, y_valid.shape

((149415, 80), (149415,), (64036, 80), (64036,))

In [71]:
m = RandomForestClassifier(n_estimators=50,
                           min_samples_leaf=3,
                           n_jobs=-1,
                           random_state=42)

In [72]:
m.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=3, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=120, n_jobs=-1,
            oob_score=False, random_state=42, verbose=0, warm_start=False)

In [73]:
m.score(X_valid, y_valid)

0.6933443687925542

In [74]:
preds_train_wd = format_preds_wd(m, X_train)
preds_valid_wd = format_preds_wd(m, X_valid)

In [75]:
scores_train = score_predictions_wd(preds_train_wd, y_train)

Average NDCG at 5: 0.8639


In [76]:
scores_valid = score_predictions_wd(preds_valid_wd, y_valid)

Average NDCG at 5: 0.8489


#### Archive

In [None]:
gbm = GradientBoostingClassifier(n_estimators=50,
                                 random_state=42)

In [None]:
rf = RandomForestClassifier(n_estimators=50,
                           min_samples_leaf=3,
                           n_jobs=-1,
                           random_state=42)

In [None]:
from xgboost.sklearn import XGBClassifier

In [None]:
xgb = XGBClassifier(n_estimators=50)

In [None]:
xgb.fit(X_train, y_train)

In [None]:
rf.fit(X_train, y_train)

In [None]:
xgb.score(X_valid, y_valid)

In [None]:
m.score(X_valid, y_valid)

In [None]:
preds_train_wd = format_preds_wd(xgb, X_train)
preds_valid_wd = format_preds_wd(xgb, X_valid)

In [None]:
scores_train = score_predictions_wd(preds_train_wd, y_train)

In [None]:
scores_valid = score_predictions_wd(preds_valid_wd, y_valid)

In [None]:
preds_xgb = xgb.predict_proba(X_train)
preds_rf = rf.predict_proba(X_train)

In [None]:
preds_xgb[1:10]

In [None]:
preds_concat = np.hstack((preds_xgb, preds_rf))

In [None]:
preds_xgb.shape, preds_rf.shape, preds_concat.shape

In [None]:
gbm = GradientBoostingClassifier(n_estimators=50,
                                 random_state=42)

In [None]:
gbm.fit(preds_concat, y_train)

### Interpretation

#### Visualize feature importance

#### Use learnings to optimize model

### Final model

#### Load and transform test data

In [77]:
test_users = pd.read_csv(DATA_PATH / "test_users.csv")

In [78]:
test_users = pd.merge(test_users, sessions_features, how="left", on="id")

In [80]:
test_df = test_users.drop(["date_account_created", "date_first_booking"], axis=1)
test_df = test_df.set_index("id")

In [81]:
test_df.age = test_df.age.fillna(-1)
test_df.first_affiliate_tracked = test_df.age.fillna("unknown")
test_df.preferred_device = test_df.preferred_device.fillna("unknown")
test_df[session_numerics_to_fill] = test_df[session_numerics_to_fill].fillna(-1)

In [82]:
apply_categories(test_df, categories=cats)

In [83]:
test_df.head()

Unnamed: 0_level_0,timestamp_first_active,gender,age,signup_method,signup_flow,language,affiliate_channel,affiliate_provider,first_affiliate_tracked,signup_app,...,count_device_linux_desktop,count_device_mac_desktop,count_device_opera_phone,count_device_tablet,count_device_windows_desktop,count_device_windows_phone,count_device_ipad_tablet,count_device_iphone,count_device_ipodtouch,preferred_device
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5uwns89zht,20140701000006,-1,35.0,-1,0,-1,-1,-1,35.0,-1,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,-1
jtl0dijy2j,20140701000051,-1,-1.0,-1,0,-1,-1,-1,-1.0,-1,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,9.0,-1.0,-1
xx0ulgorjt,20140701000148,-1,-1.0,-1,0,-1,-1,-1,-1.0,-1,...,-1.0,-1.0,-1.0,-1.0,58.0,-1.0,-1.0,-1.0,-1.0,-1
6c6puo6ix0,20140701000215,-1,-1.0,-1,0,-1,-1,-1,-1.0,-1,...,-1.0,-1.0,-1.0,-1.0,11.0,-1.0,-1.0,-1.0,-1.0,-1
czqhjk3yfe,20140701000305,-1,-1.0,-1,0,-1,-1,-1,-1.0,-1,...,-1.0,19.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1


#### Make predictions and save for submission

In [84]:
mapper_country_destination = {i : c for i, c in enumerate(cats["country_destination"])}

In [86]:
test_preds = format_preds_lng(m, test_df, mapper=mapper_country_destination)

In [87]:
test_preds.head(10)

Unnamed: 0,id,country
25245,0010k6l0om,NDF
87341,0010k6l0om,US
149437,0010k6l0om,other
211533,0010k6l0om,FR
273629,0010k6l0om,IT
55494,0031awlkjq,NDF
117590,0031awlkjq,US
179686,0031awlkjq,other
241782,0031awlkjq,GB
303878,0031awlkjq,IT


In [88]:
OUTPUT_PATH = Path.cwd() / "submission"

if not OUTPUT_PATH.is_dir():
    OUTPUT_PATH.mkdir()

In [89]:
SUBMISSION_NAME = "sub_v8_show_case.csv"

In [90]:
test_preds.to_csv(OUTPUT_PATH / SUBMISSION_NAME, index=False)