# Expedia Hotel Recommendations: Predicting Hotel Clusters Based on Past Activity

## Introduction
The question that I hope to answer in this project is whether I can use Expedia customer usage behavior (such as the search they performed, how they interacted with the results, and the type of results returned) to predict what type of hotel customers will likely book in the future. 

This project and data is from a past Kaggle [competition](https://www.kaggle.com/c/expedia-hotel-recommendations), and the goal of the competition is stated as "to predict the booking outcome (hotel cluster) for a user event, based on their search and other attributes associated with that user event"

## Data Exploration
The [data](https://www.kaggle.com/c/expedia-hotel-recommendations/data) consists of training/testing data which provides information about the users' interaction with the search results, as well as a "destinations" file which provides descriptive data about destinations the user is searching for.

The datasets are quite large, with the test data at over 37M rows x 24 columns, the training data at 2.5M rows x 22 columns, and the destinations file at 62K rows x 150 columns.

In [12]:
import pandas as pd
train_url = '~/Documents/Expedia/train.csv'
train = pd.read_csv(train_url)
train.shape

(37670293, 24)

In [3]:
test_url = '~/Documents/Expedia/test.csv'
test = pd.read_csv(test_url)
test.shape

(2528243, 22)

In [4]:
destinations_url = '~/Documents/Expedia/destinations.csv'
dests = pd.read_csv(destinations_url)
dests.shape

(62106, 150)

With the data at this size, it will be challenging to work with. Just loading the training data into a dataframe takes several minutes on my laptop. 

The training data consists of 24 columns:

In [13]:
train.head()

Unnamed: 0,date_time,site_name,posa_continent,user_location_country,user_location_region,user_location_city,orig_destination_distance,user_id,is_mobile,is_package,...,srch_children_cnt,srch_rm_cnt,srch_destination_id,srch_destination_type_id,is_booking,cnt,hotel_continent,hotel_country,hotel_market,hotel_cluster
0,2014-08-11 07:46:59,2,3,66,348,48862,2234.2641,12,0,1,...,0,1,8250,1,0,3,2,50,628,1
1,2014-08-11 08:22:12,2,3,66,348,48862,2234.2641,12,0,1,...,0,1,8250,1,1,1,2,50,628,1
2,2014-08-11 08:24:33,2,3,66,348,48862,2234.2641,12,0,0,...,0,1,8250,1,0,1,2,50,628,1
3,2014-08-09 18:05:16,2,3,66,442,35390,913.1932,93,0,0,...,0,1,14984,1,0,1,2,50,1457,80
4,2014-08-09 18:08:18,2,3,66,442,35390,913.6259,93,0,0,...,0,1,14984,1,0,1,2,50,1457,21


An examination of the columns shows that there are several date columns (date/time, search_ci, and search_co) that need to be converted into date objects. There are also a number of numeric columns that could be useful to predict which hotel cluster a user will book, such as origin_destination_distance (the distance, in an unknown unit, between the origin and destination), srch_rm_cnt (the # of rooms being booked), and columns indicating the number of adult and child guests. There are also a few 1/0 indicator columns that could be useful, such as whether the room is being booked as part of an air/hotel package and whether the user is on mobile.

However, the majority of the columns are numeric ID numbers, indicating which Expedia property is being used, the origin information about the user, and information about the geographic region of the market. The final column, hotel_cluster, is also a numeric ID number for the cluster itself. It does not seem likely that there is any linear relationship between the numeric ID columns and the hotel_cluster reponse column. The correlation function demonstrates this:

In [14]:
train.corr()["hotel_cluster"]

site_name                   -0.022408
posa_continent               0.014938
user_location_country       -0.010477
user_location_region         0.007453
user_location_city           0.000831
orig_destination_distance    0.007260
user_id                      0.001052
is_mobile                    0.008412
is_package                   0.038733
channel                      0.000707
srch_adults_cnt              0.012309
srch_children_cnt            0.016261
srch_rm_cnt                 -0.005954
srch_destination_id         -0.011712
srch_destination_type_id    -0.032850
is_booking                  -0.021548
cnt                          0.002944
hotel_continent             -0.013963
hotel_country               -0.024289
hotel_market                 0.034205
hotel_cluster                1.000000
Name: hotel_cluster, dtype: float64

The hotel_cluster is the column we will be predicting, so it could be important to know how many of them there are and if certain hotel clusters are far more likely than the others. However, it looks like the hotel clusters are fairly evenly distributed and that there are only 100 of them in total:

In [15]:
train["hotel_cluster"].value_counts(1)

91    0.027707
41    0.020513
48    0.020017
64    0.018708
65    0.017811
5     0.016464
98    0.015640
59    0.015139
42    0.014643
21    0.014603
70    0.014483
18    0.014475
83    0.014179
46    0.014177
25    0.014085
62    0.013772
95    0.013519
28    0.013459
68    0.013374
82    0.013373
37    0.013168
50    0.013005
30    0.012989
9     0.012963
58    0.012828
97    0.012727
16    0.012686
72    0.012144
1     0.012017
99    0.011810
        ...   
19    0.007510
84    0.007387
66    0.007260
38    0.007147
87    0.006913
23    0.006882
12    0.006876
31    0.006838
67    0.006794
43    0.006732
7     0.006701
54    0.006656
92    0.006486
89    0.006466
45    0.006408
49    0.006374
3     0.005980
80    0.005846
60    0.005785
71    0.005735
93    0.005689
86    0.005550
14    0.005105
75    0.004386
24    0.004357
35    0.003693
53    0.003579
88    0.002861
27    0.002788
74    0.001284
Name: hotel_cluster, dtype: float64

It may also be important to learn whether or not the hotel cluster stays consistent for a specific hotel, or whether it can change. The "srch_destination_id" is the specific ID for a certain hotel (it is the key for the "destinations" file) and so by pivoting the dataframe on this ID and the hotel cluster, we can visually see whether or not the hotel cluster can change. Based on this view, a hotel changing from one cluster to another is not at all uncommon.

In [16]:
pd.pivot_table(train, index=['srch_destination_id'], columns=['hotel_cluster'], values='cnt', aggfunc='count', fill_value=0)


hotel_cluster,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
srch_destination_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0,0,0,2,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,10,0,0,0,0,0,0,...,3,0,0,21,0,0,0,0,0,18
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,2,0,0,0,0,0
8,0,0,0,0,0,0,0,420,0,0,...,0,198,0,0,79,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Data Cleaning and Sampling

The first thing to do is to content the date/time columns to date/time objects. This will make it easier to create features based on different parts of the date or on lengths of time (i.e. trip length, advance booking, etc.) However, when converting the "srch_ci" and "srch_co" columns, I got an "out of bounds" error due to one of the years showing up as "2557". Adding the instruction to "coerce" errors will make these values null, so I can deal with them later.

In [17]:
train["date_time"] = pd.to_datetime(train["date_time"], errors="coerce")
train["srch_ci"] = pd.to_datetime(train["srch_ci"], errors="coerce")
train["srch_co"] = pd.to_datetime(train["srch_co"], errors="coerce")


Next, we will sample the dataset in order to make it smaller and easier to deal with when testing different techniques. The current train dataset includes rows for both "click" activities as well as booking activities. Since the purpose of this analysis is to try to predict the hotel cluster for bookings, we can limit the dataset to booking activities only. This brings the dataset from 37M rows to just over 3M.

In [18]:
small_train = train[train['is_booking'] == 1]
small_train.shape

(3000693, 24)

Next, we'll see how many null values are in the data set and where.

In [19]:
small_train.isnull().sum(axis=0) # show columns with counts of null values

date_time                          0
site_name                          0
posa_continent                     0
user_location_country              0
user_location_region               0
user_location_city                 0
orig_destination_distance    1015179
user_id                            0
is_mobile                          0
is_package                         0
channel                            0
srch_ci                            0
srch_co                            0
srch_adults_cnt                    0
srch_children_cnt                  0
srch_rm_cnt                        0
srch_destination_id                0
srch_destination_type_id           0
is_booking                         0
cnt                                0
hotel_continent                    0
hotel_country                      0
hotel_market                       0
hotel_cluster                      0
dtype: int64

Looks like the null search dates are now gone. Since the nulls are all concentrated in one column, I'll focus the analysis on rows where the data is present, and this will also shorten the dataset significantly.

In [20]:
small_train = small_train.dropna()

small_train.shape

(1985514, 24)

Another option for shortening the dataset is to only include data from specific user IDs. The dataset has data on nearly 600K users:

In [21]:
small_train['user_id'].nunique()

596662

I want to select all bookings from 100K random users, and use that data for training. I want to make sure that I get all of the rows for each user, in case they have multiple bookings.

In [22]:
import numpy as np
ids = np.random.choice(small_train['user_id'].unique(), 100000) # sample 100K items from the unique user IDs

# filter the train dataframe to only include rows where user_ID is in the list of 100K
small_train = small_train[small_train['user_id'].isin(ids)] 
small_train.shape



(307128, 24)

In [24]:
small_train.to_csv('small_train.csv', index=False) # save the smaller file to disk to save time

## Adding Features and Consolidating the Destinations File

Now the that data has been shortened and cleaned up, we can create some features from the remaining columns. Date information such as the trip duration, the advance booking period, and the month of check in could be useful and predictive features of what type of hotel a customer might book.

In [1]:
# This cell can be removed later. It is used to import the shortened training file to save time.
import pandas as pd
import numpy as np

url = 'small_train.csv'
small_train = pd.read_csv(url, parse_dates=[0, 11, 12])

In [2]:
from datetime import datetime

# days booked in advance
small_train['in_advance'] = (small_train['srch_ci'] - small_train['date_time']) / np.timedelta64(1, 'D') 
# duration of trip
small_train['duration'] = (small_train['srch_co'] - small_train['srch_ci']) / np.timedelta64(1, 'D') 
# month of the check-in date
small_train['trip_month'] = pd.DatetimeIndex(small_train['srch_ci']).month 

Next, we'll take a look at the Destinations file and bring that into our data set. As we saw earlier, the Destinations file has about 60K rows and 150 columns. The values in the destinations file are somehow derived from information about each property, but to the naked eye they look like they contain a lot of redundancy. 

In [3]:
small_train

Unnamed: 0,date_time,site_name,posa_continent,user_location_country,user_location_region,user_location_city,orig_destination_distance,user_id,is_mobile,is_package,...,srch_destination_type_id,is_booking,cnt,hotel_continent,hotel_country,hotel_market,hotel_cluster,in_advance,duration,trip_month
0,2013-11-09 07:45:52,2,3,66,189,2871,2586.0222,3925,1,0,...,3,1,1,2,50,967,42,44.676481,2.0,12
1,2013-11-09 07:48:54,2,3,66,189,2871,2586.0222,3925,1,0,...,3,1,1,2,50,967,42,48.674375,2.0,12
2,2014-05-10 10:17:00,2,3,66,189,2871,1515.3055,3925,0,0,...,1,1,1,2,50,871,7,-0.428472,1.0,5
3,2014-02-20 09:37:23,2,3,66,311,33705,2035.1640,6929,1,1,...,1,1,1,4,96,201,33,38.599039,7.0,3
4,2014-09-10 19:49:58,2,3,66,174,14752,8320.7631,7071,0,0,...,5,1,1,3,48,153,59,7.173634,1.0,9
5,2014-09-11 18:55:55,2,3,66,174,14752,2592.7152,7071,0,0,...,1,1,1,2,50,690,16,4.211169,1.0,9
6,2014-05-26 12:37:08,2,3,66,174,21356,2103.3748,7523,1,1,...,1,1,1,4,8,110,65,6.474213,3.0,6
7,2014-09-22 19:51:29,2,3,66,174,21356,1533.2703,7523,1,0,...,6,1,1,4,8,121,10,39.172581,5.0,11
8,2014-09-29 16:35:56,2,3,66,174,21356,1631.9775,7523,1,0,...,6,1,1,4,8,109,63,37.308380,3.0,11
9,2014-11-16 17:40:38,2,3,66,174,45042,2455.7598,7523,1,0,...,6,1,1,2,50,676,47,3.263449,4.0,11


In [4]:
destinations_url = '~/Documents/Expedia/destinations.csv'
dests = pd.read_csv(destinations_url)


In [5]:
dests

Unnamed: 0,srch_destination_id,d1,d2,d3,d4,d5,d6,d7,d8,d9,...,d140,d141,d142,d143,d144,d145,d146,d147,d148,d149
0,0,-2.198657,-2.198657,-2.198657,-2.198657,-2.198657,-1.897627,-2.198657,-2.198657,-1.897627,...,-2.198657,-2.198657,-2.198657,-2.198657,-2.198657,-2.198657,-2.198657,-2.198657,-2.198657,-2.198657
1,1,-2.181690,-2.181690,-2.181690,-2.082564,-2.181690,-2.165028,-2.181690,-2.181690,-2.031597,...,-2.165028,-2.181690,-2.165028,-2.181690,-2.181690,-2.165028,-2.181690,-2.181690,-2.181690,-2.181690
2,2,-2.183490,-2.224164,-2.224164,-2.189562,-2.105819,-2.075407,-2.224164,-2.118483,-2.140393,...,-2.224164,-2.224164,-2.196379,-2.224164,-2.192009,-2.224164,-2.224164,-2.224164,-2.224164,-2.057548
3,3,-2.177409,-2.177409,-2.177409,-2.177409,-2.177409,-2.115485,-2.177409,-2.177409,-2.177409,...,-2.161081,-2.177409,-2.177409,-2.177409,-2.177409,-2.177409,-2.177409,-2.177409,-2.177409,-2.177409
4,4,-2.189562,-2.187783,-2.194008,-2.171153,-2.152303,-2.056618,-2.194008,-2.194008,-2.145911,...,-2.187356,-2.194008,-2.191779,-2.194008,-2.194008,-2.185161,-2.194008,-2.194008,-2.194008,-2.188037
5,5,-2.174489,-2.174489,-2.174489,-2.174489,-2.174489,-2.155473,-2.174489,-2.174489,-2.174489,...,-2.174489,-2.174489,-2.174489,-2.174489,-2.174489,-2.174489,-2.174489,-2.174489,-2.174489,-2.174489
6,6,-2.174610,-2.174610,-2.174610,-2.174610,-2.174610,-2.137590,-2.174610,-2.174610,-2.174610,...,-2.174610,-2.174610,-2.174610,-2.174610,-2.174610,-2.174610,-2.174610,-2.174610,-2.174610,-2.174610
7,7,-2.221932,-2.226591,-2.226591,-2.226591,-2.095756,-2.019335,-2.207045,-2.217996,-2.224797,...,-2.221932,-2.226591,-2.094537,-2.226591,-2.226591,-2.226591,-2.226591,-2.226591,-2.226591,-2.226591
8,8,-2.201047,-2.201047,-2.201047,-2.150858,-2.150858,-2.030768,-2.194575,-2.195658,-2.201047,...,-2.201047,-2.201047,-2.201047,-2.201047,-2.201047,-2.201047,-2.201047,-2.201047,-2.201047,-2.144392
9,9,-2.175979,-2.175979,-2.175979,-2.175979,-2.175979,-2.141488,-2.175979,-2.175979,-2.175979,...,-2.175979,-2.175979,-2.175979,-2.175979,-2.175979,-2.175979,-2.175979,-2.175979,-2.175979,-2.175979


To eliminate the redundancy and allow us to make the dataset size more manageable, I'll try using Dimensionality Reduction to consolidate the 150 columns to just a handful. On the scikit-learn website, Principal Component Analysis (PCA) is listed first, so I'll try that one with 5 columns.

In [6]:
from sklearn.decomposition import PCA

pca = PCA(n_components=5)
dests_small = pca.fit_transform(dests.iloc[:,1:])

dests_small = pd.DataFrame(dests_small) #PCA transform returns a numpy array, so convert this to a df
dests_small["srch_destination_id"] = dests["srch_destination_id"]

dests_small

Unnamed: 0,0,1,2,3,4,srch_destination_id
0,-0.044268,0.169419,0.032520,-0.014009,-0.069439,0
1,-0.440761,0.077405,-0.091572,-0.020282,0.013145,1
2,0.001033,0.020677,0.012109,0.134043,0.141953,2
3,-0.480467,-0.040345,-0.019320,-0.040051,-0.027350,3
4,-0.207253,-0.042694,-0.011745,-0.017436,-0.019774,4
5,-0.555660,-0.032220,-0.029087,-0.063348,-0.011801,5
6,-0.540659,-0.035689,-0.031810,-0.048695,-0.023824,6
7,0.325618,-0.191197,0.272795,-0.126578,0.087924,7
8,0.064419,-0.109559,0.148284,-0.110965,-0.019260,8
9,-0.525696,-0.029234,-0.025737,-0.054385,-0.002034,9


Next, we'll join the destinations data with the rest of the training data and see how many nulls were introduced.

In [7]:
small_train = pd.merge(small_train, dests_small, on='srch_destination_id', how='left')

small_train.isnull().sum(axis=0)

date_time                       0
site_name                       0
posa_continent                  0
user_location_country           0
user_location_region            0
user_location_city              0
orig_destination_distance       0
user_id                         0
is_mobile                       0
is_package                      0
channel                         0
srch_ci                         0
srch_co                         0
srch_adults_cnt                 0
srch_children_cnt               0
srch_rm_cnt                     0
srch_destination_id             0
srch_destination_type_id        0
is_booking                      0
cnt                             0
hotel_continent                 0
hotel_country                   0
hotel_market                    0
hotel_cluster                   0
in_advance                      0
duration                        0
trip_month                      0
0                            1242
1                            1242
2             

Since there is a relatively small number of nulls out of the 200K+ rows, I'll simply delete them.

In [9]:
small_train = small_train.dropna()
small_train

Unnamed: 0,date_time,site_name,posa_continent,user_location_country,user_location_region,user_location_city,orig_destination_distance,user_id,is_mobile,is_package,...,hotel_market,hotel_cluster,in_advance,duration,trip_month,0,1,2,3,4
0,2013-11-09 07:45:52,2,3,66,189,2871,2586.0222,3925,1,0,...,967,42,44.676481,2.0,12,-0.132873,-0.128608,-0.054228,0.016671,0.120786
1,2013-11-09 07:48:54,2,3,66,189,2871,2586.0222,3925,1,0,...,967,42,48.674375,2.0,12,-0.132873,-0.128608,-0.054228,0.016671,0.120786
2,2014-05-10 10:17:00,2,3,66,189,2871,1515.3055,3925,0,0,...,871,7,-0.428472,1.0,5,-0.271791,-0.047116,0.110057,-0.042924,-0.035744
3,2014-02-20 09:37:23,2,3,66,311,33705,2035.1640,6929,1,1,...,201,33,38.599039,7.0,3,0.790819,1.045323,-0.094523,0.156167,-0.092824
4,2014-09-10 19:49:58,2,3,66,174,14752,8320.7631,7071,0,0,...,153,59,7.173634,1.0,9,0.332302,0.148145,0.030968,-0.095246,-0.089539
5,2014-09-11 18:55:55,2,3,66,174,14752,2592.7152,7071,0,0,...,690,16,4.211169,1.0,9,1.639595,-0.425800,-0.410201,0.049478,0.112685
6,2014-05-26 12:37:08,2,3,66,174,21356,2103.3748,7523,1,1,...,110,65,6.474213,3.0,6,1.739913,1.378982,0.180819,0.229274,-0.263395
7,2014-09-22 19:51:29,2,3,66,174,21356,1533.2703,7523,1,0,...,121,10,39.172581,5.0,11,0.503027,-0.344407,-0.299731,0.078729,-0.196648
8,2014-09-29 16:35:56,2,3,66,174,21356,1631.9775,7523,1,0,...,109,63,37.308380,3.0,11,0.040322,0.433433,-0.022538,-0.059626,-0.031747
9,2014-11-16 17:40:38,2,3,66,174,45042,2455.7598,7523,1,0,...,676,47,3.263449,4.0,11,0.500923,-0.110007,0.218703,-0.102777,-0.132509


## Testing Machine Learning Models

Now the data is ready to try different machine learning models to see how they perform. Because the Hotel Cluster field that we are predicting has no linear relationship with any of the features in the dataset, machine learning models such as linear regression and logistic regression will not be effective.

Instead, I'll try using K Nearest Neigbors and Random Forests to see how those two models perform. I'll first start with K Nearest Neigbors. I'll first use all of the columns in the dataset as-is, and do a simple test-train split to see how the model performs. (I excluded the date/time columns since there are features computed using those, and I'm also excluding the "cnt" column since it refers to the count of events in the user's entire session, and we've excluded the clicks from the analysis.)

In [11]:
# create feature matrix X

feature_cols = ['site_name', 'posa_continent', 'user_location_country',
                'user_location_city',  'user_id', 'hotel_market', 'srch_destination_type_id', 
                'is_booking', 'user_location_region', 'hotel_continent', 'hotel_country', 
                'is_mobile', 'is_package', 'srch_adults_cnt', 'srch_children_cnt', 
                'orig_destination_distance', 'srch_rm_cnt',   
                'duration', 'in_advance', 0, 1, 2, 3, 4]

X = small_train[feature_cols]
y = small_train['hotel_cluster']

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)




After the test-train split is done, we'll import the K Nearest Neighbors classifier, instantiate the model, train it, and compute the accuracy of the predictions. 

In [14]:
# import the classifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

# instantiate the model
knn = KNeighborsClassifier(n_neighbors=10) # try knn with 10 first

# train the model
knn.fit(X_train, y_train)

# make the prediction
y_pred_class = knn.predict(X_test)

print metrics.accuracy_score(y_test, y_pred_class) # This is the accuracy

0.0650957213098


The accuracy of 6.5% is barely better than guessing the most commonly occuring hotel_cluster in the response set y.

In [15]:
y.value_counts(1).head(1)

91    0.049244
Name: hotel_cluster, dtype: float64

One option for trying to improve the accuracy is to remove redundant columns. A lot of the columns include information about the origin of the user (site_name, posa_content, user_location_country) that may not add additional value beyond the user's city of origin. There are also a number of columns that refer to the destination city. I will try the analysis again only using one origin and destination column and repeat the process. 

In [16]:
# create reduced feature matrix X
feature_cols = ['user_location_city', 'user_id', 'hotel_market', 'srch_destination_type_id', 
                'is_booking', 'is_mobile', 'is_package', 'srch_adults_cnt', 'srch_children_cnt', 
                'orig_destination_distance', 'srch_rm_cnt', 'duration', 'in_advance', 0, 1, 2, 3, 4]

X = small_train[feature_cols]
y = small_train['hotel_cluster']

# do the split again
X_train, X_test, y_train, y_test = train_test_split(X, y)

# train the model
knn.fit(X_train, y_train)

# make the prediction
y_pred_class = knn.predict(X_test)

print metrics.accuracy_score(y_test, y_pred_class) # This is the accuracy

0.0664295428392


This result is only slightly better and in some earlier runs was worse. The poor results could be the result of many of the features being numeric IDs, and perhaps K Nearest Neigbors is incorrectly assigning more weight to higher IDs. In the next result, I'll try reducing the feature set further and using dummy variables for several of the features. I'll also scale the dataset since some values (like "orig_destination_distance" and "in_advance") are quite large compared to the other values.

In [16]:
# create reduced feature matrix X
feature_cols = ['is_booking', 'is_mobile', 'is_package', 'srch_adults_cnt', 'srch_children_cnt', 
                'orig_destination_distance', 'srch_rm_cnt', 'duration', 'in_advance', 0, 1, 2, 3, 4]

X = small_train[feature_cols]

# Make dummy variables
#df_market = pd.get_dummies(small_train['hotel_market'], prefix='market')
df_tripmonth = pd.get_dummies(small_train['trip_month'], prefix='month')
df_destid = pd.get_dummies(small_train['srch_destination_type_id'], prefix='destid')

#join the dummy variables to the feature matrix
#X = X.join(df_market)
X = X.join(df_tripmonth)
X = X.join(df_destid)

y = small_train['hotel_cluster']

# do the split again
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Scale the data
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# standardize X_train
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [17]:
X_train_scaled.shape

(228633L, 31L)

In [18]:
# train the model
knn.fit(X_train_scaled, y_train)

# make the prediction
y_pred_class = knn.predict(X_test_scaled)

print metrics.accuracy_score(y_test, y_pred_class) # This is the accuracy

0.0924538452454


With 31 columns and 229K rows, this prediction took approximately 30 minutes to run to completion. While the results of 0.09 are a significant improvement over the prior cases, the predictions are not nearly good enough to be useful. 

For the next step, I decided to try some different models based on the sci-kit learn "cheat sheet", which directed me to the Stochastic Gradient Descent (SGD) and kernel approximation methods. I will try SGD first. 

If that's not working either, try a model where I simply predict the most popular hotel cluster for different factors. For example:

- For a given user ID and hotel market, find the most popular cluster and predict that for that particular combo
- If I don't have a match on user ID, but I do on hotel market, predict the most popular cluster for that destination.
- If I don't have a match on hotel market, but I do on user ID, predict the most popular cluster for that user ID (based on their past activity including both clicks and bookings

In [23]:
# create reduced feature matrix X
feature_cols = ['is_booking', 'is_mobile', 'is_package', 'srch_adults_cnt', 'srch_children_cnt', 
                'orig_destination_distance', 'srch_rm_cnt', 'duration', 'in_advance', 0, 1, 2, 3, 4]

X = small_train[feature_cols]

# Make dummy variables
df_market = pd.get_dummies(small_train['hotel_market'], prefix='market')
df_tripmonth = pd.get_dummies(small_train['trip_month'], prefix='month')
df_destid = pd.get_dummies(small_train['srch_destination_type_id'], prefix='destid')

#join the dummy variables to the feature matrix
X = X.join(df_market)
X = X.join(df_tripmonth)
X = X.join(df_destid)

y = small_train['hotel_cluster']

# do the split again
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Since the docs say SGD is sensitive to feature scaling, need to scale the data
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# standardize X_train
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [24]:
from sklearn.linear_model import SGDClassifier

sgd = SGDClassifier(loss="log", penalty="l2", shuffle=True)

# train the model
sgd.fit(X_train_scaled, y_train)

# make the prediction
y_pred_class = sgd.predict(X_test_scaled)

print metrics.accuracy_score(y_test, y_pred_class) # This is the accuracy

0.110851030443


Reading the documentation for SGD caused me to change the loss function from the default of "hinge" to "log", since "log" was better suited to multiclass classification. This improved the accuracy from approximately 0.03 to 0.11. 


In [43]:
from pandasql import sqldf

pysqldf = lambda q: sqldf(q, globals())

# create reduced feature matrix X
feature_cols = ['user_id', 'hotel_market', 'is_booking', 'hotel_cluster']

X = small_train[feature_cols]
y = small_train['hotel_cluster'] # we actually don't use this since we're not feeding it into a model

# split the data so we're only find the most frequent values on the split training data
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Find most commong hotel_cluster for a specific user and hotel market in the split training data
q  = """
SELECT 
sub.user_id, sub.hotel_market, sub.hotel_cluster, MAX(sub.cluster_count) as max_count 
FROM
(SELECT user_id, hotel_market, hotel_cluster, SUM(is_booking) AS cluster_count
FROM X_train
GROUP BY
user_id, hotel_market, hotel_cluster) sub
GROUP BY
sub.user_id, sub.hotel_market
"""

user_market_cluster = pysqldf(q) # Execute the query and assign the results to a new dataframe

# Now find the most common cluster for each specific user_id
q  = """
SELECT 
sub.user_id, sub.hotel_cluster, MAX(sub.cluster_count) as max_count 
FROM
(SELECT user_id, hotel_cluster, SUM(is_booking) AS cluster_count
FROM X_train
GROUP BY
user_id, hotel_cluster) sub
GROUP BY
sub.user_id
"""

user_cluster = pysqldf(q) # Execute the query and assign the results to a new dataframe

# Finally, find the most common cluster for each specific hotel_market
q  = """
SELECT 
sub.hotel_market, sub.hotel_cluster, MAX(sub.cluster_count) as max_count 
FROM
(SELECT hotel_market, hotel_cluster, SUM(is_booking) AS cluster_count
FROM X_train
GROUP BY
hotel_market, hotel_cluster) sub
GROUP BY
sub.hotel_market
"""

market_cluster = pysqldf(q) # Execute the query and assign the results to a new dataframe

user_market_cluster # show what the new dataframe looks like

Unnamed: 0,user_id,hotel_market,hotel_cluster,max_count
0,28,191,59,1
1,47,90,62,1
2,47,91,12,1
3,137,633,54,1
4,140,628,79,1
5,140,657,42,1
6,141,628,54,1
7,141,701,56,1
8,179,679,77,1
9,179,682,15,1


In [66]:
# Now we'll predict the hotel cluster on the test data by looking up the most frequent hotel
# cluster in the tables created earlier

def lookup(row):
    
    # First look in the user_id / hotel_market table for a match
    match = user_market_cluster['hotel_cluster'][(user_market_cluster.user_id == row.user_id) & (user_market_cluster.hotel_market == row.hotel_market)]
    
    if len(match):
        return match.values[0]

    # If not found, next look for the most popular cluster for a particular hotel market
    match = market_cluster['hotel_cluster'][(market_cluster.hotel_market == row.hotel_market)]

    if len(match):
        return match.values[0]

    # If still not found, look for the most popular cluster for a particular user
    match = user_cluster['hotel_cluster'][(user_cluster.user_id == row.user_id)]

    if len(match):
        return match.values[0]
 
    # If not found for any of these, simply predict the most popular cluster overall
    return X_train["hotel_cluster"].value_counts().index[0]

# make the prediction
y_pred_class = X_test.apply(lookup, axis=1)

print metrics.accuracy_score(y_test, y_pred_class) # This is the accuracy


0.240689402657


This is the best result by far; the results improved significantly (from ~0.15 to ~0.24) by prioritizing the most popular cluster for a market over the most popular cluster for a user (in cases where a match on both user and hotel market cannot be found). Including the user/market match also performed much better than skipping this step.



In [99]:
# create feature matrix X

#feature_cols = ['site_name', 'posa_continent', 'user_location_country',
#                'user_location_city',  'user_id', 'hotel_market', 'srch_destination_type_id', 
#                'is_booking', 'user_location_region', 'hotel_continent', 'hotel_country', 
#                'is_mobile', 'is_package', 'srch_adults_cnt', 'srch_children_cnt', 
#                'orig_destination_distance', 'srch_rm_cnt',   
#                'duration', 'in_advance', 0, 1, 2, 3, 4]

feature_cols = ['user_id', 'hotel_market', 'srch_children_cnt']

X = small_train[feature_cols]
y = small_train['hotel_cluster']

X_train, X_test, y_train, y_test = train_test_split(X, y)

# instantiate the model
knn = KNeighborsClassifier(n_neighbors=3) # try knn with 10 first

# train the model
knn.fit(X_train, y_train)

# make the prediction
y_pred_class = knn.predict(X_test)

print metrics.accuracy_score(y_test, y_pred_class) # This is the accuracy

0.110001046135
