Online travel agencies are scrambling to meet the artificial intelligence driven personalization standard set by companies like Amazon and Netflix. In addition, the world of online travel has become a highly competitive space where brands try to capture our attention (and wallet) with recommending, comparing, matching, and sharing. For this assignment, we would like to create the optimal hotel recommendations for Expedia’s users that are searching for a hotel to book. For this assignment, you need to predict which “hotel cluster” the user is likely to book, given his (or her) search details.  

The data set can be found at Kaggle: Expedia Hotel Recommendations. To get started, I would suggest exploring the file train.csv, which contains the logs of user behavior.  There is another file named destinations.csv, which contains information related to hotel reviews made by users. There is a lot of data here, and making an accurate prediction is rather difficult, e.g., simply running a standard prediction algorithm will probably yield below 10% accuracy. Stary by doing some exploratory analysis of this data to help understand how to make a prediction on the hotel cluster the user is likely to select. Then, split train.csv into a training and test set (feel free to select a smaller random subset of train.csv). Then, build at least two prediction models from the training set, and report the accuracies on the test set. As I mentioned, this is a difficult problem, so be creative with your solutions. You might want to try building your own predictor rather than a standard predictor model, e.g., a random forest.  The purpose of this project is not necessarily to get great results but to understand the nuances and challenges of such problems.

In [1]:
import numpy as np
import pandas as pd
import random
random.seed(42)

In [2]:
import os
print(os.listdir('../Week7'))

['.ipynb_checkpoints', 'expedia-hotel-recommendations', 'expedia-hotel-recommendations.zip', 'video.ipynb', 'Week7_test.ipynb']


In [3]:
file = '../Week7/expedia-hotel-recommendations/train.csv'

In [4]:
train = pd.read_csv(file, nrows=100000)

In [5]:
train.shape

(100000, 24)

In [6]:
train.head()

Unnamed: 0,date_time,site_name,posa_continent,user_location_country,user_location_region,user_location_city,orig_destination_distance,user_id,is_mobile,is_package,...,srch_children_cnt,srch_rm_cnt,srch_destination_id,srch_destination_type_id,is_booking,cnt,hotel_continent,hotel_country,hotel_market,hotel_cluster
0,2014-08-11 07:46:59,2,3,66,348,48862,2234.2641,12,0,1,...,0,1,8250,1,0,3,2,50,628,1
1,2014-08-11 08:22:12,2,3,66,348,48862,2234.2641,12,0,1,...,0,1,8250,1,1,1,2,50,628,1
2,2014-08-11 08:24:33,2,3,66,348,48862,2234.2641,12,0,0,...,0,1,8250,1,0,1,2,50,628,1
3,2014-08-09 18:05:16,2,3,66,442,35390,913.1932,93,0,0,...,0,1,14984,1,0,1,2,50,1457,80
4,2014-08-09 18:08:18,2,3,66,442,35390,913.6259,93,0,0,...,0,1,14984,1,0,1,2,50,1457,21


In [7]:
train.columns

Index(['date_time', 'site_name', 'posa_continent', 'user_location_country',
       'user_location_region', 'user_location_city',
       'orig_destination_distance', 'user_id', 'is_mobile', 'is_package',
       'channel', 'srch_ci', 'srch_co', 'srch_adults_cnt', 'srch_children_cnt',
       'srch_rm_cnt', 'srch_destination_id', 'srch_destination_type_id',
       'is_booking', 'cnt', 'hotel_continent', 'hotel_country', 'hotel_market',
       'hotel_cluster'],
      dtype='object')

In [8]:
# train.info

In [9]:
nums = train.hotel_cluster.unique()

In [10]:
nums.sort()

In [11]:
print(nums)

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
 96 97 98 99]


Read in chunks (https://stackoverflow.com/questions/38818609/skip-rows-with-missing-values-in-read-csv)  
remove any rows containing NaN

In [12]:
file = '../Week7/expedia-hotel-recommendations/train.csv'

In [13]:
# # okay, this takes basically forever (at least half an hour)
# %% time
# result = pd.DataFrame()
# df = pd.read_csv(file, chunksize=1000)
# for chunk in df:
#     chunk.dropna(axis=0, inplace=True) # Dropping all rows with any NaN value
# #     chunk[colToConvert] = chunk[colToConvert].astype(np.uint32)
#     result = result.append(chunk)
# del df, chunk

In [14]:
train = pd.read_csv(file, nrows=100000)

In [15]:
train.shape

(100000, 24)

In [16]:
train.dropna(axis=0, inplace=True)

In [17]:
train.shape

(63023, 24)

In [18]:
# %%time # 15.2 s
# n = sum(1 for line in open(file)) - 1

In [20]:
# n # 37,670,293 rows

Get a random sample of size *s*

In [21]:
%%time 
# 1 min 16 s for 100k row sample; same for 10k
n = 37670293 # number of records in file
s = 100000 # desired sample size
skip = sorted(random.sample(range(n), n-s))
colNames = pd.read_csv(file, nrows=1).columns
ranDF = pd.read_csv(file, skiprows=skip, names=colNames)

Wall time: 1min 44s


In [22]:
ranDF.shape

(100001, 24)

In [23]:
ranDF.head()

Unnamed: 0,date_time,site_name,posa_continent,user_location_country,user_location_region,user_location_city,orig_destination_distance,user_id,is_mobile,is_package,...,srch_children_cnt,srch_rm_cnt,srch_destination_id,srch_destination_type_id,is_booking,cnt,hotel_continent,hotel_country,hotel_market,hotel_cluster
0,2014-08-11 08:22:12,2,3,66,348,48862,2234.2641,12,0,1,...,0,1,8250,1,1,1,2,50,628,1
1,2014-04-15 12:37:26,2,3,66,260,19022,834.105,8252,0,1,...,1,1,8268,1,0,1,2,50,682,77
2,2014-09-22 14:06:53,2,3,66,153,50542,2926.2327,15632,0,1,...,1,1,13094,3,0,1,2,50,212,21
3,2014-12-28 12:31:26,2,3,66,220,43026,4703.7133,23234,0,0,...,0,1,8859,1,0,1,2,50,212,41
4,2014-11-03 15:57:09,2,3,66,254,21713,1583.6919,23532,1,0,...,0,2,8824,1,1,1,4,8,118,30


~Somehow we're losing the header, hmmm~   Fixed!

In [24]:
ranDF.dropna(axis=0, inplace=True)

In [25]:
ranDF.shape

(63937, 24)

#### Split into train and test groups

Using pandas

In [26]:
train_pd = ranDF.sample(frac=0.8, random_state=42)
test_pd = ranDF.drop(training_pd.index)

NameError: name 'training_pd' is not defined

In [None]:
train_pd.shape

In [None]:
test_pd.shape

Using scikit-learn *(I like this one, it's the cleanest)*

In [None]:
from sklearn.model_selection import train_test_split

train_sk, test_sk = train_test_split(ranDF, test_size=0.2, random_state=42)

In [None]:
train_sk.shape

In [None]:
test_sk.shape

Using numpy

In [None]:
# import numpy as np
# mask = np.random.rand(len(ranDF)) =< 0.8
# train_np = ranDF[mask]
# test_np = ranDF[~mask]

# # Okay, I guess this one doesn't work; I didn't like it anyway

In [None]:
train_sk.head()

In [None]:
train_sk.describe()