# Expedia

###Introduction 
Planning your dream vacation, or even a weekend escape, can be an overwhelming affair. With hundreds, even thousands, of hotels to choose from at every destination, it's difficult to know which will suit your personal preferences. Should you go with an old standby with those pillow mints you like, or risk a new hotel with a trendy pool bar? 

Expedia wants to take the proverbial rabbit hole out of hotel search by providing personalized hotel recommendations to their users. This is no small task for a site with hundreds of millions of visitors every month!

Currently, Expedia uses search parameters to adjust their hotel recommendations, but there aren't enough customer specific data to personalize them for each user. In this competition, Expedia is challenging Kagglers to contextualize customer data and predict the likelihood a user will stay at 100 different hotel groups.

The data in this competition is a random selection from Expedia and is not representative of the overall statistics. 

###Data Description
Expedia has provided you logs of customer behavior. These include what customers searched for, how they interacted with search results (click/book), whether or not the search result was a travel package. The data in this competition is a random selection from Expedia and is not representative of the overall statistics.

Expedia is interested in predicting which hotel group a user is going to book. Expedia has in-house algorithms to form hotel clusters, where similar hotels for a search (based on historical price, customer star ratings, geographical locations relative to city center, etc) are grouped together. These hotel clusters serve as good identifiers to which types of hotels people are going to book, while avoiding outliers such as new hotels that don't have historical data.

Your goal of this competition is to predict the booking outcome (hotel cluster) for a user event, based on their search and other attributes associated with that user event.

The train and test datasets are split based on time: training data from 2013 and 2014, while test data are from 2015. The public/private leaderboard data are split base on time as well. Training data includes all the users in the logs, including both click events and booking events. Test data only includes booking events. 

destinations.csv data consists of features extracted from hotel reviews text. 

Note that some srch_destination_id's in the train/test files don't exist in the destinations.csv file. This is because some hotels are new and don't have enough features in the latent space. Your algorithm should be able to handle this missing information.

###File descriptions
- train.csv - the training set
- test.csv - the test set
- destinations.csv - hotel search latent attributes
- sample_submission.csv - a sample submission file in the correct format


Unnamed: 0,Column name,Description,Data type
0,date_time,Timestamp,string
1,site_name,ID of the Expedia point of sale (i.e. Expedia....,int
2,posa_continent,ID of continent associated with site_name,int
3,user_location_country,The ID of the country the customer is located,int
4,user_location_region,The ID of the region the customer is located,int
5,user_location_city,The ID of the city the customer is located,int
6,orig_destination_distance,Physical distance between a hotel and a custom...,double
7,user_id,ID of user,int
8,is_mobile,"1 when a user connected from a mobile device, ...",tinyint
9,is_package,1 if the click/booking was generated as a part...,int


In [1]:
# import libraries
import numpy as np
import pandas as pd
from datetime import datetime as dt

In [3]:
# IMPORT DATA
start = dt.now()
dir = 'C:\\Users\\Lenovo\\PycharmProjects\\Kaggle\\Project2_Expedia\\'
train_set = pd.read_csv(dir + 'train.csv', nrows=1000000)
#train = pd.DataFrame(columns=[])
#train_set = 
test_set = pd.read_csv(dir + 'test.csv', nrows=10000)
dest_set  = pd.read_csv(dir + 'destinations.csv')
end = dt.now()
'Import lasted: {0} seconds.\nTotal number of rows imported:{1}(train), {2} (test).'\
    .format((end - start).seconds, train_set.shape[0], test_set.shape[0])

'Import lasted: 7'

In [25]:
# Train set summary:
columns = pd.read_excel('C:\\Users\\Lenovo\\PycharmProjects\\Kaggle\\Project2_Expedia\\Fields_description.xlsx')
nulls = train_set.isnull().sum()
nulls.name = 'Nulls'
uniques = train_set.apply(lambda x: x.value_counts().count())
uniques.name = 'UniqueValues'
columns.set_index('column_name').join(nulls).join(uniques)

Unnamed: 0_level_0,Description,Data type,Nulls,UniqueValues
column_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
date_time,Timestamp,string,0,987199
site_name,ID of the Expedia point of sale (i.e. Expedia....,int,0,42
posa_continent,ID of continent associated with site_name,int,0,5
user_location_country,The ID of the country the customer is located,int,0,183
user_location_region,The ID of the region the customer is located,int,0,776
user_location_city,The ID of the city the customer is located,int,0,12920
orig_destination_distance,Physical distance between a hotel and a custom...,double,370247,400861
user_id,ID of user,int,0,32783
is_mobile,"1 when a user connected from a mobile device, ...",tinyint,0,2
is_package,1 if the click/booking was generated as a part...,int,0,2


date_time                         0
site_name                         0
posa_continent                    0
user_location_country             0
user_location_region              0
user_location_city                0
orig_destination_distance    370247
user_id                           0
is_mobile                         0
is_package                        0
channel                           0
srch_ci                         998
srch_co                         999
srch_adults_cnt                   0
srch_children_cnt                 0
srch_rm_cnt                       0
srch_destination_id               0
srch_destination_type_id          0
is_booking                        0
cnt                               0
hotel_continent                   0
hotel_country                     0
hotel_market                      0
hotel_cluster                     0
dtype: int64

In [5]:
train_set.head().transpose()

Unnamed: 0,0,1,2,3,4
date_time,2014-08-11 07:46:59,2014-08-11 08:22:12,2014-08-11 08:24:33,2014-08-09 18:05:16,2014-08-09 18:08:18
site_name,2,2,2,2,2
posa_continent,3,3,3,3,3
user_location_country,66,66,66,66,66
user_location_region,348,348,348,442,442
user_location_city,48862,48862,48862,35390,35390
orig_destination_distance,2234.26,2234.26,2234.26,913.193,913.626
user_id,12,12,12,93,93
is_mobile,0,0,0,0,0
is_package,1,1,0,0,0


In [7]:
test_set.head().transpose()

Unnamed: 0,0,1,2,3,4
id,0,1,2,3,4
date_time,2015-09-03 17:09:54,2015-09-24 17:38:35,2015-06-07 15:53:02,2015-09-14 14:49:10,2015-07-17 09:32:04
site_name,2,2,2,2,2
posa_continent,3,3,3,3,3
user_location_country,66,66,66,66,66
user_location_region,174,174,142,258,467
user_location_city,37449,37449,17440,34156,36345
orig_destination_distance,5539.06,5873.29,3975.98,1508.6,66.7913
user_id,1,1,20,28,50
is_mobile,1,1,0,0,0


## Let's explore both sets:
1. Common columns
* Dimensions
* dtypes
* unique values (for categorization)
* Nulls
* 
*

In [8]:
# COMMON COLUMNS
setA = pd.DataFrame({'Train': train_set.columns})
setB = pd.DataFrame({'Train': test_set.columns, 'Test': ['X' for x in range(test_set.shape[1])]})
pd.merge(setA, setB, how='outer', left_on='Train', right_on='Train')

Unnamed: 0,Train,Test
0,date_time,X
1,site_name,X
2,posa_continent,X
3,user_location_country,X
4,user_location_region,X
5,user_location_city,X
6,orig_destination_distance,X
7,user_id,X
8,is_mobile,X
9,is_package,X


In [5]:
# Dimensions
train_set.shape, test_set.shape

((1000000, 24), (10000, 22))

In [9]:
# DATA TYPES
train_set.dtypes

date_time                     object
site_name                      int64
posa_continent                 int64
user_location_country          int64
user_location_region           int64
user_location_city             int64
orig_destination_distance    float64
user_id                        int64
is_mobile                      int64
is_package                     int64
channel                        int64
srch_ci                       object
srch_co                       object
srch_adults_cnt                int64
srch_children_cnt              int64
srch_rm_cnt                    int64
srch_destination_id            int64
srch_destination_type_id       int64
is_booking                     int64
cnt                            int64
hotel_continent                int64
hotel_country                  int64
hotel_market                   int64
hotel_cluster                  int64
dtype: object

important: below nulls for test set is lower due to not loading the whole data but only 100000 rows.

In [10]:
# % OF NULL VALUES
setA = (train_set.isnull().sum() / train_set.shape[0]) * 100
setB = (test_set.isnull().sum() / test_set.shape[0]) * 100
pd.concat([setA, setB], join='outer', axis=1, keys=('Train', 'Test'))

Unnamed: 0,Train,Test
channel,0.0,0.0
cnt,0.0,
date_time,0.0,0.0
hotel_cluster,0.0,
hotel_continent,0.0,0.0
hotel_country,0.0,0.0
hotel_market,0.0,0.0
id,,0.0
is_booking,0.0,
is_mobile,0.0,0.0


In [11]:
# UNIQUE VALUES IN EACH COLUMN
#setB = test_set.apply(lambda x: x.value_counts().count())
#pd.concat([setA, setB], axis=1, keys=('Train', 'Test'))

date_time                    987199
site_name                        42
posa_continent                    5
user_location_country           183
user_location_region            776
user_location_city            12920
orig_destination_distance    400861
user_id                       32783
is_mobile                         2
is_package                        2
channel                          11
srch_ci                        1108
srch_co                        1111
srch_adults_cnt                  10
srch_children_cnt                10
srch_rm_cnt                       9
srch_destination_id           16964
srch_destination_type_id          8
is_booking                        2
cnt                              49
hotel_continent                   7
hotel_country                   202
hotel_market                   2049
hotel_cluster                   100
dtype: int64

In [12]:
lista = []
for col in range(len(test_set.columns)):
    lista.append(test_set.iloc[:, col].value_counts().count())
pd.Series(lista, index=test_set.columns)

id                           10000
date_time                     9996
site_name                       30
posa_continent                   5
user_location_country          107
user_location_region           427
user_location_city            2682
orig_destination_distance     5878
user_id                       4665
is_mobile                        2
is_package                       2
channel                         11
srch_ci                        583
srch_co                        576
srch_adults_cnt                 10
srch_children_cnt                7
srch_rm_cnt                      9
srch_destination_id           3372
srch_destination_type_id         8
hotel_continent                  7
hotel_country                  137
hotel_market                  1363
dtype: int64

## Prepare Datasets for analysis:
1. convert dates to dates format
2. Remove nulls from:
	- srch_ci, srch_co by 

In [13]:
# CONVERT DATES TO DATETIME type
columns_to_change = ['date_time', 'srch_ci', 'srch_co']

for col in columns_to_change:
    train_set.loc[:, col] = pd.to_datetime(train_set.loc[:, col])
for col in columns_to_change:
    test_set.loc[:, col] = pd.to_datetime(test_set.loc[:, col])

In [162]:
# REMOVE NULL VALUES
train_set.loc[:, ['srch_ci', 'srch_co']].removenull()


Because of the size of train data set and small number of nulls in search_ci, search_co I decided to simply delete those observations.