This notebook creates a prediction for rows where the origin destination distance is exactly the same as at least one of the training rows.

This takes advantage of a mistake made by Expedia: They gave a precise distance between the user's city and the exact hotel that they chose in the end. In the training data set, there are some items that repeat the `orig_destination_distance`. For about 95% of them there is 1-to-1 correspondance between that distance and the `hotel_cluster`.

In [1]:
import numpy as np
import pandas as pd
from math import ceil
with open('datapath.txt') as f:
    datapath=f.readlines()[0].rstrip()

In [2]:
totaltrainrows=37670293
ndestinations=62106
totaltestrows=2528243

In [3]:
testdata=pd.read_csv(datapath+'test.csv', nrows=None, usecols=[0,6,7],dtype={0:np.uint32,6:np.uint16,7:np.float64},index_col=0)

In [4]:
testdata.dropna(inplace=True)

In [46]:
testdata[['orig_destination_distance','user_location_city']]

Unnamed: 0_level_0,orig_destination_distance,user_location_city
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,5539.0567,37449
1,5873.2923,37449
2,3975.9776,17440
3,1508.5975,34156
4,66.7913,36345
5,359.8521,48189
6,237.3465,48189
7,216.5785,24811
8,2337.6754,48189
9,2539.7995,48189


This code takes advantage of the fact that sets are refered to by reference in python. We can pass this set to another data frame, but all it is passing is something that says "look in this spot in memory for this object". This means that `[set()]*testdata.shape[0]` will not work for our purposes, as it creates a list, where each element is the exact same object in memory. Thus, if we change one element in the list, we'll change every one.

In [5]:
testdata['hotel_cluster']=[set() for i in xrange(testdata.shape[0])]

This creates an iterator over the training data, using just the columns we want. Each element of the iterator is a `DataFrame`. We can than iterate over it, extract and use the info we want from each chunk, after which it is discarded from memory and the next chunk is loaded.

In [6]:
nchunks=8
chunksize=ceil(float(totaltrainrows)/nchunks)
trainit=pd.read_csv(datapath+'train.csv',iterator=True,chunksize=chunksize,usecols=[5,6,23],dtype={5:np.uint16,6:np.float64,23:np.uint8})

In [40]:
def add_cluster(x):
    x['hotel_cluster_tst'].add(x['hotel_cluster_tr'])

Because the sets are objects passed by reference, we can modify them in the data frame created in the for loop, and they will also get modified in `testdata`.

In [8]:
%%time
nchunk=0
for chunk in trainit:
    chunk=pd.merge(chunk,testdata.reset_index(),on=['orig_destination_distance','user_location_city'],how='inner',suffixes=['_tr','_tst'])
    chunk.apply(add_cluster,axis=1)
    print(nchunk)
    nchunk+=1
    

0
1
2
3
4
5
6
7
CPU times: user 19min 8s, sys: 15.7 s, total: 19min 24s
Wall time: 19min 20s


In [15]:
#Garbage collection doesn't work well in an interactive session, so I can save some memory like so:
chunk=0

In [31]:
settostring=lambda x: ' '.join(str(i) for i in x)

When we convert the sets to booleans, empty sets are `False`, all others are `True`.

In [32]:
prediction=testdata[testdata['hotel_cluster'].astype(bool)]['hotel_cluster'].map(settostring)

In [36]:
prediction.to_csv(datapath+'dist_predict.csv',index=True,header=True)

There are, however, some data points where this method doesn't work perfectly. This is probably because hotels can change their cluster seasonally, or as aspects of the hotel change. I haven't yet figured out what to do with these.

In [41]:
checklen=lambda x: len(x)

In [43]:
bads=testdata[testdata['hotel_cluster'].apply(len)>5]

In [44]:
bads.shape

(22, 3)

There are only 8 actual problems to deal with though:

In [48]:
bads.groupby('orig_destination_distance').size()

orig_destination_distance
55.7481      10
61.5983       4
69.1495       2
141.5581      1
195.8534      2
1698.9480     1
2375.0267     1
3378.0289     1
dtype: int64

In [49]:
bads

Unnamed: 0_level_0,user_location_city,orig_destination_distance,hotel_cluster
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
40071,2037,55.7481,"{73, 89, 25, 90, 59, 30}"
47815,27655,2375.0267,"{3, 83, 20, 85, 57, 30, 53}"
170559,28561,61.5983,"{73, 25, 89, 90, 59, 30}"
244716,2086,195.8534,"{3, 20, 53, 57, 60, 93, 30, 85}"
278646,28561,61.5983,"{73, 25, 89, 90, 59, 30}"
451279,32561,141.5581,"{40, 41, 44, 83, 86, 90, 60, 31}"
646412,35390,3378.0289,"{64, 99, 36, 9, 46, 80}"
690018,28561,61.5983,"{73, 25, 89, 90, 59, 30}"
835472,2037,55.7481,"{73, 89, 25, 90, 59, 30}"
944406,51733,1698.948,"{98, 80, 21, 25, 59, 95}"


### Scrap
Here's an example of how to take in a small number of rows for experimentation with the data:

In [37]:
chunk=pd.read_csv(datapath+'train.csv',nrows=100,usecols=[5,6,23],dtype={5:np.uint16,6:np.float64,23:np.uint8})

In [38]:
chunk.head()

Unnamed: 0,user_location_city,orig_destination_distance,hotel_cluster
0,48862,2234.2641,1
1,48862,2234.2641,1
2,48862,2234.2641,1
3,35390,913.1932,80
4,35390,913.6259,21
