## Kaggle Expedia 酒店推荐比赛

[link](https://www.kaggle.com/c/expedia-hotel-recommendations/overview)

### 问题背景
![](./img/kaggle-expedia-hotel-recommendation.png)

### 数据描述

Expedia has provided you logs of customer behavior. These include what customers searched for, how they interacted with search results (click/book), whether or not the search result was a travel package. The data in this competition is a random selection from Expedia and is not representative of the overall statistics.

Expedia is interested in predicting which hotel group a user is going to book. Expedia has in-house algorithms to form hotel clusters, where similar hotels for a search (based on historical price, customer star ratings, geographical locations relative to city center, etc) are grouped together. These hotel clusters serve as good identifiers to which types of hotels people are going to book, while avoiding outliers such as new hotels that don't have historical data.

Your goal of this competition is to predict the booking outcome (hotel cluster) for a user event, based on their search and other attributes associated with that user event.

The train and test datasets are split based on time: training data from 2013 and 2014, while test data are from 2015. The public/private leaderboard data are split base on time as well. Training data includes all the users in the logs, including both click events and booking events. Test data only includes booking events. 

destinations.csv data consists of features extracted from hotel reviews text. 

Note that some srch_destination_id's in the train/test files don't exist in the destinations.csv file. This is because some hotels are new and don't have enough features in the latent space. Your algorithm should be able to handle this missing information.

### File descriptions

* **train.csv** - the training set
* **test.csv** - the test set
* **destinations.csv** - hotel search latent attributes
* **sample_submission.csv** - a sample submission file in the correct format


### Data fields

**train/test.csv**

![](./img/data.png)

### 评估标准与提交格式

![](./img/eval.png)

### 解法图示

![](./img/solution.png)

## 数据泄露处理

In [1]:
# -*- coding: utf-8 -*-

import datetime
from heapq import nlargest # 堆
from operator import itemgetter
import os


# 准备好可以被查找的表
def prepare_arrays_match():
    f = open("./input/train.csv", "r")  # 不要轻易使用pandas去读
    f.readline()
    
    best_hotels_od_ulc = dict()
    best_hotels_uid_miss = dict()
    best_s00 = dict()
    best_s01 = dict()
    total = 0

    # Calc counts
    while 1:
        line = f.readline().strip() # 去除左右空格
        total += 1

        if total % 2000000 == 0:
            print('Read {} lines...'.format(total)) # 通知

        if line == '':
            break # 停止

        # 开始解析
        arr = line.split(",") 
        #20170624
        book_year = int(arr[0][:4])  
        book_month = int(arr[0][5:7])
        user_location_city = arr[5]
        orig_destination_distance = arr[6]
        user_id = arr[7]
        srch_destination_id = arr[16]
        hotel_country = arr[21]
        hotel_market = arr[22]
        is_booking = float(arr[18])
        hotel_cluster = arr[23]

        # 创造一些值
        append_0 = ((book_year - 2012)*12 + (book_month - 12))
        append_1 = append_0 * append_0 * (3 + 17.60*is_booking)
        append_2 = 3 + 5.56*is_booking

        # 创造key： unique(user_id, user_location_city, srch_destination_id, hotel_country, hotel_market)
        if user_location_city != '' and orig_destination_distance != '' and user_id !='' and srch_destination_id != '' and hotel_country != '':
            # hash处理
            s00 = hash(str(user_id)+':'+str(user_location_city)+':'+str(srch_destination_id)+':'+str(hotel_country)+':'+str(hotel_market))
            if s00 in best_s00:
                if hotel_cluster in best_s00[s00]:
                    best_s00[s00][hotel_cluster] += append_1
                else:
                    best_s00[s00][hotel_cluster] = append_1
            else:
                best_s00[s00] = dict()
                best_s00[s00][hotel_cluster] = append_1

        # 创造key： unique(user_id, srch_destination_id, hotel_country, hotel_market)
        if user_location_city != '' and orig_destination_distance != '' and user_id !='' and srch_destination_id != '':
            s01 = hash(str(user_id)+':'+str(srch_destination_id)+':'+str(hotel_country)+':'+str(hotel_market))
            if s01 in best_s01:
                if hotel_cluster in best_s01[s01]:
                    best_s01[s01][hotel_cluster] += append_1
                else:
                    best_s01[s01][hotel_cluster] = append_1
            else:
                best_s01[s01] = dict()
                best_s01[s01][hotel_cluster] = append_1

        # 创造key： unique(user_location_city, srch_destination_id, hotel_country, hotel_market)
        if user_location_city != '' and orig_destination_distance == '' and srch_destination_id != '' and hotel_country != '':
            s0 = hash(str(user_location_city)+':'+str(srch_destination_id)+':'+str(hotel_country)+':'+str(hotel_market))
            if s0 in best_hotels_uid_miss:
                if hotel_cluster in best_hotels_uid_miss[s0]:
                    best_hotels_uid_miss[s0][hotel_cluster] += append_1
                else:
                    best_hotels_uid_miss[s0][hotel_cluster] = append_1
            else:
                best_hotels_uid_miss[s0] = dict()
                best_hotels_uid_miss[s0][hotel_cluster] = append_1

        # 创造key： unique(user_location_city, srch_destination_id)
        if user_location_city != '' and orig_destination_distance != '':
            s1 = hash(str(user_location_city)+':'+str(orig_destination_distance))

            if s1 in best_hotels_od_ulc:
                if hotel_cluster in best_hotels_od_ulc[s1]:
                    best_hotels_od_ulc[s1][hotel_cluster] += append_0
                else:
                    best_hotels_od_ulc[s1][hotel_cluster] = append_0
            else:
                best_hotels_od_ulc[s1] = dict()
                best_hotels_od_ulc[s1][hotel_cluster] = append_0

    f.close()
    return best_s00,best_s01, best_hotels_od_ulc, best_hotels_uid_miss

In [2]:
def gen_submission(best_s00, best_s01, best_hotels_od_ulc, best_hotels_uid_miss):
    now = datetime.datetime.now()
    path = './output/match_pred.csv'
    out = open(path, "w")
    f = open("./input/test.csv", "r")
    f.readline()
    total = 0
    total0 = 0
    total00 = 0
    total1 = 0
    out.write("id,hotel_cluster\n")
    
    while 1:
        line = f.readline().strip()
        total += 1

        if total % 100000 == 0:
            print('Write {} lines...'.format(total))

        if line == '':
            break

        arr = line.split(",")
        id = arr[0]
        user_location_city = arr[6]
        orig_destination_distance = arr[7]
        user_id = arr[8]
        srch_destination_id = arr[17]
        hotel_country = arr[20]
        hotel_market = arr[21]

        out.write(str(id) + ',')
        filled = []

        s1 = hash(str(user_location_city)+':'+str(orig_destination_distance))
        if s1 in best_hotels_od_ulc:
            d = best_hotels_od_ulc[s1]
            topitems = nlargest(5, sorted(d.items()), key=itemgetter(1))
            for i in range(len(topitems)):
                if topitems[i][0] in filled:
                    continue
                if len(filled) == 5:
                    break
                out.write(' ' + topitems[i][0])
                filled.append(topitems[i][0])
                total1 += 1

        if orig_destination_distance == '':
            s0 = hash(str(user_location_city)+':'+str(srch_destination_id)+':'+str(hotel_country)+':'+str(hotel_market))
            if s0 in best_hotels_uid_miss:
                d = best_hotels_uid_miss[s0]
                topitems = nlargest(4, sorted(d.items()), key=itemgetter(1))
                for i in range(len(topitems)):
                    if topitems[i][0] in filled:
                        continue
                    if len(filled) == 5:
                        break
                    out.write(' ' + topitems[i][0])
                    filled.append(topitems[i][0])
                    total0 += 1

        s00 = hash(str(user_id)+':'+str(user_location_city)+':'+str(srch_destination_id)+':'+str(hotel_country)+':'+str(hotel_market))
        s01 = hash(str(user_id)+':'+str(srch_destination_id)+':'+str(hotel_country)+':'+str(hotel_market))
        if s01 in best_s01 and s00 not in best_s00:
            d = best_s01[s01]
            topitems = nlargest(4, sorted(d.items()), key=itemgetter(1))
            for i in range(len(topitems)):
                if topitems[i][0] in filled:
                    continue
                if len(filled) == 5:
                    break
                out.write(' ' + topitems[i][0])
                filled.append(topitems[i][0])
                total00 += 1

        out.write("\n")
    out.close()
    print('Total 1: {} ...'.format(total1))
    print('Total 0: {} ...'.format(total0))
    print('Total 00: {} ...'.format(total00))

In [3]:
best_s00,best_s01, best_hotels_od_ulc, best_hotels_uid_miss = prepare_arrays_match()

Read 2000000 lines...
Read 4000000 lines...
Read 6000000 lines...
Read 8000000 lines...
Read 10000000 lines...
Read 12000000 lines...
Read 14000000 lines...
Read 16000000 lines...
Read 18000000 lines...
Read 20000000 lines...
Read 22000000 lines...
Read 24000000 lines...
Read 26000000 lines...
Read 28000000 lines...
Read 30000000 lines...
Read 32000000 lines...
Read 34000000 lines...
Read 36000000 lines...


In [4]:
gen_submission(best_s00, best_s01, best_hotels_od_ulc, best_hotels_uid_miss)

Write 100000 lines...
Write 200000 lines...
Write 300000 lines...
Write 400000 lines...
Write 500000 lines...
Write 600000 lines...
Write 700000 lines...
Write 800000 lines...
Write 900000 lines...
Write 1000000 lines...
Write 1100000 lines...
Write 1200000 lines...
Write 1300000 lines...
Write 1400000 lines...
Write 1500000 lines...
Write 1600000 lines...
Write 1700000 lines...
Write 1800000 lines...
Write 1900000 lines...
Write 2000000 lines...
Write 2100000 lines...
Write 2200000 lines...
Write 2300000 lines...
Write 2400000 lines...
Write 2500000 lines...
Total 1: 1092018 ...
Total 0: 1848215 ...
Total 00: 244309 ...
