## Kaggle Expedia 酒店推荐比赛

[link](https://www.kaggle.com/c/expedia-hotel-recommendations/overview)

### 问题背景
![](./img/kaggle-expedia-hotel-recommendation.png)

### 数据描述

Expedia has provided you logs of customer behavior. These include what customers searched for, how they interacted with search results (click/book), whether or not the search result was a travel package. The data in this competition is a random selection from Expedia and is not representative of the overall statistics.

Expedia is interested in predicting which hotel group a user is going to book. Expedia has in-house algorithms to form hotel clusters, where similar hotels for a search (based on historical price, customer star ratings, geographical locations relative to city center, etc) are grouped together. These hotel clusters serve as good identifiers to which types of hotels people are going to book, while avoiding outliers such as new hotels that don't have historical data.

Your goal of this competition is to predict the booking outcome (hotel cluster) for a user event, based on their search and other attributes associated with that user event.

The train and test datasets are split based on time: training data from 2013 and 2014, while test data are from 2015. The public/private leaderboard data are split base on time as well. Training data includes all the users in the logs, including both click events and booking events. Test data only includes booking events. 

destinations.csv data consists of features extracted from hotel reviews text. 

Note that some srch_destination_id's in the train/test files don't exist in the destinations.csv file. This is because some hotels are new and don't have enough features in the latent space. Your algorithm should be able to handle this missing information.

### File descriptions

* **train.csv** - the training set
* **test.csv** - the test set
* **destinations.csv** - hotel search latent attributes
* **sample_submission.csv** - a sample submission file in the correct format


### Data fields

**train/test.csv**

![](./img/data.png)

### 评估标准与提交格式

![](./img/eval.png)

### 解法图示

![](./img/solution.png)

## 数据泄露处理 data leakage

In [19]:
import datetime
from heapq import nlargest # 堆
from operator import itemgetter
import pandas as pd
import os
from sklearn import model_selection
# (brew install libomp)
import xgboost as xgb
import datetime
from sklearn.ensemble import RandomForestClassifier
import numpy as np
import h5py # 数据存储格式
from scipy.sparse import csr_matrix, hstack
from sklearn.linear_model import SGDClassifier
import pickle
from scipy.sparse import csr_matrix, hstack
from sklearn.naive_bayes import BernoulliNB
from sklearn.preprocessing import normalize

### use pandas to check the dataset, memory requirement about 16G, not suggest running.

In [None]:
# train = pd.read_csv("./input/train.csv")

### filter wrong data and modify

### train

In [None]:
# train[(train["srch_ci"]>"2021-01-01 00:00:00") & (train["is_booking"] ==1)]

In [None]:
# train[(train["srch_co"]>"2021-01-01 00:00:00") & (train["is_booking"] ==1)]

In [None]:
# train[(train["srch_ci"].isnull()) & (train["is_booking"] ==1)]["srch_ci"]

In [None]:
# train[(train["srch_co"].isnull()) & (train["is_booking"] ==1)]["srch_ci"]

In [None]:
# def filter_data(train,column_list):
#     '''
#     data is from 2014-08-11 07:46:59 to 2014-09-18 08:52:42, if a reservation is 10 years after, it's probably wrong record.
    
#     '''
#     for i in column_list:
#         data = train[train[i]>"2021-01-01 00:00:00"].index
#         train.drop(data, axis=0, inplace=True)


In [None]:
# column_list = ["srch_ci",'srch_co']
# filter_data(train,column_list)

In [None]:
# train['srch_ci']=pd.to_datetime(train['srch_ci'])
# train['srch_co']=pd.to_datetime(train['srch_co'])
# train['date_time']=pd.to_datetime(train['date_time'])

In [None]:
# # check if check out is early than check in date
# train[train["srch_ci"]>train["srch_co"]].loc[:,["date_time","srch_ci","srch_co"]]

In [None]:
# # exchange position if check out is early than check in date
# def exchange_in_out(data):
#     index = data[data["srch_ci"]>data["srch_co"]].index
#     data["tem"] = data["srch_ci"]
#     for i in index:
#         data["tem"][i] = data.loc[i,"srch_ci"]
#         data.loc[i,"srch_ci"] = data.loc[i,'srch_co']
#         data.loc[i,'srch_co'] = data.loc[i,"tem"]
#     data.drop(["tem"],axis = 1, inplace =True)

In [None]:
# exchange_in_out(train)

In [None]:
# train[train["srch_ci"]>train["srch_co"]].loc[:,["date_time","srch_ci","srch_co"]]

In [None]:
# train.info()

In [None]:
# train.head()

In [None]:
# train.tail()

### test

In [None]:
# test = pd.read_csv("./input/test.csv")

In [None]:
# # check out bound record
# test[test["srch_co"]>"2021-01-01"].loc[:,["date_time","srch_ci","srch_co"]]

In [None]:
# test[test["srch_ci"]>"2021-01-01 00:00:00"].loc[:,["date_time","srch_ci","srch_co"]]

In [None]:
# test.loc[312920,"srch_ci"] = "2016-01-21"

In [None]:
# # No out bound record
# test[test["srch_ci"]>"2021-01-01"].loc[:,["date_time","srch_ci","srch_co"]]

In [None]:
# # check if check out is early than check in date
# test[test["srch_ci"]>test["srch_co"]].loc[:,["date_time","srch_ci","srch_co"]]

In [None]:
# exchange_in_out(test)

In [None]:
# test[test["srch_ci"]>test["srch_co"]].loc[:,["date_time","srch_ci","srch_co"]]

In [None]:
# test['srch_ci']=pd.to_datetime(test['srch_ci'])
# test['srch_co']=pd.to_datetime(test['srch_co'])
# test['date_time']=pd.to_datetime(test['date_time'])

In [None]:
# test.info()

In [None]:
# test.head()

### Store modified train and test set

In [None]:
# train.to_csv('./output/train.csv',index=False)
# test.to_csv('./output/test.csv',index=False)

## !! process above needs long time and memory, not suggest running !!

***

### start process of data leakage

In [2]:
# -*- coding: utf-8 -*-

# prepare matched cluster
def cluster_weight_collect():
    """
    Process the data to create 4 dict by creating different unique key.
    Each dict contains all clusters appeared in the train.test, 
    each element is a dic mapping to its weight.
    


    Returns:
        Four dict:
        best_hotel_mainWeight   # user_id, user_location_city, srch_destination_id, hotel_country, hotel_market
        best_hotel_secWeight    # user_id, srch_destination_id, hotel_country, hotel_market
        best_hotels_od_ulc      # user_location_city, srch_destination_id, hotel_country, hotel_market
        best_hotels_uid_miss    # user_location_city, srch_destination_id

    """
    f = open("./output/train.csv", "r")  
    f.readline()
    
    best_hotel_mainWeight = dict() # user_id, user_location_city, srch_destination_id, hotel_country, hotel_market
    best_hotel_secWeight = dict()  # user_id, srch_destination_id, hotel_country, hotel_market
    best_hotels_od_ulc = dict()    # user_location_city, srch_destination_id, hotel_country, hotel_market
    best_hotels_uid_miss = dict()  # user_location_city, srch_destination_id

    # Calc counts
    while True:
        line = f.readline().strip() # strip space

        if line == '':
            print("Finish reading ")
            break # 

        # abstract information from file
        arr = line.split(",") 
        
        book_year = int(arr[0][:4])           # year of book
        book_month = int(arr[0][5:7])         # month of book
        user_location_city = arr[5]           # the city of the coutomer is located 
        orig_destination_distance = arr[6]    # physical distance, null means cannnot be calculated
        user_id = arr[7]                      # ID of user
        srch_destination_id = arr[16]         # hotel searched Id
        hotel_country = arr[21]               # country of the hotel
        hotel_market = arr[22]                # maket of hotel
        is_booking = float(arr[18])           # 1 if booking
        hotel_cluster = arr[23]               # cluster of the hotel
 
        # creat some values
        # time weight by month
        append_0 = ((book_year - 2012)*12 + (book_month - 12)) 
        # time * 2 + the weight of booking 
        append_1 = append_0 * append_0 * (3 + 17.60*is_booking)  
        
        #temporarily not use 
        append_2 = 3 + 5.56*is_booking 

        # create key： unique(user_id, user_location_city, srch_destination_id, hotel_country, hotel_market)
        if user_location_city != '' and orig_destination_distance != '' and user_id !='' and srch_destination_id != '' and hotel_country != '':
            # hash processing
            solution = hash(str(user_id)+':'+str(user_location_city)+':'+str(srch_destination_id)+':'+str(hotel_country)+':'+str(hotel_market))
            # found cluster, add weight; not found, add the cluster, give it initial weight
            if solution in best_hotel_mainWeight:
                if hotel_cluster in best_hotel_mainWeight[solution]:
                    best_hotel_mainWeight[solution][hotel_cluster] += append_1
                else:
                    best_hotel_mainWeight[solution][hotel_cluster] = append_1
            # if not found, create a solution and give it a weight
            else:
                best_hotel_mainWeight[solution] = dict()
                best_hotel_mainWeight[solution][hotel_cluster] = append_1

        # create key： unique(user_id, srch_destination_id, hotel_country, hotel_market)
        if user_location_city != '' and orig_destination_distance != '' and user_id !='' and srch_destination_id != '':
            solution_sec = hash(str(user_id)+':'+str(srch_destination_id)+':'+str(hotel_country)+':'+str(hotel_market))
            # same as above 
            if solution_sec in best_hotel_secWeight:
                if hotel_cluster in best_hotel_secWeight[solution_sec]:
                    best_hotel_secWeight[solution_sec][hotel_cluster] += append_1
                else:
                    best_hotel_secWeight[solution_sec][hotel_cluster] = append_1
            else:
                best_hotel_secWeight[solution_sec] = dict()
                best_hotel_secWeight[solution_sec][hotel_cluster] = append_1

        # create key： unique(user_location_city, srch_destination_id, hotel_country, hotel_market)
        if user_location_city != '' and orig_destination_distance == '' and srch_destination_id != '' and hotel_country != '':
            solution_thr = hash(str(user_location_city)+':'+str(srch_destination_id)+':'+str(hotel_country)+':'+str(hotel_market))
            if solution_thr in best_hotels_uid_miss:
                if hotel_cluster in best_hotels_uid_miss[solution_thr]:
                    best_hotels_uid_miss[solution_thr][hotel_cluster] += append_1
                else:
                    best_hotels_uid_miss[solution_thr][hotel_cluster] = append_1
            else:
                best_hotels_uid_miss[solution_thr] = dict()
                best_hotels_uid_miss[solution_thr][hotel_cluster] = append_1

        # create key： unique(user_location_city, srch_destination_id)
        if user_location_city != '' and orig_destination_distance != '':
            solution_fo = hash(str(user_location_city)+':'+str(orig_destination_distance))

            if solution_fo in best_hotels_od_ulc:
                if hotel_cluster in best_hotels_od_ulc[solution_fo]:
                    best_hotels_od_ulc[solution_fo][hotel_cluster] += append_0
                else:
                    best_hotels_od_ulc[solution_fo][hotel_cluster] = append_0
            else:
                best_hotels_od_ulc[solution_fo] = dict()
                best_hotels_od_ulc[solution_fo][hotel_cluster] = append_0

    f.close()
    return best_hotel_mainWeight,best_hotel_secWeight, best_hotels_od_ulc, best_hotels_uid_miss

In [None]:
def generate_submission(best_hotel_mainWeight,best_hotel_secWeight, best_hotels_od_ulc, best_hotels_uid_miss):
    """
    Generate prediction by the result weight dic from fuction cluster_weight_collect

    """
    path = './output/match_pred.csv'
    out = open(path, "w")
    
    # get test set and read
    f = open("./output/test.csv", "r")
    f.readline()
    total_fo = 0      # total solution_fo in the test
    total_thr = 0      # total solution_thr in the test
    total_sec = 0     # total solution_sec in the test
         
    
    # write the first line in the result file
    out.write("id,hotel_cluster\n")
    
    while True:
        
        line = f.readline().strip() # test set line with strip()
        if line == '':
            print("Finish reading")
            break

        arr = line.split(",")
        ID = arr[0]                            # test set id 
        user_location_city = arr[6]            # the city of the coutomer is located 
        orig_destination_distance = arr[7]     # physical distance, null means cannnot be calculated
        user_id = arr[8]                       # ID of user
        srch_destination_id = arr[17]          # hotel searched Id
        hotel_country = arr[20]                # country of the hotel
        hotel_market = arr[21]                 # maket of hotel
         
        out.write(str(ID) + ',')
        filled = []
        
        # solution_fo is same as the usage in fuction cluster_weight_collect
        solution_fo = hash(str(user_location_city)+':'+str(orig_destination_distance))
        if solution_fo in best_hotels_od_ulc:
            d = best_hotels_od_ulc[solution_fo]
            # get the top 5 cluster
            topitems = nlargest(5, sorted(d.items()), key=itemgetter(1))
            for i in range(len(topitems)):
                # choose 5 cluster, if already contains, ignore. Full, break.
                if topitems[i][0] in filled:
                    continue
                if len(filled) == 5:
                    break
                # write the matched cluster into result file.
                out.write(' ' + topitems[i][0])
                filled.append(topitems[i][0])
                total_fo += 1

        if orig_destination_distance == '':
            solution_thr = hash(str(user_location_city)+':'+str(srch_destination_id)+':'+str(hotel_country)+':'+str(hotel_market))
            if solution_thr in best_hotels_uid_miss:
                d = best_hotels_uid_miss[solution_thr]
                topitems = nlargest(4, sorted(d.items()), key=itemgetter(1))
                for i in range(len(topitems)):
                    if topitems[i][0] in filled:
                        continue
                    if len(filled) == 5:
                        break
                    out.write(' ' + topitems[i][0])
                    filled.append(topitems[i][0])
                    total_thr += 1

        solution = hash(str(user_id)+':'+str(user_location_city)+':'+str(srch_destination_id)+':'+str(hotel_country)+':'+str(hotel_market))
        solution_sec = hash(str(user_id)+':'+str(srch_destination_id)+':'+str(hotel_country)+':'+str(hotel_market))
        if solution_sec in best_hotel_secWeight and solution not in best_hotel_mainWeight:
            d = best_hotel_secWeight[solution_sec]
            topitems = nlargest(4, sorted(d.items()), key=itemgetter(1))
            for i in range(len(topitems)):
                if topitems[i][0] in filled:
                    continue
                if len(filled) == 5:
                    break
                out.write(' ' + topitems[i][0])
                filled.append(topitems[i][0])
                total_sec += 1

        out.write("\n")
    out.close()
    print('Total solution_fo: {} ...'.format(total_fo))
    print('Total solution_thr: {} ...'.format(total_thr))
    print('Total solution_sec: {} ...'.format(total_sec))

In [None]:
# best_hotel_mainWeight,best_hotel_secWeight, best_hotels_od_ulc, best_hotels_uid_miss = cluster_weight_collect()

In [None]:
# generate_submission(best_hotel_mainWeight,best_hotel_secWeight, best_hotels_od_ulc, best_hotels_uid_miss)

## model in common

In [4]:
if os.path.exists('./output/srch_dest_hc_hm_agg.csv'): 
    aggMod = pd.read_csv('./output/srch_dest_hc_hm_agg.csv')
else:

    # read by chunk  chunksize
    reader = pd.read_csv('./output/train.csv', parse_dates=['date_time', 'srch_ci', 'srch_co'], chunksize=200000)  # parse_dates 

    # get sum and count of agg in ['srch_destination_id','hotel_country','hotel_market','hotel_cluster'] form
    pieces = [chunk.groupby(['srch_destination_id','hotel_country','hotel_market','hotel_cluster'])['is_booking'].agg(['sum','count']) for chunk in reader]
    agg = pd.concat(pieces).groupby(level=[0,1,2,3]).sum()

    del pieces # release memory
    agg.dropna(inplace=True) # delete nan

    # Weighted aggregation
    agg['sum_and_cnt'] = 0.85*agg['sum'] + 0.15*agg['count'] 

    # partition in specific index
    agg = agg.groupby(level=[0,1,2]).apply(lambda x: x.astype(float)/x.sum())
    agg.reset_index(inplace=True)

    # data PivotTable.
    aggMod = agg.pivot_table(index=['srch_destination_id','hotel_country','hotel_market'], columns='hotel_cluster', values='sum_and_cnt').reset_index()
    aggMod.to_csv('./output/srch_dest_hc_hm_agg.csv', index=False)
    # release memory
    del agg 
    
destinations = pd.read_csv('./input/destinations.csv')
submission = pd.read_csv('./input/sample_submission.csv')
aggMod

Unnamed: 0,srch_destination_id,hotel_country,hotel_market,0,1,2,3,4,5,6,...,90,91,92,93,94,95,96,97,98,99
0,0,100,796,,,,1.000000,,,,...,,,,,,,,,,
1,1,76,1537,,,,,,,,...,,,,,,,,,,
2,2,48,152,,,,,,0.050847,,...,,,,,,,,,,
3,3,17,1597,,,,,,,,...,,,,,,,,,,
4,4,7,246,,,,0.006447,,,,...,0.001235,,,0.008642,,,,,,0.014403
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65608,65098,50,688,,,,,,,,...,,,,,0.666667,,,,,
65609,65102,50,608,,,,,,,,...,,,,,,,,,,
65610,65103,50,608,,,,,,,,...,,,,,,,,,,
65611,65104,50,639,,,,,,0.115385,,...,,,,,,,,,,


### 预处理部分 pre_process 

In [3]:
def pre_process(data):
    
    # create new feature
    # living time
    data['srch_duration'] = data.srch_co-data.srch_ci
#     data['srch_duration'] = pd.to_datetime(data['srch_duration'],format = '%Y-%m-%d')
    data['srch_duration'] = data['srch_duration'].apply(lambda td: td/np.timedelta64(1, 'D')) # Datetime转天数
    
    # time from booing to checkin in days
    data['time_to_ci'] = data.srch_ci-data.date_time
    data['time_to_ci'] = data['time_to_ci'].apply(lambda td: td/np.timedelta64(1, 'D'))
    
    # checkin time
    data['ci_month'] = data['srch_ci'].apply(lambda dt: dt.month)
    data['ci_day'] = data['srch_ci'].apply(lambda dt: dt.day)
    #data['ci_year'] = data['srch_ci'].apply(lambda dt: dt.year)
    
    # booking time information
    data['bk_month'] = data['date_time'].apply(lambda dt: dt.month)
    data['bk_day'] = data['date_time'].apply(lambda dt: dt.day)
    #data['bk_year'] = data['date_time'].apply(lambda dt: dt.year)
    data['bk_hour'] = data['date_time'].apply(lambda dt: dt.hour)
    data.drop(['date_time', 'user_id', 'srch_ci', 'srch_co'], axis=1, inplace=True)
    
    data.fillna(0, inplace=True) # 缺失值填充， 也可以使用更合理的方式填充，平均值，众数等

### RandomForest

In [None]:
# training process

clf = RandomForestClassifier(n_estimators=0, n_jobs=-1, warm_start=True)
count = 0
chunksize = 200000
reader = pd.read_csv('./output/train.csv', parse_dates=['date_time', 'srch_ci', 'srch_co'], chunksize=chunksize)
for chunk in reader:
    try:
        chunk = chunk[chunk.is_booking==1]
        chunk = pd.merge(chunk, destinations, how='left', on='srch_destination_id') # join
        chunk = pd.merge(chunk, aggMod, how='left', on=['srch_destination_id','hotel_country','hotel_market'])
        pre_process(chunk) # pre-process
        y = chunk.hotel_cluster
        chunk.drop(['cnt', 'hotel_cluster', 'is_booking'], axis=1, inplace=True)
        
        # 训练
        if len(y.unique()) == 100:
            clf.set_params(n_estimators=clf.n_estimators+1)
            clf.fit(chunk, y)
        
        count = count + chunksize
        print('%d rows completed' % count)
        if(count/chunksize == 300):
            break
    except Exception as e:
        print('Error: %s' % str(e))
        pass



In [None]:
# prediction process
count = 0
chunksize = 10000
preds = np.empty((submission.shape[0],clf.n_classes_))
reader = pd.read_csv('./output/test.csv', parse_dates=['date_time', 'srch_ci', 'srch_co'], chunksize=chunksize)
for chunk in reader:
    try:
        chunk = pd.merge(chunk, destinations, how='left', on='srch_destination_id')
        chunk = pd.merge(chunk, aggMod, how='left', on=['srch_destination_id','hotel_country','hotel_market'])
        chunk.drop(['id'], axis=1, inplace=True)
        pre_process(chunk)

        pred = clf.predict_proba(chunk)
        preds[count:(count + chunk.shape[0]),:] = pred
        count = count + chunksize
        print('%d rows completed' % count)
    except Exception as e:
        print('Error: %s' % str(e))

### output predict result

In [None]:
del clf
print('writing current probabilities to file')
if os.path.exists('./output/probs/allpreds.h5'):
    with h5py.File('./output/probs/allpreds.h5', 'r+') as hf:
            print('reading in and combining probabilities')
            predslatesthf = hf['preds_latest']
            preds += predslatesthf.value
            print('writing latest probabilities to file')
            predshf[...] = preds
else:
    with h5py.File('./output/probs/allpreds.h5', 'w') as hf:
        print('writing latest probabilities to file')
        hf.create_dataset('preds_latest', data=preds)

# print('generating submission')
# col_ind = np.argsort(-preds, axis=1)[:,:5]
# hc = [' '.join(row.astype(str)) for row in col_ind]
# sub = pd.DataFrame(data=hc, index=submission.id)
# sub.reset_index(inplace=True)
# sub.columns = submission.columns
# sub.to_csv('./output/pred_sub.csv', index=False)

## GBDT建模

### 评估标准 evaluation

In [5]:
#评估标准
def map5eval(preds, dtrain):
    actual = dtrain.get_label() # 真实预测
    predicted = preds.argsort(axis=1)[:,-np.arange(5)] # 排序排在前五位的酒店类别
    metric = 0.
    for i in range(5):
        metric += np.sum(actual==predicted[:,i])/(i+1) # 计算
    metric /= actual.shape[0]
    return 'MAP@5', -metric

### GBDT

In [7]:
clf = xgb.XGBClassifier(
                objective = 'multi:softmax', # 最简单：多分类softmax； 也可以使用，每个二分类
                max_depth = 5, # 最大树深
                n_estimators=300, # 树个数
                learning_rate=0.01, # 学习率
                nthread=4, # 线程个数
                subsample=0.7, # 每次取样本比例，防止过拟合
                colsample_bytree=0.7,
                min_child_weight = 3,
                silent=False) # bug信息输出


if os.path.exists('rows_complete.txt'):
    with open('rows_complete.txt', 'r') as f:
        skipsize = int(f.readline())
else:
    skipsize = 0

skip = 0 if skipsize==0 else range(1, skipsize)
tchunksize = 1000000
print('%d rows will be skipped and next %d rows will be used for training' % (skipsize, tchunksize))
train = pd.read_csv('./output/train.csv', parse_dates=['date_time', 'srch_ci', 'srch_co'], skiprows=skip, nrows=tchunksize)
train = train[train.is_booking==1]
train = pd.merge(train, destinations, how='left', on='srch_destination_id')
train = pd.merge(train, aggMod, how='left', on=['srch_destination_id','hotel_country','hotel_market'])
pre_process(train)
y = train.hotel_cluster
train.drop(['cnt', 'hotel_cluster', 'is_booking'], axis=1, inplace=True)

0 rows will be skipped and next 1000000 rows will be used for training


In [9]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(train, y, stratify=y, test_size=0.2) # stratify 保证各个类样本的均衡性
# eval_metric 观测指标， early_stopping_rounds 早停次数
clf.fit(X_train, y_train, early_stopping_rounds=50, eval_metric=map5eval, eval_set=[(X_train, y_train),(X_test, y_test)])

Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[0]	validation_0-merror:0.79136	validation_1-merror:0.81925	validation_0-MAP@5:-0.18733	validation_1-MAP@5:-0.16417
Multiple eval metrics have been passed: 'validation_1-MAP@5' will be used for early stopping.

Will train until validation_1-MAP@5 hasn't improved in 50 rounds.
[1]	validation_0-merror:0.76981	validation_1-merror:0.79781	validation_0-MAP@5:-0.20477	validation_1-MAP@5:-0.18120
[2]	validation_0-merror:0.76486	validation_1-merror:0.79379	validation_0-MAP@5:-0.20867	validation_1-MAP@5:-0.18385
[3]	validation_0-merror:0.76115	validation_1-merror:0.79306	validation_0-MAP@5:-0.21083	validation_1-MAP@5:-0.18584
[4]	validation_0-merror:0.75917	validation_1-merror:0.79117	validation_0-MAP@5:-0.21236	valid

[66]	validation_0-merror:0.71184	validation_1-merror:0.77558	validation_0-MAP@5:-0.23522	validation_1-MAP@5:-0.19480
[67]	validation_0-merror:0.71152	validation_1-merror:0.77515	validation_0-MAP@5:-0.23539	validation_1-MAP@5:-0.19494
[68]	validation_0-merror:0.71157	validation_1-merror:0.77576	validation_0-MAP@5:-0.23555	validation_1-MAP@5:-0.19487
[69]	validation_0-merror:0.71084	validation_1-merror:0.77558	validation_0-MAP@5:-0.23578	validation_1-MAP@5:-0.19488
[70]	validation_0-merror:0.71056	validation_1-merror:0.77515	validation_0-MAP@5:-0.23594	validation_1-MAP@5:-0.19492
[71]	validation_0-merror:0.70998	validation_1-merror:0.77576	validation_0-MAP@5:-0.23611	validation_1-MAP@5:-0.19488
[72]	validation_0-merror:0.70942	validation_1-merror:0.77564	validation_0-MAP@5:-0.23630	validation_1-MAP@5:-0.19482
[73]	validation_0-merror:0.70930	validation_1-merror:0.77552	validation_0-MAP@5:-0.23647	validation_1-MAP@5:-0.19485
[74]	validation_0-merror:0.70903	validation_1-merror:0.77497	val

[136]	validation_0-merror:0.69165	validation_1-merror:0.77363	validation_0-MAP@5:-0.24565	validation_1-MAP@5:-0.19520
[137]	validation_0-merror:0.69129	validation_1-merror:0.77387	validation_0-MAP@5:-0.24579	validation_1-MAP@5:-0.19517
[138]	validation_0-merror:0.69117	validation_1-merror:0.77369	validation_0-MAP@5:-0.24595	validation_1-MAP@5:-0.19529
[139]	validation_0-merror:0.69085	validation_1-merror:0.77369	validation_0-MAP@5:-0.24604	validation_1-MAP@5:-0.19523
[140]	validation_0-merror:0.69076	validation_1-merror:0.77399	validation_0-MAP@5:-0.24612	validation_1-MAP@5:-0.19522
[141]	validation_0-merror:0.69060	validation_1-merror:0.77375	validation_0-MAP@5:-0.24627	validation_1-MAP@5:-0.19529
[142]	validation_0-merror:0.69040	validation_1-merror:0.77454	validation_0-MAP@5:-0.24636	validation_1-MAP@5:-0.19519
[143]	validation_0-merror:0.68992	validation_1-merror:0.77442	validation_0-MAP@5:-0.24653	validation_1-MAP@5:-0.19520
[144]	validation_0-merror:0.68966	validation_1-merror:0.

[206]	validation_0-merror:0.67478	validation_1-merror:0.77357	validation_0-MAP@5:-0.25376	validation_1-MAP@5:-0.19575
[207]	validation_0-merror:0.67443	validation_1-merror:0.77326	validation_0-MAP@5:-0.25389	validation_1-MAP@5:-0.19570
[208]	validation_0-merror:0.67434	validation_1-merror:0.77351	validation_0-MAP@5:-0.25399	validation_1-MAP@5:-0.19566
[209]	validation_0-merror:0.67408	validation_1-merror:0.77339	validation_0-MAP@5:-0.25413	validation_1-MAP@5:-0.19572
[210]	validation_0-merror:0.67384	validation_1-merror:0.77375	validation_0-MAP@5:-0.25424	validation_1-MAP@5:-0.19566
[211]	validation_0-merror:0.67353	validation_1-merror:0.77357	validation_0-MAP@5:-0.25440	validation_1-MAP@5:-0.19568
[212]	validation_0-merror:0.67326	validation_1-merror:0.77363	validation_0-MAP@5:-0.25455	validation_1-MAP@5:-0.19567
[213]	validation_0-merror:0.67306	validation_1-merror:0.77381	validation_0-MAP@5:-0.25464	validation_1-MAP@5:-0.19565
[214]	validation_0-merror:0.67289	validation_1-merror:0.

[276]	validation_0-merror:0.65957	validation_1-merror:0.77351	validation_0-MAP@5:-0.26090	validation_1-MAP@5:-0.19578
[277]	validation_0-merror:0.65925	validation_1-merror:0.77351	validation_0-MAP@5:-0.26100	validation_1-MAP@5:-0.19575
[278]	validation_0-merror:0.65940	validation_1-merror:0.77302	validation_0-MAP@5:-0.26105	validation_1-MAP@5:-0.19593
[279]	validation_0-merror:0.65910	validation_1-merror:0.77320	validation_0-MAP@5:-0.26118	validation_1-MAP@5:-0.19596
[280]	validation_0-merror:0.65886	validation_1-merror:0.77314	validation_0-MAP@5:-0.26129	validation_1-MAP@5:-0.19598
[281]	validation_0-merror:0.65861	validation_1-merror:0.77308	validation_0-MAP@5:-0.26138	validation_1-MAP@5:-0.19604
[282]	validation_0-merror:0.65832	validation_1-merror:0.77314	validation_0-MAP@5:-0.26146	validation_1-MAP@5:-0.19607
[283]	validation_0-merror:0.65822	validation_1-merror:0.77314	validation_0-MAP@5:-0.26155	validation_1-MAP@5:-0.19604
[284]	validation_0-merror:0.65822	validation_1-merror:0.

Error: name 'agg1' is not defined
Error: name 'agg1' is not defined
Error: name 'agg1' is not defined
Error: name 'agg1' is not defined
Error: name 'agg1' is not defined
Error: name 'agg1' is not defined
Error: name 'agg1' is not defined
Error: name 'agg1' is not defined
Error: name 'agg1' is not defined
Error: name 'agg1' is not defined
Error: name 'agg1' is not defined
Error: name 'agg1' is not defined
Error: name 'agg1' is not defined
Error: name 'agg1' is not defined
Error: name 'agg1' is not defined
Error: name 'agg1' is not defined
Error: name 'agg1' is not defined
Error: name 'agg1' is not defined
Error: name 'agg1' is not defined
Error: name 'agg1' is not defined
Error: name 'agg1' is not defined
Error: name 'agg1' is not defined
Error: name 'agg1' is not defined
Error: name 'agg1' is not defined
Error: name 'agg1' is not defined
Error: name 'agg1' is not defined
Error: name 'agg1' is not defined
Error: name 'agg1' is not defined
Error: name 'agg1' is not defined
Error: name 'a

In [12]:
# prediction process
count = 0
chunksize = 10000
preds = np.empty((submission.shape[0],clf.n_classes_))
reader = pd.read_csv('./output/test.csv', parse_dates=['date_time', 'srch_ci', 'srch_co'], chunksize=chunksize)
for chunk in reader:
    try:
        chunk = pd.merge(chunk, destinations, how='left', on='srch_destination_id')
        chunk = pd.merge(chunk, aggMod, how='left', on=['srch_destination_id','hotel_country','hotel_market'])
        chunk.drop(['id'], axis=1, inplace=True)
        pre_process(chunk)

        pred = clf.predict_proba(chunk)
        preds[count:(count + chunk.shape[0]),:] = pred
        count = count + chunksize
        print('%d rows completed' % count)
    except Exception as e:
        print('Error: %s' % str(e))

10000 rows completed
20000 rows completed
30000 rows completed
40000 rows completed
50000 rows completed
60000 rows completed
70000 rows completed
80000 rows completed
90000 rows completed
100000 rows completed
110000 rows completed
120000 rows completed
130000 rows completed
140000 rows completed
150000 rows completed
160000 rows completed
170000 rows completed
180000 rows completed
190000 rows completed
200000 rows completed
210000 rows completed
220000 rows completed
230000 rows completed
240000 rows completed
250000 rows completed
260000 rows completed
270000 rows completed
280000 rows completed
290000 rows completed
300000 rows completed
310000 rows completed
320000 rows completed
330000 rows completed
340000 rows completed
350000 rows completed
360000 rows completed
370000 rows completed
380000 rows completed
390000 rows completed
400000 rows completed
410000 rows completed
420000 rows completed
430000 rows completed
440000 rows completed
450000 rows completed
460000 rows complet

### output predict result

In [13]:
del clf
if os.path.exists('./output/probs/allpreds_xgb.h5'):
    with h5py.File('./output/probs/allpreds_xgb.h5', 'r+') as hf:
        print('reading in and combining probabilities')
        predshf = hf['preds']
        preds += predshf.value
        print('writing latest probabilities to file')
        predshf[...] = preds
else:
    with h5py.File('./output/probs/allpreds_xgb.h5', 'w') as hf:
        print('writing latest probabilities to file')
        hf.create_dataset('preds', data=preds)

# print('generating submission')
# col_ind = np.argsort(-preds, axis=1)[:,:5] # 取出最大的5个
# hc = [' '.join(row.astype(str)) for row in col_ind]
# sub = pd.DataFrame(data=hc, index=submission.id)
# sub.reset_index(inplace=True)
# sub.columns = submission.columns
# sub.to_csv('./output/pred_sub.csv', index=False)


skipsize += tchunksize
with open('rows_complete.txt', 'w') as f:
    f.write(str(skipsize))

writing latest probabilities to file


## SGD分类建模SGD classifier modeling

In [5]:
# 做类别型变量的处理，编码 categorical attributes process and encode
cat_col = ['user_id', 'user_location_city',
           'srch_destination_id', 'srch_destination_type_id', 'hotel_continent',
           'hotel_country', 'hotel_market']

num_col = ['is_mobile', 'is_package']

# 时间分箱 time binning
def bin_time(t):
    if t < 0:
        x = 0
    elif t < 2:
        x = 1
    elif t < 7:
        x = 2
    elif t < 30:
        x = 3
    else:
        x = 4    
    return x

def pre_processSGD(data):
    
    data['ci_month'] = data['srch_ci'].apply(lambda dt: dt.month)
    data['season_dest'] = 'season_dest' + data.ci_month.map(str) + '*' + data.srch_destination_id.map(str)
    data['season_dest'] = data['season_dest'].map(hash)
    data['time_to_ci'] = data.srch_ci-data.date_time
    data['time_to_ci'] = data['time_to_ci'].apply(lambda td: td/np.timedelta64(1, 'D'))
    data['time_to_ci'] = data['time_to_ci'].map(bin_time)
    data['time_dest'] = 'time_dest' + data.time_to_ci.map(str) + '*' + data.srch_destination_id.map(str)
    data['time_dest'] = data['time_dest'].map(hash)
    data.fillna(0, inplace=True)
    
    # 类别型变量做哈希 hash categorical attributes 
    for col in cat_col:
        data[col] = col + data[col].map(str)
        data[col] = data[col].map(hash)

cat_col_all = cat_col + ['season_dest', 'time_dest'] 

In [9]:
if os.path.exists('./output/probs/sgd.pkl'):
    with open('./output/probs/sgd.pkl', 'rb') as f:
        clf = pickle.load(f)
else:
    clf = SGDClassifier(loss='log', n_jobs=-1, alpha=0.0000025, verbose=0) # loss Softmax
    
# for epoch in range(5):
count = 0
chunksize = 200000
n_features = 3000000
print('Epoch %d started' % epoch)
reader = pd.read_csv('./output/train.csv', parse_dates=['date_time', 'srch_ci', 'srch_co'], chunksize=chunksize)
for chunk in reader:
    try:
        pre_processSGD(chunk)
        y = chunk.hotel_cluster
        sw = 1 + 4*chunk.is_booking # 加重booking权重
        chunk.drop(['cnt', 'hotel_cluster', 'is_booking'], axis=1, inplace=True) # 删除不需要的特征

        # 稀疏化处理
        XN = csr_matrix(chunk[num_col].values)
        X = csr_matrix((chunk.shape[0], n_features))
        rows = np.arange(chunk.shape[0])
        for col in cat_col_all:
            dat = np.ones(chunk.shape[0])
            cols = chunk[col] % n_features
            X += csr_matrix((dat, (rows, cols)), shape=(chunk.shape[0], n_features))
        X = hstack((XN, X)) # 拼接数据
    #         book_indices = sw[sw > 1].index.tolist()
    #         X_test = csr_matrix(X)[book_indices]
    #         y_test = y[book_indices]

        clf.partial_fit(X, y, classes=np.arange(100), sample_weight=sw)

        count = count + chunksize
    #             map5 = map5eval(clf.predict_proba(X_test), y_test)
    #             print('%d rows completed. MAP@5: %f' % (count, map5))
        print('%d rows completed' % count)
        if(count/chunksize == 200):
            break
    except Exception as e:
        print('Error: %s' % str(e))

Epoch 0 started


In [10]:
# prediction process
with open('./output/probs/sgd.pkl', 'wb') as f:
    pickle.dump(clf, f)

count = 0
chunksize = 10000
preds = np.empty((0,100))
reader = pd.read_csv('./output/test.csv', parse_dates=['date_time', 'srch_ci', 'srch_co'], chunksize=chunksize)
for chunk in reader:
    #chunk = pd.merge(chunk, destinations, how='left', on='srch_destination_id')
    #chunk = pd.merge(chunk, agg1, how='left', on='srch_destination_id')
    chunk.drop(['id'], axis=1, inplace=True)
    pre_processSGD(chunk)
    
    XN = csr_matrix(chunk[num_col].values)
    X = csr_matrix((chunk.shape[0], n_features))
    rows = np.arange(chunk.shape[0])
    for col in cat_col_all:
        dat = np.ones(chunk.shape[0])
        cols = chunk[col] % n_features
        X += csr_matrix((dat, (rows, cols)), shape=(chunk.shape[0], n_features))
    X = hstack((XN, X))
    
    pred = clf.predict_proba(X)
    preds = np.vstack((preds, pred))
    count = count + chunksize
    print('%d rows completed' % count)

10000 rows completed
20000 rows completed
30000 rows completed
40000 rows completed
50000 rows completed
60000 rows completed
70000 rows completed
80000 rows completed
90000 rows completed
100000 rows completed
110000 rows completed
120000 rows completed
130000 rows completed
140000 rows completed
150000 rows completed
160000 rows completed
170000 rows completed
180000 rows completed
190000 rows completed
200000 rows completed
210000 rows completed
220000 rows completed
230000 rows completed
240000 rows completed
250000 rows completed
260000 rows completed
270000 rows completed
280000 rows completed
290000 rows completed
300000 rows completed
310000 rows completed
320000 rows completed
330000 rows completed
340000 rows completed
350000 rows completed
360000 rows completed
370000 rows completed
380000 rows completed
390000 rows completed
400000 rows completed
410000 rows completed
420000 rows completed
430000 rows completed
440000 rows completed
450000 rows completed
460000 rows complet

In [11]:
del clf
del reader
# 存储结果
if os.path.exists('./output/probs/allpreds_sgd.h5'):
    with h5py.File('./output/probs/allpreds_sgd.h5', 'r+') as hf:
        #print('reading in and combining probabilities')
        predshf = hf['preds']
        preds += predshf.value
        print('writing latest probabilities to file')
        predshf[...] = preds
else:
    with h5py.File('./output/probs/allpreds_sgd.h5', 'w') as hf:
        print('writing latest probabilities to file')
        hf.create_dataset('preds', data=preds)

# col_ind = np.argsort(-preds, axis=1)[:,:5]
# hc = [' '.join(row.astype(str)) for row in col_ind]

# sub = pd.DataFrame(data=hc, index=submission.id)
# sub.reset_index(inplace=True)
# sub.columns = submission.columns
# sub.to_csv('./output/pred_sub.csv', index=False)

writing latest probabilities to file


## 朴素贝叶斯建模Naive Bayes modeling

In [15]:
if os.path.exists('./output/probs/bnb.pkl'):
    with open('./output/probs/bnb.pkl', 'rb') as f:
        clf = pickle.load(f)
else:
    clf = BernoulliNB(alpha=1.0)
#clf.sparsify()
# for epoch in range(1):
count = 0
chunksize = 200000
n_features = 1000000
print('Epoch %d started' % epoch)
reader = pd.read_csv('./output/train.csv', parse_dates=['date_time', 'srch_ci', 'srch_co'], chunksize=chunksize)
for chunk in reader:
    try:
        #chunk = chunk[chunk.is_booking==1]
        #chunk = pd.merge(chunk, destinations, how='left', on='srch_destination_id')
        #chunk = pd.merge(chunk, agg1, how='left', on='srch_destination_id')
        pre_processSGD(chunk)
        #chunk = chunk[chunk.ci_year==2014]
        y = chunk.hotel_cluster
        sw = 1 + 4*chunk.is_booking
        chunk.drop(['cnt', 'hotel_cluster', 'is_booking'], axis=1, inplace=True)

        XN = csr_matrix(chunk[num_col].values)
        X = csr_matrix((chunk.shape[0], n_features))
        rows = np.arange(chunk.shape[0])
        for col in cat_col_all:
            dat = np.ones(chunk.shape[0])
            cols = chunk[col] % n_features
            X += csr_matrix((dat, (rows, cols)), shape=(chunk.shape[0], n_features))
        X = hstack((XN, X))
#         book_indices = sw[sw > 1].index.tolist()
#         X_test = csr_matrix(X)[book_indices]
#         y_test = y[book_indices]

        clf.partial_fit(X, y, classes=np.arange(100), sample_weight=sw)

        count = count + chunksize
#             map5 = map5eval(clf.predict_proba(X_test), y_test)
#             print('%d rows completed. MAP@5: %f' % (count, map5)) # 评估结果
        if(count/chunksize == 200):
            break
    except Exception as e:
        #e = sys.exc_info()[0]
        print('Error: %s' % str(e))


Epoch 0 started


In [16]:
# prediction process
with open('./output/probs/bnb.pkl', 'wb') as f:
    pickle.dump(clf, f)

count = 0
chunksize = 10000
preds = np.empty((0,100))
reader = pd.read_csv('./output/test.csv', parse_dates=['date_time', 'srch_ci', 'srch_co'], chunksize=chunksize)
for chunk in reader:
    #chunk = pd.merge(chunk, destinations, how='left', on='srch_destination_id')
    #chunk = pd.merge(chunk, agg1, how='left', on='srch_destination_id')
    chunk.drop(['id'], axis=1, inplace=True)
    pre_processSGD(chunk)
    
    XN = csr_matrix(chunk[num_col].values)
    X = csr_matrix((chunk.shape[0], n_features))
    rows = np.arange(chunk.shape[0])
    for col in cat_col_all:
        dat = np.ones(chunk.shape[0])
        cols = chunk[col] % n_features
        X += csr_matrix((dat, (rows, cols)), shape=(chunk.shape[0], n_features))
    X = hstack((XN, X))
    
    pred = clf.predict_proba(X)
    preds = np.vstack((preds, pred))
    count = count + chunksize
    print('%d rows completed' % count)

10000 rows completed
20000 rows completed
30000 rows completed
40000 rows completed
50000 rows completed
60000 rows completed
70000 rows completed
80000 rows completed
90000 rows completed
100000 rows completed
110000 rows completed
120000 rows completed
130000 rows completed
140000 rows completed
150000 rows completed
160000 rows completed
170000 rows completed
180000 rows completed
190000 rows completed
200000 rows completed
210000 rows completed
220000 rows completed
230000 rows completed
240000 rows completed
250000 rows completed
260000 rows completed
270000 rows completed
280000 rows completed
290000 rows completed
300000 rows completed
310000 rows completed
320000 rows completed
330000 rows completed
340000 rows completed
350000 rows completed
360000 rows completed
370000 rows completed
380000 rows completed
390000 rows completed
400000 rows completed
410000 rows completed
420000 rows completed
430000 rows completed
440000 rows completed
450000 rows completed
460000 rows complet

In [17]:
del clf
del reader
#输出结果
if os.path.exists('./output/probs/allpreds_bnb.h5'):
    with h5py.File('./output/probs/allpreds_bnb.h5', 'r+') as hf:
        #print('reading in and combining probabilities')
        predshf = hf['preds']
        preds += predshf.value
        print('writing latest probabilities to file')
        predshf[...] = preds
else:
    with h5py.File('./output/probs/allpreds_bnb.h5', 'w') as hf:
        print('writing latest probabilities to file')
        hf.create_dataset('preds', data=preds)

# col_ind = np.argsort(-preds, axis=1)[:,:5]
# hc = [' '.join(row.astype(str)) for row in col_ind]

# sub = pd.DataFrame(data=hc, index=submission.id)
# sub.reset_index(inplace=True)
# sub.columns = submission.columns
# sub.to_csv('./output/pred_sub.csv', index=False)

writing latest probabilities to file


## 模型融合 stacking

In [20]:
# read in RF results
with h5py.File('./output/probs/allpreds.h5', 'r') as hf:
        predshf = hf['preds_latest']
        preds = 0.31*normalize(predshf.value, norm='l1', axis=1)

# read in XGB results
with h5py.File('./output/probs/allpreds_xgb.h5', 'r') as hf:
        predshf = hf['preds']
        preds += 0.39*normalize(predshf.value, norm='l1', axis=1)

# read in SGD results
with h5py.File('./output/probs/allpreds_sgd.h5', 'r') as hf:
        predshf = hf['preds']
        preds += 0.27*normalize(predshf.value, norm='l1', axis=1)

# read in Bernoulli NB results
with h5py.File('./output/probs/allpreds_bnb.h5', 'r') as hf:
        predshf = hf['preds']
        preds += 0.03*normalize(predshf.value, norm='l1', axis=1)

print('generating submission')
col_ind = np.argsort(-preds, axis=1)[:,:5]
hc = [' '.join(row.astype(str)) for row in col_ind]

sub = pd.DataFrame(data=hc, index=submission.id)
sub.reset_index(inplace=True)
sub.columns = submission.columns
sub.to_csv('./output/pred_sub.csv', index=False)

  after removing the cwd from sys.path.
  if __name__ == '__main__':
  


generating submission


## 提交结果格式处理 submission

In [21]:
match_pred = pd.read_csv('./output/match_pred.csv')
match_pred.fillna('', inplace=True)
match_pred = match_pred['hotel_cluster'].tolist()
match_pred = [s.split(' ') for s in match_pred]

pred_sub = pd.read_csv('./output/pred_sub.csv')
ids = pred_sub.id
pred_sub = pred_sub['hotel_cluster'].tolist()
pred_sub = [s.split(' ') for s in pred_sub]

# 取出前5
def f5(seq, idfun=None): 
    if idfun is None:
        def idfun(x): return x
    seen = {}
    result = []
    for item in seq:
        marker = idfun(item)
        if (marker in seen) or (marker == ''): continue
        seen[marker] = 1
        result.append(item)
    return result
    
full_preds = [f5(match_pred[p] + pred_sub[p])[:5] for p in range(len(pred_sub))]

write_p = [" ".join([str(l) for l in p]) for p in full_preds]
write_frame = ["{0},{1}".format(ids[i], write_p[i]) for i in range(len(full_preds))]
write_frame = ["id,hotel_cluster"] + write_frame
with open("./output/predictions.csv", "w+") as f:
    f.write("\n".join(write_frame))