# IDAO: expected time of orders in airports

Airports are special points for taxi service. Every day a lot of people use a taxi to get to the city centre from the airport.

One of important task is to predict how long a driver need to wait an order. It helps to understand what to do. Maybe the driver have to wait near doors, or can drink a tea, or even should drive to city center without an order.

We request you to solve a simple version of this prediction task.

**Task:** predict time of $k$ orders in airport (time since now when you get an order if you are $k$-th in queue), $k$ is one of 5 values (different for every airports).

**Data**
- train: number of order for every minutes for 6 months
- test: every test sample has datetime info + numer of order for every minutes for last 2 weeks

**Submission:** for every airport you should prepare a model which will be evaluated in submission system (code + model files). You can make different models for different airports.

**Evaluation:** for every airport for every $k$ sMAPE will be calculated and averaged. General leaderboard will be calculated via Borda count. 

## Baseline

In [2]:
!pip install catboost

Collecting catboost
[?25l  Downloading https://files.pythonhosted.org/packages/3e/62/b442e8d747e8a34ac8a981f7a4ff717c1f887aedb42c3f670660bda41af5/catboost-0.13.1-cp36-none-manylinux1_x86_64.whl (60.1MB)
[K    100% |████████████████████████████████| 60.1MB 864kB/s eta 0:00:01
Collecting enum34 (from catboost)
  Downloading https://files.pythonhosted.org/packages/af/42/cb9355df32c69b553e72a2e28daee25d1611d2c0d9c272aa1d34204205b2/enum34-1.1.6-py3-none-any.whl
[31mfastai 1.0.50.post1 requires nvidia-ml-py3, which is not installed.[0m
[31mthinc 6.12.1 has requirement msgpack<0.6.0,>=0.5.6, but you'll have msgpack 0.6.0 which is incompatible.[0m
Installing collected packages: enum34, catboost
Successfully installed catboost-0.13.1 enum34-1.1.6
[33mYou are using pip version 10.0.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [25]:
!pip install xgboost

Collecting xgboost
[?25l  Downloading https://files.pythonhosted.org/packages/6a/49/7e10686647f741bd9c8918b0decdb94135b542fe372ca1100739b8529503/xgboost-0.82-py2.py3-none-manylinux1_x86_64.whl (114.0MB)
[K    100% |████████████████████████████████| 114.0MB 465kB/s eta 0:00:01   23% |███████▌                        | 26.5MB 47.2MB/s eta 0:00:02
[31mfastai 1.0.50.post1 requires nvidia-ml-py3, which is not installed.[0m
[31mthinc 6.12.1 has requirement msgpack<0.6.0,>=0.5.6, but you'll have msgpack 0.6.0 which is incompatible.[0m
Installing collected packages: xgboost
Successfully installed xgboost-0.82
[33mYou are using pip version 10.0.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [28]:
!pip install lightgbm

Collecting lightgbm
[?25l  Downloading https://files.pythonhosted.org/packages/77/0f/5157e6b153b3d4a70dc5fbe2ab6f209604197590f387f03177b7a249ac60/lightgbm-2.2.3-py2.py3-none-manylinux1_x86_64.whl (1.2MB)
[K    100% |████████████████████████████████| 1.2MB 24.0MB/s ta 0:00:01
[31mfastai 1.0.50.post1 requires nvidia-ml-py3, which is not installed.[0m
[31mthinc 6.12.1 has requirement msgpack<0.6.0,>=0.5.6, but you'll have msgpack 0.6.0 which is incompatible.[0m
Installing collected packages: lightgbm
Successfully installed lightgbm-2.2.3
[33mYou are using pip version 10.0.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [3]:
%pylab inline

import catboost
import pandas as pd
import pickle
import tqdm

Populating the interactive namespace from numpy and matplotlib


Let's prepare a model for set2.

# Load train dataset

In [81]:
set_name = 'set1'
path_train_set = '../../data/train/{}.csv'.format(set_name)

data = pd.read_csv(path_train_set)
data.datetime = data.datetime.apply(
    lambda x: datetime.datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
data = data.sort_values('datetime')
data.head()

Unnamed: 0,datetime,num_orders
0,2018-03-01 00:00:00,0
1,2018-03-01 00:01:00,0
2,2018-03-01 00:02:00,0
3,2018-03-01 00:03:00,0
4,2018-03-01 00:04:00,1


Predict position for set2.

In [82]:
target_positions = {
    'set1': [10, 30, 45, 60, 75],
    'set2': [5, 10, 15, 20, 25],
    'set3': [5, 7, 9, 11, 13]
}[set_name]

Some useful constant.

In [83]:
HOUR_IN_MINUTES = 60
DAY_IN_MINUTES = 24 * HOUR_IN_MINUTES
WEEK_IN_MINUTES = 7 * DAY_IN_MINUTES

MAX_TIME = DAY_IN_MINUTES

## Generate train samples with targets

We have only history of orders (count of orders in every minutes) but we need to predict time of k orders since current minutes. So we should calculate target for train set. Also we will make a lot of samples from all set (we can only use two weeks of history while prediction so we can use only two weeks in every train sample).

In [84]:
samples = {
    'datetime': [],
    'history': []}

for position in target_positions:
    samples['target_{}'.format(position)] = []
    
num_orders = data.num_orders.values

To calculate target (minutes before k orders) we are going to use cumulative sum of orders. 

In [85]:
target_positions

[10, 30, 45, 60, 75]

In [86]:
# start after 2 weeks because of history
# finish earlier because of target calculation
for i in range(2 * WEEK_IN_MINUTES,
               len(num_orders) - 2 * DAY_IN_MINUTES):
    
    samples['datetime'].append(data.datetime[i])
    samples['history'].append(num_orders[i-2*WEEK_IN_MINUTES:i])
    
    # cumsum not for all array because of time economy
    cumsum_num_orders = num_orders[i+1:i+1+2*DAY_IN_MINUTES].cumsum()
    for position in target_positions:
        orders_by_positions = np.where(cumsum_num_orders >= position)[0]
        if len(orders_by_positions):
            time = orders_by_positions[0] + 1
        else:
            # if no orders in last days
            time = MAX_TIME
        samples['target_{}'.format(position)].append(time)

Convert to pandas.dataframe. Now we have targets to train and predict.

In [127]:
df = pd.DataFrame.from_dict(samples)
df.head()

Unnamed: 0,datetime,history,target_10,target_30,target_45,target_60,target_75
0,2018-03-15 00:00:00,"[0, 0, 0, 0, 1, 2, 0, 1, 1, 4, 0, 1, 1, 1, 1, ...",5,18,28,32,42
1,2018-03-15 00:01:00,"[0, 0, 0, 1, 2, 0, 1, 1, 4, 0, 1, 1, 1, 1, 3, ...",5,19,27,32,42
2,2018-03-15 00:02:00,"[0, 0, 1, 2, 0, 1, 1, 4, 0, 1, 1, 1, 1, 3, 2, ...",7,20,27,33,43
3,2018-03-15 00:03:00,"[0, 1, 2, 0, 1, 1, 4, 0, 1, 1, 1, 1, 3, 2, 3, ...",7,21,26,35,42
4,2018-03-15 00:04:00,"[1, 2, 0, 1, 1, 4, 0, 1, 1, 1, 1, 3, 2, 3, 1, ...",7,20,26,35,42


# Train model

Let's generate simple features.

By time:

In [128]:
df['weekday'] = df.datetime.apply(lambda x: x.weekday())
df['hour'] = df.datetime.apply(lambda x: x.hour)
df['minute'] = df.datetime.apply(lambda x: x.minute)
df['month'] = df.datetime.apply(lambda x: x.month)
df['is_night'] = df.datetime.apply(lambda x: 1.0 if (x.hour < 6) else 0.0)
df['is_weekend'] = df.datetime.apply(lambda x: 1.0 if (x.weekday() >=5)
                              or ((x.day == 23) and (x.month == 2))
                              or ((x.day == 8) and (x.month == 3))
                              or ((x.day == 9) and (x.month == 3))
                              or ((x.day == 30) and (x.month == 4))
                              or ((x.day == 1) and (x.month == 5))
                              or ((x.day == 2) and (x.month == 5))
                              or ((x.day == 9) and (x.month == 5))
                              or ((x.day == 11) and (x.month == 6))
                              or ((x.day == 12) and (x.month == 6))
                             else 0.0)

Aggregators by order history with different shift and window size:

In [129]:
SHIFTS = [
    HOUR_IN_MINUTES // 4,
    HOUR_IN_MINUTES // 2,
    HOUR_IN_MINUTES,
    DAY_IN_MINUTES,
    DAY_IN_MINUTES * 2,
    WEEK_IN_MINUTES,
    WEEK_IN_MINUTES * 2
]
WINDOWS = [
    HOUR_IN_MINUTES // 4,
    HOUR_IN_MINUTES // 2,
    HOUR_IN_MINUTES,
    DAY_IN_MINUTES,
    DAY_IN_MINUTES * 2,
    WEEK_IN_MINUTES,
    WEEK_IN_MINUTES * 2
]

In [130]:
for shift in SHIFTS:
    for window in WINDOWS:
        if window > shift:
            continue
        df['num_orders_{}_{}'.format(shift, window)] = \
            df.history.apply(lambda x: x[-shift : -shift + window].sum())

In [160]:
list(df)

['datetime',
 'history',
 'target_10',
 'target_30',
 'target_45',
 'target_60',
 'target_75',
 'weekday',
 'hour',
 'minute',
 'month',
 'is_night',
 'is_weekend',
 'num_orders_15_15',
 'num_orders_30_15',
 'num_orders_30_30',
 'num_orders_60_15',
 'num_orders_60_30',
 'num_orders_60_60',
 'num_orders_1440_15',
 'num_orders_1440_30',
 'num_orders_1440_60',
 'num_orders_1440_1440',
 'num_orders_2880_15',
 'num_orders_2880_30',
 'num_orders_2880_60',
 'num_orders_2880_1440',
 'num_orders_2880_2880',
 'num_orders_10080_15',
 'num_orders_10080_30',
 'num_orders_10080_60',
 'num_orders_10080_1440',
 'num_orders_10080_2880',
 'num_orders_10080_10080',
 'num_orders_20160_15',
 'num_orders_20160_30',
 'num_orders_20160_60',
 'num_orders_20160_1440',
 'num_orders_20160_2880',
 'num_orders_20160_10080',
 'num_orders_20160_20160']

In [None]:
df['hh00'] = df.history.apply(lambda x: x[-1440 : -1380].sum())

In [153]:
sum_10 = df.groupby(['hour'])['target_10'].sum().reset_index()

In [155]:
temp = pd.merge(df, sum_10, how='left', on=['hour'])

In [156]:
temp.head(2)

Unnamed: 0,datetime,history,target_10_x,target_30,target_45,target_60,target_75,weekday,hour,minute,...,num_orders_10080_2880,num_orders_10080_10080,num_orders_20160_15,num_orders_20160_30,num_orders_20160_60,num_orders_20160_1440,num_orders_20160_2880,num_orders_20160_10080,num_orders_20160_20160,target_10_y
0,2018-03-15 00:00:00,"[0, 0, 0, 0, 1, 2, 0, 1, 1, 4, 0, 1, 1, 1, 1, ...",5,18,28,32,42,3,0,0,...,2735,0,13,51,124,1364,2870,9110,0,52367
1,2018-03-15 00:01:00,"[0, 0, 0, 1, 2, 0, 1, 1, 4, 0, 1, 1, 1, 1, 3, ...",5,19,27,32,42,3,0,1,...,2735,0,16,53,126,1364,2872,9111,0,52367


Train/validation split for time. Let's use last 4 weeks for validation.

In [131]:
df.datetime.min(), df.datetime.max()

(Timestamp('2018-03-15 00:00:00'), Timestamp('2018-08-29 23:59:00'))

In [142]:
df_train = df.loc[df.datetime <= df.datetime.max() - datetime.timedelta(days=28)]
df_test = df.loc[df.datetime > df.datetime.max() - datetime.timedelta(days=28)]
#from sklearn.model_selection import train_test_split
#df_train, df_test = train_test_split(df, test_size=0.2, shuffle=True)

In [143]:
target_cols = ['target_{}'.format(position) for position in target_positions]

y_train = df_train[target_cols]
y_test = df_test[target_cols]

X_all = df.drop(['datetime', 'history'] + target_cols, axis=1)
y_all = df[target_cols]

df_train = df_train.drop(['datetime', 'history'] + target_cols, axis=1)#[USE_COLUMNS]
df_test = df_test.drop(['datetime', 'history'] + target_cols, axis=1)#[USE_COLUMNS]

In [144]:
def sMAPE(y_true, y_predict, shift=0):
    return 2 * np.mean(
        np.abs(y_true - y_predict) /
        (np.abs(y_true) + np.abs(y_predict) + shift))

Also we will save models for prediction stage.

In [145]:
model_to_save = {
    'models': {}
}

What is good or bad model? We can compare our model with constant solution. For instance median (optimal solution for MAE).

# Catboost

## target_10

In [30]:
%%time
for v_depth in range(3, 12, 1):
    model = catboost.CatBoostRegressor(
        iterations=2000, learning_rate=1.0, 
        depth = v_depth,
        l2_leaf_reg = 9,
        rsm = 0.2,
        
        loss_function='MAE', 
        early_stopping_rounds = 90, random_seed=27)
   
    model.fit(
        X=df_train,
        y=y_train['target_10'],
        use_best_model=True,
        eval_set=(df_test, y_test['target_10']),        
        verbose=False
    )
    y_predict = model.predict(df_test.values)
    print('v_depth=' + str(v_depth) + ': ' + str(sMAPE(y_test['target_10'], y_predict)))

v_depth=3: 0.3407250093994297
v_depth=4: 0.3376348871430672
v_depth=5: 0.3399467150445846
v_depth=6: 0.3401358336005973
v_depth=7: 0.3421795583495161
v_depth=8: 0.3418284449381795
v_depth=9: 0.3423852705892325
v_depth=10: 0.3401323037240509
v_depth=11: 0.3430035894506455
CPU times: user 4min 58s, sys: 32.4 s, total: 5min 30s
Wall time: 3min 42s


In [31]:
%%time
for v_l2_leaf_reg in range(1, 10, 1):
    model = catboost.CatBoostRegressor(
        iterations=2000, learning_rate=1.0, 
        depth = 4,
        rsm=0.2,
        l2_leaf_reg = v_l2_leaf_reg,
        loss_function='MAE', 
        early_stopping_rounds = 90, random_seed=27)
    
    model.fit(
        X=df_train,
        y=y_train['target_10'],
        use_best_model=True,
        eval_set=(df_test, y_test['target_10']),        
        verbose=False
    )
    y_predict = model.predict(df_test.values)
    print('v_l2_leaf_reg=' + str(v_l2_leaf_reg) + ': ' + str(sMAPE(y_test['target_10'], y_predict)))

v_l2_leaf_reg=1: 0.33899264269555335
v_l2_leaf_reg=2: 0.33912953502706583
v_l2_leaf_reg=3: 0.3421593573708883
v_l2_leaf_reg=4: 0.3393027398941565
v_l2_leaf_reg=5: 0.33792551394088705
v_l2_leaf_reg=6: 0.3391940684026751
v_l2_leaf_reg=7: 0.3392409598962607
v_l2_leaf_reg=8: 0.33955300983103176
v_l2_leaf_reg=9: 0.3376348871430672
CPU times: user 3min 53s, sys: 28.8 s, total: 4min 21s
Wall time: 3min 32s


In [29]:
for v_rsm in range(1, 11, 1):
    model = catboost.CatBoostRegressor(
        iterations=2000, learning_rate=1.0, 
        depth = 4,
        l2_leaf_reg = 9,
        rsm = v_rsm/10.0,
        loss_function='MAE', 
        early_stopping_rounds = 90, random_seed=27)
    
    model.fit(
        X=df_train,
        y=y_train['target_10'],
        use_best_model=True,
        eval_set=(df_test, y_test['target_10']),        
        verbose=False
    )
    y_predict = model.predict(df_test.values)
    print('v_rsm=' + str(v_rsm) + ': ' + str(sMAPE(y_test['target_10'], y_predict)))

v_rsm=1: 0.3394996586258325
v_rsm=2: 0.3376348871430672
v_rsm=3: 0.3394945870782449
v_rsm=4: 0.33827531731932686
v_rsm=5: 0.33956122708702935
v_rsm=6: 0.34008802966161017
v_rsm=7: 0.3400869348342845
v_rsm=8: 0.33828824715886757
v_rsm=9: 0.3395698672528007
v_rsm=10: 0.3379209967795093


In [110]:
model = catboost.CatBoostRegressor(
        iterations=5000, learning_rate=0.065, 
    #iterations=2000, learning_rate=0.185, 
    #0.25 - 0.3480764023878587
    #0.2 - 0.3422943987201056
    #0.1 - 0.3430843793177033
    #0.15 - 0.34139884443162544
    #0.175 - 0.3408655647346902
    ####0.185 - 0.3402415746271159
    #0.19 - 0.34074440162565656
    ##____on 5000_______
    #0.3385978767318361 (0.065)
        #depth = 4,
        #rsm=0.2,
        #l2_leaf_reg = 9,
    depth = 6,
        #l2_leaf_reg = 9,
        #rsm = 0.5,
        loss_function='MAE', 
    thread_count = 8,
        random_seed=27)
    
model.fit(
        X=df_train,
        y=y_train['target_10'],
        #use_best_model=True,
        #eval_set=(df_test, y_test['target_10']),        
        verbose=False
    )
y_predict = model.predict(df_test)
print(sMAPE(y_test['target_10'], y_predict))

0.33084618503134244


## target_30

In [40]:
%%time
for v_depth in range(3, 12, 1):
    model = catboost.CatBoostRegressor(
        iterations=2000, learning_rate=1.0, 
        depth = v_depth,
        l2_leaf_reg = 7,
        rsm = 0.1,
        loss_function='MAE', 
        early_stopping_rounds = 90, random_seed=27)
   
    model.fit(
        X=df_train,
        y=y_train['target_30'],
        use_best_model=True,
        eval_set=(df_test, y_test['target_30']),        
        verbose=False
    )
    y_predict = model.predict(df_test.values)
    print('v_depth=' + str(v_depth) + ': ' + str(sMAPE(y_test['target_30'], y_predict)))

v_depth=3: 0.27341783822677757
v_depth=4: 0.2747439730593105
v_depth=5: 0.2734742809257496
v_depth=6: 0.2755386514275678
v_depth=7: 0.27171242838737825


KeyboardInterrupt: 

In [39]:
%%time
for v_l2_leaf_reg in range(1, 10, 1):
    model = catboost.CatBoostRegressor(
        iterations=2000, learning_rate=1.0, 
        depth = 7,
        l2_leaf_reg = v_l2_leaf_reg,
        loss_function='MAE', 
        early_stopping_rounds = 90, random_seed=27)
    
    model.fit(
        X=df_train,
        y=y_train['target_30'],
        use_best_model=True,
        eval_set=(df_test, y_test['target_30']),        
        verbose=False
    )
    y_predict = model.predict(df_test.values)
    print('v_l2_leaf_reg=' + str(v_l2_leaf_reg) + ': ' + str(sMAPE(y_test['target_30'], y_predict)))

v_l2_leaf_reg=1: 0.2748519627917496
v_l2_leaf_reg=2: 0.2731332633991016
v_l2_leaf_reg=3: 0.2732656485759227
v_l2_leaf_reg=4: 0.27251297502984184
v_l2_leaf_reg=5: 0.27307764127618683
v_l2_leaf_reg=6: 0.27327182219842244
v_l2_leaf_reg=7: 0.2721956033759729
v_l2_leaf_reg=8: 0.2736765356614582
v_l2_leaf_reg=9: 0.27379755342131284
CPU times: user 8min 3s, sys: 48.7 s, total: 8min 52s
Wall time: 6min 15s


In [41]:
for v_rsm in range(1, 11, 1):
    model = catboost.CatBoostRegressor(
        iterations=2000, learning_rate=1.0, 
        depth = 7,
        #l2_leaf_reg = 7,
        rsm = v_rsm/10.0,
        loss_function='MAE', 
        early_stopping_rounds = 90, random_seed=27)
    
    model.fit(
        X=df_train,
        y=y_train['target_30'],
        use_best_model=True,
        eval_set=(df_test, y_test['target_30']),        
        verbose=False
    )
    y_predict = model.predict(df_test.values)
    print('v_rsm=' + str(v_rsm) + ': ' + str(sMAPE(y_test['target_30'], y_predict)))

v_rsm=1: 0.27176684971429727
v_rsm=2: 0.27303525460169753
v_rsm=3: 0.2749600987957908
v_rsm=4: 0.27387804816757966
v_rsm=5: 0.27428326460132546
v_rsm=6: 0.27435605151205555
v_rsm=7: 0.27315727544583485
v_rsm=8: 0.27306134434327944
v_rsm=9: 0.2753791959293799
v_rsm=10: 0.2732656485759227


In [50]:
model = catboost.CatBoostRegressor(
        iterations=2000, learning_rate=0.185, 
    #1.0 - 0.33083758264677765
    #0.2 - 0.2728408110905381
    #0.185 - 0.2731732120163993
        depth = 7,
        rsm = 0.1,
        loss_function='MAE', 
        random_seed=27)
    
model.fit(
        X=df_train,
        y=y_train['target_30'],
        #use_best_model=True,
        #eval_set=(df_test, y_test['target_30']),        
        verbose=False
    )
y_predict = model.predict(df_test)
print(sMAPE(y_test['target_30'], y_predict))

0.2731732120163993


## target_45

In [45]:
%%time
for v_depth in range(3, 12, 1):
    model = catboost.CatBoostRegressor(
        iterations=2000, learning_rate=1.0, 
        depth = v_depth,
        l2_leaf_reg = 8,
        rsm = 1.0,
        loss_function='MAE', 
        early_stopping_rounds = 90, random_seed=27)
   
    model.fit(
        X=df_train,
        y=y_train['target_45'],
        use_best_model=True,
        eval_set=(df_test, y_test['target_45']),        
        verbose=False
    )
    y_predict = model.predict(df_test.values)
    print('v_depth=' + str(v_depth) + ': ' + str(sMAPE(y_test['target_45'], y_predict)))

v_depth=3: 0.24504100551622598
v_depth=4: 0.24524275249712338
v_depth=5: 0.24375791754140905
v_depth=6: 0.24472655392382
v_depth=7: 0.24571646513539191
v_depth=8: 0.24650731482996616
v_depth=9: 0.24930437461408872
v_depth=10: 0.2471636387737767
v_depth=11: 0.2493814228238389
CPU times: user 12min 38s, sys: 1min, total: 13min 38s
Wall time: 10min 26s


In [43]:
%%time
for v_l2_leaf_reg in range(1, 10, 1):
    model = catboost.CatBoostRegressor(
        iterations=2000, learning_rate=1.0, 
        depth = 5,
        l2_leaf_reg = v_l2_leaf_reg,
        loss_function='MAE', 
        early_stopping_rounds = 90, random_seed=27)
    
    model.fit(
        X=df_train,
        y=y_train['target_45'],
        use_best_model=True,
        eval_set=(df_test, y_test['target_45']),        
        verbose=False
    )
    y_predict = model.predict(df_test.values)
    print('v_l2_leaf_reg=' + str(v_l2_leaf_reg) + ': ' + str(sMAPE(y_test['target_45'], y_predict)))

v_l2_leaf_reg=1: 0.244370794974937
v_l2_leaf_reg=2: 0.24517408581223385
v_l2_leaf_reg=3: 0.24402074187809866
v_l2_leaf_reg=4: 0.2449731017227979
v_l2_leaf_reg=5: 0.24496754524740447
v_l2_leaf_reg=6: 0.2439829349041889
v_l2_leaf_reg=7: 0.24490330633292565
v_l2_leaf_reg=8: 0.24375791754140905
v_l2_leaf_reg=9: 0.2450167108627412
CPU times: user 7min 38s, sys: 47 s, total: 8min 25s
Wall time: 6min 24s


In [44]:
for v_rsm in range(1, 11, 1):
    model = catboost.CatBoostRegressor(
        iterations=2000, learning_rate=1.0, 
        depth = 5,
        l2_leaf_reg = 8,
        rsm = v_rsm/10.0,
        loss_function='MAE', 
        early_stopping_rounds = 90, random_seed=27)
    
    model.fit(
        X=df_train,
        y=y_train['target_45'],
        use_best_model=True,
        eval_set=(df_test, y_test['target_45']),        
        verbose=False
    )
    y_predict = model.predict(df_test.values)
    print('v_rsm=' + str(v_rsm) + ': ' + str(sMAPE(y_test['target_45'], y_predict)))

v_rsm=1: 0.24584044638024763
v_rsm=2: 0.2451200300538713
v_rsm=3: 0.24697456607563126
v_rsm=4: 0.2463391145697011
v_rsm=5: 0.24446064146747973
v_rsm=6: 0.247093625776781
v_rsm=7: 0.24559688583018044
v_rsm=8: 0.2475052744276464
v_rsm=9: 0.244835239348872
v_rsm=10: 0.24375791754140905


In [54]:
model = catboost.CatBoostRegressor(
        iterations=2000, learning_rate=0.225, 
    #0.2 - 0.2484354361121851
    #0.25 - 0.24686332460367313
    #0.3 - 0.25223371025308794
    #0.225 - 0.2455687434748517
        depth = 5,
        l2_leaf_reg = 8,
        rsm = 1.0,
        loss_function='MAE', 
        random_seed=27)
    
model.fit(
        X=df_train,
        y=y_train['target_45'],
        #use_best_model=True,
        #eval_set=(df_test, y_test['target_45']),        
        verbose=False
    )
y_predict = model.predict(df_test)
print(sMAPE(y_test['target_45'], y_predict))

0.2455687434748517


## target_60

In [49]:
%%time
for v_depth in range(3, 10, 1):
    model = catboost.CatBoostRegressor(
        iterations=2000, learning_rate=1.0, 
        depth = v_depth,
        l2_leaf_reg = 3,
        rsm = 1.0,
        loss_function='MAE', 
        early_stopping_rounds = 90, random_seed=27)
   
    model.fit(
        X=df_train,
        y=y_train['target_60'],
        use_best_model=True,
        eval_set=(df_test, y_test['target_60']),        
        verbose=False
    )
    y_predict = model.predict(df_test.values)
    print('v_depth=' + str(v_depth) + ': ' + str(sMAPE(y_test['target_60'], y_predict)))

v_depth=3: 0.22869128728277777
v_depth=4: 0.22754564768152855
v_depth=5: 0.2256039149280791
v_depth=6: 0.22723255762011438


KeyboardInterrupt: 

In [47]:
%%time
for v_l2_leaf_reg in range(1, 10, 1):
    model = catboost.CatBoostRegressor(
        iterations=2000, learning_rate=1.0, 
        depth = 5,
        l2_leaf_reg = v_l2_leaf_reg,
        loss_function='MAE', 
        early_stopping_rounds = 90, random_seed=27)
    
    model.fit(
        X=df_train,
        y=y_train['target_60'],
        use_best_model=True,
        eval_set=(df_test, y_test['target_60']),        
        verbose=False
    )
    y_predict = model.predict(df_test.values)
    print('v_l2_leaf_reg=' + str(v_l2_leaf_reg) + ': ' + str(sMAPE(y_test['target_60'], y_predict)))

v_l2_leaf_reg=1: 0.2263683885326
v_l2_leaf_reg=2: 0.22657779262747868
v_l2_leaf_reg=3: 0.2256039149280791
v_l2_leaf_reg=4: 0.22632925352542238
v_l2_leaf_reg=5: 0.2266057735462952
v_l2_leaf_reg=6: 0.2256351841995213
v_l2_leaf_reg=7: 0.22569713711432646
v_l2_leaf_reg=8: 0.2257793112691084
v_l2_leaf_reg=9: 0.22710825116276842
CPU times: user 7min 48s, sys: 49.3 s, total: 8min 38s
Wall time: 6min 26s


In [48]:
for v_rsm in range(1, 11, 1):
    model = catboost.CatBoostRegressor(
        iterations=2000, learning_rate=1.0, 
        depth = 5,
        l2_leaf_reg = 3,
        rsm = v_rsm/10.0,
        loss_function='MAE', 
        early_stopping_rounds = 90, random_seed=27)
    
    model.fit(
        X=df_train,
        y=y_train['target_60'],
        use_best_model=True,
        eval_set=(df_test, y_test['target_60']),        
        verbose=False
    )
    y_predict = model.predict(df_test.values)
    print('v_rsm=' + str(v_rsm) + ': ' + str(sMAPE(y_test['target_60'], y_predict)))

v_rsm=1: 0.22696204398304698
v_rsm=2: 0.22777610062834314
v_rsm=3: 0.22830541753892158
v_rsm=4: 0.22744102342104214
v_rsm=5: 0.22739879442500877
v_rsm=6: 0.22611171815847256
v_rsm=7: 0.22634538305488394
v_rsm=8: 0.22655252035270784
v_rsm=9: 0.22742205086290362
v_rsm=10: 0.2256039149280791


In [58]:
model = catboost.CatBoostRegressor(
        iterations=2000, learning_rate=0.3, 
    #0.2 - 0.23210230685652697
    #0.25 - 0.228305145518675
    #0.275 - 0.22777540888270587
    #0.3 - 0.22962564076077827
        depth = 5,
        l2_leaf_reg = 3,
        rsm = 1.0,
        loss_function='MAE', 
        random_seed=27)
    
model.fit(
        X=df_train,
        y=y_train['target_60'],
        #use_best_model=True,
        #eval_set=(df_test, y_test['target_60']),        
        verbose=False
    )
y_predict = model.predict(df_test)
print(sMAPE(y_test['target_60'], y_predict))

0.22962564076077827


## target_75

In [25]:
%%time
for v_depth in range(3, 12, 1):
    model = catboost.CatBoostRegressor(
        iterations=2000, learning_rate=1.0, 
        depth = v_depth,
        loss_function='MAE', 
        early_stopping_rounds = 90, random_seed=27)
   
    model.fit(
        X=df_train,
        y=y_train['target_75'],
        use_best_model=True,
        eval_set=(df_test, y_test['target_75']),        
        verbose=False
    )
    y_predict = model.predict(df_test.values)
    print('v_depth=' + str(v_depth) + ': ' + str(sMAPE(y_test['target_75'], y_predict)))

v_depth=3: 0.21488234438900966
v_depth=4: 0.2130404359832791
v_depth=5: 0.21048511431856992
v_depth=6: 0.21064588933634565
v_depth=7: 0.21049139027854166
v_depth=8: 0.21168356024376472
v_depth=9: 0.21290286532695593
v_depth=10: 0.214686699514435
v_depth=11: 0.21428716514222107
CPU times: user 14min 15s, sys: 23.4 s, total: 14min 38s
Wall time: 4min 14s


In [36]:
%%time
for v_l2_leaf_reg in range(1, 10, 1):
    model = catboost.CatBoostRegressor(
        iterations=2000, learning_rate=1.0, 
        depth = 5,
        l2_leaf_reg = v_l2_leaf_reg,
        loss_function='MAE', 
        early_stopping_rounds = 90, random_seed=27)
    
    model.fit(
        X=df_train,
        y=y_train['target_75'],
        use_best_model=True,
        eval_set=(df_test, y_test['target_75']),        
        verbose=False
    )
    y_predict = model.predict(df_test.values)
    print('v_l2_leaf_reg=' + str(v_l2_leaf_reg) + ': ' + str(sMAPE(y_test['target_75'], y_predict)))

v_l2_leaf_reg=1: 0.21070080833544702
v_l2_leaf_reg=2: 0.20983240758607208
v_l2_leaf_reg=3: 0.21048511431856992
v_l2_leaf_reg=4: 0.21130366914015386
v_l2_leaf_reg=5: 0.2100850234938446
v_l2_leaf_reg=6: 0.20958792446020888
v_l2_leaf_reg=7: 0.20944312370031354
v_l2_leaf_reg=8: 0.21033535100251263
v_l2_leaf_reg=9: 0.210379013176623
CPU times: user 8min 35s, sys: 18.3 s, total: 8min 53s
Wall time: 2min 39s


In [37]:
for v_rsm in range(1, 11, 1):
    model = catboost.CatBoostRegressor(
        iterations=2000, learning_rate=1.0, 
        depth = 5,
        l2_leaf_reg = 7,
        rsm = v_rsm/10.0,
        loss_function='MAE', 
        early_stopping_rounds = 90, random_seed=27)
    
    model.fit(
        X=df_train,
        y=y_train['target_75'],
        use_best_model=True,
        eval_set=(df_test, y_test['target_75']),        
        verbose=False
    )
    y_predict = model.predict(df_test.values)
    print('v_rsm=' + str(v_rsm) + ': ' + str(sMAPE(y_test['target_75'], y_predict)))

v_rsm=1: 0.21316833909448887
v_rsm=2: 0.2106599665469864
v_rsm=3: 0.21315959165972648
v_rsm=4: 0.21111538378120498
v_rsm=5: 0.21212455804944952
v_rsm=6: 0.2122581457496143
v_rsm=7: 0.21182466013384862
v_rsm=8: 0.2111375653948338
v_rsm=9: 0.21077742703838698
v_rsm=10: 0.20944312370031354


In [59]:
model = catboost.CatBoostRegressor(
        iterations=2000, learning_rate=0.3,
    #0.3 - 0.21159578238912014
        depth = 5,
        l2_leaf_reg = 7,
        rsm = 1.0,
        loss_function='MAE', 
        random_seed=27)
    
model.fit(
        X=df_train,
        y=y_train['target_75'],
        #use_best_model=True,
        #eval_set=(df_test, y_test['target_75']),        
        verbose=False
    )
y_predict = model.predict(df_test)
print(sMAPE(y_test['target_75'], y_predict))

0.21159578238912014


# Final model

In [146]:
for position in target_positions:
    #model = catboost.CatBoostRegressor(iterations=2000, learning_rate=0.5, 
    #    depth = 5,
    #    l2_leaf_reg = 7,
    #    rsm = 1.0,
    #    loss_function='MAE', 
    #    random_seed=27)
   
        
    #model = catboost.CatBoostRegressor(iterations=5000, learning_rate=0.065,
    #    depth = 6,
    #    l2_leaf_reg = 9,
    #    rsm = 0.5,
    #    use_best_model=True,
    #    loss_function='MAE', 
    #    random_seed=27)
    
    model = catboost.CatBoostRegressor(iterations=2000, learning_rate=1,
        depth = 6,
        l2_leaf_reg = 9,
        rsm = 0.5,
        #use_best_model=True,
        loss_function='MAE', 
        random_seed=27)
    
    model.fit(
        #X=X_all,
        #y=y_all['target_{}'.format(position)],
        X=df_train,
        y=y_train['target_{}'.format(position)],
        #use_best_model=True,
        #eval_set=(df_test, y_test['target_{}'.format(position)]),
        verbose=False)
    y_predict = model.predict(df_test)
    
    print('target_{}'.format(position))
    print('stupid:\t{}'.format(sMAPE(
        y_test['target_{}'.format(position)],
        y_train['target_{}'.format(position)].median())))
    print('model:\t{}'.format(sMAPE(
        y_test['target_{}'.format(position)],
        y_predict)))
    print()
    
    model_to_save['models'][position] = model

target_10
stupid:	0.582173151841312
model:	0.4279075154492326

target_30
stupid:	0.5388110900373055
model:	0.3415715528994696

target_45
stupid:	0.5280161445107636
model:	0.3389273268162911

target_60
stupid:	0.5090823344692388
model:	0.28821776828973544

target_75
stupid:	0.5070227037322823
model:	0.25291448293972907



In [None]:
from sklearn.model_selection import StratifiedKFold, RepeatedKFold
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from math import sqrt

folds = StratifiedKFold(n_splits = 4, shuffle = True)
train_predictions = np.zeros(len(df_train))
test_predictions = np.zeros(len(y_test))

v_round = 0
for train_index, test_index in folds.split(df_train, y_train['target_10']):   
    D_train = catboost.Pool(df_train.iloc[train_index], y_train['target_10'].iloc[train_index])
    D_val = catboost.Pool(df_train.iloc[test_index], y_train['target_10'].iloc[test_index])
    
    model = catboost.CatBoostRegressor(iterations=5000,
                                        learning_rate=0.065,
                                        depth=6,
                                        use_best_model=True,
                                        l2_leaf_reg=9,
                                        rsm = 0.5, 
                                        loss_function='MAE',
                                        early_stopping_rounds = 90,
                                        random_seed=27,
                                        verbose=False
                                        )
    model.fit(D_train, eval_set=D_val, verbose=False)
    
    v_round = v_round + 1
    print("Ending round " + str(v_round))
    
    model.save_model('model_cb_stack_00_' + str(v_round))
    nb_trees = model.tree_count_

    print('nb_trees={}'.format(nb_trees))    
    y_predict = model.predict(df_test)
    print(sMAPE(y_test['target_10'], y_predict))
    #test_predictions += -model.predict_proba(pd.concat([test[USE_COLUMNS].fillna(0.0), test_dop], axis=1).values)[:, 1] / folds.n_splits

In [46]:
target_positions

[10, 30, 45, 60, 75]

In [115]:
for position in target_positions:
    if str(position) == '10':
        print('10')
        model = catboost.CatBoostRegressor(iterations=2000, learning_rate=0.185, 
            depth = 4,
            rsm=0.2,
            l2_leaf_reg = 9,
            loss_function='MAE', 
            random_seed=27)
    else:
        print('not 10')

    if str(position) == '30':
        print('30')
        model = catboost.CatBoostRegressor(iterations=2000, learning_rate=0.2, 
            depth = 7,
            rsm = 0.1,
            loss_function='MAE', 
            random_seed=27)
    else:
        print('not 30')
        
    if str(position) == '45':
        print('45')
        model = catboost.CatBoostRegressor(iterations=2000, learning_rate=0.225, 
            depth = 5,
            l2_leaf_reg = 8,
            rsm = 1.0,
            loss_function='MAE', 
            random_seed=27)
    else:
        print('not 45')
        
    if str(position) == '60':
        print('60')
        model = catboost.CatBoostRegressor(iterations=2000, learning_rate=0.275, 
            depth = 5,
            l2_leaf_reg = 3,
            rsm = 1.0,
            loss_function='MAE', 
            random_seed=27)
    else:
        print('not 60')
        
    if str(position) == '75':
        print('75')
        model = catboost.CatBoostRegressor(iterations=2000, learning_rate=0.3, 
            depth = 5,
            l2_leaf_reg = 7,
            rsm = 1.0,
            loss_function='MAE', 
            random_seed=27)
    else:
        print('not 75')
        
        
    model.fit(
            #X_all,
            #y_all['target_{}'.format(position)],
            X=df_train,
            y=y_train['target_{}'.format(position)],
            use_best_model=True,
            eval_set=(df_test, y_test['target_{}'.format(position)]),
            verbose=False)
    y_predict = model.predict(df_test)
    
    print('target_{}'.format(position))
    print('stupid:\t{}'.format(sMAPE(
        y_test['target_{}'.format(position)],
        y_train['target_{}'.format(position)].median())))
    print('model:\t{}'.format(sMAPE(
        y_test['target_{}'.format(position)],
        y_predict)))
    print()
    
    model_to_save['models'][position] = model

10
not 30
not 45
not 60
not 75
target_10
stupid:	0.48386792223329617
model:	0.3410269814192636

not 10
30
not 45
not 60
not 75
target_30
stupid:	0.4302487056582887
model:	0.25251923306633584

not 10
not 30
45
not 60
not 75
target_45
stupid:	0.4072633534830279
model:	0.22272875166029688

not 10
not 30
not 45
60
not 75
target_60
stupid:	0.39121547448984445
model:	0.20001966635694723

not 10
not 30
not 45
not 60
75
target_75
stupid:	0.37884360367050135
model:	0.18319720840779294



Our model is better than constant solution. Saving model.

In [161]:
for i in range(len(model.feature_importances_)):
    print(model.feature_names_[i] + ': ' + str(model.feature_importances_[i]))

weekday: 4.007895525075691
hour: 17.349974066420536
minute: 1.7986234838007633
month: 2.1860958906711025
is_night: 2.2460229053723526
is_weekend: 0.8583152474301983
num_orders_15_15: 0.0
num_orders_30_15: 1.654996279103456
num_orders_30_30: 0.0
num_orders_60_15: 0.29497559093322295
num_orders_60_30: 2.216242237750835
num_orders_60_60: 0.0
num_orders_1440_15: 0.43457722866754145
num_orders_1440_30: 0.9342477137100073
num_orders_1440_60: 5.6926639063916
num_orders_1440_1440: 0.0
num_orders_2880_15: 0.2585020653584616
num_orders_2880_30: 1.2289562557584521
num_orders_2880_60: 5.369404656653815
num_orders_2880_1440: 6.092214620270421
num_orders_2880_2880: 0.0
num_orders_10080_15: 0.3390455009156572
num_orders_10080_30: 1.315381957574868
num_orders_10080_60: 7.90865182432367
num_orders_10080_1440: 5.72251556679531
num_orders_10080_2880: 6.101137452731748
num_orders_10080_10080: 0.0
num_orders_20160_15: 0.2701789984012382
num_orders_20160_30: 1.106458983794173
num_orders_20160_60: 5.36831858

In [147]:
pickle.dump(model_to_save, open('models.pkl', 'wb'))

In [None]:
# кол-во случайных наборов гиперпараметров
N_HYPEROPT_PROBES = 500

# алгоритм сэмплирования гиперпараметров
HYPEROPT_ALGO = tpe.suggest  #  tpe.suggest OR hyperopt.rand.suggest

In [None]:
colorama.init()

In [None]:
D_train = catboost.Pool(train_part.drop(['feedback'], axis=1).values, train_part.feedback.values)
D_val = catboost.Pool(validation.drop(['feedback'], axis=1).values, validation.feedback.values)

In [None]:
def get_catboost_params(space):
    params = dict()
    params['learning_rate'] = space['learning_rate']
    params['depth'] = int(space['depth'])
    params['l2_leaf_reg'] = space['l2_leaf_reg']
    params['rsm'] = space['rsm']
    return params

In [None]:
obj_call_count = 0
cur_best_loss = np.inf
log_writer = open( 'catboost-hyperopt-log.txt', 'w' )

In [None]:
def objective(space):
    global obj_call_count, cur_best_loss

    obj_call_count += 1

    print('\nCatBoost objective call #{} cur_best_loss={:7.5f}'.format(obj_call_count,cur_best_loss) )

    params = get_catboost_params(space)

    sorted_params = sorted(space.items(), key=lambda z: z[0])
    params_str = str.join(' ', ['{}={}'.format(k, v) for k, v in sorted_params])
    print('Params: {}'.format(params_str) )

    model = catboost.CatBoostClassifier(iterations=2000,
                                        learning_rate=params['learning_rate'],
                                        depth=int(params['depth']),
                                        use_best_model=True,
                                        l2_leaf_reg=params['l2_leaf_reg'],
                                        early_stopping_rounds = 10,
                                        random_seed=27,
                                        verbose=False
                                        )
    model.fit(D_train, eval_set=D_val, verbose=False)
    nb_trees = model.tree_count_

    print('nb_trees={}'.format(nb_trees))

    y_pred = model.predict_proba(validation.drop(['feedback'], axis=1).values)

    test_loss = sklearn.metrics.log_loss(validation.feedback.values, y_pred, labels=list(range(2)))
    acc = sklearn.metrics.accuracy_score(validation.feedback.values, np.argmax(y_pred, axis=1))

    log_writer.write('loss={:<7.5f} acc={} Params:{} nb_trees={}\n'.format(test_loss, acc, params_str, nb_trees ))
    log_writer.flush()

    if test_loss<cur_best_loss:
        cur_best_loss = test_loss
        print(colorama.Fore.GREEN + 'NEW BEST LOSS={}'.format(cur_best_loss) + colorama.Fore.RESET)


    return{'loss':test_loss, 'status': STATUS_OK }

In [None]:
space ={
        'depth': hp.quniform("depth", 4, 11, 1),
        'rsm': hp.uniform ('rsm', 0.75, 1.0),
        'learning_rate': hp.loguniform('learning_rate', -3.0, -0.7),
        'l2_leaf_reg': hp.uniform('l2_leaf_reg', 1, 10),
       }

In [None]:
trials = Trials()

In [None]:
%%time
best = hyperopt.fmin(fn=objective,
                     space=space,
                     algo=HYPEROPT_ALGO,
                     max_evals=N_HYPEROPT_PROBES,
                     trials=trials,verbose=False)