# IDAO: expected time of orders in airports

Airports are special points for taxi service. Every day a lot of people use a taxi to get to the city centre from the airport.

One of important task is to predict how long a driver need to wait an order. It helps to understand what to do. Maybe the driver have to wait near doors, or can drink a tea, or even should drive to city center without an order.

We request you to solve a simple version of this prediction task.

**Task:** predict time of $k$ orders in airport (time since now when you get an order if you are $k$-th in queue), $k$ is one of 5 values (different for every airports).

**Data**
- train: number of order for every minutes for 6 months
- test: every test sample has datetime info + numer of order for every minutes for last 2 weeks

**Submission:** for every airport you should prepare a model which will be evaluated in submission system (code + model files). You can make different models for different airports.

**Evaluation:** for every airport for every $k$ sMAPE will be calculated and averaged. General leaderboard will be calculated via Borda count. 

## Baseline

In [2]:
!pip install catboost

Collecting catboost
[?25l  Downloading https://files.pythonhosted.org/packages/3e/62/b442e8d747e8a34ac8a981f7a4ff717c1f887aedb42c3f670660bda41af5/catboost-0.13.1-cp36-none-manylinux1_x86_64.whl (60.1MB)
[K    100% |████████████████████████████████| 60.1MB 864kB/s eta 0:00:01
[?25hCollecting enum34 (from catboost)
  Downloading https://files.pythonhosted.org/packages/af/42/cb9355df32c69b553e72a2e28daee25d1611d2c0d9c272aa1d34204205b2/enum34-1.1.6-py3-none-any.whl
[31mfastai 1.0.50.post1 requires nvidia-ml-py3, which is not installed.[0m
[31mthinc 6.12.1 has requirement msgpack<0.6.0,>=0.5.6, but you'll have msgpack 0.6.0 which is incompatible.[0m
Installing collected packages: enum34, catboost
Successfully installed catboost-0.13.1 enum34-1.1.6
[33mYou are using pip version 10.0.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [1]:
%pylab inline

import catboost
import pandas as pd
import pickle
import tqdm

Populating the interactive namespace from numpy and matplotlib


Let's prepare a model for set2.

# Load train dataset

In [2]:
set_name = 'set2'
path_train_set = '../../data/train/{}.csv'.format(set_name)

data = pd.read_csv(path_train_set)
data.datetime = data.datetime.apply(
    lambda x: datetime.datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
data = data.sort_values('datetime')
data.head()

Unnamed: 0,datetime,num_orders
0,2018-04-01 00:00:00,0
1,2018-04-01 00:01:00,0
2,2018-04-01 00:02:00,0
3,2018-04-01 00:03:00,0
4,2018-04-01 00:04:00,0


Predict position for set2.

In [3]:
target_positions = {
    'set1': [10, 30, 45, 60, 75],
    'set2': [5, 10, 15, 20, 25],
    'set3': [5, 7, 9, 11, 13]
}[set_name]

Some useful constant.

In [4]:
HOUR_IN_MINUTES = 60
DAY_IN_MINUTES = 24 * HOUR_IN_MINUTES
WEEK_IN_MINUTES = 7 * DAY_IN_MINUTES

MAX_TIME = DAY_IN_MINUTES

## Generate train samples with targets

We have only history of orders (count of orders in every minutes) but we need to predict time of k orders since current minutes. So we should calculate target for train set. Also we will make a lot of samples from all set (we can only use two weeks of history while prediction so we can use only two weeks in every train sample).

In [5]:
samples = {
    'datetime': [],
    'history': []}

for position in target_positions:
    samples['target_{}'.format(position)] = []
    
num_orders = data.num_orders.values

To calculate target (minutes before k orders) we are going to use cumulative sum of orders. 

In [6]:
# start after 2 weeks because of history
# finish earlier because of target calculation
for i in range(2 * WEEK_IN_MINUTES,
               len(num_orders) - 2 * DAY_IN_MINUTES):
    
    samples['datetime'].append(data.datetime[i])
    samples['history'].append(num_orders[i-2*WEEK_IN_MINUTES:i])
    
    # cumsum not for all array because of time economy
    cumsum_num_orders = num_orders[i+1:i+1+2*DAY_IN_MINUTES].cumsum()
    for position in target_positions:
        orders_by_positions = np.where(cumsum_num_orders >= position)[0]
        if len(orders_by_positions):
            time = orders_by_positions[0] + 1
        else:
            # if no orders in last days
            time = MAX_TIME
        samples['target_{}'.format(position)].append(time)

Convert to pandas.dataframe. Now we have targets to train and predict.

In [7]:
df = pd.DataFrame.from_dict(samples)
df.head()

Unnamed: 0,datetime,history,target_10,target_15,target_20,target_25,target_5
0,2018-04-15 00:00:00,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",221,247,285,297,205
1,2018-04-15 00:01:00,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",220,246,284,296,204
2,2018-04-15 00:02:00,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",219,245,283,295,203
3,2018-04-15 00:03:00,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",218,244,282,294,202
4,2018-04-15 00:04:00,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",217,243,281,293,201


# Train model

Let's generate simple features.

By time:

In [8]:
df['weekday'] = df.datetime.apply(lambda x: x.weekday())
df['hour'] = df.datetime.apply(lambda x: x.hour)
df['minute'] = df.datetime.apply(lambda x: x.minute)
df['month'] = df.datetime.apply(lambda x: x.month)
df['is_night'] = df.datetime.apply(lambda x: 1.0 if (x.hour < 6) else 0.0)
df['is_weekend'] = df.datetime.apply(lambda x: 1.0 if (x.weekday() >=5)
                              or ((x.day == 23) and (x.month == 2))
                              or ((x.day == 8) and (x.month == 3))
                              or ((x.day == 9) and (x.month == 3))
                              or ((x.day == 30) and (x.month == 4))
                              or ((x.day == 1) and (x.month == 5))
                              or ((x.day == 2) and (x.month == 5))
                              or ((x.day == 9) and (x.month == 5))
                              or ((x.day == 11) and (x.month == 6))
                              or ((x.day == 12) and (x.month == 6))
                             else 0.0)

Aggregators by order history with different shift and window size:

In [9]:
SHIFTS = [
    HOUR_IN_MINUTES // 4,
    HOUR_IN_MINUTES // 2,
    HOUR_IN_MINUTES,
    DAY_IN_MINUTES,
    DAY_IN_MINUTES * 2,
    DAY_IN_MINUTES * 3,
    WEEK_IN_MINUTES,
    WEEK_IN_MINUTES * 2]
WINDOWS = [
    HOUR_IN_MINUTES // 4,
    HOUR_IN_MINUTES // 2,
    HOUR_IN_MINUTES,
    DAY_IN_MINUTES,
    DAY_IN_MINUTES * 2,
    WEEK_IN_MINUTES,
    WEEK_IN_MINUTES * 2]

In [10]:
for shift in SHIFTS:
    for window in WINDOWS:
        if window > shift:
            continue
        df['num_orders_{}_{}'.format(shift, window)] = \
            df.history.apply(lambda x: x[-shift : -shift + window].sum())

Train/validation split for time. Let's use last 4 weeks for validation.

In [11]:
df.datetime.min(), df.datetime.max()

(Timestamp('2018-04-15 00:00:00'), Timestamp('2018-09-28 23:59:00'))

In [12]:
df_train = df.loc[df.datetime <= df.datetime.max() - datetime.timedelta(days=28)]
df_test = df.loc[df.datetime > df.datetime.max() - datetime.timedelta(days=28)]

In [13]:
target_cols = ['target_{}'.format(position) for position in target_positions]

y_train = df_train[target_cols]
y_test = df_test[target_cols]

X = df.drop(['datetime', 'history'] + target_cols, axis=1)
y = df[target_cols]

df_train = df_train.drop(['datetime', 'history'] + target_cols, axis=1)#[USE_COLUMNS]
df_test = df_test.drop(['datetime', 'history'] + target_cols, axis=1)#[USE_COLUMNS]

In [14]:
def sMAPE(y_true, y_predict, shift=0):
    return 2 * np.mean(
        np.abs(y_true - y_predict) /
        (np.abs(y_true) + np.abs(y_predict) + shift))

Also we will save models for prediction stage.

In [15]:
model_to_save = {
    'models': {}
}

What is good or bad model? We can compare our model with constant solution. For instance median (optimal solution for MAE).

# Catboost

## target_5

In [20]:
%%time
for v_depth in range(3, 12, 1):
    model = catboost.CatBoostRegressor(
        iterations=2000, learning_rate=1.0, 
        depth = v_depth,
        l2_leaf_reg = 4,
        rsm = 0.8,
        loss_function='MAE', 
        early_stopping_rounds = 90, random_seed=27)
   
    model.fit(
        X=df_train,
        y=y_train['target_5'],
        use_best_model=True,
        eval_set=(df_test, y_test['target_5']),        
        verbose=False
    )
    y_predict = model.predict(df_test.values)
    print('v_depth=' + str(v_depth) + ': ' + str(sMAPE(y_test['target_5'], y_predict)))

v_depth=3: 0.5133809975626664
v_depth=4: 0.5130358130215024
v_depth=5: 0.5061526058864672
v_depth=6: 0.5046022272737238
v_depth=7: 0.5051384627675342
v_depth=8: 0.5034897095778563
v_depth=9: 0.509571830527369


KeyboardInterrupt: 

In [18]:
%%time
for v_l2_leaf_reg in range(1, 10, 1):
    model = catboost.CatBoostRegressor(
        iterations=2000, learning_rate=1.0, 
        depth = 8,
        l2_leaf_reg = v_l2_leaf_reg,
        loss_function='MAE', 
        early_stopping_rounds = 90, random_seed=27)
    
    model.fit(
        X=df_train,
        y=y_train['target_5'],
        use_best_model=True,
        eval_set=(df_test, y_test['target_5']),        
        verbose=False
    )
    y_predict = model.predict(df_test.values)
    print('v_l2_leaf_reg=' + str(v_l2_leaf_reg) + ': ' + str(sMAPE(y_test['target_5'], y_predict)))

v_l2_leaf_reg=1: 0.5061859249094719
v_l2_leaf_reg=2: 0.5078597806320355
v_l2_leaf_reg=3: 0.505117588969297
v_l2_leaf_reg=4: 0.5049194276451837
v_l2_leaf_reg=5: 0.506473345779041
v_l2_leaf_reg=6: 0.5074714766487004
v_l2_leaf_reg=7: 0.5073509515493094
v_l2_leaf_reg=8: 0.5057991401605848
v_l2_leaf_reg=9: 0.5064977267987417
CPU times: user 35min 21s, sys: 3min 20s, total: 38min 42s
Wall time: 24min 16s


In [19]:
for v_rsm in range(1, 11, 1):
    model = catboost.CatBoostRegressor(
        iterations=2000, learning_rate=1.0, 
        depth = 8,
        l2_leaf_reg = 4,
        rsm = v_rsm/10.0,
        loss_function='MAE', 
        early_stopping_rounds = 90, random_seed=27)
    
    model.fit(
        X=df_train,
        y=y_train['target_5'],
        use_best_model=True,
        eval_set=(df_test, y_test['target_5']),        
        verbose=False
    )
    y_predict = model.predict(df_test.values)
    print('v_rsm=' + str(v_rsm) + ': ' + str(sMAPE(y_test['target_5'], y_predict)))

v_rsm=1: 0.5253536714940809
v_rsm=2: 0.5089861944944947
v_rsm=3: 0.5094590470899786
v_rsm=4: 0.5060533971527077
v_rsm=5: 0.5080096406961717
v_rsm=6: 0.506068501407763
v_rsm=7: 0.5066143044173538
v_rsm=8: 0.5034897095778563
v_rsm=9: 0.5061909641499894
v_rsm=10: 0.5049194276451837


In [246]:
model = catboost.CatBoostRegressor(
        iterations=2000, learning_rate=1.0, 
        depth = 8,
        l2_leaf_reg = 4,
        rsm = 0.8,
        loss_function='MAE', 
        early_stopping_rounds = 90, random_seed=27)
    
model.fit(
        X=df_train,
        y=y_train['target_5'],
        use_best_model=True,
        eval_set=(df_test, y_test['target_5']),        
        verbose=True
    )
y_predict = model.predict(df_test)
print(sMAPE(y_test['target_5'], y_predict))

0:	learn: 297.4982231	test: 170.3100609	best: 170.3100609 (0)	total: 24.5ms	remaining: 49s
1:	learn: 296.9986235	test: 169.8425610	best: 169.8425610 (1)	total: 49.9ms	remaining: 49.8s
2:	learn: 296.4989501	test: 169.3593320	best: 169.3593320 (2)	total: 74.6ms	remaining: 49.6s
3:	learn: 295.9992551	test: 168.8597825	best: 168.8597825 (3)	total: 98.7ms	remaining: 49.2s
4:	learn: 295.4997752	test: 168.3599119	best: 168.3599119 (4)	total: 124ms	remaining: 49.5s
5:	learn: 295.0002271	test: 167.8611783	best: 167.8611783 (5)	total: 150ms	remaining: 49.9s
6:	learn: 294.5006018	test: 167.3612524	best: 167.3612524 (6)	total: 174ms	remaining: 49.4s
7:	learn: 294.0007383	test: 166.8801663	best: 166.8801663 (7)	total: 198ms	remaining: 49.4s
8:	learn: 293.5009491	test: 166.3807462	best: 166.3807462 (8)	total: 224ms	remaining: 49.5s
9:	learn: 293.0015269	test: 165.8815031	best: 165.8815031 (9)	total: 252ms	remaining: 50.2s
10:	learn: 292.5018577	test: 165.3823514	best: 165.3823514 (10)	total: 278ms	r

90:	learn: 253.8838828	test: 134.3303121	best: 134.3303121 (90)	total: 2.35s	remaining: 49.3s
91:	learn: 253.4312839	test: 133.9241346	best: 133.9241346 (91)	total: 2.38s	remaining: 49.3s
92:	learn: 252.9794020	test: 133.5336821	best: 133.5336821 (92)	total: 2.4s	remaining: 49.3s
93:	learn: 252.5285051	test: 133.5195187	best: 133.5195187 (93)	total: 2.43s	remaining: 49.2s
94:	learn: 252.0785082	test: 133.1104782	best: 133.1104782 (94)	total: 2.46s	remaining: 49.2s
95:	learn: 251.6295234	test: 132.7116757	best: 132.7116757 (95)	total: 2.48s	remaining: 49.3s
96:	learn: 251.1812691	test: 132.5355834	best: 132.5355834 (96)	total: 2.51s	remaining: 49.3s
97:	learn: 250.7340983	test: 132.1429579	best: 132.1429579 (97)	total: 2.54s	remaining: 49.2s
98:	learn: 250.2882781	test: 131.7427214	best: 131.7427214 (98)	total: 2.56s	remaining: 49.3s
99:	learn: 249.8431022	test: 131.3529845	best: 131.3529845 (99)	total: 2.59s	remaining: 49.2s
100:	learn: 249.3983517	test: 131.3470865	best: 131.3470865 (

178:	learn: 217.9487728	test: 109.2017170	best: 109.2017170 (178)	total: 4.71s	remaining: 47.9s
179:	learn: 217.5876386	test: 108.9306767	best: 108.9306767 (179)	total: 4.74s	remaining: 47.9s
180:	learn: 217.2283247	test: 108.6652738	best: 108.6652738 (180)	total: 4.77s	remaining: 47.9s
181:	learn: 216.8695380	test: 108.3993456	best: 108.3993456 (181)	total: 4.79s	remaining: 47.9s
182:	learn: 216.5131857	test: 108.1313850	best: 108.1313850 (182)	total: 4.82s	remaining: 47.9s
183:	learn: 216.1567971	test: 107.8474875	best: 107.8474875 (183)	total: 4.85s	remaining: 47.8s
184:	learn: 215.8015336	test: 107.6055968	best: 107.6055968 (184)	total: 4.87s	remaining: 47.8s
185:	learn: 215.4468742	test: 107.3474994	best: 107.3474994 (185)	total: 4.9s	remaining: 47.8s
186:	learn: 215.0931972	test: 107.0890359	best: 107.0890359 (186)	total: 4.93s	remaining: 47.8s
187:	learn: 214.7410098	test: 106.8341460	best: 106.8341460 (187)	total: 4.96s	remaining: 47.8s
188:	learn: 214.3901173	test: 106.5692133

266:	learn: 190.3507708	test: 91.0044354	best: 91.0044354 (266)	total: 7.09s	remaining: 46s
267:	learn: 190.0843478	test: 90.8533217	best: 90.8533217 (267)	total: 7.12s	remaining: 46s
268:	learn: 189.8164011	test: 90.6978960	best: 90.6978960 (268)	total: 7.15s	remaining: 46s
269:	learn: 189.5521540	test: 90.5706390	best: 90.5706390 (269)	total: 7.18s	remaining: 46s
270:	learn: 189.2876431	test: 90.4302274	best: 90.4302274 (270)	total: 7.2s	remaining: 46s
271:	learn: 189.0252721	test: 90.2701777	best: 90.2701777 (271)	total: 7.23s	remaining: 45.9s
272:	learn: 188.7629205	test: 90.1182644	best: 90.1182644 (272)	total: 7.26s	remaining: 45.9s
273:	learn: 188.5010280	test: 89.9678957	best: 89.9678957 (273)	total: 7.29s	remaining: 45.9s
274:	learn: 188.2411132	test: 89.8498298	best: 89.8498298 (274)	total: 7.31s	remaining: 45.9s
275:	learn: 187.9795730	test: 89.7145299	best: 89.7145299 (275)	total: 7.34s	remaining: 45.8s
276:	learn: 187.7219432	test: 89.5775944	best: 89.5775944 (276)	total: 

354:	learn: 170.1988391	test: 81.0183489	best: 81.0183489 (354)	total: 9.48s	remaining: 43.9s
355:	learn: 170.0048109	test: 80.9333915	best: 80.9333915 (355)	total: 9.51s	remaining: 43.9s
356:	learn: 169.8107306	test: 80.8448661	best: 80.8448661 (356)	total: 9.54s	remaining: 43.9s
357:	learn: 169.6171931	test: 80.7608414	best: 80.7608414 (357)	total: 9.56s	remaining: 43.9s
358:	learn: 169.4276452	test: 80.7026780	best: 80.7026780 (358)	total: 9.59s	remaining: 43.8s
359:	learn: 169.2352790	test: 80.6531742	best: 80.6531742 (359)	total: 9.61s	remaining: 43.8s
360:	learn: 169.0429099	test: 80.5651404	best: 80.5651404 (360)	total: 9.64s	remaining: 43.8s
361:	learn: 168.8554070	test: 80.4833274	best: 80.4833274 (361)	total: 9.67s	remaining: 43.7s
362:	learn: 168.6684172	test: 80.4069458	best: 80.4069458 (362)	total: 9.7s	remaining: 43.7s
363:	learn: 168.4789096	test: 80.3323930	best: 80.3323930 (363)	total: 9.73s	remaining: 43.7s
364:	learn: 168.2948668	test: 80.2562319	best: 80.2562319 (36

442:	learn: 155.6983781	test: 75.7304972	best: 75.7304972 (442)	total: 11.9s	remaining: 41.8s
443:	learn: 155.5580548	test: 75.6965570	best: 75.6965570 (443)	total: 11.9s	remaining: 41.7s
444:	learn: 155.4190017	test: 75.6685638	best: 75.6685638 (444)	total: 11.9s	remaining: 41.7s
445:	learn: 155.2809885	test: 75.6135733	best: 75.6135733 (445)	total: 12s	remaining: 41.7s
446:	learn: 155.1431757	test: 75.5860326	best: 75.5860326 (446)	total: 12s	remaining: 41.7s
447:	learn: 155.0061719	test: 75.5425852	best: 75.5425852 (447)	total: 12s	remaining: 41.6s
448:	learn: 154.8703465	test: 75.4971250	best: 75.4971250 (448)	total: 12s	remaining: 41.6s
449:	learn: 154.7350744	test: 75.4519608	best: 75.4519608 (449)	total: 12.1s	remaining: 41.6s
450:	learn: 154.6019882	test: 75.4229304	best: 75.4229304 (450)	total: 12.1s	remaining: 41.6s
451:	learn: 154.4685178	test: 75.3725644	best: 75.3725644 (451)	total: 12.1s	remaining: 41.5s
452:	learn: 154.3344012	test: 75.3319955	best: 75.3319955 (452)	tota

537:	learn: 144.5308058	test: 72.7895389	best: 72.7895389 (537)	total: 14.5s	remaining: 39.4s
538:	learn: 144.4304062	test: 72.7686724	best: 72.7686724 (538)	total: 14.5s	remaining: 39.4s
539:	learn: 144.3328200	test: 72.7438190	best: 72.7438190 (539)	total: 14.6s	remaining: 39.4s
540:	learn: 144.2355166	test: 72.7207318	best: 72.7207318 (540)	total: 14.6s	remaining: 39.4s
541:	learn: 144.1404613	test: 72.6942241	best: 72.6942241 (541)	total: 14.6s	remaining: 39.3s
542:	learn: 144.0443426	test: 72.6702094	best: 72.6702094 (542)	total: 14.7s	remaining: 39.3s
543:	learn: 143.9485412	test: 72.6419225	best: 72.6419225 (543)	total: 14.7s	remaining: 39.3s
544:	learn: 143.8530548	test: 72.6200657	best: 72.6200657 (544)	total: 14.7s	remaining: 39.3s
545:	learn: 143.7557898	test: 72.5949290	best: 72.5949290 (545)	total: 14.7s	remaining: 39.3s
546:	learn: 143.6618895	test: 72.5729336	best: 72.5729336 (546)	total: 14.8s	remaining: 39.2s
547:	learn: 143.5649067	test: 72.5526600	best: 72.5526600 (5

625:	learn: 137.0905504	test: 71.4529260	best: 71.4529260 (625)	total: 17s	remaining: 37.2s
626:	learn: 137.0151291	test: 71.4436382	best: 71.4436382 (626)	total: 17s	remaining: 37.2s
627:	learn: 136.9397445	test: 71.4329903	best: 71.4329903 (627)	total: 17s	remaining: 37.2s
628:	learn: 136.8666378	test: 71.4196235	best: 71.4196235 (628)	total: 17s	remaining: 37.1s
629:	learn: 136.7934829	test: 71.4106972	best: 71.4106972 (629)	total: 17.1s	remaining: 37.1s
630:	learn: 136.7204693	test: 71.3877798	best: 71.3877798 (630)	total: 17.1s	remaining: 37.1s
631:	learn: 136.6497444	test: 71.3762384	best: 71.3762384 (631)	total: 17.1s	remaining: 37.1s
632:	learn: 136.5813444	test: 71.3635267	best: 71.3635267 (632)	total: 17.1s	remaining: 37s
633:	learn: 136.5094724	test: 71.3532368	best: 71.3532368 (633)	total: 17.2s	remaining: 37s
634:	learn: 136.4377161	test: 71.3432644	best: 71.3432644 (634)	total: 17.2s	remaining: 37s
635:	learn: 136.3719971	test: 71.3384416	best: 71.3384416 (635)	total: 17.

713:	learn: 131.5587163	test: 70.8364476	best: 70.8364476 (713)	total: 19.4s	remaining: 34.9s
714:	learn: 131.5036044	test: 70.8298171	best: 70.8298171 (714)	total: 19.4s	remaining: 34.8s
715:	learn: 131.4528564	test: 70.8297938	best: 70.8297938 (715)	total: 19.4s	remaining: 34.8s
716:	learn: 131.4018234	test: 70.8285795	best: 70.8285795 (716)	total: 19.4s	remaining: 34.8s
717:	learn: 131.3477993	test: 70.8243926	best: 70.8243926 (717)	total: 19.5s	remaining: 34.8s
718:	learn: 131.2935319	test: 70.8176908	best: 70.8176908 (718)	total: 19.5s	remaining: 34.7s
719:	learn: 131.2382107	test: 70.8141298	best: 70.8141298 (719)	total: 19.5s	remaining: 34.7s
720:	learn: 131.1833944	test: 70.8041652	best: 70.8041652 (720)	total: 19.6s	remaining: 34.7s
721:	learn: 131.1305415	test: 70.8009922	best: 70.8009922 (721)	total: 19.6s	remaining: 34.7s
722:	learn: 131.0769775	test: 70.7993193	best: 70.7993193 (722)	total: 19.6s	remaining: 34.6s
723:	learn: 131.0279864	test: 70.7948392	best: 70.7948392 (7

801:	learn: 127.3391868	test: 70.5673647	best: 70.5671222 (799)	total: 21.8s	remaining: 32.5s
802:	learn: 127.2973422	test: 70.5582407	best: 70.5582407 (802)	total: 21.8s	remaining: 32.5s
803:	learn: 127.2525209	test: 70.5523568	best: 70.5523568 (803)	total: 21.8s	remaining: 32.5s
804:	learn: 127.2130907	test: 70.5514999	best: 70.5514999 (804)	total: 21.9s	remaining: 32.5s
805:	learn: 127.1686371	test: 70.5489924	best: 70.5489924 (805)	total: 21.9s	remaining: 32.4s
806:	learn: 127.1246471	test: 70.5466701	best: 70.5466701 (806)	total: 21.9s	remaining: 32.4s
807:	learn: 127.0815862	test: 70.5448800	best: 70.5448800 (807)	total: 22s	remaining: 32.4s
808:	learn: 127.0379147	test: 70.5352248	best: 70.5352248 (808)	total: 22s	remaining: 32.4s
809:	learn: 126.9973018	test: 70.5340791	best: 70.5340791 (809)	total: 22s	remaining: 32.3s
810:	learn: 126.9586055	test: 70.5320496	best: 70.5320496 (810)	total: 22s	remaining: 32.3s
811:	learn: 126.9133775	test: 70.5418241	best: 70.5320496 (810)	tota

889:	learn: 123.9237146	test: 70.5266303	best: 70.5151568 (882)	total: 24.2s	remaining: 30.2s
890:	learn: 123.8931722	test: 70.5267615	best: 70.5151568 (882)	total: 24.2s	remaining: 30.2s
891:	learn: 123.8611494	test: 70.5271463	best: 70.5151568 (882)	total: 24.3s	remaining: 30.1s
892:	learn: 123.8269962	test: 70.5236527	best: 70.5151568 (882)	total: 24.3s	remaining: 30.1s
893:	learn: 123.7979338	test: 70.5245145	best: 70.5151568 (882)	total: 24.3s	remaining: 30.1s
894:	learn: 123.7589891	test: 70.5262512	best: 70.5151568 (882)	total: 24.4s	remaining: 30.1s
895:	learn: 123.7242187	test: 70.5280734	best: 70.5151568 (882)	total: 24.4s	remaining: 30s
896:	learn: 123.6964897	test: 70.5313027	best: 70.5151568 (882)	total: 24.4s	remaining: 30s
897:	learn: 123.6604016	test: 70.5232263	best: 70.5151568 (882)	total: 24.4s	remaining: 30s
898:	learn: 123.6258924	test: 70.5246963	best: 70.5151568 (882)	total: 24.5s	remaining: 30s
899:	learn: 123.5900273	test: 70.5231469	best: 70.5151568 (882)	tota

984:	learn: 120.9180433	test: 70.4828515	best: 70.4795307 (975)	total: 26.8s	remaining: 27.6s
985:	learn: 120.8863795	test: 70.4763079	best: 70.4763079 (985)	total: 26.9s	remaining: 27.6s
986:	learn: 120.8597051	test: 70.4733011	best: 70.4733011 (986)	total: 26.9s	remaining: 27.6s
987:	learn: 120.8319339	test: 70.4896494	best: 70.4733011 (986)	total: 26.9s	remaining: 27.6s
988:	learn: 120.8076886	test: 70.4912380	best: 70.4733011 (986)	total: 26.9s	remaining: 27.5s
989:	learn: 120.7805436	test: 70.4926431	best: 70.4733011 (986)	total: 27s	remaining: 27.5s
990:	learn: 120.7536802	test: 70.4949131	best: 70.4733011 (986)	total: 27s	remaining: 27.5s
991:	learn: 120.7320943	test: 70.4960244	best: 70.4733011 (986)	total: 27s	remaining: 27.4s
992:	learn: 120.7012498	test: 70.4933484	best: 70.4733011 (986)	total: 27s	remaining: 27.4s
993:	learn: 120.6705406	test: 70.4907776	best: 70.4733011 (986)	total: 27.1s	remaining: 27.4s
994:	learn: 120.6444209	test: 70.4922760	best: 70.4733011 (986)	tota

1072:	learn: 118.6295588	test: 70.4567387	best: 70.4498392 (1041)	total: 29.2s	remaining: 25.2s
1073:	learn: 118.6051259	test: 70.4485662	best: 70.4485662 (1073)	total: 29.2s	remaining: 25.2s
1074:	learn: 118.5804763	test: 70.4481353	best: 70.4481353 (1074)	total: 29.3s	remaining: 25.2s
1075:	learn: 118.5548951	test: 70.4449836	best: 70.4449836 (1075)	total: 29.3s	remaining: 25.2s
1076:	learn: 118.5271864	test: 70.4417336	best: 70.4417336 (1076)	total: 29.3s	remaining: 25.1s
1077:	learn: 118.5041610	test: 70.4438630	best: 70.4417336 (1076)	total: 29.4s	remaining: 25.1s
1078:	learn: 118.4729724	test: 70.4439670	best: 70.4417336 (1076)	total: 29.4s	remaining: 25.1s
1079:	learn: 118.4506117	test: 70.4445953	best: 70.4417336 (1076)	total: 29.4s	remaining: 25s
1080:	learn: 118.4275431	test: 70.4444398	best: 70.4417336 (1076)	total: 29.4s	remaining: 25s
1081:	learn: 118.4068845	test: 70.4447338	best: 70.4417336 (1076)	total: 29.5s	remaining: 25s
1082:	learn: 118.3818011	test: 70.4403048	best

1159:	learn: 116.6431546	test: 70.5484319	best: 70.4299964 (1090)	total: 31.6s	remaining: 22.9s
1160:	learn: 116.6228408	test: 70.5481402	best: 70.4299964 (1090)	total: 31.6s	remaining: 22.8s
1161:	learn: 116.5988945	test: 70.5586714	best: 70.4299964 (1090)	total: 31.6s	remaining: 22.8s
1162:	learn: 116.5802400	test: 70.5584513	best: 70.4299964 (1090)	total: 31.7s	remaining: 22.8s
1163:	learn: 116.5554967	test: 70.5686812	best: 70.4299964 (1090)	total: 31.7s	remaining: 22.8s
1164:	learn: 116.5402036	test: 70.5681651	best: 70.4299964 (1090)	total: 31.7s	remaining: 22.7s
1165:	learn: 116.5202713	test: 70.5703566	best: 70.4299964 (1090)	total: 31.7s	remaining: 22.7s
1166:	learn: 116.4962681	test: 70.5695348	best: 70.4299964 (1090)	total: 31.8s	remaining: 22.7s
1167:	learn: 116.4740661	test: 70.5819088	best: 70.4299964 (1090)	total: 31.8s	remaining: 22.6s
1168:	learn: 116.4505159	test: 70.5828158	best: 70.4299964 (1090)	total: 31.8s	remaining: 22.6s
1169:	learn: 116.4328770	test: 70.576120

## target_10

In [24]:
%%time
for v_depth in range(3, 12, 1):
    model = catboost.CatBoostRegressor(
        iterations=2000, learning_rate=1.0, 
        depth = v_depth,
        l2_leaf_reg = 5,
        rsm = 1.0,
        loss_function='MAE', 
        early_stopping_rounds = 90, random_seed=27)
   
    model.fit(
        X=df_train,
        y=y_train['target_10'],
        use_best_model=True,
        eval_set=(df_test, y_test['target_10']),        
        verbose=False
    )
    y_predict = model.predict(df_test.values)
    print('v_depth=' + str(v_depth) + ': ' + str(sMAPE(y_test['target_10'], y_predict)))

v_depth=3: 0.43016808954232116
v_depth=4: 0.4242116361489919
v_depth=5: 0.422449122242661
v_depth=6: 0.4172930141407162
v_depth=7: 0.41260044919997385
v_depth=8: 0.41639503995080046
v_depth=9: 0.41466513681064054
v_depth=10: 0.4158165486426095


KeyboardInterrupt: 

In [22]:
%%time
for v_l2_leaf_reg in range(1, 10, 1):
    model = catboost.CatBoostRegressor(
        iterations=2000, learning_rate=1.0, 
        depth = 7,
        l2_leaf_reg = v_l2_leaf_reg,
        loss_function='MAE', 
        early_stopping_rounds = 90, random_seed=27)
    
    model.fit(
        X=df_train,
        y=y_train['target_10'],
        use_best_model=True,
        eval_set=(df_test, y_test['target_10']),        
        verbose=False
    )
    y_predict = model.predict(df_test.values)
    print('v_l2_leaf_reg=' + str(v_l2_leaf_reg) + ': ' + str(sMAPE(y_test['target_10'], y_predict)))

v_l2_leaf_reg=1: 0.4162639644878752
v_l2_leaf_reg=2: 0.41375297097163083
v_l2_leaf_reg=3: 0.41297078130998666
v_l2_leaf_reg=4: 0.41436956879690545
v_l2_leaf_reg=5: 0.41260044919997385
v_l2_leaf_reg=6: 0.41397608562011456
v_l2_leaf_reg=7: 0.4136920416072595
v_l2_leaf_reg=8: 0.4146369007653902
v_l2_leaf_reg=9: 0.4142393315015643
CPU times: user 33min 26s, sys: 3min 31s, total: 36min 57s
Wall time: 26min 24s


In [23]:
for v_rsm in range(1, 11, 1):
    model = catboost.CatBoostRegressor(
        iterations=2000, learning_rate=1.0, 
        depth = 7,
        l2_leaf_reg = 5,
        rsm = v_rsm/10.0,
        loss_function='MAE', 
        early_stopping_rounds = 90, random_seed=27)
    
    model.fit(
        X=df_train,
        y=y_train['target_10'],
        use_best_model=True,
        eval_set=(df_test, y_test['target_10']),        
        verbose=False
    )
    y_predict = model.predict(df_test.values)
    print('v_rsm=' + str(v_rsm) + ': ' + str(sMAPE(y_test['target_10'], y_predict)))

v_rsm=1: 0.44381750201984116
v_rsm=2: 0.4283371487187559
v_rsm=3: 0.4212147239232923
v_rsm=4: 0.41761991737492105
v_rsm=5: 0.41844896870255993
v_rsm=6: 0.41482235081275776
v_rsm=7: 0.41289454584299023
v_rsm=8: 0.41448076924692057
v_rsm=9: 0.4136434243168032
v_rsm=10: 0.41260044919997385


In [None]:
model = catboost.CatBoostRegressor(
        iterations=2000, learning_rate=1.0, 
        depth = 7,
        l2_leaf_reg = 5,
        rsm = 1.0,
        loss_function='MAE', 
        early_stopping_rounds = 90, random_seed=27)
    
model.fit(
        X=df_train,
        y=y_train['target_10'],
        use_best_model=True,
        eval_set=(df_test, y_test['target_10']),        
        verbose=True
    )
y_predict = model.predict(df_test)
print(sMAPE(y_test['target_10'], y_predict))

## target_15

In [16]:
%%time
for v_depth in range(3, 12, 1):
    model = catboost.CatBoostRegressor(
        iterations=2000, learning_rate=1.0, 
        depth = v_depth,
        loss_function='MAE', 
        early_stopping_rounds = 90, random_seed=27)
   
    model.fit(
        X=df_train,
        y=y_train['target_15'],
        use_best_model=True,
        eval_set=(df_test, y_test['target_15']),        
        verbose=False
    )
    y_predict = model.predict(df_test.values)
    print('v_depth=' + str(v_depth) + ': ' + str(sMAPE(y_test['target_15'], y_predict)))

v_depth=3: 0.3737730767401891
v_depth=4: 0.36662019563059217
v_depth=5: 0.3640797132236947
v_depth=6: 0.36295606397195923
v_depth=7: 0.36352765753728705
v_depth=8: 0.3639341841215887
v_depth=9: 0.36462161945192545
v_depth=10: 0.3656311475988231
v_depth=11: 0.36848141229881365
CPU times: user 38min 26s, sys: 1min 11s, total: 39min 37s
Wall time: 12min 5s


In [17]:
%%time
for v_l2_leaf_reg in range(1, 10, 1):
    model = catboost.CatBoostRegressor(
        iterations=2000, learning_rate=1.0, 
        depth = 6,
        l2_leaf_reg = v_l2_leaf_reg,
        loss_function='MAE', 
        early_stopping_rounds = 90, random_seed=27)
    
    model.fit(
        X=df_train,
        y=y_train['target_15'],
        use_best_model=True,
        eval_set=(df_test, y_test['target_15']),        
        verbose=False
    )
    y_predict = model.predict(df_test.values)
    print('v_l2_leaf_reg=' + str(v_l2_leaf_reg) + ': ' + str(sMAPE(y_test['target_15'], y_predict)))

v_l2_leaf_reg=1: 0.3626143264740588
v_l2_leaf_reg=2: 0.36106523154725967
v_l2_leaf_reg=3: 0.36295606397195923
v_l2_leaf_reg=4: 0.3616354313542429
v_l2_leaf_reg=5: 0.3622727027271365
v_l2_leaf_reg=6: 0.3619459311142655
v_l2_leaf_reg=7: 0.3616966027582275
v_l2_leaf_reg=8: 0.3618939454127816
v_l2_leaf_reg=9: 0.3613483504079539
CPU times: user 26min 51s, sys: 54.8 s, total: 27min 45s
Wall time: 7min 51s


In [None]:
for v_rsm in range(1, 11, 1):
    model = catboost.CatBoostRegressor(
        iterations=2000, learning_rate=1.0, 
        depth = 6,
        l2_leaf_reg = 2,
        rsm = v_rsm/10.0,
        loss_function='MAE', 
        early_stopping_rounds = 90, random_seed=27)
    
    model.fit(
        X=df_train,
        y=y_train['target_15'],
        use_best_model=True,
        eval_set=(df_test, y_test['target_15']),        
        verbose=False
    )
    y_predict = model.predict(df_test.values)
    print('v_rsm=' + str(v_rsm) + ': ' + str(sMAPE(y_test['target_15'], y_predict)))

v_rsm=1: 0.39288257837720675
v_rsm=2: 0.37136032550628195
v_rsm=3: 0.3651053753238319
v_rsm=4: 0.3634549278813165
v_rsm=5: 0.3628052708687235
v_rsm=6: 0.3643446211182713
v_rsm=7: 0.36267305679892686
v_rsm=8: 0.36415149977661426
v_rsm=9: 0.36229232046514626
v_rsm=10: 0.36106523154725967


In [None]:
model = catboost.CatBoostRegressor(
        iterations=2000, learning_rate=1.0, 
        depth = 6,
        l2_leaf_reg = 2,
        rsm = 1.0,
        loss_function='MAE', 
        early_stopping_rounds = 90, random_seed=27)
    
model.fit(
        X=df_train,
        y=y_train['target_15'],
        use_best_model=True,
        eval_set=(df_test, y_test['target_15']),        
        verbose=True
    )
y_predict = model.predict(df_test)
print(sMAPE(y_test['target_15'], y_predict))

## target_20

In [None]:
%%time
for v_depth in range(3, 12, 1):
    model = catboost.CatBoostRegressor(
        iterations=2000, learning_rate=1.0, 
        depth = v_depth,
        loss_function='MAE', 
        early_stopping_rounds = 90, random_seed=27)
   
    model.fit(
        X=df_train,
        y=y_train['target_20'],
        use_best_model=True,
        eval_set=(df_test, y_test['target_20']),        
        verbose=False
    )
    y_predict = model.predict(df_test.values)
    print('v_depth=' + str(v_depth) + ': ' + str(sMAPE(y_test['target_20'], y_predict)))

v_depth=3: 0.33516560711090265
v_depth=4: 0.32458841046579395


In [None]:
%%time
for v_l2_leaf_reg in range(1, 10, 1):
    model = catboost.CatBoostRegressor(
        iterations=2000, learning_rate=1.0, 
        depth = ,
        l2_leaf_reg = v_l2_leaf_reg,
        loss_function='MAE', 
        early_stopping_rounds = 90, random_seed=27)
    
    model.fit(
        X=df_train,
        y=y_train['target_20'],
        use_best_model=True,
        eval_set=(df_test, y_test['target_20']),        
        verbose=False
    )
    y_predict = model.predict(df_test.values)
    print('v_l2_leaf_reg=' + str(v_l2_leaf_reg) + ': ' + str(sMAPE(y_test['target_20'], y_predict)))

In [None]:
for v_rsm in range(1, 11, 1):
    model = catboost.CatBoostRegressor(
        iterations=2000, learning_rate=1.0, 
        depth = ,
        l2_leaf_reg = ,
        rsm = v_rsm/10.0,
        loss_function='MAE', 
        early_stopping_rounds = 90, random_seed=27)
    
    model.fit(
        X=df_train,
        y=y_train['target_20'],
        use_best_model=True,
        eval_set=(df_test, y_test['target_20']),        
        verbose=False
    )
    y_predict = model.predict(df_test.values)
    print('v_rsm=' + str(v_rsm) + ': ' + str(sMAPE(y_test['target_20'], y_predict)))

In [None]:
model = catboost.CatBoostRegressor(
        iterations=2000, learning_rate=1.0, 
        depth = ,
        l2_leaf_reg = ,
        rsm = ,
        loss_function='MAE', 
        early_stopping_rounds = 90, random_seed=27)
    
model.fit(
        X=df_train,
        y=y_train['target_20'],
        use_best_model=True,
        eval_set=(df_test, y_test['target_20']),        
        verbose=True
    )
y_predict = model.predict(df_test)
print(sMAPE(y_test['target_20'], y_predict))

## target_25

In [None]:
%%time
for v_depth in range(3, 12, 1):
    model = catboost.CatBoostRegressor(
        iterations=2000, learning_rate=1.0, 
        depth = v_depth,
        loss_function='MAE', 
        early_stopping_rounds = 90, random_seed=27)
   
    model.fit(
        X=df_train,
        y=y_train['target_25'],
        use_best_model=True,
        eval_set=(df_test, y_test['target_25']),        
        verbose=False
    )
    y_predict = model.predict(df_test.values)
    print('v_depth=' + str(v_depth) + ': ' + str(sMAPE(y_test['target_25'], y_predict)))

In [None]:
%%time
for v_l2_leaf_reg in range(1, 10, 1):
    model = catboost.CatBoostRegressor(
        iterations=2000, learning_rate=1.0, 
        depth = ,
        l2_leaf_reg = v_l2_leaf_reg,
        loss_function='MAE', 
        early_stopping_rounds = 90, random_seed=27)
    
    model.fit(
        X=df_train,
        y=y_train['target_25'],
        use_best_model=True,
        eval_set=(df_test, y_test['target_25']),        
        verbose=False
    )
    y_predict = model.predict(df_test.values)
    print('v_l2_leaf_reg=' + str(v_l2_leaf_reg) + ': ' + str(sMAPE(y_test['target_25'], y_predict)))

In [None]:
for v_rsm in range(1, 11, 1):
    model = catboost.CatBoostRegressor(
        iterations=2000, learning_rate=1.0, 
        depth = ,
        l2_leaf_reg = ,
        rsm = v_rsm/10.0,
        loss_function='MAE', 
        early_stopping_rounds = 90, random_seed=27)
    
    model.fit(
        X=df_train,
        y=y_train['target_25'],
        use_best_model=True,
        eval_set=(df_test, y_test['target_25']),        
        verbose=False
    )
    y_predict = model.predict(df_test.values)
    print('v_rsm=' + str(v_rsm) + ': ' + str(sMAPE(y_test['target_25'], y_predict)))

In [None]:
model = catboost.CatBoostRegressor(
        iterations=2000, learning_rate=1.0, 
        depth = ,
        l2_leaf_reg = ,
        rsm = ,
        loss_function='MAE', 
        early_stopping_rounds = 90, random_seed=27)
    
model.fit(
        X=df_train,
        y=y_train['target_25'],
        use_best_model=True,
        eval_set=(df_test, y_test['target_25']),        
        verbose=True
    )
y_predict = model.predict(df_test)
print(sMAPE(y_test['target_25'], y_predict))

# Final model

In [19]:
for position in target_positions:
    model = catboost.CatBoostRegressor(iterations=4000, learning_rate=1.0, 
        depth = 6,
        l2_leaf_reg = 9,
        rsm = 0.5,
        loss_function='MAE', 
        random_seed=27)
    
    model.fit(
        #X=X,
        #y=y['target_{}'.format(position)],
        X=df_train,
        y=y_train['target_{}'.format(position)],
        use_best_model=True,
        eval_set=(df_test, y_test['target_{}'.format(position)]),
        verbose=False)
    y_predict = model.predict(df_test)
    
    print('target_{}'.format(position))
    print('stupid:\t{}'.format(sMAPE(
        y_test['target_{}'.format(position)],
        y_train['target_{}'.format(position)].median())))
    print('model:\t{}'.format(sMAPE(
        y_test['target_{}'.format(position)],
        y_predict)))
    print()
    
    model_to_save['models'][position] = model

target_5
stupid:	0.7296782178198603
model:	0.5033158858987168

target_10
stupid:	0.6368873894265404
model:	0.42582668628727827

target_15
stupid:	0.5810788045610309
model:	0.35872117082004323

target_20
stupid:	0.533455686705494
model:	0.3223642979061759

target_25
stupid:	0.499955747532939
model:	0.29350515608051875



In [18]:
model.get_params()

{'random_seed': 27,
 'loss_function': 'MAE',
 'rsm': 0.5,
 'l2_leaf_reg': 9,
 'depth': 6,
 'learning_rate': 1.0,
 'iterations': 2000}

Our model is better than constant solution. Saving model.

In [20]:
pickle.dump(model_to_save, open('models.pkl', 'wb'))