# IDAO: expected time of orders in airports

Airports are special points for taxi service. Every day a lot of people use a taxi to get to the city centre from the airport.

One of important task is to predict how long a driver need to wait an order. It helps to understand what to do. Maybe the driver have to wait near doors, or can drink a tea, or even should drive to city center without an order.

We request you to solve a simple version of this prediction task.

**Task:** predict time of $k$ orders in airport (time since now when you get an order if you are $k$-th in queue), $k$ is one of 5 values (different for every airports).

**Data**
- train: number of order for every minutes for 6 months
- test: every test sample has datetime info + numer of order for every minutes for last 2 weeks

**Submission:** for every airport you should prepare a model which will be evaluated in submission system (code + model files). You can make different models for different airports.

**Evaluation:** for every airport for every $k$ sMAPE will be calculated and averaged. General leaderboard will be calculated via Borda count. 

## Baseline

In [1]:
%pylab inline

import catboost
import pandas as pd
import pickle
import tqdm

Populating the interactive namespace from numpy and matplotlib


Let's prepare a model for set2.

# Load train dataset

In [2]:
set_name = 'set3'
path_train_set = '../../data/train/{}.csv'.format(set_name)

data = pd.read_csv(path_train_set)
data.datetime = data.datetime.apply(
    lambda x: datetime.datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
data = data.sort_values('datetime')
data.head()

Unnamed: 0,datetime,num_orders
0,2018-02-01 00:00:00,0
1,2018-02-01 00:01:00,0
2,2018-02-01 00:02:00,0
3,2018-02-01 00:03:00,0
4,2018-02-01 00:04:00,0


Predict position for set2.

In [3]:
target_positions = {
    'set1': [10, 30, 45, 60, 75],
    'set2': [5, 10, 15, 20, 25],
    'set3': [5, 7, 9, 11, 13]
}[set_name]

Some useful constant.

In [4]:
HOUR_IN_MINUTES = 60
DAY_IN_MINUTES = 24 * HOUR_IN_MINUTES
WEEK_IN_MINUTES = 7 * DAY_IN_MINUTES

MAX_TIME = DAY_IN_MINUTES

## Generate train samples with targets

We have only history of orders (count of orders in every minutes) but we need to predict time of k orders since current minutes. So we should calculate target for train set. Also we will make a lot of samples from all set (we can only use two weeks of history while prediction so we can use only two weeks in every train sample).

In [5]:
samples = {
    'datetime': [],
    'history': []}

for position in target_positions:
    samples['target_{}'.format(position)] = []
    
num_orders = data.num_orders.values

To calculate target (minutes before k orders) we are going to use cumulative sum of orders. 

In [6]:
# start after 2 weeks because of history
# finish earlier because of target calculation
for i in range(2 * WEEK_IN_MINUTES,
               len(num_orders) - 2 * DAY_IN_MINUTES):
    
    samples['datetime'].append(data.datetime[i])
    samples['history'].append(num_orders[i-2*WEEK_IN_MINUTES:i])
    
    # cumsum not for all array because of time economy
    cumsum_num_orders = num_orders[i+1:i+1+2*DAY_IN_MINUTES].cumsum()
    for position in target_positions:
        orders_by_positions = np.where(cumsum_num_orders >= position)[0]
        if len(orders_by_positions):
            time = orders_by_positions[0] + 1
        else:
            # if no orders in last days
            time = MAX_TIME
        samples['target_{}'.format(position)].append(time)

Convert to pandas.dataframe. Now we have targets to train and predict.

In [31]:
df = pd.DataFrame.from_dict(samples)
df.head()

Unnamed: 0,datetime,history,target_11,target_13,target_5,target_7,target_9
0,2018-02-15 00:00:00,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",820,1167,308,418,421
1,2018-02-15 00:01:00,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",819,1166,307,417,420
2,2018-02-15 00:02:00,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",818,1165,306,416,419
3,2018-02-15 00:03:00,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",817,1164,305,415,418
4,2018-02-15 00:04:00,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",816,1163,304,414,417


# Train model

Let's generate simple features.

By time:

In [32]:
df['weekday'] = df.datetime.apply(lambda x: x.weekday())
df['hour'] = df.datetime.apply(lambda x: x.hour)
df['minute'] = df.datetime.apply(lambda x: x.minute)
df['month'] = df.datetime.apply(lambda x: x.month)
df['is_night'] = df.datetime.apply(lambda x: 1.0 if (x.hour < 6) else 0.0)

In [33]:
df['is_weekend'] = df.datetime.apply(lambda x: 1.0 if (x.weekday() >=5)
                              or ((x.day == 23) and (x.month == 2))
                              or ((x.day == 8) and (x.month == 3))
                              or ((x.day == 9) and (x.month == 3))
                              or ((x.day == 30) and (x.month == 4))
                              or ((x.day == 1) and (x.month == 5))
                              or ((x.day == 2) and (x.month == 5))
                              or ((x.day == 9) and (x.month == 5))
                              or ((x.day == 11) and (x.month == 6))
                              or ((x.day == 12) and (x.month == 6))
                             else 0.0)

Aggregators by order history with different shift and window size:

In [34]:
SHIFTS = [
    HOUR_IN_MINUTES // 4,
    HOUR_IN_MINUTES // 2,
    HOUR_IN_MINUTES,
    DAY_IN_MINUTES,
    DAY_IN_MINUTES * 2,
    DAY_IN_MINUTES * 3,
    WEEK_IN_MINUTES,
    WEEK_IN_MINUTES * 2]
WINDOWS = [
    HOUR_IN_MINUTES // 4,
    HOUR_IN_MINUTES // 2,
    HOUR_IN_MINUTES,
    DAY_IN_MINUTES,
    DAY_IN_MINUTES * 2,
    WEEK_IN_MINUTES,
    WEEK_IN_MINUTES * 2]

In [35]:
for shift in SHIFTS:
    for window in WINDOWS:
        temp = 0
        if window > shift:
            continue
        if shift == window:
            df['num_orders_{}_{}'.format(shift, window)] = df.history.apply(lambda x: x[-shift : -1].sum())
        else:
            df['num_orders_{}_{}'.format(shift, window)] = df.history.apply(lambda x: x[-shift : -shift + window].sum())

Train/validation split for time. Let's use last 4 weeks for validation.

In [36]:
df.datetime.min(), df.datetime.max()

(Timestamp('2018-02-15 00:00:00'), Timestamp('2018-07-29 23:59:00'))

In [37]:
df_train = df.loc[df.datetime <= df.datetime.max() - datetime.timedelta(days=28)]
df_test = df.loc[df.datetime > df.datetime.max() - datetime.timedelta(days=28)]
#from sklearn.model_selection import train_test_split
#№df_train, df_test = train_test_split(df, test_size=0.2, shuffle=True)

In [38]:
target_cols = ['target_{}'.format(position) for position in target_positions]

y_train = df_train[target_cols]
y_test = df_test[target_cols]



df_train = df_train.drop(['datetime', 'history'] + target_cols, axis=1)
df_test = df_test.drop(['datetime', 'history'] + target_cols, axis=1)

In [39]:
X_all = df.drop(['datetime', 'history'] + target_cols, axis=1)
y_all = df[target_cols]

In [40]:
def sMAPE(y_true, y_predict, shift=0):
    return 2 * np.mean(
        np.abs(y_true - y_predict) /
        (np.abs(y_true) + np.abs(y_predict) + shift))

Also we will save models for prediction stage.

In [41]:
model_to_save = {
    'models': {}
}

What is good or bad model? We can compare our model with constant solution. For instance median (optimal solution for MAE).

In [53]:
model = catboost.CatBoostRegressor(iterations=2000, learning_rate=1.0, 
        depth = 6,
        l2_leaf_reg = 9,
        rsm = 0.5,
        loss_function='MAE', 
        random_seed=27)

In [54]:
model.fit(
        #X=X,
        #y=y['target_{}'.format(position)],
        X=df_train,
        y=y_train['target_{}'.format(position)],
        #use_best_model=True,
        #eval_set=(df_test, y_test['target_{}'.format(position)]),
        verbose=False)

<catboost.core.CatBoostRegressor at 0x7fa330541ef0>

In [50]:
for position in target_positions:
    model = catboost.CatBoostRegressor(iterations=4000, learning_rate=1.0, 
        depth = 6,
        l2_leaf_reg = 9,
        rsm = 0.5,
        loss_function='MAE', 
        random_seed=27)
    
    model.fit(
        #X=X,
        #y=y['target_{}'.format(position)],
        X=df_train,
        y=y_train['target_{}'.format(position)],
        #use_best_model=True,
        #eval_set=(df_test, y_test['target_{}'.format(position)]),
        verbose=False)
    y_predict = model.predict(df_test)
    
    print('target_{}'.format(position))
    print('stupid:\t{}'.format(sMAPE(
        y_test['target_{}'.format(position)],
        y_train['target_{}'.format(position)].median())))
    print('model:\t{}'.format(sMAPE(
        y_test['target_{}'.format(position)],
        y_predict)))
    print()
    
    model_to_save['models'][position] = model

target_5
stupid:	0.6113491702802483
model:	0.4653506710843815

target_7
stupid:	0.5761401162272479
model:	0.4152161794093717



KeyboardInterrupt: 

In [47]:
for position in target_positions:
    model = catboost.CatBoostRegressor(
        iterations=4000, 
        learning_rate=1.0, 
        loss_function='MAE',
        random_seed = 27
    )
    model.fit(
        X=df_train,
        y=y_train['target_{}'.format(position)],
        use_best_model=True,
        eval_set=(df_test, y_test['target_{}'.format(position)]),
        verbose=False)
    y_predict = model.predict(df_test)
    
    print('target_{}'.format(position))
    print('stupid:\t{}'.format(sMAPE(
        y_test['target_{}'.format(position)],
        y_train['target_{}'.format(position)].median())))
    print('model:\t{}'.format(sMAPE(
        y_test['target_{}'.format(position)],
        y_predict)))
    print()
    
    model_to_save['models'][position] = model

target_5
stupid:	0.6113491702802483
model:	0.44629006744668015

target_7
stupid:	0.5761401162272479
model:	0.40026816959020467

target_9
stupid:	0.5630557259108779
model:	0.38722877250155296

target_11
stupid:	0.5587023576311817
model:	0.35561401771114176

target_13
stupid:	0.5513765273934248
model:	0.3132603080259751



In [56]:
for position in target_positions:
    if str(position) == '5':
        print('5')
        model = catboost.CatBoostRegressor(iterations=2000, learning_rate=1.0, 
            depth = 6,
            l2_leaf_reg = 9,
            rsm = 0.5,
            loss_function='MAE', 
            random_seed=27)
    else:
        print('not 5')

    if str(position) == '7':
        print('7')
        model = catboost.CatBoostRegressor(iterations=2000, learning_rate=1.0, 
            depth = 3,
            l2_leaf_reg = 8,
            rsm = 1.0,
            loss_function='MAE', 
            random_seed=27)
    else:
        print('not 7')
        
    if str(position) == '9':
        print('9')
        model = catboost.CatBoostRegressor(iterations=2000, learning_rate=1.0, 
            depth = 3,
            l2_leaf_reg = 8,
            rsm = 0.7,
            loss_function='MAE', 
            random_seed=27)
    else:
        print('not 9')
        
    if str(position) == '11':
        print('11')
        model = catboost.CatBoostRegressor(iterations=2000, learning_rate=1.0, 
            depth = 3,
            l2_leaf_reg = 2,
            rsm = 0.2,
            loss_function='MAE', 
            random_seed=27)
    else:
        print('not 11')
        
    if str(position) == '13':
        print('13')
        model = catboost.CatBoostRegressor(iterations=2000, learning_rate=1.0, 
            depth = 4,
            l2_leaf_reg = 5,
            rsm = 1.0,
            loss_function='MAE', 
            random_seed=27)
    else:
        print('not 13')
        
        
    model.fit(
            #X=X,
            #y=y['target_{}'.format(position)],
            X=df_train,
            y=y_train['target_{}'.format(position)],
            use_best_model=True,
            eval_set=(df_test, y_test['target_{}'.format(position)]),
            verbose=False)
    y_predict = model.predict(df_test)
    
    print('target_{}'.format(position))
    print('stupid:\t{}'.format(sMAPE(
        y_test['target_{}'.format(position)],
        y_train['target_{}'.format(position)].median())))
    print('model:\t{}'.format(sMAPE(
        y_test['target_{}'.format(position)],
        y_predict)))
    print()
    
    model_to_save['models'][position] = model

5
not 7
not 9
not 11
not 13
target_5
stupid:	0.6113491702802483
model:	0.4488429575577311

not 5
7
not 9
not 11
not 13
target_7
stupid:	0.5761401162272479
model:	0.3980924615070586

not 5
not 7
9
not 11
not 13
target_9
stupid:	0.5630557259108779
model:	0.3744194734229844

not 5
not 7
not 9
11
not 13
target_11
stupid:	0.5587023576311817
model:	0.3409071918454904

not 5
not 7
not 9
not 11
13
target_13
stupid:	0.5513765273934248
model:	0.3073526761423234



Our model is better than constant solution. Saving model.

In [58]:
pickle.dump(model_to_save, open('models.pkl', 'wb'))

In [49]:
model.get_params()

{'random_seed': 27,
 'loss_function': 'MAE',
 'learning_rate': 1.0,
 'iterations': 4000}

In [55]:
for i in range(len(model.feature_importances_)):
    print(model.feature_names_[i] + ': ' + str(model.feature_importances_[i]))

weekday: 6.143948700366711
hour: 23.605474382501292
minute: 0.0006794520347450518
month: 5.585949873254162
is_night: 4.3527187575109645
is_weekend: 4.067210618177978
num_orders_15_15: 0.0005128637894288646
num_orders_30_15: 0.00014816361723875006
num_orders_30_30: 0.0006282319181745852
num_orders_60_15: 2.3369250046621812e-06
num_orders_60_30: 4.180413239997616e-05
num_orders_60_60: 0.0610304887675931
num_orders_1440_15: 4.1900446878010794e-05
num_orders_1440_30: 0.002279065253795308
num_orders_1440_60: 0.04228322857860064
num_orders_1440_1440: 3.9123441745148493
num_orders_2880_15: 0.000106718351020746
num_orders_2880_30: 0.0007061043101085461
num_orders_2880_60: 0.006973488747462156
num_orders_2880_1440: 4.091152889888406
num_orders_2880_2880: 6.319928187758897
num_orders_4320_15: 0.00018996416469133372
num_orders_4320_30: 0.0010398421727513952
num_orders_4320_60: 0.016337033623790874
num_orders_4320_1440: 4.657913849103723
num_orders_4320_2880: 5.057002696376291
num_orders_10080_15:

In [219]:
df.head(3)

Unnamed: 0,datetime,history,target_10,target_30,target_45,target_60,target_75,weekday,hour,minute,...,num_orders_10080_30,num_orders_10080_60,num_orders_10080_1440,num_orders_10080_10080,num_orders_20160_15,num_orders_20160_30,num_orders_20160_60,num_orders_20160_1440,num_orders_20160_10080,num_orders_20160_20160
0,2018-03-15 00:00:00,"[0, 0, 0, 0, 1, 2, 0, 1, 1, 4, 0, 1, 1, 1, 1, ...",5,18,28,32,42,3,0,0,...,59,143,1659,9871,13,51,124,1364,9110,18981
1,2018-03-15 00:01:00,"[0, 0, 0, 1, 2, 0, 1, 1, 4, 0, 1, 1, 1, 1, 3, ...",5,19,27,32,42,3,0,1,...,60,145,1658,9870,16,53,126,1364,9111,18981
2,2018-03-15 00:02:00,"[0, 0, 1, 2, 0, 1, 1, 4, 0, 1, 1, 1, 1, 3, 2, ...",7,20,27,33,43,3,0,2,...,64,147,1660,9874,18,54,128,1364,9111,18985
