# IDAO: expected time of orders in airports

Airports are special points for taxi service. Every day a lot of people use a taxi to get to the city centre from the airport.

One of important task is to predict how long a driver need to wait an order. It helps to understand what to do. Maybe the driver have to wait near doors, or can drink a tea, or even should drive to city center without an order.

We request you to solve a simple version of this prediction task.

**Task:** predict time of $k$ orders in airport (time since now when you get an order if you are $k$-th in queue), $k$ is one of 5 values (different for every airports).

**Data**
- train: number of order for every minutes for 6 months
- test: every test sample has datetime info + numer of order for every minutes for last 2 weeks

**Submission:** for every airport you should prepare a model which will be evaluated in submission system (code + model files). You can make different models for different airports.

**Evaluation:** for every airport for every $k$ sMAPE will be calculated and averaged. General leaderboard will be calculated via Borda count. 

## Baseline

In [1]:
%pylab inline

import catboost
import pandas as pd
import pickle
import tqdm

Populating the interactive namespace from numpy and matplotlib


Let's prepare a model for set2.

# Load train dataset

In [61]:
set_name = 'set1'
path_train_set = '../../data/train/{}.csv'.format(set_name)

data = pd.read_csv(path_train_set)
data.datetime = data.datetime.apply(
    lambda x: datetime.datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
data = data.sort_values('datetime')
data.head()

Unnamed: 0,datetime,num_orders
0,2018-03-01 00:00:00,0
1,2018-03-01 00:01:00,0
2,2018-03-01 00:02:00,0
3,2018-03-01 00:03:00,0
4,2018-03-01 00:04:00,1


Predict position for set2.

In [62]:
target_positions = {
    'set1': [10, 30, 45, 60, 75],
    'set2': [5, 10, 15, 20, 25],
    'set3': [5, 7, 9, 11, 13]
}[set_name]

Some useful constant.

In [63]:
HOUR_IN_MINUTES = 60
DAY_IN_MINUTES = 24 * HOUR_IN_MINUTES
WEEK_IN_MINUTES = 7 * DAY_IN_MINUTES

MAX_TIME = DAY_IN_MINUTES

## Generate train samples with targets

We have only history of orders (count of orders in every minutes) but we need to predict time of k orders since current minutes. So we should calculate target for train set. Also we will make a lot of samples from all set (we can only use two weeks of history while prediction so we can use only two weeks in every train sample).

In [64]:
samples = {
    'datetime': [],
    'history': []}

for position in target_positions:
    samples['target_{}'.format(position)] = []
    
num_orders = data.num_orders.values

To calculate target (minutes before k orders) we are going to use cumulative sum of orders. 

In [65]:
# start after 2 weeks because of history
# finish earlier because of target calculation
for i in range(2 * WEEK_IN_MINUTES,
               len(num_orders) - 2 * DAY_IN_MINUTES):
    
    samples['datetime'].append(data.datetime[i])
    samples['history'].append(num_orders[i-2*WEEK_IN_MINUTES:i])
    
    # cumsum not for all array because of time economy
    cumsum_num_orders = num_orders[i+1:i+1+2*DAY_IN_MINUTES].cumsum()
    for position in target_positions:
        orders_by_positions = np.where(cumsum_num_orders >= position)[0]
        if len(orders_by_positions):
            time = orders_by_positions[0] + 1
        else:
            # if no orders in last days
            time = MAX_TIME
        samples['target_{}'.format(position)].append(time)

Convert to pandas.dataframe. Now we have targets to train and predict.

In [312]:
df = pd.DataFrame.from_dict(samples)
df.head()

Unnamed: 0,datetime,history,target_10,target_30,target_45,target_60,target_75
0,2018-03-15 00:00:00,"[0, 0, 0, 0, 1, 2, 0, 1, 1, 4, 0, 1, 1, 1, 1, ...",5,18,28,32,42
1,2018-03-15 00:01:00,"[0, 0, 0, 1, 2, 0, 1, 1, 4, 0, 1, 1, 1, 1, 3, ...",5,19,27,32,42
2,2018-03-15 00:02:00,"[0, 0, 1, 2, 0, 1, 1, 4, 0, 1, 1, 1, 1, 3, 2, ...",7,20,27,33,43
3,2018-03-15 00:03:00,"[0, 1, 2, 0, 1, 1, 4, 0, 1, 1, 1, 1, 3, 2, 3, ...",7,21,26,35,42
4,2018-03-15 00:04:00,"[1, 2, 0, 1, 1, 4, 0, 1, 1, 1, 1, 3, 2, 3, 1, ...",7,20,26,35,42


# Train model

Let's generate simple features.

By time:

In [313]:
df['weekday'] = df.datetime.apply(lambda x: x.weekday())
df['hour'] = df.datetime.apply(lambda x: x.hour)
df['minute'] = df.datetime.apply(lambda x: x.minute)
df['month'] = df.datetime.apply(lambda x: x.month)
df['is_night'] = df.datetime.apply(lambda x: 1.0 if (x.hour < 6) else 0.0)

In [314]:
df['is_weekend'] = df.datetime.apply(lambda x: 1.0 if (x.weekday() >=5)
                              or ((x.day == 23) and (x.month == 2))
                              or ((x.day == 8) and (x.month == 3))
                              or ((x.day == 9) and (x.month == 3))
                              or ((x.day == 30) and (x.month == 4))
                              or ((x.day == 1) and (x.month == 5))
                              or ((x.day == 2) and (x.month == 5))
                              or ((x.day == 9) and (x.month == 5))
                              or ((x.day == 11) and (x.month == 6))
                              or ((x.day == 12) and (x.month == 6))
                             else 0.0)

Aggregators by order history with different shift and window size:

In [315]:
SHIFTS = [
    HOUR_IN_MINUTES // 4,
    HOUR_IN_MINUTES // 2,
    HOUR_IN_MINUTES,
    DAY_IN_MINUTES,
    DAY_IN_MINUTES * 2,
    DAY_IN_MINUTES * 3,
    WEEK_IN_MINUTES,
    WEEK_IN_MINUTES * 2]
WINDOWS = [
    HOUR_IN_MINUTES // 4,
    HOUR_IN_MINUTES // 2,
    HOUR_IN_MINUTES,
    DAY_IN_MINUTES,
    DAY_IN_MINUTES * 2,
    WEEK_IN_MINUTES,
    WEEK_IN_MINUTES * 2]

In [316]:
for shift in SHIFTS:
    for window in WINDOWS:
        temp = 0
        if window > shift:
            continue
        if shift == window:
            df['num_orders_{}_{}'.format(shift, window)] = df.history.apply(lambda x: x[-shift : -1].sum())
        else:
            df['num_orders_{}_{}'.format(shift, window)] = df.history.apply(lambda x: x[-shift : -shift + window].sum())

Train/validation split for time. Let's use last 4 weeks for validation.

In [317]:
df.datetime.min(), df.datetime.max()

(Timestamp('2018-03-15 00:00:00'), Timestamp('2018-08-29 23:59:00'))

In [333]:
#df_train = df.loc[df.datetime >= df.datetime.min() + datetime.timedelta(days=28)]
#df_test = df.loc[df.datetime < df.datetime.min() + datetime.timedelta(days=28)]
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size=0.2, stratify=df.weekday, shuffle=True)

In [334]:
target_cols = ['target_{}'.format(position) for position in target_positions]

y_train = df_train[target_cols]
y_test = df_test[target_cols]



df_train = df_train.drop(['datetime', 'history'] + target_cols, axis=1)
df_test = df_test.drop(['datetime', 'history'] + target_cols, axis=1)

In [335]:
X_all = df.drop(['datetime', 'history'] + target_cols, axis=1)
y_all = df[target_cols]

In [336]:
def sMAPE(y_true, y_predict, shift=0):
    return 2 * np.mean(
        np.abs(y_true - y_predict) /
        (np.abs(y_true) + np.abs(y_predict) + shift))

Also we will save models for prediction stage.

In [337]:
model_to_save = {
    'models': {}
}

What is good or bad model? We can compare our model with constant solution. For instance median (optimal solution for MAE).

In [338]:
for position in target_positions:
    model = catboost.CatBoostRegressor(
        iterations=5000, 
        learning_rate=1.0, 
        loss_function='MAE',
        random_seed = 27
    )
    model.fit(
        X=df_train,
        y=y_train['target_{}'.format(position)],
        #X=X_all,
        #y=y_all['target_{}'.format(position)],
        use_best_model=True,
        eval_set=(df_test, y_test['target_{}'.format(position)]),
        verbose=False)
    y_predict = model.predict(df_test)
    
    print('target_{}'.format(position))
    print('stupid:\t{}'.format(sMAPE(
        y_test['target_{}'.format(position)],
        y_train['target_{}'.format(position)].median())))
    print('model:\t{}'.format(sMAPE(
        y_test['target_{}'.format(position)],
        y_predict)))
    print()
    
    model_to_save['models'][position] = model

target_10
stupid:	0.4828611732670474
model:	0.23819738389506026

target_30
stupid:	0.4308833460266637
model:	0.13800072772073058

target_45
stupid:	0.40931313809300246
model:	0.1118790749514607

target_60
stupid:	0.39275900730095853
model:	0.09540615212987488

target_75
stupid:	0.37973470092526534
model:	0.08404532251398437



In [323]:
model.get_params()

{'random_seed': 27,
 'loss_function': 'MAE',
 'learning_rate': 1.0,
 'iterations': 5000}

In [297]:
for position in target_positions:
    if str(position) == '10':
        print('10')
        model = catboost.CatBoostRegressor(iterations=2000, learning_rate=0.185, 
            depth = 4,
            rsm=0.2,
            l2_leaf_reg = 9,
            loss_function='MAE', 
            random_seed=27)
    else:
        print('not 10')

    if str(position) == '30':
        print('30')
        model = catboost.CatBoostRegressor(iterations=2000, learning_rate=0.2, 
            depth = 7,
            rsm = 0.1,
            loss_function='MAE', 
            random_seed=27)
    else:
        print('not 30')
        
    if str(position) == '45':
        print('45')
        model = catboost.CatBoostRegressor(iterations=2000, learning_rate=0.225, 
            depth = 5,
            l2_leaf_reg = 8,
            rsm = 1.0,
            loss_function='MAE', 
            random_seed=27)
    else:
        print('not 45')
        
    if str(position) == '60':
        print('60')
        model = catboost.CatBoostRegressor(iterations=2000, learning_rate=0.275, 
            depth = 5,
            l2_leaf_reg = 3,
            rsm = 1.0,
            loss_function='MAE', 
            random_seed=27)
    else:
        print('not 60')
        
    if str(position) == '75':
        print('75')
        model = catboost.CatBoostRegressor(iterations=2000, learning_rate=0.3, 
            depth = 5,
            l2_leaf_reg = 7,
            rsm = 1.0,
            loss_function='MAE', 
            random_seed=27)
    else:
        print('not 75')

    model.fit(
            X_all,
            y_all['target_{}'.format(position)],
            #X=df_train,
            #y=y_train['target_{}'.format(position)],
            #use_best_model=True,
            #eval_set=(df_test, y_test['target_{}'.format(position)]),
            verbose=False)
    y_predict = model.predict(df_test)
    
    print('target_{}'.format(position))
    print('stupid:\t{}'.format(sMAPE(
        y_test['target_{}'.format(position)],
        y_train['target_{}'.format(position)].median())))
    print('model:\t{}'.format(sMAPE(
        y_test['target_{}'.format(position)],
        y_predict)))
    print()
    
    model_to_save['models'][position] = model

10
not 30
not 45
not 60
not 75
target_10
stupid:	0.582173151841312
model:	0.3097197086619267

not 10
30
not 45
not 60
not 75
target_30
stupid:	0.5388110900373055
model:	0.25066337596859284

not 10
not 30
45
not 60
not 75
target_45
stupid:	0.5280161445107636
model:	0.23213677351752926

not 10
not 30
not 45
60
not 75
target_60
stupid:	0.5090823344692388
model:	0.2182152603819293

not 10
not 30
not 45
not 60
75
target_75
stupid:	0.5070227037322823
model:	0.2067325321119251



Our model is better than constant solution. Saving model.

In [339]:
pickle.dump(model_to_save, open('models.pkl', 'wb'))

In [18]:
model.get_params()

{'loss_function': 'MAE', 'learning_rate': 1.0, 'iterations': 2000}

In [295]:
for i in range(len(model.feature_importances_)):
    print(model.feature_names_[i] + ': ' + str(model.feature_importances_[i]))

weekday: 1.0213571573942886
hour: 27.404556962442875
minute: 1.8617070290015951
month: 0.2624193771380556
is_night: 0.3181152711152566
is_weekend: 1.0746050814760255
num_orders_15_15: 1.4046844643747816
num_orders_30_15: 0.08945171152889983
num_orders_30_30: 0.09620485267218244
num_orders_60_15: 0.06323692840826382
num_orders_60_30: 0.14941474041207659
num_orders_60_60: 1.6951367102803234
num_orders_1440_15: 0.010007156706375793
num_orders_1440_30: 0.1713318743390005
num_orders_1440_60: 6.991810421254587
num_orders_1440_1440: 9.030463911297906
num_orders_2880_15: 0.002004196576478493
num_orders_2880_30: 0.4302611029473824
num_orders_2880_60: 6.152118529901491
num_orders_2880_1440: 1.4116724315585756
num_orders_2880_2880: 1.150300669500152
num_orders_4320_15: 0.027259094669783015
num_orders_4320_30: 0.3715284505019757
num_orders_4320_60: 4.560903082064364
num_orders_4320_1440: 2.487268305125261
num_orders_4320_2880: 1.4321479958190728
num_orders_10080_15: 0.0026604222987663432
num_order