# IDAO: expected time of orders in airports

Airports are special points for taxi service. Every day a lot of people use a taxi to get to the city centre from the airport.

One of important task is to predict how long a driver need to wait an order. It helps to understand what to do. Maybe the driver have to wait near doors, or can drink a tea, or even should drive to city center without an order.

We request you to solve a simple version of this prediction task.

**Task:** predict time of $k$ orders in airport (time since now when you get an order if you are $k$-th in queue), $k$ is one of 5 values (different for every airports).

**Data**
- train: number of order for every minutes for 6 months
- test: every test sample has datetime info + numer of order for every minutes for last 2 weeks

**Submission:** for every airport you should prepare a model which will be evaluated in submission system (code + model files). You can make different models for different airports.

**Evaluation:** for every airport for every $k$ sMAPE will be calculated and averaged. General leaderboard will be calculated via Borda count. 

## Baseline

In [1]:
%pylab inline

import catboost
import pandas as pd
import pickle
import tqdm

Populating the interactive namespace from numpy and matplotlib


Let's prepare a model for set2.

# Load train dataset

In [2]:
set_name = 'set2'
path_train_set = '../../data/train/{}.csv'.format(set_name)

data = pd.read_csv(path_train_set)
data.datetime = data.datetime.apply(
    lambda x: datetime.datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
data = data.sort_values('datetime')
data.head()

Unnamed: 0,datetime,num_orders
0,2018-04-01 00:00:00,0
1,2018-04-01 00:01:00,0
2,2018-04-01 00:02:00,0
3,2018-04-01 00:03:00,0
4,2018-04-01 00:04:00,0


Predict position for set2.

In [3]:
target_positions = {
    'set1': [10, 30, 45, 60, 75],
    'set2': [5, 10, 15, 20, 25],
    'set3': [5, 7, 9, 11, 13]
}[set_name]

Some useful constant.

In [4]:
HOUR_IN_MINUTES = 60
DAY_IN_MINUTES = 24 * HOUR_IN_MINUTES
WEEK_IN_MINUTES = 7 * DAY_IN_MINUTES

MAX_TIME = DAY_IN_MINUTES

## Generate train samples with targets

We have only history of orders (count of orders in every minutes) but we need to predict time of k orders since current minutes. So we should calculate target for train set. Also we will make a lot of samples from all set (we can only use two weeks of history while prediction so we can use only two weeks in every train sample).

In [5]:
samples = {
    'datetime': [],
    'history': []}

for position in target_positions:
    samples['target_{}'.format(position)] = []
    
num_orders = data.num_orders.values

To calculate target (minutes before k orders) we are going to use cumulative sum of orders. 

In [6]:
# start after 2 weeks because of history
# finish earlier because of target calculation
for i in range(2 * WEEK_IN_MINUTES,
               len(num_orders) - 2 * DAY_IN_MINUTES):
    
    samples['datetime'].append(data.datetime[i])
    samples['history'].append(num_orders[i-2*WEEK_IN_MINUTES:i])
    
    # cumsum not for all array because of time economy
    cumsum_num_orders = num_orders[i+1:i+1+2*DAY_IN_MINUTES].cumsum()
    for position in target_positions:
        orders_by_positions = np.where(cumsum_num_orders >= position)[0]
        if len(orders_by_positions):
            time = orders_by_positions[0] + 1
        else:
            # if no orders in last days
            time = MAX_TIME
        samples['target_{}'.format(position)].append(time)

Convert to pandas.dataframe. Now we have targets to train and predict.

In [7]:
df = pd.DataFrame.from_dict(samples)
df.head()

Unnamed: 0,datetime,history,target_10,target_15,target_20,target_25,target_5
0,2018-04-15 00:00:00,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",221,247,285,297,205
1,2018-04-15 00:01:00,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",220,246,284,296,204
2,2018-04-15 00:02:00,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",219,245,283,295,203
3,2018-04-15 00:03:00,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",218,244,282,294,202
4,2018-04-15 00:04:00,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",217,243,281,293,201


# Train model

Let's generate simple features.

By time:

In [8]:
df['weekday'] = df.datetime.apply(lambda x: x.weekday())
df['hour'] = df.datetime.apply(lambda x: x.hour)
df['minute'] = df.datetime.apply(lambda x: x.minute)
df['month'] = df.datetime.apply(lambda x: x.month)
df['is_night'] = df.datetime.apply(lambda x: 1.0 if (x.hour < 6) else 0.0)

In [9]:
df['is_weekend'] = df.datetime.apply(lambda x: 1.0 if (x.weekday() >=5)
                              or ((x.day == 23) and (x.month == 2))
                              or ((x.day == 8) and (x.month == 3))
                              or ((x.day == 9) and (x.month == 3))
                              or ((x.day == 30) and (x.month == 4))
                              or ((x.day == 1) and (x.month == 5))
                              or ((x.day == 2) and (x.month == 5))
                              or ((x.day == 9) and (x.month == 5))
                              or ((x.day == 11) and (x.month == 6))
                              or ((x.day == 12) and (x.month == 6))
                             else 0.0)

Aggregators by order history with different shift and window size:

In [10]:
SHIFTS = [
    HOUR_IN_MINUTES // 4,
    HOUR_IN_MINUTES // 2,
    HOUR_IN_MINUTES,
    DAY_IN_MINUTES,
    DAY_IN_MINUTES * 2,
    WEEK_IN_MINUTES,
    WEEK_IN_MINUTES * 2]
WINDOWS = [
    HOUR_IN_MINUTES // 4,
    HOUR_IN_MINUTES // 2,
    HOUR_IN_MINUTES,
    DAY_IN_MINUTES,
    DAY_IN_MINUTES * 2,
    WEEK_IN_MINUTES,
    WEEK_IN_MINUTES * 2]

In [11]:
for shift in SHIFTS:
    for window in WINDOWS:
        temp = 0
        if window > shift:
            continue
        if shift == window:
            df['num_orders_{}_{}'.format(shift, window)] = df.history.apply(lambda x: x[-shift : -1].sum())
        else:
            df['num_orders_{}_{}'.format(shift, window)] = df.history.apply(lambda x: x[-shift : -shift + window].sum())

Train/validation split for time. Let's use last 4 weeks for validation.

In [12]:
df.datetime.min(), df.datetime.max()

(Timestamp('2018-04-15 00:00:00'), Timestamp('2018-09-28 23:59:00'))

In [13]:
df_train = df.loc[df.datetime <= df.datetime.max() - datetime.timedelta(days=28)]
df_test = df.loc[df.datetime > df.datetime.max() - datetime.timedelta(days=28)]
#from sklearn.model_selection import train_test_split
#№df_train, df_test = train_test_split(df, test_size=0.2, shuffle=True)

In [14]:
target_cols = ['target_{}'.format(position) for position in target_positions]

y_train = df_train[target_cols]
y_test = df_test[target_cols]



df_train = df_train.drop(['datetime', 'history'] + target_cols, axis=1)
df_test = df_test.drop(['datetime', 'history'] + target_cols, axis=1)

In [15]:
X_all = df.drop(['datetime', 'history'] + target_cols, axis=1)
y_all = df[target_cols]

In [16]:
def sMAPE(y_true, y_predict, shift=0):
    return 2 * np.mean(
        np.abs(y_true - y_predict) /
        (np.abs(y_true) + np.abs(y_predict) + shift))

Also we will save models for prediction stage.

In [17]:
model_to_save = {
    'models': {}
}

What is good or bad model? We can compare our model with constant solution. For instance median (optimal solution for MAE).

In [18]:
for position in target_positions:
    model = catboost.CatBoostRegressor(
        iterations=2000, learning_rate=1.0, loss_function='MAE')
    model.fit(
        X=df_train,
        y=y_train['target_{}'.format(position)],
        use_best_model=True,
        eval_set=(df_test, y_test['target_{}'.format(position)]),
        verbose=False)
    y_predict = model.predict(df_test)
    
    print('target_{}'.format(position))
    print('stupid:\t{}'.format(sMAPE(
        y_test['target_{}'.format(position)],
        y_train['target_{}'.format(position)].median())))
    print('model:\t{}'.format(sMAPE(
        y_test['target_{}'.format(position)],
        y_predict)))
    print()
    
    model_to_save['models'][position] = model

target_5
stupid:	0.7296782178198603
model:	0.5137097089166239

target_10
stupid:	0.6368873894265404
model:	0.4332259467788421

target_15
stupid:	0.5810788045610309
model:	0.37134819575984596

target_20
stupid:	0.533455686705494
model:	0.32346051693329986

target_25
stupid:	0.499955747532939
model:	0.2933373362125351



In [237]:
for position in target_positions:
    if str(position) == '10':
        print('10')
        model = catboost.CatBoostRegressor(iterations=2000, learning_rate=0.185, 
            depth = 4,
            rsm=0.2,
            l2_leaf_reg = 9,
            loss_function='MAE', 
            random_seed=27)
    else:
        print('not 10')

    if str(position) == '30':
        print('30')
        model = catboost.CatBoostRegressor(iterations=2000, learning_rate=0.2, 
            depth = 7,
            rsm = 0.1,
            loss_function='MAE', 
            random_seed=27)
    else:
        print('not 30')
        
    if str(position) == '45':
        print('45')
        model = catboost.CatBoostRegressor(iterations=2000, learning_rate=0.225, 
            depth = 5,
            l2_leaf_reg = 8,
            rsm = 1.0,
            loss_function='MAE', 
            random_seed=27)
    else:
        print('not 45')
        
    if str(position) == '60':
        print('60')
        model = catboost.CatBoostRegressor(iterations=2000, learning_rate=0.275, 
            depth = 5,
            l2_leaf_reg = 3,
            rsm = 1.0,
            loss_function='MAE', 
            random_seed=27)
    else:
        print('not 60')
        
    if str(position) == '75':
        print('75')
        model = catboost.CatBoostRegressor(iterations=2000, learning_rate=0.3, 
            depth = 5,
            l2_leaf_reg = 7,
            rsm = 1.0,
            loss_function='MAE', 
            random_seed=27)
    else:
        print('not 75')

    model.fit(
            #X_all,
            #y_all['target_{}'.format(position)],
            X=df_train,
            y=y_train['target_{}'.format(position)],
            use_best_model=True,
            eval_set=(df_test, y_test['target_{}'.format(position)]),
            verbose=False)
    y_predict = model.predict(df_test)
    
    print('target_{}'.format(position))
    print('stupid:\t{}'.format(sMAPE(
        y_test['target_{}'.format(position)],
        y_train['target_{}'.format(position)].median())))
    print('model:\t{}'.format(sMAPE(
        y_test['target_{}'.format(position)],
        y_predict)))
    print()
    
    model_to_save['models'][position] = model

10
not 30
not 45
not 60
not 75
target_10
stupid:	0.582173151841312
model:	0.3112487728502909

not 10
30
not 45
not 60
not 75
target_30
stupid:	0.5388110900373055
model:	0.25249752087746613

not 10
not 30
45
not 60
not 75
target_45
stupid:	0.5280161445107636
model:	0.23467006985530176

not 10
not 30
not 45
60
not 75
target_60
stupid:	0.5090823344692388
model:	0.22171677830721417

not 10
not 30
not 45
not 60
75
target_75
stupid:	0.5070227037322823
model:	0.20965396338782602



Our model is better than constant solution. Saving model.

In [19]:
pickle.dump(model_to_save, open('models.pkl', 'wb'))

In [18]:
model.get_params()

{'loss_function': 'MAE', 'learning_rate': 1.0, 'iterations': 2000}

In [239]:
for i in range(len(model.feature_importances_)):
    print(model.feature_names_[i] + ': ' + str(model.feature_importances_[i]))

weekday: 0.507908532404712
hour: 29.320602007384114
minute: 1.6883773667016586
month: 0.31668455142484236
is_night: 0.37161671521052664
is_weekend: 0.8998736344526392
num_orders_15_15: 1.7667600103867425
num_orders_30_15: 0.016083145550416088
num_orders_30_30: 0.010354967671689334
num_orders_60_15: 0.017260299794472213
num_orders_60_30: 0.08401289066763444
num_orders_60_60: 1.5627701941293672
num_orders_1440_15: 0.05197090740881547
num_orders_1440_30: 0.2583489532003636
num_orders_1440_60: 8.878312898756878
num_orders_1440_1440: 8.996259902337794
num_orders_2880_15: 0.0019711478851571445
num_orders_2880_30: 0.46551097454864954
num_orders_2880_60: 8.251594831361322
num_orders_2880_1440: 1.2474298038160698
num_orders_2880_2880: 1.2523827679875634
num_orders_10080_15: 0.003501136590404999
num_orders_10080_30: 0.7512016946318609
num_orders_10080_60: 17.209364736578095
num_orders_10080_1440: 0.37454022588458885
num_orders_10080_2880: 0.6187310964004583
num_orders_10080_10080: 1.218263017603

In [219]:
df.head(3)

Unnamed: 0,datetime,history,target_10,target_30,target_45,target_60,target_75,weekday,hour,minute,...,num_orders_10080_30,num_orders_10080_60,num_orders_10080_1440,num_orders_10080_10080,num_orders_20160_15,num_orders_20160_30,num_orders_20160_60,num_orders_20160_1440,num_orders_20160_10080,num_orders_20160_20160
0,2018-03-15 00:00:00,"[0, 0, 0, 0, 1, 2, 0, 1, 1, 4, 0, 1, 1, 1, 1, ...",5,18,28,32,42,3,0,0,...,59,143,1659,9871,13,51,124,1364,9110,18981
1,2018-03-15 00:01:00,"[0, 0, 0, 1, 2, 0, 1, 1, 4, 0, 1, 1, 1, 1, 3, ...",5,19,27,32,42,3,0,1,...,60,145,1658,9870,16,53,126,1364,9111,18981
2,2018-03-15 00:02:00,"[0, 0, 1, 2, 0, 1, 1, 4, 0, 1, 1, 1, 1, 3, 2, ...",7,20,27,33,43,3,0,2,...,64,147,1660,9874,18,54,128,1364,9111,18985
