(Work in progress...)

The first part of this notebook includes exploratory analysis.

The second part will feature future prediction.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # graphs
import plotly.offline 
import plotly.graph_objs as go
plotly.offline.init_notebook_mode(connected=True)
iplot = plotly.offline.iplot

from IPython.display import display

## Data

In [None]:
cats = pd.read_csv('../input/item_categories.csv')
items = pd.read_csv('../input/items.csv')
shops = pd.read_csv('../input/shops.csv')
train = pd.read_csv('../input/sales_train.csv')
test = pd.read_csv('../input/test.csv')

print(f'''Shapes:
Item categories: {cats.shape}
Items: {items.shape}
Shops: {shops.shape}
Train set: {train.shape}
Test set: {test.shape}''')

### Item categories

There are 84 categories. They are all in Russian!

In [None]:
print(cats.info())
cats.head()

### Items

There are 22170 items

In [None]:
print(items.info())
items.head()

### Shops

There are 60 shops

In [None]:
print(shops.info())
shops.head()

### Train set
6 features: date, date_block_num, shop_id, item_id, item_price, item_cnt_day

2935849 records

In [None]:
print(train.info())
train.head()

### Test set
214200 entries

In [None]:
print(test.info())
test.head()

## Target variable: item_cnt_day

Number of products sold. We are predicting a monthly amount of this measure.

The majority of them are just 1 item.
Some of them are negative (returned items).
Some of them are over 20 items.

The maximum is 2169 items, on 28/10/2105, shop_id 12, item_id 11373 (Boxberry)


In [None]:
# train['item_cnt_day'].value_counts().sort_index()
pd.cut(train['item_cnt_day'], [-np.inf] + list(range(0, 21)) + [np.inf]).value_counts()
# items[items.item_id == train.sort_values('item_cnt_day', ascending=False).head()['item_id'].iloc[0]]

## Predictor variables

We are creating a new feature which is item_price × item_cnt_day, and call it item_sale

We will also merge the item_caregory_id from items into the train and test set

In [None]:
train['item_sale'] = train['item_price'] * train['item_cnt_day'] # a new feature to calculate total sale for the item
train = pd.merge(train, items[['item_id', 'item_category_id']], how='left', on='item_id') # merge train & items to get item_category_id
test = pd.merge(test, items[['item_id', 'item_category_id']], how='left', on='item_id')
display(train.head(), test.head())

We check if each item_id should ONLY belong to 1 item_category_id

In [None]:
train.groupby('item_id')['item_category_id'].agg(lambda x: x.nunique()).value_counts()

### date
Convert from string to datetime type and add year and month column

In [None]:
train['date'] = pd.to_datetime(train['date'], format='%d.%m.%Y')

# add year and month as they can be useful for future prediction
train['year'] = train['date'].dt.year
train['month'] = train['date'].dt.month

# add date_block_num, year, month for test too
test['date_block_num'] = 34 # continue from 33 from train
test['year'] = 2015
test['month'] = 11

Now we have date (day, month, year, week, etc) as well as item_id, item_category_id, shop_id. We want to group them and look at the item_cnt_day and item_sale


There is a clear weekly cycle, with more sales on Thu, Fri and Sat

If we group them by month, we can clearly see that:
- Dec and Jan had the highest item_cnt_day.
- Dec also had the highest item_sale (likely Xmas)
- Nov had the lowest item_cnt_day but Jul had the lowes item_sale

Also, there are a few peaks over the years, they are:
- End of Nov 2013, lots of revenues
- End of Dec 2013, lots of item_cnt_day, but relatively lower item_sale compared to Nov 2013
- End of Dec 2014, lots of item_cnt_day
- There are also peaks around the end of May 2014 and 2015

We also see declines in item_cnt_day and item_sale from 2013 to 2015

In [None]:
# this code block below plots interactive graphs
def groupby(thing,  label = ''):
    # this function is used by the updatemenus, buttons dict further down in line 26
    tmp = train.groupby(thing)[['item_cnt_day', 'item_sale']].sum()
    return dict(
        args=[{
            'x': [tmp.index, tmp.index],
            'y': [tmp['item_cnt_day'], tmp['item_sale']],
        }],
        method='update', label=label
    )

tmp = groupby('date')
trace1 = go.Scatter(x=tmp['args'][0]['x'][0], y=tmp['args'][0]['y'][0], opacity=0.75, name='item_cnt_day')
trace2 = go.Scatter(x=tmp['args'][0]['x'][1], y=tmp['args'][0]['y'][1], opacity=0.75, name='item_sale', yaxis='y2')
data = [trace1, trace2]
layout = go.Layout(
    yaxis=dict(title='Item counts'),
    yaxis2=dict(title='Item sale', overlaying='y', side='right')
)


fig = go.Figure(data=data, layout=layout)
fig.layout.updatemenus = list([
    dict(
        buttons=[groupby(i, l) for i, l in [
            (train.date.dt.date, 'date'),
            (train.date.dt.dayofyear, 'day of year'),
            (train.date.dt.day, 'day of month'),
            (train.date.dt.dayofweek, 'day of week'),
            (train.date.dt.month, 'month'),
            (train.date.dt.week, 'week'),
            (train.date.dt.weekofyear, 'week of year'),
            (train.date.dt.year, 'year')
        ]] ,
        direction = 'down',
        showactive = True,
        x = 0, xanchor = 'left',
        y = 1.25, yanchor = 'top' 
    ),
])
iplot(fig)

In total, there were 3.6M items sold, with a combined revenue of \$3.39 billions.

By item, the most popular were:
- By item_cnt_day was: item_id 20949, with 187642 units sold, generated $929K
- By item_sale was: item_id 6675, worth \$219M in revenue, with 10289 units sold

The worst are: item 1590, 11871, 18062, 13474, 13477. Shops lost money on them

By shop:
- shop_id 31 had the highest item_cnt_day (310777 items sold), and also the higest item_sale (\$235M)
- shop_id 36 had the lowest item_cnt_day (330), with a revenue of \$377K

Item_category. The most popular:
- By item_cnt_day is item_category_id 40, with 634171 units sold, generating \$170M
- By item_sale is item_category_id 19, with 254887 units sold (\$412M)

The worst performers are: item_category_id 51, with only 1 unit sold (\$129), item_category_id 50, with 3 units sold (\$24)

In [None]:
def groupby(thing, sort_values = 'item_cnt_day'):
    tmp = train.groupby(thing)[['item_cnt_day', 'item_sale']].sum().sort_values(sort_values).reset_index()[:-100:-1]
    return dict(
        args=[{
            'x': [tmp[thing], tmp[thing]],
            'y': [tmp['item_cnt_day'], tmp['item_sale']],
        }, {
            'xaxis': dict(type='category')
        }],
        method='update', label=thing + ' sort by ' + sort_values
    )

tmp = groupby('item_id')
trace1 = go.Bar(x=tmp['args'][0]['x'][0], y=tmp['args'][0]['y'][0], opacity=0.5, name='item_cnt_day')
trace2 = go.Bar(x=tmp['args'][0]['x'][1], y=tmp['args'][0]['y'][1], opacity=0.5, name='item_sale', yaxis='y2')
data = [trace1, trace2]
layout = go.Layout(
    xaxis=dict(type='category'),
    yaxis=dict(title='Item counts'),
    yaxis2=dict(title='Item sale', overlaying='y', side='right')
)
fig = go.Figure(data=data, layout=layout)
fig.layout.updatemenus = list([
    dict(
        buttons=[groupby(i, s) for i, s in [
            ('item_id', 'item_cnt_day'),
            ('item_id', 'item_sale'),
            ('item_category_id', 'item_cnt_day'),
            ('item_category_id', 'item_sale'),
            ('shop_id', 'item_cnt_day'),
            ('shop_id', 'item_sale'),
        ]] ,
        direction = 'down',
        showactive = True,
        x = 0, xanchor = 'left',
        y = 1.25, yanchor = 'top' 
    ),
])
iplot(fig)

We can examine the activities of individual items over time. We can see that some products were popular at the beginning then vanished off. Some products were more popular near the end of the time scale. Some products have peaks at vairous time points.

In [None]:
# this code block takes a long time to run, so only selec the top 25 items for displaying
traces = []
for i in train.groupby('item_id')['item_cnt_day'].sum().sort_values(ascending=False).index[:25]:
    tmp = train[train['item_id'] == i].groupby('date_block_num')['item_cnt_day'].sum()
    traces.append(go.Scatter(x = tmp.index, y = tmp, opacity=0.5, name=str(i), visible='legendonly'))

iplot({
    'data': traces,
    'layout': {
        'xaxis': { 'title': 'date_block_num' },
        'yaxis': { 'title': 'item_cnt_day' },
        'title': 'Number of item_cnt_day per item_id',
    },
})

We can also examine the activites of item category.

In [None]:
traces = []
for i in train.groupby('item_category_id')['item_cnt_day'].sum().sort_values(ascending=False).index:
    tmp = train[train['item_category_id'] == i].groupby('date_block_num')['item_cnt_day'].sum()
    traces.append(go.Scatter(x = tmp.index, y = tmp, opacity=0.5, name=str(i), visible='legendonly'))

iplot({
    'data': traces,
    'layout': {
        'xaxis': { 'title': 'date_block_num' },
        'yaxis': { 'title': 'item_cnt_day' },
        'title': 'Number of item_cnt_day per item_category_id',
    },
})

Similarly, We can examine the activities of individual shops over time.

In [None]:
traces = []
for i in train.groupby('shop_id')['item_cnt_day'].sum().sort_values(ascending=False).index:
    tmp = train[train['shop_id'] == i].groupby('date_block_num')['item_cnt_day'].sum()
    traces.append(go.Scatter(x = tmp.index, y = tmp, opacity=0.5, name=str(i), visible='legendonly'))

iplot({
    'data': traces,
    'layout': {
        'xaxis': { 'title': 'date_block_num' },
        'yaxis': { 'title': 'item_cnt_day' },
        'title': 'Number of item_cnt_day per shop',
    },
})

Now we want to explore the correlations between item_cnt_day and item_price. We see that most items do not seem to have a strong correlation of item_cnt_day and item_price

In [None]:
def count_and_corr(x):
    count = len(x)
    corr = np.corrcoef(x['item_cnt_day'], x['item_price'])[0, 1] if count > 1 else np.nan
    return pd.Series([count, corr], index=['count', 'corr'])

train.groupby('item_id').apply(count_and_corr).sort_values('corr', ascending=False)

We now want to see the distributions of items and shops in the test set
- For shop_id, all shop_ids present in the test set are also present in the train set
- All item_category_id are present in test set and train set.
- For item_id, there are 363 items present only in the test set but are not in the train set.

In [None]:
len(set(test['shop_id']).difference(set(train['shop_id']))) # all shop_id in the test set are present in the train set
len(set(test['item_category_id']).difference(set(train['item_category_id']))) # all categories are present in both
len(set(test['item_id']).difference(set(train['item_id']))) # 363 item_id in the test set are NOT present in the train set

## Basic prediction

The most basic prediction would be to use the result from the previous month. In this case, data from the train set where date_block_num == 33

This will give a public score of 8.53027, which is really bad.

But if we clip the item_cnt to 0 ~ 20, the public score will be 1.16777

Clipping to -10 ~ 30 gives a public score of 1.23867

Now our goal is to beat the 1.16777

In [None]:
# first we define some helper functions: scoring and saving to csv

from sklearn.metrics import mean_squared_error

def score(*y): return mean_squared_error(*y) ** 0.5

# save to csv for submission
def to_csv(predicted_values, filename):
    pd.DataFrame({
        'ID': test['ID'],
        'item_cnt_month': predicted_values.clip(0, 20)
    }).to_csv(filename, index=False)

In [None]:
tmp = train[train['date_block_num'] == 33].groupby(['shop_id', 'item_id'])['item_cnt_day'].sum().reset_index().rename(columns={ 'item_cnt_day': 'item_cnt_month' }) # agg item_cnt_day by shop_id and item_id
tmp = test.merge(tmp, how='left', on=['shop_id', 'item_id']) # merge test with tmp using shop_id and item_id

# use as is
to_csv(tmp['item_cnt_month'].fillna(0), 'basic.csv') # there are lots of NaN in item_cnt LB 8.53027

# clip from 0 to 20
to_csv(tmp['item_cnt_month'].fillna(0).clip(0, 20), 'basic_clipped_0_20.csv') # LB 1.16777

# fill na with median groupby shop
to_csv(tmp['item_cnt_month'].fillna(tmp.groupby('shop_id')['item_cnt_month'].transform('median')).clip(0, 20), 'basic_fillna_median.csv') # LB 1.41848

# jsut having a look
tmp.item_cnt_month.describe()

## Prediction with machine learning

We are going to try with: RandomForest, lightGBM, XGBoost

First, we need to restructure the data and add more features

### Trial 1, add item_cnt from previous month

In [None]:
# we re format the train to aggregate item_cnt_day to item_cnt_month
tmp = train.groupby(['date_block_num', 'year', 'month', 'shop_id', 'item_id', 'item_category_id']).agg({
    'item_price': 'median',
    'item_cnt_day': 'sum',
}).reset_index().rename(columns={ 'item_cnt_day': 'item_cnt_month' })

# then add the previous item_cnt_month
tmp = tmp.merge(
    tmp.assign(date_block_num=tmp['date_block_num'] + 1)[['date_block_num', 'shop_id', 'item_id', 'item_cnt_month']].rename(columns={ 'item_cnt_month': 'item_cnt_month_pre' }),
    how='left',
    on=['date_block_num', 'shop_id', 'item_id'],
)

# remove the date_block_num 0
tmp = tmp[tmp['date_block_num'] != 0]

# fillna with 0
tmp['item_cnt_month_pre'].fillna(0, inplace=True)

# train.groupby(['date_block_num', 'year', 'month']).size() # this is to check date_block_num vs year / month all agree

# these are our predictor features
cols = ['date_block_num', 'year', 'month', 'shop_id', 'item_id', 'item_category_id', 'item_price', 'item_cnt_month_pre']

# then create train and val set
X_train = tmp[tmp['date_block_num'] < 33][cols]
y_train = tmp[tmp['date_block_num'] < 33]['item_cnt_month']

X_val = tmp[tmp['date_block_num'] == 33][cols]
y_val = tmp[tmp['date_block_num'] == 33]['item_cnt_month']

# show the data
display(tmp.head())

# also add extra features to test
tmp = pd.merge(
    test,
    tmp[tmp['date_block_num'] == 33][['shop_id', 'item_id', 'item_price', 'item_cnt_month']].rename(columns={ 'item_cnt_month': 'item_cnt_month_pre' }),
    how='left',
    on=['shop_id', 'item_id'],
    suffixes=['', '_y']
).fillna(0)[cols]

# then show the data
display(tmp.head())

### RandomForest

It takes a long time to run and the results from the default parameters are not great. 

In [None]:
# %%time
# from sklearn.ensemble import RandomForestRegressor

# model = RandomForestRegressor(n_estimators=100, random_state=0, n_jobs=-1)
# model.fit(X_train, y_train)

# print(f'''RMSE
# train: {score(model.predict(X_train), y_train):.4f}
# val: {score(model.predict(X_val), y_val):.4f}''')

# # save to csv
# to_csv(model.predict(test[cols]), 'submission_rf.csv')

# # repeat with clipping y
# model = RandomForestRegressor(n_estimators=100, random_state=0, n_jobs=-1)
# model.fit(X_train, y_train.clip(0, 20))

# print(f'''RMSE with clipping in y
# train: {score(model.predict(X_train), y_train.clip(0, 20)):.4f}
# val: {score(model.predict(X_val), y_val.clip(0, 20)):.4f}''')

# to_csv(model.predict(test[cols]), 'submission_rf2.csv')

In [None]:
# run out of memory when n_estimators>=200, so not run this
# %%time
# hist = { 'train': [], 'val': [], 'i': [], }
# for i in [10, 100, 200, 300, 400, 500]:
#     print(i)
#     model = RandomForestRegressor(n_estimators=i, max_depth=10, random_state=0, n_jobs=-1)
#     model.fit(X_train, y_train)
#     hist['i'].append(i)
#     hist['train'].append(score(model.predict(X_train), y_train))
#     hist['val'].append(score(model.predict(X_val), y_val))

### lightGBM

Using the default parameters

In [None]:
%%time

import lightgbm as lgb

params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'rmse',
    'seed': 0,
    'nthread': 4,
}

model = lgb.train(
    params,
    lgb.Dataset(X_train, y_train),
    5000,
    valid_sets=lgb.Dataset(X_val, y_val),
    early_stopping_rounds=100,
    verbose_eval=100
)

to_csv(model.predict(tmp), 'lgb1.csv') # this give LB 5.52450

LightGBM, clip the y_train during training.

In [None]:
%%time

model = lgb.train(
    params,
    lgb.Dataset(X_train, y_train.clip(0, 20)), # clip 0 ~ 20 here
    5000,
    valid_sets=lgb.Dataset(X_val, y_val.clip(0, 20)),
    early_stopping_rounds=100,
    verbose_eval=100
)

to_csv(model.predict(tmp), 'lgb1b.csv') # this give LB 2.78524

### XGBoost

In [None]:
%%time

import xgboost as xgb

model = xgb.train(
    params={ 'eta': 0.15, 'silent': 1,  },
    dtrain=xgb.DMatrix(X_train, label=y_train, silent=True),
    num_boost_round=100,
    evals=[(xgb.DMatrix(X_val, label=y_val, silent=True), 'test')],
    early_stopping_rounds=10
)

to_csv(model.predict(xgb.DMatrix(tmp)), 'xgb1.csv') # LB 18.51368

In [None]:
%%time
# clipping y to 0, 20 during training

import xgboost as xgb

model = xgb.train(
    params={ 'eta': 0.15, 'silent': 1,  },
    dtrain=xgb.DMatrix(X_train, label=y_train.clip(0, 20), silent=True),
    num_boost_round=100,
    evals=[(xgb.DMatrix(X_val, label=y_val.clip(0, 20), silent=True), 'test')],
    early_stopping_rounds=10
)


to_csv(model.predict(xgb.DMatrix(tmp)), 'xgb1b.csv') # LB is 3.32154

### Trial 2, use item_cnt up to 6 previous months

Instead of using only previous month item_cnt, we can use more.

Below, we create a new dataframe, train3 to use item_cnt from previous 6 months

In [None]:
# as before, agg item_cnt_month
tmp = train.groupby(['date_block_num', 'year', 'month', 'shop_id', 'item_id', 'item_category_id']).agg({
    'item_price': 'median',
    'item_cnt_day': 'sum',
}).reset_index().rename(columns={ 'item_cnt_day': 'item_cnt_month' })

# then use item_cnt_month from the previous 6 months
for i in range(1, 7):
    tmpi = tmp[['date_block_num', 'shop_id', 'item_id', 'item_cnt_month']].copy()
    tmpi['date_block_num'] += i
    tmpi.rename(columns={ 'item_cnt_month': 'item_cnt_month_pre_' + str(i) }, inplace=True)
    tmp = tmp.merge(tmpi, how='left', on=['date_block_num', 'shop_id', 'item_id']).fillna(0)

    
# these are our predictor features
cols = ['date_block_num', 'year', 'month', 'shop_id', 'item_id', 'item_category_id', 'item_price',
        'item_cnt_month_pre_1', 'item_cnt_month_pre_2', 'item_cnt_month_pre_3', 'item_cnt_month_pre_4', 'item_cnt_month_pre_5', 'item_cnt_month_pre_6']

# clip the item_cnt_month to 0 to 20
for i in ['item_cnt_month', 'item_cnt_month_pre_1', 'item_cnt_month_pre_2', 'item_cnt_month_pre_3', 'item_cnt_month_pre_4', 'item_cnt_month_pre_5', 'item_cnt_month_pre_6']:
    tmp[i] = tmp[i].clip(0, 20)

display(tmp.head())

# create train and val set
X_train = tmp[(tmp['date_block_num'] < 33) & (tmp['date_block_num'] > 5)][cols] # remove the first 6 months in the dataset where date_block_num is in 0 to 5
y_train = tmp[(tmp['date_block_num'] < 33) & (tmp['date_block_num'] > 5)]['item_cnt_month']

X_val = tmp[tmp['date_block_num'] == 33][cols]
y_val = tmp[tmp['date_block_num'] == 33]['item_cnt_month']

# test set features need to be updated too
tmp = pd.merge(
    test[['ID', 'date_block_num', 'year', 'month', 'shop_id', 'item_id', 'item_category_id']],
    tmp.assign(date_block_num = tmp.date_block_num + 1),
    how='left',
    on=['date_block_num', 'shop_id', 'item_id'],
    suffixes=['', '_y']
).fillna(0)[cols]

print(tmp.shape)
display(tmp.head())

Now try to run lightGBM again with default parameters

In [None]:
%%time
import lightgbm as lgb

params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'rmse',
    'seed': 0,
    'nthread': 4,
}

model = lgb.train(
    params,
    lgb.Dataset(X_train, y_train),
    5000,
    valid_sets=lgb.Dataset(X_val, y_val),
    early_stopping_rounds=100,
    verbose_eval=100
)

to_csv(model.predict(tmp), 'lgb2.csv') # this gives LB 2.89164 sadly

Since the defaul paramaters do not give good results, now time for hyperparameter tuning ☕🍵

We are going to use hyperopt, a package for automated hyperparameter tuning using Bayesian Optimization (more [details](https://github.com/WillKoehrsen/hyperparameter-optimization)).

In [None]:
from hyperopt import hp, tpe, Trials, fmin, STATUS_OK
from time import time
import lightgbm as lgb

# iterations
i = 0

# define spaces to search
space = {
    'num_leaves': hp.quniform('num_leaves', 25, 75, 1),
    'learning_rate': hp.loguniform('learning_rate', np.log(0.01), np.log(0.05)),
    'subsample_for_bin': hp.quniform('subsample_for_bin', 20000, 300000, 20000),
    'min_child_samples': hp.quniform('min_child_samples', 20, 500, 5),
    'reg_alpha': hp.uniform('reg_alpha', 0.0, 1.0),
    'reg_lambda': hp.uniform('reg_lambda', 0.0, 1.0),
    'colsample_bytree': hp.uniform('colsample_by_tree', 0.6, 1.0),
    'bagging_fraction': hp.uniform('bagging_fraction', 0.75, 1.0),
}

# objective function: to minimise the cost
def objective(params):
    t = time() # keep track of duration for each iteration
    global i
    i += 1 # increase iteration count
    for j in ['num_leaves', 'subsample_for_bin', 'min_child_samples']: params[j] = int(params[j]) # need to explicitly turn them into int

    model = lgb.train(
        { **params, 'objective': 'regression', 'metric': 'rmse', 'nthread': 4, 'bagging_freq': 5, 'seed': 0 }, # merge params together
        lgb.Dataset(X_train, y_train),
        5000,
        valid_sets=lgb.Dataset(X_val, y_val),
        early_stopping_rounds=100,
        verbose_eval=0
    )
    loss = model.best_score['valid_0']['rmse']
    
    t = time() - t
    print(f'{i}) {t:.1f}s, {loss:.4f}, {params}')
    return loss

# our trial history
trials = Trials()

# run the baysian optimisation for params tuning
# best = fmin(fn=objective, space=space, algo=tpe.suggest, max_evals=1000, trials=trials, rstate=np.random.RandomState(0))

In [None]:
# because it takes so long for tuning, and the kernel can only run for 6 hours, so every now and then I save the params to a pickle file, then upload this file to else where
# later on, I can download the pickle file again and continue the training where it left.
# I use https://file.io for simple file storage.

# import pickle
# with open('trials.pickle', 'wb') as handle: pickle.dump(trials, handle, protocol=pickle.HIGHEST_PROTOCOL)
# with open('trials.pickle', 'rb') as handle: trials = pickle.load(handle)
# !curl -F "file=@trials.pickle" https://file.io

# print(len(trials.trials))
# sorted(trials, key=lambda k: k['result']['loss'] if 'loss' in k['result'] else np.inf)

In [None]:
%%time
# we have found the best params

params = {
    'objective': 'regression', 'metric': 'rmse', 'nthread': 4, 'bagging_freq': 5, 'seed': 0,
    'bagging_fraction': 0.9706689662242649,
    'colsample_by_tree': 0.7315494493281669,
    'learning_rate': 0.016357695097584456,
    'min_child_samples': 150,
    'num_leaves': 65,
    'reg_alpha': 0.053334028461403574,
    'reg_lambda': 0.111413264877147,
    'subsample_for_bin': 240000,
}

model = lgb.train(
    params,
    lgb.Dataset(X_train, y_train),
    5000,
    valid_sets=lgb.Dataset(X_val, y_val),
    early_stopping_rounds=100,
    verbose_eval=100
)

to_csv(model.predict(tmp), 'lgb2b.csv') # this gives LB 3.00277, :-(

### XGBoost

In [None]:
%%time

import xgboost as xgb

model = xgb.train(
    params={ 'eta': 0.15, 'silent': 1,  },
    dtrain=xgb.DMatrix(X_train, label=y_train, silent=True),
    num_boost_round=100,
    evals=[(xgb.DMatrix(X_val, label=y_val, silent=True), 'test')],
    early_stopping_rounds=10
)

to_csv(model.predict(xgb.DMatrix(tmp)), 'xgb2.csv') # LB 2.35866

### Trial 3

We pivot the data, so that columns are date_block_num, row values are item_cnt

In [None]:
tmp = train.pivot_table(index=['shop_id', 'item_id'], columns=['date_block_num'], values='item_cnt_day', aggfunc=np.sum).fillna(0)

tmp.head()

X_train = tmp.loc[:, :32].values
y_train = tmp[32].values

X_val = tmp.loc[:, 1:33].values
y_val = tmp[33].values

tmp = test[['shop_id', 'item_id']].merge(tmp, how='left', on=['shop_id', 'item_id']).fillna(0).loc[:, range(2:34)].values

In [None]:
%%time

import lightgbm as lgb
params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'rmse',
    'seed': 0,
    'nthread': -1,
}

model = lgb.train(
    params,
    lgb.Dataset(X_train, y_train),
    5000,
    valid_sets=lgb.Dataset(X_val, y_val),
    early_stopping_rounds=100,
    verbose_eval=100
)

to_csv(model.predict(tmp), 'lgb3.csv') # LB 1.08520

In [None]:
%%time

import xgboost as xgb

model = xgb.train(
    params={ 'eta': 0.15, 'silent': 1,  },
    dtrain=xgb.DMatrix(X_train, label=y_train, silent=True),
    num_boost_round=100,
    evals=[(xgb.DMatrix(X_val, label=y_val, silent=True), 'test')],
    early_stopping_rounds=10
)

to_csv(model.predict(xgb.DMatrix(tmp)), 'xgb3.csv') # LB 1.05809

We can clip the values to 0, 20 during training

In [None]:
%%time

import lightgbm as lgb
params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'rmse',
    'seed': 0,
    'nthread': -1,
}

model = lgb.train(
    params,
    lgb.Dataset(X_train, y_train.clip(0, 20)),
    5000,
    valid_sets=lgb.Dataset(X_val, y_val.clip(0, 20)),
    early_stopping_rounds=100,
    verbose_eval=100
)

to_csv(model.predict(tmp), 'lgb3b.csv') # LB 1.03805

In [None]:
%%time

import xgboost as xgb

model = xgb.train(
    params={ 'eta': 0.15, 'silent': 1,  },
    dtrain=xgb.DMatrix(X_train, label=y_train, silent=True),
    num_boost_round=100,
    evals=[(xgb.DMatrix(X_val, label=y_val, silent=True), 'test')],
    early_stopping_rounds=10
)

to_csv(model.predict(xgb.DMatrix(tmp)), 'xgb3b.csv') # LB 1.03369

### Trial 4
We clip all the values to 0, 20

In [None]:
tmp = train.pivot_table(index=['shop_id', 'item_id'], columns=['date_block_num'], values='item_cnt_day', aggfunc=np.sum).fillna(0).clip(0, 20)

tmp.head()

X_train = tmp.loc[:, :32].values
y_train = tmp[32].values

X_val = tmp.loc[:, 1:33].values
y_val = tmp[33].values

tmp = test[['shop_id', 'item_id']].merge(tmp, how='left', on=['shop_id', 'item_id']).fillna(0).loc[:, range(2:34)].values

In [None]:
%%time

import lightgbm as lgb
params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'rmse',
    'seed': 0,
    'nthread': -1,
}

model = lgb.train(
    params,
    lgb.Dataset(X_train, y_train),
    5000,
    valid_sets=lgb.Dataset(X_val, y_val),
    early_stopping_rounds=100,
    verbose_eval=100
)

to_csv(model.predict(tmp), 'lgb4.csv') # LB 1.03592
# to_csv(model.predict(tmp).round(), 'lgb4b.csv') # LB 1.05436

We try swapping X_train and X_val during training

In [None]:
%%time

import lightgbm as lgb
params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'rmse',
    'seed': 0,
    'nthread': -1,
}

model = lgb.train(
    params,
    lgb.Dataset(X_val, y_val), # swap the X_train and X_val here
    5000,
    valid_sets=lgb.Dataset(X_train, y_train),
    early_stopping_rounds=100,
    verbose_eval=100
)

to_csv(model.predict(tmp), 'lgb4c.csv') # LB 1.02844

Hyperparams tuning for lightbgm models

In [None]:
from hyperopt import hp, tpe, Trials, fmin, STATUS_OK
from time import time
import lightgbm as lgb

# iterations
i = 0

# define spaces to search
space = {
    'num_leaves': hp.quniform('num_leaves', 25, 75, 1),
    'learning_rate': hp.loguniform('learning_rate', np.log(0.01), np.log(0.05)),
    'subsample_for_bin': hp.quniform('subsample_for_bin', 20000, 300000, 20000),
    'min_child_samples': hp.quniform('min_child_samples', 20, 500, 5),
    'reg_alpha': hp.uniform('reg_alpha', 0.0, 1.0),
    'reg_lambda': hp.uniform('reg_lambda', 0.0, 1.0),
    'colsample_bytree': hp.uniform('colsample_by_tree', 0.6, 1.0),
    'bagging_fraction': hp.uniform('bagging_fraction', 0.75, 1.0),
}

# objective function: to minimise the cost
def objective(params):
    t = time() # keep track of duration for each iteration
    global i
    i += 1 # increase iteration count
    for j in ['num_leaves', 'subsample_for_bin', 'min_child_samples']: params[j] = int(params[j]) # need to explicitly turn them into int

    model = lgb.train(
        { **params, 'objective': 'regression', 'metric': 'rmse', 'nthread': 4, 'bagging_freq': 5, 'seed': 0 }, # merge params together
        lgb.Dataset(X_val, y_val),
        5000,
        valid_sets=lgb.Dataset(X_train, y_train),
        early_stopping_rounds=100,
        verbose_eval=0
    )
    loss = model.best_score['valid_0']['rmse']
    
    t = time() - t
    print(f'{i}) {t:.1f}s, {loss:.4f}, {params}')
    return loss

# our trial history
trials = Trials()

# run the baysian optimisation for params tuning
# best = fmin(fn=objective, space=space, algo=tpe.suggest, max_evals=1000, trials=trials, rstate=np.random.RandomState(0))

Having run the hyperparams optimisation above, we otain the best params and run them as below:

In [None]:
%%time

import lightgbm as lgb

params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'rmse',
    'seed': 0,
    'nthread': -1,
    'bagging_fraction': 0.8525450579226199,
    'colsample_by_tree': 0.9705350658278141,
    'learning_rate': 0.04395833528778046,
    'min_child_samples': 460,
    'num_leaves': 60,
    'reg_alpha': 0.18424221349470526,
    'reg_lambda': 0.7506282086709485,
    'subsample_for_bin': 140000,
}

model = lgb.train(
    params,
    lgb.Dataset(X_train, y_train),
    5000,
    valid_sets=lgb.Dataset(X_val, y_val),
    early_stopping_rounds=100,
    verbose_eval=100
)

to_csv(model.predict(tmp), 'lgb4d.csv') # LB

params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'rmse',
    'seed': 0,
    'nthread': -1,
    'bagging_fraction': 0.7730681635329509,
    'colsample_bytree': 0.9440491158187898,
    'learning_rate': 0.04806809101732328,
    'min_child_samples': 400,
    'num_leaves': 69,
    'reg_alpha': 0.5385150471866715,
    'reg_lambda': 0.8089922399251372,
    'subsample_for_bin': 60000
}

model = lgb.train(
    params,
    lgb.Dataset(X_val, y_val),
    5000,
    valid_sets=lgb.Dataset(X_train, y_train),
    early_stopping_rounds=100,
    verbose_eval=100
)

to_csv(model.predict(tmp), 'lgb4e.csv') # LB 1.08228, bad

#### Xgb

In [None]:
%%time

import xgboost as xgb

model = xgb.train(
    params={ 'eta': 0.15, 'silent': 1,  },
    dtrain=xgb.DMatrix(X_train, label=y_train, silent=True),
    num_boost_round=100,
    evals=[(xgb.DMatrix(X_val, label=y_val, silent=True), 'test')],
    early_stopping_rounds=10
)

to_csv(model.predict(xgb.DMatrix(tmp)), 'xgb4.csv') # LB 1.03369

Again, we swap X_train and X_val during training

In [None]:
%%time

import xgboost as xgb

model = xgb.train(
    params={ 'eta': 0.15, 'silent': 1,  },
    dtrain=xgb.DMatrix(X_val, label=y_val, silent=True), # swap X_train & X_val
    num_boost_round=100,
    evals=[(xgb.DMatrix(X_train, label=y_train, silent=True), 'test')],
    early_stopping_rounds=10
)

to_csv(model.predict(xgb.DMatrix(tmp)), 'xgb4b.csv') # LB 1.02826

While we are here, we may as well try simpler models such as: linear, Ridge, Bayesian Ridge, and KNN.

In [None]:
# simple linear model
from sklearn import linear_model

model = linear_model.LinearRegression()
model.fit(X_train, y_train)
print(score(model.predict(X_val), y_val))
to_csv(model.predict(tmp), 'linear_regression4.csv') # LB 1.03363

# repeat again but just swapping the train and val
model = linear_model.LinearRegression()
model.fit(X_val, y_val)
print(score(model.predict(X_train), y_train))
to_csv(model.predict(tmp), 'linear_regression4b.csv') # 1.03111

In [None]:
# Ridge

model = linear_model.RidgeCV(alphas=np.logspace(-1, 0, 10), cv=5)
model.fit(X_train, y_train)
print(score(model.predict(X_val), y_val))
to_csv(model.predict(tmp), 'lasso4.csv') # LB 1.03361

model = linear_model.RidgeCV(alphas=np.logspace(-1, 0, 10), cv=5)
model.fit(X_val, y_val)
print(score(model.predict(X_train), y_train))
to_csv(model.predict(tmp), 'lasso4b.csv') # LB 1.03111

In [None]:
# Baysian Ridge

model = linear_model.BayesianRidge()
model.fit(X_train, y_train)
print(score(model.predict(X_val), y_val))
to_csv(model.predict(tmp), 'BayesianRidge4.csv') # LB 1.03363

model = linear_model.BayesianRidge()
model.fit(X_val, y_val)
print(score(model.predict(X_train), y_train))
to_csv(model.predict(tmp), 'BayesianRidge4b.csv') # LB 1.03112

In [None]:
# KNN, but this take too long to run, and the LB is bad 
# from sklearn.neighbors import KNeighborsRegressor
# model = KNeighborsRegressor(n_neighbors=2)
# model.fit(X_train, y_train)
# print(score(model.predict(X_val), y_val))
# to_csv(model.predict(tmp), 'knn4.csv') # LB 1.12914

### Trial 5
Similar to trial 4, but the X_train and X_val will be split randomly by row instead of by date_block_num

In [None]:
from sklearn.model_selection import train_test_split, KFold, cross_val_score

# cv = KFold(n_splits=5, shuffle=True, random_state=0)

In [None]:
tmp = train.pivot_table(index=['shop_id', 'item_id'], columns=['date_block_num'], values='item_cnt_day', aggfunc=np.sum).fillna(0).clip(0, 20)

X_train, X_val, y_train, y_val = train_test_split(tmp[list(range(33))], tmp[33], random_state=0)

tmp = test[['shop_id', 'item_id']].merge(tmp, how='left', on=['shop_id', 'item_id']).fillna(0).loc[:, range(1, 34)].values

In [None]:
%%time

import lightgbm as lgb
params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'rmse',
    'seed': 0,
    'nthread': -1,
}

model = lgb.train(
    params,
    lgb.Dataset(X_train, y_train),
    5000,
    valid_sets=lgb.Dataset(X_val, y_val),
    early_stopping_rounds=100,
    verbose_eval=100
)

to_csv(model.predict(tmp), 'lgb5.csv') # LB 1.05852, no improvement