### Loading in the data

In [14]:
import numpy as np
import pandas as pd

In [110]:
# The files used here are generated by the first notebook.
# They are a tad too big for me to consider uploading them.
train  = pd.read_csv('newtrain.csv')
test_X = pd.read_csv('newtest.csv')

In [111]:
train = train.sample(frac = 1) # shuffle

train_X = train.drop(columns = ['sales']).copy()
train_y = train[['sales']].copy()

del train

### Trying to train it without further work

In [17]:
from lightgbm import LGBMRegressor

In [34]:
# Using default parameters
model = LGBMRegressor()

In [35]:
model = model.fit(train_X.values, train_y.values.ravel())

Wow, that took seconds...

In [36]:
preds = model.predict(test_X.values)

In [38]:
preds = preds.clip(0, 20) # clip to the required value range

In [41]:
print(sum(preds))

1776.6033909748116


Well, it's not all zeroes. Let's see how it does.

In [45]:
def writepreds(preds, filename):
    with open(filename, 'w') as file:
        file.write('ID,item_cnt_month\n')
        for i, pred in enumerate(preds):
            file.write(f'{i},{pred}\n')
        print('File written.')

In [46]:
writepreds(preds, 'lgbm1.csv')

File written.


This ended up worse than our last one. Maybe there's still more to do.

### Wrestling with the data some more

First thing is to one-hot encode the months.

In [112]:
dummies_train = pd.get_dummies(train_X['month'].astype(pd.CategoricalDtype(list(range(0, 12)))))
dummies_test = pd.get_dummies(test_X['month'].astype(pd.CategoricalDtype(list(range(0, 12)))))

In [113]:
train_X = pd.concat([train_X, dummies_train], axis=1).drop(['month'], axis=1)
test_X  = pd.concat([test_X,  dummies_test],  axis=1).drop(['month'], axis=1)

I'll just go ahead and see if this improves anything.

In [69]:
model2 = LGBMRegressor().fit(train_X.values, train_y.values.ravel())
writepreds(model2.predict(test_X.values).clip(0, 20), 'lgbm2.csv')

File written.


This did not improve results too much. Let's add the months-since-beginning thing back into the data again, add in item categories (will definitely have more impact than just item id), see what happens.

In [114]:
# Each month has 214200 entries, the data is sorted by the index, so just perform integer divide on the index
train_X.insert(0, 'Month', train_X.index // 214200, True)
test_X .insert(0, 'Month', [22] * 214200, True)

In [115]:
items = list(pd.read_csv('data/items.csv').item_category_id)

In [116]:
def item_to_category(item):
    return items[item]

In [117]:
train_X.insert(3, 'category', list(map(item_to_category, train_X.item_id)), True)
test_X .insert(3, 'category', list(map(item_to_category, test_X .item_id)), True)

In [123]:
# Make yet another model and increate the size of this as well.
# This'll be the last thing I do.
model3 = LGBMRegressor(min_child_samples = 100, n_estimators = 1000)
model3 = model3.fit(train_X.values, train_y.values.ravel())
preds = model3.predict(test_X.values).clip(0, 20)

In [126]:
writepreds(preds, 'final.csv')

File written.


And it's absolute trash, very near baseline. Oh well. Maybe I should have made a test set at some point and checked if the things I am doing actually help anything :^)

Now that I am sufficiently fed up with this, I'm looking through Kaggle kernels. It seems that a similar approach with XGBoost did very well, but they had so many more features figured out than I did. At least I got some exposure to a new model, I suppose, but it seems LightGBM has a lot more to offer than I used.

It's kind of disappointing and quite annoying that the best I got was my naive model from the first file, since by that point I hadn't even done too much. Sucks to be me, I guess.