# Unsold items test
Objectives:
- Implement code to go from transactions data to recommendations
- Test if the 995 never sold items are part of the validation data on the leaderboard

Remark:
- While I added comments later on for clarity, I don't intend to refactor the code to match the improvements I made in later experiments.
- I used the full dataset for this experiment, which made RAM a bigger issue than it became in future experiments

Method:
- Train LightGBM on the entire transaction dataset with random negative samples
- Have LightGBM rank 995 never sold items for all customers and return the top 12 per customer as a submission

Additional Remark:
- LightGBM has basically not learned anything about the data I made it rank, so I said I assigned them at random during my final presentation for simplicity's sake.
- While not actually random, it is definitely not expected behaviour
- I believe that using random assignments would have basically the same result.

In [2]:
import numpy as np 
import pandas as pd
import random
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn import preprocessing

In [3]:
import time

In [3]:
transactions = pd.read_csv('./data/transactions_train.csv')
articles = pd.read_csv('./data/articles.csv')
customers = pd.read_csv('./data/customers.csv')

In [4]:
transactions['purchased'] = 1

I used a label encoder in order to save space because the customer ID was treated as a string by Pandas, which takes up more space than an integer. I also used one on articles for symmetry, but as this ended up making writing the submission files more complicated for no actual gain, so that was removed from later experiments.

In [5]:
customer_encoder = preprocessing.LabelEncoder()
article_encoder = preprocessing.LabelEncoder()

customer_encoder.fit(customers['customer_id'])
article_encoder.fit(articles['article_id'])

transactions['customer_id'] = customer_encoder.transform(transactions['customer_id'])
transactions['article_id'] = article_encoder.transform(transactions['article_id'])

del articles
del customers

Negative sampling code below is copied from the TA notebook on feature engineering. I made some changes to the code to try to reduce the amount of temporary copies that were made during operations in order to conserve RAM. 

In [6]:
positive_pairs = list(map(tuple, transactions[['customer_id', 'article_id']].drop_duplicates().values))

In [7]:
real_dates = transactions["t_dat"].unique()
real_customers = transactions["customer_id"].unique()
real_articles = transactions["article_id"].unique()
real_channels = transactions["sales_channel_id"].unique()
article_and_price = transactions[["article_id","price"]].drop_duplicates("article_id").set_index("article_id").squeeze()

In [8]:
num_neg_pos = transactions.shape[0]

In [9]:
random.seed(time.time())
num_neg_samples = int(num_neg_pos * 1.1)

neg_dates = np.random.choice(real_dates, size=num_neg_samples)
neg_articles = np.random.choice(real_articles, size=num_neg_samples)
neg_customers = np.random.choice(real_customers, size=num_neg_samples)
neg_channels = np.random.choice(real_channels, size=num_neg_samples)
ordered = np.array([0] * num_neg_samples)

neg_prices = article_and_price[neg_articles].values

In order to free up RAM, I often used the python del keyword in this version. In later versions I instead used function scopes to accomplish much the same.

Although, as can be seen below, occasionally I would accidentally run a block with del commands twice and it would throw errors. This is entirely avoided by using functions.

In [56]:
del real_dates
del real_customers
del real_articles
del real_channels
del article_and_price
del num_neg_samples

NameError: name 'real_dates' is not defined

In [22]:
t_columns = transactions.columns

In [21]:
np_frame = np.column_stack((neg_dates, neg_customers, neg_articles, neg_prices, neg_channels, ordered))

In [23]:
neg_transactions = pd.DataFrame(np_frame, columns=t_columns)

In [55]:
del t_columns
del np_frame

In [16]:
duplicate_indexes = neg_transactions[["customer_id", "article_id"]].apply(tuple, 1).isin(positive_pairs)

In [24]:
neg_transactions = neg_transactions[~duplicate_indexes]

chosen_neg_transactions = neg_transactions.sample(num_neg_pos)
del neg_transactions

               t_dat customer_id article_id     price sales_channel_id  \
0         2020-02-28      306311      80236  0.030492                1   
1         2019-12-22      475992      76399  0.167797                2   
2         2019-06-22      653787      30294  0.022864                2   
3         2019-05-30      121804      10542  0.049136                2   
4         2018-10-14      396444      54582   0.06778                2   
...              ...         ...        ...       ...              ...   
34967151  2020-04-19      487036      65823  0.020051                1   
34967152  2020-05-10      119595      36033  0.011847                1   
34967153  2020-03-12      702870      37449  0.022017                1   
34967154  2020-03-26      252703      64582  0.013542                1   
34967155  2019-11-24     1367633      56919  0.020322                1   

         purchased  
0                0  
1                0  
2                0  
3                0  
4     

In [25]:
transactions = pd.concat([transactions, chosen_neg_transactions])

In [27]:
np.save('customer_ids.npy', customer_encoder.classes_)
np.save('article_ids.npy', article_encoder.classes_)

In [30]:
pip install pyarrow

Collecting pyarrowNote: you may need to restart the kernel to use updated packages.

  Downloading pyarrow-10.0.0-cp39-cp39-win_amd64.whl (20.0 MB)
Installing collected packages: pyarrow
Successfully installed pyarrow-10.0.0


In [36]:
transactions.reset_index(inplace=True)

I converted the t_dat string into an integer based on the amount of days passed since a certain date. This saved RAM, but:

- I really should have done it before adding negative samples as doing it afterwards meant replacing twice the strings (This change was made in future experiments, but it simply took too long to generate the negative samples on the full dataset to rerun it so it was left as is)
- Days are not actually useful for this task as we want predictions per week. This meant that I ended up disregarding the entire column in the end anyway.

In [47]:
import datetime
def str_dat_to_days_int(datestring):
    return (datetime.datetime.strptime(datestring, "%Y-%m-%d") - datetime.datetime(2018, 9, 1)).days

print(str_dat_to_days_int("2018-09-01"))

0


In [49]:
transactions["t_dat"] = transactions["t_dat"].map(str_dat_to_days_int)

In [40]:
transactions.drop("index", axis=1, inplace=True)

In [52]:
transactions = pd.get_dummies(transactions, columns=['sales_channel_id'])

In [50]:
print(transactions)

          t_dat customer_id article_id     price sales_channel_id purchased
0            19           2      40179  0.050831                2         1
1            19           2      10520  0.030492                2         1
2            19           7       6387  0.015237                2         1
3            19           7      46304  0.016932                2         1
4            19           7      46305  0.016932                2         1
...         ...         ...        ...       ...              ...       ...
63576643    713      631853      47686  0.025407                2         0
63576644    436     1289675      17418  0.050831                2         0
63576645    434     1176635       5672  0.015237                1         0
63576646    538     1189258       5751  0.020322                1         0
63576647    620      718504      40287  0.016932                1         0

[63576648 rows x 6 columns]


In [53]:
transactions.to_feather('./data/negativesampled.feather')

In [39]:
del num_neg_pos
del neg_dates
del neg_articles
del neg_customers
del neg_channels
del ordered
del neg_prices
del chosen_neg_transactions
del duplicate_indexes

The checkpoint that is also present in future experiments.

In [None]:
articles = pd.read_csv('./data/articles.csv')

In [4]:
transactions = pd.read_feather("./data/negativesampled.feather")

In [5]:
customers = pd.read_csv('./data/customers.csv')

In [6]:
customers["age"] = customers["age"].fillna(25)

In [9]:
customer_encoder = preprocessing.LabelEncoder()
customer_encoder.classes_ = np.load("customer_ids.npy", allow_pickle=True)

In [47]:
article_encoder = preprocessing.LabelEncoder()
article_encoder.classes_ = np.load("article_ids.npy", allow_pickle=True)

In [10]:
customers['customer_id'] = customer_encoder.transform(customers['customer_id'])

In [11]:
zip_encoder = preprocessing.LabelEncoder()
customers["postal_code"] = zip_encoder.fit_transform(customers["postal_code"])

In [12]:
customers.head()

Unnamed: 0,customer_id,FN,Active,club_member_status,fashion_news_frequency,age,postal_code
0,0,,,ACTIVE,NONE,49.0,112978
1,1,,,ACTIVE,NONE,25.0,57312
2,2,,,ACTIVE,NONE,24.0,139156
3,3,,,ACTIVE,NONE,54.0,128529
4,4,1.0,1.0,ACTIVE,Regularly,52.0,52371


In [13]:
transactions = transactions.merge(customers[["customer_id", "age", "postal_code"]], how="inner", on='customer_id')

In [14]:
transactions.head()

Unnamed: 0,t_dat,customer_id,article_id,price,purchased,sales_channel_id_1,sales_channel_id_2,age,postal_code
0,19,2,40179,0.050831,1,0,1,24.0,139156
1,19,2,10520,0.030492,1,0,1,24.0,139156
2,23,2,40179,0.050831,1,0,1,24.0,139156
3,181,2,18197,0.013542,1,0,1,24.0,139156
4,520,2,59458,0.025407,1,0,1,24.0,139156


In [15]:
del customers

In [16]:
transactions["customer_id"] = transactions["customer_id"].astype(int)
transactions["article_id"] = transactions["article_id"].astype(int)
transactions["price"] = transactions["price"].astype(float)

In [217]:
X_train, X_test, y_train, y_test = train_test_split(transactions.drop(['purchased', "price", "t_dat", "sales_channel_id_1", "sales_channel_id_2"], axis=1), transactions['purchased'], test_size=0.10, random_state=42)

In [68]:
pip install lightgbm

Collecting lightgbmNote: you may need to restart the kernel to use updated packages.

  Downloading lightgbm-3.3.3-py3-none-win_amd64.whl (1.0 MB)
Installing collected packages: lightgbm
Successfully installed lightgbm-3.3.3


In [218]:
# copying from https://github.com/microsoft/LightGBM/blob/master/examples/python-guide/simple_example.py
# combined with https://github.com/angelotc/LightGBM-binary-classification-example/blob/master/CCData.ipynb

import lightgbm as lgb
print('Starting training...')

gbm = lgb.LGBMClassifier(learning_rate = 0.1, metric = 'l1', 
                        n_estimators = 20)
gbm.fit(X_train, y_train,
        eval_set=[(X_test, y_test)],
        eval_metric=['auc', 'binary_logloss'],
        callbacks=[lgb.early_stopping(stopping_rounds=5)])

print('Saving model...')

Starting training...
Training until validation scores don't improve for 5 rounds
Did not meet early stopping. Best iteration is:
[20]	valid_0's auc: 0.642498	valid_0's binary_logloss: 0.659975	valid_0's l1: 0.474221
Saving model...


In [219]:
# save model to file
gbm.booster_.save_model('./data/model.txt')

<lightgbm.basic.Booster at 0x1f4b535dc40>

In [20]:
customers = pd.read_csv('./data/customers.csv')

In [21]:
articles = pd.read_csv("./data/articles.csv")

In [22]:
original_transactions = pd.read_csv("./data/transactions_train.csv")

In [33]:
unsold_article_ids = articles["article_id"][~articles["article_id"].isin(original_transactions["article_id"])]

In [54]:
customers["customer_id"] = customer_encoder.transform(customers["customer_id"])

In [56]:
customers["postal_code"] = zip_encoder.transform(customers["postal_code"])

In [57]:
customers["age"] = customers["age"].fillna(25)

In [58]:
print(customers.iloc[:10][["customer_id", "age", "postal_code"]])

   customer_id   age  postal_code
0            0  49.0       112978
1            1  25.0        57312
2            2  24.0       139156
3            3  54.0       128529
4            4  52.0        52371
5            5  25.0        61034
6            6  20.0       350802
7            7  32.0       194979
8            8  20.0        61034
9            9  20.0        61034


In [67]:
sales_channels = pd.DataFrame([[0, 1], [1, 0]], columns=["sales_channel_id_1", "sales_channel_id_2"])

In [68]:
sales_channels.head()

Unnamed: 0,sales_channel_id_1,sales_channel_id_2
0,0,1
1,1,0


In [61]:
unsold_article_id_labels = pd.DataFrame(article_encoder.transform(unsold_article_ids), columns=["article_id"])

In [62]:
unsold_article_id_labels

Unnamed: 0,article_id
0,159
1,718
2,2106
3,2708
4,3682
...,...
990,105529
991,105533
992,105535
993,105540


In [223]:
test_input = customers[["customer_id", "age", "postal_code"]].merge(unsold_article_id_labels, how="cross") # .merge(sales_channels, how="cross")

MemoryError: Unable to allocate 10.2 GiB for an array with shape (1365120100,) and data type int64

I lied in my presentation: I said I assigned articles to customers at random but I actually did try to use LightGBM. 

As LightGBM had no information about the unsold articles at all, however, not even the information from the articles table as I ended up not merging it into the transactions table due to RAM space, the behaviour of LightGBM is entirely unpredictable so my assumption was that random assignment would've had a similar result. (Perhaps better or worse depending on the roll of the dice)

Additionally, as seen above, I tried making a pandas dataframe with all 995 unsold items for all customers in order to be able to use vectorised functions on all customers at the same time. (Which I did do in future notebooks, although with less items per customer) This didn't end up working as the amount of RAM required was too high, so I split the customers table up in 100 smaller tables. This worked but it takes quite a while to run.

In [262]:
f = open("./data/test_output.csv", "w")
f.write("customer_id,prediction\n")
f.close()

for index, row in customers.groupby(np.arange(len(customers)) // 100):
    curr_row_in = unsold_article_id_labels.merge(row[["customer_id", "age", "postal_code"]], how="cross")
    curr_row_probs = gbm.predict_proba(curr_row_in[["customer_id", "article_id", "age", "postal_code"]], num_iteration=gbm.best_iteration_)
    curr_row_in[["probability_0", "probability_1"]] = curr_row_probs
    curr_row_in["customer_id"] = customer_encoder.inverse_transform(curr_row_in["customer_id"])
    curr_row_in["article_id"] = article_encoder.inverse_transform(curr_row_in["article_id"])
    ordering = curr_row_in.sort_values(by=["customer_id", "probability_1"], ascending=[True, False]).set_index("customer_id").groupby("customer_id")["article_id"].apply(lambda x : " ".join(["0" + str(i) for i in x[:12]]))
    ordering.to_csv("./data/test_output.csv", mode="a", header=False)

In [252]:
f.close()

Everything below this were quick experiments in order to try to figure out how to write everything to the output format.

In [107]:
test_input = customers.iloc[:100][["customer_id", "age", "postal_code"]].merge(pd.DataFrame(article_encoder.transform(articles["article_id"].iloc[:100]), columns=["article_id"]), how="cross").merge(sales_channels, how="cross")

In [None]:
test_input

In [None]:
predictions = gbm.predict_proba(test_input[["customer_id", "article_id", "sales_channel_id_1", "sales_channel_id_2", "age", "postal_code"]], num_iteration=gbm.best_iteration_)

In [None]:
output_test = test_input
output_test[["probability_0", "probability_1"]] = predictions

In [189]:
output_test

Unnamed: 0,customer_id,age,postal_code,article_id,sales_channel_id_1,sales_channel_id_2,probability_0,probability_1
0,0,49.0,112978,159,0,1,0.338363,0.661637
1,0,49.0,112978,159,1,0,0.369113,0.630887
2,0,49.0,112978,718,0,1,0.379727,0.620273
3,0,49.0,112978,718,1,0,0.503652,0.496348
4,0,49.0,112978,2106,0,1,0.379727,0.620273
...,...,...,...,...,...,...,...,...
198995,99,25.0,298089,105535,1,0,0.775617,0.224383
198996,99,25.0,298089,105540,0,1,0.504717,0.495283
198997,99,25.0,298089,105540,1,0,0.775617,0.224383
198998,99,25.0,298089,105541,0,1,0.504717,0.495283


In [190]:
output_test.sort_values(by=["customer_id", "probability_1"], ascending=[True, False]).groupby("customer_id").head(12)

Unnamed: 0,customer_id,age,postal_code,article_id,sales_channel_id_1,sales_channel_id_2,probability_0,probability_1
0,0,49.0,112978,159,0,1,0.338363,0.661637
120,0,49.0,112978,55062,0,1,0.363310,0.636690
122,0,49.0,112978,55516,0,1,0.363310,0.636690
124,0,49.0,112978,56747,0,1,0.363310,0.636690
126,0,49.0,112978,57792,0,1,0.363310,0.636690
...,...,...,...,...,...,...,...,...
197142,99,25.0,298089,59936,0,1,0.327937,0.672063
197144,99,25.0,298089,60249,0,1,0.327937,0.672063
197146,99,25.0,298089,60303,0,1,0.327937,0.672063
197182,99,25.0,298089,67411,0,1,0.342268,0.657732


In [215]:
output_test.sort_values(by=["customer_id", "probability_1"], ascending=[True, False]).set_index("customer_id").groupby("customer_id")["article_id"].apply(lambda x : " ".join([str(i) for i in x[:12]])).to_csv("./data/test_output.csv")

In [195]:
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration_)
# eval
from sklearn.metrics import accuracy_score

print("acc:", accuracy_score(y_test, y_pred))
print("null:", max(y_test.mean(), 1 - y_test.mean()))

acc: 0.6388222720133886
null: 0.500201410423481


In [113]:
predictions

array([[0.30441123, 0.69558877],
       [0.30441123, 0.69558877],
       [0.30441123, 0.69558877],
       ...,
       [0.30441123, 0.69558877],
       [0.30441123, 0.69558877],
       [0.30441123, 0.69558877]])

In [99]:
gbm.predict(test_input.iloc[:, 0:6], num_iteration=gbm.best_iteration_)

array([1, 1, 1, ..., 1, 1, 1], dtype=int64)

In [114]:
original_transactions.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004,0.015237,2
3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003,0.016932,2
4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004,0.016932,2


In [141]:
test = original_transactions.iloc[0].copy()

In [145]:
test

t_dat               2018-09-20
customer_id                  2
article_id               40179
price                 0.050831
sales_channel_id             2
Name: 0, dtype: object

In [143]:
test["customer_id"] = customer_encoder.transform([test["customer_id"]])[0]

In [144]:
test["article_id"] = article_encoder.transform([test["article_id"]])[0]

In [171]:
test["sales_channel_id_1"] = 0
test["sales_channel_id_2"] = 1

In [172]:
customers.loc[:, ["customer_id", "age", "postal_code"]].merge(pd.DataFrame([test]), how="inner", on=["customer_id"])

Unnamed: 0,customer_id,age,postal_code,t_dat,article_id,price,sales_channel_id,sales_channel_id_1,sales_channel_id_2
0,2,24.0,139156,2018-09-20,40179,0.050831,2,0,1


In [176]:
gbm.predict_proba((customers.loc[:, ["customer_id", "age", "postal_code"]].merge(pd.DataFrame([test]), how="inner", on=["customer_id"]))[["customer_id", "article_id", "sales_channel_id_1", "sales_channel_id_2", "age", "postal_code"]], num_iteration=gbm.best_iteration_)

array([[0.4121713, 0.5878287]])

In [167]:
gbm.predict_proba(X_test, num_iteration=gbm.best_iteration_)

array([[0.44837599, 0.55162401],
       [0.55952529, 0.44047471],
       [0.67486458, 0.32513542],
       ...,
       [0.41193002, 0.58806998],
       [0.32793732, 0.67206268],
       [0.60403309, 0.39596691]])

In [168]:
X_test

Unnamed: 0,customer_id,article_id,sales_channel_id_1,sales_channel_id_2,age,postal_code
23488218,301642,10290,0,1,27.0,106051
10759766,152611,49595,1,0,50.0,115578
38409516,1186968,99711,0,1,29.0,61034
24628014,190384,6011,1,0,27.0,147592
46965121,1040473,67735,0,1,54.0,230454
...,...,...,...,...,...,...
30568173,843573,58880,1,0,47.0,164138
34746913,1122520,72978,1,0,27.0,125945
46100043,877043,40342,0,1,55.0,24205
5582956,1107626,58961,0,1,24.0,133015


In [175]:
gbm.predict_proba([X_test[["customer_id", "age", "postal_code", "article_id", "sales_channel_id_1", "sales_channel_id_2"]].iloc[0]], num_iteration=gbm.best_iteration_)

array([[0.30441123, 0.69558877]])