# Kaggle Submission
In this notebook, we perform candidate generation, ranking with LightGBM and we create a submission file, ready to upload to Kaggle. 

## Summary
- **Candidate generation**
  - Last purchased items for each user
  - 12 bestsellers from previous week for each user
  - 20 most popular items from previous week for each user (slightly different from bestsellers, as it actually calculates popularity based on all preceding weeks)
- **Features**
  - All numerical features from `customers`, `articles` and `transactions_train`
  - `last_week_popularity_rank`
  - `similarity`
- **Result**
  | Private Score | Public Score  |
  |:-------------:|:-------------:|
  | 0.02024       | 0.02066       |

###### Note: I might generate more features/candidates in the code than I actually end up using. This was for simplicity reasons, which meant that I could use the same notebook multiple times by just changing which features/candidates are provided to the ranker.
---

In [3]:
import pandas as pd

import sys
sys.path.append('../../final')
from helpers.utils import DATA_PATH, customer_hex_id_to_int
from candidates import \
    generate_last_purchased_candidates, \
    generate_bestseller_candidates, \
    generate_similar_candidates, \
    generate_popularity_candidates

# This file builds on the code in https://github.com/radekosmulski/personalized_fashion_recs

In [4]:
transactions = pd.read_parquet(f'{DATA_PATH}/transactions_train.parquet')
customers = pd.read_parquet(f'{DATA_PATH}/customers.parquet')
articles = pd.read_parquet(f'{DATA_PATH}/articles.parquet')

In [5]:
test_week = transactions.week.max() + 1
transactions = transactions[transactions.week > transactions.week.max() - 10]

# Generating candidates

### Last purchase candidates

In [6]:
candidates_last_purchase = generate_last_purchased_candidates(transactions, test_week)
candidates_last_purchase.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id,week
29030503,2020-07-15,272412481300040,778064028,0.008458,1,96
29030504,2020-07-15,272412481300040,816592008,0.016932,1,96
29030505,2020-07-15,272412481300040,621381021,0.033881,1,96
29030506,2020-07-15,272412481300040,817477003,0.025407,1,96
29030507,2020-07-15,272412481300040,899088002,0.025407,1,96


### Bestsellers candidates

In [7]:
candidates_bestsellers, bestsellers_previous_week = generate_bestseller_candidates(transactions, test_week, n=12)
candidates_bestsellers.head()

Unnamed: 0,t_dat,customer_id,sales_channel_id,week,article_id,price
0,2020-07-22,200292573348128,2,96,760084003,0.025094
1,2020-07-22,200292573348128,2,96,866731001,0.024919
2,2020-07-22,200292573348128,2,96,600886001,0.02298
3,2020-07-22,200292573348128,2,96,706016001,0.033197
4,2020-07-22,200292573348128,2,96,372860002,0.013193


### K most similar items

In [8]:
candidates_similar_items, similar_items = generate_similar_candidates(transactions, test_week, n=12)
candidates_similar_items.head()

  0%|          | 0/1228106 [00:00<?, ?it/s]

  0%|          | 0/1228106 [00:00<?, ?it/s]

2022-12-22 16:35:04,361 - base - recpack - INFO - Fitting TARSItemKNN complete - Took 9.22s


Unnamed: 0,customer_id,article_id,t_dat,sales_channel_id,week,price
0,272412481300040,926500001,2020-07-15,1,105,0.059305
1,272412481300040,921906002,2020-07-15,1,105,0.033881
2,272412481300040,921906001,2020-07-15,1,105,0.033881
3,272412481300040,896169005,2020-07-15,1,105,0.050831
4,272412481300040,788647002,2020-07-15,1,105,0.013542


### Time-weighted popularity candidates

In [9]:
candidates_most_popular, popular_articles_previous_week = generate_popularity_candidates(transactions, test_week, n=20)

# Combining transactions and candidates / negative examples

In [10]:
transactions['purchased'] = 1

In [None]:
data = pd.concat([transactions, candidates_last_purchase, candidates_bestsellers, candidates_most_popular])
data.purchased.fillna(0, inplace=True)

In [34]:
print(f'Percentage of positive samples: {data.purchased.mean():.2%}')

Percentage of positive samples: 7.66%


In [12]:
data.drop_duplicates(['customer_id', 'article_id', 'week'], inplace=True)

### Add bestseller information

In [13]:
data = pd.merge(
    data,
    bestsellers_previous_week[['week', 'article_id', 'bestseller_rank']],
    on=['week', 'article_id'],
    how='left'
)

In [14]:
data = data[data.week != data.week.min()]

### Add item similarity information

In [15]:
data = pd.merge(
    data, 
    similar_items[['customer_id', 'article_id', 'similarity']], 
    on=['customer_id', 'article_id'], 
    how='left'
)

### Add item popularity information

In [16]:
data = pd.merge(
    data,
    popular_articles_previous_week[[
        'week', 
        'article_id', 
        'weekly_purchase_count', 
        'weekly_popularity', 
        'last_week_popularity_rank'
    ]],
    on=['week', 'article_id'],
    how='left'
)

In [17]:
data = pd.merge(data, articles, on='article_id', how='left')
data = pd.merge(data, customers, on='customer_id', how='left')

data['weekly_purchase_count'].fillna(0, inplace=True)
data['weekly_popularity'].fillna(0, inplace=True)
data['similarity'].fillna(data['purchased'], inplace=True)
data['bestseller_rank'].fillna(data.bestseller_rank.max() + 1, inplace=True)
data['last_week_popularity_rank'].fillna(data.last_week_popularity_rank.max() + 1, inplace=True)
data.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id,week,purchased,bestseller_rank,similarity,weekly_purchase_count,...,section_name,garment_group_no,garment_group_name,detail_desc,FN,Active,club_member_status,fashion_news_frequency,age,postal_code
0,2020-07-22,200292573348128,880777001,0.025407,2,96,1.0,13.0,1.0,0.0,...,27,1018,12,9213,1,1,0,1,25,63947
1,2020-07-22,200292573348128,784332002,0.025407,2,96,1.0,13.0,1.0,0.0,...,12,1005,0,7303,1,1,0,1,25,63947
2,2020-07-22,200292573348128,827968001,0.016932,2,96,1.0,10.0,1.0,374.0,...,30,1002,2,1227,1,1,0,1,25,63947
3,2020-07-22,200292573348128,599580086,0.011847,2,96,1.0,13.0,1.0,0.0,...,22,1018,12,52,1,1,0,1,25,63947
4,2020-07-22,248294615847351,720504008,0.031458,1,96,1.0,13.0,1.0,0.0,...,46,1016,11,95,-1,-1,0,0,46,8666


### preprocessing

In [18]:
from sklearn.preprocessing import StandardScaler, LabelEncoder
scaler = StandardScaler()
columns_to_scale = [
    'product_type_no',
    'graphical_appearance_no',
    'colour_group_code',
    'perceived_colour_value_id',
    'perceived_colour_master_id',
    'department_no',
    'index_code',
    'index_group_no',
    'section_no',
    'garment_group_no',
    'FN',
    'Active',
    'club_member_status',
    'fashion_news_frequency',
    'age',
    'postal_code',
    'last_week_popularity_rank',
    'similarity'
]
data[columns_to_scale] = scaler.fit_transform(data[columns_to_scale])

In [19]:
data.sort_values(['week', 'customer_id'], inplace=True)
data.reset_index(drop=True, inplace=True)

In [20]:
train = data[data.week != test_week]
test = data[data.week==test_week].drop_duplicates(['customer_id', 'article_id', 'sales_channel_id']).copy()

In [21]:
train_baskets = train.groupby(['week', 'customer_id'])['article_id'].count().values

In [22]:
columns_to_use = [
    'article_id',
    'product_type_no',
    'graphical_appearance_no',
    'colour_group_code',
    'perceived_colour_value_id',
    'perceived_colour_master_id',
    'department_no',
    'index_code',
    'index_group_no',
    'section_no',
    'garment_group_no',
    'FN',
    'Active',
    'club_member_status',
    'fashion_news_frequency',
    'age',
    'postal_code',
    'last_week_popularity_rank',
    'similarity'
]

In [23]:
train_X = train[columns_to_use]
train_y = train['purchased']

test_X = test[columns_to_use]

In [24]:
train_y.head()

0    1.0
1    0.0
2    0.0
3    0.0
4    0.0
Name: purchased, dtype: float64

# Model training

In [25]:
from lightgbm.sklearn import LGBMRanker

In [26]:
ranker = LGBMRanker(
    objective="lambdarank",
    num_leaves=200,
    metric="ndcg",
    boosting_type="dart",
    n_estimators=100,
    importance_type='gain',
    verbose=10,
)

In [27]:
ranker = ranker.fit(
    train_X,
    train_y,
    group=train_baskets,
)

[LightGBM] [Debug] Dataset::GetMultiBinFromSparseFeatures: sparse rate 0.858340
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.135527
[LightGBM] [Debug] init for col-wise cost 0.111765 seconds, init for row-wise cost 0.314926 seconds
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Debug] Using Dense Multi-Val Bin
[LightGBM] [Info] Total Bins 1386
[LightGBM] [Info] Number of data points in the train set: 18461334, number of used features: 19
[LightGBM] [Debug] Trained a tree with leaves = 200 and depth = 21
[LightGBM] [Debug] Trained a tree with leaves = 200 and depth = 18
[LightGBM] [Debug] Trained a tree with leaves = 200 and depth = 21
[LightGBM] [Debug] Trained a tree with leaves = 200 and depth = 20
[LightGBM] [Debug] Trained a tree with leaves = 200 and depth = 22
[LightGBM] [Debug] Trained a tree with leaves = 200 and depth = 25
[LightGBM] [Debug] Trained a tree with leave

In [28]:
print('Feature importances:')
for i in ranker.feature_importances_.argsort()[::-1]:
    print(f'\t{columns_to_use[i]}  {ranker.feature_importances_[i] / ranker.feature_importances_.sum()}')

Feature importances:
	similarity  0.9956885470840093
	last_week_popularity_rank  0.0014074076823812468
	article_id  0.0006078033646573793
	postal_code  0.0004136308741024179
	department_no  0.000336097936006107
	age  0.0003112486857683356
	product_type_no  0.0003040500024764121
	colour_group_code  0.00021819328961787034
	section_no  0.00013504223944076397
	graphical_appearance_no  0.00013382754213479297
	garment_group_no  0.00012363185538199855
	perceived_colour_master_id  0.0001226440128191681
	perceived_colour_value_id  8.477700585460518e-05
	index_code  5.6492657031536315e-05
	index_group_no  1.852478167864285e-05
	FN  1.7057408761047172e-05
	Active  1.3783973389518211e-05
	fashion_news_frequency  6.323782050073564e-06
	club_member_status  9.158224387527501e-07


# Calculate predictions

In [29]:
test['preds'] = ranker.predict(test_X)

c_id2predicted_article_ids = test \
    .sort_values(['customer_id', 'preds'], ascending=False) \
    .groupby('customer_id')['article_id'].apply(list).to_dict()

bestsellers_last_week = \
    bestsellers_previous_week[bestsellers_previous_week.week == bestsellers_previous_week.week.max()]['article_id'].tolist()

# Create submission

In [30]:
sub = pd.read_csv(f'{DATA_PATH}/sample_submission.csv')

In [31]:
preds = []
for c_id in customer_hex_id_to_int(sub.customer_id):
    pred = c_id2predicted_article_ids.get(c_id, [])
    pred = pred + bestsellers_last_week
    preds.append(pred[:12])

In [32]:
preds = [' '.join(['0' + str(p) for p in ps]) for ps in preds]
sub.prediction = preds

In [33]:
sub_name = 'submission_KNN_candidates_similarity_feature_no_candidates'
sub.to_csv(f'{DATA_PATH}/subs/{sub_name}.csv.gz', index=False)