This code is based on [https://github.com/radekosmulski/personalized_fashion_recs](https://github.com/radekosmulski/personalized_fashion_recs).
Comments explaining the original notebook code were added by me and Arno Troch.

Original notebook Kaggle score: 0.02087
Final score (with parameters currently in the first code cell): 0.02161

run final_preprocess.ipynb before running this notebook.

Most code for my feature generation is in helper_feature_generation.py. Ranking copied from original notebook has been moved to helper_ranking.py.

Original notebook summary:
- group transaction data by week
- use last x weeks as training data
- candidate and negative sample generation are identical
    - two methods:
        - 12 Bestselling items of last week
        - Items bought by the user in the last week when he made a purchase
    - if one of the negative samples already occurred as a sample (positive or negative), remove negatives until one sample is left
- main feature: bestseller_rank: How well did this item sell in the week before this purchase, if it was in top 12?
- LightLGBM ranker

Additions (throughout the entire course):
- (optional) Different preprocessing:  make more assumptions about data, like users without age being the average age
- (optional) use last x weeks of this and previous year as training data
- Calculate bestseller_rank beyond 12
- Feature: If a transaction occurred multiple times as positive and/or negative sample, keep track of how often it occurred
- Feature: for each sample, check how often the customer bought a product with the same feature(s) (colour_group_code, garment_type_no, ...).
    - Multiple features can be added to the training data, based on different article features (e.g. training data gets a feature counting how often the same colour was purchased, and one for how often the garment type was purchased)
    - A combination of article features can be used for one training data feature (e.g. how often did this user buy articles with the same colour and garment type)
    - These features can be calculated based on the entire dataset rather than the training set
        - If a transaction occurs in week x, this only takes purchases from before week x into account
    - Additionally, new features can be added to training data by ranking article features based on the feature described above
        - Example: if a users favourite colour is blue, then a transaction of this user buying a blue item will have value 1
        - this feature can also be generated for each article feature, and can also use combinations of article features.
- More candidate and negative sample generations: Each week, check the users favourite value of an article feature (e.g. colour), and pick the article with that value that was bestselling last week
    - Based on the ranking feature described above: candidates/samples can be added for each article feature, and/or combinations of article features.

Additions since lecture 6:
    Added options to make new features on only rank or count instead of both
    Added simplified_colour_group_name to final_preprocessing.ipynb (e.g. Light Red -> Red, Other Turquoise -> Turquoise)
    graphs (generated in graphs.py rather than this notebook)

Notebook can need up to 32GB of ram. If that's too much, go to the first code cell and lower transactionBackXWeeks of put fewer things in article_features and columns_to_use.

### Set global variables

In [1]:
# Lecture 6
# TODO: select better features
# Generate new features based on these article_id columns
# For each of these columns, two features will be generated:
# For each transaction (both actual transactions and negative samples), count how often a user has already bought an item with the same value for this column.
# Additionally, what rank does the feature have? e.g. If the user has bought more blue items than any other colour, the rank will be one. If his next favourite is red, transactions with a red item will have value 2.
article_features = ["product_type_no"]

# Add features based on user purchase history, based on article_features
# If True:
# how often did the user buy items where article_features[x] matches this transaction?
use_count = True
# the value of article_features[x] is the y-th favourite value of this article feature for this customer.
use_rank = True

# If True, then for each possible pair that can be made with elements from article_features, two new features will be made (as explained at article values: count and rank)
# Example: if article_features contains colour and garment type, then a transaction for blue trousers will have new features with values that say how often the user already bought any blue trousers,
# and how it ranks on his list of favourite clothing type/colour combinations.
do_combinations_of_features = False

#Lecture 6
# Should candidates be added based on user purchase history? If True, then if a user likes blue, the most popular blue item of last week will be added as negative sample
add_history_candidates = False
LGBMBoostingType = 'dart'
preprocess = '-1'  # '-1' uses preprocessing from original notebook, 'edited' uses slightly different preprocessing. You should probably use '-1'
# If a negative sample did not appear in a bestseller list, this is what the NaN is fil led with. Normal values are between 1-12. If None, will use actual bestseller rank even beyond 12.
# (Bestseller is a rank of how well an item sold in a certain week, with 1 meaning it was the most sold item)
# TODO: I assumed setting bestsellerFiller to None would get better results, but it makes them much worse. Is this an implementation error?
bestsellerFiller = 999
transactionBackXWeeks = 10  # Size of training+test sets: this many weeks before test set
# Lecture 6
# TODO: featuresBackXWeeks does not actually work, features are always calculated on the entire dataset.
featuresBackXWeeks = 999  # How much data to use to calculate features based on user history. If set to more weeks than available in dataset, uses entire dataset
prevYear = ''  # if "SkipYear": uses training data as explained in transactionBackXWeeks + the same weeks of the previous year. Not recommended.
assert LGBMBoostingType in ['gbdt','dart','goss','rf']
assert preprocess in ['-1','edited']
assert prevYear in ["","SkipYear"]

# Columns to use for training
# Useful because including garment_type_name and garment_type_no would be redundant
# Later on in the code, additional columns will be added if new features are generated
columns_to_use = ['article_id', 'product_type_no', 'graphical_appearance_no', 'colour_group_code', 'perceived_colour_value_id',
'perceived_colour_master_id', 'department_no', 'index_code',
'index_group_no', 'section_no', 'garment_group_no', 'FN', 'Active',
'club_member_status', 'fashion_news_frequency', 'age', 'postal_code', 'bestseller_rank','importance',"simplified_colour_group_name"]

### Imports

In [2]:
%run helper_functions.ipynb

  validate(nb)


In [3]:
import pandas as pd
from helper_feature_generation import get_purchase_rank_df_of_attributes, get_purchase_count_df_of_attributes, add_features_to_data
import itertools
import time


### Load preprocessed data

The only preprocessing done up to this point is filling missing values and optimizing data size (e.g. customer id from string to int)

In [4]:
transactions = pd.read_parquet(f'../../data/transactions_train_{preprocess}.parquet')
transactions=transactions.drop(columns="t_dat")
# Backup is made because some features use the full dataset for calculations
transactions_full = pd.read_parquet(f'../../data/transactions_train_{preprocess}.parquet')
transactions_full=transactions_full.drop(columns="t_dat")
transactions_full = transactions_full[transactions_full.week > transactions_full.week.max() - transactionBackXWeeks]
customers = pd.read_parquet(f'../../data/customers_{preprocess}.parquet')
articles = pd.read_parquet(f'../../data/articles_{preprocess}.parquet')

### Further process data

##### mean price per item per week

In [5]:
# mean price PER ITEM PER WEEK
mean_price = transactions \
    .groupby(['week', 'article_id'])['price'].mean()
mean_price.reset_index().head()

Unnamed: 0,week,article_id,price
0,0,108775015,0.008373
1,0,108775044,0.008374
2,0,108775051,0.005023
3,0,110065001,0.024983
4,0,110065002,0.02465


##### select training data from transactions

In [6]:
# 1 week for testing is most suitable because the competition requires a model to be good at predicting just 1 week
test_week = transactions.week.max() + 1
# Unless you really want to test training on transactionBackXWeeks and transactionBackXWeeks of last year, just read the else
if prevYear == 'SkipYear':
    # Starting from final week in dataset, select past transactionBackXWeeks weeks
    transactions3 = transactions[transactions.week > transactions.week.max() - transactionBackXWeeks]
    # Starting from final week in dataset but a year earlier, select past transactionBackXWeeks weeks
    transactions2 = transactions[(transactions.week.max()-52>=transactions.week) & (transactions.week >transactions.week.max() - transactionBackXWeeks-52)] # EDITED
    print(transactions3['week'].unique())
    print(transactions2['week'].unique())
    # training data now consists of transactionBackXWeeks of current year and last year
    transactions = pd.concat([transactions3,transactions2])
else:
    # Starting from final week in dataset, select past transactionBackXWeeks weeks as training data
    transactions = transactions[transactions.week > transactions.week.max() - transactionBackXWeeks]

min_week = transactions["week"].min()

In [7]:
transactions.head()

Unnamed: 0,customer_id,article_id,price,sales_channel_id,week
29030503,272412481300040,778064028,0.008458,1,95
29030504,272412481300040,816592008,0.016932,1,95
29030505,272412481300040,621381021,0.033881,1,95
29030506,272412481300040,817477003,0.025407,1,95
29030507,272412481300040,899088002,0.025407,1,95


##### Most common sales channel per item per week

used to get sales_channel_id for candidates and negative samples

In [8]:
# Lecture 6
# Get for each week, for each article, through which sales_channel_id it was most commonly purchased
most_common_sales_channel_id_per_item_per_week = transactions\
    .groupby(['week',"article_id"])['sales_channel_id'].value_counts() \
    .groupby(['week',"article_id"]).rank(method='dense', ascending=False) \
    .groupby(['week',"article_id"]).head(1).rename('temp').astype('int64').reset_index()
most_common_sales_channel_id_per_item_per_week=most_common_sales_channel_id_per_item_per_week.drop(columns=["temp"])
# Probably not needed
most_common_sales_channel_id_per_item_per_week = most_common_sales_channel_id_per_item_per_week.drop_duplicates(subset=["week","article_id"])
print(most_common_sales_channel_id_per_item_per_week["sales_channel_id"].min())
print(most_common_sales_channel_id_per_item_per_week["sales_channel_id"].max())
# TODO: check if this is correct, I do think I need this because I also add 1 to week for some other thing that I merge this with
most_common_sales_channel_id_per_item_per_week.week += 1
most_common_sales_channel_id_per_item_per_week.head(100)

1
2


Unnamed: 0,week,article_id,sales_channel_id
0,96,108775015,1
1,96,108775044,2
2,96,110065001,1
3,96,110065002,1
4,96,111565001,1
...,...,...,...
95,96,228257001,1
96,96,228257002,1
97,96,228257003,1
98,96,228257004,1


##### Rank items by sales

bestseller_rank is calculated for each week, based on sales of only that week.

In [9]:
# Final result of this cell contains for each week in transactions (training data, not full dataset) all articles that were sold, ranked by which ones sold best, their average price in that week, and all data normally included in the articles table.

# Ranks for each week which items sold best
# Lecture 6
sales_nohead = transactions \
    .groupby('week')['article_id'].value_counts() \
    .groupby('week').rank(method='dense', ascending=False) \
    .groupby('week').head(9999999999999).rename('bestseller_rank').astype('int64').reset_index()

# Add article columns, e.g. garment_type_name
sales_nohead = pd.merge(sales_nohead,articles,how="left",on=["article_id"])
# Add average price of product in week
sales_nohead = pd.merge(sales_nohead,mean_price,how="left",on=["week","article_id"])
sales_nohead.head(100)

Unnamed: 0,week,article_id,bestseller_rank,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,...,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc,simplified_colour_group_name,price
0,95,760084003,1,760084,1134,272,0,1,1010016,0,...,1,2,2,53,1,1009,5,847,0,0.025094
1,95,866731001,2,866731,3609,273,15,1,1010016,0,...,9,26,4,5,21,1005,0,3130,0,0.024919
2,95,600886001,3,600886,1424,59,20,6,1010016,0,...,7,1,0,60,22,1018,12,420,0,0.022980
3,95,706016001,4,706016,172,272,0,1,1010016,0,...,1,2,2,53,1,1009,5,30,0,0.033197
4,95,372860002,5,372860,19652,302,14,7,1010016,0,...,7,1,0,62,31,1021,13,157,2,0.013193
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,95,878013001,75,878013,1011,265,1,2,1010001,1,...,0,1,0,15,0,1013,8,3544,0,0.049460
96,95,720125040,76,720125,99,273,15,1,1010005,8,...,9,26,4,5,21,1005,0,313,0,0.023239
97,95,610776071,77,610776,46,255,3,0,1010001,1,...,0,1,0,16,30,1002,2,60,2,0.008110
98,95,852174003,77,852174,3280,306,13,4,1010016,0,...,9,26,4,5,21,1005,0,3945,2,0.024849


In [10]:
articles.columns

Index(['article_id', 'product_code', 'prod_name', 'product_type_no',
       'product_type_name', 'product_group_name', 'graphical_appearance_no',
       'graphical_appearance_name', 'colour_group_code', 'colour_group_name',
       'perceived_colour_value_id', 'perceived_colour_value_name',
       'perceived_colour_master_id', 'perceived_colour_master_name',
       'department_no', 'department_name', 'index_code', 'index_name',
       'index_group_no', 'index_group_name', 'section_no', 'section_name',
       'garment_group_no', 'garment_group_name', 'detail_desc',
       'simplified_colour_group_name'],
      dtype='object')

In [11]:
sales_nohead.columns

Index(['week', 'article_id', 'bestseller_rank', 'product_code', 'prod_name',
       'product_type_no', 'product_type_name', 'product_group_name',
       'graphical_appearance_no', 'graphical_appearance_name',
       'colour_group_code', 'colour_group_name', 'perceived_colour_value_id',
       'perceived_colour_value_name', 'perceived_colour_master_id',
       'perceived_colour_master_name', 'department_no', 'department_name',
       'index_code', 'index_name', 'index_group_no', 'index_group_name',
       'section_no', 'section_name', 'garment_group_no', 'garment_group_name',
       'detail_desc', 'simplified_colour_group_name', 'price'],
      dtype='object')

### Generate new features

In [12]:
all_new_features = add_features_to_data(article_features,do_combinations_of_features,columns_to_use,transactions_full,featuresBackXWeeks,articles,min_week,test_week,use_count=use_count,use_rank=use_rank,verbose=False)

In [13]:
try:
    all_new_features[0][0].head()
except IndexError:
    pass

In [14]:
# Example application of get_purchase_rank_df_of_attributes
# Assuming output is deterministic, you can see that customer 28847241659200 bought 2 articles from garment_group_no 1010 (as seen in amount_of_garment_group_no),
# making it his favourite garment_group_no (as seen inn column amount_of_garment_group_no_rank)
temp2 = get_purchase_count_df_of_attributes(transactions,articles,["garment_group_no"],"amount_of_garment_group_no")
temp2.head()

Unnamed: 0,customer_id,garment_group_no,amount_of_garment_group_no
0,28847241659200,1005,1
1,28847241659200,1007,1
2,28847241659200,1009,1
3,28847241659200,1010,2
4,41318098387474,1013,1


In [15]:
# Example application of get_purchase_rank_df_of_attributes
# Assuming output is deterministic, you can see that customer 28847241659200 bought 2 articles from garment_group_no 1010 (as seen in amount_of_garment_group_no),
# making it his favourite garment_group_no (as seen inn column amount_of_garment_group_no_rank)
temp2 = get_purchase_rank_df_of_attributes(transactions,articles,["garment_group_no"],"amount_of_garment_group_no")
temp2.head()

Unnamed: 0,customer_id,garment_group_no,amount_of_garment_group_no,amount_of_garment_group_no_rank
0,28847241659200,1005,1,2.0
1,28847241659200,1007,1,2.0
2,28847241659200,1009,1,2.0
3,28847241659200,1010,2,1.0
4,41318098387474,1013,1,1.0


In [16]:
transactions.head()

Unnamed: 0,customer_id,article_id,price,sales_channel_id,week
29030503,272412481300040,778064028,0.008458,1,95
29030504,272412481300040,816592008,0.016932,1,95
29030505,272412481300040,621381021,0.033881,1,95
29030506,272412481300040,817477003,0.025407,1,95
29030507,272412481300040,899088002,0.025407,1,95


# Generating candidates

### Last purchase candidates
From original notebook [https://github.com/radekosmulski/personalized_fashion_recs/blob/main/03c_Basic_Model_Submission.ipynb](https://github.com/radekosmulski/personalized_fashion_recs/blob/main/03c_Basic_Model_Submission.ipynb)
Candidate or negative sample for week X: item that the customer bought in the last week < x when he made a purchase

In [17]:
# Final result of cell:
# Candidate for week X: item bought in previous purchase week

c2weeks = transactions.groupby('customer_id')['week'].unique()

c2weeks2shifted_weeks = {}

for c_id, weeks in c2weeks.items():
    c2weeks2shifted_weeks[c_id] = {}
    for i in range(weeks.shape[0]-1):
        c2weeks2shifted_weeks[c_id][weeks[i]] = weeks[i+1]
    c2weeks2shifted_weeks[c_id][weeks[-1]] = test_week

candidates_last_purchase = transactions.copy()

weeks = []
for i, (c_id, week) in enumerate(zip(transactions['customer_id'], transactions['week'])):
    weeks.append(c2weeks2shifted_weeks[c_id][week])

# Candidate for week X: item bought in previous purchase week
candidates_last_purchase.week=weeks

In [18]:
print(candidates_last_purchase)

                   customer_id  article_id     price  sales_channel_id  week
29030503       272412481300040   778064028  0.008458                 1    96
29030504       272412481300040   816592008  0.016932                 1    96
29030505       272412481300040   621381021  0.033881                 1    96
29030506       272412481300040   817477003  0.025407                 1    96
29030507       272412481300040   899088002  0.025407                 1    96
...                        ...         ...       ...               ...   ...
31774722  18439937050817258297   891591003  0.084729                 2   105
31774723  18439937050817258297   869706005  0.084729                 2   105
31779097  18440902715633436014   918894002  0.016932                 1   105
31779098  18440902715633436014   761269001  0.016932                 1   105
31780475  18443633011701112574   914868002  0.033881                 1   105

[2762872 rows x 5 columns]


### Bestsellers candidates

In [19]:
candidates_last_purchase.head()

Unnamed: 0,customer_id,article_id,price,sales_channel_id,week
29030503,272412481300040,778064028,0.008458,1,96
29030504,272412481300040,816592008,0.016932,1,96
29030505,272412481300040,621381021,0.033881,1,96
29030506,272412481300040,817477003,0.025407,1,96
29030507,272412481300040,899088002,0.025407,1,96


In [20]:
transactions.head()

Unnamed: 0,customer_id,article_id,price,sales_channel_id,week
29030503,272412481300040,778064028,0.008458,1,95
29030504,272412481300040,816592008,0.016932,1,95
29030505,272412481300040,621381021,0.033881,1,95
29030506,272412481300040,817477003,0.025407,1,95
29030507,272412481300040,899088002,0.025407,1,95


In [21]:
# Bestseller rank is important because it is merged into dataframes that need it, but after that it can be removed from this dataframe.
# For each week, list of ranked 12 bestsellers
sales = transactions \
    .groupby('week')['article_id'].value_counts() \
    .groupby('week').rank(method='dense', ascending=False) \
    .groupby('week').head(12).rename('bestseller_rank').astype('int8')
sales.head()

week  article_id
95    760084003     1
      866731001     2
      600886001     3
      706016001     4
      372860002     5
Name: bestseller_rank, dtype: int8

In [22]:
# For each week: buy bestselling items of previous week
bestsellers_previous_week = pd.merge(sales, mean_price, on=['week', 'article_id']).reset_index()
bestsellers_previous_week.week += 1
# Per week list of customers that bought ANYTHING
unique_transactions = transactions \
    .groupby(['week', 'customer_id']) \
    .head(1) \
    .drop(columns=['article_id', 'price']) \
    .copy()
unique_transactions.head()

Unnamed: 0,customer_id,sales_channel_id,week
29030503,272412481300040,1,95
29064059,1456826891333599,1,95
29067103,2133687643102426,2,95
29027487,6010692573790711,1,95
29046403,6171059100114610,2,95


In [23]:

# Per week list of customers that bought ANYTHING
# MERGE
# For each week: buy bestselling items of previous week

# Per week, per customer that bought anything, the 12 bestsellers from THE (general, not per customer) previous week
candidates_bestsellers = pd.merge(
    unique_transactions,
    bestsellers_previous_week,
    on='week',
)

# unique_transactions = Per week list of customers that bought ANYTHING
# For each customer that bought anything and that we want to make a prediction for, keep customer id once and set week to test_week, because that is the week we need predictions for
test_set_transactions = unique_transactions.drop_duplicates('customer_id').reset_index(drop=True)
test_set_transactions.week = test_week


# For each customer that bought anything and that we want to make a prediction for, keep customer id once and set week to test_week, because that is the week we need predictions for
# MERGE
# For each week: buy bestselling items of previous week

# Result: For each customer for whom we want to predict something, 12 bestselling items of testweek-1 as candidates
candidates_bestsellers_test_week = pd.merge(
    test_set_transactions,
    bestsellers_previous_week,
    on='week'
)

# Per week, per customer that bought anything, the 12 bestsellers from THE (general, not per customer) previous week
# Result: For each customer for whom we want to predict something, 12 bestselling items of testweek-1 as candidates
candidates_bestsellers = pd.concat([candidates_bestsellers, candidates_bestsellers_test_week])
candidates_bestsellers.drop(columns='bestseller_rank', inplace=True)

### history-based candidates

In [24]:
# Lecture 6
# Objective of this cell: for each feature based on purchase count of article with certain features, get most liked feature, look up bestselling item with that feature of last week, add it as negative sample.
all_history_based_suggestions = pd.DataFrame()
for feature_df_partial_columns in all_new_features:
    # Dataframe: per week: customer_id, article features, purchase counts of article features, ranks of article features
    feature_df = feature_df_partial_columns[0].copy(deep=True)
    # List of strings: Column names on which the new features are based
    partial_columns = feature_df_partial_columns[1]

    # Get name of column containing rank (where 1 means favourite). Probably a cleaner way to do this
    feature_df_columns = list(feature_df.columns)
    rank_column = None
    # For each column in the df, check if the column name contains "_rank"
    for column_name in feature_df_columns:
        if "_rank" in column_name:
            rank_column = column_name
            break

    # Only keep attributes that the customer actually prefers
    feature_df = feature_df[feature_df[rank_column] == 1]
    # Too many negative samples is probably bad, so if attributes are tied for favourite, break tie randomly (should still be deterministic because drop_duplicates keeps topmost rows)
    # Result: for each week, for each customer, one favourite partial_columns
    feature_df = feature_df.drop_duplicates(subset=["customer_id","week"])

    # Get all sales for relevant weeks (sales_nohead is only training set), but don't copy unused columns
    sales_selected_feature = sales_nohead[["week","article_id","price"]+partial_columns].copy(deep=True)

    # I want to give customers recommendations based on the previous week, otherwise I would be taking future data into account, including the actual purchases the user will make in the future.
    # By adding one to each week, this means that week now means "What was popular last week?"
    sales_selected_feature.week += 1

    # If the same feature value (e.g. "blue" for colour) appears multiple times in a week, drop duplicates
    # This keeps the first row where the value appears, which is the bestselling one (sales_nohead is sorted on bestselling)
    sales_selected_feature = sales_selected_feature.drop_duplicates(subset=partial_columns+["week"])

    # For each week: For each customer + his favourite value, merge with the most popular item with that value (of last week)
    feature_df = pd.merge(feature_df,sales_selected_feature,on=["week"]+partial_columns,how="left")

    # These are negative samples. If it turns out the user did actually buy it, the negative sample will be removed later on
    feature_df["purchased"] = 0

    # Add sales channel by picking most common one for that article last week
    # TODO: check if correct: The  "last week" part was week += 1, which has also been done on the second df. Is this correct?
    feature_df = pd.merge(feature_df,most_common_sales_channel_id_per_item_per_week,on=["week","article_id"],how="left")

    # Drop columns not immediately needed anymore
    feature_df = feature_df[["customer_id","article_id","price","sales_channel_id","week","purchased"]]

    if all_history_based_suggestions.empty:
        all_history_based_suggestions= feature_df.copy(deep=True)
    else:
        all_history_based_suggestions = pd.concat([all_history_based_suggestions,feature_df])

In [25]:
try:
    sales_selected_feature.head()
except NameError:
    pass

### Combine candidates

In [26]:

# Combining transactions and candidates / negative examples
transactions['purchased'] = 1

# candidates_last_purchase: Candidate for week X: item bought in previous purchase week, negative samples
# candidates_bestsellers: voor elke customer waarvoor we iets kunnen voorspellen, geven we de 12 bestseller van testweek-1 als candidate voor testweek, negative samples
# transactions: letterlijk gewoon transactions, positive samples
# Lecture 6
if add_history_candidates:
    data = pd.concat([transactions, candidates_last_purchase, candidates_bestsellers,all_history_based_suggestions])
else:
    data = pd.concat([transactions, candidates_last_purchase, candidates_bestsellers])
# For real transactions, purchased was 1 (positive sample). This sets the value to 0 for negative samples
data.purchased.fillna(0, inplace=True)

In [27]:
data.head()

Unnamed: 0,customer_id,article_id,price,sales_channel_id,week,purchased
29030503,272412481300040,778064028,0.008458,1,95,1.0
29030504,272412481300040,816592008,0.016932,1,95,1.0
29030505,272412481300040,621381021,0.033881,1,95,1.0
29030506,272412481300040,817477003,0.025407,1,95,1.0
29030507,272412481300040,899088002,0.025407,1,95,1.0


### Count duplicate samples

In [28]:
# For each week: look for every time that a customer bought OR got recommended an item (column importance), and if bought only keep row with purchased 1 (this automatically happens because of the order of concats in the previous cell)
# Note: candidates for week 105 are all purchased = 0
temp = data.groupby(['customer_id', 'article_id', 'week']).size().reset_index(name="importance")
print(temp)
data.drop_duplicates(['customer_id', 'article_id', 'week'], inplace=True)

data = pd.merge(
    data,
    temp,
    on=['customer_id', 'article_id', 'week']
)

data.purchased.mean()
print(data["importance"].isna().sum())
print(data["importance"].max())
print(data["importance"].mean())
print(data["importance"].min())

data.head()

                   customer_id  article_id  week  importance
0               28847241659200   372860002    96           1
1               28847241659200   448509014   105           1
2               28847241659200   547780003    96           1
3               28847241659200   600886001    96           1
4               28847241659200   610776002    96           1
...                        ...         ...   ...         ...
18253744  18446737527580148316   923758001   104           1
18253745  18446737527580148316   923758001   105           1
18253746  18446737527580148316   924243001   104           1
18253747  18446737527580148316   924243001   105           1
18253748  18446737527580148316   924243002   105           1

[18253749 rows x 4 columns]
0
74
1.0362430205433415
1


Unnamed: 0,customer_id,article_id,price,sales_channel_id,week,purchased,importance
0,272412481300040,778064028,0.008458,1,95,1.0,1
1,272412481300040,816592008,0.016932,1,95,1.0,1
2,272412481300040,621381021,0.033881,1,95,1.0,1
3,272412481300040,817477003,0.025407,1,95,1.0,1
4,272412481300040,899088002,0.025407,1,95,1.0,1


In [29]:
sales.head()

week  article_id
95    760084003     1
      866731001     2
      600886001     3
      706016001     4
      372860002     5
Name: bestseller_rank, dtype: int8

In [30]:
bestsellers_previous_week.head()

Unnamed: 0,week,article_id,bestseller_rank,price
0,96,760084003,1,0.025094
1,96,866731001,2,0.024919
2,96,600886001,3,0.02298
3,96,706016001,4,0.033197
4,96,372860002,5,0.013193


### Add bestseller information

In [31]:
# For real transactions: bestseller unknown, check candidates to check if there is a bestseller rank. If not, fill with fillna later on.
# Lecture 6
# Using sales_nohead is supposed to give the true bestseller rank for any item, even if not top 12
# TODO: does sales_nohead contain all info needed for this?
if bestsellerFiller is None:
    full_bestsellers_previous_week = sales_nohead.copy(deep=True)
    full_bestsellers_previous_week.week += 1
    data = pd.merge(
        data,
        sales_nohead[['week', 'article_id', 'bestseller_rank']],
        on=['week', 'article_id'],
        how='left'
    )
else:
    data = pd.merge(
    data,
    bestsellers_previous_week[['week', 'article_id', 'bestseller_rank']],
    on=['week', 'article_id'],
    how='left'
)

In [32]:
data.head()

Unnamed: 0,customer_id,article_id,price,sales_channel_id,week,purchased,importance,bestseller_rank
0,272412481300040,778064028,0.008458,1,95,1.0,1,
1,272412481300040,816592008,0.016932,1,95,1.0,1,
2,272412481300040,621381021,0.033881,1,95,1.0,1,
3,272412481300040,817477003,0.025407,1,95,1.0,1,
4,272412481300040,899088002,0.025407,1,95,1.0,1,


In [33]:
# Remove first week because it lacks bestsellers_previous_week
data = data[data.week != data.week.min()]  # Presumably to make sure no data of an incomplete week is included?
# If no bestseller: sold very poorly (default bestsellerFiller is 999, which means there are 998 better selling items)
if bestsellerFiller is not None:
    data.bestseller_rank.fillna(bestsellerFiller, inplace=True)
else:
    # https://datatofish.com/count-nan-pandas-dataframe/
    print(data.isna().sum().sum())

### Merge data
Combine data from articles, customers, custom features...

In [34]:
# Per customer per week all transactions/candidates

# Add article info to each row
data = pd.merge(data, articles, on='article_id', how='left')
# Add customer info to each row
data = pd.merge(data, customers, on='customer_id', how='left')

In [35]:
# Sort by week, then customer
data.sort_values(['week', 'customer_id'], inplace=True)
data.reset_index(drop=True, inplace=True)

In [36]:
data.head()

Unnamed: 0,customer_id,article_id,price,sales_channel_id,week,purchased,importance,bestseller_rank,product_code,prod_name,...,garment_group_no,garment_group_name,detail_desc,simplified_colour_group_name,FN,Active,club_member_status,fashion_news_frequency,age,postal_code
0,28847241659200,887770001,0.016932,1,96,1.0,1,999.0,887770,727,...,1010,6,3692,0,1,1,0,1,21,57896
1,28847241659200,762846001,0.025407,1,96,0.0,1,999.0,762846,472,...,1010,6,492,2,1,1,0,1,21,57896
2,28847241659200,829308001,0.033881,1,96,0.0,1,999.0,829308,11402,...,1005,0,9082,0,1,1,0,1,21,57896
3,28847241659200,760084003,0.025094,1,96,0.0,1,1.0,760084,1134,...,1009,5,847,0,1,1,0,1,21,57896
4,28847241659200,866731001,0.024919,1,96,0.0,1,2.0,866731,3609,...,1005,0,3130,0,1,1,0,1,21,57896


In [37]:
# lecture 6
for feature_df_partial_columns in all_new_features:
    print(feature_df_partial_columns[1])
    # merge new features into training data
    data = pd.merge(data,feature_df_partial_columns[0],on=(["customer_id","week"] + feature_df_partial_columns[1]),how="left")

['product_type_no']


In [38]:
# Don't train on last week because it is the test set
train = data[data.week != test_week]
# final week, drop duplicate candidates
test = data[data.week==test_week].drop_duplicates(['customer_id', 'article_id', 'sales_channel_id']).copy()
test.head()

Unnamed: 0,customer_id,article_id,price,sales_channel_id,week,purchased,importance,bestseller_rank,product_code,prod_name,...,detail_desc,simplified_colour_group_name,FN,Active,club_member_status,fashion_news_frequency,age,postal_code,amount_of_(product_type_no),amount_of_(product_type_no)_rank
11381612,28847241659200,925246001,0.128797,2,105,0.0,1,999.0,925246,25454,...,27855,0,1,1,0,1,21,57896,,
11381613,28847241659200,924243001,0.041535,1,105,0.0,1,1.0,924243,19190,...,13007,5,1,1,0,1,21,57896,,
11381614,28847241659200,924243002,0.041877,1,105,0.0,1,2.0,924243,19190,...,13007,0,1,1,0,1,21,57896,,
11381615,28847241659200,918522001,0.041435,1,105,0.0,1,3.0,918522,26372,...,28633,2,1,1,0,1,21,57896,,
11381616,28847241659200,923758001,0.033462,1,105,0.0,1,4.0,923758,19359,...,27869,2,1,1,0,1,21,57896,,


In [39]:
# NaN values in output of amount_of_(<article_feature>)_rank are fine: it simply indicates that the user never bought anything with the same value of <article_feature>
print(train.groupby(['week', 'customer_id']).head())
train_baskets = train.groupby(['week', 'customer_id'])['article_id'].count().values
print(train_baskets)
print(train_baskets.min())
print(train_baskets.max())
print(len(train_baskets))

                   customer_id  article_id     price  sales_channel_id  week  \
0               28847241659200   887770001  0.016932                 1    96   
1               28847241659200   762846001  0.025407                 1    96   
2               28847241659200   829308001  0.033881                 1    96   
3               28847241659200   760084003  0.025094                 1    96   
4               28847241659200   866731001  0.024919                 1    96   
...                        ...         ...       ...               ...   ...   
11381596  18446737527580148316   547780001  0.023712                 2   104   
11381597  18446737527580148316   763988001  0.023712                 2   104   
11381598  18446737527580148316   763988003  0.023712                 2   104   
11381599  18446737527580148316   547780040  0.023712                 2   104   
11381600  18446737527580148316   909370001  0.032947                 2   104   

          purchased  importance  bestse

In [40]:
train_X = train[columns_to_use]
train_y = train['purchased']

test_X = test[columns_to_use]

# Model training and predicting

Outputs feature importance according to the ranker.
Output is placed in ../../data/subs/submissionRobbeLauwers.csv.gz
The generated .csv and .csv.gz files have identical contents.

In [41]:
from helper_ranking import rank
rank(train_X,train_y,test_X,test,train_baskets,columns_to_use,LGBMBoostingType,bestsellers_previous_week)

[LightGBM] [Debug] Dataset::GetMultiBinFromSparseFeatures: sparse rate 0.927889
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.280571
[LightGBM] [Debug] init for col-wise cost 0.145653 seconds, init for row-wise cost 0.804081 seconds
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Debug] Using Sparse Multi-Val Bin
[LightGBM] [Info] Total Bins 1152
[LightGBM] [Info] Number of data points in the train set: 11381612, number of used features: 22
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 12
bestseller_rank 0.9604403506788891
importance 0.036971804559252246
amount_of_(product_type_no) 0.001637227192831389
article_id 0.00015870698383350083
amount_of_(product_type_no)_rank 0.0001467072690398273
product_type_no 0.00013880261359649527
age 0.00013229501127365838
index_code 9.827203708556033e-05
garment_group_no 8.471655218285679e-05
postal_code 6.0475840437066565e-05
c