# **<a id="Content">HnM RecSys Notebook 9417</a>**

## **<a id="Content">Table of Contents</a>**
* [**<span>1. Imports</span>**](#Imports)  
* [**<span>2. Helper Functions/Decorators</span>**](#Helper-Functions)
* [**<span>5. LightGBM Model</span>**](#LightGBM-Model) 

## Imports

In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import os
import re
import warnings
import scipy.sparse as sparse
from scipy.sparse.linalg import spsolve

## Helper-Functions

In [29]:
from datetime import datetime, timedelta

# only use last x weeks of transactions data since data is too large
def filter_transactions_last_x_weeks(transactions, x = 10):
    # Convert date strings to datetime objects
    transactions['t_dat'] = pd.to_datetime(transactions['t_dat'])

    # Calculate the date x weeks ago from the latest transaction date
    latest_date = transactions['t_dat'].max()
    cutoff_date = latest_date - timedelta(weeks=x)

    # Filter transactions to only include those in the last x weeks
    filtered_transactions = transactions.loc[transactions['t_dat'] >= cutoff_date].copy()

    return filtered_transactions

In [30]:
def filter_customers_and_articles(customers, articles, filtered_transactions):
    # Get unique customer and article IDs from filtered transactions
    customer_ids = filtered_transactions['customer_id'].unique()
    article_ids = filtered_transactions['article_id'].unique()

    # Filter customers and articles to only include those in filtered transactions
    customers_filtered = customers.loc[customers['customer_id'].isin(customer_ids)].copy()
    articles_filtered = articles.loc[articles['article_id'].isin(article_ids)].copy()

    return customers_filtered, articles_filtered

## LightGBM

A comparison of the top GBDT models today. LightGBM is the fastest to train.

|Feature|LightGBM|XGBoost|CatBoost|
|:----|:----|:----|:----|
|Categoricals|Supports categorical features via one-hot encoding|Supports categorical features via one-hot encoding|Automatically handles categorical features using embeddings|
|Speed|Very fast training and prediction|Fast training and prediction|Slower than LightGBM and XGBoost|
|Handling Bias|Handles unbalanced classes via 'is_unbalance'|Handles unbalanced classes via 'scale_pos_weight'|Automatically handles unbalanced classes|
|Handling NaNs|Handles NaN values natively|Requires manual handling of NaNs|Automatically handles NaN values using special category|
|Custom Loss|Supports custom loss functions|Supports custom loss functions|Supports custom loss functions|


To use LightGBM for a ranking problem, we treat this as a binary classification problem where the target variable is whether an item is relevant or not to the user.

Alternatively, we can use LightGBM's ranking API, which is designed for ranking problems. Instead of optimizing for accuracy, the ranking API optimizes for ranking metric MAP (MAP support deprecated however). 

### Feature Engineering

In [31]:
# LightGBM imports

from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.feature_selection import RFE
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import make_scorer
import lightgbm as lgb
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials

In [32]:
import pickle

# open user_item_matrix_200
with open('user_item_matrix_200.pkl', 'rb') as f:
    user_item_matrix = pickle.load(f)

# open customer and articels incides map
with open('lightgbm/customer_id_indices_map.pkl', 'rb') as f:
    customer_id_indices_map = pickle.load(f)

with open('lightgbm/article_id_indices_map.pkl', 'rb') as f:
    article_id_indices_map = pickle.load(f)

# load df from pickle file for time-based split
with open('lightgbm/df.pkl', 'rb') as f:
    df = pickle.load(f)

# load final_df from pickle file for clean processing
with open('lightgbm/final_df_with_binary_targets.pkl', 'rb') as f:
    final_df = pickle.load(f)

### Model Training

In [33]:
final_df.head()

Unnamed: 0,price,sales_channel_1,sales_channel_2,quantity,article_engagement_ratio,user_index,item_index,FN,Active,club_member_status,...,garment_group_no_1019.0,garment_group_no_1020.0,garment_group_no_1021.0,garment_group_no_1023.0,garment_group_no_1025.0,index_group_no_1.0,index_group_no_2.0,index_group_no_3.0,index_group_no_4.0,index_group_no_26.0
0,0.042358,False,True,1.0,1.0,5,11563,1.0,1.0,2.0,...,False,False,False,False,False,True,False,False,False,False
1,0.050842,False,True,1.0,1.0,5,9899,1.0,1.0,2.0,...,False,False,False,False,False,True,False,False,False,False
2,0.06781,False,True,1.0,1.0,5,14438,1.0,1.0,2.0,...,False,False,False,False,False,True,False,False,False,False
3,0.016937,False,True,1.0,0.5,10,10307,0.0,0.0,2.0,...,False,False,False,False,False,False,True,False,False,False
4,0.016937,False,True,1.0,0.166667,10,13608,0.0,0.0,2.0,...,False,False,False,True,False,True,False,False,False,False


In [34]:
# # target encoding
# from category_encoders import TargetEncoder
# from sklearn.model_selection import KFold

# # Define columns to target encode
# cols_to_encode = ['department_no', 'product_type_no', 'section_no', 'graphical_appearance_no']

# # Define number of folds for cross-validation
# n_splits = 5

# # Create KFold object for cross-validation
# kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)

# # Perform target encoding with cross-validation
# for col in cols_to_encode:
#     final_df[f'{col}_te'] = 0
#     te = TargetEncoder(cols=[col])
#     for train_idx, val_idx in kf.split(final_df):
#         te.fit(final_df.iloc[train_idx][[col]], final_df.iloc[train_idx]['target'])
#         final_df.loc[val_idx, f'{col}_te'] = te.transform(final_df.iloc[val_idx][[col]]).values.flatten()

In [35]:
# ---- memory optimizations -------------

# reference: https://www.kaggle.com/arjanso/reducing-dataframe-memory-size-by-65

# iterate through all the columns of a dataframe and reduce the int and float data types to the smallest possible size, ex. customer_id should not be reduced from int64 to a samller value as it would have collisions
import numpy as np
import pandas as pd

def reduce_mem_usage(df):
    """Iterate over all the columns of a DataFrame and modify the data type
    to reduce memory usage, handling ordered Categoricals"""
    
    # check the memory usage of the DataFrame
    start_mem = df.memory_usage().sum() / 1024**2
    print("Memory usage of dataframe is {:.2f} MB".format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type == 'category':
            if df[col].cat.ordered:
                # Convert ordered Categorical to an integer
                df[col] = df[col].cat.codes.astype('int16')
            else:
                # Convert unordered Categorical to a string
                df[col] = df[col].astype('str')
        
        elif col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min >= np.iinfo(np.int8).min and c_max <= np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min >= np.iinfo(np.int16).min and c_max <= np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min >= np.iinfo(np.int32).min and c_max <= np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min >= np.iinfo(np.int64).min and c_max <= np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min >= np.finfo(np.float16).min and c_max <= np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min >= np.finfo(np.float32).min and c_max <= np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    
    # check the memory usage after optimization
    end_mem = df.memory_usage().sum() / 1024**2
    print("Memory usage after optimization is: {:.2f} MB".format(end_mem))

    # calculate the percentage of the memory usage reduction
    mem_reduction = 100 * (start_mem - end_mem) / start_mem
    print("Memory usage decreased by {:.1f}%".format(mem_reduction))
    
    return df

In [36]:
# only get top 50 customers by number of total pruchase quantity from final_df

# Compute the total quantity for each user_index
user_quantity = final_df.groupby('user_index')['quantity'].sum()

# Get the top 50 user_indices by total quantity
top_50_users = user_quantity.nlargest(50).index

# Filter the final_df to include only the data for the top 50 users
final_df_top_50 = final_df[final_df['user_index'].isin(top_50_users)].copy()
# print the shape of final_df_top_50
print(final_df_top_50.shape)

print(final_df_top_50['user_index'].nunique())


(1952211, 56)
50


In [37]:
def time_based_train_test_split(final_df, test_size=0.2):

    # Convert days, months, and years columns to datetime object
    final_df['date'] = pd.to_datetime(final_df[['day', 'month', 'year']])

    # Sort dataframe by date in ascending order
    final_df = final_df.sort_values(by='date')

    # Calculate cutoff index
    cutoff_index = int(len(final_df) * (1-test_size))

    # Create train and test dataframes
    train_df = final_df[:cutoff_index]
    test_df = final_df[cutoff_index:]

    # Drop date column from train and test dataframes
    train_df = train_df.drop('date', axis=1)
    test_df = test_df.drop('date', axis=1)

    # split train_df into X_train and y_train
    X_train = train_df.drop('target', axis=1)
    y_train = train_df['target']

    # split test_df into X_test and y_test
    X_test = test_df.drop('target', axis=1)
    y_test = test_df['target']

    return X_train, X_test, y_train, y_test

In [38]:
# 80/20 time-based split to curb data leakage
X_train, X_test, y_train, y_test = time_based_train_test_split(final_df, test_size=0.2)
# final_df_top_50 = final_df_top_50.drop('date', axis=1)
# from sklearn.model_selection import train_test_split
# X_train, X_test, y_train, y_test = train_test_split(final_df.drop(['target'], axis=1), final_df['target'], test_size=0.2, random_state=42)

# redcue memory usage
X_train = reduce_mem_usage(X_train)
X_test = reduce_mem_usage(X_test)

# print the shape of X_train, X_test, y_train, y_test
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

Memory usage of dataframe is 1393.06 MB
Memory usage after optimization is: 726.30 MB
Memory usage decreased by 47.9%
Memory usage of dataframe is 348.27 MB
Memory usage after optimization is: 401.84 MB
Memory usage decreased by -15.4%
(6242440, 55)
(1560611, 55)
(6242440,)
(1560611,)


In [39]:
import lightgbm as lgb
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import ndcg_score, average_precision_score
from sklearn.feature_selection import RFECV
import joblib
from sklearn.metrics import get_scorer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

In [40]:
# Define features and target
features = final_df.columns.tolist()
features.remove('target')
target = 'target'

# Group data by user -- so that LightGBM knows which data points belong to each user and can compute the metrics correctly
grouped_data_train = X_train.groupby('user_index')
grouped_data_test = X_test.groupby('user_index')
groups_train = [grouped_data_train.groups[user] for user in grouped_data_train.groups.keys()]
groups_train_flat = np.concatenate(groups_train)
groups_test = [grouped_data_test.groups[user] for user in grouped_data_test.groups.keys()]

# Create LightGBM datasets with group query information
train_data = lgb.Dataset(X_train, label=y_train, group=grouped_data_train.groups.values())
test_data = lgb.Dataset(X_test, label=y_test, group=grouped_data_test.groups.values())

In [55]:
# Preprocess the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Feature selection using RFECV
selector = RFECV(
    estimator=lgb.LGBMClassifier(n_jobs=-1,
        num_leaves=31, max_depth=7, learning_rate=0.1
    ),
    cv=5, scoring=get_scorer('average_precision'),
    verbose=1, step=1
)
selector.fit(X_train_scaled, y_train)

Fitting estimator with 55 features.
Fitting estimator with 54 features.
Fitting estimator with 53 features.
Fitting estimator with 52 features.
Fitting estimator with 51 features.
Fitting estimator with 50 features.
Fitting estimator with 49 features.
Fitting estimator with 48 features.
Fitting estimator with 47 features.
Fitting estimator with 46 features.
Fitting estimator with 45 features.
Fitting estimator with 44 features.
Fitting estimator with 43 features.
Fitting estimator with 42 features.
Fitting estimator with 41 features.
Fitting estimator with 40 features.
Fitting estimator with 39 features.
Fitting estimator with 38 features.
Fitting estimator with 37 features.
Fitting estimator with 36 features.
Fitting estimator with 35 features.
Fitting estimator with 34 features.
Fitting estimator with 33 features.
Fitting estimator with 32 features.
Fitting estimator with 31 features.
Fitting estimator with 30 features.
Fitting estimator with 29 features.
Fitting estimator with 28 fe

In [56]:
# Get selected features
selected_features = X_train.columns[selector.get_support()]
print(selected_features)

ranks = selector.ranking_
feat_ranks = {feat:rank for feat, rank in zip(X_train.columns, ranks)}
sorted_ranks = sorted(feat_ranks.items(), key=lambda x: x[1])
print(sorted_ranks)

Index(['price'], dtype='object')
[('price', 1), ('user_index', 2), ('item_index', 3), ('article_engagement_ratio', 4), ('user_purchase_quant', 5), ('department_no', 6), ('mean_purchase_age', 7), ('time_diff_days', 8), ('quantity', 9), ('item_avg_price_level', 10), ('product_type_no', 11), ('club_member_status', 12), ('section_no', 13), ('age', 14), ('item_purchase_frequency', 15), ('min_purchase_age', 16), ('day', 17), ('Active', 18), ('FN', 19), ('sales_channel_2', 20), ('sales_channel_1', 21), ('fashion_news_frequency', 22), ('RFM_Score', 23), ('garment_group_no_1001.0', 24), ('garment_group_no_1002.0', 25), ('garment_group_no_1003.0', 26), ('garment_group_no_1005.0', 27), ('garment_group_no_1006.0', 28), ('garment_group_no_1007.0', 29), ('garment_group_no_1008.0', 30), ('garment_group_no_1009.0', 31), ('graphical_appearance_no', 32), ('garment_group_no_1010.0', 33), ('garment_group_no_1011.0', 34), ('garment_group_no_1012.0', 35), ('garment_group_no_1013.0', 36), ('garment_group_no_

In [50]:
# # Get selected features
# selected_features = X_train.columns[selector.get_support()]

selected_features = X_train.columns[(ranks == 1) | (ranks == 2) | ( ranks == 3)]
print(selected_features)

Index(['price', 'quantity', 'article_engagement_ratio', 'user_index',
       'item_index', 'time_diff_days', 'user_purchase_quant', 'department_no',
       'mean_purchase_age', 'item_avg_price_level'],
      dtype='object')


In [86]:
min_date = df['t_dat'].min()
max_date = df['t_dat'].max()

print('Min date:', min_date)
print('Max date:', max_date)

Min date: 2018-09-20 00:00:00
Max date: 2020-09-22 00:00:00


In [87]:
def calculate_split_date(df, split_percentage):
    # Get min and max dates
    min_date = df['t_dat'].min()
    max_date = df['t_dat'].max()

    # Calculate split date
    split_date = min_date + pd.DateOffset(days=int((max_date - min_date).days * split_percentage))

    return split_date

In [117]:
split_date = calculate_split_date(df, 0.50)
print('Split date:', split_date)

Split date: 2019-09-21 00:00:00


In [118]:
# print number of unique target values in final_df after split date
print(final_df[final_df['date'] >= split_date]['target'].nunique())
print(final_df[final_df['date'] <= split_date]['target'].nunique())

# print unique target values in final_df
print(final_df['target'].unique())

1
1
[1. 0.]


In [124]:
print(final_df['target'].value_counts())

# print len of final_df 
print("Len of final df: ", len(final_df))

# print len of final_df before split date
print("Len of final df before split: ", len(final_df[final_df['date'] < split_date]))

# print len of final_df after split date
print("Len of final df after split: ", len(final_df[final_df['date'] >= split_date]))


print('Before split date:')
print(final_df[final_df['date'] < split_date]['target'].value_counts())

print('After split date:')
print(final_df[final_df['date'] >= split_date]['target'].value_counts())

SyntaxError: invalid syntax. Perhaps you forgot a comma? (2182045566.py, line 7)

In [121]:

df = df.sort_values(by='t_dat')

df.head()

Unnamed: 0,t_dat,price,sales_channel_1,sales_channel_2,quantity,article_engagement_ratio,user_index,item_index,FN,Active,...,article_preference,item_purchase_frequency,item_avg_price_level,year,month,day,recency,frequency,monetary_value,RFM Score
0,2018-09-20,0.042358,False,True,1.0,1.0,5,11563,1.0,1.0,...,0.0,0.0,0.042358,2018,9,20,6,861,37.39135,244
134,2018-09-20,0.030487,False,True,1.0,0.5,139,6231,0.0,0.0,...,1.0,1.0,0.028198,2018,9,20,6,646,22.751427,233
135,2018-09-20,0.025406,False,True,1.0,0.052632,139,14568,0.0,0.0,...,1.0,0.583333,0.024521,2018,9,20,6,646,22.751427,233
136,2018-09-20,0.031052,False,True,1.0,0.111111,147,11460,1.0,1.0,...,1.0,1.2,0.031311,2018,9,20,10,688,29.95966,234
137,2018-09-20,0.031067,False,True,1.0,0.2,147,8942,1.0,1.0,...,1.0,1.25,0.029251,2018,9,20,10,688,29.95966,234


In [102]:
final_df['date'] = pd.to_datetime(final_df[['day', 'month', 'year']])
final_df = final_df.sort_values(by='date')

final_df.head()

Unnamed: 0,price,sales_channel_1,sales_channel_2,quantity,article_engagement_ratio,user_index,item_index,FN,Active,club_member_status,...,garment_group_no_1020.0,garment_group_no_1021.0,garment_group_no_1023.0,garment_group_no_1025.0,index_group_no_1.0,index_group_no_2.0,index_group_no_3.0,index_group_no_4.0,index_group_no_26.0,date
0,0.042358,False,True,1.0,1.0,5,11563,1.0,1.0,2.0,...,False,False,False,False,True,False,False,False,False,2018-09-20
2,0.06781,False,True,1.0,1.0,5,14438,1.0,1.0,2.0,...,False,False,False,False,True,False,False,False,False,2018-09-20
3,0.016937,False,True,1.0,0.5,10,10307,0.0,0.0,2.0,...,False,False,False,False,False,True,False,False,False,2018-09-20
4,0.016937,False,True,1.0,0.166667,10,13608,0.0,0.0,2.0,...,False,False,True,False,True,False,False,False,False,2018-09-20
5,0.016937,False,True,1.0,0.333333,10,12935,0.0,0.0,2.0,...,False,False,False,False,False,True,False,False,False,2018-09-20


In [93]:
def time_based_train_test_split3(final_df, test_date):
    # Convert days, months, and years columns to datetime object
    final_df['date'] = pd.to_datetime(final_df[['day', 'month', 'year']])

    # Sort dataframe by date in ascending order
    final_df = final_df.sort_values(by='date')

    # Split dataframe into training and testing data
    train_df = final_df[final_df['date'] < test_date]
    test_df = final_df[final_df['date'] >= test_date]

    # Drop date column from train and test dataframes
    train_df = train_df.drop('date', axis=1)
    test_df = test_df.drop('date', axis=1)

    # split train_df into X_train and y_train
    X_train = train_df.drop('target', axis=1)
    y_train = train_df['target']

    # split test_df into X_test and y_test
    X_test = test_df.drop('target', axis=1)
    y_test = test_df['target']

    return X_train, X_test, y_train, y_test

In [103]:
X_train3, X_test3, y_train3, y_test3 = time_based_train_test_split3(final_df, split_date)

In [104]:

# X_train1, X_test1, y_train1, y_test1 = time_based_train_test_split(final_df, test_size=0.3)
X_train2, X_test2, y_train2, y_test2 = train_test_split(final_df.drop(['target'], axis=1), final_df['target'], test_size=0.2, random_state=42)


print("---")
print(y_train1.unique())
print(y_test1.unique())

print("---")
print(y_train2.unique())
print(y_test2.unique())

print("---")
print(y_train3.unique())
print(y_test3.unique())


---
[1. 0.]
[0.]
---
[0. 1.]
[0. 1.]
---
[1.]
[1.]


In [79]:
print(final_df['target'].value_counts())
print(y_train1.value_counts())
print(y_test1.value_counts())
print(y_train2.value_counts())
print(y_test2.value_counts())

target
0.0    7676429
1.0     126622
Name: count, dtype: int64
target
0.0    6115818
1.0     126622
Name: count, dtype: int64
target
0.0    1560611
Name: count, dtype: int64
target
0.0    6141050
1.0     101390
Name: count, dtype: int64
target
0.0    1535379
1.0      25232
Name: count, dtype: int64


In [51]:
# Get integer indices of selected features
selected_feature_indices = [X_train.columns.get_loc(col) for col in selected_features]

# Train the LGBMClassifier using the selected features
lgbm = lgb.LGBMClassifier(num_boost_round=100)
lgbm.fit(X_train_scaled[:, selected_feature_indices], y_train)

# Make predictions on the test set
y_pred = lgbm.predict_proba(X_test_scaled[:, selected_feature_indices])[:, 1]

# Evaluate the model
average_precision = average_precision_score(y_test, y_pred)
print("Average Precision Score:", average_precision)



Average Precision Score: -0.0




In [52]:
average_precision

-0.0

In [48]:
# # Define hyperparameters space
# param_dist = {
#     'lgbm__learning_rate': [0.01, 0.05, 0.1],
#     'lgbm__num_leaves': [15, 31, 63],
#     'lgbm__bagging_fraction': [0.6, 0.8, 1.0],
#     'lgbm__feature_fraction': [0.6, 0.8, 1.0]
#     # 'n_estimators': [100],
#     # 'max_depth': [3, 5, -1]
# }

# # Perform grid search on the pipeline
# clf = GridSearchCV(estimator=pipeline, param_grid=param_dist, cv=2, scoring=get_scorer('average_precision'), n_jobs=-1, verbose=2, refit=True, error_score='raise')
# clf.fit(X_train, y_train, groups=groups_train_flat)

In [130]:
# Save the best intermediate model
joblib.dump(clf.best_estimator_, 'best_model.pkl')
print(f'Best hyperparameters: {clf.best_params_}')
print(f'Best map score: {clf.best_score_}')

# Save the selected features
selected_features = X_train_sel.columns.tolist()
joblib.dump(selected_features, 'lightgbm/selected_features.pkl')

# Evaluate the best model on the test set
y_pred = clf.best_estimator_.predict(X_test_sel, num_iteration=clf.best_estimator_.best_iteration_)
ndcg = ndcg_score(y_test, y_pred, group_scores=True, verbose=1)
print(f'NDCG score on test set: {ndcg}')

Once the model is trained, it can be used to predict the probability of purchase for new user-product pairs, which can be used to generate recommendations for users.

In [53]:
def select_popular_products(df, n_products=500):
    # Group the dataframe by product and sum the quantity for each product
    product_quantities = df.groupby('item_index')['quantity'].sum()
    # Sort the products by quantity in descending order and select the top n_products
    popular_products = product_quantities.sort_values(ascending=False).index.tolist()[:n_products]
    # Filter the dataframe to only include the popular products
    df = df[df['item_index'].isin(popular_products)]
    return df

In [None]:
df.head()

If we treat this as a binary classification problem: After training the model, we can then get the probability that each user is likely to purchase an item from a candidate set of items. We can then sort these by descending probability to get the top 12 products as done below. <br>

A heuristic apparoach that we use to enhance LighGBM predictions here: <br>
1. Get a candidate set of top 500 most popular articles (by total purchase quanitity). <br>
2. Include the customer's predicitons to this set. <br>
3. Use lightGBM to predict the probability of purchases, and get the top 12. <br>

In [54]:
# Assume X is the input data for the LightGBM model
# X has a row for each user-product pair and a binary target indicating whether the user purchased the product or not

# Train the LightGBM model on X
# lgb_model = lgb.LGBMClassifier(**best_params)
# lgb_model.fit(X, y)

# Generate candidate products for each user
# This can be done using a combination of popular products and user purchase history
# Let's assume we have a dictionary 'user_products' that maps each user ID to a list of products they've purchased
user_candidates = {}
for user_id in user_products:
    # Select the 600 most popular products
    popular_products = select_popular_products(500)
    
    # Add user purchase history to candidate list
    user_history = user_products[user_id]
    candidate_products = list(set(popular_products + user_history))
    
    # Store candidate products for this user
    user_candidates[user_id] = candidate_products

# Predict probabilities of purchase for each candidate product for each user
user_scores = {}
for user_id, candidates in user_candidates.items():
    # Create input data for this user
    user_data = create_user_data(user_id, candidates)
    
    # Predict probabilities using the LightGBM model
    scores = lgbm.predict_proba(user_data)[:, 1]
    
    # Store scores for this user
    user_scores[user_id] = scores

# Rank candidate products for each user and return top 12 as recommendations
recommendations = {}
for user_id, scores in user_scores.items():
    # Sort candidate products by descending score
    candidate_products = user_candidates[user_id]
    sorted_indices = np.argsort(scores)[::-1]
    sorted_products = [candidate_products[i] for i in sorted_indices]
    
    # Select top 12 products
    top_products = sorted_products[:12]
    
    # Add user purchase history to top products
    top_products += user_products[user_id]
    
    # Remove duplicates and return as recommendations
    recommendations[user_id] = list(set(top_products))

NameError: name 'user_products' is not defined

Since we are using MAP as the evaluation metric, we could also use the LightGBM ranking API instead of the binary classification API. The code to rank article_ids using the lightgbm ranking API is below.

In [None]:
# import lightgbm as lgb
# from sklearn.model_selection import train_test_split, GridSearchCV
# from sklearn.metrics import make_scorer
# from sklearn.metrics import average_precision_score
# from sklearn.model_selection import ParameterGrid
# import numpy as np
# import pickle
# import os

# target = 'item_index'
# features = final_df.columns.tolist()
# features.remove(target)

# # split the data into training and test sets -- can also do time-based split
# X_train, X_test, y_train, y_test = train_test_split(final_df[features], final_df[target], test_size=0.2, random_state=42)

# # for number of items to rank for each user (group param for ordered ranking)
# num_items_per_user = 12
# user_indices = X_test.index.unique()
# query = [num_items_per_user] * len(user_indices)
# query_ids = []
# for user_index in user_indices:
#     user_indices_repeated = [user_index] * num_items_per_user
#     query_ids.extend(user_indices_repeated)

# train_data = lgb.Dataset(X_train, label=y_train, group=query_ids)

# # MAP@12 metric
# def mean_average_precision(y_true, y_score, k=12):
#     # get the indices of the top k scores
#     top_k_indices = np.argsort(y_score)[::-1][:k]

#     # calculate average precision at k
#     return average_precision_score(y_true[top_k_indices], y_score[top_k_indices])

# # define hyperparameters for tuning
# params = {
# 'objective': 'lambdarank', #using lightgbm ranking API
# 'metric': 'MAP',
# 'learning_rate': 0.05,
# 'num_leaves': 31,
# 'max_depth': 5,
# 'min_data_in_leaf': 50,
# 'feature_fraction': 0.8,
# 'bagging_fraction': 0.8,
# 'bagging_freq': 5
# }

# # create LightGBM model
# model = lgb.LGBMRanker()

# # perform grid search with cross-validation
# param_grid = {
# 'num_leaves': [31, 50, 75],
# 'max_depth': [5, 7, -1],
# 'min_data_in_leaf': [20, 50, 100],
# 'feature_fraction': [0.6, 0.8, 1],
# 'bagging_fraction': [0.6, 0.8, 1],
# 'bagging_freq': [1, 3, 5],
# 'lambda_l1': [0, 1, 2],
# 'lambda_l2': [0, 1, 2]
# }

# best_map_score = 0.0
# best_model = None

# for params_dict in ParameterGrid(param_grid):
#     params.update(params_dict)
#     model = lgb.train(params, train_data)
#     y_pred = model.predict(X_test, group=query)
#     map_score = mean_average_precision(y_test, y_pred, k=12)
#     if map_score > best_map_score:
#         best_map_score = map_score
#         best_model = model
#         with open(f"lightgbm/grid_search_model_{map_score:.4f}.pickle", 'wb') as f:
#             pickle.dump(model, f)

# # save the best model
# if not os.path.exists('lightgbm'):
#     os.makedirs('lightgbm')
# with open('lightgbm/best_model.pickle', 'wb') as f:
#     pickle.dump(best_model, f)

# # print the best MAP score
# print(f"Best mean average precision: {best_map_score}")