# **<a id="Content">HnM RecSys Notebook 9417</a>**

## **<a id="Content">Table of Contents</a>**
* [**<span>1. Imports</span>**](#Imports)  
* [**<span>2. Helper Functions/Decorators</span>**](#Helper-Functions)
* [**<span>5. LightGBM Model</span>**](#LightGBM-Model) 

## Imports

In [19]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import os
import re
import warnings
import scipy.sparse as sparse
from scipy.sparse.linalg import spsolve

## Helper-Functions

In [20]:
from datetime import datetime, timedelta

# only use last x weeks of transactions data since data is too large
def filter_transactions_last_x_weeks(transactions, x = 10):
    # Convert date strings to datetime objects
    transactions['t_dat'] = pd.to_datetime(transactions['t_dat'])

    # Calculate the date x weeks ago from the latest transaction date
    latest_date = transactions['t_dat'].max()
    cutoff_date = latest_date - timedelta(weeks=x)

    # Filter transactions to only include those in the last x weeks
    filtered_transactions = transactions.loc[transactions['t_dat'] >= cutoff_date].copy()

    return filtered_transactions

In [21]:
def filter_customers_and_articles(customers, articles, filtered_transactions):
    # Get unique customer and article IDs from filtered transactions
    customer_ids = filtered_transactions['customer_id'].unique()
    article_ids = filtered_transactions['article_id'].unique()

    # Filter customers and articles to only include those in filtered transactions
    customers_filtered = customers.loc[customers['customer_id'].isin(customer_ids)].copy()
    articles_filtered = articles.loc[articles['article_id'].isin(article_ids)].copy()

    return customers_filtered, articles_filtered

## LightGBM

A comparison of the top GBDT models today. LightGBM is the fastest to train.

|Feature|LightGBM|XGBoost|CatBoost|
|:----|:----|:----|:----|
|Categoricals|Supports categorical features via one-hot encoding|Supports categorical features via one-hot encoding|Automatically handles categorical features using embeddings|
|Speed|Very fast training and prediction|Fast training and prediction|Slower than LightGBM and XGBoost|
|Handling Bias|Handles unbalanced classes via 'is_unbalance'|Handles unbalanced classes via 'scale_pos_weight'|Automatically handles unbalanced classes|
|Handling NaNs|Handles NaN values natively|Requires manual handling of NaNs|Automatically handles NaN values using special category|
|Custom Loss|Supports custom loss functions|Supports custom loss functions|Supports custom loss functions|


To use LightGBM for a ranking problem, we treat this as a binary classification problem where the target variable is whether an item is relevant or not to the user.

Alternatively, we can use LightGBM's ranking API, which is designed for ranking problems. Instead of optimizing for accuracy, the ranking API optimizes for ranking metric MAP (MAP support deprecated however). 

### Feature Engineering

In [22]:
# LightGBM imports

from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.feature_selection import RFE
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import make_scorer
import lightgbm as lgb
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials

In [23]:
import pickle

# open user_item_matrix_200
with open('user_item_matrix_200.pkl', 'rb') as f:
    user_item_matrix = pickle.load(f)

# open customer and articels incides map
with open('lightgbm/customer_id_indices_map.pkl', 'rb') as f:
    customer_id_indices_map = pickle.load(f)

with open('lightgbm/article_id_indices_map.pkl', 'rb') as f:
    article_id_indices_map = pickle.load(f)

# load df from pickle file for time-based split
with open('lightgbm/df.pkl', 'rb') as f:
    df = pickle.load(f)

# load final_df from pickle file for clean processing
with open('lightgbm/final_df_with_binary_targets.pkl', 'rb') as f:
    final_df = pickle.load(f)

### Model Training

In [24]:
# drop all rows where target is 0 (no purchase)
final_df = final_df[final_df['target'] == 1].copy()

# Convert days, months, and years columns to datetime object
final_df['date'] = pd.to_datetime(final_df[['day', 'month', 'year']])


In [25]:
# target encoding
from category_encoders import TargetEncoder
from sklearn.model_selection import KFold

# Define columns to target encode
cols_to_encode = ['department_no', 'product_type_no', 'section_no', 'graphical_appearance_no']

# Define number of folds for cross-validation
n_splits = 5

# Create KFold object for cross-validation
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)

# Perform target encoding with cross-validation
for col in cols_to_encode:
    final_df[f'{col}_te'] = 0
    te = TargetEncoder(cols=[col])
    for train_idx, val_idx in kf.split(final_df):
        te.fit(final_df.iloc[train_idx][[col]], final_df.iloc[train_idx]['target'])
        final_df.loc[val_idx, f'{col}_te'] = te.transform(final_df.iloc[val_idx][[col]]).values.flatten()

In [26]:
# ---- memory optimizations -------------

# reference: https://www.kaggle.com/arjanso/reducing-dataframe-memory-size-by-65

# iterate through all the columns of a dataframe and reduce the int and float data types to the smallest possible size, ex. customer_id should not be reduced from int64 to a samller value as it would have collisions
import numpy as np
import pandas as pd

def reduce_mem_usage(df):
    """Iterate over all the columns of a DataFrame and modify the data type
    to reduce memory usage, handling ordered Categoricals"""
    
    # check the memory usage of the DataFrame
    start_mem = df.memory_usage().sum() / 1024**2
    print("Memory usage of dataframe is {:.2f} MB".format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type == 'category':
            if df[col].cat.ordered:
                # Convert ordered Categorical to an integer
                df[col] = df[col].cat.codes.astype('int16')
            else:
                # Convert unordered Categorical to a string
                df[col] = df[col].astype('str')
        
        elif col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min >= np.iinfo(np.int8).min and c_max <= np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min >= np.iinfo(np.int16).min and c_max <= np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min >= np.iinfo(np.int32).min and c_max <= np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min >= np.iinfo(np.int64).min and c_max <= np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min >= np.finfo(np.float16).min and c_max <= np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min >= np.finfo(np.float32).min and c_max <= np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    
    # check the memory usage after optimization
    end_mem = df.memory_usage().sum() / 1024**2
    print("Memory usage after optimization is: {:.2f} MB".format(end_mem))

    # calculate the percentage of the memory usage reduction
    mem_reduction = 100 * (start_mem - end_mem) / start_mem
    print("Memory usage decreased by {:.1f}%".format(mem_reduction))
    
    return df

In [27]:
def time_based_train_test_split(final_df, test_size=0.2):

    # Sort dataframe by date in ascending order
    final_df = final_df.sort_values(by='date')

    # Calculate cutoff index
    cutoff_index = int(len(final_df) * (1-test_size))

    # Create train and test dataframes
    train_df = final_df[:cutoff_index]
    test_df = final_df[cutoff_index:]

    # Drop date column from train and test dataframes
    train_df = train_df.drop('date', axis=1)
    test_df = test_df.drop('date', axis=1)

    # split train_df into X_train and y_train
    X_train = train_df.drop('target', axis=1)
    y_train = train_df['target']

    # split test_df into X_test and y_test
    X_test = test_df.drop('target', axis=1)
    y_test = test_df['target']

    return X_train, X_test, y_train, y_test

In [28]:
# 80/20 time-based split to curb data leakage
# X_train, X_test, y_train, y_test = time_based_train_test_split(final_df, test_size=0.2)
# final_df_top_50 = final_df_top_50.drop('date', axis=1)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(final_df.drop(['target'], axis=1), final_df['target'], test_size=0.2, random_state=42)

# drop the date column from X_train, X_test
X_train = X_train.drop('date', axis=1)
X_test = X_test.drop('date', axis=1)

# redcue memory usage
X_train = reduce_mem_usage(X_train)
X_test = reduce_mem_usage(X_test)

# print the shape of X_train, X_test, y_train, y_test
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

Memory usage of dataframe is 25.70 MB
Memory usage after optimization is: 12.17 MB
Memory usage decreased by 52.6%
Memory usage of dataframe is 6.42 MB
Memory usage after optimization is: 3.04 MB
Memory usage decreased by 52.6%
(101297, 59)
(25325, 59)
(101297,)
(25325,)


In [29]:
import lightgbm as lgb
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import ndcg_score, average_precision_score
from sklearn.feature_selection import RFECV
import joblib
from sklearn.metrics import get_scorer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

In [30]:
# Define features and target
features = final_df.columns.tolist()
features.remove('target')
target = 'target'

# Group data by user -- so that LightGBM knows which data points belong to each user and can compute the metrics correctly
grouped_data_train = X_train.groupby('user_index')
grouped_data_test = X_test.groupby('user_index')
groups_train = [grouped_data_train.groups[user] for user in grouped_data_train.groups.keys()]
groups_train_flat = np.concatenate(groups_train)
groups_test = [grouped_data_test.groups[user] for user in grouped_data_test.groups.keys()]

# Create LightGBM datasets with group query information
train_data = lgb.Dataset(X_train, label=y_train, group=grouped_data_train.groups.values())
test_data = lgb.Dataset(X_test, label=y_test, group=grouped_data_test.groups.values())

In [31]:
# Define the LightGBM dataset for training and validation
train_data = lgb.Dataset(X_train, label=y_train)
val_data = lgb.Dataset(X_test, label=y_test)

# Define the parameters for the LightGBM model
params = {
    'objective': 'binary',
    'metric': 'binary_logloss',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0
}

# Train the LightGBM model
num_rounds = 1000
lgb_model = lgb.train(params, train_data, num_rounds, valid_sets=[train_data, val_data], early_stopping_rounds=50)



You can set `force_col_wise=true` to remove the overhead.
[1]	training's binary_logloss: 0	valid_1's binary_logloss: 0
Training until validation scores don't improve for 50 rounds
[2]	training's binary_logloss: 0	valid_1's binary_logloss: 0
[3]	training's binary_logloss: 0	valid_1's binary_logloss: 0
[4]	training's binary_logloss: 0	valid_1's binary_logloss: 0
[5]	training's binary_logloss: 0	valid_1's binary_logloss: 0
[6]	training's binary_logloss: 0	valid_1's binary_logloss: 0
[7]	training's binary_logloss: 0	valid_1's binary_logloss: 0
[8]	training's binary_logloss: 0	valid_1's binary_logloss: 0
[9]	training's binary_logloss: 0	valid_1's binary_logloss: 0
[10]	training's binary_logloss: 0	valid_1's binary_logloss: 0
[11]	training's binary_logloss: 0	valid_1's binary_logloss: 0
[12]	training's binary_logloss: 0	valid_1's binary_logloss: 0
[13]	training's binary_logloss: 0	valid_1's binary_logloss: 0
[14]	training's binary_logloss: 0	valid_1's binary_logloss: 0
[15]	training's binary

Once the model is trained, it can be used to predict the probability of purchase for new user-product pairs, which can be used to generate recommendations for users.

If we treat this as a binary classification problem: After training the model, we can then get the probability that each user is likely to purchase an item from a candidate set of items. We can then sort these by descending probability to get the top 12 products as done below. <br>

A heuristic apparoach that we use to enhance LighGBM predictions here: <br>
1. Get a candidate set of top 500 most popular articles (by total purchase quanitity). <br>
2. Include the customer's predicitons to this set. <br>
3. Use lightGBM to predict the probability of purchases, and get the top 12. <br>

In [32]:
# dictionary 'user_products' that maps each user ID to a list of products they've purchased from the user-item matrix

user_products = {}
for user_idx in range(user_item_matrix.shape[0]):
    purchased_items = list(np.where(user_item_matrix[user_idx, :].toarray()[0] == 1)[0])
    user_products[user_idx] = purchased_items

In [33]:
# returns set of most pupular products in the catalog

def select_popular_products(df, n_products=500):
    # Group the dataframe by user and product and sum the quantity for each group
    product_quantities = df.groupby(['user_index', 'item_index'])['quantity'].sum()
    # Sort the products by quantity in descending order and select the top n_products
    popular_products = product_quantities.groupby('item_index').sum().sort_values(ascending=False).index.tolist()[:n_products]
    # return only the unique item_index values
    return list(set(popular_products))

In [34]:
# Generate candidate products for each user
# This can be done using a combination of popular products and user purchase history

popular_products = select_popular_products(final_df, 500)
print(len(popular_products))
# print first 10 popular products
print(popular_products[:20])

user_candidates = {}
for user_id in user_products:
    
    # Add user purchase history to candidate list
    user_history = user_products[user_id]
    candidate_products = list(set(popular_products + user_history))
    
    # Store candidate products (dataframes) for this user
    user_articles = final_df[final_df['item_index'].isin(candidate_products)]
    user_article_info = user_articles.groupby('item_index').first().reset_index()
    user_article_info = user_article_info.drop(['date', 'target'], axis=1)
    user_candidates[user_id] = user_article_info


500
[0, 1, 28674, 28675, 28676, 7, 6154, 14352, 14354, 6171, 2077, 45, 16432, 49, 16433, 16434, 20531, 22581, 16437, 8247]


In [35]:
# check where columns of X_test and user_candidates[0] dont match
for col in user_candidates[0].columns:
    if col not in X_test.columns:
        print(col)

In [36]:
for user_id in range(user_item_matrix.shape[0]):
    user_candidate_pool = user_candidates[user_id]
    item_probs = lgb_model.predict(user_candidate_pool)
    top_items = user_candidate_pool.iloc[item_probs.argsort()[::-1][:12]]
    print(f"Top recommended items for user {user_id}:")
    print(top_items['item_index'].values)

Top recommended items for user 0:
[38856 11466 12649 12643 12621 12564 12507 12493 12395 12315 12287 12257]
Top recommended items for user 1:
[37234  8419  7161  7120  7082  7080  7033  6961  6925  6895  6876  6874]
Top recommended items for user 2:
[38511 13926 13844 13840 13838 13802 13655 13610 13543 13542 13541 13418]
Top recommended items for user 3:
[38068 10366 10219 10209 10207 10087  9982  9952  9943  9758  9701  9579]
Top recommended items for user 4:
[38886 14132 14596 14568 14520 14419 14360 14354 14352 14280 14219 14197]
Top recommended items for user 5:
[38821 15100 14966 14967 14999 15024 15062 15064 15084 15142 14948 15155]
Top recommended items for user 6:
[38649 11975 11822 11821 11645 11552 11470 11280 11219 11210 11130 11105]
Top recommended items for user 7:
[38562 38072 12621 12507 12396 12394 12202 12077 12029 12014 12012 12010]
Top recommended items for user 8:
[38853 12014 12643 12621 12616 12507 12310 12279 12275 12265 12202 12062]
Top recommended items for us