**AIM of the notebook**

The Aim of this notebook is to implement LightFM that's why I didn't include that much EDAs because there already are many great public notebooks! 

IF YOU FIND THIS NOTEBOOK USEFUL THEN PLEASE UPVOTE!

**Background**

We can divide recommendation models into two categories:

     1)Content based model,
 
     2)Collaborative filtering model.

The Content-Based Model recommends based on similarity of the items and/or users using their description/profile. On the other hand, Collaborative Filtering Model computes the latent factors of the users and items. It works based on the assumption that if a group of people expressed similar opinions on an item, these people would tend to have similar opinions on other items.

**LightFM**

LightFM is a Python implementation of a hybrid recommendation algorithms for both implicit and explicit feedbacks.

It is a hybrid content-collaborative model which represents users and items as linear combinations of their content features’ latent factors. The model learns embeddings or latent representations of the users and items in such a way that it encodes user preferences over items. These representations produce scores for every item for a given user; items scored highly are more likely to be interesting to the user.

The user and item embeddings are estimated for every feature, and these features are then added together to be the final representations for users and items.

For example, for user i, the model retrieves the i-th row of the feature matrix to find the features with non-zero weights. The embeddings for these features will then be added together to become the user representation e.g. if user 10 has weight 1 in the 5th column of the user feature matrix, and weight 3 in the 20th column, the user 10’s representation is the sum of embedding for the 5th and the 20th features multiplying their corresponding weights. The representation for each items is computed in the same approach.

**Notebook Logs:-**

**update 1:**

Added Submission Pipeline and added a trained model

For now the submission is taking 13 hours to run thats why I am not comitting the notebook until I found a solution to squeeze the runtime

**Importing Libraries**

In [None]:
import os
import cv2
import tqdm
from PIL import Image
import seaborn as sns
from matplotlib import pyplot as plt

import pandas as pd
import numpy as np
from lightfm import LightFM
from lightfm.data import Dataset

# Import LightFM's evaluation metrics
from lightfm.evaluation import precision_at_k

%matplotlib inline
SEED = 42
np.random.seed(SEED)
os.environ['PYTHONHASHSEED'] = str(SEED)

**Configs**

In [None]:
# default number of recommendations
K = 12
EPOCHS = 1

# model learning rate
LEARNING_RATE = 0.25
# no of latent factors
NO_COMPONENTS = 20

# no of threads to fit model
NO_THREADS = 32
# regularisation for both user and item features
ITEM_ALPHA=1e-6
USER_ALPHA=1e-6

**Load the data**

In [None]:
main_dir = "../input/h-and-m-personalized-fashion-recommendations"
images_dir = main_dir+"/images/" 
customers = pd.read_csv(main_dir+"/customers.csv")
articles = pd.read_csv(main_dir+"/articles.csv", dtype={'article_id': str})
sample_submission = pd.read_csv(main_dir+"/sample_submission.csv", dtype={'article_id': str})

train = pd.read_csv('../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv',  dtype={'article_id': str}, parse_dates=['t_dat'])

**article_id** : A unique identifier of every article.

**product_code, prod_name** : A unique identifier of every product and its name (not the same).

**product_type, product_type_name** : The group of product_code and its name

**graphical_appearance_no, graphical_appearance_name** : The group of graphics and its name

**colour_group_code, colour_group_name** The group of color and its name

**graphical_appearance_no, graphical_appearance_name** : The group of graphics and its name

**perceived_colour_value_id, perceived_colour_value_name, perceived_colour_master_id, perceived_colour_master_name** The added color info

**department_no, department_name**: A unique identifier of every dep and its name

**index_code, index_name**: A unique identifier of every index and its name

**index_group_no, index_group_name**: A group of indeces and its name

**section_no, section_name**: A unique identifier of every section and its name

**garment_group_no, garment_group_name**:  A unique identifier of every garment and its name

**Some Basic EDA**

Age Distribution

In [None]:
sns.set_style("darkgrid")
f, ax = plt.subplots(figsize=(10,5))
ax = sns.histplot(data=customers, x='age', bins=50, color='orange')
ax.set_xlabel('Distribution of the customers age')
plt.show()

Price

In [None]:
sns.set_style("darkgrid")
f, ax = plt.subplots(figsize=(10,5))
ax = sns.boxplot(data=train, x='price', color='orange')
ax.set_xlabel('Price')
plt.show()

  In product_group_name Lower/Upper/Full body have a huge price variance

In [None]:
sns.set_style("darkgrid")
f, ax = plt.subplots(figsize=(25,18))
ax = sns.boxplot(data=train[['customer_id', 'article_id', 'price', 't_dat']].merge(articles[['article_id', 'prod_name', 'product_type_name', 'product_group_name', 'index_name']], on='article_id', how='left'), x='price', y='product_group_name')
ax.set_xlabel('Price outliers', fontsize=22)
ax.set_ylabel('Index names', fontsize=22)
ax.xaxis.set_tick_params(labelsize=22)
ax.yaxis.set_tick_params(labelsize=22)

plt.show()

Some Images

In [None]:
total_folders = 0
total_files = 0
folder_info = []
images_names = []


for base, dirs, files in tqdm.tqdm(os.walk(main_dir)):
    for directories in dirs:
        folder_info.append((directories, 
                            len(os.listdir(os.path.join(base, directories)))))
        total_folders = total_folders + 1
    
    for _files in files:
        total_files = total_files + 1
        if (len(_files.split(".jpg"))==2):
            images_names.append(_files.split(".jpg")[0])

image_name_df = pd.DataFrame(images_names, columns = ["image_name"])
image_name_df["article_id"] = image_name_df["image_name"].apply(lambda x: int(x[1:]))
image_name_df.head().style.set_properties(**{'background-color': 'rgba(184,230,194,.5)'})

articles_df = pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/articles.csv")
image_article_df = articles_df[["article_id", 
                                "product_code", 
                                "product_group_name", 
                                "product_type_name"]].merge(image_name_df, 
                                                            on=["article_id"], 
                                                            how="left")
image_article_df.head().style.set_properties(**{'background-color': 'rgba(184,230,194,.5)'})

def plot_image_samples(image_article_df, product_group_name, cols=1, rows=-1):
    image_path = "../input/h-and-m-personalized-fashion-recommendations/images/"
    _df = image_article_df.loc[image_article_df.product_group_name==product_group_name]
    article_ids = _df.article_id.values[0:cols*rows]
    plt.figure(figsize=(2 + 3 * cols, 2 + 4 * rows))
    for i in range(cols * rows):
        article_id = ("0" + str(article_ids[i]))[-10:]
        plt.subplot(rows, cols, i + 1)
        plt.axis('off')
        plt.title(f"{product_group_name} {article_id[:3]}\n{article_id}.jpg")
        image = Image.open(f"{image_path}{article_id[:3]}/{article_id}.jpg")
        plt.imshow(image)
        
plot_image_samples(image_article_df, "Garment Lower body", 5, 1)
plot_image_samples(image_article_df, "Accessories", 5, 1)
plot_image_samples(image_article_df, "Swimwear", 5, 1)
plot_image_samples(image_article_df, "Bags", 5, 1)

Now let's get back to the main objective 

In [None]:
dataset = Dataset()
dataset.fit(users=customers['customer_id'], 
            items=articles['article_id'])

num_users, num_topics = dataset.interactions_shape()
print(f'Number of users: {num_users}, Number of topics: {num_topics}.')

**Make Train and Validation Set**

We are taking the previous week before the test week to Validate our model.

As Chris Deotte said that a more robust validation with "folds" will be to do this with 5 validation periods. For "fold 1", use the last week of train and then train model with weeks prior. For "fold 2", use the second to last week of train and then train model with weeks prior. For "fold 3, fold4, fold5", etc etc. [link](https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations/discussion/307517#1697263)

Train TimeLine : start   ->   2020-09-15

Validation TimeLine : 2020-09-16   ->   2020-09-22

Test TimeLine  :  2020-09-23   ->  2020-09-29

In [None]:
#train_set = train[(train.t_dat>='2020-8-26')&(train.t_dat<='2020-9-15')]
train_set = train[train.t_dat<='2020-9-15']
val_set = train[(train.t_dat>='2020-9-16')&(train.t_dat<='2020-9-22')]

(interactions, weights) = dataset.build_interactions(train_set.iloc[:, 1:3].values)
(val_interactions, val_weights) = dataset.build_interactions(val_set.iloc[:, 1:3].values)
print(interactions.shape, val_interactions.shape)

LightLM works slightly differently compared to other packages as it expects the train and test sets to have same dimension so double check it 

**Declare and Fit the LightFM model**

In this notebook, the LightFM model will be using the weighted Approximate-Rank Pairwise (WARP) as the loss.

In general, it maximises the rank of positive examples by repeatedly sampling negative examples until a rank violation has been located. This approach is recommended when only positive interactions are present

In [None]:
model = LightFM(loss='warp', no_components=NO_COMPONENTS, 
                 learning_rate=LEARNING_RATE,                 
                 random_state=np.random.RandomState(SEED))
model.fit(interactions=interactions, epochs=EPOCHS, verbose=1)

**Weighted Approximate-Rank Pairwise loss**

WARP loss was first introduced in 2011. It was used to assign to an image the correct label from a very large sample of possible labels. Originally, the motivation for developing this loss — which in particular, has a novel sampling technique — was one of memory efficiency. However, the sampling technique also has additional benefits which make it well suited to training a recommender system.

**How does WARP loss work?**

At a high level, WARP loss will randomly sample output labels of a model, until it finds a pair which it knows are wrongly labelled, and will then only apply an update to these two incorrectly labelled examples.


**Train and Validation scores**

Loading the trained model(trained for 200 epochs)

In [None]:
#Load Trained Model

!pip3 install pickle5
import pickle5 as pickle
with open('../input/lightfm1/lightFM1.pickle', "rb") as fh:
    trained_model = pickle.load(fh)

In [None]:
%%time
#train_precision = precision_at_k(model, interactions, k=K).mean()(IT takes too much time thats why I am not running it)
val_precision = precision_at_k(trained_model, val_interactions, k=K).mean()

print(val_precision)

**AS FOR NOW IT'S JUST A MODEL TRAINED FOR 5 EPOCHS BUT CAN BE HEAVILY IMPROVED AND IN NEXT UPDATES I WILL BE CREATING A FULL SUBMISSION PIPELINE!**

**Submission**

In [None]:
#Get the mappings
'''
uid = mapping from customer_id to model equivalent user_id
iid = mapping from article_id to  model equivalent article_id
'''
uid_map, ufeature_map, iid_map, ifeature_map = dataset.mapping() 
'''
create inverse mappings
'''
inv_uid_map = {v:k for k, v in uid_map.items()}
inv_iid_map = {v:k for k, v in iid_map.items()}

#convert submission user_id and article_id to model equivalent user_id and article_id

test_X = sample_submission.customer_id.values
lfn_user = lambda x: uid_map[x]
test_X_m = [lfn_user(tx) for tx in test_X]

print(len(test_X_m))

In [None]:
customer_ids = []
preds = []

for usr_ in tqdm.tqdm(test_X_m, total = len(test_X_m)):
    m_opt = trained_model.predict(np.array([usr_] * len(iid_map)), np.array(list(iid_map.values())))
    pred = np.argsort(-m_opt)[:K]
    customer_ids.append(inv_uid_map[usr_])
    preds.append(' '.join([inv_iid_map[p] for p in pred]).strip())
    #break
    
customer_ids = np.array(customer_ids).reshape(-1, 1)
preds = np.array(preds).reshape(-1, 1)

In [None]:
#Create the submission
final_sub = pd.DataFrame(data=np.concatenate((customer_ids, preds), axis=1).reshape(-1, 2), columns=['customer_id', 'prediction'])
final_sub.to_csv('/kaggle/working/submission.csv', index=False)

**<h1>Work in progress 🚧</h1>**
1) Full Submission Pipeline

2) Optimized Model 