## In this Notebook, I have implemented Light GCN Model for Recommendation System from scratch using Pytorch.


* Light GCN was originally Developed By Microsoft and was released in Jul'2020. It is the SOTA Algorithm for Recommendation System as of now.
* Link to the original paper -> https://paperswithcode.com/paper/lightgcn-simplifying-and-powering-graph/review/
* There is also a Github Link to Light GCN implementation using Tensorflow -> https://github.com/microsoft/recommenders/tree/efaa3d7742183dee0846877e2dc64977098e1977
* I have taken help from the above Repository and tried to recreate LightGCN in pytorch.
* In the 2nd part of the notebook, I have also compared my result with the direct result from Original Tensorflow code. 

### Please do upvote the notebook if you liked the content. It will motivate me. Thanks !!

In [1]:
import torch
torch.__version__

'2.9.1+cpu'

In [2]:
import pandas as pd
#pd.set_option('display.max_colwidth', None)
from sklearn import preprocessing as pp
from sklearn.model_selection import train_test_split
import scipy.sparse as sp
import numpy as np
import random
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import time
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm

In [3]:
df = pd.read_csv("../../data/processed/current_reviews.csv")
df.head()

Unnamed: 0,ISBN,User_id,rating,review
0,015201294X,A2L8RR2B6HO24F,4.0,Hello! In am reading this story for a 6th grad...
1,015201294X,A3U3RS1HYT8BK,5.0,"Okay, we can forgive Mrs.Rinaldi for giving Em..."
2,015201294X,AJ9RZKCNVA7DK,5.0,This is one of my favorite books by Anne Rinal...
3,015201294X,AAYH3NGH1TCUT,5.0,You've heard about the body snatching. Gross r...
4,015201294X,A20A1RL7J10Y1Y,5.0,"It's the end of Civil War, 1865 in Washington ..."


In [4]:
df = df[['ISBN' , 'User_id' , 'rating']]
print(len(df))

84564


In [5]:
df.head()

Unnamed: 0,ISBN,User_id,rating
0,015201294X,A2L8RR2B6HO24F,4.0
1,015201294X,A3U3RS1HYT8BK,5.0
2,015201294X,AJ9RZKCNVA7DK,5.0
3,015201294X,AAYH3NGH1TCUT,5.0
4,015201294X,A20A1RL7J10Y1Y,5.0


In [6]:
train, test = train_test_split(df.values, test_size=0.2, random_state = 16)
train = pd.DataFrame(train, columns = df.columns)
test = pd.DataFrame(test, columns = df.columns)

In [7]:
print("Train Size  : ", len(train))
print("Test Size : ", len (test))

Train Size  :  67651
Test Size :  16913


In [8]:
train_user_ids = train['User_id'].unique()
train_item_ids = train['ISBN'].unique()

print(len(train_user_ids), len(train_item_ids))

# REMOVED: Aggressive filtering that kills test set. LightGCN can handle cold items.
# test = test[(test['User_id'].isin(train_user_ids)) & (test['ISBN'].isin(train_item_ids))]
print("Test size after keeping all samples:", len(test))

42410 3630
Test size after keeping all samples: 16913


In [9]:
n_users = train['User_id'].nunique()
n_items = train['ISBN'].nunique()
print("Number of Unique Users : ", n_users)
print("Number of unique Items : ", n_items)

Number of Unique Users :  42410
Number of unique Items :  3630


In [10]:
train.head()

Unnamed: 0,ISBN,User_id,rating
0,B00005BC14,ANZYKM0I8W84U,3.0
1,B000QEARDU,A3GWVB80CNOSN8,3.0
2,B000L3AP9C,A1IOJE0W1NXOSE,5.0
3,B00086TLE2,A3Q78GKY14MQ1,4.0
4,B000TNGU5M,A2X3BQKTMYE9KP,4.0


## In the below part, I have used LightGCN original Code from Microsoft's Repository, generated the Metrics, Losses and compared that with my own output.
#### Steps to run original LightGCN in tensorflow -> https://github.com/microsoft/recommenders/blob/main/examples/07_tutorials/KDD2020-tutorial/step5_run_lightgcn.ipynb

In [11]:
# !git clone https://github.com/microsoft/recommenders.git ./recommenders_microsoft

In [12]:
import sys
import os
sys.path.insert(0, os.path.abspath('recommenders_microsoft'))
sys.path.insert(0, os.path.abspath('recommenders_microsoft/recommenders'))
sys.path.insert(0, os.path.abspath('.')) 

In [13]:
!pip install retrying pyyaml



In [14]:
import os
import pandas as pd
import numpy as np
import tensorflow as tf
from recommenders_microsoft.recommenders.utils.timer import Timer
from recommenders_microsoft.recommenders.models.deeprec.models.graphrec.lightgcn import LightGCN
from recommenders_microsoft.recommenders.models.deeprec.DataModel.ImplicitCF import ImplicitCF
from recommenders_microsoft.recommenders.datasets.python_splitters import python_stratified_split
from recommenders_microsoft.recommenders.evaluation.python_evaluation import map, ndcg_at_k, precision_at_k, recall_at_k
from recommenders_microsoft.recommenders.utils.constants import SEED as DEFAULT_SEED
from recommenders_microsoft.recommenders.models.deeprec.deeprec_utils import prepare_hparams
from recommenders_microsoft.recommenders.utils.notebook_utils import store_metadata

  if not hasattr(np, "object"):





In [15]:
train = train.rename(columns={'User_id': 'userID', 'ISBN': 'itemID'})
test = test.rename(columns={'User_id': 'userID', 'ISBN': 'itemID'})

In [16]:
# Diagnostic: Check data before ImplicitCF
print("Train shape:", train.shape, "| columns:", train.columns.tolist())
print("Train head:\n", train.head())
print("\nTest shape:", test.shape, "| columns:", test.columns.tolist())
print("Test head:\n", test.head())
print("\nTrain rating range:", train['rating'].min(), "-", train['rating'].max())
print("Test rating range:", test['rating'].min(), "-", test['rating'].max())
print("Unique users in train:", train['userID'].nunique(), "| in test:", test['userID'].nunique())
print("Unique items in train:", train['itemID'].nunique(), "| in test:", test['itemID'].nunique())

Train shape: (67651, 3) | columns: ['itemID', 'userID', 'rating']
Train head:
        itemID          userID rating
0  B00005BC14   ANZYKM0I8W84U    3.0
1  B000QEARDU  A3GWVB80CNOSN8    3.0
2  B000L3AP9C  A1IOJE0W1NXOSE    5.0
3  B00086TLE2   A3Q78GKY14MQ1    4.0
4  B000TNGU5M  A2X3BQKTMYE9KP    4.0

Test shape: (16913, 3) | columns: ['itemID', 'userID', 'rating']
Test head:
        itemID          userID rating
0  B000EJ6M68  A1JPUMC3793X2C    4.0
1  B00086FM0Y  A3LZ4NROH4BKUS    3.0
2  0688800696  A2RG6V2HH02U0M    5.0
3  0441002137  A1AUBGENRIZODO    3.0
4  B000NDGNP0  A1II1UUTORWZ1K    4.0

Train rating range: 1.0 - 5.0
Test rating range: 1.0 - 5.0
Unique users in train: 42410 | in test: 13836
Unique items in train: 3630 | in test: 2175


In [17]:
data = ImplicitCF(
    train=train, test=test, seed=0,
    col_user='userID',
    col_item='itemID',
    col_rating='rating'
)

In [18]:
yaml_file = './recommenders_microsoft/examples/07_tutorials/KDD2020-tutorial/lightgcn.yaml'


hparams = prepare_hparams(yaml_file,                          
                          learning_rate=0.005,
                          eval_epoch=1,
                          top_k=10,
                          save_model=False,
                          epochs=30,
                          save_epoch=1
                         )

In [19]:
model = LightGCN(hparams, data, seed=0)

Already create adjacency matrix.
Already normalize adjacency matrix.
Using xavier initialization.



In [20]:
with Timer() as train_time:
    model.fit()

print("Took {} seconds for training.".format(train_time.interval))

Epoch 1 (train)4.6s + (eval)2.2s: train loss = 0.33054 = (mf)0.33043 + (embed)0.00011, recall = 0.35466, ndcg = 0.32575, precision = 0.04902, map = 0.31276
Epoch 2 (train)4.1s + (eval)1.9s: train loss = 0.07089 = (mf)0.07056 + (embed)0.00033, recall = 0.36061, ndcg = 0.33302, precision = 0.04980, map = 0.32043
Epoch 3 (train)4.0s + (eval)1.8s: train loss = 0.03356 = (mf)0.03312 + (embed)0.00044, recall = 0.36136, ndcg = 0.33426, precision = 0.04996, map = 0.32173
Epoch 4 (train)4.2s + (eval)1.8s: train loss = 0.02003 = (mf)0.01951 + (embed)0.00052, recall = 0.36170, ndcg = 0.33636, precision = 0.05004, map = 0.32441
Epoch 5 (train)4.2s + (eval)1.8s: train loss = 0.01350 = (mf)0.01293 + (embed)0.00057, recall = 0.36243, ndcg = 0.33664, precision = 0.05015, map = 0.32454
Epoch 6 (train)4.3s + (eval)1.9s: train loss = 0.00979 = (mf)0.00918 + (embed)0.00061, recall = 0.36288, ndcg = 0.33678, precision = 0.05018, map = 0.32456
Epoch 7 (train)4.3s + (eval)1.8s: train loss = 0.00746 = (mf)0.0

In [21]:
test.head()

Unnamed: 0,itemID,userID,rating
0,B000EJ6M68,A1JPUMC3793X2C,4.0
1,B00086FM0Y,A3LZ4NROH4BKUS,3.0
2,0688800696,A2RG6V2HH02U0M,5.0
3,0441002137,A1AUBGENRIZODO,3.0
4,B000NDGNP0,A1II1UUTORWZ1K,4.0


In [22]:
test

Unnamed: 0,itemID,userID,rating
0,B000EJ6M68,A1JPUMC3793X2C,4.0
1,B00086FM0Y,A3LZ4NROH4BKUS,3.0
2,0688800696,A2RG6V2HH02U0M,5.0
3,0441002137,A1AUBGENRIZODO,3.0
4,B000NDGNP0,A1II1UUTORWZ1K,4.0
...,...,...,...
16908,0205315119,A2E6RBSV0PAE24,5.0
16909,B0007ED2A4,A3HTEVJ63FXDMR,4.0
16910,B000G643YM,A25BVI7W6FDM98,5.0
16911,B000NDWGWY,A2NERK7NRDFX9X,2.0


In [23]:
test['userID'].value_counts()

userID
A1D2C0WDCSHUWZ    37
A14OJS0VWMOSWO    24
AFVQZQ8PW0L       14
A1X8VZWTOG8IS6    12
A1NT7ED5TATUAM    11
                  ..
A1SSIDDDCCK1E      1
A30GSU5H72LSL      1
A2WZJI4UP45JCP     1
A36S1RC0WP3RWA     1
A3U2G6PD7C8R9E     1
Name: count, Length: 13836, dtype: int64

In [24]:
test.drop(labels="itemID" , axis = 1 , inplace = True)

In [25]:
test

Unnamed: 0,userID,rating
0,A1JPUMC3793X2C,4.0
1,A3LZ4NROH4BKUS,3.0
2,A2RG6V2HH02U0M,5.0
3,A1AUBGENRIZODO,3.0
4,A1II1UUTORWZ1K,4.0
...,...,...
16908,A2E6RBSV0PAE24,5.0
16909,A3HTEVJ63FXDMR,4.0
16910,A25BVI7W6FDM98,5.0
16911,A2NERK7NRDFX9X,2.0


In [26]:
model.recommend_k_items(test,10)

Unnamed: 0,userID,itemID,prediction
0,A1JPUMC3793X2C,B000EJ6M68,7.636870
1,A1JPUMC3793X2C,1576467449,7.565510
2,A1JPUMC3793X2C,0809594528,7.515228
3,A1JPUMC3793X2C,0736641238,6.837118
4,A1JPUMC3793X2C,0192503561,5.300484
...,...,...,...
138355,A3U2G6PD7C8R9E,B000OVT49I,2.192492
138356,A3U2G6PD7C8R9E,019283357X,2.056808
138357,A3U2G6PD7C8R9E,B000N7623E,1.968076
138358,A3U2G6PD7C8R9E,B000MZWXNA,1.844643


In [27]:
model.save('../../models/lightgcn')