# Load Data
- Nguồn: https://grouplens.org/datasets/movielens/100k/
- Bộ dữ liệu MovieLens 100k được thu thập bởi GroupLens Research Project thuộc University of Minnesota
- Ngoài ra còn có các phiên bản MovieLens 1M, MovieLens 10M, MovieLens 20M

In [None]:
!wget https://files.grouplens.org/datasets/movielens/ml-100k.zip

--2024-04-28 16:33:11--  https://files.grouplens.org/datasets/movielens/ml-100k.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4924029 (4.7M) [application/zip]
Saving to: ‘ml-100k.zip’


2024-04-28 16:33:11 (28.9 MB/s) - ‘ml-100k.zip’ saved [4924029/4924029]



In [None]:
!unzip ml-100k.zip

Archive:  ml-100k.zip
   creating: ml-100k/
  inflating: ml-100k/allbut.pl       
  inflating: ml-100k/mku.sh          
  inflating: ml-100k/README          
  inflating: ml-100k/u.data          
  inflating: ml-100k/u.genre         
  inflating: ml-100k/u.info          
  inflating: ml-100k/u.item          
  inflating: ml-100k/u.occupation    
  inflating: ml-100k/u.user          
  inflating: ml-100k/u1.base         
  inflating: ml-100k/u1.test         
  inflating: ml-100k/u2.base         
  inflating: ml-100k/u2.test         
  inflating: ml-100k/u3.base         
  inflating: ml-100k/u3.test         
  inflating: ml-100k/u4.base         
  inflating: ml-100k/u4.test         
  inflating: ml-100k/u5.base         
  inflating: ml-100k/u5.test         
  inflating: ml-100k/ua.base         
  inflating: ml-100k/ua.test         
  inflating: ml-100k/ub.base         
  inflating: ml-100k/ub.test         


In [None]:
!rm -rf ml-100k.zip

# EDA
Dataset chứa:
- 100,000 ratings (1-5) từ 943 users trên 1682 movies.
- Mỗi user đánh giá ít nhất 20 movies.
- Các thông tin cơ bản của user (age, gender, occupation, zip)

In [None]:
import pandas as pd

In [None]:
# Reading user file:
u_cols =  ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('ml-100k/u.user', sep='|', names=u_cols, encoding='latin-1')

n_users = users.shape[0]
print('Number of users:', n_users)
users.head()

Number of users: 943


Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


In [None]:
# Reading ratings file:
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']

ratings_base = pd.read_csv('ml-100k/u1.base', sep='\t', names=r_cols, encoding='latin-1')
ratings_test = pd.read_csv('ml-100k/u1.test', sep='\t', names=r_cols, encoding='latin-1')

rate_train = ratings_base.to_numpy().copy()
rate_test = ratings_test.to_numpy().copy()

print('Number of traing rates:', rate_train.shape[0])
print('Number of test rates:', rate_test.shape[0])

ratings_base.head()

Number of traing rates: 80000
Number of test rates: 20000


Unnamed: 0,user_id,movie_id,rating,unix_timestamp
0,1,1,5,874965758
1,1,2,3,876893171
2,1,3,4,878542960
3,1,4,3,876893119
4,1,5,3,889751712


In [None]:
# Reading items file:
i_cols = ['movie id', 'movie title' ,'release date','video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure',
 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']

items = pd.read_csv('ml-100k/u.item', sep='|', names=i_cols, encoding='latin-1')

n_items = items.shape[0]
print('Number of items:', n_items)
items.head()

Number of items: 1682


Unnamed: 0,movie id,movie title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children's,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


# Recommendation System
- Mục tiêu: đề xuất 10 movies dựa trên lịch sử ratings của các users
- Hướng tiếp cận: xây dựng mô hình dự đoán rating của user đối với các movies chưa xem, sau đó ranking movies theo ratings cao nhất
- Phương pháp: Content-based System, Collaborative Filtering, Graph-based System
- Đánh giá: Processing time, Precision@10, Recall@10, MAP (relevant_movie: true rating > 3)

## Packages

In [None]:
import torch
print(torch.__version__)
!pip install torch-scatter torch-sparse -f'https://data.pyg.org/whl/torch-{torch.__version__}.html'
!pip install torch-geometric

2.2.1+cu121
Looking in links: https://data.pyg.org/whl/torch-2.2.1+cu121.html
Collecting torch-scatter
  Downloading https://data.pyg.org/whl/torch-2.2.0%2Bcu121/torch_scatter-2.1.2%2Bpt22cu121-cp310-cp310-linux_x86_64.whl (10.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.9/10.9 MB[0m [31m29.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torch-sparse
  Downloading https://data.pyg.org/whl/torch-2.2.0%2Bcu121/torch_sparse-0.6.18%2Bpt22cu121-cp310-cp310-linux_x86_64.whl (5.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.0/5.0 MB[0m [31m28.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: torch-scatter, torch-sparse
Successfully installed torch-scatter-2.1.2+pt22cu121 torch-sparse-0.6.18+pt22cu121
Collecting torch-geometric
  Downloading torch_geometric-2.5.3-py3-none-any.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
Installing collecte

In [None]:
import copy
import time
import random
import numpy as np

from scipy import sparse
from tqdm.notebook import tqdm
from collections import defaultdict

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import Ridge
from sklearn import preprocessing
from sklearn.metrics.pairwise import cosine_similarity

from torch import nn, optim, Tensor
import torch.nn.functional as F

from torch_sparse import SparseTensor
from torch_geometric.nn.conv.gcn_conv import gcn_norm
from torch_geometric.nn.conv import MessagePassing
from torch_geometric.typing import Adj

## Utils

In [None]:
results = list([None, None, None]) # [[method, recommending_time, MAP, precision@10, recall@10], ...]

In [None]:
rate_train_dict = dict()
for user, movie, rating, _ in rate_train:
  rate_train_dict[(user, movie)] = rating

rate_test_dict = dict()
for user, movie, rating, _ in rate_test:
  rate_test_dict[(user, movie)] = rating

In [None]:
def get_result(ratings_test, rate_test_dict, database, recommend):
  precision = list()
  recall = list()
  map = list()

  for user_id in np.unique(ratings_test['user_id']):

    count_total_relevant = 0
    count_recommend_relevant = 0
    number_retrieve = 0
    average_precision = 0

    for movie_id in database[user_id]:
      if rate_test_dict[user_id, movie_id] > 3:
        count_total_relevant += 1

    for movie_id in recommend[user_id]:
      number_retrieve += 1

      if rate_test_dict[user_id, movie_id] > 3:
        count_recommend_relevant += 1

        average_precision += (count_recommend_relevant/number_retrieve)

    if count_total_relevant == 0 or count_recommend_relevant == 0:
        precision.append(0)
        recall.append(0)
        map.append(0)
        continue

    precision.append(count_recommend_relevant/10)
    recall.append(count_recommend_relevant/count_total_relevant)
    map.append(average_precision/count_recommend_relevant)

  return np.mean(precision), np.mean(recall), np.mean(map)

In [None]:
def get_items_rated_by_user(rate_matrix, user_id):
    """
    in each line of rate_matrix, we have infor: user_id, item_id, rating (scores), time_stamp
    we care about the first three values
    return (item_ids, scores) rated by user user_id
    """
    y = rate_matrix[:,0] # all users
    # item indices rated by user_id
    # we need to +1 to user_id since in the rate_matrix, id starts from 1
    # while index in python starts from 0
    ids = np.where(y == user_id +1)[0]
    item_ids = rate_matrix[ids, 1] - 1 # index starts from 0
    scores = rate_matrix[ids, 2]
    return (item_ids, scores)

## Content-based System
  - Sử dụng thuộc tính "genre" (thể loại phim) để đề xuất các bộ phim có nội dung tương tự
  - Sử dụng TF-IDF để xây dựng feature vector
  - Sử dụng Ridge Regression làm mô hình dự đoán đối với từng user (n mô hình cho n user)
  

In [None]:
start_time = time.time()

In [None]:
X0 = items.to_numpy()
print(X0)

[[1 'Toy Story (1995)' '01-Jan-1995' ... 0 0 0]
 [2 'GoldenEye (1995)' '01-Jan-1995' ... 1 0 0]
 [3 'Four Rooms (1995)' '01-Jan-1995' ... 1 0 0]
 ...
 [1680 'Sliding Doors (1998)' '01-Jan-1998' ... 0 0 0]
 [1681 'You So Crazy (1994)' '01-Jan-1994' ... 0 0 0]
 [1682 'Scream of Stone (Schrei aus Stein) (1991)' '08-Mar-1996' ... 0 0
  0]]


In [None]:
X_train_counts = X0[:, -19:]
print(X_train_counts)

[[0 0 0 ... 0 0 0]
 [0 1 1 ... 1 0 0]
 [0 0 0 ... 1 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [None]:
# TF-IDF
transformer = TfidfTransformer(smooth_idf=True, norm ='l2')
tfidf = transformer.fit_transform(X_train_counts.tolist()).toarray()

In [None]:
d = tfidf.shape[1] # data dimension
W = np.zeros((d, n_users))
b = np.zeros((1, n_users))

for n in range(n_users):
    ids, scores = get_items_rated_by_user(rate_train, n)
    clf = Ridge(alpha=0.01, fit_intercept  = True)
    Xhat = tfidf[ids, :]

    clf.fit(Xhat, scores)
    W[:, n] = clf.coef_
    b[0, n] = clf.intercept_

In [None]:
# predicted scores
Yhat = tfidf.dot(W) + b
print(Yhat.shape)
print(Yhat)

(1682, 943)
[[ 3.27612429  3.97999788  0.67692105 ...  4.98371109  4.78483866
   3.18528636]
 [ 2.32368086  4.62732562  3.54831582 ...  3.96443414  4.50542002
   3.67198243]
 [ 3.60852778  2.4299569  -0.66633772 ...  5.64767287  3.85108764
   3.67730187]
 ...
 [ 4.39411033  4.07591046  2.98016687 ...  5.54093867  4.6548579
   3.3272269 ]
 [ 3.50423601  3.37859899  3.30821434 ...  3.79681247  3.65574491
   2.76146867]
 [ 3.98288721  3.95081311  3.51835951 ...  4.0037233   4.47091283
   3.97127359]]


In [None]:
database_user = dict() # (user_id: [movie_id_1, movie_id_2, ...])
recommend_user = dict() # (user_id: [movie_id_1, movie_id_2, ...])

for user_id in np.unique(ratings_test['user_id']):
  movie_ids, scores = get_items_rated_by_user(rate_test, user_id - 1)
  rating_predict = Yhat[movie_ids, user_id - 1]

  database = [[id + 1, rate] for [id, rate] in zip(movie_ids, rating_predict)]
  database_sort = np.array(sorted(database, key=lambda x: x[1], reverse=True)) # ranking

  database_user[user_id] = database_sort[:,0]
  recommend_user[user_id] = database_sort [:10, 0]

print(database_user[1])
print(recommend_user[1])

[260. 258. 170.  51. 264. 189. 171. 159.  14.  20.  36. 125. 213. 253.
 175. 114.  90. 129.  96. 145.  49.  70.  81. 202. 255. 206.  23.  44.
  98. 183.  47.  65. 150. 236. 235.  72.  84. 250.   6.  60.  61.  64.
  86. 107. 113. 134. 160. 193. 196. 212. 215. 221. 224. 262. 272.  73.
 232. 185. 121.  69.  33.  39.  31. 177. 100.  54.  74. 161. 218. 241.
  10. 157. 180. 190.  92.  56.  76.  67.  85. 104. 108. 154. 163. 242.
  17. 209.  12. 155. 188.  53. 128. 226. 265. 120. 248. 208. 214.  80.
  97. 164. 252. 184. 102. 200. 219. 130. 186. 103.  62.  82. 222. 227.
 228. 229. 230. 267.  27. 233. 143. 201. 117. 118.  91.  24. 254. 225.
 243. 259. 148. 174. 210. 266. 132.  78. 151. 112. 140.]
[260. 258. 170.  51. 264. 189. 171. 159.  14.  20.]


In [None]:
end_time = time.time()
processing_time = end_time - start_time

precision, recall, map = get_result(ratings_test, rate_test_dict, database_user, recommend_user)
results[0] = ["Content-based System", processing_time, precision, recall, map]
print(results)

[['Content-based System', 1.390753984451294, 0.6028322440087146, 0.4984686586226328, 0.7254508102966838], None, None]


## Collaborative Filtering
- Sử dụng phương pháp User-user Collaborative Filtering
- Sử dụng Cosine Similarity để ước lượng độ tương đồng giữa 2 vector
- Sử KNN để dự đoán rating của user đối với các movies chưa xem

### Model

In [None]:
class CF(object):
    def __init__(self, Y_data, k, dist_func = cosine_similarity, uuCF = 1):
        self.uuCF = uuCF # user-user (1) or item-item (0) CF
        self.Y_data = Y_data if uuCF else Y_data[:, [1, 0, 2]]
        self.k = k
        self.dist_func = dist_func
        self.Ybar_data = None
        # number of users and items. Remember to add 1 since id starts from 0
        self.n_users = int(np.max(self.Y_data[:, 0])) + 1
        self.n_items = int(np.max(self.Y_data[:, 1])) + 1

    def add(self, new_data):
        """
        Update Y_data matrix when new ratings come.
        For simplicity, suppose that there is no new user or item.
        """
        self.Y_data = np.concatenate((self.Y_data, new_data), axis = 0)

    def normalize_Y(self):
        users = self.Y_data[:, 0] # all users - first col of the Y_data
        self.Ybar_data = self.Y_data.copy()
        self.mu = np.zeros((self.n_users,))
        for n in range(self.n_users):
            # row indices of rating done by user n
            # since indices need to be integers, we need to convert
            ids = np.where(users == n)[0].astype(np.int32)
            # indices of all ratings associated with user n
            item_ids = self.Y_data[ids, 1]
            # and the corresponding ratings
            ratings = self.Y_data[ids, 2]
            # take mean
            m = np.mean(ratings)
            if np.isnan(m):
                m = 0 # to avoid empty array and nan value
            self.mu[n] = m
            # normalize
            self.Ybar_data[ids, 2] = ratings - self.mu[n]

        ################################################
        # form the rating matrix as a sparse matrix. Sparsity is important
        # for both memory and computing efficiency. For example, if #user = 1M,
        # #item = 100k, then shape of the rating matrix would be (100k, 1M),
        # you may not have enough memory to store this. Then, instead, we store
        # nonzeros only, and, of course, their locations.
        self.Ybar = sparse.coo_matrix((self.Ybar_data[:, 2],
            (self.Ybar_data[:, 1], self.Ybar_data[:, 0])), (self.n_items, self.n_users))
        self.Ybar = self.Ybar.tocsr()

    def similarity(self):
        eps = 1e-6
        self.S = self.dist_func(self.Ybar.T, self.Ybar.T)

    def refresh(self):
        """
        Normalize data and calculate similarity matrix again (after
        some few ratings added)
        """
        self.normalize_Y()
        self.similarity()

    def fit(self):
        self.refresh()

    def __pred(self, u, i, normalized = 1):
        """
        predict the rating of user u for item i (normalized)
        if you need the un
        """
        # Step 1: find all users who rated i
        ids = np.where(self.Y_data[:, 1] == i)[0].astype(np.int32)
        # Step 2:
        users_rated_i = (self.Y_data[ids, 0]).astype(np.int32)
        # Step 3: find similarity btw the current user and others
        # who already rated i
        sim = self.S[u, users_rated_i]
        # Step 4: find the k most similarity users
        a = np.argsort(sim)[-self.k:]
        # and the corresponding similarity levels
        nearest_s = sim[a]
        # How did each of 'near' users rated item i
        r = self.Ybar[i, users_rated_i[a]]
        if normalized:
            # add a small number, for instance, 1e-8, to avoid dividing by 0
            return (r*nearest_s)[0]/(np.abs(nearest_s).sum() + 1e-8)

        return (r*nearest_s)[0]/(np.abs(nearest_s).sum() + 1e-8) + self.mu[u]

    def pred(self, u, i, normalized = 1):
        """
        predict the rating of user u for item i (normalized)
        if you need the un
        """
        if self.uuCF: return self.__pred(u, i, normalized)
        return self.__pred(i, u, normalized)

    def recommend(self, u):
        """
        Determine all items should be recommended for user u.
        The decision is made based on all i such that:
        self.pred(u, i) > 0. Suppose we are considering items which
        have not been rated by u yet.
        """
        ids = np.where(self.Y_data[:, 0] == u)[0]
        items_rated_by_u = self.Y_data[ids, 1].tolist()
        recommended_items = []
        for i in range(self.n_items):
            if i not in items_rated_by_u:
                rating = self.__pred(u, i)
                if rating > 0:
                    recommended_items.append(i)

        return recommended_items

    def print_recommendation(self):
        """
        print all items which should be recommended for each user
        """
        print('Recommendation:')
        for u in range(self.n_users):
            recommended_items = self.recommend(u)
            if self.uuCF:
                print('Recommend item(s):', recommended_items, 'for user', u)
            else:
                print('Recommend item', u, 'for user(s) : ', recommended_items)

### Processing

In [None]:
start_time = time.time()

In [None]:
rate_train_cf = ratings_base.to_numpy().copy()
rate_test_cf = ratings_test.to_numpy().copy()

# indices start from 0
rate_train_cf[:, :2] -= 1
rate_test_cf[:, :2] -= 1

In [None]:
rs = CF(rate_train_cf, k = 30, uuCF = 1)
rs.fit()

In [None]:
database_user = dict() # (user_id: [movie_id_1, movie_id_2, ...])
recommend_user = dict() # (user_id: [movie_id_1, movie_id_2, ...])

for user_id in np.unique(ratings_test['user_id']):
  movie_ids, scores = get_items_rated_by_user(rate_test, user_id - 1)

  database = list()

  for id in movie_ids:
    pred = rs.pred(user_id - 1, id, normalized = 0)
    database.append([id + 1, pred])

  database_sort = np.array(sorted(database, key=lambda x: x[1], reverse=True)) # ranking

  database_user[user_id] = database_sort[:,0]
  recommend_user[user_id] = database_sort [:10, 0]

print(database_user[1])
print(recommend_user[1])

[ 12. 174.  64. 180. 267. 134.  98. 114. 100. 185. 150. 183. 189. 132.
 190. 272. 151.  23. 200. 262.  96. 228. 170. 265. 210.  61.  60. 258.
  69. 171. 193. 215. 157. 186. 208. 129. 209.  14.  97. 213. 242. 154.
 196. 188. 202.  56. 248.  10. 128.  70. 184.  24.  81.  82.  91.  92.
 221.  47. 241. 224. 177. 212. 175. 236.  31. 222. 164. 163. 201.  86.
   6. 125. 161. 250.  33. 206.  65. 113.  73. 143.  76. 117. 107. 253.
 218.  39. 226.  20.  62.  49. 233. 230. 219.  51. 160. 232. 227. 229.
 214.  54. 140. 159. 121. 108.  27. 255.  90.  72. 148. 235.  17.  84.
 102.  85.  44.  53. 259. 118.  80. 264.  67. 252. 266. 145. 225. 260.
 155.  74. 103.  36. 254. 130. 120. 243.  78. 112. 104.]
[ 12. 174.  64. 180. 267. 134.  98. 114. 100. 185.]


In [None]:
end_time = time.time()
processing_time = end_time - start_time

precision, recall, map = get_result(ratings_test, rate_test_dict, database_user, recommend_user)
results[1] = ["Collaborative Filtering", processing_time, precision, recall, map]
print(results)

[['Content-based System', 1.390753984451294, 0.6028322440087146, 0.4984686586226328, 0.7254508102966838], ['Collaborative Filtering', 13.84111475944519, 0.6873638344226579, 0.5391703428404729, 0.8348313756229515], None]


## Graph-based System
@inproceedings{he2020lightgcn,

  title={Lightgcn: Simplifying and powering graph convolution network for recommendation},

  author={He, Xiangnan and Deng, Kuan and Wang, Xiang and Li, Yan and Zhang, Yongdong and Wang, Meng},

  booktitle={Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval},

  pages={639--648},

  year={2020}
  
}

### Model

In [None]:
class LightGCN(MessagePassing):

    def __init__(self, num_users, num_items, embedding_dim=32, K=3, add_self_loops=True, dropout_rate = 0.4):
        super().__init__()
        self.dropout_rate = dropout_rate
        self.num_users = num_users
        self.num_items = num_items
        self.embedding_dim = embedding_dim
        self.K = K
        self.add_self_loops = add_self_loops

        self.users_emb = nn.Embedding(num_embeddings=self.num_users, embedding_dim=self.embedding_dim)
        self.items_emb = nn.Embedding(num_embeddings=self.num_items, embedding_dim=self.embedding_dim)


        nn.init.kaiming_normal(self.users_emb.weight)
        nn.init.kaiming_normal_(self.items_emb.weight)

        self.out = nn.Linear(embedding_dim + embedding_dim, 1)

    def forward(self, edge_index, edge_values):

        edge_index_norm = gcn_norm(edge_index=edge_index,
                                   add_self_loops=self.add_self_loops)

        emb_0 = torch.cat([self.users_emb.weight, self.items_emb.weight])

        embs = [emb_0]

        emb_k = emb_0

        for i in range(self.K):
            emb_k = self.propagate(edge_index=edge_index_norm[0], x = emb_k, norm=edge_index_norm[1])
            embs.append(emb_k)


        embs = torch.stack(embs, dim=1)

        emb_final = torch.mean(embs, dim=1)

        users_emb_final, items_emb_final = torch.split(emb_final, [self.num_users, self.num_items])

        r_mat_edge_index, _ = convert_adj_mat_edge_index_to_r_mat_edge_index(edge_index, edge_values)

        src, dest = r_mat_edge_index[0], r_mat_edge_index[1]

        user_embeds = users_emb_final[src]
        item_embeds = items_emb_final[dest]

        output = torch.cat([user_embeds, item_embeds], dim=1)

        output = self.out(output)
        return output

    def message(self, x_j, norm):
        return norm.view(-1,1)*x_j

### Utils

In [None]:
def load_edge_csv(df,
                  src_index_col,
                  dst_index_col,
                  link_index_col,
                  rating_threshold=3):
    edge_index = None
    src = [user_id for user_id in df[src_index_col]]

    num_users = len(df['userId'].unique())

    dst = [(movie_id) for movie_id in df[dst_index_col]]

    link_vals = df[link_index_col].values

    edge_attr = torch.from_numpy(df[link_index_col].values).view(-1, 1).to(torch.long) >= rating_threshold

    edge_values = []

    edge_index = [[], []]

    for i in range(edge_attr.shape[0]):
        if edge_attr[i]:
            edge_index[0].append(src[i])
            edge_index[1].append(dst[i])
            edge_values.append(link_vals[i])

    return edge_index, edge_values

In [None]:
def convert_r_mat_edge_index_to_adj_mat_edge_index(input_edge_index, input_edge_values):
    R = torch.zeros((num_users, num_movies))
    for i in range(len(input_edge_index[0])):
        row_idx = input_edge_index[0][i]
        col_idx = input_edge_index[1][i]
        R[row_idx][col_idx] = input_edge_values[i]

    R_transpose = torch.transpose(R, 0, 1)
    adj_mat = torch.zeros((num_users + num_movies, num_users + num_movies))
    adj_mat[:num_users, num_users:] = R.clone()
    adj_mat[num_users:, :num_users] = R_transpose.clone()

    adj_mat_coo = adj_mat.to_sparse_coo()
    adj_mat_coo_indices = adj_mat_coo.indices()
    adj_matt_coo_values = adj_mat_coo.values()

    return adj_mat_coo_indices, adj_matt_coo_values

In [None]:
def convert_adj_mat_edge_index_to_r_mat_edge_index(input_edge_index, input_edge_values):

    sparse_input_edge_index = SparseTensor(row=input_edge_index[0],
                                           col=input_edge_index[1],
                                           value=input_edge_values,
                                           sparse_sizes=((num_users+num_movies), num_users+num_movies))
    adj_mat = sparse_input_edge_index.to_dense()
    interact_mat = adj_mat[:num_users,num_users:]

    r_mat_edge_index = interact_mat.to_sparse_coo().indices()
    r_mat_edge_values = interact_mat.to_sparse_coo().values()

    return r_mat_edge_index, r_mat_edge_values

In [None]:
def get_recall_precision_at_k(input_edge_index,
                    input_edge_values,
                    pred_ratings,
                    k=10,
                    threshold=4):
    with torch.no_grad():
        user_item_rating_list = defaultdict(list)

        for i in range(len(input_edge_index[0])):
            src = input_edge_index[0][i].item()
            dest = input_edge_index[1][i].item()
            true_rating = input_edge_values[i].item()
            pred_rating = pred_ratings[i].item()

            user_item_rating_list[src].append((pred_rating, true_rating))

        recalls = dict()
        precisions = dict()
        map = dict()

        for user_id, user_ratings in user_item_rating_list.items():

            user_ratings.sort(key=lambda x: x[0], reverse=True)


            n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)

            n_rec_k = sum((true_r >= threshold) for (_, true_r) in user_ratings[:k])

            # n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])

            # n_rel_and_rec_k = sum(
            #     ((true_r >= threshold) and (est >= threshold)) for (est, true_r) in user_ratings[:k]
            # )

            precisions[user_id] = n_rec_k/k if n_rec_k != 0 else 0
            recalls[user_id] = n_rec_k/n_rel if n_rel != 0 else 0

            # compute MAP

            for item in user_ratings[:k]:
              relevant = 0.0
              average_precision = 0.0
              number_retrieve = 0

              for item in user_ratings:

                  number_retrieve += 1
                  if item[1] < threshold:
                      continue

                  relevant += 1
                  average_precision += (relevant/number_retrieve)
              if (relevant == 0):
                map[user_id] = 0
              else:
                map[user_id] = average_precision/relevant

        overall_recall = sum(rec for rec in recalls.values())/len(recalls)
        overall_precision = sum(prec for prec in precisions.values())/len(precisions)
        overall_map = sum(map for map in map.values())/len(map)

        return overall_recall, overall_precision, overall_map

### Processing

In [None]:
start_time = time.time()

In [None]:
""" dataset preparation """
rating_df = pd.read_csv('ml-100k/u.data', sep='\t', names=['userId', 'movieId', 'rating', 'timestamp'])
train = pd.read_csv('ml-100k/u1.base', sep='\t', names=['userId', 'movieId', 'rating', 'timestamp'])
test = pd.read_csv('ml-100k/u1.test', sep='\t', names=['userId', 'movieId', 'rating', 'timestamp'])

In [None]:
num_users = len(rating_df['userId'].unique())
num_movies = len(rating_df['movieId'].unique())

print(f'num users {num_users}, num_movies {num_movies}')

num users 943, num_movies 1682


In [None]:
""" label encoding """
lbl_user = preprocessing.LabelEncoder()
lbl_movie = preprocessing.LabelEncoder()

rating_df.userId = lbl_user.fit_transform(rating_df.userId.values)
rating_df.movieId = lbl_movie.fit_transform(rating_df.movieId.values)

train.userId = lbl_user.transform(train.userId)
train.movieId = lbl_movie.transform(train.movieId)

test.userId = lbl_user.transform(test.userId)
test.movieId = lbl_movie.transform(test.movieId)

In [None]:
""" data processing """
train_edge_index, train_edge_values = load_edge_csv(
    train,
    src_index_col='userId',
    dst_index_col='movieId',
    link_index_col='rating',
    rating_threshold=1
)

test_edge_index, test_edge_values = load_edge_csv(
    test,
    src_index_col='userId',
    dst_index_col='movieId',
    link_index_col='rating',
    rating_threshold=1
)

train_edge_index = torch.LongTensor(train_edge_index)
train_edge_values = torch.tensor(train_edge_values)

test_edge_index = torch.LongTensor(test_edge_index)
test_edge_values = torch.tensor(test_edge_values)

train_edge_index, train_edge_values = convert_r_mat_edge_index_to_adj_mat_edge_index(train_edge_index, train_edge_values)
test_edge_index, test_edge_values = convert_r_mat_edge_index_to_adj_mat_edge_index(test_edge_index, test_edge_values)

r_mat_train_edge_index, r_mat_train_edge_values = convert_adj_mat_edge_index_to_r_mat_edge_index(train_edge_index, train_edge_values)
r_mat_test_edge_index, r_mat_test_edge_values = convert_adj_mat_edge_index_to_r_mat_edge_index(test_edge_index, test_edge_values)

In [None]:
""" model configuration """
ITERATIONS = 10000
LR = 1e-3
ITER_PER_EVAL = 1000
ITERS_PER_LR_DECAY = 1000
K = 10
LAMBDA = 1e-6

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

""" setup model """
model = LightGCN(num_users=num_users, num_items=num_movies)
model = model.to(device)
model.train()

""" setup optimizer """
optimizer = optim.Adam(model.parameters(), lr = LR, weight_decay=0.01)

""" setup scheduler """
scheduler = optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.95)

train_edge_index = train_edge_index.to(device)
test_edge_index = test_edge_index.to(device)

""" setup loss """
loss_func = nn.MSELoss()

  nn.init.kaiming_normal(self.users_emb.weight)


In [None]:
""" training part """
train_losses = []
val_losses = []
val_recall_at_ks = []
best_val_loss = torch.inf
for iter in tqdm(range(1, ITERATIONS+1)):
    pred_ratings = model.forward(train_edge_index.to(device), train_edge_values.to(device))
    train_loss = loss_func(pred_ratings, r_mat_train_edge_values.view(-1,1).to(device))

    optimizer.zero_grad()
    train_loss.backward()
    optimizer.step()

    if iter % ITER_PER_EVAL == 0:
        model.eval()

        with torch.no_grad():
            val_pred_ratings = model.forward(test_edge_index.to(device), test_edge_values.to(device))
            val_loss = loss_func(val_pred_ratings, r_mat_test_edge_values.view(-1,1).to(device)).sum()
            recall_at_k, precision_at_k, map = get_recall_precision_at_k(r_mat_test_edge_index,
                                                            r_mat_test_edge_values,
                                                            val_pred_ratings,
                                                            k=10)
            if val_loss < best_val_loss:
                best_val_loss = val_loss
                torch.save(model, 'best_weight.pt')
            val_recall_at_ks.append(round(recall_at_k, 5))
            train_losses.append(train_loss.item())
            val_losses.append(val_loss.item())

            print(f'[Iteration {iter}/{ITERATIONS}], train_loss: {round(train_loss.item(), 5)}, val_loss: {round(val_loss.item(), 5)}, recall_at_k: {round(recall_at_k, 5)}, precision_at_k: {round(precision_at_k, 5)}')

        model.train()

    if iter % ITERS_PER_LR_DECAY == 0 and iter != 0:
        scheduler.step()

  0%|          | 0/10000 [00:00<?, ?it/s]

[Iteration 1000/10000], train_loss: 1.38441, val_loss: 1.51314, recall_at_k: 0.52145, precision_at_k: 0.65556
[Iteration 2000/10000], train_loss: 1.19866, val_loss: 1.32983, recall_at_k: 0.52783, precision_at_k: 0.66819
[Iteration 3000/10000], train_loss: 1.11435, val_loss: 1.23946, recall_at_k: 0.53121, precision_at_k: 0.67342
[Iteration 4000/10000], train_loss: 1.06744, val_loss: 1.17686, recall_at_k: 0.53234, precision_at_k: 0.67582
[Iteration 5000/10000], train_loss: 1.04501, val_loss: 1.13714, recall_at_k: 0.53287, precision_at_k: 0.67712
[Iteration 6000/10000], train_loss: 1.04422, val_loss: 1.12462, recall_at_k: 0.53363, precision_at_k: 0.67778
[Iteration 7000/10000], train_loss: 1.06557, val_loss: 1.13988, recall_at_k: 0.53932, precision_at_k: 0.6854
[Iteration 8000/10000], train_loss: 1.1017, val_loss: 1.1731, recall_at_k: 0.53893, precision_at_k: 0.68627
[Iteration 9000/10000], train_loss: 1.12678, val_loss: 1.19761, recall_at_k: 0.53771, precision_at_k: 0.68562
[Iteration 10

In [None]:
# load pretrained model
model = torch.load('best_weight.pt')
model.eval()

# prediction
preds = model(test_edge_index, None)
preds

tensor([[3.4271],
        [3.6605],
        [4.1177],
        ...,
        [2.8620],
        [3.1731],
        [2.9968]], device='cuda:0', grad_fn=<AddmmBackward0>)

In [None]:
end_time = time.time()
processing_time = end_time - start_time

recall, precision, map = get_recall_precision_at_k(r_mat_test_edge_index, r_mat_test_edge_values, preds, threshold=4, k=10)
results[2] = ["Graph-based System", processing_time, precision, recall, map]
print(results)

[['Content-based System', 1.390753984451294, 0.6028322440087146, 0.4984686586226328, 0.7254508102966838], ['Collaborative Filtering', 13.84111475944519, 0.6873638344226579, 0.5391703428404729, 0.8348313756229515], ['Graph-based System', 93.24144864082336, 0.6777777777777789, 0.5336312096248179, 0.7642451454210383]]


# Final Result

In [None]:
final_results = pd.DataFrame(results, columns=['method', 'processing_time', 'precision@10', 'recall@10', 'map'])
final_results

Unnamed: 0,method,processing_time,precision@10,recall@10,map
0,Content-based System,1.390754,0.602832,0.498469,0.725451
1,Collaborative Filtering,13.841115,0.687364,0.53917,0.834831
2,Graph-based System,93.241449,0.677778,0.533631,0.764245


**Nhận xét:**
- Phương pháp Collaborative Filtering cho kết quả nhỉnh hơn so với Content-based -> việc kết hợp thông tin của các users hiệu quả hơn
- Phương pháp Graph-based System cho kết quả chưa khả quan lắm -> cần huấn luyện kỹ càng, tinh chỉnh tham số phù hợp hơn