## Recommender System

#### Davy

In [1]:
# Reload local python files every 2 seconds
            
%load_ext autoreload
%autoreload 2

In [2]:
import json

import matplotlib.pyplot as plt

from Bias import *
from LatentFactor import *


In [3]:
##Read data from "*.JSON" file. 
##Each line is a dictionary. 
path1 = "goodreads_reviews_historybio_train.json"
path2 = "goodreads_reviews_historybio_test.json"
path3 = "goodreads_reviews_historybio_val.json"

try:
    with open(path1,'r') as training_file:
        training = []
        for line in training_file:
            json_data1 = json.loads(line)
            training.append(json_data1)
        training_file.close()
    
except FileNotFoundError:
    print("file open failed.")

try:
    with open(path2,'r') as test_file:
        test = []
        for line in test_file:
            json_data2 = json.loads(line)
            test.append(json_data2)
        test_file.close()
    
except FileNotFoundError:
    print("file open failed.")

try:
    with open(path3,'r') as validation_file:
        val = []
        for line in validation_file:
            json_data3 = json.loads(line)
            val.append(json_data3)
        validation_file.close()
    
except FileNotFoundError:
    print("file open failed.")

### Section 1: Explore biases

##### Calculate the global bias $b_g$, user specific bias $b_i^{(user)}$ and item specific bias $b_j^{(item)}$ on the **training data**. Report:

##### (A)The global $b_g$ bias

In [4]:
bg = Global_Bias(training)

print(f'The global bias is :{bg:.4f}')

The global bias is :3.7670


##### (B)The user specific bias of user id= “3913f3be1e8fadc1de34dc49dab06381”

In [5]:
usr_train = Dictionary(training, 'user_id')

In [6]:
spcific_uid = usr_train['3913f3be1e8fadc1de34dc49dab06381'] #Convert the "user id " to index.

bi  = User_Bias(training, usr_train) #Calculate the user bias for the whole dataset.
single_usr_bias = bi[spcific_uid] #Specific "user_id" bias

print(f'The user specific bias of user_id= “3913f3be1e8fadc1de34dc49dab06381” is :{single_usr_bias:.4f}')

The user specific bias of user_id= “3913f3be1e8fadc1de34dc49dab06381” is :-0.1139


##### (C) The item specific bias of book id = “16130”.

In [7]:
it_train = Dictionary(training, 'book_id')

In [8]:
specific_bid = it_train['16130'] #Convert the "book id " to index.

bj= Item_Bias(training, it_train) #Calculate the item bias for the whole dataset.
single_book_bias = bj[specific_bid]#Specific "book_id" bias

print(f'The item specific bias of book_id = “16130” is :{single_book_bias:.4f}')

The item specific bias of book_id = “16130” is :0.4563


### Section 2: Implement the regularized latent factor model without bias using SGD

##### (A) Implement the regularized latent factor model without considering the bias. The optimization problem that needs to be solved is (see slide 8 of W9.2 lecture notes):
<center>$\ ^{\mathrm{min}}_{\mathrm{P,Q}} \sum_{r_{ij}\in R}(r_{ij} − \textbf{q}^T_i · \textbf{p}_j )^2 + \lambda _1 \sum_{i \in U}||\textbf{q}_i||^2_2 + \lambda _2 \sum_{j \in P}|| \textbf{p}_j ||^2_2$</center>
<br>

The initialization of **P** and **Q** should be random, from a normal distribution. Set the number of latent factors to $k$ = 8. Use Stochastic Gradient Descent (SGD) to solve the optimization problem on the **training data** (see slide 9 of W9.2 lecture notes). Run SGD for 10 iterations (also called epoches), with a fixed learning rate $\eta$ = 0.01 and regularization hyperparameters $\lambda _1 = \lambda _2 $= 0.3. Remember that the regularization terms involve the L2-norms of the $q_i$ and $p_j$ vectors for each user $i$ and item $j$ respectively.

Report the RMSE on the training data for each epoch, by using the RMSE formula (see slide 36 of W8 lecture notes):

<center>$RMSE=\sqrt {\frac{1}{|R|} \sum_{{i,j}\in {R}}(r_{ij}-\hat{r_{ij}})^2}$</center>

In [9]:
#P, Q, R initialisation 

Factor = 8

q_train = Latent_factor(len(usr_train), Factor) 
p_train = Latent_factor(len(it_train), Factor)
r_train = Interaction_dataframe(training, usr_train, it_train )

In [10]:
print(f'Overall, the rating system looks like: \n{r_train}')

Overall, the rating system looks like: 
         user_id  book_id  rating
0          22460     3223       5
1         160446    59273       5
2         181493    66973       3
3          84669   161052       2
4           8478    55311       3
...          ...      ...     ...
1239710   194226   175926       5
1239711   174601   185828       3
1239712     6216    10967       5
1239713    55949   116015       5
1239714   194242   106531       5

[1239715 rows x 3 columns]


In [11]:
RMSE_train = SGD_LFM(r_train, q_train, p_train, Factor, 10, 0.3, 0.3, 0.01)

In [12]:
print(f'the RMSE on the training data for each epoch are : {RMSE_train}')

the RMSE on the training data for each epoch are : {1: 4.363282, 2: 3.661844, 3: 3.069793, 4: 2.557558, 5: 2.175528, 6: 1.888099, 7: 1.669095, 8: 1.500498, 9: 1.369611, 10: 1.26722}


##### (B) Use SGD to train the latent factor model on the **training data** for different values of $k$ in {4,8,16}. For each value of $k$, train the model for 10 epoches/iterations. Report the **RMSE** for each value of k on the **validation data**. Pick the model that results in the best **RMSE** on the **validation set** and report its **RMSE** on the test data.


In [13]:
Factor_list= [4, 8, 16]
RMSE = {}

for k in Factor_list:

    total = [x for x in training] + [x for x in val] + [x for x in test] #The total list combines training, val and test set
    usr_total = Dictionary(total, 'user_id') 
    it_total = Dictionary(total, 'book_id')

    #Training the latent factors P and Q on the total list
    q_total = Latent_factor(len(usr_total), k)
    p_total= Latent_factor(len(it_total), k) 

    r_val = Interaction_dataframe( val, usr_total, it_total ) 

    RMSE[ k ]  = SGD_LFM(r_val, q_total, p_total, k, 10, 0.3, 0.3, 0.01) 

In [14]:
#Report the RMSE for the val set

for key, value in RMSE.items():
   
    print(f'When the k is {key},the RMSE is:\n')
    print(f'{RMSE[ key ]}\n')

When the k is 4,the RMSE is:

{1: 4.254102, 2: 3.833583, 3: 3.619951, 4: 3.417427, 5: 3.183338, 6: 2.925594, 7: 2.669875, 8: 2.4321, 9: 2.218197, 10: 2.028968}

When the k is 8,the RMSE is:

{1: 4.524966, 2: 3.6585, 3: 3.303403, 4: 3.027226, 5: 2.762207, 6: 2.493667, 7: 2.232126, 8: 1.992416, 9: 1.782243, 10: 1.602799}

When the k is 16,the RMSE is:

{1: 5.041389, 2: 3.240417, 3: 2.757906, 4: 2.419278, 5: 2.120636, 6: 1.851766, 7: 1.619825, 8: 1.427947, 9: 1.273263, 10: 1.150074}



In [15]:
#Select the minimal RMSE and return k

min_error = np.inf

for key, values in RMSE.items():
  
    l_key =list(RMSE[ key ].keys())[ -1 ]
    
    if RMSE[ key ][ l_key ] < min_error:
        best_k = key
        min_error =  RMSE[ key ][ l_key ] 

print(f'The best k value is {best_k}')

The best k value is 16


In [16]:
r_test = Interaction_dataframe(test, usr_total, it_total ) 

#Use the "best_k" from Val set to test the test set
RMSE_test= SGD_LFM(r_test, q_total, p_total, best_k ,10, 0.3, 0.3, 0.01)

In [17]:
print(f'the RMSE on the test data for each epoch are : {RMSE_test}')

the RMSE on the test data for each epoch are : {1: 4.26514, 2: 2.810147, 3: 2.195511, 4: 1.795294, 5: 1.508608, 6: 1.29784, 7: 1.141721, 8: 1.025349, 9: 0.937787, 10: 0.871092}



### Section 3: Implement the regularized latent factor model with bias using SGD

##### (A) Incorporate the bias terms $b_g$, $b_i^{(user)}$ and $b_j^{(item)}$ to the latent factor model. The optimization problem that needs to be solved is (see slide 11 of W9.2 lecture notes):

<center>$\ ^{\  \mathrm{    min }}_{\mathrm{P,Q,b_i,b_j}} \sum_{r_{ij}\in R}(r_{ij} −\textbf{q}^T_i · \textbf{p}_j -b_{ij})^2 + \lambda _1 \sum_{i \in U}||\textbf{q}_i||^2_2 + \lambda _2 \sum_{j \in P}|| \textbf{p}_j ||^2_2 + \lambda _3 \sum_{i \in U}( b_i^{(user)} )^2_2 +\lambda _4 \sum_{j \in P}( b_j^{(item)} )^2_2$</center>
<br/>

The initialization of **P** and **Q** should be random, from a normal distribution. Initialize the user bias $b_i^{(user)}$ and item bias terms $b_j^{(item)}$ using the values computed in Task 1. Set the number of latent factors $k$ = 8. Run SGD for 10 epoches with a fixed learning rate $\eta$ = 0.01 and regularization hyperparameters $\lambda$1 = $\lambda$2 = $\lambda$3 = $\lambda$4 = 0.3. Report the RMSE on the training data for each epoch. After finishing all epoches, report the learned user-specific bias of the user with user_id= “3913f3be1e8fadc1de34dc49dab06381” , and the learned item- specific bias of the book with book_id = “16130”.


In [18]:
RMSE_2, Bi_trained, Bj_trained, P_trained, Q_trained = SGD_LFM_bias(r_train, q_train, p_train, bg, bi, bj, Factor, 10, 0.3, 0.3, 0.3, 0.3, 0.01)

In [19]:
print(f'the RMSE on the training data with bias terms for each epoch are : { RMSE_2 }')

the RMSE on the training data with bias terms for each epoch are : {1: 1.865964, 2: 1.111793, 3: 0.944602, 4: 0.876734, 5: 0.844411, 6: 0.827238, 7: 0.817255, 8: 0.810979, 9: 0.806748, 10: 0.803716}


In [20]:
single_usr_bias2 = Bi_trained[spcific_uid]
b_uid_trained = single_usr_bias2 

print(f'The user specific bias of user_id= “3913f3be1e8fadc1de34dc49dab06381 ” after finishing all epoches is :{b_uid_trained:.4f}')

The user specific bias of user_id= “3913f3be1e8fadc1de34dc49dab06381 ” after finishing all epoches is :0.0013


In [21]:
single_book_bias2 = Bj_trained[specific_bid]
b_itid_trained = single_book_bias2 

print(f'The item specific bias of book_id = “16130”  after finishing all epoches is :{b_itid_trained:.4f}')

The item specific bias of book_id = “16130”  after finishing all epoches is :0.3587


##### (B) Similar to Task 2 (B), find the best $k$ in {4, 8, 16} for the model you developed in Task 3 (A) on the validation set, by using **RMSE** to compare across these models, and apply the best of these models to the test data. Compare the resulting test **RMSE** with Task 2 (B). Analyse and explain your findings.


In [22]:
RMSE_2 = {}

#Initialise the bias terms bg, bi and bj.
#P and Q was initialised in 2(B)
bi_val  = User_Bias(val, usr_total)
bj_val = Item_Bias(val, it_total)
bg_val = Global_Bias( val )
 
for k in Factor_list:

    RMSE_2[ k ] = SGD_LFM_bias(r_val, q_total, p_total, bg_val, bi_val, bj_val, k, 10, 0.3, 0.3, 0.3, 0.3, 0.01)[0]

In [23]:
#Report the RMSE for the val set

for key, value in RMSE_2.items():
   
    print(f'When the k is {key},the RMSE is:\n')
    print(f'{RMSE_2[ key ]}\n')

When the k is 4,the RMSE is:

{1: 2.606012, 2: 2.054839, 3: 1.806251, 4: 1.645682, 5: 1.530271, 6: 1.442324, 7: 1.372627, 8: 1.315729, 9: 1.26815, 10: 1.227553}

When the k is 8,the RMSE is:

{1: 1.136526, 2: 1.039368, 3: 0.992343, 4: 0.960732, 5: 0.936682, 6: 0.917033, 7: 0.900244, 8: 0.885468, 9: 0.872203, 10: 0.86013}

When the k is 16,the RMSE is:

{1: 0.724968, 2: 0.632431, 3: 0.610508, 4: 0.601093, 5: 0.595955, 6: 0.59266, 7: 0.59029, 8: 0.588447, 9: 0.586936, 10: 0.585652}



In [24]:
#Select the minimal RMSE and return k

min_error = np.inf

for key, values in RMSE_2.items():
  
    l_key =list(RMSE_2[ key ].keys())[ -1 ]
    
    if RMSE_2[ key ][ l_key ] < min_error:
        best_k_2 = key
        min_error =  RMSE_2[ key ][ l_key ] 

print(f'The best k value is {best_k_2}')

The best k value is 16


In [25]:
# Initialise the bias terms bg, bi and bj for test set.
bi_test  = User_Bias(test, usr_total)
bj_test = Item_Bias(test, it_total)
bg_test = Global_Bias( test )

#Use the "best_k_2" from Val set to test the test set
RMSE_test2 = SGD_LFM_bias(r_test, q_total, p_total, bg_test, bi_test, bj_test, best_k_2, 10, 0.3, 0.3, 0.3, 0.3, 0.01)[0]

In [26]:
print(f'the RMSE on the test data with bias terms for each epoch are : {RMSE_test2}')

the RMSE on the test data with bias terms for each epoch are : {1: 1.66526, 2: 1.021557, 3: 0.803163, 4: 0.715255, 5: 0.675585, 6: 0.655299, 7: 0.643523, 8: 0.635851, 9: 0.630354, 10: 0.626117}


In [27]:
print('Compare the resulting test RMSE with Task 2.\n')

print('\t\tNo bias term\tWith bias term \n')

for key in range(10):
    value1 = RMSE_test[key + 1]
    value2 = RMSE_test2[key + 1]
    print(f'Epoch {key + 1}: \t {value1} \t{value2}')

Compare the resulting test RMSE with Task 2.

		No bias term	With bias term 

Epoch 1: 	 4.26514 	1.66526
Epoch 2: 	 2.810147 	1.021557
Epoch 3: 	 2.195511 	0.803163
Epoch 4: 	 1.795294 	0.715255
Epoch 5: 	 1.508608 	0.675585
Epoch 6: 	 1.29784 	0.655299
Epoch 7: 	 1.141721 	0.643523
Epoch 8: 	 1.025349 	0.635851
Epoch 9: 	 0.937787 	0.630354
Epoch 10: 	 0.871092 	0.626117


##### **Comments**: 



The RMSE result in 3B is significant better than 2B 

##### **Introducing bias terms** into latent factor models can improve the performance of the model. 

This is because bias terms can better capture personalised features and preferences between users and items, thereby enhancing the model's predictive accuracy. Each user and item can have different bias terms, reflecting their personalized interactions. This is crucial for predicting a user's interest in a specific item or other user-item interactions.

In [28]:
 Bi_trained

user_id
0         0.014409
1        -0.565659
2        -0.370377
3         0.328671
4        -0.111219
            ...   
196661   -0.128537
196662    1.086705
196663    0.255969
196664    0.030883
196665   -0.075127
Length: 196666, dtype: float64

In [29]:
 Bj_trained

book_id
0         0.004367
1         0.145095
2         0.930291
3        -0.353638
4         0.526738
            ...   
232098    0.084968
232099   -1.683639
232100    1.030946
232101    0.536694
232102    0.076925
Length: 232103, dtype: float64