This notebook serves as an example on how collaborative filtering works

We have a sparse matrix and we want to be able to predict values for "ratings" of items that the user hasnt seen yet. Learning the entire matrix is an approach but becomse unfeasible with size

A better more traditional approach is using a matrix factorization and learning these smaller matrices. They reduce the amount of parameters and grow less with size compared to the orginal sparse matrix. We randomly intitliaze these smaller matrices and use graident desecent on data we do know to tune these matrices.

We are going to use L2 regularization to reduce variance and stop parameters from shooting up but I will show a comparsion.


In [29]:
import numpy as np

#lets create an example sparse matrix that looks like this

ratings  = np.array(
    [[1, None, 3 ,  4 , 3],
    [2,1,2,1,None],
    [1,None,3,4,3]]
)

print(ratings)
print(ratings.shape)


[[1 None 3 4 3]
 [2 1 2 1 None]
 [1 None 3 4 3]]
(3, 5)


In [None]:
#now that we have our ratings matrix, we can craete our smaller matrices for users and items 
#we need the dimensions to match
# m x n  * n x p = m x p
#so if ratings is a 3 x 5, the first matrix should be a 3 x k and the second matrix should be a k x 5


#we can choose k = 2 for this example
k = 2



#we can compute predicted ratings but first we need to train the matrices U and V and adjust them based on known ratings

#we can use gradient descent for this but first we need to fitler for known ratings

#we can iterate through the ratings matrix 
data = []
for i in range(ratings.shape[0]):
    for j in range(ratings.shape[1]):
        if ratings[i][j] is not None:
            data.append((i,j,ratings[i][j])) # (user_index, item_index, rating)
data = np.array(data)
print(data)

[[0 0 1]
 [0 2 3]
 [0 3 4]
 [0 4 3]
 [1 0 2]
 [1 1 1]
 [1 2 2]
 [1 3 1]
 [2 0 1]
 [2 2 3]
 [2 3 4]
 [2 4 3]]


In [5]:
#now we import scikit and pytorch to do train-test spilt and gradient descent with an optimizer
from sklearn.model_selection import train_test_split

train,test = train_test_split(data,test_size=.3,random_state = 42)
print("Train Data:", train)
print("Test Data:", test)

Train Data: [[1 1 1]
 [0 3 4]
 [0 2 3]
 [2 4 3]
 [1 0 2]
 [1 3 1]
 [0 4 3]
 [1 2 2]]
Test Data: [[2 3 4]
 [2 2 3]
 [0 0 1]
 [2 0 1]]


In [None]:
import torch
import torch.nn as nn

#we are creating a custom model for pytorch
class FactorizationModel(nn.Module):
    def __init__(self,num_users,num_items,num_factors):
        super(FactorizationModel,self).__init__()
        #creates our smaller matrices, intialized randomly and doesnt follow traditioanal tensor initialization
        self.U = nn.Embedding(num_users,num_factors) # 3x2
        self.V = nn.Embedding(num_items,num_factors) # 2x5


    def forward(self,user_indices,item_indices):
        #this is how our predictions are computed
        user_factors = self.U(user_indices)
        item_factors = self.V(item_indices)
        #dot product
        return (user_factors * item_factors).sum(1)
    

model = FactorizationModel(num_users=3,num_items=5,num_factors=k)
#define optimizer and loss function
optimizer = torch.optim.Adam(model.parameters(),lr=0.02,weight_decay=1e-5)
loss_fn = nn.MSELoss() 
#training loop
num_epochs = 200
for epoch in range(num_epochs):
    #set the model to training mode
    model.train()
    #zero out the gradients
    optimizer.zero_grad()

    user_indices = torch.LongTensor(train[:,0])
    item_indices = torch.LongTensor(train[:,1])
    ratings = torch.FloatTensor(train[:,2])

    predictions = model(user_indices,item_indices)
    loss = loss_fn(predictions,ratings)
    loss.backward()
    optimizer.step()

    if (epoch+1) % 20 == 0:
        print(f"Epoch {epoch+1}/{num_epochs}, Loss: {loss.item():.4f}")
    
#evaluate on test set
model.eval()
with torch.no_grad():
    user_indices = torch.LongTensor(test[:,0])
    item_indices = torch.LongTensor(test[:,1])
    ratings = torch.FloatTensor(test[:,2])

    predictions = model(user_indices,item_indices)
    loss = loss_fn(predictions,ratings)
    print(f"Test Loss: {loss.item():.4f}")



Epoch 20/200, Loss: 4.8457
Epoch 40/200, Loss: 3.3065
Epoch 60/200, Loss: 2.6665
Epoch 80/200, Loss: 2.1397
Epoch 100/200, Loss: 1.5922
Epoch 120/200, Loss: 0.9444
Epoch 140/200, Loss: 0.3497
Epoch 160/200, Loss: 0.0669
Epoch 180/200, Loss: 0.0086
Epoch 200/200, Loss: 0.0009
Test Loss: 1.7717


In [26]:
guess_matrix = torch.matmul(model.U.weight,model.V.weight.t())
print("Predicted Ratings Matrix:")
print(guess_matrix)

Predicted Ratings Matrix:
tensor([[-0.0840, -2.6346,  2.9952,  3.9801,  3.0081],
        [ 2.0092,  1.0067,  1.9492,  1.0552,  2.3200],
        [ 2.1054,  0.5644,  2.6246,  1.8672,  3.0185]], grad_fn=<MmBackward0>)


In [30]:
print("Original Ratings Matrix:")
print(ratings)

Original Ratings Matrix:
[[1 None 3 4 3]
 [2 1 2 1 None]
 [1 None 3 4 3]]


With more training data and adjustments to epoch and learning rate, we can improve our guesses

In [9]:
#querying the database

import sqlite3
import pandas as pd

conn = sqlite3.connect('./backend/main.db')

query = "SELECT user_id,isbn,rating FROM ratings"


df = pd.read_sql_query(query,conn)

print(df.head())


   user_id        isbn  rating
0   276725  034545104X       0
1   276726  0155061224       5
2   276727  0446520802       0
3   276729  052165615X       3
4   276729  0521795028       6


In [10]:
df["isbn"] = (
    df["isbn"]
    .astype(str)
    .str.replace("-", "")
    .str.strip()
)
print(df.head())

   user_id        isbn  rating
0   276725  034545104X       0
1   276726  0155061224       5
2   276727  0446520802       0
3   276729  052165615X       3
4   276729  0521795028       6


In [11]:
df["isbn"].nunique()

340553