# Collaborative Filtering

## Import the dataset
Dataset= https://www.kaggle.com/datasets/saurabhbagchi/books-dataset
In this Notebook the dataset ratings.csv is used.

In [1]:
# Importing the dataset
import pandas as pd

book_ratings = pd.read_csv('./datasets/books_data/ratings.csv', sep=';', encoding='latin-1') #latin-1 was used because UTF8 didn't work
book_ratings.columns = ["User_ID", "ISBN", "Book_Rating"]

# unique number of users and books (items)
print(f'Amount of users: {book_ratings.User_ID.nunique()} | Amount of books: {book_ratings.ISBN.nunique()}')

Amount of users: 105283 | Amount of books: 340556


In [2]:
# Cut dataset in 4 (so my pc is able to process the data)
cut = len(book_ratings) // 4
book_ratings = book_ratings.iloc[:cut]

In [3]:
# Remove all rows where ISBN is not valid
isbn_digits = book_ratings["ISBN"].str.isdigit()
book_ratings= book_ratings[isbn_digits]
book_ratings = book_ratings[book_ratings['ISBN'] != '0000000000']
book_ratings = book_ratings[book_ratings['ISBN'] != '00000000000']

In [4]:
# Only keep users that have at least 50 book reviews (to lower the dataset amount again)
count_users = book_ratings['User_ID'].value_counts()
book_ratings = book_ratings[book_ratings['User_ID'].isin(count_users[count_users >= 50].index)]

In [5]:
book_ratings.head()

Unnamed: 0,User_ID,ISBN,Book_Rating
412,276925,6511929,0
414,276925,60520507,0
415,276925,60930934,0
416,276925,60951303,0
417,276925,99414732,7


In [6]:
# Checking the frequency of book ratings, to detect anomalies
rating_counts = book_ratings['Book_Rating'].value_counts()
print(rating_counts)

Book_Rating
0     116169
8      12934
10     12015
7       9466
9       8308
5       5741
6       4336
4        971
3        613
2        324
1        303
Name: count, dtype: int64


In [7]:
# The 0 rating is used a lot, Indicate the user didn't rate the book. Therefore, we remove these from the dataset.
book_ratings = book_ratings[book_ratings['Book_Rating'] != 0]
book_ratings.head()

Unnamed: 0,User_ID,ISBN,Book_Rating
417,276925,99414732,7
418,276925,140154078,6
424,276925,194216748,5
429,276925,385504209,8
438,276925,553275283,8


## Train-test split
The Test Train split is done with the train_test_split function to split the dataset into a training- and a test set. 
Here I chose for a 80/20 split of train and test data.
The dataset is stratified (sorting in groups or layers) on the book ratings. 
This is done so that the different kinds of ratings are equally distributed on the 2 sets

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(book_ratings, book_ratings.Book_Rating, test_size=0.20, stratify=book_ratings.Book_Rating, random_state=42)

## User-based Collaborative Filtering

### SVG (Singular Value Decomposition Model)
This model processes a user-item matrix by decomposing it into three separate matrices: users, items, and ratings. 
The algorithm identifies patterns (hidden features) of user preferences, enabling it to predict how users might rate items they haven't rated yet.

#### Prediction
Issues with Suriprise:
I had a lot of issues importing surprise.
I tried installing with Pip -> Gave loads of different errors
I tried installing the whole git -> Wasn't recognized in the code
I tried changing the environment to condas was the solution -> Was the solution
total hours spend on the problem: 4

In [9]:
# Making a prediction for 1 user with 1 book
from surprise import Reader, Dataset, SVD

reader = Reader(rating_scale=(1, 10))
data = Dataset.load_from_df(book_ratings[['User_ID', 'ISBN', 'Book_Rating']], reader)
algo = SVD()
trainingSet = data.build_full_trainset()
algo.fit(trainingSet)
predicted_rating = algo.predict(276925, '0373243944') # Random number
print("Predicted rating: ", predicted_rating.est)

Predicted rating:  7.313006598955723


#### Model train and test

In the code below, I used the SVD algorithm to train a model with the training data. 
After training, I tested the model by predicting ratings. 
The Root Mean Squared Error (RMSE) measures the accuracy of these predictions.
The RMSE indicates how close the model's predictions are to the actual ratings in the test data.

As we will see belown the SVD model will predict the score relatively well.
In my testing we see that it's average deviation is 1.6-1.65, which for a score between 1 and 10 is good in my eyes.

In [15]:
from surprise import accuracy

reader = Reader(rating_scale=(1, 10))
train_data = Dataset.load_from_df(X_train[['User_ID', 'ISBN', 'Book_Rating']], reader)
test_data = Dataset.load_from_df(X_test[['User_ID', 'ISBN', 'Book_Rating']], reader)
algo = SVD()
trainingSet = train_data.build_full_trainset()
algo.fit(trainingSet)
testset = [(row['User_ID'], row['ISBN'], row['Book_Rating']) for index, row in X_test.iterrows()]
predictions = algo.test(testset)
rmse = accuracy.rmse(predictions)

RMSE: 1.6370


### K-Nearest Neighbors Model (KNN)

With the KNN Algorithm, we can see the similarity between users and their rating patterns.
This is useful for recommending a similar book.

#### Prediction

The prediction rate is higher than with the SVD model.
This could indicate a worse model.

In [12]:
# Making a prediction for 1 user with 1 book
from surprise import KNNBasic

reader = Reader(rating_scale=(1, 10))
data = Dataset.load_from_df(book_ratings[['User_ID', 'ISBN', 'Book_Rating']], reader)
k=40 # number of neighbors
similarity_metric = 'cosine'
algo = KNNBasic(k=k, sim_options={'name': similarity_metric})
trainingSet = data.build_full_trainset()
algo.fit(trainingSet)
predicted_rating = algo.predict(276925, '0373243944')
print("Predicted rating: ", predicted_rating.est)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Predicted rating:  7.744832851611496


#### Model train and test

In the results we see that the deviation is more than with SVD.
This means that the model is less accurate with its predictions.

In [13]:
reader = Reader(rating_scale=(1, 10))
train_data = Dataset.load_from_df(X_train[['User_ID', 'ISBN', 'Book_Rating']], reader)
test_data = Dataset.load_from_df(X_test[['User_ID', 'ISBN', 'Book_Rating']], reader)
k=40 # number of neighbors
similarity_metric = 'cosine'
algo = KNNBasic(k=k, sim_options={'name': similarity_metric})
trainingSet = train_data.build_full_trainset()
algo.fit(trainingSet)
testset = [(row['User_ID'], row['ISBN'], row['Book_Rating']) for index, row in X_test.iterrows()]
predictions = algo.test(testset)
rmse = accuracy.rmse(predictions)

Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 1.9648
