Project name: Book recommender system

###

### Methodology
**Conten-based:**


**Collaborative Filtering**

Since we have users 'interaction' with books - theirs ratings, we can use collaborative filtering using these interactions, based on the idea that users who have agreed in the past will agree in the future.

In our project, Alternating Least Squares (ALS) algorithm is used to identify the patterns in both users and books. ALS can factorize the large user-item interaction matrix into two lower-dimensional matrices that capture the latent factors of users and items. The final goal of ALS is to minimize the difference between the actual ratings and the predicted ones derived from the latent factors.


1. Data preparation 

First, the user and book IDs are mapped to a continuous range of indices, which is essential for creating a sparse matrix.

In [8]:
from scipy.sparse import csr_matrix
from implicit.als import AlternatingLeastSquares
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import pandas as pd
import numpy as np

books = pd.read_csv('Books.csv')
users = pd.read_csv('Users.csv')
ratings = pd.read_csv('Ratings.csv')


# Map user and book IDs to continuous indices
user_id_mapping = {id: idx for idx, id in enumerate(ratings['User-ID'].unique())}
book_id_mapping = {id: idx for idx, id in enumerate(ratings['ISBN'].unique())}

ratings['User-ID'] = ratings['User-ID'].map(user_id_mapping)
ratings['ISBN'] = ratings['ISBN'].map(book_id_mapping)
ratings.dropna(subset=['Book-Rating'], inplace=True)
all_user_ids = ratings['User-ID'].unique()
all_book_ids = ratings['ISBN'].unique()

2. Train-Test Split

The data is split into training and test sets. To ensure that all users and books are represented in the training set, any users or books that are only present in the test set are added back to the training set.

In [9]:
train_data, test_data = train_test_split(ratings, test_size=0.2, random_state=42)

train_user_ids = set(train_data['User-ID'])
train_book_ids = set(train_data['ISBN'])
missing_users = set(all_user_ids) - train_user_ids
missing_books = set(all_book_ids) - train_book_ids

missing_data = ratings[ratings['User-ID'].isin(missing_users) | ratings['ISBN'].isin(missing_books)]
train_data = pd.concat([train_data, missing_data]).drop_duplicates()

3. Create Sparse Matrices

Sparse matrices for the training and test data are created. These matrices are used by the ALS algorithm to learn the latent factors. Here we also added a test to judge whether there is null values in the training matrix, in case during matric factorization there is error.

In [10]:
# Create sparse matrices for training and test data
n_users = ratings['User-ID'].nunique()
n_items = ratings['ISBN'].nunique()
train_matrix = csr_matrix((train_data['Book-Rating'], (train_data['User-ID'], train_data['ISBN'])), shape=(n_users, n_items))
test_matrix = csr_matrix((test_data['Book-Rating'], (test_data['User-ID'], test_data['ISBN'])), shape=(n_users, n_items))
print("-------matrix finished---------")

# Check for NaN values in the training matrix
if np.any(np.isnan(train_matrix.data)):
    print("NaN values found in training matrix")
else:
    print("No NaN values in training matrix")

-------matrix finished---------
No NaN values in training matrix


4. Train the ALS Model

An ALS model is initialized and trained using the training data. The model learns latent factors for users and books. Here we set the iteration to be 20.

In [11]:
# Initialize and train the ALS model
als_model = AlternatingLeastSquares(factors=50, regularization=0.1, iterations=20, use_gpu=False, calculate_training_loss=True)
als_model.fit(train_matrix.T, show_progress=True)


  check_blas_config()
100%|██████████| 20/20 [00:23<00:00,  1.17s/it, loss=6.2e-5] 


5. Evaluate the Model

The model is evaluated using the test data by calculating the Root Mean Square Error (RMSE) between the predicted ratings and the actual ratings in the test set.

We first extract non-zero entries, then iterate over Non-Zero User-Item pairs to calculate predictions, during this we also check index bounds. After the predictions, we first handle empty predictions, If no predictions were made when the predictions list is empty, the function returns infinity, which is a safeguard to indicate that the model could not make any predictions. Finally RMSE is calculated.

In our experiment, Test RMSE is: 7.844746002414569

In [12]:
# Evaluate the model
def evaluate_model(test_matrix, als_model):
    test_user_items = test_matrix.nonzero()
    predictions = []
    ground_truth = []
    for user, item in zip(test_user_items[0], test_user_items[1]):
        if user < als_model.user_factors.shape[0] and item < als_model.item_factors.shape[0]:
            prediction = als_model.user_factors[user, :].dot(als_model.item_factors[item, :].T)
            predictions.append(prediction)
            ground_truth.append(test_matrix[user, item])
    if len(predictions) == 0:
        return float('inf')  
    return np.sqrt(mean_squared_error(ground_truth, predictions))


6. Generate Recommendations

Since the recommendation is based on user's past action, after checking whether user is in our mapping, we retrieve the ratings given by the user from the training matrix, then use it by the recommend method to filter out books that the user has already rated. The recommendation is based on Implicit library.

In [17]:
def recommend_books_als(user_id, num_recommendations=5):
    if user_id not in user_id_mapping:
        return [] 
    user_index = user_id_mapping[user_id]
    user_ratings = train_matrix[user_index]
    recommended_books = als_model.recommend(user_index, user_ratings, N=num_recommendations, filter_already_liked_items=True)
    # recommended_book_ids = [list(book_id_mapping.keys())[list(book_id_mapping.values()).index(i)] for i, _ in recommended_books]
    # return books[books['ISBN'].isin(recommended_book_ids)]

    # Convert recommended book indices back to ISBNs
    recommended_book_isbns = [list(book_id_mapping.keys())[list(book_id_mapping.values()).index(i[0])] for i in recommended_books]
    
    # Retrieve book information from books DataFrame
    recommended_books_info = books[books['ISBN'].isin(recommended_book_isbns)]
    
    return recommended_books_info

7. Conclusion for ALS:

By using ALS, we managed to discover hidden patterns and relationships in the rating data by learning latent factors for both users and books. Also, we managed to reduce the high-dimensional user-item matrix into lower-dimensional ones. The RMSE is relatively low, and we show it effeciency in predicting unknown interactions and generate personalized recommendations for users. Also, we can use it to compute similaritied between users and items.

In [18]:
recommend_books_als(100)

ValueError: 2.85138e-12 is not in list