The *goal* of this notebook is to create a *recommender system* for books using collaborative filtering and content-based methods. <BR>

The recommender system will be evaluated using precision and recall metrics, and various methods such as TF-IDF, Google API similarity, BERT embeddings, and collaborative filtering (item-based and user-based) will be employed to generate recommendations.<BR>
The notebook is structured to first explore the data, create user-item matrices, and then implement different recommendation techniques. The final results will be saved in CSV files for further analysis.  

In [None]:
#Library
from joblib import Parallel, delayed
import numpy as np
import pandas as pd
import sklearn
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import seaborn as sns
import random
from sklearn.preprocessing import normalize
from sklearn.feature_extraction.text import TfidfVectorizer
from sentence_transformers import SentenceTransformer
from sklearn.model_selection import train_test_split


Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.11.0->sentence-transformers)
 

# Recommender Systems
| Recommender Type     | Similarity Between | Based On           | Example Statement                                      |
|----------------------|--------------------|--------------------|--------------------------------------------------------|
| CF – Item-Item       | Items              | User behavior      | “You liked A, others who liked A also liked B”         |
| CF – User-User       | Users              | User behavior      | “People like you liked B, so you might too”            |
| Content-Based        | Items              | Item text/content  | “These books are similar in description/topic”         |
| Hybrid               | Items              | Content + Behavior | “You liked A; B is similar and liked by others too”    |


## Task 1: Exploring



#### Step 1:Get the data

In [None]:
# Load the datasets

interactions = pd.read_csv('https://raw.githubusercontent.com/linneverh/MachineLearning/main/interactions_train.csv')

#FOR: Google enhanced & ISBN enhanced - author_date_title_subjects priority
items1 = pd.read_csv("https://media.githubusercontent.com/media/ML-brooowss/ML/refs/heads/main/final_items/author_date_title_subjects/embeddings_part1.csv")
items2 = pd.read_csv("https://media.githubusercontent.com/media/ML-brooowss/ML/refs/heads/main/final_items/author_date_title_subjects/embeddings_part2.csv")
items = pd.concat([items1, items2])

#rename columns
interactions = interactions.rename(columns={'u': 'user_id', 'i': 'book_id', 't': 'timestamp'})
items=items.rename(columns={'i':'book_id'})

# Display the first rows of the updated interactions DataFrame
display(interactions.head())
display(items.head())

# Display the first rows of each dataset
display(interactions.head())
display(items.head())

Unnamed: 0,user_id,book_id,timestamp
0,4456,8581,1687541000.0
1,142,1964,1679585000.0
2,362,3705,1706872000.0
3,1809,11317,1673533000.0
4,4384,1323,1681402000.0


Unnamed: 0.1,Unnamed: 0,CanonicalLink,Description,ISBN,ImageLink,Language,PublishedDate,Publisher,Subjects,Title,...,book_id,title_clean,title_description,date_title_description,author_title_description,author_date_title_description,author_date_title,author_date_title_subjects,author_title_subjects,embedding
0,723,https://books.google.com/books/about/Classific...,,9782871303336,https://images.isbndb.com/covers/2000472348298...,fr,2012.0,Ed du CEFAL,Classification décimale universelle; Indexatio...,Classification décimale universelle : édition ...,...,0,Classification décimale universelle : édition ...,Classification décimale universelle : édition ...,2012 Classification décimale universelle : édi...,UDC Consortium (The Hague) Classification déci...,UDC Consortium (The Hague) 2012 Classification...,UDC Consortium (The Hague) 2012 Classification...,UDC Consortium (The Hague) 2012 Classification...,UDC Consortium (The Hague) Classification déc...,"[-0.004826885, -0.0587869, -0.06438997, -0.007..."
1,724,https://books.google.com/books/about/Les_inter...,C'est dans l'interaction en classe que s'actua...,9782278058327,https://images.isbndb.com/covers/2384333482926...,fr,2011.0,Didier,didactique--langue étrangère - enseignement; d...,Les interactions dans l'enseignement des langu...,...,1,Les interactions dans l'enseignement des langu...,Les interactions dans l'enseignement des langu...,2011 Les interactions dans l'enseignement des ...,"Cicurel, Francine, Les interactions dans l'ens...","Cicurel, Francine, 2011 Les interactions dans ...","Cicurel, Francine, 2011 Les interactions dans ...","Cicurel, Francine, 2011 Les interactions dans ...","Cicurel, Francine, Les interactions dans l'en...","[0.0041115503, -0.012976925, 0.0044452655, 0.0..."
2,725,https://books.google.com/books/about/Histoire_...,Depuis la parution en 1918 de l'ouvrage fondat...,2343190194,http://books.google.com/books/content?id=Q2PMD...,fr,2020.0,L'Harmattan,Histoires de vie en sociologie; Sciences socia...,Histoire de vie et recherche biographique : pe...,...,2,Histoire de vie et recherche biographique : pe...,Histoire de vie et recherche biographique : pe...,2020 Histoire de vie et recherche biographique...,"Aneta Slowik, Hervé Breton, Gaston Pineau Hist...","Aneta Slowik, Hervé Breton, Gaston Pineau 2020...","Aneta Slowik, Hervé Breton, Gaston Pineau 2020...","Aneta Slowik, Hervé Breton, Gaston Pineau 2020...","Aneta Slowik, Hervé Breton, Gaston Pineau His...","[0.027354596, -0.025706276, -0.051459163, 0.00..."
3,726,https://books.google.com/books/about/Ce_livre_...,Juin 1940. Les Allemands entrent dans Paris.Pa...,9782365350020,https://images.isbndb.com/covers/1994518348298...,fr,2012.0,Vraoum!,Moyen-Orient; Bandes dessinées autobiographiqu...,Ce livre devrait me permettre de résoudre le c...,...,3,Ce livre devrait me permettre de résoudre le c...,Ce livre devrait me permettre de résoudre le c...,2012 Ce livre devrait me permettre de résoudre...,"Mazas, Sylvain, Ce livre devrait me permettre ...","Mazas, Sylvain, 2012 Ce livre devrait me perme...","Mazas, Sylvain, 2012 Ce livre devrait me perme...","Mazas, Sylvain, 2012 Ce livre devrait me perme...","Mazas, Sylvain, Ce livre devrait me permettre...","[0.036929574, -0.0399203, -0.033997424, -0.006..."
4,727,https://books.google.com/books/about/Le_grand_...,"Trois histoires d'amour, un lanceur d'alerte, ...",9782702180815,http://books.google.com/books/content?id=f5u3z...,fr,1984.0,Calmann-Lévy,France--1945-1975; Roman historique; Roman fra...,Les années glorieuses : roman /,...,4,Les années glorieuses : roman,Les années glorieuses : roman Trois histoires ...,1984 Les années glorieuses : roman Trois histo...,"Lemaitre, Pierre, Les années glorieuses : roma...","Lemaitre, Pierre, 1984 Les années glorieuses :...","Lemaitre, Pierre, 1984 Les années glorieuses :...","Lemaitre, Pierre, 1984 Les années glorieuses :...","Lemaitre, Pierre, Les années glorieuses : rom...","[0.05324783, -0.026807835, -0.009055429, 0.005..."


Unnamed: 0,user_id,book_id,timestamp
0,4456,8581,1687541000.0
1,142,1964,1679585000.0
2,362,3705,1706872000.0
3,1809,11317,1673533000.0
4,4384,1323,1681402000.0


Unnamed: 0.1,Unnamed: 0,CanonicalLink,Description,ISBN,ImageLink,Language,PublishedDate,Publisher,Subjects,Title,...,book_id,title_clean,title_description,date_title_description,author_title_description,author_date_title_description,author_date_title,author_date_title_subjects,author_title_subjects,embedding
0,723,https://books.google.com/books/about/Classific...,,9782871303336,https://images.isbndb.com/covers/2000472348298...,fr,2012.0,Ed du CEFAL,Classification décimale universelle; Indexatio...,Classification décimale universelle : édition ...,...,0,Classification décimale universelle : édition ...,Classification décimale universelle : édition ...,2012 Classification décimale universelle : édi...,UDC Consortium (The Hague) Classification déci...,UDC Consortium (The Hague) 2012 Classification...,UDC Consortium (The Hague) 2012 Classification...,UDC Consortium (The Hague) 2012 Classification...,UDC Consortium (The Hague) Classification déc...,"[-0.004826885, -0.0587869, -0.06438997, -0.007..."
1,724,https://books.google.com/books/about/Les_inter...,C'est dans l'interaction en classe que s'actua...,9782278058327,https://images.isbndb.com/covers/2384333482926...,fr,2011.0,Didier,didactique--langue étrangère - enseignement; d...,Les interactions dans l'enseignement des langu...,...,1,Les interactions dans l'enseignement des langu...,Les interactions dans l'enseignement des langu...,2011 Les interactions dans l'enseignement des ...,"Cicurel, Francine, Les interactions dans l'ens...","Cicurel, Francine, 2011 Les interactions dans ...","Cicurel, Francine, 2011 Les interactions dans ...","Cicurel, Francine, 2011 Les interactions dans ...","Cicurel, Francine, Les interactions dans l'en...","[0.0041115503, -0.012976925, 0.0044452655, 0.0..."
2,725,https://books.google.com/books/about/Histoire_...,Depuis la parution en 1918 de l'ouvrage fondat...,2343190194,http://books.google.com/books/content?id=Q2PMD...,fr,2020.0,L'Harmattan,Histoires de vie en sociologie; Sciences socia...,Histoire de vie et recherche biographique : pe...,...,2,Histoire de vie et recherche biographique : pe...,Histoire de vie et recherche biographique : pe...,2020 Histoire de vie et recherche biographique...,"Aneta Slowik, Hervé Breton, Gaston Pineau Hist...","Aneta Slowik, Hervé Breton, Gaston Pineau 2020...","Aneta Slowik, Hervé Breton, Gaston Pineau 2020...","Aneta Slowik, Hervé Breton, Gaston Pineau 2020...","Aneta Slowik, Hervé Breton, Gaston Pineau His...","[0.027354596, -0.025706276, -0.051459163, 0.00..."
3,726,https://books.google.com/books/about/Ce_livre_...,Juin 1940. Les Allemands entrent dans Paris.Pa...,9782365350020,https://images.isbndb.com/covers/1994518348298...,fr,2012.0,Vraoum!,Moyen-Orient; Bandes dessinées autobiographiqu...,Ce livre devrait me permettre de résoudre le c...,...,3,Ce livre devrait me permettre de résoudre le c...,Ce livre devrait me permettre de résoudre le c...,2012 Ce livre devrait me permettre de résoudre...,"Mazas, Sylvain, Ce livre devrait me permettre ...","Mazas, Sylvain, 2012 Ce livre devrait me perme...","Mazas, Sylvain, 2012 Ce livre devrait me perme...","Mazas, Sylvain, 2012 Ce livre devrait me perme...","Mazas, Sylvain, Ce livre devrait me permettre...","[0.036929574, -0.0399203, -0.033997424, -0.006..."
4,727,https://books.google.com/books/about/Le_grand_...,"Trois histoires d'amour, un lanceur d'alerte, ...",9782702180815,http://books.google.com/books/content?id=f5u3z...,fr,1984.0,Calmann-Lévy,France--1945-1975; Roman historique; Roman fra...,Les années glorieuses : roman /,...,4,Les années glorieuses : roman,Les années glorieuses : roman Trois histoires ...,1984 Les années glorieuses : roman Trois histo...,"Lemaitre, Pierre, Les années glorieuses : roma...","Lemaitre, Pierre, 1984 Les années glorieuses :...","Lemaitre, Pierre, 1984 Les années glorieuses :...","Lemaitre, Pierre, 1984 Les années glorieuses :...","Lemaitre, Pierre, Les années glorieuses : rom...","[0.05324783, -0.026807835, -0.009055429, 0.005..."



#### Step 2: Check the Number of interactions, users and books

In [None]:
n_users = interactions.user_id.nunique()
n_items = items.book_id.nunique()
print(f'Number of users = {n_users}, \n Number of books = {n_items} \n Number of interactions = {len(interactions)}')


Number of users = 7838, 
 Number of books = 15291 
 Number of interactions = 87047



#### Step 3: Split the Data into Training and Test Sets

In [None]:
# let's first sort the interactions by user and time stamp
interactions = interactions.sort_values(["user_id", "timestamp"])
interactions.head(100)

Unnamed: 0,user_id,book_id,timestamp,embedding
21035,0,0,1.680191e+09,"[-0.004826885, -0.0587869, -0.06438997, -0.007..."
28842,0,1,1.680783e+09,"[0.0041115503, -0.012976925, 0.0044452655, 0.0..."
3958,0,2,1.680801e+09,"[0.027354596, -0.025706276, -0.051459163, 0.00..."
29592,0,3,1.683715e+09,"[0.036929574, -0.0399203, -0.033997424, -0.006..."
6371,0,3,1.683715e+09,"[0.036929574, -0.0399203, -0.033997424, -0.006..."
...,...,...,...,...
20068,2,53,1.694861e+09,"[0.04563928, -0.053787332, -0.016430369, 0.006..."
12721,2,53,1.695226e+09,"[0.04563928, -0.053787332, -0.016430369, 0.006..."
86745,2,53,1.695226e+09,"[0.04563928, -0.053787332, -0.016430369, 0.006..."
19329,2,53,1.695226e+09,"[0.04563928, -0.053787332, -0.016430369, 0.006..."


In [None]:
interactions["pct_rank"] = interactions.groupby("user_id")["timestamp"].rank(pct=True, method='dense')
interactions.reset_index(inplace=True, drop=True)
interactions.head(10)

Unnamed: 0,user_id,book_id,timestamp,embedding,pct_rank
0,0,0,1680191000.0,"[-0.004826885, -0.0587869, -0.06438997, -0.007...",0.04
1,0,1,1680783000.0,"[0.0041115503, -0.012976925, 0.0044452655, 0.0...",0.08
2,0,2,1680801000.0,"[0.027354596, -0.025706276, -0.051459163, 0.00...",0.12
3,0,3,1683715000.0,"[0.036929574, -0.0399203, -0.033997424, -0.006...",0.16
4,0,3,1683715000.0,"[0.036929574, -0.0399203, -0.033997424, -0.006...",0.2
5,0,4,1686569000.0,"[0.05324783, -0.026807835, -0.009055429, 0.005...",0.24
6,0,5,1687014000.0,"[0.0103662815, -0.05280713, -0.029626973, -0.0...",0.28
7,0,6,1687014000.0,"[0.023781504, -0.054194607, -0.018097805, -0.0...",0.32
8,0,7,1687014000.0,"[0.00092733064, -0.02754119, -0.0001447586, 0....",0.36
9,0,8,1687260000.0,"[0.012236664, 0.005825913, -0.056410506, 0.015...",0.4


Now all remains to do is to pick the first 80% of the interactions of each user in the training set and the rest in the test set. We can do so using the `pct_rank` column.

In [None]:
train_data = interactions[interactions["pct_rank"] < 0.8]
test_data = interactions[interactions["pct_rank"] >= 0.8]

In [None]:
print("Training set size:", train_data.shape[0])
print("Testing set size:", test_data.shape[0])

Training set size: 65419
Testing set size: 21628


## Task 2: Creating User-Item Matrices for Implicit Feedback


In [None]:
print('number of users =', n_users, '| number of movies =', n_items)

number of users = 7838 | number of movies = 15291


#### Step 1: Define the Function to Create the Data Matrix


In [None]:
# Define a function to create the data matrix
def create_data_matrix(data, n_users, n_items):
    """
    This function returns a numpy matrix with shape (n_users, n_items).
    Each entry is a binary value indicating positive interaction.
    """
    data_matrix = np.zeros((n_users, n_items))
    data_matrix[data["user_id"].values, data["book_id"].values] = 1
    return data_matrix

#### Step 2: Create the Training and Testing Matrices

Now we can use the function to create matrices for both the training and testing data. Each cell in the matrix will show a 1 if there was a positive interaction in the training or testing data, and a 0 otherwise.

In [None]:
entire_data=create_data_matrix(interactions, n_users, n_items)

In [None]:
# Create the training and testing matrices
train_data_matrix = create_data_matrix(train_data, n_users, n_items)
test_data_matrix = create_data_matrix(test_data, n_users, n_items)

# Display the matrices to understand their structure
print('train_data_matrix')
print(train_data_matrix)
print("number of non-zero values: ", np.sum(train_data_matrix))
print('test_data_matrix')
print(test_data_matrix)
print("number of non-zero values: ", np.sum(test_data_matrix))


train_data_matrix
[[1. 1. 1. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
number of non-zero values:  49689.0
test_data_matrix
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
number of non-zero values:  19409.0


In [None]:
#give the dimensions of matrices
print("Train data matrix dimensions:", train_data_matrix.shape)
print("Test data matrix dimensions:", test_data_matrix.shape)

Train data matrix dimensions: (7838, 15291)
Test data matrix dimensions: (7838, 15291)


#### Basic Definitions

In [None]:
# Recommendation frame generation
def create_recommendation_table(user_predictions, top_n=10, separator=" "):
    """
    Creates a table of top-N recommendations for each user.

    Args:
        user_predictions (numpy.ndarray): Rows = users, columns = items. Predicted scores.
        top_n (int): Number of top recommendations per user.
        separator (str): Delimiter to join recommended book IDs.

    Returns:
        pandas.DataFrame: Columns = ['user_id', 'recommendation'].
    """
    recommendations = []
    num_users = user_predictions.shape[0]

    for user_id in range(num_users):
        top_items = np.argsort(user_predictions[user_id, :])[-top_n:][::-1]
        recommendations.append({
            'user_id': user_id,
            'recommendation': separator.join(map(str, top_items))
        })

    return pd.DataFrame(recommendations)

In [None]:
# Def for the precision_recall_at_k function
def precision_recall_at_k(prediction, ground_truth, k=10):
    """
    Calculates Precision@K and Recall@K for top-K recommendations.
    Parameters:
        prediction (numpy array): The predicted interaction matrix with scores.
        ground_truth (numpy array): The ground truth interaction matrix (binary).
        k (int): Number of top recommendations to consider.
    Returns:
        precision_at_k (float): The average precision@K over all users.
        recall_at_k (float): The average recall@K over all users.
    """
    num_users = prediction.shape[0]
    precision_at_k, recall_at_k = 0, 0

    for user in range(num_users):
        # TODO: Get the indices of the top-K items for the user based on predicted scores
        top_k_items = np.argsort(prediction[user, :])[-k:]

        # TODO: Calculate the number of relevant items in the top-K items for the user
        relevant_items_in_top_k = np.isin(top_k_items, np.where(ground_truth[user, :] == 1)[0]).sum()

        # TODO: Calculate the total number of relevant items for the user
        total_relevant_items = ground_truth[user, :].sum()

        # Precision@K and Recall@K for this user
        precision_at_k += relevant_items_in_top_k / k
        recall_at_k += relevant_items_in_top_k / total_relevant_items if total_relevant_items > 0 else 0

    # Average Precision@K and Recall@K over all users
    precision_at_k /= num_users
    recall_at_k /= num_users

    return precision_at_k, recall_at_k

In [None]:
# Create random splits def.
def random_split_per_user(interactions_df, test_size=0.2):
    train_list = []
    test_list = []
    for user_id, user_df in interactions_df.groupby('user_id'):
        train_df, test_df = train_test_split(user_df, test_size=test_size)
        train_list.append(train_df)
        test_list.append(test_df)
    return pd.concat(train_list), pd.concat(test_list)

In [None]:
# Define the function to predict interactions based on item similarity
def item_based_predict(interactions, similarity, epsilon=1e-9):
    """
    Predicts user-item interactions based on item-item similarity.
    Parameters:
        interactions (numpy array): The user-item interaction matrix.
        similarity (numpy array): The item-item similarity matrix.
        epsilon (float): Small constant added to the denominator to avoid division by zero.
    Returns:
        numpy array: The predicted interaction scores for each user-item pair.
    """
    # np.dot does the matrix multiplication. Here we are calculating the
    # weighted sum of interactions based on item similarity
    pred = similarity.dot(interactions.T) / (similarity.sum(axis=1)[:, np.newaxis] + epsilon)
    return pred.T  # Transpose to get users as rows and items as columns

## Content-based

### TF-IDF
w. ['Publisher', 'Subjects', 'google_api_title', 'author_clean', 'ISBN']<br>
Mean Precision@10 = 0.0149 <br>
Mean Recall@10    = 0.091

In [None]:
#TF-IDF

# STEP 1: Build and clean the combined text feature
text_fields = ['Publisher', 'Subjects', 'google_api_title', 'author_clean', 'ISBN']
items['combined_text'] = items[text_fields].fillna('').agg(' '.join, axis=1)

# # STEP 2: Align items with those used in the train_data_matrix (e.g., by book_id)
# # to ensure the order of books in the TF-IDF matrix exactly matches the item columns in the collaborative filtering matrix, so similarity scores align correctly with item IDs.
items_ordered = items.set_index('book_id').loc[range(entire_data.shape[1])]

# # STEP 3: Compute TF-IDF matrix and cosine similarity
tfidf = TfidfVectorizer(stop_words='english', max_features=1000)
tfidf_matrix = tfidf.fit_transform(items_ordered['combined_text'])

# # Cosine similarity between item vectors
tfidf_sim = cosine_similarity(tfidf_matrix)

In [None]:
# Calculate the item-based predictions for positive interactions
item_tfidf_prediction = item_based_predict(entire_data, tfidf_sim)
print("Predicted Interaction Matrix:")
print(item_tfidf_prediction)
print(item_tfidf_prediction.shape)

In [None]:
# Create df
item_tfidf_recommendations_df = create_recommendation_table(item_tfidf_prediction, top_n=10, separator=" ")

# Save and display
item_tfidf_recommendations_df.to_csv('item_tfidf_recommendations.csv', index=False)

print("\nItem-based Recommendations:")
display(item_tfidf_recommendations_df)

In [None]:
precision_item_k, recall_item_k = precision_recall_at_k(item_tfidf_prediction, test_data_matrix, k=10)
print('Item-based EMBED Precision@K:', precision_item_k)
print('Item-based EMBED Recall@K:', recall_item_k)

### Google API similarity <BR>
Mean Precision@K: 0.04866037254401807 <BR>
Mean Recall@K: 0.2707247031495884

In [None]:
# Select only the item IDs in the training data matrix
train_item_ids = range(entire_data.shape[1])

# Ensure correct item order by aligning to the item indices used in the train matrix
items_ordered = items.set_index('book_id').loc[train_item_ids]

# Parse the embedding strings into numpy arrays
items_ordered['embedding'] = items_ordered['embedding'].apply(lambda x: np.fromstring(x.strip('[]'), sep=','))

# Drop rows with missing or malformed embeddings (if any)
valid_items = items_ordered[items_ordered['embedding'].notna()].reset_index(drop=True)

# Stack embeddings into a matrix
embedding_matrix = np.vstack(valid_items['embedding'].values)

# Compute cosine similarity
embedding_sim = cosine_similarity(embedding_matrix)

In [None]:
# Calculate the item-based predictions for positive interactions
item_EMBED_prediction = item_based_predict(entire_data, embedding_sim)
print("Predicted Interaction Matrix:")
print(item_EMBED_prediction)
print(item_EMBED_prediction.shape)

Predicted Interaction Matrix:
[[0.00170382 0.00172829 0.00167158 ... 0.00162291 0.00160954 0.00168624]
 [0.0007001  0.00068107 0.00071199 ... 0.00072365 0.00076317 0.00068279]
 [0.00322614 0.00318063 0.00320925 ... 0.00355201 0.0037767  0.00315969]
 ...
 [0.00018128 0.00017142 0.00017568 ... 0.00021122 0.00021997 0.0001848 ]
 [0.0001237  0.00012176 0.00012107 ... 0.00014238 0.0001533  0.00011867]
 [0.00018541 0.00018266 0.00018177 ... 0.00020662 0.00021311 0.00018768]]
(7838, 15291)


In [None]:
# CHECK PRECISION & RECALL NOT YET WITH CROSS-VALIDATION [OVERFITTING PROBLEM THOUGH]
precision_item_k, recall_item_k = precision_recall_at_k(item_EMBED_prediction, test_data_matrix, k=10)
print('Item-based EMBED Precision@K:', precision_item_k)
print('Item-based EMBED Recall@K:', recall_item_k)

Item-based EMBED Precision@K: 0.11751722378158948
Item-based EMBED Recall@K: 0.7115535622347688


In [None]:
#Cross Validation
def evaluate_one(seeds):
    train_df, test_df = random_split_per_user(interactions)
    train_matrix = create_data_matrix(train_df, n_users, n_items)

    # Compute similarity from current train split
    item_sim = cosine_similarity(train_matrix.T)
    prediction_matrix = item_based_predict(train_matrix, item_sim)

    # Evaluate on corresponding test set
    test_matrix = create_data_matrix(test_df, n_users, n_items)
    p_at_k, r_at_k = precision_recall_at_k(prediction_matrix, test_matrix, k=10)

    return p_at_k, r_at_k

# Run cross-validation
seeds = list(range(5))
results = Parallel(n_jobs=-1)(
    delayed(evaluate_one)(seed) for seed in seeds
)

# Unpack and average
precisions, recalls = zip(*results)
mean_precision = np.mean(precisions)
mean_recall = np.mean(recalls)

# Print results
print(f"Mean Precision@10 = {mean_precision:.4f}")
print(f"Mean Recall@10    = {mean_recall:.4f}")

### BERT Similarity
Mean Precision@10 = 0.0272 <br>
Mean Recall@10    = 0.1760

In [None]:
# STEP 1: Combine text features
text_fields = ['Publisher', 'Subjects', 'google_api_title', 'author_clean', 'ISBN']
items['combined_text'] = items[text_fields].fillna('').agg(' '.join, axis=1)

# STEP 2: Align with train_data_matrix
items_ordered = items.set_index('book_id').loc[range(train_data_matrix.shape[1])]

# STEP 3: Load BERT model
model = SentenceTransformer('all-MiniLM-L6-v2')

# STEP 4: Encode book texts into embeddings
bert_embeddings = model.encode(items_ordered['combined_text'].tolist(), show_progress_bar=True)

# STEP 5: Compute cosine similarity
bert_sim = cosine_similarity(bert_embeddings)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Batches:   0%|          | 0/478 [00:00<?, ?it/s]

In [None]:
# Calculate the item-based predictions for positive interactions
item_bert_prediction = item_based_predict(train_data_matrix, bert_sim)
print("Predicted Interaction Matrix:")
print(item_bert_prediction)
print(item_bert_prediction.shape)

Predicted Interaction Matrix:
[[1.38409534e-03 1.49448447e-03 1.30900817e-03 ... 1.29422665e-03
  1.18864182e-03 1.39533841e-03]
 [7.21433692e-04 6.04238447e-04 5.83082873e-04 ... 7.58418202e-04
  7.91102805e-04 5.99347015e-04]
 [3.41488950e-03 2.68107736e-03 2.45631168e-03 ... 4.05583460e-03
  5.26865042e-03 2.39774866e-03]
 ...
 [1.10743656e-04 7.87106288e-05 9.56198959e-05 ... 2.20251064e-04
  2.16924020e-04 1.09405376e-04]
 [1.52506734e-04 1.09890912e-04 1.09026144e-04 ... 1.87100465e-04
  2.64724579e-04 1.08594306e-04]
 [1.44722789e-04 1.22703426e-04 9.55393984e-05 ... 1.51565298e-04
  1.94664777e-04 1.96875283e-04]]
(7838, 15291)


In [None]:
# Create recommendation
item_bert_recommendations_df = create_recommendation_table(item_bert_prediction, top_n=10, separator=" ")

# Save and display
item_bert_recommendations_df.to_csv('item_bert_recommendations.csv', index=False)

print("\nItem-based Recommendations:")
display(item_bert_recommendations_df)


Item-based Recommendations:


Unnamed: 0,user_id,recommendation
0,0,13009 5254 1886 14255 13995 14284 12648 12906 ...
1,1,9819 30 31 7154 132 9921 7123 1807 7431 14553
2,2,14559 95 14850 11379 2142 3057 15066 13952 140...
3,3,1807 11379 132 155 151 7154 2185 12109 9921 11561
4,4,14079 14130 5935 10393 12672 12007 5345 7327 1...
...,...,...
7833,7833,975 12632 13009 9238 7322 10997 400 9334 5935 ...
7834,7834,14559 95 3057 13952 14547 15081 15066 2085 711...
7835,7835,15271 3057 3055 15081 2085 14559 95 7122 14547...
7836,7836,14559 95 9052 14547 3057 14550 2085 15081 7112...


In [None]:
p_at_k, r_at_k = precision_recall_at_k(item_bert_prediction, test_data_matrix, k=10)
print(f"Precision@10 = {p_at_k:.4f}")
print(f"Recall@10 = {r_at_k:.4f}")

Precision@10 = 0.0272
Recall@10 = 0.1676


In [None]:
# Cross-validation setup
seeds = list(range(5))  # 5 random seeds for 5 train-test splits

# Evaluate precision and recall for one run
def evaluate_one(seed):
    train_df, test_df = random_split_per_user(interactions, seed=seed)
    train_matrix = create_data_matrix(train_df, n_users, n_items)
    prediction_matrix = item_based_predict(train_matrix, bert_sim)
    test_matrix = create_data_matrix(test_df, n_users, n_items)
    p_at_k, r_at_k = precision_recall_at_k(prediction_matrix, test_matrix, k=10)
    return p_at_k, r_at_k

# Run evaluations in parallel
results = Parallel(n_jobs=-1)(
    delayed(evaluate_one)(seed) for seed in seeds
)

# Extract and average
precisions, recalls = zip(*results)
mean_precision = np.mean(precisions)
mean_recall = np.mean(recalls)

# Print results
print(f"Mean Precision@10 = {mean_precision:.4f}")
print(f"Mean Recall@10    = {mean_recall:.4f}")

Mean Precision@10 = 0.0272
Mean Recall@10    = 0.1760


## Colaborative Filtering

### CF Item-based
Mean Precision@10 = 0.0585 <br>
Mean Recall@10    = 0.2823

In [None]:
# Compute the item-item similarity matrix
item_similarity = cosine_similarity(entire_data.T)
print("Item-Item Similarity Matrix:")
print(item_similarity)
print(item_similarity.shape)

Item-Item Similarity Matrix:
[[1.         0.40824829 0.33333333 ... 0.         0.         0.        ]
 [0.40824829 1.         0.40824829 ... 0.         0.         0.        ]
 [0.33333333 0.40824829 1.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 1.         0.         0.        ]
 [0.         0.         0.         ... 0.         1.         0.        ]
 [0.         0.         0.         ... 0.         0.         1.        ]]
(15291, 15291)


In [None]:
# Calculate the item-based predictions for positive interactions
item_prediction = item_based_predict(entire_data, item_similarity)
print("Predicted Interaction Matrix:")
print(item_prediction)
print(item_prediction.shape)

In [None]:
# Create recommendation
item_CF_recommendations_df = create_recommendation_table(item_prediction, top_n=10, separator=" ")

# Save and display
item_CF_recommendations_df.to_csv('item_CF_recommendations.csv', index=False)

print("\nItem-based Recommendations:")
display(item_CF_recommendations_df)


Item-based Recommendations:


Unnamed: 0,user_id,recommendation
0,0,2 3 1 0
1,1,2 3 1 0
2,2,2 3 1 0
3,3,2 3 1 0
4,4,2 3 1 0
...,...,...
87042,87042,2 3 1 0
87043,87043,2 3 1 0
87044,87044,2 3 1 0
87045,87045,2 3 1 0


In [None]:
p_at_k, r_at_k = precision_recall_at_k(item_prediction, test_data_matrix, k=10)
print(f"Precision@10 = {p_at_k:.4f}")
print(f"Recall@10 = {r_at_k:.4f}")

Precision@10 = 0.0557
Recall@10 = 0.2640


In [None]:
# Cross-validation setup
seeds = list(range(5))  # 5 random seeds for 5 train-test splits

# Evaluate precision and recall for one run
def evaluate_one(seed):
    train_df, test_df = random_split_per_user(interactions, seed=seed)
    train_matrix = create_data_matrix(train_df, n_users, n_items)
    prediction_matrix = item_based_predict(train_matrix, bert_sim)
    test_matrix = create_data_matrix(test_df, n_users, n_items)
    p_at_k, r_at_k = precision_recall_at_k(prediction_matrix, test_matrix, k=10)
    return p_at_k, r_at_k

# Run evaluations in parallel
results = Parallel(n_jobs=-1)(
    delayed(evaluate_one)(seed) for seed in seeds
)

# Extract and average
precisions, recalls = zip(*results)
mean_precision = np.mean(precisions)
mean_recall = np.mean(recalls)

# Print results
print(f"Mean Precision@10 = {mean_precision:.4f}")
print(f"Mean Recall@10    = {mean_recall:.4f}")

### CF User-based
Mean Precision@10 = 0.0612 <br>
Mean Recall@10    = 0.3167

In [None]:
# Compute the user-user similarity matrix
user_similarity = cosine_similarity(entire_data)
print("User-User Similarity Matrix:")
print(user_similarity)

# Check the shape as a sanity check
print("Shape of User Similarity Matrix:", user_similarity.shape)

User-User Similarity Matrix:
[[1.         0.         0.         ... 0.         0.         0.        ]
 [0.         1.         0.         ... 0.         0.         0.        ]
 [0.         0.         1.         ... 0.         0.         0.08084521]
 ...
 [0.         0.         0.         ... 1.         0.         0.        ]
 [0.         0.         0.         ... 0.         1.         0.        ]
 [0.         0.         0.08084521 ... 0.         0.         1.        ]]
Shape of User Similarity Matrix: (7838, 7838)


In [None]:
# Define the function to predict interactions based on user similarity
def user_based_predict(interactions, similarity, epsilon=1e-9):
    """
    Predicts user-item interactions based on user-user similarity.
    Parameters:
        interactions (numpy array): The user-item interaction matrix.
        similarity (numpy array): The user-user similarity matrix.
        epsilon (float): Small constant added to the denominator to avoid division by zero.
    Returns:
        numpy array: The predicted interaction scores for each user-item pair.
    """
    # Calculate the weighted sum of interactions based on user similarity
    pred = similarity.dot(interactions) / (np.abs(similarity).sum(axis=1)[:, np.newaxis] + epsilon)
    return pred

# Calculate the user-based predictions for positive interactions
user_prediction = user_based_predict(entire_data, user_similarity)
print("Predicted Interaction Matrix (User-Based):")
print(user_prediction)
print(user_prediction.shape)

Predicted Interaction Matrix (User-Based):
[[0.12083887 0.12253831 0.12798326 ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.00421191 0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]
(7838, 15291)


In [None]:
# Create recommendation
user_CF_recommendations_df = create_recommendation_table(user_prediction, top_n=10, separator=" ")

# Save and display
user_CF_recommendations_df.to_csv('user_CF_recommendations.csv', index=False)

print("\nuser-based Recommendations:")
display(user_CF_recommendations_df)


user-based Recommendations:


Unnamed: 0,user_id,recommendation
0,0,13 4 12 23 15 14 11 8 5 9
1,1,38 39 30 31 34 36 37 32 33 29
2,2,46 58 49 56 53 91 64 87 45 71
3,3,149 169 163 167 128 133 143 40 139 165
4,4,203 198 207 205 195 202 193 191 199 201
...,...,...
7833,7833,7760 975 7322 7306 611 8086 1130 9610 5838 11291
7834,7834,1367 13891 7128 8999 15276 3055 101 2125 10651...
7835,7835,3055 6791 4820 11126 8369 8999 9719 53 1367 15062
7836,7836,3471 14550 14552 611 15065 3470 8999 618 14557...


In [None]:
p_at_k, r_at_k = precision_recall_at_k(user_prediction, test_data_matrix, k=10)
print(f"Precision@10 = {p_at_k:.4f}")
print(f"Recall@10 = {r_at_k:.4f}")

Precision@10 = 0.0565
Recall@10 = 0.2905


In [None]:
#Cross Validation
def evaluate_one(seed):
    train_df, test_df = random_split_per_user(interactions, seed=seed)
    train_matrix = create_data_matrix(train_df, n_users, n_items)

    # Compute similarity from current train split
    user_sim = cosine_similarity(train_matrix)
    prediction_matrix = user_based_predict(train_matrix, user_sim)

    # Evaluate on corresponding test set
    test_matrix = create_data_matrix(test_df, n_users, n_items)
    p_at_k, r_at_k = precision_recall_at_k(prediction_matrix, test_matrix, k=10)

    return p_at_k, r_at_k

# Run cross-validation
seeds = list(range(5))
results = Parallel(n_jobs=-1)(
    delayed(evaluate_one)(seed) for seed in seeds
)

# Unpack and average
precisions, recalls = zip(*results)
mean_precision = np.mean(precisions)
mean_recall = np.mean(recalls)

# Print results
print(f"Mean Precision@10 = {mean_precision:.4f}")
print(f"Mean Recall@10    = {mean_recall:.4f}")

KeyboardInterrupt: 