# DSAIT4335 Recommender Systems
# Final Project

In this project, you will work to build different recommendation models and evaluate the effectiveness of these models through offline experiments. The dataset used for the experiments is **MovieLens100K**, a movie recommendation dataset collected by GroupLens: https://grouplens.org/datasets/movielens/100k/. For more details, check the project description on Brightspace.

# Instruction

The MovieLens100K is already splitted into 80% training and 20% test sets. Along with training and test sets, movies metadata as content information is also provided.

**Expected file structure** for this assignment:   
   
   ```
   RecSysProject/
   ├── training.txt
   ├── test.txt
   ├── movies.txt
   └── codes.ipynb
   ```

**Note:** Be sure to run all cells in each section sequentially, so that intermediate variables and packages are properly carried over to subsequent cells.

**Note** Be sure to run all cells such that the submitted file contains the output of each cell.

**Note** Feel free to add cells if you need more for answering a question.

**Submission:** Answer all the questions in this jupyter-notebook file. Submit this jupyter-notebook file (your answers included) to Brightspace. Change the name of this jupyter-notebook file to your group number: example, group10 -> 10.ipynb.

# Setup

In [1]:
!pip install transformers torch  # For BERT

# you can refer https://huggingface.co/docs/transformers/en/model_doc/bert for various versions of the pre-trained model BERT

    extract-msg (<=0.29.*)
                 ~~~~~~~^[0m[33m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
# For BERT embeddings (install: pip install transformers torch)
print("Check the status of BERT installation:")

try:
    from transformers import AutoTokenizer, AutoModel
    import torch
    BERT_AVAILABLE = True
    print("BERT libraries loaded successfully!")
    device = torch.device('cuda' if torch.cuda.is_available else 'cpu')
    print(f"Using device: {device}")
except ImportError:
    BERT_AVAILABLE = False
    print("BERT libraries not available. Install with: pip install transformers torch")

Check the status of BERT installation:
BERT libraries loaded successfully!
Using device: cuda


In [48]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.sparse import csr_matrix
from scipy.spatial.distance import cosine, correlation
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
from sklearn.preprocessing import StandardScaler, MultiLabelBinarizer
import re
import time, math
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

import importlib, utils
importlib.reload(utils)

# Set random seed for reproducibility
np.random.seed(42)

print("Libraries imported successfully!")

Libraries imported successfully!


# Load dataset

In [4]:
# loading the training set and test set
columns_name=['user_id','item_id','rating','timestamp']
train_data = pd.read_csv('training.txt', sep='\t', names=columns_name)
test_data = pd.read_csv('test.txt', sep='\t', names=columns_name)

print(f'The training data:')
display(train_data[['user_id','item_id','rating']].head())
print(f'The shape of the training data: {train_data.shape}')
print('--------------------------------')
print(f'The test data:')
display(test_data[['user_id','item_id','rating']].head())
print(f'The shape of the test data: {test_data.shape}')

The training data:


Unnamed: 0,user_id,item_id,rating
0,1,1,5
1,1,2,3
2,1,3,4
3,1,4,3
4,1,5,3


The shape of the training data: (80000, 4)
--------------------------------
The test data:


Unnamed: 0,user_id,item_id,rating
0,1,6,5
1,1,10,3
2,1,12,5
3,1,14,5
4,1,17,3


The shape of the test data: (20000, 4)


In [5]:
movies = pd.read_csv('movies.txt',names=['item_id','title','genres','description'],sep='\t')
movies.head()

Unnamed: 0,item_id,title,genres,description
0,1,Toy Story (1995),"Animation, Children's, Comedy","A group of sentient toys, who pretend to be li..."
1,2,GoldenEye (1995),"Action, Adventure, Thriller","In 1986, MI6 agents James Bond and Alec Trevel..."
2,3,Four Rooms (1995),Thriller,"On New Year's Eve, bellhop Sam (Marc Lawrence)..."
3,4,Get Shorty (1995),"Action, Comedy, Drama",Chili Palmer is a Miami-based loan shark and m...
4,5,Copycat (1995),"Crime, Drama, Thriller",After giving a guest lecture on criminal psych...


# Task 1) Implementation of different recommendation models as well as a hybrid model combining those recommendation models

## 1.1 Content-based Recommendation

### 1.1.1 Deriving content representation with Bert

In [6]:
def create_bert_embeddings(content):
    """
    Generate BERT embeddings for movie content.

    Args:
        content: Content of items

    Returns:
        numpy.ndarray: BERT embeddings matrix
    """
    if not BERT_AVAILABLE:
        print("BERT libraries not available. Install with: pip install transformers torch")
        return None

    if content is None:
        return None

    if isinstance(content, pd.Series):
        content = content.fillna("").astype(str).tolist()
    elif isinstance(content, np.ndarray):
        content = content.astype(str).tolist()

    model_name = 'distilbert-base-uncased'

    print(f"Loading BERT model: {model_name}")

    # Load tokenizer and model
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)

    # Set device (GPU if available)
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print(f"Using cuda or cpu: {device}")
    model.to(device)
    model.eval()

    print(f"Using device: {device}")

    # Generate embeddings in batches
    batch_size = 32  # Adjust based on available memory
    emb = []

    for i in range(0, len(content), batch_size):
        if i % (batch_size * 10) == 0:
            print(f"Processing batch {i//batch_size + 1}/{len(content)//batch_size + 1}")

        batch_texts = content[i:i + batch_size]

        # Tokenize batch
        inputs = tokenizer(
            batch_texts,
            padding=True,
            truncation=True,
            max_length=512,
            return_tensors='pt'
        )

        # Move to device
        inputs = {k: v.to(device) for k, v in inputs.items()}

        # Generate embeddings
        with torch.no_grad():
            outputs = model(**inputs)

            # Use [CLS] token embedding (first token)
            batch_embeddings = outputs.last_hidden_state[:, 0, :].cpu().numpy()
            emb.extend(batch_embeddings)

    emb = np.array(emb)

    print(f"BERT embeddings generated: {emb.shape}")
    print(f"Embedding dimension: {emb.shape[1]}")

    return emb

In [7]:
# Sample content
sample_content = ['aa','ab','ac','bc']

# Generate BERT embeddings (this may take several minutes)
print("Generating BERT embeddings...")
bert_embeddings = create_bert_embeddings(sample_content)

if bert_embeddings is not None:
    print("BERT embeddings created successfully!")
else:
    print("BERT embeddings not available. Continuing with TF-IDF only.")

Generating BERT embeddings...
Loading BERT model: distilbert-base-uncased
Using cuda or cpu: cpu
Using device: cpu
Processing batch 1/1
BERT embeddings generated: (4, 768)
Embedding dimension: 768
BERT embeddings created successfully!


### 1.1.2 Deriving the representation of items for three types of content:

    1) title + genres 
    2) description 
    3) title + genres + description

In [8]:
"""
1) Content type: title + genres
"""
# Implement code to derive the content representation for title and genres. Concatenate the two content as: title + ' ' + genres
item_emb_titlegenres = None

############# Your code here ############
item_emb_titlegenres = movies['title'] + ' ' + movies['genres']
print("Generating BERT embeddings for title and genres...")
title_genre_embeddings = create_bert_embeddings(item_emb_titlegenres)

if title_genre_embeddings is not None:
    print("BERT embeddings created successfully!")
else:
    print("BERT embeddings not available. Continuing with TF-IDF only.")
#########################################

Generating BERT embeddings for title and genres...
Loading BERT model: distilbert-base-uncased
Using cuda or cpu: cpu
Using device: cpu
Processing batch 1/53
Processing batch 11/53
Processing batch 21/53
Processing batch 31/53
Processing batch 41/53
Processing batch 51/53
BERT embeddings generated: (1682, 768)
Embedding dimension: 768
BERT embeddings created successfully!


In [9]:
"""
2) Content type: description
"""

# Implement code to derive the content representation for description.
item_emb_description = None

############# Your code here ############
item_emb_description = movies['description']
print("Generating BERT embeddings for description...")
desc_embeddings = create_bert_embeddings(item_emb_description)

if desc_embeddings is not None:
    print("BERT embeddings created successfully!")
else:
    print("BERT embeddings not available. Continuing with TF-IDF only.")
#########################################

Generating BERT embeddings for description...
Loading BERT model: distilbert-base-uncased
Using cuda or cpu: cpu
Using device: cpu
Processing batch 1/53
Processing batch 11/53
Processing batch 21/53
Processing batch 31/53
Processing batch 41/53
Processing batch 51/53
BERT embeddings generated: (1682, 768)
Embedding dimension: 768
BERT embeddings created successfully!


In [10]:
"""
3) title + genres + description
"""
# Implement code to derive the content representation for title, genres, and description. Concatenate the three content as: title + ' ' + genres + ' ' + description
item_emb_full = None

############# Your code here ############
item_emb_full = movies['title'] + ' ' + movies['genres'] + ' ' + movies['description']
print("Generating BERT embeddings for title, genres, and description...")
full_embeddings = create_bert_embeddings(item_emb_full)

if full_embeddings is not None:
    print("BERT embeddings created successfully!")
else:
    print("BERT embeddings not available. Continuing with TF-IDF only.")
#########################################

Generating BERT embeddings for title, genres, and description...
Loading BERT model: distilbert-base-uncased
Using cuda or cpu: cpu
Using device: cpu
Processing batch 1/53
Processing batch 11/53
Processing batch 21/53
Processing batch 31/53
Processing batch 41/53
Processing batch 51/53
BERT embeddings generated: (1682, 768)
Embedding dimension: 768
BERT embeddings created successfully!


In [11]:
def get_item_emb(item_id, content_type):
    # Implement the function that given content type (title+genres, description, or title+genres+description) returns the embedding derived for the corresponding item_id. 
    # Hint1: keep in mind that item_id in the data starts from 1, but in the embedding variable it starts from 0, e.g., item_id 100 corresponds to index 99 in embedding variable..
    # Hint2: use if-else conditions to return the embedding for the requested content types.
    # Hint3: use the global variables (embeddings) already computed in previous cells.

    emb = None
    
    ############# Your code here ############

    if content_type == 'title_genres':
        emb = title_genre_embeddings[item_id - 1]

    elif content_type == 'description':
        emb = desc_embeddings[item_id - 1]

    elif content_type == 'full':
        emb = full_embeddings[item_id - 1]
    
    #########################################

    return emb
    
item_id = 100
print('Embedding representation for content type = title+genres:')
print(get_item_emb(item_id, 'title_genres'))
print('------------------------------------------')
print('Embedding representation for content type = description:')
print(get_item_emb(item_id, 'description'))
print('------------------------------------------')
print('Embedding representation for content type = full:')
print(get_item_emb(item_id, 'full'))
print('------------------------------------------')

Embedding representation for content type = title+genres:
[-1.33968905e-01 -9.01783630e-02 -1.27716556e-01 -1.31242529e-01
 -3.00812423e-02  1.16647147e-01  2.68211424e-01  1.28216147e-01
 -2.02165321e-02 -5.45811206e-02  1.50339022e-01 -5.03519289e-02
 -1.71943054e-01  3.98922533e-01  3.19539309e-01  3.08140308e-01
 -1.24051079e-01  3.27949256e-01  1.63815111e-01 -1.64276898e-01
 -1.18869416e-01 -2.46317118e-01  1.83684215e-01  9.05874521e-02
 -3.29116620e-02  1.50853410e-01 -3.94599497e-01 -9.15597677e-02
  1.46123141e-01  2.79694200e-01  2.88073067e-02  8.70803818e-02
 -1.64137289e-01 -6.84375018e-02  3.54835957e-01  1.68893710e-02
  1.22078344e-01 -1.09478347e-01 -5.38688600e-02 -8.63623321e-02
  2.87261866e-02  7.36701563e-02  2.03959405e-01 -1.56151742e-01
  8.09049010e-02 -8.71455073e-02 -2.65358353e+00 -2.31110737e-01
 -2.65957355e-01 -7.23349303e-02  2.97734737e-01  1.20156772e-01
  1.21810257e-01  1.11444123e-01  3.60954404e-01  5.76821685e-01
 -9.38287824e-02  8.00883248e-02

### 1.1.3 User profile construction

In [12]:
def get_interacted_items_embs_rating(train_data, user_id, content_type):
    # Implement the function that given content type (title+genres, description, or title+genres+description) returns the embeddings and ratings of interacted items by user_id=100. 
    # Hint1: use train_data to retrieve the item_ids that target user (user_id=100 in this example) interacted, then pass these item_ids to function previously implemented to retrieve the embeddings and ratings.

    embs, ratings = [], []
    
    ############# Your code here ############

    data_from_user = train_data[train_data['user_id'] == user_id]
    # print(data_from_user)

    item_ids = data_from_user['item_id'].tolist()
    rating_ids = data_from_user['rating'].tolist()

    for id in item_ids:
        emb = get_item_emb(id, content_type)
        embs.append(emb)

    for rating in rating_ids:
        ratings.append(rating)

    #########################################

    return embs, ratings
    
user_id = 100
print('Embeddings and ratings of interacted items by user_id=100 for content type = title+genres:')
print(get_interacted_items_embs_rating(train_data, user_id, 'title_genres'))
print('------------------------------------------')
print('Embeddings and ratings of interacted items by user_id=100 for content type = description:')
print(get_interacted_items_embs_rating(train_data, user_id, 'description'))
print('------------------------------------------')
print('Embeddings and ratings of interacted items by user_id=100 for content type = full:')
print(get_interacted_items_embs_rating(train_data, user_id, 'full'))
print('------------------------------------------')

Embeddings and ratings of interacted items by user_id=100 for content type = title+genres:
([array([-1.10550933e-01, -7.49561414e-02, -2.26197496e-01, -1.45411298e-01,
       -1.07525643e-02,  1.18971057e-01,  1.49337322e-01,  2.11917177e-01,
       -8.58951658e-02,  1.91682912e-02,  6.93990588e-02, -1.28075734e-01,
       -6.95621520e-02,  2.59627551e-01,  2.46748179e-01,  1.54867515e-01,
       -1.78490847e-01,  1.92843705e-01,  1.00374177e-01, -1.94543749e-01,
       -1.80487126e-01, -2.29439080e-01,  1.18793085e-01,  1.36365905e-01,
       -1.24469828e-02,  1.01595521e-01, -1.72738701e-01,  8.16611648e-02,
        1.84625953e-01,  1.74651161e-01,  5.71562126e-02,  1.30099282e-01,
       -1.81853950e-01, -1.11466974e-01,  2.43417323e-01, -8.62910300e-02,
        1.54848695e-01, -9.63722840e-02, -3.63760591e-02,  3.50998044e-02,
        3.31870690e-02,  1.02180831e-01,  1.62405699e-01, -1.32392168e-01,
        3.56112719e-02, -7.92589635e-02, -2.44606781e+00, -9.26957056e-02,
       

In [13]:
def get_user_emb(train_data, user_id, content_type, aggregation_method):
    # Implement the function that given content type (title+genres, description, or title+genres+description) and aggregation method (avg, weighted_avg, avg_pos) returns the representation of a user. 
    # Hint1: use the previsouly implemented items for retrieving ratings and representation of interacted items by a user.

    emb = []
    
    ############# Your code here ############

    embs, ratings = get_interacted_items_embs_rating(train_data, user_id, content_type)

    if aggregation_method == 'avg':
        emb = np.mean(embs, axis=0)

    elif aggregation_method == 'weighted_avg':
        emb = np.average(embs, weights=ratings, axis=0)

    elif aggregation_method == 'avg_pos':
        emb_list = [emb for emb, rating in zip(embs, ratings) if rating >= 4]            
        emb = np.mean(emb_list, axis=0)
    
    #########################################

    return emb
    
user_id = 100
content_type, aggregation_method = 'full', 'avg' # alternatives are content_type={title_genres,description,full} and aggregation_method={avg,weighted_avg,avg_pos}
print('Embeddings of user_id=100 for content type '+content_type+' by aggregation method '+aggregation_method+':')
print(get_user_emb(train_data, user_id, content_type, aggregation_method))

Embeddings of user_id=100 for content type full by aggregation method avg:
[-3.77594739e-01 -1.62529156e-01 -2.09185570e-01 -2.44604424e-01
  1.60393193e-01  3.55566591e-02  3.65352690e-01  2.70110629e-02
 -6.46184024e-04 -2.21369550e-01 -9.99604687e-02  2.31478326e-02
 -1.80722773e-01  2.96081394e-01 -8.29827785e-02  1.43453732e-01
  1.55283257e-01  2.35295773e-01  1.67761549e-01 -4.14667204e-02
  5.51840700e-02 -2.42448300e-01  2.64884174e-01  3.63100886e-01
  6.64624292e-03  2.56050024e-02 -1.33248776e-01 -3.82726230e-02
 -9.05934814e-03  1.32687137e-01  1.40577391e-01  5.12599535e-02
 -5.17638028e-02 -4.51556295e-01  3.91069129e-02 -1.01737857e-01
  8.39059502e-02 -4.78987619e-02  7.95425624e-02  2.78782159e-01
  2.07502278e-04  1.23829536e-01 -1.82174578e-01 -1.16067857e-01
  6.78640977e-02 -3.27166051e-01 -3.30903816e+00 -8.06847736e-02
 -9.30583030e-02 -2.10115910e-01  3.48131359e-01  1.98958904e-01
  1.32903159e-01  2.29476675e-01  3.23887020e-01  2.07131654e-01
 -3.34604740e-0

### 1.1.4 Content-based prediction

In [14]:
def get_user_item_prediction(train_data, user_id, item_id, content_type, aggregation_method):
    # Implement the function that given content type and aggregation method returns the predicted rating for a user-item pair. 
    # Hint1: use the previsouly implemented functions for retrieving the embeddings and then compute the dot product of user and item embeddings.

    pred_rating = 0.0
    
    ############# Your code here ############

    item_emb = get_item_emb(item_id, content_type)
    user_emb = get_user_emb(train_data, user_id, content_type, aggregation_method)
    pred_rating = np.dot(item_emb, user_emb) # Dot product means that we are computing the cosine similarity
    
    #########################################

    return pred_rating
    
user_id, item_id = 100, 266
content_type, aggregation_method = 'full', 'avg' # alternatives are content_type={title_genres,description,full} and aggregation_method={avg,weighted_avg,avg_pos}
print('Predicted score for user_id=100 and item_id=266 for content type '+content_type+' and aggregation method '+aggregation_method+':')
print(get_user_item_prediction(train_data, user_id, item_id, content_type, aggregation_method))

Predicted score for user_id=100 and item_id=266 for content type full and aggregation method avg:
142.68709


## 1.2 User-based Collaborative Filtering

### 1.2.1 Pearson correlation coefficient

In [15]:
def pearson_correlation(user1_ratings: pd.Series, user2_ratings: pd.Series) -> float:
    """
    Compute Pearson correlation coefficient between two users' rating vectors.
    
    user1_ratings, user2_ratings: Pandas Series indexed by item IDs. They may contain NaN for unrated items.
    Returns: float (correlation between -1 and 1). Returns 0 if not enough data.
    """
    result = 0.0
    
    ############# Your code here ############

    # Find set of items rated by both users
    s1, s2 = user1_ratings.align(user2_ratings, join="inner")
    mask = s1.notna() & s2.notna()
    x, y = s1[mask].astype(float), s2[mask].astype(float)

    #print("Ratings of the common items by user1:", x)
    #print("Ratings of the common items by user2:", y)

    # We need at least 2 ratings to compute pearson correlation 
    # Ensure denominator is not zero
    if x.size < 2:
        return 0.0

    mean_X = x.mean()
    mean_Y = y.mean()

    diff_X = x - mean_X
    diff_Y = y - mean_Y

    num = (diff_X * diff_Y).sum()
    denom = math.sqrt((diff_X * diff_X).sum()) * math.sqrt((diff_Y * diff_Y).sum())
    if denom == 0.0 or not np.isfinite(denom):
        return 0.0

    result = num / denom

    #########################################
    
    return result

user1, user2 = 1, 2
user1_ratings = train_data[train_data['user_id'] == user1].set_index('item_id')['rating']
user2_ratings = train_data[train_data['user_id'] == user2].set_index('item_id')['rating']
print(f"Pearson Correlation between users {user1} and {user2} is {pearson_correlation(user1_ratings, user2_ratings):.4f}")

Pearson Correlation between users 1 and 2 is 0.2697


### 1.2.2 User-user similarity matrix

In [16]:
def compute_user_similarity_matrix(train_data: pd.DataFrame) -> pd.DataFrame:
    """
    Compute user-user similarity matrix using Pearson correlation.
    
    Parameters:
    - ratings: pd.DataFrame with columns ['user_id', 'item_id', 'rating']
    
    Returns:
    - pd.DataFrame: user-user similarity matrix (rows & cols = user_ids)
    """
    users = train_data['user_id'].unique()
    user_similarity_matrix = pd.DataFrame(np.zeros((len(users), len(users))), index=users, columns=users) # Create a user x user similarity matrix
    
    ############# Your code here ############

    for i in range(len(users)):
        for j in range(i, len(users)):  # compute only upper triangle (to speed up the computation)
            user1_id = users[i]
            user2_id = users[j]

            user1_ratings = train_data[train_data['user_id'] == user1_id].set_index('item_id')['rating']
            user2_ratings = train_data[train_data['user_id'] == user2_id].set_index('item_id')['rating']

            sim = pearson_correlation(user1_ratings, user2_ratings)
            user_similarity_matrix[user1_id][user2_id] = sim   # fill in the upper cell
            user_similarity_matrix[user2_id][user1_id] = sim   # reflect on the lower cell
                        
    #########################################
    
    return user_similarity_matrix

In [17]:
from pathlib import Path
from utils import save_similarity_matrix, load_similarity_matrix

"""
Compute and save the user-user similarity matrix in the artifacts directory.
If the similarity matrix is already saved, load it from the artifacts directory.
"""

latest_path = Path("artifacts/similarity/user_sim_latest.parquet")

if latest_path.exists():
    print("Loading cached user similarity matrix…")
    user_similarity_matrix = load_similarity_matrix(str(latest_path))
else:
    start_time = time.time()
    print("Computing user similarity matrix…")
    user_similarity_matrix = compute_user_similarity_matrix(train_data)
    user_paths = save_similarity_matrix(user_similarity_matrix, name="user")
    print("Similarity matrix is saved in the directory:", user_paths)  
    end_time = time.time()
    print(f'Running time: {end_time - start_time:.4f} seconds')

Loading cached user similarity matrix…


### 1.2.3 Predicting the rating for a target user and target item using the user-user similarity matrix

In [18]:
def get_k_user_neighbors(user_similarity_matrix: pd.DataFrame, target_user, k=5):
    """
    Retrieve top-k most similar users to the target user.

    Parameters:
    - user_similarity_matrix: pd.DataFrame, user-user similarity values (indexed by user IDs)
    - target_user: user ID for whom we want neighbors
    - k: number of neighbors to retrieve

    Returns:
    - List of tuples: [(neighbor_user_id, similarity), ...] sorted by similarity descending
    """
    top_k_neighbors = []
    
    ############# Your code here ############

    sorted_neighbors = user_similarity_matrix[target_user].sort_values(ascending=False)
    top_neighbors = sorted_neighbors[:k]
    for neighbor, sim in top_neighbors.items():
        top_k_neighbors.append((neighbor, sim))
    
    #########################################
    
    return top_k_neighbors

target_user, k = 1, 10
print(f"Neighbors of user {target_user} are:")
get_k_user_neighbors(user_similarity_matrix, target_user, k)

Neighbors of user 1 are:


[(926, 1.0000000000000002),
 (289, 1.0000000000000002),
 (46, 1.0000000000000002),
 (29, 1.0000000000000002),
 (920, 1.0000000000000002),
 (656, 1.0000000000000002),
 (123, 1.0),
 (732, 1.0),
 (740, 1.0),
 (898, 1.0)]

In [19]:
def predict_rating_user_based(train_data: pd.DataFrame, user_similarity_matrix: pd.DataFrame, target_user, target_item, k=5):
    """
    Predict rating for target_user and target_item using mean-centered user-based CF.

    Parameters:
    - ratings: pd.DataFrame with columns ['user_id', 'item_id', 'rating']
    - user_similarity_matrix: pd.DataFrame of user-user similarities
    - target_user: user ID
    - target_item: item ID
    - k: number of neighbors to consider

    Returns:
    - float: predicted rating, or np.nan if not possible
    """
    result = 0.0
    
    ############# Your code here ############

    # Define minimum and maximum possible values for rating predictions
    rmin=1.0
    rmax=5.0

    # Find the top-k neighbors of the target user and select only those who have rated the target item
    top_k_neighbors = get_k_user_neighbors(user_similarity_matrix, target_user, k)
    selected_neighbors = []
    for neighbor, sim in top_k_neighbors:
        if neighbor in train_data[train_data['item_id'] == target_item]['user_id'].unique():
            neighbor_rating = train_data[(train_data['user_id'] == neighbor) & (train_data['item_id'] == target_item)]['rating'].values[0]
            selected_neighbors.append((neighbor, sim, neighbor_rating))

    # If there are not any neighbors who rated on the same item, return the mean rating of the target user
    mean_rating = train_data[train_data['user_id'] == target_user]['rating'].mean()
    if len(selected_neighbors) == 0:
        return float(np.clip(mean_rating, rmin, rmax))

    sim_sum = sum([abs(sim) for _, sim, _ in selected_neighbors])
    if sim_sum == 0:
        return float(np.clip(mean_rating, rmin, rmax))
    
    num = 0.0
    # Calculate the weighted average rating based on the similarity scores of the neighbors
    for neighbor, sim, neighbor_rating in selected_neighbors:
        mean_neighbor_rating = train_data[(train_data['user_id'] == neighbor)]['rating'].mean()
        num += sim * (neighbor_rating - mean_neighbor_rating)

    result = mean_rating + (num / sim_sum)

    #########################################

    return result

target_user, target_item, k = 1, 17, 50
print(f"The actual rating for user {target_user} and item {target_item} is 3. The predicted rating by user-based CF for user {target_user} and item {target_item} is {predict_rating_user_based(train_data, user_similarity_matrix, target_user, target_item, k):.4f}")

The actual rating for user 1 and item 17 is 3. The predicted rating by user-based CF for user 1 and item 17 is 4.9473


## 1.3 Item-based Collaborative Filtering

### 1.3.1 Cosine similarity

In [23]:
def cosine_similarity(item1_ratings: pd.Series, item2_ratings: pd.Series) -> float:
    """
    Compute cosine similarity between two items' rating vectors.
    Only common users are considered.
    
    Parameters:
    - item1_ratings, item2_ratings: pd.Series indexed by user_id
    
    Returns:
    - float: cosine similarity between -1 and 1
    """
    result = 0.0
    
    ############# Your code here ############

    i1, i2 = item1_ratings.align(item2_ratings, join='inner')
    mask = i1.notna() & i2.notna()
    x, y = i1[mask].astype(float), i2[mask].astype(float)

    #print("Ratings for item1 by common users", x)
    #print("Ratings for item2 by common users", y)

    # We need at least 2 ratings to compute cosine similarity
    if x.size < 2:
        return 0.0
    
    num = (x * y).sum()
    denom = math.sqrt((x * x).sum()) * math.sqrt((y * y).sum())
    if denom == 0.0 or not np.isfinite(denom):
        return 0.0

    result = num / denom
    
    #########################################

    return result

item1, item2 = 1, 2
item1_ratings = train_data[train_data['item_id'] == item1].set_index('user_id')['rating']
item2_ratings = train_data[train_data['item_id'] == item2].set_index('user_id')['rating']
print(f"Cosine similarity between items {item1} and {item2} is {cosine_similarity(item1_ratings, item2_ratings):.4f}")

Cosine similarity between items 1 and 2 is 0.9500


### 1.3.2 Item-item similarity matrix

In [None]:
def compute_item_similarity_matrix(train_data: pd.DataFrame) -> pd.DataFrame:
    """
    Compute item-item similarity matrix using cosine similarity.
    
    Parameters:
    - ratings: pd.DataFrame with columns ['user_id', 'item_id', 'rating']
    
    Returns:
    - pd.DataFrame: item-item similarity matrix (rows & cols = item_ids)
    """
    items = train_data['item_id'].unique()
    item_similarity_matrix = pd.DataFrame(np.zeros((len(items), len(items))), index=items, columns=items)
    
    ############# Your code here ############

    for i in range(len(items)):
        for j in range(i, len(items)): # compute only upper triangle (to speed up the computation)
            item1_id = items[i]
            item2_id = items[j]

            item1_ratings = train_data[train_data['item_id'] == item1_id].set_index('user_id')['rating']
            item2_ratings = train_data[train_data['item_id'] == item2_id].set_index('user_id')['rating']

            sim = cosine_similarity(item1_ratings, item2_ratings)
            item_similarity_matrix[item1_id][item2_id] = sim   # fill in the upper cell
            item_similarity_matrix[item2_id][item1_id] = sim   # reflect on the lower cell
    
    #########################################
    
    return item_similarity_matrix

In [25]:
from pathlib import Path
from utils import save_similarity_matrix, load_similarity_matrix

"""
Compute and save the user-user similarity matrix in the artifacts directory.
If the similarity matrix is already saved, load it from the artifacts directory.
"""

latest_path = Path("artifacts/similarity/item_sim_latest.parquet")

if latest_path.exists():
    print("Loading cached item similarity matrix…")
    item_similarity_matrix = load_similarity_matrix(str(latest_path))
else:
    start_time = time.time()
    print("Computing item similarity matrix…")
    item_similarity_matrix = compute_item_similarity_matrix(train_data)
    item_paths = save_similarity_matrix(item_similarity_matrix, name="item")
    print("Similarity matrix is saved in the directory:", item_paths)  
    end_time = time.time()
    print(f'Running time: {end_time - start_time:.4f} seconds')

Computing item similarity matrix…
Similarity matrix is saved in the directory: {'parquet': 'artifacts/similarity/item_sim_1650x1650_20251011-194334.parquet', 'csv': 'artifacts/similarity/item_sim_1650x1650_20251011-194334.csv.gz', 'pickle': 'artifacts/similarity/item_sim_1650x1650_20251011-194334.pkl', 'meta': 'artifacts/similarity/item_sim_1650x1650_20251011-194334.meta.json'}
Running time: 834.9603 seconds


### 1.3.3 Predicting the rating for a target user and target item using the item-item similarity matrix

In [26]:
def get_k_item_neighbors(item_similarity_matrix: pd.DataFrame, target_item, k=5):
    """
    Retrieve top-k most similar items to the target item.
    
    Parameters:
    - item_similarity_matrix: pd.DataFrame, item-item similarity
    - target_item: item ID
    - k: number of neighbors
    
    Returns:
    - List of tuples: [(neighbor_item_id, similarity), ...]
    """
    top_k_neighbors = []

    ############# Your code here ############

    if target_item not in item_similarity_matrix.columns:
        return []

    sorted_neighbors = item_similarity_matrix[target_item].sort_values(ascending=False)
    top_neighbors = sorted_neighbors[:k]
    for neighbor, sim in top_neighbors.items():
        top_k_neighbors.append((neighbor, sim))
    
    #########################################
    
    return top_k_neighbors

target_item, k = 1, 10
print(f"Neighbors of item {target_item} are:")
get_k_item_neighbors(item_similarity_matrix, target_item, k)

Neighbors of item 1 are:


[(1, 1.0),
 (1024, 1.0),
 (1445, 1.0),
 (938, 1.0),
 (757, 1.0),
 (1150, 1.0),
 (1534, 1.0),
 (592, 0.9999999999999999),
 (1189, 0.9999999999999999),
 (954, 0.9999999999999999)]

In [27]:
def predict_rating_item_based(train_data: pd.DataFrame, item_similarity_matrix: pd.DataFrame, target_user, target_item, k=5):
    """
    Predict rating using item-based CF (non-mean centric).
    
    Parameters:
    - ratings: pd.DataFrame ['user_id', 'item_id', 'rating']
    - item_similarity_matrix: item-item similarity DataFrame
    - target_user: user ID
    - target_item: item ID
    - k: number of neighbors to use
    
    Returns:
    - float: predicted rating, or np.nan if not enough data
    """
    result = 0.0

    ############# Your code here ############

    if target_item not in item_similarity_matrix.columns:
        return train_data[train_data['user_id'] == target_user]['rating'].mean() # Return user mean rating

    top_k_neighbors = get_k_item_neighbors(item_similarity_matrix, target_item, k)
    selected_neighbors = [] # In this case, neighbors are the items
    for item_neighbor, sim in top_k_neighbors:
        # If the target user has rated the neighbor item
        if item_neighbor in train_data[train_data['user_id'] == target_user]['item_id'].unique():
            neighbor_rating = train_data[(train_data['user_id'] == target_user) & (train_data['item_id'] == item_neighbor)]['rating'].values[0]
            selected_neighbors.append((item_neighbor, sim, neighbor_rating))

    sim_sum = sum([abs(sim) for _, sim, _ in selected_neighbors])
    if sim_sum == 0:
        return 0.0
    
    numerator = 0.0
    # Calculate the weighted average rating based on the similarity scores of the neighbors
    for _, sim, neighbor_rating in selected_neighbors:
        numerator += sim * neighbor_rating

    result += numerator / sim_sum

    #########################################
    
    return result

target_user, target_item, k = 1, 17, 50
print(f"The actual rating for user {target_user} and item {target_item} is 3. The predicted rating by item-based CF for user {target_user} and item {target_item} is {predict_rating_item_based(train_data, item_similarity_matrix, target_user, target_item, k):.4f}")

The actual rating for user 1 and item 17 is 3. The predicted rating by item-based CF for user 1 and item 17 is 2.3376


## 1.4 Matrix Factorization

In [28]:
class MatrixFactorizationSGD:
    """
    Matrix Factorization for rating prediction using Stochastic Gradient Descent (SGD).
    
    Rating matrix R ≈ P x Q^T + biases
    """
    def __init__(self, n_factors=20, learning_rate=0.01, regularization=0.02, n_epochs=20, use_bias=True):
        self.n_factors = n_factors
        self.learning_rate = learning_rate
        self.regularization = regularization
        self.n_epochs = n_epochs
        self.use_bias = use_bias

        # Model parameters
        self.P = None  # User latent factors
        self.Q = None  # Item latent factors
        self.user_bias = None
        self.item_bias = None
        self.global_mean = None

    def fit(self, ratings, verbose=True):
        """
        Train the model.
        
        Args:
            ratings (pd.DataFrame): dataframe with [user_id, item_id, rating]
        """
        # Map IDs to indices
        self.user_mapping = {u: i for i, u in enumerate(ratings['user_id'].unique())}
        self.item_mapping = {i: j for j, i in enumerate(ratings['item_id'].unique())}
        self.user_inv = {i: u for u, i in self.user_mapping.items()}
        self.item_inv = {j: i for i, j in self.item_mapping.items()}

        n_users = len(self.user_mapping)
        n_items = len(self.item_mapping)

        # Initialize factors
        self.P = np.random.normal(0, 0.1, (n_users, self.n_factors))
        self.Q = np.random.normal(0, 0.1, (n_items, self.n_factors))

        if self.use_bias:
            self.user_bias = np.zeros(n_users)
            self.item_bias = np.zeros(n_items)
            self.global_mean = ratings['rating'].mean()

        # Convert to (user_idx, item_idx, rating) triples
        training_data = [(self.user_mapping[u], self.item_mapping[i], r)
                         for u, i, r in zip(ratings['user_id'], ratings['item_id'], ratings['rating'])]

        # SGD loop
        for epoch in range(self.n_epochs):
            np.random.shuffle(training_data)
            total_error = 0

            for u, i, r in training_data:
                pred = np.dot(self.P[u], self.Q[i])
                if self.use_bias:
                    pred += self.global_mean + self.user_bias[u] + self.item_bias[i]

                err = r - pred
                total_error += err ** 2

                # Updates
                P_u = self.P[u]
                Q_i = self.Q[i]

                self.P[u] += self.learning_rate * (err * Q_i - self.regularization * P_u)
                self.Q[i] += self.learning_rate * (err * P_u - self.regularization * Q_i)

                if self.use_bias:
                    self.user_bias[u] += self.learning_rate * (err - self.regularization * self.user_bias[u])
                    self.item_bias[i] += self.learning_rate * (err - self.regularization * self.item_bias[i])

            rmse = np.sqrt(total_error / len(training_data))
            if verbose:
                print(f"Epoch {epoch+1}/{self.n_epochs} - RMSE: {rmse:.4f}")

        return self

    def predict_single(self, user_id, item_id):
        """Predict rating for a single (user, item) pair"""
        if user_id not in self.user_mapping or item_id not in self.item_mapping:
            return np.nan

        u = self.user_mapping[user_id]
        i = self.item_mapping[item_id]

        pred = np.dot(self.P[u], self.Q[i])
        if self.use_bias:
            pred += self.global_mean + self.user_bias[u] + self.item_bias[i]
        return pred

    def predict(self, test_data):
        """Predict ratings for a test dataframe with [user_id, item_id]"""
        preds = []
        for u, i in zip(test_data['user_id'], test_data['item_id']):
            preds.append(self.predict_single(u, i))
        return np.array(preds)

    def recommend_topk(self, user_id, train_data, n=10, exclude_seen=True):
        """
        Generate Top-K recommendations for a given user.

        Args:
            user_id (int): target user ID (original ID, not index).
            train_data (pd.DataFrame): training ratings [user_id, item_id, rating],
                                       used to exclude already-seen items.
            k (int): number of recommendations.
            exclude_seen (bool): whether to exclude items the user already rated.

        Returns:
            list of (item_id, predicted_score) sorted by score desc.
        """
        if user_id not in self.user_mapping:
            return []

        u = self.user_mapping[user_id]

        # Predict scores for all items
        scores = np.dot(self.P[u], self.Q.T)
        if self.use_bias:
            scores += self.global_mean + self.user_bias[u] + self.item_bias

        # Exclude seen items
        if exclude_seen:
            seen_items = train_data[train_data['user_id'] == user_id]['item_id'].values
            seen_idx = [self.item_mapping[i] for i in seen_items if i in self.item_mapping]
            scores[seen_idx] = -np.inf

        # Get top-K items
        top_idx = np.argsort(scores)[::-1][:n]
        top_items = [self.item_inv[i] for i in top_idx]
        top_scores = scores[top_idx]

        return list(zip(top_items, top_scores))

In [29]:
"""
MF for prediction task
"""

# Train model 
# Parameters: n_factors refers to embedding size, n_epochs refers to number of epochs, learning_rate refers to learning rate, and regularization refers to lambda hyperparameter controlling the effect of regularization terms
mf = MatrixFactorizationSGD(n_factors=50, n_epochs=10, learning_rate=0.001, regularization=0.001)
mf.fit(train_data, verbose=True)

# predict the rating for a target user and a target item
target_user = 1
target_item = 17
actual_rating = 3
pred_rating = mf.predict_single(target_user, target_item)
print(f"The actual rating for user {target_user} and item {target_item} is 3. The predicted rating by MF for user {target_user} and item {target_item} is {pred_rating:.4f}")

Epoch 1/10 - RMSE: 1.0965
Epoch 2/10 - RMSE: 1.0573
Epoch 3/10 - RMSE: 1.0312
Epoch 4/10 - RMSE: 1.0125
Epoch 5/10 - RMSE: 0.9984
Epoch 6/10 - RMSE: 0.9873
Epoch 7/10 - RMSE: 0.9783
Epoch 8/10 - RMSE: 0.9708
Epoch 9/10 - RMSE: 0.9645
Epoch 10/10 - RMSE: 0.9590
The actual rating for user 1 and item 17 is 3. The predicted rating by MF for user 1 and item 17 is 3.4983


In [30]:
"""
MF for ranking task
"""

# Train model 
# Parameters: n_factors refers to embedding size, n_epochs refers to number of epochs, learning_rate refers to learning rate, and regularization refers to lambda hyperparameter controlling the effect of regularization terms
mf = MatrixFactorizationSGD(n_factors=10, n_epochs=10, learning_rate=0.001, regularization=0.001)
mf.fit(train_data, verbose=True)

# Get top-10 recommendations for user 1
recommendations = mf.recommend_topk(user_id=1, train_data=train_data, n=10)
print("Top-10 Recommendations for user 1:")
for item, score in recommendations:
    print(f"Item {item}: {score:.4f}")

Epoch 1/10 - RMSE: 1.0944
Epoch 2/10 - RMSE: 1.0560
Epoch 3/10 - RMSE: 1.0306
Epoch 4/10 - RMSE: 1.0126
Epoch 5/10 - RMSE: 0.9992
Epoch 6/10 - RMSE: 0.9889
Epoch 7/10 - RMSE: 0.9806
Epoch 8/10 - RMSE: 0.9738
Epoch 9/10 - RMSE: 0.9682
Epoch 10/10 - RMSE: 0.9634
Top-10 Recommendations for user 1:
Item 64: 4.4136
Item 483: 4.4110
Item 318: 4.3957
Item 12: 4.3777
Item 313: 4.3221
Item 98: 4.2959
Item 427: 4.2894
Item 357: 4.2575
Item 603: 4.2419
Item 174: 4.2418


## 1.5 Evaluation Metrics

### 1.5.1 MAE, MSE, and RMSE metrics for rating prediction task

In [31]:
def MAE(actual_ratings, pred_ratings):
    # Implement a function that computes MAE error between actual ratings and predicted ratings. 
    # Note that actual_ratings and pred_ratings are lists.
    
    result = 0.0
    
    ############# Your code here ############

    diff_sum = np.sum(np.abs(actual_ratings - pred_ratings))
    result = diff_sum / len(actual_ratings)
    
    #########################################

    return result

def MSE(actual_rating, pred_rating):
    # Implement a function that computes MSE error between actual ratings and predicted ratings. 
    # Note that actual_ratings and pred_ratings are lists.
    
    result = 0.0
    
    ############# Your code here ############

    squared_sum = np.sum((actual_rating - pred_rating)**2)
    result = squared_sum / len(actual_rating)
    
    #########################################

    return result

def RMSE(actual_rating, pred_rating):
    # Implement a function that computes RMSE error between actual ratings and predicted ratings. 
    # Note that actual_ratings and pred_ratings are lists.
    
    result = 0.0
    
    ############# Your code here ############

    mse = MSE(actual_rating, pred_rating)
    result = np.sqrt(mse)
    
    #########################################

    return result

### 1.5.2 Precision, Recall, NDCG, MRR, and MAP metrics for ranking task

In [32]:
# Note: Ground truth items are the items the user actually liked and/or rated highly in the test set.
# I defined all metrics below on a scale of 0 to 1 to be consistent. 

def Precision(ground_truth, rec_list): 
    '''
    Definition: Among the recommended items, what percentage of them is relevant?
    '''
    # Implement a function that computes Precision across ground truth data and recommendation list generated for each user. 
    # Note that ground_truth and rec_list contain the list of items for all users, e.g., 2-dimensional arrays.
    
    precisions = []
    
    ############# Your code here ############

    # Find the precision per user
    for rated_items, recommended_items in zip(ground_truth, rec_list):

        if len(recommended_items) == 0:
            precisions.append(0.0)
            
        else:
            relevance_set = set(rated_items) 
            recommended_set = set(recommended_items)
            intersection = relevance_set.intersection(recommended_set)
            precision = len(intersection) / len(recommended_items)
            precisions.append(precision)

    result = np.mean(precisions)

    #########################################

    return result

def Recall(ground_truth, rec_list):
    '''
    Definition: Among all relevant items to the users, what percentage of them is recommended?
    '''
    # Implement a function that computes Recall across ground truth data and recommendation list generated for each user. 
    # Note that ground_truth and rec_list contain the list of items for all users, e.g., 2-dimensional arrays.
    
    recalls = []
    
    ############# Your code here ############

    for rated_items, recommended_items in zip(ground_truth, rec_list):

        if len(rated_items) == 0:
            recalls.append(0.0)
            
        else:
            relevance_set = set(rated_items) 
            recommended_set = set(recommended_items)
            intersection = relevance_set.intersection(recommended_set)
            precision = len(intersection) / len(rated_items)
            recalls.append(precision)

    result = np.mean(recalls)
    
    #########################################

    return result

def NDCG(ground_truth, rec_list):
    # Implement a function that computes NDCG across ground truth data and recommendation list generated for each user. 
    # Note that ground_truth and rec_list contain the list of items for all users, e.g., 2-dimensional arrays.
    
    ndcgs = []
    
    ############# Your code here ############

    for rated_items, recommended_items in zip(ground_truth, rec_list):
        relevance_set = set(rated_items)
        
        cumulative_dcg = 0.0
        cumulative_idcg = 0.0

        # Now loop over the recommended items (their order matters)
        for i, recommendation in enumerate(recommended_items, start=1): # Start at 1 because the rank starts at 1, not 0
            is_relevant = 1 if recommendation in relevance_set else 0
            curr_dcg = (2**is_relevant - 1) / np.log2(i + 1)
            cumulative_dcg += curr_dcg
            
        # Now compute the ndcg
        ideal_hits = min(len(relevance_set), len(recommended_items))
        if ideal_hits == 0:
            ndcgs.append(0.0)
            continue

        for i in range(1, ideal_hits + 1):
            curr_idcg = 1 / np.log2(i + 1)
            cumulative_idcg += curr_idcg

        ndcgs.append(cumulative_dcg / cumulative_idcg)

    result = np.mean(ndcgs)
    
    #########################################

    return result

def MRR(ground_truth, rec_list):
    # Implement a function that computes MRR across ground truth data and recommendation list generated for each user. 
    # Note that ground_truth and rec_list contain the list of items for all users, e.g., 2-dimensional arrays.
    
    mrrs = []
    
    ############# Your code here ############

    for rated_items, recommended_items in zip(ground_truth, rec_list):
        relevance_set = set(rated_items)
        reciprocal_rank = 0.0 # Default value if no relevant item is recommended

        # Loop over the recommended items to find the rank of the first relevant item
        for rank, recommendation in enumerate(recommended_items, start=1): # Start at 1 because the rank starts at 1, not 0
            if recommendation in relevance_set:
                reciprocal_rank = 1.0 / rank
                break
            
        mrrs.append(reciprocal_rank)

    result = np.mean(mrrs)
    
    #########################################

    return result

def MAP(ground_truth, rec_list):
    # Implement a function that computes MAP across ground truth data and recommendation list generated for each user. 
    # Note that ground_truth and rec_list contain the list of items for all users, e.g., 2-dimensional arrays.
    
    avg_precisions = []
    
    ############# Your code here ############

    for rated_items, recommended_items in zip(ground_truth, rec_list):
        relevance_set = set(rated_items)

        sum_precision = 0.0
        count_relevant_seen = 0
        for rank, recommendation in enumerate(recommended_items, start=1): # Start at 1 because the rank starts at 1, not 0
            if recommendation in relevance_set:
                count_relevant_seen += 1
                curr_precision = count_relevant_seen / rank
                sum_precision += curr_precision

        if count_relevant_seen > 0:
            avg_precisions.append(sum_precision / count_relevant_seen)
        else:
            avg_precisions.append(0.0)

    result = np.mean(avg_precisions)
    
    #########################################

    return result

## 1.6 Hybrid Recommender 

Our *hybrid recommender* combines the predictions of the four models (content-based, user-based CF, item-based CF, and MF) using linear regression.

### 1.6.1 Define the HybridLinear class

In [44]:
import numpy as np
import pandas as pd

class HybridLinear:
    def __init__(self, coef, intercept,
                 train_data, user_similarity_matrix, item_similarity_matrix, mf_model,
                 content_type='full', aggregation_method='avg',
                 k_neighbors=50, rmin=1.0, rmax=5.0):
        self.coef_ = np.asarray(coef, dtype=float)
        self.intercept_ = float(intercept)
        self.train_data = train_data
        self.user_sim = user_similarity_matrix
        self.item_sim = item_similarity_matrix
        self.mf = mf_model
        self.content_type = content_type
        self.aggregation_method = aggregation_method
        self.k = k_neighbors
        self.rmin = rmin
        self.rmax = rmax

        # Precompute fallbacks
        self.global_mean = float(train_data['rating'].mean())
        self.user_mean = train_data.groupby('user_id')['rating'].mean().to_dict()
        self.item_mean = train_data.groupby('item_id')['rating'].mean().to_dict()

    def _fallback(self, u, i):
        return self.user_mean.get(u, self.item_mean.get(i, self.global_mean))

    @staticmethod
    def _set_fallback_value(x, fallback):
        x = float(x) if x is not None else np.nan
        return fallback if (np.isnan(x) or not np.isfinite(x)) else x

    def _features_for(self, user_id, item_id):
        # 1) Content-based
        try:
            content_based = get_user_item_prediction(
                self.train_data, user_id, item_id, self.content_type, self.aggregation_method
            )
        except Exception:
            content_based = np.nan

        # 2) User-based CF
        try:
            user_based = predict_rating_user_based(
                self.train_data, self.user_sim, user_id, item_id, k=self.k
            )
        except Exception:
            user_based = np.nan

        # 3) Item-based CF
        try:
            item_based = predict_rating_item_based(
                self.train_data, self.item_sim, user_id, item_id, k=self.k,
                rmin=self.rmin, rmax=self.rmax
            )
        except Exception:
            item_based = np.nan

        # 4) Matrix Factorization
        try:
            mf = self.mf.predict_single(user_id, item_id)
        except Exception:
            mf = np.nan

        fb = self._fallback(user_id, item_id)
        content_based = self._set_fallback_value(content_based, fb)
        user_based   = self._set_fallback_value(user_based, fb)
        item_based   = self._set_fallback_value(item_based, fb)
        mf           = self._set_fallback_value(mf, fb)

        return np.array([content_based, user_based, item_based, mf], dtype=float)

    def predict(self, user_id, item_id):
        x = self._features_for(user_id, item_id)
        yhat = float(self.intercept_ + np.dot(self.coef_, x))
        return float(np.clip(yhat, self.rmin, self.rmax))

    def batch_predict(self, user_item_df: pd.DataFrame):
        preds = []
        for u, i in user_item_df[['user_id', 'item_id']].itertuples(index=False):
            preds.append(self.predict(u, i))
        return np.array(preds, dtype=float)


### 1.6.2 Training pipeline

In [45]:
from sklearn.linear_model import Ridge

"""
Main training loop for the linear regression function
that uses 4 base predictors: [ContentBased, UserCF, ItemCF, MF].
"""

def fit_linear(X, y, alpha=1.0):
    """Uses ridge regression to fit a linear model to the training data."""
    model = Ridge(alpha=alpha, fit_intercept=True, random_state=42)
    model.fit(X, y)
    return model.coef_.astype(float), float(model.intercept_)

def fit_hybrid_linear(train_data: pd.DataFrame,
                      user_similarity_matrix: pd.DataFrame,
                      item_similarity_matrix: pd.DataFrame,
                      mf_model,
                      content_type='full',
                      aggregation_method='avg',
                      k_neighbors=50,
                      alpha=1.0,
                      rmin=1.0, rmax=5.0,
                      eval_holdout=True, holdout_frac=0.2, seed=42):
    
    # Build training features
    def build_features(df):
        rows = []
        for u, i in df[['user_id', 'item_id']].itertuples(index=False):
            # Use a temporary HybridLinear to reuse feature logic
            tmp = HybridLinear(coef=[0,0,0,0], intercept=0.0,
                               train_data=train_data,
                               user_similarity_matrix=user_similarity_matrix,
                               item_similarity_matrix=item_similarity_matrix,
                               mf_model=mf_model,
                               content_type=content_type,
                               aggregation_method=aggregation_method,
                               k_neighbors=k_neighbors,
                               rmin=rmin, rmax=rmax)
            rows.append(tmp._features_for(u, i))
        X = np.vstack(rows)
        y = df['rating'].to_numpy(dtype=float)
        return X, y
    
    X_train, y_train = build_features(train_data)
    coef, intercept = fit_linear(X_train, y_train, alpha=alpha)

    hybrid_model = HybridLinear(coef, intercept,
                          train_data=train_data,
                          user_similarity_matrix=user_similarity_matrix,
                          item_similarity_matrix=item_similarity_matrix,
                          mf_model=mf_model,
                          content_type=content_type,
                          aggregation_method=aggregation_method,
                          k_neighbors=k_neighbors,
                          rmin=rmin, rmax=rmax)
    
    return hybrid_model

### 1.6.3 Train the hybrid model and get predictions

In [None]:
from utils import load_hybrid, save_hybrid

hybrid = load_hybrid('artifacts/hybrid_model.joblib')
if hybrid is None:
    hybrid = fit_hybrid_linear(
        train_data=train_data,
        user_similarity_matrix=user_similarity_matrix,
        item_similarity_matrix=item_similarity_matrix,
        mf_model=mf,
        content_type='full',            # or 'title_genres' / 'description'
        aggregation_method='avg',       # or 'weighted_avg', 'avg_pos'
        k_neighbors=50,
        alpha=1.0,                      # TODO: tune the alpha value (do we need to tune it?)
        eval_holdout=True
    )
    save_hybrid(hybrid, 'artifacts/hybrid_model.joblib')

# Predict the rating for target user and target item
target_user, target_item = 1, 17
print("Hybrid prediction:", hybrid.predict(target_user, target_item))

# Batch prediction across arrays of user_ids and item_ids
pairs = pd.DataFrame({'user_id':[1,1,2], 'item_id':[17,25,42]})
hyb_scores = hybrid.batch_predict(pairs)
print(hyb_scores)

[Hybrid] Loaded from artifacts/hybrid_model.joblib
Hybrid prediction: 4.977880728509721
[4.97788073 3.77355201 3.28895723]


## 1.7 Evaluate Performance (Effectiveness)

### 1.7.1 Evaluate performance of the content-based predictor using RMSE

In [55]:
def evaluate_content_based_rating_prediction(train_data, test_data, content_type, aggregation_method):
    # Implement a function that first computes the representation of users and then predicts the rating for each user-item pair. Finally, call the implemented metrics to measure the error.
    # Hint: the reason for pre-computing the representation of all users is to speed up the experiments and to avoid too many unnecessary computations.
    # Note: Make sure to map the predicted score into [1,5] interval. The prediction from content-based model may not necessarily be in 5-star rating scale.

    # here we compute the representation of users
    users_emb = []
    users_dict = {} # Create a users dict for fast lookup
    users = list(train_data['user_id'].unique())

    for user in tqdm(users):
        user_emb = get_user_emb(train_data, user, content_type, aggregation_method)
        users_emb.append(user_emb)
        users_dict[user] = user_emb

    print('Computing the representation of users is done!')
 
    actual_ratings, pred_ratings = [], []
    
    ############# Your code here ############

    # Predict rating of each user-item pair in test data
    for _, test_row in test_data.iterrows():
        user_id = test_row['user_id']
        item_id = test_row['item_id']
        actual_rating = test_row['rating']

        item_emb = get_item_emb(item_id, content_type)
        user_emb = users_dict[user_id]
        pred_rating = np.dot(item_emb, user_emb)

        actual_ratings.append(actual_rating)
        pred_ratings.append(pred_rating)

    #########################################

    # Given predicted ratings, map them into [1,5] interval using -> 1 + (pred - min_val) * (4 / (max_val - min_val))
    ############# Your code here ############

    pred_array = np.array(pred_ratings, dtype=float)
    min_val = pred_array.min()
    max_val = pred_array.max()

    pred_mapped = 1 + (pred_array - min_val) * (4 / (max_val - min_val))
    
    #########################################

    mae_value, mse_value, rmse_value = 0.0, 0.0, 0.0

    # compute the metrics: MAE, MSE, RMSE 
    ############# Your code here ############

    mae_value = MAE(actual_ratings, pred_mapped)
    mse_value = MSE(actual_ratings, pred_mapped)
    rmse_value = RMSE(actual_ratings, pred_mapped)
    
    #########################################

    return mae_value, mse_value, rmse_value


# In the hybrid model, content-type: full and aggregation_method: avg is used for the content-based predictor
print('Performance of content-based recommender for content type = full and aggregation method = avg:')
mae_value, mse_value, rmse_value = evaluate_content_based_rating_prediction(train_data, test_data, 'full', 'avg')
print('RMSE='+str(round(rmse_value,5)))


Performance of content-based recommender for content type = full and aggregation method = avg:


100%|██████████| 943/943 [00:00<00:00, 3420.10it/s]


Computing the representation of users is done!
RMSE=1.21568


### 1.7.2 Evaluate performance of the user-based CF predictor using RMSE

In [52]:
def evaluate_user_cf_rating_prediction(train_data: pd.DataFrame, test_data: pd.DataFrame, user_similarity_matrix: pd.DataFrame, k=5) -> float:
    """
    Evaluate user-based CF using RMSE on a test set.
    
    Parameters:
    - train_data: pd.DataFrame with ['user_id', 'item_id', 'rating'] used for training
    - test_data: pd.DataFrame with ['user_id', 'item_id', 'rating'] used for testing
    - user_similarity_matrix: user-user similarity matrix (computed from train set)
    - k: number of neighbors
    
    Returns:
    - float: RMSE value
    """
    result = 0.0

    ############# Your code here ############

    # Compute RMSE
    sum_diff = 0.0
    for row in test_data.itertuples():
        user = row.user_id
        target_item = row.item_id
        pred = predict_rating_user_based(train_data, user_similarity_matrix, user, target_item, k)
        sum_diff += (row.rating - pred) ** 2
    result = np.sqrt(sum_diff / len(test_data))
    
    #########################################

    return result

k = 50
start_time = time.time()
print(f"RMSE of user-based CF for k={k} is {evaluate_user_cf_rating_prediction(train_data, test_data, user_similarity_matrix, k):.4f}")
end_time = time.time()
print(f'Running time: {end_time - start_time:.4f} seconds')

RMSE of user-based CF for k=50 is 1.1255
Running time: 121.9425 seconds


### 1.7.3 Evaluate performance of the item-based CF predictor using RMSE

In [54]:
def evaluate_item_cf_rating_prediction(train_data: pd.DataFrame, test_data: pd.DataFrame, item_similarity_matrix: pd.DataFrame, k=5) -> float:
    """
    Evaluate item-based CF using RMSE on a test set.
    
    Parameters:
    - train_data: pd.DataFrame with ['user_id', 'item_id', 'rating'] used for training
    - test_data: pd.DataFrame with ['user_id', 'item_id', 'rating'] used for testing
    - item_similarity_matrix: item-item similarity matrix (computed from train set)
    - k: number of neighbors
    
    Returns:
    - float: RMSE value
    """
    result = 0.0

    ############# Your code here ############

    # Compute RMSE
    sum_diff = 0.0
    for row in test_data.itertuples():
        user = row.user_id
        target_item = row.item_id
        pred = predict_rating_item_based(train_data, item_similarity_matrix, user, target_item, k)
        sum_diff += (row.rating - pred) ** 2
    result = np.sqrt(sum_diff / len(test_data))
    
    #########################################

    return result

k = 50
start_time = time.time()
print(f"RMSE of item-based CF for k={k} is {evaluate_item_cf_rating_prediction(train_data, test_data, item_similarity_matrix, k):.4f}")
end_time = time.time()
print(f'Running time: {end_time - start_time:.4f} seconds')

RMSE of item-based CF for k=50 is 2.9661
Running time: 108.1730 seconds


### 1.7.4 Evaluate performance of matrix factorization predictor using RMSE

In [51]:
def evaluate_hybrid_rating_prediction(hybrid, test_data: pd.DataFrame, use_batch: bool = True) -> float:
    """
    Evaluate the performance of a hybrid model using RMSE on the test data.
    """
    if test_data is None or len(test_data) == 0:
        return float("nan")

    if use_batch and hasattr(hybrid, "batch_predict"):
        pairs = test_data[['user_id', 'item_id']]
        preds = hybrid.batch_predict(pairs)
    else:
        preds = np.array(
            [hybrid.predict(u, i) for u, i in test_data[['user_id', 'item_id']].itertuples(index=False)],
            dtype=float
        )

    y_true = test_data['rating'].to_numpy(dtype=float)
    rmse = float(np.sqrt(np.mean((y_true - preds) ** 2)))
    return rmse

start_time = time.time()
rmse_hybrid = evaluate_hybrid_rating_prediction(hybrid, test_data, use_batch=True)
end_time = time.time()
print(f"RMSE of Hybrid model is {rmse_hybrid:.4f}")
print(f'Running time: {end_time - start_time:.4f} seconds')

RMSE of Hybrid model is 1.1159
Running time: 117.7133 seconds


# Task 2) Experiments for both rating prediction and ranking tasks, and conducting offline evaluation

# Task 3) Implement baselines for both rating prediction and ranking tasks, and perform experiments with those baselines

# Task 4) Analysis of recommendation models. Analyzing the coefficients of hybrid model and the success of recommendation models for different users' groups. 

# Task 5) Evaluation of beyond accuracy