# DSAIT4335 Recommender Systems
# Assignment 2: Collaborative Filtering

In this assignment, you will work to build different recommendation models under Collaborative Filtering approach, including user-based and item-based neighborhood models, and matrix factorization. Then, you will apply these recommendation models on a public dataset. The dataset is **MovieLens100K**, a movie recommendation dataset collected by GroupLens: https://grouplens.org/datasets/movielens/100k/.

By the end of this assignment, you will:
1. Understand the fundamental principles of collaborative filtering approach
2. Implement user-based and item-based neighborhood methods
3. Develop recommendation generation and prediction with SLIM and MF models
4. Perform both rating prediction and top-k recommendation tasks
5. Evaluate collaborative filtering methods to understand their strengths/limitations

# Instruction

The MovieLens100K is already splitted into 80% training and 20% test sets. 

**Expected file structure** for this assignment:   
   
   ```
   Assignment2/
   ├── training.txt
   ├── test.txt
   └── hw2.ipynb
   ```

**Note:** Be sure to run all cells in each section sequentially, so that intermediate variables and packages are properly carried over to subsequent cells.

**Submission:** Answer all the questions in this jupyter-notebook file. Submit this jupyter-notebook file (your answers included) to Brightspace. Change the name of this jupyter-notebook file to your name: firstname-lastname.ipynb.

# Setup

Import necessary libraries/packages.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.sparse import csr_matrix
from scipy.spatial.distance import cosine, correlation
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.model_selection import train_test_split
import time, math
import warnings
import pickle
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

print("Libraries imported successfully!")

Libraries imported successfully!


# 1) MovieLens100K dataset

Load the data files: training and test sets.

In [2]:
# loading the training set and test set
columns_name=['user_id','item_id','rating','timestamp']
train_data = pd.read_csv('training.txt', sep='\t', names=columns_name)
test_data = pd.read_csv('test.txt', sep='\t', names=columns_name)

print(f'The training data:')
display(train_data[['user_id','item_id','rating']].head())
print(f'The shape of the training data: {train_data.shape}')
print('--------------------------------')
print(f'The test data:')
display(test_data[['user_id','item_id','rating']].head())
print(f'The shape of the test data: {test_data.shape}')
# print(test_data.shape)

The training data:


Unnamed: 0,user_id,item_id,rating
0,1,1,5
1,1,2,3
2,1,3,4
3,1,4,3
4,1,5,3


The shape of the training data: (80000, 4)
--------------------------------
The test data:


Unnamed: 0,user_id,item_id,rating
0,1,6,5
1,1,10,3
2,1,12,5
3,1,14,5
4,1,17,3


The shape of the test data: (20000, 4)


# 2) User-based collaborative filtering

### Question 1: Implement a function that computes the Pearson Correlation between two users. 

The **Pearson correlation coefficient** between two users \(x\) and \(y\) is defined as:

$$
r_{xy} = \frac{\sum_{i \in I_{xy}} (x_i - \bar{x})(y_i - \bar{y})}
              {\sqrt{\sum_{i \in I_{xy}} (x_i - \bar{x})^2} \cdot \sqrt{\sum_{i \in I_{xy}} (y_i - \bar{y})^2}}
$$

**Where:**

- $I_{xy}$ = set of items rated by both users  
- $x_i$, $y_i$ = ratings of users \(x\) and \(y\) on item \(i\)  
- $\bar{x}$, $\bar{y}$ = mean ratings of users \(x\) and \(y\) on the common items  


In [69]:
def extract_ratings(user_id: int):
    return train_data[train_data['user_id'] == user_id].set_index('item_id')['rating']

def extract_ratings_item(user_id: int, item_id: int):
    result = train_data[(train_data['user_id'] == user_id) & (train_data['item_id'] == item_id)]['rating']
    if len(result) > 0:
        return result.iloc[0]
    else:
        return np.nan

In [None]:
def pearson_correlation(user1_ratings: pd.Series, user2_ratings: pd.Series) -> float:
    """
    Compute Pearson correlation coefficient between two users' rating vectors.
    
    user1_ratings, user2_ratings: Pandas Series indexed by item IDs. They may contain NaN for unrated items.
    Returns: float (correlation between -1 and 1). Returns 0 if not enough data.
    """
    result = 0.0
    
    ############# Your code here ############
    '''
        Data format:
        ratings = [ratings1, ratings2, ...] -> idx = item_id
    '''
    # Find common items rated by both users
    common_item_indexs = user1_ratings.index.intersection(user2_ratings.index)
    
    # Remove items with NaN values
    valid_indices = []
    for idx in common_item_indexs:
        if not (pd.isna(user1_ratings[idx]) or pd.isna(user2_ratings[idx])):
            # print(f"Item {idx} rated by both users: User1 rating = {user1_ratings[idx]}, User2 rating = {user2_ratings[idx]}")
            valid_indices.append(idx)
    
    common_item_indexs = pd.Index(valid_indices)
            
    # Check if they have common ratings
    # print(f"Common items rated by both users: {common_item_indexs.tolist()}")  # Comment out for matrix computation
    if len(common_item_indexs) <= 1:  # Need at least 2 items for correlation
        return 0.0
            
    user1_pure = user1_ratings[common_item_indexs]
    user2_pure = user2_ratings[common_item_indexs]

    user1_mean = np.mean(user1_pure)
    user2_mean = np.mean(user2_pure)
    
    user1_diff = (user1_pure - user1_mean)
    user2_diff = (user2_pure - user2_mean)

    nominator = np.sum(user1_diff * user2_diff)
    denominator_sqrt = np.sqrt(np.sum(user1_diff * user1_diff) * np.sum(user2_diff * user2_diff))
    
    # Check for zero denominator (constant ratings)
    if denominator_sqrt == 0:
        return 0.0
    #########################################
    
    return nominator / denominator_sqrt

user1, user2 = 1, 2
# user1_ratings = train_data[train_data['user_id'] == user1].set_index('item_id')['rating']
# user2_ratings = train_data[train_data['user_id'] == user2].set_index('item_id')['rating']

user1_ratings = extract_ratings(user1)
user2_ratings = extract_ratings(user2)
print(f"Pearson Correlation between users {user1} and {user2} is {pearson_correlation(user1_ratings, user2_ratings):.4f}")

Pearson Correlation between users 1 and 2 is 0.2697


### Question 2: What is the similarity value between users <8,9>. Discuss your observation.

In [62]:
user1, user2 = 9, 8
sim = 0.0

############# Your code here ############
user1_ratings = extract_ratings(user1)
user2_ratings = extract_ratings(user2)

sim = pearson_correlation(user1_ratings, user2_ratings)
#########################################


print(f"Pearson Correlation between users {user1} and {user2} is {sim:.4f}")

Pearson Correlation between users 9 and 8 is 0.0000


### Question 3: What is the similarity value between users <2,3>? Discuss your observation.

In [63]:
user1, user2 = 2, 3
sim = 0.0

############# Your code here ############
user1_ratings = extract_ratings(user1)
user2_ratings = extract_ratings(user2)

sim = pearson_correlation(user1_ratings, user2_ratings)

#########################################

print(f"Pearson Correlation between users {user1} and {user2} is {sim:.4f}")

Pearson Correlation between users 2 and 3 is 0.0000


### Note 1-3

For Question 2, there is no common ratings, so the correlation should directly return to 0.0;

For Question 3, user 2 has constant ratings, which results in the denominator being 0, thus will cause NaN error.

### Question 4: Create the user-user similarity matrix.

In [60]:
def compute_user_similarity_matrix(train_data: pd.DataFrame) -> pd.DataFrame:
    """
    Compute user-user similarity matrix using Pearson correlation.
    
    Parameters:
    - ratings: pd.DataFrame with columns ['user_id', 'item_id', 'rating']
    
    Returns:
    - pd.DataFrame: user-user similarity matrix (rows & cols = user_ids)
    """
    users = train_data['user_id'].unique()
    user_similarity_matrix = pd.DataFrame(np.zeros((len(users), len(users))), index=users, columns=users)
    
    ############# Your code here ############
    user_ratings = {user: train_data[train_data['user_id'] == user].set_index('item_id')['rating'] for user in users}
    
    for i, user1 in enumerate(users):
        for j, user2 in enumerate(users):
            if i <= j:  # Only compute upper triangle + diagonal to avoid duplicates
                if i == j:
                    # Self-similarity is always 1
                    user_similarity_matrix.loc[user1, user2] = 1.0
                else:
                    # Compute similarity between different users
                    val = pearson_correlation(user_ratings[user1], user_ratings[user2])
                    user_similarity_matrix.loc[user1, user2] = val
                    user_similarity_matrix.loc[user2, user1] = val
    #########################################
    
    return user_similarity_matrix


sim_matrix_path = "user_similarity_matrix.pkl"
if os.path.exists(sim_matrix_path):
    with open(sim_matrix_path, "rb") as f:
        user_similarity_matrix = pickle.load(f)
    print("Loaded user_similarity_matrix from file.")
else:
    start_time = time.time()
    print(f'Similarity matrix creation started! This may take around 5-10 minutes...')
    user_similarity_matrix = compute_user_similarity_matrix(train_data)
    end_time = time.time()
    print(f'Running time: {end_time - start_time:.4f} seconds')
    
    with open(sim_matrix_path, "wb") as f:
        pickle.dump(user_similarity_matrix, f)
    print("Computed and saved user_similarity_matrix.")

Loaded user_similarity_matrix from file.


### Question 5: Implement a function that returns k most similar users along with the similarity values to a target user.

In [None]:
def get_k_user_neighbors(user_similarity_matrix: pd.DataFrame, target_user, k=5):
    """
    Retrieve top-k most similar users to the target user.

    Parameters:
    - user_similarity_matrix: pd.DataFrame, user-user similarity values (indexed by user IDs)
    - target_user: user ID for whom we want neighbors
    - k: number of neighbors to retrieve

    Returns:
    - List of tuples: [(neighbor_user_id, similarity), ...] sorted by similarity descending
    """
    top_k_neighbors = []
    
    ############# Your code here ############
    target_row = user_similarity_matrix.loc[target_user]
    # print(target_row)
    target_row.drop(target_user) # Remove itself
    sorted_target_row = target_row.sort_values(ascending = False)
    k = min(k, len(sorted_target_row))
    for i in range(k):
        top_k_neighbors.append((sorted_target_row.index[i], sorted_target_row.iloc[i]))
    
    #########################################
    
    return top_k_neighbors

target_user, k = 1, 10
print(f"Neighbors of user {target_user} are:")
get_k_user_neighbors(user_similarity_matrix, target_user, k)

Neighbors of user 1 are:


[(1, 1.0),
 (238, 1.0),
 (616, 1.0),
 (29, 1.0),
 (229, 1.0),
 (898, 1.0),
 (656, 1.0),
 (724, 1.0),
 (732, 1.0),
 (289, 1.0)]

### Question 6: Implement a function that predicts the rating for a target user might give to a target item using user-user similarity matrix and the following equation.

The **predicted rating** for a target user \(u\) on item \(i\) using mean-centered user-based collaborative filtering is:

$$
\hat{r}_{u,i} = \bar{r}_u + \frac{\sum_{v \in N(u)} s(u,v) \cdot (r_{v,i} - \bar{r}_v)}
                             {\sum_{v \in N(u)} |s(u,v)|}
$$

Where:

- $\hat{r}_{u,i}$ = predicted rating for user \(u\) on item \(i\)  
- $\bar{r}_u$ = mean rating of the target user \(u\)  
- $N(u)$ = set of top-\(k\) neighbors of user \(u\) who have rated item \(i\)  
- $s(u,v)$ = similarity between users \(u\) and \(v\)  
- $r_{v,i}$ = rating of neighbor \(v\) on item \(i\)  
- $\bar{r}_v$ = mean rating of neighbor \(v\)


In [None]:
def predict_rating_user_based(train_data: pd.DataFrame, user_similarity_matrix: pd.DataFrame, target_user, target_item, k=5):
    """
    Predict rating for target_user and target_item using mean-centered user-based CF.

    Parameters:
    - ratings: pd.DataFrame with columns ['user_id', 'item_id', 'rating']
    - user_similarity_matrix: pd.DataFrame of user-user similarities
    - target_user: user ID
    - target_item: item ID
    - k: number of neighbors to consider

    Returns:
    - float: predicted rating, or np.nan if not possible
    """
    result = 0.0
    
    ############# Your code here ############
    
    target_user_ratings = train_data[train_data['user_id'] == target_user]['rating']
    r_user_mean = np.mean(target_user_ratings)
    
    
    neighbors = get_k_user_neighbors(user_similarity_matrix, target_user, k)
    valid_neighbors = []
    for neighbor_id, similarity in neighbors:
        neighbor_rating = extract_ratings_item(neighbor_id, target_item)
        if not pd.isna(neighbor_rating):
            valid_neighbors.append((neighbor_id, similarity))
    
    if len(valid_neighbors) == 0:
        return np.nan
    
    
    nominator = 0.0
    denominator = 0.0
    for neighbor_id, similarity in valid_neighbors:
        r_vi = extract_ratings_item(neighbor_id, target_item)
        
        # Get neighbor's mean rating
        neighbor_ratings = train_data[train_data['user_id'] == neighbor_id]['rating']
        r_v_mean = np.mean(neighbor_ratings)
        
        nominator += similarity * (r_v_mean - r_vi)
        denominator += abs(similarity)
        
    result = r_user_mean + nominator / denominator if denominator != 0.0 else np.nan
    #########################################

    return result

target_user, target_item, k = 1, 17, 50
print(f"The actual rating for user {target_user} and item {target_item} is 3. The predicted rating by user-based CF for user {target_user} and item {target_item} is {predict_rating_user_based(train_data, user_similarity_matrix, target_user, target_item, k):.4f}")

The actual rating for user 1 and item 17 is 3. The predicted rating by user-based CF for user 1 and item 17 is 2.4157


### Question 7: Implement a function that generates top-10 recommendation list for a target user using user-based CF method.

In [None]:
def recommend_topk_user_based(train_data, user_similarity_matrix, target_user, k=5):
    """
    Generate Top-K recommendations for a target user using User-based CF.
    
    Args:
        train_data (pd.DataFrame): ratings data with columns [user_id, item_id, rating]
        user_similarity_matrix (pd.DataFrame): precomputed user-user similarity matrix
        target_user (int): user_id of the target user
        k (int): number of most similar neighbors to consider
    
    Returns:
        list of (item_id, predicted_score) sorted by score desc
    """
    result = []
    
    ############# Your code here ############
    items = train_data['item_id'].unique()
    pred_ratings = []
    for item in items:
        pred_rating = predict_rating_user_based(train_data, user_similarity_matrix, target_user, item, k)
        if not pd.isna(pred_rating):
            pred_ratings.append((item, pred_rating))
            
    pred_ratings.sort(key=lambda x: x[1], reverse=True)
    result = pred_ratings[:10] 
    #########################################
    
    return result

start_time = time.time()
target_user, k = 1, 30
recommendations = recommend_topk_user_based(train_data, user_similarity_matrix, target_user, k)
print(f"Top-10 recommendations for user {target_user}:")
for item, score in recommendations:
    print(f"Item {item}: {score:.4f}")
end_time = time.time()
print(f'Running time: {end_time - start_time:.4f} seconds')

Top-10 recommendations for user 1:
Item 8: 6.3630
Item 29: 6.3630
Item 35: 6.3630
Item 110: 6.3630
Item 131: 6.3630
Item 138: 6.3630
Item 231: 6.3630
Item 247: 6.3630
Item 263: 6.3630
Item 353: 6.2815
Running time: 11.2233 seconds


# 3) Item-based collaborative filtering

### Question 8: Implement a function that computes the Cosine similarity between two items. 

The **cosine similarity** between two items \(i\) and \(j\) is defined as:

$$
\text{sim}(i,j) = \frac{\sum_{u \in U_{ij}} r_{u,i} \cdot r_{u,j}}
                      {\sqrt{\sum_{u \in U_{ij}} r_{u,i}^2} \cdot \sqrt{\sum_{u \in U_{ij}} r_{u,j}^2}}
$$

Where:

- \(r_{u,i}\) = rating of user \(u\) on item \(i\)  
- \(r_{u,j}\) = rating of user \(u\) on item \(j\)  
- \(U_{ij}\) = set of users who have rated both items \(i\) and \(j\)  


In [None]:
def cosine_similarity(item1_ratings: pd.Series, item2_ratings: pd.Series) -> float:
    """
    Compute cosine similarity between two items' rating vectors.
    Only common users are considered.
    
    Parameters:
    - item1_ratings, item2_ratings: pd.Series indexed by user_id
    
    Returns:
    - float: cosine similarity between -1 and 1
    """
    result = 0.0
    
    ############# Your code here ############
    
    #########################################

    return result

item1, item2 = 1, 2
item1_ratings = train_data[train_data['item_id'] == item1].set_index('user_id')['rating']
item2_ratings = train_data[train_data['item_id'] == item2].set_index('user_id')['rating']
print(f"Cosine similarity between items {item1} and {item2} is {cosine_similarity(item1_ratings, item2_ratings):.4f}")

### Question 9: Create the item-item similarity matrix.

In [None]:
def compute_item_similarity_matrix(train_data: pd.DataFrame) -> pd.DataFrame:
    """
    Compute item-item similarity matrix using cosine similarity.
    
    Parameters:
    - ratings: pd.DataFrame with columns ['user_id', 'item_id', 'rating']
    
    Returns:
    - pd.DataFrame: item-item similarity matrix (rows & cols = item_ids)
    """
    items = train_data['item_id'].unique()
    item_similarity_matrix = pd.DataFrame(np.zeros((len(items), len(items))), index=items, columns=items)
    
    ############# Your code here ############
    
    #########################################
    
    return item_similarity_matrix

start_time = time.time()
item_similarity_matrix = compute_item_similarity_matrix(train_data)  
end_time = time.time()
print(f'Running time: {end_time - start_time:.4f} seconds')

### Question 10: Implement a function that returns k most similar item along with the similarity values to a target item.

In [None]:
def get_k_item_neighbors(item_similarity_matrix: pd.DataFrame, target_item, k=5):
    """
    Retrieve top-k most similar items to the target item.
    
    Parameters:
    - item_similarity_matrix: pd.DataFrame, item-item similarity
    - target_item: item ID
    - k: number of neighbors
    
    Returns:
    - List of tuples: [(neighbor_item_id, similarity), ...]
    """
    top_k_neighbors = []

    ############# Your code here ############
    
    #########################################
    
    return top_k_neighbors

target_item, k = 1, 10
print(f"Neighbors of item {target_item} are:")
get_k_item_neighbors(item_similarity_matrix, target_item, k)

### Question 11: Implement a function that predicts the rating for a target user might give to a target item using item-item similarity matrix and the following equation.

The **predicted rating** for a target user \(u\) on a target item \(i\) using item-based collaborative filtering is:

$$
\hat{r}_{u,i} = \frac{\sum_{j \in N(i)} s(i,j) \cdot r_{u,j}}{\sum_{j \in N(i)} |s(i,j)|}
$$

Where:

- \(\hat{r}_{u,i}\) = predicted rating of user \(u\) on item \(i\)  
- \(N(i)\) = set of top-\(k\) most similar items to item \(i\) that user \(u\) has rated  
- \(s(i,j)\) = similarity between item \(i\) and item \(j\)  
- \(r_{u,j}\) = rating of user \(u\) on item \(j\)  


In [None]:
def predict_rating_item_based(train_data: pd.DataFrame, item_similarity_matrix: pd.DataFrame, target_user, target_item, k=5):
    """
    Predict rating using item-based CF (non-mean centric).
    
    Parameters:
    - ratings: pd.DataFrame ['user_id', 'item_id', 'rating']
    - item_similarity_matrix: item-item similarity DataFrame
    - target_user: user ID
    - target_item: item ID
    - k: number of neighbors to use
    
    Returns:
    - float: predicted rating, or np.nan if not enough data
    """
    result = 0.0

    ############# Your code here ############
    
    #########################################
    
    return result

target_user, target_item, k = 1, 17, 50
print(f"The actual rating for user {target_user} and item {target_item} is 3. The predicted rating by item-based CF for user {target_user} and item {target_item} is {predict_rating_item_based(train_data, item_similarity_matrix, target_user, target_item, k):.4f}")

### Question 12: Implement a function that generates top-10 recommendation list for a target user using item-based CF method.

In [None]:
def recommend_topk_item_based(train_data, item_similarity_matrix, target_user, k=5):
    """
    Generate Top-K recommendations for a target user using Item-based CF.
    
    Args:
        train_data (pd.DataFrame): ratings data with columns [user_id, item_id, rating]
        item_similarity_matrix (pd.DataFrame): precomputed item-item similarity matrix
        target_user (int): user_id of the target user
        k (int): number of items to recommend
    
    Returns:
        list of (item_id, predicted_score) sorted by score desc
    """
    result = []

    ############# Your code here ############
    
    #########################################
    
    return result

target_user, k = 1, 50
recommendations = recommend_topk_item_based(train_data, item_similarity_matrix, target_user, k)
print(f"Top-10 recommendations for user {target_user}:")
for item, score in recommendations:
    print(f"Item {item}: {score:.4f}")

# 3) Matrix Factorization (MF)

For details about matrix factorization algorithm see the lecture in week 3.

Matrix factorization algorithm is implemented in the next cell. Run the following cell and then use it to build MF model for running experiments.

In [217]:
class MatrixFactorizationSGD:
    """
    Matrix Factorization for rating prediction using Stochastic Gradient Descent (SGD).
    
    Rating matrix R ≈ P × Q^T + biases
    """
    def __init__(self, n_factors=20, learning_rate=0.01, regularization=0.02, n_epochs=20, use_bias=True):
        self.n_factors = n_factors
        self.learning_rate = learning_rate
        self.regularization = regularization
        self.n_epochs = n_epochs
        self.use_bias = use_bias

        # Model parameters
        self.P = None  # User latent factors
        self.Q = None  # Item latent factors
        self.user_bias = None
        self.item_bias = None
        self.global_mean = None

    def fit(self, ratings, verbose=True):
        """
        Train the model.
        
        Args:
            ratings (pd.DataFrame): dataframe with [user_id, item_id, rating]
        """
        # Map IDs to indices
        self.user_mapping = {u: i for i, u in enumerate(ratings['user_id'].unique())}
        self.item_mapping = {i: j for j, i in enumerate(ratings['item_id'].unique())}
        self.user_inv = {i: u for u, i in self.user_mapping.items()}
        self.item_inv = {j: i for i, j in self.item_mapping.items()}

        n_users = len(self.user_mapping)
        n_items = len(self.item_mapping)

        # Initialize factors
        self.P = np.random.normal(0, 0.1, (n_users, self.n_factors))
        self.Q = np.random.normal(0, 0.1, (n_items, self.n_factors))

        if self.use_bias:
            self.user_bias = np.zeros(n_users)
            self.item_bias = np.zeros(n_items)
            self.global_mean = ratings['rating'].mean()

        # Convert to (user_idx, item_idx, rating) triples
        training_data = [(self.user_mapping[u], self.item_mapping[i], r)
                         for u, i, r in zip(ratings['user_id'], ratings['item_id'], ratings['rating'])]

        # SGD loop
        for epoch in range(self.n_epochs):
            np.random.shuffle(training_data)
            total_error = 0

            for u, i, r in training_data:
                pred = np.dot(self.P[u], self.Q[i])
                if self.use_bias:
                    pred += self.global_mean + self.user_bias[u] + self.item_bias[i]

                err = r - pred
                total_error += err ** 2

                # Updates
                P_u = self.P[u]
                Q_i = self.Q[i]

                self.P[u] += self.learning_rate * (err * Q_i - self.regularization * P_u)
                self.Q[i] += self.learning_rate * (err * P_u - self.regularization * Q_i)

                if self.use_bias:
                    self.user_bias[u] += self.learning_rate * (err - self.regularization * self.user_bias[u])
                    self.item_bias[i] += self.learning_rate * (err - self.regularization * self.item_bias[i])

            rmse = np.sqrt(total_error / len(training_data))
            if verbose:
                print(f"Epoch {epoch+1}/{self.n_epochs} - RMSE: {rmse:.4f}")

        return self

    def predict_single(self, user_id, item_id):
        """Predict rating for a single (user, item) pair"""
        if user_id not in self.user_mapping or item_id not in self.item_mapping:
            return np.nan

        u = self.user_mapping[user_id]
        i = self.item_mapping[item_id]

        pred = np.dot(self.P[u], self.Q[i])
        if self.use_bias:
            pred += self.global_mean + self.user_bias[u] + self.item_bias[i]
        return pred

    def predict(self, test_data):
        """Predict ratings for a test dataframe with [user_id, item_id]"""
        preds = []
        for u, i in zip(test_data['user_id'], test_data['item_id']):
            preds.append(self.predict_single(u, i))
        return np.array(preds)

    def recommend_topk(self, user_id, train_data, n=10, exclude_seen=True):
        """
        Generate Top-K recommendations for a given user.

        Args:
            user_id (int): target user ID (original ID, not index).
            train_data (pd.DataFrame): training ratings [user_id, item_id, rating],
                                       used to exclude already-seen items.
            k (int): number of recommendations.
            exclude_seen (bool): whether to exclude items the user already rated.

        Returns:
            list of (item_id, predicted_score) sorted by score desc.
        """
        if user_id not in self.user_mapping:
            return []

        u = self.user_mapping[user_id]

        # Predict scores for all items
        scores = np.dot(self.P[u], self.Q.T)
        if self.use_bias:
            scores += self.global_mean + self.user_bias[u] + self.item_bias

        # Exclude seen items
        if exclude_seen:
            seen_items = train_data[train_data['user_id'] == user_id]['item_id'].values
            seen_idx = [self.item_mapping[i] for i in seen_items if i in self.item_mapping]
            scores[seen_idx] = -np.inf

        # Get top-K items
        top_idx = np.argsort(scores)[::-1][:n]
        top_items = [self.item_inv[i] for i in top_idx]
        top_scores = scores[top_idx]

        return list(zip(top_items, top_scores))

#### Sample usage for rating prediction task:

In [221]:
# Train model 
# Parameters: n_factors refers to embedding size, n_epochs refers to number of epochs, learning_rate refers to learning rate, and regularization refers to lambda hyperparameter controlling the effect of regularization terms
mf = MatrixFactorizationSGD(n_factors=50, n_epochs=10, learning_rate=0.001, regularization=0.001)
mf.fit(train_data, verbose=True)

# predict the rating for a target user and a target item
target_user = 1
target_item = 17
actual_rating = 3
pred_rating = mf.predict_single(target_user, target_item)
print(f"The actual rating for user {target_user} and item {target_item} is 3. The predicted rating by MF for user {target_user} and item {target_item} is {pred_rating:.4f}")

Epoch 1/10 - RMSE: 1.0961
Epoch 2/10 - RMSE: 1.0569
Epoch 3/10 - RMSE: 1.0308
Epoch 4/10 - RMSE: 1.0121
Epoch 5/10 - RMSE: 0.9980
Epoch 6/10 - RMSE: 0.9869
Epoch 7/10 - RMSE: 0.9779
Epoch 8/10 - RMSE: 0.9704
Epoch 9/10 - RMSE: 0.9640
Epoch 10/10 - RMSE: 0.9585
The actual rating for user 1 and item 17 is 3. The predicted rating by MF for user 1 and item 17 is 3.2838


#### Sample usage for ranking task:

In [222]:
# Train model 
# Parameters: n_factors refers to embedding size, n_epochs refers to number of epochs, learning_rate refers to learning rate, and regularization refers to lambda hyperparameter controlling the effect of regularization terms
mf = MatrixFactorizationSGD(n_factors=10, n_epochs=10, learning_rate=0.001, regularization=0.001)
mf.fit(train_data, verbose=True)

# Get top-10 recommendations for user 1
recommendations = mf.recommend_topk(user_id=1, train_data=train_data, n=10)
print("Top-10 Recommendations for user 1:")
for item, score in recommendations:
    print(f"Item {item}: {score:.4f}")

Epoch 1/10 - RMSE: 1.0943
Epoch 2/10 - RMSE: 1.0558
Epoch 3/10 - RMSE: 1.0305
Epoch 4/10 - RMSE: 1.0125
Epoch 5/10 - RMSE: 0.9991
Epoch 6/10 - RMSE: 0.9888
Epoch 7/10 - RMSE: 0.9805
Epoch 8/10 - RMSE: 0.9737
Epoch 9/10 - RMSE: 0.9680
Epoch 10/10 - RMSE: 0.9632
Top-5 Recommendations for user 1:
Item 318: 4.4215
Item 64: 4.3744
Item 483: 4.3730
Item 12: 4.3684
Item 98: 4.3340
Item 174: 4.2826
Item 313: 4.2805
Item 357: 4.2692
Item 603: 4.2425
Item 427: 4.2256


# 4) Evaluation of user-based, item-based, and MF models for rating prediction task

### Question 13: What is the performance of user-based CF in terms of RMSE for k=50?

In [None]:
def evaluate_user_cf_rating_prediction(train_data: pd.DataFrame, test_data: pd.DataFrame, user_similarity_matrix: pd.DataFrame, k=5) -> float:
    """
    Evaluate user-based CF using RMSE on a test set.
    
    Parameters:
    - train_data: pd.DataFrame with ['user_id', 'item_id', 'rating'] used for training
    - test_data: pd.DataFrame with ['user_id', 'item_id', 'rating'] used for testing
    - user_similarity_matrix: user-user similarity matrix (computed from train set)
    - k: number of neighbors
    
    Returns:
    - float: RMSE value
    """
    result = 0.0

    ############# Your code here ############
    
    #########################################

    return result

k = 50
start_time = time.time()
print(f"RMSE of user-based CF for k={k} is {evaluate_user_cf_rating_prediction(train_data, test_data, user_similarity_matrix, k):.4f}")
end_time = time.time()
print(f'Running time: {end_time - start_time:.4f} seconds')

### Question 14: Can you further improve the performance of user-based CF? Tune user-based CF with k={10, 30, 50, 70, 100} and visualize the performance with k values in x-axis and corresponding RMSE values in y-axis.

In [None]:
K = [10,30,50,70,100]
user_based_RMSEs = []

start_time = time.time()

############# Your code here ############
    
#########################################

end_time = time.time()
print(f'Running time: {end_time - start_time:.4f} seconds')

Discuss your observations.

### Question 15: What is the performance of item-based CF in terms of RMSE for k=50?

In [None]:
def evaluate_item_cf_rating_prediction(train_data: pd.DataFrame, test_data: pd.DataFrame, item_similarity_matrix: pd.DataFrame, k=5) -> float:
    """
    Evaluate item-based CF using RMSE on a test set.
    
    Parameters:
    - train_data: pd.DataFrame with ['user_id', 'item_id', 'rating'] used for training
    - test_data: pd.DataFrame with ['user_id', 'item_id', 'rating'] used for testing
    - item_similarity_matrix: item-item similarity matrix (computed from train set)
    - k: number of neighbors
    
    Returns:
    - float: RMSE value
    """
    result = 0.0

    ############# Your code here ############
    
    #########################################

    return result

k = 50
start_time = time.time()
print(f"RMSE of user-based CF for k={k} is {evaluate_item_cf_rating_prediction(train_data, test_data, item_similarity_matrix, k):.4f}")
end_time = time.time()
print(f'Running time: {end_time - start_time:.4f} seconds')

### Question 16: Can you further improve the performance of item-based CF? Tune item-based CF with k={10, 30, 50, 70, 100} and visualize the performance with k values in x-axis and corresponding RMSE values in y-axis.

In [None]:
K = [10,30,50,70,100]
item_based_RMSEs = []

start_time = time.time()

############# Your code here ############
    
#########################################

end_time = time.time()
print(f'Running time: {end_time - start_time:.4f} seconds')

Discuss your observations.

### Question 17: What is the performance of MF model in terms of RMSE for the following hyperparameter values?

    n_factors=50

    n_epochs=30

    learning_rate=0.001

    regularization=0.001

In [None]:
def evaluate_mf_rating_prediction(train_data: pd.DataFrame, test_data: pd.DataFrame, n_factors: float, n_epochs: float, learning_rate: float) -> float:
    """
    Evaluate MF using RMSE on a test set.
    
    Parameters:
    - train_data: pd.DataFrame with ['user_id', 'item_id', 'rating'] used for training
    - test_data: pd.DataFrame with ['user_id', 'item_id', 'rating'] used for testing
    - n_factors: embedding size
    - n_epochs: number of training epochs
    - learning_rate: Learning rate
    
    Returns:
    - float: RMSE value
    """
    # Hint: set the regularization hyperparameter to 0.001

    result = 0.0

    ############# Your code here ############
    
    #########################################
    
    return result

n_factors, n_epochs, learning_rate = 50, 30, 0.001
start_time = time.time()
print(f"RMSE of MF for n_factors={n_factors}, n_epochs={n_epochs}, learning_rate={learning_rate} is {evaluate_mf_rating_prediction(train_data, test_data, n_factors, n_epochs, learning_rate):.4f}")
end_time = time.time()
print(f'Running time: {end_time - start_time:.4f} seconds')

### Question 18: Tune MF model with n_factors={10, 30, 50, 70, 90} and the rest of hyperparameters as n_epochs=30, learning_rate=0.001. Visualize the performance with n_factors in x-axis and corresponding RMSE values in y-axis.

In [None]:
n_factors = [10,30,50,70,90]
n_epochs, learning_rate = 30, 0.001
mf_RMSEs = []

start_time = time.time()

############# Your code here ############
    
#########################################

end_time = time.time()
print(f'Running time: {end_time - start_time:.4f} seconds')

Discuss your observation.

### Question 19: Tune MF model with n_epochs={10, 30, 50, 70, 90} and the rest of hyperparameters as n_factors=10, learning_rate=0.001. Visualize the performance with n_factors in x-axis and corresponding RMSE values in y-axis.

In [None]:
n_epochs = [10,30,50,70,90]
n_factors, learning_rate = 10, 0.001
mf_RMSEs = []

start_time = time.time()

############# Your code here ############
    
#########################################

end_time = time.time()
print(f'Running time: {end_time - start_time:.4f} seconds')

Discuss your observation.

### Question 20: Tune MF model with learning_rate={0.001, 0.005, 0.01, 0.05} and the rest of hyperparameters as n_factors=10, n_epochs=90. Visualize the performance with n_factors in x-axis and corresponding RMSE values in y-axis.

In [None]:
learning_rate = [0.001,0.005,0.01,0.05]
n_factors, n_epochs = 10, 90
mf_RMSEs = []

start_time = time.time()

############# Your code here ############
    
#########################################

end_time = time.time()
print(f'Running time: {end_time - start_time:.4f} seconds')

Discuss your observation.

# 6) Evaluation of user-based, item-based, and MF models for ranking task

### Question 21: What is the performance of user-based CF in terms of nDCG for k=30?

In [None]:
def evaluate_user_cf_ranking(train_data, test_data, user_similarity_matrix, k=5):
    """
    Evaluate User-based CF ranking performance using NDCG@K.
    
    Args:
        train_data (pd.DataFrame): training ratings [user_id, item_id, rating]
        test_data (pd.DataFrame): test ratings [user_id, item_id, rating]
        user_similarity_matrix (pd.DataFrame): precomputed user-user similarity matrix
        k (int): number of neighbors for prediction
    
    Returns:
        float: average NDCG@K across test users
    """
    result = 0.0

    ############# Your code here ############
    
    #########################################
    
    return result

k = 30
start_time = time.time()
print(f"NDCG of user-based CF for k={k} is {evaluate_user_cf_ranking(train_data, test_data, user_similarity_matrix, k):.4f}")
end_time = time.time()
print(f'Running time: {end_time - start_time:.4f} seconds')

### Question 22: What is the performance of item-based CF in terms of nDCG for k=30?

In [None]:
def evaluate_item_cf_ranking(train_data, test_data, item_similarity_matrix, k=5):
    """
    Evaluate Item-based CF ranking performance using NDCG@K.
    
    Args:
        train_data (pd.DataFrame): training ratings [user_id, item_id, rating]
        test_data (pd.DataFrame): test ratings [user_id, item_id, rating]
        item_similarity_matrix (pd.DataFrame): precomputed item-item similarity matrix
        k (int): number of neighbors for prediction
    
    Returns:
        float: average NDCG@K across test users
    """
    result = 0.0

    ############# Your code here ############
    
    #########################################
    
    return result

k = 30
start_time = time.time()
print(f"NDCG of item-based CF for k={k} is {evaluate_item_cf_ranking(train_data, test_data, item_similarity_matrix, k):.4f}")
end_time = time.time()
print(f'Running time: {end_time - start_time:.4f} seconds')

### Question 23: What is the performance of MF model in terms of NDCG for the following hyperparameter values?

    n_factors=50

    n_epochs=30

    learning_rate=0.001

    regularization=0.001

In [None]:
def evaluate_mf_ranking(train_data: pd.DataFrame, test_data: pd.DataFrame, n_factors: float, n_epochs: float, learning_rate: float) -> float:
    """
    Evaluate MF in terms of NDCG.
    
    Parameters:
    - train_data: pd.DataFrame with ['user_id', 'item_id', 'rating'] used for training
    - test_data: pd.DataFrame with ['user_id', 'item_id', 'rating'] used for testing
    - n_factors: embedding size
    - n_epochs: number of training epochs
    - learning_rate: Learning rate
    
    Returns:
    - float: NDCG value
    """
    # Hint: set the regularization hyperparameter to 0.001

    result = 0.0

    ############# Your code here ############
    
    #########################################
    
    return result

n_factors, n_epochs, learning_rate = 50, 30, 0.001
start_time = time.time()
print(f"NDCG of MF for n_factors={n_factors}, n_epochs={n_epochs}, learning_rate={learning_rate} is {evaluate_mf_ranking(train_data, test_data, n_factors, n_epochs, learning_rate):.4f}")
end_time = time.time()
print(f'Running time: {end_time - start_time:.4f} seconds')

### Question 24: Tune MF model with n_factors={10, 30, 50, 70, 90} and the rest of hyperparameters as n_epochs=30, learning_rate=0.001. Visualize the performance with n_factors in x-axis and corresponding NDCG values in y-axis.

In [None]:
n_factors = [10,30,50,70,90]
n_epochs, learning_rate = 30, 0.001
mf_NDCGs = []

start_time = time.time()

############# Your code here ############
    
#########################################

end_time = time.time()
print(f'Running time: {end_time - start_time:.4f} seconds')

### Question 25: Tune MF model with n_epochs={10, 30, 50, 70, 90} and the rest of hyperparameters as n_factors=30, learning_rate=0.001. Visualize the performance with n_factors in x-axis and corresponding NDCG values in y-axis.

In [None]:
n_epochs = [10,30,50,70,90]
n_factors, learning_rate = 30, 0.001
mf_NDCGs = []

start_time = time.time()

############# Your code here ############
    
#########################################

end_time = time.time()
print(f'Running time: {end_time - start_time:.4f} seconds')

### Question 26: Tune MF model with learning_rate={0.001, 0.005, 0.01, 0.05} and the rest of hyperparameters as n_factors=10, n_epochs=90. Visualize the performance with n_factors in x-axis and corresponding NDCG values in y-axis.

In [None]:
learning_rate = [0.001,0.005,0.01,0.05]
n_factors, n_epochs = 10, 90
mf_NDCGs = []

start_time = time.time()

############# Your code here ############
    
#########################################

end_time = time.time()
print(f'Running time: {end_time - start_time:.4f} seconds')

# 7) Discussion

### Question 27: Compare the performance of CF methods with the content-based recommendation model developed in assignment 1. Discuss your observations.