# Appendix: User-Based Collaborative Filtering from Scratch

This notebook contains a simplified implementation of a user-based collaborative filtering algorithm.
It was created to gain a deeper understanding of similarity-based recommendation methods and the internal
mechanics of collaborative filtering.

This implementation is intended for educational purposes and is not part of the production-oriented
recommendation pipeline used in the main project.


In [1]:
import scipy.io

# Load movie metadata
movie_ids_path = 'movies/movie_ids.txt'
with open(movie_ids_path, 'r', encoding='latin-1') as file:
    movie_names = file.readlines()

# Parse movie titles
movie_list = [line.strip().split(' ', 1)[1] for line in movie_names if len(line.strip().split(' ', 1)) == 2]

# Load ratings data
ratings_mat = scipy.io.loadmat('movies/movies.mat')

# Display basic dataset info
movie_list[:5], {key: type(value) for key, value in ratings_mat.items()}


(['Toy Story (1995)',
  'GoldenEye (1995)',
  'Four Rooms (1995)',
  'Get Shorty (1995)',
  'Copycat (1995)'],
 {'__header__': bytes,
  '__version__': str,
  '__globals__': list,
  'Y': numpy.ndarray,
  'R': numpy.ndarray})

## Rating normalization

We center movie ratings by subtracting each movie’s mean rating. This helps reduce movie-specific bias before computing similarities and predictions.


In [2]:
import numpy as np

# Extract Y and R matrices from the loaded data
Y = ratings_mat['Y']  # Ratings matrix (movies x users)
R = ratings_mat['R']  # Indicator matrix (movies x users)

# Normalize ratings by subtracting each movie's mean rating
Y_mean = np.sum(Y, axis=1) / np.sum(R, axis=1)
Y_norm = Y - Y_mean[:, np.newaxis] * R  # Subtract mean only for rated movies

## Compute cosine similarity between users

We compute user–user cosine similarity based on the normalized rating matrix.


In [3]:
def cosine_similarity(ratings, epsilon=1e-9):
    # Add a small epsilon to avoid division by zero
    sim = ratings.dot(ratings.T) + epsilon
    norms = np.array([np.sqrt(np.diagonal(sim))])
    return (sim / norms / norms.T)

# Compute cosine similarity
user_similarity = cosine_similarity(Y_norm.T)
user_similarity.shape


(943, 943)

## Predict ratings

We predict ratings for all user–movie pairs by taking a similarity-weighted average of ratings, then add back the movie mean ratings.


In [4]:
def predict_ratings(similarity, ratings):
    # Multiply ratings by the transposed similarity matrix
    weighted_sum = np.dot(ratings, similarity.T)
    
    # Compute the sum of absolute similarities for normalization    
    sim_sum = np.abs(similarity).sum(axis=1)
    
    # Normalize by dividing weighted ratings by similarity sums    
    pred_ratings = weighted_sum / sim_sum
    
    # Add back the movie mean ratings to the predictions
    pred_ratings += Y_mean[:, np.newaxis]
    
    return pred_ratings

predicted_ratings = predict_ratings(user_similarity, Y_norm)


## Generate top-5 recommendations for a user

We rank unseen movies by predicted rating and return the top 5 recommendations for a selected user.


In [5]:
user_id = 0 # user #1 (0-based index)
user_ratings = predicted_ratings[:, user_id]  # Extract predictions for user #1

# Get indices of the top-5 predicted ratings
top_5_indices = np.argsort(user_ratings)[-5:][::-1]

# Print the titles of the top-5 recommended movies for user #1
top_5_movies = [movie_list[index] for index in top_5_indices]
print("Top-5 recommended movies for user #1:")
for movie in top_5_movies:
    print(movie)

Top-5 recommended movies for user #1:
Entertaining Angels: The Dorothy Day Story (1996)
They Made Me a Criminal (1939)
Marlene Dietrich: Shadow and Light (1996)
Someone Else's America (1995)
Star Kid (1997)
