## Collaborative Filtering

Collaborative filtering is a technique used in recommendation systems to make automatic predictions about the preferences of a user by collecting preferences from many users (collaborating). It assumes that if a user A has the same opinion as a user B on an issue, A is more likely to have B's opinion on a different issue.

### 1. Import libraries

In [1]:
# Importing the libraries
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
from scipy.spatial import distance 

### 2. Read data

In [2]:
# Read movie data
df_movies = pd.read_csv('data/movies.csv')

# Read ratings data
df_ratings = pd.read_csv('data/ratings.csv')

# Merge ratings and movies datasets
df = pd.merge(df_ratings, df_movies, on='movieId', how='inner')

df.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


### 3. Item-Based Collaborative Filtering

In item-based collaborative filtering, recommendations are made based on the similarity between items.
It involves creating an item-user matrix, where rows represent items, columns represent users, and the matrix entries represent item-user interactions.
Similarity measures (e.g., cosine similarity) are calculated between items, and predictions for the target user are made based on the preferences for similar items.

#### Item-Based Collaborative Filtering using Cosine Similarity

In [3]:
# Create a user-item matrix
user_item_matrix = df.pivot(index='userId', columns='movieId', values='rating').fillna(0)

# Transpose the matrix for item-based collaborative filtering
item_user_matrix = user_item_matrix.T

item_user_matrix.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,0.0,0.0,4.0,0.0,4.5,0.0,0.0,0.0,...,4.0,0.0,4.0,3.0,4.0,2.5,4.0,2.5,3.0,5.0
2,0.0,0.0,0.0,0.0,0.0,4.0,0.0,4.0,0.0,0.0,...,0.0,4.0,0.0,5.0,3.5,0.0,0.0,2.0,0.0,0.0
3,4.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0


In [4]:
from sklearn.metrics.pairwise import cosine_similarity

# Create a pivot table with users as rows, movies as columns, and ratings as values
pivot_table = df.pivot_table(index='userId', columns='title', values='rating')

# Fill NaN values with 0 (assuming no rating means a rating of 0)
pivot_table = pivot_table.fillna(0)

pivot_table.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [5]:
# Calculate cosine similarity between movies
movie_similarity = cosine_similarity(pivot_table.T)

# Create a DataFrame from the similarity matrix
movie_similarity_df = pd.DataFrame(movie_similarity, index=pivot_table.columns, columns=pivot_table.columns)

movie_similarity_df.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.141653,0.0,...,0.0,0.342055,0.543305,0.707107,0.0,0.0,0.139431,0.327327,0.0,0.0
'Hellboy': The Seeds of Creation (2004),0.0,1.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Round Midnight (1986),0.0,0.707107,1.0,0.0,0.0,0.0,0.176777,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Salem's Lot (2004),0.0,0.0,0.0,1.0,0.857493,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Til There Was You (1997),0.0,0.0,0.0,0.857493,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [6]:
# Function to get top N similar movies for a given movie
def get_similar_movies(movie_title, top_n=10):
    similar_scores = movie_similarity_df[movie_title]
    similar_movies = similar_scores.sort_values(ascending=False).index[1:top_n+1]
    return similar_movies

In [7]:
# Get top 10 similar movies to 'Forrest Gump (1994)'
similar_movies = get_similar_movies('Forrest Gump (1994)', top_n=10)
print(f"Top 10 movies similar to 'Forrest Gump (1994)': {similar_movies}")

Top 10 movies similar to 'Forrest Gump (1994)': Index(['Shawshank Redemption, The (1994)', 'Jurassic Park (1993)',
       'Pulp Fiction (1994)', 'Braveheart (1995)',
       'Silence of the Lambs, The (1991)', 'Apollo 13 (1995)',
       'Matrix, The (1999)', 'Mrs. Doubtfire (1993)',
       'Schindler's List (1993)', 'Terminator 2: Judgment Day (1991)'],
      dtype='object', name='title')


#### Item-Based Collaborative Filtering using SVD

In [8]:
# Step 1: Matrix Decomposition with SVD for Item-Based Collaborative Filtering
item_user_matrix = df.pivot_table(index='movieId', columns='userId', values='rating', fill_value=0)
U, Sigma, Vt = np.linalg.svd(item_user_matrix.values, full_matrices=False)

U.shape, Sigma.shape, Vt.shape

((9724, 610), (610,), (610, 610))

In [9]:
# Choose the number of latent factors (adjust based on your dataset)
k = 50
U_k = U[:, :k]
Sigma_k = np.diag(Sigma[:k])
Vt_k = Vt[:k, :]

In [10]:
# Step 2: Matrix Reconstruction
item_user_matrix_pred = np.dot(np.dot(U_k, Sigma_k), Vt_k)
item_user_matrix_pred

array([[ 2.18187197e+00,  2.09809067e-01,  1.33940814e-02, ...,
         2.30963539e+00,  7.83182598e-01,  5.35809290e+00],
       [ 3.93674189e-01,  4.82051887e-03,  3.47258164e-02, ...,
         2.70243898e+00,  5.30142683e-01, -2.88817350e-01],
       [ 8.38185756e-01,  3.07424005e-02,  5.05247472e-02, ...,
         2.26419696e+00,  9.79748203e-02, -9.07680249e-02],
       ...,
       [-2.49842711e-02,  1.88951263e-02, -1.61232411e-03, ...,
        -1.25165145e-02,  9.84577917e-04, -2.79227416e-02],
       [-2.49842711e-02,  1.88951263e-02, -1.61232411e-03, ...,
        -1.25165145e-02,  9.84577917e-04, -2.79227416e-02],
       [-5.89881001e-02,  3.19658766e-02, -5.29984436e-04, ...,
         9.27520866e-02, -5.49383653e-03,  3.55476113e-02]])

In [11]:
# Step 3: Prediction Generation (Replace original ratings with predictions for missing values)
item_user_matrix_pred[item_user_matrix_pred == 0] = 0
item_user_matrix_pred

array([[ 2.18187197e+00,  2.09809067e-01,  1.33940814e-02, ...,
         2.30963539e+00,  7.83182598e-01,  5.35809290e+00],
       [ 3.93674189e-01,  4.82051887e-03,  3.47258164e-02, ...,
         2.70243898e+00,  5.30142683e-01, -2.88817350e-01],
       [ 8.38185756e-01,  3.07424005e-02,  5.05247472e-02, ...,
         2.26419696e+00,  9.79748203e-02, -9.07680249e-02],
       ...,
       [-2.49842711e-02,  1.88951263e-02, -1.61232411e-03, ...,
        -1.25165145e-02,  9.84577917e-04, -2.79227416e-02],
       [-2.49842711e-02,  1.88951263e-02, -1.61232411e-03, ...,
        -1.25165145e-02,  9.84577917e-04, -2.79227416e-02],
       [-5.89881001e-02,  3.19658766e-02, -5.29984436e-04, ...,
         9.27520866e-02, -5.49383653e-03,  3.55476113e-02]])

In [14]:
def get_top_n_movie_recommendations(user_id, item_user_matrix_pred, n=10):
    user_predictions = item_user_matrix_pred[:, user_id - 1]
    sorted_indices = np.argsort(user_predictions)[::-1]
    top_n_indices = sorted_indices[:n]
    top_n_movie_ids = item_user_matrix.index[top_n_indices]

    # Get movie titles
    top_n_movie_titles = []
    for movie_id in top_n_movie_ids:
        title = df_movies[df_movies['movieId']==movie_id]['title'].values[0]
        top_n_movie_titles.append(title)

    return top_n_movie_titles


In [15]:
user_id = 1  
top_movie_recommendations = get_top_n_movie_recommendations(user_id, item_user_matrix_pred, n=10)
print(f"Top 10 movie recommendations for user {user_id}: {top_movie_recommendations}")


Top 10 movie recommendations for user 1: ['Star Wars: Episode V - The Empire Strikes Back (1980)', 'Star Wars: Episode IV - A New Hope (1977)', 'Star Wars: Episode VI - Return of the Jedi (1983)', 'Indiana Jones and the Last Crusade (1989)', 'Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)', 'Fargo (1996)', 'Saving Private Ryan (1998)', 'Seven (a.k.a. Se7en) (1995)', 'Indiana Jones and the Temple of Doom (1984)', 'American Beauty (1999)']
