### 4.2 Collaborative filtering for movie recommendations

Learning goal: How to use neighbourhood-based collaborative filtering in recommender systems; problems of adjusted cosine similarity.

Table 2 presents movie ratings by 6 users on 6 movies. The latex source
of the table is available on the course page (mratingstable.tex). The ratings
are between 1 (didn’t like at all) to 5 (fantastic movie) and 0 means a missing
rating (the user hasn’t watched the movie). The users are notated u1, . . . , u6
and movies m1, . . . , m6. The task is to apply recommender systems for rating
prediction using neighbourhood-based collaborative filtering (see Aggarwal
18.5.2 and an example in the lecture).


Table 2: Movie ratings (scale 1–5) by 6 users (u1–u6) on 6 movies (m1–m6).
Special value 0 means a missing rating.

|    | m1 | m2 | m3 | m4 | m5 | m6 |
|----|----|----|----|----|----|----|
| u1 | 3  | 1  | 2  | 2  | 0  | 2  |
| u2 | 4  | 2  | 3  | 3  | 4  | 2  |
| u3 | 4  | 1  | 3  | 3  | 2  | 5  |
| u4 | 0  | 3  | 4  | 4  | 5  | 0  |
| u5 | 2  | 5  | 5  | 0  | 3  | 3  |
| u6 | 1  | 4  | 0  | 5  | 0  | 0  |


In [2]:
import pandas as pd
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt

(a) Calculate mean ratings per user. Use all non-missing ratings in the calculation. These are needed in parts (b) and (c).

In [3]:
# Movie ratings data

movie_ratings = {
    'm1': [3, 4, 4, 0, 2, 1],
    'm2': [1, 2, 1, 3, 5, 4],
    'm3': [2, 3, 3, 4, 5, 0],
    'm4': [2, 3, 3, 4, 0, 5],
    'm5': [0, 4, 2, 5, 3, 0],
    'm6': [2, 2, 5, 0, 3, 0]
}

movie_ratings = pd.DataFrame(movie_ratings, index=['u1', 'u2', 'u3', 'u4', 'u5', 'u6'])
movie_ratings


Unnamed: 0,m1,m2,m3,m4,m5,m6
u1,3,1,2,2,0,2
u2,4,2,3,3,4,2
u3,4,1,3,3,2,5
u4,0,3,4,4,5,0
u5,2,5,5,0,3,3
u6,1,4,0,5,0,0


In [7]:
# Replacing 0 with NaN for mean calculation
movie_ratings.replace(0, np.nan, inplace=True)

user_mean_ratings = movie_ratings.mean(axis=1)

# Convert to dictionary
user_mean_ratings = user_mean_ratings.to_dict()

print("The mean ratings per user are:\n")

print(user_mean_ratings)


The mean ratings per user are:

{'u1': 2.0, 'u2': 3.0, 'u3': 3.0, 'u4': 4.0, 'u5': 3.6, 'u6': 3.3333333333333335}


(b) Calculate required pairwise similarities between users using a modified Pearson correlation r ("Pearson" in Aggarwal Equation 18.12). Use the mean values calculated in part a. Remember that the correlation is calculated only over co-rated movies.

Note: similarity between u2 and u3 is not needed, so 14 similarities

# Pearson correlation formula

$$
Pearson(\overline{X},\overline{Y}) = \frac{\sum_{i=1}^{s}(x_i - \hat{x})(y_i - \hat{y})}{\sqrt{\sum_{i=1}^{s}(x_i - \hat{x})^2}\sqrt{\sum_{i=1}^{s}(y_i - \hat{y})^2}}
$$

The Pearson coefficient is computed between the target user and all the other users. The
peer group of the target user is defined as the top-k users with the highest Pearson coefficient
of correlation with her. Users with very low or negative correlations are also removed from
the peer group. The average ratings of each of the (specified) items of this peer group are
returned as the recommended ratings. To achieve greater robustness, it is also possible to
weight each rating with the Pearson correlation coefficient of its owner while computing
the average. This weighted average rating can provide a prediction for the target user. The
items with the highest predicted ratings are recommended to the user.

In [8]:
def pearson_correlation(user1, user2, user_mean_ratings, movie_ratings):
    
    # Selecting only the movies co-rated by both users
    common_movies = movie_ratings.loc[:, (movie_ratings.loc[user1].notna()) & (movie_ratings.loc[user2].notna())]

    # If there are no common movies, the similarity is undefined (set to 0)
    if common_movies.shape[1] == 0:
        return 0

    # Calculating the mean-adjusted ratings
    mean_adjusted_ratings_user1 = common_movies.loc[user1] - user_mean_ratings[user1]
    mean_adjusted_ratings_user2 = common_movies.loc[user2] - user_mean_ratings[user2]

    # Calculating the numerator and denominators of the Pearson correlation formula
    numerator = (mean_adjusted_ratings_user1 * mean_adjusted_ratings_user2).sum()
    denominator = np.sqrt((mean_adjusted_ratings_user1 ** 2).sum()) * np.sqrt((mean_adjusted_ratings_user2 ** 2).sum())

    # If the denominator is 0 (which happens when all ratings are the same for a user), return 0
    if denominator == 0:
        return 0

    # Calculating the Pearson correlation
    correlation = numerator / denominator

    return correlation

# Testing the function with an example
pearson_correlation('u1', 'u2', user_mean_ratings, movie_ratings)

0.8164965809277259

In [12]:
# The Pearson correlation coefficients matrix
user_similarity = pd.DataFrame(index=movie_ratings.index, columns=movie_ratings.index)

# Calculating the Pearson correlation coefficients for all pairs of users
for user1 in movie_ratings.index:
    for user2 in movie_ratings.index:
        correlation = pearson_correlation(user1, user2, user_mean_ratings, movie_ratings)
        user_similarity.loc[user1, user2] = correlation

user_similarity


Unnamed: 0,u1,u2,u3,u4,u5,u6
u1,1.0,0.816497,0.707107,1.0,-0.811107,-0.720577
u2,0.816497,1.0,0.0,1.0,-0.559017,-0.720577
u3,0.707107,0.0,1.0,0.316228,-0.589256,-0.557007
u4,1.0,1.0,0.316228,1.0,-0.683586,-0.371391
u5,-0.811107,-0.559017,-0.589256,-0.683586,1.0,0.904526
u6,-0.720577,-0.720577,-0.557007,-0.371391,0.904526,1.0


#### Range and Interpretation:

+1: A Pearson correlation of +1 indicates a perfect positive linear relationship between variables.

-1: A Pearson correlation of -1 indicates a perfect negative linear relationship between variables.

0: A Pearson correlation of 0 indicates no linear relationship between the variables.

We can see that the Pearson correlation coefficient of u2 and u3 is 0, so this is not needed.

(c) Predict missing ratings using two nearest neighbours (K = 2) and an
extra requirement that the similarity is r ≥ 0.5. Tell if the movie is
recommended to the user (if the user would like it more than average).
Report if some prediction cannot be made (not enough sufficiently similar neighbours with required ratings)

In [17]:
def nearest_neighbours(user_similarity, movie_ratings, user_mean_ratings, K=2, threshold=0.5):
    # Storing predictions and recommendations
    predictions = pd.DataFrame(index=movie_ratings.index, columns=movie_ratings.columns)
    recommendations = pd.DataFrame(index=movie_ratings.index, columns=movie_ratings.columns)

    for user in movie_ratings.index:
        for movie in movie_ratings.columns:
            # Skip if the user has already rated the movie
            if not np.isnan(movie_ratings.loc[user, movie]):
                continue
            
            # Find users who have rated the movie
            users_rated_movie = movie_ratings[movie].dropna().index
            
            # Calculate similarity with the current user
            similarity_scores = user_similarity.loc[user, users_rated_movie]
            
            # Filter out users with similarity less than the threshold
            similar_users = similarity_scores[similarity_scores >= threshold]
            
            # If there are less than K similar users, skip the prediction
            if len(similar_users) < K:
                predictions.loc[user, movie] = "NEN" # Not enough neighbors
                recommendations.loc[user, movie] = "CBD" # Cannot be determined
                continue
            
            # Take the top K similar users
            top_similar_users = similar_users.nlargest(K)
            
            # Calculate the predicted rating
            numerator = sum(top_similar_users * movie_ratings.loc[top_similar_users.index, movie])
            denominator = sum(abs(top_similar_users))
            predicted_rating = numerator / denominator
            
            # Store the predicted rating
            predictions.loc[user, movie] = predicted_rating
            
            # Make a recommendation based on whether the predicted rating is higher than the user's average rating
            recommendations.loc[user, movie] = "Rec" if predicted_rating > user_mean_ratings[user] else "N-Rec"

    return predictions, recommendations

# Calling the function with the user similarity matrix, movie ratings, user mean ratings, K=2, and similarity threshold=0.5
predictions, recommendations = nearest_neighbours(user_similarity.astype(float), movie_ratings, user_mean_ratings)

print("Prediction dataframes\n")
print(predictions)


print("\nRecommendation dataframes\n")
print(recommendations)


Prediction dataframes

     m1   m2   m3   m4       m5   m6
u1  NaN  NaN  NaN  NaN  4.55051  NaN
u2  NaN  NaN  NaN  NaN      NaN  NaN
u3  NaN  NaN  NaN  NaN      NaN  NaN
u4  3.5  NaN  NaN  NaN      NaN  2.0
u5  NaN  NaN  NaN  NEN      NaN  NaN
u6  NaN  NaN  NEN  NaN      NEN  NEN

Recommendation dataframes

       m1   m2   m3   m4   m5     m6
u1    NaN  NaN  NaN  NaN  Rec    NaN
u2    NaN  NaN  NaN  NaN  NaN    NaN
u3    NaN  NaN  NaN  NaN  NaN    NaN
u4  N-Rec  NaN  NaN  NaN  NaN  N-Rec
u5    NaN  NaN  NaN  CBD  NaN    NaN
u6    NaN  NaN  CBD  NaN  CBD    CBD


Predictions DataFrame:
- NaN: The user has already rated the movie.
- Numeric Value: The predicted rating for the movie.
- "NEN" for "Not enough neighbors": There weren't enough sufficiently similar neighbors to make a prediction.

Recommendations DataFrame:
- "Rec" for "Recommended": The predicted rating is higher than the user's average rating.
- "N-Rec" for "Not Recommended": The predicted rating is not higher than the user's average rating.
- "CBD" for "Cannot be determined": It was not possible to make a prediction or recommendation.

For example:

- For user u1, movie m5 is predicted to have a rating of approximately 4.55, and it is recommended because the predicted rating is higher than the user's average rating.
- For user u4 movie m1 is predicted to have a rating of 3.5, but it is not recommended because the predicted rating is lower than the user's average rating.
- For user u5, movie m4 could not be predicted due to not having enough sufficiently similar neighbors with required ratings

(d) Consider the item-based way of predicting the missing ratings of movies m3 and m4 with adjusted cosine similarity, as suggested in Aggarwal 18.5.2.2. Why it is not a good solution here? Suggest an alternative item-based solution that could be used instead (no need to calculate the actual predictions).

The main conceptual difference from the user-based approach is that peer groups are constructed in terms of items rather than users. Therefore, similarities need to be computed between items (or columns in the ratings matrix). Before computing the similarities between the columns, the ratings matrix is normalized. As in the case of user-based ratings, the average of each row in the ratings matrix is subtracted from that row. Then, the cosine similarity between the normalized ratings $ \overline{U} = (u_1 . . . u_s)$ and $ \overline{V} = (v_1 . . . v_s)$ of a pair of items (columns) defines the similarity between them:

$$
Cosine(\overline{U}, \overline{V}) = \frac{\sum_{i=1}^{s}u_iv_i}{\sqrt{\sum_{i=1}^{s}u_i^2}\sqrt{\sum_{i=1}^{s}v_i^2}}
$$

This similarity is referred to as the adjusted cosine similarity, because the ratings are normalized before computing the similarity value.

In [18]:
# Replacing NaN back to 0 for the calculations
movie_ratings_normalized = movie_ratings.fillna(0)

# Normalizing the ratings matrix
for user in movie_ratings_normalized.index:
    user_mean = user_mean_ratings[user]
    movie_ratings_normalized.loc[user] = movie_ratings_normalized.loc[user].apply(lambda x: x - user_mean if x != 0 else 0)

movie_ratings_normalized


Unnamed: 0,m1,m2,m3,m4,m5,m6
u1,1.0,-1.0,0.0,0.0,0.0,0.0
u2,1.0,-1.0,0.0,0.0,1.0,-1.0
u3,1.0,-2.0,0.0,0.0,-1.0,2.0
u4,0.0,-1.0,0.0,0.0,1.0,0.0
u5,-1.6,1.4,1.4,0.0,-0.6,-0.6
u6,-2.333333,0.666667,0.0,1.666667,0.0,0.0


In [19]:
def adjusted_cosine_similarity(movie1, movie2):
    # Calculate the numerator of the adjusted cosine similarity formula
    numerator = (movie1 * movie2).sum()
    
    # Calculate the denominators of the adjusted cosine similarity formula
    denominator = np.sqrt((movie1 ** 2).sum()) * np.sqrt((movie2 ** 2).sum())
    
    # If the denominator is 0, return 0
    if denominator == 0:
        return 0
    
    # Calculate the adjusted cosine similarity
    similarity = numerator / denominator
    
    return similarity

# Creating a DataFrame to store the adjusted cosine similarity between movies
item_similarity = pd.DataFrame(index=movie_ratings_normalized.columns, columns=movie_ratings_normalized.columns)

# Calculating the adjusted cosine similarity between each pair of movies
for movie1 in movie_ratings_normalized.columns:
    for movie2 in movie_ratings_normalized.columns:
        if movie1 != movie2:
            similarity = adjusted_cosine_similarity(movie_ratings_normalized[movie1], movie_ratings_normalized[movie2])
            item_similarity.loc[movie1, movie2] = similarity

item_similarity


Unnamed: 0,m1,m2,m3,m4,m5,m6
m1,,-0.766296,-0.482321,-0.703384,0.157877,0.255205
m2,-0.766296,,0.456522,0.217391,-0.149432,-0.540857
m3,-0.482321,0.456522,,0.0,-0.327327,-0.259161
m4,-0.703384,0.217391,0.0,,0.0,0.0
m5,0.157877,-0.149432,-0.327327,0.0,,-0.622088
m6,0.255205,-0.540857,-0.259161,0.0,-0.622088,


In [23]:
def predict_item_based_ratings(user_ratings, item_similarity, user_mean_ratings):
    predictions = pd.DataFrame(index=user_ratings.index, columns=['m3', 'm4'])

    for user in user_ratings.index:
        for movie in ['m3', 'm4']:
            # Skip if the user has already rated the movie
            if not np.isnan(user_ratings.loc[user, movie]):
                continue

            # Find movies that the user has rated
            rated_movies = user_ratings.loc[user].dropna().index

            # Calculate similarity with the target movie
            similarity_scores = item_similarity.loc[movie, rated_movies]

            # Calculate the predicted rating using the formula: sum(similarity * (rating - user_mean)) / sum(similarity)
            numerator = sum(similarity_scores * (user_ratings.loc[user, rated_movies] - user_mean_ratings[user]))
            denominator = sum(abs(similarity_scores))
            predicted_rating = user_mean_ratings[user] + numerator / denominator if denominator != 0 else user_mean_ratings[user]

            # Store the predicted rating
            predictions.loc[user, movie] = predicted_rating

    return predictions

# Calling the function to predict the missing ratings for movies m3 and m4
predicted_ratings_item_based = predict_item_based_ratings(movie_ratings, item_similarity.astype(float), user_mean_ratings)
predicted_ratings_item_based



Unnamed: 0,m3,m4
u1,,
u2,,
u3,,
u4,,
u5,,5.152781
u6,4.856233,


The other entries are NaN because those users have already rated movies m3 and m4, so predictions were not needed.

These predictions are based on the adjusted cosine similarities between movies, considering the ratings given by all users. This item-based approach can provide accurate predictions, especially when there are more items than users.​

Issues with Adjusted Cosine Similarity:

- Sparsity: The movie ratings matrix is sparse, meaning that many movies are not rated by many users. This can lead to inaccurate similarity calculations because the adjusted cosine similarity is calculated based only on the users who have rated both items. When this number is small, the similarity measure can be unreliable.

- Limited Range of Ratings: Since the ratings are normalized by subtracting the user’s average rating, the adjusted ratings can be negative, zero, or positive. This can lead to counterintuitive similarity measures, especially when most users rate items positively.

- Bias Towards Popular Items: Popular items (items rated by many users) can dominate the similarity calculations, potentially leading to biased recommendations.

Alternative Item-Based Solution: Jaccard Similarity

An alternative item-based solution that could be used instead is the Jaccard similarity, which measures the similarity between two sets. In the context of collaborative filtering, it can be used to measure the similarity between two items based on the sets of users who have rated them.

Jaccard Similarity between two items A and B is calculated as:

$$ J(A,B) = \frac{\mid A \cap B \mid}{\mid A \cup B \mid} $$

where:

$\mid A \cap B \mid$ is the number of users who have rated both items 
and 
$\mid A \cup B \mid$ is the number of users who have rated either item A or item B or both.

Advantages of Jaccard Similarity:

- Handles Sparsity Better: Since it is based on the presence or absence of ratings rather than the rating values themselves, it can handle sparse data better.
- Not Affected by Rating Scale: It does not depend on the actual rating values, making it unaffected by the scale of the ratings.
- Simple and Intuitive: It is easy to understand and implement.