<h1>Collaborative Filtering Movie Recommendation - User Based</h1>

The user-based collaborative filtering system will focus on the ‘ratings.json’ file, in which contains 28,490,116 lines of user ratings on MovieLens. In this method, given a user u, a prediction for an item i will be generated by using the ratings for i from users in this u’s neighborhoods. Neighborhoods are users with similar interest as user u. The generation of the prediction will follow the fomula below:

$$
\text{pred}(u, i) = \bar{r}_u + \frac{\sum_{n \in \text{neighbors}(u)} \text{sim}(u, n) \cdot (r_{ni} - \bar{r}_n)}{\sum_{n \in \text{neighbors}(u)} \text{sim}(u, n)}
$$

where $\bar{r}_u$ is the average ratings of user u, and $\bar{r}_n$ is the average ratings of neighbor n.

For $\text{sim}(u, n)$, we will use Cosine Similarity:

$$
\cos(\mathbf{d1}, \mathbf{d2}) = \frac{\mathbf{d1} \cdot \mathbf{d2}}{\|\mathbf{d1}\| \|\mathbf{d2}\|}
$$

where $\cdot$ indicates vector dot product and $\|\mathbf{d}\|$ is the length (or norm) of vector $\mathbf{d}$.

<h3>import data</h3>

In [1]:
import pandas as pd

In [2]:
# Importing ratings.json
original_ratings = pd.read_json(r'F:\DS_Dataset\genome_2021\movie_dataset_public_final\raw\ratings.json', lines=True)
original_ratings.reset_index(drop=True, inplace=True)
original_ratings.head()

Unnamed: 0,item_id,user_id,rating
0,5,997206,3.0
1,10,997206,4.0
2,13,997206,4.0
3,17,997206,5.0
4,21,997206,4.0


In [3]:
# As described in the Readme file of the original dataset, ratings.json includes 
# 28,490,116 lines of user ratings on MovieLens.
original_ratings.shape

(28490116, 3)

In [4]:
# Examining dublicated data of ratings
duplicates = original_ratings.duplicated(subset=['item_id', 'user_id'])
num_duplicates = duplicates.sum()
num_duplicates

240925

There are 240925 dublicates ratings in the orinal data.

In [5]:
# for dublicates value, we will calculate the mean value of dublicates
ratings = original_ratings.groupby(['item_id', 'user_id']).rating.mean().reset_index()
ratings.shape

(28249191, 3)

In [6]:
# check unique numbers of movies
len(ratings['item_id'].unique())

67873

In [7]:
# check unique numbers of users
len(ratings['user_id'].unique())

247383

<h3>seperate training set and test set</h3>

In this project, 1/100 of users will be randomly selected to test the performance of the model.

In [8]:
import numpy as np

np.random.seed(42)

# Getting the unique users
unique_users = ratings['user_id'].unique()

# Calculating the number of users to sample (1% of the unique users)
num_users_to_sample = len(unique_users) // 100

# Randomly selecting the users
sampled_users = np.random.choice(unique_users, size=num_users_to_sample, replace=False)

# Filtering the original DataFrame for the sampled users (test set)
users_test = ratings[ratings['user_id'].isin(sampled_users)]

# Filtering for the remaining users (training set)
users_train = ratings[~ratings['user_id'].isin(sampled_users)]

In [9]:
print("Unique movies in user_test: ", len(users_test['item_id'].unique()))
print("Unique users in user_test: ", len(users_test['user_id'].unique()))
print("Unique movies in user_train: ", len(users_train['item_id'].unique()))
print("Unique users in user_train: ", len(users_train['user_id'].unique()))

Unique movies in user_test:  16553
Unique users in user_test:  2473
Unique movies in user_train:  67787
Unique users in user_train:  244910


<h3>find nearest neighbours based on cosine similarity</h3>

In [10]:
# given a user ID and the ratings DataFrame, return a DataFrame containing the movies watched by the user and their ratings.
def get_user_movie_ratings(user_id, ratings_df):
    user_ratings = ratings_df[ratings_df['user_id'] == user_id][['item_id', 'rating']]
    return user_ratings

In [11]:
# example of get_user_movie_ratings
users_test.head()

Unnamed: 0,item_id,user_id,rating
44,1,577,5.0
117,1,1595,4.0
119,1,1666,4.0
256,1,3871,5.0
321,1,4752,4.0


In [12]:
input_example = get_user_movie_ratings(577, users_test)
input_example

Unnamed: 0,item_id,rating
44,1,5.0
127516,6,3.0
245927,16,4.0
327114,21,4.0
368531,24,3.0
...,...,...
17967737,4587,4.0
17974249,4605,3.0
17992051,4626,4.0
18723305,5060,5.0


In [13]:
from sklearn.metrics.pairwise import cosine_similarity

In [14]:
def find_nearest_neighbours(input_df, ratings_df, n):
    """
    Given an input dataframe containing movie id and rating by a user, and the ratings DataFrame, return a DataFrame containing
    at most n neighbours' name and their cosine similarity to this user.

    Parameters:
    input_df (DataFrame): An input dataframe containing movie id and rating by a user.
    ratings_df (DataFrame): The DataFrame containing movie ratings in the database.
    n: the number of top nearest neighbours to find

    Returns:
    DataFrame: A DataFrame with columns ['user_id', 'cos_sim'] for the specified user.
    """
    # Identify movies watched by input user
    watched_movies = input_df['item_id'].unique()
    
    # Filter out users who have watched any of the movies in watched_movies
    relevant_ratings = ratings_df[ratings_df['item_id'].isin(watched_movies)]
    
    # Pivot the relevant ratings dataframe to create a user-item matrix
    user_item_matrix = relevant_ratings.pivot_table(index='user_id', columns='item_id', values='rating').fillna(0)
    
    # Convert the input_df to a user profile vector
    user_profile = input_df.set_index('item_id').reindex(user_item_matrix.columns).fillna(0).T
    
    # Compute cosine similarity
    similarity = cosine_similarity(user_profile, user_item_matrix)
    
    similarity_df = pd.DataFrame(similarity.T, index=user_item_matrix.index, columns=['cos_sim'])

    # Sort by cosine similarity and select top n
    top_neighbours = similarity_df.sort_values(by='cos_sim', ascending=False).head(n).reset_index()
    
    return top_neighbours

In [15]:
nearest_neighbours_example = find_nearest_neighbours(input_example, users_train, 30)
nearest_neighbours_example

Unnamed: 0,user_id,cos_sim
0,741204,0.964039
1,928461,0.960161
2,347565,0.955879
3,880612,0.942304
4,168279,0.940648
5,557480,0.927374
6,586936,0.920912
7,24744,0.914967
8,176194,0.906819
9,778488,0.906427


<h3>generate ratings of the prediction</h3>

In [16]:
def predict_rating(input_df, item_id, nearest_neighbors, ratings_df):
    """
    Predict the rating for a given user and movie, based on nearest neighbors.

    Parameters:
    input_df (DataFrame): Ratings of movies of the user.
    item_id (int): The movie ID to be predicted.
    nearest_neighbors (DataFrame): DataFrame of nearest neighbors and their cosine similarity.
    ratings_df (DataFrame): DataFrame containing movie ratings.

    Returns:
    float: The predicted rating.
    """
    # Filter the ratings to include only those from the nearest neighbors
    neighbors_ratings = ratings_df[ratings_df['user_id'].isin(nearest_neighbors['user_id'])]

    # Calculate the average rating for the user and each neighbor
    user_avg_rating = input_df['rating'].mean()
    neighbors_avg_ratings = neighbors_ratings.groupby('user_id')['rating'].mean()

    # Calculate the numerator and denominator for the prediction formula
    numerator = 0
    denominator = 0
    for _, row in nearest_neighbors.iterrows():
        neighbor_id = row['user_id']
        cos_sim = row['cos_sim']

        # Check if the neighbor has rated the movie
        neighbor_ratings_for_movie = neighbors_ratings[(neighbors_ratings['user_id'] == neighbor_id) & (neighbors_ratings['item_id'] == item_id)]
        if not neighbor_ratings_for_movie.empty:
            neighbor_rating = neighbor_ratings_for_movie['rating'].iloc[0]
            neighbor_avg = neighbors_avg_ratings[neighbor_id]

            numerator += cos_sim * (neighbor_rating - neighbor_avg)
            denominator += abs(cos_sim)

    # Handle the case where the denominator is zero
    if denominator == 0:
        return user_avg_rating

    # Calculate and return the predicted rating
    predicted_rating = user_avg_rating + (numerator / denominator)
    
    #limit the predicted rating between 0.5-5.
    predicted_rating = max(0.5, min(predicted_rating, 5))
    return predicted_rating

In [17]:
predict_rating(input_example, 16, nearest_neighbours_example, users_train)

4.269251510631981

In this example, we are trying to predict user "577" rating for movie id "16" by using 30 nearest neighbours. The predicted rating is 4.27 and the actual rating is 4.0.

In [18]:
pred_errors = []
for movie_id in input_example["item_id"]:
    # Predict the rating
    pred_rating = predict_rating(input_example, movie_id, nearest_neighbours_example, users_train)

    # Get the actual rating
    actual_rating = input_example[input_example['item_id'] == movie_id]['rating'].iloc[0]

    # Calculate the prediction error
    pred_error = abs(pred_rating - actual_rating) / actual_rating
    pred_errors.append(pred_error)

In [19]:
mean_pred_error = np.mean(pred_errors)
mean_pred_error

0.2535088282990312

When using 30 nearest neighbours, the average prediction error is about 25% for user "577". The number of neighbors will be determined later.

<h3>generate movies to be predicted</h3>

In [20]:
def recommend_movies(input_df, ratings_df, n):
    """
    Recommend movies to a user by combining nearest neighbor finding, rating prediction, 
    and sorting by predicted ratings, with a limit of top 10 movies per neighbor.

    Parameters:
    input_df (DataFrame): Ratings of movies by the user.
    ratings_df (DataFrame): DataFrame containing movie ratings in the database.
    n (int): Number of top nearest neighbours to consider.

    Returns:
    DataFrame: Movies recommended to the user, sorted by predicted ratings.
    """
    # Find Nearest Neighbours
    watched_movies = input_df['item_id'].unique()
    relevant_ratings = ratings_df[ratings_df['item_id'].isin(watched_movies)]
    user_item_matrix = relevant_ratings.pivot_table(index='user_id', columns='item_id', values='rating').fillna(0)
    user_profile = input_df.set_index('item_id').reindex(user_item_matrix.columns).fillna(0).T
    similarity = cosine_similarity(user_profile, user_item_matrix)
    similarity_df = pd.DataFrame(similarity.T, index=user_item_matrix.index, columns=['cos_sim'])
    top_neighbours = similarity_df.sort_values(by='cos_sim', ascending=False).head(n).reset_index()

    # Limit to Top N Movies for Each Neighbor
    top_movies_by_neighbour = {}
    for neighbor_id in top_neighbours['user_id']:
        top_movies = ratings_df[ratings_df['user_id'] == neighbor_id].sort_values(by='rating', ascending=False).head(10)['item_id'].tolist()
        top_movies_by_neighbour[neighbor_id] = top_movies

    # Aggregate Movies to Predict
    movies_to_predict = set()
    for movies in top_movies_by_neighbour.values():
        movies_to_predict.update(movies)
    movies_to_predict -= set(input_df['item_id'])

    # Predict Ratings
    predictions = []
    user_avg_rating = input_df['rating'].mean()
    for movie_id in movies_to_predict:
        numerator, denominator = 0, 0
        for _, row in top_neighbours.iterrows():
            neighbor_id = row['user_id']
            if movie_id in top_movies_by_neighbour[neighbor_id]:
                cos_sim = row['cos_sim']
                neighbor_ratings = ratings_df[(ratings_df['user_id'] == neighbor_id) & (ratings_df['item_id'] == movie_id)]
                if not neighbor_ratings.empty:
                    neighbor_rating = neighbor_ratings['rating'].iloc[0]
                    neighbor_avg_rating = ratings_df[ratings_df['user_id'] == neighbor_id]['rating'].mean()
                    numerator += cos_sim * (neighbor_rating - neighbor_avg_rating)
                    denominator += abs(cos_sim)

        # Calculate Predicted Rating
        pred_rating = user_avg_rating if denominator == 0 else user_avg_rating + (numerator / denominator)
        predictions.append((movie_id, pred_rating))

    # Sort and Return Recommendations
    recommendations_df = pd.DataFrame(predictions, columns=['item_id', 'pred_rating'])
    return recommendations_df.sort_values(by='pred_rating', ascending=False)

In [21]:
movie_recommended_example = recommend_movies(input_example, users_train, 10)

In [22]:
metadata = pd.read_json(r'F:\DS_Dataset\genome_2021\movie_dataset_public_final\raw\metadata.json', lines=True)

In [23]:
movie_recommended_df_example = movie_recommended_example.merge(metadata[['item_id', 'title']], on='item_id', how='left')
movie_recommended_df_example

Unnamed: 0,item_id,pred_rating,title
0,5617,6.317437,Secretary (2002)
1,3993,6.317437,Quills (2000)
2,2352,6.317437,"Big Chill, The (1983)"
3,1208,6.317437,Apocalypse Now (1979)
4,4274,6.317437,Cleopatra (1963)
...,...,...,...
56,3556,5.095212,"Virgin Suicides, The (1999)"
57,1401,4.974376,Ghosts of Mississippi (1996)
58,1393,4.974376,Jerry Maguire (1996)
59,3634,4.974376,Seven Days in May (1964)


In this example, we will recommend these moivies to the user "577".

<h3>determine the number of nearest neighbors to be used in the system</h3>

To determine the N value in nearest neighbours selection, we will randomly select 100 users from the user_train data, calculate their prediction error, calculated the mean error for these 100 users, and compare the results from different N values.

In [24]:
np.random.seed(42)

sampled_user_ids = np.random.choice(users_test['user_id'].unique(), size=100, replace=False)

results = []

for user_id in sampled_user_ids:
    input_df = get_user_movie_ratings(user_id, users_test)

    for N in range(10, 101, 10):
        nearest_neighbors = find_nearest_neighbours(input_df, users_train, N)
        pred_errors = []

        for movie_id in input_df["item_id"]:
            # Predict the rating
            pred_rating = predict_rating(input_df, movie_id, nearest_neighbors, users_train)

            # Get the actual rating
            actual_rating = input_df[input_df['item_id'] == movie_id]['rating'].iloc[0]

            # Calculate the prediction error
            pred_error = abs(pred_rating - actual_rating) / actual_rating
            pred_errors.append(pred_error)

        # Calculate mean prediction error for the user
        mean_pred_error = np.mean(pred_errors)

        # Append the result
        results.append({'user_id': user_id, 'mean_pred_error': mean_pred_error, 'N': N})

# Convert the results to a DataFrame
results_df = pd.DataFrame(results)
results_df

Unnamed: 0,user_id,mean_pred_error,N
0,634057,0.267869,10
1,634057,0.262709,20
2,634057,0.260150,30
3,634057,0.261534,40
4,634057,0.258899,50
...,...,...,...
995,554508,0.138509,60
996,554508,0.137697,70
997,554508,0.135506,80
998,554508,0.134104,90


In [25]:
results_df.to_csv('user_based_N_comparison.csv', index=False)

In [27]:
# Combine results for different users and calcualted the mean error for different N values.
grouped_df = results_df.groupby('N')['mean_pred_error'].mean().reset_index()
grouped_df

Unnamed: 0,N,mean_pred_error
0,10,0.287522
1,20,0.284017
2,30,0.283262
3,40,0.282621
4,50,0.282217
5,60,0.281422
6,70,0.280778
7,80,0.279965
8,90,0.280147
9,100,0.280762


Suprisingly, the predicted error doesn't change much when increasing N from 10 to 100.

Then, we use user "577" to compare the relative calculating time for different N values.

In [30]:
import time

# Specified user ID: 577
user_id = 577

times = []

input_df = get_user_movie_ratings(user_id, users_test)

for N in range(10, 101, 10):
    start_time = time.time()  # Record the start time

    # Find nearest neighbors and calculate prediction errors
    nearest_neighbors = find_nearest_neighbours(input_df, users_train, N)
    pred_errors = []
    
    for movie_id in input_df["item_id"]:
        # Predict the rating
        pred_rating = predict_rating(input_df, movie_id, nearest_neighbors, users_train)

        # Get the actual rating
        actual_rating = input_df[input_df['item_id'] == movie_id]['rating'].iloc[0]

        # Calculate prediction error
        pred_error = abs(pred_rating - actual_rating) / actual_rating
        pred_errors.append(pred_error)

    # Calculate the mean prediction error
    mean_pred_error = np.mean(pred_errors)

    end_time = time.time()  # Record the end time
    elapsed_time = end_time - start_time  # Calculate the elapsed time

    times.append({'user_id': user_id, 'mean_pred_error': mean_pred_error, 'N': N, 'time': elapsed_time})

times_df = pd.DataFrame(times)

times_df

Unnamed: 0,user_id,mean_pred_error,N,time
0,577,0.268966,10,116.520718
1,577,0.257074,20,120.825759
2,577,0.253509,30,176.687872
3,577,0.250622,40,222.527967
4,577,0.247911,50,163.247308
5,577,0.247377,60,194.898054
6,577,0.244011,70,223.305178
7,577,0.24332,80,246.078703
8,577,0.243031,90,257.450154
9,577,0.242811,100,202.114818


In [32]:
import time

# Specified user ID: 577
user_id = 577

times2 = []

input_df = get_user_movie_ratings(user_id, users_test)

for N in range(1, 11, 1):
    start_time = time.time()  # Record the start time

    # Find nearest neighbors and calculate prediction errors
    nearest_neighbors = find_nearest_neighbours(input_df, users_train, N)
    pred_errors = []
    
    for movie_id in input_df["item_id"]:
        # Predict the rating
        pred_rating = predict_rating(input_df, movie_id, nearest_neighbors, users_train)

        # Get the actual rating
        actual_rating = input_df[input_df['item_id'] == movie_id]['rating'].iloc[0]

        # Calculate prediction error
        pred_error = abs(pred_rating - actual_rating) / actual_rating
        pred_errors.append(pred_error)

    # Calculate the mean prediction error
    mean_pred_error = np.mean(pred_errors)

    end_time = time.time()  # Record the end time
    elapsed_time = end_time - start_time  # Calculate the elapsed time

    times2.append({'user_id': user_id, 'mean_pred_error': mean_pred_error, 'N': N, 'time': elapsed_time})

times_df2 = pd.DataFrame(times2)

times_df2

Unnamed: 0,user_id,mean_pred_error,N,time
0,577,0.282821,1,44.121829
1,577,0.275108,2,59.232289
2,577,0.275252,3,93.412805
3,577,0.275711,4,92.577237
4,577,0.273892,5,106.194109
5,577,0.271537,6,106.638918
6,577,0.271685,7,107.431589
7,577,0.276044,8,119.600437
8,577,0.27253,9,124.573483
9,577,0.268966,10,131.458186


Combing predicted error and elapsed time, we choose N = 20 as the default value for our algorism. The following function（after optimization) will be used for our movie recommendation website:

In [None]:

def recommend_movies(input_df, ratings_df, n=20):
    """
    Recommend movies to a user by combining nearest neighbor finding, rating prediction, 
    and sorting by predicted ratings, with a limit of top 10 movies per neighbor.

    Parameters:
    input_df (DataFrame): Ratings of movies by the user.
    ratings_df (DataFrame): DataFrame containing movie ratings in the database.
    n (int): Number of top nearest neighbours to consider.

    Returns:
    DataFrame: Movies recommended to the user, sorted by predicted ratings.
    """
    # Find Nearest Neighbours
    watched_movies = input_df['item_id'].unique()
    relevant_ratings = ratings_df[ratings_df['item_id'].isin(watched_movies)]
    user_item_matrix = relevant_ratings.pivot_table(index='user_id', columns='item_id', values='rating').fillna(0)
    user_profile = input_df.set_index('item_id').reindex(user_item_matrix.columns).fillna(0).T
    similarity = cosine_similarity(user_profile, user_item_matrix)
    similarity_df = pd.DataFrame(similarity.T, index=user_item_matrix.index, columns=['cos_sim'])
    top_neighbours = similarity_df.sort_values(by='cos_sim', ascending=False).head(n).reset_index()

    # Limit to Top N Movies for Each Neighbor
    top_movies_by_neighbour = {}
    for neighbor_id in top_neighbours['user_id']:
        top_movies = ratings_df[ratings_df['user_id'] == neighbor_id].sort_values(by='rating', ascending=False).head(10)['item_id'].tolist()
        top_movies_by_neighbour[neighbor_id] = top_movies

    # Aggregate Movies to Predict
    movies_to_predict = set()
    for movies in top_movies_by_neighbour.values():
        movies_to_predict.update(movies)
    movies_to_predict -= set(input_df['item_id'])

    # Predict Ratings
    predictions = []
    user_avg_rating = input_df['rating'].mean()
    for movie_id in movies_to_predict:
        numerator, denominator = 0, 0
        for _, row in top_neighbours.iterrows():
            neighbor_id = row['user_id']
            if movie_id in top_movies_by_neighbour[neighbor_id]:
                cos_sim = row['cos_sim']
                neighbor_ratings = ratings_df[(ratings_df['user_id'] == neighbor_id) & (ratings_df['item_id'] == movie_id)]
                if not neighbor_ratings.empty:
                    neighbor_rating = neighbor_ratings['rating'].iloc[0]
                    neighbor_avg_rating = ratings_df[ratings_df['user_id'] == neighbor_id]['rating'].mean()
                    numerator += cos_sim * (neighbor_rating - neighbor_avg_rating)
                    denominator += abs(cos_sim)

        # Calculate Predicted Rating
        pred_rating = user_avg_rating if denominator == 0 else user_avg_rating + (numerator / denominator)
        predictions.append((movie_id, pred_rating))

    # Sort and Return Recommendations
    recommendations_df = pd.DataFrame(predictions, columns=['item_id', 'pred_rating'])
    return recommendations_df.sort_values(by='pred_rating', ascending=False)