In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Data import

In [3]:
movies = pd.read_csv('data//movies.csv')
movies.head()

FileNotFoundError: [Errno 2] No such file or directory: 'data/movies.csv'

In [4]:
ratings = pd.read_csv('../data/ratings.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


Let's have a look at the distribution of the rating values

In [5]:
ratings.rating.value_counts(normalize=True)

rating
4.0    0.265957
3.0    0.198808
5.0    0.131015
3.5    0.130271
4.5    0.084801
2.0    0.074884
2.5    0.055040
1.0    0.027877
1.5    0.017762
0.5    0.013586
Name: proportion, dtype: float64

Now we look at the most popular movies by average rating

# Popularity recommender

In [6]:
# Group by movieId, extract the ratings, then aggregate the mean rating and the count of ratings per movie
rating_count_df = ratings.groupby('movieId')['rating'].agg(['mean', 'count']).reset_index()
# Get the 5 movies with the highest average rating
rating_count_df.nlargest(5, ['mean', 'count'])

Unnamed: 0,movieId,mean,count
48,53,5.0,2
87,99,5.0,2
869,1151,5.0,2
2593,3473,5.0,2
4384,6442,5.0,2


We see that these movies have received 5 starts but obviously two ratings is not a popular enough movie to give us a reliable answer so let's switch the parameters.

In [7]:
# Sort the movies by number of ratings and then mean rating.
rating_count_df.nlargest(5, ['count', 'mean'])

Unnamed: 0,movieId,mean,count
314,356,4.164134,329
277,318,4.429022,317
257,296,4.197068,307
510,593,4.16129,279
1938,2571,4.192446,278


Now let's use the Bayesian average to weight the ratings of the movies against the number of ratings per movies

### Bayesian Average

$$
\text{Bayesian Average} = \frac{(C \cdot M) + (N \cdot R)}{C + N}
$$

$R$: The average rating of the item (e.g., movie).

$N$: The number of ratings for the item.

$M$: The mean rating across all items (the prior).

$C$: A constant representing the "weight" of the prior (e.g., how much influence the global average has).

If N is small (few ratings) then the Bayesian Average will go closer to the global average M - If only two people rate a movie and they both give it 5 stars, it may actually suck so we should lower the rating towards the average.
If N is big (many ratings) then the movies actual average rating has more influence.

Big C gives more weight to the global average so it reduces the influence of small sample sizes.
Small C allows the actual average values of the items to have more influence.

In [8]:
"""
M = global_mean_rating
C = threshold_rating
R = qualified_movies['mean']
N = qualified_movies['count']

We will choose the 75th percentile of the ratings count threshold for the minimum number of ratings required.
75% percentile = 75% of the movies have less this amount of ratings.
So a movie must be in the top 25% of rated movies for its average rating to significantly override the global average.
"""

def recommend_popular_movies(n, movies, ratings):
    
    global_mean_rating = ratings['rating'].mean()
    rating_count_df = ratings.groupby('movieId')['rating'].agg(['mean', 'count']).reset_index()
    threshold_rating = rating_count_df['count'].quantile(0.75)
    
    # Calculate the Bayesian average weighted rating
    rating_count_df['weighted_rating'] = (
        (rating_count_df['count'] * rating_count_df['mean'] + threshold_rating * global_mean_rating) /
        (rating_count_df['count'] + threshold_rating)
    )

    # Step 5: Sort movies by the weighted rating
    top_movies = rating_count_df.sort_values(by='weighted_rating', ascending=False)

    return top_movies.head(n)

popular_movies = recommend_popular_movies(5, movies, ratings)

# Add the titles
popular_movies = popular_movies.merge(movies[['movieId', 'title']], on='movieId', how='left')
popular_movies

Unnamed: 0,movieId,mean,count,weighted_rating,title
0,318,4.429022,317,4.403417,"Shawshank Redemption, The (1994)"
1,858,4.289062,192,4.253801,"Godfather, The (1972)"
2,2959,4.272936,218,4.242352,Fight Club (1999)
3,1221,4.25969,129,4.210246,"Godfather: Part II, The (1974)"
4,50,4.237745,204,4.206639,"Usual Suspects, The (1995)"
