# 1. Simple Recommendation System

In this system, we'll explore a subset of a movie dataset, what considerations should be made for metrics, and finally implement a weighted metric for simple recommendations. 

## 1A. Import statements (and setting up Kaggle API)

First, we need to import all of our data. We can either download it directly from this link: 
* https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset/data

Or, we can use Kaggle's API to access it:
* https://www.kaggle.com/docs/api#authentication 

The dataset we are using is a subset of the Full MovieLens Dataset which can be found here:
* https://grouplens.org/datasets/movielens/latest

In [3]:
import kagglehub
import pandas as pd
import os

path = kagglehub.dataset_download("rounakbanik/the-movies-dataset")
metadata_path = os.path.join(path, 'movies_metadata.csv')

metadata = pd.read_csv(metadata_path, low_memory=False)
metadata.head(3)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0


## 1B. Metrics
If we want to build a recommendation system, we need a metric that tells us which movies to recommend. 

A basic recommendation system might directly use the rating of the movie. But this doesn't account for the number of ratings -- what if one movie has a perfect rating from 2 users, while another has a 9/10 from 10,000 users? 

Consequently, we need a weighted rating that takes into account the average rating as well as the number of votes. 

The equation can be represented as follows:
$\begin{equation} \text Weighted \hspace{0.1cm} Rating (\bf WR) = \left({{\bf v} \over {\bf v} + {\bf m}} \cdot R\right) + \left({{\bf m} \over {\bf v} + {\bf m}} \cdot C\right) \end{equation}$

- v is the number of votes for the movie;
- m is the minimum votes required to be listed in the chart;
- R is the average rating of the movie;
- C is the mean vote across the whole report.

The $m$ variable is a bit of a special case here. It's a hyperparameter that we're free to choose. Having a higher value for $m$ would mean being more exclusive, picking movies that have more votes than $m$ percent of movies. 

Below, I choose $m$ to be at the 0.90 quantile, meaning our system will only include movies that have more votes than 90% of movies. 

In [8]:
# Calculate mean vote across vote average column (C)
C = metadata['vote_average'].mean()
print(f"The average rating of a movie on IMDB is {C} out of 10.")

The average rating of a movie on IMDB is 5.618207215134185 out of 10.


In [11]:
# Calculate minimum number of votes to be listed in chart (m)
m = metadata['vote_count'].quantile(0.90)
print(f"We require a minimum of {m} votes to be listed.")

We require a minimum of 160.0 votes to be listed.


In [15]:
# Filter out movies that meet the minimum
q_movies = metadata.copy().loc[metadata['vote_count'] >= m]
print(q_movies.shape)
print(metadata.shape)

(4555, 24)
(45466, 24)


### Note:
It looks like we've filtered out the top 10% of the movies with the most votes. 

## 1C. Creating the rating
We'll define a `weighted_rating` function below. 
* Having calculated $m$ and $C$ previously, we'll pass them in as arguments. 
* Selecting the `vote_count` $v$ and `vote_average` $R$...
* We'll compute and return the weighted average 

In [17]:
# Compute the weighted average
def weighted_rating(data, m=m, C=C):
    v = data['vote_count']
    R = data['vote_average']
    return (v/(v+m) * R) + (m/(v+m) * C)

In [18]:
# Add this rating as a column to our data
q_movies['weighted_rating'] = q_movies.apply(weighted_rating, axis=1)

# Sort movies based on rating (descending)
q_movies = q_movies.sort_values('weighted_rating', ascending=False)
# Print the top 20 movies
q_movies[['title', 'vote_count', 'vote_average', 'weighted_rating']].head(20)

Unnamed: 0,title,vote_count,vote_average,weighted_rating
314,The Shawshank Redemption,8358.0,8.5,8.445869
834,The Godfather,6024.0,8.5,8.425439
10309,Dilwale Dulhania Le Jayenge,661.0,9.1,8.421453
12481,The Dark Knight,12269.0,8.3,8.265477
2843,Fight Club,9678.0,8.3,8.256385
292,Pulp Fiction,8670.0,8.3,8.251406
522,Schindler's List,4436.0,8.3,8.206639
23673,Whiplash,4376.0,8.3,8.205404
5481,Spirited Away,3968.0,8.3,8.196055
2211,Life Is Beautiful,3643.0,8.3,8.187171


### Conclusion
And that's it! A simple recommendation system for our movies. 

It's nothing too crazy, but it demonstrates that even simple recommendation systems have to balance things like vote counts and vote averages to produce good recommendations. 