# Netflix recommendation engine

Based on the [netflix prize dataset](https://www.kaggle.com/datasets/netflix-inc/netflix-prize-data). Our
goal is to build a recommendation engine.

## Importing the libraries

In [55]:
import polars as pl
import pandas as pd
import numpy as np
from tqdm import tqdm
from time import sleep
import sqlite3
import requests
import json

## Connect to database

Here we connect to the database `netflix_dev.db`. Currently, we are using a small portion of the whole dataset, around 100.000 / 100.000.000 Entries. This is due to the fact that the whole dataset is too big to be processed on a normal computer. We are using a sample of 100.000 entries to test our code and to get a first impression of the data. The sample is randomly chosen, so it is representative for the whole dataset.

- `netflix_data` contains the ratings from the netflix prize challenge.
- `movie_titles` contains the titles corresponding to the `film` column in `netflix_data`
- `combined` is a join of `netflix_data` and `movie_titles` over the `film` column.

In [45]:
db_path = 'netflix_dev.db'
db_conn = 'sqlite://' + db_path

In [46]:
netflix_data = pl.read_database("SELECT * FROM netflix_data", db_conn)
movie_titles = pl.read_database("SELECT * FROM movie_titles", db_conn)
combined     = pl.read_database("SELECT * FROM netflix_data, movie_titles \
                                  WHERE netflix_data.film = movie_titles.film", db_conn)

In [47]:
# return the title for a given item_id (column film)
# i.e. get_title(16242) -> "Con Air"
def get_title(item_id):
    return movie_titles.filter(pl.col("film") == item_id)["title"].to_list()[0]

## Run some queries

Now we run some queries on the data.
- `most_rated` contains the 100 most rated movies.
- `best_rated` contains the 100 best rated movies that have at least 50 ratings.
- `not_rated` contains all movies that have no ratings.
- `rated` contains all movies that have at least one rating.

In [7]:
most_rated = pl.read_database("SELECT netflix_data.film, movie_titles.title, COUNT(*) AS 'num_ratings', AVG(netflix_data.rating) AS 'avg_rating' \
                               FROM netflix_data, movie_titles \
                               WHERE netflix_data.film = movie_titles.film \
                               GROUP BY netflix_data.film, title \
                               ORDER BY COUNT(*) DESC \
                               LIMIT 100 \
                               ", db_conn)

most_rated

film,title,num_ratings,avg_rating
i64,str,i64,f64
5317,"""Miss Congenial…",233,3.549356
15124,"""Independence D…",220,3.736364
15205,"""The Day After …",212,3.476415
11283,"""Forrest Gump""",199,4.351759
16242,"""Con Air""",196,3.377551
15582,"""Sweet Home Ala…",191,3.507853
6287,"""Pretty Woman""",190,3.884211
6972,"""Armageddon""",186,3.5
14313,"""The Patriot""",184,3.869565
1905,"""Pirates of the…",183,4.245902


In [8]:
best_rated = pl.read_database("SELECT netflix_data.film, movie_titles.title, COUNT(*) AS 'num_ratings', AVG(netflix_data.rating) AS 'avg_rating' \
                                FROM netflix_data, movie_titles \
                                WHERE netflix_data.film = movie_titles.film \
                                GROUP BY netflix_data.film, title \
                                HAVING num_ratings > 50 \
                                ORDER BY AVG(netflix_data.rating) DESC \
                                LIMIT 100", db_conn)

best_rated

film,title,num_ratings,avg_rating
i64,str,i64,f64
14961,"""Lord of the Ri…",83,4.759036
5582,"""Star Wars: Epi…",98,4.704082
7230,"""The Lord of th…",74,4.702703
7057,"""Lord of the Ri…",75,4.626667
16265,"""Star Wars: Epi…",89,4.58427
14550,"""The Shawshank …",151,4.576159
9628,"""Star Wars: Epi…",84,4.559524
14240,"""Lord of the Ri…",140,4.557143
10042,"""Raiders of the…",108,4.509259
12293,"""The Godfather""",120,4.5


In [9]:
rated = pl.read_database("SELECT movie_titles.film, movie_titles.title \
                          FROM movie_titles \
                          WHERE movie_titles.film IN (SELECT film FROM netflix_data)",  db_conn)

not_rated = pl.read_database("SELECT movie_titles.film, movie_titles.title \
                              FROM movie_titles \
                              WHERE movie_titles.film NOT IN (SELECT film FROM netflix_data)",  db_conn)

rated

film,title
i64,str
1,"""Dinosaur Plane…"
2,"""Isle of Man TT…"
3,"""Character"""
6,"""Sick"""
8,"""What the #$*! …"
10,"""Fighter"""
12,"""My Favorite Br…"
13,"""Lord of the Ri…"
15,"""Neil Diamond: …"
16,"""Screamers"""


# Getting the movie metadata

We use the [OMDb API](http://www.omdbapi.com/) to get the movie metadata. We use the `film` from the `movie_titles` table to get the metadata for each movie. We store the metadata in the `movie_metadata` table.

In [10]:
# TODO: PLEASE DELETE BEFORE PUBLIC RELEASE
api_key = "put api key here"

# get movie data from the omdb api
#
# input: title of movie (str)
#
# output: json object of movie data
#         title, year, rated, release date, runtime, genre, language, country, ratings from different sites, ...
def ombd_api(title):

    url = "http://www.omdbapi.com/?apikey=" + api_key + "&t=" + title
    response = requests.get(url)
    data = json.loads(response.text)

    return data

Now we want to get the movie metadata for every single movie. First, we get all movie titles from the `movie_titles` table. Then we iterate over the movie titles and get the metadata for each movie. We store the metadata in a csv file.

In [62]:
fields = [
    'Year',
    'Response',
    'Rated',
    'Released',
    'Runtime',
    'Genre',
    'Director',
    'Writer',
    'Actors',
    'Plot',
    'Language',
    'Country',
    'Awards',
    'Poster',
    'Ratings',
    'Metascore',
    'imdbRating',
    'imdbVotes',
    'imdbID',
    'Type',
    'DVD',
    'BoxOffice',
    'Production',
    'Website',
]

In [67]:
# CAN BE SKIPPED
titles = rated["title"].to_list()

# get movie data for all rated movies
movie_data = []

for title in tqdm(titles):

    response = ombd_api(title)
    current_data = {}

    # try to copy all value to all fields
    for field in fields:

        current_data["Title"] = title

        # is the current field in the response?
        # if yes -> copy the value from the response
        # if no  -> set the value to "N/A"
        if(field not in response):
            current_data[field] = "N/A"
        else:
            current_data[field] = response.get(field)

    movie_data.append(current_data)

    sleep(0.1)

df = pd.DataFrame(movie_data)
df.to_csv("/data/movie_data.csv")

100%|██████████| 9235/9235 [1:06:59<00:00,  2.30it/s] 


## Create a recommendation engine

Now we create a recommendation engine. We use the `surprise` library for this. We use the `SVD` algorithm, which is a matrix factorization algorithm. We use the `trainset` to train the algorithm and the `testset` to test the algorithm. We use the `RMSE` as a metric to evaluate the algorithm.

In [11]:
from surprise import Dataset, KNNBasic, Reader, accuracy, SVD
from surprise.model_selection import cross_validate, train_test_split

First we need to bring the ratings data in the correct format. `Surprise` expects a dataframe of format `[user_ids, itemd_ids, ratings]`, so we need to drop the `date` column. We also need to convert the polars dataframe to a pandas dataframe, because surprise does not support polars dataframes. Finally, we can create the `Dataset` and the `trainset` and `testset` (75% training, 25% testing).

In [12]:
ratings = netflix_data.drop("date")
ratings = ratings.to_pandas()

data = Dataset.load_from_df(ratings, Reader(rating_scale=(1, 5)))
trainset, testset = train_test_split(data, test_size=0.25)

trainset_data = trainset.build_testset()

We choose the SVD algorithm and fit it to the dataset. Then we predict the ratings for the testset and calculate the RMSE.

In [13]:
algo = SVD()
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1d076218a30>

In [14]:
predictions = algo.test(testset)

# Calculate RMSE
rmse = accuracy.rmse(predictions, verbose=False)
print("Root Mean Squared Error (RMSE):", rmse)

# Calculate MAE
mae = accuracy.mae(predictions, verbose=False)
print("Mean Absolute Error (MAE):", mae)

Root Mean Squared Error (RMSE): 1.0399204120378003
Mean Absolute Error (MAE): 0.8400388009536006


In [15]:
def random_predict():
    # get two random indices from the trainset_data
    i_1 = np.random.randint(0, len(trainset_data))
    i_2 = np.random.randint(0, len(trainset_data))

    # take the user_id from i_1
    # take the item_id from i_2
    user_id = trainset_data[i_1][1]
    item_id = trainset_data[i_2][0]

    # predict how user_id will rate item_id
    pred = algo.predict(user_id, item_id)

    return pred

In [16]:
# print prediction information in a pretty string
def pretty_predict(pred):
    user_id = pred.uid
    item_id = pred.iid
    pred_rating = pred.est
    title = get_title(item_id)

    print("User {0} will rate {1} ({2}) with {3} stars.".format(user_id, title, item_id, pred_rating))

Now we can predict some ratings. Right now it returns ~3.6 stars almost every time (i dont know why)

In [17]:
for i in range(10):
    pred = random_predict()
    pretty_predict(pred)

User 2519375 will rate Waiting for Guffman (16201) with 3.6043866666666666 stars.
User 647602 will rate The Princess Diaries (Widescreen) (11265) with 3.6043866666666666 stars.
User 2016590 will rate Kiss: Exposed (2092) with 3.6043866666666666 stars.
User 132059 will rate Two Weeks Notice (13050) with 3.6043866666666666 stars.
User 306199 will rate The Flamingo Kid (8254) with 3.6043866666666666 stars.
User 2509503 will rate The Notebook (14103) with 3.6043866666666666 stars.
User 2587117 will rate Jerry Maguire (13763) with 3.6043866666666666 stars.
User 2556053 will rate Big Momma's House (6347) with 3.6043866666666666 stars.
User 1284226 will rate Just Married (10921) with 3.6043866666666666 stars.
User 2366006 will rate The Verdict (12593) with 3.6043866666666666 stars.


In [18]:
# rate all movies for a given user_id
user_id = 517756

# get all the item_ids in the trainset_data
item_ids = [x[0] for x in trainset_data]

# item_ids should only be unique values
item_ids = list(set(item_ids))

# get a rating for all item_ids from user_id
user_ratings = {}
for item_id in item_ids:
    pred = algo.predict(user_id, item_id)
    user_ratings[item_id] = pred.est

# sort the ratings from highest to lowest
user_ratings = sorted(user_ratings.items(), key=lambda x: x[1], reverse=True)

# print the top 10 movies for user_id
for item_id, rating in user_ratings[:10]:
    title = get_title(item_id)
    print("{0} ({1}) - {2} stars".format(title, item_id, rating))

Baby Shakespeare: World of Poetry (7133) - 3.8561833188464307 stars
Slums of Beverly Hills (15181) - 3.8418737428628598 stars
Latin Kings: A Street Gang Story (6663) - 3.8088334610349937 stars
Moonlight Mile (7363) - 3.8007843140571373 stars
Amelie: Bonus Material (15865) - 3.793172170972907 stars
Renegades (17587) - 3.780581995056136 stars
The SoulTaker (192) - 3.7706913354987903 stars
Lost: Season 1 (3456) - 3.7587480937264806 stars
Stargate SG-1: Season 1 (14363) - 3.7496374742696768 stars
Wiseguy: Season 1: Part 2 (6504) - 3.7459049147798718 stars
