# Netflix recommendation engine

Based on the [netflix prize dataset](https://www.kaggle.com/datasets/netflix-inc/netflix-prize-data). Our
goal is to build a recommendation engine.

## Importing the libraries

In [255]:
import polars as pl
import pandas as pd
import numpy as np
import sqlite3

## Connect to database

Here we connect to the database `netflix_dev.db`. Currently, we are using a small portion of the whole dataset, around 100.000 / 100.000.000 Entries. This is due to the fact that the whole dataset is too big to be processed on a normal computer. We are using a sample of 100.000 entries to test our code and to get a first impression of the data. The sample is randomly chosen, so it is representative for the whole dataset.

- `netflix_data` contains the ratings from the netflix prize challenge.
- `movie_titles` contains the titles corresponding to the `film` column in `netflix_data`
- `combined` is a join of `netflix_data` and `movie_titles` over the `film` column.

In [256]:
db_path = 'netflix_dev.db'
db_conn = 'sqlite://' + db_path

In [257]:
netflix_data = pl.read_database("SELECT * FROM netflix_data", db_conn)
movie_titles = pl.read_database("SELECT * FROM movie_titles", db_conn)
combined     = pl.read_database("SELECT * FROM netflix_data, movie_titles \
                                  WHERE netflix_data.film = movie_titles.film", db_conn)

## Run some queries

Now we run some queries on the data.
- `most_rated` contains the 100 most rated movies.
- `best_rated` contains the 100 best rated movies that have at least 50 ratings.
- `not_rated` contains all movies that have no ratings.
- `rated` contains all movies that have at least one rating.

In [258]:
most_rated = pl.read_database("SELECT netflix_data.film, movie_titles.title, COUNT(*) AS 'num_ratings', AVG(netflix_data.rating) AS 'avg_rating' \
                               FROM netflix_data, movie_titles \
                               WHERE netflix_data.film = movie_titles.film \
                               GROUP BY netflix_data.film, title \
                               ORDER BY COUNT(*) DESC \
                               LIMIT 100 \
                               ", db_conn)

In [259]:
best_rated = pl.read_database("SELECT netflix_data.film, movie_titles.title, COUNT(*) AS 'num_ratings', AVG(netflix_data.rating) AS 'avg_rating' \
                                FROM netflix_data, movie_titles \
                                WHERE netflix_data.film = movie_titles.film \
                                GROUP BY netflix_data.film, title \
                                HAVING num_ratings > 50 \
                                ORDER BY AVG(netflix_data.rating) DESC \
                                LIMIT 100", db_conn)

In [260]:
rated = pl.read_database("SELECT movie_titles.film, movie_titles.title \
                          FROM movie_titles \
                          WHERE movie_titles.film IN (SELECT film FROM netflix_data)",  db_conn)

not_rated = pl.read_database("SELECT movie_titles.film, movie_titles.title \
                                FROM movie_titles \
                                WHERE movie_titles.film NOT IN (SELECT film FROM netflix_data)",  db_conn)

## Create a recommendation engine

Now we create a recommendation engine. We use the `surprise` library for this. We use the `SVD` algorithm, which is a matrix factorization algorithm. We use the `trainset` to train the algorithm and the `testset` to test the algorithm. We use the `RMSE` as a metric to evaluate the algorithm.

In [261]:
from surprise import Dataset, KNNBasic, Reader, accuracy, SVD
from surprise.model_selection import cross_validate, train_test_split

First we need to bring the ratings data in the correct format. `Surprise` expects a dataframe of format `[user_ids, itemd_ids, ratings]`, so we need to drop the `date` column. We also need to convert the polars dataframe to a pandas dataframe, because surprise does not support polars dataframes. Finally, we can create the `Dataset` and the `trainset` and `testset` (75% training, 25% testing).

In [262]:
ratings = netflix_data.drop("date")
ratings = ratings.to_pandas()

data = Dataset.load_from_df(ratings, Reader(rating_scale=(1, 5)))
trainset, testset = train_test_split(data, test_size=0.25)

trainset_data = trainset.build_testset()

We choose the SVD algorithm and fit it to the dataset. Then we predict the ratings for the testset and calculate the RMSE.

In [263]:
algo = SVD()
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x2cd3c777df0>

In [264]:
predictions = algo.test(testset)

# Calculate RMSE
rmse = accuracy.rmse(predictions, verbose=False)
print("Root Mean Squared Error (RMSE):", rmse)

# Calculate MAE
mae = accuracy.mae(predictions, verbose=False)
print("Mean Absolute Error (MAE):", mae)

Root Mean Squared Error (RMSE): 1.0415766256190409
Mean Absolute Error (MAE): 0.8412969441935148


In [265]:
def random_predict():
    # get two random indices from the trainset_data
    i_1 = np.random.randint(0, len(trainset_data))
    i_2 = np.random.randint(0, len(trainset_data))

    # take the user_id from i_1
    # take the item_id from i_2
    user_id = trainset_data[i_1][1]
    item_id = trainset_data[i_2][0]

    # predict how user_id will rate item_id
    pred = algo.predict(user_id, item_id)

    return pred

In [266]:
# print prediction information in a pretty string
def pretty_predict(pred):
    user_id = pred.uid
    item_id = pred.iid
    pred_rating = round(pred.est,2)
    title = movie_titles.filter(pl.col("film") == item_id)["title"].to_list()[0]

    print("User {0} will rate item {1} ({2}) with {3} stars.".format(user_id, title, item_id, pred_rating))

Now we can predict some ratings. Right now it returns ~3.6 stars almost every time (i dont know why)

In [268]:
for i in range(10):
    pred = random_predict()
    pretty_predict(pred)

User 1567240 will rate item Friends: Season 6 (2942) with 3.61 stars.
User 2566540 will rate item South Park: Season 5 (15609) with 3.61 stars.
User 360287 will rate item White Men Can't Jump (12015) with 3.61 stars.
User 829782 will rate item Sister Act (6386) with 3.61 stars.
User 1178358 will rate item The Muppet Movie (16567) with 3.61 stars.
User 2055183 will rate item Along Came a Spider (12299) with 3.61 stars.
User 2488360 will rate item The Legend of Bagger Vance (10241) with 3.61 stars.
User 1288043 will rate item The Patriot (14313) with 3.61 stars.
User 2240832 will rate item The Four Seasons (3026) with 3.61 stars.
User 719014 will rate item The Good Girl (12244) with 3.61 stars.
