# Netflix recommendation engine

Based on the [netflix prize dataset](https://www.kaggle.com/datasets/netflix-inc/netflix-prize-data). Our
goal is to build a recommendation engine.

## Importing the libraries

In [1]:
import polars as pl
import sqlite3

## Connect to database

Here we connect to the database `netflix_dev.db`. Currently, we are using a small portion of the whole dataset, around 100.000 / 100.000.000 Entries. This is due to the fact that the whole dataset is too big to be processed on a normal computer. We are using a sample of 100.000 entries to test our code and to get a first impression of the data. The sample is randomly chosen, so it is representative for the whole dataset.

- `netflix_data` contains the ratings from the netflix prize challenge.
- `movie_titles` contains the titles corresponding to the `film` column in `netflix_data`
- `combined` is a join of `netflix_data` and `movie_titles` over the `film` column.

In [4]:
db_path = 'netflix_dev.db'
db_conn = 'sqlite://' + db_path

In [6]:
netflix_data = pl.read_database("SELECT * FROM netflix_data", db_conn)
movie_titles = pl.read_database("SELECT * FROM movie_titles", db_conn)
combined     = pl.read_database("SELECT * FROM netflix_data, movie_titles \
                                  WHERE netflix_data.film = movie_titles.film", db_conn)

In [7]:
combined

film,user,rating,date,year,title
i64,i64,i64,str,str,str
12401,993438,2,"""2005-04-15 ""","""2000""","""Nico and Dani"""
14103,328402,4,"""2005-04-17 ""","""2004""","""The Notebook"""
13691,1555660,4,"""2004-11-15 ""","""1949""","""The Adventures…"
11607,1281112,5,"""2005-05-05 ""","""2005""","""Hotel Rwanda"""
11047,713840,4,"""2005-12-17 ""","""1995""","""Outbreak"""
9188,2038403,4,"""2003-09-24 ""","""1988""","""Cocktail"""
788,2574115,3,"""2005-02-04 ""","""1994""","""Clerks"""
2905,2510444,4,"""2004-01-27 ""","""1998""","""Croupier"""
12293,2269486,4,"""2003-09-10 ""","""1972""","""The Godfather"""
16377,2000183,5,"""2000-12-26 ""","""1999""","""The Green Mile…"


## Run some queries

Now we run some queries on the data.
- `top_100` contains the 100 most rated movies.
- `best_rated` contains the 100 best rated movies that have at least 50 ratings.
- `not_rated` contains all movies that have no ratings.

In [8]:
top_100 = pl.read_database("SELECT netflix_data.film, movie_titles.title, COUNT(*) AS 'num_ratings', AVG(netflix_data.rating) AS 'avg_rating' \
                            FROM netflix_data, movie_titles \
                            WHERE netflix_data.film = movie_titles.film \
                            GROUP BY netflix_data.film, title \
                            ORDER BY COUNT(*) DESC \
                            LIMIT 100 \
                            ", db_conn)

top_100

film,title,num_ratings,avg_rating
i64,str,i64,f64
5317,"""Miss Congenial…",233,3.549356
15124,"""Independence D…",220,3.736364
15205,"""The Day After …",212,3.476415
11283,"""Forrest Gump""",199,4.351759
16242,"""Con Air""",196,3.377551
15582,"""Sweet Home Ala…",191,3.507853
6287,"""Pretty Woman""",190,3.884211
6972,"""Armageddon""",186,3.5
14313,"""The Patriot""",184,3.869565
1905,"""Pirates of the…",183,4.245902


In [9]:
best_rated = pl.read_database("SELECT netflix_data.film, movie_titles.title, COUNT(*) AS 'num_ratings', AVG(netflix_data.rating) AS 'avg_rating' \
                                FROM netflix_data, movie_titles \
                                WHERE netflix_data.film = movie_titles.film \
                                GROUP BY netflix_data.film, title \
                                HAVING num_ratings > 50 \
                                ORDER BY AVG(netflix_data.rating) DESC \
                                LIMIT 100", db_conn)

best_rated

film,title,num_ratings,avg_rating
i64,str,i64,f64
14961,"""Lord of the Ri…",83,4.759036
5582,"""Star Wars: Epi…",98,4.704082
7230,"""The Lord of th…",74,4.702703
7057,"""Lord of the Ri…",75,4.626667
16265,"""Star Wars: Epi…",89,4.58427
14550,"""The Shawshank …",151,4.576159
9628,"""Star Wars: Epi…",84,4.559524
14240,"""Lord of the Ri…",140,4.557143
10042,"""Raiders of the…",108,4.509259
12293,"""The Godfather""",120,4.5


In [10]:
not_rated = pl.read_database("SELECT movie_titles.film, movie_titles.title \
                                FROM movie_titles \
                                WHERE movie_titles.film NOT IN (SELECT film FROM netflix_data)",  db_conn)

not_rated

film,title
i64,str
4,"""Paula Abdul's …"
5,"""The Rise and F…"
7,"""8 Man"""
9,"""Class of Nuke …"
11,"""Full Frame: Do…"
14,"""Nature: Antarc…"
19,"""By Dawn's Earl…"
21,"""Strange Relati…"
29,"""Boycott"""
31,"""Classic Albums…"
