In this notebook, we do the pre-processing necessary for our movie recommender flask app.

This entails:
1. The loading and processing of a movie database with 600+ user ratings.
2. The creation of the NMF base model object
3. The creation of the database for the nbcfilter model

Let's get started:

In [1]:
# Imports
from pathlib import Path
import os
import pandas as pd
import numpy as np
from sklearn.decomposition import NMF
from tqdm import tqdm
import dill as pickle
from sklearn.metrics.pairwise import cosine_similarity
import requests
#from ..scripts.config import OMDB_APIKEY as OMDB_APIKEY
OMDB_APIKEY = "A-very-cool-API-keys"

In [2]:
# Class and functions definitions

class ApiRequestLimitReached(Exception):
    """Custom exception to raise when request limit is reached"""

    pass

def get_movie_posters(movie_name: str, omdb_apikey: str = OMDB_APIKEY) -> str:
    """
    Returns the link to the movie poster.
    Note: Since cinemagoer aka former PyIMDB is extremely slow in handling requests, we go with https://www.omdbapi.com/
    Only 1000 request per day allowed: at some point the API will return Nones.
    """
    if (not "poster_links" in movies_information.columns) or (
        movies_information[movies_information.loc[:, "title"] == movie_name][
            "poster_links"
        ].values[0]
        == "http://www.interlog.com/~tfs/images/posters/TFSMoviePosterUnavailable.jpg"
    ):
        movie_id = movies_information[movies_information.loc[:, "title"] == movie_name][
            "imdbId"
        ].values[0]
        # We could also get the movie link directly (small image) with:
        # image = f"https://img.omdbapi.com/?apikey={omdb_apikey}&i=tt{movie_id:07}"
        # Get OMDB's JSON object of movie and get poster link from there (Amazon hosted and larger)
        # Note: with this many requests, our IP might be temp banned completely so that we don't receive
        # any response from the server. We should catch that error.
        try:
            get_json = requests.get(
                f"https://www.omdbapi.com/?apikey={omdb_apikey}&i=tt{movie_id:07}"
            ).json()
        except Exception as e:
            print("Get request failed: ", e)
            print(movie_name, ": left link as is because of API request error")
            return movies_information[movies_information.loc[:, "title"] == movie_name][
                "poster_links"
            ].values[0]
        # If we get a valid API response, process it
        if get_json["Response"] == "False":
            try:
                print(
                    movie_name,
                    ": left link as is b/c API returned False because: ",
                    get_json["Error"],
                )
            except:
                print(movie_name, ": left link as is b/c API returned False for reasons unknown")
            finally:
                raise ApiRequestLimitReached
        else:
            movie_poster_link = get_json.get("Poster")
            print(movie_name, ": added", movie_poster_link)
            return movie_poster_link
    else:
        print(movie_name, ": link already exists. Left as is.")
        return movies_information[movies_information.loc[:, "title"] == movie_name][
            "poster_links"
        ].values[0]


# Note the cosine similarity functions are not necessary thanks to sklearns cosine_similarity function
# This is just to demonstrate how you would do them manually.
def cosine_sim(vector_1, vector_2):
    """calculates cosine similarity between two vectors"""
    num = np.dot(vector_1, vector_2)
    denom = np.sqrt(np.dot(vector_1, vector_1)) * np.sqrt(np.dot(vector_2, vector_2))

    return num / denom


# Cosine table
def create_cosine_table(df):
    """Calculates the cosine similarity table (manually instead of using the sklearn method)"""
    data = []
    for user1 in tqdm(df.columns):
        row = []
        for user2 in df.columns:
            c = cosine_sim(df[user1], df[user2])
            row.append(c)
        data.append(row)
    cs = pd.DataFrame(data, index=df.columns, columns=df.columns).round(2)

    return cs

We first start with some very basic EDA to ensure our dataset is ready.
The critical step here is to ensure that we get a dataset without any NaNs.

In [3]:
# Navigate out of subfolder
cwd = Path.cwd()
os.chdir(cwd.parent)

# Import datasets (start with the small one)
ratings = pd.read_csv(cwd.parent/ "datasets" / "ml-latest-small" / "ratings.csv")
movies = pd.read_csv(cwd.parent/  "datasets" / "ml-latest-small" / "movies.csv")

# Both dfs share a common key "movieId", let's merge on it:
merged_df = pd.merge(ratings, movies, on="movieId").sort_values(
    by=["userId", "movieId"], ascending=True
)

In [4]:
# Show some information about the dataset and clean checks
merged_df.info()
# Note: checking NaNs is very important because NMF (which we will use later) won't work with NaNs
print("Number of NaNs in the dataset: ", merged_df.rating.isna().sum())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100836 entries, 0 to 79014
Data columns (total 6 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
 4   title      100836 non-null  object 
 5   genres     100836 non-null  object 
dtypes: float64(1), int64(3), object(2)
memory usage: 5.4+ MB
Number of NaNs in the dataset:  0


The dataset is free of NaNs, so we do not need to impute anything yet. However, not each user has watched every movie, but for our modelling we still need a rating of some sort for each movie input. We therefore impute now:

In [5]:
# We can now pivot the merged_df by userId...
df = pd.pivot_table(merged_df, index="userId", columns="title", values="rating")
# ... and now fill in the blanks
# ... with either global mean:
# df = df.fillna(df.mean().mean())
# ... or per-item averages
# df = df.fillna(df.mean())
# ... or better: per-user averages
user_mean = df.mean(axis=1)
df = df.transpose().fillna(user_mean).transpose()
# ... or even better: per-user averages per genre (i.e. average user rating for genres).
# user_genre_mean = merged_df.groupby(["userId", "genres"])["rating"].mean()
# Note: if genre unrated, we use global per-user averages

At this point we are ready to create a dataframe that we can use to read general movie information out from. This is so that we can present it quickly to the user, without the need for an API request (which does require the API server to be up and our request limit not reached).
Luckily, our dataset has all basic movie information so we only need to bring into the shape we want. However, it does not have links to the movie posters which we want to show to the user as well. These have to be fetched per API request to some movie database.
Since we are using the free and publicly available OMDB (an IMDB-like movie database), we are restricted by 1000 API requests per day. To mitigate the OMDB restriction, we create the movie information dataframe once and store it after reaching out 1000 request limit. Reload it on another day, to use up our 1000 requests again, store it and so on.

##### Important: the code for storing the dataframe should only be run the very first time. Otherwise, we overwrite the already obtained movie poster links and have to start over.

In [6]:
df_links = pd.read_csv("datasets/ml-latest-small/links.csv")
movies_information = pd.merge(df_links, merged_df, on="movieId")
movies_information = movies_information.drop(
    ["userId", "movieId", "timestamp", "rating", "tmdbId"], axis=1
).drop_duplicates(subset=["imdbId"])
# Note there is certain movieIds that have the same title. Let's remove them because the user will work with movie titles:
movies_information = movies_information.drop_duplicates(subset=["title"])
# Average ratings per item
movies_information = pd.merge(
    movies_information, pd.Series(df.mean(), name="avg_ratings_pitem"), on="title"
)
# Transform genres and extract year and title into seperate columns
movies_information["genres"] = movies_information["genres"].str.split("|")
movies_information["year_only"] = movies_information["title"].str.extract(r"\((\d{4})\)")
movies_information["title_only"] = movies_information["title"].str.extract(r"(.*)\s*\(\d{4}\)")

# Run this part only the first time!
# # Get movie poster links
# movies_information["poster_links"] = movies_information["title"].apply(get_movie_posters)
# # For all posters, we haven't gotten (yet), insert a "Movie Poster Unavailable" placeholder.
# movies_information["poster_links"] = movies_information["poster_links"].fillna(
#     "http://www.interlog.com/~tfs/images/posters/TFSMoviePosterUnavailable.jpg"
# )
# # Store it sorted by title
# movies_information = movies_information.sort_values(by="title")
# with open("model_objects/movies_information" + ".pkl", "wb") as file:
#     pickle.dump(movies_information, file)

In [7]:
# Since we are limited by the API (1000 requests per day), we will have to call the function on separate
# days to update all movie poster links:
# Load to update
with open("model_objects/movies_information" + ".pkl", "rb") as file:
    movies_information = pickle.load(file)
# Note: while this is slower than using the apply function, we can abort request links from the API
# once we have reached the daily request limit:
try:
    for index, row in movies_information.iterrows():
        movies_information.at[index, "poster_links"] = get_movie_posters(row["title"])
except ApiRequestLimitReached:
    print("API Request limit reached! Aborting the update...")

# Store updated df
with open("model_objects/movies_information" + ".pkl", "wb") as file:
    pickle.dump(movies_information, file)

'71 (2014) : link already exists. Left as is.
'Hellboy': The Seeds of Creation (2004) : link already exists. Left as is.
'Round Midnight (1986) : link already exists. Left as is.
'Salem's Lot (2004) : link already exists. Left as is.
'Til There Was You (1997) : link already exists. Left as is.
'Tis the Season for Love (2015) : link already exists. Left as is.
'burbs, The (1989) : link already exists. Left as is.
'night Mother (1986) : link already exists. Left as is.
(500) Days of Summer (2009) : link already exists. Left as is.
*batteries not included (1987) : link already exists. Left as is.
...All the Marbles (1981) : link already exists. Left as is.
...And Justice for All (1979) : link already exists. Left as is.
00 Schneider - Jagd auf Nihil Baxter (1994) : link already exists. Left as is.
1-900 (06) (1994) : link already exists. Left as is.
10 (1979) : link already exists. Left as is.
10 Cent Pistol (2015) : link already exists. Left as is.
10 Cloverfield Lane (2016) : link alrea

Now, that we have the df containing the movie information, we can build our two models.
1. A non-negative matrix factorization model
2. A neighborhood-based collaborative filtering model (using cosine similarity)

Let's talk briefly about the advantages and disadvantages of each model:  

The NMF model:  
Advantages: rather fast once set up / implemented. It loves sparsity (gaps) in the data. NMF can work with non-ordinal data.  
Disadvantages: harder to debug (i.e., to understand what's going on). 

The NBCFilter model  
Advantages: fast to implement, works for huge dataset, no domain knowledge necessary  
Disadvantages: Hard to include data other than ordinal data (e.g., ratings), Sparsity (i.e., if there's large gaps / NaNs) in data sets, NBCFilter is not great. NMF loves sparsity. Slow for on larger data sets with weak processors

## 1. Non-Negative Matrix Factorization model

In [9]:
# We create the model and store it for later use in our flask app.
model = NMF(n_components=20, init="random", max_iter=10000, random_state=42)
model.fit(df.values)
with open("model_objects/model_nmf" + ".pkl", "wb") as file:
    pickle.dump(model, file)

##### Example usage
Note this is not a necessary pre-processing step, since we will do this in our Flask app class.  

For the NMF we need 5 sets of information: U, D, R, P and Q  
U: the set of users  
D: the set of items (here movies)  
R: a user by items (UxD) matrix containing ratings on all the movies.  
Those we have already created earlier.  

Now, we define:  
P: a user by latent/hidden features (UxK) matrix  
Q: a items by latent/hidden features (DxK) matrix  

In [10]:
p_mat = model.transform(df.values)
q_mat = model.components_
predictions = np.dot(p_mat, q_mat)

This is an examaple of how we would **obtain our predictions for a new user**:

In [11]:
# We just make up a new user with the average ratings of all users 
# (which is our best bet when we don't know the user yet)
# Note, both matrices have already been sorted alphabetically by movie title
new_user = movies_information["avg_ratings_pitem"].values.reshape(1, -1)
# However, to actually get meaningful recommendations, the user needs to have watched and rated a few movies already
# Let's pick 10 movies randomly and create rather high ratings for these movies:
movies_seen = df.sample(n=10, axis=1).columns
ratings_seen = np.random.randint(3, 5, 10).tolist()
print("The user has seen and rated the following: ", movies_seen, ratings_seen)
movies_seen_idx = df.columns.get_indexer(movies_seen)
new_user[0, movies_seen_idx] = ratings_seen
# Get the new user's P matrix...
new_user_p = model.transform(new_user)
# ...and the actual predictions for the ratings of all movies.
new_user_predictions = np.dot(new_user_p, q_mat)
# It's probably better to sort them and find the highest ratings after turning it into a pd.Series:
user_matrix = pd.concat(
    [movies_information, pd.Series(new_user_predictions.flatten(), name="user_predictions")],
    axis=1,
)
print(
    "Top 5 predictions for our new user: ",
    user_matrix.sort_values(by="user_predictions", ascending=False)[:5],
)

The user has seen and rated the following:  Index(['Blind Fury (1989)', 'Last Days of Disco, The (1998)',
       'Ordinary People (1980)', 'Frozen (2010)', 'To End All Wars (2001)',
       'Others, The (2001)', 'Ernest & Célestine (Ernest et Célestine) (2012)',
       'Why Do Fools Fall In Love? (1998)', 'Day of the Jackal, The (1973)',
       'Bill Burr: I'm Sorry You Feel That Way (2014)'],
      dtype='object', name='title') [3, 4, 4, 4, 4, 4, 4, 4, 4, 4]
Top 5 predictions for our new user:         imdbId                                        title  \
7593  1528313  Nothing to Declare (Rien à déclarer) (2010)   
6865   824747                            Changeling (2008)   
3158   240515                   Freddy Got Fingered (2001)   
7680  1448755                          Killer Elite (2011)   
8001  2063781                               Smashed (2012)   

                       genres  avg_ratings_pitem year_only  \
7593                 [Comedy]           3.657620      2010   
686

Now we could do all sorts of recommendations, but before we do, we should remove movies the user has already seen.
We don't continue with this here because we will do this in the Flask app.

## 2. Neighborhood-based collaborative filtering model (using cosine similarity)

For nbcfilter, we don't want to fill all NaNs with averages because we actually want to cluster unseen movies separately from those users have seen. Filling in `0` as a rating for any unseen movie will do just that.

In [12]:
df_cosim = pd.pivot_table(merged_df, index="userId", columns="title", values="rating")
df_cosim = df_cosim.fillna(0)
# We store this df because we will need it in our FLask app
with open("model_objects/nbcfilter_data" + ".pkl", "wb") as file:
    pickle.dump(df, file)

### Example usage

The approach is as follows:
- We generate a cosine similarity table for all users including our new user
- Then, we iterate through the new_user's unseen movies and obtain predictions for the new user based on the predictions by otherwise similar users (i.e. neighbors) that have actually seen the unseen movies.

In [13]:
# We first need to obtain cosine similarity of our data matrix. There is two ways to go about that:
# The manual way with our own functions
cosim_table_manual = create_cosine_table(df_cosim.T)

# The convenient way with sklearn
cosim_table_convenient = cosine_similarity(df_cosim)
cosim_table_convenient = pd.DataFrame(
    cosim_table_convenient, index=df_cosim.index, columns=df_cosim.index
)

100%|██████████| 610/610 [00:21<00:00, 28.10it/s]


This is an example of how we would **obtain our predictions for a new user**:

In [14]:
# The approach for a new user is fairly similar to the NMF model approach initially in that we create a 
# new user with only a certain rating for each movie except for the movies the user has already watched.
# The difference is that we don't use average ratings but 0s (see above)
new_user = np.zeros(np.shape(movies_information["avg_ratings_pitem"]))
movies_seen = movies_information["title"].sample(n=10).tolist()
ratings_seen = np.random.randint(3, 5, 10).tolist()
print("The user has seen and rated the following: ", movies_seen, ratings_seen)
movies_seen_idx = movies_information[movies_information["title"].isin(movies_seen)].index.values
new_user[movies_seen_idx] = list(ratings_seen)
# We now need to add the new user to the df containing all other users and create the cosine_similarity matrix
df_cosim.loc["new_user"] = new_user
cosim_table_convenient_added = cosine_similarity(df_cosim)
cosim_table_convenient_added = pd.DataFrame(
    cosim_table_convenient_added, index=df_cosim.index, columns=df_cosim.index
)
# First, we extract the movie that the user hasn't seen yet
movies_unseen = list(df_cosim.T.index[df_cosim.T["new_user"] == 0])
# Then, we find the highest ranking neighbor/s of our new user.
neighbors = list(
    cosim_table_convenient_added.loc["new_user", :].sort_values(ascending=False).index[1:3]
)  # Note: the first one is the new_user himself, so we need to pick the second

# Finally, we iterate through the new_user's unseen movies and obtain predictions for the new user
# based on the predictions by otherwise similar users (i.e. neighbors) that have actually seen
# the unseen movies.
predicted_ratings_movies = []
for movie in movies_unseen:
    # Find people who watched the unseen movies
    others_seen = list(df_cosim.T.columns[df_cosim.T.loc[movie] > 0])
    numerator = 0
    denominator = 0.000001
    # go through users who are similar but watched the film
    for user in neighbors:
        if user in others_seen:
            # extract the ratings and similarities for similar users
            rating = df_cosim.T.loc[movie, user]
            similarity = cosim_table_convenient_added.loc["new_user", user]
            # predict rating based on the averaged rating of the neighbors
            # sum(ratings)/number of users OR
            # sum(ratings * similarity)/sum(similarities)
            numerator += rating * similarity
            denominator += similarity
    predicted_ratings = round(numerator / denominator, 1)
    predicted_ratings_movies.append([predicted_ratings, movie])
# Transform it into a df and sort it by rating for better viewing
predicted_rating_df = pd.DataFrame(
    predicted_ratings_movies, columns=["user_predictions", "title"]
).sort_values(by="user_predictions", ascending=False)
print(predicted_rating_df)

The user has seen and rated the following:  ['Pitch Perfect (2012)', 'Lost Highway (1997)', 'Better Off Dead... (1985)', 'Hook (1991)', 'Catch That Kid (2004)', 'Maniac (2012)', 'Dana Carvey: Straight White Male, 60 (2016)', 'Twelve Chairs, The (1970)', 'By the Law (1926)', 'Changeling, The (1980)'] [3, 3, 3, 4, 4, 3, 4, 4, 4, 3]
      user_predictions                                             title
5536               5.0                         Me, Myself & Irene (2000)
8701               5.0  Three Billboards Outside Ebbing, Missouri (2017)
4785               5.0                                    Kingpin (1996)
3496               5.0                    Godfather: Part II, The (1974)
7415               5.0                           Schindler's List (1993)
...                ...                                               ...
3241               0.0                               Fright Night (1985)
3242               0.0                               Fright Night (2011)
3243       

These are the predictions the nbcfilter model would return and we could now do all sorts of recommendations.