# ***Recommender Systems - Final Project***
### login: ***clovis.lechien***

## ***Summary:***

1. ***Imports***
    1. General uses
    1. Download uses
    1. Model uses
1. ***Downloading the datasets***
    1. Movielens : ml-1m (movies, users and ratings)
    1. IMDb : title.basics and title.ratings
1. ***Merging the datasets***
    1. Both IMDbs together
    1. MovieLens' items and the last one
1. ***Preprocessing***
    1. Combining genres
    1. Droping columns
    1. Numerical transformations
    1. Scaling
    1. One hot encoding genres
1. ***Ratings: Create the couple matrix***
    1. Creation of the matrix
    1. Utility function to find the best candidates
1. ***Creating our model***
    1. Collaborative filtering using SVD
    1. Training and prediction
1. ***Recommendations***
    1. Recommendation function
    1. Recommendation scaling
1. ***Checking our results***
    1. Pretty printer
    1. Checking each recommendation automatically for any issue
1. ***Evaluating our model***
    1. RMSE
    1. MAE
1. ***To go further***
    1. Transformers

# ***Imports***

We start by importing the necessary libraries: pandas, numpy and sklearn utilities for model building and evaluation.

We also use other libs, but solely for the purpose of downloading the datasets.

In [1]:
# USE CASE : General
import pandas as pd
import numpy as np

# USE CASE : Downloading the datasets
import requests
import zipfile
import io
import os
import gzip

# USE CASE : Model creation, training and evaluation
from sklearn.decomposition import TruncatedSVD
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.preprocessing import MultiLabelBinarizer, LabelEncoder

import tensorflow as tf

2024-07-06 15:01:19.985093: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-06 15:01:19.985252: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-06 15:01:20.173024: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


# ***Download the datasets***

## ***1 - Movielens***

We then download the MovieLens dataset (ml-1m) using requests and extract it using zipfile.

We read the extracted files (ratings.dat, users.dat, movies.dat) into pandas DataFrames.

Finally, we print the first few rows of each DataFrame to verify the data loading.

In [2]:
url = "http://files.grouplens.org/datasets/movielens/ml-1m.zip"
response = requests.get(url)
zip_file = zipfile.ZipFile(io.BytesIO(response.content))

zip_file.extractall("ml-1m")

print(zip_file.namelist())

['ml-1m/', 'ml-1m/movies.dat', 'ml-1m/ratings.dat', 'ml-1m/README', 'ml-1m/users.dat']


We check to see the emplacemnt of the downloaded files:

In [3]:
!ls ml-1m/ml-1m

README	movies.dat  ratings.dat  users.dat


We define the paths of our data, and read it into pandas dataframes.

df_rating: contains all the ratings of the users, one rating for one movie per row. \
df_users: contains all the users who have rated at least one movie. \
df_items: contains all the films, and metadata about those films.

We then create the user-item interaction matrix by using our dataframe of ratings as a pivot. \
This matrix has users as rows, movies as columns, and ratings as values.

In [4]:
ratings_file = "ml-1m/ml-1m/ratings.dat"
users_file = "ml-1m/ml-1m/users.dat"
movies_file = "ml-1m/ml-1m/movies.dat"

df_rating = pd.read_csv(ratings_file, sep='::', engine='python', header=None, names=["UserId", "MovieId", "Rating", "Timestamp"])
df_users = pd.read_csv(users_file, sep='::', engine='python', header=None, names=["UserId", "Gender", "Age", "Occupation", "ZipCode"])
df_items = pd.read_csv(movies_file, sep='::', engine='python', header=None, names=["MovieId", "Title", "Genres"], encoding='latin-1')

df_matrix = df_rating.pivot(index="UserId", columns="MovieId", values="Rating")

Lets check all our new dataframes !

In [5]:
df_users

Unnamed: 0,UserId,Gender,Age,Occupation,ZipCode
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,02460
4,5,M,25,20,55455
...,...,...,...,...,...
6035,6036,F,25,15,32603
6036,6037,F,45,1,76006
6037,6038,F,56,1,14706
6038,6039,F,45,0,01060


In [6]:
df_rating

Unnamed: 0,UserId,MovieId,Rating,Timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291
...,...,...,...,...
1000204,6040,1091,1,956716541
1000205,6040,1094,5,956704887
1000206,6040,562,5,956704746
1000207,6040,1096,4,956715648


In [7]:
df_items

Unnamed: 0,MovieId,Title,Genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
3878,3948,Meet the Parents (2000),Comedy
3879,3949,Requiem for a Dream (2000),Drama
3880,3950,Tigerland (2000),Drama
3881,3951,Two Family House (2000),Drama


In [8]:
df_matrix

MovieId,1,2,3,4,5,6,7,8,9,10,...,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952
UserId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,2.0,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6036,,,,2.0,,3.0,,,,,...,,,,,,,,,,
6037,,,,,,,,,,,...,,,,,,,,,,
6038,,,,,,,,,,,...,,,,,,,,,,
6039,,,,,,,,,,,...,,,,,,,,,,


## ***2 - IMDb***

We do the same logic for the IMDb datasets, however since they are not in the same format (zip and tar.gz) we use a diffrent trick.

In [9]:
# UTILS

"""
This function ensures the directory at file_path exists, otherwise it creates it.
"""
def ensure_dir(file_path):
    if not os.path.exists(file_path):
        os.makedirs(file_path)


"""
This function calls the url and downloads the data from it.
It then saves in at extract_to.
"""
def download_and_unzip_imdb(url, extract_to='.'):
    ensure_dir(extract_to)
    response = requests.get(url)
    tsv_path = os.path.join(extract_to, os.path.basename(url))
    
    with open(tsv_path, 'wb') as f:
        f.write(response.content)
    
    return tsv_path


"""
This function loads the downloaded files and reads it into a pandas dataframe.
"""
def load_gzipped_tsv(file_path):
    with gzip.open(file_path, 'rt') as f:
        return pd.read_csv(f, delimiter='\t')

First we define the URLs to download the 2 useful IMDb datasets.

Then we call our previous functions to handle the logic and transform these files into usable dataframes.

In [10]:
imdb_basics_url = 'https://datasets.imdbws.com/title.basics.tsv.gz'
imdb_ratings_url = 'https://datasets.imdbws.com/title.ratings.tsv.gz'

basics_path = download_and_unzip_imdb(imdb_basics_url, extract_to='./imdb')
ratings_path = download_and_unzip_imdb(imdb_ratings_url, extract_to='./imdb')

imdb_basics_df = load_gzipped_tsv(basics_path)
imdb_ratings_df = load_gzipped_tsv(ratings_path)

  return pd.read_csv(f, delimiter='\t')


Here are our dataframes:

In [11]:
imdb_basics_df

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,5,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"
...,...,...,...,...,...,...,...,...,...
10908179,tt9916848,tvEpisode,Episode #3.17,Episode #3.17,0,2009,\N,\N,"Action,Drama,Family"
10908180,tt9916850,tvEpisode,Episode #3.19,Episode #3.19,0,2010,\N,\N,"Action,Drama,Family"
10908181,tt9916852,tvEpisode,Episode #3.20,Episode #3.20,0,2010,\N,\N,"Action,Drama,Family"
10908182,tt9916856,short,The Wind,The Wind,0,2015,\N,27,Short


In [12]:
imdb_ratings_df

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,2062
1,tt0000002,5.6,279
2,tt0000003,6.5,2030
3,tt0000004,5.4,180
4,tt0000005,6.2,2797
...,...,...,...
1453458,tt9916730,7.0,12
1453459,tt9916766,7.1,23
1453460,tt9916778,7.2,37
1453461,tt9916840,7.2,10


# ***Merging the datasets***

In prevision for further work, we merge our dataframes together, when possible and/or useful.

First we will group the 2 IMDb ones together as they can give meaningful information, and do not contain personal ratings. \
We also quickly transform the type of data from string to numerical values for the year.

In [13]:
imdb_df = pd.merge(imdb_basics_df, imdb_ratings_df, on='tconst', how='inner')

# Here I convert years from string to float, so that I can use it later to merge with the other dataframe.
imdb_df['startYear'] = pd.to_numeric(imdb_df['startYear'], errors='coerce')
imdb_df

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes
0,tt0000001,short,Carmencita,Carmencita,0,1894.0,\N,1,"Documentary,Short",5.7,2062
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892.0,\N,5,"Animation,Short",5.6,279
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892.0,\N,5,"Animation,Comedy,Romance",6.5,2030
3,tt0000004,short,Un bon bock,Un bon bock,0,1892.0,\N,12,"Animation,Short",5.4,180
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893.0,\N,1,"Comedy,Short",6.2,2797
...,...,...,...,...,...,...,...,...,...,...,...
1453458,tt9916730,movie,6 Gunn,6 Gunn,0,2017.0,\N,116,Drama,7.0,12
1453459,tt9916766,tvEpisode,Episode #10.15,Episode #10.15,0,2019.0,\N,43,"Family,Game-Show,Reality-TV",7.1,23
1453460,tt9916778,tvEpisode,Escape,Escape,0,2019.0,\N,\N,"Crime,Drama,Mystery",7.2,37
1453461,tt9916840,tvEpisode,Horrid Henry's Comic Caper,Horrid Henry's Comic Caper,0,2014.0,\N,11,"Adventure,Animation,Comedy",7.2,10


Since we will also merge the dataframe with films from movielens with our newly merged imdb one, we need to clean a little bit the data so that both IMDb and MovieLens share the same format for the title.

In [14]:
# Here I extract years from the title, so that I can use it later to merge with the other dataframe.
df_items['Year'] = df_items['Title'].str.extract(r'\((\d{4})\)').astype(float)

# Here I remove the year [this format: (XXXX)] from the title so that when I merge both dfs they have the same format.
df_items['Title'] = df_items['Title'].apply(lambda x: ' '.join(x.split()[:-1]))

In [15]:
df_items

Unnamed: 0,MovieId,Title,Genres,Year
0,1,Toy Story,Animation|Children's|Comedy,1995.0
1,2,Jumanji,Adventure|Children's|Fantasy,1995.0
2,3,Grumpier Old Men,Comedy|Romance,1995.0
3,4,Waiting to Exhale,Comedy|Drama,1995.0
4,5,Father of the Bride Part II,Comedy,1995.0
...,...,...,...,...
3878,3948,Meet the Parents,Comedy,2000.0
3879,3949,Requiem for a Dream,Drama,2000.0
3880,3950,Tigerland,Drama,2000.0
3881,3951,Two Family House,Drama,2000.0


We can now merge correctly both datasets.

When we do merge both datasets, you will notice we will lose some film entries overall.

I decided it wasnt so bad, and 

In [33]:
# Merge MovieLens and IMDb data on title and year
merged_df = pd.merge(df_items, imdb_df, left_on=['Title', 'Year'], right_on=['primaryTitle', 'startYear'], how='outer')

In [34]:
merged_df = merged_df[merged_df["MovieId"].notna()].copy()
merged_df

Unnamed: 0,MovieId,Title,Genres,Year,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes
1250,2031.0,"$1,000,000 Duck",Children's|Comedy,1971.0,,,,,,,,,,,
1959,3112.0,'Night Mother,Drama,1986.0,,,,,,,,,,,
2304,779.0,'Til There Was You,Drama|Romance,1997.0,tt0118523,movie,'Til There Was You,'Til There Was You,0,1997.0,\N,113,"Comedy,Romance",4.8,3012.0
2493,2072.0,"'burbs, The",Comedy,1989.0,,,,,,,,,,,
2832,3420.0,...And Justice for All,Drama|Thriller,1979.0,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1446633,1845.0,Zero Effect,Comedy|Thriller,1998.0,tt0120906,movie,Zero Effect,Zero Effect,0,1998.0,\N,116,"Comedy,Crime,Drama",6.9,15380.0
1446694,1364.0,Zero Kelvin (Kjærlighetens kjøtere),Action,1995.0,,,,,,,,,,,
1446986,1426.0,Zeus and Roxanne,Children's,1997.0,tt0120550,movie,Zeus and Roxanne,Zeus and Roxanne,0,1997.0,\N,98,"Adventure,Comedy,Family",5.3,3247.0
1449132,2698.0,Zone 39,Sci-Fi,1997.0,,,,,,,,,,,


# ***Preprocessing***

We already did a bunch of preprossing for the merging of datasets, but we will now do a little bit more to be ready for any kind work.

In [35]:
"""
This function takes 2 columns of genres in the dataframe with different format, and creates a list of all genres to be one hot encoded later on.
"""
def combine_genres(genres_x, genres_y):  
    set_x = set(genres_x.split('|'))
    set_y = set(genres_y.split(','))
    
    combined_genres = sorted(list(set_x.union(set_y)))
    
    return combined_genres

In [36]:
"""
This function applies general preprocessing on the data: drops columns and changes types to numerical values for models.
"""
def preprocessing(merged_df):
    
    df = merged_df.drop(['primaryTitle', 'originalTitle', 'startYear', 'endYear', 'isAdult', 'tconst', 'genres'], axis=1).copy()
    
    df.rename(columns={"A": "a", "B": "c"})
    
    df['runtimeMinutes'] = pd.to_numeric(df['runtimeMinutes'], errors='coerce')
    df['averageRating'] = pd.to_numeric(df['averageRating'], errors='coerce')
    df['numVotes'] = pd.to_numeric(df['numVotes'], errors='coerce')

    df = df.dropna()#.fillna(0)#.dropna()
    
    return df

In [37]:
"""
This function scales the rating (used because IMDb rating goes up to 10 while MovieLens goes up to 5)
"""
def scale_ratings(averageRating, original_scale, new_scale):
    return averageRating * new_scale / original_scale

In [39]:
# We first apply our combine_genres function
df = merged_df.copy()
df["Genres"] = df["Genres"].fillna("")
df["genres"] = df["genres"].fillna("")

df['Genres'] = df.apply(lambda row: combine_genres(row['Genres'], row['genres']), axis=1)

# We then scale down the averageRating of IMDb
original_scale, new_scale = 10, 5
df['averageRating'] = df.apply(lambda row: scale_ratings(row['averageRating'], original_scale, new_scale), axis=1)

# and we finish by applying some general preprocessing
df = preprocessing(df)
df

Unnamed: 0,MovieId,Title,Genres,Year,titleType,runtimeMinutes,averageRating,numVotes
2304,779.0,'Til There Was You,"[Comedy, Drama, Romance]",1997.0,movie,113.0,2.40,3012.0
3934,889.0,1-900,"[Drama, Romance]",1994.0,movie,87.0,3.10,655.0
4404,2572.0,10 Things I Hate About You,"[Comedy, Drama, Romance]",1999.0,movie,97.0,3.65,390557.0
5239,1367.0,101 Dalmatians,"[Adventure, Children's, Comedy, Crime]",1996.0,movie,103.0,2.85,118104.0
5998,1203.0,12 Angry Men,"[Crime, Drama]",1957.0,movie,96.0,4.50,873397.0
...,...,...,...,...,...,...,...,...
1442450,2165.0,Your Friends and Neighbors,"[Comedy, Drama, Romance]",1998.0,movie,100.0,3.15,8372.0
1444739,3236.0,Zachariah,"[Comedy, Drama, Musical, Western]",1971.0,movie,93.0,2.90,786.0
1446633,1845.0,Zero Effect,"[Comedy, Crime, Drama, Thriller]",1998.0,movie,116.0,3.45,15380.0
1446986,1426.0,Zeus and Roxanne,"[Adventure, Children's, Comedy, Family]",1997.0,movie,98.0,2.65,3247.0


Below is the code to one hot encod the genres, using the MultiLabelBinarizer from sklearn.

Once the genres are one hot encoded, we dont need the genres columns anymore.

In [44]:
mlb = MultiLabelBinarizer()
genre_df = pd.DataFrame(mlb.fit_transform(df['Genres']), columns=mlb.classes_, index=df.index)
df = df.join(genre_df)

df = df.drop('Genres', axis=1)
df

Unnamed: 0,MovieId,Title,Year,titleType,runtimeMinutes,averageRating,numVotes,Action,Adult,Adventure,...,Mystery,Reality-TV,Romance,Sci-Fi,Short,Sport,Thriller,War,Western,\N
2304,779.0,'Til There Was You,1997.0,movie,113.0,2.40,3012.0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
3934,889.0,1-900,1994.0,movie,87.0,3.10,655.0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
4404,2572.0,10 Things I Hate About You,1999.0,movie,97.0,3.65,390557.0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
5239,1367.0,101 Dalmatians,1996.0,movie,103.0,2.85,118104.0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
5998,1203.0,12 Angry Men,1957.0,movie,96.0,4.50,873397.0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1442450,2165.0,Your Friends and Neighbors,1998.0,movie,100.0,3.15,8372.0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1444739,3236.0,Zachariah,1971.0,movie,93.0,2.90,786.0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1446633,1845.0,Zero Effect,1998.0,movie,116.0,3.45,15380.0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1446986,1426.0,Zeus and Roxanne,1997.0,movie,98.0,2.65,3247.0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


# ***Create the couple matrix***

We define a function to create a couple matrix by averaging the ratings of two users. \
The function takes the user-item interaction matrix and two user IDs as inputs (the couple).

It filters out movies that haven't been rated by both users and calculates the average rating for each movie rated by both users. \
We then demonstrate this function by creating a couple matrix for users 4169 and 1680 and print the first few rows.

We chose specifically these two because thye have rated the most amount of films.

In [45]:
def create_couple_matrix(df_matrix, user1_id, user2_id):
    # we get the existing ratings for both users
    user1_ratings = df_matrix.loc[user1_id]
    user2_ratings = df_matrix.loc[user2_id]
    
    # we filter out movies that have been rated by both users
    common_ratings = user1_ratings.notna() & user2_ratings.notna()
    
    # Calculate the average rating for each movie rated by both users
    couple_ratings = (user1_ratings[common_ratings] + user2_ratings[common_ratings]) / 2
    
    return couple_ratings


In [46]:
"""
This function returns a Series with 2 entries with this structure : the UserId as index, and the number of rated films as the value.
"""
def get_highest_2_couple():
    res = df_rating.groupby(['UserId']).count() \
                             .sort_values(['Rating'], ascending=False)
    return res.loc[:][:2].Rating

get_highest_2_couple()

UserId
4169    2314
1680    1850
Name: Rating, dtype: int64

In [47]:
# Example: we create a couple matrix for users 4169 and 1680
couple = (4169, 1680)

couple_matrix = create_couple_matrix(df_matrix, couple[0], couple[1])
couple_matrix

MovieId
2       3.5
3       2.5
5       2.5
6       2.5
7       3.5
       ... 
3930    3.0
3932    4.0
3936    4.5
3946    2.5
3948    4.0
Length: 1334, dtype: float64

# ***Create our Model***

We define a function to train a SVD model using the TruncatedSVD class from sklearn.

We fill missing values in the user-item matrix with zeros and fit the SVD model. \
We train the SVD model on the user-item interaction matrix.

In [48]:
"""
this function create and trains the svd model
"""
def train_svd(user_item_matrix, n_components=25):
    svd = TruncatedSVD(n_components=n_components)
    svd.fit(user_item_matrix.fillna(0))
    return svd


"""
this function uses our model and create predictions on the ratings
"""
def predict_ratings(svd, user_item_matrix):
    predicted_ratings = svd.inverse_transform(svd.transform(user_item_matrix.fillna(0)))
    predicted_ratings_df = pd.DataFrame(predicted_ratings, index=user_item_matrix.index, columns=user_item_matrix.columns)
    return predicted_ratings_df

# Add the couple matrix to the user-item interaction matrix
df_matrix.loc['couple'] = couple_matrix

# Train the SVD model
svd = train_svd(df_matrix)

# Predict ratings for all movies
predicted_ratings = predict_ratings(svd, df_matrix)

# here is the unscaled rating predicted for the couple
print("unscaled rating predicted for the couple :")
predicted_ratings.loc['couple']

unscaled rating predicted for the couple :


MovieId
1      -0.103751
2       3.840868
3       1.789342
4       1.578840
5       1.840425
          ...   
3948    1.174912
3949    1.977074
3950    0.438250
3951    0.432108
3952    1.848349
Name: couple, Length: 3706, dtype: float64

In [49]:
# we will scale up our results to take into account the minimum treshold
mini = np.min(predicted_ratings.loc['couple'])
mini

-0.9539045049087531

# ***Recommendation Function***

Here we create our function to recommend movies based on the predicted ratings of our model.

We will drop all films already seen by at least one of the partners, and recommend only non visionned ones.

In [58]:
"""
This function recommends n_recommendations for the couple, based on the predicted ratings.
"""
def recommend_movies(predicted_ratings, user1_id, user2_id, df_rating, n_recommendations=5):
    user1_seen_movies = df_rating[df_rating['UserId'] == user1_id]['MovieId']
    user2_seen_movies = df_rating[df_rating['UserId'] == user2_id]['MovieId']
    seen_movies = pd.concat([user1_seen_movies, user2_seen_movies]).unique()
    
    couple_predictions = predicted_ratings.loc['couple']
    recommendations = couple_predictions.drop(seen_movies).sort_values(ascending=False)
    return recommendations.head(n_recommendations)

# Recommend movies for the couple
recommended_movies = recommend_movies(predicted_ratings, couple[0], couple[1], df_rating)
recommended_movies

MovieId
1449    3.261945
3911    2.156973
24      2.081295
1734    1.912779
2094    1.907970
Name: couple, dtype: float64

In [59]:
# We need to scale up our results because svd gives out negative results

scaled_recommendations = recommended_movies + np.abs(mini)
scaled_recommendations

MovieId
1449    4.215850
3911    3.110877
24      3.035199
1734    2.866683
2094    2.861875
Name: couple, dtype: float64

# ***Pretty Printer and checking for errors***

In [60]:
"""
This function takes the series of recommended films, transforms it into a dataframe and add back all the metadata about the films.
"""
def pp_recommendations(recommended_movies):
    reco = pd.DataFrame(columns=df.columns)
    for MovieId in recommended_movies.index:
        reco = pd.concat([reco, merged_df[merged_df['MovieId'] == MovieId]])
    return reco

recommendations = pp_recommendations(scaled_recommendations)
recommendations

  reco = pd.concat([reco, merged_df[merged_df['MovieId'] == MovieId]])


Unnamed: 0,MovieId,Title,Year,titleType,runtimeMinutes,averageRating,numVotes,Action,Adult,Adventure,...,Western,\N,Genres,tconst,primaryTitle,originalTitle,isAdult,startYear,endYear,genres
1391503,1449.0,Waiting for Guffman,1996.0,movie,84.0,7.4,30912.0,,,,...,,,Comedy,tt0118111,Waiting for Guffman,Waiting for Guffman,0.0,1996.0,\N,Comedy
143111,3911.0,Best in Show,2000.0,movie,90.0,7.5,67369.0,,,,...,,,Comedy,tt0218839,Best in Show,Best in Show,0.0,2000.0,\N,Comedy
943237,24.0,Powder,1995.0,movie,111.0,6.6,32803.0,,,,...,,,Drama|Sci-Fi,tt0114168,Powder,Powder,0.0,1995.0,\N,"Drama,Fantasy,Mystery"
835697,1734.0,My Life in Pink (Ma vie en rose),1997.0,,,,,,,,...,,,Comedy|Drama,,,,,,,
996433,2094.0,"Rocketeer, The",1991.0,,,,,,,,...,,,Action|Adventure|Sci-Fi,,,,,,,


In [63]:
df_rating['MovieId']

0          1193
1           661
2           914
3          3408
4          2355
           ... 
1000204    1091
1000205    1094
1000206     562
1000207    1096
1000208    1097
Name: MovieId, Length: 1000209, dtype: int64

In [64]:
# this functions serves to prove that indeed the recommended film has not been seen by the person
def checking_truthfulness(MovieId, UserId):
    return df_rating.loc[(df_rating['MovieId'] == str(MovieId)) & (df_rating['UserId'] == UserId)].empty


# this function applies the last one to the couple on all recommended films.
def check_results(recommended_movies, couple_ids):
    user_1 = recommended_movies.apply(lambda row: checking_truthfulness(recommended_movies['MovieId'], couple_ids[0]), axis=1).all
    user_2 = recommended_movies.apply(lambda row: checking_truthfulness(recommended_movies['MovieId'], couple_ids[1]), axis=1).all
    return user_1 and user_2


check_results(recommendations, couple)

<bound method Series.all of 1391503    True
143111     True
943237     True
835697     True
996433     True
dtype: bool>

# ***Evaluate our Model***

In [65]:
# Function to evaluate the model using RMSE and MAE
def evaluate_model(svd, user_item_matrix, df_rating_test):
    test_matrix = df_rating_test.pivot(index='UserId', columns='MovieId', values='Rating')
    test_matrix = test_matrix.reindex(columns=user_item_matrix.columns, fill_value=0)
    
    test_predicted_ratings = predict_ratings(svd, test_matrix)
    
    y_true = test_matrix.values.flatten()
    y_pred = test_predicted_ratings.values.flatten()
    
    mask = ~np.isnan(y_true)
    y_true = y_true[mask]
    y_pred = y_pred[mask]
    
    rmse = mean_squared_error(y_true, y_pred, squared=False)
    mae = mean_absolute_error(y_true, y_pred)
    
    return rmse, mae

In [66]:
# we split the data into train and test sets
train_data, test_data = train_test_split(df_rating, test_size=0.2, random_state=42)

# we create our couple matrix
train_matrix = train_data.pivot(index='UserId', columns='MovieId', values='Rating')

# we train the SVD model on the training matrix
svd = train_svd(train_matrix)

# and finally we evaluate the model on the test data
rmse, mae = evaluate_model(svd, train_matrix, test_data)
print(f"RMSE: {rmse}")
print(f"MAE: {mae}")

RMSE: 1.1212835478416474
MAE: 0.3584361684879515


# ***Going further***

I tried to implement a model based on transformers, using the example in the course (unlike here where I tried implementing collaborative filtering on my own).

However I kept struggling on the same issue over and over during the training of the model (all steps before were working fine, only training wasnt behaving well).

this was my error:
> RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. 

I am still confused as why this happened, but I could not bring myself to solve the issue after a lot of tries.

I scraped the code so that it would not pollute the notebook, but I still have saves over on Kaggle in case.


I also did try a bit on BPR, but to no good results : whatever the id of the person rating, the recommendations remained unchanged.