# Simple Recommenders

Simple Recommenders offer generalized recommendations to every user, based on item popularity and/or genre. The basic idea behind this system is that items that are more popular and critically acclaimed will have a higher probability of being liked by the average audience.\
Example of application : IMDB Top 250.\
In short, Simple recommenders are basic systems that recommend the top items based on a certain metric or score.

In this section, we will do as follow : 

1. Load and visualize the data.

2. Decide on the metric or score to rate movies on.

3. Calculate the score for every movie.

4. Sort the movies based on the score and output the top results.

In [1]:
import numpy as np
import pandas as pd

# Load Movies Metadata
metadata = pd.read_csv('/Users/Philippine/Documents/Cours/Maths/M2/Projet_annuel/TP/movies_metadata.csv', low_memory=False)

# Print the first three rows
metadata.head(3)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0


On dispose de plusieurs informations sur chaque film : le budget, la langue de tournage, le titre original, la date de sortie, le bénéfice, la durée et surtout la note du film et le nombre de votant.

Il nous faut maintenant choisir une métrique pour classer les films des plus appréciés au moins appréciés. En prenant en considération que la note d'un film très peu noté est beaucoup moins proche du ressenti général que la note d'un film très noté, on considère la métrique suivante : WeightedRating(WR)=(v⋅R/(v+m)+(m.C/(m+v))
où v est le nombre de vote pour le film, m est le nombre minimum de vote nécessaire pour être listé dans ce tableau, R est la note moyenne du film et C est la note moyenne de tous les films présents dans ce jeu de données.

Dans ce jeu de données, on a à disposition v, R et on peut facilement calculer C. m est un hyperparamètre à fixer (You can consider it as a preliminary negative filter that will simply remove the movies which have a number of votes less than a certain threshold m. The selectivity of your filter is up to your discretion.
In this example, we will use cutoff m as the 90th percentile. In other words, for a movie to be featured in the charts, it must have more votes than at least 90% of the movies on the list.)

In [2]:
# Calculate mean of vote average column
C = metadata['vote_average'].mean()
print('Note moyenne sur tous les films : ', C, '/10')

Note moyenne sur tous les films :  5.618207215133889 /10


In [3]:
# Calculate the minimum number of votes required to be in the chart, m
m = metadata['vote_count'].quantile(0.90)
print('Nombre minimal de vote à avoir: ', m)

Nombre minimal de vote à avoir:  160.0


In [4]:
# Filter out all qualified movies into a new DataFrame
q_movies = metadata.copy().loc[metadata['vote_count'] >= m]
print('Nombre de films après filtrage: ', q_movies.shape[0])

Nombre de films après filtrage:  4555


L'étape suivante est la plus importante : calculer la note pondérée de chaque film du nouveau dataframe. Nous allons donc définir une fonction 'weighted_ratings(m,C)' à l'aide des colonnes 'vote_count' (v) et 'vote_average' (R) du nouveau dataframe q_movies.
Enfin, nous afficherons les résultats pour tous les films de q_movies.

In [9]:
# Function that computes the weighted rating of each movie
def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    # Calculation based on the IMDB formula
    return (v/(v+m) * R) + (m/(m+v) * C)

# Define a new feature 'score' and calculate its value with `weighted_rating()`
q_movies['score'] = q_movies.apply(weighted_rating, axis=1)

#Sort movies based on score calculated above
q_movies = q_movies.sort_values('score', ascending=False)

#Print the top 20 movies
q_movies[['title', 'vote_count', 'vote_average', 'score']].head(20)

Unnamed: 0,title,vote_count,vote_average,score
314,The Shawshank Redemption,8358.0,8.5,8.445869
834,The Godfather,6024.0,8.5,8.425439
10309,Dilwale Dulhania Le Jayenge,661.0,9.1,8.421453
12481,The Dark Knight,12269.0,8.3,8.265477
2843,Fight Club,9678.0,8.3,8.256385
292,Pulp Fiction,8670.0,8.3,8.251406
522,Schindler's List,4436.0,8.3,8.206639
23673,Whiplash,4376.0,8.3,8.205404
5481,Spirited Away,3968.0,8.3,8.196055
2211,Life Is Beautiful,3643.0,8.3,8.187171


# Content-based Recommender

In this section, we will learn how to build a system that recommends movies that are similar to a particular movie. To achieve this, we will compute pairwise cosine similarity scores for all movies based on their plot descriptions and recommend movies based on that similarity score threshold.

The plot description is available as the overview feature in the 'metadata' dataset.

In [5]:
#Print plot overviews of the first 5 movies.
q_movies['overview'].head()

0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
4    Just when George Banks has recovered from his ...
5    Obsessive master thief, Neil McCauley leads a ...
8    International action superstar Jean Claude Van...
Name: overview, dtype: object

The problem at hand is a Natural Language Processing problem. Hence we need to extract some kind of features from the above text data before we can compute the similarity and/or dissimilarity between them. To put it simply, it is not possible to compute the similarity between any two overviews in their raw forms. To do this, we need to compute the word vectors of each overview or document, as it will be called from now on.

As the name suggests, word vectors are vectorized representation of words in a document. The vectors carry a semantic meaning with it. For example, man & king will have vector representations close to each other while man & woman would have representation far from each other.

We will compute Term Frequency-Inverse Document Frequency (TF-IDF) vectors for each document. This will give us a matrix where each column represents a word in the overview vocabulary (all the words that appear in at least one document), and each column represents a movie, as before.

In its essence, the TF-IDF score is the frequency of a word occurring in a document, down-weighted by the number of documents in which it occurs. This is done to reduce the importance of words that frequently occur in plot overviews and, therefore, their significance in computing the final similarity score.

Fortunately, scikit-learn gives us a built-in TfIdfVectorizer class that produces the TF-IDF matrix in a couple of lines.

We will process as follow : 

1. Import the Tfidf module using scikit-learn;
2. Remove stop words like 'the', 'an', etc. since they do not give any useful information about the topic;
3. Replace not-a-number values with a blank string;
4. Finally, construct the TF-IDF matrix on the data.

In [10]:
#Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
q_movies['overview'] = q_movies['overview'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(q_movies['overview'])

#Output the shape of tfidf_matrix
print('Dimension de la matrice_tfidf :',tfidf_matrix.shape)

#Array mapping from feature integer indices to feature name.
tfidf.get_feature_names()[500:510]

Dimension de la matrice_tfidf : (4555, 19694)


['adaption',
 'adapts',
 'add',
 'addams',
 'added',
 'addict',
 'addicted',
 'addiction',
 'addictions',
 'adding']

From the above output, we observe 19,694 different vocabularies or words in our dataset of 4555 movies.

With this matrix in hand, we can now compute a similarity score. There are several similarity metrics that we can use for this, such as the manhattan, euclidean, the Pearson, and the cosine similarity scores. Again, there is no right answer to which score is the best. Different scores work well in different scenarios, and it is often a good idea to experiment with different metrics and observe the results.

We will be using the cosine similarity to calculate a numeric quantity that denotes the similarity between two movies. We use the cosine similarity score since it is independent of magnitude and is relatively easy and fast to calculate (especially when used in conjunction with TF-IDF scores).

In [11]:
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
cosine_sim[1]

array([0.01892297, 1.        , 0.        , ..., 0.        , 0.        ,
       0.        ])

We're going to define a function that takes in a movie title as an input and outputs a list of the 10 most similar movies. Firstly, for this, we need a reverse mapping of movie titles and DataFrame indices. In other words, we need a mechanism to identify the index of a movie in our q_movies DataFrame, given its title.

In [12]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(q_movies.index, index=q_movies['title']).drop_duplicates()
indices[:10]

title
Toy Story                       0
Jumanji                         1
Father of the Bride Part II     4
Heat                            5
Sudden Death                    8
GoldenEye                       9
The American President         10
Dracula: Dead and Loving It    11
Balto                          12
Casino                         15
dtype: int64

We are now going to proced as follow :

1. Get the index of the movie given its title.

2. Get the list of cosine similarity scores for that particular movie with all movies. Convert it into a list of tuples where the first element is its position, and the second is the similarity score.

3. Sort the aforementioned list of tuples based on the similarity scores; that is, the second element.

4. Get the top 10 elements of this list. Ignore the first element as it refers to self (the movie most similar to a particular movie is the movie itself).

5. Return the titles corresponding to the indices of the top elements.

In [13]:
# Function that takes in movie title as input and outputs most similar movies

def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return q_movies['title'].iloc[movie_indices]

In [17]:
get_recommendations('Toy Story')

15348               Toy Story 3
2997                Toy Story 2
10301    The 40 Year Old Virgin
1071      Rebel Without a Cause
3057            Man on the Moon
2157          Indecent Proposal
10585               Match Point
3145       White Men Can't Jump
1884             Child's Play 3
16585      I Spit on Your Grave
Name: title, dtype: object

Sachant que l'on a aimé le film Toy Story, ce système de recommandation nous propose des films très similaires, comme par exemple la suite de la série (Toy Story 2 et 3) ou des films en lien avec des jouets.