# Practical 8 (Part II) - Recommender System (Content-based Filtering)

Content-Based Recommendations systems are the systems that look for similarity before recommending something. To understand how similarity between different products is computed, there are different techniques or similarity measures that are used to compute the similarity, such as Euclidean distance and cosine similarity. This system uses item metadata, such as genre, director, description, actors, etc. for movies, to make these recommendations.


This practical helps you to learn how to build a basic model of content-based recommender systems using the Movies Data set that is publicly available on Kaggle. You will learn how to build a system that recommends movies that are similar to a particular movie. To achieve this, you will compute pairwise cosine similarity scores for all movies based on their plot descriptions and recommend movies based on that similarity score threshold.

Reference: 

Full dataset can be downloaded here: https://www.kaggle.com/rounakbanik/the-movies-dataset?select=movies_metadata.csv

The reference of this practical: https://www.datacamp.com/community/tutorials/recommender-systems-python


## Section 1 Data Preparation

"movies_metadata.csv" contains information on ~45,000 movies featured in the Full MovieLens dataset. Features include posters, backdrops, budget, genre, revenue, release dates, languages, production countries, and companies.

1. Let's load your movies metadata dataset into a pandas DataFrame:

In [None]:
# Import Pandas
import pandas as pd

# Load Movies Metadata
metadata = pd.read_csv('', low_memory=True)           #read movies_metadata.csv

# Print the first three rows
metadata.head()

In [None]:
metadata.shape

There are 45466 rows and 24 columns

2.  Let's inspect the plots of a few movies:

In [None]:
#The plot description is available to you as the overview feature in your metadata dataset. 
metadata['overview'].head()

## Section 2 Features Generation

Now we have a Natural Language Processing problem to solve. Therefore we need to extract some kind of features from the above text data before we can compute the similarity and/or dissimilarity between them. To do this, we need to compute the word vectors of each overview or document.

As the name suggests, word vectors are vectorized representation of words in a document. The vectors carry a semantic meaning with it. The following section shows how we could use Term Frequency-Inverse Document Frequency (TF-IDF) vectors for each document. 

TF-IDF will produce a matrix where each column represents a word in the overview vocabulary (all the words that appear in at least one document), and each column represents a movie, as before. The TF-IDF score is the frequency of a word occurring in a document, down-weighted by the number of documents in which it occurs. This is done to reduce the importance of words that frequently occur in plot overviews and, therefore, their significance in computing the final similarity score.

3. Now, let's use scikit-learn built-in TfIdfVectorizer class to produce the TF-IDF matrix, by following the steps below:

(i) Import the Tfidf module using scikit-learn;

(ii) Remove stop words like 'the', 'an', etc. since they do not give any useful information about the topic;

(iii) Replace not-a-number values with a blank string; 

(iv) Finally, construct the TF-IDF matrix on the data.

In [None]:
#Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import                   # import TfidfVectorizer as feature extraction

#Define a TF-IDF Vectorizer Object.
tfidf = TfidfVectorizer(stop_words='')                    #  Remove all english stop words such as 'the'

#Replace NaN with an empty string
metadata['overview'] = metadata['overview'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(metadata['overview'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

There are 75,827 different vocabularies or words (features) in the dataset that contains 45,466 movies.

4. Let's check out some of the the features

In [None]:
#Array mapping from feature integer indices to feature name.
tfidf.get_feature_names_out()[5000:5010]

With the matrix, we can now use the cosine similarity to calculate a numeric quantity that denotes the similarity between two movies. Cosine similarity score is used since it is independent of magnitude and is relatively easy and fast to calculate.
Note that there are metrics that you can use for this, such as the manhattan, euclidean, the Pearson, other than cosine similarity.

5. Since TF-IDF vectorizer is used, calculating the dot product between each vector will directly give you the cosine similarity score. Therefore, you will use sklearn's <i>linear_kernel()</i> instead of <i>cosine_similarities()</i> since it is faster.

In [None]:
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [None]:
cosine_sim.shape

The above returns a matrix of shape 45466x45466, which means each movie overview cosine similarity score with every other movie overview. Hence, each movie will be a 1x45466 column vector where each column will be a similarity score with each movie. Sample matrix is as follows:

![image.png](attachment:image.png)

In [None]:
#observing the first 6 rows and 6 columns
for i in range(6):
    print(cosine_sim[i][:6])

6. Next, we need to define a function that takes in a movie title as an input and outputs a list of the 10 most similar movies. Firstly, for this, you need a reverse mapping of movie titles and DataFrame indices. In other words, we are generating the ID for each movie title using index.

In [None]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()

#check the first 10 indices
indices[:10]

## Section 3 Content-Based Filtering Recommender

Now let's build a content filtering recommender. These are the following steps to follow:

(i) Get the index of the movie given its title.

(ii) Get the list of cosine similarity scores for that particular movie with all movies. Convert it into a list of tuples where the first element is its position, and the second is the similarity score.

(iii) Sort the aforementioned list of tuples based on the similarity scores; that is, the second element.

(iv) Get the top 10 elements of this list. Ignore the first element as it refers to self (the movie most similar to a particular movie is the movie itself).

(v) Return the titles corresponding to the indices of the top elements.

In [None]:
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return metadata['title'].iloc[movie_indices]

In [None]:
get_recommendations('')                 #obtain the movie that has the similarity to user input e.g. The Dark Knight Rises

Now you may try out other movie titles

## Exercise

Q1. <b>Building Credits, Genres, and Keywords Based Recommender</b>: You are required to build a recommender system based on the following metadata: the 3 top actors, the director, related genres, and the movie plot keywords.

Reference: https://www.datacamp.com/community/tutorials/recommender-systems-python

Q2. <b>Popularity filter</b>: Build a recommender would take the 30 most similar movies, calculate the weighted ratings (using the IMDB formula from above), sort movies based on this rating, and return the top 10 movies.