# Chapter 4 - Content Based Recommenders

In this chapter, we are going to build two types of content-based recommender:
**
Plot description-based recommender**: This mode 
compares the descriptions and taglines of different movies, a d
provides recommendations that have the most similar p ot
descripti

**ns.
Metadata-based recomme**nder: This model takes a ho t offeatures, such as genres, keywords, cast, and crew, intoconsideration and pro ides recommendations that are the mostsimilar with respect to the aforementioned features.

In [1]:
# Initial Libraries
import pandas as pd
import numpy as np

In [2]:
# Read data
file = 'data/metadata_clean.csv'

df = pd.read_csv(file, low_memory=False)

df.head()

Unnamed: 0,title,genres,runtime,vote_average,vote_count,year
0,Toy Story,"['animation', 'comedy', 'family']",81.0,7.7,5415.0,1995
1,Jumanji,"['adventure', 'fantasy', 'family']",104.0,6.9,2413.0,1995
2,Grumpier Old Men,"['romance', 'comedy']",101.0,6.5,92.0,1995
3,Waiting to Exhale,"['comedy', 'drama', 'romance']",127.0,6.1,34.0,1995
4,Father of the Bride Part II,['comedy'],106.0,5.7,173.0,1995


In [3]:
#Import the original file
orig_df = pd.read_csv('data/movies_metadata.csv', low_memory=False)

#Add the useful features into the cleaned dataframe
df['overview'], df['id'] = orig_df['overview'], orig_df['id']

df.head()

Unnamed: 0,title,genres,runtime,vote_average,vote_count,year,overview,id
0,Toy Story,"['animation', 'comedy', 'family']",81.0,7.7,5415.0,1995,"Led by Woody, Andy's toys live happily in his ...",862
1,Jumanji,"['adventure', 'fantasy', 'family']",104.0,6.9,2413.0,1995,When siblings Judy and Peter discover an encha...,8844
2,Grumpier Old Men,"['romance', 'comedy']",101.0,6.5,92.0,1995,A family wedding reignites the ancient feud be...,15602
3,Waiting to Exhale,"['comedy', 'drama', 'romance']",127.0,6.1,34.0,1995,"Cheated on, mistreated and stepped on, the wom...",31357
4,Father of the Bride Part II,['comedy'],106.0,5.7,173.0,1995,Just when George Banks has recovered from his ...,11862


### CountVectorizer

CountVectorizer is the simplest type of vectorizer and is best explained with the help of an example. Imagine that we have three documents, A, B, and C,
which are as follows:

    A: The sun is a star.
    B: My love is like a red, red rose
    C: Mary had a little lamb

We now have to convert these documents into their vector forms using CountVectorizer. The first step is to compute the size of the vocabulary. The
vocabulary is the number of unique words present across all documents. Therefore, the vocabulary for this set of three documents is as follows: the,
sun, is, a, star, my, love, like, red, rose, mary, had, little, lamb. Consequently, the size of the vocabulary is 14.

It is common practice to not include extremely common words such as a, the, is, had, my, and so on (also known as stop words) in the vocabulary.
Therefore, eliminating the stop words, our vocabulary, V, is as follows:

V: like, little, lamb, love, mary, red, rose, sun, star

The size of our vocabulary is now nine. Therefore, our documents will be represented as nine-dimensional vectors, and each dimension here will
represent the number of times a particular word occurs in a document. In other words, the first dimension will represent the number of times like occurs, the second will represent the number of times little occurs, and so on. Therefore, using the CountVectorizer approach, A, B, and C will now be
represented as follows:

    A: (0, 0, 0, 0, 0, 0, 0, 1, 1)
    B: (1, 0, 0, 1, 0, 2, 1, 0, 0)
    C: (0, 1, 1, 0, 1, 0, 0, 0, 0)

<div style="text-align:center;">
    <img src='images/tfidf.jpg'>
</div>

The weight $w_{i,j}$ takes values between 0 and 1.

In [4]:
#Import TfIdfVectorizer from the scikit-learn library
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stopwords
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
df['overview'] = df['overview'].fillna('')

#Construct the required TF-IDF matrix by applying the fit_transform method on the overview feature
tfidf_matrix = tfidf.fit_transform(df['overview'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

(45032, 75357)

<div style="text-align:center;">
    <img src='images/css.jpg' width='600'>
</div>

In [5]:
# Import linear_kernel to compute the dot product
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [6]:
#Construct a reverse mapping of indices and movie titles, and drop duplicate titles, if any
indices = pd.Series(df.index, index=df['title']).drop_duplicates()

indices

title
Toy Story                          0
Jumanji                            1
Grumpier Old Men                   2
Waiting to Exhale                  3
Father of the Bride Part II        4
                               ...  
Caged Heat 3000                45027
Robin Hood                     45028
Century of Birthing            45029
Betrayal                       45030
Queerama                       45031
Length: 45032, dtype: int64

### Plot description based recommender

Our plot description-based recommender will take in a movie title as an argument and output a list of movies that are most similar based on their
plots. These are the steps we are going to perform in building this model:

1. Obtain the data required to build the model
2. Create TF-IDF vectors for the plot description (or overview) of every movie
3. Compute the pairwise cosine similarity score of every movie
4. Write the recommender function that takes in a movie title as an argument and outputs movies most similar to it based on the plot

### Recommender Function

We will perform the following steps in building the recommender function:

1. Declare the title of the movie as an argument.
2. Obtain the index of the movie from the indices reverse mapping.
3. Get the list of cosine similarity scores for that particular movie with all movies using cosine_sim. Convert this into a list of tuples where the first element is the position and the second is the similarity score.
4. Sort this list of tuples on the basis of the cosine similarity scores.
5. Get the top 10 elements of this list. Ignore the first element as it refers to the similarity score with itself (the movie most similar to a particular movie is obviously the movie itself).
6. Return the titles corresponding to the indices of the top 10 elements, excluding the first:

In [7]:
# Function that takes in movie title as input and gives recommendations 

def content_recommender(title, cosine_sim=cosine_sim, df=df, indices=indices):
    # Obtain the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    # And convert it into a list of tuples as described above
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the cosine similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies. Ignore the first movie.
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return df['title'].iloc[movie_indices]

In [8]:
#Get recommendations for The Lion King
content_recommender('The Lion King')

34682               Manson Family Vacation
9353                    Tipping the Velvet
9115                              Thursday
42829                              Big Jet
25654    Cheech & Chong Get Out of My Room
17041                     Bird of Paradise
27933                      Little Monsters
6094                  Pauline at the Beach
37409                   The Driftless Area
3203                        Beyond the Mat
Name: title, dtype: object

# Metadata Based Recommender

To build this model, we will be using the following metdata:
* The genre of the movie.
* The director of the movie. This person is part of the crew.
* The movie's three major stars. They are part of the cast.
* Sub-genres or keywords.

In [None]:
# Load the keywords and credits files
cred_df = pd.read_csv('../data/credits.csv')
key_df = pd.read_csv('../data/keywords.csv')