# Plot Description Based Recommender

#### In the previous exercises, we have among others built an IMDB Top250 recommender in which we could input genre. Such a system is not so sophisticated though. For example, if Alice would like movies as The Dark Knight, Iron Man, and X-Men, the previous genre-based recommender could only recommend 'action movies', but not detect that all of these are superhero movies. Moreover, it could also be that the audience for which movies were shot differs. The movies 'Forgetting Sarah Marshall' and 'The Hangover' were both released in the 21st century, were comedies, and were two hours in length. However, the former is for all audiences, while the latter is, well, just think back to their night in Vegas.

#### Obviously, we could try to find more metadata (subgenres), but it is probably more effective and efficient to ask users what movie they like, and then find similar ones. This follows of course the approaches of businesses like Netflix. Today, we will look more closely at content-based recommenders (ignoring Collaborative and Hybrid approaches for now), and focus on two types of recommenders: A Plot Description-based Recommender and a Metadata-based recommender

In [1]:
#Import the relevant packages
import pandas as pd
import numpy as np

#Import data from the clean Movie metadata file - we have cleaned it in another example (the knowledge-based recommender)
df = pd.read_csv('../data/metadata_clean.csv')

#Print the head of the cleaned DataFrame
df.head()

Unnamed: 0,title,genres,runtime,vote_average,vote_count,year
0,Toy Story,"['animation', 'comedy', 'family']",81.0,7.7,5415.0,1995
1,Jumanji,"['adventure', 'fantasy', 'family']",104.0,6.9,2413.0,1995
2,Grumpier Old Men,"['romance', 'comedy']",101.0,6.5,92.0,1995
3,Waiting to Exhale,"['comedy', 'drama', 'romance']",127.0,6.1,34.0,1995
4,Father of the Bride Part II,['comedy'],106.0,5.7,173.0,1995


Two aspects are missing: the movie overview (the plot description) and the id number. We will add them below

In [2]:
#Import the original file
orig_df = pd.read_csv('../data/movies_metadata.csv', low_memory=False)

#Add the useful features into the cleaned dataframe
df['overview'], df['id'] = orig_df['overview'], orig_df['id']

df.head()

Unnamed: 0,title,genres,runtime,vote_average,vote_count,year,overview,id
0,Toy Story,"['animation', 'comedy', 'family']",81.0,7.7,5415.0,1995,"Led by Woody, Andy's toys live happily in his ...",862
1,Jumanji,"['adventure', 'fantasy', 'family']",104.0,6.9,2413.0,1995,When siblings Judy and Peter discover an encha...,8844
2,Grumpier Old Men,"['romance', 'comedy']",101.0,6.5,92.0,1995,A family wedding reignites the ancient feud be...,15602
3,Waiting to Exhale,"['comedy', 'drama', 'romance']",127.0,6.1,34.0,1995,"Cheated on, mistreated and stepped on, the wom...",31357
4,Father of the Bride Part II,['comedy'],106.0,5.7,173.0,1995,Just when George Banks has recovered from his ...,11862


The plot description would be more efficient without punctuation and quotation marks, but we will see below that Scikit-learn does not care about this

In [3]:
#Import TfIdfVectorizer from the scikit-learn library
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stopwords and create useful word vectors
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
df['overview'] = df['overview'].fillna('')

#Construct the required TF-IDF matrix by applying the fit_transform method on the overview feature
tfidf_matrix = tfidf.fit_transform(df['overview'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

(45466, 75827)

We can't inspect it, but the matrix represents the different movies on one axis, and the total vocabulary on the other axis. The values indicate the prevalence of each keyword per movie. In sum, we have created a 75,827-dimensional vector for every movie. Sounds quite impressive, right? Now let's compute the consine similarity score as one method of similarity.

We are going to create a 45,466 x 45,466 matrix in which in each value represents the similarity inbetween each movie. The value of the diagonal is 1 (think about it). 

Cosine similarity is a computationally expensive process. However, since we have represented the movie plots as vectors, their magnitude is always 1 (they have the same length). This reduces our computational work to a much simpler and computationally cheaper dot product, which is provided by Scikit-learn through 'linear_kernel'

In [4]:
# Import linear_kernel to compute the dot product
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix - this may take a few minutes
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

Now, we are going to build the recommender. Before doing so, we create a reverse mapping of movie titles and their respective indices. In toher words, we create a Pandas series with the index as the movie title and the value as the correspondence index in the dataframe. This leads to a number of steps:

1. Declare the title of the movie as an argument
2. Obtain the index of the movie from the indices reverse mapping
3. Get the list of cosine similarity scores for that particular movie with all movies using cosine_sim. Convert this into a list of 'tuples' (a row of object, like an array), where the first element is the position and the second is the similarity score.
4. Sort this list of tuples on their cosine similarity scores
5. Get the top30 elements of this list. Ignore the first element as it refers to the similarity score with itself (The Lion King is obviously the most similar to the Lion King itself, not any other movie).
6. Return the titles corresponding to the indices of the top-10 elements, excluding the first.

In [5]:
#Construct a reverse mapping of indices and movie titles, and drop duplicate titles, if any
indices = pd.Series(df.index, index=df['title']).drop_duplicates()

In [6]:
# Function that takes in movie title as input and gives recommendations 
def content_recommender(title, cosine_sim=cosine_sim, df=df, indices=indices):
    # Obtain the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    # And convert it into a list of tuples as described above
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the cosine similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies. Ignore the first movie.
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return df['title'].iloc[movie_indices]

Now it's time to start recommending. You can now ask the system to come up with similar recommendation for any Movie title. One drawback is obviously that you need to correctly reproduce the title, including the right case

In [7]:
#Get recommendations for The Lion King - adapt the argument to generate other movie recommendation sets (e.g., Toy Story)
content_recommender('The Silence of the Lambs')

5493                  Red Dragon
4022                    Hannibal
8104                Suspect Zero
28754             Hell Is a City
11560            Hannibal Rising
31710                Dark Asylum
30721    The Silence of the Hams
13138               Surveillance
24512      Serial Killer Culture
26223       Mine Own Executioner
Name: title, dtype: object

The results might sometimes look good. For example, if you look for 'Silence of the Lambs', you'll receive recommendations with other Hannibal Lecter characters, as well as... Silence of the Hams?? For the Lion King, you'll see a few Disney movies in addition to the Lion King sequels. 

Obviously, it would be better if our recommender would take more metadata into account. Someone who likes The Lion King probably also likes other Disney movies (I have to find a latent feature that describes preferences for movie with Cats...), so let's and do that next.

# Metadata Based Recommender

#### We will use some movie metadata, which includes:
##### - The genre of the movie
##### - The director of the movie. This person is part of the crew
##### - The movie's three major stars. They are part of the cast
##### - Sub-genres of keywords.

In [8]:
# Load the keywords and credits files - additional data to work with.
cred_df = pd.read_csv('../data/credits.csv')
key_df = pd.read_csv('../data/keywords.csv')

In [9]:
#Print the head of the credit dataframe
cred_df.head()

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


The cast, crew, and the keywords are in the usual 'list of dictionaries' form. Just like genres in the other dataset, we have to reduce them to a string or a list of strings.

Before being able to do so, we need to combine the dataframes above with the original dataframe that includes the genres. Pandas can do this for us. This is done by joining tables, just like in SQL, using the movie id as the main feature.

In [10]:
#Print the head of the keywords dataframe
key_df.head()

Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


You will see that this is bad data, as not all id numbers are actually numbers. So a simple convert will not do, as you will encounter in many other large datasets. So, we will write our own conversion function.

In [11]:
#Convert the IDs of df into int
df['id'] = df['id'].astype('int')

ValueError: invalid literal for int() with base 10: '1997-08-20'

In [12]:
# Function to convert all non-integer IDs to NaN
def clean_ids(x):
    try:
        return int(x)
    except:
        return np.nan

In [13]:
#Clean the ids of df
df['id'] = df['id'].apply(clean_ids)

#Filter all rows that have a null ID
df = df[df['id'].notnull()]

Join the datasets

In [14]:
# Convert IDs into integer
df['id'] = df['id'].astype('int')
key_df['id'] = key_df['id'].astype('int')
cred_df['id'] = cred_df['id'].astype('int')

# Merge keywords and credits into your main metadata dataframe
df = df.merge(cred_df, on='id')
df = df.merge(key_df, on='id')

#Display the head of df
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,title,genres,runtime,vote_average,vote_count,year,overview,id,cast,crew,keywords
0,Toy Story,"['animation', 'comedy', 'family']",81.0,7.7,5415.0,1995,"Led by Woody, Andy's toys live happily in his ...",862,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,Jumanji,"['adventure', 'fantasy', 'family']",104.0,6.9,2413.0,1995,When siblings Judy and Peter discover an encha...,8844,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,Grumpier Old Men,"['romance', 'comedy']",101.0,6.5,92.0,1995,A family wedding reignites the ancient feud be...,15602,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,Waiting to Exhale,"['comedy', 'drama', 'romance']",127.0,6.1,34.0,1995,"Cheated on, mistreated and stepped on, the wom...",31357,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...","[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,Father of the Bride Part II,['comedy'],106.0,5.7,173.0,1995,Just when George Banks has recovered from his ...,11862,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


Now let's convert the variables so that they are more usable:
- Convert keywords into a list of string where each string is a keyword, similar to genres. We will only include the top three keywords
- Convert cast into a list of strings where each string is a star. Like keywords, we will only include the top three stars in our cast.
- Convert crew into director, which means that we need to ignore all non-director crew members!

In [15]:
# Convert the stringified objects into the native python objects - using 'literal_eval'
from ast import literal_eval

features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    df[feature] = df[feature].apply(literal_eval)

In [16]:
#Print the first cast member of the first movie in df - to see how we can extract the director
df.iloc[0]['crew'][0]

{'credit_id': '52fe4284c3a36847f8024f49',
 'department': 'Directing',
 'gender': 2,
 'id': 7879,
 'job': 'Director',
 'name': 'John Lasseter',
 'profile_path': '/7EdqiNbr4FRjIhKHyPPdFfEEEFG.jpg'}

In [17]:
# Extract the director's name. If director is not listed, return NaN
#the term 'crew_member' is just a self-chosen argument (for clarity) and you can change it
def get_director(x):
    for crew_member in x:
        if crew_member['job'] == 'Director':
            return crew_member['name']
    return np.nan

In [18]:
#Define the new director feature
df['director'] = df['crew'].apply(get_director)

#Print the directors of the first five movies
df['director'].head()

0      John Lasseter
1       Joe Johnston
2      Howard Deutch
3    Forest Whitaker
4      Charles Shyer
Name: director, dtype: object

We can do the same for the directories of keywords and casts. Below, you'll first see what the data looks like (to remind you). Since these columns are similar, we can write a single function. In both cases, we have 'agreed' to extract the top-3 elements.

In [19]:
#Print the first keyword of the first movie in the df
df.iloc[0]['keywords'][0]

{'id': 931, 'name': 'jealousy'}

In [20]:
#Print the first cast member of the first movie in the df
df.iloc[0]['cast'][0]

{'cast_id': 14,
 'character': 'Woody (voice)',
 'credit_id': '52fe4284c3a36847f8024f95',
 'gender': 2,
 'id': 31,
 'name': 'Tom Hanks',
 'order': 0,
 'profile_path': '/pQFoyx7rp09CJTAb932F2g8Nlho.jpg'}

In [21]:
# Another function - Returns the list top 3 elements or entire list; whichever is more.
def generate_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > 3:
            names = names[:3]
        return names

    #Return empty list in case of missing/malformed data
    return []

In [22]:
#Apply the generate_list function to cast and keywords
df['cast'] = df['cast'].apply(generate_list)
df['keywords'] = df['keywords'].apply(generate_list)

In [23]:
#Only consider a maximum of 3 genres - These were already formatted a little more conveniently
df['genres'] = df['genres'].apply(lambda x: x[:3])

In [24]:
# Print the new features of the first 5 movies along with title
df[['title', 'cast', 'director', 'keywords', 'genres']].head()

Unnamed: 0,title,cast,director,keywords,genres
0,Toy Story,"[Tom Hanks, Tim Allen, Don Rickles]",John Lasseter,"[jealousy, toy, boy]","[animation, comedy, family]"
1,Jumanji,"[Robin Williams, Jonathan Hyde, Kirsten Dunst]",Joe Johnston,"[board game, disappearance, based on children'...","[adventure, fantasy, family]"
2,Grumpier Old Men,"[Walter Matthau, Jack Lemmon, Ann-Margret]",Howard Deutch,"[fishing, best friend, duringcreditsstinger]","[romance, comedy]"
3,Waiting to Exhale,"[Whitney Houston, Angela Bassett, Loretta Devine]",Forest Whitaker,"[based on novel, interracial relationship, sin...","[comedy, drama, romance]"
4,Father of the Bride Part II,"[Steve Martin, Diane Keaton, Martin Short]",Charles Shyer,"[baby, midlife crisis, confidence]",[comedy]


So, what happens if we vectorize the data above? You can imagine that such function cannot differentiate between first and last names. Therefore, what could happen is that if a user would like a movie with Ryan Gosling (e.g., Drive), that the system would also recommend movies with Ryan Reynolds (e.g., Deadpool) - because they both are named Ryan. Although both are great actors, they are clearly different 'entities' - let's not assume that all Ryans are like.

Hence, we need to 'sanitize' the data to prevent ambiguity. We will do this by removing all spaces, resulting in ryanreynolds and ryangosling.

In [25]:
# Function to sanitize data to prevent ambiguity. It removes spaces and converts to lowercase
def sanitize(x):
    if isinstance(x, list):
        #Strip spaces and convert to lowercase
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [26]:
#Apply the generate_list function to cast, keywords, director and genres
for feature in ['cast', 'director', 'genres', 'keywords']:
    df[feature] = df[feature].apply(sanitize)

In contrast with our plot description-based recommendation, we now have multiple features to compute similarity on. How can we do that? By creating a soup! ... Yeah, I'm not sure who came up with this term, but it basically means that we are going to combine our three lists and one string.

In [27]:
#Function that creates a soup out of the desired metadata
def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])

In [28]:
# Create the new soup feature
df['soup'] = df.apply(create_soup, axis=1)

In [29]:
#Display the soup of the first movie
df.iloc[0]['soup']

'jealousy toy boy tomhanks timallen donrickles johnlasseter animation comedy family'

As you can see, the soup above for Toy Story is like a senseless sentence with different aspects of the movie. However, now that every movie is represented like that, we can vectorize each movie and compute the similarity between one another!

We will use the CountVectorizer to compute similarity and generate recommendations. Usually, you would pick TF-IDF, but in this case it would attribute less weight to actors and directors that have participated in a lot of movies. And that's an appropriate approach for the movie context.

In [30]:
# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

#Define a new CountVectorizer object and create vectors for the soup
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(df['soup'])

It does mean that we will need to use the computationally expensive cosine similarity (rather than a standard TF-IDF function). 

In [31]:
#Import cosine_similarity function
from sklearn.metrics.pairwise import cosine_similarity

#Compute the cosine similarity score (equivalent to dot product for tf-idf vectors)
cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

Due to some dataframe issues, we had to drop the move title earlier. We can use reverse mapping to include it again using the movie ID that we still have.

In [32]:
# Reset index of your df and construct reverse mapping again
df = df.reset_index()
indices2 = pd.Series(df.index, index=df['title'])

In [35]:
# THE RECOMMENDER! Change the name to find other movies
content_recommender('The Lion King', cosine_sim2, df, indices2)

29607                                          Cheburashka
40904                   VeggieTales: Josh and the Big Wall
40913    VeggieTales: Minnesota Cuke and the Search for...
27768                                 The Little Matchgirl
15209             Spiderman: The Ultimate Villain Showdown
16613                            Cirque du Soleil: Varekai
24654                                  The Seventh Brother
29198                                      Superstar Goofy
30244                                              My Love
31179                Pokémon: Arceus and the Jewel of Life
Name: title, dtype: object

### Is that it? There are some improvements possible, because this is still a simple setup. Here are a few ideas you could try:
##### Experiment with the number of keywords, genres, and cast: The use of a maximum of 3 of each was an abritary decision. It could be, while swapping this for a more tedious computation.
##### Come up with more well-defined sub-genres: It could be that some keywords only appear among a few movies, rendering them useless in a recommendation scenario (not for look-up... remember TF-IDF?). You could pre-define keywords based on the total set of keywords
##### Assign more weight to the director: This could be done by repeating the director, for example. You could also repeat other keywords if you would find them to be important. It is, of course, better to assign explicit weights in a well-defined computation, but these are just simple suggestions.
##### Experiment with other metadata: For example, think of using the producer (Pixar) to recommend animation movies.
##### Induce a popularity filter