# **Recommender Systems - The Movies Dataset**

In general, recommender systems can be divided into three categories.

**Simple Recommendations** provides general recommendations for each user based on fashion and/or movie type. The idea behind this system is that more popular and critically acclaimed images are more likely to appeal to a general audience. For example IMDB Top 250. 

**Content-Based Recommendations** recommend similar content based on a specific item. This system uses element metadata such as director, description, actors, etc. The idea behind these recommender systems is that if a person likes a particular product, they will also like similar products. And to make recommendations, it uses the user's single item metadata. A good example is YouTube. YouTube recommends new videos to watch based on your history. 

**Collaborative filtering tool systems** are widely used and attempt to predict the position or interest a user will give to an item based on conditions and other items of interest. Recommendation is a simple introductory system that recommends stylish items based on a specific metric or score. In this section, we'll use metadata gathered from IMDB to create a simple system based on the top 250 IMDB images.

**Select a metric or score to rate the movie.**

**Scores per movie.**

**Filter movies by rating and show the result in style.**

The dataset contains movies released on or before July 2017. This dataset captures key points such as cast, crew, story keywords, budget, gross, billing, release date, language, production company, country, TMDB elections and regular polls.
These points can potentially be used to train content filtering and collaborative machine learning models.

This data set consists of the following files.

**movies_metadata.csv**: This file contains information on approximately 45,000 movies in the Full MovieLens dataset. Features include poster, background, budget, genre, revenue, release date, language, production country and company.

**keywords.csv**: Contains story keywords for MovieLens movies. You can use it as a JSON string object.

**Credits.csv**: Contains cast and crew information for all films. Available as
JSON string objects.

**links.csv**: This file contains the TMDB and IMDB IDs of all movies in the Full MovieLens data set.

**links_small.csv**: Contains TMDB and IMDB IDs for a small subset of the 9,000 movies in the full dataset.

**ratings_small.csv**: A set of 100,000 ratings from 700 users for 9,000 movies. The
complete MovieLens dataset contains 26 million ratings and 750,000 tag apps from 270,000 users for all 45,000 movies in this dataset. You can access it from the GroupLens official website.

In [None]:
from zipfile import ZipFile
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import numpy as np
# Parse the stringified features into their corresponding python objects
from ast import literal_eval
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer
# Compute the Cosine Similarity matrix based on the count_matrix
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
# zip = ZipFile('Recommander Systems - The Movie Dataset.zip')
# zip.extractall()

In [None]:
# Load Movies Metadata
metadata = pd.read_csv('movies_metadata.csv', low_memory=False)
# Print the first three rows
metadata.head(3)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0


One of the most basic metrics we can think of is a ranking that determines the top 250 films based on their respective ratings.

However, there are some caveats to using rank as an indicator.

First, it doesn't take into account the popularity of the movie. Thus, a film with a rating of 9 out of 10 voters is considered "better" than a film with a rating of 8.9 out of 10,000.

For example, imagine we want to order Chinese food. There are multiple options. One restaurant received a 5-star rating from 5 people, while the other restaurant received a 4.5-star rating from 1000 people. Which restaurant would we prefer? Second, right?

Of course there could be an exception that the first restaurant opened just a few days ago. As a result, fewer people voted, and the second restaurant, on the contrary, was in operation for a year.

So to speak, this metric favors films with a small number of voters that are either biased or have very high ratings. As the number of voters increases, a movie's rating levels off and approaches a value that reflects the quality of the movie, giving users a much better idea of ​​which movie to choose. It is difficult to determine the quality of a film with fewer voters, but we may need to consider outside sources to come to a conclusion.
Considering this shortcoming, it is necessary to give a weighted rating by considering the average rating and the number of votes she has received. Such a system guarantees that a film with a rating of 9 out of 100,000 votes will score (much) higher than a film with the same rating but with only a few hundred votes.

Since we want to replicate the IMDB top 250, we will use the weighted evaluation formula as our metric/score. Mathematically, this is expressed as:

\begin{equation} \text Weighted Rating (\bf WR) = \left({{\bf v} \over {\bf v} + {\bf m}} \cdot R\right) + \left({{\bf m} \over {\bf v} + {\bf m}} \cdot C\right) \end{equation}

In the above equation,

**v is the number of votes for the movie**;

**m is the minimum votes required to be listed in the chart**;

**R is the average rating of the movie**;

**C is the mean vote across the whole report**.

We already have v (vote_count) and R (vote_average) values ​​for each movie in our dataset. C can also be calculated directly from this data. We can think of it as a preliminary negative filter that simply removes movies with vote counts below a certain threshold m.

We use the cutoff m as the 90th percentile. This means that in order for a film to chart, it needs to get more votes than 90% of the films on the list. (On the other hand, choosing the 75th percentile will consider movies in the top 25% of the votes cast. As the percentile decreases, the number of movies considered increases.)

As a first step, we use the pandas .mean() function. to calculate the average rating of all movies, the C value.v

In [None]:
# Calculate mean of vote average column
mean_columns = metadata['vote_average'].mean()
print(mean_columns)

5.618207215134185


From the above output, you can observe that the average rating of a movie on IMDB is around 5.6 on a scale of 10.

Next, let's calculate the number of votes, m, received by a movie in the 90th percentile. The pandas library makes this task extremely trivial using the .quantile() method of pandas:

In [None]:
# Calculate the minimum number of votes required to be in the chart, min_votes
min_votes = metadata['vote_count'].quantile(0.90)
print(min_votes)

160.0


Now that we have m, we can use the greater than or equal to condition to filter out movies with more than 160 votes. Metadata DataFrame. That is, changes to the q_movies DataFrame do not affect the original metadata frame.

In [None]:
# Filter out all qualified movies into a new DataFrame
qualified_movies = metadata.copy().loc[metadata['vote_count'] >= min_votes]
qualified_movies.shape

(4555, 24)

In [None]:
metadata.shape

(45466, 24)

From the above results, it is clear that about 10% of the films received 160 or more votes and could be included in this list.

The next most important step is to calculate a weighted rating for each qualifying film. To do this:

We define the weighted_rating() function.

Since we have already computed m and C, we pass them as arguments to the function.

Then select the voice_count(v) and voice_average(R) columns from the q_movies data frame.

Finally calculate the weighted average and returns the result.
Apply this function to a suitable movie data frame to define a new function score to compute the value of.

In [None]:
# Function that computes the weighted rating of each movie
def weighted_rating(x, m=min_votes, C=mean_columns):
    v = x['vote_count']
    R = x['vote_average']
    # Calculation based on the IMDB formula
    return (v/(v+m) * R) + (m/(m+v) * C)

In [None]:
# Define a new feature 'score' and calculate its value with `weighted_rating()`
qualified_movies['score'] = qualified_movies.apply(weighted_rating, axis=1)

Finally, let's sort the DataFrame in descending order based on the score feature column and output the title, vote count, vote average, and weighted rating (score) of the top 20 movies.

In [None]:
#Sort movies based on score calculated above
qualified_movies = qualified_movies.sort_values('score', ascending=False)

#Print the top 15 movies
qualified_movies[['title', 'vote_count', 'vote_average', 'score']].head(20)

Unnamed: 0,title,vote_count,vote_average,score
314,The Shawshank Redemption,8358.0,8.5,8.445869
834,The Godfather,6024.0,8.5,8.425439
10309,Dilwale Dulhania Le Jayenge,661.0,9.1,8.421453
12481,The Dark Knight,12269.0,8.3,8.265477
2843,Fight Club,9678.0,8.3,8.256385
292,Pulp Fiction,8670.0,8.3,8.251406
522,Schindler's List,4436.0,8.3,8.206639
23673,Whiplash,4376.0,8.3,8.205404
5481,Spirited Away,3968.0,8.3,8.196055
2211,Life Is Beautiful,3643.0,8.3,8.187171


Well, from the output above we can see that the simple recommender worked great!

Because this chart has a lot in common with the IMDB Top 250 charts. For example, our two best movies, The Shawshank Redemption and The Godfather are the same on IMDB and we all know they are truly amazing movies.

# **Content-Based Recommender**

# **Plot Description Based Recommender**

Now let's examine a way to construct a machine that recommends films which might be just like a selected movie. To do this, we compute pairwise cosine similarity ratings for all films primarily based totally on their plot descriptions and advocate films primarily based totally on cosine similarity scores.

In [None]:
#Print plot overviews of the first 5 movies.
metadata['overview'].head()

0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
2    A family wedding reignites the ancient feud be...
3    Cheated on, mistreated and stepped on, the wom...
4    Just when George Banks has recovered from his ...
Name: overview, dtype: object

The problem under consideration is that of natural language processing. Therefore, we need to extract some features from the text data above before calculating similarities and/or differences. Simply put, it is impossible to calculate the similarity between two reviews in raw form. To do this, we need to calculate the word vectors for each review or document that will be called from now on.

As the name suggests, word vectors are vectorized representations of words in a document. Vectors have semantic meaning. For example, for a man and a king, the vector images are close to each other, and for a man and a woman, the vector images are far apart.

Calculates the term frequency-inverse document frequency (TF-IDF) vector for each document. This gives us a matrix where each column represents a word from the review dictionary (every word that appears in at least one document) and each column represents a movie, as before.

Essentially, the TF-IDF score is the number of times a word appears in a document reduced by the number of documents in which the word appears. This is done to reduce the importance of words that appear frequently in plot reviews and the importance of words when calculating the final similarity score.

Fortunately, scikit-learn provides a built-in TfIdfVectorizer class that generates TF-IDF matrices with a few rows.

Import Tfidf module with scikit-learn;

Remove stopwords such as 'the', 'an', etc. that do not provide useful information about the subject.

Replaces non-numeric values ​​with an empty string.

Finally create a TF-IDF matrix from the data.

In [None]:
#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')
#Replace NaN with an empty string
metadata['overview'] = metadata['overview'].fillna('')
#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(metadata['overview'].sample(30000))
#Output the shape of tfidf_matrix
tfidf_matrix.shape

(30000, 61513)

In [None]:
#Array mapping from feature integer indices to feature name.
tfidf.get_feature_names_out()[5000:5010]

array(['basle', 'basler', 'basmachi', 'basmachis', 'basner', 'basquali',
       'basque', 'basques', 'basquiat', 'basra'], dtype=object)

From the output above, we can see that 45,000 movies are included in the 75,827 different vocabularies or words in our dataset.

This matrix can be used to calculate the similarity score. Several similarity metrics are available for this purpose: Manhattan, Euclidean, Pearson, and cosine similarity score. Again, there is no one right answer as to which score is best. Different scores work well in different scenarios and it is often useful to experiment with different metrics and see the results.

Calculates a number representing the similarity between two movies using the cosine similarity. We use cosine similarity scores because they are size independent, relatively easy and fast to compute (especially when used in conjunction with the TF-IDF scores discussed later).

Since we used the TF-IDF vectorizer, we can directly compute the dot product between each vector to get a cosine similarity score. So instead of cosine_similarities() we will use sklearn's linear_kernel() as it's faster.

This returns a matrix of the form 45466x45466. This means the cosine similarity of each movie review to all other movie reviews. So each movie is a 1x45466 column vector, and each column is a measure of similarity to each movie.


In [None]:
# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [None]:
cosine_sim.shape

(30000, 30000)

In [None]:
cosine_sim[1]

array([0.        , 1.        , 0.        , ..., 0.        , 0.00974046,
       0.        ])

We'll define a function that takes the title of a movie as input and outputs a list of the 10 most similar movies. First, for this we need an inverse mapping of movie titles to DataFrame indices. In other words, we need a mechanism to determine the index of a movie in a titled metadata DataFrame.

In [None]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()

In [None]:
indices[:10]

title
Toy Story                      0
Jumanji                        1
Grumpier Old Men               2
Waiting to Exhale              3
Father of the Bride Part II    4
Heat                           5
Sabrina                        6
Tom and Huck                   7
Sudden Death                   8
GoldenEye                      9
dtype: int64

We are now in a good state to define our recommendation function. Follow these steps:

Get the index of the movie by title.

Get a list of cosine similarity values ​​for all movies in this movie. Convert to a list of tuples. Here, the first factor is the location and the second factor is the similarity score.

Sort the list of tuples above by similarity score. That's the second factor.

Get the first 10 elements of this list. Ignore the first element that refers to itself (the movie most similar to a particular movie is the movie itself).

Return the title corresponding to the index of the top-level element.

In [None]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return metadata['title'].iloc[movie_indices]

In [None]:
get_recommendations('The Dark Knight Rises')

24475                  Blindfold
15332                   Whoopee!
3684             Mackenna's Gold
24937                 Ski Patrol
20425                   The Rage
12687       The Three Musketeers
25331                 Dead Souls
3321                    Red Dawn
12305    Things I Never Told You
28874                  Maidstone
Name: title, dtype: object

In [None]:
get_recommendations('The Godfather')

8114           Seeing Other People
4930                       M*A*S*H
26265             The Elephant Man
5632                        Ms .45
15543    Diary of a Shinjuku Thief
29656            Syndicate Sadists
24269              Happy Christmas
6766               The Magic Sword
29634                 Kill 'em All
20313                     Bad Luck
Name: title, dtype: object

# **Credits, Genres, and Keywords Based Recommender**

Using better metadata and capturing more detail will improve the quality of our recommenders. That's what we'll do in this section. Generate a recommendation system based on the metadata of three main actors, director, related genre, and movie plot keywords.

Keyword, Cast, and Crew data are not currently available in the dataset, so the first step is to load them and combine them with the underlying DataFrame metadata.

In [None]:
# Load keywords and credits
credits = pd.read_csv('credits.csv')
keywords = pd.read_csv('keywords.csv')

# Remove rows with bad IDs.
metadata = metadata.drop([19730, 29503, 35587])

# Convert IDs to int. Required for merging
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
metadata['id'] = metadata['id'].astype('int')

# Merge keywords and credits into your main metadata dataframe
metadata = metadata.merge(credits, on='id')
metadata = metadata.merge(keywords, on='id')

In [None]:
# Print the first two movies of your newly merged metadata
metadata.head(2)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,spoken_languages,status,tagline,title,video,vote_average,vote_count,cast,crew,keywords
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."


Among new features, actors, crew, and keywords, we need to highlight the three most important actors, directors, and keywords related to this movie.

But first, the data is displayed in a "linear" list format. We need to convert them in a convenient way.

In [None]:
features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    metadata[feature] = metadata[feature].apply(literal_eval)

Then, we write functions to help us extract the information we need from each function.

First we import the NumPy package to access NaN constants. We can then use it to write a get_director() function.

Get the director's name from the crew feature. Returns NaN if the director isn't within the list.

In [None]:
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

Next, write a function that returns the greater of the first 3 elements or the entire list. Lists here represent cast members, keywords, and genres.

In [None]:
def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > 3:
            names = names[:3]
        return names

    #Return empty list in case of missing/malformed data
    return []

In [None]:
# Define new director, cast, genres and keywords features that are in a suitable form.
metadata['director'] = metadata['crew'].apply(get_director)

features = ['cast', 'keywords', 'genres']
for feature in features:
    metadata[feature] = metadata[feature].apply(get_list)

In [None]:
# Print the new features of the first 3 films
metadata[['title', 'cast', 'director', 'keywords', 'genres']].head(3)

Unnamed: 0,title,cast,director,keywords,genres
0,Toy Story,"[Tom Hanks, Tim Allen, Don Rickles]",John Lasseter,"[jealousy, toy, boy]","[Animation, Comedy, Family]"
1,Jumanji,"[Robin Williams, Jonathan Hyde, Kirsten Dunst]",Joe Johnston,"[board game, disappearance, based on children'...","[Adventure, Fantasy, Family]"
2,Grumpier Old Men,"[Walter Matthau, Jack Lemmon, Ann-Margret]",Howard Deutch,"[fishing, best friend, duringcreditsstinger]","[Romance, Comedy]"


The next step is to convert the keyword names and instances to lowercase and remove all spaces between them.

Removing spaces between words is an important preprocessing step. This is done so that our vectorizer does not consider Tom in "Tom Hanks" and "Tom Holland" the same. After this processing step, the actors above will be marked as "tomhanks" and "tomholland", depending on the vectorizer.

The function below does this.

In [None]:
# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [None]:
# Apply clean_data function to your features.
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    metadata[feature] = metadata[feature].apply(clean_data)

The create_soup function concatenates all required columns with blanks. This is the final preprocessing step, and the output of this function is passed to the word vector model.

In [None]:
def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])

In [None]:
# Create a new soup feature
metadata['soup'] = metadata.apply(create_soup, axis=1)

In [None]:
metadata[['soup']].head(2)

Unnamed: 0,soup
0,jealousy toy boy tomhanks timallen donrickles ...
1,boardgame disappearance basedonchildren'sbook ...


The next steps are identical to the plot description based recommender. One major difference is the use of CountVectorizer() instead of TF-IDF. Because we don't want to downplay the existence of actors/directors who have starred or appeared in relatively more films. Downplaying them in this context doesn't make very intuitive sense.

The main difference between CountVectorizer() and TF-IDF is the inverse document frequency (IDF) component, which is the latter rather than the former.

In [None]:
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(metadata['soup'])

In [None]:
count_matrix.shape

(46628, 73881)

From the output above, we can see that the metadata we passed has 73,881 dictionaries.

Next we use cosine_similarity to measure the distance between embeddings.

In [None]:
count_matrix_sample = count_matrix[0:3000]
cosine_sim2 = cosine_similarity(count_matrix_sample, count_matrix_sample)

In [None]:
# Reset index of your main DataFrame and construct reverse mapping as before
metadata = metadata.reset_index()
indices = pd.Series(metadata.index, index=metadata['title'])

In [None]:
# get_recommendations('The Dark Knight Rises', cosine_sim2)

In [None]:
get_recommendations('The Godfather', cosine_sim2)

1934    The Godfather: Part III
1199     The Godfather: Part II
1186             Apocalypse Now
1648           Ill Gotten Gains
5                          Heat
426               Carlito's Way
1084        Glengarry Glen Ross
1430              Donnie Brasco
1614              The Rainmaker
1856          On the Waterfront
Name: title, dtype: object

Great! We see that our recommender has been a success in getting better recommandations because of greater metadata. There are, of course, several other approaches of experimenting with this machine to enhance recommendations.

Some suggestions:

Introduce a recognition filter: this recommender might take the 30 maximum comparable films, calculate the weighted ratings (the use of the IMDB components from above), kind films primarily based totally in this rating, and go back the pinnacle 10 films.

Other team members: different team member names, along with screenwriters and producers, may also be included.

The growing weight of the director: to present greater weight to the director, she or he might be noted a couple of instances withinside the soup to growth the similarity rankings of films with the identical director.

# **Collaborative Filtering with Python**

We have learned how to create a simple, content-based movie recommendation system. There is another very popular type of recommendation known as a collaborative filter.

Collaborative filters can be divided into two types.

User-Based Filtering: The system recommends products that similar users like to the user. For example, suppose Alice and Bob are equally interested in books (i.e. they like and dislike basically the same books). Now we suppose a new book is on the market, Alice reads it and likes it. Therefore, it is very likely that Bob will also like this book, so the system recommends this book to Bob.

Item-Based Filtering: This system is very similar to the content recommendation engine we have created. These systems identify similar items based on how people have valued them in the past. For example, if Alice, Bob, and Eve gave The Lord of the Rings and The Hobbit 5 stars, the system will identify the items as similar. So if someone buys The Lord of the Rings, the system recommends them The Hobbit as well.