## Content Based Filtering
1. In this recommender system the content of the movie (overview, cast, crew, keyword, tagline etc) is used to find its similarity with other movies. Then the movies that are most likely to be similar are recommended.

2. We will compute pairwise similarity scores for all movies based on their plot descriptions and recommend movies based on that similarity score. The plot description is given in the overview feature of our dataset. Let's take a look at the data. ..

3. For any of you who has done even a bit of text processing before knows we need to convert the word vector of each overview. Now we'll compute Term Frequency-Inverse Document Frequency (TF-IDF) vectors for each overview.

Now if you are wondering what is term frequency , it is the relative frequency of a word in a document and is given as (term instances/total instances). Inverse Document Frequency is the relative count of documents containing the term is given as log(number of documents/documents with term) The overall importance of each word to the documents in which they appear is equal to TF * IDF

This will give you a matrix where each column represents a word in the overview vocabulary (all the words that appear in at least one document) and each row represents a movie, as before.This is done to reduce the importance of words that occur frequently in plot overviews and therefore, their significance in computing the final similarity score.

4. We see that over 20,000 different words were used to describe the 4800 movies in our dataset.

With this matrix in hand, we can now compute a similarity score. There are several candidates for this; such as the euclidean, the Pearson and the cosine similarity scores. There is no right answer to which score is the best. Different scores work well in different scenarios and it is often a good idea to experiment with different metrics.

We will be using the cosine similarity to calculate a numeric quantity that denotes the similarity between two movies. We use the cosine similarity score since it is independent of magnitude and is relatively easy and fast to calculate. Mathematically, it is defined as follows: cos(alpha)


In [1]:
import os
import sys
current_dir = os.getcwd()
parent_dir = os.path.abspath(os.path.join(current_dir, '..'))
sys.path.append(parent_dir)
from Movie.pre_processing import MovieDataset
from Movie.recommend.content_based_filtering import ContentBasedFiltering

In [2]:
credits_path = os.path.join(parent_dir, 'data', 'raw', 'tmdb_5000_credits.csv')
movies_path = os.path.join(parent_dir, 'data', 'raw', 'tmdb_5000_movies.csv')

In [3]:
dataset = MovieDataset(credits_path, movies_path)
dataset.load_data()
dataset.preprocess_data()
processed_data = dataset.data
print("Processed data sample:")
print(processed_data.head())

Starting data loading...
Starting data preprocessing...
Processed data sample:
       id                                           overview  \
0   19995  In the 22nd century, a paraplegic Marine is di...   
1     285  Captain Barbossa, long believed to be dead, ha...   
2  206647  A cryptic message from Bond’s past sends him o...   
3   49026  Following the death of District Attorney Harve...   
4   49529  John Carter is a war-weary, former military ca...   

                                          genres  \
0  [Action, Adventure, Fantasy, Science Fiction]   
1                   [Adventure, Fantasy, Action]   
2                     [Action, Adventure, Crime]   
3               [Action, Crime, Drama, Thriller]   
4           [Action, Adventure, Science Fiction]   

                                            keywords  \
0  [culture clash, future, space war, space colon...   
1  [ocean, drug abuse, exotic island, east india ...   
2  [spy, based on novel, secret agent, sequel, mi...   

In [4]:
Content_Based_filtering = ContentBasedFiltering(processed_data)

TF-IDF matrix shape: (4802, 20977)
Feature names: ['00' '000' '007' '07am' '10' '100' '1000' '101' '108' '10th']
Cosine similarity matrix shape: (4802, 4802)


In [7]:
test_movie_title = ['The Dark Knight Rises', 'Inception', 'Interstellar', 'The Matrix', 'Fight Club', 'Forrest Gump', 'The Shawshank Redemption']
recommendations = [Content_Based_filtering.recommend(movie, top_n=10) for movie in test_movie_title]

In [9]:
print(f"Top 10 movie recommendations based on content for {test_movie_title}:")
for i, rec in enumerate(recommendations):
    print(f"Recommendations for {test_movie_title[i]}:")
    print(rec[['title', 'overview', 'vote_average', 'vote_count']])

Top 10 movie recommendations based on content for ['The Dark Knight Rises', 'Inception', 'Interstellar', 'The Matrix', 'Fight Club', 'Forrest Gump', 'The Shawshank Redemption']:
Recommendations for The Dark Knight Rises:
                                        title  \
65                            The Dark Knight   
299                            Batman Forever   
428                            Batman Returns   
1359                                   Batman   
3854  Batman: The Dark Knight Returns, Part 2   
119                             Batman Begins   
2507                                Slow Burn   
9          Batman v Superman: Dawn of Justice   
1181                                      JFK   
210                            Batman & Robin   

                                               overview  vote_average  \
65    Batman raises the stakes in his war on crime. ...           8.2   
299   The Dark Knight of Gotham City confronts a das...           5.2   
428   Having defeate