# Problem: IMDB Dataset Recommender
### Problem class: Plot description-based recommender system
### Problem dataset link: https://bit.ly/33IAohl
### Problem description:
     Creating a Plot based recommendation system using IMDB clean dataset.

### Problem Task:
     Recommend the top 10 most similar movies.

# Importing libraries

In [1]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set()

# Load the dataset into a pandas dataframe

In [2]:
# import data from the clean file
df = pd.read_csv("data/metadata_clean.csv", low_memory=False)

# show the head of the cleaned dataframe
df.head()

Unnamed: 0,title,genres,runtime,vote_average,vote_count,year
0,Toy Story,"['Animation', 'Comedy', 'Family']",81.0,7.7,5415.0,1995
1,Jumanji,"['Adventure', 'Fantasy', 'Family']",104.0,6.9,2413.0,1995
2,Grumpier Old Men,"['Romance', 'Comedy']",101.0,6.5,92.0,1995
3,Waiting to Exhale,"['Comedy', 'Drama', 'Romance']",127.0,6.1,34.0,1995
4,Father of the Bride Part II,['Comedy'],106.0,5.7,173.0,1995


In [3]:
# Note:  the models we are building compute the pairwise similarity between bodies of text.
# Q : how do we numerically quantify the similarity between two bodies of text?
# Example Scenario: consider three movies: A, B, and C. How can we mathematically prove that the plot of A 
# is more similar to the plot of B than to that of C (or vice versa)?

# A : these questions is to represent the bodies of text (henceforth referred to as documents) as mathematical 
# quantities. This is done by representing these documents as vectors. In other words, every document is depicted
# as a series of n numbers, where each number represents a dimension and n is the size of the vocabulary of all 
# the documents put together.

# Q: what are the values of these vectors?
# A: The two most popular vectorizers are CountVectorizer and TF-IDFVectorizer.

# CountVectorizer

In [4]:
# CountVectorizer is the simplest type of vectorizer
# Example: 
#     we have three documents, A, B, and C, which are as follows:
#     A: The sun is a star.
#     B: My love is like a red, red rose
#     C: Mary had a little lamb

# step 1: convert these documents into their vector forms using CountVectorizer
# step 2: compute the size of the vocabulary. The vocabulary is the number of unique words present across all documents
#         The vocabulary for this set of three documents is as follows: the, sun, is, a, star, my, love, like, 
#         red, rose, mary, had, little, lamb. Consequently, the size of the vocabulary is 14
# step 3: eliminating the stop words
#         V: like, little, lamb, love, mary, red, rose, sun, star
# The size of our vocabulary is now nine. Therefore, our documents will be represented as ninedimensional vectors,
# and each dimension here will represent the number of times a particular wordoccurs in a document

## Applying CountVectorizer in A B C
#     A: (0, 0, 0, 0, 0, 0, 0, 1, 1)
#     B: (1, 0, 0, 1, 0, 2, 1, 0, 0)
#     C: (0, 1, 1, 0, 1, 0, 0, 0, 0)

## TF-IDFVectorizer(Term Frequency-Inverse Document Frequency)

In [5]:
#     Takes the aforementioned point into consideration and assigns weights to each word according 
#     to the following formula. For every word i in document j, the following applies:
#                                     Wij = tfij * log(N / dfi)
#             wi, j is the weight of word i in document j
#             dfi is the number of documents that contain the term i
#             N is the total number of documents
# Note: the weight of a word in a document is greater if it occurs more frequently in that document and is 
#      present in fewer documents.
# Not all words in a document carry equal weight.

# Cosine similarity score

In [6]:
# cosine similarity score between two documents, x and y
#     cosine(x, y) = (x * y^T) / (||x||. ||y||)

# Note: The cosine score can take any value between -1 and 1. The higher the cosine score, the more similar the
# documents are to each other.

# Plot description-based recommender system

In [7]:
"""
Description: take in a movie title as an argument and output a list of movies that are most similar based 
on their plots
"""
# TODO:
# 1. Obtain the data required to build the model
# 2. Create TF-IDF vectors for the plot description (or overview) of every movie
# 3. Compute the pairwise cosine similarity score of every movie
# 4. Write the recommender function that takes in a movie title as an argument and outputs movies most
# similar to it based on the plot

'\nDescription: take in a movie title as an argument and output a list of movies that are most similar based \non their plots\n'

In [8]:
# plot description-based recommender requisite features are available in the original metadata file.
# Import the original file
original_df = pd.read_csv("data/movies_metadata.csv", low_memory=False)

# Add the useful features into the cleaned dataframe
df['overview'], df['id'] = original_df['overview'], original_df['id']

df.head()

Unnamed: 0,title,genres,runtime,vote_average,vote_count,year,overview,id
0,Toy Story,"['Animation', 'Comedy', 'Family']",81.0,7.7,5415.0,1995,"Led by Woody, Andy's toys live happily in his ...",862
1,Jumanji,"['Adventure', 'Fantasy', 'Family']",104.0,6.9,2413.0,1995,When siblings Judy and Peter discover an encha...,8844
2,Grumpier Old Men,"['Romance', 'Comedy']",101.0,6.5,92.0,1995,A family wedding reignites the ancient feud be...,15602
3,Waiting to Exhale,"['Comedy', 'Drama', 'Romance']",127.0,6.1,34.0,1995,"Cheated on, mistreated and stepped on, the wom...",31357
4,Father of the Bride Part II,['Comedy'],106.0,5.7,173.0,1995,Just when George Banks has recovered from his ...,11862


In [9]:
df.shape

(45466, 8)

# Creating the TF-IDF matrix

In [10]:
# Import TfdfVectorizer from scikit-learn library
from sklearn.feature_extraction.text import TfidfVectorizer

# Define a TF-IDF Vectorizer Object. Remove all englist stopwords
tfidf = TfidfVectorizer(stop_words='english')

# Replace Nan with an empty string
df['overview'] = df['overview'].fillna('')

# Construct the required TF-IDF matrix by applying the fit_transform method on the overview feature
tfidf_matrix = tfidf.fit_transform(df['overview'])

# Output the shape of tfidf_matrix
tfidf_matrix.shape

# Insight: vectorizer has created a 75,827-dimensional vector for the overview of every movie

(45466, 75827)

# Computing the cosine similarity score

In [None]:
# Calculating the cosine similarity is, however, a computationally expensive process

# Import linear_kernel to compute the dot product
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_similarity = linear_kernel(tfidf_matrix, tfidf_matrix)

# Building the recommender

In [None]:
# Construct a reverse mapping of indices and movie titles, and drop duplicate titles, if any
indices = pd.Series(df.index, index=df['title']).drop_duplicates()

In [None]:
# TODO:
# 1. Declare the title of the movie as an argument.
# 2. Obtain the index of the movie from the indices reverse mapping.
# 3. Get the list of cosine similarity scores for that particular movie with all movies using cosine_sim.
# Convert this into a list of tuples where the first element is the position and the second is the
# similarity score.
# 4. Sort this list of tuples on the basis of the cosine similarity scores.
# 5. Get the top 10 elements of this list. Ignore the first element as it refers to the similarity score with
# itself (the movie most similar to a particular movie is obviously the movie itself).
# 6. Return the titles corresponding to the indices of the top 10 elements, excluding the first:

In [None]:
# Function that takes in movie title as input and gives recommendations
def content_based_recommnder(title:str, cosine_similarity:bool=cosine_similarity, df:pd.DataFrame=df, indices:pd.Series=indices):
    # Obtain the index of the movie that matches the title
    idx = indices[title]
    
    # Get the pairwise similarity scores of all movies with that movie
    # And convert it inot a list of tuples as described above
    similarity_scores = list(enumerate(cosine_similarity[idx]))
    
    # Sort the movies based on the cosine similarity scores
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
    
    # Get the scores of the 10 most similar movies. Ignore the first movie.
    similarity_scores = similarity_scores[1:11]
    
    # Get the movie indices
    movie_indices = [i[0] for i in similarity_scores]
    
    # Return the top 10 most similar movies
    return df['title'].iloc[movie_indices]

In [None]:
#Get recommendations for The Lion King
content_based_recommnder('The Lion King')