<h3>IMPORT PANDAS LIBRARY AND READ DATA</h3>

In [1]:
# Import Pandas
import pandas as pd

# Load Movies Metadata
metadata = pd.read_csv('movies_metadata.csv',low_memory=False)



In [2]:
# Print the first three rows

metadata.head(3)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0


<h3> SIMPLE RECOMMENDER </h3>

Simple recommenders are basic systems that recommend the top items based on a certain metric or score. This is a simplified clone of IMDB Top 250 Movies using metadata collected from IMDB

**The following are the steps involved:** 
<ul>
    <li>Decide on the metric or scope to rate movies on.</li>
    <li>Calculate the score for every movie.</li>
    <li>Sort the movies based on the score and output the top results.</li>
</ul>

Since we are trying to build a clone of IMDB's Top 250, let's use its weighted rating formula as a metric/score. Mathematically, it is presented as follows:
\begin{equation} 
\text Weighted Rating (\bf WR) = \left({{\bf v} \over {\bf v} + {\bf m}} \cdot R\right) + \left({{\bf m} \over {\bf v} + {\bf m}} \cdot C\right) 
\end{equation}
<ul>
    <li> <b>v</b> is the number of votes for the movie.</li>
    <li> <b>m</b> is minimum votes required to be listed in the chart.</li>
    <li> <b>R</b> is the average rating of the movie.</li>
    <li> <b>C</b> is the mean vote across the whole report.</li>
</ul>

In [3]:
# Calculate mean of vote average column
C = metadata['vote_average'].mean()
print(C)

5.618207215134185


In [4]:
# Calculate the minimum number of votes required to be in the chart, m

m = metadata['vote_count'].quantile(0.9)
print(m)

160.0


<b>Explanation: </b>
<ul>
    <li>The <b>quantile()</b> function is used to calculate the 90th percentile of the <b>vote_count</b> column in the <b>metadata</b> data frame.
    <li>This means that 90% of the movies have a vote count lower than the value return turned by this function.
</ul>

In [5]:
# Filter out ll qualified movies into a new DataFrame
q_movies = metadata.copy().loc[metadata['vote_count']>=m]

# Show dimension of q_movies
q_movies.shape

(4555, 24)

In [6]:
# Define Function that computes the weighted rating of each movie

def weighted_rating(x, m=m, C=C):
    v = x['vote_count'] # Number of votes for the movie
    R = x['vote_average'] # The average rating of the movie
    # Calculate based on the IMDB formula
    return (v/(v+m)*R) + (m/(m+v)*C)

In [7]:
# Define a new feature 'score' and calculate its value with 'weighted_rating'

q_movies['score'] = q_movies.apply(weighted_rating,axis = 1)

In [8]:
# Sort the movies based on score calculated above
q_movies = q_movies.sort_values('score', ascending = False)

# Print the top 20 movies
q_movies[['title','vote_count','vote_average','score']].head(20)

Unnamed: 0,title,vote_count,vote_average,score
314,The Shawshank Redemption,8358.0,8.5,8.445869
834,The Godfather,6024.0,8.5,8.425439
10309,Dilwale Dulhania Le Jayenge,661.0,9.1,8.421453
12481,The Dark Knight,12269.0,8.3,8.265477
2843,Fight Club,9678.0,8.3,8.256385
292,Pulp Fiction,8670.0,8.3,8.251406
522,Schindler's List,4436.0,8.3,8.206639
23673,Whiplash,4376.0,8.3,8.205404
5481,Spirited Away,3968.0,8.3,8.196055
2211,Life Is Beautiful,3643.0,8.3,8.187171


**CONCLUSION :**
According to the output above, we can infer that a simple recommender did a great job. There are some popular movies that we love and that what proves the recommender make sense !

<h3> CONTENT-BASED RECOMMENDER </h3>

<h4>PLOT DESCRIPTION BASED RECOMMENDER</h4>
<div>
    In this section, we build a system that recommends movies that are similar to a particular movie. To achieve this, we compute the pairwise <b>cosine</b> similarity scores for all movies based on their plot descriptions and recommend movies based on that similarity score threshold.
</div>

The plot description is available as the overview feature in <b>metadata</b> dataset.
<br>
Let's inspect the plots of a movies.

In [9]:
# Print plot overviews of the first 5 movies.
metadata['overview'].head(5)

0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
2    A family wedding reignites the ancient feud be...
3    Cheated on, mistreated and stepped on, the wom...
4    Just when George Banks has recovered from his ...
Name: overview, dtype: object

<div>
    <p>
        The problem at hand is a Natural Language Processing problem. Hence we need to extract some kind of features from the above text data before we can compute the similarity and/or dissimilarity between them. To put it simply, it is not possible to compute the similarity between any two overviews in their raw forms. To do this, we need to compute the word vectors of each overview or document, as it will be called from now on
    </p>
    <p>
        As the name suggests, word vectors are vectorized representation of words in a document. The vectors carry a semantic meaning with it. For example, man & king will have vector representations close to each other while man & woman would have representation far from each other
    </p>
    <p>
        We will compute <b>Term Frequency-Inverse Document Frequency (TF-IDF)</b> vectors for each document. This will give you a matrix where each <b>column
        </b> represents a word in the overview vocabulary (all the words that appear in at least one document), and each <b>row</b> represents a movie, as before.
    </p>
    <p>
        In its essence, the TF-IDF score is frequency of a word occurring in a document, down-weighted by the
        number of documents in which it occurs. This is done to reduce the importance of words that frequently occur in plot overviews and, therefore, their significance in computing the final similarity score.
    </p>
    <p>
        Fortunately, scikit-learn gives you a built-in <b>TfidfVectorizer</b> class that produces the TF-IDF matrix in a couple of lines.
    </p>
    <ul>
        <li>Import the Tfidf module using scikit-learn.</li>
        <li>Remove the stop words like 'the','an',etc. Since they do not give any useful information.</li>
        <li>Replace not-a-number values with a blank string.</li>
        <li>Finally, construct the TF-IDF matrix on the data</li>
    </ul>
</div>

In [10]:
# Import TfIdVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

# Define a TF-IDF Vectorizer Object. Remove all English stop words such as 'the'
tfidf = TfidfVectorizer(stop_words='english')

# Replace NaN with an empty string
metadata['overview'] = metadata['overview'].fillna('')

# Construct the equired TF - IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(metadata['overview'])

# Output the shape of tfidf_matrix
tfidf_matrix.shape

(45466, 75827)

In [11]:
# Array mapping from feature integer indices to feature name.

tfidf.get_feature_names_out()[5000:5010]

# Print every words - Just for fun
# for i in tfidf.get_feature_names_out():
#     print(i)




array(['avails', 'avaks', 'avalanche', 'avalanches', 'avallone', 'avalon',
       'avant', 'avanthika', 'avanti', 'avaracious'], dtype=object)

<b>Explanation: </b>
<div>
    <ul>
        <li>The <b>get_feature_names_out()</b> method of the <b>tfidf</b> object returns an array of feature names in the order they appear in the feature matrix.</li>
        <li> In this code snippet, the <b>[5000:5010]</b> slide is used to get the feature names for the indices between 5000 and 5010 (exclusive).</li>
    <ul>
</div>

<div>
    <p>
        From the above output, we observe that 75,827 different vocabularies or words in the dataset have 45466 movies.
    </p>
    <p>
        With this matrix in hand, we can now compute a similarity score. There are several similarity metrics that we can use for this, such as Manhattan, Euclidean, Pearson, and the <b>Cosine similarity scores</b>.
    </p>
    <p>
        We use Cosine similarity here to calculate a numeric quantity that denotes the similarity between two movies. The reason here is it is independent of magnitude and relatively easy and fast to calculate (especially when used in conjunction with TF-IDF scores, which will be explained later). Mathematically, it is defined as follows:
    </p>
    <img src="https://images.datacamp.com/image/upload/f_auto,q_auto:best/v1590782185/cos_aalkpq.png" />
</div>

<div>
    <p>
        Since you have used the TF-IDF vectorizer, calculating the dot product between each vector will directly give you the cosine similarity score. Therefore, you will use <b>sklearn's linear_kernel()</b> instead of <b>cosine_similarities()</b> since it is faster.
    </p>
    <p>
        This would return a matrix of shape 45466x45466, which means each movie overview cosine similarity score with every other movie <b>overview</b>. Hence, each movie will be a 1x45466 column vector where each column will be a similarity score with each movie.
    </p>
</div>

In [12]:
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

<div><b>Explanation:</b><div>
<div>
    <ul>
        <li>
            This code imports the <b>linear_kernel</b> function from the <b>sklearn.metrics.pairwise</b> module.
        </li>
        <li>
            The <b>linear_kernel</b> function is used to compute the dot product of two matrices.
        </li>
        <li>
            In the next line, the <b>cosine_sim</b> variable is assigned the result of applying the <b>linear_kernel</b> function to the <b>tfidf_matrix</b> twice.
        </li>
        <li>
            This computes the cosine similarity matrix of the <b>tfidf_matrix</b>
        </li>
        <li>
            The <b>tfidf_matrix</b> is a matrix that represents the text data in a numerical form using the term frequency-inverse document frequency (TF-IDF) method.
        </li>
        <li>
            The cosine similarity matrix is a measure of similarity between each pair of documents in the <b>tfidf_matix</b>
        </li>
        <li>
            Overall, this code computes the cosine similarity matrix of the <b>tfidf_matrix</b> using the <b>linear_kernel</b> function.
        </li>
    </ul>
</div>

In [13]:
cosine_sim.shape

(45466, 45466)

In [14]:
cosine_sim[1]

array([0.01504121, 1.        , 0.04681953, ..., 0.        , 0.02198641,
       0.00929411])

In [15]:
# Construct a reverse map of indices and movie titles
indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()

In [16]:
indices[:10]

title
Toy Story                      0
Jumanji                        1
Grumpier Old Men               2
Waiting to Exhale              3
Father of the Bride Part II    4
Heat                           5
Sabrina                        6
Tom and Huck                   7
Sudden Death                   8
GoldenEye                      9
dtype: int64

<div>
    <p>
        We are now in good shape to define your recommendation function. These are the following next steps:
    </p>
    <ul>
        <li>
            Get the index of the movie given its title
        </li>
        <li>
            Get the list of cosine similarity scores for that particular movie with all movies. Convert it into a list of tuples where the first element is its position, and the second is the similarity score.</li>
        <li>
            Sort the aforementioned list of tuples based on the similarity scores; that is, the second element.
        </li>
        <li>
            Get the top 10 elements of this list. Ignore the first element as it refers to self (the movie most similar to a particular movie is the movie itself).
        </li>
        <li>
            Return the titles corresponding to the indices of the top elements.
        </li>
    </ul>
</div>

In [17]:
# Function that takes in movie little as input and outputs most similar movies

def get_recommendations (title, cosine_sim=cosine_sim):
        # Get the index of the movie that matches the title
        idx = indices[title]

        # Get the pairwise similarity scores of all movies with that movie
        sim_scores = list(enumerate(cosine_sim[idx]))

        # Sort the movies based on the similarity scores
        sim_scores = sorted(sim_scores, key=lambda x:x[1], reverse=True)

        # Get the scores of the 10 most similar movies
        sim_scores = sim_scores[1:11]

        # Get the movie indices
        movie_indices = [i[0] for i in sim_scores]

        # Return the top 10 most similar movies
        return metadata['title'].iloc[movie_indices]


In [18]:
for i in get_recommendations('Two Bits'):
    print(i)

Gone Baby Gone
Side Street Story
Full Speed
Byelorussian Station
Dean
We're No Angels
Best Boy
Camp
La luna
Merlin's Shop of Mystical Wonders
