## CONTENT-BASED RECOMMENDATION MODEL

In this coding scheme, will learn how to build a recommendation model that suggests movies that are similar to a particular movie based upon some content provided.

In the dataset provided, there is a content in the 'overview' feature, which is explaining about the movie in textual formatted data in about 20-to-30 words.

Here, the main job is to process the overview content of the movies using 'word vectors' method. Based upon the meaning created by word vectors find the similar ones and can be used as basis for recommendation for given movie.

In simple words, we calculate the similarity matrix too & which will help us to know the very similar movies for a given input movie title.

In [None]:
# First explore the 'overview' feature of dataset
import pandas as pd

#Load the dataset
md = pd.read_csv("movies_metadata.csv", engine='python')

# Lets have a look on the 'overview' feature
md['overview'].head()

Since the output is textual content, The problem at hand is a Natural Language Processing problem. Hence you need to extract some kind of features from the above text data before you can compute the similarity and/or dissimilarity between them.

We will consider 'word vectors' technique to represnt in vectorised format. These vectors carry a semantic meaning with it. To do this word vectrization, will use TF-IDF (Term Frequency - Inverse Document Frequency) vectors for each document.
TF-IDF is done to reduce the importance of words that frequently occur in plot overviews and, therefore, their significance in computing the final similarity score.

The scikit-learn gives you a built-in TfIdfVectorizer class that produces the TF-IDF matrix on the data.

In [None]:
#Import TfIdfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
md['overview'] = md['overview'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(md['overview'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

In [None]:
From this output, we can observe that with 45000+ dataset more than 75000 different words.

Now, we must use any similarity matrices such as Eculidean, manhattan, Pearson, Cosine similarity.

lets use linear_kernel() instead of cosine_similarities() since it is faster & similar.

In [15]:
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim_mat = linear_kernel(tfidf_matrix, tfidf_matrix, dense_output=True)
cosine_sim_mat.shape

Now we define a function which takes a movie title as input and ouptputs a list of 10-20 most similar movies.
For this, you need a mechanism to identify the index of a movie in your metadata DataFrame, given its title.

In [None]:
indices = pd.Series(md.index, index=md['title']).drop_duplicates()

indices[:15]

Now, define the recommendation function. 
These are the following steps:

Get the index of the movie given its title.

Get the list of cosine similarity scores for that particular movie with all movies. Convert it into a list of tuples where the first element is its position, and the second is the similarity score.

Sort the aforementioned list of tuples based on the similarity scores; that is, the second element.

Get the top 10-20 elements of this list. Ignore the first element as it refers to self (the movie most similar to a particular movie is the movie itself).

Return the titles corresponding to the indices of the top elements.


In [None]:
def get_recommendations(title, cosine_sim_mat=cosine_sim_mat):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim_mat[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return md['title'].iloc[movie_indices]


Now we can use the function get " get_recommendations(title) " to get the recommended movies list for the given movie title

In [None]:
# Get the list of recommendation for the movie " Toy Story "
get_recommendations("Toy Story")