By using this system we find similarities between a movie and we recommend movies based on similarity with its other movies.
Similarity is based on content of a movie i.e overview/plot, cast, crew, keyword, tagline.

This can be done in two ways.
1. Plot description Based Recommendation.
2. Credits, Keywords Based Recommendation.

Plot description Based Recommendation:
      We will compare a movie with every movie based on their plot description given in 'overview' feature of our dataset.
We get a similarity score. Based on that similarity score we recommend.

Credits, Keywords Based Recommendation:
      We are going to build a recommender based on the following metadata: the 3 top actors, the director, related genre
and the movie plot keywords.

Let's take a look at the overview of first 10 movies

In [135]:
import numpy as np
import pandas as pd
from preprocessing import Preprocess
df = Preprocess().df
df['overview'].head(10)

0    In the 22nd century, a paraplegic Marine is di...
1    Captain Barbossa, long believed to be dead, ha...
2    A cryptic message from Bond’s past sends him o...
3    Following the death of District Attorney Harve...
4    John Carter is a war-weary, former military ca...
5    The seemingly invincible Spider-Man goes up ag...
6    When the kingdom's most wanted-and most charmi...
7    When Tony Stark tries to jumpstart a dormant p...
8    As Harry begins his sixth year at Hogwarts, he...
9    Fearing the actions of a god-like Super Hero l...
Name: overview, dtype: object

By using some methods in text processing i.e Term Frequency-Inverse Document Frequency (TF-IDF) vectors.
We need to convert word vector of each word in Overview.

Term frequency is the relative frequency of a word in a document and is given as (term instances/total instances). 
Inverse Document Frequency is the relative count of documents containing the term is given as 
      log(number of documents/documents with term). 
The overall importance of each word to the documents in which they appear is equal to TF * IDF.

This is going to give us a matrix where each column represents a word in the overview vocabulary (all the words that appear in at least one document) and each row represents a movie, as before.This is done to reduce the importance of words that occur frequently in plot overviews and therefore, their significance in computing the final similarity score.

To find TF-IDF vectors vectors we have a built-in function TfidfVectorizer in scikit-learn module 

In [136]:
from sklearn.feature_extraction.text import TfidfVectorizer

wordVectors = TfidfVectorizer(stop_words='english')

df['overview'] = df['overview'].fillna('')

wordMatrix = wordVectors.fit_transform(df['overview'])

wordMatrix.shape

(4803, 20978)

To find similarity we use cosine similarity as it is independent of magnitude and is relatively easy and fast to calculate.

To find cosine similarity we have an inbuilt function "cosine_similarities()" from scikit-learn module
But here the data is a wordMatrix which is a L2 - Normalized data. On L2- Normalized data this function is equal to linear_kernel() fuction.

So we use linear_kernel() function from sklearn as it is faster and easier to calculate.

In [137]:
from sklearn.metrics.pairwise import linear_kernel

similarityMatrix = linear_kernel(wordMatrix, wordMatrix)

Now we are going to take both indices and titles from our data frame and construct a series which will help us in finding its index based on recommended title.

In [138]:
retIndexSeries = pd.Series(df.index, index=df['title']).drop_duplicates()
retIndexSeries.head(5)

title
Avatar                                      0
Pirates of the Caribbean: At World's End    1
Spectre                                     2
The Dark Knight Rises                       3
John Carter                                 4
dtype: int64

Now we have similarity scores based on overview.Based on those scores we need to recommend movies.

To do this we will define a function "recommendMovie()".

This function will do the following process:
1. Take title of a movie as Parameter and find its index.
2. Using that find similarity scores of that movie from similarityMatrix and make that row into a list of tuples. In tuples first element is the position and the second is similarity score.
3. Sort the list accorsing to similarity score in descending order.
4. Get top 10 movies of the list and ignore first one as it is the same movie.
5. Return Titles corresponding to indices given.

In [139]:
def recommendMovie(title, sim_matrix= similarityMatrix):
    
    index = retIndexSeries[title]

    sim_scores1 = list(enumerate(sim_matrix[index]))

    #print(sim_scores1[0:5])

    sim_scores2 = sorted(sim_scores1, key=lambda x: x[1], reverse=True)

    #print(sim_scores2[0:5])

    sim_scores3 = sim_scores2[1:11]

    movie_indices = [i[0] for i in sim_scores3]

    return df['title'].iloc[movie_indices]

In [140]:
recommendMovie('Batman Begins')

299                              Batman Forever
1359                                     Batman
428                              Batman Returns
3                         The Dark Knight Rises
65                              The Dark Knight
3854    Batman: The Dark Knight Returns, Part 2
423                              Bruce Almighty
225                                 Speed Racer
9            Batman v Superman: Dawn of Justice
68                                     Iron Man
Name: title, dtype: object

In [141]:
recommendMovie('Avatar')

3604                       Apollo 18
2130                    The American
634                       The Matrix
1341            The Inhabited Island
529                 Tears of the Sun
1610                           Hanna
311     The Adventures of Pluto Nash
847                         Semi-Pro
775                        Supernova
2628             Blood and Chocolate
Name: title, dtype: object

Right now, our data is present in the form of "stringified" lists , we need to convert it into a safe and usable structure

In [142]:

from ast import literal_eval
features = ['cast', 'crew', 'keywords', 'genres']

for feature in features:
    df[feature] = df[feature].apply(literal_eval)

Now we will extract the req info from each feature

In [143]:
def retDirector(x):
    if isinstance(x,list):
        for i in x:
            if i['job'] == 'Director':
                return i['name']
    return np.nan

In [144]:

def retList(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        if len(names) > 3:
            names = names[:3]
        return names
    return []

In [145]:
df['director'] = df['crew'].apply(retDirector)

features = ['cast', 'keywords', 'genres']
for feature in features:
    df[feature] = df[feature].apply(retList)

In [146]:
df[['title', 'cast', 'director', 'keywords', 'genres']].head(3)

Unnamed: 0,title,cast,director,keywords,genres
0,Avatar,"[Sam Worthington, Zoe Saldana, Sigourney Weaver]",James Cameron,"[culture clash, future, space war]","[Action, Adventure, Fantasy]"
1,Pirates of the Caribbean: At World's End,"[Johnny Depp, Orlando Bloom, Keira Knightley]",Gore Verbinski,"[ocean, drug abuse, exotic island]","[Adventure, Fantasy, Action]"
2,Spectre,"[Daniel Craig, Christoph Waltz, Léa Seydoux]",Sam Mendes,"[spy, based on novel, secret agent]","[Action, Adventure, Crime]"


The next step is to convert the names and keyword instances into lowercase and strip all the spaces between them.
This is done so that our vectorizer doesn't count the Johnny of "Johnny Depp" and "Johnny Galecki" as the same.

In [147]:
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [148]:
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    df[feature] = df[feature].apply(clean_data)

Now we add our own metadata that we want to feed to our vectorizer namely actors, director and keywords

In [149]:
def metadata(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])
df['meta'] = df.apply(metadata, axis=1)

We use the CountVectorizer() instead of TF-IDF. This is because we do not want to down-weight the presence of an actor/director if he or she has acted or directed in relatively more movies.

In [150]:
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
creditMatrix = count.fit_transform(df['meta'])

In [151]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity(creditMatrix, creditMatrix)

In [152]:
df = df.reset_index()
indices = pd.Series(df.index, index=df['title'])

In [153]:
recommendMovie('The Dark Knight',cosine_sim)

3          The Dark Knight Rises
119                Batman Begins
4638    Amidst the Devil's Wings
2398                      Hitman
1720                    Kick-Ass
1740                  Kick-Ass 2
3326              Black November
1503                      Takers
1986                      Faster
303                     Catwoman
Name: title, dtype: object