Name: Keara Hayes

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import numpy as np

Using a Netflix database CSV, we can use TFIDF or "term frequency–inverse document frequency" to recommend what to watch next based on what you've just finished. TFIDF looks at the frequency with which words appear to give similarity scores. This code expands and improves [this code](https://colab.research.google.com/drive/1NkUvrLdIQ_QoSl4kfP6XQvWC4xsU-2Ic#sandboxMode=true), which is based on a workshop given by Rounak Banik. The example code is simplistic and does not give sophisticated recommendations. This new and improved code takes ideas and techniques presented there and expands them.

To start, input the name of the movie/show you've just watched.

In [2]:
# input a movie or TV show title
# this has to be pretty precise to work, as misspellings and capitalization errors will
# cause the input to not be understood

title = 'Star Trek'

Next, the database we've been using needs to be loaded in. That database can be found [here](https://www.kaggle.com/datasets/shivamb/netflix-shows). 

In [3]:
# load in the Netflix library
netflix = pd.read_csv("netflix_titles.csv")

Next, the database needs some cleaning. Unimportant columns, like the date it was added to Netflix, the duration of the movie/film, etc are dropped. NaNs are filled with empty strings to prevent errors.

In [4]:
# take out some of the things we care less about
netflix_simple = netflix.drop(['date_added', 'release_year', 'duration'], axis = 1)

# fill in any NaNs with empty strings so the vectorizing doesn't break
netflix_simple.fillna("", inplace=True)

Here we begin pulling our sorting categories. The simplest ones to use with sklearn's TFIDF are the title and description, which are merged together into a single category.

In [5]:
# things that don't need extra manipulation
titledesc = netflix_simple['description'] + netflix_simple['title']

Sklearn struggles with some of the categories, though. For example, in the case of the cast, actors tend to have multiple names which need to stay together. If they are tokenized separately though, you lose a lot of information. For example, Gates McFadden is an actress in *Star Trek: The Next Generation*, and Bill Gates appears in the documentary *Inside Bill's Brain: Decoding Bill Gates*. If sklearn is allowed to run unaided, watching *TNG* will reccommend the Bill Gates documentary based solely on the fact that *Gates* appears in both of their names.

To solve this, for both the director column and the cast column, all the whitespace is removed so that actors' names are one "word," which means they are tokenized properly.

In [6]:
# things that do need some help, since they contain multiword tokens
# what we're doing here is removing all the whitespace so that when we tokenize, there's 
# no worry about individual words being tokens instead of full names

for i in range(len(netflix_simple['cast'])):
    netflix_simple['cast'].iloc[i] = netflix_simple['cast'].iloc[i].replace(" ", "")
    netflix_simple['director'].iloc[i] = netflix_simple['director'].iloc[i].replace(" ", "")

In [7]:
# like before, assign the colmns to variables so things are cleaner
cast = netflix_simple['cast']
director = netflix_simple['director']

Then the columns need to be "vectorized." They have a "direction" (the token, for example, the word "star") and a magnitude (the frequency with which that word appears).

In [8]:
# Create tfidf vectorizer:
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 2), min_df=0, stop_words='english')

# Use vectorizer to create tfidf matrix.
tfidf_titledesc = tf.fit_transform(titledesc)
tfidf_cast = tf.fit_transform(cast)
tfidf_director = tf.fit_transform(director)

Based on those vectors, titles are given similarity scores relative to each other. These are matrices that show how similar entries are based on the different categories.

In [9]:
# Compute the similarity matrix so we know how similar the given movie is to the others available
sim_titledesc = linear_kernel(tfidf_titledesc, tfidf_titledesc)
sim_cast = linear_kernel(tfidf_cast, tfidf_cast)
sim_director = linear_kernel(tfidf_director, tfidf_director)

We'll set aside what categories we intend to use.

In [10]:
# all the things we'll be using for the ranking
metrics = [sim_titledesc, sim_cast, sim_director]

Here, we reset the index to use the titles of the movies to make searching easier.

In [11]:
netflix_simple.reset_index(inplace=True)

# reindex according to title so that the indexing later is easier
indices = pd.Series(netflix_simple.index, index=netflix_simple['title'])

And here is the real meat of the recommendations. In short, each category (title + description, cast, and director) is used to create a list of contenders, along with their similarity scores. These similarity scores are added for each recommended entry, and the movie/show that has the high score overall is recommended.

In [47]:
# this loop will take all the rating categories and combine them into a final 
# recommendation list

# this is the title you input at the top of the notebook
# it is used on the reindexed database to pull the information for that entry
index = indices[title]

# an empty array to append to in the loop
contenders = np.zeros((9,2))

for i in metrics:
    
    # for a given category, find the input movie
    row = i[index]
    
    # pull the similarity scores for that movie
    sim_scores = list(enumerate(row))
    
    # sort the scores, highest score first
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # keep the top 10 entries; entry 0 is excluded because that will nearly always
    # be the input title
    closest_matches = sim_scores[1:11]
    
    # get the numerical indices for the movies
    movie_indices = [i[0] for i in closest_matches]
    
    # turn the closest matches into an array
    contenders2 = np.array(closest_matches)
    
    # concatenate to the blank array so we can compile the similarity scores later
    contenders = np.concatenate((contenders, contenders2))
    
# exclude the 0 entries from when we made the "empty" array earlier;
# keeping them will break the recommendations
contenders = contenders[10:,:]   

# these are the films/shows; column 1, excluded here, is the similarity scores
net = contenders[:,0]

finalists = np.zeros((len(contenders), 2))

# a loop to skim through the contenders and sum their overall similarity scores
for i in range(len(net)):
    ind = np.where(contenders[:,0] == net[i])
    finalists[i,0] = net[i]
    finalists[i,1] = np.sum(contenders[ind,1])

finalists = np.unique(finalists, axis = 0)

ind = finalists[np.where(finalists[:,1] == max(finalists[:,1]))]

ind = int(ind[0,0])

print(netflix_simple.iloc[ind]['title'])

For the Love of Spock


Finally, we can jam all of this into a function:

In [13]:
def movie_rec(title):
    
    '''A function which takes in the name of a film/TV show as a string and returns a 
    recommendation using TFIDF.'''
    
    title = title
    
    # load in the Netflix library
    netflix = pd.read_csv("netflix_titles.csv")
    
    # take out some of the things we care less about
    netflix_simple = netflix.drop(['date_added', 'release_year', 'duration'], axis = 1)

    # fill in any NaNs with empty strings so the vectorizing doesn't break
    netflix_simple.fillna("", inplace=True)
    
    # things that don't need extra manipulation
    titledesc = netflix_simple['description'] + netflix_simple['title']
    
    # things that do need some help, since they contain multiword tokens
    # what we're doing here is removing all the whitespace so that when we tokenize, there's 
    # no worry about individual words being tokens instead of full names

    for i in range(len(netflix_simple['cast'])):
        netflix_simple['cast'].iloc[i] = netflix_simple['cast'].iloc[i].replace(" ", "")
        netflix_simple['director'].iloc[i] = netflix_simple['director'].iloc[i].replace(" ", "")
        
    # like before, assign the colmns to variables so things are cleaner
    cast = netflix_simple['cast']
    director = netflix_simple['director']
    
    # Create tfidf vectorizer:
    tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 2), min_df=0, stop_words='english')

    # Use vectorizer to create tfidf matrix.
    tfidf_titledesc = tf.fit_transform(titledesc)
    tfidf_cast = tf.fit_transform(cast)
    tfidf_director = tf.fit_transform(director)
    
    # Compute the similarity matrix so we know how similar the given movie is to the others available
    sim_titledesc = linear_kernel(tfidf_titledesc, tfidf_titledesc)
    sim_cast = linear_kernel(tfidf_cast, tfidf_cast)
    sim_director = linear_kernel(tfidf_director, tfidf_director)
    
    # all the things we'll be using for the ranking
    metrics = [sim_titledesc, sim_cast, sim_director]
    
    netflix_simple.reset_index(inplace=True)

    # reindex according to title so that the indexing later is easier
    indices = pd.Series(netflix_simple.index, index=netflix_simple['title'])
    
    # this loop will take all the rating categories and combine them into a final 
    # recommendation list

    # this is the title you input at the top of the notebook
    # it is used on the reindexed database to pull the information for that entry
    index = indices[title]

    # an empty array to append to in the loop
    contenders = np.zeros((9,2))

    for i in metrics:

        # for a given category, find the input movie
        row = i[index]

        # pull the similarity scores for that movie
        sim_scores = list(enumerate(row))

        # sort the scores, highest score first
        sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

        # keep the top 10 entries; entry 0 is excluded because that will almost always
        # be the input title
        closest_matches = sim_scores[1:11]

        # get the numerical indices for the movies
        movie_indices = [i[0] for i in closest_matches]

        # turn the closest matches into an array
        contenders2 = np.array(closest_matches)

        # concatenate to the blank array so we can compile the similarity scores later
        contenders = np.concatenate((contenders, contenders2))

    # exclude the 0 entries from when we made the "empty" array earlier;
    # keeping them will break the recommendations
    contenders = contenders[10:,:]   

    # this is just the indices of the movies
    net = contenders[:,0]

    finalists = np.zeros((len(contenders), 2))

    # a loop to skim through the contenders and sum their overall similarity scores
    for i in range(len(net)):
        ind = np.where(contenders[:,0] == net[i])
        finalists[i,0] = net[i]  #index of the movie
        finalists[i,1] = np.sum(contenders[ind,1])   #movie's similarity score

    finalists = np.unique(finalists, axis = 0)

    ind = finalists[np.where(finalists[:,1] == max(finalists[:,1]))]

    ind = int(ind[0,0])

    rec = netflix_simple.iloc[ind]['title']
    
    return rec

Some demonstrations:

In [14]:
rec = movie_rec('Stuart Little')
print(rec)

# Stuart Little 2 is the sequel to Stuart Little, so this is reasonable

Stuart Little 2


In [15]:
rec = movie_rec('Star Trek')
print(rec)

# 'For the Love of Spock' is a documentary about Star Trek and Leonard Nimoy, so this is
# a reasonable recommendation if you've just finished the JJ Abrams Star Trek reboot movie

For the Love of Spock


In [39]:
rec = movie_rec('Star Trek: The Next Generation')
print(rec)

# While Deep Space Nine is probably a better recommendation, getting in the right franchise
# is a good sign, and it picked up on which starship to focus on

Star Trek: Enterprise


In [38]:
rec = movie_rec('Ash vs. Evil Dead')
print(rec)

# 'Ash vs. Evil Dead' is a spinoff of the 'The Evil Dead' movie franchise, so this is
# definitely a good recommendation.

The Evil Dead


Some difficulties/limitations:

* There's a chance that a movie/show won't have a perfect correlation with itself, so it isn't excluded from the recommendation list and may accidentally be recommended to the user.
* Sometimes the descriptions in the csv used for this aren't very good, leading to erroneous recommendations.