Name: Keara Hayes

# Due 4/26/2022

In [77]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import numpy as np

Using a Netflix database CSV, we can use TFIDF or "term frequency–inverse document frequency" to recommend what to watch next based on what you've just finished. TFIDF looks at the frequency with which words appear to give similarity scores. This code expands and improves [this code](https://colab.research.google.com/drive/1NkUvrLdIQ_QoSl4kfP6XQvWC4xsU-2Ic#sandboxMode=true), which is based on a workshop given by Rounak Banik. The example code (found by Sania) is simplistic and does not give sophisticated recommendations. This new and improved code takes ideas and techniques presented there and expands them.

To start, input the name of the movie/show you've just watched.

In [115]:
# input a movie or TV show title

title = 'Star Trek'

Next, the database we've been using needs to be loaded in. That database can be found [here](https://www.kaggle.com/datasets/shivamb/netflix-shows). 

In [116]:
# load in the Netflix library
netflix = pd.read_csv("netflix_titles.csv")

Next, the database needs some cleaning. Unimportant columns, like the date it was added to Netflix, the duration of the movie/film, etc are dropped. NaNs are filled with empty strings to prevent errors.

In [117]:
# take out some of the things we care less about
netflix_simple = netflix.drop(['date_added', 'release_year', 'duration'], axis = 1)

# fill in any NaNs with empty strings so the vectorizing doesn't break
netflix_simple.fillna("", inplace=True)

Here we begin pulling our sorting categories. The simplest ones to use with sklearn's TFIDF are the title and description, which are merged together into a single category.

In [125]:
# things that don't need extra manipulation
titledesc = netflix_simple['description'] + netflix_simple['title']

Sklearn struggles with some of the categories, though. For example, in the case of the cast, actors tend to have multiple names which need to stay together. If they are tokenized separately though, you lose a lot of information. For example, Gates McFadden is an actress in *Star Trek: The Next Generation*, and Bill Gates appears in the documentary *Inside Bill's Brain: Decoding Bill Gates*. If sklearn is allowed to run unaided, watching *TNG* will reccommend the Bill Gates documentary based solely on the fact that *Gates* appears in both of their names.

To solve this, for both the director column and the cast column, all the whitespace is removed so that actors' names are one "word," which means they are tokenized properly.

In [118]:
# things that do need some help, since they contain multiword tokens
# what we're doing here is removing all the whitespace so that when we tokenize, there's 
# no worry about individual words being tokens instead of full names

for i in range(len(netflix_simple['cast'])):
    netflix_simple['cast'].iloc[i] = netflix_simple['cast'].iloc[i].replace(" ", "")
    netflix_simple['director'].iloc[i] = netflix_simple['director'].iloc[i].replace(" ", "")

In [119]:
# like before, assign the colmns to variables so things are cleaner
cast = netflix_simple['cast']
director = netflix_simple['director']

Then the columns need to be "vectorized." They have a "direction" (the token, for example, the word "star") and a magnitude (the frequency with which that word appears).

In [120]:
# Create tfidf vectorizer:
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 2), min_df=0, stop_words='english')

# Use vectorizer to create tfidf matrix.
tfidf_titledesc = tf.fit_transform(titledesc)
tfidf_cast = tf.fit_transform(cast)
tfidf_director = tf.fit_transform(director)

Based on those vectors, titles are given similarity scores relative to each other. These are matrices that show how similar entries are based on the different categories.

In [121]:
# Compute the similarity matrix so we know how similar the given movie is to the others available
sim_titledesc = linear_kernel(tfidf_titledesc, tfidf_titledesc)
sim_cast = linear_kernel(tfidf_cast, tfidf_cast)
sim_director = linear_kernel(tfidf_director, tfidf_director)

We'll set aside what categories we intend to use.

In [122]:
# all the things we'll be using for the ranking
metrics = [sim_titledesc, sim_cast, sim_director]

Here, we reset the index to use the titles of the movies to make searching easier.

In [123]:
netflix_simple.reset_index(inplace=True)

# reindex according to title so that the indexing later is easier
indices = pd.Series(netflix_simple.index, index=netflix_simple['title'])

And here is the real meat of the recommendations. In short, each category (title + description, cast, and director) is used to create a list of contenders, along with their similarity scores. These similarity scores are added for each recommended entry, and the movie/show that has the high score overall is recommended.

In [126]:
# this loop will take all the rating categories and combine them into a final 
# recommendation list

# this is the title you input at the top of the notebook
# it is used on the reindexed database to pull the information for that entry
index = indices[title]

# an empty array to append to in the loop
contenders = np.zeros((9,2))

for i in metrics:
    
    # for a given category, find the input movie
    row = i[index]
    
    # pull the similarity scores for that movie
    sim_scores = list(enumerate(row))
    
    # sort the scores, highest score first
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # keep the top 10 entries; entry 0 is excluded because that will always
    # be the input title
    closest_matches = sim_scores[1:11]
    
    # get the numerical indices for the movies
    movie_indices = [i[0] for i in closest_matches]
    
    # turn the closest matches into an array
    contenders2 = np.array(closest_matches)
    
    # concatenate to the blank array so we can compile the similarity scores later
    contenders = np.concatenate((contenders, contenders2))
    
# exclude the 0 entries from when we made the "empty" array earlier;
# keeping them will break the recommendations
contenders = contenders[9:,:]
    

net = contenders[:,0]

finalists = np.zeros((len(contenders), 2))

for i in range(len(net)):
    ind = np.where(contenders[:,0] == net[i])
    finalists[i,0] = net[i]
    finalists[i,1] = np.sum(contenders[ind,1])

finalists = np.unique(finalists, axis = 0)

ind = finalists[np.where(finalists[:,1] == max(finalists[:,1]))]

ind = int(ind[0,0])

print(netflix_simple.iloc[ind]['title'])

# print(net)

[5244. 5650. 5693. 4946.  592. 5245. 6902. 2005. 4060. 3994. 5693. 6749.
 3517. 7757. 8354. 5252. 3100. 5848. 6017. 7891.    0.    1.    2.    3.
    4.    5.    6.    7.    8.    9.]


# Under here is testing code that can be deleted once everything else is finalized

In [69]:
# Get the movie's index based on its title:
index = indices["Star Trek"]

# Use that index to get the similarities matrix row
# that gives a similarity score for this movie
# compared to each other movie:
row = sim_cast[index]

In [70]:
# Convert that row to a list of (movie_row, similarity_score) pairs:
sim_scores = list(enumerate(row))

In [71]:
# Sort the (movie_row, similarity_score) pairs by similarity score:
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

In [72]:
# Get the top 10 (movie_row, similarity_score) pairs:
closest_matches = sim_scores[1:10]

In [76]:
np.set_printoptions(suppress=True)

movie_indices = [i[0] for i in closest_matches]

contenders = np.array(closest_matches)

contenders.shape

(9, 2)

In [74]:
netflix_simple.iloc[movie_indices]

Unnamed: 0,index,show_id,type,title,director,cast,country,rating,listed_in,description
5693,5693,s5694,Movie,For the Love of Spock,AdamNimoy,"LeonardNimoy,WilliamShatner,GeorgeTakei,Nichel...","Canada, United States",TV-14,Documentaries,The son of actor Leonard Nimoy directs this mo...
6749,6749,s6750,Movie,Figures of Speech,AriLevinson,ChrisPine,United States,TV-PG,Documentaries,"In this documentary, passionate high schoolers..."
3517,3517,s3518,Movie,The Crystal Calls Making the Dark Crystal: Age...,RandallLobb,"TaronEgerton,NatalieDormer,SimonPegg,JasonIsaacs",United States,TV-PG,"Documentaries, International Movies","Go behind the scenes with stars, puppeteers an..."
7757,7757,s7758,Movie,Porto,GabeKlinger,"AntonYelchin,LucieLucas,FrançoiseLebrun,PauloC...","Portugal, France, Poland, United States",TV-MA,"Dramas, Independent Movies, Romantic Movies","In a coastal Portuguese city, an erotic encoun..."
8354,8354,s8355,Movie,The Hurricane Heist,RobCohen,"TobyKebbell,MaggieGrace,RyanKwanten,RalphIneso...","United Kingdom, United States",PG-13,Action & Adventure,A deadly hurricane with mile-high waves provid...
5252,5252,s5253,Movie,Gerald's Game,MikeFlanagan,"CarlaGugino,BruceGreenwood,HenryThomas,CarelSt...",United States,TV-MA,"Horror Movies, Thrillers","When her husband's sex game goes wrong, Jessie..."
3100,3100,s3101,Movie,Jarhead: Law of Return,DonMichaelPaul,"AmauryNolasco,DevonSawa,RobertPatrick,JeffPier...","Israel, United States",R,"Action & Adventure, Dramas",When a U.S. senator’s son is held captive by H...
5848,5848,s5849,Movie,Special Correspondents,RickyGervais,"EricBana,RickyGervais,VeraFarmiga,KellyMacdona...","Canada, United Kingdom, United States",TV-MA,Comedies,"When they lose their passports, a bickering ra..."
6017,6017,s6018,Movie,5 to 7,VictorLevin,"AntonYelchin,BéréniceMarlohe,OliviaThirlby,Lam...",United States,R,"Comedies, Dramas, Romantic Movies",A young novelist's life is turned upside down ...


In [48]:
row2 = sim_titledesc[index]

sim_scores = list(enumerate(row2))

sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

closest_matches2 = sim_scores[1:10]
                             
movie_indices = [i[0] for i in closest_matches2]

contenders2 = np.array(closest_matches2)
                             
netflix_simple.iloc[movie_indices]

Unnamed: 0,level_0,index,show_id,type,title,director,cast,country,rating,listed_in,description
5244,5244,5244,s5245,TV Show,Star Trek: Enterprise,,"ScottBakula,JohnBillingsley,JoleneBlalock,Domi...",United States,TV-14,"Classic & Cult TV, TV Action & Adventure, TV S...",Capt. Archer and his crew explore space and di...
5650,5650,5650,s5651,TV Show,Star Trek: Deep Space Nine,,"AveryBrooks,NanaVisitor,ReneAuberjonois,Cirroc...",United States,TV-14,"TV Action & Adventure, TV Sci-Fi & Fantasy","In this ""Star Trek"" spin-off, Commander Sisko ..."
5693,5693,5693,s5694,Movie,For the Love of Spock,Adam Nimoy,"LeonardNimoy,WilliamShatner,GeorgeTakei,Nichel...","Canada, United States",TV-14,Documentaries,The son of actor Leonard Nimoy directs this mo...
4946,4946,4946,s4947,TV Show,Star Trek: The Next Generation,,"PatrickStewart,JonathanFrakes,LeVarBurton,Mich...",United States,TV-PG,"TV Action & Adventure, TV Sci-Fi & Fantasy",Decades after the adventures of the original E...
592,592,592,s593,Movie,She's Out of My League,Jim Field Smith,"JayBaruchel,AliceEve,T.J.Miller,MikeVogel,Nate...",United States,R,"Comedies, Romantic Movies",Kirk's a 5. His new girlfriend Molly's a 10. N...
5245,5245,5245,s5246,TV Show,Star Trek: Voyager,,"KateMulgrew,RobertBeltran,RoxannDawson,Jennife...",United States,TV-PG,"TV Action & Adventure, TV Sci-Fi & Fantasy","On Voyager's 75-year journey back to Earth, th..."
6902,6902,6902,s6903,Movie,Guy Martin: Last Flight of the Vulcan Bomber,James Woodroffe,"GuyMartin,KevinStone,ShaunDooley",United Kingdom,TV-G,"Documentaries, International Movies",Guy Martin assists in preparing the last airwo...
2005,2005,2005,s2006,Movie,Lara and the Beat,Tosin Coker,"SeyiShay,SomkeleIyamah,Vector,ChiomaChukwukaAk...",Nigeria,TV-MA,"Dramas, International Movies, Music & Musicals","When their glamorous, fast-paced lifestyle com..."
4060,4060,4060,s4061,Movie,Radio Rebel,Peter Howitt,"DebbyRyan,SarenaParmar,AdamDiMarco,MerrittPatt...",United States,TV-G,"Children & Family Movies, Comedies",Shy student Tara has a secret identity: She is...


In [49]:
contenders

array([[5693.        ,    0.22678751],
       [6749.        ,    0.18664712],
       [3517.        ,    0.0656334 ],
       [7757.        ,    0.05603469],
       [8354.        ,    0.05340816],
       [5252.        ,    0.05125346],
       [3100.        ,    0.04897567],
       [5848.        ,    0.04729247],
       [6017.        ,    0.04690411]])

In [105]:
contenders2

array([[5244.        ,    0.20736046],
       [5650.        ,    0.11901829],
       [ 594.        ,    0.10358936],
       [5245.        ,    0.08648629],
       [8586.        ,    0.065304  ],
       [2005.        ,    0.05835767],
       [ 956.        ,    0.05614262],
       [5693.        ,    0.05493923],
       [4652.        ,    0.05028207]])

In [123]:
test = np.concatenate((contenders, contenders2))

test

array([[ 908.        ,    1.        ],
       [1779.        ,    1.        ],
       [2405.        ,    1.        ],
       [2470.        ,    1.        ],
       [3615.        ,    1.        ],
       [4946.        ,    1.        ],
       [5245.        ,    1.        ],
       [5650.        ,    1.        ],
       [3674.        ,    0.92786103],
       [5244.        ,    0.20736046],
       [5650.        ,    0.11901829],
       [ 594.        ,    0.10358936],
       [5245.        ,    0.08648629],
       [8586.        ,    0.065304  ],
       [2005.        ,    0.05835767],
       [ 956.        ,    0.05614262],
       [5693.        ,    0.05493923],
       [4652.        ,    0.05028207]])

In [163]:
net = test[:,0]

finalists = np.zeros((len(test), 2))

for i in range(len(net)):
    ind = np.where(test[:,0] == net[i])
    finalists[i,0] = net[i]
    finalists[i,1] = np.sum(test[ind,1])
    
finalists = np.unique(finalists, axis = 0)

In [164]:
finalists

array([[ 594.        ,    0.10358936],
       [ 908.        ,    1.        ],
       [ 956.        ,    0.05614262],
       [1779.        ,    1.        ],
       [2005.        ,    0.05835767],
       [2405.        ,    1.        ],
       [2470.        ,    1.        ],
       [3615.        ,    1.        ],
       [3674.        ,    0.92786103],
       [4652.        ,    0.05028207],
       [4946.        ,    1.        ],
       [5244.        ,    0.20736046],
       [5245.        ,    1.08648629],
       [5650.        ,    1.11901829],
       [5693.        ,    0.05493923],
       [8586.        ,    0.065304  ]])

In [165]:
print(np.where(finalists[:,1] == max(finalists[:,1])))

(array([13], dtype=int64),)


In [184]:
ind = finalists[np.where(finalists[:,1] == max(finalists[:,1]))]

ind = int(ind[0,0])

ind

5650