# Term Frequency - Inverse Document Frequency

### Create a recommender system for a database of movies and movie description. Using tf-idf. 
  - Assume the query is always an existing movie in the db
  - If query = "Scream 3", then print out the 5 closest movies
  - Get tf-idf representation of Scream 3
  - Compute similarity between vector of Scream 3 and all other vectors
  - Sort by similarity and print top 5

In [177]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
import matplotlib.pyplot as plt
import numpy  as np
import pandas as pd
import json

In [178]:
# Download dataset
# https://www.kaggle.com/tmdb/tmdb-movie-metadata
#!wget https://lazyprogrammer.me/course_files/nlp/tmdb_5000_movies.csv

In [179]:
df = pd.read_csv('../datasets/movies/tmdb_5000_movies.csv')

In [180]:
df.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


Key step: how to combine data into a string? tf-idf expect one string per document.
Some columns are stored as Jsons, we will need to flatten those and store everything as one string.

it seems like all the json are formatted like {name: '', id: ''}. So we should read and print all name fields in our final string.

In [181]:
x = json.loads(df['genres'][0])
x[0]['name']

'Action'

In [182]:
def extract_names_from_json_array(j):
  x = json.loads(j)
  string = ""
  for i in x:
    string+= i['name'] + ' '
  return string

In [183]:
df[['genres','keywords', 'production_companies', 'production_countries', 'spoken_languages']] = df[['genres','keywords', 'production_companies', 'production_countries', 'spoken_languages']].applymap(extract_names_from_json_array)


In [184]:
df.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,Action Adventure Fantasy Science Fiction,http://www.avatarmovie.com/,19995,culture clash future space war space colony so...,en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,Ingenious Film Partners Twentieth Century Fox ...,United States of America United Kingdom,2009-12-10,2787965087,162.0,English Español,Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,Adventure Fantasy Action,http://disney.go.com/disneypictures/pirates/,285,ocean drug abuse exotic island east india trad...,en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,Walt Disney Pictures Jerry Bruckheimer Films S...,United States of America,2007-05-19,961000000,169.0,English,Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,Action Adventure Crime,http://www.sonypictures.com/movies/spectre/,206647,spy based on novel secret agent sequel mi6 bri...,en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,Columbia Pictures Danjaq B24,United Kingdom United States of America,2015-10-26,880674609,148.0,Français English Español Italiano Deutsch,Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,Action Crime Drama Thriller,http://www.thedarkknightrises.com/,49026,dc comics crime fighter terrorist secret ident...,en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,Legendary Pictures Warner Bros. DC Entertainme...,United States of America,2012-07-16,1084939099,165.0,English,Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,Action Adventure Science Fiction,http://movies.disney.com/john-carter,49529,based on novel mars medallion space travel pri...,en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,Walt Disney Pictures,United States of America,2012-03-07,284139100,132.0,English,Released,"Lost in our world, found in another.",John Carter,6.1,2124


Let's delete columns that looks like noise

In [185]:
df = df.drop(columns=['id','status','vote_count','revenue','popularity','vote_average','runtime','homepage','budget','original_language', 'release_date', 'title'])
# keep only genres, keywords, original_title
#df = df[['genres', 'keywords', 'original_title']]
df.head()

Unnamed: 0,genres,keywords,original_title,overview,production_companies,production_countries,spoken_languages,tagline
0,Action Adventure Fantasy Science Fiction,culture clash future space war space colony so...,Avatar,"In the 22nd century, a paraplegic Marine is di...",Ingenious Film Partners Twentieth Century Fox ...,United States of America United Kingdom,English Español,Enter the World of Pandora.
1,Adventure Fantasy Action,ocean drug abuse exotic island east india trad...,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",Walt Disney Pictures Jerry Bruckheimer Films S...,United States of America,English,"At the end of the world, the adventure begins."
2,Action Adventure Crime,spy based on novel secret agent sequel mi6 bri...,Spectre,A cryptic message from Bond’s past sends him o...,Columbia Pictures Danjaq B24,United Kingdom United States of America,Français English Español Italiano Deutsch,A Plan No One Escapes
3,Action Crime Drama Thriller,dc comics crime fighter terrorist secret ident...,The Dark Knight Rises,Following the death of District Attorney Harve...,Legendary Pictures Warner Bros. DC Entertainme...,United States of America,English,The Legend Ends
4,Action Adventure Science Fiction,based on novel mars medallion space travel pri...,John Carter,"John Carter is a war-weary, former military ca...",Walt Disney Pictures,United States of America,English,"Lost in our world, found in another."


Now that we got rid of json, we need to format our df into 2 columns 'text' and 'label'. text being a concatenation of all columns into a single string.

In [186]:
df_text = df.loc[:, df.columns != 'original_title'].apply( lambda x: " ".join([str(i) for i in x]) , axis = 1)
df_label = df['original_title']

df = pd.concat([df_label, df_text], axis=1)
df = df.rename(columns={"original_title": "label", 0: "text"})
df.head()

Unnamed: 0,label,text
0,Avatar,Action Adventure Fantasy Science Fiction cult...
1,Pirates of the Caribbean: At World's End,Adventure Fantasy Action ocean drug abuse exo...
2,Spectre,Action Adventure Crime spy based on novel sec...
3,The Dark Knight Rises,Action Crime Drama Thriller dc comics crime f...
4,John Carter,Action Adventure Science Fiction based on nov...


Now that we have our dataset, let's use tf-idf!

In [187]:
vectorizer = TfidfVectorizer()

In [188]:
X = vectorizer.fit_transform(df['text'])
X.shape

(4803, 26568)

Now find the top 5 closest movies to "Scream 3"

In [189]:
print(df.loc[df['label'] == "Scream 3"].index.values[0])

1164


We can output a big kernel matrix K of cosine similarities of X with X. And we then have all the results at once. To find the top 5 movies, we just need to find index i of the movie in df, and find the top 5 index of index sort np.argsort(K[i,:]).

In [190]:
from sklearn.metrics.pairwise import cosine_similarity

In [191]:
K = cosine_similarity(X,X)

In [192]:
K

array([[1.        , 0.05101507, 0.04003499, ..., 0.0389315 , 0.02058338,
        0.01570893],
       [0.05101507, 1.        , 0.05218709, ..., 0.05583301, 0.03851455,
        0.03720884],
       [0.04003499, 0.05218709, 1.        , ..., 0.03491026, 0.01879809,
        0.02177843],
       ...,
       [0.0389315 , 0.05583301, 0.03491026, ..., 1.        , 0.05268888,
        0.04498346],
       [0.02058338, 0.03851455, 0.01879809, ..., 0.05268888, 1.        ,
        0.04553488],
       [0.01570893, 0.03720884, 0.02177843, ..., 0.04498346, 0.04553488,
        1.        ]])

In [193]:
np.argsort(K[1164,:])[::-1][:5]

array([1164, 2282, 1961, 1042, 2194])

In [194]:
df['label'][2282]

'Scream'

In [195]:
def find_similar_movies(title):
    title_index = df.loc[df['label'] == title].index.values[0]
    similar_titles = np.argsort(K[title_index,:])[::-1][:6]
    for i in similar_titles[1:]:
        print(df['label'][i])

In [196]:
find_similar_movies("Spectre")

Never Say Never Again
From Russia with Love
Skyfall
Die Another Day
Quantum of Solace
