#### Prompt
Build a recommendation system using all or some of movie info
Get only dataset link from notebook

#### Hints
 - Combine movie data into one string since TFidf only takes one string as a individual document
 - Use TF-IDF to transform strings into vectors. 
 - Get the TF-IDF of a query movie, compute similarity between query and other vectors
 - Sort by similarity then return the top 5 closest movies
 - Test on movies in other genres to test if code works. 

#### Importing Packages and Data

In [97]:
import re

import nltk

import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from scipy import spatial


In [98]:
nltk.download("wordnet")
nltk.download("punkt")
nltk.download("averaged_perceptron_tagger")

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\seohy\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\seohy\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\seohy\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

#### Download & check raw data

In [99]:
# https://www.kaggle.com/tmdb/tmdb-movie-metadata
!wget https://lazyprogrammer.me/course_files/nlp/tmdb_5000_movies.csv

--2023-10-07 18:58:53--  https://lazyprogrammer.me/course_files/nlp/tmdb_5000_movies.csv
Resolving lazyprogrammer.me (lazyprogrammer.me)... 172.67.213.166, 104.21.23.210
Connecting to lazyprogrammer.me (lazyprogrammer.me)|172.67.213.166|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5698602 (5.4M) [text/csv]
Saving to: 'tmdb_5000_movies.csv.1'

     0K .......... .......... .......... .......... ..........  0% 1.50M 4s
    50K .......... .......... .......... .......... ..........  1% 6.34M 2s
   100K .......... .......... .......... .......... ..........  2% 2.32M 2s
   150K .......... .......... .......... .......... ..........  3% 4.83M 2s
   200K .......... .......... .......... .......... ..........  4% 4.79M 2s
   250K .......... .......... .......... .......... ..........  5% 2.83M 2s
   300K .......... .......... .......... .......... ..........  6% 8.06M 2s
   350K .......... .......... .......... .......... ..........  7% 1.20M 2s
   400K ..........

In [100]:
df = pd.read_csv('tmdb_5000_movies.csv')

In [101]:
df.head(2)


Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500


#### Organizing DF to pass into TF-IDF

Goal: Turning meaningful columns for recommendation system (genre, keywords, overview, tagline) into a single single string to be passed into TF-IDF. 

Step-by-step plan:
1. Extract genre names from 'genres' column, store into a variable. 
2. For each keyword from 'keywords' column, concat them together, store into a variable
3. Store overview from 'overview column' into a varible
4. Store tagline from 'tagline' into a variable
5. Add variable in steps 1~5 into one variable
6. Add the resultant string into a list.
7. The resultant list is a list of all train + test data. 

In [102]:
split_genres = df['genres'].str[1:-1]
split_genres = split_genres.str.split(', ', expand=True)

# Extract genre using regex
# Iterating over each row is slow and inefficient. Use applymap or apply instead.
def extract_name(cell):
    if pd.isnull(cell):
        return ''
    match = re.search(r'"name":\s*"([^"]*)"', cell) # cannot search through None
    if match:
        return match.group(1)
    return None

# Apply extract_genre function
genre_extracted_df = split_genres.applymap(extract_name)
genre_extracted_df = genre_extracted_df.fillna('')

genre_extracted_df = genre_extracted_df.agg(' '. join, axis=1)

genre_extracted_df

0        Action  Adventure  Fantasy  Science Fiction  ...
1                      Adventure  Fantasy  Action        
2                        Action  Adventure  Crime        
3                    Action  Crime  Drama  Thriller      
4              Action  Adventure  Science Fiction        
                              ...                        
4798                      Action  Crime  Thriller        
4799                            Comedy  Romance          
4800               Comedy  Drama  Romance  TV Movie      
4801                                                     
4802                              Documentary            
Length: 4803, dtype: object

In [103]:
split_keywords = df['keywords'].str[1:-1]
split_keywords = split_keywords.str.split(', ', expand=True)

keywords_extracted_df = split_keywords.applymap(extract_name)
keywords_extracted_df = keywords_extracted_df.fillna('')

keywords_extracted_df = keywords_extracted_df.agg(' '. join, axis=1)

keywords_extracted_df 

0        culture clash  future  space war  space colon...
1        ocean  drug abuse  exotic island  east india ...
2        spy  based on novel  secret agent  sequel  mi...
3        dc comics  crime fighter  terrorist  secret i...
4        based on novel  mars  medallion  space travel...
                              ...                        
4798     united states\u2013mexico barrier  legs  arms...
4799                                                  ...
4800     date  love at first sight  narration  investi...
4801                                                  ...
4802     obsession  camcorder  crush  dream girl      ...
Length: 4803, dtype: object

In [104]:
overview_df = df['overview']
tagline_df = df['tagline']

In [105]:
total_df = pd.concat([genre_extracted_df, keywords_extracted_df, overview_df, tagline_df], axis=1)
total_df = total_df.fillna('')

total_df = total_df.agg(' '. join, axis=1)
total_df

0        Action  Adventure  Fantasy  Science Fiction  ...
1        Adventure  Fantasy  Action          ocean  dr...
2        Action  Adventure  Crime          spy  based ...
3        Action  Crime  Drama  Thriller        dc comi...
4        Action  Adventure  Science Fiction          b...
                              ...                        
4798     Action  Crime  Thriller          united state...
4799     Comedy  Romance                              ...
4800     Comedy  Drama  Romance  TV Movie        date ...
4801                                                  ...
4802     Documentary              obsession  camcorder...
Length: 4803, dtype: object

### Training using TF-IDF

No need to split the data into train and test set, because there is no testing to be done. 
Rather, we are calculating the vector distance between two vectors.

In [106]:
train_texts = total_df 

tfidf = TfidfVectorizer() # instantiate TfidfVectorizer class
                          # try out other 변수 like stopwords
tfidf_matrix = tfidf.fit_transform(train_texts) # fit vectorizer onto data, transform into vector


#### Get query, calculate the closest 5 vectors 

In [107]:
# Get the TF-IDF of a query movie, compute similarity between query and other vectors
query_movie = input('Which movie do you wish to watch: ')

In [108]:
# take query_movie find corresponding vector in df
query_movie_index = df[df['original_title'] == query_movie].index[0]
query_vector = tfidf_matrix[query_movie_index]

# calculate cosine distance between corresponding vector and all other vectors using df
cosine_similarity_list = []
for i in range(1, 4803):
    query_vector_1D = query_vector.toarray().flatten()
    tfidf_matrix_1D = tfidf_matrix[i].toarray().flatten()
    cosine_similarity = 1 - spatial.distance.cosine(query_vector_1D, tfidf_matrix_1D)
    cosine_similarity_list.append(cosine_similarity)

# rank the distance, select the 5 closest vectors
index = np.argpartition(cosine_similarity_list, -5)[-5:]

# return the movies that correspond with those 5 closest vectors using df
print('Based on your query, I would recommended: {0}, {1}, {2}, {3}, {4}'.format(
    df.loc[index[0],'original_title'], 
    df.loc[index[1],'original_title'], 
    df.loc[index[2],'original_title'], 
    df.loc[index[3],'original_title'], 
    df.loc[index[4],'original_title']
    ))

Based on your query, I would recommended: The Avengers, Semi-Pro, R.I.P.D., Quantum of Solace, Avatar
