#### Prompt
Build a recommendation system using all or some of movie info
Get only dataset link from notebook

#### Hints
 - Combine movie data into one string since TFidf only takes one string as a individual document
 - Use TF-IDF to transform strings into vectors. 
 - Get the TF-IDF of a query movie, compute similarity between query and other vectors
 - Sort by similarity then return the top 5 closest movies
 - Test on movies in other genres to test if code works. 

#### Importing Packages and Data

In [2]:
import re

import nltk

import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from scipy import spatial


In [3]:
nltk.download("wordnet")
nltk.download("punkt")
nltk.download("averaged_perceptron_tagger")

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\seohy\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\seohy\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\seohy\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

#### Download & check raw data

In [4]:
# https://www.kaggle.com/tmdb/tmdb-movie-metadata
!wget https://lazyprogrammer.me/course_files/nlp/tmdb_5000_movies.csv

--2023-10-08 16:44:59--  https://lazyprogrammer.me/course_files/nlp/tmdb_5000_movies.csv
Resolving lazyprogrammer.me (lazyprogrammer.me)... 172.67.213.166, 104.21.23.210
Connecting to lazyprogrammer.me (lazyprogrammer.me)|172.67.213.166|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5698602 (5.4M) [text/csv]
Saving to: 'tmdb_5000_movies.csv.1'

     0K .......... .......... .......... .......... ..........  0%  314K 18s
    50K .......... .......... .......... .......... ..........  1% 8.02M 9s
   100K .......... .......... .......... .......... ..........  2%  126K 20s
   150K .......... .......... .......... .......... ..........  3% 3.12M 16s
   200K .......... .......... .......... .......... ..........  4%  204K 18s
   250K .......... .......... .......... .......... ..........  5% 13.7M 15s
   300K .......... .......... .......... .......... ..........  6%  727K 13s
   350K .......... .......... .......... .......... ..........  7% 2.27M 12s
   400K ...

In [5]:
df = pd.read_csv('tmdb_5000_movies.csv')

In [6]:
df.head(2)


Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500


#### Organizing DF to pass into TF-IDF

Goal: Turning meaningful columns for recommendation system (genre, keywords, overview, tagline) into a single single string to be passed into TF-IDF. 

Step-by-step plan:
1. Extract genre names from 'genres' column, store into a variable. 
2. For each keyword from 'keywords' column, concat them together, store into a variable
3. Store overview from 'overview column' into a varible
4. Store tagline from 'tagline' into a variable
5. Add variable in steps 1~5 into one variable
6. Add the resultant string into a list.
7. The resultant list is a list of all train + test data. 

In [7]:
split_genres = df['genres'].str[1:-1] # take genres column, remove []
                                      # 이후 재사용성을 위해 함수 정의 = refactoring의 커다란 부분

split_genres = split_genres.str.split(', ', expand=True) # split by , to split by ids and names

def extract_name(cell):
    if pd.isnull(cell): # if cell is none, put in empty string
                        # needed for concat with other data
        return ''
    match = re.search(r'"name":\s*"([^"]*)"', cell) # cannot search through None, used regex 
    if match:
        return match.group(1)
    return '' # enter none if no "name"

# Apply extract_genre function
# Use applymap or apply instead of interating. Iterating is the last resort.

genre_extracted_df = split_genres.applymap(extract_name)

genre_extracted_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,,Action,,Adventure,,Fantasy,,Science Fiction,,,,,,
1,,Adventure,,Fantasy,,Action,,,,,,,,
2,,Action,,Adventure,,Crime,,,,,,,,
3,,Action,,Crime,,Drama,,Thriller,,,,,,
4,,Action,,Adventure,,Science Fiction,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4798,,Action,,Crime,,Thriller,,,,,,,,
4799,,Comedy,,Romance,,,,,,,,,,
4800,,Comedy,,Drama,,Romance,,TV Movie,,,,,,
4801,,,,,,,,,,,,,,


In [8]:
split_keywords = df['keywords'].str[1:-1]
split_keywords = split_keywords.str.split(', ', expand=True) # 반복 부분, split까지 함수로 만들어보기 
                                                             # 최소 단위로 쪼개서 함수로 만들기

keywords_extracted_df = split_keywords.applymap(extract_name)

keywords_extracted_df 

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,184,185,186,187,188,189,190,191,192,193
0,,culture clash,,future,,space war,,space colony,,society,...,,,,,,,,,,
1,,ocean,,drug abuse,,exotic island,,east india trading company,,love of one's life,...,,,,,,,,,,
2,,spy,,based on novel,,secret agent,,sequel,,mi6,...,,,,,,,,,,
3,,dc comics,,crime fighter,,terrorist,,secret identity,,burglar,...,,,,,,,,,,
4,,based on novel,,mars,,medallion,,space travel,,princess,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4798,,united states\u2013mexico barrier,,legs,,arms,,paper knife,,guitar case,...,,,,,,,,,,
4799,,,,,,,,,,,...,,,,,,,,,,
4800,,date,,love at first sight,,narration,,investigation,,team,...,,,,,,,,,,
4801,,,,,,,,,,,...,,,,,,,,,,


In [9]:
overview_df = df['overview'] # take overview and tagline columns from dataframe

tagline_df = df['tagline']

In [10]:
# combine genre names, keyword names, overview, tagline dataframe into a single total_df

total_df = pd.concat([genre_extracted_df, keywords_extracted_df, overview_df, tagline_df], axis=1)

# replace all none with '' for overview_df and tagline_df
total_df = total_df.fillna('')

total_df = total_df.agg(' '. join, axis=1)
total_df

0        Action  Adventure  Fantasy  Science Fiction  ...
1        Adventure  Fantasy  Action          ocean  dr...
2        Action  Adventure  Crime          spy  based ...
3        Action  Crime  Drama  Thriller        dc comi...
4        Action  Adventure  Science Fiction          b...
                              ...                        
4798     Action  Crime  Thriller          united state...
4799     Comedy  Romance                              ...
4800     Comedy  Drama  Romance  TV Movie        date ...
4801                                                  ...
4802     Documentary              obsession  camcorder...
Length: 4803, dtype: object

### Training using TF-IDF

No need to split the data into train and test set, because there is no testing to be done. 
Rather, we are calculating the vector distance between two vectors.

In [11]:
train_texts = total_df 

tfidf = TfidfVectorizer() # instantiate TfidfVectorizer class
                          # try out other 변수 like stopwords

tfidf_matrix = tfidf.fit_transform(train_texts) # fit vectorizer onto data, transform into vector
# fit_transform 했을 때 document 순서가 바뀌었을까?

#### Get query, calculate the closest 5 vectors 

In [12]:
# Get the TF-IDF of a query movie, compute similarity between query and other vectors
query_movie = input('Which movie do you wish to watch: ')

In [17]:
# take query_movie find corresponding vector in df
query_index = df[df['original_title'] == query_movie].index[0]
query_vector = tfidf_matrix[query_index]

# calculate cosine distance between corresponding vector and all other vectors using df
cosine_similarity_list = []

query_vector_1D = query_vector.toarray().flatten()
for i in range(0, len(df)):
    tfidf_matrix_1D = tfidf_matrix[i].toarray().flatten()
    cosine_similarity = 1 - spatial.distance.cosine(query_vector_1D, tfidf_matrix_1D)
    cosine_similarity_list.append(cosine_similarity)



# return the movies that correspond with those 5 closest vectors using df
# rank the distance, select 6 closest vectors
# exclude most simliar movive b/c will be query_movie
# most similar movie from left to right
index = np.argpartition(cosine_similarity_list, 5)[-5:]


print('''
      Based on your query, I would recommended: 
      {},
      {},
      {},
      {},
      {}
      '''.format(
          df.loc[index[4],'original_title'], 
          df.loc[index[3],'original_title'], 
          df.loc[index[2],'original_title'], 
          df.loc[index[1],'original_title'], 
          df.loc[index[0],'original_title']))




      Based on your query, I would recommended: 
      My Date with Drew,
      Shanghai Calling,
      Signed, Sealed, Delivered,
      Spectre,
      El Mariachi
      
