# Content-Based Filtering
## TMDB 5000 Movie Dataset

TMDB 5000 영화 데이터 세트는 인기 영화 데이터 정보 사이트인 IMDB의 주요 영화 5,000편을 가공한 메타 데이터이다. 
이는 Kaggle에서 구할 수 있다.


The TMDB 5000 movie dataset is a processed meta-data for 5,000 of the major movies of IMDB's popular film data information site. 
It is available at Kaggle.

https://www.kaggle.com/tmdb/tmdb-movie-metadata

- tmdb_5000_credits.csv
- tmdb_5000_movies.csv

----

## Content-based filtering using genre properties
### 장르 속성을 이용한 콘텐츠 기반 필터링

영화를 선택하는 데 중요한 요소 중 하나인 영화 장르 속성을 바탕으로 콘텐츠 기반 필터링 추천 시스템을 만들어보자.
장르 열 값의 유사성 비교하고, 높은 등급의 영화를 추천한다.



Let's create a content-based filtering recommendation system based on movie genre attributes, one of the important factors in choosing movies.
Comparing the similarity of the column values of the genre, recommending the movie with a high rating.

## Data Loading and Processing

In [18]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [19]:
movies = pd.read_csv('./data/TMDB_5000_Movie_Dataset/tmdb_5000_movies.csv')
print(movies.shape)
movies.head()

(4803, 20)


Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""name"": ""Fantasy""}, {...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"": 2964, ""name"": ""future""}, {""id"": 3386, ""name"": ""sp...",en,Avatar,"In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, ...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289}, {""name"": ""Twentieth Century Fox Film Corporatio...","[{""iso_3166_1"": ""US"", ""name"": ""United States of America""}, {""iso_3166_1"": ""GB"", ""name"": ""United ...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso_639_1"": ""es"", ""name"": ""Espa\u00f1ol""}]",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""name"": ""Fantasy""}, {""id"": 28, ""name"": ""Action""}]",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""name"": ""drug abuse""}, {""id"": 911, ""name"": ""exotic is...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, has come back to life and is headed to the edge of t...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""name"": ""Jerry Bruckheimer Films"", ""id"": 130}, {""na...","[{""iso_3166_1"": ""US"", ""name"": ""United States of America""}]",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""name"": ""Adventure""}, {""id"": 80, ""name"": ""Crime""}]",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name"": ""based on novel""}, {""id"": 4289, ""name"": ""secret...",en,Spectre,A cryptic message from Bond’s past sends him on a trail to uncover a sinister organization. Whil...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""name"": ""Danjaq"", ""id"": 10761}, {""name"": ""B24"", ""id"": ...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""}, {""iso_3166_1"": ""US"", ""name"": ""United States of ...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""}, {""iso_639_1"": ""en"", ""name"": ""English""}, {""iso_639...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""name"": ""Crime""}, {""id"": 18, ""name"": ""Drama""}, {""id"": ...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853, ""name"": ""crime fighter""}, {""id"": 949, ""name"": ""te...",en,The Dark Knight Rises,"Following the death of District Attorney Harvey Dent, Batman assumes responsibility for Dent's c...",112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""name"": ""Warner Bros."", ""id"": 6194}, {""name"": ""DC E...","[{""iso_3166_1"": ""US"", ""name"": ""United States of America""}]",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""name"": ""Adventure""}, {""id"": 878, ""name"": ""Science Fic...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"": 839, ""name"": ""mars""}, {""id"": 1456, ""name"": ""medal...",en,John Carter,"John Carter is a war-weary, former military captain who's inexplicably transported to the myster...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States of America""}]",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


### - Extract major columns for content-based filtering recommendation analysis
내용 기반 필터링 권장 사항 분석을 위한 주요 열 추출
### - Make them into new DataFrame
새로운 데이터 프레임으로 만들기

In [20]:
movies_df = movies[['id','title','genres','vote_average','vote_count','popularity','keywords','overview']]
movies_df.sample()

Unnamed: 0,id,title,genres,vote_average,vote_count,popularity,keywords,overview
2895,12479,Breakfast of Champions,"[{""id"": 35, ""name"": ""Comedy""}]",5.2,28,2.695896,"[{""id"": 1157, ""name"": ""wife husband relationship""}, {""id"": 3836, ""name"": ""success""}, {""id"": 1081...",A portrait of a fictional town in the mid west that is home to a group of idiosyncratic and slig...


'genres', 'keywords' 컬럼 값은 파이썬 리스트 내부에 여러 개의 딕셔너리가 있는 형태이다. 이는 한꺼번에 여러 개의 값을 표기하기 위한 방법으로, str 형태로 로딩된 데이터를 다시 딕셔너리 형태로 바꿔주는 과정이 필요하다.

---
The 'genres' and 'keywords' column values are in the form of multiple dictionaries inside the Python list. This is a method to indicate multiple values at once, requiring the process of changing the data loaded in str form back into dictionary form.

In [21]:
pd.set_option('max_colwidth', 100)
movies_df[['genres','keywords']].sample()

Unnamed: 0,genres,keywords
3951,"[{""id"": 35, ""name"": ""Comedy""}]","[{""id"": 1946, ""name"": ""restaurant""}, {""id"": 1994, ""name"": ""wolf""}, {""id"": 6490, ""name"": ""shoppin..."


- genres와 keywords 모두 id와 name을 딕셔너리의 key로 가지며, name이라는 key를 이용해 해당하는 명칭을 가져올 수 있다
- 이 두 컬럼을 분해하여 파이썬 리스트 객체로 추출해보자

---
- Both genres and keywords have id and name as keys to the dictionary, and the corresponding name can be obtained using the key 'name'
- Let's break down these two columns and extract them into Python list-objects

In [22]:
from ast import literal_eval
movies_df['genres'] = movies_df['genres'].apply(literal_eval)
movies_df['keywords'] = movies_df['keywords'].apply(literal_eval)
movies_df[['genres','keywords']].sample()

Unnamed: 0,genres,keywords
2317,"[{'id': 878, 'name': 'Science Fiction'}, {'id': 28, 'name': 'Action'}, {'id': 12, 'name': 'Adven...",[]


- 겉보기에는 달라진 것이 없어보이지만, 실제로는 문자열 형태의 값에서, 리스트 내부 딕셔너리 형태의 값으로 변경되었다
- 이제 이 중 명칭만을 리스트 객체로 추출해보자

---
- Nothing seems to have changed on the surface, but in reality, it was changed from the value of string form to the value of dictionary form within the list.
- Now, let's extract only the names from the list.

In [23]:
movies_df['genres'] = movies_df['genres'].apply(lambda x : [y['name'] for y in x])
movies_df['keywords'] = movies_df['keywords'].apply(lambda x : [y['name'] for y in x])
movies_df[['genres','keywords']].sample()

Unnamed: 0,genres,keywords
4357,[Drama],"[adoption, college]"


----

## Measure genre content similarity
### 장르 콘텐츠 유사도 측정

영화 간의 장르 콘텐츠 유사도를 측정하는 방법에는 여러가지가 있지만, 그중 가장 간단한 방법은 코사인 유사도를 구하는 것이다
genres를 str 형태로 변환 후 이를 CountVectorizer로 feature vectorize한 행렬 데이터 값을 비교하는 것으로, 다음과 같은 단계로 구현된다
1. str 형태로 변환된 genres 칼럼을 Count 기반으로 feature vectorize
2. genres string을 feature vectorize된 matrix로 변환한 데이터셋을 코사인 유사도 통해 비교
3. 장르 유사도가 높은 영화 중 평점이 높은 순으로 영화를 추천


---
There are many ways to measure the similarity of genre content between movies, but the simplest way is to obtain cosine similarity.
The transformation of genres into str forms and comparing them with the value of the matrix data, which is characterized by the CountVectorize, is implemented in the following steps:
1. Feature vectorize based on the count of the genres column that converted to str form
2. Compare feature vectorized matrix with genres string through cosine similarity
3. Recommend movies in order of high ratings among movies with high genre similarity

In [24]:
from sklearn.feature_extraction.text import CountVectorizer


movies_df['genres_literal'] = movies_df['genres'].apply(lambda x: (' ').join(x)) 
# CountVectorize 적용 위해 공백문자로 word 구분하여 문자열로 반환

count_vect = CountVectorizer(min_df=0, ngram_range=(1,2)) 
# max_df / min_df: 토큰이 나타난 횟수를 기준으로, max_df 값보다 크거나, min_df 값보다 작으면 무시
# ngram_range: (min_n, max_n)으로, BoW 생성에 사용할 토큰의 크기인 n-gram의 범위를 결정 - 여기서는 최소 모노그램, 최대 바이그램

genre_mat = count_vect.fit_transform(movies_df['genres_literal']) # csr_matrix: CSR 형식 희소 행렬
genre_mat.shape

(4803, 276)

- CountVectorizer로 변환해 4,803개의 레코드와 276개의 개별 단어 피처로 구성된 피처 벡터 행렬이 만들어졌다
- 이렇게 생성된 피처 벡터 행렬에 사이킷런의 cosine_similarity() 메서드를 이용해 코사인 유사도를 계산하자

---
- Convert with CountVectorize, a feature vector matrix of 4,803 records and 276 individual word features are created
- Let's calculate cosine similarity using the cosine_similarity() method in the generated feature vector matrix

In [25]:
from sklearn.metrics.pairwise import cosine_similarity

genre_sim = cosine_similarity(genre_mat, genre_mat)
print(genre_sim.shape)
print(genre_sim)

(4803, 4803)
[[1.         0.59628479 0.4472136  ... 0.         0.         0.        ]
 [0.59628479 1.         0.4        ... 0.         0.         0.        ]
 [0.4472136  0.4        1.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 1.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         1.        ]]
