## 실습2. 유사도 이용한 추천

#### 예제 데이터: Kaggle Movies meta dataset

- 24개 컬럼을 가진 45,466개 샘플로 구성된 영화 정보 데이터 
- 데이터파일 : movies_metadata.csv
- 출처 : https://www.kaggle.com/rounakbanik/the-movies-dataset

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

#### 1. 데이터 준비

- 예제 데이터 파일 다운로드 및 압축해제

#### 2. 데이터 프레임으로 변환

In [2]:
data = pd.read_csv('data/MoviesDataset/movies_metadata.csv', low_memory = False)
data.head(2)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

In [4]:
data.budget.unique()

array(['30000000', '65000000', '0', ..., '3417000', '25868826', '1254040'],
      dtype=object)

#### 3. 데이터 정제

- overview컬럼 : 영화 개요

In [5]:
data.overview.isnull().sum()

954

- overview컬럼 결측치를 ''값으로 대체

In [6]:
data.overview.fillna('', inplace=True)

#### 4. TF-IDF 피처벡터화

- overview컬럼을 TF-IDF의 stop_words를 사용하여 불용어 제거 및 벡터화 수행

In [15]:
tfidf_vect = TfidfVectorizer()
tfidf = TfidfVectorizer(stop_words='english')
ftr_mat = tfidf_vect.fit_transform(data.overview)


# print(f'단어집합 : \n{tfidf_vect.vocabulary_}')
# print(f'피처벡터 : \n{ftr_mat.toarray()}')

#### 5. 코사인 유사도

In [19]:
# cosine_sim = cosine_similarity(ftr_mat, ftr_mat)


#### 6. 영화 제목과 인덱스를 갖는 딕셔너리 생성

In [9]:
title_to_index = dict(zip(data['title'], data.index))
title_to_index['Father of the Bride Part II']

4

In [10]:
data['title'][:10]

0                      Toy Story
1                        Jumanji
2               Grumpier Old Men
3              Waiting to Exhale
4    Father of the Bride Part II
5                           Heat
6                        Sabrina
7                   Tom and Huck
8                   Sudden Death
9                      GoldenEye
Name: title, dtype: object

#### 7. 선택한 영화 제목의 overview와 가장 유사한 overview인 10개 영화 찾기

In [16]:
def get_recommend(title, cosine_sim):
    idx = title_to_index[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores10 = sorted(sim_scores, key=lambda x:x[1], reverse=True)[1:11]
    indices = [item[0] for iten in sim_scores10]
    return data['title'].iloc[indices]

NameError: name 'cosine_sim' is not defined

In [None]:
title = 'Father of the Bride Part II'
get_recommend(title, cosine_sim)

#### 8. 선택한 영화와 유사도가 높은 상위 10개 영화와의 코사인 유사도 시각화

In [None]:
title = 'Jumanji'
my_idx = title_to_index[title]
print(my_idx)
top10_idx = get_recommend(title, cosine_sim).index
print(top10_idx)

In [None]:
mv_idx = [my_idx] + list(top10_idx)
cos_sim_10 = cosine_similarity(feature_vect[mv_idx], feature_vect[mv_idx])
cos_sim_10

#### 히트맵(heatmap)

In [None]:
plt.figure(figsize=(12,12))
sns.heatmap(cos_sim_10, annot=True, fmt='.2f', xticklabels=mv_idx, yticklabels=mv_idx)
plt.show()

- 사인 유사도가 큰 순으로 movie_index를 재정렬

In [None]:
cos_sim_10[0]

#### barplot

In [None]:
title = 'The Dark Knight Rises'
mv_info = get_recommend(title, cosine_sim)
mv_title = mv_info.values
mv_idx = mv_info.index
cos_sim = cosine_similarity(feature_vect[mv_idx], feature_vect[mv_idx])
sim = cos_sim[0][1:]
df = pd.DataFrame(sim, index=mv_title[1], columns=['similarity'])
df

In [None]:
sns.barplot(data=df, x='similarity', y=df.index)
plt.title(f'{title}')
plt.show()

=> The Dark Knight Rises와 리뷰와 가장 유사도가 큰 영화는 The Dark Knight(0.32667)

------------