### [ 추천 시스템 ]
-  TF-IDF와 코사인 유사도로 추천 시스템 구현
- 원리 : 유사한 내용 추천으로 유사도가 높은 것 찾기
    * 영화 줄거리 (overview)가 비슷한 영화를 찾아주기

영화 추천 <hr>

- [1] 데이터 준비

In [4]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [5]:
datafile = '../data/movies_metadata.csv'

dataDF1 = pd.read_csv(datafile, low_memory=False)
dataDF1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

In [16]:
# 10000 개만 사용
dataDF2 = dataDF1.head(10000)[['id','title','overview']]
dataDF2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        10000 non-null  object
 1   title     10000 non-null  object
 2   overview  9971 non-null   object
dtypes: object(3)
memory usage: 234.5+ KB


In [17]:
dataDF2['overview'].isnull().sum()

29

In [19]:
dataDF2.dropna(subset=['overview'],inplace=True)
dataDF2['overview'].isnull().sum()

0

[3] TF-IDF와 cosine_similarity<hr>

In [9]:
# TF-IDF : 단어들의 값 계산
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(dataDF2['overview'])

# 코사인 유사도 계싼 : 두개 matrix에 대한 비교진행
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

In [14]:
tfidf_matrix.toarray()[0]

array([0., 0., 0., ..., 0., 0., 0.])

In [15]:
# TF-IDF 행렬의 크기 / 코사인 유사도 연산 결과 shape
tfidf_matrix.shape, cosine_sim.shape

((10000, 32644), (10000, 10000))

In [10]:
# 각 영화마다 다른 영화와 유사도 계산 값 반환
cosine_sim[:10], dataDF2.loc[:10, 'title']

(array([[1.        , 0.03267362, 0.01329921, ..., 0.0235086 , 0.02621655,
         0.00774935],
        [0.03267362, 1.        , 0.05808681, ..., 0.03867369, 0.03732402,
         0.04147409],
        [0.01329921, 0.05808681, 1.        , ..., 0.01755207, 0.03922927,
         0.00790429],
        ...,
        [0.00885688, 0.01758338, 0.02996405, ..., 0.04767291, 0.02163666,
         0.01004134],
        [0.01774436, 0.11065301, 0.01584074, ..., 0.02460528, 0.02839019,
         0.0085201 ],
        [0.01415605, 0.02124652, 0.01905343, ..., 0.01593843, 0.05205464,
         0.01860393]]),
 0                       Toy Story
 1                         Jumanji
 2                Grumpier Old Men
 3               Waiting to Exhale
 4     Father of the Bride Part II
 5                            Heat
 6                         Sabrina
 7                    Tom and Huck
 8                    Sudden Death
 9                       GoldenEye
 10         The American President
 Name: title, dtype: obj

In [12]:
## 영화 제목 입력 => 해당 영화 인덱스 추출
(dataDF2['title'] == 'Father of the Bride Part II').argmax() # 자기 자신이 나옴

4

In [13]:
# 영화제목 : 인덱스
title_to_index = dict(zip(dataDF2['title'], dataDF2.index))

# 원하는 영화 인덱스 찾기
select_idx = title_to_index['Father of the Bride Part II']
# 제목만 넣으면 인덱스 반환 / dict니까

In [25]:
# 모든 영화 유사도
# (영화 인덱스, 유사도)
sim_scores = list(enumerate(cosine_sim[select_idx]))
print(sim_scores)

# 유사도에 따라 정렬
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
print(sim_scores)

# 가장 유사한 10개 영화 인덱스( 상위 10개 )
sim_scores = sim_scores[1:11]
movie_indices = [idx[0] for idx in sim_scores]
print(movie_indices) 

# 가장 유사한 10개의 영화의 제목
dataDF1[["title",'genres']].iloc[movie_indices] # genres에서 name만 추출

[(0, 0.03119800623767369), (1, 0.03804439652571363), (2, 0.057699884569226985), (3, 0.02839007385287428), (4, 1.0000000000000002), (5, 0.021895620970223222), (6, 0.04805314175428012), (7, 0.022352208074922553), (8, 0.04438184426657858), (9, 0.02721291840762366), (10, 0.04145545330282145), (11, 0.03952379859891834), (12, 0.016146216710918735), (13, 0.047551194301208514), (14, 0.030083448871842805), (15, 0.01729707163718076), (16, 0.033090877293772775), (17, 0.0382973272594593), (18, 0.06347786602044246), (19, 0.013758477641952004), (20, 0.05096619500206895), (21, 0.01567734880227021), (22, 0.031419912523511645), (23, 0.04571746831099373), (24, 0.027218780230518717), (25, 0.04374771047839839), (26, 0.024989335786980822), (27, 0.03265241651516776), (28, 0.048389185799026055), (29, 0.01612060043790749), (30, 0.038466714195725504), (31, 0.05854888071003641), (32, 0.0), (33, 0.0991001448754636), (34, 0.0517836452928998), (35, 0.031061072683077805), (36, 0.05247165638018701), (37, 0.016833155

Unnamed: 0,title,genres
6793,Father of the Bride,"[{'id': 35, 'name': 'Comedy'}, {'id': 10749, '..."
6571,Kuffs,"[{'id': 28, 'name': 'Action'}, {'id': 35, 'nam..."
5005,Wendigo,"[{'id': 27, 'name': 'Horror'}]"
6306,North to Alaska,"[{'id': 37, 'name': 'Western'}]"
926,It's a Wonderful Life,"[{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n..."
5571,All Night Long,"[{'id': 35, 'name': 'Comedy'}, {'id': 10749, '..."
4112,Blow,"[{'id': 80, 'name': 'Crime'}, {'id': 18, 'name..."
7097,The Out of Towners,"[{'id': 35, 'name': 'Comedy'}]"
1516,George of the Jungle,"[{'id': 12, 'name': 'Adventure'}, {'id': 35, '..."
5749,Another You,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam..."
