### [추천시스템]
- TF-IDF와 코사인 유사도로 추천 시스템 구현
- 원리 : 유사한 내용 추천으로 유사도가 높은 것 찾기 ==> 영화 줄거리(overview)가 비슷한 영화를 찾아주기

#### 영화 추천 <hr>

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [4]:
# 데이터 준비
data_file = '../DATA/movies_metadata.csv'

dataDF1 = pd.read_csv(data_file, low_memory = False)
dataDF1.info()
dataDF1.head(n = 2)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0


In [5]:
# 10000개 데이터만 사용
dataDF2 = dataDF1.head(n = 10000)[['id', 'title', 'overview']]
dataDF2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        10000 non-null  object
 1   title     10000 non-null  object
 2   overview  9971 non-null   object
dtypes: object(3)
memory usage: 234.5+ KB


[2] 데이터 전처리 <hr>

[2-1] 기본 데이터 전처리

In [6]:
# 결측치
dataDF2['overview'].isnull().sum()

29

In [7]:
dataDF2.loc[:, 'overview'] = dataDF2['overview'].fillna('')
dataDF2['overview'].isnull().sum()

0

[3] TF-IDF와 cosine_similarity

In [20]:
# TF-IDF : 단어들의 값 계산
tfidf = TfidfVectorizer(stop_words = 'english')
tfidf_matrix = tfidf.fit_transform(dataDF2['overview'])

# 코사인 유사도 : 두개 matrix에 대한 비교 진행
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

In [21]:
tfidf_matrix.toarray()[0]

array([0., 0., 0., ..., 0., 0., 0.])

In [22]:
print(f'TF-IDF 행렬의 크기(shape) : {tfidf_matrix.shape}')
print(f'코사인 유사도 연산 : {cosine_sim.shape}')

TF-IDF 행렬의 크기(shape) : (10000, 32350)
코사인 유사도 연산 : (10000, 10000)


In [23]:
cosine_sim[:10], dataDF2.loc[:10, 'title']

(array([[1.        , 0.01682915, 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.01682915, 1.        , 0.04871976, ..., 0.        , 0.01200997,
         0.        ],
        [0.        , 0.04871976, 1.        , ..., 0.        , 0.00735515,
         0.        ],
        ...,
        [0.        , 0.        , 0.00686749, ..., 0.0193363 , 0.        ,
         0.        ],
        [0.        , 0.10718403, 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ]]),
 0                       Toy Story
 1                         Jumanji
 2                Grumpier Old Men
 3               Waiting to Exhale
 4     Father of the Bride Part II
 5                            Heat
 6                         Sabrina
 7                    Tom and Huck
 8                    Sudden Death
 9                       GoldenEye
 10         The American President
 Name: title, dtype: obj

In [24]:
# 영화제목 입력 ==> 해당 영화 인덱스 추출
(dataDF2['title'] == 'Father of the Bride Part II').argmax()

4

In [25]:
# 영화제목 : 인덱스
title_to_index = dict(zip(dataDF2['title'], dataDF2.index))

# 원하는 영화 인덱스 찾기
select_idx = title_to_index['Father of the Bride Part II']

In [28]:
# 모든 영화 유사도
sim_scores = list(enumerate(cosine_sim[select_idx]))
print(sim_scores)

# 유사도 따라 정렬
sim_scores = sorted(sim_scores, key = lambda x : x[1], reverse = True)
print(sim_scores)

# 가상 유사한 10개 => 선택한 영화 제외하고 나머지 10개
sim_scores = sim_scores[1 : 11]
movie_indices = [idx[0] for idx in sim_scores]

# 가장 유사한 10개의 영화의 제목
dataDF2['title'].iloc[movie_indices]

[(0, 0.0), (1, 0.0), (2, 0.025780216078509725), (3, 0.0), (4, 1.0000000000000002), (5, 0.0), (6, 0.034216525297532024), (7, 0.0), (8, 0.03283297047979882), (9, 0.0), (10, 0.0), (11, 0.0), (12, 0.0), (13, 0.0), (14, 0.0), (15, 0.0), (16, 0.016322383439969928), (17, 0.0), (18, 0.023801243803199312), (19, 0.0), (20, 0.0), (21, 0.0), (22, 0.0), (23, 0.0), (24, 0.0), (25, 0.0), (26, 0.0), (27, 0.010339242454258954), (28, 0.0), (29, 0.0), (30, 0.0), (31, 0.0), (32, 0.0), (33, 0.013070376200862292), (34, 0.0), (35, 0.0), (36, 0.0090080051154292), (37, 0.0), (38, 0.0), (39, 0.0), (40, 0.0), (41, 0.013340149835643038), (42, 0.005233911281801586), (43, 0.0), (44, 0.0), (45, 0.0), (46, 0.0), (47, 0.0270376466142291), (48, 0.0), (49, 0.0), (50, 0.030661836164739956), (51, 0.00897570613787078), (52, 0.0), (53, 0.0), (54, 0.0), (55, 0.0), (56, 0.017905163961910473), (57, 0.0), (58, 0.019138874257217973), (59, 0.0), (60, 0.016226458465929634), (61, 0.013727520026312277), (62, 0.0), (63, 0.01313878820

6793      Father of the Bride
6571                    Kuffs
6306          North to Alaska
5005                  Wendigo
7097       The Out of Towners
926     It's a Wonderful Life
5571           All Night Long
5749              Another You
1516     George of the Jungle
6813     Journeys with George
Name: title, dtype: object

In [27]:
sim_scores

[(6793, 0.3076138818369588),
 (6571, 0.28382906197662683),
 (6306, 0.2791645153540653),
 (5005, 0.24789093506699436),
 (7097, 0.23210137181082088),
 (926, 0.22047982740851552),
 (5571, 0.20212741579046697),
 (5749, 0.18171170753554033),
 (1516, 0.18052252436838906),
 (6813, 0.18014815775256995)]