### 추천시스템
- TF-IDF와 코사인 유사도로 추천 시스템 구현
- 원리: 유사한 내용 추천으로 유사도 높은 거 찾기
    * 영화 줄거리(OverView)가 비슷한 영화 찾아주기

영화 추천 <hr>

In [2]:
## 1. 데이터 준비하기

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [3]:
datafile = '../datas/movies_metadata.csv'

dataDF1 = pd.read_csv(datafile, low_memory=False)
dataDF1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

In [5]:
# 10000 개만 사용
dataDF2 = dataDF1.head(10000)[['id','title','overview']]
dataDF2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        10000 non-null  object
 1   title     10000 non-null  object
 2   overview  9971 non-null   object
dtypes: object(3)
memory usage: 234.5+ KB


In [6]:
## 2. 데이터 전처리

In [8]:
dataDF2['overview'].isnull().sum()

29

In [9]:
dataDF2.loc[:, 'overview'] = dataDF2['overview'].fillna('')
dataDF2['overview'].isnull().sum()

0

3. TF-IDF, Cosine_similarity <hr>

In [11]:
#TF-IDF : 단어들의 값 계산하기
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(dataDF2['overview'])

# 코사인 유사도 : 두 개의 matrix에 대한 비교 진행하기
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

In [12]:
print(f'[TF-IDF 행렬 크기(shape)]: {tfidf_matrix.shape}')
print(f'[콯사인 유사도 연산 결과]: {cosine_sim.shape}')

[TF-IDF 행렬 크기(shape)]: (10000, 32350)
[콯사인 유사도 연산 결과]: (10000, 10000)


In [13]:
cosine_sim[:10], dataDF2.loc[:10, 'title']

(array([[1.        , 0.01682915, 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.01682915, 1.        , 0.04871976, ..., 0.        , 0.01200997,
         0.        ],
        [0.        , 0.04871976, 1.        , ..., 0.        , 0.00735515,
         0.        ],
        ...,
        [0.        , 0.        , 0.00686749, ..., 0.0193363 , 0.        ,
         0.        ],
        [0.        , 0.10718403, 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ]]),
 0                       Toy Story
 1                         Jumanji
 2                Grumpier Old Men
 3               Waiting to Exhale
 4     Father of the Bride Part II
 5                            Heat
 6                         Sabrina
 7                    Tom and Huck
 8                    Sudden Death
 9                       GoldenEye
 10         The American President
 Name: title, dtype: obj

In [15]:
# 영화 제목 입력 --> 해당 영화 인덱스 추출
(dataDF2['title']=='Father of the Bride Part II').argmax()

4

In [17]:
title_to_index = dict(zip(dataDF2['title'], dataDF2.index))

In [18]:
# 원하는 영화 인덱스 찾기
select_idx = title_to_index['Father of the Bride Part II']

In [19]:
# 모든 영화 유사도
sim_scores = list(enumerate(cosine_sim[select_idx]))

#유사도 따라서 정렬
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

#r가장 유사한 10개
sim_scores= sim_scores[1:11]
movie_indices = [idx[0] for idx in sim_scores]

#가장 유사한 10개의 영화 제목
dataDF2['title'].iloc[movie_indices]

6793      Father of the Bride
6571                    Kuffs
6306          North to Alaska
5005                  Wendigo
7097       The Out of Towners
926     It's a Wonderful Life
5571           All Night Long
5749              Another You
1516     George of the Jungle
6813     Journeys with George
Name: title, dtype: object