### [추천시스템]

-   TF-IDF와 코사인 유사도로 추천 시스템 구현
-   원리 : 유사한 내용 추천으로 유사도가 높은 것 찾기
    -   영화 줄거리(overview)가 비슷한 영화를 찾아주기


#### 영화 추천 <hr>

-   [1] 데이터 준비


In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


In [2]:
### ===> 데이터 준비
datafile = "../data/movies_metadata.csv"

dataDF1 = pd.read_csv(datafile, low_memory=False)
dataDF1.info()
dataDF1.head(2)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0


In [3]:
## 10000개 데이터만 사용
dataDF2 = dataDF1.head(10000)[["id", "title", "overview"]]
dataDF2.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        10000 non-null  object
 1   title     10000 non-null  object
 2   overview  9971 non-null   object
dtypes: object(3)
memory usage: 234.5+ KB


-   [2] 데이터 전처리 <hr>
-   [2-1] 기본 데이터 전처리


In [4]:
### ==> 결측치
dataDF2["overview"].isnull().sum()


29

In [5]:
dataDF2.dropna(subset=["overview"], inplace=True)
dataDF2.info()


<class 'pandas.core.frame.DataFrame'>
Index: 9971 entries, 0 to 9999
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        9971 non-null   object
 1   title     9971 non-null   object
 2   overview  9971 non-null   object
dtypes: object(3)
memory usage: 311.6+ KB


-   [3] TF-IDF와 cosine_similarity <hr>


In [6]:
### TF-IDF : 단어들의 값 계산
tfidf = TfidfVectorizer(stop_words="english")
tfidf_matrix = tfidf.fit_transform(dataDF2["overview"])

### 코사인 유사도 : 두 개 matrix에 대한 비교 진행
cosin_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)


In [7]:
tfidf_matrix.toarray()[0]


array([0., 0., 0., ..., 0., 0., 0.])

In [8]:
print(f"TF-IDF 행렬의 크기(shape) : {tfidf_matrix.shape}")
print(f"코사인 유사도 연산 결과 : {cosin_sim.shape}")


TF-IDF 행렬의 크기(shape) : (9971, 32350)
코사인 유사도 연산 결과 : (9971, 9971)


In [9]:
cosin_sim[:10], dataDF2.loc[:10, "title"]


(array([[1.        , 0.01682702, 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.01682702, 1.        , 0.04871756, ..., 0.        , 0.01200553,
         0.        ],
        [0.        , 0.04871756, 1.        , ..., 0.        , 0.00734935,
         0.        ],
        ...,
        [0.        , 0.        , 0.00686199, ..., 0.01933384, 0.        ,
         0.        ],
        [0.        , 0.1071591 , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ]]),
 0                       Toy Story
 1                         Jumanji
 2                Grumpier Old Men
 3               Waiting to Exhale
 4     Father of the Bride Part II
 5                            Heat
 6                         Sabrina
 7                    Tom and Huck
 8                    Sudden Death
 9                       GoldenEye
 10         The American President
 Name: title, dtype: obj

In [10]:
## 영화 제목 입력 ==> 해당 영화 인덱스 추출
(dataDF2["title"] == "Father of the Bride Part II").argmax()


4

In [11]:
### ===> 영화 제목 색인 생성
### 영화제목:인덱스
title_to_index = dict(zip(dataDF2["title"], dataDF2.index))

### 원하는 영화 인덱스 찾기
select_idx = title_to_index["Father of the Bride Part II"]


In [12]:
# 모든 영화 유사도
sim_scores = list(enumerate(cosin_sim[select_idx]))
# print(sim_scores)

# 유사도 따라 정렬
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
# print(sim_scores)

# 가장 유사한 10개
sim_scores = sim_scores[1:11]
movie_indices = [idx[0] for idx in sim_scores]

# 가장 유사한 10개 영화의 제목
dataDF1[["title", "genres"]].iloc[movie_indices]


Unnamed: 0,title,genres
6769,The Hired Hand,"[{'id': 37, 'name': 'Western'}]"
6547,Bollywood/Hollywood,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam..."
6282,Ziggy Stardust and the Spiders from Mars,"[{'id': 99, 'name': 'Documentary'}, {'id': 104..."
4984,Cast a Giant Shadow,"[{'id': 10752, 'name': 'War'}, {'id': 28, 'nam..."
7073,No Good Deed,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam..."
914,The Mark of Zorro,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam..."
5550,Take Care of My Cat,"[{'id': 18, 'name': 'Drama'}]"
5728,They All Laughed,"[{'id': 18, 'name': 'Drama'}, {'id': 35, 'name..."
1500,When the Cat's Away,"[{'id': 35, 'name': 'Comedy'}]"
6789,In My Skin,"[{'id': 18, 'name': 'Drama'}, {'id': 27, 'name..."


In [37]:
eval(
    dataDF1["genres"]
    .iloc[movie_indices]
    .str.replace("[\[\]]", "", regex=True)
    .values[1]
)


({'id': 35, 'name': 'Comedy'},
 {'id': 18, 'name': 'Drama'},
 {'id': 10402, 'name': 'Music'},
 {'id': 10749, 'name': 'Romance'})