# 추천시스템

- 콘텐츠 기반
 - 사용자와 상품의 정보, 전문가 지식을 기반으로 추천
 - 지금은 잘 사용하지 않음
- 협업 필터링 기반
 - 구매 행위 데이터만 사용 (사람이 개입하지 않는다), Amazon, YouTube,Netflix에서 사용
  > 사용자 기반, 아이템 기반
 - 메모리 기반 (kNN)
 - 잠재 벡터 기반 (SVD, 행렬분해 방법, 메모리를 줄인다)

## 협업 필터링 (Collaborative Filtering : CF)

- 본인과 다른 사람들의 구매 등 행동 이력을 기반으로 (협업해서) 최적의 추천을 하는 방식
 - 먼저 나와 비슷한 사람들을 찾고
 - 이들의 과거 평점을 바탕으로 새로운 항목에 대한 나의 평점을 예측하여 높은 점수의 항목을 추천한다
- 행동 (평점, 반응)의 종류
 - explicit (평점을 주는 것)
 - implicit (검색, 장바구니 담기, 시청시간 등으로부터 추정)
- 사용자의 나이, 성별, 지역 등의 정보를 전혀 사용하지 않는다
 - 하이브리드 모델에서는 이러한 정보를 사용하여 최종 추천할 수 있다 (실전 적용)
 - UBCF (User Based CF) : 구매 패턴이 비슷한 사용자를 기반으로 상품(Item) 추천
 - IBCF (Item Based CF) : 상품을 기반으로 연관성이 있는 상품 추천

## 참고 블로그
- (개념) https://lsjsj92.tistory.com/563
- (코드) https://lsjsj92.tistory.com/568
- (데이터) kaggle의 **The movies Dataset (https://www.kaggle.com/rounakbanik/the-movies-dataset)**
- 추천 모델에 사용되는 샘플 [데이터셋](https://github.com/caserec/Datasets-for-Recommender-Systems)

# import

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

from sklearn.metrics.pairwise import cosine_similarity

## user-item-score 테이블 만들기

- 사용자의 평가 점수 테이블: ratings
- 영화 정보 테이블: movies
- pivot_table 이용

In [2]:
ratings = pd.read_csv('https://raw.githubusercontent.com/StillWork/data/master/ratings.csv')
movies = pd.read_csv('https://raw.githubusercontent.com/StillWork/data/master/movies.csv')

In [3]:
print(ratings.shape)
ratings[:3]

(100004, 4)


Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182


In [4]:
# 평가자 별 평가수 통계

ratings.userId.value_counts()

547    2391
564    1868
624    1735
15     1700
73     1610
       ... 
444      20
438      20
583      20
249      20
399      20
Name: userId, Length: 671, dtype: int64

In [5]:
print(movies.shape)
movies[:3]

(9125, 3)


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance


In [7]:
# 영화 제목 정보를 테이블에 추가하기
# 두 테이블을 merge하면 된다

ratings_movies = pd.merge(ratings, movies, on = 'movieId')
print(ratings_movies.shape)
ratings_movies[:3]

(100004, 6)


Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,31,2.5,1260759144,Dangerous Minds (1995),Drama
1,7,31,3.0,851868750,Dangerous Minds (1995),Drama
2,31,31,4.0,1273541953,Dangerous Minds (1995),Drama


## pivot table을 이용하여 사용자-아이템-점수 테이블 만들기

- row가 사용자, col이 영화
- 아이템 기반 협업 필터링은 비슷한 영화를 찾는 것

In [10]:
user_movie_rating = ratings_movies.pivot_table('rating', index = 'userId', columns='title')
print(user_movie_rating.shape)
user_movie_rating

(671, 9064)


title,"""Great Performances"" Cats (1998)",$9.99 (2008),'Hellboy': The Seeds of Creation (2004),'Neath the Arizona Skies (1934),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),...,Zulu (1964),Zulu (2013),[REC] (2007),eXistenZ (1999),loudQUIETloud: A Film About the Pixies (2006),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931),İtirazım Var (2014)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
667,,,,,,,,,,,...,,,,,,,,,,
668,,,,,,,,,,,...,,,,,,,,,,
669,,,,,,,,,,,...,,,,,,,,,,
670,,,,,,,,,,,...,,,,,,,,,,


In [12]:
# 사용자-아이템-점수 테이블 만들기
movie_user_rating = ratings_movies.pivot_table('rating', index = 'title', columns='userId')
# 다른 방법: movie_user_rating = user_movie_rating.T

print(movie_user_rating.shape)
movie_user_rating[:5]

(9064, 671)


userId,1,2,3,4,5,6,7,8,9,10,...,662,663,664,665,666,667,668,669,670,671
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"""Great Performances"" Cats (1998)",,,,,,,,,,,...,,,,,,,,,,
$9.99 (2008),,,,,,,,,,,...,,,,,,,,,,
'Hellboy': The Seeds of Creation (2004),,,,,,,,,,,...,,,,,,,,,,
'Neath the Arizona Skies (1934),,,,,,,,,,,...,,,,,,,,,,
'Round Midnight (1986),,,,,,,,,,,...,,,,,,,,,,


In [13]:
movie_user_rating.fillna(0, inplace = True)
print(movie_user_rating.shape)
movie_user_rating.head(3)

(9064, 671)


userId,1,2,3,4,5,6,7,8,9,10,...,662,663,664,665,666,667,668,669,670,671
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"""Great Performances"" Cats (1998)",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
$9.99 (2008),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Hellboy': The Seeds of Creation (2004),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


- row가 영화, col이 user
- 아이템 기반 협업 필터링은 비슷한 영화를 찾는 것
- 비슷한 영화를 코사인 유사도로 측정한다 (행 단위, 즉 영과간의 유사도를 측정한다)

# 유사도

- 2개 영화에 대해서 4명의 평점 예  

<img src="https://files.realpython.com/media/euclidean-distance.74e8c9d0be22.jpg" width="500"  align="left">


In [37]:
import pandas as pd
from scipy import spatial
from surprise import Dataset, accuracy
from surprise import Reader
from surprise import KNNWithMeans,SVD

In [2]:
# 4명 평가자 간의 유사도를 유클리디언 거리로 구하는 경우
# c가 누구와 가까운지를 확인


# 4명의 영화 1, 2에 대한 평가 점수 리스트
a = [1, 2]
b = [2, 4]
c = [2.5, 4]
d = [4.5, 5]

print(spatial.distance.euclidean(c, a))
print(spatial.distance.euclidean(c, b))
print(spatial.distance.euclidean(c, d))

2.5
0.5
2.23606797749979


- c는 b와 거리가 가장 가깝다
- c가 a와 d중에는 누구와 더 가까운가?
 - 위의 결과를 보면 직선 거리는 d와 가깝다
 - 그런데 영화 1, 2에 대한 평점을 주는 취향을 보면 a와 더 가깝다고 볼 수 있다 (d는 두 영화에 대한 평가 차이가 없다)
 - 이러한 경우, cosine 유사도를 사용해야 한다
 

<img src="https://files.realpython.com/media/cosine-similarity.76bcd5413eb8.jpg" width="400"  align="center">

## 코사인 유사도

<img src="https://www.tyrrell4innovation.ca/wp-content/uploads/2021/06/rsz_jenny_du_miword.png" width="450"  align="center">

In [4]:
print(spatial.distance.cosine(c, a).round(4))
print(spatial.distance.cosine(c, b).round(4))
print(spatial.distance.cosine(c, d).round(4))
print(spatial.distance.cosine(a, b).round(4))

0.0045
0.0045
0.0151
0.0


- 코사인 유사도를 기반으로 유사한 사람 k명을 찾은 후 이들이 특정 항목(영화)에 대한 평점의 평균을 구하면 된다
 - 유사도를 반영한 가중평균을 사용한다 (거리가 가까운 사람의 비중을 높인다)
- 평가가마다 점수가 후한 사람과 박한 사람의 바이어스 있으므로 각 평가자의 평균 평가 점수를 반영해야 한다 (평점이 0점이 되도록 조정한다- 일종의 스케일링)

## 유사도 계산 방법

- 유클리드 거리 계산법 : 거리 기반 유사도 계산
- 코사인 (Cosine) 유사도 : 두 벡터 사이의 각도
- 상관계수 (Correlation coefficient) 유사도 : 피어슨 상관계수 이용
- Jaccard 유사도 : 이진화 자료(binary data) 대상 유사도 계산

# 메모리 기반 CF

- kNN (k Nearest Neighbors)

## User-Based vs Item-Based CF

- User-Based CF: 유사한 사람 k명을 찾는 방식
- Item-Based CF: 유사한 항목 k개를 찾는 방식
 - 아마존에서 사용: 사용자 수가 많은 경우, sparse 한 경우 속도가 빠르다. 항목에 대한 평가 변화가 크지 않다

# 유사한 영화 추천하기

In [14]:
item_based_collabor = cosine_similarity(movie_user_rating)
print(item_based_collabor.shape)
item_based_collabor[:3]

(9064, 9064)


array([[1.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 1.        , 0.        , ..., 0.05821787, 0.        ,
        0.        ],
       [0.        , 0.        , 1.        , ..., 0.        , 0.        ,
        0.        ]])

In [25]:
item_based_collabor = pd.DataFrame(data = item_based_collabor, index = movie_user_rating.index, columns = movie_user_rating.index)
item_based_collabor[:3]

title,"""Great Performances"" Cats (1998)",$9.99 (2008),'Hellboy': The Seeds of Creation (2004),'Neath the Arizona Skies (1934),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),...,Zulu (1964),Zulu (2013),[REC] (2007),eXistenZ (1999),loudQUIETloud: A Film About the Pixies (2006),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931),İtirazım Var (2014)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"""Great Performances"" Cats (1998)",1.0,0.0,0.0,0.164399,0.020391,0.0,0.014046,0.0,0.0,0.003166,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
$9.99 (2008),0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.079474,0.0,0.15633,...,0.0,0.0,0.0,0.0,0.0,0.013899,0.0,0.058218,0.0,0.0
'Hellboy': The Seeds of Creation (2004),0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.217357,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [18]:
print(movie_user_rating.shape)
print(item_based_collabor.shape)

(9064, 671)
(9064, 9064)


- 특정 item에 대해 유사도가 높은 순으로 찾는 함수
- 인덱스가 title임

In [29]:
def get_item_based_collabor(title):
    return item_based_collabor.loc[title].sort_values(ascending=False)[1:11]

In [30]:
get_item_based_collabor('Godfather, The (1972)')

title
Godfather: Part II, The (1974)                                                    0.773685
Goodfellas (1990)                                                                 0.620349
One Flew Over the Cuckoo's Nest (1975)                                            0.568244
American Beauty (1999)                                                            0.557997
Star Wars: Episode IV - A New Hope (1977)                                         0.546750
Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)    0.538185
Saving Private Ryan (1998)                                                        0.534684
Apocalypse Now (1979)                                                             0.534347
Reservoir Dogs (1992)                                                             0.531713
Usual Suspects, The (1995)                                                        0.530727
Name: Godfather, The (1972), dtype: float64

In [31]:
get_item_based_collabor('American Beauty (1999)')

title
Fight Club (1999)                         0.604583
Sixth Sense, The (1999)                   0.584172
Being John Malkovich (1999)               0.560610
Pulp Fiction (1994)                       0.558401
Godfather, The (1972)                     0.557997
Silence of the Lambs, The (1991)          0.557522
Shakespeare in Love (1998)                0.550611
One Flew Over the Cuckoo's Nest (1975)    0.549126
Matrix, The (1999)                        0.548587
Memento (2000)                            0.544260
Name: American Beauty (1999), dtype: float64