
# 추천 시스템 - 잠재 벡터 방식

- 모델 기반 방식
- 임베딩 벡터를 이용하는 방식 
- 사용자와 항목을 표현하는 차원을 축소한다 (고차원의 경우 계산량이 급증한다)
- 사용자가 평가하지 않은 콘텐츠의 평가 점수 예측이 가능해진다

## 참고자료

- https://lsjsj92.tistory.com/569
- https://lsjsj92.tistory.com/564
- https://lsjsj92.tistory.com/570

- https://github.com/SurhanZahid/Recommendation-System-Using-Matrix-Factorization/blob/master/Recommender%20System%20With%20Matrix%20Factorization%20.ipynb
- https://github.com/nikitaa30/Recommender-Systems/blob/master/matrix_factorisation_svd.py

- 데이터: kaggle의 https://www.kaggle.com/sengzhaotoo/movielens-small


## 매트릭스 분해 방법

- n x m 크기의 평가(rating) 매트릭스를 n x 2 와 2 x m 크기의 두개의 매트릭스로 분할한 예(임베딩 벡터 크기가 2이다)

<img src="https://files.realpython.com/media/dimensionality-reduction.f8686dd52b9c.jpg" width="450"  align="center">

- 2차원으로 표현한 경우 예를 들어 (액션취향, 감성취향)을 나타낸 것일 수 있다. 실제로는 100정도의 고차원 벡터로 표현하며 각 벡터값이 어떤 의미를 갖는지는 알 수 없다

### 매트릭스 분해 종류

- singular value decomposition (SVD)
- PCA
- NMF(Non-negative matrix factorization)
- Autoencoders 

## SVD(Singular Value Decomposion)

-  m x n 크기의 데이터 행렬 A를 아래와 같이 분해하는 것

![15](https://user-images.githubusercontent.com/24634054/73115129-93138c00-3f65-11ea-9a10-80abc59a8494.JPG)


- 출처 : https://ratsgo.github.io/from%20frequency%20to%20semantics/2017/04/06/pcasvdlsa/

- 행렬 U와 V에 속한 열벡터는 특이벡터(singular vector)이고, 이 특이벡터들은 서로 직교하는 성질을 가지고 있다
- 시그마 행렬은 대각행렬(diagonal matrix)이며 대각 성분이 행렬 A의 특이값이고 나머지 성분이 0임
- 사이킷런에서 제공하는 truncated SVD는 이러한 SVD의 변형
 - 시그마 행렬의 대각원소(특이값) 중 상위 n개만 골라낸 것 (근사치를 구한다)

# import

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
from sklearn.decomposition import TruncatedSVD
from scipy.sparse.linalg import svds

In [2]:
ratings = pd.read_csv('https://raw.githubusercontent.com/StillWork/data/master/ratings.csv')
movies = pd.read_csv('https://raw.githubusercontent.com/StillWork/data/master/movies.csv')

In [3]:
print(movies.shape)
ratings.head()

(9125, 3)


Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [4]:
print(ratings.shape)
movies.head()

(100004, 4)


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


# 영화 제목 합치기

In [5]:
# 두 테이블을 merge하면 된다

ratings_movies = pd.merge(ratings, movies, on = 'movieId')
print(ratings_movies.shape)
ratings_movies[:3]

(100004, 6)


Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,31,2.5,1260759144,Dangerous Minds (1995),Drama
1,7,31,3.0,851868750,Dangerous Minds (1995),Drama
2,31,31,4.0,1273541953,Dangerous Minds (1995),Drama


## Rating 테이블

In [6]:
movie_user_rating = ratings_movies.pivot_table('rating', index = 'title', columns='userId')
movie_user_rating.fillna(0, inplace = True)
print(movie_user_rating.shape)
movie_user_rating.head(3)

(9064, 671)


userId,1,2,3,4,5,6,7,8,9,10,...,662,663,664,665,666,667,668,669,670,671
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"""Great Performances"" Cats (1998)",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
$9.99 (2008),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Hellboy': The Seeds of Creation (2004),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## SVD를 이용한 차원축소

- 차원이 축소된, 임베딩 벡터를 리턴한다

In [7]:
trunc = TruncatedSVD(n_components=20)
reduced = trunc.fit_transform(movie_user_rating)
reduced.shape

(9064, 20)

In [8]:
reduced[0]

array([ 1.22749118e-02,  2.50765864e-03,  1.55480010e-02, -3.39749637e-02,
       -1.44646341e-02,  3.61019704e-03, -2.23356567e-03,  4.50219490e-02,
       -1.61348348e-02, -2.12741134e-02,  1.11380907e-02, -1.04308162e-02,
        3.99329440e-03,  9.28386259e-03,  5.32601013e-02, -2.21822059e-02,
        7.38391744e-03, -2.05939892e-02,  1.93310370e-03,  4.90389550e-05])

- 피어슨 상관계수는 코사인 유사도를 (0,0)점을 기준으로 계산한 것과 같다

In [9]:
corr = np.corrcoef(reduced)
df_corr = pd.DataFrame(corr, index=movie_user_rating.index,
                      columns =movie_user_rating.index)
df_corr[:5]

title,"""Great Performances"" Cats (1998)",$9.99 (2008),'Hellboy': The Seeds of Creation (2004),'Neath the Arizona Skies (1934),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),...,Zulu (1964),Zulu (2013),[REC] (2007),eXistenZ (1999),loudQUIETloud: A Film About the Pixies (2006),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931),İtirazım Var (2014)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"""Great Performances"" Cats (1998)",1.0,0.184411,-0.007879,0.162699,-0.095595,-0.007879,0.379422,0.180065,0.301184,0.326697,...,0.327875,0.027334,0.24513,0.075318,-0.020296,0.335242,0.045903,0.295973,-0.217822,0.028016
$9.99 (2008),0.184411,1.0,0.228217,-0.094895,0.090928,0.228217,-0.006729,0.352731,0.066504,0.426026,...,0.243026,0.020328,0.515677,0.184741,0.493375,0.103194,-0.027395,0.267257,0.65469,0.005069
'Hellboy': The Seeds of Creation (2004),-0.007879,0.228217,1.0,-0.124019,0.010308,1.0,-0.020543,0.424395,0.021571,0.193846,...,0.160562,-0.187448,0.46751,0.271947,0.615576,0.049802,-0.047048,0.270722,-0.021164,0.172558
'Neath the Arizona Skies (1934),0.162699,-0.094895,-0.124019,1.0,-0.043092,-0.124019,-0.020674,-0.014366,0.021534,0.168413,...,-0.140493,0.301426,-0.03993,0.014061,-0.022977,0.072284,0.020564,0.159486,-0.042783,0.24875
'Round Midnight (1986),-0.095595,0.090928,0.010308,-0.043092,1.0,0.010308,-0.019076,0.260482,-0.016993,0.248386,...,0.011535,0.012813,0.007814,0.202608,-0.085222,-0.015213,0.040108,0.033184,-0.029238,0.009005


- 피어슨 상관계수를 이용해서 특정 영화와 상관계수가 높은 영화를 찾을 수 있다

In [89]:
# 영화 목록 20개 보기
movie_user_rating.index[:10]

Index(['"Great Performances" Cats (1998)', '$9.99 (2008)',
       ''Hellboy': The Seeds of Creation (2004)',
       ''Neath the Arizona Skies (1934)', ''Round Midnight (1986)',
       ''Salem's Lot (2004)', ''Til There Was You (1997)',
       ''burbs, The (1989)', ''night Mother (1986)',
       '(500) Days of Summer (2009)'],
      dtype='object', name='title')

In [11]:
df_corr["Godfather, The (1972)"].sort_values(ascending=False)[1:11]

title
Godfather: Part II, The (1974)            0.980021
Goodfellas (1990)                         0.961160
One Flew Over the Cuckoo's Nest (1975)    0.936132
Apocalypse Now (1979)                     0.921939
L.A. Confidential (1997)                  0.898836
American Beauty (1999)                    0.893820
Taxi Driver (1976)                        0.886442
Saving Private Ryan (1998)                0.886194
Reservoir Dogs (1992)                     0.881146
Platoon (1986)                            0.872421
Name: Godfather, The (1972), dtype: float64

In [88]:
df_corr['Shawshank Redemption, The (1994)'].sort_values(ascending=False)[1:11]

title
Pulp Fiction (1994)                 0.932304
Schindler's List (1993)             0.928808
Silence of the Lambs, The (1991)    0.927919
Forrest Gump (1994)                 0.913426
Seven (a.k.a. Se7en) (1995)         0.891959
Usual Suspects, The (1995)          0.890117
Philadelphia (1993)                 0.872496
Apollo 13 (1995)                    0.866006
Braveheart (1995)                   0.856251
Dances with Wolves (1990)           0.847034
Name: Shawshank Redemption, The (1994), dtype: float64