## 3. Collaborative Filtering (협업 필터링)
- 비슷한 취향을 가진 사용자를 매칭 시켜서 추천
- 나의 이력을 기반으로 추천

<br/>

<br/>

#### 사용자 기반 필터링 
- 특정 사용자와 유사한 사용자에게 제품을 추천
- 두 사용자 간의 유사성을 측정하기 위해 피어슨 상관계수 또는 코사인 유사도를 사용

<br/>

<br/>

#### 항목 기반 협업 필터링 
- 사용자 간의 유사도를 측정하는 대신, 대상 사용자가 평가한 항목과의 유사도를 기반으로 추천 
- 마찬가지로 유사성은 피어슨 상관계수 또는 코사인 유사도를 사용하여 계산

https://surprise.readthedocs.io/en/stable/

<br/>

In [33]:
import pandas as pd
import numpy as np
import sklearn
import surprise
from surprise import Reader, Dataset
from surprise import KNNBasic, SVD, SVDpp, NMF
from surprise.model_selection import cross_validate

In [31]:
ratings = pd.read_csv("/content/ratings_small.csv")
ratings.head(3)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182


In [32]:
ratings.shape, ratings.rating.min(), ratings.rating.max()

((100004, 4), 0.5, 5.0)

<br/>

### ```surprise.Reader(name, line_format, sep, rating_scale, skip_lines, ...)```
- ```rating_scale``` : 평점의 범위


In [14]:
reader = Reader(rating_scale = (0.5, 5))

<br/>

### ```Dataset.load_from_df()``` 
- **ratings와 같이 userId, movieId, rating, timestamp에 대한 정보를 포함하는 DataFrame을 읽음**

In [29]:
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader = reader)
data.raw_ratings[:10]

[(1, 31, 2.5, None),
 (1, 1029, 3.0, None),
 (1, 1061, 3.0, None),
 (1, 1129, 2.0, None),
 (1, 1172, 4.0, None),
 (1, 1263, 2.0, None),
 (1, 1287, 2.0, None),
 (1, 1293, 2.0, None),
 (1, 1339, 3.5, None),
 (1, 1343, 2.0, None)]

<br/>

### ```surprise.SVD()``` : SVD
### ```surprise.KNNBasic()``` : K 최근접 이웃
### ```surprise.SVDpp()``` : SVD++
### ```surprise.NMF()``` : 비응수 행렬 분해

- 잠재 요인 모델을 활용하여 사용자와 항목 간의 유사성을 캡처
- 추천 문제를 최적화 문제로 바꾸기 위하여, 사용자에게 주어진 항목에 대한 평가를 얼마나 잘 예측하는지로 평가
- RMSE가 낮을수록 성능이 좋은 성능
- SVD는 잠재 요인을 추출하여 차원을 축소. 


In [34]:
svd = SVD(random_state = 0)
knn = KNNBasic(random_state = 0)
nmf = NMF(random_state = 0)
# svdpp = SVDpp(random_state = 0)

<br/>

### ```surprise.model_selection.cross_validate(모형, data, measures, cv, ...)``` : 모형에 대한 교차 검증
- ```measures``` : 평가 점수
- ```cv``` : 교차검증 횟수

In [18]:
cross_validate(svd, data, measures = ['RMSE', 'MAE'], cv=5, verbose = True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9002  0.8900  0.8951  0.8997  0.8940  0.8958  0.0038  
MAE (testset)     0.6923  0.6851  0.6879  0.6932  0.6884  0.6894  0.0030  
Fit time          4.99    4.99    4.94    5.93    5.50    5.27    0.39    
Test time         0.14    0.30    0.14    0.30    0.14    0.21    0.08    


{'test_rmse': array([0.9001988 , 0.88997073, 0.89514338, 0.89972415, 0.89398649]),
 'test_mae': array([0.69229826, 0.68507757, 0.68794114, 0.69317012, 0.68842066]),
 'fit_time': (4.9947190284729,
  4.987050533294678,
  4.9385764598846436,
  5.934472560882568,
  5.497990608215332),
 'test_time': (0.1420001983642578,
  0.3000760078430176,
  0.14197587966918945,
  0.3037290573120117,
  0.1396498680114746)}

In [37]:
cross_validate(knn, data, measures = ['RMSE', 'MAE'], cv=5, verbose = True)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9697  0.9614  0.9711  0.9570  0.9767  0.9672  0.0071  
MAE (testset)     0.7446  0.7387  0.7486  0.7367  0.7483  0.7434  0.0049  
Fit time          0.21    0.17    0.18    0.17    0.18    0.18    0.02    
Test time         1.56    1.58    1.55    1.56    1.57    1.56    0.01    


{'test_rmse': array([0.96971418, 0.96139859, 0.97107164, 0.95697625, 0.9766564 ]),
 'test_mae': array([0.74457058, 0.73872365, 0.748592  , 0.73673324, 0.74834888]),
 'fit_time': (0.2118213176727295,
  0.1668715476989746,
  0.17515993118286133,
  0.17276263236999512,
  0.1752300262451172),
 'test_time': (1.557586431503296,
  1.5770461559295654,
  1.5495104789733887,
  1.5622377395629883,
  1.5688910484313965)}

In [39]:
cross_validate(nmf, data, measures = ['RMSE', 'MAE'], cv=5, verbose = True)
# cross_validate(svdpp, data, measures = ['RMSE', 'MAE'], cv=5, verbose = True)

Evaluating RMSE, MAE of algorithm NMF on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9401  0.9569  0.9539  0.9590  0.9456  0.9511  0.0071  
MAE (testset)     0.7223  0.7335  0.7343  0.7350  0.7273  0.7305  0.0049  
Fit time          7.21    7.22    6.41    7.06    6.30    6.84    0.40    
Test time         0.34    0.30    0.13    0.12    0.13    0.20    0.10    


{'test_rmse': array([0.94007423, 0.95686136, 0.9538772 , 0.95903248, 0.94563315]),
 'test_mae': array([0.72229086, 0.73351509, 0.73433738, 0.73503027, 0.7272663 ]),
 'fit_time': (7.213628768920898,
  7.2220070362091064,
  6.408390522003174,
  7.055392503738403,
  6.303618669509888),
 'test_time': (0.3387868404388428,
  0.29956579208374023,
  0.13003158569335938,
  0.12351584434509277,
  0.12628650665283203)}

<br/>

- 전체 데이터에 대한 적합

In [40]:
trainset = data.build_full_trainset()
knn.fit(trainset)
svd.fit(trainset)
nmf.fit(trainset)
# svdpp.fit(trainset)

Computing the msd similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.matrix_factorization.NMF at 0x7f98af451850>

<br/>

- **userId가 1인 사람의, movieId가 31 영화에 대한 평점 예측**

In [48]:
ratings[ratings['userId'] == 1][ratings['movieId'] == 31]

  """Entry point for launching an IPython kernel.


Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144


In [133]:
knn.predict(1, 31).est

2.986320319817485

In [134]:
svd.predict(1, 31).est

2.4162799702909346

In [136]:
nmf.predict(1, 31).est
# svdpp.predict(1, 31).est

2.5061117792103396

In [158]:
n_movies = len(np.unique(ratings['movieId']))
n_movies

9066

In [155]:
knn_predict = []

for i in range(n_movies):
  knn_predict.append(knn.predict(1, i).est)

In [156]:
svd_predict = []

for i in range(n_movies):
  svd_predict.append(svd.predict(1, i).est)

In [159]:
nmf_predict = []

for i in range(n_movies):
  nmf_predict.append(nmf.predict(1, i).est)

In [191]:
predict_df = pd.DataFrame({'knn_predict' : knn_predict,
                           'svd_predict' : svd_predict,
                           'nmf_predict' : nmf_predict},
                          index = pd.Series((np.unique(ratings['movieId']))))

In [192]:
predict_df.head(3)

Unnamed: 0,knn_predict,svd_predict,nmf_predict
1,3.543608,2.693693,3.543608
2,3.722981,3.148624,2.705488
3,3.258439,2.373448,2.250783
