### Recommenation System : SVD 기반 추천 엔진
Surprise 라이브러리를 이용해서 SVD 기반 추천 엔진 구현해보자!  
무비렌즈 데이터셋을 이용하여 train 해주고, 평점을 예측하여 test 해본다.

------

1. surprise 모듈을 설치

In [2]:
!pip install surprise



In [3]:
!wget "https://grepp-reco-test.s3.ap-northeast-2.amazonaws.com/movielens/movies.csv"
!wget "https://grepp-reco-test.s3.ap-northeast-2.amazonaws.com/movielens/ratings.csv"

--2022-09-10 13:45:22--  https://grepp-reco-test.s3.ap-northeast-2.amazonaws.com/movielens/movies.csv
Resolving grepp-reco-test.s3.ap-northeast-2.amazonaws.com (grepp-reco-test.s3.ap-northeast-2.amazonaws.com)... 52.219.60.11
Connecting to grepp-reco-test.s3.ap-northeast-2.amazonaws.com (grepp-reco-test.s3.ap-northeast-2.amazonaws.com)|52.219.60.11|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 458390 (448K) [text/csv]
Saving to: 'movies.csv.1'


2022-09-10 13:45:22 (8.15 MB/s) - 'movies.csv.1' saved [458390/458390]

--2022-09-10 13:45:22--  https://grepp-reco-test.s3.ap-northeast-2.amazonaws.com/movielens/ratings.csv
Resolving grepp-reco-test.s3.ap-northeast-2.amazonaws.com (grepp-reco-test.s3.ap-northeast-2.amazonaws.com)... 52.219.60.11
Connecting to grepp-reco-test.s3.ap-northeast-2.amazonaws.com (grepp-reco-test.s3.ap-northeast-2.amazonaws.com)|52.219.60.11|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2438266 (2.3M) [text/csv

### 데이터 로딩

In [4]:
import numpy as np
import pandas as pd 

In [5]:
movies = pd.read_csv("movies.csv")
ratings= pd.read_csv("ratings.csv")

In [9]:
movies.shape

(9125, 3)

In [7]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


NaN valuse 확인 

In [8]:
itemRatings = ratings.pivot_table(index=['userId'], columns=['movieId'], values='rating')
itemRatings.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,161084,161155,161594,161830,161918,161944,162376,162542,162672,163949
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,4.0,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,4.0,...,,,,,,,,,,
5,,,4.0,,,,,,,,...,,,,,,,,,,


In [9]:
movie_ratings= pd.merge(movies, ratings, left_on= 'movieId', right_on='movieId')

In [10]:
movie_ratings.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7,3.0,851866703
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,9,4.0,938629179
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,13,5.0,1331380058
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15,2.0,997938310
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,19,3.0,855190091


In [11]:
movies.shape

(9125, 3)

In [12]:
ratings.shape

(100004, 4)

In [13]:
movie_ratings.shape

(100004, 6)

In [14]:
def getMovieName(movie_ratings, movieID):
    return movie_ratings[movie_ratings["movieId"] == movieID][["title", "genres"]].values[0]

def getMovieID(movie_ratings, movieName):
    return movie_ratings[movie_ratings["title"] == movieName][["movieId", "genres"]].values[0]

## 영화 데이터를 surprise 모듈을 통해 로딩
$유저 - 아이템\ 평점\ 행렬 \quad A=U \sum V^T$ \\
  여기서 
  * $U$ : 유저 행렬 (u*r)
  * $\sum$ : 스케일 행렬 (r*r)
  * $V^T$ : 아이템 행렬 (r*n)


<br><br>

`surprise.model_selection`의 `GridSearchCV`
- 하이퍼 파라미터를 그리드 서치하여 최적값을 찾음 \\
  -> scikit-learn의 GridSerachCV와 매우 흡사!

- `n_factors`: 축소 차원 수 ($\sum$의 dimension) 
- `n_epochs`: 전체 데이터 셋 훈련 횟수
- `lr_all`: learning rate


In [15]:
import surprise
from surprise import Dataset
from surprise import Reader
from surprise import KNNBasic
from surprise import SVD
from surprise import NormalPredictor
from surprise.model_selection import GridSearchCV

import heapq

from collections import defaultdict
from operator import itemgetter

In [16]:
# 학습 옵션 설정
param_grid = {
    'n_epochs': [20,30],
    'lr_all': [0.005, 0.010],
    'n_factors' : [50,100]
}

#### ML 학습 1: k-fold cross validation

In [28]:
# 3-fold
gs= GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)
gs.fit(data)

In [29]:
reader = Reader(line_format='user item rating timestamp', sep=',', skip_lines=1)
data = Dataset.load_from_file("ratings.csv", reader=reader)

In [30]:
# RMSE
print("Best RMSE score attained: ", gs.best_score['rmse'])
print("Best RMSE params: ", gs.best_params['rmse'])

Best RMSE score attained:  0.8983535578247839
Best RMSE params:  {'n_epochs': 20, 'lr_all': 0.005, 'n_factors': 50}


In [31]:
# MAE
print("Best RAE score attained: ", gs.best_score['mae'])
print("Best RAE params: ", gs.best_params['mae'])

Best RAE score attained:  0.6923276520712629
Best RAE params:  {'n_epochs': 20, 'lr_all': 0.005, 'n_factors': 50}


### 최고의 성능을 보인 파라미터로 모델 훈련 & 예측해보기

In [32]:
svd= gs.best_estimator['rmse']

In [33]:
trainset= data.build_full_trainset()
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f6ad16d1a50>

In [34]:
uid = str(196) # raw USER id (asz in the rating file). They are **strings**!
iid = str(302) # raw ITEM id (asz in the rating file). They are **strings**!

# get a prediction fro specific users and items
pred= svd.predict(uid, iid, verbose= True) #r_ui=4

user: 196        item: 302        r_ui = None   est = 3.78   {'was_impossible': False}


#### ML 학습 2: Train - Test Split
trn:tst= 75:25

In [37]:
from surprise import accuracy
from surprise.model_selection import train_test_split

trn, tst = train_test_split(data, test_size=.25)
svd=SVD()
svd.fit(trn)
predictions= svd.test(tst)
accuracy.rmse(predictions)

RMSE: 0.9015


0.9015377273863754

In [39]:
tst [:10]

[('518', '2250', 3.0),
 ('564', '1460', 4.0),
 ('564', '1483', 5.0),
 ('213', '3000', 2.5),
 ('4', '1374', 4.0),
 ('342', '898', 5.0),
 ('532', '2953', 3.5),
 ('615', '1213', 4.0),
 ('624', '223', 4.0),
 ('299', '26231', 5.0)]

In [43]:
pred= svd.predict("518","2250", verbose=True)

user: 518        item: 2250       r_ui = None   est = 3.51   {'was_impossible': False}
