# 파이썬 추천 시스템 패키지: Surprise

- docs: http://surpriselib.com/
- 주요기능: 추천시스템 구축을 위한 SVD, KNNBasic, BaselineOnly 등의 추천 알고리즘 클래스 구현
    - `SVD`: 행렬 분해를 통한 잠재 요인 협업 필터링을 위한 SVD 알고리즘
    - `KNNBasic`: 최근접 이웃 협업 필터링을 위한 KNN 알고리즘
    - `BaselineOnly`: 사용자 Bias와 아이템 Bias를 감안한 SGD 베이스라인 알고리즘

In [1]:
! pip install surprise

Collecting surprise
  Downloading https://files.pythonhosted.org/packages/61/de/e5cba8682201fcf9c3719a6fdda95693468ed061945493dea2dd37c5618b/surprise-0.1-py2.py3-none-any.whl
Collecting scikit-surprise
[?25l  Downloading https://files.pythonhosted.org/packages/f5/da/b5700d96495fb4f092be497f02492768a3d96a3f4fa2ae7dea46d4081cfa/scikit-surprise-1.1.0.tar.gz (6.4MB)
[K     |████████████████████████████████| 6.5MB 2.5MB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.0-cp36-cp36m-linux_x86_64.whl size=1675382 sha256=06bfb621fb19833a665ec7d7cc3a8ef6c47c9edb96cd67a1ba088fc939f6dd39
  Stored in directory: /root/.cache/pip/wheels/cc/fa/8c/16c93fccce688ae1bde7d979ff102f7bee980d9cfeb8641bcf
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.0 surprise-0.1


In [0]:
from surprise import SVD, accuracy
from surprise.model_selection import train_test_split
import pandas as pd
from surprise import Reader, Dataset

## 데이터 불러오기

- Suprise package는 `user_id, item_id, rating` 형태로 되어 있는 데이터만 처리함
- `Pandas.DataFrame`과 연동 가능. 단, `user_id, item_id, rating` 순서가 정해져 있어야 함.
- Download: [MovieLens latest](http://files.grouplens.org/datasets/movielens/ml-latest-small.zip)
    - 그외 다른 데이터셋은 https://grouplens.org/datasets/

In [0]:
#data = Dataset.load_builtin('ml-100k')

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
file_path = '/content/drive/My Drive/Data Mining/ml-latest-small/ratings.csv'
ratings = pd.read_csv(file_path)

reader = Reader(rating_scale=(0.5, 5.0)) # (min, max)

# ratings DataFrame 에서 컬럼은 사용자 아이디, 아이템 아이디, 평점 순서를 지켜야 함
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

In [5]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [6]:
data

<surprise.dataset.DatasetAutoFolds at 0x7fa3897d5358>

## SVD 추천 알고리즘 구현
- **parameters**
    - n_factors: 잠재요인의 개수, 커질 수록 정확도가 높아질 수 있으나 과적합 문제가 발생할 수 있음.
    - n_epochs: SGD 수행 시 반복 횟수.
    - biased(bool): 베이스라인 사용자 편향 적용 여부



- 베이스라인 평점
    - 개인이 아이템에 평가를 후하게 주는지? 박하게 주는지? 개인의 별점 성향을 반영
    - 보통은 `전체 평균 평점 + 사용자 편향 점수 + 아이템 편향 점수`로 계산 됨
        - 사용자 편향 점수 = 사용자별 아이템 평점 평균 값 - 전체 평균 평점
        - 아이템 편향 점수 = 아이템별 평점 평균 값 - 전체 평균 평점

In [0]:
trainset, testset = train_test_split(data, test_size=.25, random_state=1234)

In [8]:
algo = SVD(n_factors=50, biased=True, random_state=0)
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fa36ab0b6d8>

In [0]:
predictions = algo.test(testset)

In [10]:
predictions[:5]

[Prediction(uid=68, iid=166528, r_ui=4.5, est=3.511001567930796, details={'was_impossible': False}),
 Prediction(uid=432, iid=1221, r_ui=4.0, est=4.069237755418807, details={'was_impossible': False}),
 Prediction(uid=325, iid=1968, r_ui=3.0, est=3.6533169253788427, details={'was_impossible': False}),
 Prediction(uid=133, iid=150, r_ui=3.0, est=3.0813843316562664, details={'was_impossible': False}),
 Prediction(uid=187, iid=2502, r_ui=4.5, est=3.9946182607992675, details={'was_impossible': False})]

In [12]:
# RMSE (Root Mean Square Error)
# MAE (Mean Absolute Error)
accuracy.rmse(predictions)
accuracy.mae(predictions)

RMSE: 0.8714
MAE:  0.6673


0.6672632948996802

## 교차 검증(Cross Validation)

In [13]:
from surprise.model_selection import cross_validate 
cross_validate(algo, data, measures = ['RMSE', 'MAE'], cv = 5, verbose = True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8740  0.8652  0.8761  0.8732  0.8670  0.8711  0.0042  
MAE (testset)     0.6713  0.6623  0.6713  0.6728  0.6671  0.6689  0.0038  
Fit time          2.97    2.88    2.96    2.85    2.96    2.92    0.05    
Test time         0.13    0.21    0.12    0.21    0.12    0.16    0.04    


{'fit_time': (2.970444917678833,
  2.883467674255371,
  2.9563026428222656,
  2.8486170768737793,
  2.956969738006592),
 'test_mae': array([0.67129753, 0.66225643, 0.67125379, 0.67276255, 0.66710184]),
 'test_rmse': array([0.87400302, 0.86524739, 0.87611232, 0.87320059, 0.86695301]),
 'test_time': (0.13410449028015137,
  0.2137620449066162,
  0.11996102333068848,
  0.20944881439208984,
  0.12348628044128418)}

## 하이퍼 파라미터 튜닝

In [0]:
from surprise.model_selection import GridSearchCV
# 테스트할 파라미터를 딕셔너리 형태로 지정
param_grid = {'n_epochs': [20, 40, 60], 'n_factors': [50, 100, 200]}

# CV를 3개 폴드 세트로 지정, 성능 평가에는 rmse, mse로 수행하도록 GridSearchCV 구성
gs = GridSearchCV(SVD, param_grid, measures = ['rmse', 'mae'], cv = 3)
gs.fit(data)

In [15]:
# RMSE Evaluation 점수와 그때의 하이퍼 파라미터
print(gs.best_score['rmse'])
print(gs.best_params['rmse'])

0.876702268735101
{'n_epochs': 20, 'n_factors': 50}
