# **08. 파이썬 추천 시스템 패키지-Surprise**
- 다양한 추천 알고리즘(아이템 기반, 사용자 기반 초근접 이웃 협업 필터링, SVD, NMF 기반의 잠재 요인 협업 필터링) 사용 가능
- 사이킷런의 핵심 API와 비슷

In [2]:
! pip install scikit-surprise

Collecting scikit-surprise
  Downloading scikit-surprise-1.1.1.tar.gz (11.8 MB)
[K     |████████████████████████████████| 11.8 MB 4.3 MB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.1-cp37-cp37m-linux_x86_64.whl size=1633719 sha256=451cc499a75379f7bca309096f5818e4dc9560883a29468b316bd3aed54409cc
  Stored in directory: /root/.cache/pip/wheels/76/44/74/b498c42be47b2406bd27994e16c5188e337c657025ab400c1c
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.1


In [3]:
import surprise 

print(surprise.__version__)

1.1.1


### **Surprise를 이용한 추천 시스템 구축**

In [None]:
from surprise import SVD
from surprise import Dataset 
from surprise import accuracy 
from surprise.model_selection import train_test_split

In [None]:
data = Dataset.load_builtin('ml-100k') 
trainset, testset = train_test_split(data, test_size=.25, random_state=0) 

-과거 버전의 데이터셋  
-무비렌즈 사이트에서 내려받은 데이터 파일과 동일하게 로우 레벨의 사용자-아이템 평점 데이터 그대로 적용해야 함

**SVD로 잠재 요인 협업 필터링 수행**

In [None]:
algo = SVD()
algo.fit(trainset) 

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fd0b0bd6710>

-test: 입력된 데이터 세트 전체에 대해 추천 예측하는 메서드  
-predict: 개별 사용자와 영화에 대한 추천 평점 반환

In [None]:
predictions = algo.test( testset )
print('prediction type :',type(predictions), ' size:',len(predictions))
print('prediction 결과의 최초 5개 추출')
predictions[:5]

prediction type : <class 'list'>  size: 25000
prediction 결과의 최초 5개 추출


[Prediction(uid='120', iid='282', r_ui=4.0, est=3.801646740991072, details={'was_impossible': False}),
 Prediction(uid='882', iid='291', r_ui=4.0, est=3.7066561356427687, details={'was_impossible': False}),
 Prediction(uid='535', iid='507', r_ui=5.0, est=4.127505924101476, details={'was_impossible': False}),
 Prediction(uid='697', iid='244', r_ui=5.0, est=3.440707431002478, details={'was_impossible': False}),
 Prediction(uid='751', iid='385', r_ui=4.0, est=3.778689173230423, details={'was_impossible': False})]

In [None]:
[ (pred.uid, pred.iid, pred.est) for pred in predictions[:3] ]

[('120', '282', 3.801646740991072),
 ('882', '291', 3.7066561356427687),
 ('535', '507', 4.127505924101476)]

In [None]:
uid = str(196)
iid = str(302)
pred = algo.predict(uid, iid)
print(pred)

user: 196        item: 302        r_ui = None   est = 4.46   {'was_impossible': False}


In [None]:
accuracy.rmse(predictions)

RMSE: 0.9488


0.9487699487523649

### **Surprise 주요 모듈 소개**
- Dataset

In [4]:
from google.colab import drive 
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
import pandas as pd

ratings = pd.read_csv("/content/drive/MyDrive/ESAA-OB/ratings.csv")


In [None]:
ratings.to_csv("/content/drive/MyDrive/ESAA-OB/ratings_noh.csv", index=False, header=False)

In [None]:
from surprise import Reader

reader = Reader(line_format='user item rating timestamp', sep=',', rating_scale=(0.5, 5))
data=Dataset.load_from_file("/content/drive/MyDrive/ESAA-OB/ratings_noh.csv",reader=reader)

In [None]:
trainset, testset = train_test_split(data, test_size=.25, random_state=0)

algo = SVD(n_factors=50, random_state=0)

In [None]:
algo.fit(trainset) 
predictions = algo.test( testset )
accuracy.rmse(predictions)

In [None]:
import pandas as pd
from surprise import Reader, Dataset

ratings = pd.read_csv("/content/drive/MyDrive/ESAA-OB/ratings.csv") 
reader = Reader(rating_scale=(0.5, 5.0))

data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
trainset, testset = train_test_split(data, test_size=.25, random_state=0)

algo = SVD(n_factors=50, random_state=0)
algo.fit(trainset) 
predictions = algo.test( testset )
accuracy.rmse(predictions)

### **Surprise 추천 알고리즘 클래스**
- 베이스라인 평점  
: 개인의 성향을 반영해 아이템 평가에 편향성 요소를 반영하여 평점 부과  
-전체평균평점+사용자편향점수+아이템편향점수

**교차 검증과 하이퍼 파라미터 튜닝**

In [None]:
from surprise.model_selection import cross_validate 

ratings = pd.read_csv("/content/drive/MyDrive/ESAA-OB/ratings.csv")
reader = Reader(rating_scale=(0.5, 5.0))
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

algo = SVD(random_state=0) 
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True) 

In [None]:
from surprise.model_selection import GridSearchCV

param_grid = {'n_epochs': [20, 40, 60], 'n_factors': [50, 100, 200] }

gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)
gs.fit(data)

print(gs.best_score['rmse'])
print(gs.best_params['rmse'])

### **Surprise를 이용한 개인화 영화 추천 시스템 구축**

In [None]:
from surprise.dataset import DatasetAutoFolds

reader = Reader(line_format='user item rating timestamp', sep=',', rating_scale=(0.5, 5))
 
data_folds = DatasetAutoFolds(ratings_file="/content/drive/MyDrive/ESAA-OB/ratings_noh.csv", reader=reader)

trainset = data_folds.build_full_trainset()

In [None]:
algo = SVD(n_epochs=20, n_factors=50, random_state=0)
algo.fit(trainset)

In [None]:
movies = pd.read_csv('/content/drive/MyDrive/ESAA-OB/movies.csv')
 
movieIds = ratings[ratings['userId']==9]['movieId']
if movieIds[movieIds==42].count() == 0:
    print('사용자 아이디 9는 영화 아이디 42의 평점 없음')

print(movies[movies['movieId']==42])

In [None]:
uid = str(9)
iid = str(42)

pred = algo.predict(uid, iid, verbose=True)

In [None]:
def get_unseen_surprise(ratings, movies, userId):

    seen_movies = ratings[ratings['userId']== userId]['movieId'].tolist()
    
    total_movies = movies['movieId'].tolist()
    
    unseen_movies= [movie for movie in total_movies if movie not in seen_movies]
    print('평점 매긴 영화수:',len(seen_movies), '추천대상 영화수:',len(unseen_movies), \
          '전체 영화수:',len(total_movies))
    
    return unseen_movies

unseen_movies = get_unseen_surprise(ratings, movies, 9)

In [None]:
def recomm_movie_by_surprise(algo, userId, unseen_movies, top_n=10):

    predictions = [algo.predict(str(userId), str(movieId)) for movieId in unseen_movies]

    def sortkey_est(pred):
        return pred.est

    predictions.sort(key=sortkey_est, reverse=True)
    top_predictions = predictions[:top_n]

    top_movie_ids = [ int(pred.iid) for pred in top_predictions]
    top_movie_rating = [ pred.est for pred in top_predictions]
    top_movie_titles = movies[movies.movieId.isin(top_movie_ids)]['title']

    top_movie_preds = [ (id, title, rating) for id, title, rating in \
                      zip(top_movie_ids, top_movie_titles, top_movie_rating)]

    return top_movie_preds

unseen_movies = get_unseen_surprise(ratings, movies, 9)
top_movie_preds = recomm_movie_by_surprise(algo, 9, unseen_movies, top_n=10)

print('##### Top-10 추천 영화 리스트 #####')
for top_movie in top_movie_preds:
    print(top_movie[1], ":", top_movie[2])