## 추천 시스템 기본 유형
- Content Based Filtering
- Collaborative Filtering
  + Memory Based 
  + Latent Factor (잠재요인) 협업 필터링

## Content Based Filterin
- 예) Youtube
- 아바타2 시청
  + 아바타1 추천, 마블 영화 추천

## Memory Based - 협업 필터링(Collaborative Filtering)
- 새로운 영화가 나옴
  + 다른 사람들의 평점이나 댓글, 평가 등을 확인한 뒤 영화를 선택
- 사용자의 행동 양식을 기반으로 추천함
  + 사용자 - 영화 행렬(매트릭스)에서 사용자가 아직 평가하지 않은 영화를 예측하는 것이 목표

In [1]:
import pandas as pd 
dict_df = {
    '영화1' : [4, 3, None], 
    '영화2' : [None, 2, 3], 
    '영화3' : [3, None, 2], 
    '영화4' : [None, 5, 2]
}

data = pd.DataFrame(dict_df, index = ['User1', 'User2', 'User3'])
data

Unnamed: 0,영화1,영화2,영화3,영화4
User1,4.0,,3.0,
User2,3.0,2.0,,5.0
User3,,3.0,2.0,2.0


- 사용자 기반 : 취향이 비슷한 고객들이 ~~한 영화를 시청했다.
- 아이템(영화) 기반 : OO영화를 시청한 고객(=사용자)들이 다음 OO영화도 구매할 것이다.

In [2]:
# 사용자 기반 협업 필터링 예
dict_df = {
    '영화A' : [3, 4, 1], 
    '영화B' : [4, 4, 1], 
    '영화C' : [4, 4, 2], 
    '영화D' : [None, 3, 5], 
    '영화E' : [1, None, None]
}

data = pd.DataFrame(dict_df, index = ['User1', 'User2', 'User3'])
data
# 사용자 성향이 유사한 다른 사용자들의 평가 참고하여 추천할 수 있음

Unnamed: 0,영화A,영화B,영화C,영화D,영화E
User1,3,4,4,,1.0
User2,4,4,4,3.0,
User3,1,1,2,5.0,


- User1, User2는 영화 A ~ C 까지의 평점이 비슷함
- User1, User2의 성향이 비슷할 것이다.

In [3]:
# 영화 기반 협업 필터링 예
dict_df = {
    'User1' : [5, 4, 1], 
    'User2' : [4, 4, 1], 
    'User3' : [4, 4, 2], 
    'User4' : [None, 5, 3], 
    'User5' : [5, None, None]
}

data = pd.DataFrame(dict_df, index = ['영화1', '영화2', '영화3'])
data
# 영화 자체의 평가가 좋은 경우 추천 확률 높음

Unnamed: 0,User1,User2,User3,User4,User5
영화1,5,4,4,,5.0
영화2,4,4,4,5.0,
영화3,1,1,2,3.0,


- 영화1, 영화2의 사용자들의 평점 분포가 비슷
- User4에게 영화1을 추천 해줌
- 일반적으로 Item 기반의 정확도가 좀 더 높은 것으로 알려져 있음
  + 인구통계학적 특성만 가지고서는 사람들의 취향 파악이 어려움
  + 사람들의 취향이 비슷한 항목을 좋아하기 때문에 추천이 잘됨

## 잠재 요인 협업 필터링 (Latent Factor Collaborative Filtering)
- 코사인 유사도(Cosine Similarity)를 사용함
- 아바타 평점 10점
  + 알고리즘을 이용한 분해(이유를 모름, 추정)
    - 사용자 1. 시나리오 완성도 8점, 배우들의 연기 2점 
    - 사용자 2. 시나리오 7점, 연기 2점, 극장이 좋음 1점
- 행렬분해 : 인수분해
  + 12 = 1 x 12, 2 x 6, 3 x 4
  + SGD 기법을 사용함
- 사용자 - 아이템 행렬 매트리스 속에 숨어 있는 잠재 요인을 추출해서 추천 예측을 함.

## 라이브러리 설치
- Surprise 라이브러리 활용
- Scikit-Surprise 라이브러리 존재

In [4]:
!pip install scikit-surprise

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting scikit-surprise
  Downloading scikit-surprise-1.1.3.tar.gz (771 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m772.0/772.0 KB[0m [31m33.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.3-cp38-cp38-linux_x86_64.whl size=3366436 sha256=2f53ffda0b97c231053727d924ab06abbac4930d39e424c23aea362d178333ee
  Stored in directory: /root/.cache/pip/wheels/af/db/86/2c18183a80ba05da35bf0fb7417aac5cddbd93bcb1b92fd3ea
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.3


In [5]:
from surprise import Reader
from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import train_test_split
from surprise.dataset import DatasetAutoFolds

import pandas as pd
import os

## 데이터 불러오기

In [6]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [7]:
DATA_PATH = '/content/drive/MyDrive/Colab Notebooks/2023/python/data/data_230125/'

ratings = pd.read_csv(DATA_PATH + 'BX-Book-Ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
books = pd.read_csv(DATA_PATH + 'BX_Books.csv', sep=';', error_bad_lines=False, encoding="latin-1")
bookratings = pd.merge(ratings, books, on='ISBN')

bookratings.head()



  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,User-ID,ISBN,Book-Rating,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,276725,034545104X,0,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...
1,2313,034545104X,5,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...
2,6543,034545104X,0,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...
3,8680,034545104X,5,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...
4,10314,034545104X,9,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...


- 데이터 가공

In [33]:
bookratings = bookratings.rename(columns={'ISBN':'item'})
bookratings.head()

Unnamed: 0,User-ID,item,Book-Rating,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,276725,034545104X,0,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...
1,2313,034545104X,5,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...
2,6543,034545104X,0,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...
3,8680,034545104X,5,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...
4,10314,034545104X,9,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...


In [9]:
books = books.rename(columns = {
    'Book-Title':'title', 'ISBN':'item'
})
books.head()

Unnamed: 0,item,title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton & Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [10]:
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271379 entries, 0 to 271378
Data columns (total 8 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   item                 271379 non-null  object
 1   title                271379 non-null  object
 2   Book-Author          271378 non-null  object
 3   Year-Of-Publication  271379 non-null  int64 
 4   Publisher            271377 non-null  object
 5   Image-URL-S          271379 non-null  object
 6   Image-URL-M          271379 non-null  object
 7   Image-URL-L          271379 non-null  object
dtypes: int64(1), object(7)
memory usage: 16.6+ MB


- bookratings의 컬럼명 수정
  + User-ID --> user
  + Book-Rating --> rating

In [11]:
bookratings2 = bookratings.rename(
    columns={'User-ID':'user', 'Book-Rating':'rating'}
)

bookratings2.head()

Unnamed: 0,user,item,rating,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,276725,034545104X,0,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...
1,2313,034545104X,5,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...
2,6543,034545104X,0,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...
3,8680,034545104X,5,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...
4,10314,034545104X,9,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...


In [12]:
bookratings2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1031175 entries, 0 to 1031174
Data columns (total 10 columns):
 #   Column               Non-Null Count    Dtype 
---  ------               --------------    ----- 
 0   user                 1031175 non-null  int64 
 1   item                 1031175 non-null  object
 2   rating               1031175 non-null  int64 
 3   Book-Title           1031175 non-null  object
 4   Book-Author          1031174 non-null  object
 5   Year-Of-Publication  1031175 non-null  int64 
 6   Publisher            1031173 non-null  object
 7   Image-URL-S          1031175 non-null  object
 8   Image-URL-M          1031175 non-null  object
 9   Image-URL-L          1031175 non-null  object
dtypes: int64(3), object(7)
memory usage: 86.5+ MB


In [14]:
bookratings2.to_csv(DATA_PATH + 'Book-Ratings_noh.csv', index=False, header=False)

- load_from_file()을 이용해서 DataSet 객체로 데이터를 불러옴
- Reader 클래스를 활용해서, 데이터 파일의 파싱 포맷을 정의

In [15]:
col = 'user item rating'
reader = Reader(line_format=col, sep=',', rating_scale=(1, 10))
data=Dataset.load_from_file(DATA_PATH + 'Book-Ratings_noh.csv', reader=reader)

In [16]:
data

<surprise.dataset.DatasetAutoFolds at 0x7faa037943a0>

- rating_scale = (1, 10) : 최소 평점 1, 최대 평점 10으로 설정
  + 디폴트 (1 ~ 5)

## 모델링 코드
- scikit-learn 모델링과 매우 유사
- 첫번째 방법 : 훈련 / 테스트로 분리
- 두번째 방법 : 전체 데이터를 훈련 데이터로 사용

- 훈련 / 테스트로 분리

In [17]:
%%time 
trainset, testset = train_test_split(data, test_size=.25, random_state=0)

# 수행 시마다 동일한 결과를 도출하기 위해 random_state 설정
# 잠재 요인 협업 필터링 알고리즘 사용
algo = SVD(n_factors=50, random_state=0)

# 학습 데이터 세트로 학습하고 나서 테스트 데이터 세트로 평점 예측 후 RMSE 평가
algo.fit(trainset)
predictions = algo.test(testset)
accuracy.rmse(predictions)

RMSE: 3.5413
CPU times: user 26 s, sys: 489 ms, total: 26.5 s
Wall time: 27.6 s


3.541278910004836

- 전체 데이터를 학습 데이터로 생성

In [18]:
col = 'user item rating'
reader = Reader(line_format=col, sep=',', rating_scale=(1, 10))

# DatasetAutoFolds 클래스로 가져옴
data_folds = DatasetAutoFolds(ratings_file=DATA_PATH + 'Book-Ratings_noh.csv', reader=reader)

# 전체 데이터를 학습데이터로 생성함
trainset = data_folds.build_full_trainset()

- SVD를 활용해 학습, 특정 사용자를 임의로 userId=9로 지정해서 도서 추천하기 위해서, 아직 보지 않은 도서 목록도 확인

In [19]:
%%time
algo = SVD(n_epochs=20, n_factors=50, random_state=0)
algo.fit(trainset)

CPU times: user 19.4 s, sys: 28.2 ms, total: 19.4 s
Wall time: 19.3 s


<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7faa03843cd0>

In [23]:
# userId=9인 데이터 추출, ISBN
# ISBNs = bookratings2[bookratings2['user']==9]
ISBNs = bookratings2[bookratings2['user']==9]['item']
ISBNs

if ISBNs[ISBNs == '034545104X'].count() == 0:
  print('034545104X 평점 없음')

034545104X 평점 없음


In [24]:
books[books['item'] == '034545104X']

Unnamed: 0,item,title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
2966,034545104X,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...


In [25]:
uid = str(9)
iid = str('034545104X')
pred = algo.predict(uid, iid, verbose=True)

user: 9          item: 034545104X r_ui = None   est = 2.87   {'was_impossible': False}


### 도서 추천
- 임의의 userId = 276847의 추천 대상이 되는 도서를 추출한다.

In [26]:
def get_unread_surprise(bookratings, books, userId):
  # 입력값으로 userId에 해당하는 사용자가 평점을 매긴 모든 도서 리스트
  read_books = bookratings[bookratings['user']==userId]['item'].tolist()

  # 모든 도서의 ISBN을 리스트로 생성
  total_books = books['item'].tolist()

  # 평점을 매기지 않은 모든 도서를 리스트로 생성
  unread_books = [book for book in total_books if book not in read_books]
  print('평점 매긴 도서 수 : ', len(read_books), '추천 대상 도서 수 : ', len(unread_books),
        '전체 도서 수 : ', len(total_books))
  
  return unread_books

In [27]:
unread_books = get_unread_surprise(bookratings2, books, 276847)

평점 매긴 도서 수 :  47 추천 대상 도서 수 :  271332 전체 도서 수 :  271379


- 추천 함수 만들기
- TOP-N개의 ISBN, 도서 제목, 예측 평점 정보를 추출해서 반환

In [28]:
unread_books[:10]

['0195153448',
 '0002005018',
 '0060973129',
 '0374157065',
 '0393045218',
 '0399135782',
 '0425176428',
 '0671870432',
 '0679425608',
 '074322678X']

In [29]:
# pred = algo.predict(str(userId), str(item), verbose=True)
pred = algo.predict(str(276847), str('0002005018'), verbose=True)

user: 276847     item: 0002005018 r_ui = None   est = 6.91   {'was_impossible': False}


In [None]:
ids, ratings, titles

In [30]:
def recomm_book_by_surprise(algo, userId, unread_books, top_n=10):

  # 알고리즘 객체의 predict() 메서드를 평점이 없는 영화에 반복 수행한 후 결과를 list 객체로 저장
  predictions = [algo.predict(str(userId), str(item)) for item in unread_books]
  
  # predictions list 객체는 surprise의 Prediction 객체를 원소로 가지고 있음.
  # [Prediction(uid='276847', iid='1', est=3.69), Prediction(uid='276847', iid='2', est=2.98),,,,]

  # 이를 est 값으로 정렬하기 위해서 아래의 sortkey_eat 함수를 정의함.
  # sortkey_est 함수는 list 객체의 sort() 함수의 키 값으로 사용되어 정렬 수행.
  def sortkey_est(pred):
    return pred.est # 평점
    
  # sortkey_est() 반환값의 내림 차순으로 정렬 수행하고 top_n개의 최상위 값 추출.
  predictions.sort(key=sortkey_est, reverse=True)
  top_predictions = predictions[:top_n]

  # top_n으로 추출된 영화의 정보 추출, 영화 아이디, 추천 예상 평점, 제목 추출
  top_book_ids = [ pred.iid for pred in top_predictions]
  top_book_rating = [ pred.est for pred in top_predictions]
  top_book_titles = bookratings[bookratings.item.isin(top_book_ids)]['Book-Title']

  top_book_preds = [(id, rating) for id, rating in 
                      zip(top_book_ids, top_book_rating)]
  return top_book_preds

unread_books = get_unread_surprise(bookratings2, books, 276847)
top_book_preds = recomm_book_by_surprise(algo, 276847, unread_books, top_n=10)

print('  ')
print('  ')
print('***************************')
print('*** 추천 도서 리스트 Top 10 ***')
print('  ')
for top_book in top_book_preds :
  print(top_book[0], ":", top_book[1])

평점 매긴 도서 수 :  47 추천 대상 도서 수 :  271332 전체 도서 수 :  271379
  
  
***************************
*** 추천 도서 리스트 Top 10 ***
  
0399135782 : 10
0316769487 : 10
0385509456 : 10
0515128554 : 10
1878424319 : 10
0375726403 : 10
0515134279 : 10
0399144463 : 10
193156146X : 10
0743406176 : 10


In [32]:
lists = ['0399135782', '0316769487', '0385509456'] # 예측 평점 순으로 뽑아서 넣기

results = books[books['item'].isin(lists)]
results

Unnamed: 0,item,title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
5,399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...
90,316769487,The Catcher in the Rye,J.D. Salinger,1991,"Little, Brown",http://images.amazon.com/images/P/0316769487.0...,http://images.amazon.com/images/P/0316769487.0...,http://images.amazon.com/images/P/0316769487.0...
220,385509456,The Curious Incident of the Dog in the Night-T...,MARK HADDON,2003,Doubleday,http://images.amazon.com/images/P/0385509456.0...,http://images.amazon.com/images/P/0385509456.0...,http://images.amazon.com/images/P/0385509456.0...
