아이템 기반 추천은 생각만큼 정확한 결과를 기대하기 힘들다.  
개선 방법으로 개인의 취향 기준으로 추천하는 것으로 협업 필터링(Collaborative Filtering : CF) 알고리즘이다.

### 3.1 협업 필터링의 원리

추천의 대상이 되는 사람과 취향이 비슷한 사람들을  찾아서 공통적으로 좋아하는 제품 서비스를 추천해준다는 아이디어

협업 필터링의 과정
1. 취향이 비슷한 사용자를 유사성을 계산하여 neighbor 그룹으로 분류한다.
2. 분류한 그룹의 인원이 가장 좋게 평가한 아이템의 평점 평균을 내어 값이 높은 아이템을 추천한다.

### 3.2 유사도지표

상관계수
+ 평가 자료가 연속값인 경우 이해하기 쉬운 유사도

코사인 유사도
+ 상관계수는 이해하기 쉽지만 늘 좋은 결과를 가져오지 못한다.
+ 협업 필터링에서 널리 쓰이는 유사도지표로 코사인 유사도가 있다.
+ 아이템을 하나의 차원으로 보고 평가값을 좌표로 하여 각 사용자의 평가값을 벡터로 해서 두 사용자 간의 벡터의 각도를 구할 수 있다.
+ 두 사용자의 평가값이 유사할수록 코사인 값이 크다는 것을 알 수 있다.

타니모토 계수
+ 데이터가 이진값이면 상관계수나 코사인 유사도를 사용할 수 없다.
+ 이 경우 타니모토 계수를 사용한다.
+ 이진수 데이터에 대해 협업 필터링에서 좋은 결과를 보여준다.
+ 타니모토 계수를 변형하여 자카드 계수로 사용하기도 한다.

### 3.3 기본 CF 알고리즘

이웃(neighbor)을 전체 사용자로 하여 모든 사용자의 평점을 가지고 예측한다.
1. 모든 사용자 간의 평가의 유사도를 계산한다.(상관계수, 코사인 유사도 등)
2. 추천 대상이 되는 사람과 다른 사용자의 유사도를 추출한다.
3. 현재 사용자가 평가하지 않은 모든 아이템에 대해 현재 사용자의 예상 평가값을 구한다.  
예상 평가값은 다른 사용자의 해당 아이템에 대한 평가를 현재 사용자와 그 사용자와의 유사도로 가중해서 평균을 낸다.
4. 아이템 중에서 예상 평가값이 가장 높은 N개의 아이템을 추천한다.

2장에서 사용한 코드를 가져와서 실행하고 그 이후부터 살펴본다.

In [22]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings(action='ignore')

# 데이터 읽어 오기 
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('./data/u.user', sep='|', names=u_cols, encoding='latin-1')
i_cols = ['movie_id', 'title', 'release date', 'video release date', 'IMDB URL', 'unknown', 
          'Action', 'Adventure', 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 
          'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 
          'Thriller', 'War', 'Western']
movies = pd.read_csv('./data/u.item', sep='|', names=i_cols, encoding='latin-1')
r_cols = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_csv('./data/u.data', sep='\t', names=r_cols, encoding='latin-1')

# timestamp 제거 
ratings = ratings.drop('timestamp', axis=1)
# movie ID와 title 빼고 다른 데이터 제거
movies = movies[['movie_id', 'title']]

# train, test 데이터 분리
from sklearn.model_selection import train_test_split
x = ratings.copy()
y = ratings['user_id']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, stratify=y)

# 정확도(RMSE)를 계산하는 함수 
def RMSE(y_true, y_pred):
    return np.sqrt(np.mean((np.array(y_true) - np.array(y_pred))**2))

# 모델별 RMSE를 계산하는 함수 
def score(model):
    id_pairs = zip(x_test['user_id'], x_test['movie_id'])
    y_pred = np.array([model(user, movie) for (user, movie) in id_pairs])
    y_true = np.array(x_test['rating'])
    return RMSE(y_true, y_pred)

# train 데이터로 Full matrix 구하기 
rating_matrix = x_train.pivot(index='user_id', columns='movie_id', values='rating')

In [23]:
rating_matrix

movie_id,1,2,3,4,5,6,7,8,9,10,...,1670,1672,1673,1674,1675,1677,1678,1679,1681,1682
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,,,,3.0,5.0,,1.0,5.0,3.0,...,,,,,,,,,,
2,4.0,,,,,,,,,2.0,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
939,,,,,,,,,5.0,,...,,,,,,,,,,
940,,,,2.0,,,4.0,5.0,3.0,,...,,,,,,,,,,
941,5.0,,,,,,4.0,,,,...,,,,,,,,,,
942,,,,,,,,,,,...,,,,,,,,,,


In [24]:
# train set 의 모든 가능한 사용자 pair 의 Cosine similarities 계산
from sklearn.metrics.pairwise import cosine_similarity
matrix_dummy = rating_matrix.copy().fillna(0)
user_similarity = cosine_similarity(matrix_dummy, matrix_dummy)
user_similarity = pd.DataFrame(user_similarity, index = rating_matrix.index, columns = rating_matrix.index)
user_similarity

user_id,1,2,3,4,5,6,7,8,9,10,...,934,935,936,937,938,939,940,941,942,943
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.000000,0.113221,0.039499,0.025729,0.300284,0.331638,0.326776,0.258087,0.037908,0.257341,...,0.293213,0.057250,0.198556,0.152950,0.147553,0.113128,0.219667,0.117452,0.173225,0.278053
2,0.113221,1.000000,0.067666,0.129080,0.060459,0.168528,0.085291,0.059501,0.084767,0.104463,...,0.118190,0.210629,0.287073,0.341043,0.182310,0.166115,0.196554,0.123463,0.166594,0.091937
3,0.039499,0.067666,1.000000,0.220107,0.000000,0.058761,0.050802,0.024981,0.061771,0.073175,...,0.009597,0.000000,0.136406,0.035000,0.070953,0.035407,0.095254,0.025519,0.097848,0.000000
4,0.025729,0.129080,0.220107,1.000000,0.000000,0.039858,0.045473,0.116454,0.063449,0.038655,...,0.024644,0.050816,0.109710,0.138274,0.076271,0.000000,0.147591,0.144165,0.111854,0.037202
5,0.300284,0.060459,0.000000,0.000000,1.000000,0.190406,0.264903,0.170426,0.038001,0.174171,...,0.249688,0.067209,0.079165,0.071773,0.114199,0.085717,0.182086,0.153717,0.101134,0.260604
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
939,0.113128,0.166115,0.035407,0.000000,0.085717,0.096848,0.082103,0.055036,0.055286,0.029940,...,0.077305,0.379019,0.203629,0.233258,0.329632,1.000000,0.092478,0.153025,0.040680,0.112065
940,0.219667,0.196554,0.095254,0.147591,0.182086,0.291890,0.260160,0.223593,0.085306,0.256173,...,0.220659,0.068321,0.151247,0.196192,0.169696,0.092478,1.000000,0.138717,0.218879,0.159629
941,0.117452,0.123463,0.025519,0.144165,0.153717,0.140316,0.064042,0.162875,0.204407,0.058261,...,0.027858,0.103395,0.156886,0.209447,0.260562,0.153025,0.138717,1.000000,0.102617,0.090112
942,0.173225,0.166594,0.097848,0.111854,0.101134,0.221642,0.204286,0.165583,0.049679,0.187363,...,0.212256,0.079575,0.082796,0.143839,0.082468,0.040680,0.218879,0.102617,1.000000,0.150099


In [25]:
# 주어진 영화의 가중평균을 계산하는 함수
def CF_simple(user_id, movie_id):
    if movie_id in rating_matrix:
        # 현재 사용자와 다른 사용자 간의 유사도(user_similarity)
        sim_scores = user_similarity[user_id].copy()
        # 현재 영화에 대한 모든 사용자의 rating값 가져오기
        movie_ratings = rating_matrix[movie_id].copy()
        # 현재 영화를 평가하지 않은 사용자의 index 가져오기
        none_rating_idx = movie_ratings[movie_ratings.isnull()].index
        # 현재 영화를 평가하지 않은 사용자의 rating (null) 제거
        movie_ratings = movie_ratings.dropna()
        # 현재 영화를 평가하지 않은 사용자의 similarity 값 제거
        sim_scores = sim_scores.drop(none_rating_idx)
        # 현재 영화를 평가한 모든 사용자의 가중 평균값 구하기
        mean_rating = np.dot(sim_scores, movie_ratings) / sim_scores.sum()
    else:
        mean_rating = 3.0
    return mean_rating

# 정확도 계산
score(CF_simple)

1.013492545046893

연습문제
1. 위 코드를 수정해서 코사인 유사도 대신에 피어슨 상관계수를 사용하는 코드를 작성하고 RMSE를 계산하시오.

In [20]:
from scipy.stats import pearson3
matrix_dummy = rating_matrix.copy().fillna(0)
user_similarity_pearson = matrix_dummy.T.corr(method='pearson')
user_similarity_pearson = pd.DataFrame(user_similarity_pearson, index = rating_matrix.index, columns = rating_matrix.index)
user_similarity_pearson

user_id,1,2,3,4,5,6,7,8,9,10,...,934,935,936,937,938,939,940,941,942,943
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.000000,0.111037,-0.021900,0.003777,0.213347,0.228205,0.220141,0.140829,0.028158,0.254115,...,0.232562,0.036162,0.172518,0.130127,0.130658,0.053043,0.187266,0.105237,0.114071,0.286845
2,0.111037,1.000000,0.096679,0.121567,0.005404,0.164713,0.012538,0.000808,0.066002,0.076622,...,0.073194,0.229745,0.160818,0.396104,0.283168,0.151924,0.208489,0.095388,0.081087,0.039865
3,-0.021900,0.096679,1.000000,0.232734,-0.039289,0.038716,-0.016404,0.016138,0.063283,0.023400,...,-0.008762,0.037294,0.142737,0.034851,0.057627,-0.022093,0.070806,0.095740,0.064182,-0.003460
4,0.003777,0.121567,0.232734,1.000000,-0.027323,0.009880,0.027307,0.070610,-0.010322,0.010094,...,-0.003679,0.037302,0.110520,0.145555,0.092962,0.025622,0.137876,0.121566,0.136723,-0.010898
5,0.213347,0.005404,-0.039289,-0.027323,1.000000,0.154471,0.223142,0.092397,0.011753,0.138914,...,0.198127,-0.034802,-0.020883,0.014657,0.016095,-0.023125,0.143566,0.002088,0.108872,0.203047
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
939,0.053043,0.151924,-0.022093,0.025622,-0.023125,0.065351,0.008569,0.072888,-0.014912,0.003927,...,-0.021733,0.314814,0.105566,0.188203,0.242218,1.000000,0.086833,0.137396,0.006922,0.058270
940,0.187266,0.208489,0.070806,0.137876,0.143566,0.228752,0.159351,0.089046,0.060239,0.260895,...,0.164242,0.083863,0.097145,0.164432,0.139179,0.086833,1.000000,0.176113,0.194635,0.104963
941,0.105237,0.095388,0.095740,0.121566,0.002088,0.096442,0.004166,0.113080,0.044120,0.027035,...,0.024057,0.147121,0.183707,0.186753,0.258718,0.137396,0.176113,1.000000,0.083847,0.086153
942,0.114071,0.081087,0.064182,0.136723,0.108872,0.179828,0.151454,0.031313,0.007598,0.126171,...,0.106958,0.003231,0.042165,0.088552,0.035104,0.006922,0.194635,0.083847,1.000000,0.101890


In [21]:
# 주어진 영화의 가중평균을 계산하는 함수
def CF_simple(user_id, movie_id):
    if movie_id in rating_matrix:
        # 현재 사용자와 다른 사용자 간의 유사도(user_similarity)
        sim_scores = user_similarity_pearson[user_id].copy()
        # 현재 영화에 대한 모든 사용자의 rating값 가져오기
        movie_ratings = rating_matrix[movie_id].copy()
        # 현재 영화를 평가하지 않은 사용자의 index 가져오기
        none_rating_idx = movie_ratings[movie_ratings.isnull()].index
        # 현재 영화를 평가하지 않은 사용자의 rating (null) 제거
        movie_ratings = movie_ratings.dropna()
        # 현재 영화를 평가하지 않은 사용자의 similarity 값 제거
        sim_scores = sim_scores.drop(none_rating_idx)
        # 현재 영화를 평가한 모든 사용자의 가중 평균값 구하기
        mean_rating = np.dot(sim_scores, movie_ratings) / sim_scores.sum()
    else:
        mean_rating = 3.0
    return mean_rating

# 정확도 계산
score(CF_simple)

1.012600403378509

### 3.4 이웃을 고려한 CF

단순 CF 알고리즘을 개선할 수 있는 한 가지 방법은 이웃(neighbor)을 전체 사용자로 하는 대신에 유사도가 높은 사람만을 이웃으로 선정해서 이웃의 크기를 줄이는 것이다.

이웃을 정하는 기준에는 크게 2가지 방법이 존재한다.
1. 이웃의 크기를 미리 정해놓고 추천 대상 사용자와 가장 유사한 K명을 선택하는 KNN 방법
2. 이웃의 크기 대신 유사도의 기준을 정해놓고 기준을 충족시키는 사용자를 이웃으로 정하는 Thresholding

Thresholding 방법이 KNN 보다 정확하지만 정해진 기준을 넘는 사용자가 없어서 추천을 하지 못하는 경우가 많기 때문에 KNN이 무난하게 많이 쓰인다.

In [None]:
# Neighbor size 를 정해서 예측치를 계산하는 함수
def cf_knn(user_id, movie_id, neighbor_size = 0):
    if movie_id in rating_matrix:
        # 현재 사용자와 다른 사용자 간의 similarity 가져오기
        sim_scores = user_similarity[user_id].copy()