# Item-Based Nearest Neighbor Collaborative Filtering
## MovieLens Dataset

최근접 이웃 협업 필터링은 사용자 기반과 아이템 기반으로 나뉘는데, 이 중 일반적으로 아이템 기반의 추천 정확도가 뛰어나므로 아이템 기반 방식을 이용해 추천 시스템 구현 실습을 수행해보겠다.
협업 필터링 기반의 영화 추천을 위해서는 사용자가 영화에 대해 작성한 평점 데이터 행렬 세트가 필요한데, 이를 위해 Grouplens의 MovieLens 데이터셋을 사용하도록 하겠다.

https://grouplens.org/datasets/movielens/latest/

### - ml-latest-small.zip (size: 1 MB)

100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users. Last updated 9/2018.

- links.csv
- movies.csv
- ratings.csv
-tags.csv

---

In [41]:
import cpuinfo
cpuinfo.get_cpu_info()

{'python_version': '3.9.7.final.0 (64 bit)',
 'cpuinfo_version': [8, 0, 0],
 'cpuinfo_version_string': '8.0.0',
 'arch': 'ARM_8',
 'bits': 64,
 'count': 8,
 'arch_string_raw': 'arm64',
 'brand_raw': 'Apple M1'}

----

## Data Loading and Processing

In [42]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [43]:
movies = pd.read_csv('./data/MovieLens_Dataset/ml-latest-small/movies.csv')
ratings = pd.read_csv('./data/MovieLens_Dataset/ml-latest-small/ratings.csv')

print(movies.shape)
print(ratings.shape)

(9742, 3)
(100836, 4)


In [44]:
movies.head(3)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance


In [45]:
ratings.head(3)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224


- DataFrame의 pivot_table() 함수 이용하여 ratings 데이터프레임을 User-Items 행렬로 변환

In [46]:
ratings = ratings[['userId', 'movieId','rating']]
ratings_matrix = ratings.pivot_table('rating',index='userId', columns='movieId')
ratings_matrix.head(3)

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,


- pivot_table() 적용 이후 movieId 값이 모두 column 이름으로 변경되었다.
- NaN 값이 많은 Sparse matrix인데, 편의상 NaN은 모두 0점으로 변환하도록 하자. (최소 평점 0.5)
- 가독성을 위해 column 명을 movieId에서 title로 변경하자. (title은 movies dataset에 존재)

In [47]:
# title column을 얻기 위한 movies와 join
rating_movies = pd.merge(ratings, movies, on='movieId')

# columns='title'로 title columnㅇ로 pivot 수행
ratings_matrix = rating_movies.pivot_table('rating',index='userId', columns='title')

# NaN -> 0
ratings_matrix = ratings_matrix.fillna(0)

ratings_matrix.head(3)

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Get Simmilarities between Movies

변환된 User-Item 행렬을 이용하여 영화 간의 유사도를 측정해보자.
영화 간의 유사도는 이전에서와 같이 코사인 유사도를 사용하겠다.
코사인 유사도를 구하기 위한 함수인 cosine_similarity()는 행을 기준으로 서로 다른 행을 비교해 유사도를 산출하므로, 영화를 기준으로 코사인 유사도를 계산하기 위해 ratings_matrix의 행과 열을 바꿔 전치행렬을 만들어준다.

In [48]:
ratings_matrix_T = ratings_matrix.T
ratings_matrix_T.head(3)

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
'Hellboy': The Seeds of Creation (2004),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Round Midnight (1986),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


ratings_matrix의 전치행렬을 구했으니, 이를 기반으로 코사인 유사도를 구해보자.
좀 더 직관적인 영화의 유사도 값을 표현하기 위해 cosine_similarity()로 반환된 Numpy 행렬에 영화명을 mapping해 DataFrame으로 변환해보자.

In [49]:
from sklearn.metrics.pairwise import cosine_similarity

item_sim = cosine_similarity(ratings_matrix_T, ratings_matrix_T)

# Numpy matrix to DataFrame
item_sim_df = pd.DataFrame(data=item_sim, index=ratings_matrix.columns,\
                          columns=ratings_matrix.columns)

print(item_sim_df.shape)
item_sim_df.head(3)

(9719, 9719)


title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.141653,0.0,...,0.0,0.342055,0.543305,0.707107,0.0,0.0,0.139431,0.327327,0.0,0.0
'Hellboy': The Seeds of Creation (2004),0.0,1.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Round Midnight (1986),0.0,0.707107,1.0,0.0,0.0,0.0,0.176777,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


- item_sim_df를 이용해서 '대부(Godfather, The (1972))'와 유사도가 높은 상위 5개의 영화를 추출해보자. (자신 제외)

In [50]:
item_sim_df['Godfather, The (1972)'].sort_values(ascending=False)[1:6]

title
Godfather: Part II, The (1974)               0.821773
Goodfellas (1990)                            0.664841
One Flew Over the Cuckoo's Nest (1975)       0.620536
Star Wars: Episode IV - A New Hope (1977)    0.595317
Fargo (1996)                                 0.588614
Name: Godfather, The (1972), dtype: float64

## Personalized Movie Recommend System Using Item-based NNCF

앞에서 만들어진 아이템 기잔 영화 유사도 데이터는 모든 사용자의 평점을 기준으로 영화의 유사도를 생성했고, 이를 이용해 추천을 진행하였다. 나름 훌룡한 성능을 보이긴 했으나 개인적인 취향을 반영하지 않았기 때문에 추천 시스템으로 사용하기에는 다소 부족함이 있다.
이번에는 최근접 이웃 방식을 적용해 개인 취향을 고려한 추천 시스템의 구축을 진행해보자. 개인화된 영화 추천의 가장 큰 특징은 개인이 아직 관람하지 않은 영화를 추천한다는 점으로, 아직 관람하지 않은 영화의 아이템 유사도와 관람한 영화의 평점 데이터를 기반으로 새롭게 예측 평점을 계산하고 추천을 진행한다.

이러한 아이템 기반의 협업 필터링에서 개인화된 예측 평점을 구하는 공식은 다음과 같다.

${\hat R}_{\mu, i} = \Sigma^N(S_{i,N} * (R_{\mu,N}) / \Sigma^N(|S_{i,N}|)$

- ${\hat R}_{\mu,i}$: 최종적으로 구하고자 하는 점수. 사용자 $\mu$, 아이템 i의 개인화된 예측 평점 값
- $S_{i,N}$: 아이템 i와 가장 유사도가 높은 Top-N개 아이템의 유사도 벡터
- $R_{\mu,N}$: 사용자 $\mu$의 아이템 i와 가장 유사도가 높은 Top-N개 아이템에 대한 실제 평점 벡터
- N: 아이템의 최근접 이웃 범위 계수(item neighbors)

먼저 N 범위에 제약을 두지 않고 모든 아이템으로 가정하고 예측 평점을 구하는 로직을 작성한 뒤에 Top-N 아이템을 기반으로 협업 필터링을 수행하는 로직으로 변경하겠다.

In [51]:
def predict_rating(ratings_arr, item_sim_arr):
    ratings_pred = ratings_arr.dot(item_sim_arr)/np.array([np.abs(item_sim_arr).sum(axis=1)])
    return ratings_pred

In [52]:
ratings_pred = predict_rating(ratings_matrix.values, item_sim_df.values)
ratings_pred_matrix = pd.DataFrame(data = ratings_pred, index = ratings_matrix.index, columns = ratings_matrix.columns)
ratings_pred_matrix.head(3)

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.070345,0.577855,0.321696,0.227055,0.206958,0.194615,0.249883,0.102542,0.157084,0.178197,...,0.113608,0.181738,0.133962,0.128574,0.006179,0.21207,0.192921,0.136024,0.292955,0.720347
2,0.01826,0.042744,0.018861,0.0,0.0,0.035995,0.013413,0.002314,0.032213,0.014863,...,0.01564,0.020855,0.020119,0.015745,0.049983,0.014876,0.021616,0.024528,0.017563,0.0
3,0.011884,0.030279,0.064437,0.003762,0.003749,0.002722,0.014625,0.002085,0.005666,0.006272,...,0.006923,0.011665,0.0118,0.012225,0.0,0.008194,0.007017,0.009229,0.01042,0.084501


예측 평점이 사용자별 영화의 실제 평점과 영화의 코사인 유사도를 내적(dot)한 값이므로 관람하지 않아 NaN, 즉 0으로 표시된 평점에 예측 평점이 부여되는 경우가 존재함을 확인할 수 있다. 이때 예측 평점은 실제 평점 대비 다소 작을 수 있는데, 이는 내적 결과를 코사인 유사도 벡터 합으로 나누었기 때문이다.

이 예측 결과가 실제 평점과 얼마나 차이 나는지 MSE를 예측 평가지표로 사용해 확인해보자.
MSE를 측정할 때는 관람하지 않은 영화에 0이 부여된 기존 데이터와 관람하지 않은 데이터에 대해서도 예측 평점을 기입한 예측 데이터의 차이를 고려하여 기존에 NaN이 아닌 점수가 부여된 데이터에 대해서만 오차 정도를 측정하도록 하자.

In [53]:
from sklearn.metrics import mean_squared_error

# 사용자가 평점을 부여한 영화에 대해서만 예측 성능 평가 MSE 측정
def get_mse(pred, actual):
    # 평점이 있는 실제 영화만 추출
    pred = pred[actual.nonzero()].flatten()
    actual = actual[actual.nonzero()].flatten()
    
    return mean_squared_error(pred, actual)

print('아이템 기반 모든 최근점 이웃 MSE: ', get_mse(ratings_pred, ratings_matrix.values))

아이템 기반 모든 최근점 이웃 MSE:  9.895354759094706


MSE가 약 9.9인데, 더 나은 추천 시스템의 구축을 위해서는 이를 줄이는 방향으로 개선을 수행해야한다.

앞에서의 predict_raing() 함수는 사용자별 영화의 예측 평점을 계산하기 위해 해당 영화와 다른 모든 영화 간의 유사도 벡터를 적용한 것이므로 상대적으로 평점 예측의 정확도가 낮았다. 따라서 이번에는 특정 영화와 가장 유사도가 높은 영화에 대해서만 유사도 벡터를 적용하는 함수를 이용해보겠다.

In [54]:
def predict_rating_topsim(ratings_arr, item_sim_arr, n=20):
    # User-Item 행렬 크기의 초기화된 행렬
    pred = np.zeros(ratings_arr.shape)
    
    # User-Item 행렬의 열 크기 만큼 loop
    for col in range(ratings_arr.shape[1]):
        # 유사도 행렬에서 유사도가 큰 순으로 n개 데이터 행렬의 인덱스 반환
        top_n_items = [np.argsort(item_sim_arr[:, col])[:-n-1:-1]]
        # 개인화된 예측 평점 계산
        for row in range(ratings_arr.shape[0]):
            pred[row,col] = item_sim_arr[col, :][top_n_items].dot(ratings_arr[row,:][top_n_items].T)
            pred[row,col] /= np.sum(np.abs(item_sim_arr[col,:][top_n_items]))
            
        return pred

predict_rating_topsim()으로 예측 평점을 계산하고, 실제 평점과의 MSE를 구해보자. 계산된 예측 평점 Numpy 행렬은 DataFrame으로 재생성하겠다.

In [55]:
%timeit predict_rating_topsim(ratings_matrix.values, item_sim_df.values, n=20) 

4.53 ms ± 23.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


* 현재 사용하고 있는 데이터셋은 크기가 작아 개인 노트북에서도 빠르게 계산되지만, 해당 로직은 데이터의 크기에 따라 수행시간이 급격히 늘어나는 로직으로, 데이터가 증가할 경우에는 조정이 필요하다

In [56]:
ratings_pred = predict_rating_topsim(ratings_matrix.values, item_sim_df.values, n=20)

In [57]:
print('아이템 기반 최근접 이웃 Top-20 MSE: ', get_mse(ratings_pred, ratings_matrix.values))

아이템 기반 최근접 이웃 Top-20 MSE:  13.347676357703904


### 왜 MSE가 증가했지..? 내일 수정하자...