### 아이템 기반 협업 필터링 (Item-based Collaborative Filtering)

- 특정 아이템과 유사한 다른 아이템을 찾아 추천하는 방식
- 사용자의 과거 행동 데이터를 바탕으로 각 아이템 간의 유사도를 계산하고, 이를 기반으로 추천 생성

**과정**
1. 아이템 간 유사도 계산
2. 사용자의 선호도 파악
3. 가중 평점 예측
4. 추천 제공

**장점**
- 사용자 수가 많아지더라도 유사도 계산에 소요되는 시간 비교적 적음
- 아이템의 특성을 고려하지 않으므로 특성 데이터가 부족하더라도 활용 가능

**단점**
- 아이템 간 유사도만 고려하므로 사용자의 선호 변화나 개인 취향 반영이 어려울 수 있음
- 충분한 기반 데이터가 없는 경우 정확한 유사도 계산이 어려움 (Cold Start)

In [108]:
import numpy as np
import pandas as pd

In [109]:
# 데이터 로드
movies_df = pd.read_csv('./data/ml-latest-small/movies.csv')
ratings_df = pd.read_csv('./data/ml-latest-small/ratings.csv')

movies_df.shape, ratings_df.shape

((9742, 3), (100836, 4))

In [110]:
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [111]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [112]:
# 데이터프레임 병합
movies_ratings_df = pd.merge(ratings_df, movies_df, on='movieId', how='inner')

print(movies_ratings_df.shape)
movies_ratings_df.head()

(100836, 6)


Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,1,3,4.0,964981247,Grumpier Old Men (1995),Comedy|Romance
2,1,6,4.0,964982224,Heat (1995),Action|Crime|Thriller
3,1,47,5.0,964983815,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
4,1,50,5.0,964982931,"Usual Suspects, The (1995)",Crime|Mystery|Thriller


### 사용자 평점 기반 아이템(영화) 유사도 계산

In [113]:
# 사용자별 영화 평점 데이터프레임 (pivoting)
users_movies_df = movies_ratings_df.pivot_table('rating', index='userId', columns='title', fill_value=0)

print(users_movies_df.shape)
users_movies_df.head()

(610, 9719)


title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [114]:
# 특정 사용자의 영화 평점 조회
users_movies_df.iloc[555].sort_values(ascending=False)[:30]

title
Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)    5.0
How to Train Your Dragon (2010)                                                                   5.0
Guardians of the Galaxy (2014)                                                                    5.0
Aladdin (1992)                                                                                    5.0
Harry Potter and the Chamber of Secrets (2002)                                                    4.5
Harry Potter and the Deathly Hallows: Part 2 (2011)                                               4.5
Eragon (2006)                                                                                     4.5
Lord of the Rings: The Fellowship of the Ring, The (2001)                                         4.5
Harry Potter and the Prisoner of Azkaban (2004)                                                   4.0
Into the Woods (2014)                                                       

In [115]:
# 사용자별 평점 개수
(users_movies_df != 0).sum(axis=1).describe()

count     610.000000
mean      165.298361
std       269.466692
min        20.000000
25%        35.000000
50%        70.500000
75%       168.000000
max      2698.000000
dtype: float64

In [116]:
from sklearn.metrics.pairwise import cosine_similarity

movies_sim = cosine_similarity(users_movies_df.T, users_movies_df.T)    # 영화 간 평점 유사도 계산
movies_sim_df = pd.DataFrame(movies_sim, index=users_movies_df.columns, columns=users_movies_df.columns)

In [117]:
movies_sim_df['Aladdin (1992)'].sort_values(ascending=False)[:10]

title
Aladdin (1992)                       1.000000
Beauty and the Beast (1991)          0.747056
Lion King, The (1994)                0.717909
Jurassic Park (1993)                 0.613485
True Lies (1994)                     0.599906
Batman (1989)                        0.596721
Ace Ventura: Pet Detective (1994)    0.583814
Mrs. Doubtfire (1993)                0.575423
Die Hard: With a Vengeance (1995)    0.568496
Batman Forever (1995)                0.566384
Name: Aladdin (1992), dtype: float64

### 가중 평점 예측
1. 사용자별 아이템(영화) 평점이 있다. 
2. 평점 기반의 아이템(영화) 유사도가 있다.
3. 모든 사용자의 모든 아이템(영화)에 대한 가중평점 예측한다. 
4. 사용자별 영화 예측평점이 높은 순으로 영화를 추천한다. 
    - 사용자가 안 본 영화(평점이 없는, 0인 영화)를 추천한다. 

**Weighted Rating Sum**

사용자 $ u $의 아이템 $ i $에 대한 평점 예측은 사용자 $ u $가 아이템 $ i $와 유사한 다른 아이템들($ N $개의 다른 아이템)의 합으로 계산하며, 아이템들 간의 유사도를 반영한 합으로 계산

$
\hat{R}_{u,i} = \frac{\sum_{N} (S_{i,N} \times R_{u,N})}{\sum_{N} (|S_{i,N}|)}
$

- $\hat{R}_{u,i}$: 사용자 $ u $가 아이템 $ i $에 대해 가질 것으로 예측되는 평점
- $S_{i,N}$: 아이템 $ i $와 유사한 다른 아이템들의 유사도
- $R_{u,N}$: 사용자 $u$의 유사한 아이템들의 평점


**사용자 $ u $의 유사한 아이템 들의 평점**

| Item | j | k | **i** | m | n |
|------|---|---|---|---|---|
| Rating | 5 | 4 | **1** | 3 | 2 |


**유사도**

|   | (i,j) | (i,k) | (i,i) | (i,m) | (i,n) |
|---|-------|-------|-------|-------|-------|
| **R_{u,i}** | 0.2 | 0.1 | **0.4** | 0.1 | 0.2 |


**계산:** 위 두개의 행렬에 대한 내적을 구한다.

$
5 \times 0.2 + 4 \times 0.1 + 1 \times 0.4 + 3 \times 0.1 + 2 \times 0.2 = 2.5
$
<br>
<br>
$
\hat{R}_{u,i} = \frac{(5 \times 0.2) + (4 \times 0.1) + (1 \times 0.4) + (3 \times 0.1) + (2 \times 0.2)}{0.2 + 0.1 + 0.4 + 0.1 + 0.2}
= \frac{1 + 0.4 + 0.4 + 0.3 + 0.4}{1} = \frac{2.5}{1} = 2.5
$

결과적으로, 사용자 $ u $가 아이템 $ i $에 대해 가질 것으로 예측되는 평점은 **2.5**이이다.

- 전체 가중평점 예측

In [118]:
def predict_ratings(users_movies_df, movies_sim_df):
    return users_movies_df.dot(movies_sim_df) / np.abs(movies_sim_df).sum(axis=1)

In [119]:
ratings_pred_df = predict_ratings(users_movies_df, movies_sim_df)

print(ratings_pred_df.shape)
ratings_pred_df.head(1).T

(610, 9719)


userId,1
title,Unnamed: 1_level_1
'71 (2014),0.070345
'Hellboy': The Seeds of Creation (2004),0.577855
'Round Midnight (1986),0.321696
'Salem's Lot (2004),0.227055
'Til There Was You (1997),0.206958
...,...
eXistenZ (1999),0.212070
xXx (2002),0.192921
xXx: State of the Union (2005),0.136024
¡Three Amigos! (1986),0.292955


In [120]:
from sklearn.metrics import mean_squared_error

# 실제 평점과 예측 평점 간 오차 비교
def get_mse(actual, pred):
    non_zero_idx = actual.nonzero()
    # print(non_zero_idx)           # ([row_idx, row_idx, ...], [col_idx, col_idx, ...])
    actual = actual[non_zero_idx]
    pred = pred[non_zero_idx]
    return mean_squared_error(actual, pred)

In [121]:
get_mse(users_movies_df.values, ratings_pred_df.values)

9.895354759094705

- 특정 사용자의 영화 하나 평점 예측

In [122]:
users_movies_df.iloc[176, 35]   # 176번 사용자의 35번째 영화에 대한 평점

np.float64(5.0)

In [123]:
topn_sim_idx = movies_sim_df.iloc[35].argsort()[::-1]
topn_sim_idx = topn_sim_idx[:20]
topn_sim_idx

title
À nous la liberté (Freedom for Us) (1931)                 9169
¡Three Amigos! (1986)                                     5855
xXx: State of the Union (2005)                            4412
xXx (2002)                                                5658
eXistenZ (1999)                                             35
anohana: The Flower We Saw That Day - The Movie (2013)    6814
[REC]³ 3 Génesis (2012)                                   1098
[REC]² (2009)                                             4426
[REC] (2007)                                              8656
Zulu (2013)                                               3523
Zulu (1964)                                               8416
Zootopia (2016)                                           3522
Zoom (2015)                                               4129
Zoom (2006)                                               9491
Zoolander 2 (2016)                                        8495
Zoolander (2001)                                 

In [124]:
users_movies_df.iloc[176, topn_sim_idx]

title
Very Potter Sequel, A (2010)                               5.0
Mr. Skeffington (1944)                                     5.0
Intolerance: Love's Struggle Throughout the Ages (1916)    3.5
Mildred Pierce (2011)                                      4.5
12 Angry Men (1997)                                        5.0
Princess and the Pirate, The (1944)                        4.0
Birth of a Nation, The (1915)                              2.0
Invisible Man Returns, The (1940)                          2.5
Thief of Bagdad, The (1924)                                3.5
Gold Diggers of 1935 (1935)                                3.5
The Blue Lagoon (1949)                                     3.0
Gold Diggers of 1933 (1933)                                3.5
Hunchback of Notre Dame, The (1923)                        3.0
Winchester '73 (1950)                                      4.0
The Great Train Robbery (1903)                             4.0
Snake Pit, The (1948)                            

In [125]:
# 사용자별 영화별 가중평점 예측
def predict_ratings_by_user_movie(user_idx, movie_idx, topn_sim_idx):
    topn_sim = movies_sim[movie_idx, :][topn_sim_idx]                   # 영화별 평점유사도
    topn_rating = users_movies_df.values[user_idx, :][topn_sim_idx]     # 평점유사도 상위 n건에 대한 사용자 평점
    return topn_sim.dot(topn_rating) / np.abs(topn_sim).sum()

In [126]:
# 176번 사용자의 35번째 영화에 대한 평점
topn_sim_idx = movies_sim_df.iloc[35].argsort()[:(-10 + 1):-1]
                            # 35번째 영화          # 유사도 상위 10개

predict_ratings_by_user_movie(176, 35, topn_sim_idx)

np.float64(3.9375)

- 전체 가중평점 예측: 영화별 사용자별 평점 유사도 상위 n건만 적용하여 계산

In [127]:
def predict_ratings(topn=20):
    pred = np.zeros(users_movies_df.shape)

    for movie_idx in range(pred.shape[1]):
        topn_sim_idx = movies_sim_df.iloc[movie_idx].argsort()[:-(topn + 1):-1]

        for user_idx in range(pred.shape[0]):
            pred[user_idx, movie_idx] = predict_ratings_by_user_movie(user_idx, movie_idx, topn_sim_idx)
            
    return pred

In [128]:
ratings_pred = predict_ratings(10)
ratings_pred.shape

(610, 9719)

In [129]:
ratings_pred_df = pd.DataFrame(ratings_pred, index=users_movies_df.index, columns=users_movies_df.columns)

get_mse(users_movies_df.values, ratings_pred_df.values)

2.9201587326345124

In [130]:
movies_df.iloc[(ratings_pred[170].argsort()[:-11:-1])]

Unnamed: 0,movieId,title,genres
5780,31431,Boogeyman (2005),Drama|Horror|Mystery|Thriller
6808,60885,"Zone, The (La Zona) (2007)",Drama|Thriller
2940,3943,Bamboozled (2000),Comedy
1227,1629,"MatchMaker, The (1997)",Comedy|Romance
9485,169984,Alien: Covenant (2017),Action|Horror|Sci-Fi|Thriller
8001,97194,"Thing: Terror Takes Shape, The (1998)",Documentary
327,369,Mrs. Parker and the Vicious Circle (1994),Drama
8002,97225,Hotel Transylvania (2012),Animation|Children|Comedy
6944,65261,Ponyo (Gake no ue no Ponyo) (2008),Adventure|Animation|Children|Fantasy
9119,145839,Concussion (2015),Drama


- 사용자별 평가하지 않은 영화 조회

In [131]:
def get_unseen_movies(user_idx):
    user_ratings_df = users_movies_df.iloc[user_idx]
    return user_ratings_df[user_ratings_df == 0]

'Boogeyman (2005)' in get_unseen_movies(170).index

True

- 사용자별 영화 추천

In [132]:
def recommend_movies(user_idx, topn=10):
    unseen = get_unseen_movies(user_idx).index
    temp_movie = ratings_pred_df.loc[user_idx + 1, unseen].sort_values(ascending=False)[:topn]
    user_rating_df = users_movies_df.loc[user_idx + 1, temp_movie.index]

    return pd.DataFrame({
        'title': temp_movie.index,
        'pred_rating': temp_movie.values,
        'user_rating': user_rating_df.values
    })

In [133]:
# 높은 평점을 줄 것으로 예상되는 영화 추천 10개 추출
recommend_movies(343)

Unnamed: 0,title,pred_rating,user_rating
0,Men in Black (a.k.a. MIB) (1997),2.68864,0.0
1,Good Will Hunting (1997),2.370637,0.0
2,Life Is Beautiful (La Vita è bella) (1997),2.318414,0.0
3,Schindler's List (1993),2.193334,0.0
4,Star Wars: Episode V - The Empire Strikes Back...,2.121892,0.0
5,Titanic (1997),1.920122,0.0
6,Raiders of the Lost Ark (Indiana Jones and the...,1.890737,0.0
7,Apollo 13 (1995),1.833234,0.0
8,Toy Story (1995),1.827937,0.0
9,"Princess Bride, The (1987)",1.819425,0.0


In [134]:
# 실제 해당 사용자의 평점이 높은 영화 10개 추출
users_movies_df.loc[343].sort_values(ascending=False)[:10]

title
Goodfellas (1990)                                                     5.0
40-Year-Old Virgin, The (2005)                                        5.0
Memento (2000)                                                        5.0
Matrix, The (1999)                                                    5.0
Grave of the Fireflies (Hotaru no haka) (1988)                        5.0
Superbad (2007)                                                       5.0
Fight Club (1999)                                                     5.0
Assassination of Jesse James by the Coward Robert Ford, The (2007)    5.0
Infernal Affairs (Mou gaan dou) (2002)                                5.0
City of God (Cidade de Deus) (2002)                                   5.0
Name: 343, dtype: float64