- 아이템 기반 최근접이웃 협업 필터링
    - 컨텐츠 정보 뿐만 아니라, "rating에 대한 정보"도 필요하다
    - 해당 아이템 써본 사람들은 이러한 아이템도 써보더라~

# 데이터 로드
- 사용자-영화 평점 행렬 데이터
- Grouplens 사이트에서 만든 MoviesLens 데이터 셋 (축소 버전 파일 사용)
    - ml-latest-small.zip (size: 1 MB)
- 대용량 버전 파일은 https://grouplens.org/datasets/movielens/latest 에서 확인 가능
    - ml-latest.zip (size: 265 MB)

In [2]:
import pandas as pd
import numpy as np

movies = pd.read_csv('./ml-latest-small/movies.csv')
ratings = pd.read_csv('./ml-latest-small/ratings.csv')
print(movies.shape, ratings.shape)
# 영화 정보는 9742건
# 평점 정보는 100836건 (유저가 100836명이라는 의미는 아님! 한 유저가 여러 개의 평점을 남겼을 수도)

(9742, 3) (100836, 4)


In [3]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [5]:
ratings['userId'].unique().size
# 평점 정보는 10만건이지만, 평점 작성한 유저는 610명이다.

610

# 데이터 전처리

In [6]:
# 현재 데이터는 행 단위 데이터이다. (유저/영화번호/평점)
# -> 사용자-아이템 형태로 바꿔야 한다. (사용자-아이템 평점 행렬 = 행: 사용자 / 열: 영화)
# 사용자-아이템 평점 행렬은 '희소행렬'의 형태 : 사용자가 총 610명이었으므로 610행, 영화는 9742건이므로 9742열인데 각 사용자가 모든 영화를 평가한 것은 아니므로 결측치(NaN)가 매우 많을 것!

## 행 단위 데이터 -> 사용자-아이템(영화) 평점 행렬로 변환
- 사용자가 평점을 매기지 않은 영화의 평점은 NaN으로 할당됨

In [7]:
# ratings 데이터프레임의 timestamp 컬럼 삭제
ratings = ratings[['userId', 'movieId', 'rating']]

# 사용자-아이템 평점 행렬 만들기 : pivoting 활용 (행:userId, 열:movieId, 데이터:rating)
ratings_matrix = pd.pivot_table(ratings, index='userId', columns='movieId', values='rating')
print(ratings_matrix.shape) # (610, 9724)
# 610명의 유저가 9742개의 영화에 대해 평점을 매긴 것
ratings_matrix.head(3)

(610, 9724)


movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,


## 영화 ID -> 영화 제목으로 컬럼명 변경

In [8]:
# 영화 ID -> 영화 제목으로 컬럼명 변경
# 방법 1: column 값에 바로 movies 데이터프레임의 title 넣어주기
# 방법 2: merge (movies 데이터프레임과 ratings 데이터프레임을 공통 컬럼(키)인 moviedId를 기준으로 병합)

In [25]:
rating_movies = pd.merge(ratings, movies, on='movieId')
ratings_matrix = pd.pivot_table(rating_movies, index='userId', columns='title', values='rating')
print(ratings_matrix.shape) # (610, 9719)
ratings_matrix.head(3)

(610, 9719)


title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,4.0,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,


## NaN -> 0으로 변경

In [10]:
ratings_matrix = ratings_matrix.fillna(0)
ratings_matrix.head(3)

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# 평점 행렬을 이용한 평점 유사도 만들기
- 목적 : **"영화 간 평점"**이 유사한 것 추출하기
     - (행:영화 / 열:영화)
- 주의 : 현재 사용자-아이템 평점 행렬(ratings_matrix)은 userId가 기준인 행 레벨 데이터이므로 여기에 코사인 유사도를 적용하면, "영화 평점 간 유사도"가 아닌 **"사용자 간 유사도"**를 추출하게 된다
     - (행:사용자 / 열:영화)
- 방법 : cosine_similarity()를 이용해서 영화 간 유사도를 산출하기 위해, ratings_matrix를 전치시켜 영화를 행 기준으로 만들기

In [11]:
ratings_matrix_T = ratings_matrix.transpose()
ratings_matrix_T.head(3)

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
'Hellboy': The Seeds of Creation (2004),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Round Midnight (1986),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [12]:
from sklearn.metrics.pairwise import cosine_similarity

# 아이템 유사도
item_sim = cosine_similarity(ratings_matrix_T, ratings_matrix_T)
item_sim_df = pd.DataFrame(item_sim, index=ratings_matrix.columns, columns=ratings_matrix.columns)

print(item_sim_df.shape) # (9719, 9719) # 9719 : 영화의 개수
item_sim_df.head(3)

(9719, 9719)


title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.141653,0.0,...,0.0,0.342055,0.543305,0.707107,0.0,0.0,0.139431,0.327327,0.0,0.0
'Hellboy': The Seeds of Creation (2004),0.0,1.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Round Midnight (1986),0.0,0.707107,1.0,0.0,0.0,0.0,0.176777,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [13]:
# 인셉션과 평점이 유사한 영화 상위 5개 영화 추출하기
item_sim_df['Inception (2010)'].sort_values(ascending=False)[1:6]
# [1:6] : 0번째는 본인 영화니까 제외한 것

title
Dark Knight, The (2008)          0.727263
Inglourious Basterds (2009)      0.646103
Shutter Island (2010)            0.617736
Dark Knight Rises, The (2012)    0.617504
Fight Club (1999)                0.615417
Name: Inception (2010), dtype: float64

In [14]:
# 내가 사용해본 걸 다른 사람이 사용해봤고, 그 사람들이 써봤고 내가 써보지 않은 것 중 평점이 높은 것을 추천해줄 것

# 아이템 기반 인접 이웃 협업 필터링으로 개인화된 영화 추천

## 개인화된 예측 평점 계산
- rating_arr : 사용자-영화 평점 행렬 (행:사용자) -> shape (610,9719:merge이후)
- item_sim_arr : 영화간 평점 유사도 행렬 (코사인 유사도) -> shape (9719,9719)
- n : 유사도가 높은 Top-n개의 영화

In [15]:
n = 2
a = [1,2,3,4,5,6,7,8,9,10]
np.argsort(a)[:-n-1:-1]

array([9, 8], dtype=int64)

In [16]:
def predict_rating_topsim(rating_arr, item_sim_arr, n=20):
    
    #사용자-아이템 평점 행렬(rating_arr)의 크기만큼 0으로 채운 예측 행렬 초기화
    # zeros_like : 특정 함수와 모양이 똑같으면서 데이터는 0인 행렬 생성
    pred = np.zeros_like(rating_arr)
    
    # 사용자-아이템 평점 행렬의 열 크기(영화의 수)만큼 loop 실행
    for col in range(rating_arr.shape[1]):
        # item_sim_arr : (행:영화, 열:영화)로 된 유사도
        # [:, col] : item_sim_arr의 각 열에 담긴 유사도 (0,1,2..9719번째 열의 영화 유사도)
        # argsort : 현재 영화와의 유사도가 오름차순으로 정렬된 영화의 index
        # [:-n-1:-1] :(:-1,:-21,:-1) -1번째부터 -20-1+1번째까지 -1번째 간격의 idx
        # top_n_items : 현재 영화와 유사도가 큰 top_n개 영화의 인덱스 모음
        top_n_items = np.argsort(item_sim_arr[:, col])[:-n-1:-1]
        
        ## 개인화된 예측 평점 계산하기
        for row in range(rating_arr.shape[0]):
            # rating_arr : 사용자-영화 평점 행렬
            # shape[0]  : 사용자수
            pred[row,col] = item_sim_arr[col, :][top_n_items].dot(rating_arr[row,:][top_n_items].T)
            # pred[row,col]:row에 사용자를, col에 영화를 넣은 pred(영행렬) 생성
            # item_sim_arr[col, :][top_n_items] : col번째 영화와 유사도가 높은 top_n개 영화에 대한 유사도 벡터
            # .dot : 행렬 곱
            # rating_arr[row,:][top_n_items] : row번째 영화와 유사도가 높은 top_n개 영화에 대한 실제 평점 벡터 (실제 평점값이 없을 수도 있음!!)
            # T : 전치 (row:사용자 col:영화)
            
            pred[row,col] /= np.sum(np.abs(item_sim_arr[col,:][top_n_items]))
            # abs : 절대값
            # pred[row,col] = pred[row,col] / np.sum(np.abs(item_sim_arr[col,:][top_n_items]))
            
    return pred

In [17]:
ratings_pred = predict_rating_topsim(ratings_matrix.values, item_sim_df.values, n=20)
# ratings_matrix : 사용자-아이템 평점 행렬
# item_sim_df : 평점-유사도 행렬

ratings_pred_matrix = pd.DataFrame(ratings_pred, index=ratings_matrix.index, columns=ratings_matrix.columns)

In [18]:
ratings_pred_matrix.shape
# 610 : 행 : 사용자
# 9719 : 열 : 영화

(610, 9719)

In [19]:
np.max(ratings_pred_matrix, axis=1)[:1], np.min(ratings_pred_matrix, axis=1)[:1]

# 각 행의 최대/최소값

(userId
 1    4.595776
 dtype: float64,
 userId
 1    0.0
 dtype: float64)

In [20]:
# 첫 번째 유저의 9719개 영화에 대한 예측 평점 중, 0이 아닌 영화의 개수
user_0 = ratings_pred_matrix.values[0]
len(user_0[user_0 != 0])

1133

## 평점을 주지 않은 영화 리스트 반환
- 최근접 이웃 협업 필터링을 통한 개인화된 영화 추천은 개인이 아직 관람하지 않은 영화를 추천하는 방식

In [21]:
def get_unseen_movies(ratings_matrix, userId):
    # ratings_matrix : pivoting을 통해 만들어낸 사용자-아이템 평점 행렬 (행:userId, 열:movieId, 데이터:rating)
    # ratings_matrix = pd.pivot_table(ratings, index='userId', columns='movieId', values='rating')
    
    # 사용자별 평점을 인덱싱하기
    user_rating = ratings_matrix.loc[userId, :]
    # [userId, :] : UserId와 index가 일치하기 때문 : UserId가 0이면 실제 ratings_matrix의 0번째 데이터가 가져와지므로, userId로 인덱싱한 것!
    
    # 사용자가 이미 본 영화 제목의 목록 반환 (평점을 0으로 준 경우는 없다)
    already_seen = user_rating[user_rating > 0].index.tolist()
    # series로 반환되다보니 userId는 column이 아닌 index임
    
    # 전체 영화 목록
    movies_list = ratings_matrix.columns.tolist()
    
    # 사용자가 아직 보지 않은 영화 제목의 목록 반환
    unseen_list = set(movies_list).difference(set(already_seen))
    # set은 집합 연산이 가능하므로, 차집합을 이용하여 구하기
    
    
    ## 참고 : 한 번에 사용자가 아직 보지 않은 영화 제목 목록 반환 가능
    unseen_list2 = user_rating[user_rating == 0].index.tolist()
    
    return unseen_list

## 특정 사용자의 관람하지 않은 영화에 대한 예측 평점 기반 추천

In [22]:
def recomm_movie_by_userId(pred_df, userId, unseen_list, top_n=10):
    recomm_movies = pred_df.loc[userId, unseen_list].sort_values(ascending=False)[:top_n]
    return recomm_movies

In [23]:
unseen_list = get_unseen_movies(ratings_matrix, 9)
recomm_movies = recomm_movie_by_userId(ratings_pred_matrix, 9, unseen_list, top_n=10)

recomm_movies = pd.DataFrame(recomm_movies.values, index=recomm_movies.index, columns=['pred_score'])
recomm_movies

# 9번 유저가 관람한 영화를 관람했던 다른 사용자들이 시청한 영화 중 9번 유저의 예측 평점이 높은 것 순서대로 추천

  recomm_movies = pred_df.loc[userId, unseen_list].sort_values(ascending=False)[:top_n]


Unnamed: 0_level_0,pred_score
title,Unnamed: 1_level_1
Shrek (2001),0.866202
Spider-Man (2002),0.857854
"Last Samurai, The (2003)",0.817473
Indiana Jones and the Temple of Doom (1984),0.816626
"Matrix Reloaded, The (2003)",0.80099
Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001),0.765159
Gladiator (2000),0.740956
"Matrix, The (1999)",0.732693
Pirates of the Caribbean: The Curse of the Black Pearl (2003),0.689591
"Lord of the Rings: The Return of the King, The (2003)",0.676711
