## Ch02 기본적인 추천시스템

[2.1 데이터 읽기](#2.1-데이터-읽기)  
[2.2 인기제품 방식 (best seller 추천)](#2.2-best-seller-추천)  
[2.3 추천시스템의 정확도 측정](#2.3-추천-시스템의-정확도-측정)  
[2.4 사용자 집단별 추천](#2.4-사용자-집단별-추천)  
[2.5 내용 기반 필터링 추천](#2.5-내용-기반-필터링-추천)

### 2.1 데이터 읽기

사용되는 데이터는 공개된 데이터로 MovieLens 100K 데이터를 사용한다.
> u.user : 사용자 데이터  
u.item : 영화 데이터  
u.data : 영화평가(rating) 데이터


In [9]:
import pandas as pd

# u.user 데이터
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('./data/u.user', sep = '|', names = u_cols, encoding='latin-1')
users = users.set_index('user_id')
users.head()

Unnamed: 0_level_0,age,sex,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043
3,23,M,writer,32067
4,24,M,technician,43537
5,33,F,other,15213


In [10]:
# u.item 데이터
i_cols = ['movie_id', 'title', 'release date', 'video release date', 'IMDB URL', 
          'unknown', 'Action', 'Adventure', 'Animation', 'Children\'s', 'Comedy', 
          'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 
          'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
movies = pd.read_csv('./data/u.item', sep='|', names=i_cols, encoding='latin-1')
movies = movies.set_index('movie_id')
movies.head()

Unnamed: 0_level_0,title,release date,video release date,IMDB URL,unknown,Action,Adventure,Animation,Children's,Comedy,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [11]:
# u.data 데이터
r_cols = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_csv('./data/u.data', sep='\t', names=r_cols, encoding='latin-1') 
ratings = ratings.set_index('user_id')
ratings.head()

Unnamed: 0_level_0,movie_id,rating,timestamp
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
196,242,3,881250949
186,302,3,891717742
22,377,1,878887116
244,51,2,880606923
166,346,1,886397596


### 2.2 best-seller 추천

가장 간단한 추천으로는 모든 사람에게 똑같은 추천을 하는 것이다.

이때 가장 인기 있는 제품을 추천하는 것이 가장 합리적이다.

인기 있는 제품을 추천하기 위해 각 제품에 대한 평가를 평균해서 평균값이 가장 높은 것을 순서대로 추천하면 된다.

In [13]:
# Best-seller 추천 
def recom_movie1(n_items=5):
    movie_sort = movie_mean.sort_values(ascending=False)[:n_items]
    recom_movies = movies.loc[movie_sort.index]
    recommendations = recom_movies['title']
    return recommendations

movie_mean = ratings.groupby(['movie_id'])['rating'].mean()
recom_movie1(5)

movie_id
814                         Great Day in Harlem, A (1994)
1599                        Someone Else's America (1995)
1201           Marlene Dietrich: Shadow and Light (1996) 
1122                       They Made Me a Criminal (1939)
1653    Entertaining Angels: The Dorothy Day Story (1996)
Name: title, dtype: object

### 2.3 추천 시스템의 정확도 측정

정확도를 측정하는 방법은 여러가지가 있는데 대표적으로 RMSE 가 있다.  
(자세한 내용은 3.9에서 참조)

아래는 best-seller 방식으로 구한 예측값의 RMSE를 계산하는 코드이다.

In [14]:
# 정확도 계산
import numpy as np
def RMSE(y_true, y_pred):
    return np.sqrt(np.mean((np.array(y_true) - np.array(y_pred))**2))

rmse = []
for user in set(ratings.index):
    y_true = ratings.loc[user]['rating']
    y_pred = movie_mean[ratings.loc[user]['movie_id']]
    accuracy = RMSE(y_true, y_pred)
    rmse.append(accuracy)

print(np.mean(rmse))

0.996007224010567


### 2.4 사용자 집단별 추천

best-seller 방법보다 조금 발전한 방법으로 사용자들을 비슷한 특성의 소집단으로 만든 다음 각 집단의 평점평균을 바탕으로 추천하는 것이다.

성별, 직업, 비슷한 나이대 사람들의 취향은 비슷하기 때문에 더 정확한 추천이 될 것이라는 가정이 전제된다.

In [53]:
# 데이터 읽기
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('./data/u.user', sep='|', names=u_cols, encoding='latin-1')

i_cols = ['movie_id', 'title', 'release date', 'video release date', 'IMDB URL', 'unknown', 
          'Action', 'Adventure', 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 
          'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 
          'Thriller', 'War', 'Western']
movies = pd.read_csv('./data/u.item', sep='|', names=i_cols, encoding='latin-1')

r_cols = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_csv('./data/u.data', sep='\t', names=r_cols, encoding='latin-1')

# timestamp 제거 
ratings = ratings.drop('timestamp', axis=1)

# movie ID와 title 빼고 다른 데이터 제거
movies = movies[['movie_id', 'title']]

예측을 보다 정확하게 하기 위해 train, test set 을 분리

In [54]:
from sklearn.model_selection import train_test_split
x = ratings.copy()
y = ratings['user_id']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, stratify=y)

In [55]:
# 정확도(RMSE)를 계산하는 함수
def RMSE(y_true, y_pred):
    return np.sqrt(np.mean((np.array(y_true) - np.array(y_pred))**2))

# 모델별 RMSE 계산하는 함수
def score(model):
    id_pairs = zip(x_test['user_id'], x_test['movie_id'])
    y_pred = np.array([model(user, movie) for (user, movie) in id_pairs])
    y_true = np.array(x_test['rating'])
    return RMSE(y_true, y_pred)

# train 데이터로 Full matrix 구하기
rating_matrix = x_train.pivot(index = 'user_id', columns = 'movie_id', values='rating')
rating_matrix

movie_id,1,2,3,4,5,6,7,8,9,10,...,1669,1670,1672,1675,1676,1677,1679,1680,1681,1682
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,3.0,,3.0,3.0,5.0,4.0,1.0,,3.0,...,,,,,,,,,,
2,4.0,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,3.0,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
939,,,,,,,,,5.0,,...,,,,,,,,,,
940,,,,2.0,,,4.0,,3.0,,...,,,,,,,,,,
941,,,,,,,,,,,...,,,,,,,,,,
942,,,,,,,,,,,...,,,,,,,,,,


In [56]:
# 전체 평균으로 예측치를 계산하는 기본 모델
def best_seller(user_id, movie_id):
    try:
        rating = train_mean[movie_id]
    except:
        rating = 3.0
    return rating

train_mean = x_train.groupby(['movie_id'])['rating'].mean()
score(best_seller)

1.0297239149048725

In [60]:
# Full matrix 사용자 데이터와 merge
merged_ratings = pd.merge(x_train, users)
users = users.set_index('user_id')

In [62]:
# gender별 평점평균 계산
g_mean = merged_ratings[['movie_id', 'sex', 'rating']].groupby(['movie_id', 'sex'])['rating'].mean()
g_mean

movie_id  sex
1         F      3.923077
          M      3.890688
2         F      3.235294
          M      3.238095
3         F      2.909091
                   ...   
1677      F      3.000000
1679      M      3.000000
1680      M      2.000000
1681      M      3.000000
1682      M      3.000000
Name: rating, Length: 3030, dtype: float64

In [63]:
### Gender기준 추천 ###
# gender 별 평균을 예측치로 돌려주는 함수
def cf_gender(user_id, movie_id):
    if movie_id in rating_matrix:
        gender = users.loc[user_id]['sex']
        if gender in g_mean[movie_id]:
            gender_rating = g_mean[movie_id][gender]
        else:
            gender_rating = 3.0
    else:
        gender_rating = 3.0
    return gender_rating

score(cf_gender)

1.0380771133556892

연습문제
1. 성별 추천 코드를 수정해서 사용자의 직업에 따라 집단을 나누어서 예측값을 구하는 함수를 만들고 이의 정확도를 계산하는 코드를 작성하시오

2. 사용자의 성별과 직업을 동시에 고려한 집단을 나누어서 예측값을 구하는 함수를 만들고 이의 정확도를 계산하는 코드를 작성하세요.

In [68]:
# 1번
o_mean = merged_ratings[['movie_id', 'occupation', 'rating']].groupby(['movie_id', 'occupation'])['rating'].mean()

### occupation 추천 ###
# occupation 별 평균을 예측치로 돌려주는 함수
def cf_occupation(user_id, movie_id):
    if movie_id in rating_matrix:
        occupation = users.loc[user_id]['occupation']
        if occupation in g_mean[movie_id]:
            occupation_rating = o_mean[movie_id][occupation]
        else:
            occupation_rating = 3.0
    else:
        occupation_rating = 3.0
    return occupation_rating

score(cf_occupation)

1.2480544859900948

In [117]:
# 2번
g_o_mean = merged_ratings[['movie_id', 'occupation', 'rating', 'sex']].groupby(['movie_id', 'occupation', 'sex'])['rating'].mean()

### Gender와 직업 모두 고려한 추천 ###
# gender 별 평균을 예측치로 돌려주는 함수
def cf_occu_gen(user_id, movie_id):
    if movie_id in rating_matrix:
        occu_gen = tuple(users.loc[user_id][['occupation', 'sex']].values)
        if occu_gen in g_o_mean[movie_id].index:
            occu_gen_rating = g_o_mean[movie_id][occu_gen[0]][occu_gen[1]]
        else:
            occu_gen_rating = 3.0
    else:
        occu_gen_rating = 3.0
    return occu_gen_rating


score(cf_occu_gen)

1.1488130155809948

### 2.5 내용 기반 필터링 추천

Content based filtering(CB)는 아이템의 내용을 분석해서 아이템 간의 유사도를 계산하고 이를 바탕으로 추천하는 방법이다.
1. 각 아이템 간의 유사도(similarity)를 계산한다.
2. 추천 대상이 되는 사용자가 선호하는 아이템을 선정한다.
3. 위에서 선정된 아이템과 가장 유사도가 높은 N개의 아이템을 찾는다.
4. 이 N개의 아이템을 사용자에게 추천한다.

각 단계에서 나타나는 이슈  

__1번 단계__
+ 아이템 간의 유사도는 어떤 방법으로 계산하는가?
    + 텍스트 분석의 경우 tf-idf 를 이용

__2번 단계__
+ 사용자가 선호하는 아이템을 몇 개 선정할 것인지?
    + 다수의 아이템을 선정하는 경우 몇 개의 아이템을 선정하는 것이 최적인지?
+ 다수의 아이템을 선정한 경우 각 아이템과 유사도가 높은 아이템이 존재할텐데 어떻게 결합할 것인지?
    + 각 유사한 아이템을 모두 합쳐서 리스트를 만드는 방법
    + 각 아이템 점수에 따라 순서대로 나열하는 방법

일반적인 해결 방법은 다양한 경우에 대해 실제 데이터로 테스트를 해보고 그 중 가장 좋은 결과를 보이는 것을 선택하는 것이다.  
CB 기반 간단한 추천을 구현하는 예시를 살펴보는데 사용자가 선호하는 아이템은 1개만 사용하기로 한다.

In [133]:
# Data 읽기
movies = pd.read_csv('./data/movies_metadata.csv', low_memory=False)
movies = movies[['id', 'title', 'overview']]
print(movies.shape)
movies.head()

(45466, 3)


Unnamed: 0,id,title,overview
0,862,Toy Story,"Led by Woody, Andy's toys live happily in his ..."
1,8844,Jumanji,When siblings Judy and Peter discover an encha...
2,15602,Grumpier Old Men,A family wedding reignites the ancient feud be...
3,31357,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom..."
4,11862,Father of the Bride Part II,Just when George Banks has recovered from his ...


In [134]:
# 데이터 전처리
movies = movies.dropna()
movies.shape

(44506, 3)

In [136]:
# 불용어를 english 로 지정하고 tf-idf 계산
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(movies['overview'])
 
# cosine 유사도 계산
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
cosine_sim = pd.DataFrame(cosine_sim, index = movies.index, columns = movies.index)
cosine_sim

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,45456,45457,45458,45459,45460,45461,45462,45463,45464,45465
0,1.000000,0.015021,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,...,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.005933,0.000000
1,0.015021,1.000000,0.046798,0.000000,0.000000,0.050093,0.000000,0.000000,0.102340,0.00000,...,0.0,0.0,0.0,0.011230,0.0,0.000000,0.066913,0.0,0.021957,0.009242
2,0.000000,0.046798,1.000000,0.000000,0.025070,0.000000,0.000000,0.006362,0.000000,0.00000,...,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.014013,0.000000
3,0.000000,0.000000,0.000000,1.000000,0.000000,0.007149,0.000000,0.008905,0.000000,0.00000,...,0.0,0.0,0.0,0.021466,0.0,0.026269,0.000000,0.0,0.009493,0.016349
4,0.000000,0.000000,0.025070,0.000000,1.000000,0.000000,0.030209,0.000000,0.032657,0.00000,...,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.006974,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45461,0.000000,0.000000,0.000000,0.026269,0.000000,0.025293,0.000000,0.000000,0.000000,0.00000,...,0.0,0.0,0.0,0.000000,0.0,1.000000,0.000000,0.0,0.000000,0.000000
45462,0.000000,0.066913,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.04974,...,0.0,0.0,0.0,0.000000,0.0,0.000000,1.000000,0.0,0.000000,0.000000
45463,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.038274,0.00000,...,0.0,0.0,0.0,0.031389,0.0,0.000000,0.000000,1.0,0.000000,0.000000
45464,0.005933,0.021957,0.014013,0.009493,0.006974,0.000000,0.011487,0.005248,0.000000,0.00000,...,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,1.000000,0.000000


In [141]:
# index-title 을 뒤집는다
indices = pd.Series(movies.index, index = movies['title'])

# 영화 제목을 받아서 추천 영화를 돌려주는 함수
def content_recommender(title, n_of_recomm):
    # title 에서 영화 index 받아오기
    idx = indices[title]
    # 주어진 영화와 다른 영화의 similarity 를 가져온다.
    sim_scores = cosine_sim[idx]
    sim_scores = sim_scores.sort_values(ascending=False)[1:n_of_recomm+1]
    # 영화 title 반환
    return movies.loc[sim_scores.index]['title']

In [143]:
# 추천 받기
print(content_recommender('The Lion King', 5))

34682    How the Lion Cub and the Turtle Sang a Song
9353                                The Lion King 1½
9115                  The Lion King 2: Simba's Pride
42829                                           Prey
25654                                 Fearless Fagan
Name: title, dtype: object


In [144]:
print(content_recommender('The Dark Knight Rises', 10))

12481                                      The Dark Knight
150                                         Batman Forever
1328                                        Batman Returns
15511                           Batman: Under the Red Hood
585                                                 Batman
21194    Batman Unmasked: The Psychology of the Dark Kn...
9230                    Batman Beyond: Return of the Joker
18035                                     Batman: Year One
19792              Batman: The Dark Knight Returns, Part 1
3095                          Batman: Mask of the Phantasm
Name: title, dtype: object
