# 01. 추천 시스템의 개요와 배경
### * 추천시스템 개요
### * 온라인 스토어의 필수 요소, 추천 시스템
### * 추천 시스템 유형
- 콘텐츠 기반 필터링
- 협업 필터링: Nearest Neighbor/ Latent Factor

# 02. 콘텐츠 기반 필터링 추천 시스템
- 특정한 아이템을 매우 선호하는 경우, 그 아이템과 비슷한 콘텐츠를 가진 다른 아이템을 추천하는 방식

# 03. 최근접 이웃 협업 필터링
- 축척된 사용자 행동 데이터를 기반으로 사용자가 아직 평가하지 않은 아이템을 __예측평가__해 추천하는 방식
- 사용자-아이템 평점 매트릭스: 다차원 행렬, 희소 행렬
- 메모리 협업 필터링: 사용자 기반/ 아이템 기반 -> __주로 아이템 기반 협업 필터링 사용__ _(유사도 측정 방법: 코사인 유사도)_ 

# 04. 잠재 요인 협업 필터링
### * 잠재 요인 협업 필터링 이해
- 사용자-아이템 평점 매트릭스 속에 숨어 있는 잠재 요인을 추출해 추천 예측을 할 수 있게 하는 기법
- SVD 등 차원 감소 기법으로 분해 하는 과정에서 잠재 요인 추출 -> __## 행렬 분해 ##__

### * 행렬 분해의 이해
### * 확률적 경사 하강법을 이용한 행렬 분해
1. P와 Q를 임의의 값을 가진 행렬로 설정
2. P와 Q.T값을 곱해 예측 R 행렬을 계산하고 예측 R 행렬과 실제 R 행렬에 해당하는 오류 값 계산
3. 이 오류 값을 최소화할 수 있도록 P와 Q 행렬을 적적할 값으로 업데이트
4. 만족할 만한 오류 값을 가질 때까지 반복해 P와 Q값을 업데이트해 근사화

In [27]:
import numpy as np

R=np.array([[4, np.NaN, np.NaN, 2, np.NaN],
           [np.NaN, 5, np.NaN, 3, 1],
          [np.NaN, np.NaN, 3, 4, 4],
          [5, 2, 1, 2, np.NaN]])
num_users, num_items = R.shape
K=3

np.random.seed(1)
P = np.random.normal(scale=1./K, size=(num_users, K))
Q = np.random.normal(scale=1./K, size=(num_items, K))

In [28]:
from sklearn.metrics import mean_squared_error

def get_rmse(R, P, Q, non_zeros):
    error=0
    full_pred_matrix=np.dot(P, Q.T)
    x_non_zero_ind=[non_zero[0] for non_zero in non_zeros]
    y_non_zero_ind=[non_zero[1] for non_zero in non_zeros]
    R_non_zeros=R[x_non_zero_ind, y_non_zero_ind]
    full_pred_matrix_non_zeros=full_pred_matrix[x_non_zero_ind, y_non_zero_ind]
    mse=mean_squared_error(R_non_zeros, full_pred_matrix_non_zeros)
    rmse=np.sqrt(mse)
    
    return(rmse)

In [5]:
non_zeros=[(i,j,R[i,j]) for i in range(num_users) for j in range(num_items) if R[i,j]>0]

steps=1000
learning_rate=0.01
r_lambda=0.01

for step in range(steps):
    for i, j, r in non_zeros:
        eji=r-np.dot(P[i,:], Q[j,:].T)
        P[i,:]=P[i,:]+learning_rate*(eji*Q[j,:]-r_lambda*P[i,:])
        Q[j,:]=Q[j,:]+learning_rate*(eji*P[i,:]-r_lambda*Q[j,:])
        rmse=get_rmse(R, P, Q, non_zeros)
        if(step%50)==0:
            print('### iteration step :', step, 'rmse :', rmse)

### iteration step : 0 rmse : 0.01645276471505709
### iteration step : 0 rmse : 0.016540891411674398
### iteration step : 0 rmse : 0.01639737822768701
### iteration step : 0 rmse : 0.016327165877973376
### iteration step : 0 rmse : 0.016442184813351968
### iteration step : 0 rmse : 0.016456959294961895
### iteration step : 0 rmse : 0.016281771945551853
### iteration step : 0 rmse : 0.016258601686471676
### iteration step : 0 rmse : 0.01593521639847534
### iteration step : 0 rmse : 0.016070867276654854
### iteration step : 0 rmse : 0.016156558845351162
### iteration step : 0 rmse : 0.016410959588961445
### iteration step : 50 rmse : 0.016415626400065848
### iteration step : 50 rmse : 0.016502576268650006
### iteration step : 50 rmse : 0.016358274578054596
### iteration step : 50 rmse : 0.01628913640069477
### iteration step : 50 rmse : 0.016404419353426576
### iteration step : 50 rmse : 0.016418985195750684
### iteration step : 50 rmse : 0.016244452027802333
### iteration step : 50 rmse

### iteration step : 800 rmse : 0.01581426845704762
### iteration step : 800 rmse : 0.0158964402677436
### iteration step : 800 rmse : 0.01574130861829952
### iteration step : 800 rmse : 0.01568686098083087
### iteration step : 800 rmse : 0.015805965905985098
### iteration step : 800 rmse : 0.01581917348006171
### iteration step : 800 rmse : 0.015647745454688534
### iteration step : 800 rmse : 0.01562101743539861
### iteration step : 800 rmse : 0.015294845761463498
### iteration step : 800 rmse : 0.01543887517340491
### iteration step : 800 rmse : 0.015530352151403897
### iteration step : 800 rmse : 0.015768866818295302
### iteration step : 850 rmse : 0.015777056015593188
### iteration step : 850 rmse : 0.0158592156887103
### iteration step : 850 rmse : 0.015703465907807943
### iteration step : 850 rmse : 0.015649873608618043
### iteration step : 850 rmse : 0.015769212908607635
### iteration step : 850 rmse : 0.015782397140419416
### iteration step : 850 rmse : 0.015611030164867548
###

In [6]:
pred_matrix=np.dot(P, Q.T)
print('Predict Matrix:\n', np.round(pred_matrix, 3))

Predict Matrix:
 [[3.991 1.171 1.266 2.    1.649]
 [6.288 4.978 0.896 2.983 1.003]
 [6.401 0.866 2.987 3.978 3.986]
 [4.969 2.005 1.006 2.013 1.258]]


# 05. 콘텐츠 기반 필터링 실습 - TMDB 5000 영화 데이터 세트
### * 장르 속성을 이용한 영화 콘텐츠 기반 필터링
### * 데이터 로딩 및 가공

In [2]:
import pandas as pd
import numpy as np
import warnings; warnings.filterwarnings('ignore')

movies=pd.read_csv('./tmdb_5000_movies.csv')
print(movies.shape)
movies.head()

(4803, 20)


Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


In [3]:
movies_df=movies[['id','title','genres','vote_average','vote_count','popularity','keywords','overview']]

In [4]:
pd.set_option('max_colwidth', 100)
movies_df[['genres', 'keywords']][:1]

Unnamed: 0,genres,keywords
0,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""name"": ""Fantasy""}, {...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"": 2964, ""name"": ""future""}, {""id"": 3386, ""name"": ""sp..."


In [5]:
from ast import literal_eval
movies_df['genres']=movies_df['genres'].apply(literal_eval)
movies_df['keywords']=movies_df['keywords'].apply(literal_eval)

In [6]:
movies_df['genres']=movies_df['genres'].apply(lambda x : [y['name'] for y in x])
movies_df['keywords']=movies_df['keywords'].apply(lambda x : [y['name'] for y in x])
movies_df[['genres', 'keywords']][:1]

Unnamed: 0,genres,keywords
0,"[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colony, society, space travel, futuristic, romance, spa..."


### * 장르 콘텐츠 유시도 측정

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

movies_df['genres_literal']=movies_df['genres'].apply(lambda x : (' ').join(x))
count_vect=CountVectorizer(min_df=0, ngram_range=(1,2))
genre_mat=count_vect.fit_transform(movies_df['genres_literal'])
print(genre_mat.shape)

(4803, 276)


In [8]:
from sklearn.metrics.pairwise import cosine_similarity

genre_sim=cosine_similarity(genre_mat, genre_mat)
print(genre_sim.shape)
print(genre_sim[:1])

(4803, 4803)
[[1.         0.59628479 0.4472136  ... 0.         0.         0.        ]]


In [9]:
genre_sim_sorted_ind=genre_sim.argsort()[:,::-1]
print(genre_sim_sorted_ind[:1])

[[   0 3494  813 ... 3038 3037 2401]]


### * 장르 콘텐츠 필터링을 이용한 영화 추천

In [10]:
def find_sim_movie(df, sorted_ind, title_name, top_n=10):
    title_movie=df[df['title']==title_name]
    
    title_index=title_movie.index.values
    similar_indexes=sorted_ind[title_index, :(top_n)]
    
    print(similar_indexes)
    similar_indexes=similar_indexes.reshape(-1)
    
    return df.iloc[similar_indexes]

In [11]:
similar_movies=find_sim_movie(movies_df, genre_sim_sorted_ind, 'The Godfather', 10)
similar_movies[['title', 'vote_average']]

[[2731 1243 3636 1946 2640 4065 1847 4217  883 3866]]


Unnamed: 0,title,vote_average
2731,The Godfather: Part II,8.3
1243,Mean Streets,7.2
3636,Light Sleeper,5.7
1946,The Bad Lieutenant: Port of Call - New Orleans,6.0
2640,Things to Do in Denver When You're Dead,6.7
4065,Mi America,0.0
1847,GoodFellas,8.2
4217,Kids,6.8
883,Catch Me If You Can,7.7
3866,City of God,8.1


In [12]:
movies_df[['title', 'vote_average', 'vote_count']].sort_values('vote_average', ascending=False)[:10]

Unnamed: 0,title,vote_average,vote_count
3519,Stiff Upper Lips,10.0,1
4247,Me You and Five Bucks,10.0,2
4045,"Dancer, Texas Pop. 81",10.0,1
4662,Little Big Top,10.0,1
3992,Sardaarji,9.5,2
2386,One Man's Hero,9.3,2
2970,There Goes My Baby,8.5,2
1881,The Shawshank Redemption,8.5,8205
2796,The Prisoner of Zenda,8.4,11
3337,The Godfather,8.4,5893


In [13]:
C=movies_df['vote_average'].mean()
m=movies_df['vote_count'].quantile(0.6)
print('C:', round(C,3), 'm:', round(m,3))

C: 6.092 m: 370.2


In [26]:
percentile=0.6
C=movies_df['vote_average'].mean()
m=movies_df['vote_count'].quantile(percentile)

def weighted_vote_average(record):
    v=record['vote_count']
    R=record['vote_average']
    
    return ((v/(v+m))*R)+((m/(m+v))*C)

movies_df['weighted_vote']=movies_df.apply(weighted_vote_average, axis=1)

In [27]:
movies_df[['title','vote_average', 'weighted_vote', 'vote_count']].sort_values('weighted_vote', ascending=False)[:10]

Unnamed: 0,title,vote_average,weighted_vote,vote_count
1881,The Shawshank Redemption,8.5,8.396052,8205
3337,The Godfather,8.4,8.263591,5893
662,Fight Club,8.3,8.216455,9413
3232,Pulp Fiction,8.3,8.207102,8428
65,The Dark Knight,8.2,8.13693,12002
1818,Schindler's List,8.3,8.126069,4329
3865,Whiplash,8.3,8.123248,4254
809,Forrest Gump,8.2,8.105954,7927
2294,Spirited Away,8.3,8.105867,3840
2731,The Godfather: Part II,8.3,8.079586,3338


In [28]:
def find_sim_movie(df, sorted_ind, title_name, top_n=10):
    title_movie=df[df['title']==title_name]
    title_index=title_movie.index.values
    
    similar_indexes=sorted_ind[title_index, :(top_n*2)]
    similar_indexes=similar_indexes.reshape(-1)
    similar_indexes=similar_indexes[similar_indexes != title_index]
    
    return df.iloc[similar_indexes].sort_values('weighted_vote', ascending=False)[:top_n]

similar_movies=find_sim_movie(movies_df, genre_sim_sorted_ind, 'The Godfather', 10)
similar_movies[['title','vote_average','weighted_vote']]

Unnamed: 0,title,vote_average,weighted_vote
2731,The Godfather: Part II,8.3,8.079586
1847,GoodFellas,8.2,7.976937
3866,City of God,8.1,7.759693
1663,Once Upon a Time in America,8.2,7.657811
883,Catch Me If You Can,7.7,7.557097
281,American Gangster,7.4,7.141396
4041,This Is England,7.4,6.739664
1149,American Hustle,6.8,6.717525
1243,Mean Streets,7.2,6.626569
2839,Rounders,6.9,6.530427


# 06. 아이템 기반 최근접 이웃 협업 필터링 실습
### * 데이터 가공 및 변환

In [1]:
import pandas as pd
import numpy as np

movies=pd.read_csv('./ml-latest-small/movies.csv')
ratings=pd.read_csv('./ml-latest-small/ratings.csv')
print(movies.shape)
print(ratings.shape)

(9742, 3)
(100836, 4)


In [2]:
ratings=ratings[['userId', 'movieId','rating']]
ratings_matrix=ratings.pivot_table('rating', index='userId', columns='movieId')
ratings_matrix.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,


In [3]:
rating_movies=pd.merge(ratings, movies, on='movieId')
ratings_matrix=rating_movies.pivot_table('rating', index='userId', columns='title')

ratings_matrix=ratings_matrix.fillna(0)
ratings_matrix.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### * 영화 간 유사도 산출

In [4]:
ratings_matrix_T=ratings_matrix.transpose()
ratings_matrix_T.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
'Hellboy': The Seeds of Creation (2004),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Round Midnight (1986),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Salem's Lot (2004),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Til There Was You (1997),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [5]:
from sklearn.metrics.pairwise import cosine_similarity

item_sim=cosine_similarity(ratings_matrix_T, ratings_matrix_T)
item_sim_df=pd.DataFrame(data=item_sim, index=ratings_matrix.columns, columns=ratings_matrix.columns)
print(item_sim_df.shape)
item_sim_df.head()

(9719, 9719)


title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.141653,0.0,...,0.0,0.342055,0.543305,0.707107,0.0,0.0,0.139431,0.327327,0.0,0.0
'Hellboy': The Seeds of Creation (2004),0.0,1.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Round Midnight (1986),0.0,0.707107,1.0,0.0,0.0,0.0,0.176777,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Salem's Lot (2004),0.0,0.0,0.0,1.0,0.857493,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Til There Was You (1997),0.0,0.0,0.0,0.857493,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [6]:
item_sim_df['Godfather, The (1972)'].sort_values(ascending=False)[:6]

title
Godfather, The (1972)                        1.000000
Godfather: Part II, The (1974)               0.821773
Goodfellas (1990)                            0.664841
One Flew Over the Cuckoo's Nest (1975)       0.620536
Star Wars: Episode IV - A New Hope (1977)    0.595317
Fargo (1996)                                 0.588614
Name: Godfather, The (1972), dtype: float64

In [7]:
item_sim_df['Inception (2010)'].sort_values(ascending=False)[:6]

title
Inception (2010)                 1.000000
Dark Knight, The (2008)          0.727263
Inglourious Basterds (2009)      0.646103
Shutter Island (2010)            0.617736
Dark Knight Rises, The (2012)    0.617504
Fight Club (1999)                0.615417
Name: Inception (2010), dtype: float64

### * 아이템 기반 최근접 이웃 협업 필터링으로 개인화된 영화 추천

In [8]:
def predict_rating(ratings_arr, item_sim_arr):
    ratings_pred=ratings_arr.dot(item_sim_arr)/np.array([np.abs(item_sim_arr).sum(axis=1)])
    return ratings_pred

In [10]:
ratings_pred=predict_rating(ratings_matrix.values, item_sim_df.values)
ratings_pred_matrix=pd.DataFrame(data=ratings_pred, index=ratings_matrix.index, columns=ratings_matrix.columns)
ratings_pred_matrix.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.070345,0.577855,0.321696,0.227055,0.206958,0.194615,0.249883,0.102542,0.157084,0.178197,...,0.113608,0.181738,0.133962,0.128574,0.006179,0.21207,0.192921,0.136024,0.292955,0.720347
2,0.01826,0.042744,0.018861,0.0,0.0,0.035995,0.013413,0.002314,0.032213,0.014863,...,0.01564,0.020855,0.020119,0.015745,0.049983,0.014876,0.021616,0.024528,0.017563,0.0
3,0.011884,0.030279,0.064437,0.003762,0.003749,0.002722,0.014625,0.002085,0.005666,0.006272,...,0.006923,0.011665,0.0118,0.012225,0.0,0.008194,0.007017,0.009229,0.01042,0.084501
4,0.049145,0.277628,0.160448,0.206892,0.309632,0.042337,0.130048,0.116442,0.099785,0.097432,...,0.051269,0.076051,0.055563,0.054137,0.008343,0.159242,0.100941,0.062253,0.146054,0.231187
5,0.007278,0.066951,0.041879,0.01388,0.024842,0.01824,0.026405,0.018673,0.021591,0.018841,...,0.009689,0.022246,0.01336,0.012378,0.0,0.025839,0.023712,0.018012,0.028133,0.052315


In [12]:
from sklearn.metrics import mean_squared_error

def get_mse(pred, actual):
    pred=pred[actual.nonzero()].flatten()
    actual=actual[actual.nonzero()].flatten()
    return mean_squared_error(pred, actual)

print('아이템 기반 모든 최근접 이웃 MSE:', get_mse(ratings_pred, ratings_matrix.values))

아이템 기반 모든 최근접 이웃 MSE: 9.895354759094706


In [18]:
def predict_rating_topsim(ratings_arr, item_sim_arr, n=20):
    pred=np.zeros(ratings_arr.shape)
    
    for col in range(ratings_arr.shape[1]):
        top_n_items=[np.argsort(item_sim_arr[:,col])[:-n-1:-1]]
        for row in range(ratings_arr.shape[0]):
            pred[row, col]=item_sim_arr[col,:][top_n_items].dot(ratings_arr[row,:][top_n_items].T)
            pred[row, col] /= np.sum(np.abs(item_sim_arr[col,:][top_n_items]))
    return pred

In [19]:
ratings_pred=predict_rating_topsim(ratings_matrix.values, item_sim_df.values, n=20)
print('아이템 기반 최근접 TOP-20 이웃 MSE:', get_mse(ratings_pred, ratings_matrix.values))
ratings_pred_matirx=pd.DataFrame(data=ratings_pred, index=ratings_matrix.index, columns=ratings_matrix.columns)

  pred[row, col]=item_sim_arr[col,:][top_n_items].dot(ratings_arr[row,:][top_n_items].T)
  pred[row, col] /= np.sum(np.abs(item_sim_arr[col,:][top_n_items]))


아이템 기반 최근접 TOP-20 이웃 MSE: 3.6949696289030554


In [21]:
user_rating_id=ratings_matrix.loc[9,:]
user_rating_id[user_rating_id > 0].sort_values(ascending=False)[:10]

title
Adaptation (2002)                                                                 5.0
Austin Powers in Goldmember (2002)                                                5.0
Lord of the Rings: The Fellowship of the Ring, The (2001)                         5.0
Lord of the Rings: The Two Towers, The (2002)                                     5.0
Producers, The (1968)                                                             5.0
Citizen Kane (1941)                                                               5.0
Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)    5.0
Back to the Future (1985)                                                         5.0
Glengarry Glen Ross (1992)                                                        4.0
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)                                     4.0
Name: 9, dtype: float64

In [22]:
def get_unseen_movies(ratings_matrix, userId):
    user_rating=ratings_matrix.loc[userId,:]
    already_seen=user_rating[user_rating > 0].index.tolist()
    movies_list=ratings_matrix.columns.tolist()
    
    unseen_list=[movie for movie in movies_list if movie not in already_seen]
    
    return unseen_list

In [25]:
def recomm_movie_by_userid(pred_df, userId, unseen_list, top_n=10):
    recomm_movies=pred_df.loc[userId, unseen_list].sort_values(ascending=False)[:top_n]
    return recomm_movies

unseen_list=get_unseen_movies(ratings_matrix, 9)
recomm_movies=recomm_movie_by_userid(ratings_pred_matrix, 9, unseen_list, top_n=10)
recomm_movies=pd.DataFrame(data=recomm_movies.values, index=recomm_movies.index, columns=['pred_scroe'])

recomm_movies

Unnamed: 0_level_0,pred_scroe
title,Unnamed: 1_level_1
Venom (1982),0.303278
Dr. Goldfoot and the Bikini Machine (1965),0.258705
Frankie and Johnny (1966),0.234754
English Vinglish (2012),0.214774
"Harmonists, The (1997)",0.169338
"Story of Women (Affaire de femmes, Une) (1988)",0.163884
3:10 to Yuma (1957),0.163884
"Passenger, The (Professione: reporter) (1975)",0.163884
"Child, The (L'enfant) (2005)",0.163884
Cassandra's Dream (2007),0.163884


# 07. 행렬 분해를 이용한 잠재 요인 협업 필터링 실습

In [32]:
def matrix_factorization(R, K, steps=200, learning_rate=0.01, r_lambda=0.01):
    num_users, num_items = R.shape
    np.random.seed(1)
    P = np.random.normal(scale=1./K, size=(num_users, K))
    Q = np.random.normal(scale=1./K, size=(num_items, K))
    
    prev_rmse=10000
    break_count=0

    non_zeros=[(i,j,R[i,j]) for i in range(num_users) for j in range(num_items) if R[i,j]>0]
    for step in range(steps):
        for i, j, r in non_zeros:
            eji=r-np.dot(P[i,:], Q[j,:].T)
            P[i,:]=P[i,:]+learning_rate*(eji*Q[j,:]-r_lambda*P[i,:])
            Q[j,:]=Q[j,:]+learning_rate*(eji*P[i,:]-r_lambda*Q[j,:])
        
        rmse=get_rmse(R, P, Q, non_zeros)
        if(step%10)==0:
            print('### iteration step :', step, 'rmse :', rmse)
    return P, Q

In [33]:
P, Q = matrix_factorization(ratings_matrix.values, K=50, steps=200, learning_rate=0.01, r_lambda=0.01)
pred_matrix=np.dot(P,Q.T)

### iteration step : 0 rmse : 2.9023619751336867
### iteration step : 10 rmse : 0.7335768591017927
### iteration step : 20 rmse : 0.5115539026853442
### iteration step : 30 rmse : 0.37261628282537446
### iteration step : 40 rmse : 0.2960818299181014
### iteration step : 50 rmse : 0.2520353192341642
### iteration step : 60 rmse : 0.2248750327526985
### iteration step : 70 rmse : 0.20685455302331537
### iteration step : 80 rmse : 0.19413418783028683
### iteration step : 90 rmse : 0.184700820027204
### iteration step : 100 rmse : 0.17742927527209104
### iteration step : 110 rmse : 0.1716522696470749
### iteration step : 120 rmse : 0.1669518194687172
### iteration step : 130 rmse : 0.1630529219199754
### iteration step : 140 rmse : 0.1597669192967964
### iteration step : 150 rmse : 0.1569598699945732
### iteration step : 160 rmse : 0.15453398186715425
### iteration step : 170 rmse : 0.15241618551077643
### iteration step : 180 rmse : 0.15055080739628307
### iteration step : 190 rmse : 0.14

In [34]:
ratings_pred_matrix=pd.DataFrame(data=pred_matrix, index=ratings_matrix.index, columns=ratings_matrix.columns)
ratings_pred_matrix.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,3.055084,4.092018,3.56413,4.502167,3.981215,1.271694,3.603274,2.333266,5.091749,3.972454,...,1.402608,4.208382,3.705957,2.720514,2.787331,3.475076,3.253458,2.161087,4.010495,0.859474
2,3.170119,3.657992,3.308707,4.166521,4.31189,1.275469,4.237972,1.900366,3.392859,3.647421,...,0.973811,3.528264,3.361532,2.672535,2.404456,4.232789,2.911602,1.634576,4.135735,0.725684
3,2.307073,1.658853,1.443538,2.208859,2.229486,0.78076,1.997043,0.924908,2.9707,2.551446,...,0.520354,1.709494,2.281596,1.782833,1.635173,1.323276,2.88758,1.042618,2.29389,0.396941
4,2.628629,3.03555,2.575746,3.706912,3.430636,0.706441,3.33028,1.978826,4.560368,2.77571,...,1.046116,2.912178,2.479592,2.231915,1.888629,2.211364,0.645603,1.585734,3.542892,0.59154
5,2.116148,3.084761,2.747679,3.78349,3.94699,0.883259,1.958953,1.757317,2.054312,2.775258,...,0.956159,3.893975,2.717024,2.002443,2.053337,3.983639,2.099626,1.423718,2.490428,0.531403


In [35]:
unseen_list=get_unseen_movies(ratings_matrix, 9)
recomm_movies=recomm_movie_by_userid(ratings_pred_matrix, 9, unseen_list, top_n=10)
recomm_movies=pd.DataFrame(data=recomm_movies.values, index=recomm_movies.index, columns=['pred_scroe'])

recomm_movies

Unnamed: 0_level_0,pred_scroe
title,Unnamed: 1_level_1
Rear Window (1954),5.704612
"South Park: Bigger, Longer and Uncut (1999)",5.4511
Rounders (1998),5.298393
Blade Runner (1982),5.244951
Roger & Me (1989),5.191962
Gattaca (1997),5.183179
Ben-Hur (1959),5.130463
Rosencrantz and Guildenstern Are Dead (1990),5.087375
"Big Lebowski, The (1998)",5.03869
Star Wars: Episode V - The Empire Strikes Back (1980),4.989601


# 08. 파이썬 추천 시스템 패키지 - Surprise
### * Surprise package
### * Surprise를 이용한 추천 시스템 구축

In [1]:
from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import train_test_split

In [2]:
data=Dataset.load_builtin('ml-100k')
trainset, testset = train_test_split(data, test_size=0.25, random_state=0)

Dataset ml-100k could not be found. Do you want to download it? [Y/n] Y
Trying to download dataset from http://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /Users/jangseojin/.surprise_data/ml-100k


In [3]:
algo=SVD()
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f9c584c0d00>

In [5]:
predictions=algo.test(testset)
print('Prediction Type:', type(predictions), 'Size:', len(predictions))
print('\nPrediction result top5')
predictions[:5]

Prediction Type: <class 'list'> Size: 25000

Prediction result top5


[Prediction(uid='120', iid='282', r_ui=4.0, est=3.6927823728344644, details={'was_impossible': False}),
 Prediction(uid='882', iid='291', r_ui=4.0, est=3.8269753253558125, details={'was_impossible': False}),
 Prediction(uid='535', iid='507', r_ui=5.0, est=4.223373453858056, details={'was_impossible': False}),
 Prediction(uid='697', iid='244', r_ui=5.0, est=3.7769838693095084, details={'was_impossible': False}),
 Prediction(uid='751', iid='385', r_ui=4.0, est=3.1497538925296107, details={'was_impossible': False})]

In [7]:
[(pred.uid, pred.iid, pred.est) for pred in predictions[:3]]

[('120', '282', 3.6927823728344644),
 ('882', '291', 3.8269753253558125),
 ('535', '507', 4.223373453858056)]

In [8]:
uid=str(196)
iid=str(302)
pred=algo.predict(uid, iid)
print(pred)

user: 196        item: 302        r_ui = None   est = 4.28   {'was_impossible': False}


In [9]:
accuracy.rmse(predictions)

RMSE: 0.9507


0.9506525960609317

### * Surprise 주요 모듈 소개
1. Dataset
2. OS 파일 데이터를 Surprise 데이터 세트로 로딩

In [10]:
import pandas as pd

ratings=pd.read_csv('./ml-latest-small/ratings.csv')
ratings.to_csv('./ml-latest-small/ratings_noh.csv', index=False, header=False)

In [12]:
from surprise import Reader
reader=Reader(line_format='user item rating timestamp', sep=',', rating_scale=(0.5,5))
data=Dataset.load_from_file('./ml-latest-small/ratings_noh.csv', reader=reader)

In [14]:
trainset, testset = train_test_split(data, test_size=0.25, random_state=0)
algo=SVD(n_factors=50, random_state=0)
algo.fit(trainset)
predictions=algo.test(testset)
accuracy.rmse(predictions)

RMSE: 0.8682


0.8681952927143516

3. 판다스 DataFrame에서 Surprise 데이터 세트 로딩

In [15]:
import pandas as pd
from surprise import Reader, Dataset

ratings=pd.read_csv('./ml-latest-small/ratings.csv')
reader=Reader(rating_scale=(0.5,5))

data=Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']],reader)
trainset, testset = train_test_split(data, test_size=0.25, random_state=0)

algo=SVD(n_factors=50, random_state=0)
algo.fit(trainset)
predictions=algo.test(testset)
accuracy.rmse(predictions)

RMSE: 0.8682


0.8681952927143516

### * Surprise 추천 알고리즘 클래스
- SVD, k-NN Baseline
- Baseline: 각 개인이 평점을 부여하는 성향을 반영해 평점을 계산하는 방식

### * Baseline Rating
- 전체 평균 평점 + 사용자 편향 점수 + 아이템 편향 점수 = 베이스라인 평점

### * 교차 검증과 하이퍼 파라미터 튜닝

In [2]:
import pandas as pd
from surprise import Reader, Dataset
from surprise import SVD
from surprise.model_selection import cross_validate

ratings=pd.read_csv('./ml-latest-small/ratings.csv')
reader=Reader(rating_scale=(0.5,5))
data=Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']],reader)

algo=SVD(random_state=0)
cross_validate(algo, data, measures=['RMSE','MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8715  0.8779  0.8747  0.8686  0.8724  0.8730  0.0031  
MAE (testset)     0.6699  0.6749  0.6724  0.6661  0.6716  0.6710  0.0029  
Fit time          3.59    3.65    3.64    3.65    3.66    3.64    0.02    
Test time         0.11    0.11    0.10    0.10    0.08    0.10    0.01    


{'test_rmse': array([0.87151418, 0.87785868, 0.87468044, 0.86861775, 0.87241542]),
 'test_mae': array([0.6699428 , 0.67494002, 0.67244403, 0.66607592, 0.671621  ]),
 'fit_time': (3.591003179550171,
  3.652466058731079,
  3.6364362239837646,
  3.648545026779175,
  3.660881996154785),
 'test_time': (0.10600113868713379,
  0.10549473762512207,
  0.10437417030334473,
  0.1049339771270752,
  0.07735300064086914)}

In [4]:
from surprise.model_selection import GridSearchCV

param_grid={'n_epochs':[20,40,60], 'n_factors':[50,100,200]}
gs=GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)
gs.fit(data)
print(gs.best_score['rmse'])
print(gs.best_params['rmse'])

0.8770112949336206
{'n_epochs': 20, 'n_factors': 50}


### * Surprise를 이용한 개인화 영화 추천 시스템 구추구

In [5]:
from surprise.dataset import DatasetAutoFolds

reader=Reader(line_format='user item rating timestamp', sep=',', rating_scale=(0.5,5))
data_folds=DatasetAutoFolds(ratings_file='./ml-latest-small/ratings_noh.csv', reader=reader)
trainset=data_folds.build_full_trainset()

In [6]:
algo=SVD(n_epochs=20, n_factors=50, random_state=0)
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7ffc411b9f70>

In [9]:
movies=pd.read_csv('./ml-latest-small/movies.csv')
movieIds=ratings[ratings['userId'] == 9]['movieId']
if movieIds[movieIds == 42].count() == 0:
    print('movieId 42 rating value of userId 9 is None')
    
print(movies[movies['movieId'] == 42])

movieId 42 rating value of userId 9 is None
    movieId                   title              genres
38       42  Dead Presidents (1995)  Action|Crime|Drama


In [10]:
uid = str(9)
iid= str(42)
pred=algo.predict(uid, iid, verbose=True)

user: 9          item: 42         r_ui = None   est = 3.13   {'was_impossible': False}


In [11]:
def get_unseen_surprise(ratings, movies, userId):
    seen_movies=ratings[ratings['userId'] == userId]['movieId'].tolist()
    total_movies=movies['movieId'].tolist()
    unseen_movies=[movie for movie in total_movies if movie not in seen_movies]
    print('Rating Movies:', len(seen_movies), 'Recomm Movies:', len(unseen_movies),
         'Total Movies:', len(total_movies))
    return unseen_movies
unseen_movies = get_unseen_surprise(ratings, movies, 9)

Rating Movies: 46 Recomm Movies: 9696 Total Movies: 9742


In [15]:
def recomm_movie_by_surprise(algo, userId, unseen_movies, top_n=10):
    predictions=[algo.predict(str(userId), str(movieId)) for movieId in unseen_movies]
    
    def sortkey_est(pred):
        return pred.est
    predictions.sort(key=sortkey_est, reverse=True)
    top_predictions=predictions[:top_n]
    
    top_movie_ids = [int(pred.iid) for pred in top_predictions]
    top_movie_rating=[pred.est for pred in top_predictions]
    top_movie_titles=movies[movies.movieId.isin(top_movie_ids)]['title']
    top_movie_preds=[(id, title, rating) for id, title, rating in
                    zip(top_movie_ids, top_movie_titles, top_movie_rating)]
    
    return top_movie_preds

unseen_movies = get_unseen_surprise(ratings, movies, 9)
top_movie_preds=recomm_movie_by_surprise(algo, 9, unseen_movies, top_n=10)

print('\n##### Top 10 Recommendation Movie list #####\n')
for top_movie in top_movie_preds:
    print(top_movie[1], ':', top_movie[2])

Rating Movies: 46 Recomm Movies: 9696 Total Movies: 9742

##### Top 10 Recommendation Movie list #####

Usual Suspects, The (1995) : 4.306302135700814
Star Wars: Episode IV - A New Hope (1977) : 4.281663842987387
Pulp Fiction (1994) : 4.278152632122758
Silence of the Lambs, The (1991) : 4.226073566460876
Godfather, The (1972) : 4.1918097904381995
Streetcar Named Desire, A (1951) : 4.154746591122658
Star Wars: Episode V - The Empire Strikes Back (1980) : 4.122016128534504
Star Wars: Episode VI - Return of the Jedi (1983) : 4.108009609093436
Goodfellas (1990) : 4.083464936588478
Glory (1989) : 4.07887165526957
