#### 추천 시스템 개론
- 추천 시스템 시초: 인터넷 등장 -> 이메일 활성화 -> 기존에 있던 서비스를 인터넷에 옮기는 작업을 함(온라인 뉴스)
-> 구독 서비스에는 많은 뉴스 기사 중에서 대부분의 사람들은 특수한 목적이 있지 않는 이상 모든 뉴스를 다 읽지 않음 -> 필터링을 해서 전달해줘야 함(특정 키워드가 들어있는지 여부로 추천대상을 필터링했음)
- 키워드의 등록이 여러 개인 경우에 우선순위에 따라 목록을 보여줌 -> 추천 시스템


#### 어떤 기준을 통해 줄 세우기 할 것인가?
- CB(Contents-Based, 컨텐츠 기반) -> 컨텐츠를 설명하기 위한 키워드들 중심(e.g. 달팽이크림(달팽이, 크림, 보습 등)
- KB(Knowledge-Based, 지식 기반) -> 자신의 쌓아온 지식을 기반으로 추천(도메인 관련 지식)

#### CB, KB 단점(항상 빅데이터라고 가정+클라우드)
- 컨텐츠(데이터)가 무지하게 많기 때문에, 이 모든 것을 처리하기에는 역부족
- 아이템의 종류나 고객 인원수가 **많아지면** 대응이 어려워짐

#### Collaborative Filtering, 협업(연결고리) 필터링
- 고객과 물건의 연결고리 기반: 사용자들로부터 얻은 취향이나 기호(favor)에 대한 정보를 이용
- 과거의 경향이 미래에도 계속 유지될 것이라 가정 -> 나와 비슷한 선택을 했던 사람들이 과거에 선택한 것을 기반으로 나에게 상품 추천
- User-Based: 특정 고객에게 유사한 선택을 했던 다른 사람의 선택을 추천
- Iten-Based: 특정 물건을 선택한 고객에게 해당 물건을 선택한 다른 고객의 선택을 추천, 대부분의 고객들이 A상품을 고를 때, B상품을 구매하니 A상품을 구매한 고객에게 B도 추천 ex) Amazon
- **연결고리**를 잘 찾는 것이 핵심

#### Cross / Up / Down Sell
- Cross Sell: 끼워팔기(햄버거사면 사이드 메뉴 추천)
- Up Sell: 올려팔기(500원 추가하면 라지로 업그레이드) -> 목적 달성+높은 마진
- Down Sell: 내려팔기(비싼 차를 보고 온 고객에게 좀 더 싼 제품을 추천(대체제))

In [2]:
import pandas as pd

In [3]:
df_movies = pd.read_csv('movies.csv')
df_ratings = pd.read_csv('ratings.csv')

In [4]:
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
df_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [6]:
print(df_movies.shape)
print(df_ratings.shape)

(9742, 3)
(100836, 4)


In [7]:
df_combined = pd.merge(df_ratings, df_movies)
df_combined

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,1,3,4.0,964981247,Grumpier Old Men (1995),Comedy|Romance
2,1,6,4.0,964982224,Heat (1995),Action|Crime|Thriller
3,1,47,5.0,964983815,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
4,1,50,5.0,964982931,"Usual Suspects, The (1995)",Crime|Mystery|Thriller
...,...,...,...,...,...,...
100831,610,166534,4.0,1493848402,Split (2017),Drama|Horror|Thriller
100832,610,168248,5.0,1493850091,John Wick: Chapter Two (2017),Action|Crime|Thriller
100833,610,168250,5.0,1494273047,Get Out (2017),Horror
100834,610,168252,5.0,1493846352,Logan (2017),Action|Sci-Fi


In [8]:
print(len(set(df_combined['userId'])))
print(len(set(df_combined['title'])))

610
9719


In [9]:
df_pvt = df_combined.pivot_table(index='userId', columns='title', values='rating').fillna(0)
df_pvt

# groupby와 비슷하지만 합계나 평균등 계산이 아니라 columns 중 하나를 values를 보고 싶을때

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
607,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
608,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,4.5,3.5,0.0,0.0,0.0
609,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Item-Based
- 아이템 간 유사도(상관관계)를 계산

In [14]:
item_corr = df_pvt.corr()
item_corr

# 해석: 예를 들어 한 영화에 평점을 높게 준 고객은 비슷한 영화에도 평점을 높게 줬다 -> 이러한 사용자들이 많은 경우 -> 상관관계가 높음 -> 비슷함(추천) 

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),1.000000,-0.001642,-0.002324,-0.001642,-0.002254,-0.001642,-0.006407,-0.001642,0.135943,-0.004325,...,-0.001642,0.339935,0.542247,0.706526,-0.001642,-0.007675,0.134327,0.325287,-0.008185,-0.001642
'Hellboy': The Seeds of Creation (2004),-0.001642,1.000000,0.706526,-0.001642,-0.002254,-0.001642,-0.006407,-0.001642,-0.010568,-0.004325,...,-0.001642,-0.004589,-0.002808,-0.002324,-0.001642,-0.007675,-0.007744,-0.003594,-0.008185,-0.001642
'Round Midnight (1986),-0.002324,0.706526,1.000000,-0.002324,-0.003191,-0.002324,0.170199,-0.002324,-0.014958,-0.006121,...,-0.002324,-0.006495,-0.003975,-0.003289,-0.002324,-0.010863,-0.010961,-0.005087,-0.011585,-0.002324
'Salem's Lot (2004),-0.001642,-0.001642,-0.002324,1.000000,0.857269,-0.001642,-0.006407,-0.001642,-0.010568,-0.004325,...,-0.001642,-0.004589,-0.002808,-0.002324,-0.001642,-0.007675,-0.007744,-0.003594,-0.008185,-0.001642
'Til There Was You (1997),-0.002254,-0.002254,-0.003191,0.857269,1.000000,-0.002254,-0.008797,-0.002254,-0.014510,-0.005938,...,-0.002254,-0.006301,-0.003856,-0.003191,-0.002254,-0.010538,-0.010632,-0.004935,-0.011238,-0.002254
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
eXistenZ (1999),-0.007675,-0.007675,-0.010863,-0.007675,-0.010538,-0.007675,0.187953,0.212646,0.053614,0.115396,...,-0.007675,-0.021449,-0.013126,-0.010863,-0.007675,1.000000,0.163022,-0.016800,0.138611,-0.007675
xXx (2002),0.134327,-0.007744,-0.010961,-0.007744,-0.010632,-0.007744,0.062174,-0.007744,0.241092,-0.000060,...,0.063291,0.291410,0.163464,0.240394,-0.007744,0.163022,1.000000,0.259049,0.065673,-0.007744
xXx: State of the Union (2005),0.325287,-0.003594,-0.005087,-0.003594,-0.004935,-0.003594,-0.014025,-0.003594,0.139511,-0.009467,...,-0.003594,0.376455,0.172818,0.227658,-0.003594,-0.016800,0.259049,1.000000,-0.017917,-0.003594
¡Three Amigos! (1986),-0.008185,-0.008185,-0.011585,-0.008185,-0.011238,-0.008185,0.353194,0.175610,0.125905,0.234514,...,0.175610,-0.022876,-0.013999,-0.011585,-0.008185,0.138611,0.065673,-0.017917,1.000000,-0.008185


In [15]:
target = 'Harry Potter'

for title in item_corr.columns:
    if target in title:
        print(title)

Harry Potter and the Chamber of Secrets (2002)
Harry Potter and the Deathly Hallows: Part 1 (2010)
Harry Potter and the Deathly Hallows: Part 2 (2011)
Harry Potter and the Goblet of Fire (2005)
Harry Potter and the Half-Blood Prince (2009)
Harry Potter and the Order of the Phoenix (2007)
Harry Potter and the Prisoner of Azkaban (2004)
Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)


In [18]:
interested = 'Harry Potter and the Chamber of Secrets (2002)'
item_corr.sort_values(by=interested, ascending=False)[interested].head()

# 영화에 대한 정보(장르, 감독, 배우 등에 대한 정보)가 없음에도 불구하고,
# User가 좋아할만한 유사성 있는 것들을 구할 수 있음
# 오직 영화와 유저간의 연결고리로만 추천
# 정보가 없어도 추천이 가능 -> 정보가 있다면 더 자세하게 추천가능
# 해리포터 영화면 해리포터 본 사람이 해리포터 다른 영화의 유사관계

title
Harry Potter and the Chamber of Secrets (2002)                                                    1.000000
Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)    0.736992
Harry Potter and the Prisoner of Azkaban (2004)                                                   0.727898
Harry Potter and the Goblet of Fire (2005)                                                        0.673761
Harry Potter and the Order of the Phoenix (2007)                                                  0.630222
Name: Harry Potter and the Chamber of Secrets (2002), dtype: float64

#### User-Based
- 유저 간 유사한 정도를 계산함
- A유저와 비슷한 B유저가 있을경우(상관관계 0.7) A유저가 보지않은 영화를 B유저에게 추천

In [19]:
df_pvt2 = df_pvt.T
df_pvt2

# groupby와 비슷하지만 합계나 평균등 계산이 아니라 columns 중 하나를 values를 보고 싶을때

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
'Hellboy': The Seeds of Creation (2004),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Round Midnight (1986),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Salem's Lot (2004),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Til There Was You (1997),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
eXistenZ (1999),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,5.0,0.0,0.0,0.0,0.0,4.5,0.0,0.0
xXx (2002),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.5,0.0,2.0
xXx: State of the Union (2005),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.5
¡Three Amigos! (1986),4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [20]:
user_corr = df_pvt2.corr()
user_corr

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.000000,0.019396,0.053052,0.176911,0.120862,0.104406,0.143785,0.128542,0.055263,-0.000307,...,0.066248,0.149934,0.186959,0.056523,0.134402,0.121958,0.254192,0.262225,0.085430,0.098693
2,0.019396,1.000000,-0.002595,-0.003808,0.013181,0.016252,0.021564,0.023748,-0.003450,0.061877,...,0.198547,0.010885,-0.004038,-0.005348,-0.007923,0.011290,0.005809,0.032723,0.024371,0.089321
3,0.053052,-0.002595,1.000000,-0.004559,0.001886,-0.004581,-0.005637,0.001701,-0.003112,-0.005504,...,0.000148,-0.000588,0.011203,-0.004824,0.003674,-0.003255,0.012881,0.008089,-0.002964,0.015953
4,0.176911,-0.003808,-0.004559,1.000000,0.121014,0.065707,0.100595,0.054231,0.002412,0.015607,...,0.072841,0.114280,0.281852,0.039692,0.065483,0.164812,0.115109,0.116843,0.023926,0.062498
5,0.120862,0.013181,0.001886,0.121014,1.000000,0.294134,0.101721,0.426575,-0.004187,0.023468,...,0.061908,0.414929,0.095386,0.254115,0.141073,0.090149,0.145760,0.122600,0.258288,0.040361
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,0.121958,0.011290,-0.003255,0.164812,0.090149,0.047476,0.172484,0.081904,0.057979,0.054858,...,0.153879,0.084190,0.224593,0.035234,0.106729,1.000000,0.115978,0.188312,0.052375,0.093788
607,0.254192,0.005809,0.012881,0.115109,0.145760,0.142158,0.173287,0.178130,0.003252,-0.004817,...,0.080027,0.187581,0.173008,0.126261,0.101129,0.115978,1.000000,0.258232,0.142529,0.098496
608,0.262225,0.032723,0.008089,0.116843,0.122600,0.137932,0.305429,0.175906,0.086221,0.048357,...,0.136304,0.174056,0.164440,0.133722,0.144878,0.188312,0.258232,1.000000,0.109556,0.248902
609,0.085430,0.024371,-0.002964,0.023926,0.258288,0.207121,0.084491,0.421626,-0.003940,0.014980,...,0.029660,0.331051,0.045991,0.232113,0.089806,0.052375,0.142529,0.109556,1.000000,0.033702


In [129]:
interested = 126
sorted_corr = user_corr.sort_values(by=interested, ascending=False)[interested]
print(sorted_corr.index[1])
print(sorted_corr.iloc[1])

379
0.8127372496932684


In [None]:
# 88번 = user_id1
# 247번 = user_id2

In [140]:
# user2가 평가했지만 user1이 보지 않은 영화 찾기
user1 = 126
user2 = user_corr.sort_values(by=user1, ascending=False)[interested].index[2]

movie2_rated = df_pvt2.loc[(df_pvt2[user1] == 0) & (df_pvt2[user2] > 0)][[user1, user2]]
movie2_rated[[user2]].sort_values(by=user2, ascending=False).head()

userId,94
title,Unnamed: 1_level_1
Twelve Monkeys (a.k.a. 12 Monkeys) (1995),5.0
Jumanji (1995),4.0
Ghost (1990),3.0
Get Shorty (1995),3.0
"Santa Clause, The (1994)",3.0


In [108]:
potential_recommendations

title
101 Dalmatians (One Hundred and One Dalmatians) (1961)    3.5
300 (2007)                                                5.0
3:10 to Yuma (2007)                                       3.0
Aladdin (1992)                                            5.0
American Beauty (1999)                                    4.0
                                                         ... 
WALL·E (2008)                                             3.0
Willy Wonka & the Chocolate Factory (1971)                4.0
Wizard of Oz, The (1939)                                  4.0
Wolf of Wall Street, The (2013)                           4.0
Zootopia (2016)                                           4.0
Name: 247, Length: 123, dtype: float64

In [109]:
recommendations.head()

title
Harry Potter and the Deathly Hallows: Part 2 (2011)    5.0
Jungle Book, The (1967)                                5.0
300 (2007)                                             5.0
Lion King, The (1994)                                  5.0
John Wick (2014)                                       5.0
Name: 247, dtype: float64

In [131]:
interested = 126
user_corr.sort_values(by=interested, ascending=False)[interested].head()

userId
126    1.000000
379    0.812737
94     0.705266
507    0.689818
512    0.683637
Name: 126, dtype: float64

In [137]:
# 126번 고객에게 맞춤형 추천을 하고자 함
# 126번 고객과 94번 고객이 서로 교집합인 것은 추천하지 않음
# 도메인의 특성상 재시청보다는 새로운 것을 보여주는 것이 의미가 있음

user_1, user_2 = 126, 94
u1 = set(df_combined.loc[df_combined['userId'] == user_1]['title'])
u2 = set(df_combined.loc[df_combined['userId'] == user_2]['title'])

diff = u2.difference(u1)
diff

{'Ace Ventura: When Nature Calls (1995)',
 'American President, The (1995)',
 'Beverly Hills Cop III (1994)',
 'Broken Arrow (1996)',
 "City Slickers II: The Legend of Curly's Gold (1994)",
 'Clueless (1995)',
 'Coneheads (1993)',
 'Congo (1995)',
 'Disclosure (1994)',
 'Four Weddings and a Funeral (1994)',
 'Get Shorty (1995)',
 'Ghost (1990)',
 'GoldenEye (1995)',
 'Home Alone (1990)',
 'Jumanji (1995)',
 'Legends of the Fall (1994)',
 'Mortal Kombat (1995)',
 'Nell (1994)',
 'Pretty Woman (1990)',
 'Santa Clause, The (1994)',
 'Sense and Sensibility (1995)',
 'Twelve Monkeys (a.k.a. 12 Monkeys) (1995)',
 "What's Eating Gilbert Grape (1993)"}

In [138]:
u2_all = df_combined.loc[df_combined['userId'] == user_2]
filtered = u2_all.loc[df_combined['title'].isin(diff)]
filtered.sort_values(by='rating', ascending=False).head()['title']

14823    Twelve Monkeys (a.k.a. 12 Monkeys) (1995)
14817                               Jumanji (1995)
14818                             GoldenEye (1995)
14866                                 Ghost (1990)
14865                            Home Alone (1990)
Name: title, dtype: object