분산 환경인 하둡이나 스파크를 쓰지 않고도, 노트북 수준으로도 충분히  대용량의 추천 엔진을 디자인할 수 있음을 보여주는 예제 프로젝트입니다. 

### _Objective_ 
* 실제 추천 서버로 동작하기 위해서는 대규모의 요청에서도 빠르게 응답할 수 있어야 합니다.
* 이를 위해서 `MinHash` 알고리즘과 `Redis` 캐시 서버를 이용한 방법론을 살펴보도록 하겠습니다.

In [1]:
import os
import numpy as np
import pandas as pd
from tqdm import tqdm
import tensorflow as tf
from google_drive_downloader import GoogleDriveDownloader as gdd

tqdm.pandas()

  from pandas import Panel


In [2]:
gdd.download_file_from_google_drive(file_id="1uPjBuhv96mJP9oFi-KNyVzNkSlS7U2xY", dest_path="./movies.csv")
movies_df = pd.read_csv("movies.csv", index_col=0)

gdd.download_file_from_google_drive(file_id="1hik_RSV0e5r_P3iYe4sK8B-eNWxmIWOa", dest_path="./genres.csv")
genres_df = pd.read_csv("genres.csv", index_col=0)

gdd.download_file_from_google_drive(file_id="15vsm-VWAC3Y-7jr7ROL_xy0ufkIfqSke", dest_path="./ratings.csv")
ratings_df = pd.read_csv("ratings.csv", index_col=0)

  mask |= (ar1 == a)


In [3]:
ratings_df

Unnamed: 0,user_id,movie_id,rating,rated_at
0,1,2,3.5,1112486027
1,1,29,3.5,1112484676
2,1,32,3.5,1112484819
3,1,47,3.5,1112484727
4,1,50,3.5,1112484580
...,...,...,...,...
20000258,138493,68954,4.5,1258126920
20000259,138493,69526,4.5,1259865108
20000260,138493,69644,3.0,1260209457
20000261,138493,70286,5.0,1258126944


In [4]:
#data type을 변경하여 메모리를 최대한 줄입니다.
ratings_df["rating"] = (ratings_df["rating"]*2).astype(np.uint8)
ratings_df["movie_id"] = ratings_df["movie_id"].astype(np.uint32)
ratings_df["user_id"] = ratings_df["user_id"].astype(np.uint32)

#필요 없는 column을 제거합니다.
ratings_df = ratings_df.drop(columns="rated_at")

In [5]:
ratings_df

Unnamed: 0,user_id,movie_id,rating
0,1,2,7
1,1,29,7
2,1,32,7
3,1,47,7
4,1,50,7
...,...,...,...
20000258,138493,68954,9
20000259,138493,69526,9
20000260,138493,69644,6
20000261,138493,70286,10


### (2) 데이터 샘플링하기

빠른 실습을 위해 영화 중 Top 1000개에 해당하는 영화만 가져와 보도록 하겠습니다.

In [6]:
n_ratings_top1000 = ratings_df["movie_id"].value_counts()[:1000].index

ratings_df_top1000 = ratings_df[ratings_df["movie_id"].isin(n_ratings_top1000)]

In [7]:
ratings_df_top1000

Unnamed: 0,user_id,movie_id,rating
0,1,2,7
1,1,29,7
2,1,32,7
3,1,47,7
4,1,50,7
...,...,...,...
20000243,138493,53996,9
20000249,138493,59315,8
20000252,138493,60069,8
20000258,138493,68954,9


- 현업에서 요구되는 응답 시간은 보통 300ms 이내를 목표로 합니다. 이것보다 늦을 경우, 웹페이지가 느리다고 느끼기 때문에 고객에게 나쁜 서비스 경험을 제공할 수 있습니다.
- 그래서 추천 시스템을 디자인할 때에는 "얼마나 단시간 내에", "얼마나 많은" 요청을 처리할 수 있는가가 핵심 이슈가 됩니다.

### 1. 연산량이 지나치게 많음
- 이때까지 디자인한 협업 필터링은 전체 아이템 간의 유사도를 동시에 계산합니다. 아이템 수가 10배가 늘어나면, 유사도 연산은 100배가 늘어나게 됩니다.
- 100만개만 되더라도 이 경우 기존방식대로 전체 Item Similiarity Matrix을 구하게 되면, 931GB로 일반적인 컴퓨터의 램으로는 계산이 어려운 수준이고 이 규모가 되면 그 때부터 Hadoop과 같은 분산처리에서 다루어야 합니다.

### 2. 실시간으로 반영이 어려움
- 그리고 위와 같은 수준의 연산량은 매번 계산하기가 어렵습니다. 그렇기 때문에 보통 주기적으로 하루 단위 혹은 1시간 단위 등 연산을 진행하고, Item Similarity Matrix을 업데이트 하는 방식으로 진행합니다. 뉴스 피드 추천과 같은 컨텐츠 추천에서는 실시간성이 매우 핵심인데, 실시간으로 추천이 갱신되지 못하는 기존의 방식은 추천 시스템으로 적용하기 어렵습니다.

## 1. 유사도의 기준 : Jaccard 유사도

Jaccard 유사도는 두 집합 간의 유사도를 계산하는 방식입니다.

$$
jac\_sim(A,B) = \frac{len(A \& B)}{len(A | B)}
$$

두 집합이 얼마나 겹쳤냐를 계산하는 유사도로, 이전에 배운 Cosine 유사도나 유클리드 유사도와 달리 Boolean 데이터(구매 유무, 시청 유무, 클릭 유무)에 대한 유사도를 계산할 때 쓰입니다.

## 2. 독특한 성질의 Hash, MinHash

MinHash의 특성을 위주로 살펴보도록 하겠습니다. Minhash의 개념에 대한 설명은 아래 Reference를 참조해주시길 바랍니다.

reference : [쉽게 설명한 MinHash  알고리즘](http://blog.haandol.com/2019/05/25/minhash-algorithm-explained.html)

### (1) 집합을 Hash하는 알고리즘 

Minhash는 기본적으로 집합을 Hash하는 알고리즘입니다. Minhash는 집합을 복수개의 Hash Values, 즉 Signature로 표현합니다.

In [290]:
# Hash하고자 하는 집합
set_A = {"A","B","C","D"}

minhash = MinHash(num_perm=128)

for value in set_A:
    # 원소 별로 하나씩 minhash에 적용
    minhash.update(str(value).encode('utf8'))
    
minhash.hashvalues

array([ 187542028,  206934951,  383765683,  257361572,   14019373,
         98127798,  510146389,  158207789, 2211499638,  236986188,
       1430055357, 1069935458,  622631458,  859502047,  304814259,
       2930336844,   82639309, 1462000340, 1259992472, 1462270518,
         35851626,   62567127,  669040041,  734884339,  640828744,
       1638357194,  104131353,  338442154,  826472273,  251592307,
       2872577173, 1624068580, 1915339881, 1075083221,  145452920,
        141861766,  565557948,  109692850, 1037588332,  232800860,
         71174338,  904892082,  126924591,    6559914, 1344550122,
       1751405721,  136141014,  469736690,  718739130, 1591066330,
       2073693511,  156225272,  172995981, 1829169708,   48017838,
       1191568394,   59197654,   49810303,   94627355,  345970473,
       1306477605,  502945878,   23067506,   30293773, 1446449111,
       1446167980,  666285169,  522965222,   99415839,   14356784,
         84889254,  952242489, 1614147919,  515649169, 1677435

위의 과정을 함수로 나타내면 아래와 같습니다.

In [318]:
def get_hash(target_set, sig_size=128):
    minhash = MinHash(sig_size)
    for value in target_set:
        minhash.update(str(value).encode('utf8'))
    return minhash.hashvalues

MinHash Function은 서로 다른 Hash Function들(위에서는 128개)으로 구성되어 있습니다. MinHash의 알고리즘을 보기 위해 간단한 예제를 만들어보도록 하겠습니다.

In [319]:
hash_a = get_hash({'A'},4)
hash_a

array([2155541700, 3497910862, 1404536498,  257361572], dtype=uint64)

In [320]:
hash_b = get_hash({'B'},4)
hash_b

array([1660848309, 2866125906,  383765683,  690308850], dtype=uint64)

In [321]:
hash_c = get_hash({'C'},4)
hash_c

array([ 432770662, 3839306110,  701789493, 2250697747], dtype=uint64)

MinHash는 각 원소 별로 Signature(여기서는 4개)을 구한 후, 각 Signature 중 가장 작은 값을 저장하는 방식입니다. 가장 작은 값을 저장한다 해서 MinHash라고 불립니다.

In [322]:
hash_abc = get_hash({"A","B", "C"},4)
hash_abc

array([ 432770662, 2866125906,  383765683,  257361572], dtype=uint64)

In [325]:
# 각 시그니처 별 최소 값(Min)이 해당 집합 Signature
np.maximum(hash_a, hash_b, hash_c)

array([2155541700, 3497910862, 1404536498,  690308850], dtype=uint64)

#### 특성 1 : 원소가 중복되면, 동일한 결과 반환한다. 

집합의 특성과 동일하게 이미 Minhash에 포함되었다면 MinHash의 값은 동일하게 나옵니다.

In [97]:
minhash.update("A".encode('utf8'))

minhash.hashvalues

array([ 187542028,  206934951,  383765683,  257361572,   14019373,
         98127798,  510146389,  158207789, 2211499638,  236986188,
       1430055357, 1069935458,  622631458,  859502047,  304814259,
       2930336844,   82639309, 1462000340, 1259992472, 1462270518,
         35851626,   62567127,  669040041,  734884339,  640828744,
       1638357194,  104131353,  338442154,  826472273,  251592307,
       2872577173, 1624068580, 1915339881, 1075083221,  145452920,
        141861766,  565557948,  109692850, 1037588332,  232800860,
         71174338,  904892082,  126924591,    6559914, 1344550122,
       1751405721,  136141014,  469736690,  718739130, 1591066330,
       2073693511,  156225272,  172995981, 1829169708,   48017838,
       1191568394,   59197654,   49810303,   94627355,  345970473,
       1306477605,  502945878,   23067506,   30293773, 1446449111,
       1446167980,  666285169,  522965222,   99415839,   14356784,
         84889254,  952242489, 1614147919,  515649169, 1677435

#### 특성 2: 순서에 영향을 받지 않는다.

그리고 집합과 동일하게, 원소를 update하는 순서가 달라지더라도 동일한 결과를 반환합니다.

In [98]:
minhash = MinHash()

for value in ["A","B","C","D"]:
    minhash.update(str(value).encode('utf8'))
    
minhash.hashvalues

array([ 187542028,  206934951,  383765683,  257361572,   14019373,
         98127798,  510146389,  158207789, 2211499638,  236986188,
       1430055357, 1069935458,  622631458,  859502047,  304814259,
       2930336844,   82639309, 1462000340, 1259992472, 1462270518,
         35851626,   62567127,  669040041,  734884339,  640828744,
       1638357194,  104131353,  338442154,  826472273,  251592307,
       2872577173, 1624068580, 1915339881, 1075083221,  145452920,
        141861766,  565557948,  109692850, 1037588332,  232800860,
         71174338,  904892082,  126924591,    6559914, 1344550122,
       1751405721,  136141014,  469736690,  718739130, 1591066330,
       2073693511,  156225272,  172995981, 1829169708,   48017838,
       1191568394,   59197654,   49810303,   94627355,  345970473,
       1306477605,  502945878,   23067506,   30293773, 1446449111,
       1446167980,  666285169,  522965222,   99415839,   14356784,
         84889254,  952242489, 1614147919,  515649169, 1677435

In [99]:
minhash = MinHash()

for value in ["D","C","B","A"]:
    minhash.update(str(value).encode('utf8'))
    
minhash.hashvalues

array([ 187542028,  206934951,  383765683,  257361572,   14019373,
         98127798,  510146389,  158207789, 2211499638,  236986188,
       1430055357, 1069935458,  622631458,  859502047,  304814259,
       2930336844,   82639309, 1462000340, 1259992472, 1462270518,
         35851626,   62567127,  669040041,  734884339,  640828744,
       1638357194,  104131353,  338442154,  826472273,  251592307,
       2872577173, 1624068580, 1915339881, 1075083221,  145452920,
        141861766,  565557948,  109692850, 1037588332,  232800860,
         71174338,  904892082,  126924591,    6559914, 1344550122,
       1751405721,  136141014,  469736690,  718739130, 1591066330,
       2073693511,  156225272,  172995981, 1829169708,   48017838,
       1191568394,   59197654,   49810303,   94627355,  345970473,
       1306477605,  502945878,   23067506,   30293773, 1446449111,
       1446167980,  666285169,  522965222,   99415839,   14356784,
         84889254,  952242489, 1614147919,  515649169, 1677435

#### 특성 3 : 집합이 비슷하면, Hash 값도 비슷하게 나온다.

Minhash의 가장 중요한 특성 중 하나로, 집합 간의 IOU 값과 Hash 값의 IOU가 비슷하게 나옵니다.

In [116]:
set_A = {"A","B","C","D","E","F","G","H"}
set_B = {"D","E","F","G","H","I","J","K"}

#### 두 집합의 IOU 값 구하기 

In [125]:
intersection = len(set_A & set_B)
union = len(set_A | set_B)
iou = intersection / union 
iou

0.45454545454545453

#### 두 집합의 MinHash 구하기

In [126]:
minhash_A = get_hash(set_A)
minhash_B = get_hash(set_B)

In [127]:
minhash_iou = np.mean(minhash_A == minhash_B)
minhash_iou

0.5

MinHash의 signature size가 커질수록 집합 간의 IOU 값과 Hash 값의 IOU가 더 비슷해 집니다.

In [128]:
sig_size = 256

minhash_A = get_hash(set_A, sig_size)
minhash_B = get_hash(set_B, sig_size)

In [129]:
np.mean(minhash_A == minhash_B)

0.44921875

## 3. In-Memory DB, Redis

Redis는 대표적인 Key-Value 타입의 In-Memory 데이터베이스입니다. 간단히 설명하면, Python의 Dict와 같이 Key와 Value로 데이터를 저장할 수 있는데, 기존 RDBMS와 달리 RAM에서 데이터를 저장해 훨씬 더 빠르게 입출력을 할 수 있습니다. Redis는 Linux, Mac, Window 모두 현재 지원하고 있으므로, 편하게 로컬 컴퓨터에도 설치가 가능한 DB입니다.

reference : [Redis WIKI 설명](https://en.wikipedia.org/wiki/Redis)

In [130]:
from redis import Redis

In [131]:
db = Redis()

#### 데이터 넣기

In [132]:
db.set("A",1)

True

#### 데이터 읽기

In [133]:
db.get('A')

b'1'

#### 데이터 수정하기

In [139]:
db.set('A',2)

True

#### 데이터 제거하기

In [140]:
db.delete('A')

1

In [141]:
db.get('A')

# \[ MinHash와 In-Memory Cache DB을 활용한 추천 엔진 구현하기 \]
---

## 1. Item 별 Click Stream 구성하기

Click Stream은 이전 시간에 배운 장바구니랑 동일하다고 생각하면 됩니다.

In [142]:
click_stream = (
    sampling_df  #  영화 데이터를 
    .groupby('movie_id')  # 아이템을 기준으로
    ['user_id'] # user_id를 모아
    .apply(set) # 하나의 집합으로 만들어 주세요
)

## 2. MinHash Hashvalues로 바꾸기

각 영화의 MinHash값을 구해보도록 하겠습니다.

In [143]:
minhash_per_item = click_stream.progress_apply(get_hash)
minhash_per_item

100%|██████████| 1000/1000 [03:12<00:00,  5.19it/s]


movie_id
1        [219544, 2701, 91343, 60078, 399183, 8658, 958...
2        [5838, 571359, 339908, 96928, 519416, 54054, 2...
3        [324981, 571359, 562296, 438775, 392841, 8658,...
5        [5838, 206413, 339908, 320381, 1656504, 8658, ...
6        [391396, 69615, 154141, 96928, 399183, 174070,...
                               ...                        
78499    [1313292, 231996, 416848, 283498, 2997276, 267...
79132    [62857, 69615, 339908, 283498, 590106, 267979,...
80463    [1381294, 231996, 369410, 320381, 608509, 2679...
81591    [62857, 231996, 369410, 320381, 608509, 176937...
81845    [62857, 69615, 369410, 2746276, 1842589, 26797...
Name: user_id, Length: 1000, dtype: object

In [147]:
minhv_df = pd.DataFrame(np.stack(minhash_per_item.values))
minhv_df.index = minhash_per_item.index
columns = [f"sig{col}" for col in minhv_df.columns]
minhv_df.columns = columns
minhv_df

Unnamed: 0_level_0,sig0,sig1,sig2,sig3,sig4,sig5,sig6,sig7,sig8,sig9,...,sig118,sig119,sig120,sig121,sig122,sig123,sig124,sig125,sig126,sig127
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,219544,2701,91343,60078,399183,8658,9587,78203,276205,186778,...,23419,59397,115212,106014,31411,147210,39891,73507,40856,11594
2,5838,571359,339908,96928,519416,54054,29607,428337,522674,101776,...,23419,228680,188830,696179,31411,147210,186694,159780,454859,199075
3,324981,571359,562296,438775,392841,8658,29607,78203,675778,381341,...,110941,632098,192118,252534,189484,251638,74487,159780,409530,219264
5,5838,206413,339908,320381,1656504,8658,29607,78203,889257,422515,...,906144,59397,192118,297082,31411,734722,74487,159780,142912,542472
6,391396,69615,154141,96928,399183,174070,195534,21124,522674,222732,...,38599,366954,188830,94105,13220,852718,39891,159780,142912,317167
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
78499,1313292,231996,416848,283498,2997276,267979,810065,157034,363895,618052,...,341332,2166067,174107,1108468,1462250,630262,232626,638116,1636785,1461658
79132,62857,69615,339908,283498,590106,267979,13311,21124,363895,421406,...,23419,366954,174107,1013415,15473,147210,232626,76713,1598624,138745
80463,1381294,231996,369410,320381,608509,267979,29607,21124,683372,762393,...,1218622,2585154,188830,1251259,470601,852718,232626,638116,1646759,2070256
81591,62857,231996,369410,320381,608509,1769372,1142150,1662733,683372,762393,...,3321438,366954,188830,1251259,1499643,852718,240072,76713,1598624,138745


이에 대한 각 아이템 간 Item Similarity Matrix를 구하면 아래와 같습니다.

In [158]:
hash_ious = np.mean(minhv_df.values[:,None,:] == minhv_df.values[None,:,:],axis=-1)
item_sim_df = pd.DataFrame(hash_ious, columns=minhv_df.index,index=minhv_df.index)
item_sim_df

movie_id,1,2,3,5,6,7,10,11,14,16,...,70286,71535,72998,73017,74458,78499,79132,80463,81591,81845
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.000000,0.250000,0.171875,0.195312,0.281250,0.187500,0.242188,0.156250,0.070312,0.195312,...,0.062500,0.062500,0.070312,0.054688,0.054688,0.070312,0.078125,0.046875,0.046875,0.039062
2,0.250000,1.000000,0.140625,0.125000,0.164062,0.125000,0.203125,0.171875,0.046875,0.156250,...,0.085938,0.085938,0.054688,0.039062,0.062500,0.054688,0.070312,0.031250,0.054688,0.039062
3,0.171875,0.140625,1.000000,0.359375,0.125000,0.335938,0.062500,0.093750,0.164062,0.070312,...,0.015625,0.007812,0.007812,0.007812,0.007812,0.015625,0.007812,0.015625,0.000000,0.007812
5,0.195312,0.125000,0.359375,1.000000,0.140625,0.289062,0.062500,0.125000,0.156250,0.070312,...,0.031250,0.046875,0.031250,0.031250,0.023438,0.039062,0.023438,0.039062,0.031250,0.023438
6,0.281250,0.164062,0.125000,0.140625,1.000000,0.156250,0.296875,0.171875,0.070312,0.273438,...,0.125000,0.101562,0.062500,0.093750,0.078125,0.070312,0.117188,0.109375,0.093750,0.085938
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
78499,0.070312,0.054688,0.015625,0.039062,0.070312,0.031250,0.039062,0.031250,0.007812,0.062500,...,0.234375,0.234375,0.210938,0.242188,0.218750,1.000000,0.320312,0.242188,0.187500,0.242188
79132,0.078125,0.070312,0.007812,0.023438,0.117188,0.046875,0.054688,0.023438,0.015625,0.078125,...,0.382812,0.242188,0.375000,0.257812,0.281250,0.320312,1.000000,0.328125,0.289062,0.250000
80463,0.046875,0.031250,0.015625,0.039062,0.109375,0.023438,0.054688,0.031250,0.007812,0.085938,...,0.273438,0.203125,0.234375,0.242188,0.304688,0.242188,0.328125,1.000000,0.367188,0.312500
81591,0.046875,0.054688,0.000000,0.031250,0.093750,0.023438,0.039062,0.031250,0.000000,0.054688,...,0.242188,0.210938,0.210938,0.250000,0.304688,0.187500,0.289062,0.367188,1.000000,0.257812


## 3. Secondary Index로 구성하여 Redis에 저장하기

위와 같이 Item Similarity DataFrame을 세팅하는 경우에는 아이템이 클 경우 불가능해집니다. 대신 Redis에 Secondary Index 방식으로 접근하면 보다 간단하게 할 수 있습니다.

In [273]:
from redis import Redis

In [274]:
# redis에 연결
db = Redis('localhost', port=6379)    

for sig_name in tqdm(minhv_df.columns):
    signature_series = minhv_df[sig_name]
    
    for sig_value, grouped in (
        signature_series.groupby(signature_series)):
        # Key로 만들기
        key_string = "{}-{}".format(sig_name, sig_value)
        # Value를 String으로 만들기
        value_string = str(grouped.index.values.tolist())
        db.set(key_string, value_string)

100%|██████████| 128/128 [00:08<00:00, 15.84it/s]


## 4. 추천 시스템 동작하기

### (1) 아이템 추천하기

In [275]:
# 해리 포터 : 불의 잔을 본 사람에 대한 추천
movies_df[movies_df.id==40815]

Unnamed: 0,id,title,release_year
10600,40815,Harry Potter and the Goblet of Fire,2005


#### 불의 잔 영화에 대한 MinHash Signature 값 가져오기

In [276]:
target_item = minhv_df.loc[40815]
target_item

sig0      320552
sig1      231996
sig2      339908
sig3       96928
sig4      490371
           ...  
sig123    147210
sig124    232626
sig125    883389
sig126    123689
sig127    199075
Name: 40815, Length: 128, dtype: uint64

#### DB에 Query 전송하기

In [277]:
%%time

querys = []
for k, v in target_item.items():
    querys.append(f'{k}-{v}')
    
intersected_movie_ids = np.concatenate(
    [json.loads(row) for row in db.mget(querys)])
    
items, counts = np.unique(intersected_movie_ids,
                          return_counts=True)    

CPU times: user 4.7 ms, sys: 1.04 ms, total: 5.74 ms
Wall time: 6.66 ms


#### 결과 보기

In [278]:
result = pd.Series(counts,index=items).sort_values(ascending=False)/128
result.index = result.index.map(dict(zip(movies_df.id,movies_df.title)))
result.iloc[:10]

Harry Potter and the Goblet of Fire                                                        1.000000
Harry Potter and the Prisoner of Azkaban                                                   0.562500
Harry Potter and the Order of the Phoenix                                                  0.507812
Harry Potter and the Chamber of Secrets                                                    0.460938
Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone)    0.429688
Chronicles of Narnia: The Lion, the Witch and the Wardrobe, The                            0.382812
Spider-Man 2                                                                               0.367188
Pirates of the Caribbean: Dead Man's Chest                                                 0.351562
Shrek 2                                                                                    0.328125
Spider-Man                                                                                 0.289062


### (2) 정보 갱신하기

#### 예제 상황 
> 유저(1551)번이 해리포터 불의잔(40815)을 보았을 때

In [279]:
old_minhash = minhv_df.loc[40815]
old_minhash

sig0      320552
sig1      231996
sig2      339908
sig3       96928
sig4      490371
           ...  
sig123    147210
sig124    232626
sig125    883389
sig126    123689
sig127    199075
Name: 40815, Length: 128, dtype: uint64

In [280]:
new_minhash = get_hash([1551])
new_minhash

array([1048627272, 1513800844, 1051346822,  243397122, 4254558996,
       3910579161, 2714463644, 3226051397, 3405622008, 2489970010,
       3882600967,  270858047, 1815424979, 1556920697, 2342500043,
       1145634247, 1456631817, 2486033195,  956985939, 4270610901,
        177317663, 3844107406, 3106019867,  628449413, 3096728226,
       4092191400,  271992335, 2413857880, 3222240607, 2151898277,
       2497695043, 3707997569,  299063558, 1292431369, 3652634318,
       4153710162,  192991774, 2328127284, 1368331266,  968917771,
       3089515145, 2743966941, 2006832071,  885742492,  537916551,
       4253004388, 2695105263, 2353687496, 1063046712, 2555458223,
       4291586588,  312503410, 1110517264, 3452261949, 1499534944,
        302670327,  918593656, 3718382526,  545465343, 2022429752,
       1280982525,   33216477, 1374618577, 2153898073, 3045751088,
        610197203, 2994775872, 1838069218, 2243730264,  701002237,
       3719154976,  599137673,  174320139, 1853719633, 1586564

MinHash를 갱신하는 것은 추가할 MinHash와 기존 MinHash값을 서로 비교해서 각 Signature의 최소값으로 갱신해주면 됩니다.

In [281]:
updated_minhash = np.minimum(old_minhash, new_minhash)
updated_minhash

sig0      320552
sig1      231996
sig2      339908
sig3       96928
sig4      490371
           ...  
sig123    147210
sig124    232626
sig125    883389
sig126    123689
sig127    199075
Name: 40815, Length: 128, dtype: uint64

#### 지워야할 Secondary Index

변경된 Signature만 갱신해주면 됩니다. 아래 값은 이전 minhash 값으로, Secondary Index에서 제거해 주어야 하는 값입니다.

In [282]:
old_kv = old_minhash[updated_minhash!=old_minhash]
old_kv

sig122    31411
Name: 40815, dtype: uint64

해당 Secondary Index에는 현재 영화에 대한 movie id가 저장되어 있으므로, 제거해주어야 합니다.

In [283]:
k = old_kv.index[0]
v = old_kv.values[0]

old_index_list = json.loads(db.get(f"{k}-{v}"))
print(old_index_list)

[1, 2, 5, 16, 17, 22, 25, 48, 58, 70, 105, 110, 111, 141, 151, 168, 186, 194, 198, 208, 222, 224, 231, 253, 261, 266, 273, 292, 293, 318, 339, 342, 345, 355, 356, 368, 372, 376, 410, 420, 432, 440, 497, 509, 515, 520, 527, 535, 539, 541, 551, 552, 553, 586, 587, 594, 719, 724, 762, 778, 780, 785, 788, 832, 838, 852, 910, 912, 919, 920, 969, 1027, 1036, 1073, 1080, 1084, 1092, 1093, 1094, 1097, 1100, 1101, 1120, 1127, 1136, 1183, 1193, 1197, 1198, 1199, 1210, 1214, 1219, 1221, 1225, 1246, 1247, 1258, 1259, 1263, 1265, 1270, 1271, 1275, 1278, 1291, 1296, 1307, 1339, 1345, 1356, 1370, 1377, 1380, 1387, 1391, 1394, 1407, 1408, 1409, 1485, 1500, 1527, 1544, 1562, 1569, 1573, 1580, 1608, 1635, 1641, 1653, 1673, 1680, 1721, 1747, 1777, 1784, 1801, 1876, 1917, 1923, 1958, 1961, 1968, 2000, 2001, 2003, 2012, 2020, 2054, 2100, 2115, 2124, 2125, 2134, 2144, 2145, 2160, 2161, 2167, 2174, 2193, 2289, 2291, 2294, 2321, 2336, 2340, 2355, 2369, 2394, 2396, 2405, 2406, 2407, 2420, 2455, 2467, 2470, 254

위의 리스트에서 영화 불의 잔에 대한 Movie ID인 40815를 제거해 주면 됩니다.

In [286]:
old_index_list.remove(movie_id)
db.set(f"{k}-{old_v}", str(old_index_list))

True

#### 추가해야 할 Secondary Index

In [287]:
new_kv = updated_minhash[updated_minhash!=old_minhash]
new_kv

sig122    15473
Name: 40815, dtype: uint64

새로운 Secondary Index에는 현재 영화에 대한 Movie id를 저장해주어야 합니다. 

In [288]:
k = new_kv.index[0]
v = new_kv.values[0]

new_index_list = json.loads(db.get(f"{k}-{v}"))
print(new_index_list)

[50, 260, 858, 1079, 1175, 1196, 1200, 1213, 1240, 1682, 1732, 2571, 2959, 2997, 3911, 4226, 4973, 4993, 4995, 5418, 5669, 5995, 6016, 7143, 30707, 30749, 44195, 49272, 54286, 60684, 63082, 68954, 70286, 79132, 81845]


여기에는 영화 불의 잔에 대한 Movie ID인 40815를 추가해 주면 됩니다.

In [289]:
new_index_list.append(movie_id)
db.set(f"{k}-{new_v}", str(new_index_list))

True

#  

---

    Copyright(c) 2020 by Public AI. All rights reserved.
    Writen by PAI, SeonYoul Choi ( best10@publicai.co.kr )  last updated on 2020/06/06


---