# What is product recommendation system?

## Use Case 
今晚有奧運羽球金牌戰，中華隊戴資穎超棒一定要幫她加油！看比賽配啤酒最對味！趁比賽前趕快買啤酒。可能的行為：

- Step 1：去家裡附近的 7-11/全家/全聯/家樂福，直接走向冷藏櫃找到放**「啤酒」**的區域，比較了幾個品牌、口味、促銷活動和價格，想想今天的心情與想喝的口味，做好決定後拿起選定的啤酒放進購物籃裡

<img src="https://thumbs.dreamstime.com/z/%E5%95%A4%E9%85%92%E5%93%81%E7%A7%8D%E5%9C%A8-eleven%E4%BE%BF%E5%88%A9%E5%95%86%E5%BA%97%E7%9A%84-96069550.jpg" alt="drawing" width="1000"/>

- Step 2：光喝啤酒胃有點空虛，想順便帶點下酒**「零食」**。走到零食區，比較了幾個品牌、口味、促銷活動和價格，選定了喜歡的零食也放進購物藍裡
- Step 3：走去櫃檯結帳時，登入會員後，店員看了你的消費記錄說：「現在XX牌衛生棉有買一送一的優惠，需要一起帶嗎？」妳想起了生理期快到了，剛好沒有存貨而且是妳愛用的牌子就一起結帳買了


## Decision Making
1. 結帳前
    > 相同分類：類似商品，藉由比較不同規格，找到符合需求的商品 </br>
    > 不同分類：常一起被搭配購買的商品
2. 結帳時
    > 針對個人消費習慣與偏好，客製化推薦商品


## Main Objective
http://www.woshipm.com/pd/2707270.html
1. 顧客可能想買什麼？
1. 想讓顧客接續買什麼？
1. 更好的挖掘长尾商品：主流商品代表大多數顧客的需求，而长尾商品则代表小众用户个性化需求


## Solutions 
https://itw01.com/D8Q2EBJ.html
1. Content-Based 基于内容
1. Association Rules 關聯規則學習
1. FP-Growth
1. word2vec
1. Collaborative Filtering (CF) 協同過濾
    1. item-based
    1. user-based


# 1.Association Rules


1. 關聯分析是什麼? </br>
在大量數據中找尋資料彼此之間的關聯，它是透過兩種主要的方式來進行分析: 頻繁項集、關聯規則

    - 頻繁項集 (Frequent Itemsets): 經常一起出現的物品集合  
    - 關聯規則 (Association Rules): 表達數據之間的可能存在很強關聯姓


 2. 主要透過 3 個指標來挖掘數據間關聯的強弱
<img src="https://miro.medium.com/max/2134/1*--iUPe_DtzKdongjqZ2lOg.png" alt="drawing" width="1200"/>

```
Lift(A>D) = (2/5) / [(3/5)*(3/5)] = 10/9
```



- **支持度(Support)**：
支持度表示為 item-set 在整個資料集中出現的頻率：<br>
`Support(X,Y) = P(X∩Y) = Count(X∪Y) / Count(ALL_DATA)`
<br>
<br>
- **置信度(Confidence)**：
置性度表示當事件X發生的情況下，有多少顧客也買了 Y： <br>
`Confidence(X→Y) = P(Y|X) = P(X∩Y) / P(X)`
<br>
<br>
- **提升度(Lift)**：
計算 X 與 Y 之間的關聯性，若值接近 1 表示互相獨立，愈高表示關聯性愈強：<br> 
`Lift(X→Y) = P(X∩Y) / (P(X) * P(Y))`


## by Query

In [None]:
#@title Query #2.5 GB

from google.colab import auth
auth.authenticate_user()
print('Authenticated')

from datetime import datetime
from google.cloud import bigquery

client = bigquery.Client( 'shopline-142003')
print( 'Query Start : ', datetime.now())

raw_data = client.query('''


WITH events_dedup AS (
  SELECT * EXCEPT(rn)
  FROM (
    SELECT
      merchant_id, http_cookie, created_at, product_id, event_name,
      ROW_NUMBER() OVER (PARTITION BY id ORDER BY created_at) AS rn
    FROM `datawarehouse.events_hourly`
    WHERE
      DATE(created_at)= '2021-08-10'
      AND merchant_id = '55d3ecabe37ec6fbbf00003f' #生活倉庫
  )
  WHERE rn = 1
)
,session_split AS (
  SELECT
    merchant_id, http_cookie, product_id, event_name, created_at,
    -- Use created_at as session_start on first and idle over 30 mins records, then filldown
    MAX(CASE WHEN idle_sec IS NULL OR idle_sec > 1800 THEN created_at ELSE NULL END)
      OVER(PARTITION BY merchant_id, http_cookie ORDER BY created_at) AS session_start
  FROM (
    SELECT *,
      -- Difference between created_at and previous created_at in seconds
      TIMESTAMP_DIFF(created_at, LAG(created_at) OVER(PARTITION BY merchant_id, http_cookie ORDER BY created_at), SECOND) AS idle_sec
    FROM events_dedup
  )
),
product_view_session AS (
  SELECT *
  FROM session_split
  WHERE
    IFNULL(product_id, '') != '' AND
    event_name = 'View'
),
merchants_view AS (
  SELECT merchant_id, count(*) AS merchant_total_view
  FROM product_view_session
  GROUP BY merchant_id
),
products_view AS (
  SELECT merchant_id, product_id, count(*) AS product_total_view
  FROM product_view_session
  GROUP BY merchant_id, product_id
),
products_view_probability AS (
  SELECT products_view.merchant_id, product_id, (product_total_view / merchant_total_view) AS product_view_pr
  FROM products_view
  INNER JOIN merchants_view
    ON products_view.merchant_id = merchants_view.merchant_id
),
products_session_grouping AS (
  SELECT merchant_id, product_list
  FROM (
    SELECT merchant_id, http_cookie, session_start, ARRAY_AGG(DISTINCT product_id) AS product_list
    FROM product_view_session
    GROUP BY merchant_id, http_cookie, session_start
  )
  WHERE ARRAY_LENGTH(product_list) >= 2 AND ARRAY_LENGTH(product_list) <= 100
),
products_view_association AS (
  SELECT merchant_id, product_1, product_2, count(*) AS pattern_frq
  FROM products_session_grouping
  CROSS JOIN UNNEST(product_list) AS product_1
  CROSS JOIN UNNEST(product_list) AS product_2
  WHERE product_1 > product_2 -- deduplicate record
  GROUP BY merchant_id, product_1, product_2
  HAVING pattern_frq > 1
),
association_result AS (
  SELECT merchant_id, product_1, product_2, pattern_frq FROM products_view_association
  UNION ALL 
  SELECT merchant_id, product_2, product_1, pattern_frq FROM products_view_association
) 


SELECT
  association.merchant_id,
  product_1,
  product_2,
  (pattern_frq/merchant_total_view) AS support,
  -- Support(X,Y) = P(X∩Y) = Count(X∪Y) / Count(ALL_DATA)
  (pattern_frq/product_total_view) AS confidence,
  -- Confidence(X→Y) = P(Y|X) = P(X∩Y) / P(X)
  (pattern_frq/merchant_total_view) / (probability_1.product_view_pr * probability_2.product_view_pr) AS lift
  -- Lift(X→Y) = P(X∩Y) / (P(X) * P(Y))
FROM 
    association_result AS association
    INNER JOIN merchants_view
    ON association.merchant_id = merchants_view.merchant_id
    INNER JOIN products_view
    ON association.merchant_id = products_view.merchant_id
        AND association.product_1 = products_view.product_id  
    INNER JOIN products_view_probability AS probability_1
    ON association.merchant_id = probability_1.merchant_id
        AND association.product_1 = probability_1.product_id  
    INNER JOIN products_view_probability AS probability_2
    ON association.merchant_id = probability_2.merchant_id
        AND association.product_2 = probability_2.product_id


''' ).to_dataframe()

print( 'Query Done : ', datetime.now())
print(raw_data.shape)

raw_data.head(5)

Authenticated
Query Start :  2021-08-11 04:20:53.901170
Query Done :  2021-08-11 04:20:59.420406
(6968, 6)


Unnamed: 0,merchant_id,product_1,product_2,support,confidence,lift
0,55d3ecabe37ec6fbbf00003f,60ac69f998470f001a8eb132,5ebe5e224f4d79003c751094,0.000293,0.103448,18.773573
1,55d3ecabe37ec6fbbf00003f,602e28e0b53acb0014a69a0d,5e6f24c55299bc001b8f8ae2,0.000634,0.043046,0.844738
2,55d3ecabe37ec6fbbf00003f,602e28e0b53acb0014a69a0d,58b6479cd4e3952f68001bdc,0.000195,0.013245,5.779062
3,55d3ecabe37ec6fbbf00003f,60fe82e01b3c3b0011b61932,602e28e0b53acb0014a69a0d,0.000634,0.004438,0.301383
4,55d3ecabe37ec6fbbf00003f,60fe82e01b3c3b0011b61932,5e6f24c55299bc001b8f8ae2,0.001707,0.011949,0.234496


## by python

Reference
- https://pypi.org/project/apyori/

```python
from apyori import apriori

transactions = [
    ['beer', 'nuts'],
    ['beer', 'cheese'],
]
results = list(apriori(transactions))
```

Fields of output mean:
- Base item
- Appended item
- Support
- Confidence
- Lift


In [None]:
#@title import apriori

!pip install apyori
from apyori import apriori

Collecting apyori
  Downloading apyori-1.1.2.tar.gz (8.6 kB)
Building wheels for collected packages: apyori
  Building wheel for apyori (setup.py) ... [?25l[?25hdone
  Created wheel for apyori: filename=apyori-1.1.2-py3-none-any.whl size=5975 sha256=f63dbbde25eb4b1b33e6aa8af32d4aab931c1e7199ab262a83bb8cbc7885b7dd
  Stored in directory: /root/.cache/pip/wheels/cb/f6/e1/57973c631d27efd1a2f375bd6a83b2a616c4021f24aab84080
Successfully built apyori
Installing collected packages: apyori
Successfully installed apyori-1.1.2


## Simple example

In [None]:
#@title Code

data = [
['Milk', 'Onion', 'Potato', 'Beans', 'Eggs', 'Yogurt'],
['Beef', 'Onion', 'Potato', 'Beans', 'Eggs', 'Yogurt'],
['Milk', 'Apple', 'Beans', 'Eggs'],
['Milk', 'Unicorn', 'Corn', 'Beans', 'Yogurt'],
['Corn', 'Onion', 'Onion', 'Beans', 'Ice cream', 'Eggs']]

association_rules = apriori(data, min_support=0.0001, min_confidence=0.01, min_lift=2, max_length=2) 
# min_support = 0.0001 , 只要該組合在每一萬次的 session 裡有出現一次即可
# min_confidence = 0.01 , 同一個 Session 看過 A 商品，同時出現 B 商品的機率
# min_lift = 2 , 當事情X發生的情況下，同時發生Y的可能性，若值接近 1 表示互相獨立，愈高表示關聯性愈強
# max_length = 2, 組合內含的最大商品個數
association_results = list(association_rules)
print( "combinations: " , len(association_results))

print("=====================================")
for item in association_results:
    print(item)

    pair = item[0] 
    items = [x for x in pair]
    print("Rule: " + items[0] + " -> " + items[1])
    print("Support: " + str(item[1]))
    print("Confidence: " + str(item[2][0][2]))
    print("Lift: " + str(item[2][0][3]))
    print("=====================================")

combinations:  3
RelationRecord(items=frozenset({'Potato', 'Beef'}), support=0.2, ordered_statistics=[OrderedStatistic(items_base=frozenset({'Beef'}), items_add=frozenset({'Potato'}), confidence=1.0, lift=2.5), OrderedStatistic(items_base=frozenset({'Potato'}), items_add=frozenset({'Beef'}), confidence=0.5, lift=2.5)])
Rule: Potato -> Beef
Support: 0.2
Confidence: 1.0
Lift: 2.5
RelationRecord(items=frozenset({'Corn', 'Ice cream'}), support=0.2, ordered_statistics=[OrderedStatistic(items_base=frozenset({'Corn'}), items_add=frozenset({'Ice cream'}), confidence=0.5, lift=2.5), OrderedStatistic(items_base=frozenset({'Ice cream'}), items_add=frozenset({'Corn'}), confidence=1.0, lift=2.5)])
Rule: Corn -> Ice cream
Support: 0.2
Confidence: 0.5
Lift: 2.5
RelationRecord(items=frozenset({'Corn', 'Unicorn'}), support=0.2, ordered_statistics=[OrderedStatistic(items_base=frozenset({'Corn'}), items_add=frozenset({'Unicorn'}), confidence=0.5, lift=2.5), OrderedStatistic(items_base=frozenset({'Unicorn

注意事項：
1. apyori library 要求 dataset 格式為 list of list，所以要 pandas dataframe 格式轉變為 list of list

## popola - Session

In [None]:
#@title events #6.5 GB

from google.colab import auth
auth.authenticate_user()
print('Authenticated')

from datetime import datetime
from google.cloud import bigquery

client = bigquery.Client( 'shopline-142003')
print( 'Query Start : ', datetime.now())

raw_data = client.query('''

WITH beta_merchants AS (

SELECT
  handle,
  _id AS merchant_id
FROM 
  `shopline-test.looker_prod.merchants_hourly` 
WHERE 
  handle IN ('popola')

)
, base AS (
  SELECT
    merchant_id,
    http_cookie,
    event_name,
    product_id,
    created_at,
    DENSE_RANK() OVER ( ORDER BY merchant_id, http_cookie ) AS dense_rk,
    COUNT( DISTINCT product_id ) OVER (PARTITION BY merchant_id, http_cookie) AS cn
  FROM
    `shopline-142003.datawarehouse.events_hourly`
  WHERE
    DATE(created_at, 'Asia/Taipei') >= "2021-08-01"
    AND DATE(created_at, 'Asia/Taipei') <= "2021-08-10"
    AND product_id IS NOT NULL
    AND event_name = 'View'  
)
, agg AS (
  SELECT
    handle,
    base.merchant_id	,
    http_cookie,
    event_name,
    product_id,
  FROM 
    base
    JOIN beta_merchants ON beta_merchants.merchant_id = base.merchant_id
  WHERE 
    cn > 1
)
, product_name AS (
  SELECT
    owner_id,
    _id,
    REPLACE( JSON_EXTRACT(title_translations, '$.zh-hant'), '"','') AS zh_hant,
    title_translations
  FROM 
    `shopline-test.looker_prod.products_hourly` AS products
    JOIN beta_merchants 
      ON products.owner_id = beta_merchants.merchant_id
)
SELECT
  handle,
  merchant_id	,
  http_cookie,
  event_name,
  product_id,
  product_name.zh_hant
FROM 
  agg
  JOIN product_name ON product_name.owner_id = agg.merchant_id
    AND product_name._id = agg.product_id


''' ).to_dataframe()

print( 'Query Done : ', datetime.now())
print(raw_data.shape)

raw_data.head(5)

Authenticated
Query Start :  2021-08-13 05:26:06.805237
Query Done :  2021-08-13 05:26:16.875074
(11610, 6)


Unnamed: 0,handle,merchant_id,http_cookie,event_name,product_id,zh_hant
0,popola,5a2773a04926783c0d00030c,188b55f7-06f5-4480-8909-3bd680757fbe,View,604ae25bc985b4002975b7af,【開鍋祭】溫補蔬菜羊肉獨享小鍋(1-2人份)
1,popola,5a2773a04926783c0d00030c,1eb90e3d-acc3-43d7-8f8e-a4d185af4152,View,5f5b12bf45132800485d95af,POPOLA水感淨透卸妝油(150ml)
2,popola,5a2773a04926783c0d00030c,3772753f-62cd-406f-bd66-65f3711a463d,View,5ffbe5fd9173d500358dd629,洋蔥辣椒拌醬
3,popola,5a2773a04926783c0d00030c,3772753f-62cd-406f-bd66-65f3711a463d,View,5dd7d42890137900129d3b52,好營養凍乾系列(氣冷雞肉丁)
4,popola,5a2773a04926783c0d00030c,4eac9a89-4c3d-4c5f-95c5-0eee40c5eed1,View,5b961d675528e2001491186f,七分卡-海鮮餃(24顆/盒)


In [None]:
#@title Create list of list
import pandas as pd
import numpy as np

df = raw_data.copy()

df_list = []
for cookie in df.http_cookie.unique():
    prodcut_id = df[df.http_cookie == cookie]['zh_hant'].values.tolist()
    df_list.append(prodcut_id)

df_list

[['【開鍋祭】溫補蔬菜羊肉獨享小鍋(1-2人份)', '香菇雞腿湯麵', '味噌豚骨叉燒湯麵'],
 ['POPOLA水感淨透卸妝油(150ml)',
  'POPOLA玻尿酸保濕噴霧(100ml)',
  'POPOLA玻尿酸保濕噴霧(100ml)',
  '胺基酸潔面霜(120g)'],
 ['洋蔥辣椒拌醬', '好營養凍乾系列(氣冷雞肉丁)', '【開鍋祭】台東濃郁剝皮辣椒雞獨享小鍋(1-2人份)'],
 ['七分卡-海鮮餃(24顆/盒)',
  '七分卡-塔香雞肉餃(24顆/盒)',
  '七分卡-海鮮餃(24顆/盒)',
  '七分卡-塔香雞肉餃(24顆/盒)',
  '七分卡-塔香雞肉餃(24顆/盒)',
  '七分卡-海鮮餃(24顆/盒)'],
 ['【母湯家族試飲組】圓圓+腫腫+累累各3瓶', '【頭好壯壯補元氣】滴雞精(30入/盒)3盒『送POPOLA訂製泰國碗1入』'],
 ['培根乳酪貝果',
  '夯吉乳酪貝果',
  '職人手作鹽之花',
  '芋泥肉鬆貝果',
  '芋泥肉鬆貝果',
  '芋泥乳酪貝果',
  '芋泥乳酪貝果',
  '芋泥乳酪貝果',
  '芋泥乳酪貝果',
  '芋泥乳酪貝果',
  '芋泥乳酪貝果',
  '栗子乳酪貝果',
  '栗子乳酪貝果',
  '義式乳酪貝果'],
 ['腫腫母湯 (20瓶/箱)', '圓圓母湯 (20瓶/箱)'],
 ['蒲燒鰻', '蒲燒鰻', 'POPOLAの酵', 'POPOLAの晶(10ml/瓶)', '七分卡-經典牛肉湯麵'],
 ['【瑕疵OUT透亮起來】姬透飲3盒',
  '【瑕疵OUT透亮起來】姬透飲3盒',
  '【瑕疵OUT透亮起來】姬透飲3盒',
  '【解封跑跳不NG組】姬透飲+明明美各1盒',
  '老母雞滴雞精禮盒',
  '【頭好壯壯補元氣】滴雞精(30入/盒)3盒『送POPOLA訂製泰國碗1入』',
  '【頭好壯壯補元氣】滴雞精(30入/盒)3盒『送POPOLA訂製泰國碗1入』',
  '【頭好壯壯補元氣】滴雞精(30入/盒)3盒『送POPOLA訂製泰國碗1入』',
  '【頭好壯壯補元氣】滴雞精(30入/盒)3盒『送POPOLA訂製泰國碗1入』',
  '【頭好壯壯補元氣】滴雞精(30入/盒)3盒『送POPOLA訂製泰國碗1入』',
  '【趕走怪物

In [None]:
#@title Association
association_rules = apriori(df_list, 
                            min_support=0.0001, 
                            min_confidence=0.01, 
                            min_lift=2, 
                            max_length=2) 

association_results = list(association_rules)

print(len(association_results))

5700


In [None]:
#@title Result in datafram
df_result = []

for item in association_results:
   pair = item[0] 
   items = [x for x in pair]
   df_result.append([ items[0], items[1] , round(item[1],4) , round(item[2][0][2],4) , round(item[2][0][3],4)])

import pandas as pd
df_validation = pd.DataFrame(df_result, columns=['product', 'recommend', 'Support', 'Confidence', 'Lift']).sort_values(by = ['product','Confidence'] ,ascending = False).reset_index()

size = df_validation['product'].nunique()
print( 'No. of products = ', size)

df_validation.sort_values(by = 'product',ascending = False)


No. of products =  2


Unnamed: 0,index,product,recommend,Support,Confidence,Lift
0,0,Potato,Beef,0.2,1.0,2.5
1,1,Corn,Ice cream,0.2,0.5,2.5
2,2,Corn,Unicorn,0.2,0.5,2.5


In [None]:
#@title check
id = np.random.randint(0, size-1)

item = df_validation['product'].unique()[id]
# item ='寵物滴雞精'
item = '狗狗泌尿道保健主食罐-羊肉鮪魚(3罐1組)'
df_validation[ df_validation['product'] == item ].sort_values(by = ['Support', 'Confidence','Lift'] , ascending = [False, False , False]).head(20)


Unnamed: 0,index,product,recommend,Support,Confidence,Lift
1896,4795,狗狗泌尿道保健主食罐-羊肉鮪魚(3罐1組),狗狗主食罐-鴕鳥火雞肉(3罐1組),0.0014,0.2308,60.5192
1897,4778,狗狗泌尿道保健主食罐-羊肉鮪魚(3罐1組),狗狗主食罐-火雞肉(3罐1組),0.0005,0.125,32.7812
1898,4806,狗狗泌尿道保健主食罐-羊肉鮪魚(3罐1組),狗狗主食罐-鹿肉火雞肉(3罐1組),0.0005,0.125,32.7812
1899,4906,狗狗泌尿道保健主食罐-羊肉鮪魚(3罐1組),狗狗湯罐-鮮菇起司佐鯛魚(3罐1組),0.0005,0.125,32.7812
1900,4907,狗狗泌尿道保健主食罐-羊肉鮪魚(3罐1組),狗狗火雞肉主食包(700g),0.0005,0.125,26.225
1901,4911,狗狗泌尿道保健主食罐-羊肉鮪魚(3罐1組),狗狗鹿肉火雞肉主食包(700g),0.0005,0.125,18.7321
1902,4912,狗狗泌尿道保健主食罐-羊肉鮪魚(3罐1組),褐藻魚子精華飲,0.0005,0.125,8.7417
1903,4913,狗狗泌尿道保健主食罐-羊肉鮪魚(3罐1組),貼貼的女兒紅奶茶,0.0005,0.125,5.0433
1904,4816,狗狗泌尿道保健主食罐-羊肉鮪魚(3罐1組),狗狗心血管保健主食包-雞肉鮭魚(700g),0.0005,0.0714,18.7321
1905,4739,狗狗泌尿道保健主食罐-羊肉鮪魚(3罐1組),犬的腸胃保健益生菌,0.0005,0.0208,5.4635


## popola - Order Items

In [None]:
#@title order_items #4.4 GB
# DATE(order_items.created_at) >= "2021-05-01" 

from google.colab import auth
auth.authenticate_user()
print('Authenticated')

from datetime import datetime
from google.cloud import bigquery

client = bigquery.Client( 'shopline-142003')
print( 'Query Start : ', datetime.now())

raw_data_cart = client.query('''

WITH base AS(
SELECT 
  handle,
  item_owner_id AS merchant_id,
  order_id,
  item_id,
  object_data.title_translations_zh_hant,
  DENSE_RANK() OVER ( ORDER BY item_owner_id, order_id ) AS dense_rk,
  COUNT( DISTINCT item_id ) OVER (PARTITION BY item_owner_id, order_id) AS cn
FROM 
  `shopline-test.looker_prod.order_items_hourly`  AS order_items
  JOIN `shopline-test.looker_prod.merchants_hourly` AS merchants
     ON merchants._id = order_items.item_owner_id
WHERE 
  DATE(order_items.created_at) >= "2021-08-01" 
  AND DATE(order_items.created_at) <= "2021-08-10" 
  AND item_type = 'Product'
  AND merchants.email NOT LIKE '%@shoplineapp%'
  AND current_plan_key != 'locked'
  AND handle IN ('popola' )

)
SELECT
  *
FROM 
  base
WHERE 
  cn >= 2

''' ).to_dataframe()

print( 'Query Done : ', datetime.now())
print(raw_data_cart.shape)

raw_data_cart.head(5)

Authenticated
Query Start :  2021-08-13 05:28:50.137734
Query Done :  2021-08-13 05:28:54.082981
(1916, 7)


Unnamed: 0,handle,merchant_id,order_id,item_id,title_translations_zh_hant,dense_rk,cn
0,popola,5a2773a04926783c0d00030c,6107b64685a32400268ca018,60fe84d73cd95d00265217e5,【POPOLA xE.V.O菲媽是我】職人手作貝果/鹽之花,58,2
1,popola,5a2773a04926783c0d00030c,6107b64685a32400268ca018,60fe84d73cd95d00265217e5,【POPOLA xE.V.O菲媽是我】職人手作貝果/鹽之花,58,2
2,popola,5a2773a04926783c0d00030c,6107b64685a32400268ca018,60fe82998ed24e003b73f65f,【POPOLA xE.V.O菲媽是我】獨享小鍋/經典湯麵,58,2
3,popola,5a2773a04926783c0d00030c,6107b64685a32400268ca018,60fe84d73cd95d00265217e5,【POPOLA xE.V.O菲媽是我】職人手作貝果/鹽之花,58,2
4,popola,5a2773a04926783c0d00030c,6108d6aeb70c690011ad6f6f,60fe84d73cd95d00265217e5,【POPOLA xE.V.O菲媽是我】職人手作貝果/鹽之花,112,2


In [None]:
#@title Create list of list
import pandas as pd
import numpy as np

df_cart = raw_data_cart.copy()

df_cart_list = []
for order in df_cart['order_id'].unique():
    prodcut_id = df_cart[df_cart['order_id'] == order]['title_translations_zh_hant'].values.tolist()
    df_cart_list.append(prodcut_id)

df_cart_list

[['【POPOLA xE.V.O菲媽是我】職人手作貝果/鹽之花',
  '【POPOLA xE.V.O菲媽是我】職人手作貝果/鹽之花',
  '【POPOLA xE.V.O菲媽是我】獨享小鍋/經典湯麵',
  '【POPOLA xE.V.O菲媽是我】職人手作貝果/鹽之花'],
 ['【POPOLA xE.V.O菲媽是我】職人手作貝果/鹽之花',
  '【POPOLA xE.V.O菲媽是我】職人手作貝果/鹽之花',
  '【POPOLA xE.V.O菲媽是我】獨享小鍋/經典湯麵',
  '【POPOLA xE.V.O菲媽是我】獨享小鍋/經典湯麵',
  '【POPOLA xE.V.O菲媽是我】獨享小鍋/經典湯麵',
  '【POPOLA xE.V.O菲媽是我】獨享小鍋/經典湯麵',
  '【POPOLA xE.V.O菲媽是我】獨享小鍋/經典湯麵',
  '【POPOLA xE.V.O菲媽是我】獨享小鍋/經典湯麵',
  '【POPOLA xE.V.O菲媽是我】獨享小鍋/經典湯麵',
  '【POPOLA xE.V.O菲媽是我】獨享小鍋/經典湯麵',
  '【POPOLA xE.V.O菲媽是我】獨享小鍋/經典湯麵',
  '【POPOLA xE.V.O菲媽是我】職人手作貝果/鹽之花',
  '【POPOLA xE.V.O菲媽是我】獨享小鍋/經典湯麵'],
 ['【美肌up新客必入組】B5面膜(5片/盒)+美顏器各1入→結帳再折580，每帳號限折一次',
  '【彈滑肌密入門組】藻針霜+胜肽精華各1罐',
  '【補好補滿水水多入組】玻尿酸保濕噴霧3入組/6入組'],
 ['焦糖可可貝果',
  '泰式酸辣雞肉湯麵',
  '酸菜白肉湯麵',
  '野生小藍莓貝果',
  '味噌豚骨叉燒湯麵',
  '義式乳酪貝果',
  '栗子乳酪貝果',
  '【NEW】雜炊湯粥-芋頭油蔥',
  '芋泥肉鬆貝果'],
 ['POPOLA玻尿酸保濕噴霧(100ml)', '牛肉/羊肉/雞肉棒棒糖(20g/支)', 'POPOLAの晶(10ml/瓶)'],
 ['捲捲小魚雞肉乾(50g/份)', '寵物口服玻尿酸(8瓶裝)'],
 ['POPOLAの酵', '凝結豆腐松木貓砂(櫻花味/紫鈴蘭味)-6包一箱(7L/包)'],
 ['POPOLAの晶(10ml/瓶)', '賊寶的果香阿薩

In [None]:
# Reference https://www.maxlist.xyz/2018/11/03/python_apriori/

from apyori import apriori

association_cart = apriori(df_cart_list, min_support=0.0001, min_confidence=0.01, min_lift=2, max_length=2) 

ar_result_cart = list(association_cart)

print(len(ar_result_cart))


1014


In [None]:
df_result = []
for item in ar_result_cart:
   pair = item[0] 
   items = [x for x in pair]
   df_result.append([ items[0], items[1] , round(item[1],4) , round(item[2][0][2],4) , round(item[2][0][3],4)])

import pandas as pd
df_cart_validation = pd.DataFrame(df_result, columns=['product', 'recommend', 'Support', 'Confidence', 'Lift']).sort_values(by = ['product','Confidence'] ,ascending = False).reset_index()

size_cart = df_cart_validation['product'].nunique()
print( 'No. of products = ', size_cart)

df_cart_validation


No. of products =  144


Unnamed: 0,index,product,recommend,Support,Confidence,Lift
0,134,鮮蝦豬肉燒賣(12顆/盒),【NEW】阿根廷魷魚(200g),0.0024,1.0000,206.5000
1,931,鮮蝦豬肉燒賣(12顆/盒),老班愛的油雞腿,0.0024,0.5000,103.2500
2,429,鮮蝦豬肉燒賣(12顆/盒),【開鍋祭】芋頭貢丸（4顆/包）,0.0024,0.2500,51.6250
3,967,鮮蝦豬肉燒賣(12顆/盒),芋見極致貓咪生吐司,0.0024,0.2500,51.6250
4,664,鮮蝦豬肉燒賣(12顆/盒),培根乳酪貝果,0.0024,0.1667,34.4167
...,...,...,...,...,...,...
1009,16,POPOLA 100%PURE純離胺酸(30粒裝),貓貓湯罐-牛三寶番茄(3罐1組),0.0024,0.5000,206.5000
1010,17,POPOLA 100%PURE純離胺酸(30粒裝),貓貓湯罐-白醬松子雞湯(3罐1組),0.0024,0.5000,206.5000
1011,9,HEALTHPIT目元美顏器,POPOLAの晶(10ml/瓶),0.0024,1.0000,20.6500
1012,10,HEALTHPIT目元美顏器,【水嫩美顏upup】B5面膜+美顏器+保濕噴霧各1入,0.0024,1.0000,82.6000


In [None]:
#@title check

# id_cart = np.random.randint(0, size_cart-1)
# 
# item = df_validation['product'].unique()[id_cart]
item = '狗狗泌尿道保健主食罐-羊肉鮪魚(3罐1組)'

df_cart_validation[ df_cart_validation['product'] == item ].sort_values(by = ['Support', 'Confidence','Lift'] , ascending = [False, False , False]).head(20)

Unnamed: 0,index,product,recommend,Support,Confidence,Lift
355,877,狗狗泌尿道保健主食罐-羊肉鮪魚(3罐1組),狗狗主食罐-鴕鳥火雞肉(3罐1組),0.0048,0.2857,29.5
357,897,狗狗泌尿道保健主食罐-羊肉鮪魚(3罐1組),狗狗湯罐-鮮菇起司佐鯛魚(3罐1組),0.0024,0.25,103.25
356,895,狗狗泌尿道保健主食罐-羊肉鮪魚(3罐1組),狗狗湯罐-三寶海鮮(3罐1組),0.0024,0.25,34.4167
358,899,狗狗泌尿道保健主食罐-羊肉鮪魚(3罐1組),黑豆明明美精華飲(30罐/盒),0.0024,0.25,25.8125
359,883,狗狗泌尿道保健主食罐-羊肉鮪魚(3罐1組),狗狗主食罐-鹿肉火雞肉(3罐1組),0.0024,0.2,20.65
360,860,狗狗泌尿道保健主食罐-羊肉鮪魚(3罐1組),犬的腸胃保健益生菌,0.0024,0.0435,4.4891


## SHOPLINE Recommender

### Project


<img src="https://drive.google.com/uc?id=1OSUKXoCMVINRYc_xgtkikRFz5TyzulZw" alt="drawing" width="600"/>

### Limitation
- Only 4 items can be recommended on storefront
</br>
</br>

### OKR
- Improve **CTR** of Related Product
</br>
</br>
storefront conversion Funnel (Date : 2021/05/01 - 2021/05/31)
</br>

|event_name|All products||Category Page||Related Product||Search||Other||Total|
|:--|:--|:--|:--|:--|:--|:--|:--|:--|:--|:--|:--|
|impression | 8,264,119 (4.64%) || 72,178,911 (40.54%)|| 91,957,166 (51.65%)||5,627,706 (3.16%)||1,883||178,029,785|
|click |30.37% /2,510,032||33.38% / 24,092,471||**1.94%** / 1,787,157||38.12% / 2,145,256||10,330,001||22.95% / 40,864,917|

</br>
</br>

### Main Objective
- 幫助 shoppers 快速找到可比較規格的商品
</br>
</br>

### Logic
- 用 **Association Rule** 找出在同一個 session 最常被一起看/比較的商品 (based on users who have similar preference)
- Cold start : Based on product name & product category that most closely match the the current product


### Performance Review

In [None]:
#@title Load Data

from google.colab import auth
auth.authenticate_user()
print('Authenticated')

from datetime import datetime
from google.cloud import bigquery

client = bigquery.Client( 'shopline-142003')
print( 'Query Start : ', datetime.now())


raw_data = client.query('''

DECLARE StartDate DATE DEFAULT '2021-07-01';
DECLARE EndDate DATE DEFAULT '2021-08-01';

WITH existing_merchants AS (

SELECT
  DISTINCT merchants._id AS merchant_id,
  handle
FROM 
  `shopline-test.looker_prod.merchants_hourly` AS merchants
  JOIN `shopline-test.looker_prod.merchant_features_hourly` AS merchant_features
    ON merchant_features.owner_id = merchants._id
WHERE 
  merchants.email NOT LIKE '%@shoplineapp%'
  AND feature_key = 'smart_recommended_related_products'
  AND merchant_features.status = 'active'

)
, events_base AS (

SELECT
  handle,
  events_hourly.merchant_id,
  http_cookie,
  DATE(created_at) AS x_date,
  event_name,
  product_id,
  data
FROM
  `shopline-142003.datawarehouse.events_hourly` AS events_hourly
  JOIN existing_merchants ON events_hourly.merchant_id = existing_merchants.merchant_id
WHERE
  DATE(created_at) >= StartDate
  AND DATE(created_at) <= EndDate 
  AND event_name IN ('View', 'ProductClick', 'RecommendItem')

)
, agg AS (

SELECT
  handle,
  merchant_id,
  http_cookie,
  x_date,
  SUM( IF(event_name = 'View' AND product_id IS NOT NULL, 1 , 0) ) AS product_page_views,
  SUM( IF(event_name = 'ProductClick' AND REPLACE(JSON_EXTRACT( data, '$.page_type') , '"', '') = 'product', 1 , 0) ) AS product_page_clicks,
FROM 
  events_base
GROUP BY 
  handle,
  merchant_id,
  http_cookie,
  x_date
)
SELECT
  handle,
  merchant_id,
  x_date,
  COUNT( DISTINCT IF( product_page_views > 0, http_cookie, NULL ) ) AS no_viewed_cookies,
  COUNT( DISTINCT IF( product_page_clicks > 0, http_cookie, NULL ) ) AS no_click_cookies,
  ROUND( COUNT( DISTINCT IF( product_page_clicks > 0, http_cookie, NULL ) ) / COUNT( DISTINCT http_cookie ) *100 ,2 ) AS clicks_percentage
FROM 
  agg
WHERE 
  product_page_views > 0
GROUP BY 
  handle,
  merchant_id,
  x_date
ORDER BY 
  handle,
  merchant_id,
  x_date
    
''' ).to_dataframe()

print( 'Query Done : ', datetime.now())
print(raw_data.shape)

raw_data.head(5)

Authenticated
Query Start :  2021-08-13 05:50:47.872987
Query Done :  2021-08-13 05:51:05.781823
(18537, 6)


Unnamed: 0,handle,merchant_id,x_date,no_viewed_cookies,no_click_cookies,clicks_percentage
0,0306brian416,5b6f18197f30a4001796de4f,2021-07-01,335,8,2.39
1,0306brian416,5b6f18197f30a4001796de4f,2021-07-02,406,3,0.74
2,0306brian416,5b6f18197f30a4001796de4f,2021-07-03,450,4,0.89
3,0306brian416,5b6f18197f30a4001796de4f,2021-07-04,465,9,1.94
4,0306brian416,5b6f18197f30a4001796de4f,2021-07-05,467,5,1.07


In [None]:
#@title *CTR by handle

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

df = raw_data.copy()

handle_name = 'scheminggg341'

print(handle_name)

df = df[df.handle == handle_name]

# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])


# Add traces
fig.add_trace(go.Bar(
    name="No. of cookies",
    x=df["x_date"], y=df["no_viewed_cookies"],
    ),
    secondary_y=True)

fig.update_traces(marker_color='rgb(158,202,225)', 
                #   marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5, opacity=0.6)


fig.add_trace(
    go.Scatter(x= df.x_date, 
               y= df['clicks_percentage'], 
               name= 'CTR', 
               line=dict(color='salmon')),
    secondary_y=False,
)


# Add shapes
fig.add_shape(type="line",
    x0='2021-07-21', y0 = df['clicks_percentage'].min(), x1='2021-07-21', y1=df['clicks_percentage'].max(),
    line=dict(color="lightseagreen",width=3, dash="dot")
)

fig.add_shape(
    type="rect",
    x0='2021-07-25',
    x1='2021-07-31',
    y0=df['clicks_percentage'].min(),
    y1=df['clicks_percentage'].max(),
    # y1=df.cart_ratio.max(),
    opacity=0.5,
    layer="below", 
    fillcolor="pink",
    line=dict(
        color="pink",
        width=1,
    ),
)

fig.add_shape(
    type="rect",
    x0='2021-07-11',
    x1='2021-07-17',
    y0=df['clicks_percentage'].min(),
    y1=df['clicks_percentage'].max(),
    # y1=df.cart_ratio.max(),
    opacity=0.5,
    layer="below", 
    fillcolor="pink",
    line=dict(
        color="pink",
        width=1,
    ),
)

## Layout
fig.update_layout(
    width = 800,
    height = 350,
    title_text = handle_name,
    # title_text= "CTR% by time, monthly clicks = "+ str(df.total_click.unique().astype(int).sum()),
    yaxis=dict(tickformat=".2f")
    )

fig.update_layout(legend=dict(
    yanchor="top",
    # y= df['CTR%'].max(),
    y=1.25,
    xanchor="left",
    x= 0
))


# # Set y-axes titles
fig.update_yaxes(title_text="CTR", secondary_y=False)
fig.update_yaxes(title_text="No. of Cookies", 
                 secondary_y=True,
                 )

fig.show()

scheminggg341


In [None]:
#@title Tag Before & After

import pandas as pd

df = raw_data.copy()

df['x_date'] = pd.to_datetime(df['x_date'], format='%Y-%m-%d')

df = df[ ((df.x_date >= '2021-07-25') & (df.x_date <= '2021-07-31')) | ((df.x_date >= '2021-07-11') & (df.x_date <= '2021-07-17 ')) ] 

result = []

for i in range(df.shape[0]):
    tx_date = df.iloc[i, df.columns.get_loc("x_date") ]
    if tx_date >= datetime.strptime( '2021-07-21', "%Y-%m-%d") :
        tag = 'after'
    else:
        tag = 'before'

    result.append(tag)

df['tag'] = result
df

Unnamed: 0,handle,merchant_id,x_date,no_viewed_cookies,no_click_cookies,clicks_percentage,tag
10,0306brian416,5b6f18197f30a4001796de4f,2021-07-11,444,11,2.48,before
11,0306brian416,5b6f18197f30a4001796de4f,2021-07-12,533,6,1.13,before
12,0306brian416,5b6f18197f30a4001796de4f,2021-07-13,471,13,2.76,before
13,0306brian416,5b6f18197f30a4001796de4f,2021-07-14,490,6,1.22,before
14,0306brian416,5b6f18197f30a4001796de4f,2021-07-15,479,7,1.46,before
...,...,...,...,...,...,...,...
18499,zxc7024255264,5d18fee073de5900010a59a8,2021-07-27,41,0,0.00,after
18500,zxc7024255264,5d18fee073de5900010a59a8,2021-07-28,34,0,0.00,after
18501,zxc7024255264,5d18fee073de5900010a59a8,2021-07-29,51,3,5.88,after
18502,zxc7024255264,5d18fee073de5900010a59a8,2021-07-30,130,11,8.46,after


### Z-test for Conversion Rate

Reference 
- https://towardsdatascience.com/the-art-of-a-b-testing-5a10c9bb70a4


<img src="https://drive.google.com/uc?id=1qFsd86GWE4p15yPCfOysFXKHzBmrs7xn" alt="drawing" width="600"/>

</br>

The Z-test could be adapted to conversion rate by modelling conversion as an random value which realisations are in Bernoulli:
- 1 for a conversion
- 0 else

#### The hypothesis to test are:
- H₀: “the conversion rate is the same for the two versions”
- H₁: “the conversion rate is higher for version B”

#### The first step is to model H₀
Under H₀, μ(A) = μ(B) and we have

<img src="https://miro.medium.com/max/4800/1*CuGs8iRqG0Ufa1LhiaHHyw.jpeg" alt="drawing" width="600"/>

</br>

The corresponding test statistic

<img src="https://miro.medium.com/max/1400/1*FCAkTCjZtmuADgbSNwYudA.jpeg" alt="drawing" width="600"/>

</br>

This time, with binary random values, it can be shown that the estimators for the standard deviations are functions of the expectations:

</br>

<img src="https://miro.medium.com/max/4800/1*XN0imuj5hHFnb_6riQNQ1w.jpeg" alt="drawing" width="600"/>

</br>
</br>

#### The second step is to see how likely our samples are under H₀
To this end, we compute the Z-score and the corresponding right-tailed p-value:

```python
import numpy as np
from scipy.stats import norm

mu_B = 0.02
mu_A = 0.015

var_B = mu_B * (1-mu_B)
var_A = mu_A * (1-mu_A)

n_B = 4000
n_A = 6000

Z = (mu_B - mu_A)/np.sqrt(var_B/n_B + var_A/n_A)
pvalue = norm.sf(Z)

print("Z-score: {0}\np-value: {1}".format(Z,pvalue))
```
```
Z-score: 1.8427115179918694
p-value: 0.03268557071858785
```
https://abtestguide.com/calc/?ua=6000&ub=4000&ca=90&cb=80

</br>

With the α=0.05 criterion (p value < 0.05), **we would have rejected the null hypothesis**.
This difference may be explained by a slight weakness of the Z-test, which does not acknowledge here the binary nature of the random value: μ(B)-μ(A) is actually bounded in [-1,1] and the observation is therefore attributed a lower p-value.


p-value、顯著水準、Type I error, Type 2 error
- https://blog.xuite.net/metafun/life/82541806-p-value%E3%80%81%E9%A1%AF%E8%91%97%E6%B0%B4%E6%BA%96%E3%80%81Type+I+error%2C+Type+2+error

In [None]:
#@title CTR Before & After

df_groupy = df.groupby(by = ['handle', 'tag']).agg({'no_viewed_cookies': 'sum', 'no_click_cookies':'sum' }).reset_index()
df_groupy['CTR'] = df_groupy['no_click_cookies'] / df_groupy['no_viewed_cookies']
df_groupy

Unnamed: 0,handle,tag,no_viewed_cookies,no_click_cookies,CTR
0,0306brian416,after,2943,88,0.029901
1,0306brian416,before,3369,65,0.019294
2,1010apothecarytw64,after,32242,1682,0.052168
3,1010apothecarytw64,before,33680,1105,0.032809
4,13bstore,after,2039,73,0.035802
...,...,...,...,...,...
1164,zita367,before,15511,343,0.022113
1165,zombieshop,after,1171,34,0.029035
1166,zombieshop,before,1412,35,0.024788
1167,zxc7024255264,after,537,34,0.063315


In [None]:
#@title invalid merchants

df_exclude = df.groupby(by = ['handle']).agg({'tag': 'nunique', 'no_viewed_cookies':'min' , 'no_click_cookies':'sum'}).reset_index()
df_exclude['CTR'] = df_exclude.no_click_cookies / df_exclude.no_viewed_cookies

excldue_list = df_exclude[ (df_exclude.tag == 1) | (df_exclude.no_click_cookies == 0) | (df_exclude.no_viewed_cookies == 0) ]['handle'].unique()
excldue_list


array(['979857120229', 'amica1391', 'babydust', 'bfflifestyle',
       'centuryfashiontp575', 'chickbaby', 'deltaonehk', 'eunice',
       'happiness6888393', 'imestyle', 'jenniferhuang17', 'juicestore',
       'lzs15', 'mionfashion', 'moremallhk', 'overthewall', 'pd167',
       'qiaobangzhu', 'scheminggg341', 'sidefame'], dtype=object)

In [None]:
#@title Hypothesis Test
# Reference # https://towardsdatascience.com/the-art-of-a-b-testing-5a10c9bb70a4

import numpy as np
from scipy.stats import norm

handles = df_groupy.handle.unique()

handles = [x for x in handles if x not in excldue_list]

test_result = []
handle_name = handles[1]

for i in range(len(handles)):
    handle_name = handles[i]
    # print( handle_name )

    df = df_groupy[df_groupy.handle == handle_name].reset_index()
    mu_B = df.iloc[0, df.columns.get_loc("CTR") ]  # 0 after
    mu_A = df.iloc[1, df.columns.get_loc("CTR") ]  # 1 before

    var_B = mu_B * (1-mu_B)
    var_A = mu_A * (1-mu_A)

    n_B = df.iloc[0, df.columns.get_loc("no_viewed_cookies") ] 
    n_A = df.iloc[1, df.columns.get_loc("no_viewed_cookies") ] 

    Z = (mu_B - mu_A)/np.sqrt(var_B/n_B + var_A/n_A)
    pvalue = norm.sf(Z)

    test_result.append( [handle_name, mu_B, mu_A,  n_B, n_A, Z, pvalue ])


In [None]:
#@title Test Result

import pandas as pd

df_result_cookie = pd.DataFrame( test_result ,columns=['handle', 'ctr_after', 'ctr_before', 'size_after', 'size_before','Z','pvalue' ])

result = []

for i in range(df_result_cookie.shape[0]):
    pvalue = df_result_cookie.iloc[i, df_result_cookie.columns.get_loc("pvalue") ]
    if pvalue > 0.05 :
        tag = 'accept'
    else:
        tag = 'reject'

    result.append(tag)

df_result_cookie['tag'] = result


print( df_result_cookie[( df_result_cookie.ctr_before > 0)].shape)
print( df_result_cookie[ ( df_result_cookie.Z > 1.645) & ( df_result_cookie.ctr_before > 0)].shape)

df_result_cookie[( df_result_cookie.ctr_before > 0)].sort_values(by = 'Z')

(567, 8)
(254, 8)


Unnamed: 0,handle,ctr_after,ctr_before,size_after,size_before,Z,pvalue,tag
268,jojochris765,0.020158,0.048318,2282,3270,-5.908220,1.000000e+00,accept
529,vike1006188,0.005255,0.016362,11227,3117,-4.681340,9.999986e-01,accept
281,kickstage2007255,0.014075,0.019541,24511,20930,-4.490657,9.999964e-01,accept
272,jplivingstyle,0.014863,0.055556,2153,684,-4.452929,9.999958e-01,accept
514,twinko,0.026786,0.034926,18629,14688,-4.235249,9.999886e-01,accept
...,...,...,...,...,...,...,...,...
236,huangnana368156,0.101425,0.042636,25891,11047,21.886613,1.742304e-106,reject
282,kiiwio,0.087223,0.046377,37983,40731,22.898338,2.413379e-116,reject
418,ronin,0.077983,0.027937,21620,19150,22.977622,3.902229e-117,reject
312,lifewedo,0.036283,0.017436,101646,78058,25.107609,2.053672e-139,reject


In [None]:
print('size', df_result_cookie.shape )

print('Accept (same):', df_result_cookie[ (df_result_cookie.tag == 'accept') ].shape )
print('Reject (Better):', df_result_cookie[ (df_result_cookie.tag == 'reject') ].shape )


size (569, 9)
Accept (same): (314, 9)
Reject (Better): (255, 9)


In [None]:
#@title Validation List
df_result_cookie['diff'] =  df_result_cookie.ctr_after - df_result_cookie.ctr_before

print( 'FN', df_result_cookie[ (df_result_cookie.tag == 'accept') & (df_result_cookie['diff'] > 0)].shape , 'type II error (b) ?')
print( 'TN', df_result_cookie[ (df_result_cookie.tag == 'accept') & (df_result_cookie['diff'] <= 0)].shape)
print( 'TP', df_result_cookie[ (df_result_cookie.tag == 'reject') & (df_result_cookie['diff'] > 0)].shape)
print( 'FP', df_result_cookie[ (df_result_cookie.tag == 'reject') & (df_result_cookie['diff'] <= 0)].shape, 'type I error (a)?')

df_result_cookie[ (df_result_cookie.tag == 'accept') & (df_result_cookie['diff'] > 0)].sort_values(by = 'pvalue')

FN (165, 9) type II error (b) ?
TN (149, 9)
TP (255, 9)
FP (0, 9) type I error (a)?


Unnamed: 0,handle,ctr_after,ctr_before,size_after,size_before,Z,pvalue,tag,diff
554,xwysiblings,0.038522,0.026801,1272,1194,1.641920,0.050303,accept,0.011721
100,birdyhousehk,0.088832,0.058524,394,393,1.630243,0.051525,accept,0.030308
118,catslave,0.023952,0.013253,835,830,1.617477,0.052888,accept,0.010699
381,owenchang744,0.009676,0.005795,2687,2416,1.591153,0.055788,accept,0.003882
166,douuob,0.034665,0.029090,5308,4572,1.577866,0.057298,accept,0.005575
...,...,...,...,...,...,...,...,...,...
221,hhoversea,0.009618,0.009563,13204,12026,0.045329,0.481922,accept,0.000056
186,forestoutdoortw597,0.042020,0.041780,2713,1843,0.039725,0.484156,accept,0.000240
205,gonoww1,0.009554,0.009368,314,427,0.025890,0.489673,accept,0.000186
345,mrsblue,0.021341,0.021164,984,756,0.025446,0.489850,accept,0.000177


In [None]:
#@title Poor List
df_result_cookie['diff'] =  df_result_cookie.ctr_after - df_result_cookie.ctr_before

print( df_result_cookie[ (df_result_cookie.tag == 'accept') & (df_result_cookie['diff'] < 0)].shape)
df_result_cookie[ (df_result_cookie.tag == 'accept') & (df_result_cookie['diff'] < 0)].sort_values(by = 'pvalue')

(149, 9)


Unnamed: 0,handle,ctr_after,ctr_before,size_after,size_before,Z,pvalue,tag,diff
51,anyshoptw197,0.012170,0.012182,493,903,-0.001829,0.500730,accept,-0.000011
263,jeyubin,0.022863,0.022899,8223,8472,-0.015665,0.506249,accept,-0.000036
162,doinggood88578,0.038062,0.038202,1734,1780,-0.021660,0.508640,accept,-0.000140
372,olivmewstudio,0.034863,0.034920,12334,11283,-0.023764,0.509480,accept,-0.000057
64,asiayogies,0.029197,0.029605,274,304,-0.029012,0.511572,accept,-0.000408
...,...,...,...,...,...,...,...,...,...
514,twinko,0.026786,0.034926,18629,14688,-4.235249,0.999989,accept,-0.008140
272,jplivingstyle,0.014863,0.055556,2153,684,-4.452929,0.999996,accept,-0.040693
281,kickstage2007255,0.014075,0.019541,24511,20930,-4.490657,0.999996,accept,-0.005466
529,vike1006188,0.005255,0.016362,11227,3117,-4.681340,0.999999,accept,-0.011107


#============

# 2.Content-Based (TBC)

[第 11 屆 iT 邦幫忙鐵人賽 / 初探推薦系統(Recommendation System)](https://ithelp.ithome.com.tw/articles/10219033)

- 依據一件瀏覽或已購買的商品，推薦屬性相似的商品
- 結合使用者評價與商品屬性，推薦使用者偏好的商品

#============

# 3.Collaborative Filtering

協同過濾 (Collaborative Filtering) ：當已知 Shopper 過去的消費經驗，利用與該 Shopper 類似經驗之群體的所顯示的偏好，來預測該 Shopper「未知」的偏好資訊。

==> 適用「已登入」的「舊客」

<img src="https://miro.medium.com/max/1400/1*QvhetbRjCr1vryTch_2HZQ.jpeg" alt="drawing" width="1000"/>


優點：

1. 不需進行內容分析。
1. 能夠對複雜的、難以表述的概念（如資訊品質、個人品味）進行推薦
1. 有推薦新資訊的能力，可以發現使用者潛在的的偏好
1. 能做個人化推薦
1. 自動化程度高

缺點：
1. Cold start 問題。新使用者及新商品剛出現時，CF系統的推薦品質較差
1. 稀疏性問題（Sparsity）
1. 系統延伸性問題（Scalability）：新加User或者Item時，系統需要增加計算負荷量大

### Sparsity (Long Tail Plot)

This plot is used to explore popularity patterns in user-item interaction data such as clicks, ratings, or purchases. Typically, only a small percentage of items have a high volume of interactions, and this is referred to as the “head”. Most items are in the “long tail”, but they only make up a small percentage of interactions.

[Evaluation Metrics for Recommender Systems](https://towardsdatascience.com/evaluation-metrics-for-recommender-systems-df56c6611093)

<img src="https://miro.medium.com/max/1400/1*SIPN2FOu440_bhDvAkWdow.png" alt="drawing" width="800"/>



## Simple Example

假設有 7 部 Netflix 影片、4 個觀眾評論，要猜 Chelsey 會不會喜歡「社內相親」？

Reference : [Recommendation Systems](https://medium.com/x8-the-ai-community/recommendation-system-db51c868f13d)


|No| Video / User | Alan | ChingChing | Ian | Chelsey |
|:--|:----------:|:------:|:------------:|:-----:|:---------:|
|1|AV帝王   |      | v          | v   |     v    |
|2|華燈初上 | v    |            | v   |         |
|3|鬼滅之刃 | v    |            | v  |     v    |
|4|少年法庭 |     |            |    |     v    |
|5|矽谷群瞎轉 |     |   v         |    |         |
|6|Tinder 大騙徒 | v    | v          |   v  |       |
|7|社內相親 | v    | v          |     |    ??     |

解法很簡單：

<img src="https://miro.medium.com/max/780/1*zOvNQxXHFB_cPTS85svDDQ.png" alt="drawing" width="500"/>



1. Chelsey 與 Alan 相同意見有 1 個，即 1/6 = 1/6
1. Chelsey 與 ChingChing 相同意見有 1 個，即 1/6 = 1/6 
1. Chelsey 與 Ian 相同意見有 2 個，即 2/6 = 2/6</br>

Chelsey 與 Ian 看片的品味比較像，故猜 Chelsey 不會喜歡 「社內相親」


## User Based CF


<img src="http://lh3.ggpht.com/_D--7P2OZHMo/S78IvzIQ6RI/AAAAAAAAFjc/wdEeq-MC3Qc/image_thumb60.png?imgmax=800" alt="drawing" width="1000"/>

計算 user u 對於 item i 的偏好

1. 要考量所有行為像是 u 的其他 user v ( ex. 共同 rating 過 item i)

1. 再除上所有相似度的加總，意義是將算出來的結果 mapping 到 user  u 平常的 rating 區間

## Cosine similarities

Cosine similarities 通過測量兩個向量的夾角的餘弦值 ( cosine of the angle between them) 來度量它們之間的相似性


<img src="https://media-exp1.licdn.com/dms/image/C4E12AQF9Q-nIc3bkhQ/article-inline_image-shrink_1500_2232/0/1535343574742?e=2147483647&v=beta&t=KVRA1-gjCJQwiFAARiHUqEsZ6C8QxFOgVz5lYizoJlE" alt="drawing" width="500"/>

公式可由基本的三角函數運算推導成內積表達形式，數學推導如下

<img src="https://miro.medium.com/max/1400/1*jesx2NJwzF0_dPHCWeJmOw.png" alt="drawing" width="800"/>



|No| Video / User | Alan | ChingChing | Ian | Chelsey |
|:--|:----------:|:------:|:------------:|:-----:|:---------:|
|1. |AV帝王       |      |           | 1   |     1    |
|2. |華燈初上      |     |            | 2   |     2    |
|3. |鬼滅之刃      |     |            | 2  |     2    |
|4. |少年法庭      |     |            |  1  |     1    |
|5. |矽谷群瞎轉    |     |            |  1  |     1    |
|6. |Tinder大騙徒 |     |           |   1  |    2   |
|7. |社內相親     |     |           |   0  |    1     |



使用這個公式，我們就可以得到，Ian 與 Chelsey 的 Cosine similarities

<img src="http://www.ruanyifeng.com/blogimg/asset/201303/bg2013032008.png" alt="drawing" width="800"/>

夾角大約為20.3度

 


<img src="https://www.doka.ch/3dscatterplotrotateanim.gif" alt="drawing" width="500"/>

## 綠藤生機

Smart Segmentation 上線了，分群效果很好，準備 EDM 針對特定對象客製化推薦商品
- 鎖定有回購紀錄的老客人

- python reference : https://medium.com/sfu-cspmp/recommendation-systems-user-based-collaborative-filtering-using-n-nearest-neighbors-bf7361dc24e0

In [1]:
#@title Load Data

from google.colab import auth
auth.authenticate_user()
print('Authenticated')

from datetime import datetime
from google.cloud import bigquery

client = bigquery.Client( 'shopline-142003')

print( 'Query Start : ', datetime.now())

raw_data = client.query('''

WITH orders_hourly AS (

    SELECT 
        customer_id,
        _id AS order_id,
        total_cents_in_usd,  
        created_at,
    FROM 
        `shopline-test.looker_prod.orders_shopline_unremoved_hourly` 
    WHERE 
        seller_id = '5ceb666ffea1260001377cdb'
        AND DATE(created_at) >= '2021-01-01'
        AND DATE(created_at) < '2021-07-01'
)
, order_items AS (

  SELECT 
    item_owner_id AS merchant_id,
    order_id,
    item_id,
    object_data.title_translations_zh_hant AS product_name,
    quantity,
  FROM 
    `shopline-test.looker_prod.order_items_hourly`
  WHERE 
    item_owner_id = '5ceb666ffea1260001377cdb'
    AND item_type IN ('Product')
    AND status != 'removed'
    AND DATE(created_at) >= '2021-01-01'
    AND DATE(created_at) < '2021-07-01'
),
base AS (
SELECT 
    order_items.merchant_id ,
    orders_hourly.customer_id,
    order_items.item_id,
    order_items.product_name ,
    DATE( orders_hourly.created_at, "Asia/Taipei") AS created_date,
    SUM( order_items.quantity ) AS quantity
FROM 
    order_items 
LEFT JOIN 
    orders_hourly 
ON 
    orders_hourly.order_id = order_items.order_id
GROUP BY 
    customer_id,
    item_id,
    merchant_id,
    product_name,
    created_date
)
SELECT
    *EXCEPT(product_name),
    MAX(product_name) OVER(PARTITION BY item_id ORDER BY created_date) AS product_name,
    COUNT(DISTINCT created_date) OVER(PARTITION BY customer_id ) AS order_count
FROM
    base

''' ).to_dataframe()

print( 'Query Done : ', datetime.now())
print(raw_data.shape)

raw_data.sample(5)


Authenticated
Query Start :  2022-04-22 00:38:34.411803
Query Done :  2022-04-22 00:39:09.182676
(166040, 7)


Unnamed: 0,merchant_id,customer_id,item_id,created_date,quantity,product_name,order_count
7722,5ceb666ffea1260001377cdb,5ff98d3b78a273001281575f,5d5e614c622a5609855e9525,2021-04-06,1,【最新補貨到】奇蹟辣木潤髮乳,5
156806,5ceb666ffea1260001377cdb,5f37b780edce900025761619,607073bcc287690017232398,2021-04-24,1,R試用包-極境雙藻復原精華1ml,4
138783,5ceb666ffea1260001377cdb,5d6970c51038081466699433,605209a4b1f829001a812dc4,2021-04-04,1,R-母親節會員專屬自選禮,2
11225,5ceb666ffea1260001377cdb,5eecabe568dbd10001133870,5de099cf631d050033472adf,2021-03-30,1,酵一個,33
83852,5ceb666ffea1260001377cdb,5f0c47a22aa7c200010890a3,5e6b8623622a565cb8c0059e,2021-04-29,1,R-實驗者,1


In [2]:
#@title create a USER-ITEM matix
import pandas as pd
import numpy as np

df = raw_data.copy()

# 只計算有回購紀錄的客人
df = df[ df.order_count >= 2]

# 轉換成 customer vs item 的 vector
table = pd.pivot_table(df, 
                       columns = ['item_id'], 
                       index = ['customer_id'], 
                       values= ['quantity'],
                       aggfunc = np.sum, 
                       fill_value=0
                       ).droplevel(0,axis=1)
table

item_id,5cf73a43cb2d86002379f58e,5cf73c169b311a00118c2f34,5cf73cb752d13c003eb88d53,5cf73d87cb2d86003579f96f,5cf73e6cec7c13003e1ecb47,5cf73eb7cb2d86466379e43c,5cf73ef7bf03790032e98161,5cf73f4ab6a0038c76d0480b,5d5e614c622a5609855e9525,5d5e615a622a5609855e9541,...,60b83e8a1e13e10001cb9722,60b83e8a1e13e10001cb9723,60b83e8a1e13e10001cb9724,60b8406e5ae6fb002fcd2d35,60b9ef3e1aaee40017ccdc4b,60bb20caac78d4002cc98323,60bd87e27a10900038ab931f,60c70e4cc5e3e500170c8c3f,60cffd3c6719f50020b882f8,60d05472f2cb0a0021f157a3
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5d64cd4deaab2300012e2d2d,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5d68a7041038080745bf8892,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5d68a7041038080745bf88a9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5d68a7051038080745bf88f0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5d68a7051038080745bf8990,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60d5f9ac3924a0001872f03e,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
60d7bd8ee3a4d1000f259dc9,0,0,0,2,0,0,2,0,0,0,...,0,0,0,0,0,0,0,0,0,0
60f033bfeecaa4000c3c8f1d,0,0,0,2,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6157e744154b47568e38349c,0,0,0,2,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
# 計算 user 間的相似程度 
from sklearn.metrics.pairwise import cosine_similarity


b = cosine_similarity(table)
print(b)

# 對角線補 0
np.fill_diagonal(b, 0 )

# 把 similarity 轉成 table 的形式
similarity_with_user = pd.DataFrame(b,index=table.index)
similarity_with_user.columns=table.index
similarity_with_user.head()

[[1.         0.         0.13130643 ... 0.         0.         0.        ]
 [0.         1.         0.         ... 0.         0.         0.        ]
 [0.13130643 0.         1.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 1.         0.35777088 0.53333333]
 [0.         0.         0.         ... 0.35777088 1.         0.2981424 ]
 [0.         0.         0.         ... 0.53333333 0.2981424  1.        ]]


customer_id,5d64cd4deaab2300012e2d2d,5d68a7041038080745bf8892,5d68a7041038080745bf88a9,5d68a7051038080745bf88f0,5d68a7051038080745bf8990,5d68a7051038080745bf89b1,5d68a7061038080745bf8a22,5d68a7071038080745bf8b72,5d68aa7e1038080991f36dfd,5d68aa821038080991f36f51,...,60d4a665fe745f00185cb903,60d4abfb72bd860016fdc3d7,60d5412450ce0d001524df3c,60d55ad6f52554001bc72769,60d5e1438b8233001b8e3e75,60d5f9ac3924a0001872f03e,60d7bd8ee3a4d1000f259dc9,60f033bfeecaa4000c3c8f1d,6157e744154b47568e38349c,61c3f4b56d16d20019a1b329
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5d64cd4deaab2300012e2d2d,0.0,0.0,0.131306,0.167812,0.0,0.032963,0.029361,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5d68a7041038080745bf8892,0.0,0.0,0.0,0.0,0.0,0.30796,0.033903,0.0,0.0,0.081044,...,0.0,0.043769,0.0,0.0,0.0,0.0,0.071474,0.0,0.0,0.0
5d68a7041038080745bf88a9,0.131306,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5d68a7051038080745bf88f0,0.167812,0.0,0.0,0.0,0.3849,0.091667,0.244949,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5d68a7051038080745bf8990,0.0,0.0,0.0,0.3849,0.0,0.0,0.0,0.0,0.0,0.0,...,0.186248,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# 針對每一位user, 找出前 10 名最相似的 user
def find_n_neighbours(df,n):
    order = np.argsort(df.values, axis=1)[:, :n]
    df = df.apply(lambda x: pd.Series(x.sort_values(ascending=False)
           .iloc[:n].index, 
          index=['top{}'.format(i) for i in range(1, n+1)]), axis=1)
    return df


In [53]:
# top 10 neighbours for each user
sim_10_user = find_n_neighbours(similarity_with_user, 10)
sim_10_user.sample(10)

Unnamed: 0_level_0,top1,top2,top3,top4,top5,top6,top7,top8,top9,top10
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
5f87bcfebe90f8002001da54,6080a2ac5128bb001b5670fe,5d68c42b1038081f45ae64b6,5eb25d3190e5290daecf898f,5d68cf9b1038081f45ae8f28,5dcd37fc90e52952242616e6,5d68da0e1038082fb7ca57f9,5fe5de81802b32000cc5d081,5fe9b10d004ea2000f84f694,5d696164103808046ad84c5d,5d6970071038081466699124
5ff9a54c9c38680018716cfb,5d6947a61038087964efa4c9,5fc438da4f22b1001b9b3ed8,609d21a36d246c000cbdd1ee,5d68b48c1038080991f3a707,5fb7ac3e4c179f764b127166,609297ddf4ba330027ef2f8d,5f2002905ffe10003304d8d0,5ff1d46b8579760027270f00,607c1f5282bd43064bc2af8f,607be550ecc5a44877d7f69d
5fc876e4c604ce00245a235c,5f981f602b379600240fa7bd,5f2acf0a76cc2800369542ce,5d6961741038080b8b32ace0,5f0567459f8ef00021f60e71,5f6bee749bccf400134866e8,5eb502ccbf733800010257b4,5fdaf65281ba330014b34799,5e9ad769f63942002971b5f4,5d6923e4103808612765ba2b,5dce15b490e52972a1acc3df
5d69726210380814666999cc,5d69220b10380861276576e2,5f95098ceeedf000215f6a83,600a7156c2546b00278e22ed,5d692d251038086844f592d8,604c5036c217a700139c6ec7,6013ac5f0e687e0025aabdab,60081e7d0fe25a5bd74ee85a,5f3380af1c48cb0013f15b87,5fb7587521e14b0024afff4e,5f9574ecc4497b0014f6b41f
5e2521fa3d6818001ccadeef,5f8431f3ab5fa4000fa42595,600d2ea76fee720026d8bb77,604d97e792acf80020da124e,6059ae272887230023b80f0c,5fdf0fa30e08b00014a05193,6044446583481c001edfee52,5df09b80e1ca2c001fd5a281,5ff9697bcdc034002625b86b,604b4d158fd0c10015ce5649,60052074e2f1930015b508e5
5d68ebce10380834777e8341,5d68f8b010380842b41333dc,5e429dc7e755250013b15d57,5d691a5e10380857f97188b1,5d69444b103808713cf37c79,5dce1b1a90e52972a1acd86f,5d692f9d1038086844f5f040,5d68e6461038082fb7ca923b,5d69710a103808146669954a,5d6913331038084dee1838cb,5df1cc0f444ce70019ce0d29
5e7acad37f1218002f937f6b,604d9218f50980000dc6efb6,6042fe27738190001b0298ca,5fe4265c83783c002a813913,5fd647286ddfa147a4026600,5f9ad8c52ae5b0001327b323,60c2fdc8b892b300183a5df2,607fbf48ee23fe001b216b8c,5dd74950f183a40031553f21,5f9799afca0cc8000c5c2326,5d68e6d11038082fb7ca938e
6038ef5899e2b400138701b3,604108235b9b5a0018e6a230,604239a30ec48b7c95f02c9b,604416ea72a1a4000f2816ea,604f15c9833fcc000c25c76a,5fa172d455f9f600186653c6,604d862542b36b002a20ae3a,6048502569f077000fbbd473,5d6960b9103808046ad846c8,604b17478fd0c10021ce5132,60531ca5e0ee1b2d6c8f736f
5fdae44ca79d080021b8f204,5f7f025d2146c2001374b7b1,5de37397bdb7776569858ef5,5dce1d6690e529759d3c21dc,5d68f1e71038083f294cdb96,5f8533117eb0bd0013791803,5f7bfe984b94fe00136410bf,5f1038298a6a4e0021ce3bd2,5e9cef53f64dbe001d1e357d,5df5c0040e0d88000a966ab3,5d697017103808146669916c
609c8cd7c8cd6f001e4e1fa6,609bb55d84417d0021fe938b,60920a2afb61d4001e82e383,607c03425ec57111cfeb1b85,6053f6552b637f0021b6dfe0,606ad2d9e20a6f000f76bc26,5d691a2f10380857f97187ee,5d691cb610380857f9719002,605e75e1e4c28a00241ee575,609a17d30f33eb000cb1a059,607aff663716d20018077257


In [18]:
users = df.groupby(by= ["customer_id", 'item_id', 'product_name'],as_index=False)['quantity'].sum()
users.head()

Unnamed: 0,customer_id,item_id,product_name,quantity
0,5d64cd4deaab2300012e2d2d,5dbbfa11622a5632ee3849c6,R-COSMOS修護承諾護髮精華,1
1,5d64cd4deaab2300012e2d2d,5dbbfa15622a5632ee3849e4,R-活萃洗面乳100ml,2
2,5d64cd4deaab2300012e2d2d,5dbbfa1f622a5632ee384a20,R-活萃修護精華油30ml,1
3,5d64cd4deaab2300012e2d2d,5dbbfa23622a5632ee384a3e,R-純粹保濕精華液30ml,3
4,5d64cd4deaab2300012e2d2d,5dbbfa28622a5632ee384a5c,R-第三選擇防曬-無潤色版30ml,1


In [31]:
def get_user_similar_products( user1, user2 ):
    common_movies = users[users.customer_id == user1].merge(
    users[users.customer_id == user2],
    on = ["product_name", 'item_id'],
    how = "left" )
    return common_movies

In [55]:
user1 = '609c8cd7c8cd6f001e4e1fa6'
top_5=['609bb55d84417d0021fe938b', '60920a2afb61d4001e82e383', '607c03425ec57111cfeb1b85', '6053f6552b637f0021b6dfe0', '606ad2d9e20a6f000f76bc26']
			

for user2 in top_5:
    a = get_user_similar_products( user1 , user2) 
    a = a.loc[ : , ['quantity_x','quantity_y','product_name']]
    print(a)
    print('')

a = get_user_similar_products(user1 ,user2) #Top5
a = a.loc[ : , ['quantity_x','quantity_y','product_name', 'item_id']]
a


   quantity_x  quantity_y      product_name
0           2         2.0    專心護唇油（透明 / 莓紅）
1           1         1.0     R試用包-活萃洗面乳1ml
2           1         1.0   R試用包-活萃修護化妝水1ml
3           1         NaN     R試用包-奇蹟辣木油1ml
4           1         1.0           頭皮淨化紓壓組
5           1         2.0  R試用包-極境雙藻復原精華1ml

   quantity_x  quantity_y      product_name
0           2         2.0    專心護唇油（透明 / 莓紅）
1           1         NaN     R試用包-活萃洗面乳1ml
2           1         1.0   R試用包-活萃修護化妝水1ml
3           1         NaN     R試用包-奇蹟辣木油1ml
4           1         1.0           頭皮淨化紓壓組
5           1         1.0  R試用包-極境雙藻復原精華1ml

   quantity_x  quantity_y      product_name
0           2         4.0    專心護唇油（透明 / 莓紅）
1           1         NaN     R試用包-活萃洗面乳1ml
2           1         NaN   R試用包-活萃修護化妝水1ml
3           1         NaN     R試用包-奇蹟辣木油1ml
4           1         2.0           頭皮淨化紓壓組
5           1         2.0  R試用包-極境雙藻復原精華1ml

   quantity_x  quantity_y      product_name
0           2         5.0    

Unnamed: 0,quantity_x,quantity_y,product_name,item_id
0,2,1.0,專心護唇油（透明 / 莓紅）,5d7a29f5f5f15b87ac788be5
1,1,,R試用包-活萃洗面乳1ml,5f0427f407ec3167ffbc4478
2,1,,R試用包-活萃修護化妝水1ml,5f0427f707ec3167ffbc44b8
3,1,,R試用包-奇蹟辣木油1ml,5f0427fc07ec3167ffbc4518
4,1,1.0,頭皮淨化紓壓組,606d60c1817714002c97d75b
5,1,,R試用包-極境雙藻復原精華1ml,607073bcc287690017232398


In [44]:
def user_item_Q(user, item):
    a = sim_10_user[sim_10_user.index==user].values
    b = a.squeeze().tolist()
    c = table.loc[:,item]
    d = c[c.index.isin(b)]
    f = d[d.notnull()]
    avg_user = users.loc[users['customer_id'] == user, 'quantity'].values[0]
    index = f.index.values.squeeze().tolist()
    corr = similarity_with_user.loc[user,index]
    fin = pd.concat([f, corr], axis=1)
    fin.columns = ['adg_score','correlation']
    fin['score']=fin.apply(lambda x:x['adg_score'] * x['correlation'],axis=1)
    nume = fin['score'].sum()
    deno = fin['correlation'].sum()
    final_score = avg_user + (nume/deno)
    return final_score

In [56]:
user = '606ad2d9e20a6f000f76bc26' #top5
item = '607073bcc287690017232398' #R-頭頭-頭皮淨化雙入組
score = user_item_Q( user , item)
item_name = product[product.item_id == item].product_name.unique()
print(user, "user buying power of", item_name,  "is", score )



606ad2d9e20a6f000f76bc26 user buying power of ['R試用包-極境雙藻復原精華1ml'] is 1.1926039284332255


[20 Popular Machine Learning Evaluating Metrics](https://towardsdatascience.com/20-popular-machine-learning-metrics-part-1-classification-regression-evaluation-metrics-1ca3e282a2ce)

1. **Classification** : Accuracy, Precision, Recall, F1-score, ROC, AUC, …etc
2. **Regression** : MSE, MAE
3. **Ranking** : MRR, DCG, NDCG
4. **Statistical** : Correlation
5. **Computer Vision** : PSNR, SSIM, IoU
6. **NLP** : Perplexity, BLEU score
7. **Deep Learning Related** : Inception score, Frechet Inception distance


