## 購物籃分析
- 概述
    - 不管是哪一領域，商品購買常常會有買了A商品的消費者，同時也會買B商品的機率比較高的狀況，而廠商也會針對這種狀況進行銷售策略設計，期望得到營收/利潤的提升。
    - 甚至在商品的擺放上(實體)，也可以將其設計在一起，來提升消費者體驗。
- 原理
    - 購物籃分析背後的原理在於機率。以下有幾個關鍵觀念。
    1. Support(支持度) = P(A): 對於整體購買來說，P(A)體現了出現機率，可以粗略地進行跨商品/商品組合的比較，因此需要後續Confidence給予更多資訊。
        - A可能是一個商品或者一組商品，比如(麵包)或者(麵包,牛奶)
    2. Confidence(信心) = P(A|B): 對於購買商品B的消費者來說，有多少比例的人購買了A。
    3. Lift(提升度) = Confidence / P(A): 可以理解為購買倍數。可以看到其公式變成 P(A&B) / (P(A) * P(B))，而P(A)\*P(B)是兩個商品獨立時的直接計算，如果Lift>1，代表A與B購買正相關；Lift=1，代表A與B購買獨立；Lift<1則代表兩者購買負相關。
- 應用
    - 因此，可以透過計算所有商品的Lift去得到商品提升度越高的商品，更適合進行綑綁銷售，常見於電商推薦系統上。

---
- 參考來源
    - [365資料科學](https://365datascience.com/tutorials/python-tutorials/market-basket-analysis/)
    - [微軟AI](https://learn.microsoft.com/zh-tw/archive/msdn-magazine/2018/december/artificially-intelligent-market-basket-analysis)

### 一、取得資料

In [1]:
import numpy as np
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules

In [2]:
df = pd.read_csv('Groceries_dataset.csv')
df.head()

Unnamed: 0,Member_number,Date,itemDescription
0,1808,21-07-2015,tropical fruit
1,2552,05-01-2015,whole milk
2,2300,19-09-2015,pip fruit
3,1187,12-12-2015,other vegetables
4,3037,01-02-2015,whole milk


In [3]:
df.shape

(38765, 3)

### 二、資料處理

In [4]:
## 將訂單整理

df['unique_transaction_code'] = df['Member_number'].astype(str) + '_' + df['Date']
df.head()

Unnamed: 0,Member_number,Date,itemDescription,unique_transaction_code
0,1808,21-07-2015,tropical fruit,1808_21-07-2015
1,2552,05-01-2015,whole milk,2552_05-01-2015
2,2300,19-09-2015,pip fruit,2300_19-09-2015
3,1187,12-12-2015,other vegetables,1187_12-12-2015
4,3037,01-02-2015,whole milk,3037_01-02-2015


In [5]:
len(df['itemDescription'].unique())  ## 唯一商品數量

167

In [6]:
## 取得每一個購買紀錄的各購買商品次數

df_cross_tab = pd.crosstab(
    index=df['unique_transaction_code'],
    columns=df['itemDescription']
)
print(df_cross_tab.shape)
df_cross_tab.head()

(14963, 167)


itemDescription,Instant food products,UHT-milk,abrasive cleaner,artif. sweetener,baby cosmetics,bags,baking powder,bathroom cleaner,beef,berries,...,turkey,vinegar,waffles,whipped/sour cream,whisky,white bread,white wine,whole milk,yogurt,zwieback
unique_transaction_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1000_15-03-2015,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
1000_24-06-2014,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1000_24-07-2015,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1000_25-11-2015,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1000_27-05-2015,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [7]:
## 因為購物籃分析僅在乎當次購買，你購買了哪一些商品，不在乎商品的數量
# 因此將其 >1 的次數轉換成1 else 0，來符合後續套件使用

bask_input = df_cross_tab.applymap(lambda x: 1 if x > 0 else 0)

### 三、建模：購物籃分析演算法
- 因為自行通常沒有經過計算優化，因此使用成熟套件加速。

In [8]:
freq_itemsets = apriori(
    df=bask_input,
    min_support=0.5  # 僅把support=0.5以上會傳
)
freq_itemsets

Unnamed: 0,support,itemsets


> 可以看到以此類來說，並沒有相關商品在0.5以上，因商品眾多，0.5也是非常高的數字，不過直接透過此演算法可以得到`熱門商品`。

In [9]:
## 換個數字取得
freq_itemsets = apriori(
    df=bask_input,
    min_support=0.001,  # 僅把support=0.001以上會傳
    use_colnames=True   # 把商品名稱直接顯示，方便閱讀
)
freq_itemsets.sort_values('support', ascending=False)

Unnamed: 0,support,itemsets
146,0.157923,(whole milk)
90,0.122101,(other vegetables)
109,0.110005,(rolls/buns)
123,0.097106,(soda)
147,0.085879,(yogurt)
...,...,...
344,0.001002,"(chicken, margarine)"
201,0.001002,"(bottled beer, chicken)"
202,0.001002,"(bottled beer, chocolate)"
516,0.001002,"(pastry, hamburger meat)"


> 可以發現`最熱門商品`是`whole milk`。

#### 現在使用關聯規則一次計算所有support, confidence, lift來方便分析

In [10]:
rules = association_rules(
    df=freq_itemsets,
    metric='lift',
    min_threshold=1
)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(tropical fruit),(UHT-milk),0.067767,0.021386,0.001537,0.022682,1.060617,8.785064e-05,1.001326
1,(UHT-milk),(tropical fruit),0.021386,0.067767,0.001537,0.071875,1.060617,8.785064e-05,1.004426
2,(brown bread),(beef),0.037626,0.033950,0.001537,0.040853,1.203301,2.597018e-04,1.007196
3,(beef),(brown bread),0.033950,0.037626,0.001537,0.045276,1.203301,2.597018e-04,1.008012
4,(beef),(citrus fruit),0.033950,0.053131,0.001804,0.053150,1.000349,6.297697e-07,1.000020
...,...,...,...,...,...,...,...,...,...
235,"(whole milk, yogurt)",(sausage),0.011161,0.060349,0.001470,0.131737,2.182917,7.967480e-04,1.082219
236,"(sausage, yogurt)",(whole milk),0.005748,0.157923,0.001470,0.255814,1.619866,5.626300e-04,1.131541
237,(whole milk),"(sausage, yogurt)",0.157923,0.005748,0.001470,0.009310,1.619866,5.626300e-04,1.003596
238,(sausage),"(whole milk, yogurt)",0.060349,0.011161,0.001470,0.024363,2.182917,7.967480e-04,1.013532


In [11]:
## 因為通常透過 lift來判斷，因此透過lift排序

rules.sort_values('lift', ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
238,(sausage),"(whole milk, yogurt)",0.060349,0.011161,0.001470,0.024363,2.182917,7.967480e-04,1.013532
235,"(whole milk, yogurt)",(sausage),0.011161,0.060349,0.001470,0.131737,2.182917,7.967480e-04,1.082219
234,"(whole milk, sausage)",(yogurt),0.008955,0.085879,0.001470,0.164179,1.911760,7.012151e-04,1.093681
239,(yogurt),"(whole milk, sausage)",0.085879,0.008955,0.001470,0.017121,1.911760,7.012151e-04,1.008307
87,(specialty chocolate),(citrus fruit),0.015973,0.053131,0.001403,0.087866,1.653762,5.548137e-04,1.038081
...,...,...,...,...,...,...,...,...,...
145,(grapes),(soda),0.014436,0.097106,0.001403,0.097222,1.001195,1.674919e-06,1.000129
5,(citrus fruit),(beef),0.053131,0.033950,0.001804,0.033962,1.000349,6.297697e-07,1.000012
4,(beef),(citrus fruit),0.033950,0.053131,0.001804,0.053150,1.000349,6.297697e-07,1.000020
138,(rolls/buns),(fruit/vegetable juice),0.110005,0.034017,0.003743,0.034022,1.000136,5.091755e-07,1.000005


> 上面lyft的商品就可以觀察antecedents與consequents得到總的商品組合，比如第一組就是(香腸, 優格, 全麥牛奶)，就可以內部討論是否打包成優惠策略進行銷售，看是否價格彈性下，會取得利潤的提升。