## Association Rule (연관성 분석)
- 상품 또는 서비스 간 관계 속에서 유용한 규칙을 찾을 때 사용
- 유통 분야에서 주로 활용되며, 장바구니 분석 (Market Basket Analysis)라고 하기도 함
- 비즈니스적으로 중요한 요소를 고려하기 어렵고, 연산량이 많음
<br>
- **주요 평가 지표**
 - 지지도 (support) : 상품 x, y 를 동시에 구매한 비율, 규칙의 중요성
 - 신뢰도 (confidence): 상품 x를 구매 시 Y를 구매한 비율 (조건부 확률), 규칙의 신뢰성
 - 향상도 (lift): 상품 x 구매 시 임의 상품 구입 대비 Y를 포함하는 경우의 비중, 규칙의 상관성<br>
 
- 향상도 해석
 - lift > 1: 품목 간 양의 상관 관계 (보완재)
 - lift = 1: 품목 간 상호 독립 관계
 - lift < 1: 품목 간 음의 상관 관계 (대체재)
 
- `mlxtend_apriori()` -> 구매 아이템 빈도를 계산하는 함수
 - 입력 데이터 세트는 구매 아이템 기반으로 더미변수화(OHE) 되어 있어야 함
 - min_support와 max_len 인자로 최소 지지도와 아이템 조합 최대값을 설정
 - **use_colnames = True**

- `mlxtend_association_rules()`
 - 구매 아이템 빈도를 활용하여 연관규칙을 계산하는 mlxtend의 함수
 - metric에 필터링 기준 지표를 설정, min_threshold에 그 경계값을 지정

**데이터 전처리시 원핫 인코딩 위해 pivot_table로 처리 필요!!**

In [1]:
!pip install mlxtend # mlxtend 설치하기

Collecting mlxtend
  Downloading mlxtend-0.19.0-py2.py3-none-any.whl (1.3 MB)
Installing collected packages: mlxtend
Successfully installed mlxtend-0.19.0


In [2]:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [5]:
df = pd.read_csv('C:/Data/association_rules_mart.csv')
df.head(4)

Unnamed: 0,Date,ID,Item
0,2014-01-01,1249in804,citrus fruit
1,2014-01-01,1249in804,coffee
2,2014-01-01,1381ht273,curd
3,2014-01-01,1381ht273,soda


In [6]:
df['purchase'] = True

In [8]:
df_pivot = df.pivot_table(index= "ID", columns = "Item", values = "purchase", aggfunc = max,
                         fill_value = False)
df_pivot.head(5)

Item,Instant food products,UHT-milk,abrasive cleaner,artif. sweetener,baby cosmetics,bags,baking powder,bathroom cleaner,beef,berries,...,turkey,vinegar,waffles,whipped/sour cream,whisky,white bread,white wine,whole milk,yogurt,zwieback
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1000ol738,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,True,False
1001sf480,False,False,False,False,False,False,False,False,True,False,...,False,False,False,True,False,True,False,True,False,False
1002nj599,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
1003cq947,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1004jh583,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False


In [10]:
item_freq = apriori(df_pivot, min_support= 0.01, use_colnames = True) # min_support 수치 조절, 실행시간 꽤 소요됨
item_freq.head(4)

Unnamed: 0,support,itemsets
0,0.015393,(Instant food products)
1,0.078502,(UHT-milk)
2,0.031042,(baking powder)
3,0.119548,(beef)


In [11]:
df_rules = association_rules(item_freq, metric = 'lift', min_threshold = 1.5)
df_rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(UHT-milk),(cream cheese),0.078502,0.088507,0.010518,0.133987,1.513858,0.00357,1.052517
1,(cream cheese),(UHT-milk),0.088507,0.078502,0.010518,0.118841,1.513858,0.00357,1.045779
2,(fruit/vegetable juice),(berries),0.124936,0.079785,0.015649,0.125257,1.569937,0.005681,1.051983
3,(berries),(fruit/vegetable juice),0.079785,0.124936,0.015649,0.196141,1.569937,0.005681,1.08858
4,(beverages),(white bread),0.062083,0.088763,0.010518,0.169421,1.908685,0.005008,1.097111


In [12]:
df.nunique()

Date        1039
ID          3898
Item         167
purchase       1
dtype: int64

In [23]:
df = df.iloc[:, 1:].drop_duplicates()
len(df)

34766

In [25]:
df['purchase'] = True
df_pivot = pd.pivot_table(df, index = "ID", columns = "Item", values = "purchase", aggfunc = max, fill_value = False)
df_pivot.head(2)

Item,Instant food products,UHT-milk,abrasive cleaner,artif. sweetener,baby cosmetics,bags,baking powder,bathroom cleaner,beef,berries,...,turkey,vinegar,waffles,whipped/sour cream,whisky,white bread,white wine,whole milk,yogurt,zwieback
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1000ol738,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,True,False
1001sf480,False,False,False,False,False,False,False,False,True,False,...,False,False,False,True,False,True,False,True,False,False


In [26]:
item_sets = apriori(df = df_pivot,
                   min_support = 0.005, use_colnames = True)
item_sets.head(2)

Unnamed: 0,support,itemsets
0,0.015393,(Instant food products)
1,0.078502,(UHT-milk)


In [30]:
rules = association_rules(item_sets, metric = 'confidence', min_threshold = 0.005)
rules_sub = rules.loc[rules['support'] >= 0.1, ]
# df_rules2_sub.sort_values("lift", ascending = True)
rules_sub.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
758,(whole milk),(bottled water),0.458184,0.213699,0.112365,0.245241,1.147597,0.014452,1.04179
759,(bottled water),(whole milk),0.213699,0.458184,0.112365,0.52581,1.147597,0.014452,1.142615
3442,(rolls/buns),(other vegetables),0.349666,0.376603,0.146742,0.419663,1.114335,0.015056,1.074197
3443,(other vegetables),(rolls/buns),0.376603,0.349666,0.146742,0.389646,1.114335,0.015056,1.065502
3460,(soda),(other vegetables),0.313494,0.376603,0.124166,0.396072,1.051695,0.006103,1.032237


In [31]:
len(rules_sub)

26

#### 최소 지지도와 신뢰도를 0.005 로 설정, 연관성 분석 실시할 때,
#### 지지도가 0.01이상인 규칙 중 향상도가 가장 높은 규칙과 관련이 없는 품목은? (사전 중복제거, max_len =3으로 설정)
#### 조건 결과에 나오지 않은 변수인 맥주가 정답

In [32]:
df['purchase'] = True
df_pivot = pd.pivot_table(df, index = "ID", columns = "Item", values = "purchase", aggfunc = max, fill_value = False)

In [33]:
item_sets = apriori(df = df_pivot,
                   min_support = 0.005, use_colnames = True, max_len = 3)
item_sets.head(4)

Unnamed: 0,support,itemsets
0,0.015393,(Instant food products)
1,0.078502,(UHT-milk)
2,0.005644,(abrasive cleaner)
3,0.00744,(artif. sweetener)


In [35]:
rules = association_rules(item_sets, metric = 'confidence', min_threshold = 0.005)
rules_sub = rules.loc[rules['support'] >= 0.01, ]
rules_sub.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
8,(UHT-milk),(beef),0.078502,0.119548,0.010518,0.133987,1.120775,0.001133,1.016672
9,(beef),(UHT-milk),0.119548,0.078502,0.010518,0.087983,1.120775,0.001133,1.010396
14,(UHT-milk),(bottled beer),0.078502,0.158799,0.014879,0.189542,1.193597,0.002413,1.037933
15,(bottled beer),(UHT-milk),0.158799,0.078502,0.014879,0.0937,1.193597,0.002413,1.016769
16,(UHT-milk),(bottled water),0.078502,0.213699,0.021293,0.271242,1.269268,0.004517,1.07896


In [38]:
rules_sub.sort_values("lift", ascending = False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
22842,"(whole milk, domestic eggs)",(meat),0.070292,0.063622,0.010262,0.145985,2.294561,0.005789,1.096442
22843,(meat),"(whole milk, domestic eggs)",0.063622,0.070292,0.010262,0.161290,2.294561,0.005789,1.108497
22844,(domestic eggs),"(meat, whole milk)",0.133145,0.034890,0.010262,0.077071,2.208999,0.005616,1.045704
22841,"(meat, whole milk)",(domestic eggs),0.034890,0.133145,0.010262,0.294118,2.208999,0.005616,1.228044
18053,"(whole milk, fruit/vegetable juice)",(chocolate),0.062340,0.086455,0.010775,0.172840,1.999194,0.005385,1.104435
...,...,...,...,...,...,...,...,...,...
27291,(long life bakery product),"(whole milk, other vegetables)",0.065418,0.191380,0.011031,0.168627,0.881112,-0.001488,0.972632
28850,(newspapers),"(sausage, other vegetables)",0.139815,0.092868,0.011288,0.080734,0.869340,-0.001697,0.986800
28847,"(sausage, other vegetables)",(newspapers),0.092868,0.139815,0.011288,0.121547,0.869340,-0.001697,0.979204
1648,(citrus fruit),(cream cheese),0.185480,0.088507,0.014110,0.076072,0.859502,-0.002306,0.986541


#### 판매실적 상위 30개 품목만 사용하여 최소 지지도와 신뢰도를 0.005로 설정한 연관성 분석결과를 보았을 때,
#### 지지도가 3% 이상인 규칙 중 가장 높은 향상도는 얼마인가? => 1.54

In [40]:
df_item_cnt = df['Item'].value_counts().reset_index()
df_item_cnt = df_item_cnt.sort_values("Item", ascending = False)
df_item_cnt.head()

Unnamed: 0,index,Item
0,whole milk,1786
1,other vegetables,1468
2,rolls/buns,1363
3,soda,1222
4,yogurt,1103


In [42]:
df_item_cnt = df_item_cnt.iloc[:30,] # 매출 상위 30개 품목

In [43]:
df_sub= df.loc[df['Item'].isin(df_item_cnt['index']), ] 
df_sub.head()

Unnamed: 0,ID,Item,purchase
0,1249in804,citrus fruit,True
1,1249in804,coffee,True
2,1381ht273,curd,True
3,1381ht273,soda,True
4,1440kn258,other vegetables,True


In [45]:
df_sub_pivot = pd.pivot_table(df_sub, index = "ID", columns = "Item", values = "purchase", aggfunc = max, fill_value = False)

In [49]:
item_sets = apriori(df = df_sub_pivot,
                   min_support = 0.005, use_colnames = True)
item_sets.head(4)

Unnamed: 0,support,itemsets
0,0.120538,(beef)
1,0.160114,(bottled beer)
2,0.215468,(bottled water)
3,0.137093,(brown bread)


In [48]:
rules = association_rules(item_sets, metric = 'confidence', min_threshold = 0.005)
rules_sub = rules.loc[rules['support'] >= 0.03, ]
rules_sub = rules_sub.sort_values('lift', ascending = False)
rules_sub.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
16065,(sausage),"(rolls/buns, yogurt)",0.207708,0.112261,0.035954,0.173101,1.541954,0.012637,1.073576
16064,"(rolls/buns, yogurt)",(sausage),0.112261,0.207708,0.035954,0.320276,1.541954,0.012637,1.165609
16062,"(rolls/buns, sausage)",(yogurt),0.083032,0.285308,0.035954,0.433022,1.517736,0.012265,1.260529
16067,(yogurt),"(rolls/buns, sausage)",0.285308,0.083032,0.035954,0.12602,1.517736,0.012265,1.049187
14960,"(other vegetables, yogurt)",(sausage),0.121314,0.207708,0.037506,0.309168,1.488475,0.012309,1.146867
