## Key words
### 추천시스템, 연관규칙, 장바구니분석, mlxtend, apriori, association_rules

### 연관성 분석 특징
- 상품 또는 서비스간의 관계 속에서 유용한 규칙을 찾을 때 사용
- 유통 분야에서 주로 활용되며 `장바구니 분석`이라는 변칭이 존재
- 비즈니스적으로 중요한 요소를 고려하기 어렵고, 연산량이 많음

### 주요 평가 지표
- 지지도(Support) : 상품 X와 상품 Y를 동시에 구매한 비율, `규칙의 중요성`
- 신뢰도(Confidence) : 상품 X와 구매 시 Y를 구매한 비율(조건부 확률), `규칙의 신뢰성`
- 향상도(Lift) : 상품 X 구매시 임의 상품 구입 대비 Y를 포함하는 경우의 비중, `규칙의 상관성`

### 향상도(Lift) 해석
- Lift > 1 : 품목 간 양의 상관 관계(보완재)
- Lift = 1 : 품목 간 상호 독립 관계
- Lift < 1 : 품목 간 음의 상관 관계(대체재)

### mlxtend - apriori()
- `구매 아이템 빈도`를 계산하는 mlxtend의 함수
- 입력 데이터 세트는 구매 아이템 기반으로 `더미변수화(OHE:원핫인코딩)` 되어있어야 함
- min_support 와 max_len 인자로 `최소 지지도`와 `아이템 조합 최대값`을 설정
- use_colnames 인자를 True로 하여 분석을 하는 것을 무조건 하세요

In [1]:
!pip install mlxtend



In [2]:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [3]:
df = pd.read_csv("association_rules_mart.csv")
df.head(2)

Unnamed: 0,Date,ID,Item
0,2014-01-01,1249in804,citrus fruit
1,2014-01-01,1249in804,coffee


겟더미즈 말고 pivottable로 해볼것임

In [4]:
df["purchase"] = True

In [5]:
df.head(5)

Unnamed: 0,Date,ID,Item,purchase
0,2014-01-01,1249in804,citrus fruit,True
1,2014-01-01,1249in804,coffee,True
2,2014-01-01,1381ht273,curd,True
3,2014-01-01,1381ht273,soda,True
4,2014-01-01,1440kn258,other vegetables,True


In [6]:
df_pivot = df.pivot_table(index= "ID", columns="Item", values = "purchase", aggfunc = max, fill_value = False)
df_pivot.head(2)

Item,Instant food products,UHT-milk,abrasive cleaner,artif. sweetener,baby cosmetics,bags,baking powder,bathroom cleaner,beef,berries,...,turkey,vinegar,waffles,whipped/sour cream,whisky,white bread,white wine,whole milk,yogurt,zwieback
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1000ol738,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,True,False
1001sf480,False,False,False,False,False,False,False,False,True,False,...,False,False,False,True,False,True,False,True,False,False


In [7]:
item_freq = apriori(df_pivot, min_support = 0.005, use_colnames= True)
item_freq.head()

Unnamed: 0,support,itemsets
0,0.015393,(Instant food products)
1,0.078502,(UHT-milk)
2,0.005644,(abrasive cleaner)
3,0.00744,(artif. sweetener)
4,0.031042,(baking powder)


### mlxtend - association_rules()
- 구매 아이템 빈도를 활용하여 연관 규칙을 계산하는 mlxtend 함수
- metric에 `필터링 기준 지표`를 설정하고, min_threshold에 `그 경계값`을 지정

In [8]:
df_rules = association_rules(item_freq, metric = 'lift',
                             min_threshold=1.5)
df_rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(root vegetables),(Instant food products),0.230631,0.015393,0.006927,0.030033,1.951168,0.003377,1.015094
1,(Instant food products),(root vegetables),0.015393,0.230631,0.006927,0.45,1.951168,0.003377,1.398853
2,(Instant food products),(soda),0.015393,0.313494,0.007953,0.516667,1.648091,0.003127,1.420357
3,(soda),(Instant food products),0.313494,0.015393,0.007953,0.025368,1.648091,0.003127,1.010235
4,(candy),(UHT-milk),0.053874,0.078502,0.00744,0.138095,1.759135,0.003211,1.069142


### 1. 최소지지도와 신뢰도를 0.005로 설정하고 연관성 분석을 실시했을 때 지지도가 0.1이상인 규칙은 몇 개 인가?
- association_rules_mart.csv
- 사전 중복 제거 실시

정답: 26

In [9]:
df = pd.read_csv("association_rules_mart.csv")
df.head(2)

Unnamed: 0,Date,ID,Item
0,2014-01-01,1249in804,citrus fruit
1,2014-01-01,1249in804,coffee


In [10]:
len(df)

40000

In [11]:
df = df.iloc[:, 1:].drop_duplicates() # 중복 제거
len(df)

34766

In [12]:
df["purchase"] = True
df_pivot = df.pivot_table(index= "ID", columns="Item", values = "purchase", aggfunc = max, fill_value = False)
df_pivot.head(2)

Item,Instant food products,UHT-milk,abrasive cleaner,artif. sweetener,baby cosmetics,bags,baking powder,bathroom cleaner,beef,berries,...,turkey,vinegar,waffles,whipped/sour cream,whisky,white bread,white wine,whole milk,yogurt,zwieback
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1000ol738,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,True,False
1001sf480,False,False,False,False,False,False,False,False,True,False,...,False,False,False,True,False,True,False,True,False,False


In [13]:
item_sets = apriori(df = df_pivot, min_support=0.005, use_colnames=True)
item_sets.head(2)

Unnamed: 0,support,itemsets
0,0.015393,(Instant food products)
1,0.078502,(UHT-milk)


In [14]:
rules = association_rules(item_sets, metric = 'confidence',
                             min_threshold=0.005)
rules.head(2)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Instant food products),(rolls/buns),0.015393,0.349666,0.005387,0.35,1.000954,5e-06,1.000513
1,(rolls/buns),(Instant food products),0.349666,0.015393,0.005387,0.015407,1.000954,5e-06,1.000015


In [15]:
rules_sub = rules.loc[rules["support"] > 0.1, ]
rules_sub = rules_sub.sort_values("lift", ascending = False)
rules_sub.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
4193,(yogurt),(whole milk),0.282966,0.458184,0.15059,0.532185,1.16151,0.02094,1.158185
4192,(whole milk),(yogurt),0.458184,0.282966,0.15059,0.328667,1.16151,0.02094,1.068076
759,(bottled water),(whole milk),0.213699,0.458184,0.112365,0.52581,1.147597,0.014452,1.142615
758,(whole milk),(bottled water),0.458184,0.213699,0.112365,0.245241,1.147597,0.014452,1.04179
3969,(sausage),(whole milk),0.206003,0.458184,0.106978,0.519303,1.133394,0.012591,1.127146


In [16]:
len(rules_sub)

26

### 2. 최소 지지도와 신뢰도를 0.005로 설정하고 연관성 분석을 실시했을 때 지지도가 0.01 이상인 규칙 중 향상도가 가장 높은 규칙과 관련없는 품목은?
- association_rules_mart.csv
- 사전 중복 제거 실시
- max_len = 3 으로 설정

In [17]:
df = pd.read_csv("association_rules_mart.csv")
df.head(2)

Unnamed: 0,Date,ID,Item
0,2014-01-01,1249in804,citrus fruit
1,2014-01-01,1249in804,coffee


In [18]:
df = df.iloc[:, 1:].drop_duplicates()
len(df)

34766

In [19]:
df["purchase"] = True
df_pivot = df.pivot_table(index= "ID", columns="Item", values = "purchase", aggfunc = max, fill_value = False)
df_pivot.head(2)

Item,Instant food products,UHT-milk,abrasive cleaner,artif. sweetener,baby cosmetics,bags,baking powder,bathroom cleaner,beef,berries,...,turkey,vinegar,waffles,whipped/sour cream,whisky,white bread,white wine,whole milk,yogurt,zwieback
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1000ol738,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,True,False
1001sf480,False,False,False,False,False,False,False,False,True,False,...,False,False,False,True,False,True,False,True,False,False


In [20]:
item_sets = apriori(df = df_pivot, min_support=0.005, use_colnames=True, max_len = 3)
item_sets.head(2)

Unnamed: 0,support,itemsets
0,0.015393,(Instant food products)
1,0.078502,(UHT-milk)


In [21]:
item_sets.tail(2)

Unnamed: 0,support,itemsets
7213,0.016419,"(white bread, whole milk, yogurt)"
7214,0.009749,"(whole milk, white wine, yogurt)"


In [22]:
rules = association_rules(item_sets, metric = 'confidence',
                             min_threshold=0.005)
rules.head(2)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Instant food products),(rolls/buns),0.015393,0.349666,0.005387,0.35,1.000954,5e-06,1.000513
1,(rolls/buns),(Instant food products),0.349666,0.015393,0.005387,0.015407,1.000954,5e-06,1.000015


In [23]:
rules_sub = rules.loc[rules["support"] >= 0.01, ]
rules_sub = rules_sub.sort_values("lift", ascending = False)
rules_sub.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
22845,(meat),"(domestic eggs, whole milk)",0.063622,0.070292,0.010262,0.16129,2.294561,0.005789,1.108497
22840,"(domestic eggs, whole milk)",(meat),0.070292,0.063622,0.010262,0.145985,2.294561,0.005789,1.096442
22842,"(whole milk, meat)",(domestic eggs),0.03489,0.133145,0.010262,0.294118,2.208999,0.005616,1.228044
22843,(domestic eggs),"(whole milk, meat)",0.133145,0.03489,0.010262,0.077071,2.208999,0.005616,1.045704
18056,(chocolate),"(whole milk, fruit/vegetable juice)",0.086455,0.06234,0.010775,0.124629,1.999194,0.005385,1.071158


### 3. 판매 실적 상위 30위 품목만 사용하여 최소지지도와 신뢰도를 0.005로 설정한 연관성 분석 결과를 보았을 때 지지도가 3% 이상인 규칙 중 가장 높은 향상도는 얼마인가?
- association_rules_mart.csv
- 판매 실적은 개수로 하며 1행당 1개로 취급

In [24]:
df = pd.read_csv("association_rules_mart.csv")
df.head(2)

Unnamed: 0,Date,ID,Item
0,2014-01-01,1249in804,citrus fruit
1,2014-01-01,1249in804,coffee


In [27]:
df_item_cnt = df["Item"].value_counts().reset_index()
df_item_cnt = df_item_cnt.sort_values("Item", ascending=False)
df_item_cnt.head()

Unnamed: 0,index,Item
0,whole milk,2570
1,other vegetables,1951
2,rolls/buns,1778
3,soda,1558
4,yogurt,1373


In [29]:
df_item_cnt = df_item_cnt.iloc[:30, ] # 상위 30개
df_item_cnt.head(2)

Unnamed: 0,index,Item
0,whole milk,2570
1,other vegetables,1951


In [30]:
df_sub = df.loc[df["Item"].isin(df_item_cnt["index"]), ] # 상위 30개 품목만 들고오기
df_sub.head()

Unnamed: 0,Date,ID,Item
0,2014-01-01,1249in804,citrus fruit
1,2014-01-01,1249in804,coffee
2,2014-01-01,1381ht273,curd
3,2014-01-01,1381ht273,soda
4,2014-01-01,1440kn258,other vegetables


In [33]:
len(df_sub)

25979

In [47]:
df_sub["purchase"] = True
df_pivot = pd.pivot_table(data = df_sub, index= "ID", columns="Item", values = "purchase", aggfunc = max, fill_value = False)
df_pivot.head(2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sub["purchase"] = True


Item,beef,bottled beer,bottled water,brown bread,butter,canned beer,chicken,citrus fruit,coffee,curd,...,rolls/buns,root vegetables,sausage,shopping bags,soda,tropical fruit,whipped/sour cream,white bread,whole milk,yogurt
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1000ol738,False,False,False,False,False,True,False,False,False,False,...,False,False,True,False,True,False,False,False,True,True
1001sf480,True,False,False,False,False,False,False,False,False,True,...,True,False,True,False,True,False,True,True,True,False


In [37]:
item_sets = apriori(df = df_pivot, min_support=0.005, use_colnames=True, max_len = 3)
item_sets.head(2)

Unnamed: 0,support,itemsets
0,0.120538,(beef)
1,0.160114,(bottled beer)


In [42]:
rules = association_rules(item_sets, 
                          metric = 'confidence', min_threshold=0.005)
rules.head(2)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(bottled beer),(beef),0.160114,0.120538,0.020952,0.130856,1.085601,0.001652,1.011872
1,(beef),(bottled beer),0.120538,0.160114,0.020952,0.17382,1.085601,0.001652,1.01659


In [46]:
rules_sub = rules.loc[rules["support"] > 0.03, ]
rules_sub = rules_sub.sort_values("lift", ascending = False)
rules_sub.head(2)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
16066,(sausage),"(yogurt, rolls/buns)",0.207708,0.112261,0.035954,0.173101,1.541954,0.012637,1.073576
16063,"(yogurt, rolls/buns)",(sausage),0.112261,0.207708,0.035954,0.320276,1.541954,0.012637,1.165609
