# 09차시 추천: 연관성 분석(Association Rule)

## 01 연관성 분석 개요

### 연관성 분석(Association Rule) 특징

- 상품 또는 서비스 간의 관계 속에서 유용한 규칙을 찾을 때 사용
- 유통 분야에서 주로 활용되며 장바구니 분석(Market Basket Analysis)이라는 별칭 존재
- 비즈니스적으로 중요한 요소를 고려하기 어렵고, 연산량이 많음

### 주요 평가 지표

- 지지도(Support): 상품 X와 Y를 동시에 구매한 비율, 규칙의 중요성
- 신뢰도(Condifence): 상품 X를 구매 시 Y를 구매한 비율(조건부 확률), 규칙의 신뢰성
- 향상도(Lift): 상품 X 구매 시 임의 상품 구입 대비 Y를 포함하는 경우의 비중, 규칙의 상관성

### 향상도 해석

- Lift > 1: 품목 간 양의 상관 관계(보완재)
- Lift = 1: 품목 간 상호 독립 관계
- Lift < 1: 품목 간 음의 상관 관계(대체재)


## 02 데이터 소개

### 제품 구매 데이터 - association_rules_mart.csv

- 익명화된 고객의 제품 구매 데이터 4만건


## 03 주요 함수 및 메서드 소개

### mlxtend - apriori()

- 구매 아이템 빈도를 계산하는 mlxtend의 함수
- 입력 데이터 세트는 구매 아이템 기반으로 더미변수화(OHE, One-Hot Encoding) 되어 있어야 함
- min_support 와 max_len 인자로 최소 지지도와 아이템 조합 최대값을 설정
- use_colnames 인자를 True로 하여 분석을 하는 것을 권장

In [1]:
!pip install mlxtend

Defaulting to user installation because normal site-packages is not writeable
Collecting mlxtend
  Downloading mlxtend-0.23.1-py3-none-any.whl (1.4 MB)
     ---------------------------------------- 1.4/1.4 MB 3.0 MB/s eta 0:00:00
Installing collected packages: mlxtend
Successfully installed mlxtend-0.23.1


In [2]:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [3]:
df = pd.read_csv("실습파일/association_rules_mart.csv")
df.head()

Unnamed: 0,Date,ID,Item
0,2014-01-01,1249in804,citrus fruit
1,2014-01-01,1249in804,coffee
2,2014-01-01,1381ht273,curd
3,2014-01-01,1381ht273,soda
4,2014-01-01,1440kn258,other vegetables


In [4]:
df["purchase"] = True

In [5]:
df_pivot = df.pivot_table(index = "ID", columns = "Item", values = "purchase",
                          aggfunc = max, fill_value = False)
df_pivot.head()

Item,Instant food products,UHT-milk,abrasive cleaner,artif. sweetener,baby cosmetics,bags,baking powder,bathroom cleaner,beef,berries,...,turkey,vinegar,waffles,whipped/sour cream,whisky,white bread,white wine,whole milk,yogurt,zwieback
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1000ol738,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,True,False
1001sf480,False,False,False,False,False,False,False,False,True,False,...,False,False,False,True,False,True,False,True,False,False
1002nj599,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
1003cq947,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1004jh583,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False


In [6]:
item_freq = apriori(df_pivot, min_support = 0.005, use_colnames = True)
item_freq.head()

Unnamed: 0,support,itemsets
0,0.015393,(Instant food products)
1,0.078502,(UHT-milk)
2,0.005644,(abrasive cleaner)
3,0.00744,(artif. sweetener)
4,0.031042,(baking powder)


### mlxtend - association_rules()

- 구매 아이템 빈도를 활용하여 연관규칙을 계산하는 mlxtend의 함수
- metric에 필터링 기준 지표를 설정하고 min_threshold에 그 경계값을 지정

In [7]:
df_rules = association_rules(item_freq, metric = "lift",
                             min_threshold = 1.5)
df_rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Instant food products),(root vegetables),0.015393,0.230631,0.006927,0.45,1.951168,0.003377,1.398853,0.495107
1,(root vegetables),(Instant food products),0.230631,0.015393,0.006927,0.030033,1.951168,0.003377,1.015094,0.633619
2,(Instant food products),(soda),0.015393,0.313494,0.007953,0.516667,1.648091,0.003127,1.420357,0.399385
3,(soda),(Instant food products),0.313494,0.015393,0.007953,0.025368,1.648091,0.003127,1.010235,0.57281
4,(candy),(UHT-milk),0.053874,0.078502,0.00744,0.138095,1.759135,0.003211,1.069142,0.456111


## Q1 최소 지지도와 신뢰도를 0.005로 설정하고 연관성 분석을 시릿했을 때 지지도가 0.1 이상인 규칙은 몇 개 인가?
1) 사전 중복 제거 실시

In [9]:
Q1 = pd.read_csv("실습파일/association_rules_mart.csv")
Q1["purchase"] = True
Q1.head()

Unnamed: 0,Date,ID,Item,purchase
0,2014-01-01,1249in804,citrus fruit,True
1,2014-01-01,1249in804,coffee,True
2,2014-01-01,1381ht273,curd,True
3,2014-01-01,1381ht273,soda,True
4,2014-01-01,1440kn258,other vegetables,True


In [10]:
len(Q1)

40000

In [11]:
Q1 = Q1.iloc[:, 1:].drop_duplicates()
len(Q1)

34766

In [12]:
Q1_pivot = Q1.pivot_table(index = "ID", columns = "Item", values = "purchase",
                          aggfunc = max, fill_value = False)
Q1_pivot.head()

Item,Instant food products,UHT-milk,abrasive cleaner,artif. sweetener,baby cosmetics,bags,baking powder,bathroom cleaner,beef,berries,...,turkey,vinegar,waffles,whipped/sour cream,whisky,white bread,white wine,whole milk,yogurt,zwieback
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1000ol738,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,True,False
1001sf480,False,False,False,False,False,False,False,False,True,False,...,False,False,False,True,False,True,False,True,False,False
1002nj599,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
1003cq947,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1004jh583,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False


In [13]:
item_freq = apriori(Q1_pivot, min_support = 0.005, use_colnames = True)
item_freq.head()

Unnamed: 0,support,itemsets
0,0.015393,(Instant food products)
1,0.078502,(UHT-milk)
2,0.005644,(abrasive cleaner)
3,0.00744,(artif. sweetener)
4,0.031042,(baking powder)


In [14]:
Q1_rules = association_rules(item_freq, metric = "confidence",
                             min_threshold = 0.005)
Q1_rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Instant food products),(rolls/buns),0.015393,0.349666,0.005387,0.35,1.000954,5e-06,1.000513,0.000968
1,(rolls/buns),(Instant food products),0.349666,0.015393,0.005387,0.015407,1.000954,5e-06,1.000015,0.001465
2,(Instant food products),(root vegetables),0.015393,0.230631,0.006927,0.45,1.951168,0.003377,1.398853,0.495107
3,(root vegetables),(Instant food products),0.230631,0.015393,0.006927,0.030033,1.951168,0.003377,1.015094,0.633619
4,(Instant food products),(soda),0.015393,0.313494,0.007953,0.516667,1.648091,0.003127,1.420357,0.399385


In [15]:
rules_sub = Q1_rules.loc[Q1_rules["support"] > 0.1]
rules_sub = rules_sub.sort_values("lift", ascending = False)
rules_sub.head()


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
4193,(whole milk),(yogurt),0.458184,0.282966,0.15059,0.328667,1.16151,0.02094,1.068076,0.25664
4192,(yogurt),(whole milk),0.282966,0.458184,0.15059,0.532185,1.16151,0.02094,1.158185,0.193926
759,(whole milk),(bottled water),0.458184,0.213699,0.112365,0.245241,1.147597,0.014452,1.04179,0.237376
758,(bottled water),(whole milk),0.213699,0.458184,0.112365,0.52581,1.147597,0.014452,1.142615,0.163569
3969,(whole milk),(sausage),0.458184,0.206003,0.106978,0.233483,1.133394,0.012591,1.03585,0.217222


In [16]:
len(rules_sub)

26