<font color = "#CC3D3D"><b>
# (DW Practice #3) Market Basket Analysis

- 장바구니분석(Market Basket Analysis)은 거래내역(Transaction)을 통해 고객이 구매한 상품 간의 연관 관계 또는 규칙를 찾을 때 사용하는 분석기법이다.  
  - (연관규칙의 표현) `항목 A`와 `품목 B`를 구매한 고객은 `품목 C`를 구매한다: *(품목 A) & (품목 B) => (품목 C)*
- 교차판매, 상품진열, 부정탐지, 상품 카달로그 디자인 등에 주로 활용된다.  
<img align='left' src='https://blog.rsquaredacademy.com/img/mba_steps.png' style='width: 80%; height: auto;'>

- 장바구니분석을 하게되면 수많은 연관규칙이 나오기 때문에 이 중에서 유용한 규칙을 선별할 수 있는 아래와 같은 평가기준이 요구된다.  
<img align='left' src='http://drive.google.com/uc?export=view&id=191LWlu63r0T3GIv-FX-x7Ds4bezBfxfU' style='width: 80%; height: auto;'>

전항이 출현할 확률에 대해서 후항이 출현할 확률.

#### 데이터 준비

In [1]:
import pandas as pd
import numpy as np

In [2]:
# read raw data
cs = pd.read_csv('L사_고객정보.csv')
gd = pd.read_csv('L사_상품정보.csv')
tr = pd.read_csv('L사_거래정보.csv')

# merge data 
gd.pd_c = gd.pd_c.astype(str) 
df = pd.merge(tr, cs).merge(gd, on='pd_c')
df.de_dt = df.de_dt.astype(str).astype('datetime64') 

In [4]:
# transform data
store_data = pd.pivot_table(df, index='clnt_id', columns='clac_nm2', values='buy_ct', aggfunc=np.size, fill_value=0)\
            .applymap(lambda x: 1 if x>=1 else 0).reset_index() #ㄱ밧이 하나라도 있다면 1, 그렇지 않다면 0을 반환.
transactions = store_data.iloc[:,1:]
transactions

# apply는 series 에 대해서 적용, applymap은 데이터프레임의 전체 데이터에 대해서 적용.

clac_nm2,Arts / Crafts Supplies,Audios,Bikes,Biscuits,Body Care,Boy's Toys,Breads,Business Paper Products,Cameras / Camcorders,Camping,...,Women's Lower Bodywear / Bottoms,Women's Outwear,Women's Socks and Hosiery,Women's Special Materials Clothing,Women's Special Use Clothing,Women's Sport Shoes,Women's Underwear,Women's Upper Bodywear / Tops,Writing Pads,Writing Supplies
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,1,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10093,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10094,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
10095,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10096,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [5]:
transactions.sum().sort_values(ascending=False).head(20)

clac_nm2
Instant Noodles        5231
Snacks                 5186
Tofu / Bean Sprouts    4749
Leaf Vegetables        4361
Fruit Vegetables       4251
Biscuits               4034
Retort Pouches         3625
Root Vegetables        3193
Mushrooms              3041
Sauces                 3008
Western Vegetables     2692
Instant Cup Noodles    2519
Mature Sauces          2399
Seasonings             2334
Candies                2226
Dried Noodles          2009
Cooking Oils           1924
Pies                   1830
Cereals                1680
Restaurants            1675
dtype: int64

#### 빈발항목집합 추출 - Apriori

In [6]:
pip install --upgrade pip

Requirement already up-to-date: pip in /Users/seongyoon/programming/program/anaconda3/lib/python3.7/site-packages (20.1.1)
Note: you may need to restart the kernel to use updated packages.


대표적인 연관규칙탐사 알고리즘인 Apriori를 실행하기 위해서는 mlxtend 패키지를 설치해야 함
%pip install mlxtend

In [8]:
from mlxtend.frequent_patterns import apriori, association_rules

In [9]:
# 지지도(support)가 5% 이상인 빈발항목집합(itemsets)만 추출하고 지지도 기준 내림차순으로 출력
freq_items = apriori(transactions, min_support=0.2, use_colnames=True)
freq_items.sort_values(by='support', ascending=False)

Unnamed: 0,support,itemsets
4,0.518023,(Instant Noodles)
12,0.513567,(Snacks)
13,0.470291,(Tofu / Bean Sprouts)
5,0.431868,(Leaf Vegetables)
2,0.420974,(Fruit Vegetables)
...,...,...
85,0.202119,"(Snacks, Tofu / Bean Sprouts, Biscuits, Instan..."
56,0.201921,"(Instant Noodles, Fruit Vegetables, Biscuits)"
45,0.201822,"(Western Vegetables, Leaf Vegetables)"
79,0.201327,"(Instant Noodles, Root Vegetables, Tofu / Bean..."


In [10]:
freq_items['length'] = freq_items['itemsets'].apply(lambda x: len(x))
freq_items.query('length >= 2')

Unnamed: 0,support,itemsets,length
15,0.245593,"(Fruit Vegetables, Biscuits)",2
16,0.300753,"(Instant Noodles, Biscuits)",2
17,0.247178,"(Biscuits, Leaf Vegetables)",2
18,0.220539,"(Retort Pouches, Biscuits)",2
19,0.334819,"(Snacks, Biscuits)",2
...,...,...,...
86,0.200733,"(Instant Noodles, Fruit Vegetables, Leaf Veget...",4
87,0.217271,"(Instant Noodles, Fruit Vegetables, Leaf Veget...",4
88,0.206278,"(Instant Noodles, Fruit Vegetables, Tofu / Bea...",4
89,0.211032,"(Snacks, Tofu / Bean Sprouts, Leaf Vegetables,...",4


#### 연관규칙 도출

In [11]:
# 신뢰도(confidence)가 85% 이상인 연관규칙만 출력
rules = association_rules(freq_items, metric='confidence')
rules.query('confidence >= 0.85')

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
6,"(Fruit Vegetables, Biscuits)",(Snacks),0.245593,0.513567,0.216181,0.880242,1.713977,0.090053,4.061797
9,"(Instant Noodles, Biscuits)",(Snacks),0.300753,0.513567,0.263319,0.875535,1.704812,0.108863,3.908193
11,"(Leaf Vegetables, Biscuits)",(Snacks),0.247178,0.513567,0.219746,0.889022,1.731074,0.092804,4.383165
13,"(Biscuits, Tofu / Bean Sprouts)",(Snacks),0.274807,0.513567,0.240741,0.876036,1.705787,0.099609,3.923987
17,"(Mushrooms, Fruit Vegetables)",(Leaf Vegetables),0.235591,0.431868,0.202416,0.859185,1.989462,0.100672,4.034587
18,"(Fruit Vegetables, Root Vegetables)",(Leaf Vegetables),0.245692,0.431868,0.21331,0.868198,2.010334,0.107203,4.310508
24,"(Mushrooms, Fruit Vegetables)",(Tofu / Bean Sprouts),0.235591,0.470291,0.207467,0.880622,1.872504,0.09667,4.437244
25,"(Fruit Vegetables, Root Vegetables)",(Tofu / Bean Sprouts),0.245692,0.470291,0.213211,0.867795,1.84523,0.097664,4.006731
33,"(Instant Noodles, Root Vegetables)",(Tofu / Bean Sprouts),0.235195,0.470291,0.201327,0.856,1.820149,0.090717,3.678534
36,"(Mushrooms, Leaf Vegetables)",(Tofu / Bean Sprouts),0.239156,0.470291,0.206278,0.862526,1.834025,0.093805,3.853153


<font color = "#CC3D3D"><b>
# End