# 🛒 **연관 분석 (Association Rule Analysis) 과제**

### <span style="color:black; background-color:#F5F5F5;"> **Q1. 연관 규칙 {우유} → {쿠키}가 도출되었을 때, 다음 용어들이 각각 무엇을 의미하는지 설명하시오.**
#### <span style="color:black; background-color:#F5F5F5;">**① 지지도(support) ② 신뢰도(confidence) ③ 향상도(lift)** </span>

답:
- 지지도 - 전체 거래에서 우유와 쿠키가 함께 구매된 거래의 비율
- 신뢰도 - 우유가 포함된 거래 중에서 쿠키가 함께 구매된 거래의 비율
- 향상도 - 신뢰도를 지지도로 나눈 것, 우유를 산 고객이 쿠키를 살 상대적 확률


### <span style="color:black; background-color:#F5F5F5;"> **Q2. Apriori 알고리즘이 처리해야 할 후보 항목 수가 기하급수적으로 증가하는 이유와, FP-Growth가 이를 어떻게 해결하는지 설명하시오.**

답:
Apriori알고리즘은 처리할 후보 항목 수의 모든 부분집합을 상위부터 구해가면서 작업을 해야 하기 때문에, 후보 항목 수가 증가하면 처리해야할 케이스가 지수적으로 증가하기 때문이다.

FP-Growth는 먼저 데이터를 한번에 훑어 모든 항목 중 최소지지도 이상의 후보만 남긴 후, 빈도 순으로 정렬한 뒤 트리를 생성하는 방식으로 연관 규칙을 도출한다. 후보 항목을 미리 조합하지 않고 작업을 한다

# <span style="color:black; background-color:#F5F5F5;"> 💸 **연관 분석을 활용한 잉마트(Ing-Mart) 고객 장바구니 패턴 분석 및 비즈니스 전략 수립** </span>

<strong>죽지도 않고 다시 돌아온 잉마트..! 🤣🫥😫🙃<br>
이번 연관 분석 심화 세션의 과제는 잉마트의 고객 장바구니 분석과 전략 수립입니다~ <strong>



<span style="color:black; background-color:#E6E6FA; padding:2px 4px; border-radius:4px">
<strong> 🤓 지난 심화 세션에서 배운 개념과 실습 내용을 바탕으로 아래 빈칸을 채워주시고, 해당 장바구니 결과를 분석하여 이에 적합한 전략을 제시해주시면 됩니다! <strong>
</span>

## **1️⃣ 데이터 불러오기 및 전처리**

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from mlxtend.frequent_patterns import fpgrowth
import warnings
warnings.simplefilter(action='ignore', category=DeprecationWarning)

In [5]:
# !pip install mlxtend

- 데이터 불러오기
  - 1차 인사이콘 때 활용하셨던 데이터 원본을 활용해주시면 됩니다!
  - 알맞게 경로 지정해주세요~

In [7]:
transaction_data = pd.read_csv('transaction_data.csv')
product = pd.read_csv('product.csv')

In [11]:
transaction_data.head()

Unnamed: 0,Household_ID,Basket_ID,Product_ID,Store_ID,Day,Quantity,Sales_Value,Trans_time,Week_no,Disc(retail),Disc(coupon),Disc(coupon_match)
0,1803,30780785930,1065887,338,252,1,8.99,1419,37,-1.0,0.0,0.0
1,2299,33768622588,1073244,446,456,1,1.0,2030,66,-1.59,0.0,0.0
2,158,30202616809,7025114,343,225,1,1.5,1246,33,-0.5,-1.0,0.0
3,2347,42076926172,1064299,438,695,1,2.5,1430,100,-2.89,0.0,0.0
4,1430,31625201009,1040197,31742,312,1,2.29,1423,45,0.0,0.0,0.0


- 이상치 처리
  - 예시로 제시해드리지만 추가적으로 필요한 부분은 전처리 처리해주세요!

In [13]:
# Quantity 열에 대해 Z-score를 계산한 뒤, 절댓값을 취해 새로운 열 'z_score'에 저장
transaction_data["z_score"] = np.abs(stats.zscore(transaction_data["Quantity"]))

# Z-score가 3을 초과하는 이상치(즉, 평균에서 3표준편차 이상 벗어난 값)를 추출
outliers_zscore = transaction_data[transaction_data["z_score"] > 3]

- 고객 - 상품 행렬 생성

In [15]:
# 고객-상품 pivot_table 생성 (행: 고객, 열: 상품, 값: 총 구매금액)
user_item_matrix = transaction_data.pivot_table(
    index='Household_ID',     # 가구 ID 기준
    columns='Product_ID',     # 상품 ID 기준
    values='Sales_Value',     # 구매 금액
    aggfunc='sum',            # 상품별 총 구매금액
    fill_value=0              # 구매 이력 없으면 0
)
user_item_matrix

Product_ID,25671,26081,26093,26190,26355,26426,26540,26601,26636,26691,...,18273019,18273051,18273115,18273133,18292005,18293142,18293439,18293696,18294080,18316298
Household_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2496,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2497,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2498,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2499,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [17]:
# 행렬의 크기 확인 (고객 수 × 상품 수)
user_item_matrix.shape

(2500, 92339)

- 구매가 적은 사용자/상품 필터링

In [19]:
# 필터링 기준 정의
min_product_purchases = 10   # 최소 10명 이상이 구매한 상품만 사용
min_user_purchases = 2       # 최소 2개 이상 상품 구매한 사용자만 사용

# 상품별 구매된 고객 수 계산
product_purchase_count = (user_item_matrix > 0).sum()

# 고객별 구매한 상품 수 계산
user_purchase_count = (user_item_matrix > 0).sum(axis=1)

# 기준을 만족하는 상품과 사용자 필터링
filtered_products = product_purchase_count[product_purchase_count >= min_product_purchases].index
filtered_users = user_purchase_count[user_purchase_count >= min_user_purchases].index

# 필터링된 행렬 추출
filtered_matrix = user_item_matrix.loc[filtered_users, filtered_products]
print(f"\n2. Filtered Matrix Shape: {filtered_matrix.shape}")


2. Filtered Matrix Shape: (2500, 23326)


- 이상치 및 음수 제거한 트랜잭션 데이터 생성

In [21]:
# Z-score 기준으로 이상치 제거 (±3 이상) + 구매 수량이 양수인 데이터만 남김
transaction_data_cleaned = transaction_data[
    (transaction_data['z_score'] < 3) & 
    (transaction_data['z_score'] > -3) & 
    (transaction_data['z_score'] > 0)
]
print(transaction_data_cleaned.shape)

# 데이터 샘플 확인
transaction_data_cleaned.head()

(2573787, 13)


Unnamed: 0,Household_ID,Basket_ID,Product_ID,Store_ID,Day,Quantity,Sales_Value,Trans_time,Week_no,Disc(retail),Disc(coupon),Disc(coupon_match),z_score
0,1803,30780785930,1065887,338,252,1,8.99,1419,37,-1.0,0.0,0.0,0.086202
1,2299,33768622588,1073244,446,456,1,1.0,2030,66,-1.59,0.0,0.0,0.086202
2,158,30202616809,7025114,343,225,1,1.5,1246,33,-0.5,-1.0,0.0,0.086202
3,2347,42076926172,1064299,438,695,1,2.5,1430,100,-2.89,0.0,0.0,0.086202
4,1430,31625201009,1040197,31742,312,1,2.29,1423,45,0.0,0.0,0.0,0.086202


- 정제된 데이터로 다시 사용자-상품 행렬 생성 및 필터링

In [23]:
# 더 정확한 연관 규칙 도출을 위해 이상치 제거 후 재생성
# 정제된 데이터를 기반으로 사용자-상품 매트릭스 다시 생성
user_item_matrix = transaction_data_cleaned.pivot_table(
    index='Household_ID',     # 가구 ID 기준
    columns='Product_ID',     # 상품 ID 기준
    values='Sales_Value',     # 구매 금액
    aggfunc='sum',            # 상품별 총 구매금액
    fill_value=0              # 구매 이력 없으면 0
)

# 필터링 기준 재사용
min_product_purchases = 10  
min_user_purchases = 2     

# 상품/사용자별 구매 횟수 계산
product_purchase_count = (user_item_matrix > 0).sum()
user_purchase_count = (user_item_matrix > 0).sum(axis=1)

# 조건에 맞는 상품과 사용자 필터
filtered_products = product_purchase_count[product_purchase_count >= min_product_purchases].index
filtered_users = user_purchase_count[user_purchase_count >= min_user_purchases].index

# 최종 필터링된 행렬 생성
filtered_matrix = user_item_matrix.loc[filtered_users, filtered_products]
filtered_matrix

Product_ID,27658,34873,43020,43871,59666,138619,197681,201704,215923,244960,...,18005913,18005929,18022252,18055205,18055329,18105264,18106286,18119016,18147612,18203921
Household_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.19,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2496,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2497,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2498,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2499,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


- 상품 정보 조인하여 제품 타입 단위로 분석 준비

In [25]:
# Product_ID을 기준으로 데이터 product와 inner join하여 Product_type 정보 추가
merged_data = pd.merge(transaction_data_cleaned, product, on='Product_ID', how='inner')
merged_data

Unnamed: 0,Household_ID,Basket_ID,Product_ID,Store_ID,Day,Quantity,Sales_Value,Trans_time,Week_no,Disc(retail),Disc(coupon),Disc(coupon_match),z_score,Manufacturer,Brand,Category,Subcategory,Product_type,Curr_Size_of_Product
0,1803,30780785930,1065887,338,252,1,8.99,1419,37,-1.00,0.0,0.0,0.086202,69,Private,DRUG GM,ANALGESICS,ADULT ANALGESICS,
1,2299,33768622588,1073244,446,456,1,1.00,2030,66,-1.59,0.0,0.0,0.086202,2110,National,GROCERY,ICE CREAM/MILK/SHERBTS,PREMIUM PINTS,PT
2,158,30202616809,7025114,343,225,1,1.50,1246,33,-0.50,-1.0,0.0,0.086202,1838,National,GROCERY,BAKED BREAD/BUNS/ROLLS,MAINSTREAM WHEAT/MULTIGRAIN BR,20 OZ
3,2347,42076926172,1064299,438,695,1,2.50,1430,100,-2.89,0.0,0.0,0.086202,2193,National,GROCERY,ICE CREAM/MILK/SHERBTS,PREMIUM,48 OZ
4,1430,31625201009,1040197,31742,312,1,2.29,1423,45,0.00,0.0,0.0,0.086202,869,National,GROCERY,TEAS,TEA BAGS HERBAL & FLAVORED,20 CT
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2573782,225,33655617762,6463742,31642,447,1,2.29,1942,65,-1.70,0.0,0.0,0.086202,69,Private,GROCERY,CHEESE,SHREDDED CHEESE,16 OZ
2573783,266,40097402794,968269,367,551,1,1.99,2255,79,0.00,0.0,0.0,0.086202,1091,National,GROCERY,HOUSEHOLD CLEANG NEEDS,TOILET BOWL MANUAL,24 OZ
2573784,2364,30443976140,2041688,673,230,2,1.50,1757,34,-0.62,0.0,0.0,0.085335,69,Private,GROCERY,MEAT - SHELF STABLE,CHILI: CANNED,15 OZ
2573785,1048,31145396229,1082185,436,280,1,0.74,1445,41,0.00,0.0,0.0,0.086202,2,National,PRODUCE,TROPICAL FRUIT,BANANAS,40 LB


- 장바구니(Basket_ID)별로 구매한 상품타입 목록 정리

In [27]:
# 각 거래(Basket_ID)마다 구매한 상품 유형(Product_type)의 리스트 생성
transactions = merged_data.groupby('Basket_ID')['Product_type'].unique().reset_index()
transactions

Unnamed: 0,Basket_ID,Product_type
0,26984851472,"[POTATOES RUSSET (BULK&BAG), CELERY, ONIONS SW..."
1,26984851516,"[HAMBURGER BUNS, TRAY PACK/CHOC CHIP COOKIES, ..."
2,26984896261,"[GRANOLA BARS, EGGS - X-LARGE, LINKS - RAW, SN..."
3,26984905972,"[RAMEN NOODLES/RAMEN CUPS, MAINSTREAM WHITE BR..."
4,26984945254,"[SEASONAL CANDY BOX NON-CHOCOLA, INSIDE FROST ..."
...,...,...
255331,42302712006,"[TORTILLA/NACHO CHIPS, SFT DRNK 2 LITER BTL CA..."
255332,42302712189,"[PLSTC CTLRYTBLCLTHSTTHPKSST, REFRIG DIPS, PAP..."
255333,42302712298,"[BUTTER, HAIR CONDITIONERS AND RINSES, SHAMPOO..."
255334,42305362497,"[SEASONAL MISCELLANEOUS, SFT DRNK 2 LITER BTL ..."


- 트랜잭션 리스트로 변환

In [29]:
transaction_list = transactions['Product_type'].tolist()
transaction_list = [list(item) for item in transaction_list]
print(transaction_list[:5])

[['POTATOES RUSSET (BULK&BAG)', 'CELERY', 'ONIONS SWEET (BULK&BAG)', 'ORGANIC CARROTS', 'BANANAS'], ['HAMBURGER BUNS', 'TRAY PACK/CHOC CHIP COOKIES', 'PEANUT BUTTER', 'SPONGES: BATH HOUSEHOLD', 'GRAHAM CRACKERS'], ['GRANOLA BARS', 'EGGS - X-LARGE', 'LINKS - RAW', 'SNACK CRACKERS', 'GRND/PATTY - ROUND'], ['RAMEN NOODLES/RAMEN CUPS', 'MAINSTREAM WHITE BREAD'], ['SEASONAL CANDY BOX NON-CHOCOLA', 'INSIDE FROST BULBS', 'CHEWING GUM']]


- 트랜잭션 통계

In [31]:
transactions['num_products'] = transactions['Product_type'].apply(len)
average_products_per_order = transactions['num_products'].mean()
max_products_per_order = transactions['num_products'].max()
min_products_per_order = transactions['num_products'].min()

print(f"Average number of products per order: {average_products_per_order}")
print(f"Maximum number of products per order: {max_products_per_order}")
print(f"Minimum number of products per order: {min_products_per_order}")

Average number of products per order: 8.548778863928314
Maximum number of products per order: 129
Minimum number of products per order: 1


<span style="color:black; background-color:#E6E6FA; padding:2px 4px; border-radius:4px">
<strong> 🤓 지금부턴 지난 실습 때 했던 과정의 반복! <strong>
</span>

## **2️⃣ 연관 분석 - TransactionEncoder로 이진 행렬로 변환**

In [33]:
from mlxtend.preprocessing import TransactionEncoder

te = TransactionEncoder()
te_ary = te.fit(transaction_list).transform(transaction_list)  # 학습과 변환을 따로따로!

df_encoded = pd.DataFrame(te_ary, columns=te.columns_).astype(int)
df_encoded.head()

Unnamed: 0,Unnamed: 1,*ATH ACCES:TOWEL BARS/SOAP D,*ATTERIES:CAMERA/FLASH/WATCH,*BOYS/GIRLS MISC TOYS,*GOURMET/UPSCALE,*MISC. LOBBY ITEMS,*PURSES UMBRELLAS,*SCRAPBOOK,*SLEDS-WINTER TOYS,*SPORT NOVELTIES,...,WRITING INSTRUMENTS,XMAS PLUSH,YARDLEY,YEAST: DRY,YELLOW JACKET,YELLOW SUMMER SQUASH,YNG MEN SCREEN PRINT T-SHIRTS,YOGURT,YOGURT MULTI-PACKS,YOGURT NOT MULTI-PACKS
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## **3️⃣ 연관 분석 - Apriori 알고리즘 활용**

- Apriori 알고리즘으로 빈발 항목 집합 도출 (지지도 0.05% 이상)

In [39]:
from mlxtend.frequent_patterns import apriori

filtered_onehot = df_encoded.loc[:, df_encoded.sum(axis=0) > 20]

# apriori로 frequent_itemsets 추출 (최소지지도는 0.005, use_colnames=True, low_memory=True)
frequent_itemsets = apriori(filtered_onehot, min_support=0.005, use_colnames=True, low_memory=True)
frequent_itemsets.head()

Unnamed: 0,support,itemsets
0,0.02895,( )
1,0.011922,(ADULT ANALGESICS)
2,0.029988,(ADULT CEREAL)
3,0.005389,(AIR CARE - CONTINUOUS ACTION)
4,0.008033,(ALKALINE BATTERIES)


In [46]:
# 공집합이 생겨버려 제거해야겠어...

frequent_itemsets = frequent_itemsets[frequent_itemsets['itemsets'].apply(lambda x: len(x) > 0)]
frequent_itemsets.head()

Unnamed: 0,support,itemsets
0,0.02895,( )
1,0.011922,(ADULT ANALGESICS)
2,0.029988,(ADULT CEREAL)
3,0.005389,(AIR CARE - CONTINUOUS ACTION)
4,0.008033,(ALKALINE BATTERIES)


- `association_rules`로 연관 규칙 도출 및 필터링(신뢰도 40% 이상)

In [48]:
num_itemsets = len(frequent_itemsets)

# confidence 기준으로 연관 규칙(rules)을 추출하세요.
# 조건: min_threshold=0.4, metric="***",num_itemsets = num_itemsets
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.4, num_itemsets = num_itemsets)

rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']].head()

Unnamed: 0,antecedents,consequents,support,confidence,lift
0,( ),(FLUID MILK WHITE ONLY),0.011777,0.406791,1.692137
1,(ADULT CEREAL),(BANANAS),0.012102,0.403552,3.397792
2,(ADULT CEREAL),(FLUID MILK WHITE ONLY),0.019128,0.637848,2.653267
3,(ALL FAMILY CEREAL),(FLUID MILK WHITE ONLY),0.026111,0.64892,2.699323
4,(APPLE JUICE & CIDER (OVER 50%),(FLUID MILK WHITE ONLY),0.006873,0.588729,2.448945


In [50]:
print(rules.shape)

(447, 14)


- 연관 분석용 리스트 구조 정리

In [52]:
transaction_list = [list(t) if isinstance(t, (list, np.ndarray)) else [t] for t in transaction_list]
transaction_list = [t.tolist() if isinstance(t, np.ndarray) else list(t) if isinstance(t, list) else [t] for t in transaction_list]

- 불필요한 지표 제거

In [54]:
apriori = rules.drop(columns=[
    "antecedent support", 
    "consequent support", 
    "representativity", 
    "conviction", 
    "zhangs_metric", 
    "jaccard", 
    "certainty", 
    "kulczynski"
])

- 유의미한 규칙 필터링(향상도 1 이상)

In [56]:
apriori = apriori[apriori['lift'] >= 1]

- 결과 확인

In [59]:
apriori

Unnamed: 0,antecedents,consequents,support,confidence,lift,leverage
0,( ),(FLUID MILK WHITE ONLY),0.011777,0.406791,1.692137,0.004817
1,(ADULT CEREAL),(BANANAS),0.012102,0.403552,3.397792,0.008540
2,(ADULT CEREAL),(FLUID MILK WHITE ONLY),0.019128,0.637848,2.653267,0.011919
3,(ALL FAMILY CEREAL),(FLUID MILK WHITE ONLY),0.026111,0.648920,2.699323,0.016438
4,(APPLE JUICE & CIDER (OVER 50%),(FLUID MILK WHITE ONLY),0.006873,0.588729,2.448945,0.004067
...,...,...,...,...,...,...
442,"(KIDS CEREAL, SHREDDED CHEESE)",(MAINSTREAM WHITE BREAD),0.005076,0.442623,4.209848,0.003870
443,"(SOFT DRINKS 12/18&15PK CAN CAR, KIDS CEREAL)",(MAINSTREAM WHITE BREAD),0.005150,0.455648,4.333730,0.003962
444,"(POTATO CHIPS, SHREDDED CHEESE)",(MAINSTREAM WHITE BREAD),0.005604,0.406419,3.865504,0.004155
445,"(SOFT DRINKS 12/18&15PK CAN CAR, SNACK CAKE - ...",(MAINSTREAM WHITE BREAD),0.005440,0.479959,4.564952,0.004248


## **[ 참고 ] 연관 분석 - FP-Growth 알고리즘 활용**

- 코드를 돌릴 때 조심해주세요!

In [61]:
filtered_onehot

Unnamed: 0,Unnamed: 1,ABRASIVES,ACNE MEDICATIONS,ACTIVITY,ADDITIVES/FLUIDS,ADHESIVES/CAULK,ADULT ANALGESICS,ADULT CEREAL,ADULT INCONTINENCE BRIEFS,ADULT INCONTINENCE MISC PRODUC,...,WRAP,WREATHS,WREATHS/TINSEL/GARLAND,WRITING INSTRUMENTS,XMAS PLUSH,YEAST: DRY,YELLOW SUMMER SQUASH,YOGURT,YOGURT MULTI-PACKS,YOGURT NOT MULTI-PACKS
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
255331,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
255332,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
255333,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
255334,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


<span style="color:black; background-color:#E6E6FA; padding:2px 4px; border-radius:4px">
<strong> 🥹 실습 때 다루었던 FP-Growth 알고리즘이 참고 자료가 된 이유는 다음 셀 때문이에요... 돌릴 때 조심해주세요... 30분씩 걸릴 때도 있거든요... <strong>
</span>

<span style="color:black; background-color:#fff5b1; padding:2px 4px; border-radius:4px">
<strong> 🤔 "엥? 근데 FP-Growth 알고리즘이 Apriori 알고리즘보다 계산이 빠르다고 하지 않았나?"<strong>
</span>

<span style="color:black; background-color:#E6E6FA; padding:2px 4px; border-radius:4px">
<strong> 🤓 네! 공부를 열심히 하셨군요!? 맞습니다! 이론상 FP-Growth 알고리즘이 Apriori 알고리즘이 계산이 더 빠릅니다!<strong>
</span>

<span style="color:black; background-color:#fff5b1; padding:2px 4px; border-radius:4px">
<strong> 🤔 엥 그럼 왜...? <strong>
</span>

<span style="color:black; background-color:#E6E6FA; padding:2px 4px; border-radius:4px">
<strong> 🤓 사용한 라이브러리의 차이입니다! 저희는 mlxtend 라이브러리를 사용했습니다! <strong>
</span>

<span style="color:black; background-color:#E6E6FA; padding:2px 4px; border-radius:4px">
<strong> 🤓 mlxtend 라이브러리의 경우 Apriori는 Cython으로 최적화 되어 매우 빠르게 작동하지만, FP-Growth는 순수 Python으로 구현하기 때문에 오히려 느릴 수 있습니다. 이 경우 fpgrowth_py 라이브러리를 활용한다면 더 빠르게 작동할 수 있어요~ <strong>
</span>

- FP-Growth 알고리즘으로 빈발 항목 집합 도출 (지지도 0.05% 이상)

In [63]:
from mlxtend.frequent_patterns import fpgrowth

# FP-Growth 알고리즘을 사용해 frequent_itemsets_fp을 생성하세요.
# 조건: 최소지지도: 0.005, use_colnames=True, 입력 데이터는 boolean 타입(astype(bool))
frequent_itemsets_fp = fpgrowth(filtered_onehot.astype(bool), min_support=0.005, use_colnames=True)
frequent_itemsets_fp.head()

Unnamed: 0,support,itemsets
0,0.118769,(BANANAS)
1,0.038075,(POTATOES RUSSET (BULK&BAG))
2,0.023757,(ONIONS SWEET (BULK&BAG))
3,0.018846,(CELERY)
4,0.040057,(HAMBURGER BUNS)


- `association_rules`로 연관 규칙 도출 및 필터링(신뢰도 40% 이상)

In [64]:
# confidence 기준으로 연관 규칙(rules)을 추출하세요.
# 조건: min_threshold=0.4, metric="***"
rules_fp = association_rules(frequent_itemsets_fp, metric="confidence", min_threshold=0.4)

rules_fp.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,(BANANAS),(FLUID MILK WHITE ONLY),0.118769,0.240401,0.061339,0.516455,2.148305,1.0,0.032787,1.570895,0.606557,0.205952,0.36342,0.385803
1,(POTATOES RUSSET (BULK&BAG)),(FLUID MILK WHITE ONLY),0.038075,0.240401,0.019253,0.505657,2.103392,1.0,0.0101,1.536584,0.545341,0.074272,0.349206,0.292872
2,"(POTATOES RUSSET (BULK&BAG), BANANAS)",(FLUID MILK WHITE ONLY),0.010798,0.240401,0.006779,0.627856,2.611706,1.0,0.004184,2.041145,0.623844,0.027736,0.510079,0.328028
3,"(POTATOES RUSSET (BULK&BAG), MAINSTREAM WHITE ...",(FLUID MILK WHITE ONLY),0.011487,0.240401,0.007586,0.660416,2.747144,1.0,0.004825,2.236852,0.643376,0.031052,0.552943,0.345986
4,"(POTATOES RUSSET (BULK&BAG), SHREDDED CHEESE)",(FLUID MILK WHITE ONLY),0.008534,0.240401,0.005608,0.657182,2.733693,1.0,0.003557,2.215752,0.639653,0.023048,0.548686,0.340256


In [65]:
from IPython.display import display

print("🔥 Frequent Itemsets:")
display(frequent_itemsets_fp)

print("\n🔥 Association Rules:")
display(rules_fp)

🔥 Frequent Itemsets:


Unnamed: 0,support,itemsets
0,0.118769,(BANANAS)
1,0.038075,(POTATOES RUSSET (BULK&BAG))
2,0.023757,(ONIONS SWEET (BULK&BAG))
3,0.018846,(CELERY)
4,0.040057,(HAMBURGER BUNS)
...,...,...
1477,0.005209,"(PAPER NAPKINS, FLUID MILK WHITE ONLY)"
1478,0.007261,"(STRING CHEESE, FLUID MILK WHITE ONLY)"
1479,0.006865,"(CANDY BARS (MULTI PACK), FLUID MILK WHITE ONLY)"
1480,0.006564,"(ISOTONIC DRINKS SINGLE SERVE, FLUID MILK WHIT..."



🔥 Association Rules:


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,(BANANAS),(FLUID MILK WHITE ONLY),0.118769,0.240401,0.061339,0.516455,2.148305,1.0,0.032787,1.570895,0.606557,0.205952,0.363420,0.385803
1,(POTATOES RUSSET (BULK&BAG)),(FLUID MILK WHITE ONLY),0.038075,0.240401,0.019253,0.505657,2.103392,1.0,0.010100,1.536584,0.545341,0.074272,0.349206,0.292872
2,"(POTATOES RUSSET (BULK&BAG), BANANAS)",(FLUID MILK WHITE ONLY),0.010798,0.240401,0.006779,0.627856,2.611706,1.0,0.004184,2.041145,0.623844,0.027736,0.510079,0.328028
3,"(POTATOES RUSSET (BULK&BAG), MAINSTREAM WHITE ...",(FLUID MILK WHITE ONLY),0.011487,0.240401,0.007586,0.660416,2.747144,1.0,0.004825,2.236852,0.643376,0.031052,0.552943,0.345986
4,"(POTATOES RUSSET (BULK&BAG), SHREDDED CHEESE)",(FLUID MILK WHITE ONLY),0.008534,0.240401,0.005608,0.657182,2.733693,1.0,0.003557,2.215752,0.639653,0.023048,0.548686,0.340256
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
442,(FACIAL TISSUE & PAPER HANDKE),(FLUID MILK WHITE ONLY),0.018113,0.240401,0.008910,0.491892,2.046132,1.0,0.004555,1.494956,0.520705,0.035696,0.331084,0.264477
443,(PAPER NAPKINS),(FLUID MILK WHITE ONLY),0.010704,0.240401,0.005209,0.486645,2.024305,1.0,0.002636,1.479675,0.511478,0.021183,0.324176,0.254156
444,(STRING CHEESE),(FLUID MILK WHITE ONLY),0.014115,0.240401,0.007261,0.514428,2.139877,1.0,0.003868,1.564340,0.540310,0.029367,0.360753,0.272316
445,(ISOTONIC DRINKS SINGLE SERVE),(FLUID MILK WHITE ONLY),0.015756,0.240401,0.006564,0.416605,1.732958,1.0,0.002776,1.302031,0.429722,0.026298,0.231969,0.221954


In [76]:
fp_growth = rules_fp.drop(columns=[
    "antecedent support", 
    "consequent support", 
    "representativity", 
    "conviction", 
    "zhangs_metric", 
    "jaccard", 
    "certainty", 
    "kulczynski"
])

- 유의미한 규칙 필터링(향상도 1 이상)

In [78]:
fp_growth = fp_growth[fp_growth['lift'] >= 1]
fp_growth

Unnamed: 0,antecedents,consequents,support,confidence,lift,leverage
0,(BANANAS),(FLUID MILK WHITE ONLY),0.061339,0.516455,2.148305,0.032787
1,(POTATOES RUSSET (BULK&BAG)),(FLUID MILK WHITE ONLY),0.019253,0.505657,2.103392,0.010100
2,"(POTATOES RUSSET (BULK&BAG), BANANAS)",(FLUID MILK WHITE ONLY),0.006779,0.627856,2.611706,0.004184
3,"(POTATOES RUSSET (BULK&BAG), MAINSTREAM WHITE ...",(FLUID MILK WHITE ONLY),0.007586,0.660416,2.747144,0.004825
4,"(POTATOES RUSSET (BULK&BAG), SHREDDED CHEESE)",(FLUID MILK WHITE ONLY),0.005608,0.657182,2.733693,0.003557
...,...,...,...,...,...,...
442,(FACIAL TISSUE & PAPER HANDKE),(FLUID MILK WHITE ONLY),0.008910,0.491892,2.046132,0.004555
443,(PAPER NAPKINS),(FLUID MILK WHITE ONLY),0.005209,0.486645,2.024305,0.002636
444,(STRING CHEESE),(FLUID MILK WHITE ONLY),0.007261,0.514428,2.139877,0.003868
445,(ISOTONIC DRINKS SINGLE SERVE),(FLUID MILK WHITE ONLY),0.006564,0.416605,1.732958,0.002776


## **4️⃣ 연관 분석 - 결과 해석**

<span style="color:black; background-color:#E6E6FA; padding:2px 4px; border-radius:4px">
<strong> 🤓 결과를 해석하고 전략을 세우는 게 해당 과제의 핵심이니 꼭!!!! 성의있게 깊게 고민한 흔적을 남겨주세요! <strong>
</span>

### **[연관분석] 지지도 0.9% 이상, 신뢰도 55% 이상, 향상도 1 이상 연관 분석**

In [82]:
results = apriori[(apriori['confidence']>0.55)&(apriori['support']>0.009)&(apriori['lift']>1)]
results.sort_values(by='lift', ascending=False)

Unnamed: 0,antecedents,consequents,support,confidence,lift,leverage
358,"(MAINSTREAM WHITE BREAD, KIDS CEREAL)",(FLUID MILK WHITE ONLY),0.012783,0.753637,3.134916,0.008705
320,"(DAIRY CASE 100% PURE JUICE - O, MAINSTREAM WH...",(FLUID MILK WHITE ONLY),0.010006,0.735463,3.059321,0.006736
221,"(ALL FAMILY CEREAL, BANANAS)",(FLUID MILK WHITE ONLY),0.010707,0.716082,2.978698,0.007113
256,"(KIDS CEREAL, BANANAS)",(FLUID MILK WHITE ONLY),0.009944,0.710607,2.955926,0.00658
331,"(MAINSTREAM WHITE BREAD, EGGS - LARGE)",(FLUID MILK WHITE ONLY),0.009035,0.705936,2.936496,0.005958
245,"(BANANAS, EGGS - LARGE)",(FLUID MILK WHITE ONLY),0.010527,0.699454,2.90953,0.006909
242,"(DAIRY CASE 100% PURE JUICE - O, BANANAS)",(FLUID MILK WHITE ONLY),0.014416,0.68637,2.855104,0.009367
279,"(BANANAS, SHREDDED CHEESE)",(FLUID MILK WHITE ONLY),0.014005,0.681532,2.834983,0.009065
395,"(MAINSTREAM WHITE BREAD, SHREDDED CHEESE)",(FLUID MILK WHITE ONLY),0.013535,0.679379,2.826025,0.008746
260,"(MAINSTREAM WHITE BREAD, BANANAS)",(FLUID MILK WHITE ONLY),0.016492,0.663986,2.761995,0.010521


- 위 결과를 지지도, 신뢰도, 향상도 값을 바탕으로 해석해주세요! (두 가지 이상)

<span style="color:black; background-color:#E6E6FA; padding:2px 4px; border-radius:4px">  
<strong> 🤓 [ 예시 ] 'FLUID MILK WHITE ONLY'(우유)는 다양한 품목과 높은 결합 구매 패턴을 보이며, 30개 이상의 제품과 lift 2 이상으로 강한 연관성을 나타낸다. <br> 특히 시리얼 계열인 'ALL FAMILY CEREAL', 'KIDS CEREAL', 'ADULT CEREAL'은 각각 65% 이상의 신뢰도와 2.6~2.7의 lift를 기록해 눈에 띄는 결합 소비가 확인된다. <br> 전반적으로 우유는 시리얼, 과일, 베이커리, 아침 식사류 제품들과 자주 함께 구매되며, 이는 소비자의 식사 준비 맥락과 밀접하게 연결된 구매 경향을 보여준다. <strong>  
</span>

- 해석 1: 우유는 식빵, 계란, 토스트, 바나나 등 아침 또는 브런치로 주로 소비되는 식재료와 높은 결합 구매 패턴을 보인다.

- 해석 2: 우유는 다양한 품목과 높은 향상도를 보이며, 주로 간편하게 먹을 수 있는 상품과 함께 결합한다

## **5️⃣ 연관 분석 - 비즈니스 전략 수립**

- 위 결과해석에 따라 비즈니스 전략을 수립해주세요! (2가지 이상) -> 냅다 GPT만 패서 쓰지 말아주세요 . . .. . 

<span style="color:black; background-color:#E6E6FA; padding:2px 4px; border-radius:4px">  
<strong> 🤓 [ 예시 ] 우유와 함께하는 식탁 큐레이션 존 구성 <strong>  
</span>

**🎯 목적**

* 우유와 자주 함께 구매되는 상품을 모아 **직관적인 구매 유도**
* **객단가 상승**, **편리한 쇼핑 경험 제공**

**🛒 구성 품목**

* **시리얼류**: KIDS / ALL FAMILY / ADULT CEREAL
* **베이커리**: 식빵, 비스킷, 토스터 페이스트리
* **과일류**: 바나나, 컵과일
* **유제품/간편식**: 요거트, 달걀, 마카로니 등

**📍 운영 방법**

* 우유 냉장고 인근에 **"우유와 최고의 궁합!"** 테마존 설치
* POP/QR코드로 **추천 식단**이나 **할인 쿠폰** 제공
* 계절별 테마 구성 (예: 여름=냉과일, 겨울=오트밀)

**📈 기대 효과**

* 우유 결합 구매율 상승
* 연관 제품 매출 증가
* ‘고민 없는 조합’으로 **고객 만족도 향상**



- 전략 1:
우유라는 품목의 특성상 소비기한이 그리 길지 않다, 또한 아침 식사로 주로 사용되는 품목과 결합소비를 하는 경우가 무척 많으므로, 이 점을 활용하고자 한다

매일 저녁마다, 우유와 함께 다음 날 아침식사로 쓸 거 같은 품목들을 함께 구매하면 할인하는 프로모션을 진행한다

바쁜 하루를 살 가족에게 줄 아침을 준비하세요! 이런 느낌으로다가 저녁에 할인 프로모션을 진행하는 것이다

왜냐면 아침에 먹는 음식이긴 해도, 아침에 마트에 와서 구매하고 차리기까지는 시간이 부족할테니 말이다.

유통기한이 짧은 우유의 재고를 줄이는 방법으로도 쓸 수 있을 것 같다



- 전략 2:
쿠팡 로켓배송처럼 아침마다 우유와 함께 먹을 아침 식사를 배달해주는 일종의 구독 서비스를 만든다

전문 큐레이터와 영양사가 붙어서 짠 우리 가족 아침 건강식! 이런 슬로건으로다가

우유화 계란, 토스트, 시리얼, 바나나 등등을 요일별로 해서 다채롭게 구성하여

고객들에게 발송하는 서비스를 고안한다




# **🤓 기가 막힌 전략을 제시하는 분께는 행운이 찾아옵니다~🍀**