## Dram Shop Example

As discussed in the lecture and the reading, Association Analysis is a technique to find common sets of items that co-occur within a data set. 

--- 

### Data Format for mlxtend's Association Analysis

`mlxtend` requires the input data to be in a specific format for association rule mining:

1. Transactional Data: Each transaction is a list of items purchased together. For the Wedge, we can define "transactions" as a trip to the grocery store. For the Dram Shop, it will make more sense to think of a "transaction" as a customer's history, since many customers buy only one or two items on a visit. 
1. One-Hot Encoded DataFrame: A Pandas DataFrame where each row represents a transaction and each column represents an item. The cells contain binary values indicating the presence (True/1) or absence (False/0) of an item in a transaction.

For `mlxtend.frequent_patterns.apriori` and `association_rules` functions, the data must be in a one-hot encoded format.

#### Transactional Data

Each transaction in your dataset should be represented as a list of items purchased together.

#### One-Hot Encoded DataFrame

- **Rows**: Each row corresponds to a transaction.
- **Columns**: Each column represents an item in the dataset.
- **Values**: Binary indicators (`1` or `0`) showing whether an item is present in a transaction.

##### Example of One-Hot Encoded DataFrame

| Transaction_ID | Item_A | Item_B | Item_C | Item_D |
|----------------|--------|--------|--------|--------|
| 1              | 1      | 0      | 1      | 0      |
| 2              | 0      | 1      | 1      | 1      |
| 3              | 1      | 1      | 0      | 0      |



In [6]:
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder
from google.cloud import bigquery
from pandas_gbq import read_gbq

In [None]:
client = bigquery.Client(project=project_id)
project_id = "umt-msba"

In [18]:

query = """
    SELECT
        COALESCE(o.customer_id, o.square_unique_id) AS customer_identifier,
        oi.catalog_object_id,
        id.clean_item_name,
        id.clean_category_name,
        COUNT(*) AS item_count,
        ROUND(SUM(oi.total_money)/100, 2) AS total_item_spend
    FROM `umt-msba.dram_shop.orders_*` o
    JOIN `umt-msba.dram_shop.order_items_*` oi ON o.order_id = oi.order_id
    JOIN `umt-msba.dram_shop.item_data` id ON oi.catalog_object_id = id.variant_id
    WHERE COALESCE(o.customer_id, o.square_unique_id) != "" 
      AND id.clean_item_name IN (
          SELECT clean_item_name
          FROM `umt-msba.dram_shop.vw_item_year_month`
          WHERE year IN (2022, 2023, 2024)
          GROUP BY clean_item_name
          ORDER BY SUM(total_sales) DESC
          LIMIT 1000
      )
    GROUP BY customer_identifier, oi.catalog_object_id, id.clean_item_name, id.clean_category_name
"""

In [19]:
query_job = client.query(query)
results = query_job.result()

df = results.to_dataframe()

In [21]:
df.sample(n=5)

Unnamed: 0,customer_identifier,catalog_object_id,clean_item_name,clean_category_name,item_count,total_item_spend
78695,CYM077H8GN57Q73S5XRP4PW1B8,T7NO5UZDIQVQLIYVX2RKS7PE,All Day IPA,IPA - Draught,1,4.5
6474,AYKWT8YNA56N91H6DNHT5PTZNW,6QNEEKTXGODAYDYSO3TGICDD,Robot Panda Hazy IPA,IPA - Draught,1,25.68
106247,5NQ5JWKRJD4CHECCEZB8BQCJS4,6XXI6244F3SN3ADC4KIA5CEU,Flathead Cherry,Cider - Draught,1,5.0
219262,JEN4EQ9A015EF3N3ZWHK5HX8BG,KHUMFL2FOTLOJHJMTDQE6RFW,Pinot Noir,Wine - Draught,1,9.5
104895,F847MQR3154J4XXFNQRQGFCZV0,Q3FV3BZOAR4PEZXSG3XL74GS,Blackfoot Single Malt IPA,IPA - Draught,1,5.0


In [22]:
df.groupby('clean_item_name')['item_count'].sum().sort_values(ascending=False).head(10)


clean_item_name
Blackfoot Single Malt IPA    36074
Super Pils                   10394
Pilsner                       6930
Grazing Clouds Hazy IPA       6064
IPA                           4596
Pear Cider                    4547
Helles Lager                  4278
Robot Panda Hazy IPA          3029
Grapefruit Radler             2814
All Day IPA                   2729
Name: item_count, dtype: Int64

In [24]:
df.groupby('clean_category_name')['total_item_spend'].sum().sort_values(ascending=False).head(20)

clean_category_name
IPA - Draught                  885972.25
Lagers/Pils/Wheat - Draught    453889.83
Cider - Draught                226976.02
Wine - Draught                 226194.89
Ambers/Pales - Draught         170888.91
Porters/Stouts - Draught       131037.22
Sour - Draught                 114139.03
Growlers                        46939.49
Belgian - Draught               35279.91
Hard Seltzer - Draught          21108.23
Red Wine - Bottled              19337.09
Seltzer                         18503.03
IPA - Bottled                   15921.75
Sparkling Wine - Bottled        15579.65
Seasonal/Event                  15440.66
White Wine - Bottled            14071.30
Lagers/Pils/Wheat - Bottled     13895.78
Rosé Wine - Bottled             11653.65
Swag                             8612.85
Soda - Draught                   6179.83
Name: total_item_spend, dtype: float64

In [25]:
transactions = df.groupby('customer_identifier')['clean_item_name'].apply(list).tolist()

# One-hot encoding
te = TransactionEncoder()
te_data = te.fit(transactions).transform(transactions)
df_encoded = pd.DataFrame(te_data, columns=te.columns_)



In [26]:

df_encoded.head()

Unnamed: 0,$5,$7,100 Degrees Shandy Inspired Farmhouse Ale,1000 Petals Pilsner,10G Pumpkin Spice Latte,14C Who's Whoo,1664,"20"" Brown","20,000 Leguas' Amber Wine Chardonnay",2424 Hazy Double,...,ZC Up Stream,ZM Breezy Does It,ZP 56 Counties,ZP Spruce Tip IPA,Zinfandel,Zip Hoody,Zymopunk Pilsner,|Z Populis Sauvignon Blanc - 2022| Sauvignon Blanc | Populis | 2022 |,¡Viva La Pineapple!,‘River Sand’ Fiano
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,True,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [39]:
frequent_itemsets = apriori(df_encoded, min_support=0.005, use_colnames=True)

# Sort by support
frequent_itemsets = frequent_itemsets.sort_values(by='support', ascending=False)
print(frequent_itemsets.head())


      support                     itemsets
19   0.207251  (Blackfoot Single Malt IPA)
182  0.068734                 (Super Pils)
143  0.060235                    (Pilsner)
66   0.051430    (Grazing Clouds Hazy IPA)
137  0.045866                 (Pear Cider)


In [40]:
frequent_itemsets

Unnamed: 0,support,itemsets
19,0.207251,(Blackfoot Single Malt IPA)
182,0.068734,(Super Pils)
143,0.060235,(Pilsner)
66,0.051430,(Grazing Clouds Hazy IPA)
137,0.045866,(Pear Cider)
...,...,...
49,0.005080,(El Dorado Hopped Cider)
235,0.005080,"(Lagunitas IPA, Blackfoot Single Malt IPA)"
210,0.005064,"(Blackfoot Single Malt IPA, 60 Minute IPA)"
172,0.005032,(Snowmelt Pomegranate and Acai)


In [42]:

# Generate association rules
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5)

# Print the association rules
print(rules)

TypeError: association_rules() missing 1 required positional argument: 'num_itemsets'

In [41]:
# Generate association rules
rules = association_rules(frequent_itemsets, metric="lift", support_only=True,num_itemsets = 1)

# Sort by lift
rules = rules.sort_values(by='lift', ascending=False)
print(rules.head())


Empty DataFrame
Columns: [antecedents, consequents, antecedent support, consequent support, support, confidence, lift, representativity, leverage, conviction, zhangs_metric, jaccard, certainty, kulczynski]
Index: []
