# Association Rules and Market Basket Analysis

This notebook covers fundamental concepts in association rule mining and market basket analysis, including key metrics, algorithms, and practical applications.

In [1]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

## 1. Introduction to Association Rules

Association rule mining is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is particularly useful for market basket analysis.

### Key Concepts:
- **Itemset:** A collection of one or more items
- **Association Rule:** An implication of the form X → Y, where X and Y are itemsets
- **Support:** How frequently an itemset appears in the dataset
- **Confidence:** How often the rule has been found to be true
- **Lift:** The ratio of observed support to expected support if X and Y were independent

## 2. Basic Metrics

Let's define the key metrics mathematically:

1. **Support(X)**: Probability that a transaction contains itemset X  
   $support(X) = \frac{count(X)}{N}$  
   where N is the total number of transactions

2. **Confidence(X → Y)**: Probability that a transaction having X also contains Y  
   $confidence(X → Y) = \frac{support(X ∪ Y)}{support(X)}$

3. **Lift(X → Y)**: Measures how much more often X and Y occur together than expected if they were statistically independent  
   $lift(X → Y) = \frac{support(X ∪ Y)}{support(X) × support(Y)}$

4. **Conviction(X → Y)**: Measures how often the rule makes an incorrect prediction  
   $conviction(X → Y) = \frac{1 - support(Y)}{1 - confidence(X → Y)}$

Let's implement these metrics in Python:

In [2]:
import numpy as np
from itertools import combinations

transactions = [
    ['milk', 'bread', 'eggs'],
    ['milk', 'bread'],
    ['bread', 'eggs'],
    ['milk', 'eggs'],
    ['milk', 'bread', 'eggs', 'cheese'],
    ['cheese', 'eggs'],
    ['cheese'],
    ['milk', 'cheese', 'eggs']
]

In [3]:
transactions = [list(set(t)) for t in transactions]
transactions

[['eggs', 'milk', 'bread'],
 ['milk', 'bread'],
 ['eggs', 'bread'],
 ['eggs', 'milk'],
 ['eggs', 'milk', 'bread', 'cheese'],
 ['eggs', 'cheese'],
 ['cheese'],
 ['eggs', 'milk', 'cheese']]

In [4]:
[item for transaction in transactions for item in transaction]

['eggs',
 'milk',
 'bread',
 'milk',
 'bread',
 'eggs',
 'bread',
 'eggs',
 'milk',
 'eggs',
 'milk',
 'bread',
 'cheese',
 'eggs',
 'cheese',
 'cheese',
 'eggs',
 'milk',
 'cheese']

In [5]:
unique_items = list(set(item for transaction in transactions for item in transaction))

### Assignment 1 (0.5 pt) : Metric Calculations
For the given transaction dataset, calculate the following manually (show your work):
   - support for {'milk', 'bread'}
   - confidence for {'milk'} → {'bread'}
   - lift for {'bread'} → {'eggs'}
   - conviction for {'eggs'} → {'cheese'}

In [6]:
# 1) milk and bread in the same transaction are showed n times so... support =3/8
# 2) confidece is calculated as.., so ... 3/5
# 3) lift is calculated as.., so...  3/8 / 6/8*4/8= 3*8*2/8*6=1
# 4) conviction is calculated as.., so... 4/8 / 3/6 = 1

In [7]:
transactions

[['eggs', 'milk', 'bread'],
 ['milk', 'bread'],
 ['eggs', 'bread'],
 ['eggs', 'milk'],
 ['eggs', 'milk', 'bread', 'cheese'],
 ['eggs', 'cheese'],
 ['cheese'],
 ['eggs', 'milk', 'cheese']]

## 3. Create functions for Basic Metrics

Assignment 2 (0.5 pt) : Define the functions that calculate:

1. **Support(X)**

2. **Confidence(X → Y)**

3. **Lift(X → Y)**

4. **Conviction(X → Y)**

In [8]:
'eggs' in transactions[0]

True

In [9]:
itemset = ['bread', 'eggs']
[item in transactions[0] for item in itemset]

[True, True]

In [10]:
def calculate_support(itemset, transactions):
    counter=0
    flag=0
    for i in transactions:
        for j in itemset:
            if j not in i:
                flag=1
        if flag ==0:
            counter+=1
        flag=0
    return  counter/len(transactions)


def calculate_confidence(X, Y, transactions):
    return calculate_support(X+Y,transactions)/calculate_support(X,transactions) if calculate_support(X,transactions)!=0 else 0


def calculate_lift(X, Y, transactions):
    return calculate_support(X+Y,transactions)/(calculate_support(X,transactions)* calculate_support(Y,transactions)) if(calculate_support(X,transactions)* calculate_support(Y,transactions))!=0 else 0


def calculate_conviction(X, Y, transactions):
    return (1-calculate_support(Y,transactions))/(1 - calculate_confidence(X,Y,transactions)) if calculate_confidence(X,Y,transactions)!=1 else float('inf')

In [11]:
X = ['milk']
Y = ['bread']

print(f"Support({X}): {calculate_support(X, transactions):.3f}")
print(f"Support({Y}): {calculate_support(Y, transactions):.3f}")
print(f"Support({X + Y}): {calculate_support(X + Y, transactions):.3f}")
print(f"Confidence({X} → {Y}): {calculate_confidence(X, Y, transactions):.3f}")
print(f"Lift({X} → {Y}): {calculate_lift(X, Y, transactions):.3f}")
print(f"Conviction({X} → {Y}): {calculate_conviction(X, Y, transactions):.3f}")

Support(['milk']): 0.625
Support(['bread']): 0.500
Support(['milk', 'bread']): 0.375
Confidence(['milk'] → ['bread']): 0.600
Lift(['milk'] → ['bread']): 1.200
Conviction(['milk'] → ['bread']): 1.250


## 3. Market Basket Analysis

Market basket analysis examines customers' purchasing habits by finding associations between different items that customers place in their "shopping baskets."

### Applications:
- Product placement in stores
- Cross-selling strategies
- Catalog design
- Online recommendation systems

Let's perform a complete market basket analysis using the Apriori algorithm.

In [13]:
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
import pandas as pd

def create_ohe_matrix(transactions, unique_items):
    ohe_matrix = []
    for transaction in transactions:
        row = [1 if item in transaction else 0 for item in unique_items]
        ohe_matrix.append(row)
    return pd.DataFrame(ohe_matrix, columns=unique_items)

In [14]:
ohe_df = create_ohe_matrix(transactions, unique_items)
ohe_df

Unnamed: 0,eggs,milk,bread,cheese
0,1,1,1,0
1,0,1,1,0
2,1,0,1,0
3,1,1,0,0
4,1,1,1,1
5,1,0,0,1
6,0,0,0,1
7,1,1,0,1


In [15]:
frequent_itemsets = apriori(ohe_df, min_support=0.2, use_colnames=True)

rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5)

print("Association Rules:")
rules.sort_values('lift', ascending=False)

Association Rules:




Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
15,"(milk, cheese)",(eggs),0.25,0.75,0.25,1.0,1.333333,1.0,0.0625,inf,0.333333,0.333333,1.0,0.666667
6,(milk),(bread),0.625,0.5,0.375,0.6,1.2,1.0,0.0625,1.25,0.444444,0.5,0.2,0.675
7,(bread),(milk),0.5,0.625,0.375,0.75,1.2,1.0,0.0625,1.5,0.333333,0.5,0.333333,0.675
0,(eggs),(milk),0.75,0.625,0.5,0.666667,1.066667,1.0,0.03125,1.125,0.25,0.571429,0.111111,0.733333
1,(milk),(eggs),0.625,0.75,0.5,0.8,1.066667,1.0,0.03125,1.25,0.166667,0.571429,0.2,0.733333
10,"(eggs, bread)",(milk),0.375,0.625,0.25,0.666667,1.066667,1.0,0.015625,1.125,0.1,0.333333,0.111111,0.533333
14,"(eggs, cheese)",(milk),0.375,0.625,0.25,0.666667,1.066667,1.0,0.015625,1.125,0.1,0.333333,0.111111,0.533333
4,(eggs),(cheese),0.75,0.5,0.375,0.5,1.0,1.0,0.0,1.0,0.0,0.428571,0.0,0.625
2,(eggs),(bread),0.75,0.5,0.375,0.5,1.0,1.0,0.0,1.0,0.0,0.428571,0.0,0.625
13,"(eggs, milk)",(cheese),0.5,0.5,0.25,0.5,1.0,1.0,0.0,1.0,0.0,0.333333,0.0,0.5


## 4. The Naive Algorithm for Association Rule Mining

Before sophisticated algorithms like Apriori were developed, a naive approach was used:

1. Generate all possible itemsets (the power set of all items)
2. Calculate support for each itemset
3. Prune itemsets that don't meet the minimum support threshold
4. Generate all possible rules from frequent itemsets
5. Calculate confidence for each rule and keep those above the threshold

### Problems with the Naive Approach:
- Computationally expensive (2^d possible itemsets for d items)
- Requires multiple database scans
- Memory intensive for large datasets

Assignment 3 (2pt) : Let's implement a simplified version of the naive algorithm:

In [16]:
items = list(set(item for transaction in transactions for item in transaction))
print(items)

generic_list = []
r = 2
generic_list.extend(combinations(items, r))
print(generic_list)

['eggs', 'milk', 'bread', 'cheese']
[('eggs', 'milk'), ('eggs', 'bread'), ('eggs', 'cheese'), ('milk', 'bread'), ('milk', 'cheese'), ('bread', 'cheese')]


In [17]:
gen_list = ['a', 'b', 'c']
gen_dict = {}
c = 0
for el in gen_list:
    gen_dict[el] = c
    c+=1
print(gen_dict)

for el in gen_dict: 
    print(el, gen_dict[el])

{'a': 0, 'b': 1, 'c': 2}
a 0
b 1
c 2


In [18]:
gen_list2 = []
random_words = ['apple', 'banana', 'hero', 'mouse', 'trip', 'bowl']
for i in range(3):
    random_support = np.random.uniform(0,1)
    random_confidence = np.random.uniform(0,1)
    random_lift = np.random.uniform(0,10)
    random_conviction = np.random.uniform(0,10)
    gen_list2.append({'antecedent': random_words[i],
        'consequent': random_words[i+3],
        'support': random_support,
        'confidence': random_confidence,
        'lift': random_lift,
        'conviction': random_conviction
        })
    
gen_pd_df = pd.DataFrame(gen_list2)
gen_pd_df

Unnamed: 0,antecedent,consequent,support,confidence,lift,conviction
0,apple,mouse,0.054435,0.918591,4.906565,6.969042
1,banana,trip,0.752556,0.579955,0.20247,0.941564
2,hero,bowl,0.565818,0.663944,7.188596,0.710726


In [19]:
def naive_association_rule_mining(transactions, min_support=0.2, min_confidence=0.5):
    items = list(set(item for transaction in transactions for item in transaction))

    itemset_sups = {
        itemset: support
        for r in range(1, len(items)+1)
        for itemset in map(tuple, map(sorted, combinations(items, r)))
        if (support := calculate_support(itemset, transactions)) >= min_support
    }
    rules = [
    {
        'antecedent': tuple(sorted(antecedent)),
        'consequent': tuple(sorted(set(itemset) - set(antecedent))),
        'support': itemset_sups[itemset],
        'confidence': itemset_sups[itemset] / itemset_sups[antecedent],
        'lift': itemset_sups[itemset] / (itemset_sups[antecedent] * itemset_sups[tuple(sorted(set(itemset) - set(antecedent)))]),
    }
    for itemset in itemset_sups if len(itemset) > 1
    for i in range(1, len(itemset))
    for antecedent in combinations(itemset, i)
    if itemset_sups[itemset] / itemset_sups[antecedent] >= min_confidence
]
    return pd.DataFrame(rules)

In [20]:
naive_rules = naive_association_rule_mining(transactions)
naive_rules.sort_values('lift', ascending=False)

Unnamed: 0,antecedent,consequent,support,confidence,lift
15,"(cheese, milk)","(eggs,)",0.25,1.0,1.333333
6,"(bread,)","(milk,)",0.375,0.75,1.2
7,"(milk,)","(bread,)",0.375,0.6,1.2
0,"(eggs,)","(milk,)",0.5,0.666667,1.066667
1,"(milk,)","(eggs,)",0.5,0.8,1.066667
10,"(bread, eggs)","(milk,)",0.25,0.666667,1.066667
14,"(cheese, eggs)","(milk,)",0.25,0.666667,1.066667
4,"(cheese,)","(eggs,)",0.375,0.75,1.0
2,"(bread,)","(eggs,)",0.375,0.75,1.0
13,"(cheese,)","(eggs, milk)",0.25,0.5,1.0


### Bonus assignment (1pt) : Real-world Application
1. Download the Groceries dataset from the arules package in R (or find a similar market basket dataset).
2. Perform market basket analysis using mlxtend or another Python library.
3. Find the top 5 rules by lift and interpret what they mean for a grocery store manager.

In [21]:
from mlxtend.preprocessing import TransactionEncoder

with open("groceries.csv", "r") as f:
    transactions = [line.strip().split(',') for line in f if line.strip()]

t_encoder = TransactionEncoder()
te_data = t_encoder.fit(transactions).transform(transactions)
df_encoded = pd.DataFrame(te_data, columns=t_encoder.columns_)

df_encoded


Unnamed: 0,Instant food products,UHT-milk,abrasive cleaner,artif. sweetener,baby cosmetics,baby food,bags,baking powder,bathroom cleaner,beef,...,turkey,vinegar,waffles,whipped/sour cream,whisky,white bread,white wine,whole milk,yogurt,zwieback
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9830,False,False,False,False,False,False,False,False,False,True,...,False,False,False,True,False,False,False,True,False,False
9831,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
9832,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
9833,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [22]:
frequent_itemsets = apriori(df_encoded, min_support=0.01, use_colnames=True)

rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.0)

top_rules = rules.sort_values(by='lift', ascending=False).head(5)
print(top_rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])

                          antecedents                       consequents  \
440                            (curd)              (yogurt, whole milk)   
437              (yogurt, whole milk)                            (curd)   
422                 (root vegetables)  (other vegetables, citrus fruit)   
419  (other vegetables, citrus fruit)                 (root vegetables)   
538        (other vegetables, yogurt)              (whipped/sour cream)   

      support  confidence      lift  
440  0.010066    0.188931  3.372304  
437  0.010066    0.179673  3.372304  
422  0.010371    0.095149  3.295045  
419  0.010371    0.359155  3.295045  
538  0.010168    0.234192  3.267062  


**Interpretation**
1) (whole milk, yogurt) → (curd)  
Customers who buy both whole milk and yogurt are 18% likely to also buy curd, with a strong lift of 3.37.

2) (curd) → (whole milk, yogurt)  
Customers who buy curd are 19% likely to purchase both whole milk and yogurt, with a lift of 3.37.

3) (root vegetables) → (other vegetables, citrus fruit)  
Customers who buy root vegetables are 9.5% likely to buy other vegetables and citrus fruits, with a lift of 3.29.

4) (other vegetables, citrus fruit) → (root vegetables)  
When customers buy other vegetables and citrus fruits, 36% also purchase root vegetables, with a lift of 3.29.

5) (other vegetables, yogurt) → (whipped/sour cream)  
Customers who buy other vegetables and yogurt are 23% likely to also buy whipped or sour cream, with a lift of 3.37.