# Discovery of Frequent Itemsets and Association Rules

1. Frequent itemsets with support s
2. Association Rules with confidence c


A priori:
C1
    - read transactions and get counts
L1
    - filter candidates with support s

CK
    - generate candidates CK out of L(k-1), with monotonicity support s,
    - read transactions and get counts
LK
- filter candidates LK with support s



Keep in mind
- if the files are very large and can't be fit into memory, then these files should be split. With a dataloader

In [44]:
def load_transactions(filepath):
    transactions = []
    
    with open(filepath, 'r') as f:
        for line in f:
            # Convert each line to a set of integers
            transaction = set(map(int, line.strip().split()))
            transactions.append(transaction)
    
    return transactions

transactions = load_transactions('data/T10I4D100K.dat')
print(f"Number of transactions: {len(transactions)}")
print(f"First few transactions: {transactions[:3]}")

all_items = set().union(*transactions)
print(f"Total unique items: {len(all_items)}")

Number of transactions: 100000
First few transactions: [{448, 834, 164, 775, 328, 687, 240, 368, 274, 561, 52, 630, 825, 25, 538, 730}, {704, 834, 581, 39, 205, 814, 401, 120, 825, 124}, {674, 35, 712, 854, 759, 950, 249, 733}]
Total unique items: 870


# A Priori algorithm

the idea is to incrementally build larger itemsets by combining frequently occuring smaller subsets.


### Frequent singletons:

    1. find singletons
    2. c1 <- count singletons
    3. l1 <- filter frequent c1

### Frequent 2-itemsets

    1. candidates <- combine singletons to 2-itemsets
    2. c2 <- count candidates
    3. l2 <- filter c2

### Frequent k-itemsets
    1. candidates <- combine singletons with k-1 item sets
    2. ensure frequent subsets - they occur in all frequently filtered candidates L (consisting of each lk)
    3. ck <- count candidates
    4. lk <- filter ck



    




In [45]:
from tqdm import tqdm
import itertools

def A_priori(transactions, s, max_k=3):
    # get counts for singletons (k=1)
    c1 = {frozenset([item]): 0 for item in all_items}
    for transaction in transactions:
        for item in transaction:
            c1[frozenset([item])] += 1
    
    # filter singletons with support s  
    l1 = {itemset: count for itemset, count in c1.items() if count >= s}
    
    # initialize result with L1
    L = [l1]
    C = [c1]
    
    print(f"C1: {len(c1)} - {list(c1)[:5]}..")
    print(f"L1: {len(l1)} - {list(l1)[:5]}..")
    
    # iterate for k=2 to max_k
    for k in range(2, max_k + 1):
        # generate candidates from previous frequent itemsets
        ck = {}
        prev_l = L[k-2]
        
        # generate candidates by combining previous frequent itemsets
        for itemset1 in prev_l:
            for singleton in l1:
                union = itemset1.union(singleton)
                if len(union) == k:
                    
                    # subset frequency check
                    all_subsets_frequent = True
                    for item in union:
                        subset = frozenset(union - {item})
                        if subset not in prev_l:
                            all_subsets_frequent = False
                            break
                    
                    if all_subsets_frequent:
                        ck[union] = 0
        
        print(f"C{k} candidates: {len(ck)} - {list(ck)[:5]}..")
 
        # count occurrences of candidates
        for transaction in tqdm(transactions):
            transaction_set = frozenset(transaction)
            # Generate k-sized subsets of the transaction only once
            transaction_subsets = set(itertools.combinations(transaction_set, k))
            # Check which candidates appear in these subsets
            for subset in transaction_subsets:
                subset = frozenset(subset)
                if subset in ck:
                    ck[subset] += 1
        
        print(f"C{k} counts: {len(ck)} - {list(ck)[:5]}..")
        C.append(ck)
        
        # filter candidates with minimum support
        lk = {itemset: count for itemset, count in ck.items() if count >= s}
        
        print(f"L{k}: {len(lk)} - {lk}")
        
        # if no frequent itemsets found, break
        if not lk:
            break
        L.append(lk)
            
    
    return L, C



In [46]:
s = 0.05*len(transactions)
print(f"s: {s}")
L, C = A_priori(transactions, s, 3)


s: 5000.0
C1: 870 - [frozenset({0}), frozenset({1}), frozenset({2}), frozenset({3}), frozenset({4})]..
L1: 10 - [frozenset({217}), frozenset({354}), frozenset({368}), frozenset({419}), frozenset({494})]..
C2 candidates: 45 - [frozenset({217, 354}), frozenset({368, 217}), frozenset({217, 419}), frozenset({217, 494}), frozenset({217, 529})]..


100%|██████████| 100000/100000 [00:01<00:00, 83307.51it/s]

C2 counts: 45 - [frozenset({217, 354}), frozenset({368, 217}), frozenset({217, 419}), frozenset({217, 494}), frozenset({217, 529})]..
L2: 0 - {}





In [47]:
L

[{frozenset({217}): 5375,
  frozenset({354}): 5835,
  frozenset({368}): 7828,
  frozenset({419}): 5057,
  frozenset({494}): 5102,
  frozenset({529}): 7057,
  frozenset({684}): 5408,
  frozenset({722}): 5845,
  frozenset({766}): 6265,
  frozenset({829}): 6810}]

In [48]:
c2 = sorted([(v,c) for v,c in C[1].items()], key=lambda x: x[1], reverse=True)
print(c2)


[(frozenset({368, 829}), 1194), (frozenset({368, 494}), 860), (frozenset({368, 529}), 640), (frozenset({684, 766}), 613), (frozenset({529, 829}), 584), (frozenset({722, 354}), 566), (frozenset({368, 766}), 504), (frozenset({217, 722}), 498), (frozenset({722, 684}), 443), (frozenset({217, 529}), 403), (frozenset({368, 722}), 392), (frozenset({368, 684}), 387), (frozenset({722, 419}), 366), (frozenset({368, 419}), 355), (frozenset({684, 829}), 349), (frozenset({217, 419}), 344), (frozenset({529, 684}), 334), (frozenset({354, 766}), 329), (frozenset({722, 766}), 328), (frozenset({829, 766}), 321), (frozenset({368, 354}), 319), (frozenset({529, 766}), 317), (frozenset({368, 217}), 303), (frozenset({529, 354}), 301), (frozenset({722, 829}), 294), (frozenset({529, 722}), 283), (frozenset({217, 354}), 280), (frozenset({217, 766}), 276), (frozenset({217, 829}), 275), (frozenset({829, 494}), 267), (frozenset({354, 419}), 263), (frozenset({354, 829}), 259), (frozenset({419, 829}), 259), (frozens

# Association Rules

1. frequent itemset I
2. rule generation
    - for subset A in I:
    
        1. conf(A -> A/I) = supp(I) / supp(A)              given A how likely to also observe A/I

        2. if conf > c:

            We've found a rule!

In [49]:
transactions = [
    {1, 2, 3},    # B1 {m, c, b}
    {1, 4, 5},    # B2 {m, p, j}
    {1, 2, 3, 6}, # B3 {m, c, b, n}
    {2, 5},       # B4 {c, j}
    {1, 4, 3},    # B5 {m, p, b}
    {1, 2, 3, 5}, # B6 {m, c, b, j}
    {2, 3, 5},    # B7 {c, b, j}
    {3, 2}        # B8 {b, c}
]


L, C = A_priori(transactions, s=3, max_k=3)

C1: 870 - [frozenset({0}), frozenset({1}), frozenset({2}), frozenset({3}), frozenset({4})]..
L1: 4 - [frozenset({1}), frozenset({2}), frozenset({3}), frozenset({5})]..
C2 candidates: 6 - [frozenset({1, 2}), frozenset({1, 3}), frozenset({1, 5}), frozenset({2, 3}), frozenset({2, 5})]..


100%|██████████| 8/8 [00:00<00:00, 11362.83it/s]


C2 counts: 6 - [frozenset({1, 2}), frozenset({1, 3}), frozenset({1, 5}), frozenset({2, 3}), frozenset({2, 5})]..
L2: 4 - {frozenset({1, 2}): 3, frozenset({1, 3}): 4, frozenset({2, 3}): 5, frozenset({2, 5}): 3}
C3 candidates: 1 - [frozenset({1, 2, 3})]..


100%|██████████| 8/8 [00:00<00:00, 88069.38it/s]

C3 counts: 1 - [frozenset({1, 2, 3})]..
L3: 1 - {frozenset({1, 2, 3}): 3}





In [50]:

import itertools
from typing import List, Dict, FrozenSet

def find_association_rules(L:List[Dict[FrozenSet[int], int]], c=0.5):
    '''
    L: list of frequent itemsets (L1, L2, ... Lk), where Lk is a dict of itemsets and their counts
    '''
    
    # get all frequent itemsets
    frequent_itemsets = {}
    for level in L:
        frequent_itemsets.update(level)
    
    # filter out singletons
    frequent_itemsets = list(k for k in frequent_itemsets.keys() if len(k) > 1)
    
    # support counts
    counts = {}
    for level in L:
        counts.update(level)
    
    rules = []  # Changed to list to maintain order
    for itemset in frequent_itemsets:
        n = len(itemset)
        for i in range(1, n):
            for X in map(frozenset, itertools.combinations(itemset, i)):
                Y = itemset - X
                if len(Y) > 0:
                    conf = counts[itemset] / counts[X]
                    if conf >= c:
                        rules.append({
                            'X': sorted(list(X)),
                            'Y': sorted(list(Y)),
                            'confidence': conf
                        })
    
    # Sort rules by confidence
    rules.sort(key=lambda x: x['confidence'], reverse=True)
    
    # Print rules in a formatted way
    print(f"{'X':<20} {'→':<5} {'Y':<20} {'Confidence':<10}")
    print("-" * 55)
    for rule in rules:
        x_str = str(rule['X'])
        y_str = str(rule['Y'])
        conf_str = f"{rule['confidence']:.3f}"
        print(f"{x_str:<20} {'→':<5} {y_str:<20} {conf_str:<10}")
    
    return rules


In [51]:
find_association_rules(L, c=0.75)

X                    →     Y                    Confidence
-------------------------------------------------------
[1, 2]               →     [3]                  1.000     
[2]                  →     [3]                  0.833     
[3]                  →     [2]                  0.833     
[1]                  →     [3]                  0.800     
[5]                  →     [2]                  0.750     
[1, 3]               →     [2]                  0.750     


[{'X': [1, 2], 'Y': [3], 'confidence': 1.0},
 {'X': [2], 'Y': [3], 'confidence': 0.8333333333333334},
 {'X': [3], 'Y': [2], 'confidence': 0.8333333333333334},
 {'X': [1], 'Y': [3], 'confidence': 0.8},
 {'X': [5], 'Y': [2], 'confidence': 0.75},
 {'X': [1, 3], 'Y': [2], 'confidence': 0.75}]

# Putting it all together

1. Find frequent itemsets
2. Find association rules

In [54]:
import time
start = time.perf_counter()
transactions = load_transactions('data/T10I4D100K.dat')

s = 0.01*len(transactions)
print(f"s: {s}")
L, C = A_priori(transactions, s, 3)
end = time.perf_counter()
print(f"Time taken: {end - start} seconds")


s: 1000.0
C1: 870 - [frozenset({0}), frozenset({1}), frozenset({2}), frozenset({3}), frozenset({4})]..
L1: 375 - [frozenset({1}), frozenset({4}), frozenset({5}), frozenset({6}), frozenset({8})]..
C2 candidates: 70125 - [frozenset({1, 4}), frozenset({1, 5}), frozenset({1, 6}), frozenset({8, 1}), frozenset({1, 10})]..


100%|██████████| 100000/100000 [00:02<00:00, 40861.18it/s]


C2 counts: 70125 - [frozenset({1, 4}), frozenset({1, 5}), frozenset({1, 6}), frozenset({8, 1}), frozenset({1, 10})]..
L2: 9 - {frozenset({704, 39}): 1107, frozenset({825, 39}): 1187, frozenset({217, 346}): 1336, frozenset({227, 390}): 1049, frozenset({368, 682}): 1193, frozenset({368, 829}): 1194, frozenset({722, 390}): 1042, frozenset({704, 825}): 1102, frozenset({829, 789}): 1194}
C3 candidates: 1 - [frozenset({704, 825, 39})]..


100%|██████████| 100000/100000 [00:04<00:00, 22787.47it/s]

C3 counts: 1 - [frozenset({704, 825, 39})]..
L3: 1 - {frozenset({704, 825, 39}): 1035}
Time taken: 7.598735417239368 seconds





In [55]:
find_association_rules(L, c=0.5)

X                    →     Y                    Confidence
-------------------------------------------------------
[704, 825]           →     [39]                 0.939     
[39, 704]            →     [825]                0.935     
[39, 825]            →     [704]                0.872     
[704]                →     [39]                 0.617     
[704]                →     [825]                0.614     
[227]                →     [390]                0.577     
[704]                →     [39, 825]            0.577     


[{'X': [704, 825], 'Y': [39], 'confidence': 0.9392014519056261},
 {'X': [39, 704], 'Y': [825], 'confidence': 0.9349593495934959},
 {'X': [39, 825], 'Y': [704], 'confidence': 0.8719460825610783},
 {'X': [704], 'Y': [39], 'confidence': 0.617056856187291},
 {'X': [704], 'Y': [825], 'confidence': 0.6142697881828316},
 {'X': [227], 'Y': [390], 'confidence': 0.577007700770077},
 {'X': [704], 'Y': [39, 825], 'confidence': 0.5769230769230769}]

# Conclusion
The association rules suggests that if 704 and 825 is bought, then 39 is also bought with a confidence of almost 94%. There seem to be a strong correlation between all of these three products, while it seems like 39 and 825 is more often bought without 704 than the others. 

if 704 is bought, then you can be quite confident of the customer also buying 30 or 825 ~60%
if 704 is bought with either 39 or 825, then with a confidence of more than 93% we can say that they'll also buy all three. 


It would be clever to arrange all of these products close to each other in the store.

226 and 390 also seems to have some correlation, yet lower at conf~58%