# Discovery of Frequent Itemsets and Association Rules

1. Frequent itemsets with support s
2. Association Rules with confidence c


A priori:
C1
    - read transactions and get counts
L1
    - filter candidates with support s

CK
    - generate candidates CK out of L(k-1), with monotonicity support s,
    - read transactions and get counts
LK
- filter candidates LK with support s



Keep in mind
- if the files are very large and can't be fit into memory, then these files should be split. With a dataloader

In [11]:
def load_transactions(filepath):
    transactions = []
    
    with open(filepath, 'r') as f:
        for line in f:
            # Convert each line to a set of integers
            transaction = set(map(int, line.strip().split()))
            transactions.append(transaction)
    
    return transactions

# Example usage:
transactions = load_transactions('data/T10I4D100K.dat')
print(f"Number of transactions: {len(transactions)}")
print(f"First few transactions: {transactions[:3]}")

# If you need to know the item universe:
all_items = set().union(*transactions)
print(f"Total unique items: {len(all_items)}")

Number of transactions: 100000
First few transactions: [{448, 834, 164, 775, 328, 687, 240, 368, 274, 561, 52, 630, 825, 25, 538, 730}, {704, 834, 581, 39, 205, 814, 401, 120, 825, 124}, {674, 35, 712, 854, 759, 950, 249, 733}]
Total unique items: 870


# A Priori algorithm

the idea is to incrementally build larger itemsets by combining frequently occuring smaller subsets.


### Frequent singletons:

    1. find singletons
    2. c1 <- count singletons
    3. l1 <- filter frequent c1

### Frequent 2-itemsets

    1. candidates <- combine singletons to 2-itemsets
    2. c2 <- count candidates
    3. l2 <- filter c2

### Frequent k-itemsets
    1. candidates <- combine singletons with k-1 item sets
    2. ensure frequent subsets - they occur in all frequently filtered candidates L (consisting of each lk)
    3. ck <- count candidates
    4. lk <- filter ck



    




In [31]:
from tqdm import tqdm

def A_priori(transactions, s, max_k=3):
    # get counts for singletons (k=1)
    c1 = {frozenset([item]): 0 for item in all_items}
    for transaction in transactions:
        for item in transaction:
            c1[frozenset([item])] += 1
    
    # filter singletons with support s  
    l1 = {itemset for itemset, count in c1.items() if count >= s}
    
    # initialize result with L1
    L = [l1]
    C = [c1]
    
    print(f"C1: {len(c1)} - {c1}")
    print(f"L1: {len(l1)} - {l1}")
    
    # iterate for k=2 to max_k
    for k in range(2, max_k + 1):
        # generate candidates from previous frequent itemsets
        ck = {}
        prev_l = L[k-2]
        
        # generate candidates by combining previous frequent itemsets
        for itemset1 in prev_l:
            for singleton in l1:
                union = itemset1.union(singleton)
                if len(union) == k:
                    
                    # subset frequency check
                    all_subsets_frequent = True
                    for item in union:
                        subset = frozenset(union - {item})
                        if subset not in prev_l:
                            all_subsets_frequent = False
                            break
                    
                    if all_subsets_frequent:
                        ck[union] = 0
        
        print(f"C{k} candidates: {len(ck)} - {ck}")
 
        # count occurrences of candidates
        for transaction in tqdm(transactions):
            transaction_set = frozenset(transaction)
            for candidate in ck:
                if candidate.issubset(transaction_set):
                    ck[candidate] += 1
        
        print(f"C{k} counts: {len(ck)} - {ck}")
        C.append(ck)
        
        # filter candidates with minimum support
        lk = {itemset for itemset, count in ck.items() if count >= s}
        
        print(f"L{k}: {len(lk)} - {lk}")
        
        # if no frequent itemsets found, break
        if not lk:
            break
        L.append(lk)
            
    
    return L, C



In [43]:
s = 0.06*len(transactions)
print(f"s: {s}")
L, C = A_priori(transactions, s, 3)


s: 6000.0
C1: 870 - {frozenset({0}): 594, frozenset({1}): 1535, frozenset({2}): 673, frozenset({3}): 531, frozenset({4}): 1394, frozenset({5}): 1094, frozenset({6}): 2149, frozenset({7}): 997, frozenset({8}): 3090, frozenset({10}): 1351, frozenset({11}): 525, frozenset({12}): 3415, frozenset({13}): 35, frozenset({14}): 197, frozenset({15}): 458, frozenset({16}): 150, frozenset({17}): 1683, frozenset({18}): 813, frozenset({19}): 121, frozenset({20}): 40, frozenset({21}): 2666, frozenset({22}): 397, frozenset({23}): 128, frozenset({24}): 191, frozenset({25}): 1395, frozenset({26}): 527, frozenset({27}): 2165, frozenset({28}): 1454, frozenset({29}): 171, frozenset({31}): 1666, frozenset({32}): 4248, frozenset({33}): 1460, frozenset({34}): 56, frozenset({35}): 1984, frozenset({36}): 528, frozenset({37}): 1249, frozenset({38}): 2402, frozenset({39}): 4258, frozenset({40}): 457, frozenset({41}): 1353, frozenset({42}): 119, frozenset({43}): 1721, frozenset({44}): 903, frozenset({45}): 1728, f

100%|██████████| 100000/100000 [00:00<00:00, 1578574.49it/s]

C2 counts: 6 - {frozenset({368, 829}): 1194, frozenset({368, 766}): 504, frozenset({368, 529}): 640, frozenset({829, 766}): 321, frozenset({529, 829}): 584, frozenset({529, 766}): 317}
L2: 0 - set()





In [44]:
c2 = sorted([(v,c) for v,c in C[1].items()], key=lambda x: x[1], reverse=True)
print(c2)


[(frozenset({368, 829}), 1194), (frozenset({368, 529}), 640), (frozenset({529, 829}), 584), (frozenset({368, 766}), 504), (frozenset({829, 766}), 321), (frozenset({529, 766}), 317)]
