# Homework 2: Discovery of frequent itemsets and association rules
Homework Group 54: Xu Wang

The problem of discovering association rules between itemsets in a sales transaction database (a set of baskets) includes the following two sub-problems：

1.Finding frequent itemsets with support at least s; 

2.Generating association rules with confidence at least c from the itemsets found in the first step.

In [1]:
import itertools

# Read in the dataset
The dataset is given by the assignment.
Preprocess and save each transaction in a basket, compute the 1% of basket_num as the support threshold

In [2]:
def read_data(name):  # read in the dataset as a list of sets of transactions
    baskets = []
    with open(name, 'r') as f:
        for line in f:
            items = line.split(' ')
            items.remove('\n')
            items.sort()
            baskets.append(list(map(int, items)))
    s = len(baskets) * 0.01
    return baskets, s

In [3]:
baskets, s = read_data("T10I4D100K.dat")
frequent_itemsets = [] # A list of dictionaries
print('Number of transactions', len(baskets))
print('Support threshold', s)

Number of transactions 100000
Support threshold 1000.0


# Task 1: Implement the A-priori algorithm
The implementations below follow the general structure:
a. pick out each item in the dataset and compute the frequent singletons
b. generate candidate k-itemsets
c. filter over the candidates by the support_threshold and select the frequent k-itemsets
d. repeat step b and c until no more k-itemsets can be filtered out

In [4]:
### step a: find frequent singletons
def get_freq_singletons(baskets, support_threshold):
    freq_dict = {}
    for basket in baskets:
        for item in basket: # pick out single item from basket
            if item in freq_dict:
                freq_dict[item] += 1
            else:
                freq_dict[item] = 1
    singletons = [(item,) for item, count in freq_dict.items() if count >= support_threshold] 
    singletons_dict = {(item,):count for item, count in freq_dict.items() if count >= support_threshold} # (item,) is to save the key as the same type as the doubletons, tripletons, etc.
    return singletons, singletons_dict

In [5]:
L1, dict = get_freq_singletons(baskets, s)
print('Number of frequent singletons: ', len(L1))
print(L1)
frequent_itemsets.append(dict)

Number of frequent singletons:  375
[(240,), (25,), (274,), (368,), (448,), (52,), (538,), (561,), (630,), (687,), (775,), (825,), (834,), (120,), (205,), (39,), (401,), (581,), (704,), (814,), (35,), (674,), (733,), (854,), (950,), (422,), (449,), (857,), (895,), (937,), (964,), (229,), (283,), (294,), (381,), (708,), (738,), (766,), (853,), (883,), (966,), (978,), (104,), (143,), (569,), (620,), (798,), (185,), (214,), (350,), (529,), (658,), (682,), (782,), (809,), (947,), (970,), (227,), (390,), (192,), (208,), (279,), (280,), (496,), (530,), (597,), (618,), (675,), (71,), (720,), (914,), (932,), (183,), (217,), (276,), (653,), (706,), (878,), (161,), (175,), (177,), (424,), (490,), (571,), (623,), (795,), (910,), (960,), (125,), (130,), (392,), (461,), (862,), (27,), (78,), (900,), (921,), (147,), (411,), (572,), (579,), (778,), (803,), (266,), (290,), (458,), (523,), (614,), (888,), (944,), (204,), (334,), (43,), (480,), (513,), (70,), (874,), (151,), (504,), (890,), (310,), (419

In [6]:
### step b: generate all combinations of elelments in itemsets and freq_singletons
def generate_candidates(itemsets, freq_singletons):
    candidates = {}
    for itemset in itemsets:
        for singleton in freq_singletons:
            if singleton[0] not in itemset:
                candidate = tuple(sorted(itemset + singleton))
                if candidate not in candidates:
                    candidates[candidate] = 0
    return candidates

In [7]:
### step c: compute the occurence of a candidate in all baskets and filter up those with occurence >= support_threshold
# find out checking whether a candidate is the subset of a basket is too slow. Too many candidates, especially for doubletons.
# Implemented the other way around: generate subset of a basket with length k and check if it is a candidate
def filter_candidates(baskets, candidates, candidate_length, support_threshold):
    for basket in baskets:
        sub_ks = itertools.combinations(basket, candidate_length)
        for subk in sub_ks:
            if subk in candidates:
                candidates[subk] += 1
    freq_candidates = [candidate for candidate, count in candidates.items() if count >= support_threshold]
    dict = {candidate:count for candidate, count in candidates.items() if count >= support_threshold}
    return freq_candidates, dict

In [8]:
### step d: iterate over steps b and c until no more frequent itemsets are found
L = L1
while(len(L)>0):
    C = generate_candidates(L, L1)
    L, dict = filter_candidates(baskets, C, len(L[0])+1, s)
    if len(L) > 0:
        print('Number of frequent {}-itemsets: {}'.format(len(L[0]), len(L)))
        print(L)
        frequent_itemsets.append(dict)
    else:
        break

Number of frequent 2-itemsets: 9
[(368, 682), (368, 829), (39, 825), (704, 825), (39, 704), (227, 390), (390, 722), (217, 346), (789, 829)]
Number of frequent 3-itemsets: 1
[(39, 704, 825)]


# Task 2: Get association rules
Support of rule X → Y is the number of transactions that contain X⋃Y
Confidence of rule X → Y is the fraction of transactions containing X⋃Y in all transactions that contain X

In [9]:
# frequent_itemsets[0] # singletons
# frequent_itemsets[1] # doubletons
# frequent_itemsets[2] # tripletons

In [10]:
def conf(XUY, X, frequent_itemsets):
    XUY_support = frequent_itemsets[len(XUY) - 1][XUY]
    X_support = frequent_itemsets[len(X) - 1][X]
    return XUY_support / X_support

In [11]:
def print_rule(left, right, conf):
    print('{' + ','.join(map(str, left)) + '} -> ' + str(right[0]) + ' conf: ' + str(conf))

In [12]:
### construct rules out of frequent itemsets since the support of the rule should beyond threshold
confidence = 0.5
for k_itemsets in frequent_itemsets[1:]:
    for k_itemset in k_itemsets:
        for i in range(len(k_itemset)):
            right = (k_itemset[i],) # pick one item out
            left = tuple(sorted(set(k_itemset) - set(right)))
            if conf(k_itemset, left, frequent_itemsets) >= confidence:
                print_rule(left, right, conf(k_itemset, left, frequent_itemsets))

{704} -> 825 conf: 0.6142697881828316
{704} -> 39 conf: 0.617056856187291
{227} -> 390 conf: 0.577007700770077
{704,825} -> 39 conf: 0.9392014519056261
{39,825} -> 704 conf: 0.8719460825610783
{39,704} -> 825 conf: 0.9349593495934959
