# HW2--Group6

kaidx@kth.se & zhenlinz@kth.se

Task

You are to solve the first sub-problem: to implement the A-Priori algorithm for finding frequent itemsets with support at least s in a dataset of sales transactions. Remind that support of an itemset is the number of transactions containing the itemset. To test and evaluate your implementation, write a program that uses your A-Priori algorithm implementation to discover frequent itemsets with support at least s in a given dataset of sales transactions.

The implementation can be done using any big data processing framework, such as Apache Spark, Apache Flink, or no framework, e.g., in Java, Python, etc.  

Optional task for extra bonus

Solve the second sub-problem, i.e., develop and implement an algorithm for generating association rules between frequent itemsets discovered by using the A-Priori algorithm in a dataset of sales transactions. The rules must have support at least s and confidence at least c, where s and c are given as input parameters.



# Implementation

In [19]:
import itertools
import time

Read data and create a baskets list.

In [20]:
# Create buskets list to store all the baskets(each line is a basket and turn the items into int type to store)
def read_baskets():
    baskets = []
    with open('T10I4D100K.dat') as f:
        for line in f:
            items = line.split(' ')
            items.remove('\n')
            baskets.append(list(map(int, items)))
    return baskets

For each k, we construct two sets of k-tuples (sets of size k): 
1. Ck = candidate k-tuples = those that might be frequent sets (support > s) based on information from the pass for k–1
2. Lk = the set of truly frequent k-tuples, i.e. filter only those k-tuples from Ck that have support at least s

In [21]:
# Count the support of items in C1 (to generate singletons later)
def count_singletons(baskets):
    count = {}
    for basket in baskets:
        for item in basket:
            if item in count:
                count[item] += 1
            else:
                count[item] = 1
    return count


# For k-1, filter frequent itemsets by pruning and return those whose support is at least as s with true support of each. 
def filter_frequent_items(items_count, support):
    return {item: items_count[item] for item in items_count if items_count[item] >= support}



In [22]:
# Construct candidates k-tuples from previous(k-1) frequent itemsets and singletons
def generate_candidates(items, singletons):
    candidates = {}
    for item in items:
        for singleton in singletons:
            if singleton[0] not in item:
             # for each itemset, iterate over singletons, if a singleton is not in the itemset, construct a new candidate.
                candidate = tuple(sorted(item + singleton))  
                if candidate not in candidates:
                    candidates[candidate] = 0
    return candidates




# FAST IMPLEMENTATION: Generate all possible basket items combinations and check if they exist in candidates
def count_candidates(baskets, candidates, candidate_length):
    for basket in baskets:
        basket_variations = itertools.combinations(basket, candidate_length)
        for combination in basket_variations:
            if combination in candidates:
                candidates[combination] += 1
    return candidates


In [35]:
# Compute the rule confidence
def conf(k_tuple, arrow_position, frequent_itemsets):
    before_arrow = k_tuple[:arrow_position]                      #arrow: rule -->
    union_support = get_support(k_tuple, frequent_itemsets)
    single_support = get_support(before_arrow, frequent_itemsets)
    return union_support / single_support


# Evaluation and extra bonus task

Here we define main() to implement the A-Priori algorithom on the given dataset.
We set support threshold to 1000(i.e. 1% of all the baskets) and confidence=0.5.
We first generate and count C1 and filter frequent singletons only with support at least as 1000, which is shown in the first part of ouptput.
Then for k>=1, we use a loop to generate candidates from previous frequent itemset(k-1) and singletons(k=0). And we get nine 2-tuples, one 3-tuples and there is no 4-tuples or more.

Finally, for k>=2, we find association rules by moving the arrow in each of the filtered frequent itemsets and calculating the confidence to see whose confidence is accross the threshold c. 

In [36]:
def main():
    support = 1000          # 1% of the baskets
    confidence = 0.5
    frequent_itemsets = []  # Returned results
    associations = set()    # Generated associations rules

    baskets = read_baskets()                                                 # Read data file
    singletons_count = count_singletons(baskets)                             # Find and count singletons
    filtered_items = filter_frequent_items(singletons_count, support)        # Filter frequent singletons
    frequent_singletons = {(i,): filtered_items[i] for i in filtered_items}   # Wrap singletons in tuple to use the same data structure for pairs, triplets, etc.
    frequent_itemsets.append(frequent_singletons)
    print("Frequent singletons with true support:\n", frequent_singletons)

    k = 1                                                                                   # For each k >= 1
    while len(frequent_itemsets[k - 1]) > 0:                                                # While new frequent elements are found
        candidates = generate_candidates(frequent_itemsets[k - 1], frequent_itemsets[0])    # Generate candidates from previous frequent itemset(k-1) and singletons(k=0)
        candidates_count = count_candidates(baskets, candidates, k + 1)                     # Count candidates frequency
        frequent_itemset = filter_frequent_items(candidates_count, support)                 # Filter frequent items
        frequent_itemsets.append(frequent_itemset)
        print("Frequent " + str(k + 1) + "-tuples:\n", frequent_itemsets[k])
        k += 1

    for frequent_itemset in frequent_itemsets[1:]:
        for k_tuple in frequent_itemset:
            for tuple_permutation in itertools.permutations(k_tuple, len(k_tuple)):
                for arrow_position in reversed(range(1, len(tuple_permutation))): # arrow_position = 1 (A -> B,C,D) ; arrow_position = 2 (A,B -> C,D etc..)
                    c = conf(tuple_permutation, arrow_position, frequent_itemsets)
                    if c >= confidence:
                        associations.add((', '.join(map(str, sorted(tuple_permutation[:arrow_position]))) + ' -> ' + ', '.join(map(str, sorted(tuple_permutation[arrow_position:]))), c))
                    else:
                        break # Known rule: If A,B,C -> D is below confidence so that A,B -> C,D. So no need to iterate over arrow positions futher

    print("Association rules with true confidence:\n", associations)


start_time = time.time()
main()
print("--- %s seconds ---" % (time.time() - start_time))

Frequent singletons with true support:
 {(25,): 1395, (52,): 1983, (240,): 1399, (274,): 2628, (368,): 7828, (448,): 1370, (538,): 3982, (561,): 2783, (630,): 1523, (687,): 1762, (775,): 3771, (825,): 3085, (834,): 1373, (39,): 4258, (120,): 4973, (205,): 3605, (401,): 3667, (581,): 2943, (704,): 1794, (814,): 1672, (35,): 1984, (674,): 2527, (733,): 1141, (854,): 2847, (950,): 1463, (422,): 1255, (449,): 1890, (857,): 1588, (895,): 3385, (937,): 4681, (964,): 1518, (229,): 2281, (283,): 4082, (294,): 1445, (381,): 2959, (708,): 1090, (738,): 2129, (766,): 6265, (853,): 1804, (883,): 4902, (966,): 3921, (978,): 1141, (104,): 1158, (143,): 1417, (569,): 2835, (620,): 2100, (798,): 3103, (185,): 1529, (214,): 1893, (350,): 3069, (529,): 7057, (658,): 1881, (682,): 4132, (782,): 2767, (809,): 2163, (947,): 3690, (970,): 2086, (227,): 1818, (390,): 2685, (71,): 3507, (192,): 2004, (208,): 1483, (279,): 3014, (280,): 2108, (496,): 1428, (530,): 1263, (597,): 2883, (618,): 1337, (675,): 2976