# Homework 2 -  Discovery of Frequent Itemsets and Association Rules

**Authors: Sherly Sherly and Anna Martignano**


## 1. Introduction
The problem of discovering association rules between itemsets in a sales transaction database (a set of baskets) includes the following two sub-problems [R. Agrawal and R. Srikant, VLDB '94 (Links to an external site.)]:

Finding frequent itemsets with support at least s;
Generating association rules with confidence at least c from the itemsets found in the first step.
Remind that an association rule is an implication X → Y, where X and Y are itemsets such that X∩Y=∅. Support of the rule X → Y is the number of transactions that contain X⋃Y. Confidence of the rule X → Y the fraction of transactions containing X⋃Y in all transactions that contain X.

### 1.1 Task
You are to solve the first sub-problem: to implement the Apriori algorithm for finding frequent itemsets with support at least $s$ in a dataset of sales transactions. Recall that support of an itemset is the number of transactions containing the itemset. To test and evaluate your implementation, write a program that uses your Apriori algorithm implementation to discover frequent itemsets with support at least s in a given dataset of sales transactions.

The implementation can be done using any big data processing framework, such as Apache Spark, Apache Flink, or no framework, e.g., in Java, Python, etc.  

### 1.2 Optional task for extra bonus
Solve the second sub-problem, i.e., develop and implement an algorithm for generating association rules between frequent itemsets discovered by using the Apriori algorithm in a dataset of sales transactions. The rules must have support at least s and confidence at least c, where s and c are given as input parameters.

## 2. Implementations
### 2.1 A-Priori Algorithm
A two-pass approach called A-Priori limits the memory demand.

Key idea: monotonicity of support
- If a set of items appears at least s times, so does every subset, i.e., the support of a subset is at least as big as the support of its superset The downward closure property of frequent patterns
- Any subset of a frequent itemset must be frequent. Contrapositive for pairs: if item i does not appear in s baskets, then no pair including i can appear in s baskets.

Based on candidate generation-and-test approach A-priori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested, because it’s also infrequent
[Agrawal & Srikant,@VLDB’94, Mannila, et al. @ KDD’94]



**Pass 1** : Read baskets and count in main memory the occurrences of
each individual item
- Requires O(n) memory, where n is #items. Items that appear $≥s$ times are the frequent items. Typical $s=1\%$ as many singletons will be infrequent (s is the support threshold)


**Pass 2**: Read baskets again and count only those pairs where both elements are frequent (discovered in Pass 1).
- Requires memory proportional to square of frequent items only (for counts) – 2m instead 2n. Plus a list of the frequent items (so you know what must be counted)

In [108]:
def parse_data():
    data = {}
    baskets = []
    
    # read baskets and count occurences
    for line in open ('T10I4D100K.dat', 'r'):
        basket = [int(item) for item in line.rstrip().split(" ")] 
        for i in basket:
            data[tuple([i])] = data.get(tuple([i]), 0) + 1
        
        baskets.append(basket)

    # typically, s=1%
    s = int (0.01 * len (baskets))

    # keep only pairs that is above the support threshold
    data = {k: v for k, v in data.items() if v >= s}

    return data, baskets, s

**Pipeline of the A-Priory Algorithm**
<img src="apriori_algo.png">

For each k, we construct two sets of k-tuples (sets of size k):
- $C_{k}$ = candidate k-tuples = those that might be frequent sets (support $> s$) based on information from the pass for k–1
- $L_{k}$ = the set of truly frequent k-tuples, i.e. filter only those k-tuples from $C_{k}$ that have support at least s

In [124]:
import itertools

flatten = lambda l: [item for sublist in l for item in sublist]

def get_frequent_Lset(previous_Lset, baskets, s, k):
    new_Lset = {}

    # Generate the set of items that is frequent based on last L
    Lprev = set(flatten(previous_Lset.keys()))

    """
    Second Pass:
    (1) For each basket, look in the frequent-items
        table to see which of its items are frequent.
    (2) In a double loop, generate all pairs of frequent
        items in that basket.
    (3) For each such pair, add one to its count
        in the data structure used to store counts.
    """
    for basket in baskets:
        # keep only frequent items in Lprev
        valid_basket = list(Lprev.intersection(basket))
        valid_basket.sort()

        """
        For a candidate in Ck to be a frequent itemset,
        all its subsets must be frequent, not only the
        itemsets from Lk-1 and L1 that the candidate is
        constructed from, i.e., each of its subsets should
        be in the corresponding Lm, m = 1,…, k-1
        """
        candidates = list(itertools.combinations(valid_basket, k))

        for key in candidates:
            prev_candidates = list(itertools.combinations(
                list(key), k-1))

            ## Check that all of the candidates exists
            if len(set(prev_candidates).intersection(
                    set(previous_Lset))) == len(prev_candidates):
                new_Lset[key] = new_Lset.get(key, 0) + 1

    # Filter the valid candidates
    new_Lset = {k: v for k, v in new_Lset.items() if v >= s}

    return new_Lset

def apriori_algo():
    # read L1, the baskets and compute the support threshold
    L1, baskets, s = parse_data()

    generated_set = [(1,)]
    
    # initialize the original set for L
    Lset = L1
    k = 2

    # Generate pruned Lsets until it empties out
    while (len(Lset) > 0):
        generated_set.append(Lset)
        Lset = get_frequent_Lset(Lset, baskets, s, k)
        k += 1

    return generated_set

In [128]:
apriori = apriori_algo()

Support threshold (1% of baskets-length): 1000


### 2.2 Association Rules
Generating association rules between frequent itemsets discovered by using the Apriori algorithm in a dataset of sales transactions. The rules must have support at least $s$ and confidence at least $c$, where $s$ and $c$ are given as input parameters.


**Mining Association Rules**
1. Find all frequent itemsets I with at least as a given support
2. Rule generation
    - For every subset A of I, generate a rule A → I \ A
         - Since I is frequent, then so is A
         - Variant 1: Single pass to compute the rule confidence
            - conf(A,B→C,D) = supp(A,B,C,D)/supp(A,B)
         - Variant 2:
            - Observation: If A,B,C→D is below confidence, so is A,B→C,D because of supp(A,B) ≥ supp(A,B,C)
            - Can generate bigger rules from smaller ones

Output the rules above the confidence threshold

In [169]:
def get_confidence(item_sets, s1, s2):
    """
    conf(I -> j) = support(I union j) / support(I)
    """
    union = list(set(s1).union(set(s2)))
    # since the keys are sorted when we generate the itemsets
    union.sort()

    confidence = float(item_sets[len(union)][tuple(union)]) / item_sets[len(s1)][s1]

    return confidence

def generate_associations(item_sets, c):
    """
    We are looking for rules I → j with reasonably high support
    and confidence
    """
    associations = {}
    
    # associations are computed only for item_sets that has at least 2 items
    for k in range(2, len(item_sets)):
        # for each num permutation in k
        for count in range(1, k):
            for items in item_sets[k]:
                # I refers to I in the association I -> j
                I = set(itertools.combinations(list(items), count))
                for i in I:
                    # a set that contains all items from both sets,
                    # except items that are present in both sets
                    # j refers to j in the association I -> j
                    j = set(items).symmetric_difference(i)
                    confidence = get_confidence(item_sets, i, j)
                    if confidence > c:
                        associations[tuple([tuple(i), tuple(j)])] = confidence

    return associations

In [175]:
c = 0.5
associations = generate_associations(apriori, c)

In [181]:
for k, v in associations.items():
    print("{} → {} with confidence of {}".format(k[0], k[1], v))

(704,) → (39,) with confidence of 0.617056856187291
(704,) → (825,) with confidence of 0.6142697881828316
(227,) → (390,) with confidence of 0.577007700770077
(704,) → (825, 39) with confidence of 0.5769230769230769
(704, 825) → (39,) with confidence of 0.9392014519056261
(39, 704) → (825,) with confidence of 0.9349593495934959
(39, 825) → (704,) with confidence of 0.8719460825610783
