# Week 2B: Frequent Itemsets (Basic)

## Question 1

Suppose we have transactions that satisfy the following assumptions: 

<ol>
<li>s, the support threshold, is 10,000.
<li>There are one million items, which are represented by the integers 0,1,...,999999.
<li>There are N frequent items, that is, items that occur 10,000 times or more.
<li>There are one million pairs that occur 10,000 times or more.
<li>There are 2M pairs that occur exactly once. M of these pairs consist of two frequent items, the other M each have at least one nonfrequent item.
<li>No other pairs occur at all.
<li>Integers are always represented by 4 bytes.
</ol>

Suppose we run the a-priori algorithm to find frequent pairs and can choose on the second pass between the triangular-matrix method for counting candidate pairs (a triangular array count[i][j] that holds an integer count for each pair of items (i, j) where i < j) and a hash table of item-item-count triples. Neglect in the first case the space needed to translate between original item numbers and numbers for the frequent items, and in the second case neglect the space needed for the hash table. Assume that item numbers and counts are always 4-byte integers. 
As a function of N and M, what is the minimum number of bytes of main memory needed to execute the a-priori algorithm on this data? Demonstrate that you have the correct formula by selecting, from the choices below, the triple consisting of values for N, M, and the (approximate, i.e., to within 10%) minumum number of bytes of main memory, S, needed for the a-priori algorithm to execute with this data.

<ol>
<li>N = 100,000; M = 100,000,000; S = 1,200,000,000
<li>N = 50,000; M = 80,000,000; S = 1,500,000,000
<li>N = 30,000; M = 100,000,000; S = 500,000,000
<li>N = 100,000; M = 50,000,000; S = 5,000,000,000
<ol>

In [4]:
NUM_ITEMS = 1000000
LOWER, UPPER = 0.9, 1.1

def triangular_count(n):
    # O(n) to convert item names to consecutive integers
    # Total number of pairs n(n-1)/2
    # This is roughly equal to 2n^2
    #return (NUM_ITEMS + (n * (n-1) / 2) ) * 4
    return 2*n**2

def triple_method(m):
    # O(n) to convert item names to consecutive integers
    # 3 times the number of pairs
    # This is rougly equal to 12*m
    #return (NUM_ITEMS + 3 * m) * 4
    return 12*m

test = [
    [100000, 100000000, 1200000000],
    [50000, 80000000, 1500000000],
    [30000, 100000000, 500000000],
    [100000, 50000000, 5000000000]
]

winner = 0

for t in test:
    bytez = min(triangular_count(t[0]), triple_method(t[1]))
    interval_min, interval_max = t[2] * LOWER, t[2] * UPPER
    if interval_min <= bytez <= interval_max:
        winner = t
        
print("Winner")
print(winner)
print() 

for i, t in enumerate(test):
    print(str(i+1) + ":", res == t)

Winner
[100000, 100000000, 1200000000]

1: True
2: False
3: False
4: False



## Question 2
Imagine there are 100 baskets, numbered 1,2,...,100, and 100 items, similarly numbered. Item i is in basket j if and only if i divides j evenly. For example, basket 24 is the set of items {1,2,3,4,6,8,12,24}. Describe all the association rules that have 100% confidence. Which of the following rules has 100% confidence?

<ol>
<li>{4,6} → 12
<li>{1,2} → 4
<li>{1,3,6} → 12
<li>{1} → 2
</ol>

In [28]:
#
def generate_baskets(num_baskets, num_items, b = []):
    g = lambda i: set([x for x in range(1, num_items + 1) if i % x == 0])
    return [g(z) for z in range(1, num_baskets + 1)]
        
rules = [
        {"if": set([1,2]), "then": 4},
        {"if": set([4,6]), "then": 12},
        {"if": set([1,3,6]), "then": 12},
        {"if": set([1]), "then": 2}
]

baskets = generate_baskets(100, 100)

winner = 0

for rule in rules:
    if_, then = 0, 0
    for b in baskets:
        # If the if part is a subset of b
        if rule["if"].issubset(b):
            if_ += 1
            if rule["then"] in b:
                then += 1
    # if then is always present when if is a subset
    if if_ == then:
        winner = rule
        
print("Winner")        
print(res)     

Winner
{'then': 12, 'if': {4, 6}}


## Question 3

Suppose ABC is a frequent itemset and BCDE is NOT a frequent itemset. Given this information, we can be sure that certain other itemsets are frequent and sure that certain itemsets are NOT frequent. Other itemsets may be either frequent or not. Which of the following is a correct classification of an itemset?

<ol>
<li>B can be either frequent or not frequent.
<li>BCF is not frequent.
<li>AB can be either frequent or not frequent.
<li>BC is frequent.
</ol>

In [56]:
# Answer:
#1: B is frequent
#2: BCF can be either frequent or not frequent
#3: AB is frequent
#4: BC is frequent