# Frequent Items

## terminology
- given a support threshold t, then sets of items that occur in at least s buckets are called **frequent itemsets**
- for rule L-> R **confidence** is the probablilty of L giver R: $$confidence(L->R) = P(R|L) = \frac{support(L \cup R)}{support(L)}$$ Rules with greater confidence are more interesting
- **Lift** measures how much better an association rule is at predicting the rule body than one based on entire data:
    * For rule L-> R
    $$lift(L->R) = \frac{confidence(L->R)}{support(R)} = \frac {support(L \cup R)}{support(L) * support(R)}$$
    * We typically want rules with high lift, certainly greater than 1



In [3]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons, make_blobs
import random

X = [
    set("mcb"),
    set("mpj"),
    set("mb"),
    set("cj"),
    set("mpb"),
    set("mcbj"),
    set("cbj"),
    set("bc")
]

## Support

In [4]:
def support(L, X) -> float:
  return np.sum([1 if np.all([j in x for j in L]) else 0 for x in X])/len(X)

## Confidence

In [5]:
def confidence(L, R, X) -> float:
  return support(L + R, X)/ support(L,X)

## Lift

In [6]:
def lift(L, R, X) -> float:
  return confidence(L,R,X)/support(R,X)

## Association Rules
Given itemsets L and R, association rule L-> R (L - antecedent or left-hand-side, R consequent or right-hand-side) This simply means that whenever L occurs R is likely to occur as well

## Itemset Lattice
The itemset lattice is a special data structure that stores all frequent itemsets in a compact form. It is a directed acyclic graph where each node represents an itemset and the edges represent the inclusion relationship between itemsets. The root node represents the empty set, and each level of the graph represents itemsets of a specific size. The itemset lattice is useful for efficiently calculating frequent itemsets and association rules.
[Itemset Lattice](../assets/itemset.png)
**Maximal frequent itemsets** are those without any supersets of any other frequent itemset (on the border).
**Closed frequent itemsets** are those without supersets with the same support.
The itemset lattice can be used to efficiently compute both maximal and closed frequent itemsets.

## Downward Closure
The downward closure property states that if an itemset is frequent, then all of its subsets must also be frequent. This is because if an itemset is frequent, it means that it appears in at least s baskets, and any subset of it must appear in a subset of those baskets. This property is useful in pruning infrequent itemsets and reducing the search space for frequent itemsets and association rules.

## Apriori Algorithm
The Apriori algorithm is a method for mining frequent itemsets and generating association rules. It works by iteratively constructing an enumeration tree containing all frequent itemsets, and extending the tree by joining pairs of parents from the itemset lattice that occur in the enumeration tree (are frequent). The algorithm uses the downward closure property to efficiently prune infrequent itemsets and reduce the search space for frequent itemsets and association rules. The Apriori algorithm is designed to limit the need for main memory and is a key algorithm for mining frequent itemsets in large datasets.

### Psuedocode
```
Algotithm Apriori(D, s)
    Input: D - a dataset of transactions
           s - a support threshold
    Output: L - a list of all frequent itemsets
    L = []
    C1 = generate candidate 1-itemsets from D
    L1 = generate frequent 1-itemsets from C1
    L = L union L1
    k = 2
    while Lk-1 != {}
        Ck = generate candidate k-itemsets from Lk-1
        Lk = generate frequent k-itemsets from Ck
        L = L union Lk
        k = k + 1
    return L
```

In [8]:
import functools
from itertools import chain, combinations
from collections import defaultdict

# This function take the previous steps frequent pairs and generates next pairs.
def generate_unions(previous):
  return [i.union(j) for i in previous for j in previous if len(i.union(j))==len(i)+1]

# This removes the items that have infrequent subsets.
def prune_subset(previous, new_item):
  subsets  = combinations(new_item, len(new_item)-1)
  return len([i for i in subsets if i not in previous]) != 0

def apriori(X, threshold: float) -> list:

  counts_for_1_item_set = defaultdict(lambda: 0)

  for i in X:
    for j in i:
      counts_for_1_item_set[j]+=1

  # We get the 1-item frequent sets.
  F = [
      [set(i) for i in counts_for_1_item_set if counts_for_1_item_set[i]/len(X) > threshold]
  ]


  k = 0
  while len(F[k]) != 0:

    # We get the new pairs.
    generated = generate_unions(F[k])

    # We remove those that contain subsets that are not frequent.
    pruned = [i for i in generated if prune_subset(F[k], i)]

    # We remove those with small support.
    with_support = [i for i in pruned if support(i, X) > threshold]

    # We add the new pairs.
    F.append(with_support)
    k += 1
  return functools.reduce(lambda a,b: [*a, *b], F)
apriori(X, 0.3)

[{'b'},
 {'m'},
 {'c'},
 {'j'},
 {'b', 'm'},
 {'b', 'c'},
 {'b', 'm'},
 {'b', 'c'},
 {'c', 'j'},
 {'c', 'j'}]

## Enumeration Tree
The Apriori algorithm is a method for mining frequent itemsets and generating association rules. It works by iteratively constructing an enumeration tree containing all frequent itemsets, and extending the tree by joining pairs of parents from the itemset lattice that occur in the enumeration tree (are frequent). The algorithm uses the downward closure property to efficiently prune infrequent itemsets and reduce the search space for frequent itemsets and association rules. The Apriori algorithm is designed to limit the need for main memory and is a key algorithm for mining frequent itemsets in large datasets.

## FP-Tree
An FP-Tree is a tree-like structure used in the FP-growth algorithm for mining frequent itemsets and generating association rules. It compresses the dataset by storing itemsets in a compact form, and avoids the need for repeated database scans. Each node in the tree represents an item, and the edges between nodes represent the frequency of the item in the dataset. The nodes at the bottom of the tree represent frequent itemsets, and the paths from the root to each node represent the support of the itemset. The FP-Tree is used to efficiently mine frequent itemsets by recursively creating conditional subtrees and projecting the data onto itemsets.

## FP-growth Algorithm
The FP-growth algorithm is an alternative method for mining frequent itemsets and generating association rules. It uses a hierarchical data structure called an FP-tree to compress the dataset and avoid the need for repeated database scans. The algorithm recursively creates conditional subtrees to extract patterns and projection the data onto itemsets. By doing so, it reduces the search space for frequent itemsets and association rules, making it a useful algorithm for mining large datasets.

### Psuedocode
```
Algorithm FP-growth(D, s)
    Input: FP-Tree of frequent items: FPT,
    Minimum support: minsup
    Current Suffix: P
    Output: Frequent itemsets

    if FTP is a single path
        then determine all combinations C of nodes on the path and report C \cup P as frequent;
    else
    for each item i in FTP do begin
        report itemset P_i = {i} \cup P asfrequent;
        Use pointers to extract conditional prefix paths dorm FPT containding item il
        Readjust counts of prefix paths and remove i;
        Remove infrequent items from prefix and reconstruct conditional FP-Tree FPT_i;
        if (FPR_i is not empty)
            then FP-growth(FPT_i, minsup, P_i);
    end
```


## FP-growth Algorithm

In [None]:
def powerset(iterable):
    s = list(iterable)
    return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))

class Tree:
  def __init__(self):
    self.children = []
    self.character = ''
    self.count = 0
    self.tempCount = 0
    self.father = None
class FPGrowth:
  def __init__(self, data, threshold):
    self.root = Tree()
    self.pointers = dict()
    self.threshold = threshold
    count_items = {j:support(j, data) for i in data for j in i if support(j, data) >= threshold}
    self.data = [sorted(list(set([j for j in i if j in count_items])), key=lambda x: count_items[x], reverse=True) for i in data]
    for i in self.data:
      self.addR(i, self.root)

  def addR(self, s, root):
    root.count += 1
    if len(s) == 0:
      return
    continuation = [i for i in root.children if i.character == s[0]]
    if len(continuation) == 0:
      newNode = Tree()
      newNode.character = s[0]
      newNode.count = 1
      newNode.father = root
      root.children.append(newNode)
      if s[0] in self.pointers:
        self.pointers[s[0]].append(newNode)
      else:
        self.pointers[s[0]] = [newNode]
      self.addR(s[1:], newNode)
    else:
      self.addR(s[1:], continuation[0])

  def get_path(self, node):
    if node.father == None:
      return ""
    return self.get_path(node.father) + node.character

  def get_frequent(self):
    conditional_pattern_base = {
        i: [(set(self.get_path(j)[:-1]), j.count-1) for j in self.pointers[i] if len(self.get_path(j))>1] for i in self.pointers
    }
    frequent = []
    for i in conditional_pattern_base:
      values = conditional_pattern_base[i]
      if len(values)==0:
        continue
      inter = functools.reduce(lambda a,b: a.intersection(b), [j[0] for j in values])
      inter.add(i)
      support = sum([j[1] for j in values])
      generated_patterns = [j for j in powerset(inter) if len(j) > 0]
      frequent.extend(generated_patterns)
    return set(frequent)
data = [
    "EKMNOY",
    "DEKNOY",
    "AEKM",
    "CKMUY",
    "CEIKOO"
]
fp = FPGrowth(data,0.5)
fp.get_frequent()

## PCY

In [None]:
import functools
from itertools import chain, combinations
from collections import defaultdict

# This function take the previous steps frequent pairs and generates next pairs.
def generate_unions(previous):
  return [i.union(j) for i in previous for j in previous if len(i.union(j))==len(i)+1]

# This removes the items that have infrequent subsets.
def prune_subset(previous, new_item):
  subsets  = combinations(new_item, len(new_item)-1)
  return len([i for i in subsets if i not in previous]) != 0

def PCY(X, threshold: float, dictionary_size: int) -> list:

  counts_for_1_item_set = defaultdict(lambda: 0)
  bit_mask = np.zeros((dictionary_size))
  hash_function = lambda x, y: (hash(y) + hash(x)) % dictionary_size

  # This is the place where we now also do the bitmask.
  for i in X:
    for j in i:
      counts_for_1_item_set[j]+=1
    for j in i:
      for k in i:
        if len(set([j,k])) == 2:
          bit_mask[hash_function(j,k)]+=1

  bit_mask /= len(X)
  bit_mask = bit_mask > threshold

  # We get the 1-item frequent sets.
  F = [
      [set(i) for i in counts_for_1_item_set if counts_for_1_item_set[i]/len(X) > threshold]
  ]


  k = 0
  while len(F[k]) != 0:

    # We get the new pairs.
    generated = generate_unions(F[k])

    # We remove those that contain subsets that are not frequent.
    pruned = [i for i in generated if prune_subset(F[k], i)]

    if k == 0:
      pruned = [i for i in pruned if bit_mask[hash_function(list(i)[0],list(i)[1])]]
    # We remove those with small support.
    with_support = [i for i in pruned if support(i, X) > threshold]

    # We add the new pairs.
    F.append(with_support)
    k += 1
  return functools.reduce(lambda a,b: [*a, *b], F)
PCY(X, 0.3, 5)

## Hashing Speedups

Hashing speedups are a technique used to improve the efficiency of frequent itemset mining algorithms. Hashing allows frequent itemsets to be stored in a compact form, reducing the memory required to store them and the time required to search for them. This can be particularly useful when dealing with large datasets, as it allows frequent itemsets to be efficiently identified and used to generate association rules.