# Chapter 8 Itemset Mining

* Puprose is to mine frequent patterns that appear
* Most common example is *market basket analysis*
    * Mine the set of groceries that are often bought together (also known as frequent sets)
    * Once mined, the frequent sets allow us to extract *association rules* 
        * Association rules tell us how likely are two sets of items to co-occur or ocnditionally occur.
        * E.x. of association rules:
            * Users who visit the page Main, Laptop, and Rebates also visits (->) shopping-cart and checkout
            * Implies that the rebate offer results in more check outs.

## 8.1 Freqeunt Itemsets and Association Rules

#### Itemsets and Tidsets
* Let $\textbf{I} = \{x_1, x_2, ..., x_m\}$ be a set of m elements called items.
    * $\textbf{I}$ contains the individual items, but not the different combination of items, that comes later.
* A set $X \subseteq \textbf{I}$ is called an *itemset*
    * An itemset can contain multiple items from $\textbf{I}$, such as BE, AB, etc.
* $\textbf{I}$ might denote a collection of products
* An itemset of cardinality (or size) k is called a k-itemset. 
* We denote $\textbf{I}^{(k)}$ be the set of all k-itemsets, whith are subsets of $\textbf{I}$ with size k. 
* Let $\textbf{T} = \{t_1, t_2, ..., t_n\}$ be a set of elements called **transaction identifiers** or **tids**. 
* A set $T \subseteq \textbf{T}$ is called a tidset.
* Assume itemsets and tidsets are sorted in lexicographic (alphabetical) order.
* A *transaction* is a tuple (t, X), where $t \in T$ is a *unique* transaction identifier and X is an itemset. 
    * no need for t to appear multiple times in the database, since if a transaction that involves the entity tied to the transaction identifier t occurs multiple times, i.e. a person buys multiple items, all items in will be stored in the item set that corresponds to t. This is best shown in the transaction database representation. 

#### Database Representation

#### binary representation
* binary relation between tide, transaction identifiers, and ites.
* we say that tid $t \in T$ *contains* items $x \in \textbf{I}$ iff, $(t,x)\in D$.
* In other words, we say that a tid that corresponds to a certain item, denoted with a 1 in the binary matrix, only happens if the item x is in the tuple (t, X). 
* We say that tid t *contains* itemset $X=\{x_1, x_2, ..., x_k\}$, or transaction t includes all items in the itemset X, iff $(t, x_i) \in D \text{ for all } i=1,2,...,k$

In [1]:
# Binary Database
import pandas as pd
table = [[1,1,1,0,1,1],
        [2,0,1,1,0,1],
        [3,1,1,0,1,1],
        [4,1,1,1,0,1],
        [5,1,1,1,1,1],
         [6,0,1,1,1,0]]
pd.DataFrame(table, columns=['D', 'A', 'B', 'C', 'D', 'E'])

Unnamed: 0,D,A,B,C,D.1,E
0,1,1,1,0,1,1
1,2,0,1,1,0,1
2,3,1,1,0,1,1
3,4,1,1,1,0,1
4,5,1,1,1,1,1
5,6,0,1,1,1,0


#### Transactional database
* we define $i(T) = \{x | \forall{t} \in T, \text{t contains x}\}$
    * English: i(T) returns all items correspond to the transaction ids in T.
* where $T \subseteq \textbf{T}$ and i(T) is the set of items that are common to all transactions in tidset T. In other words, find all the items that are associated with all the transaction ids in T. 
* A transaction databse contains tuples of the form (t, i(T)) with $t \in \textbf{T}$

In [2]:
# Transactional database
data = [[1, "ABDE"],
       [2,"BCE"],
       [3,"ABDE"],
       [4, "ABCE"],
       [5,"ABCDE"],
       [6, "BCD"]]

pd.DataFrame(data, columns=['t', 'i(t)'])

Unnamed: 0,t,i(t)
0,1,ABDE
1,2,BCE
2,3,ABDE
3,4,ABCE
4,5,ABCDE
5,6,BCD


#### Vertical database
* We define $t(X) = \{\text{t | t} \in \textbf{T} \text{ and t contains X}\}$ where $X \subseteq \textbf{I}$
* t(X) is the set of tids that contain *all* items, x, in the itemset X.
    * it is the inverse of i(T), instead of finding all the items that correspond to a tid, find all the tids that correspond to an item (or set of items as defined by the function). 

In [3]:
import numpy as np
# vertical database
data = [['t(x)', 1, 1, 2, 1, 1],
       ['t(x)', 3,2,4,3,2],
       ['t(x)', 4,3,5,5,3],
       ['t(x)', 5,4,6,6,4],
       ['t(x)', np.nan, 5, np.nan, np.nan, 5],
       ['t(x)', np.nan, 6, np.nan, np.nan, np.nan]]

pd.DataFrame(data, columns=['x', 'A', 'B', 'C', 'D', 'E'])

Unnamed: 0,x,A,B,C,D,E
0,t(x),1.0,1,2.0,1.0,1.0
1,t(x),3.0,2,4.0,3.0,2.0
2,t(x),4.0,3,5.0,5.0,3.0
3,t(x),5.0,4,6.0,6.0,4.0
4,t(x),,5,,,5.0
5,t(x),,6,,,


* If the first transaction looks like the following: $<1, \{A, B, D, E\} >$, we write that as $<1, ABDE>$

#### Support and Frequent Itemsets
* The *support* of an itemset X in a dataset D, denoted as *sup*(X, D) is the number of transactions in D that contain the itemset X. 
    * mathematically defined as $sup(X,D) = |\{\text{t | <t,i(t)>} \in D \text{ and } X \subseteq i(T)\}| = |t(x)|$
    * English: get the cardinality of the set of tids given that the tids correspond to a set of items, A, where the given X supplied in the argument contains a subset or equal to the set of items A.
* *relative support* the fraction of transactions that contain X.
    * $rsup(X,D) = \frac{sup(X,D)}{|D|}$
* An itemset X is said to be *frequent* in D if given some user supplied value, minsup, $sup(X,D) >= minsup$. 
    * minsup stands for minimum support threshold.
    * If minsup is specified as a fraction, it's assumed that the value is relative support.
* We use the set $\textbf{F}$ to denote the set of all frequent itemsets, and $\textbf{F}^{(k)}$ to denote the set of frequent k-itemsets.
    * As stated before, k means the number of 'itemes' included in each element within the itemset. For example, $\textbf{F}^{(3)}$ includes item combinations of size 3.

In [6]:
# Frequent itemset with minsup = 3
# the way frequency works builds the foundation on creating apriori.
data = [
    [6, ['B']],
    [5, ['E', 'BE']],
    [4, ['A', 'C', 'D', 'AB', 'AE', 'BC', 'BD', 'ABE']],
    [3, ['AD', 'CE', 'DE', 'ABD', 'ADE', 'BCE', 'BDE', 'ABDE']]
]
pd.DataFrame(data, columns=['sup', 'itemsets'])
# here, using the binary database above, B appears all 6 times, thus
# B appears when sup = 6. 'BE' appears in all transactions except for
# transaction 6, since B doesn't appear.

Unnamed: 0,sup,itemsets
0,6,[B]
1,5,"[E, BE]"
2,4,"[A, C, D, AB, AE, BC, BD, ABE]"
3,3,"[AD, CE, DE, ABD, ADE, BCE, BDE, ABDE]"


In [5]:
data = [
    [1, ['A', 'B', 'C', 'D', 'E']],
    [2, ['AB', 'AD', 'AE', 'BC', 'BD', 'BE', 'CE', 'DE']],
    [3, ['ABD', 'ABE', 'ADE', 'BCE', 'BDE']],
    [4, ['ABDE']]
]
pd.DataFrame(data, columns=['k', 'frequent itemsets'])

Unnamed: 0,k,frequent itemsets
0,1,"[A, B, C, D, E]"
1,2,"[AB, AD, AE, BC, BD, BE, CE, DE]"
2,3,"[ABD, ABE, ADE, BCE, BDE]"
3,4,[ABDE]


#### Association Rules
* An assocation rule is expressed as X ->(s,c) Y, where X and Y are itemsets and are disjoint $X \cap Y = \emptyset$ and $X,Y \subseteq \textbf{I}$
* Let itemset $X \cup Y$ be denoted as XY. 
* The *support* of an association rule is the number of transactions (occurances) where XY occurs.
    * Mathematically defined as $s = sup(X->Y) = |\textbf{t}(XY)| = sup(XY)$ where $\textbf{t}(i)$ is a function that returns the number of transactions that correspond to the set of items i
* The *relative support* defined as the joint probability of obtaining X and Y.
    * Mathematically defined as $rsup(X->Y) = \frac{sup(XY)}{|\textbf{D}|} = P(X \cap Y)$
* The confidence of a rule is the conditional probability that a transaction contains Y given that it contains X.
    * $c = conf(X->Y) = P(X|Y) = \frac{P(X \cap Y)}{P(X)} = \frac{sup(XY)}{sup(X)}$
* A rule is **frequent** if XY is frequent (defined as $sup(XY) >= minsup$
* A rule is **strong** if $conf >= minconf$
    * E.x. Using the binary database above, $s = sup(BC -> E) = sup(BCE) = 3$ 
    * $c = conf(BC->E) = \frac{sup(BCE)}{sup(BC)} = \frac{3}{4} = 0.75$


## 8.2 Itemset Mining Algorithms

#### Candidate Generation (Brute Force)
* there are two steps:
    * i.) Candidate generation
    * ii.) Support computation
* Candidate Generation
    * Generate the powerset of all of items included in $\textbf{I}$. Since we are generating the powerset, there are $2^{|\textbf{I}|}$ possible candidates.
    * The set of items are generated in a lattice structure such that if there is a link between two item sets X and Y, X is an *immediate subset* of Y, $X \subseteq Y$ and $|X| = |Y| - 1$.
        * At the minimum, X should be found inside Y. 
    * To generate this, use BFS or DFS search on the prefix tree where two itemsets, X and Y, are connected by a link iff X is an immediate subset and prefix of Y.
    * can start enumerating itemsets starting with an empty set and add an item one at a time. 
    * Have to create a custom data structure to store the itemsets such that it satisfies the above condition.
    * Ultimately will generate all possible candidates from the given Itemset.
* Support Computation
    * calculates the support of each candidate pattern X and determines if it's frequent, sup(X) >= minsup.
* Computational Complexity
    * Calculating support alone is $O(|\textbf{I}| * |D|)$, since you have to check each transaction ($|D|$), t,and the itemset that corresponds to that transaction ($|\textbf{I}|$), i(t), that exists to see if the given itemset $X \subseteq i(t)$
        * Worst case is that an item set that corresponds to some transaction id t contains all items in $\textbf{I}$
    * However, since you are doing this for all possible itemsets, you would need to append $2^{|I|}$ to the computational complexity.

### 8.2.1 Level-Wise Approach: Apriori Algorithm (this boi gets his own section)
* Brute force iterates through all possible itemsets but it's not necessary.
    * Let $X,Y \subseteq \textbf{I}$ 
    * Some basic truths given definitions:
        * if $X \subseteq Y$, the X itemset (like ABC) is in itself included in the Y itemset (like ABCD)
            * Then: $sup(X) >= sup(Y)$
        * From the above, we can say two things:
            * if Y is frequent, $sup(X) >= minsup$, then for all X, such that $X \subseteq Y$, is also frequent. (If ABCD is frequent then ABC is also frequent)
            * if Y is **not** frequent, $sup(X) < minsup$, then for all X, such that $Y \subseteq X$, is also not frequent.  
* Apriori algo utilizes these to properties to improve the brute-force approach by employing a level-wise exploration of the itesemt and prunes.
    * It prunes all supersets of any infrequent candidate, since no superset of an infrequent itemset can be frequent.
    * Which means that if you find one itemset that is frequent, say ABDE, then all of the itemsets that are strictly subsets of ABDE, ABD, BDE, ABE, etc., are also frequent.

# APRIORI
* F <- 0
    * Set F to be the null set <br/>
* $C^{(1)}$ <- {0} 
    * Initializes the prefix tree with single items.
    * In the example, initialize the the first row to be $\emptyset$ <br/>
* foreach $i \in I \text{ do add i as child of } \emptyset \text{ in } C^{(1)} \text{ with }sup(i) <- 0$
    * construct the {A, B, C, D, E} row for k = 1 by adding to $C^{(1)}$ each individual item in $\textbf{I}$
    * $sup(i) <- 0$ instantiates each support value for each itemset in the first row, A, B, C, etc., to 0.
* k <- 1
    * denotes the current level / counter variable that we will be usin g in the rest of the algorithm
* while $C^{(k)} != \emptyset do$ <br/>
    * keep on constructing all possible nodes in the $k^{th}$ level
    * ComputerSupport $(C^{k}, \textbf{D})$
        * look at below
    * foreach leaf $X \in C^{(k)} do$:
        * iterates through all k-itemsets
        * if $sup(X) >= minsup \text{then} F <- F \cup \{(X, sup(X)\}$
            * checks if the current k-itemset of all the k-itemsets in level k are frequent, if so then add to F.
        * else remove X from $C^{(k)}$
            * if a k-itemset is not frequent, remove from the k^th level.
    * $C^{(k+1)} <- ExtendPrefixTree(C^{(k)})$
        * pass in the remaining k-itemsets into ExtendPrefixTree (look below)

* ComputeSupport $(C^{(k)}, D)$:
* foreach $<t, i(t)> \in D$ do:
    * for all transaction ids and itemsets that correspond to tid t
    * foreach k-subset $X \subseteq i(t)$ do:
        * for all k-itemsets that are returned within i(t)
        * if $X \in C^{(k)}$ then $sup(X) <- sup(X) + 1$
            * if k-itemset X, so just current level, don't care about k-1 itemset k-2 etc., is in the current level, increment the support for object.

* ExtendPrefixTree($C^{(k)}$):
* foreach leaf $X_a \in C^{(k)}$ do:
    * for all leaves in the k-th level of C:
    * foreach leaf $X_b \in SIBLING(X_a)$, such that b > a do:
        * iterate through all k-itemsets that are right of k-itemset $X_a$
        * $X_{ab} <- X_a \cup X_b$
            * construct potential candidate for the k+1 level
            * if $X_j \in C^{(k)}, \text{ for all } X_j \subset X_{ab}, \text{ such that } |X_j| = |X_{ab}| - 1 \text{ then }$:
                * this will check to make sure that all possible subsets of $X_{ab}$ that are at the k-level are in $C^{(k)}$ as is the definition of this problem. If not all of it's subsets are in  $C^{(k)}$, then that violates the second assumption we made, that if Y is **not** frequent, $sup(X) < minsup$, then for all X, such that $Y \subseteq X$, is also not frequent. Thus, we would have to remove it. Look below for an example of this.
        * if no extensions from $X_a$ then remove $X_a \text{ from } C^{(k)}$
            * TODO
* if $X_a = BC$ and $X_b = BD$ then $X_{ab} = BCD$, then check all possible subsets of BCD that are of size 2, which is BC, BD, CD, however CD is not in $C^{2}$, since it was removed thus BCD is not a valid node to keep. 