# Discovery of Frequent Itemsets and Association Rules

In [2]:
import pandas as pd
from collections import defaultdict
from itertools import permutations

In [3]:
def read_data(path):
    '''
    Here is the code to read .dat file
    '''
    baskets = []
    with open(path) as f:
        for line in f:
            basket = line.strip().split()
            baskets.append(set(basket)) # set is used to remove duplicates
    return baskets #list of sets

## 1. Finding frequent itemsets with support at least s

Given a large set of items $I=\{i_1,i_2, \dots i_N\}$ and a set of baskets $T=\{t_1,t_2, \dots t_n \}$, such that $t_j \subset I$, we can define:
* $X$ itemset, $X \subset I$ 
* $X$ k-itemset, $X \subset I$ such that $|X|=k$
* $X \rightarrow Y$ association rule, with $X,Y \subset I$ and $X \cap Y = \emptyset$

A frequent itemset $X$ is a set of items appearing in at least $s$ baskets. 

Support $s$ for an itemset $I$ is the number of baskets containing all items in I.

### A-Priori algorithm

* **Step 1:** Read baskets and count the occurences of each individual items. Find the frequent itemset, i.e. items appearing at least $s$ times.
  * Input: Baskets $T$ and support $s$. (e.g $s=1\%$ total number of baskets)
  * Output: Frquent items (Singletons such that $count(singleton)\geq s$)
* **Step 2:** Read baskets again and count only those candidates which elements are frequent. Repeat this step until no more candidates are found.
  * Input: Baskets $T$, frequent candidates and support $s$.
  * Output: Frequent pairs.
* **REPEAT Step 2** until no more candidates are generated.

In [4]:
def find_single_cand(baskets, support):
    ''' 
    Here is the code to find the frequency of single items
    it should be ran initially
    baskets is the data obtained from read_data
    support stands for the threshold
    '''
    # Here is the code to calculate the frequency of each singleton
    res_dict = defaultdict(int)
    freq_dict = []
    for basket in baskets:
        for item in basket:
            res_dict[item] += 1
    
    #Here is the code to find the frequenct items
    for key, value in res_dict.items():
        if value >= support:
            freq_dict.append({key}) #list of sets

    return res_dict, freq_dict

In [5]:
def find_freq_items(baskets, prev_cand, support):
    ''' 
    Here is the code to find the frequent length >= 2 pairs
    '''

    #prev_cand = [set(cand) for cand in prev_cand]

    freq_dict = defaultdict(int)
    res_dict = {}
    final_candidate = []
    final_count = {}

    # the candidate is generated from singleton in each basket and prev_cand
    for basket in baskets:
        for cand in prev_cand:
            if cand.issubset(basket): # check if the candidate is in the basket
                for item in basket:
                    if item not in cand:
                        freq_dict[tuple(cand) + tuple({item})] += 1 # count the frequency of each candidate

    #Here is to find the candidate which frequency is greater than support
    for key, value in freq_dict.items():
        if value >= support:
            res_dict[key] = value
    
    # Here is the code to remove redundancy
    for item in res_dict:
        if set(item) not in final_candidate:
            final_candidate.append(set(item))
            final_count[tuple(set(item))] = res_dict[item]

    return final_count, final_candidate

**Note:** The 2 functions are also returning the counts of the items.

#### Results

* Support threshold $s=1000$, i.e. $1\%$ of the total number of baskets.

In [6]:
support = 1000
baskets = read_data('T10I4D100K.dat')
_, freq_cand_1 = find_single_cand(baskets, support)
#freq_dict_1 = find_freq_items(baskets, )
freq_count_2, freq_cand_2 = find_freq_items(baskets, freq_cand_1, support)
freq_count_3, freq_cand_3 = find_freq_items(baskets, freq_cand_2, support)

In [7]:
len(freq_cand_1)

375

Frequent pairs:

In [8]:
freq_cand_2

[{'39', '825'},
 {'704', '825'},
 {'39', '704'},
 {'227', '390'},
 {'789', '829'},
 {'368', '829'},
 {'217', '346'},
 {'368', '682'},
 {'390', '722'}]

Frequent triplet:

In [9]:
freq_cand_3

[{'39', '704', '825'}]

No more candidates are generated. Quadrupletes with support at least $s$ don't exist.

In [10]:
freq_count_4, freq_cand_4 = find_freq_items(baskets, freq_cand_3, support)
freq_cand_4

[]

## Association rules

Given $X,Y \subset I$ with $X \cap Y = \emptyset$, $X \rightarrow Y$ is an association rule.

We define the confidence of an association rule $X \rightarrow Y$ as follows,  $conf(X \rightarrow Y) = \frac{support(X \cup Y)}{support(X)}$ .

**Obs:** The higher the confidence the higher is the correlation between $X$ and $Y$.

In [11]:
def cal_conf(association_X, association_Y, baskets):
    '''
    Here is the function to calculate confidence
    baskets, association_X, association_Y is tuple and it should be converted to set
    '''
    association_X = set(association_X)
    association_Y = set(association_Y)
    cnt_I = 0
    cnt_I_J = 0
    for basket in baskets:
        if association_X.issubset(basket):
            cnt_I += 1
        if association_X.union(association_Y).issubset(basket):
            cnt_I_J += 1
    return cnt_I_J/cnt_I

def find_association_rules(candidates, threshold, baskets):
    ''' 
    Here is the function to find association rules
    '''
    res_pairs = []
    for candidate in candidates:
        # association_X_list is used to remove redundancy
        association_X_list = []
        #permutation is to find all combinations of candidate
        for cand_comb in permutations(candidate):
            #pos stands for position
            for pos in range(1, len(cand_comb)):
                res_pair = []
                association_X = cand_comb[:pos]
                association_Y = cand_comb[pos:]
                #make sure there is no duplicate
                if tuple(set(association_X)) not in association_X_list:
                    res_pair.append(association_X)
                    res_pair.append(association_Y)
                    association_X_list.append(tuple(set(association_X)))
                    #find the confidence greater than threshold
                    if cal_conf(association_X, association_Y, baskets) >= threshold:
                        res_pair.append(cal_conf(association_X, association_Y, baskets))
                        res_pairs.append(res_pair)
    
    return res_pairs

#### Results

Only association rules with a confidence greater than or equal to a threshold $t$ must be considered. 

* Support $s=1000$
* Confidence threshold $t=0.5$

In [12]:
support = 1000
baskets = read_data('T10I4D100K.dat')
_, freq_cand_1 = find_single_cand(baskets, support)
freq_count_2, freq_cand_2 = find_freq_items(baskets, freq_cand_1, support)
res_pairs_2 = find_association_rules(freq_cand_2, 0.5, baskets)
freq_count_3, freq_cand_3 = find_freq_items(baskets, freq_cand_2, support)
res_pairs_3 = find_association_rules(freq_cand_3, 0.5, baskets)

Association rules between 2 items $i_i \rightarrow i_j$, where $i_i$, $i_j$ $\in I$ and $i_i \neq i_j$

In [13]:
res_pairs_2

[[('704',), ('825',), 0.6142697881828316],
 [('704',), ('39',), 0.617056856187291],
 [('227',), ('390',), 0.577007700770077]]

Association rules between 3 items $(i_i,i_j) \rightarrow i_k$ or $(i_i) \rightarrow (i_j, i_k)$, where $i_i$, $i_j$, $i_k$ $\in I$ and $i_i \neq i_j$, $i_i \neq i_k$, $i_j \neq i_k$, 

In [14]:
res_pairs_3

[[('825', '39'), ('704',), 0.8719460825610783],
 [('825', '704'), ('39',), 0.9392014519056261],
 [('39', '704'), ('825',), 0.9349593495934959],
 [('704',), ('825', '39'), 0.5769230769230769]]