# Example 1 -- Generating Frequent Itemsets and association rules

The __generate_rules__ takes dataframes of __frequent itemsets__ as produced by the __apriori__ function in mlxtend.association. 

To demonstrate the usage of the generate_rules method, we first create a pandas DataFrame of frequent itemsets as generated by the apriori function:

The apriori function expects data in a one-hot encoded pandas DataFrame. Suppose we have the following transaction data:

#### -- generate frequent item sets

In [1]:
# pip install mlxtend
import pandas as pd

from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [3]:
# dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
#            ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
#            ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
#            ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
#            ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]

In [2]:
dataset = [['A', 'B', 'C'],
           ['A', 'C'],
           ['A', 'D'],
           ['B', 'E', 'F']
           ]

In [5]:
# dataset = [['beer', 'wine', 'rum'],
#            ['beer', 'rum', 'vodka'],
#            ['beer', 'vodka'],
#            ['beer', 'wine', 'rum']
#            ]

We can transform it into the right format via the TransactionEncoder as follows:

In [3]:
te     = TransactionEncoder()

te_ary = te.fit(dataset).transform(dataset)
te_ary

array([[ True,  True,  True, False, False, False],
       [ True, False,  True, False, False, False],
       [ True, False, False,  True, False, False],
       [False,  True, False, False,  True,  True]])

In [4]:
te.columns_

['A', 'B', 'C', 'D', 'E', 'F']

In [5]:
# create a dataframe
df     = pd.DataFrame(te_ary, columns=te.columns_)
df

Unnamed: 0,A,B,C,D,E,F
0,True,True,True,False,False,False
1,True,False,True,False,False,False
2,True,False,False,True,False,False
3,False,True,False,False,True,True


Now, let us return the items and itemsets with at least 60% support:

In [6]:
apriori(df, min_support=0.5)

Unnamed: 0,support,itemsets
0,0.75,(0)
1,0.5,(1)
2,0.5,(2)
3,0.5,"(0, 2)"


By default, apriori returns the column indices of the items, which may be useful in downstream operations such as association rule mining. For better readability, we can set use_colnames=True to convert these integer values into the respective item names:

In [7]:
apriori(df, min_support=0.5, use_colnames=True)

Unnamed: 0,support,itemsets
0,0.75,(A)
1,0.5,(B)
2,0.5,(C)
3,0.5,"(C, A)"


#### -- Selecting and Filtering Results

let's assume we are only interested in itemsets of length 2 that have a support of at least 80 percent. 

First, we create the frequent itemsets via apriori and add a new column that stores the length of each itemset:

In [10]:
frequent_itemsets = apriori(df, min_support=0.5, use_colnames=True)

frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets

Unnamed: 0,support,itemsets,length
0,0.75,(A),1
1,0.5,(B),1
2,0.5,(C),1
3,0.5,"(C, A)",2


Then, we can select the results that satisfy our desired criteria as follows:

In [11]:
frequent_itemsets[ (frequent_itemsets['length']  >= 2) &
                   (frequent_itemsets['support'] >= 0.5) ]

Unnamed: 0,support,itemsets,length
3,0.5,"(C, A)",2


Similarly, using the Pandas API, we can select entries based on the "itemsets" column:

In [12]:
frequent_itemsets[ frequent_itemsets['itemsets'] == {'A', 'C'} ]

Unnamed: 0,support,itemsets,length
3,0.5,"(C, A)",2


#### -- Working with Sparse Representations

To save memory, you may want to represent your transaction data in the sparse format. This is especially useful if you have lots of products and small transactions.

In [13]:
oht_ary  = te.fit(dataset).transform(dataset, sparse=True)

sparse_df = pd.SparseDataFrame(te_ary, columns=te.columns_, default_fill_value=False)
sparse_df


Unnamed: 0,A,B,C,D,E,F
0,True,True,True,False,False,False
1,True,False,True,False,False,False
2,True,False,False,True,False,False
3,False,True,False,False,True,True


In [14]:
apriori(sparse_df, min_support=0.5, use_colnames=True)

Unnamed: 0,support,itemsets
0,0.75,(A)
1,0.5,(B)
2,0.5,(C)
3,0.5,"(C, A)"


#### -- Generate rules

The generate_rules() function allows to 
- (1) specify your metric of interest and 
- (2) the according threshold. 

Currently implemented measures are confidence and lift. 

Let's say we are interesting in rules derived from the frequent itemsets only if the level of confidence is above the 90 percent threshold (min_threshold=0.7)


In [15]:
association_rules(frequent_itemsets, metric="confidence", min_threshold=0.60)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(C),(A),0.5,0.75,0.5,1.0,1.333333,0.125,inf
1,(A),(C),0.75,0.5,0.5,0.666667,1.333333,0.125,1.5


#### -- Rule Generation and different metric

If you are interested in rules according to a different metric of interest, you can simply adjust the metric and min_threshold arguments . E.g. if you are only interested in rules that have a lift score of >= 1.2,

In [48]:
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(wine),(rum),0.5,0.75,0.5,1.0,1.333333,0.125,inf
1,(rum),(wine),0.75,0.5,0.5,0.666667,1.333333,0.125,1.5
2,"(beer, wine)",(rum),0.5,0.75,0.5,1.0,1.333333,0.125,inf
3,"(beer, rum)",(wine),0.75,0.5,0.5,0.666667,1.333333,0.125,1.5
4,(wine),"(beer, rum)",0.5,0.75,0.5,1.0,1.333333,0.125,inf
5,(rum),"(beer, wine)",0.75,0.5,0.5,0.666667,1.333333,0.125,1.5


Pandas DataFrames make it easy to filter the results further. Let's say we are ony interested in rules that satisfy the following criteria:

    at least 2 antecedents
    a confidence > 0.75
    a lift score > 1.2

In [49]:
rules["antecedent_len"] = rules["antecedents"].apply(lambda x: len(x))
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,antecedent_len
0,(wine),(rum),0.5,0.75,0.5,1.0,1.333333,0.125,inf,1
1,(rum),(wine),0.75,0.5,0.5,0.666667,1.333333,0.125,1.5,1
2,"(beer, wine)",(rum),0.5,0.75,0.5,1.0,1.333333,0.125,inf,2
3,"(beer, rum)",(wine),0.75,0.5,0.5,0.666667,1.333333,0.125,1.5,2
4,(wine),"(beer, rum)",0.5,0.75,0.5,1.0,1.333333,0.125,inf,1
5,(rum),"(beer, wine)",0.75,0.5,0.5,0.666667,1.333333,0.125,1.5,1


In [50]:
rules[ (rules['antecedent_len'] >= 2) &
       (rules['confidence'] > 0.75) &
       (rules['lift'] > 1.2) ]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,antecedent_len
2,"(beer, wine)",(rum),0.5,0.75,0.5,1.0,1.333333,0.125,inf,2
