# Finding frequent itemsets and association rules

### Library: mlxtend
We are going to use the library mlxtend which contains an implementation of the apriori algorithm. Click on the following <a href="http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/">link</a>. We recommend to install it using

pip install mlxtend

or

conda install mlxtend

Sometimes using the previous commands the library is not correctly installed, especially in the case when there are multiple Python versions installed on your Jupyter notebook. In that case, we recommend to use the command shown in the next cell (directly in your Jupyter notebook):

In [1]:
import sys
!{sys.executable} -m pip install mlxtend



### One-hot encoding

Sometimes it is required to transform categorical data where features might take more than two values into one-hot encoding, where each feature can take either 0 or 1 (or True/False). For example, let's consider the following set of baskets with each basket containing all the items bought during a simple trip to the supermarket. Using the following script (which uses a TransactionEncoder from mlxtend) we obtain a one-hot encoding of the data, which is required by the apriori method of mlxtend:

In [21]:
from mlxtend.preprocessing import TransactionEncoder

dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
           ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
           ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]

type(dataset)

list

In [None]:
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
te_ary

Observe that we have now a number of features being equal to the number of items in the data. Each column corresponds to a feature, while each row-column entry specifies whether the corresponding basket contains the corresponding feature or not (e.g. "Milk").

### Frequent itemsets and association rules

Finally we are able to run apriori on the data represented with one-hot encoding using the corresponding methods from mlxtend. Please find below the full code:

In [4]:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori


dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
           ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
           ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]

te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
df

Unnamed: 0,Apple,Corn,Dill,Eggs,Ice cream,Kidney Beans,Milk,Nutmeg,Onion,Unicorn,Yogurt
0,False,False,False,True,False,True,True,True,True,False,True
1,False,False,True,True,False,True,False,True,True,False,True
2,True,False,False,True,False,True,True,False,False,False,False
3,False,True,False,False,False,True,True,False,False,True,True
4,False,True,False,True,True,True,False,False,True,False,False


In [18]:
dataset

[['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
 ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
 ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
 ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
 ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]

### DataFrame

One convenient way to load data from an input file and perform some preprocessing on it is to use panda DataFrame. A DataFrame is a "two-dimensional, size-mutable, potentially heterogeneous tabular data." It contains several tools to process and handle tabular data. Click on this <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html">link</a> for a documentation. 

To use a DataFrame we need to install the "pandas" library with any of the methods we saw up to now. 



In [5]:
frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)

frequent_itemsets


Unnamed: 0,support,itemsets
0,0.8,(Eggs)
1,1.0,(Kidney Beans)
2,0.6,(Milk)
3,0.6,(Onion)
4,0.6,(Yogurt)
5,0.8,"(Eggs, Kidney Beans)"
6,0.6,"(Eggs, Onion)"
7,0.6,"(Milk, Kidney Beans)"
8,0.6,"(Onion, Kidney Beans)"
9,0.6,"(Yogurt, Kidney Beans)"


In [6]:
from mlxtend.frequent_patterns import association_rules

a=association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)

a[["antecedents","consequents","support","confidence"]]

Unnamed: 0,antecedents,consequents,support,confidence
0,(Eggs),(Kidney Beans),0.8,1.0
1,(Kidney Beans),(Eggs),0.8,0.8
2,(Eggs),(Onion),0.6,0.75
3,(Onion),(Eggs),0.6,1.0
4,(Milk),(Kidney Beans),0.6,1.0
5,(Onion),(Kidney Beans),0.6,1.0
6,(Yogurt),(Kidney Beans),0.6,1.0
7,"(Eggs, Onion)",(Kidney Beans),0.6,1.0
8,"(Eggs, Kidney Beans)",(Onion),0.6,0.75
9,"(Onion, Kidney Beans)",(Eggs),0.6,1.0


In [7]:
import pandas as pd

#read order_data.csv and create a DataFrame with that content
data = pd.read_csv(r"order_data.csv",delimiter=" ",header=None)
data

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,toothpaste,brush,milk,cereals,honey,bread,butter,cheese,yogurt
1,milk,cereals,honey,bread,cheese,razor,gel,shampoo,
2,milk,cereals,honey,cheese,soap,shampoo,,,
3,honey,bread,butter,cheese,mouthwash,toothpaste,,,
4,cereals,honey,bread,butter,gel,soap,,,
5,cheesse,yogurt,milk,cereals,honey,shampoo,gel,,
6,honey,bread,cheese,razor,butter,yogurt,,,
7,honey,bread,cheese,butter,milk,,,,
8,cereals,butter,cookies,chips,,,,,
9,cerals,cheese,yogurt,cookies,chips,,,,


In [8]:
#we run apriori on the order_data.csv file

import math
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules


data = pd.read_csv(r"order_data.csv",delimiter=" ",header=None)

#preprocessing: change to one hot encoding so as to be able to use apriori from mlxtend
d=data.values.tolist()

#removing nan values
for i in range(len(d)):
    j=0
    while(True):
        if (type(d[i][j])==float and math.isnan(d[i][j])) :
            del d[i][j]
            j-=1
        j+=1
        if (j>len(d[i])-1):
            break

te = TransactionEncoder()
te_ary = te.fit(d).transform(d)
df = pd.DataFrame(te_ary, columns=te.columns_)

#computing frequent itemsets and association rules
frequent_itemsets = apriori(df, min_support=0.2, use_colnames=True)

a=association_rules(frequent_itemsets, metric="confidence", min_threshold=0.6)

#visualizing association rules results
a[["antecedents","consequents","support","confidence"]]

Unnamed: 0,antecedents,consequents,support,confidence
0,(butter),(bread),0.35,0.875000
1,(bread),(butter),0.35,0.636364
2,(cheese),(bread),0.35,0.700000
3,(bread),(cheese),0.35,0.636364
4,(honey),(bread),0.30,0.666667
...,...,...,...,...
99,"(cereals, shampoo)","(milk, honey)",0.20,0.800000
100,"(cereals, honey)","(milk, shampoo)",0.20,0.666667
101,"(milk, shampoo)","(cereals, honey)",0.20,0.800000
102,"(milk, honey)","(cereals, shampoo)",0.20,0.666667


In [9]:
type(a["antecedents"][99]) #prints frozenset

frozenset

In [10]:
#frozenset is very similar to a set in Python with the main difference that cannot be changed.
#so we can check if a rule contains shampoo and honey as follows
i=99
if ("shampoo" in a["antecedents"][i] and "honey" in a["consequents"][i] ):
    print ("The "+str(i)+" rule talks about shampoo and honey:" )
else:
    print ("The "+str(i)+"th rule does not talk about shampoo and honey:" )

The 99 rule talks about shampoo and honey:


In [13]:
import math
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules


data = pd.read_csv(r"mammographic_masses.csv",delimiter=",",header=0)

#preprocessing: change to one hot encoding so as to be able to use apriori from mlxtend
d=data.values.tolist()

d  

[['5', '67', '3', '5', '3', 1],
 ['4', '43', '1', '1', '?', 1],
 ['5', '58', '4', '5', '3', 1],
 ['4', '28', '1', '1', '3', 0],
 ['5', '74', '1', '5', '?', 1],
 ['4', '65', '1', '?', '3', 0],
 ['4', '70', '?', '?', '3', 0],
 ['5', '42', '1', '?', '3', 0],
 ['5', '57', '1', '5', '3', 1],
 ['5', '60', '?', '5', '1', 1],
 ['5', '76', '1', '4', '3', 1],
 ['3', '42', '2', '1', '3', 1],
 ['4', '64', '1', '?', '3', 0],
 ['4', '36', '3', '1', '2', 0],
 ['4', '60', '2', '1', '2', 0],
 ['4', '54', '1', '1', '3', 0],
 ['3', '52', '3', '4', '3', 0],
 ['4', '59', '2', '1', '3', 1],
 ['4', '54', '1', '1', '3', 1],
 ['4', '40', '1', '?', '?', 0],
 ['?', '66', '?', '?', '1', 1],
 ['5', '56', '4', '3', '1', 1],
 ['4', '43', '1', '?', '?', 0],
 ['5', '42', '4', '4', '3', 1],
 ['4', '59', '2', '4', '3', 1],
 ['5', '75', '4', '5', '3', 1],
 ['2', '66', '1', '1', '?', 0],
 ['5', '63', '3', '?', '3', 0],
 ['5', '45', '4', '5', '3', 1],
 ['5', '55', '4', '4', '3', 0],
 ['4', '46', '1', '5', '2', 0],
 ['5', '

In [14]:
#removing nan values
for i in range(0,len(d)):
    j=0
    while(True):
        if (type(d[i][j])==float and math.isnan(d[i][j])) :
            del d[i][j]
            j-=1
        j+=1
        if (j>len(d[i])-1):
            break

d   

[['5', '67', '3', '5', '3', 1],
 ['4', '43', '1', '1', '?', 1],
 ['5', '58', '4', '5', '3', 1],
 ['4', '28', '1', '1', '3', 0],
 ['5', '74', '1', '5', '?', 1],
 ['4', '65', '1', '?', '3', 0],
 ['4', '70', '?', '?', '3', 0],
 ['5', '42', '1', '?', '3', 0],
 ['5', '57', '1', '5', '3', 1],
 ['5', '60', '?', '5', '1', 1],
 ['5', '76', '1', '4', '3', 1],
 ['3', '42', '2', '1', '3', 1],
 ['4', '64', '1', '?', '3', 0],
 ['4', '36', '3', '1', '2', 0],
 ['4', '60', '2', '1', '2', 0],
 ['4', '54', '1', '1', '3', 0],
 ['3', '52', '3', '4', '3', 0],
 ['4', '59', '2', '1', '3', 1],
 ['4', '54', '1', '1', '3', 1],
 ['4', '40', '1', '?', '?', 0],
 ['?', '66', '?', '?', '1', 1],
 ['5', '56', '4', '3', '1', 1],
 ['4', '43', '1', '?', '?', 0],
 ['5', '42', '4', '4', '3', 1],
 ['4', '59', '2', '4', '3', 1],
 ['5', '75', '4', '5', '3', 1],
 ['2', '66', '1', '1', '?', 0],
 ['5', '63', '3', '?', '3', 0],
 ['5', '45', '4', '5', '3', 1],
 ['5', '55', '4', '4', '3', 0],
 ['4', '46', '1', '5', '2', 0],
 ['5', '

In [15]:
#adding attributes
for i in range(len(d)):
    for j in range (len(d[i])):
        d[i][j]=data.columns[j] + "=" +str(d[i][j])

d 

[['BI-RADS=5', 'Age=67', 'Shape=3', 'Margin=5', 'Density=3', 'Severity=1'],
 ['BI-RADS=4', 'Age=43', 'Shape=1', 'Margin=1', 'Density=?', 'Severity=1'],
 ['BI-RADS=5', 'Age=58', 'Shape=4', 'Margin=5', 'Density=3', 'Severity=1'],
 ['BI-RADS=4', 'Age=28', 'Shape=1', 'Margin=1', 'Density=3', 'Severity=0'],
 ['BI-RADS=5', 'Age=74', 'Shape=1', 'Margin=5', 'Density=?', 'Severity=1'],
 ['BI-RADS=4', 'Age=65', 'Shape=1', 'Margin=?', 'Density=3', 'Severity=0'],
 ['BI-RADS=4', 'Age=70', 'Shape=?', 'Margin=?', 'Density=3', 'Severity=0'],
 ['BI-RADS=5', 'Age=42', 'Shape=1', 'Margin=?', 'Density=3', 'Severity=0'],
 ['BI-RADS=5', 'Age=57', 'Shape=1', 'Margin=5', 'Density=3', 'Severity=1'],
 ['BI-RADS=5', 'Age=60', 'Shape=?', 'Margin=5', 'Density=1', 'Severity=1'],
 ['BI-RADS=5', 'Age=76', 'Shape=1', 'Margin=4', 'Density=3', 'Severity=1'],
 ['BI-RADS=3', 'Age=42', 'Shape=2', 'Margin=1', 'Density=3', 'Severity=1'],
 ['BI-RADS=4', 'Age=64', 'Shape=1', 'Margin=?', 'Density=3', 'Severity=0'],
 ['BI-RADS=4

In [17]:
te = TransactionEncoder()
te_ary = te.fit(d).transform(d)

df = pd.DataFrame(te_ary, columns=te.columns_)
#this part needs to be filled...

Unnamed: 0,Age=18,Age=19,Age=20,Age=21,Age=22,Age=23,Age=24,Age=25,Age=26,Age=27,...,Margin=4,Margin=5,Margin=?,Severity=0,Severity=1,Shape=1,Shape=2,Shape=3,Shape=4,Shape=?
0,False,False,False,False,False,False,False,False,False,False,...,False,True,False,False,True,False,False,True,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,True,True,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,True,False,False,True,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,True,False,True,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,True,False,False,True,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
956,False,False,False,False,False,False,False,False,False,False,...,False,False,False,True,False,False,True,False,False,False
957,False,False,False,False,False,False,False,False,False,False,...,False,True,False,False,True,False,False,False,True,False
958,False,False,False,False,False,False,False,False,False,False,...,False,True,False,True,False,False,False,False,True,False
959,False,False,False,False,False,False,False,False,False,False,...,False,True,False,False,True,False,False,False,True,False
