## Finding Frequent Itemsets & Association Rule Mining

MLextend is a python library for data science tasks. The documentation of this library can be found through this [link](http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/#overview). It can be installed using

pip install mlxtend or conda install mlxtend

Sometimes using the previous commands the library is not correctly installed, especially in the case when there are multiple Python versions installed on your Jupyter notebook. In that case, it is recommended to use the command shown in the next cell (directly in your Jupyter notebook):

In [47]:
#!pip install mlxtend
#import sys
#!{sys.executable} -m pip install mlxtend

### One-hot encoding

Sometimes it is required to transform categorical data where features might take more than two values into one-hot encoding, where each feature can take either 0 or 1 (or True/False). For example, let's consider the following set of baskets with each basket containing all the items bought during a simple trip to the supermarket. Using the following script (which uses a TransactionEncoder from mlxtend) we obtain a one-hot encoding of the data, which is required by the apriori method of mlxtend:

### DataFrame

One convenient way to load data from an input file and perform some preprocessing on it is to use panda DataFrame. A DataFrame is a "two-dimensional, size-mutable, potentially heterogeneous tabular data." It contains several tools to process and handle tabular data. Click on this <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html">link</a> for a documentation.

To use a DataFrame we need to install the "pandas" library with any of the methods we saw up to now.

In [48]:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder

dataset = [
    ["Milk", "Onion", "Nutmeg", "Kidney Beans", "Eggs", "Yogurt"],
    ["Dill", "Onion", "Nutmeg", "Kidney Beans", "Eggs", "Yogurt"],
    ["Milk", "Apple", "Kidney Beans", "Eggs"],
    ["Milk", "Unicorn", "Corn", "Kidney Beans", "Yogurt"],
    ["Corn", "Onion", "Onion", "Kidney Beans", "Ice cream", "Eggs"],
]

te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
df

Unnamed: 0,Apple,Corn,Dill,Eggs,Ice cream,Kidney Beans,Milk,Nutmeg,Onion,Unicorn,Yogurt
0,False,False,False,True,False,True,True,True,True,False,True
1,False,False,True,True,False,True,False,True,True,False,True
2,True,False,False,True,False,True,True,False,False,False,False
3,False,True,False,False,False,True,True,False,False,True,True
4,False,True,False,True,True,True,False,False,True,False,False


Observe that we have now a number of features being equal to the number of items in the data. Each column corresponds to a feature, while each row-column entry specifies whether the corresponding basket contains the corresponding feature or not (e.g. "Milk").

### Frequent itemsets and association rules

Finally we are able to run apriori on the data represented with one-hot encoding using the corresponding methods from mlxtend. Please find below the full code:

In [49]:
from mlxtend.frequent_patterns import apriori

# Uncomment the following line to use FP-growth algorithm
# from mlxtend.frequent_patterns import fpgrowth

frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)
# Uncomment the following line to use FP-growth algorithm
# frequent_itemsets = fpgrowth(df, min_support=0.6, use_colnames=True)
frequent_itemsets["length"] = frequent_itemsets["itemsets"].apply(lambda x: len(x))
frequent_itemsets

Unnamed: 0,support,itemsets,length
0,0.8,(Eggs),1
1,1.0,(Kidney Beans),1
2,0.6,(Milk),1
3,0.6,(Onion),1
4,0.6,(Yogurt),1
5,0.8,"(Eggs, Kidney Beans)",2
6,0.6,"(Eggs, Onion)",2
7,0.6,"(Milk, Kidney Beans)",2
8,0.6,"(Onion, Kidney Beans)",2
9,0.6,"(Yogurt, Kidney Beans)",2


In [50]:
from mlxtend.frequent_patterns import association_rules

a = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)

a[["antecedents", "consequents", "support", "confidence"]]

Unnamed: 0,antecedents,consequents,support,confidence
0,(Eggs),(Kidney Beans),0.8,1.0
1,(Kidney Beans),(Eggs),0.8,0.8
2,(Eggs),(Onion),0.6,0.75
3,(Onion),(Eggs),0.6,1.0
4,(Milk),(Kidney Beans),0.6,1.0
5,(Onion),(Kidney Beans),0.6,1.0
6,(Yogurt),(Kidney Beans),0.6,1.0
7,"(Eggs, Onion)",(Kidney Beans),0.6,1.0
8,"(Eggs, Kidney Beans)",(Onion),0.6,0.75
9,"(Onion, Kidney Beans)",(Eggs),0.6,1.0


## Selecting and Filtering Results

In [51]:
frequent_itemsets[(frequent_itemsets["length"] == 2) & (frequent_itemsets["support"] >= 0.8)]

Unnamed: 0,support,itemsets,length
5,0.8,"(Eggs, Kidney Beans)",2


In [52]:
frequent_itemsets[frequent_itemsets["itemsets"] == {"Onion", "Eggs"}]

Unnamed: 0,support,itemsets,length
6,0.6,"(Eggs, Onion)",2


In [53]:
association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Eggs),(Kidney Beans),0.8,1.0,0.8,1.0,1.0,0.0,inf,0.0
1,(Kidney Beans),(Eggs),1.0,0.8,0.8,0.8,1.0,0.0,1.0,0.0
2,(Eggs),(Onion),0.8,0.6,0.6,0.75,1.25,0.12,1.6,1.0
3,(Onion),(Eggs),0.6,0.8,0.6,1.0,1.25,0.12,inf,0.5
4,(Milk),(Kidney Beans),0.6,1.0,0.6,1.0,1.0,0.0,inf,0.0
5,(Onion),(Kidney Beans),0.6,1.0,0.6,1.0,1.0,0.0,inf,0.0
6,(Yogurt),(Kidney Beans),0.6,1.0,0.6,1.0,1.0,0.0,inf,0.0
7,"(Eggs, Onion)",(Kidney Beans),0.6,1.0,0.6,1.0,1.0,0.0,inf,0.0
8,"(Eggs, Kidney Beans)",(Onion),0.8,0.6,0.6,0.75,1.25,0.12,1.6,1.0
9,"(Onion, Kidney Beans)",(Eggs),0.6,0.8,0.6,1.0,1.25,0.12,inf,0.5


In [54]:
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Eggs),(Onion),0.8,0.6,0.6,0.75,1.25,0.12,1.6,1.0
1,(Onion),(Eggs),0.6,0.8,0.6,1.0,1.25,0.12,inf,0.5
2,"(Eggs, Kidney Beans)",(Onion),0.8,0.6,0.6,0.75,1.25,0.12,1.6,1.0
3,"(Onion, Kidney Beans)",(Eggs),0.6,0.8,0.6,1.0,1.25,0.12,inf,0.5
4,(Eggs),"(Onion, Kidney Beans)",0.8,0.6,0.6,0.75,1.25,0.12,1.6,1.0
5,(Onion),"(Eggs, Kidney Beans)",0.6,0.8,0.6,1.0,1.25,0.12,inf,0.5


In [55]:
rules["antecedent_len"] = rules["antecedents"].apply(lambda x: len(x))
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric,antecedent_len
0,(Eggs),(Onion),0.8,0.6,0.6,0.75,1.25,0.12,1.6,1.0,1
1,(Onion),(Eggs),0.6,0.8,0.6,1.0,1.25,0.12,inf,0.5,1
2,"(Eggs, Kidney Beans)",(Onion),0.8,0.6,0.6,0.75,1.25,0.12,1.6,1.0,2
3,"(Onion, Kidney Beans)",(Eggs),0.6,0.8,0.6,1.0,1.25,0.12,inf,0.5,2
4,(Eggs),"(Onion, Kidney Beans)",0.8,0.6,0.6,0.75,1.25,0.12,1.6,1.0,1
5,(Onion),"(Eggs, Kidney Beans)",0.6,0.8,0.6,1.0,1.25,0.12,inf,0.5,1


## Using a CSV file

In [56]:
import pandas as pd

# read order_data.csv and create a DataFrame with that content
data = pd.read_csv(r"order_data.csv", delimiter=" ", header=None)
data

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,toothpaste,brush,milk,cereals,honey,bread,butter,cheese,yogurt
1,milk,cereals,honey,bread,cheese,razor,gel,shampoo,
2,milk,cereals,honey,cheese,soap,shampoo,,,
3,honey,bread,butter,cheese,mouthwash,toothpaste,,,
4,cereals,honey,bread,butter,gel,soap,,,
5,cheesse,yogurt,milk,cereals,honey,shampoo,gel,,
6,honey,bread,cheese,razor,butter,yogurt,,,
7,honey,bread,cheese,butter,milk,,,,
8,cereals,butter,cookies,chips,,,,,
9,cerals,cheese,yogurt,cookies,chips,,,,


In [57]:
# we run apriori on the order_data.csv file

import math
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules


data = pd.read_csv(r"order_data.csv", delimiter=" ", header=None)

# preprocessing: change to one hot encoding so as to be able to use apriori from mlxtend
d = data.values.tolist()

# removing nan values
for i in range(len(d)):
    j = 0
    while True:
        if type(d[i][j]) == float and math.isnan(d[i][j]):
            del d[i][j]
            j -= 1
        j += 1
        if j > len(d[i]) - 1:
            break

te = TransactionEncoder()
te_ary = te.fit(d).transform(d)
df = pd.DataFrame(te_ary, columns=te.columns_)

# computing frequent itemsets and association rules
frequent_itemsets = apriori(df, min_support=0.2, use_colnames=True)

a = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.6)

# visualizing association rules results
a[["antecedents", "consequents", "support", "confidence"]]

Unnamed: 0,antecedents,consequents,support,confidence
0,(bread),(butter),0.35,0.636364
1,(butter),(bread),0.35,0.875000
2,(bread),(cheese),0.35,0.636364
3,(cheese),(bread),0.35,0.700000
4,(honey),(bread),0.30,0.666667
...,...,...,...,...
99,"(honey, shampoo)","(cereals, milk)",0.20,1.000000
100,"(honey, milk)","(cereals, shampoo)",0.20,0.666667
101,"(cereals, shampoo)","(honey, milk)",0.20,0.800000
102,"(cereals, milk)","(honey, shampoo)",0.20,0.666667


In [58]:
type(a["antecedents"][99])  # prints frozenset

frozenset

In [59]:
i = 99
if "shampoo" in a["antecedents"][i] and "honey" in a["consequents"][i]:
    print("The " + str(i) + " rule talks about shampoo and honey:")
else:
    print("The " + str(i) + "th rule does not talk about shampoo and honey:")

The 99th rule does not talk about shampoo and honey:


In [60]:
import math
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, fpgrowth
from mlxtend.frequent_patterns import association_rules


data = pd.read_csv(r"mammographic_masses.csv", delimiter=",", header=0)

# preprocessing: change to one hot encoding so as to be able to use apriori from mlxtend
d = data.values.tolist()

# removing nan values
for i in range(0, len(d)):
    j = 0
    while True:
        if type(d[i][j]) == float and math.isnan(d[i][j]):
            del d[i][j]
            j -= 1
        j += 1
        if j > len(d[i]) - 1:
            break

# adding attributes
for i in range(len(d)):
    for j in range(len(d[i])):
        d[i][j] = data.columns[j] + "=" + str(d[i][j])


te = TransactionEncoder()
te_ary = te.fit(d).transform(d)

df = pd.DataFrame(te_ary, columns=te.columns_)

#### 1. Report 3 rules with support at least 0.2 and confidence at least 0.9. Specify for each of them the support and the confidence.

In [61]:
frequent_itemsets = apriori(df, min_support=0.2, use_colnames=True)
# frequent_itemsets = fpgrowth(df, min_support=0.2, use_colnames=True)

frequent_itemsets["length"] = frequent_itemsets["itemsets"].apply(lambda x: len(x))
ar = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.9).sort_values(by=["confidence"], ascending=False)
ar[['antecedents', 'consequents', 'support', 'confidence']]

Unnamed: 0,antecedents,consequents,support,confidence
7,"(Density=3, Margin=1, Severity=0)",(BI-RADS=4),0.238293,0.927126
10,"(Shape=4, Density=3, BI-RADS=5)",(Severity=1),0.224766,0.915254
9,"(Shape=4, Severity=1, BI-RADS=5)",(Density=3),0.224766,0.911392
2,"(Margin=1, Severity=0)",(BI-RADS=4),0.299688,0.911392
3,"(Margin=1, BI-RADS=4)",(Severity=0),0.299688,0.911392
5,"(Shape=4, BI-RADS=5)",(Severity=1),0.246618,0.908046
8,"(Margin=1, Density=3, BI-RADS=4)",(Severity=0),0.238293,0.905138
4,"(Shape=4, BI-RADS=5)",(Density=3),0.245578,0.904215
6,"(Shape=4, Severity=1)",(Density=3),0.295525,0.901587
1,"(Margin=1, Density=3)",(BI-RADS=4),0.263267,0.900356


#### 2. This task consists of determining some attributes and their values that help us to find out whether a given instance is benign (severity =0) or malign (severity=1).

In [62]:
frequent_itemsets = apriori(df, min_support=0.1, use_colnames=True)
# frequent_itemsets = fpgrowth(df, min_support=0.1, use_colnames=True)

frequent_itemsets["length"] = frequent_itemsets["itemsets"].apply(lambda x: len(x))

ar = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.9)

# filter out of the rules the ones that do not talk about severity and the ones that talk about BI-RADS in antecedents tuples
ar = ar[
    (ar["consequents"].apply(lambda x: "Severity" in str(x)))
    & (ar["antecedents"].apply(lambda x: "BI-RADS" not in str(x)))
]
ar.reset_index(drop=True, inplace=True)

ar[["antecedents", "consequents", "support", "confidence"]]

Unnamed: 0,antecedents,consequents,support,confidence
0,"(Margin=1, Shape=2)",(Severity=0),0.136316,0.903448
1,"(Shape=1, Margin=1, Density=3)",(Severity=0),0.1436,0.901961


#### 3. The BI-RADS assessment is not always accurate and it might lead to unnecessary breast biopsy. Provide one or two rules that might give some evidence that the BI-RADS assessment is not always accurate. Explain your answer

In [63]:
frequent_itemsets = apriori(df, min_support=0.1, use_colnames=True)
# frequent_itemsets = fpgrowth(df, min_support=0.1, use_colnames=True)

frequent_itemsets["length"] = frequent_itemsets["itemsets"].apply(lambda x: len(x))

ar = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.9)

ar = ar[
    (ar["antecedents"].apply(lambda x: "BI-RADS=4" in str(x) or "BI-RADS=5" in str(x) or "BI-RADS=6" in str(x)))
    & (ar["consequents"].apply(lambda x: "Severity=0" in str(x)))
]
ar.reset_index(drop=True, inplace=True)

ar[["antecedents", "consequents", "support", "confidence"]]

Unnamed: 0,antecedents,consequents,support,confidence
0,"(Margin=1, BI-RADS=4)",(Severity=0),0.299688,0.911392
1,"(Shape=1, BI-RADS=4)",(Severity=0),0.173777,0.907609
2,"(Shape=2, BI-RADS=4)",(Severity=0),0.162331,0.901734
3,"(Margin=1, Density=3, BI-RADS=4)",(Severity=0),0.238293,0.905138
4,"(Shape=1, Density=3, BI-RADS=4)",(Severity=0),0.145682,0.915033
5,"(Shape=1, Margin=1, BI-RADS=4)",(Severity=0),0.156087,0.909091
6,"(Margin=1, Shape=2, BI-RADS=4)",(Severity=0),0.12487,0.9375
7,"(Shape=1, Margin=1, Density=3, BI-RADS=4)",(Severity=0),0.133195,0.920863


#### 4. Find the confidence and support of the following rule: Age = 35 ⇒ Severity = 0. Report its support and confidence. Do you think this rule tells us something valuable or that we should ignore it as there is not enough evidence to support this rule?


In [64]:
frequent_itemsets = apriori(df, min_support=0.000001, use_colnames=True)
# frequent_itemsets = fpgrowth(df, min_support=0.000001, use_colnames=True)

frequent_itemsets["length"] = frequent_itemsets["itemsets"].apply(lambda x: len(x))

ar = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.09)

ar = ar[
    (ar["antecedents"].apply(lambda x: len(x) == 1 and "Age=35" in x))
    & (ar["consequents"].apply(lambda x: len(x) == 1 and  "Severity=0" in x))
]
ar.reset_index(drop=True, inplace=True)

ar[["antecedents", "consequents", "support", "confidence"]]

Unnamed: 0,antecedents,consequents,support,confidence
0,(Age=35),(Severity=0),0.012487,0.923077


#### 5. The attribute “Age” is ordinal which makes the rule mining approach not ideal. In particular, one would like to obtain rules of the kind:
$$Age \geq n,\space\space A_1 = a_1, . . . , A_k = a_k \implies Severity = 1$$


In [82]:
# encode year as a binary value
age_threshold = 65  # minimum age to be considered, 0 if lower 1 if higher

df = pd.read_csv("mammographic_masses.csv")
df = df[df["Age"] != "?"]   # remove rows with missing age values
df["Age"] = df["Age"].apply(lambda age: 0 if int(age) < age_threshold else 1)
df.to_csv("mammographic_masses_modified.csv", index=False)

# perform preprocessing as before
data = pd.read_csv(r"mammographic_masses_modified.csv", delimiter=",", header=0)

# preprocessing: change to one hot encoding so as to be able to use apriori from mlxtend
d = data.values.tolist()

# removing nan values
for i in range(0, len(d)):
    j = 0
    while True:
        if type(d[i][j]) == float and math.isnan(d[i][j]):
            del d[i][j]
            j -= 1
        j += 1
        if j > len(d[i]) - 1:
            break

# adding attributes
for i in range(len(d)):
    for j in range(len(d[i])):
        d[i][j] = data.columns[j] + "=" + str(d[i][j])

te = TransactionEncoder()
te_ary = te.fit(d).transform(d)

df = pd.DataFrame(te_ary, columns=te.columns_)

frequent_itemsets = apriori(df, min_support=0.1, use_colnames=True)
#frequent_itemsets = fpgrowth(df, min_support=0.1, use_colnames=True)

frequent_itemsets["length"] = frequent_itemsets["itemsets"].apply(lambda x: len(x))

ar = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.9)

ar = ar[
    (ar["antecedents"].apply(lambda x: "Age=1" in x))
    & (ar["consequents"].apply(lambda x: len(x) == 1 and "Severity=1" in x))
]
ar.reset_index(drop=True, inplace=True)

print(f'{ar[["antecedents", "consequents", "support", "confidence"]]} \n')

                              antecedents   consequents   support  confidence
0                      (Age=1, BI-RADS=5)  (Severity=1)  0.157950    0.932099
1           (Age=1, Density=3, BI-RADS=5)  (Severity=1)  0.149582    0.947020
2             (Shape=4, Age=1, BI-RADS=5)  (Severity=1)  0.121339    0.943089
3  (Shape=4, Age=1, Density=3, BI-RADS=5)  (Severity=1)  0.117155    0.957265 

