# **3. ASSOCIATION RULES MINING**

Finding frequent patterns and associations among sets of items in transaction databases, relational databases, or other information repositories.

Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in a transaction.

    An association rule is an implication of the form X → Y, where X and Y are itemsets.

*Evaluation metrics*:

      Support = fraction of transactions that contain both X and Y.
      Confidence = how often items in Y appear in transactions that contain X.

***GOAL***

Given a set of transactions T, the goal of association rule mining is to find all rules having
1. support >= minsup_threshold
      
        Frequent Itemset = an itemset whose support is greater than or equal to the minsup_threshold.

2. confidence >= minconf_threshold.

**Mining Association Rules** *Two step approach*:

    1. Frequent Itemset Generation: Generate all itemsets whose support >= minsup (computationally expensive)
    2. Rule Generation: Generate high confidence rules from frequent itemset

Import libraries:

In [1]:
!git clone https://github.com/camillasancricca/DATADIQ.git

fatal: destination path 'DATADIQ' already exists and is not an empty directory.


In [None]:
!pip install mlxtend pyECLAT efficient-apriori plotly

In [3]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [16]:
import pandas as pd
import numpy as np
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
import plotly.express as px
from mlxtend.frequent_patterns import fpgrowth
from pyECLAT import ECLAT
from DATADIQ import eff_apriori
import plotly.offline as pyo

***1. FREQUENT ITEMSET GENERATION***

**APRIORI**

***Apriori principle***: If an itemset is frequent, then all of its subsets must also be frequent → supersets of not-frequent itemset can be pruned from the lattice.

*Main steps*:

      1. Start with itemsets containing just a single item (Individual items)
      2. Determine the support for itemsets
      3. Keep the itemsets that meet the minimum support threshold and remove itemsets that do not support minimum support
      4. Use the itemsets that are kept and generate all the possible itemset combinations

      *Repeat steps 3 and 4 until there are no more new itemsets*.

In [5]:
BASKET = pd.read_csv('https://raw.githubusercontent.com/camillasancricca/DATADIQ/master/MARKETBASKET.csv', header=None)
BASKET

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
2,chutney,,,,,,,,,,,,,,,,,,,
3,turkey,avocado,,,,,,,,,,,,,,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
996,shrimp,body spray,green tea,,,,,,,,,,,,,,,,,
997,frozen smoothie,,,,,,,,,,,,,,,,,,,
998,herb & pepper,frozen vegetables,mineral water,muffins,cereals,,,,,,,,,,,,,,,
999,turkey,tomatoes,spaghetti,milk,cider,eggs,honey,cake,green tea,french fries,brownies,tomato juice,,,,,,,,


In [6]:
#Put all items of each transactions into a list
records = []
for i in range (0, len(BASKET)):
    records.append([str(BASKET.values[i,j]) for j in range(0, 20)])

In [7]:
#Initializing the transactionEncoder
TE = TransactionEncoder()
array = TE.fit(records).transform(records)

In [8]:
#Building the data frame rows are logical and columns are the items have been purchased
transf_df = pd.DataFrame(array, columns = TE.columns_)
transf_df

Unnamed: 0,asparagus,almonds,antioxydant juice,asparagus.1,avocado,babies food,bacon,barbecue sauce,black tea,blueberries,...,turkey,vegetables mix,water spray,white wine,whole weat flour,whole wheat pasta,whole wheat rice,yams,yogurt cake,zucchini
0,False,True,True,False,True,False,False,False,False,False,...,False,True,False,False,True,False,False,True,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,True,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
996,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
997,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
998,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
999,False,False,False,False,False,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False


In [9]:
#Drop NaN
basket_clean = transf_df.drop(['nan'], axis = 1)
basket_clean

Unnamed: 0,asparagus,almonds,antioxydant juice,asparagus.1,avocado,babies food,bacon,barbecue sauce,black tea,blueberries,...,turkey,vegetables mix,water spray,white wine,whole weat flour,whole wheat pasta,whole wheat rice,yams,yogurt cake,zucchini
0,False,True,True,False,True,False,False,False,False,False,...,False,True,False,False,True,False,False,True,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,True,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
996,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
997,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
998,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
999,False,False,False,False,False,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False


In [10]:
#Chose 0.03 minimum support
a_rules = apriori(basket_clean, min_support = 0.03, use_colnames = True)
a_rules['length'] = a_rules['itemsets'].apply(lambda x: len(x))

In [11]:
#Frequent itemset
a_rules

Unnamed: 0,support,itemsets,length
0,0.034965,(avocado),1
1,0.079920,(burgers),1
2,0.037962,(butter),1
3,0.075924,(cake),1
4,0.048951,(champagne),1
...,...,...,...
56,0.046953,"(milk, mineral water)",2
57,0.044955,"(spaghetti, milk)",2
58,0.034965,"(olive oil, mineral water)",2
59,0.030969,"(soup, mineral water)",2


***2. RULES GENERATION***

Given a frequent itemset L, find all non-empty subsets f ⊂ L such that f → L – f satisfies the minimum confidence requirement.

In [None]:
#Chose 0.05 minimum confidence
rules = association_rules(a_rules, metric = 'confidence', min_threshold = 0.05)
rules

In [20]:
#Rules generation using the efficient-apriori library
ex = pd.read_csv('https://raw.githubusercontent.com/camillasancricca/DATADIQ/master/BRIDGES.csv')
eff_apriori.rules(ex,0.1,1)

[{('ARCH', 12)} -> {('HIGHWAY', 4)}, {('ARCH', 12)} -> {('STEEL', 9)}, {('CANTILEV', 12)} -> {('STEEL', 9)}, {('IRON', 9)} -> {('THROUGH', 8)}, {('LONG', 10)} -> {('STEEL', 9)}, {('WOOD', 9)} -> {('S', 11)}, {('WOOD', 12)} -> {('S', 11)}, {('SUSPEN', 12)} -> {('THROUGH', 8)}, {('WOOD', 12)} -> {('WOOD', 9)}, {('WOOD', 9)} -> {('WOOD', 12)}, {('2', 6), ('LONG', 10)} -> {('STEEL', 9)}, {('M', 1), ('S', 11)} -> {('2', 6)}, {('2', 6), ('WOOD', 9)} -> {('N', 7)}, {('2', 6), ('WOOD', 12)} -> {('N', 7)}, {('2', 6), ('WOOD', 9)} -> {('S', 11)}, {('2', 6), ('WOOD', 12)} -> {('S', 11)}, {('2', 6), ('WOOD', 12)} -> {('WOOD', 9)}, {('2', 6), ('WOOD', 9)} -> {('WOOD', 12)}, {('4', 6), ('F', 11)} -> {('HIGHWAY', 4)}, {('4', 6), ('F', 11)} -> {('STEEL', 9)}, {('4', 6), ('MEDIUM', 10)} -> {('STEEL', 9)}, {('?', 5), ('F', 11)} -> {('G', 7)}, {('?', 5), ('STEEL', 9)} -> {('G', 7)}, {('A', 1), ('WOOD', 9)} -> {('S', 11)}, {('A', 1), ('WOOD', 12)} -> {('S', 11)}, {('A', 1), ('WOOD', 12)} -> {('WOOD', 9)},

[{('ARCH', 12)} -> {('HIGHWAY', 4)},
 {('ARCH', 12)} -> {('STEEL', 9)},
 {('CANTILEV', 12)} -> {('STEEL', 9)},
 {('IRON', 9)} -> {('THROUGH', 8)},
 {('LONG', 10)} -> {('STEEL', 9)},
 {('WOOD', 9)} -> {('S', 11)},
 {('WOOD', 12)} -> {('S', 11)},
 {('SUSPEN', 12)} -> {('THROUGH', 8)},
 {('WOOD', 12)} -> {('WOOD', 9)},
 {('WOOD', 9)} -> {('WOOD', 12)},
 {('2', 6), ('LONG', 10)} -> {('STEEL', 9)},
 {('M', 1), ('S', 11)} -> {('2', 6)},
 {('2', 6), ('WOOD', 9)} -> {('N', 7)},
 {('2', 6), ('WOOD', 12)} -> {('N', 7)},
 {('2', 6), ('WOOD', 9)} -> {('S', 11)},
 {('2', 6), ('WOOD', 12)} -> {('S', 11)},
 {('2', 6), ('WOOD', 12)} -> {('WOOD', 9)},
 {('2', 6), ('WOOD', 9)} -> {('WOOD', 12)},
 {('4', 6), ('F', 11)} -> {('HIGHWAY', 4)},
 {('4', 6), ('F', 11)} -> {('STEEL', 9)},
 {('4', 6), ('MEDIUM', 10)} -> {('STEEL', 9)},
 {('?', 5), ('F', 11)} -> {('G', 7)},
 {('?', 5), ('STEEL', 9)} -> {('G', 7)},
 {('A', 1), ('WOOD', 9)} -> {('S', 11)},
 {('A', 1), ('WOOD', 12)} -> {('S', 11)},
 {('A', 1), ('WOOD

**ECLAT ALGORITHM**

Leverages the tidsets directly for support computation.

The support of a candidate itemset can be computed by intersecting the tidsets of suitably chosen subsets.

*Main steps*:

    1. Convert data into the vertical format
    2. Set up the minimum support value
    3. Esclude all items that appeared in number_of_transactions < minimum support value
    4. Use the itemsets that are kept and generate all the possible itemset combinations

    *Repeat steps 3 and 4 as many times as needed to analyze itemsets of the required length.*

In [21]:
#Loading transactions DataFrame to ECLAT class
eclat = ECLAT(data=BASKET)

In [22]:
#DataFrame of binary values
eclat.df_bin

Unnamed: 0,almonds,water spray,white wine,escalope,parmesan cheese,soda,melons,mayonnaise,clothes accessories,antioxydant juice,...,mineral water,bramble,strawberries,bug spray,cider,cream,honey,tomato sauce,soup,frozen vegetables
0,1,0,0,0,0,0,0,0,0,1,...,1,0,0,0,0,0,1,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
997,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
998,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
999,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,1,0,0,0


In [23]:
#Count items in each row
items_per_transaction = eclat.df_bin.astype(int).sum(axis=1)
items_per_transaction

0       20
1        3
2        1
3        2
4        5
        ..
996      3
997      1
998      5
999     12
1000     9
Length: 1001, dtype: int64

In [24]:
#The item shoud appear at least at 3% of transactions
min_support = 0.03
#Start from transactions containing at least 2 items
min_combination = 2
#up to 2 items per transaction
max_combination = 2

rule_indices, rule_supports = eclat.fit(min_support=min_support,
                                                 min_combination=min_combination,
                                                 max_combination=max_combination,
                                                 separator=' & ',
                                                 verbose=True)

Combination 2 by 2


741it [00:16, 44.21it/s]


In [25]:
result = pd.DataFrame(rule_supports.items(),columns=['Item', 'Support'])
result.sort_values(by=['Support'], ascending=False)

Unnamed: 0,Item,Support
13,eggs & mineral water,0.05994
19,chocolate & mineral water,0.058941
16,spaghetti & mineral water,0.057942
14,spaghetti & chocolate,0.050949
5,milk & mineral water,0.046953
4,milk & chocolate,0.045954
3,milk & spaghetti,0.044955
10,eggs & chocolate,0.042957
11,eggs & french fries,0.040959
8,eggs & spaghetti,0.038961


**FP-GROWTH**

Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure.

Datasets are encoded using a compact structure, the FP-tree.

Frequent itemsets are extracted directly from the FP-tree.

***GOAL*** To avoid candidate generation (computationally expensive)

Main steps:

    1. Construct the frequent pattern tree
    2. For each frequent item: compute the projected FP-tree
    3. Mine conditional FP-trees and grow frequent patterns
    4. If the conditional FP-tree contains a single path: enumerate all the patterns

In [26]:
#Put all items of each transactions into a list
records = []
for i in range(0, len(BASKET)):
    records.append([str(BASKET.values[i, j]) for j in range(0, 20)])

In [27]:
#Initializing the transactionEncoder
TE = TransactionEncoder()
array = TE.fit(records).transform(records)

In [28]:
#Building the data frame rows are logical and columns are the items have been purchased
transf_df = pd.DataFrame(array, columns=TE.columns_)
transf_df

Unnamed: 0,asparagus,almonds,antioxydant juice,asparagus.1,avocado,babies food,bacon,barbecue sauce,black tea,blueberries,...,turkey,vegetables mix,water spray,white wine,whole weat flour,whole wheat pasta,whole wheat rice,yams,yogurt cake,zucchini
0,False,True,True,False,True,False,False,False,False,False,...,False,True,False,False,True,False,False,True,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,True,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
996,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
997,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
998,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
999,False,False,False,False,False,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False


In [29]:
#Drop NaN
transf_df = transf_df.drop(['nan'], axis = 1)
transf_df

Unnamed: 0,asparagus,almonds,antioxydant juice,asparagus.1,avocado,babies food,bacon,barbecue sauce,black tea,blueberries,...,turkey,vegetables mix,water spray,white wine,whole weat flour,whole wheat pasta,whole wheat rice,yams,yogurt cake,zucchini
0,False,True,True,False,True,False,False,False,False,False,...,False,True,False,False,True,False,False,True,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,True,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
996,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
997,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
998,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
999,False,False,False,False,False,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False


In [30]:
#Running the fpgrowth algorithm
res = fpgrowth(transf_df,min_support=0.05, use_colnames=True)
res

Unnamed: 0,support,itemsets
0,0.244755,(mineral water)
1,0.140859,(green tea)
2,0.082917,(shrimp)
3,0.082917,(low fat yogurt)
4,0.075924,(olive oil)
5,0.061938,(frozen smoothie)
6,0.207792,(eggs)
7,0.07992,(burgers)
8,0.078921,(turkey)
9,0.135864,(milk)


In [31]:
#Extract association rules with min confidence 0.05
res = association_rules(res, metric="confidence", min_threshold=0.05)
res

TypeError: association_rules() missing 1 required positional argument: 'num_itemsets'

In [None]:
#Sort values based on confidence
res.sort_values("confidence",ascending=False)

**Summary**

from *mlxtend.frequent_patterns*:

- apriori()
- association_rules()
- fpgrowth()

from *pyECLAT*:

- ECLAT()