# Eclat Association Rules

Introduction:
Eclat (Equivalence Class Clustering and bottom-up Lattice Traversal) is an efficient algorithm used for mining frequent itemsets from transactional datasets. It operates on a vertical data representation, making it highly scalable and suitable for large datasets. In this blog, we will explore the working principles of the Eclat algorithm, discuss its core concepts, provide an example code implementation in Python, and examine its advantages and limitations.


code :

In [1]:
import pandas as pd

In this code snippet, we first import the necessary libraries. We then define a sample transaction dataset represented by the dataset list of lists.

Next, we perform one-hot encoding of the dataset using the TransactionEncoder from the mlxtend.preprocessing module. This step transforms the dataset into a binary matrix representation, where each column corresponds to an item and each row represents a transaction.

After that, we create a DataFrame using the one-hot encoded matrix, assigning the column names based on the unique items in the dataset.

We then apply the Eclat algorithm using the eclat() function from the mlxtend.frequent_patterns module. We specify the minimum support threshold as 0.3 and set use_colnames=True to use the item names in the resulting frequent itemsets.

Finally, we print the frequent itemsets obtained from the Eclat algorithm.

In [27]:
dataset = [
    {"apple", "banana", "cherry"},
    {"banana", "orange"},
    {"apple", "banana", "grape"},
    {"banana", "cherry"},
    {"apple", "grape"},
    {"banana", "orange"},
]

fruits = ["apple", "banana", "cherry", "grape", "orange"]

# Map fruit indices to names in the dataset
mapped_dataset = [sorted([fruit for fruit in itemset], key=lambda x: fruits.index(x)) for itemset in dataset]

# Print the mapped dataset
for itemset in mapped_dataset:
    print(itemset)

['apple', 'banana', 'cherry']
['banana', 'orange']
['apple', 'banana', 'grape']
['banana', 'cherry']
['apple', 'grape']
['banana', 'orange']


In [4]:
dataset = [
    {"apple", "banana", "cherry"},
    {"banana", "orange"},
    {"apple", "banana", "grape"},
    {"banana", "cherry"},
    {"apple", "grape"},
    {"banana", "orange"},
]

In [5]:
items = set(item for transaction in dataset for item in transaction)

In [7]:
binary_dataset = [[1 if item in transaction else 0 for item in items] for transaction in dataset]
binary_dataset

[[0, 1, 1, 1, 0],
 [1, 0, 1, 0, 0],
 [0, 1, 1, 0, 1],
 [0, 0, 1, 1, 0],
 [0, 1, 0, 0, 1],
 [1, 0, 1, 0, 0]]

In [8]:
support_counts = {}

In [9]:
for transaction in binary_dataset:
    for i, item in enumerate(transaction):
        if item == 1:
            support_counts[i] = support_counts.get(i, 0) + 1

In [10]:
sorted_items = sorted(support_counts.items(), key=lambda x: x[1], reverse=True)

In [11]:
class EclatNode:
    def __init__(self, item, support_count):
        self.item = item
        self.support_count = support_count
        self.children = []

In [12]:
root = EclatNode(None, 0)

In [13]:
for item, support_count in sorted_items:
    parent = root
    for child in parent.children:
        if child.item == item:
            parent = child
            break
    else:
        new_node = EclatNode(item, support_count)
        parent.children.append(new_node)
        parent = new_node
    for i, transaction in enumerate(binary_dataset):
        if transaction[item] == 1:
            child_node = EclatNode(i, 1)
            parent.children.append(child_node)

In [14]:
def dfs_eclat(node, itemset, min_support):
    # Check if the current itemset is frequent
    if node.support_count >= min_support:
        print(itemset)

    # Recursively traverse child nodes
    for child in node.children:
        new_itemset = itemset.copy()
        if child.item is not None:
            new_itemset.append(child.item)
        dfs_eclat(child, new_itemset, min_support)

In [17]:
def dfs_eclat(node, prefix, min_support):
    # Check if the current itemset is frequent
    if len(node.children) > 0 and node.support_count >= min_support:
        print(prefix)

    # Recursively traverse child nodes
    for child in node.children:
        new_prefix = prefix.copy()
        if child.item is not None:
            new_prefix.append(child.item)
        dfs_eclat(child, new_prefix, min_support)

In [18]:
min_support = 2  # Set your desired minimum support threshold
dfs_eclat(root, [], min_support)

[2]
[1]
[3]
[0]
[4]


In [29]:
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori

dataset = [
    {"apple", "banana", "cherry"},
    {"banana", "orange"},
    {"apple", "banana", "grape"},
    {"banana", "cherry"},
    {"apple", "grape"},
    {"banana", "orange"},
]

# Transform the dataset into a one-hot encoded format
te = TransactionEncoder()
one_hot_encoded = te.fit_transform(dataset)
df = pd.DataFrame(one_hot_encoded, columns=te.columns_)

# Find frequent itemsets with minimum support of 0.5 using Apriori
frequent_itemsets = apriori(df, min_support=0.5, use_colnames=True)

# Print the frequent itemsets
print(frequent_itemsets)

    support  itemsets
0  0.500000   (apple)
1  0.833333  (banana)


In [1]:
from collections import defaultdict

def eclat(dataset, min_support):
    items = defaultdict(int)
    for transaction in dataset:
        for item in transaction:
            items[item] += 1

    frequent_itemsets = []
    stack = [([], dataset)]
    while stack:
        prefix, subset = stack.pop()
        for item, support in items.items():
            if support >= min_support and any(item in transaction for transaction in subset):
                frequent_itemsets.append(prefix + [item])
                new_subset = [transaction for transaction in subset if item in transaction]
                stack.append((prefix + [item], new_subset))

    return frequent_itemsets

dataset = [
    ["apple", "banana", "cherry"],
    ["banana", "orange"],
    ["apple", "banana", "grape"],
    ["banana", "cherry"],
    ["apple", "grape"],
    ["banana", "orange"],
]

min_support = 2  # Minimum support count

frequent_itemsets = eclat(dataset, min_support)

# Print the frequent itemsets
for itemset in frequent_itemsets:
    print(itemset)

In [1]:
# importing dataset ( example 1 and example 2 are datasets in pyECLAT)
from pyECLAT import Example2

# storing the dataset in a variable
dataset = Example2().get()

# printing the dataset
dataset.head()

Unnamed: 0,0,1,2,3,4,5,6
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams
1,burgers,meatballs,eggs,,,,
2,chutney,,,,,,
3,turkey,avocado,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,


In [2]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3001 entries, 0 to 3000
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       3001 non-null   object
 1   1       2315 non-null   object
 2   2       1774 non-null   object
 3   3       1374 non-null   object
 4   4       1048 non-null   object
 5   5       775 non-null    object
 6   6       581 non-null    object
dtypes: object(7)
memory usage: 164.2+ KB


In [3]:
## Visualizing the frequent items
# importing the ECLAT module
from pyECLAT import ECLAT

# loading transactions DataFrame to ECLAT class
eclat = ECLAT(data=dataset)

# DataFrame of binary values
eclat.df_bin

Unnamed: 0,rice,light mayo,green tea,frozen vegetables,grated cheese,white wine,fromage blanc,mint,energy drink,burgers,...,parmesan cheese,honey,frozen smoothie,brownies,cookies,almonds,butter,pickles,tomato sauce,vegetables mix
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
1,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2996,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2997,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2998,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2999,0,0,0,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In this binary dataset, every row represents a transaction. Columns are possible products that might appear in every transaction. Every cell contains one of two possible values:

0 — the product was not included in the transaction
1 — the transaction contains the product

In [4]:
# count items in each column
items_total = eclat.df_bin.astype(int).sum(axis=0)

items_total

rice                  50
light mayo            70
green tea            340
frozen vegetables    276
grated cheese        166
                    ... 
almonds               52
butter                89
pickles               17
tomato sauce          52
vegetables mix        59
Length: 119, dtype: int64

In [5]:
# count items in each row
items_per_transaction = eclat.df_bin.astype(int).sum(axis=1)

items_per_transaction

0       7
1       3
2       1
3       2
4       5
       ..
2996    1
2997    2
2998    3
2999    7
3000    5
Length: 3001, dtype: int64

In [8]:
## Frequent ItemList
import pandas as pd

# Loading items per column stats to the DataFrame
df = pd.DataFrame({'items': items_total.index, 'transactions': items_total.values}) 

# cloning pandas DataFrame for visualization purpose  
df_table = df.sort_values("transactions", ascending=False)

#  Top 5 most popular products/items
df_table.head(5).style.background_gradient(cmap='Blues')

Unnamed: 0,items,transactions
92,mineral water,711
88,spaghetti,549
13,eggs,532
97,chocolate,485
53,french fries,463


To generate association rules, we need to define:

Minimum support — should be provided as a percentage of the overall items from the dataset
Minumum combinations — the minimum amount of items in the transaction
Maximum combinations — the minimum amount of items in the transaction.
Note: the higher the value of the maximum combinations the longer the calculation will take

In [9]:
# the item shoud appear at least at 5% of transactions
min_support = 5/100

# start from transactions containing at least 2 items
min_combination = 2

# up to maximum items per transaction
max_combination = max(items_per_transaction)

rule_indices, rule_supports = eclat.fit(min_support=min_support,
                                                 min_combination=min_combination,
                                                 max_combination=max_combination,
                                                 separator=' & ',
                                                 verbose=True)

Combination 2 by 2


253it [00:00, 319.76it/s]


Combination 3 by 3


1771it [00:05, 305.66it/s]


Combination 4 by 4


8855it [00:29, 297.58it/s]


Combination 5 by 5


33649it [02:14, 250.08it/s]


Combination 6 by 6


100947it [07:01, 239.61it/s]


Combination 7 by 7


245157it [18:15, 223.69it/s]


The fit() method of the ECLAT class returns:

association rule indices
association rule support values

In [10]:
import pandas as pd

result = pd.DataFrame(rule_supports.items(),columns=['Item', 'Support'])
result.sort_values(by=['Support'], ascending=False)

Unnamed: 0,Item,Support
0,spaghetti & mineral water,0.060646


We found that mineral water and spaghetti are commonly purchased by the customers based on the transactions data in our dataset and the minimum support value we’ve provided.