**Association rules** have infiltrated your data while you were asleep! 😯 What are they and how to find them? Let's try them with [Desbordante](https://github.com/Desbordante/desbordante-core)!

# Install necessary dependencies

Firstly, let's download and import necessary libraries:

In [None]:
!pip install desbordante==2.3.2

Collecting desbordante==2.3.2
  Downloading desbordante-2.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (19 kB)
Downloading desbordante-2.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.0 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/4.0 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━[0m [32m2.7/4.0 MB[0m [31m78.4 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m4.0/4.0 MB[0m [31m82.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.0/4.0 MB[0m [31m49.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: desbordante
Successfully installed desbordante-2.3.2


Desbordante library will be used for discovery of association rules and Pandas library will be used for visualising the data:

In [None]:
import desbordante
import pandas as pd

Let's download example data:

In [None]:
!wget https://raw.githubusercontent.com/Desbordante/desbordante-core/refs/heads/main/examples/datasets/rules_book_rows.csv
!wget https://raw.githubusercontent.com/Desbordante/desbordante-core/refs/heads/main/examples/datasets/rules_book.csv

--2025-03-20 17:00:04--  https://raw.githubusercontent.com/Desbordante/desbordante-core/refs/heads/main/examples/datasets/rules_book_rows.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 98 [text/plain]
Saving to: ‘rules_book_rows.csv’


2025-03-20 17:00:04 (5.37 MB/s) - ‘rules_book_rows.csv’ saved [98/98]

--2025-03-20 17:00:04--  https://raw.githubusercontent.com/Desbordante/desbordante-core/refs/heads/main/examples/datasets/rules_book.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 126 [text/plain]
Saving to: ‘rules_book

# Explore data

Let's look at the dataset:

In [None]:
dataset = pd.read_csv('rules_book_rows.csv', header=None, keep_default_na=False)
dataset

Unnamed: 0,0,1,2,3
0,Bread,Butter,Milk,
1,Eggs,Milk,Yogurt,
2,Bread,Cheese,Eggs,Milk
3,Eggs,Milk,Yogurt,
4,Cheese,Milk,Yogurt,


The dataset contains receipts from some supermarket: each row of the dataset is a single receipt. This kind of data is called transactional data.

# Find association rules

Now, let's find associtaion rules (ARs) with Desbordante:

In [None]:
algo = desbordante.ar.algorithms.Default()
algo.load_data(table=dataset, input_format='tabular')
algo.execute(minconf=1)
ars = algo.get_ars()
print('Total count of ARs:', len(ars))
print('The first 10 ARs:')
for ar in ars[:10]:
  print(ar)

Total count of ARs: 24
The first 10 ARs:
conf: 1.000000	sup: 0.200000	{Butter} -> {Bread}
conf: 1.000000	sup: 0.400000	{Bread} -> {Milk}
conf: 1.000000	sup: 0.200000	{Butter} -> {Milk}
conf: 1.000000	sup: 0.600000	{Eggs} -> {Milk}
conf: 1.000000	sup: 0.600000	{Yogurt} -> {Milk}
conf: 1.000000	sup: 0.400000	{Cheese} -> {Milk}
conf: 1.000000	sup: 0.200000	{Butter, Milk} -> {Bread}
conf: 1.000000	sup: 0.200000	{Bread, Butter} -> {Milk}
conf: 1.000000	sup: 0.200000	{Butter} -> {Bread, Milk}
conf: 1.000000	sup: 0.200000	{Bread, Eggs} -> {Milk}


The AR mining algorithm has found 24 ARs!

['Butter'] -> ['Bread'] with confidence 1 means that whenever butter is found in the receipt, bread will always be present as well. As we can see, the only receipt containing butter is the first receipt. This receipt also contains bread, thus this AR holds in the table.

# Support and confidence

Let's discuss support and confidence of an association rule.

Support of the AR is the percentage of rows that satisfy both the left-hand side (LHS) and the right-hand side (RHS) of the AR. Support equals to the number of rows satisfying LHS and RHS divided by the number of all rows: $supp(X \rightarrow Y ) = \frac{n(X \cup Y)}{N}$.

For example, let's calculate the support of the above AR ['Butter'] -> ['Bread']. It can easily be seen that there is only one receipt in the table that contains both bread and butter. The number of all receipts is 5, so
the support of the AR ['Butter'] -> ['Bread'] equals $\frac{1}{5} = 0.2$.

Confidence of the AR is the percentage of rows satisfying LHS that also satisfy RHS. Confidence equals to the number of rows satisfying LHS and RHS divided by the number of rows satisfying LHS: $conf(X \rightarrow Y ) = \frac{n(X \cup Y)}{n(X)}$.

Let's calculate the confidence of the above AR. We already know that the number of receipts containing both bread and butter is 1. As we can see, the number of rows satisfying LHS (i.e. the number of receipts containing butter) also equals 1, so the confidence of the AR equals $\frac{1}{1} = 1$.

When the confidence of the AR equals 1, that means that if a row satisfies LHS, than it will always satisfy RHS. In our example, if butter is present in the receipt than bread will always be present as well. If the confidence is less than 1, then there are counterexamples of this AR in the dataset, and RHS is only likely to be satisfied given that LHS is satisfied.

Let's see more examples of ARs with different support and confidence.

# Second example: minimum confidence

Now, let's examine the same dataset with *minconf*=0.6. The option *minconf* defines minimum confidence that we allow an AR to have. All ARs with less confidence are omitted from the results. By setting this option to 0.6 we allow more ARs to be present than with previous value of 1:

In [None]:
algo.execute(minconf=0.6)
ars = algo.get_ars()
print('Total count of ARs:', len(ars))
print('The first 10 ARs:')
for ar in ars[:10]:
  print(ar)

Total count of ARs: 32
The first 10 ARs:
conf: 1.000000	sup: 0.200000	{Butter} -> {Bread}
conf: 1.000000	sup: 0.400000	{Bread} -> {Milk}
conf: 1.000000	sup: 0.200000	{Butter} -> {Milk}
conf: 1.000000	sup: 0.600000	{Eggs} -> {Milk}
conf: 0.600000	sup: 0.600000	{Milk} -> {Eggs}
conf: 1.000000	sup: 0.600000	{Yogurt} -> {Milk}
conf: 0.600000	sup: 0.600000	{Milk} -> {Yogurt}
conf: 1.000000	sup: 0.400000	{Cheese} -> {Milk}
conf: 0.666667	sup: 0.400000	{Yogurt} -> {Eggs}
conf: 0.666667	sup: 0.400000	{Eggs} -> {Yogurt}


As we can see, the number of ARs has increased from 24 to 32, as we expected.

['Yogurt'] -> ['Eggs'] with confidence 0.67 means that when yogurt is found in the receipt, the chance of eggs being present amounts to 67%. So, customers are likely to buy eggs with yogurt.

As it can be seen, the number of receipts containing yogurt is 3, while only two of them also contain eggs:

In [None]:
def color_cells(x):
  df1=pd.DataFrame('',index=x.index,columns=x.columns)
  df1.iloc[1,2]='color:green;font-weight:bold'
  df1.iloc[3,2]='color:green;font-weight:bold'
  df1.iloc[4,2]='color:green;font-weight:bold'
  df1.iloc[1,0]='color:red;font-weight:bold'
  df1.iloc[3,0]='color:red;font-weight:bold'
  return df1

dataset.style.apply(color_cells,axis=None)

Unnamed: 0,0,1,2,3
0,Bread,Butter,Milk,
1,Eggs,Milk,Yogurt,
2,Bread,Cheese,Eggs,Milk
3,Eggs,Milk,Yogurt,
4,Cheese,Milk,Yogurt,


The row 4 is a counterexample: the receipt contains yougurt but doesn't contain eggs. This lowers the confidence from 1 to $\frac{2}{3} \approx 0.67$.

# Third example: minimum support

Let us turn to the next issue. You can observe that there are a lot of association rules found in this small dataset. This happens because we did not set up the minimum support value. Since the default support value is 0, the system discovers all association rules, even those that only occur once in the dataset. Now, let's see the results with *minsup*=0.4 and *minconf*=0.6.

In [None]:
algo.execute(minsup=0.4, minconf=0.6)
ars = algo.get_ars()
print('Total count of ARs:', len(ars))
print('The first 10 ARs:')
for ar in ars[:10]:
  print(ar)

Total count of ARs: 13
The first 10 ARs:
conf: 1.000000	sup: 0.400000	{Bread} -> {Milk}
conf: 1.000000	sup: 0.600000	{Eggs} -> {Milk}
conf: 0.600000	sup: 0.600000	{Milk} -> {Eggs}
conf: 1.000000	sup: 0.600000	{Yogurt} -> {Milk}
conf: 0.600000	sup: 0.600000	{Milk} -> {Yogurt}
conf: 1.000000	sup: 0.400000	{Cheese} -> {Milk}
conf: 0.666667	sup: 0.400000	{Yogurt} -> {Eggs}
conf: 0.666667	sup: 0.400000	{Eggs} -> {Yogurt}
conf: 1.000000	sup: 0.400000	{Eggs, Yogurt} -> {Milk}
conf: 0.666667	sup: 0.400000	{Milk, Yogurt} -> {Eggs}


Now we can see that the number of association rules has decreased significantly. This happened due to *minsup* being set to 0.4: now ARs with low support are not present in the results. For example, the AR ['Butter'] -> ['Bread'] is no more present because its support of 0.2 is too low. For real-world data a low support means that the pattern occurs too rarely to be meaningful.

# Fourth example: usefulness

A typical approach to controlling the algorithm is to employ "usefulness", which is defined as confidence * support. In the last example, we set up min "usefulness" = 0.6 * 0.4 = 0.24.

Now, let's try with *minsup*=0.6, *minconf*=0.6 and "usefulness"=0.36:

In [None]:
algo.execute(minsup=0.6, minconf=0.6)
ars = algo.get_ars()
print('Total count of ARs:', len(ars))
print('The first 10 ARs:')
for ar in ars[:10]:
  print(ar)

Total count of ARs: 4
The first 10 ARs:
conf: 1.000000	sup: 0.600000	{Eggs} -> {Milk}
conf: 0.600000	sup: 0.600000	{Milk} -> {Eggs}
conf: 1.000000	sup: 0.600000	{Yogurt} -> {Milk}
conf: 0.600000	sup: 0.600000	{Milk} -> {Yogurt}


So, now the total number of returned association rules is only four. We reduced the amount of "noisy" information in our output. You are free to play with these parameters to see how it changes things. Eventually, you will find out what fits your dataset and your task best.

# Input formats

In all previous examples we used a "tabular" input format, which means that transactions are passed in a table:

In [None]:
dataset

Unnamed: 0,0,1,2,3
0,Bread,Butter,Milk,
1,Eggs,Milk,Yogurt,
2,Bread,Cheese,Eggs,Milk
3,Eggs,Milk,Yogurt,
4,Cheese,Milk,Yogurt,


Desbordante also supports another input format called "singular". This is a two-column format, where the first column is the order of the items, and the second column is the item that belongs to that order:

In [None]:
singular_dataset = pd.read_csv('rules_book.csv', header=None, keep_default_na=False)
singular_dataset

Unnamed: 0,0,1
0,1,Bread
1,1,Butter
2,3,Cheese
3,2,Eggs
4,1,Milk
5,2,Milk
6,2,Yogurt
7,3,Bread
8,3,Eggs
9,3,Milk


The singular input format is just a different table representation. In our example the above two datasets are equivalent. Let's find some ARs with singular input format and see if results are the same:

In [None]:
algo = desbordante.ar.algorithms.Default()
algo.load_data(table=singular_dataset, input_format='singular')
algo.execute(minsup=0.6, minconf=0.6)
new_ars = algo.get_ars()
print('Total count of ARs:', len(new_ars))
print('The first 10 ARs:')
for ar in new_ars[:10]:
  print(ar)

Total count of ARs: 4
The first 10 ARs:
conf: 0.600000	sup: 0.600000	{Milk} -> {Eggs}
conf: 1.000000	sup: 0.600000	{Eggs} -> {Milk}
conf: 1.000000	sup: 0.600000	{Yogurt} -> {Milk}
conf: 0.600000	sup: 0.600000	{Milk} -> {Yogurt}


Let's recall previous results:

In [None]:
print('Total count of ARs:', len(ars))
print('The first 10 ARs:')
for ar in ars[:10]:
  print(ar)

Total count of ARs: 4
The first 10 ARs:
conf: 1.000000	sup: 0.600000	{Eggs} -> {Milk}
conf: 0.600000	sup: 0.600000	{Milk} -> {Eggs}
conf: 1.000000	sup: 0.600000	{Yogurt} -> {Milk}
conf: 0.600000	sup: 0.600000	{Milk} -> {Yogurt}


As we can see, the results are identical, as expected.

# Find unique items

An additional feature of the AR mining algorithm is the following: it can list all of the unique items in a dataset. Let's do it:

In [None]:
items = algo.get_itemnames()
print('Total number of items:', len(items))
for item in items:
  print(item)

Total number of items: 6
Bread
Butter
Cheese
Eggs
Milk
Yogurt


# Conclusion

If you are reading this, then you have learnt about association rules. Congratulations!

We have explored data and found that people tend to buy milk with eggs. We have also learnt about support and confidence and how they affect the number of meaningful ARs.

If you wish to find these patterns in your data, now you know how to do it 🙂
Also, you can learn more about other pattern types presented in [Desbordante](https://github.com/Desbordante/desbordante-core).