# Objective

To understand the key concepts of association rule mining using the `mlxtend` package.

In [1]:
! pip install mlxtend

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
import random

import pandas as pd

from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

In [3]:
random.seed(20130810)

# Dataset

We create a transaction dataset based on the items available for sale in a grocery store.

In [4]:
grocery_items = ['toor dal', 
                 'rice', 
                 'bread',
                 'ghee',
                 'sooji',
                 'jeera',
                 'salt',
                 'poha',
                 'cake mix',
                 'rava',
                 'wheat flour',
                 'all purpose flour',
                 'urad dal',
                 'pigeon peas',
                 'kidney beans',
                 'masoor dal']

In [5]:
MAX_LEN_BASKET = 5
NUM_TRANSACTIONS = 10000

In [6]:
dataset = [random.sample(grocery_items, k=random.choice(range(2, MAX_LEN_BASKET+1)))
           for _ in range(NUM_TRANSACTIONS)]

In [7]:
dataset[0:5]

[['poha', 'pigeon peas', 'kidney beans', 'rava'],
 ['all purpose flour', 'jeera', 'salt'],
 ['kidney beans', 'ghee', 'all purpose flour', 'pigeon peas', 'rice'],
 ['toor dal', 'rice'],
 ['sooji', 'pigeon peas', 'ghee', 'rice']]

This data set can now be parsed into a series of transactions as mentioned in the [manual for the mlxtend package](http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/).

In [8]:
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 16 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   all purpose flour  10000 non-null  bool 
 1   bread              10000 non-null  bool 
 2   cake mix           10000 non-null  bool 
 3   ghee               10000 non-null  bool 
 4   jeera              10000 non-null  bool 
 5   kidney beans       10000 non-null  bool 
 6   masoor dal         10000 non-null  bool 
 7   pigeon peas        10000 non-null  bool 
 8   poha               10000 non-null  bool 
 9   rava               10000 non-null  bool 
 10  rice               10000 non-null  bool 
 11  salt               10000 non-null  bool 
 12  sooji              10000 non-null  bool 
 13  toor dal           10000 non-null  bool 
 14  urad dal           10000 non-null  bool 
 15  wheat flour        10000 non-null  bool 
dtypes: bool(16)
memory usage: 156.4 KB


We will use the dataframe `df` for the rest of the notebook.

## All itemsets with a minimum support of 1% using the `apriori` function from `mlxtend`.

In [10]:
frequent_itemsets = apriori(df, 
                            min_support=0.01, 
                            use_colnames=True)

In [11]:
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.2273,(all purpose flour)
1,0.2185,(bread)
2,0.2152,(cake mix)
3,0.2211,(ghee)
4,0.2187,(jeera)
...,...,...
131,0.0434,"(sooji, urad dal)"
132,0.0417,"(sooji, wheat flour)"
133,0.0390,"(urad dal, toor dal)"
134,0.0423,"(wheat flour, toor dal)"


## All the itemsets with a minimum support of 2%, how many have a length equal to 2? (Use the `apriori` function from `mlxtend`)

In [12]:
frequent_itemsets = apriori(df, 
                            min_support=0.02, 
                            use_colnames=True)

In [13]:
frequent_itemsets.head()

Unnamed: 0,support,itemsets
0,0.2273,(all purpose flour)
1,0.2185,(bread)
2,0.2152,(cake mix)
3,0.2211,(ghee)
4,0.2187,(jeera)


In [14]:
filter_condition = (frequent_itemsets.itemsets.apply(len) == 2)

In [15]:
frequent_itemsets[filter_condition]

Unnamed: 0,support,itemsets
16,0.0455,"(bread, all purpose flour)"
17,0.0409,"(cake mix, all purpose flour)"
18,0.0415,"(all purpose flour, ghee)"
19,0.0434,"(all purpose flour, jeera)"
20,0.0420,"(all purpose flour, kidney beans)"
...,...,...
131,0.0434,"(sooji, urad dal)"
132,0.0417,"(sooji, wheat flour)"
133,0.0390,"(urad dal, toor dal)"
134,0.0423,"(wheat flour, toor dal)"


## All association rules with minimum support of 0.5% and mininum confidence of 0.2

In [16]:
frequent_itemsets = apriori(df, 
                            min_support=0.005, 
                            use_colnames=True)

In [17]:
rules = association_rules(frequent_itemsets, 
                          metric="confidence", 
                          min_threshold=0.2)

In [18]:
filter_condition = (rules.lift >= 1)

In [19]:
rules[filter_condition]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
35,"(cake mix, kidney beans)",(toor dal),0.0391,0.2142,0.0086,0.219949,1.026839,0.000225,1.00737
