# Market Basket Analysis With Association Rules in python



Market basket analysis is a statistical technique used to identify relationships between items that are frequently purchased together. It involves examining transaction data to discover patterns and associations among products.

**Key Concepts:**

### **Association Rules:**
These rules describe relationships between items. For example, a rule might be:  
_"If a customer buys bread, they are likely to also buy milk."_

### **Support:**
This measures the frequency of occurrence of an item or itemset in the dataset.
- **Formula:**  
  Support(A) = (Number of transactions containing A) / (Total number of transactions)

### **Confidence:**
This measures the probability of buying item B given that item A has already been bought.
- **Formula:**  
  Confidence(A ⇒ B) = (Number of transactions containing both A and B) / (Number of transactions containing A)

### **Lift:**
This measures the strength of the association between two items, accounting for the individual popularity of each item.
- **Formula:**  
  Lift(A ⇒ B) = Confidence(A ⇒ B) / Support(B)


In [1]:
from csv import reader

##### we iterate over each line in the input file(groceries.csv) and append it to a list groceries

In [3]:
groceries = []
with open('groceries.csv','r') as csvfile:
    csv_read = reader(csvfile)
    for row in csv_read:
        groceries.append(row)

In [4]:
groceries[0:5]

[['citrus fruit', 'semi-finished bread', 'margarine', 'ready soups'],
 ['tropical fruit', 'yogurt', 'coffee'],
 ['whole milk'],
 ['pip fruit', 'yogurt', 'cream cheese', 'meat spreads'],
 ['other vegetables',
  'whole milk',
  'condensed milk',
  'long life bakery product']]

Now we imported Transaction into a list, we need to encode them and represent the data in a sparse formate before we can generate frequent itemsets

In [9]:
from mlxtend.preprocessing import TransactionEncoder

Instantiate an object called encoder from the TeansactionEncoder class

In [10]:
encoder = TransactionEncoder()

using The encoder object, we call fit() method  to extract the unique labels in the transcation set and the transform() method to one-hot encode to transcations into a boolean numpy array

In [11]:
transactions = encoder.fit(groceries).transform(groceries)

transactions

array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False,  True, False],
       [False, False, False, ...,  True, False, False],
       ...,
       [False, False, False, ..., False,  True, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

we Convert the transcation into a pandas dataFrame 

In [12]:
import pandas as pd

In [13]:
df =  pd.DataFrame(transactions,columns=encoder.columns_)

In [16]:
df

Unnamed: 0,abrasive cleaner,artif. sweetener,baby cosmetics,baby food,bags,baking powder,bathroom cleaner,beef,berries,beverages,...,uht-milk,vinegar,waffles,whipped/sour cream,whisky,white bread,white wine,whole milk,yogurt,zwieback
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9830,False,False,False,False,False,False,False,True,False,False,...,False,False,False,True,False,False,False,True,False,False
9831,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
9832,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
9833,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9835 entries, 0 to 9834
Columns: 169 entries, abrasive cleaner to zwieback
dtypes: bool(169)
memory usage: 1.6 MB


## Generate Frequent itemsets

In [17]:
from mlxtend.frequent_patterns import apriori

The `apriori` function takes several arguments. The first one is the pandas DataFrame of the transactions we wish to analyze. The second is the minimum support threshold of the itemsets we consider frequent. This value specifies how often an itemset must occur in the transaction set in order to warrant our attention. 

Let’s assume that we only want to focus our attention on itemsets that occur at least $5$ times a day. Given that our data is for $30$ days and our dataset has $9,835$ transactions, this means that we need to set our minimum support threshold to $ 5 \times \frac{30}{9835} \approx 0.015$.

In [38]:
frequent_itemsets = apriori(df,min_support=0.015,use_colnames=True)

In [39]:
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.017692,(baking powder)
1,0.052466,(beef)
2,0.033249,(berries)
3,0.026029,(beverages)
4,0.080529,(bottled beer)
...,...,...
175,0.023183,"(other vegetables, whole milk, root vegetables)"
176,0.017082,"(other vegetables, tropical fruit, whole milk)"
177,0.022267,"(other vegetables, whole milk, yogurt)"
178,0.015557,"(rolls/buns, whole milk, yogurt)"


In [40]:
frequent_itemsets.sort_values('support',ascending=False)

Unnamed: 0,support,itemsets
71,0.255516,(whole milk)
44,0.193493,(other vegetables)
53,0.183935,(rolls/buns)
60,0.174377,(soda)
72,0.139502,(yogurt)
...,...,...
163,0.015252,"(yogurt, shopping bags)"
179,0.015150,"(tropical fruit, whole milk, yogurt)"
11,0.015048,(canned fish)
45,0.015048,(pasta)


We see that `{whole milk}`, `{other vegetables}`, `{rolls/buns}`, `{soda}`, and `{yogurt}` are the five most frequently bought items in the store.

In [41]:
length = frequent_itemsets['itemsets'].str.len()

Let's break down the code with an example to make it clear:

Assume the `frequent_itemsets` DataFrame looks like this:

| support   | itemsets                        |
|-----------|----------------------------------|
| 0.255516  | (whole milk)                    |
| 0.193493  | (other vegetables, whole milk)  |
| 0.183935  | (rolls/buns, soda, yogurt)      |
| 0.174377  | (soda)                          |

Now, the code:

```python
length = frequent_itemsets['itemsets'].str.len()

What it's doing step by step:
Accessing the 'itemsets' column:
The code is looking at the itemsets column, which contains sets of items like (whole milk) and (other vegetables, whole milk).

Applying .str.len():
This operation calculates the length of each itemset (i.e., how many items are in each set).

For the first itemset (whole milk), the length is 1 (because it contains only one item).
For the second itemset (other vegetables, whole milk), the length is 2 (because it contains two items).
For the third itemset (rolls/buns, soda, yogurt), the length is 3 (since it has three items).
For the fourth itemset (soda), the length is 1 (just one item).


In [43]:
rows = length >2

Now we see the six frequent itemsets with a length greater than $2$.

In [44]:
frequent_itemsets[rows]

Unnamed: 0,support,itemsets
174,0.017895,"(rolls/buns, other vegetables, whole milk)"
175,0.023183,"(other vegetables, whole milk, root vegetables)"
176,0.017082,"(other vegetables, tropical fruit, whole milk)"
177,0.022267,"(other vegetables, whole milk, yogurt)"
178,0.015557,"(rolls/buns, whole milk, yogurt)"
179,0.01515,"(tropical fruit, whole milk, yogurt)"


We can also use the `describe()` method of a pandas DataFrame to get a big picture view of the distribution of values in the data. For example, to get a statistical summary of the support values by itemset length, we do the following:

In [26]:
frequent_itemsets.groupby(length)['support'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
itemsets,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,73.0,0.053441,0.045956,0.015048,0.024504,0.037112,0.06487,0.255516
2,101.0,0.024799,0.010058,0.015048,0.018404,0.021047,0.027555,0.074835
3,6.0,0.018522,0.003417,0.01515,0.015938,0.017489,0.021174,0.023183


# Creating Association Rules

In [45]:
from mlxtend.frequent_patterns import association_rules

The `association_rules` function takes several arguments. The first is the frequent itemset. The next is the metric we intend to use to filter the rules for significance. This can either be "*support*", "*confidence*", "*lift*", "*leverage*" or "*conviction*". 

Let's assume that we want to limit our focus to rules that have a confidence of `0.25` or more. To do this, we set the `metric` argument to `"confidence"` and the `min_threshold` argument to `0.25`.

In [46]:
rules = association_rules(frequent_itemsets,metric="confidence",min_threshold=0.25)

In [47]:
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(beef),(other vegetables),0.052466,0.193493,0.019725,0.375969,1.943066,0.009574,1.292416,0.512224
1,(beef),(root vegetables),0.052466,0.108998,0.017387,0.331395,3.040367,0.011668,1.332628,0.708251
2,(beef),(whole milk),0.052466,0.255516,0.021251,0.405039,1.585180,0.007845,1.251315,0.389597
3,(bottled beer),(whole milk),0.080529,0.255516,0.020437,0.253788,0.993237,-0.000139,0.997684,-0.007351
4,(bottled water),(soda),0.110524,0.174377,0.028978,0.262190,1.503577,0.009705,1.119017,0.376535
...,...,...,...,...,...,...,...,...,...,...
73,"(rolls/buns, yogurt)",(whole milk),0.034367,0.255516,0.015557,0.452663,1.771563,0.006775,1.360192,0.451027
74,"(whole milk, yogurt)",(rolls/buns),0.056024,0.183935,0.015557,0.277677,1.509648,0.005252,1.129779,0.357630
75,"(tropical fruit, whole milk)",(yogurt),0.042298,0.139502,0.015150,0.358173,2.567516,0.009249,1.340701,0.637483
76,"(tropical fruit, yogurt)",(whole milk),0.029283,0.255516,0.015150,0.517361,2.024770,0.007668,1.542528,0.521384


There are 78 association rules that match our conditions. Each rule has two parts: an "antecedent" (the "if" part) and a "consequent" (the "then" part). For each rule, we have measurements that show how often the antecedent and consequent appear. We also have other measurements that show the support, confidence, lift, leverage, and conviction for each rule.

Since our rules are in a pandas DataFrame, we can easily change and filter the data to find what we need. For example, if we only want to see rules where 'rolls/buns' is in the antecedent (the "if" part), we can start by creating a condition to filter for that

In [52]:
rows = rules['antecedents'] =={'rolls/buns'}

checking which rows in the rules DataFrame have the antecedent exactly equal to {'rolls/buns'}. It creates a Boolean series (rows) where each value is True if the antecedent in that row is {'rolls/buns'}, and False if it is not. This series can then be used to filter or analyze the relevant rows.

In [53]:
rules[rows]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
51,(rolls/buns),(whole milk),0.183935,0.255516,0.056634,0.307905,1.205032,0.009636,1.075696,0.208496


In [54]:
rows = rules['consequents'] =={'rolls/buns'}

checking which rows in the rules DataFrame have the consequent exactly equal to {'rolls/buns'}. It creates a Boolean series (rows) where each value is True if the consequent in that row is {'rolls/buns'}, and False if it is not. This series can then be used to filter or analyze the relevant rows.

In [55]:
rules[rows]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
24,(frankfurter),(rolls/buns),0.058973,0.183935,0.019217,0.325862,1.771616,0.00837,1.210531,0.462839
50,(sausage),(rolls/buns),0.09395,0.183935,0.030605,0.325758,1.771048,0.013324,1.210344,0.480506
74,"(whole milk, yogurt)",(rolls/buns),0.056024,0.183935,0.015557,0.277677,1.509648,0.005252,1.129779,0.35763


In [56]:
rows = rules['antecedents'].astype(str).str.contains('rolls/buns')

##### Explanation:
**rules['antecedents']:** This accesses the 'antecedents' column of the rules DataFrame, which contains the "if" parts of the association rules.

**.astype(str):** This converts each element in the 'antecedents' column to a string. This is necessary because the antecedents may be stored as sets or other data types, and we want to work with them as strings.

**.str.contains('rolls/buns'):** This checks if the string 'rolls/buns' is present within each string in the 'antecedents' column. It returns a Boolean series where each value is True if 'rolls/buns' is found in the antecedent, and False if it is not.

**rows =:** The result of this operation is assigned to the variable rows.

##### Statement:
The code checks which rows in the rules DataFrame have an antecedent that contains the substring 'rolls/buns'. It creates a Boolean series (rows) where each value is True if the antecedent includes 'rolls/buns', and False if it does not. This series can be used for filtering or analyzing the relevant rows.

In [57]:
rules[rows]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
51,(rolls/buns),(whole milk),0.183935,0.255516,0.056634,0.307905,1.205032,0.009636,1.075696,0.208496
62,"(rolls/buns, other vegetables)",(whole milk),0.042603,0.255516,0.017895,0.420048,1.643919,0.00701,1.283699,0.409128
63,"(rolls/buns, whole milk)",(other vegetables),0.056634,0.193493,0.017895,0.315978,1.633026,0.006937,1.179067,0.410912
72,"(rolls/buns, whole milk)",(yogurt),0.056634,0.139502,0.015557,0.274686,1.969049,0.007656,1.18638,0.521686
73,"(rolls/buns, yogurt)",(whole milk),0.034367,0.255516,0.015557,0.452663,1.771563,0.006775,1.360192,0.451027


We can aslo filter our rules by the length of the antecedent or consequent. For example, to match only rules with an antecedent length more than `1` we do the following:

In [60]:
rows = rules['antecedents'].str.len()>1
rules[rows]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
62,"(rolls/buns, other vegetables)",(whole milk),0.042603,0.255516,0.017895,0.420048,1.643919,0.00701,1.283699,0.409128
63,"(rolls/buns, whole milk)",(other vegetables),0.056634,0.193493,0.017895,0.315978,1.633026,0.006937,1.179067,0.410912
64,"(other vegetables, whole milk)",(root vegetables),0.074835,0.108998,0.023183,0.309783,2.842082,0.015026,1.2909,0.700572
65,"(other vegetables, root vegetables)",(whole milk),0.047382,0.255516,0.023183,0.48927,1.914833,0.011076,1.457687,0.501524
66,"(whole milk, root vegetables)",(other vegetables),0.048907,0.193493,0.023183,0.474012,2.44977,0.013719,1.53332,0.62223
67,"(other vegetables, tropical fruit)",(whole milk),0.035892,0.255516,0.017082,0.475921,1.862587,0.007911,1.420556,0.480353
68,"(tropical fruit, whole milk)",(other vegetables),0.042298,0.193493,0.017082,0.403846,2.08714,0.008898,1.352851,0.54388
69,"(other vegetables, whole milk)",(yogurt),0.074835,0.139502,0.022267,0.297554,2.132979,0.011828,1.225003,0.574138
70,"(other vegetables, yogurt)",(whole milk),0.043416,0.255516,0.022267,0.512881,2.007235,0.011174,1.52834,0.524577
71,"(whole milk, yogurt)",(other vegetables),0.056024,0.193493,0.022267,0.397459,2.054131,0.011427,1.338511,0.543633


We can also filter our rules based on the values in any of the numeric columns. For example, let's assume that we only want to see rules that have a lift of more than `2`, a leverage score more than `0.01` and a conviction score of more than `1.4`. This can be written as follows:

In [62]:
rows =((rules['lift']>2) & (rules['leverage']>0.01) & (rules['conviction']>1.4))

rules[rows]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
39,(root vegetables),(other vegetables),0.108998,0.193493,0.047382,0.434701,2.246605,0.026291,1.426693,0.622764
66,"(whole milk, root vegetables)",(other vegetables),0.048907,0.193493,0.023183,0.474012,2.44977,0.013719,1.53332,0.62223
70,"(other vegetables, yogurt)",(whole milk),0.043416,0.255516,0.022267,0.512881,2.007235,0.011174,1.52834,0.524577


# Evaluate Associate Rules

A quick way to get a big-picture view of the metrics is with summary statistics. We do this by calling the `describe()` method of the `rules` DataFrame:

In [63]:
rules.describe()

Unnamed: 0,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
count,78.0,78.0,78.0,78.0,78.0,78.0,78.0,78.0
mean,0.073186,0.211041,0.025578,0.360834,1.763158,0.010156,1.245113,0.440231
std,0.036738,0.046866,0.012045,0.070495,0.377079,0.004766,0.115688,0.124659
min,0.029283,0.104931,0.015048,0.253714,0.993237,-0.000139,0.997684,-0.007351
25%,0.052466,0.193493,0.017895,0.308374,1.504669,0.00697,1.166832,0.360102
50%,0.06121,0.193493,0.021912,0.354567,1.740032,0.00926,1.226608,0.451502
75%,0.082766,0.255516,0.028876,0.405608,1.942669,0.011788,1.294081,0.514002
max,0.255516,0.255516,0.074835,0.517361,3.040367,0.026291,1.542528,0.708251


The summary statistics give us important information, including the average, standard deviation, minimum, maximum, and some percentiles for the metrics of the association rules. From this summary, we see that a typical rule has a lift of 1.76, with lift values ranging from 0.99 to 3.04.

Lift measures how often the antecedent and consequent occur together compared to how often they happen separately. In other words, lift shows the strength of their relationship. Lift values can range from 0 to infinity, with a value of 1 indicating that the antecedent and consequent are independent of each other. Now, let’s look at the top 5 rules based on lift.

In [64]:
rows = rules.sort_values('lift',ascending=False).head()

In [66]:
rows

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
1,(beef),(root vegetables),0.052466,0.108998,0.017387,0.331395,3.040367,0.011668,1.332628,0.708251
64,"(other vegetables, whole milk)",(root vegetables),0.074835,0.108998,0.023183,0.309783,2.842082,0.015026,1.2909,0.700572
77,"(whole milk, yogurt)",(tropical fruit),0.056024,0.104931,0.01515,0.270417,2.577089,0.009271,1.226823,0.648285
47,(pip fruit),(tropical fruit),0.075648,0.104931,0.020437,0.270161,2.574648,0.012499,1.226392,0.66165
75,"(tropical fruit, whole milk)",(yogurt),0.042298,0.139502,0.01515,0.358173,2.567516,0.009249,1.340701,0.637483


The first rule has a lift score of 3.04. This means that customers who buy beef are 3.04 times more likely to also buy root vegetables. Remember, lift values above 1 suggest a greater likelihood of buying together, while values below 1 suggest a lower likelihood.

Leverage is similar to lift and can be seen as a normalized version of it. Leverage values range from -1 to 1, where a value of 0 means the antecedent and consequent are independent of each other. Now, let’s look at the top 5 rules based on leverage.

In [67]:
rows = rules.sort_values('leverage',ascending=False).head()
rows

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
39,(root vegetables),(other vegetables),0.108998,0.193493,0.047382,0.434701,2.246605,0.026291,1.426693,0.622764
43,(other vegetables),(whole milk),0.193493,0.255516,0.074835,0.386758,1.513634,0.025394,1.214013,0.42075
44,(whole milk),(other vegetables),0.255516,0.193493,0.074835,0.292877,1.513634,0.025394,1.140548,0.455803
52,(root vegetables),(whole milk),0.108998,0.255516,0.048907,0.448694,1.756031,0.021056,1.350401,0.483202
61,(yogurt),(whole milk),0.139502,0.255516,0.056024,0.401603,1.571735,0.020379,1.244132,0.422732


The first rule has a leverage score of 0.026. This suggests that customers who buy root vegetables are also likely to buy other vegetables, which makes sense. However, the second rule indicates that customers who buy whole milk are 1.5 times, or 50%, more likely to also buy other vegetables (according to the lift score), which seems questionable. Rules involving popular items like whole milk can be misleading, so it's important to consider the conviction of the association rules as well.

**Conviction** measures how much the consequent depends on the antecedent. It is related to lift but differs in that it is sensitive to the direction of the rule. This means that $\text{Conviction}_{A \rightarrow B} \neq \text{Conviction}_{B \rightarrow A}$.. Conviction values range from 0 to infinity, with a value of 1 indicating that the antecedent and consequent are independent. Now, let’s look at the top 5 rules based on conviction.


In [68]:
rules.sort_values('conviction',ascending=False).head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
76,"(tropical fruit, yogurt)",(whole milk),0.029283,0.255516,0.01515,0.517361,2.02477,0.007668,1.542528,0.521384
66,"(whole milk, root vegetables)",(other vegetables),0.048907,0.193493,0.023183,0.474012,2.44977,0.013719,1.53332,0.62223
70,"(other vegetables, yogurt)",(whole milk),0.043416,0.255516,0.022267,0.512881,2.007235,0.011174,1.52834,0.524577
9,(butter),(whole milk),0.055414,0.255516,0.027555,0.497248,1.946053,0.013395,1.480817,0.514659
19,(curd),(whole milk),0.053279,0.255516,0.026131,0.490458,1.919481,0.012517,1.461085,0.505984


The first rule has a conviction of $1.54$. This means that the rule $\{\text{tropical fruit, yogurt}\} \rightarrow \{\text{whole milk}\}$ would be incorrect $54\%$ more often (or $1.54$ times as often) if the consequent was independent of the antecedent. The higher the conviction, the more likely it is that the consequent is dependent on the antecedent.

Besides the metrics returned by the `association_rules` function, **Zhang's Metric** is another useful metric that we should also take into consideration when evaluating rules. It ranges in value from $-1$ to $1$ which represent perfect association and perfect dissociation respectively. Zhang's metric is useful in identifying items that should not be placed next to each other, even if they have been purchased together previously. It is calculated as follows:

$$ \text{Zhang}_{A \rightarrow B} = \frac{\text{Support}_{A \rightarrow B} - (\text{Support}_{A} \times \text{Support}_{B})}{\text{max}\{[\text{Support}_{A \rightarrow B} \times (1 - \text{Support}_{A})], [\text{Support}_{A} \times (\text{Support}_{B} - \text{Support}_{A \rightarrow B})]\}}$$

Where $\text{Support}_{A \rightarrow B}$ is the support of the rule, $\text{Support}_{A}$ is the antecedent support and $\text{Support}_{B}$ is the consequent support.

We can add Zhang's metric to our `rules` DataFrame by first creating a function that calculates it:

In [71]:
import numpy as np

def zhang_metric(rules):
    sup = rules['support'].copy()
    sup_a =rules['antecedent support'].copy()
    sup_b =rules['consequent support'].copy()
    num = sup - sup_a * sup_b
    denom = np.max((sup * (1 - sup_a).values, sup_a * (sup_b - sup).values ), axis = 0)
    return num/denom


In [72]:
rules['zhang'] = zhang_metric(rules)

rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric,zhang
0,(beef),(other vegetables),0.052466,0.193493,0.019725,0.375969,1.943066,0.009574,1.292416,0.512224,0.512224
1,(beef),(root vegetables),0.052466,0.108998,0.017387,0.331395,3.040367,0.011668,1.332628,0.708251,0.708251
2,(beef),(whole milk),0.052466,0.255516,0.021251,0.405039,1.585180,0.007845,1.251315,0.389597,0.389597
3,(bottled beer),(whole milk),0.080529,0.255516,0.020437,0.253788,0.993237,-0.000139,0.997684,-0.007351,-0.007351
4,(bottled water),(soda),0.110524,0.174377,0.028978,0.262190,1.503577,0.009705,1.119017,0.376535,0.376535
...,...,...,...,...,...,...,...,...,...,...,...
73,"(rolls/buns, yogurt)",(whole milk),0.034367,0.255516,0.015557,0.452663,1.771563,0.006775,1.360192,0.451027,0.451027
74,"(whole milk, yogurt)",(rolls/buns),0.056024,0.183935,0.015557,0.277677,1.509648,0.005252,1.129779,0.357630,0.357630
75,"(tropical fruit, whole milk)",(yogurt),0.042298,0.139502,0.015150,0.358173,2.567516,0.009249,1.340701,0.637483,0.637483
76,"(tropical fruit, yogurt)",(whole milk),0.029283,0.255516,0.015150,0.517361,2.024770,0.007668,1.542528,0.521384,0.521384


In [73]:
rules.sort_values('zhang',ascending=False).head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric,zhang
1,(beef),(root vegetables),0.052466,0.108998,0.017387,0.331395,3.040367,0.011668,1.332628,0.708251,0.708251
64,"(other vegetables, whole milk)",(root vegetables),0.074835,0.108998,0.023183,0.309783,2.842082,0.015026,1.2909,0.700572,0.700572
47,(pip fruit),(tropical fruit),0.075648,0.104931,0.020437,0.270161,2.574648,0.012499,1.226392,0.66165,0.66165
77,"(whole milk, yogurt)",(tropical fruit),0.056024,0.104931,0.01515,0.270417,2.577089,0.009271,1.226823,0.648285,0.648285
75,"(tropical fruit, whole milk)",(yogurt),0.042298,0.139502,0.01515,0.358173,2.567516,0.009249,1.340701,0.637483,0.637483


The first rule has a zhang metric score of $0.708$. This indicates a pretty strong positive association between beef and root vegetables. This tells us that if we were to separate beef from root vegetables in our store, there could be an impact to how much of both are purchased. In other words, pairing beef and root vegetables for promotional purposes is a good choice.

Looking at rules that have a low zhang metric is also very useful. Let's take a look at the bottom $5$ rules by the zhang metric: 

In [74]:
rules.sort_values('zhang').head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric,zhang
3,(bottled beer),(whole milk),0.080529,0.255516,0.020437,0.253788,0.993237,-0.000139,0.997684,-0.007351,-0.007351
5,(bottled water),(whole milk),0.110524,0.255516,0.034367,0.310948,1.21694,0.006126,1.080446,0.200417,0.200417
51,(rolls/buns),(whole milk),0.183935,0.255516,0.056634,0.307905,1.205032,0.009636,1.075696,0.208496,0.208496
54,(sausage),(whole milk),0.09395,0.255516,0.029893,0.318182,1.245252,0.005887,1.09191,0.217372,0.217372
16,(coffee),(whole milk),0.058058,0.255516,0.018709,0.322242,1.261141,0.003874,1.098451,0.21983,0.21983


The first rule has a zhang metric score of $-0.007$. This indicates a slight dissociation between bottled beer and whole milk. This tells us that if we were to separate bottled beer from whole milk in the store, there would likely not be an appreciable impact on purchase patterns for both items. This means that it would be a bad choice to pair these two items together for promotional purposes.