# <center> Association Rules Mining </center>

___

Association Rules is one of the very important concepts of machine learning being used in market basket analysis. 
But it is not the only use case.

For example: In a store, all vegetables are placed in the same aisle, all dairy items are placed together and cosmetics form another set of such groups. Investing time and resources on deliberate product placements like this not only reduces a customer’s shopping time, but also reminds the customer of what relevant items (s)he might be interested in buying, thus helping stores cross-sell in the process. Association rules help uncover all such relationships between items from huge databases. 

Association Rules is an unsupervised technique to unravel any pattern or relation between items. The rule defines association between A and B as A => B i.e if A is purchased B is also purchased. 

An association rule consists of an antecedent and a consequent.

$${\{Pen, Pencil\}} \to \{Paper\}$$
$$     {antecedent} \to consequent$$

For a given rule, `itemset` is the list of all the items in the antecedent and the consequent.

$${itemset} \to \{Pen, Pencil, Paper\}$$

The goodness of an association rule is measured based on three primary factors


**Support**

This measure gives an idea of how frequent an itemset is in all the transactions. 

- Consider itemset1 = {bread} and itemset2 = {shampoo}. There will be far more transactions containing bread than those containing shampoo. So as you rightly guessed, itemset1 will generally have a higher support than itemset2. 

- Now consider itemset1 = {bread, butter} and itemset2 = {bread, shampoo}. Many transactions will have both bread and butter on the cart but bread and shampoo? Not so much. So in this case, itemset1 will generally have a higher support than itemset2. Mathematically, support is the fraction of the total number of transactions in which the itemset occurs.

$$
{Support(\{A\} \to \{B\}) = \frac{Transactions\ containing\ both\ A\ and\ B"}{Total\ number\ of\ transactions}}
$$

Value of `support` helps us identify the rules worth considering for further analysis. 

For example, one might want to consider only the itemsets which occur at least 50 times out of a total of 10,000 transactions i.e. support = 0.005. 

If an itemset happens to have a very low support, we do not have enough information on the relationship between its items and hence no conclusions can be drawn from such a rule.

**Confidence**

This measure defines the likeliness of occurrence of consequent on the cart given that the cart already has the antecedents. 

For example, to answer the question — of all the transactions containing say, {Kellogs Cornflakes}, how many also had {Milk} on them? 

We can say by common knowledge that {Kellogs Cornflakes} → {Milk} should be a high confidence rule. Technically, confidence is the conditional probability of occurrence of consequent given the antecedent.


$$
{Confidence(\{A\} \to \{B\}) = \frac{Transactions\ containing\ both\ A\ and\ B"}{Transactions\ containing\ A}}
$$

Consider few more examples before moving ahead. 

- What do you think would be the confidence for {Butter} → {Bread}? 
   That is, what fraction of transactions having butter also had bread? Very high i.e. a value close to 1? That’s right. 
   
- What about {Yogurt} → {Milk}? High again. {Toothbrush} → {Milk}? Not so sure? Confidence for this rule will also be high since {Milk} is such a frequent itemset and would be present in every other transaction.

*It does not matter what you have in the antecedent for such a frequent consequent. The confidence for an association rule having a very frequent consequent will always be high.*

![Confidence](confidence.png)

<i><center>Total transactions = 100.</center></i>
<i><center>10 of them have both milk and toothbrush, 70 have milk but no toothbrush and 4 have toothbrush but no milk.</center></i>

Consider the numbers from the above figure. `Confidence` for ${\{Toothbrush}\} \to {\{Milk\}}$
will be 10/(10+4) = 0.7. Looks like a high confidence value. But we know intuitively that these two products have a weak association and there is something misleading about this high confidence value. Lift is introduced to overcome this challenge.

If confidence is very high, it implies that when A is purchased then the probability of purchasing B is very high i.e. the rule is strong.

**`Considering just the value of confidence limits our capability to make any business inference.`**

**Lift**

Lift controls for the support (frequency) of consequent while calculating the conditional probability of occurrence of {B} given {A}. 

Lift is a very literal term given to this measure. Think of it as the **`lift`** that {A} provides to our confidence for having {B} on the cart. To rephrase, lift is the rise in probability of having {B} on the cart with the knowledge of {A} being present over the probability of having {B} on the cart without any knowledge about presence of {A}. Mathematically,

$$
{Lift(\{A\} \to \{B\}) = ( \frac{Transactions\ containing\ both\ A\ and\ B}{Transactions\ containing\ A}} )/{(Fractions\ of\ transactions\ containing\ B )}
$$

In cases where {A} actually leads to {B} on the cart, value of lift will be greater than 1. 

Let us understand this with an example which will be continuation of the {Toothbrush} → {Milk} rule.

Probability of having milk on the cart with the knowledge that toothbrush is present (i.e. confidence) : 10/(10+4) = 0.7

Now to put this number in perspective, consider the probability of having milk on the cart without any knowledge about toothbrush: 80/100 = 0.8

These numbers show that having toothbrush on the cart actually reduces the probability of having milk on the cart to 0.7 from 0.8! This will be a lift of 0.7/0.8 = 0.87. Now that’s more like the real picture. A value of lift less than 1 shows that having toothbrush on the cart does not increase the chances of occurrence of milk on the cart in spite of the rule showing a high confidence value. A value of lift greater than 1 vouches for high association between {B} and {A}. More the value of lift, greater are the chances of preference to buy {B} if the customer has already bought {A}. Lift is the measure that will help store managers to decide product placements on aisle.

If lift is very high, it implies that when B is purchased then the confidence for the rule is very high or most of the times B was purchased along with A i.e. the rule is very strong.

***`Ideally, we look for rules that have low support, high confidence and high lift.`***


**Association Rules Mining**

Now that we understand how to quantify the importance of association of products within an itemset, the next step is to generate rules from the entire list of items and identify the most important ones. This is not as simple as it might sound. Supermarkets will have thousands of different products in store. For d items there are ${2}^{d}$ 
itemsets!! And this number increases exponentially with the increase in number of items. Finding lift values for each of these will get computationally very very expensive. How to deal with this problem? How to come up with a set of most important association rules to be considered? **`Apriori`** algorithm comes to our rescue for this.

We will see apriori algorithm as part of our activity.

___

## Load the libraries

For this lab exercise we are going to use a package called Machine Learning extensions.

You can install it by using the command given below. 

Please remove the comment "#" symbol and run the below cell.

**mlxtend : Machine learning extensions.**

In [None]:
#!pip install mlxtend --user
!pip install xlrd==1.2.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting xlrd==1.2.0
  Downloading xlrd-1.2.0-py2.py3-none-any.whl (103 kB)
[K     |████████████████████████████████| 103 kB 25.2 MB/s 
[?25hInstalling collected packages: xlrd
  Attempting uninstall: xlrd
    Found existing installation: xlrd 1.1.0
    Uninstalling xlrd-1.1.0:
      Successfully uninstalled xlrd-1.1.0
Successfully installed xlrd-1.2.0


In [None]:
import os
import numpy as np
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
import xlrd

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
help(xlrd)

Help on package xlrd:

NAME
    xlrd

DESCRIPTION
    # Copyright (c) 2005-2012 Stephen John Machin, Lingfo Pty Ltd
    # This module is part of the xlrd package, which is released under a
    # BSD-style licence.

PACKAGE CONTENTS
    biffh
    book
    compdoc
    formatting
    formula
    info
    sheet
    timemachine
    xldate
    xlsx

FUNCTIONS
    count_records(filename, outfile=<ipykernel.iostream.OutStream object at 0x7fe525647a30>)
        For debugging and analysis: summarise the file's BIFF records.
        ie: produce a sorted file of ``(record_name, count)``.
        
        :param filename: The path to the file to be summarised.
        :param outfile: An open file, to which the summary is written.
    
    dump(filename, outfile=<ipykernel.iostream.OutStream object at 0x7fe525647a30>, unnumbered=False)
        For debugging: dump an XLS file's BIFF records in char & hex.
        
        :param filename: The path to the file to be dumped.
        :param outfile: An 

### Activity

Given to you is a Transactions Dataset in retail domain. In this activity we will identify the association rules between the sub categories.

In [None]:
# read the Superstore_Sales.xls data # SSS=SuperStoreSales
SSS =pd.read_excel('/content/drive/My Drive/ML_LAB/Association_Rules_Lab/Example1/Superstore_Sales.xls')    

In [None]:
SSS.shape  # check how many rows and columns 

(9994, 21)

In [None]:
SSS['Order ID'].nunique()      # check the number of unique orders

5009

In [None]:
SSS['Customer ID'].nunique()  
  # check the number of unique customers


793

In [None]:
SSS.columns # checking the colums 

Index(['Row ID', 'Order ID', 'Order Date', 'Ship Date', 'Ship Mode',
       'Customer ID', 'Customer Name', 'Segment', 'Country', 'City', 'State',
       'Postal Code', 'Region', 'Product ID', 'Category', 'Sub-Category',
       'Product Name', 'Sales', 'Quantity', 'Discount', 'Profit'],
      dtype='object')

In [None]:
SSS['Product ID'].nunique()

1862

In [None]:
print('No of categories: ',SSS['Category'].nunique())
print('No of sub-categories : ',SSS['Sub-Category'].nunique())


No of categories:  3
No of sub-categories :  17


In [None]:
# re-org the data by order id
# Most frequent items on order id level recommendation not on customer level
transactions_df = pd.crosstab(SSS['Order ID'], SSS['Sub-Category'])      

In [None]:
transactions_df.shape # check the data shape now

(5009, 17)

In [None]:
transactions_df.sample(10)

Sub-Category,Accessories,Appliances,Art,Binders,Bookcases,Chairs,Copiers,Envelopes,Fasteners,Furnishings,Labels,Machines,Paper,Phones,Storage,Supplies,Tables
Order ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
CA-2017-144456,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0
CA-2017-100951,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
CA-2014-116246,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
CA-2016-167605,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
CA-2016-108364,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
US-2015-168732,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
CA-2015-151043,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
CA-2017-134096,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0
US-2015-131842,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
CA-2015-128993,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
transactions_df.head(10)

Sub-Category,Accessories,Appliances,Art,Binders,Bookcases,Chairs,Copiers,Envelopes,Fasteners,Furnishings,Labels,Machines,Paper,Phones,Storage,Supplies,Tables
Order ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
CA-2014-100006,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
CA-2014-100090,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
CA-2014-100293,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
CA-2014-100328,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
CA-2014-100363,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0
CA-2014-100391,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
CA-2014-100678,1,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0
CA-2014-100706,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
CA-2014-100762,0,0,1,0,0,0,0,0,0,0,1,0,2,0,0,0,0
CA-2014-100860,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0


In [None]:
## transactions_df.index.name = ''
## transactions_df.columns.name = ''

In [None]:
transactions_df[transactions_df.index=='CA-2016-113733']

Sub-Category,Accessories,Appliances,Art,Binders,Bookcases,Chairs,Copiers,Envelopes,Fasteners,Furnishings,Labels,Machines,Paper,Phones,Storage,Supplies,Tables
Order ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
CA-2016-113733,2,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0


In [None]:
# checking the frequency of the items 
(transactions_df>1).sum()

Sub-Category
Accessories     54
Appliances      14
Art             59
Binders        188
Bookcases        4
Chairs          33
Copiers          0
Envelopes        5
Fasteners        2
Furnishings     73
Labels          18
Machines         3
Paper          163
Phones          71
Storage         67
Supplies         3
Tables          11
dtype: int64

In [None]:
# As the quantity doesn't matter. So, if quantity > 1, reset to 1
transactions_df[transactions_df>1]=1

In [None]:
itemsets = apriori(transactions_df, min_support=0.01, use_colnames=True) # creating the items sets 
itemsets  #

Unnamed: 0,support,itemsets
0,0.143342,(Accessories)
1,0.090038,(Appliances)
2,0.145937,(Art)
3,0.262727,(Binders)
4,0.044720,(Bookcases)
...,...,...
63,0.023358,"(Storage, Phones)"
64,0.010381,"(Tables, Phones)"
65,0.010182,"(Paper, Binders, Furnishings)"
66,0.010781,"(Paper, Binders, Phones)"


In [None]:
# Creating the association rules # Antecedent=LHS #Consequents=RHS
rules =association_rules(itemsets, metric= 'lift',min_threshold= 1.1)                                             
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Paper),(Fasteners),0.237772,0.042923,0.011779,0.049538,1.154125,0.001573,1.00696
1,(Fasteners),(Paper),0.042923,0.237772,0.011779,0.274419,1.154125,0.001573,1.050507
2,(Storage),(Labels),0.155121,0.069076,0.012178,0.078507,1.136537,0.001463,1.010235
3,(Labels),(Storage),0.069076,0.155121,0.012178,0.176301,1.136537,0.001463,1.025713
4,"(Paper, Binders)",(Phones),0.054901,0.162507,0.010781,0.196364,1.208336,0.001859,1.042129
5,"(Paper, Phones)",(Binders),0.034937,0.262727,0.010781,0.308571,1.174494,0.001602,1.066304
6,"(Binders, Phones)",(Paper),0.039728,0.237772,0.010781,0.271357,1.141248,0.001334,1.046092
7,(Paper),"(Binders, Phones)",0.237772,0.039728,0.010781,0.04534,1.141248,0.001334,1.005878
8,(Binders),"(Paper, Phones)",0.262727,0.034937,0.010781,0.041033,1.174494,0.001602,1.006357
9,(Phones),"(Paper, Binders)",0.162507,0.054901,0.010781,0.066339,1.208336,0.001859,1.012251


In [None]:
rules['antecedents']

0                (Paper)
1            (Fasteners)
2              (Storage)
3               (Labels)
4       (Paper, Binders)
5        (Paper, Phones)
6      (Binders, Phones)
7                (Paper)
8              (Binders)
9               (Phones)
10      (Paper, Binders)
11      (Paper, Storage)
12    (Binders, Storage)
13               (Paper)
14             (Binders)
15             (Storage)
Name: antecedents, dtype: object

In [None]:
# What should be recommended if binder's is already added to the basket/cart?
rules[rules['antecedents']==frozenset({'Binders'})]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
8,(Binders),"(Paper, Phones)",0.262727,0.034937,0.010781,0.041033,1.174494,0.001602,1.006357
14,(Binders),"(Paper, Storage)",0.262727,0.035536,0.010581,0.040274,1.133316,0.001245,1.004936



### This comparision fails for the reason Binders is not of string type

### It's a frozenset!! so the comparision should be as follows:

In [None]:
# Another example , give the recommendations for Phones
# It's a frozenset!! so the comparision should be as follows:

rules[rules['consequents'] == frozenset({'Phones'})]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
4,"(Paper, Binders)",(Phones),0.054901,0.162507,0.010781,0.196364,1.208336,0.001859,1.042129


In [None]:
sorted_rules = rules.sort_values(['support', 'confidence', 'lift'], ascending = [True, False, False ])
sorted_rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
11,"(Paper, Storage)",(Binders),0.035536,0.262727,0.010581,0.297753,1.133316,0.001245,1.049877
12,"(Binders, Storage)",(Paper),0.039728,0.237772,0.010581,0.266332,1.120114,0.001135,1.038927
10,"(Paper, Binders)",(Storage),0.054901,0.155121,0.010581,0.192727,1.242434,0.002065,1.046585
15,(Storage),"(Paper, Binders)",0.155121,0.054901,0.010581,0.068211,1.242434,0.002065,1.014284
13,(Paper),"(Binders, Storage)",0.237772,0.039728,0.010581,0.0445,1.120114,0.001135,1.004994
14,(Binders),"(Paper, Storage)",0.262727,0.035536,0.010581,0.040274,1.133316,0.001245,1.004936
5,"(Paper, Phones)",(Binders),0.034937,0.262727,0.010781,0.308571,1.174494,0.001602,1.066304
6,"(Binders, Phones)",(Paper),0.039728,0.237772,0.010781,0.271357,1.141248,0.001334,1.046092
4,"(Paper, Binders)",(Phones),0.054901,0.162507,0.010781,0.196364,1.208336,0.001859,1.042129
9,(Phones),"(Paper, Binders)",0.162507,0.054901,0.010781,0.066339,1.208336,0.001859,1.012251


In [None]:
sorted_rules.to_csv('rules_extracted.csv')