# Association rule mining

Association rule mining or association analysis is a machine learninig method used for discovering relationship between large set of variables. This notebook provides ntroduction to one of hte main applications of association rule mining: market basket analysis to understand similar items based on consumer purchase patterns.

The model will use aprior algorithm for analysis, which is prnunig (cutting down) the decision tree brances based on the popularity of variables (products). The model calculates 3 important indicators with the given sequence:

- **Support** - popularity of an itemset, measured by the proportion of transactions in which an itemset appears.
- **Confidence** - how likely item Y is purchased when item X is purchased, measured by the proportion of transactions with item X, in which item Y also appears.
- **Lift** - how likely item Y is purchased when item X is purchased, when compared to the number of independent cases, the lift of 1 implies no association.

Higher support signals the popularity of the product, while higher confidence is showing similarity between 2 or more products. Lift is the final "decision-maker" as it takes into account both popularity of products and dependence/association between them.

To learn more about Association rule mining or Apriori algorithm and the indicators, you may refer [here](http://pbpython.com/market-basket-analysis.html). Some nice examples and math formulas are also provided on the [Wikipedia page](https://en.wikipedia.org/wiki/Association_rule_learning).

To complete the tasks in this notebook, one needs to have 2 additional Python libraries installed by running:

- pip install **qgrid**
- pip install **mlxtend**

**Qgrid** will be used to create a datatable with filtering options (like in Excel) and **mlxtend** will be used to develop and apply the apriori algoithm and get the association rules.

In [1]:
# import necessary libraries
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
import qgrid

In [2]:
# read the data by mentioning that the first column can be considered as row names
data = pd.read_excel("products.xlsx",index_col=0, sheetname=3)
data.head()

Unnamed: 0_level_0,10002,10120,10125,10135,11001,15036,15039,15044C,15056BL,15056N,...,90030C,90031,90099,90184B,90184C,90201B,90201C,C2,M,POST
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536370,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
536852,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
536974,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1
537065,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
537463,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1



If you want to set the first column as index, you may use the following approach:
```
data = data.set_index([[0]])
```

In [3]:
# use apriori algorithm to prune the tree based on support value
frequent_itemsets = apriori(data, min_support=0.07, use_colnames=True)

In [4]:
# make sure that the result is a dataframe
type(frequent_itemsets)

pandas.core.frame.DataFrame

In [5]:
# view the dataframe
frequent_itemsets.head()

Unnamed: 0,support,itemsets
0,0.08243,[20724]
1,0.132321,[20725]
2,0.101952,[20726]
3,0.119306,[20750]
4,0.112798,[21080]


In [6]:
# get info (more specifically, the number of observations)
frequent_itemsets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68 entries, 0 to 67
Data columns (total 2 columns):
support     68 non-null float64
itemsets    68 non-null object
dtypes: float64(1), object(1)
memory usage: 1.1+ KB


In [7]:
# prune the tree based on confidence to form the rules (and calculate the lift)
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules.head()

Unnamed: 0,antecedants,consequents,support,confidence,lift
0,(21080),(21086),0.112798,0.769231,6.566952
1,(21086),(21080),0.117137,0.740741,6.566952
2,(POST),(21731),0.67462,0.199357,1.276438
3,(21731),(POST),0.156182,0.861111,1.276438
4,(POST),(22727),0.67462,0.109325,1.362127


In [8]:
# get info (more specifically, the number of observations)
rules.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96 entries, 0 to 95
Data columns (total 5 columns):
antecedants    96 non-null object
consequents    96 non-null object
support        96 non-null float64
confidence     96 non-null float64
lift           96 non-null float64
dtypes: float64(3), object(2)
memory usage: 3.8+ KB


In [9]:
# choose only those who have lift indicator higher than 6
rules[rules["lift"]>6]

Unnamed: 0,antecedants,consequents,support,confidence,lift
0,(21080),(21086),0.112798,0.769231,6.566952
1,(21086),(21080),0.117137,0.740741,6.566952
24,(21080),(21094),0.112798,0.769231,7.092308
25,(21094),(21080),0.10846,0.8,7.092308
28,(21094),(21086),0.10846,0.96,8.195556
29,(21086),(21094),0.117137,0.888889,8.195556
30,"(21080, POST)",(21094),0.093275,0.767442,7.075814
32,"(POST, 21094)",(21080),0.091106,0.785714,6.965659
33,(21080),"(POST, 21094)",0.112798,0.634615,6.965659
35,(21094),"(21080, POST)",0.10846,0.66,7.075814


In [11]:
# create an interactive datatable with filters
qgrid.show_grid(rules)