## Apriori Algorithm using 'mlxtend' package

### Install the module called 'mlxtend'
pip install mlxtend or conda install mlxtend and load the required libraries

In [60]:
import pandas as pd
import numpy as np
from mlxtend.preprocessing import OnehotTransactions
from mlxtend.frequent_patterns import apriori

The 'dataset' consists of set of transactions, where each transaction in 'dataset' represents a set of items contained. Suppose that we have a particular set of items A (e.g., beans and squash), and another set of items B (e.g., asparagus). Then
an association rule has the form if A, then B, represented as A ⇒ B.

In [61]:
dataset = [['Broccoli', 'Green Peppers', 'Corn'],
           ['Asparagus', 'Squash', 'Corn'],
           ['Corn', 'Tomatoes', 'Beans', 'Squash'],
           ['Green Peppers', 'Corn', 'Tomatoes', 'Beans'],
           ['Beans', 'Asparagus', 'Broccoli'],
           ['Squash', 'Asparagus', 'Beans', 'Tomatoes'],
           ['Tomatoes', 'Corn'],
           ['Broccoli', 'Tomatoes', 'Green Peppers'],
           ['Squash', 'Asparagus', 'Beans'],
           ['Beans', 'Corn'],
           ['Green peppers', 'Broccoli', 'Beans', 'Squash'],
           ['Asparagus', 'Beans', 'Squash'],
           ['Squash', 'Corn', 'Asparagus', 'Beans'],
           ['Corn', 'Green Peppers', 'Tomatoes', 'Beans', 'Broccoli']]            

dataset

[['Broccoli', 'Green Peppers', 'Corn'],
 ['Asparagus', 'Squash', 'Corn'],
 ['Corn', 'Tomatoes', 'Beans', 'Squash'],
 ['Green Peppers', 'Corn', 'Tomatoes', 'Beans'],
 ['Beans', 'Asparagus', 'Broccoli'],
 ['Squash', 'Asparagus', 'Beans', 'Tomatoes'],
 ['Tomatoes', 'Corn'],
 ['Broccoli', 'Tomatoes', 'Green Peppers'],
 ['Squash', 'Asparagus', 'Beans'],
 ['Beans', 'Corn'],
 ['Green peppers', 'Broccoli', 'Beans', 'Squash'],
 ['Asparagus', 'Beans', 'Squash'],
 ['Squash', 'Corn', 'Asparagus', 'Beans'],
 ['Corn', 'Green Peppers', 'Tomatoes', 'Beans', 'Broccoli']]

#### Convert the dataset into a dataframe with one-hot encoding and with columns as items

In [62]:
oht = OnehotTransactions()
oht_ary = oht.fit(dataset).transform(dataset)
df = pd.DataFrame(oht_ary, columns=oht.columns_)

In [63]:
oht

OnehotTransactions()

In [64]:
oht_ary

array([[0, 0, 1, 1, 1, 0, 0, 0],
       [1, 0, 0, 1, 0, 0, 1, 0],
       [0, 1, 0, 1, 0, 0, 1, 1],
       [0, 1, 0, 1, 1, 0, 0, 1],
       [1, 1, 1, 0, 0, 0, 0, 0],
       [1, 1, 0, 0, 0, 0, 1, 1],
       [0, 0, 0, 1, 0, 0, 0, 1],
       [0, 0, 1, 0, 1, 0, 0, 1],
       [1, 1, 0, 0, 0, 0, 1, 0],
       [0, 1, 0, 1, 0, 0, 0, 0],
       [0, 1, 1, 0, 0, 1, 1, 0],
       [1, 1, 0, 0, 0, 0, 1, 0],
       [1, 1, 0, 1, 0, 0, 1, 0],
       [0, 1, 1, 1, 1, 0, 0, 1]])

In [65]:
df

Unnamed: 0,Asparagus,Beans,Broccoli,Corn,Green Peppers,Green peppers,Squash,Tomatoes
0,0,0,1,1,1,0,0,0
1,1,0,0,1,0,0,1,0
2,0,1,0,1,0,0,1,1
3,0,1,0,1,1,0,0,1
4,1,1,1,0,0,0,0,0
5,1,1,0,0,0,0,1,1
6,0,0,0,1,0,0,0,1
7,0,0,1,0,1,0,0,1
8,1,1,0,0,0,0,1,0
9,0,1,0,1,0,0,0,0


### Support of a Rule
The support for a particular association rule A ⇒ B is the proportion of transactions in 'dataset' that contain both A and B. We normally prefer rules that have either high support and/or high confidence. Strong rules are those that meet or surpass certain minimum support and confidence criteria.

#### Generate the items/itemsets that occur together frequently (minimum support of 0.4)

In [66]:
frequent_itemsets = apriori(df, min_support=0.3, use_colnames=True)

frequent_itemsets

Unnamed: 0,support,itemsets
0,0.428571,[Asparagus]
1,0.714286,[Beans]
2,0.357143,[Broccoli]
3,0.571429,[Corn]
4,0.5,[Squash]
5,0.428571,[Tomatoes]
6,0.357143,"[Asparagus, Beans]"
7,0.357143,"[Asparagus, Squash]"
8,0.357143,"[Beans, Corn]"
9,0.428571,"[Beans, Squash]"


### Confidence of a Rule

The confidence of the association rule A ⇒ B is a measure of the accuracy of the rule, as determined by the percentage of transactions in the 'dataset' containing A that also contain B. We can say that it is a percentage value that shows how frequently the rule head occurs among all the groups containing the rule body. It indicates how reliable this rule is. The higher the value, the more often the set of items is associated together.

### Usefulness of Association Rules
Generally, not all association rules are equally useful. We can quantify the usefulness of an association rule with a measure called 'lift'. Lift is defined as ratio of the confidence of the rule and the expected confidence of the rule. The expected confidence of a rule is defined as the product of the support values of the rule body and the rule head divided by the support of the rule body.

In [67]:
from mlxtend.frequent_patterns import association_rules

association_rules(frequent_itemsets, metric="confidence", min_threshold=0.3)


Unnamed: 0,antecedants,consequents,support,confidence,lift
0,(Asparagus),(Beans),0.428571,0.833333,1.166667
1,(Beans),(Asparagus),0.714286,0.5,1.166667
2,(Asparagus),(Squash),0.428571,0.833333,1.666667
3,(Squash),(Asparagus),0.5,0.714286,1.666667
4,(Beans),(Corn),0.714286,0.5,0.875
5,(Corn),(Beans),0.571429,0.625,0.875
6,(Squash),(Beans),0.5,0.857143,1.2
7,(Beans),(Squash),0.714286,0.6,1.2


In [68]:
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2)
rules

Unnamed: 0,antecedants,consequents,support,confidence,lift
0,(Asparagus),(Squash),0.428571,0.833333,1.666667
1,(Squash),(Asparagus),0.5,0.714286,1.666667
2,(Squash),(Beans),0.5,0.857143,1.2
3,(Beans),(Squash),0.714286,0.6,1.2


In [69]:
rules["antecedant_len"] = rules["antecedants"].apply(lambda x: len(x))
rules

Unnamed: 0,antecedants,consequents,support,confidence,lift,antecedant_len
0,(Asparagus),(Squash),0.428571,0.833333,1.666667,1
1,(Squash),(Asparagus),0.5,0.714286,1.666667,1
2,(Squash),(Beans),0.5,0.857143,1.2,1
3,(Beans),(Squash),0.714286,0.6,1.2,1


If we are interested only in itemsets of length 2 that have a support of at least 40 percent. First, we create the frequent itemsets via apriori and add a new column that stores the length of each itemset

In [70]:
frequent_itemsets = apriori(df, min_support=0.3, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets

Unnamed: 0,support,itemsets,length
0,0.428571,[Asparagus],1
1,0.714286,[Beans],1
2,0.357143,[Broccoli],1
3,0.571429,[Corn],1
4,0.5,[Squash],1
5,0.428571,[Tomatoes],1
6,0.357143,"[Asparagus, Beans]",2
7,0.357143,"[Asparagus, Squash]",2
8,0.357143,"[Beans, Corn]",2
9,0.428571,"[Beans, Squash]",2


In [71]:
frequent_itemsets[ (frequent_itemsets['length'] == 2) &
                   (frequent_itemsets['support'] >= 0.4) ]

Unnamed: 0,support,itemsets,length
9,0.428571,"[Beans, Squash]",2


### References
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/<br>
https://www.ibm.com/support/knowledgecenter/en/SSEPGG_9.5.0/com.ibm.im.model.doc/c_lift_in_an_association_rule.html<br>
Data Mining and Predictive Analytics by Daniel T. Larose and Chantal D. Larose