# Association Rule Mining

Association rule mining is a technique to identify underlying relations between different items. Take an example of a Super Market where customers can buy variety of items. Usually, there is a pattern in what the customers buy. For instance, mothers with babies buy baby products such as milk and diapers. Damsels may buy makeup items whereas bachelors may buy beers and chips etc. In short, transactions involve a pattern. More profit can be generated if the relationship between the items purchased in different transactions can be identified.

For instance, if item A and B are bought together more frequently then several steps can be taken to increase the profit. For example:

A and B can be placed together so that when a customer buys one of the product he doesn't have to go far away to buy the other product. People who buy one of the products can be targeted through an advertisement campaign to buy the other. Collective discounts can be offered on these products if the customer buys both of them. Both A and B can be packaged together. The process of identifying an associations between products is called association rule mining.

Association rules are a set of rules derived from a database, that can help determining relationship among variables in a large transactional database.

For example, let I ={i(1),i(2)...,i(m)} be a set of m attributes called items, and T={t(1),t(2),...,t(n)} be the set of transactions. Every transaction t(i) in T has a unique transaction ID, and it contains a subset of itemsets in I.

Association rules are usually written as i(j) -> i(k). This means that there is a strong relationship between the purchase of item i(j) and item i(k). Both these items were purchased together in the same transaction.

In the above example, i(j) is the antecedent and i(k) is the consequent.

Please note that both antecedents and consequents can have multiple items. For example, {Diaper,Gum} -> {Beer, Chips} is also valid.

Since multiplie rules are possible even from a very small database, i-order to select the most relevant ones, we use constraints on various measures of interest. The most important measures are discussed below. They are:

**Apriori Algorithm for Association Rule Mining:** 
Different statistical algorithms have been developed to implement association rule mining, and Apriori is one such algorithm. In this article we will study the theory behind the Apriori algorithm and will later implement Apriori algorithm in Python.


# Apriori Algorithm

There are four major components of Apriori algorithm:
1. Support 
2. Confidence 
3. Lift
4. Conviction


1. Support : The support of an itemset X, supp(X) is the proportion of transaction in the database in which the item X appears. It signifies the popularity of an itemset.

supp(X) = (Number of transactions in which X appears)/(Total number of transactions)

We can identify itemsets that have support values beyond this threshold as significant itemsets.

2. Confidence : Confidence of a rule signifies the likelihood of item Y being purchased when item X is purchased.

Thus, conf(X -> Y) = supp(X U Y) / supp( X )

If conf (X -> Y) is 75%, it implies that, for 75% of transactions containing X & Y, this rule is correct. It is more like a conditional probability, P(Y|X), that the probability of finding itemset Y in transactions fiven that the transaction already contains itemset X.

3. Lift : Lift explains the the likelihood of the itemset Y being purchased when itemset X is already purchased, while taking into account the popularity of Y.

Thus, lift (X -> Y) = supp (X U Y)/( supp(X) supp (Y) )*

If the value of lift is greater than 1, it means that the itemset Y is likely to be bought with itemset X, while a value less than 1 implies that the itemset Y is unlikely to be bought if the itemset X is bought.

4. Conviction : The conviction of a rule can be defined as :

conv (X->Y) = (1-supp(Y))/(1-conf(X-Y))

If the conviction means 1.4, it means that the rule X -> Y would be incorrect 40% more often if the association between X & Y was an accidental chance.


### Steps in Apriori Algorithm

The steps in implementing Apriori Algorithm are:

1. Create a frequency table of all items that occur in all transactions.

2. Select only those (significant) items - for which the support is greater than threshold (50%)

3. Create possible pairs of all items (remember AB is same as BA)

4. Select itemsets that are only significant (support > threshold)

5. Create tiplets using another rule, called self-join. It says, from the item pairs AB, AC, BC, BD, we look for pairs with identical first letter. So we from AB, AC we get ABC. From BC, BD we get BCD.

6. Find frequency of the new triplet pairs, and select only those pairs where the support of the new itemset (ABC or BCD) is greater than the threshold.

7. If we get 2 pairs of significant triplets, combine and form groups of 4, repeat the threshold process, and continue.

8. Continue till the frequency after grouping is less than threshold support.

### Pros of Apriori algorithm:
- Easy to understand and implement
- Can be used on large itemsets

### Cons of Apriori algoritm
- Can get compuationally expensive if the candidate rules are large
- Calculating support is also expensive since it has to go through the whole database

## Apriori implementation

In [2]:
#For installtion of apriori,

!pip install apyori

Collecting apyori
[31mtwisted 18.7.0 requires PyHamcrest>=1.9.0, which is not installed.[0m
[31mgrin 1.2.1 requires argparse>=1.1, which is not installed.[0m
Installing collected packages: apyori
Successfully installed apyori-1.1.1
[33mYou are using pip version 10.0.1, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [4]:
# Import all the required libraries

import numpy as np  
import matplotlib.pyplot as plt  
import pandas as pd  
from apyori import apriori

Dataset downloaded from https://drive.google.com/file/d/1y5DYn0dGoSbC22xowBq2d4po6h1JxcTQ/view


In [5]:
# Importing the data set

store_data = pd.read_csv('store_data.csv')

In [6]:
store_data.head()

Unnamed: 0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
0,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
1,chutney,,,,,,,,,,,,,,,,,,,
2,turkey,avocado,,,,,,,,,,,,,,,,,,
3,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,
4,low fat yogurt,,,,,,,,,,,,,,,,,,,


Carefully looking at the data, we can see that the header is actually the first transaction. 
Each row corresponds to a transaction and each column corresponds to an item purchased in that specific transaction.

The NaN tells us that the item represented by the column was not purchased in that specific transaction.

In this dataset there is no header row. But by default, pd.read_csv function treats first row as header. To get rid of this problem, add header=None option to pd.read_csv function, as shown below:

In [7]:
store_data = pd.read_csv('store_data.csv', header=None)

In [8]:
store_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
2,chutney,,,,,,,,,,,,,,,,,,,
3,turkey,avocado,,,,,,,,,,,,,,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,


Now we will use the Apriori algorithm to find out which items are commonly sold together, so that store owner can take action to place the related items together or advertise them together in order to have increased profit.

### Data Proprocessing

The Apriori library we are going to use requires our dataset to be in the form of a list of lists, where the whole dataset is a big list and each transaction in the dataset is an inner list within the outer big list. 

Currently we have data in the form of a pandas dataframe. To convert our pandas dataframe into a list of lists, executing the following script:

In [9]:
records = []  
for i in range(0, 7501):  
    records.append([str(store_data.values[i,j]) for j in range(0, 20)])

### Applying Apriori

The next step is to apply the Apriori algorithm on the dataset. To do so, we can use the apriori class that we imported from the apyori library.

The apriori class requires some parameter values to work. 
- The first parameter is the list of list that you want to extract rules from. 
- The second parameter is the min_support parameter. This parameter is used to select the items with support values greater than the value specified by the parameter. 
- Next, the min_confidence parameter filters those rules that have confidence greater than the confidence threshold specified by the parameter. 
- Similarly, the min_lift parameter specifies the minimum lift value for the short listed rules. 
- Finally, the min_length parameter specifies the minimum number of items that you want in your rules.

Let's suppose that we want rules for only those items that are purchased at least 5 times a day, or 7 x 5 = 35 times in one week, since our dataset is for a one-week time period. The support for those items can be calculated as 35/7500 = 0.0045. The minimum confidence for the rules is 20% or 0.2. Similarly, we specify the value for lift as 3 and finally min_length is 2 since we want at least two products in our rules. These values are mostly just arbitrarily chosen, so you can play with these values and see what difference it makes in the rules you get back out.

### Execute the following script:


In [10]:
association_rules = apriori(records, min_support=0.0045, min_confidence=0.2, min_lift=3, min_length=2)  
association_results = list(association_rules)

For instance from the first item, we can see that light cream and chicken are commonly bought together. This makes sense since people who purchase light cream are careful about what they eat hence they are more likely to buy chicken i.e. white meat instead of red meat i.e. beef. Or this could mean that light cream is commonly used in recipes for chicken.

The support value for the first rule is 0.0045. This number is calculated by dividing the number of transactions containing light cream divided by total number of transactions. The confidence level for the rule is 0.2905 which shows that out of all the transactions that contain light cream, 29.05% of the transactions also contain chicken. Finally, the lift of 4.84 tells us that chicken is 4.84 times more likely to be bought by the customers who buy light cream compared to the default likelihood of the sale of chicken.

The following script displays the rule, the support, the confidence, and lift for each rule in a more clear way:

In [11]:
for item in association_results:

    # first index of the inner list
    # Contains base item and add item
    pair = item[0] 
    items = [x for x in pair]
    print("Rule: " + items[0] + " -> " + items[1])

    #second index of the inner list
    print("Support: " + str(item[1]))

    #third index of the list located at 0th
    #of the third index of the inner list

    print("Confidence: " + str(item[2][0][2]))
    print("Lift: " + str(item[2][0][3]))
    print("=====================================")

Rule: chicken -> light cream
Support: 0.00453272896947
Confidence: 0.290598290598
Lift: 4.84395061728
Rule: escalope -> mushroom cream sauce
Support: 0.0057325689908
Confidence: 0.300699300699
Lift: 3.79083269672
Rule: pasta -> escalope
Support: 0.00586588454873
Confidence: 0.372881355932
Lift: 4.70081185016
Rule: herb & pepper -> ground beef
Support: 0.0159978669511
Confidence: 0.323450134771
Lift: 3.29199384113
Rule: tomato sauce -> ground beef
Support: 0.00533262231702
Confidence: 0.377358490566
Lift: 3.84065948132
Rule: olive oil -> whole wheat pasta
Support: 0.00799893347554
Confidence: 0.27149321267
Lift: 4.12241009764
Rule: pasta -> shrimp
Support: 0.00506599120117
Confidence: 0.322033898305
Lift: 4.50667214774
Rule: chicken -> nan
Support: 0.00453272896947
Confidence: 0.290598290598
Lift: 4.84395061728
Rule: frozen vegetables -> shrimp
Support: 0.00533262231702
Confidence: 0.232558139535
Lift: 3.25451232211
Rule: spaghetti -> ground beef
Support: 0.00479936008532
Confidence: 0.

This results in the rules that are applied.

**Explaing the second rule:**
The second rule states that mushroom cream sauce and escalope are bought frequently. 

- The support for mushroom cream sauce is 0.0057. 
- The confidence for this rule is 0.3006 which means that out of all the transactions containing mushroom, 30.06% of the transactions are likely to contain escalope as well. 
- Finally, lift of 3.79 shows that the escalope is 3.79 more likely to be bought by the customers that buy mushroom cream sauce, compared to its default sale.

### Conclusion

Association rule mining algorithms such as Apriori are very useful for finding simple associations between our data items. They are easy to implement and have high explain-ability. However for more advanced insights, such those used by Google or Amazon etc., more complex algorithms, such as recommender systems, are used. However, you can probably see that this method is a very simple way to get basic associations if that's all your use-case needs.

**References:

https://github.com/kiranvajrapu/Apriori-Algorithm/blob/master/Apriori%20Algorithm.ipynb

https://www.youtube.com/watch?v=WGlMlS_Yydk