# Association rule mining

In this notebook, we will look at the task of learning association rules, that is, learning rule of the type "customers who bought A and B, also bought C", for instance. 

It is a classical data mining task to find out what items are frequently bought together, also known as *Market Basket Analysis*. Association rule mining is one approach to this. There are other methods for solving this and similar tasks, within recommender systems, such as Content-based filtering or collaborative filtering.

Applications of association rule mining include product recommendation such as on Amazon or Netflix, for instance. (What happens at Amazon and Netflix is of course more advanced than just simple association rules mining we will see here.) Other applications could be placing product next to each other in a physical store (for instance the classical example of diapers and beers being bought together) or devising offers and advertisement. 

## Defining Association rules mining

For association rule mining, data is given as a set of "baskets"/"transactions" each containing a set of items. Association rules mining then tries to generate rules of the form "*if a customer has oatmeal and sugar in her basket, it is likely that she also has milk in her basket*".

We will use the `arules` package for association rules learning in R, so let us load it:

In [None]:
library(arules)
options(repr.plot.width=8, repr.plot.height=6)

As an example, we will use the dataset "Groceries" from the `arules` package.

In [None]:
data("Groceries")
str(Groceries)

Note that the Groceries dataset is not a data frame as we are used to. It is a special kind of object/class called "transactions", which is a special class of the `arules` package that makes the data easily available for association rule mining. So to get a better grip of the data, instead of the `str` function we can use the `inspect` function or the classic `summary` function.

In [None]:
inspect(Groceries[1:6, ])

Note how the transaction data simply consists of customer transactions of "baskets" of items bought together. (Information about the quantity of each item is abstracted away in this representation as it is not needed for the task.)

In [None]:
summary(Groceries)

The summary give us some summary information about the data, not surprisingly. It tells us the number of transactions/baskets (9835), and the total number of items (169). It tells us the most frequent items, i.e. how many transactions the most frequent items occurs in. It also give us a distribution of the sizes of each transaction/basket ranging from 1 to 32 number of items with a mean of 4.4 items per basket. (We can also see that the transactions are represented as a itemMatrix object in sparse format.)

We can also investigate the data further by asking how frequent do the different items occur, for instance:

In [None]:
itemFrequency(Groceries)[1:4]

In [None]:
itemFrequencyPlot(Groceries, support = 0.07)

This plot show us the frequency of all items that have a frequency above 0.07 (- the `support = 0.07` argument in the call to `itemFrequencyPlot`). We can see that the most frequently bought item is whole milk with a frequency just above 0.25.

## Association rules mining terminology

Some terminology is important here, so let us introduce it:

* **Association rule**. An *association rule* is of the form *"If LHS, then RHS"*, where LHS and RHS are both placeholders for sets of items.
    + For example: The rule "if {milk, chocolate} then {chili}" say that if milk and chocolate occurs together in a basket, then the basket is likely to contain chili as well.
* **Support**: The *support* of a set of items is the proportion of all baskets where the particular combination of items occurred.
    + For example: if `support({milk, chocolate}) = 0.01`, it means that milk and chocolate occurred together in `1%` of the baskets.
    + The *support* of a rule "if LHS, then RHS" is the support of the item set `{LHS, RHS}`, i.e. `support("if LHS, then RHS") = support({LHS, RHS})`
    + High support tell us that the combination is frequent.
    + To low support of a rule tell us that that there can be high uncertainty wrt. the rule, but also that we will not have many chances of applying the rule, so it is not really an interesting rule
    + If both the `LHS` and the `RHS` of a rule have high support, the rule might be an expression of this instead of an actual rule. If a lot of people often buy milk and a lot of people often buy bread, milk and bread will appear together often without it being a real "association"
* **Confidence**: The confidence of a rule is the proportion of times the `RHS` of a rule occurs when the `LHS` of the rule occurs.
    + For example: `confidence("if {milk, chocolate} then {chili}") = 0.5` means that `50%` of the times milk and chocolate occurred together, chili also occurred.
    + In other words, it is the number of baskets containing both `LHS` and `RHS` divided by the number of baskets containing `LHS`
    + Alternatively, in math: `confidence("if LHS, then RHS") = support({LHS, RHS})/support(LHS)`
    + High confidence says something about how applicable the rule is - how confident we can be in it

* **Lift** is another measure of the "quality" of an association rule
    + Greater lift values indicate stronger associations.
    + The definition:
`lift("If LHS, then RHS") = support({LHS, RHS}) / (support(LHS) * support(RHS))`
    + Alternatively: `lift("if LHS, then RHS") = confidence("if LHS, then RHS") / support(RHS)`
    + Intuitively, the lift is the "lift" in the likelihood of `RHS` if we know `LHS` compared to the likelihood of `RHS` in general case, in other words, the lift in likelihood of `RHS` given `LHS`

## Finding association rules

One of the most common algorithms for finding association rules is the *aprior algorithm*. In the "arules" package this algorithm is implemented in the `apriori` function. To apply the `apriori` function in R, we need to set a minimum for support and confidence. These limits will depend on the data and the domain.
* We want a support that is low, but still makes the potential rule useful (`LHS` will occur often enough for us to recommend the `RHS`)
* We want high confidence such that there is a fair amount of certainty in the rule
* However, in the end, the rules with the highest lift are likely the most interesting ones

The `apriori` function can now be called in the following way:

In [None]:
apriori(Groceries, parameter = list(support = 0.01, confidence = 0.5))

We can also try other parameter values:

In [None]:
apriori(Groceries, parameter = list(support = 0.005, confidence = 0.5))

In [None]:
apriori(Groceries, parameter = list(support = 0.005, confidence = 0.6))

Let us stick to a support of 0.01 and a confidence of 0.5 and investigate the rules it returns:

In [None]:
rules <- apriori(Groceries, parameter = list(support = 0.01, confidence = 0.5))

In [None]:
rules

In [None]:
inspect(rules[1:4])

In [None]:
summary(rules)

From this we can see that we got 15 rules all of length 3. We also get some descriptive statistics on the support, confidence and lift of these total of 15 rules.

We can now sort the rules by their lift to find the most valuable ones. (We turn it into a data frame for easier reading in Jupyter Notebooks.)

In [None]:
as.data.frame(inspect(sort(rules, by = "lift")))

### Exercise

The `arules` comes with another dataset named `Epub`. Load that dataset using `data(Epub)` and do association rule mining on it. What are the three rules with the most lift?

## Example involving transaction data

Until now we have only used data that was already in the right format for using the `arules` package. However, this is not how real data comes. So in this example we will look at some typical transaction data, load it into R and transform it into a proper format that suits the `arules` package.

We will load a typical transaction dataset usde as sample data by the Tableau software (https://community.tableau.com/s/contentdocument/0694T000001ivFbQAI), which is also available at the Jupyter Hub or moodle by the name "Global Superstore 2018.xlsx".

In [None]:
library(readxl)
transactionData <- read_excel("Global Superstore 2018.xlsx")

In [None]:
head(transactionData)

We are only interested in the `Order ID` and the `Product ID` column as this will tell us which products are bought together in an order. Thereby we can recreate the baskets for an association rule mining task. We do a little clearning of the data also.

In [None]:
library(tidyverse)

In [None]:
transactionData <- transactionData %>% select('Order ID', 'Product ID')

In [None]:
head(transactionData)

In [None]:
names(transactionData) <- c("OrderID", "ProductID")

In [None]:
str(transactionData)

To transformt the data into a proper transaction class for the `aprior` algorithm, it is easiest to write the data out to a file and load it in using the `read.transactions` function.

In [None]:
write.csv(transactionData, file = "afile.csv", row.names=FALSE)

In [None]:
transData <- read.transactions("afile.csv", format = "single", cols = c(1,2))

By inspecting the data we can see that it is now in a familiar format we can work with and use as input for the `apriori` function:

In [None]:
inspect(transData[1:4])

In [None]:
summary(transData)

### Exercise

1. For the `transData` data make item frequency plot
2. Do association rule mining on the `transData` using the `apriori` function