# Association Rule Mining


In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

import plotly
import plotly.graph_objs as go

from tqdm.notebook import tqdm
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
%matplotlib inline

Scikit-Learn does not support association rule learning. Fortunately though, [Sebastian Raschka](https://sebastianraschka.com) (a personal hero of mine) implemented this (and many other cool things) in his library *mlextend*, which aims to be as Scikit-Learn compatible as possible.

You can find examples for generating frequent itemsets with apriori [here](http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/) and for association rule mining [here](http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/).

## Manually
But first, we do a little manual calculation. You are given the following dataset of transactions.

In [None]:
transactions = [['oats', 'lego', 'teddybear', 'rc car'],
                ['oats', 'red coat', 'gloves', 'teddybear', 'doll', 'warm boot'],
                ['lego', 'red jelly bag cap', 'rc car', 'doll'],
                ['lego', 'oats', 'large red bag', 'gift wrap paper', 'warm boot']]

transactions = pd.DataFrame(data={"Items":transactions}, index=range(1,5))
transactions.index.name = 'Id'

with pd.option_context('display.max_colwidth', 80):
    print(transactions)

> Calculate the support for lego, oats and doll (manunally or by code, your choice).

*Click on the dots to display the solution*

In [None]:
# support_lego = 3/4
# support_oats = 3/4
# support_doll = 2/4

# or with code, for example:
support = {}
for item in ['lego', 'oats', 'doll']:
    support[item] = transactions.Items.map(lambda x: item in x).sum() / transactions.shape[0] # support of 'lego'
support

> Calculate the confidence of `['lego', 'oats'] -> ['teddybear']`

*Click on the dots to display the solution*

In [None]:
# confidence_lego_oats-teddybear = 0.25 / 0.5
0.25 / ( transactions.Items.map(lambda x: 'lego' in x and 'oats' in x).sum() / transactions.shape[0]  )

Now apply the Apriori algorithm and find the frequent item sets with a minimum support of 0.5 and minimum confidence of 0.75. Here is the dataset again:

In [None]:
with pd.option_context('display.max_colwidth', 80):
    print(transactions)

> **Step 1**: Generate frequent item sets satisfying the support threshold (hint: there are 6 itemsets of length 1 and 4 itemsets of length 2)

*Click on the dots to display the solution*

In [None]:
# Execute the following code to show the solution. We will see how to use this library in a minute.
te = TransactionEncoder()
te_ary = te.fit_transform(transactions.Items.values.tolist())
df = pd.DataFrame(te_ary, columns=te.columns_)

freq_itemsets = apriori(df, use_colnames=True, min_support=0.5)
freq_itemsets

> **Step 2**: Extract rules from frequent item sets satisfying the confidence threshold (hint: there are three itemsets)

There are 8 candidates: From all 4 itemsets with two items, generate the two possibilities.

From these 8 candidates, 4 have a confidence of 0.5/0.75 which is below the threshold and 4 have a confidence of 0.5/0.5 which is above.

*Click on the dots to display the solution*

In [None]:
# Execute the following code to show the solution
association_rules(freq_itemsets, metric='confidence', min_threshold=0.75)

Ok, enough manual calculation with a toy example for today. Let's work with a bigger dataset.

## Automated
You are given some transactional data about purchases in a supermarket.

In [None]:
transactions = pd.read_csv('acostasg.csv')
transactions.columns = ['Date', 'Transaction', 'Item']
transactions.head()

In [None]:
transactions.shape

There is a kind of placeholder item *'all- purpose'* (notice the space after the dash) in the data which appears multiple times in some transactions. 
> Remove rows with this item. 

*Click on the dots to display the solution*

In [None]:
transactions = transactions[transactions.Item != 'all- purpose']

In [None]:
transactions.shape

### Group by transaction ID
We group the data by transaction id and aggregate purchases into a list (the Date is constant for s single transaction).

In [None]:
transactions = transactions.groupby('Transaction').agg({'Date':lambda x: x.iloc[0] ,'Item':list})
transactions.head()

### Calculate size for each transaction
We also calculate the size for each transaction.
> `map` the function `len` on each row of the *Item* column.

In [None]:
#transactions['Size'] = 

*Click on the dots to display the solution*

In [None]:
transactions['Size'] = transactions['Item'].map(len)
transactions.head()

### Statistics

In [None]:
transactions.describe()

In [None]:
transactions.hist()

The mlxtend library offers a function to turn a list of transactions into the required binary transaction format: 

In [None]:
te = TransactionEncoder()
te_binary = te.fit_transform(transactions.Item)

df = pd.DataFrame(te_binary, columns=te.columns_)
df.head()

### Association Rule Mining
> Generate frequent itemsets with a minimum support of 0.05. Look at the examples linked above or given in the solutions of the toy example for hints.

In [None]:
#freq_itemsets = 

*Click on the dots to display the solution*

In [None]:
freq_itemsets = apriori(df, min_support=0.05, use_colnames=True)

> Now extract all association rules with a confidence threshold of 0.5.

In [None]:
#rules = 

*Click on the dots to display the solution*

In [None]:
rules = association_rules(freq_itemsets, metric='confidence', min_threshold=0.5)
rules.head()

> Sort this so that the rules with the highest lift are at the top and print the top ten rules.

*Click on the dots to display the solution*

In [None]:
rules.sort_values('lift', ascending=False).head(n=10)

You now have a list of rules that are interesting (support >= 0.05), trustworthy (confidence >= 0.5) and are ordered by association strength (lift) between antecedents and consequent.

Finally, let us display our rule set with the three measures support, confidence and lift.

In [None]:
plt.subplots(figsize=(10, 8))
plt.scatter(rules.support, rules.confidence, c=rules.lift, s=5)
plt.xlabel('Support')
plt.ylabel('Confidence')
plt.title('Scatter Plot for Support vs. Confidence vs. Lift')
plt.colorbar()
plt.show()

We can filter the rules the following way:

In [None]:
rules.query('support > 0.05 and confidence > 0.9 and lift > 1.3')

# Ilias Quiz

> Now filter for the rules in the top right corner (support greater than 0.2 and confidence greater than 0.7). What do these rules have in common? Answer the question on ILIAS

> Let's look at the rules which have a lift greater than 1.6. Are these rules interesting?

That's it for the topic of Association Rules!