# Exercise 09 Assoication Rule Mining - Solution

## Pedagogy

This notebook contains both theoretical explanations and executable cells to execute your code.

When you see the <span style="color:red">**[TBC]**</span> (To Be Completed) sign, it means that you need to perform an action else besides executing the cells of code that already exist. These actions can be:
- Complete the code with proper comments
- Respond to a question
- Write an analysis
- etc.

### Import libraries

In [1]:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

## Part 1. Assoication rule mining using a toy dataset

This part we will perform association rule mining using a toy dataset defined by ourselves.

This toy dataset records the transaction history from a supermarket.

### Create the dataset

The `dataset` is a list of lists.

Each item in the `dataset` list is a transaction represented by a list.

Each item in the transaction list is a product pucharsed in this transaction.

In [2]:
# create the dataset
dataset = [
    ['Apple', 'Beer', 'Rice', 'Drumstick'],
    ['Apple', 'Beer', 'Rice'],
    ['Apple', 'Beer'],
    ['Apple', 'Pear'],
    ['Feeder', 'Beer', 'Rice', 'Drumstick'],
    ['Feeder', 'Beer', 'Rice'],
    ['Feeder', 'Beer'],
    ['Feeder', 'Pear']
]

In [3]:
# get the type of the dataset
type(dataset)

list

In [4]:
# get the type of the 1st item in the dataset
type(dataset[0])

list

### Pre-processing the data

To perform association rule mining on this dataset, we need to transform the dataset into a one-hot-encoded DataFrame.

[One-hot encoding](https://en.wikipedia.org/wiki/One-hot) is a technique used to convert categorical data into a numerical format.

It creates a binary column for each category, where only one of these columns is "hot" (set to 1 or `True`) while all others are "cold" (set to 0 or `False`) for each examples in the dataset.

Assuming that we have a dataset that records the nationalities of different people, which looks like:

||Name|Nationality|
|-|-|-|
|0|Ameli|French|
|1|Jack|Canadian|
|2|Yoko|Japanese|

If we want to train a classifier to predict the nationality of different person, we need to convert the categorical variable into numerical one. A common way to do that is using diferent numbers to represent different categories.

||Name|Nationality|Encoded nationality|
|-|-|-|-|
|0|Ameli|French|1|
|1|Jack|Canadian|2|
|2|Yoko|Japanese|3|

Unlike the above encoding method, one-hot encoding will create a binary column for each nationality. Only the real nationality is set to 1 or `True`.

||Name|Nationality|French|Canadian|Japanese|
|-|-|-|-|-|-|
|0|Ameli|French|True|False|False|
|1|Jack|Canadian|False|True|False|
|2|Yoko|Japanese|False|False|True|

`sklearn` provides the `sklearn.preprocessing.OneHotEncoder` for this purpose.

As we are using `Mlxtend` for association rule mining today, we will use `mlxtend.preprocessing.TransactionEncoder` to perform one-hot encoding. You can find the documentation [here](https://rasbt.github.io/mlxtend/user_guide/preprocessing/TransactionEncoder/#api).

In [5]:
# one-hot encoding
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns = te.columns_)
df

Unnamed: 0,Apple,Beer,Drumstick,Feeder,Pear,Rice
0,True,True,True,False,False,True
1,True,True,False,False,False,True
2,True,True,False,False,False,False
3,True,False,False,False,True,False
4,False,True,True,True,False,True
5,False,True,False,True,False,True
6,False,True,False,True,False,False
7,False,False,False,True,True,False


### Get frequent itemsets

We will use `mlxtend.frequent_patterns.apriori` API to extract frequent itemsets from the one-hot-encoded DataFrame using Apriori algorithm. You can find the documentation [here](https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/).

The extracted frequent itemsets will be used to generate association rules.

We need to specify the minimum support threshold for an itemset to be considered as frequent.

The support is computed as the fraction `transactions_where_item(s)_occur / total_transactions`.

In [6]:
# get frequent itemsets with at least 50% support
frequent_itemsets = apriori(df, min_support =  0.3)
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.5,(0)
1,0.75,(1)
2,0.5,(3)
3,0.5,(5)
4,0.375,"(0, 1)"
5,0.375,"(1, 3)"
6,0.5,"(1, 5)"


By default, `mlxtend.frequent_patterns.apriori` returns the column indices of the items.

For better readability, we can set `use_colnames = True` to convert these integer values into the respective item names

In [7]:
# get frequent itemsets with at least 50% support
frequent_itemsets = apriori(df, min_support = 0.3, use_colnames = True)
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.5,(Apple)
1,0.75,(Beer)
2,0.5,(Feeder)
3,0.5,(Rice)
4,0.375,"(Beer, Apple)"
5,0.375,"(Beer, Feeder)"
6,0.5,"(Beer, Rice)"


From the results we can see, among the six different products in our dataset, `Apple`, `Beer`, `Feeder`, and `Rice` are considered as frequent itemsets with single item.

`Apple & Beer`, `Feeder & Beer`, and `Rice & Beer` are considered as frequent itemsets with two items.

Using another `min_support` value, the results may be different

### Generate association rules

We will generate association rules from these frequent itemsets using `mlxtend.frequent_patterns.association_rules`. You can find the documentation [here](https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/).

A set of metrics are calculated for each assoication rules:

$support(A\rightarrow C)=support(A\cup C)$, range: \[0, 1\]

$confidence(A\rightarrow C)=\frac{support(A\rightarrow C)}{support(A)}$, range: \[0, 1\]

$lift(A\rightarrow C)=\frac{confidence(A\rightarrow C)}{support(C)}$, range: \[0, $\infty$)

In [8]:
# generate association rules
rules = association_rules(frequent_itemsets, metric = "confidence", min_threshold = 0.7)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Apple),(Beer),0.5,0.75,0.375,0.75,1.0,0.0,1.0,0.0
1,(Feeder),(Beer),0.5,0.75,0.375,0.75,1.0,0.0,1.0,0.0
2,(Rice),(Beer),0.5,0.75,0.5,1.0,1.333333,0.125,inf,0.5


## Part 2. Hands-on Exercise

This part we will perform association rule mining using a real world dataset.

This dataset contains details of more than 9000 viewers watching TV shows.

Each row represents all the TV shows a viewer has watched.

Please perform association rule mining on this dataset.

In [9]:
# load dataset
raw_data = pd.read_csv('TV_Shows.csv', header = None)
raw_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,Cobra Kai,Lupin,12 Monkeys,Sherlock,,,,,,,...,,,,,,,,,,
1,Lost,Jack Ryan,The Flash,Game of thrones,House of Cards,12 Monkeys,Vikings,Fringe,The Mentalist,The Alienist,...,,,,,,,,,,
2,Sex Education,Dr. House,Kingdom,The Walking Dead,,,,,,,...,,,,,,,,,,
3,Ozark,Sex Education,Constantine,Preacher,Vikings,The Tick,,,,,...,,,,,,,,,,
4,Naruto,,,,,,,,,,...,,,,,,,,,,


In [10]:
# convert the dataframe to a list of lists
dataset = []
for index, row in raw_data.iterrows():
    dataset.append(row.dropna().tolist())

### Task 1. Pre-process the dataset

<span style="color:red">**[TBC]**</span> Transform the dataset into one-hot-coding DataFrame.

In [11]:
# [TBC] complete your code here with proper comments
# one-hot encoding
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns = te.columns_)
df.head()

Unnamed: 0,12 Monkeys,24,Absentia,Alice in Borderland,Altered Carbon,American Gods,Another Life,Archer,Arrow,Atypical,...,True Detective,Two and a half men,Upload,Vikings,Watchmen,Westworld,White Collar,X-Files,You,Young Sheldon
0,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,True,False,False,False,False,False,False,False,False,False,...,False,False,False,True,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,True,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


### Task 2. Generate frequent itemsets

<span style="color:red">**[TBC]**</span> Set the minimal support threshold and generate frequent itemsets.

In [12]:
# [TBC] complete your code here with proper comments
# get frequent itemsets
frequent_itemsets = apriori(df, min_support = 0.05, use_colnames = True)
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.058617,(12 Monkeys)
1,0.07709,(Absentia)
2,0.057792,(Archer)
3,0.139938,(Atypical)
4,0.055521,(Berlin Station)
5,0.052219,(Chernobyl)
6,0.083075,(Cobra Kai)
7,0.093911,(Daredevil)
8,0.089164,(Dark)
9,0.075439,(Demon Slayer)


### Task 3. Generate association rules

<span style="color:red">**[TBC]**</span> Set the minimal confidence threshold and generate association rules.

In [13]:
# [TBC] complete your code here with proper comments
# generate association rules
rules = association_rules(frequent_itemsets, metric = "confidence", min_threshold = 0.3)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Atypical),(Sex Education),0.139938,0.255624,0.05614,0.40118,1.569412,0.020369,1.243071,0.421852
1,(Ozark),(Sex Education),0.193705,0.255624,0.075129,0.387853,1.517277,0.025613,1.216008,0.422828
2,(Two and a half men),(Sex Education),0.183591,0.255624,0.056553,0.308038,1.205043,0.009623,1.075747,0.208417
