# 3. Frequent Patterns

This JupyterNotebook is part of an exercise series titled *Frequent Patterns*.
The series itself is based on lecture *6. Mining Frequent Patterns, Associations and Correlations*. 

There are two parts:

- Part One: Implementing A Priori and FP-Growth
- Part Two: Mining Frequent Patterns in the AdventureWorks Database

Recall that we have two exercise groups.
Depending on how each group progresses, some parts of these exercises may not be discussed in its entirety.
If questions arise, ask them in your study group or in our StudOn forum.

## Part Two: Mining Frequent Patterns in the AdventureWorks Database

In Part One you worked on a very small and therefore non-realistic data set.
Now you will apply your knowledge of Frequent Patterns to a more realistic scenario.
Imagine this:

*You are an employee in the fictitious company Adventure Works GmbH.*
*Your job is to find out which of the company's products are frequently bought together.*
*To start with, the management wants you to find the ten most "relevant" product pairs bought together.*

*You get access to the OLTP database of the company.*
*The information about individual transactions can be found in the relation `TransactionHistory`.*
*The translation of ProductIDs into real product names can be done with the help of the relation `Product`.*

*From other similar projects of the company you also already know the required libraries and the code to connect to the OLTP database:*

In [None]:
# Import required libraries
import os
import tempfile
import sqlite3
import urllib.request
import pandas as pd

from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import fpgrowth
from mlxtend.frequent_patterns import association_rules

In [None]:
# Create a temporary directory
dataset_folder = tempfile.mkdtemp()

# Build path to database
database_path = os.path.join(dataset_folder, "adventure-works.db")

# Get the database
urllib.request.urlretrieve(
    "https://github.com/FAU-CS6/KDD-Databases/raw/main/AdventureWorks/adventure-works.db",
    database_path,
)

# Open connection to the adventure-works.db
connection = sqlite3.connect(database_path)

### Finding the Ten Most "Relevant" Frequent Patterns

Within this worksheet you are now given two options:
You can either do it on your own or with our guidance by splitting up the problem into smaller steps.

We recommend that you first try it on your own and only switch to the guided version if you encounter problems.


<div class="alert alert-block alert-warning">

**Note:** In both cases there is an other section at the end of this worksheet.
Do not skip it, regardless of your decission in this section.

</div>

#### Option 1: Solve the Assignment Independently

In this variant we don't give you anything except for the description of the scenario, the libraries to use, the database connection, and some code cells (just add more if needed).

We give you one small tip:
If you have successfully determined the frequent itemsets, you may take another look at the list of libraries.
There you will probably find a function to determine the association rules from the frequent itemsets.

<div class="alert alert-block alert-info">

**Task:** Find the ten most "relevant" frequent patterns in the OLTP database of the fictitious Adventure Works GmbH.
You have to decide every step from loading the DataFrames to determining the Association Rules from the Frequent Itemsets.

</div>

In [None]:
# Find the ten most "relevant" frequent patterns (Code placeholder 01/10)

In [None]:
# Find the ten most "relevant" frequent patterns (Code placeholder 02/10)

In [None]:
# Find the ten most "relevant" frequent patterns (Code placeholder 03/10)

In [None]:
# Find the ten most "relevant" frequent patterns (Code placeholder 04/10)

In [None]:
# Find the ten most "relevant" frequent patterns (Code placeholder 05/10)

In [None]:
# Find the ten most "relevant" frequent patterns (Code placeholder 06/10)

In [None]:
# Find the ten most "relevant" frequent patterns (Code placeholder 07/10)

In [None]:
# Find the ten most "relevant" frequent patterns (Code placeholder 08/10)

In [None]:
# Find the ten most "relevant" frequent patterns (Code placeholder 09/10)

In [None]:
# Find the ten most "relevant" frequent patterns (Code placeholder 10/10)

In [None]:
# Sample solution => See Option 2

#### Option 2: Solve the Assignment by Solving Small Tasks 

Any large assignment can be broken down into many smaller steps.
For a KDD task the first step is to get familiar with the given data.

##### Getting to Know `TransactionHistory` and `Product`

First the recors of `TransactionHistory` and `Product` have to be loaded.
Since we don't know anything about the relations we load all attributes and tuples for both datasets.

<div class="alert alert-block alert-info">

**Task:** Load the relations `TransactionHistory` and `Product` into two individual DataFrames and display the first ten rows of each DataFrame.
(Hint: You might want to look at exercise sheet 2 (a-c) to get to know methods of loading relations into a DataFrame)

</div>

In [None]:
# Load TransactionHistory into a DataFrame and display the first ten rows

In [None]:
# Load Product into a DataFrame and display the first ten rows

In [None]:
# Load TransactionHistory into a DataFrame and display the first ten rows
transaction_history_df = pd.read_sql_query(
    "SELECT * FROM TransactionHistory", connection
)
transaction_history_df.head(10)

In [None]:
# Load Product into a DataFrame and display the first ten rows
product_df = pd.read_sql_query("SELECT * FROM Product", connection)
product_df.head(10)

The `TransactionHistory` seems to contain information about individual transactions.
But it is not yet possible to see how it is possible to determine which products (probably identified via the `ProductID`) are purchased together.

The Product table can be used to map from `ProductID` to `Name`.

We might therefore assume that we are looking for products that are purchased in the same transaction.
The attribute `TransactionID` seems to uniquely identify each transaction.
We test this hypothesis by determining whether there are `TransactionID`s with more than one linked `ProductID`.

<div class="alert alert-block alert-info">
    
**Task:** Check if there are cases of several different `ProductID`s for the same `TransactionID`.
    
</div>

In [None]:
# Check the TransactionHistory DataFrame for cases of several different ProductIDs for the same TransactionID

In [None]:
# Check the TransactionHistory DataFrame for cases of several different ProductIDs for the same TransactionID
# First group the dataframe by TransactionID and aggregate the other columns by counting different values
transaction_history_df_grouped = transaction_history_df.groupby(
    ["TransactionID"]
).count()

# Then check if there are results there cells inb ProductID there the count of different values is greater than one
transaction_history_df_grouped[transaction_history_df_grouped["ProductID"] > 1]

# No results => There are no cases of several different ProductIDs for the same TransactionID

We find no cases of multiple different `ProductID`s for the same `TransactionID` so our first hypothesis not seem to be correct.
Apparently the `TransactionID` is the primary key for the `TransactionHistory` relation and one `TransactionID` cannot refer to different `ProductID`s.

When looking at `TransactionHistory` a second attribute stands out.
The `ReferenceOrderID` could identify the individual order and products that are part of the same order are obviously purchased together.

So let's test this new hypothesis as well.

<div class="alert alert-block alert-info">

**Task:** Check if there are cases of several different `ProductID`s for the same `ReferenceOrderID`.

</div>

In [None]:
# Check the TransactionHistory DataFrame for cases of several different ProductIDs for the same ReferenceOrderID

In [None]:
# Check the TransactionHistory DataFrame for cases of several different ProductIDs for the same ReferenceOrderID
# Now group the dataframe by ReferenceOrderID and aggregate the other columns by counting different values
transaction_history_df_grouped = transaction_history_df.groupby(
    ["ReferenceOrderID"]
).count()

# Then check again if there are results there cells inb ProductID there the count of different values is greater than one
transaction_history_df_grouped[transaction_history_df_grouped["ProductID"] > 1]

# 23249 results => There are multiple cases of different ProductIDs for the same ReferenceOrderID

Our new hypothesis seems to be correct.
In the next step we want to search for `ProductID`s that regularly occur in the same `ReferenceOrderID`, the frequent itemsets of our problem.

##### Identifing the Frequent Itemsets

To determine our frequent itemsets using mlxtend we first need to do some preprocessing on `TransactionHistory`.

<div class="alert alert-block alert-info">

**Task:** Aggregate the `TransactionHistory` so that next to each `ReferenceOrderID` the associated `ProductID`s are listed in a single cell.

</div>

In [None]:
# Aggregate the TransactionHistory to have a list of ProductIDs per ReferenceOrderID

In [None]:
# Aggregate the TransactionHistory to have a list of ProductIDs per ReferenceOrderID
products_per_order_df = (
    transaction_history_df.groupby("ReferenceOrderID")["ProductID"]
    .apply(list)
    .reset_index(name="ProductIDs")
    .set_index("ReferenceOrderID")
)
products_per_order_df

<div class="alert alert-block alert-info">

**Task:** Prepare the dataset for `mlxtend`s `fpgrowth` by using the `TransactionEncoder` of the library

</div>

In [None]:
# Apply one hot encoding to the prepared dataset by using the TransactionEncoder

In [None]:
# Apply one hot encoding to the prepared dataset by using the TransactionEncoder
# Create a TransactionEncoder
transaction_encoder = TransactionEncoder()

# Use the TransactionEncoder to transform the dataset into a one-hot encoded NumPy boolean array
one_hot_encoded_dataset = transaction_encoder.fit(
    products_per_order_df["ProductIDs"].tolist()
).transform(products_per_order_df["ProductIDs"].tolist())

# Transform the one-hot encoded array into a pandas DataFrame
preprocessed_dataset = pd.DataFrame(
    one_hot_encoded_dataset,
    columns=transaction_encoder.columns_,
    index=products_per_order_df.index,
)
preprocessed_dataset

After the preprocessing steps the frequent itemsets can now theoratically be determined.
But we definitely do not know which min_support to choose. 

Even by trial and error it is difficult to find a meaningful threshold here and we have only been told by our bosses to find the ten most "relevant" frequent patterns.

First of all we should determine rather too many itemsets than too few because it is easier to discard frequent itemsets later than to create additional ones.

<div class="alert alert-block alert-info">

**Task:** Use `fpgrowth` to determine the frequent itemsets of our dataset.
Select `min_support` so that the approximately 100 most frequent itemsets become frequent itemsets.

</div>

In [None]:
# Determine the frequent itemsets

In [None]:
# Determine the frequent itemsets
frequent_itemsets = fpgrowth(preprocessed_dataset, min_support=0.01, use_colnames=True)
frequent_itemsets

##### Determination of the Frequent Patterns

Before using the frequent itemsets to determine frequent patterns we have to determine how to define "relevance" in the context of frequent patterns.

If our bosses wanted to know which ten patterns occur most frequently in our dataset the *support* would be the appropriate measure.
Did they wanted to know how certain one can be that Product A will end up in the shopping cart if Product B is already there?
Then the calculation of *confidence* would be more appropriate.
In addition there is a large number of other interestingness measures.

All in all, this question cannot be answered conclusively.
In practice a dialog with the management would be appropriate in order to narrow down more precisely what is meant by the most "relevant" ten patterns.

This ambiguity was intentionally used in the assignment to show that the assignment will often contain inaccuracies in the real world. 

While in the real world dialogue is the best solution we have no opportunity to consult with our fictitious bosses.
Therefore we do what is best for us and choose the simplest measure to apply: the support.

It is not important that we generate only the 10 rules with the highest support.
As long as they are part of your list, everything is fine.

<div class="alert alert-block alert-info">

**Task:** Use `mlxtend`s `association_rules` to generate frequent patterns from the frequent itemsets.
Set the corresponding threshold so that at least the 10 frequent patterns with the highest support are included.

</div>

In [None]:
# Generate the association rules/frequent patterns

In [None]:
# Generate the association rules/frequent patterns
frequent_patterns = association_rules(
    frequent_itemsets, metric="support", min_threshold=0.02
)
frequent_patterns

Of course it is no problem at all to sort out extra patterns afterwards. 

<div class="alert alert-block alert-info">

**Task:** Delete all patterns that do not belong to the ten patterns with the highest support.

</div>

In [None]:
# Delete the extra patterns

In [None]:
# Delete the extra patterns
frequent_patterns = frequent_patterns.nlargest(10, "support")
frequent_patterns

##### Getting to Know the Product Names 

After completing the core task we tidy up the results, because management can not use the Information that `ProductID` 871 is often purchased in combination with `ProductID` 870 because these are internal database ids. 

To complete our task satisfactorily for the management we still need to enrich the `ProductID`s with their actual names.

<div class="alert alert-block alert-info">

**Task:** Enrich the frequent patterns by adding the product names to the list. 

</div>

In [None]:
# Merge the ProductName into the frequent pattern df

In [None]:
# Merge the ProductName into the frequent pattern df
# We have to transform the frozensets within the two colums to strings first
# (as we know that there is only one item per set this is pretty simple)
frequent_patterns["antecedents"] = frequent_patterns["antecedents"].apply(
    lambda x: list(x)[0]
)
frequent_patterns["consequents"] = frequent_patterns["consequents"].apply(
    lambda x: list(x)[0]
)

# After that we have to merge frequent_patterns with the product df
frequent_patterns = pd.merge(
    frequent_patterns, product_df, left_on="antecedents", right_on="ProductID"
)[
    [
        "antecedents",
        "Name",
        "consequents",
        "antecedent support",
        "consequent support",
        "support",
        "confidence",
        "lift",
        "leverage",
        "conviction",
    ]
]
frequent_patterns = frequent_patterns.rename(columns={"Name": "antecedents name"})
frequent_patterns = pd.merge(
    frequent_patterns, product_df, left_on="consequents", right_on="ProductID"
)[
    [
        "antecedents",
        "antecedents name",
        "consequents",
        "Name",
        "antecedent support",
        "consequent support",
        "support",
        "confidence",
        "lift",
        "leverage",
        "conviction",
    ]
]
frequent_patterns = frequent_patterns.rename(columns={"Name": "consequents name"})

# Print the df
frequent_patterns

Now the assignment is fully completed!
In the virtual scenario introduced in the beginning of this part you would now be able to report to the management that the `Mountain Bottle Cage` is often purchased in combination with the `Water Bottle - 30 oz.`.
The same is true for the other nine requested frequent patterns.


### Implementing the Kulczynski Measure and the Imbalance Ratio

The library `mlxtend` offers some more measures besides support and confidence for the determination of frequent patterns.
While lift, leverage and conviction are offered, the kulczynski metric and imbalance ratio presented in the lecture are not.

We can use the antecedent support, the consequent support and the support calculated by `mlxtend` to easily calculate these interestingness measures.

<div class="alert alert-block alert-info">

**Task:** Write a function to compute the kulczynski measure known from the lecture as `Kulc(a, b)`.

</div>

In [None]:
# Complete the function kulczynski_measure to compute the kulczynski measure
def kulczynski_measure(antecedent_support, consequent_support, support):
    # ...
    return 0


# Compute the kulczynski measure for "Water Bottle - 30 oz." -> "Mountain Bottle Cage"
kulczynski_measure(0.112802, 0.049356, 0.041139)

In [None]:
# Complete the function kulczynski_measure to compute the kulczynski measure
def kulczynski_measure(antecedent_support, consequent_support, support):
    # Simply use the formula introduced in the lecture
    return (support / 2) * ((1 / antecedent_support) + (1 / consequent_support))


# Compute the kulczynski measure for "Water Bottle - 30 oz." -> "Mountain Bottle Cage"
kulczynski_measure(0.112802, 0.049356, 0.041139)

<div class="alert alert-block alert-info">

**Task:** Write a function to compute the imbalance ratio.

</div>

In [None]:
# Complete the function imbalance_ratio to compute the imbalance ratio
def imbalance_ratio(antecedent_support, consequent_support, support):
    # ...
    return 0


# Compute the imbalance ratio for "Water Bottle - 30 oz." -> "Mountain Bottle Cage"
imbalance_ratio(0.112802, 0.049356, 0.041139)

In [None]:
# Complete the function imbalance_ratio to compute the imbalance ratio
def imbalance_ratio(antecedent_support, consequent_support, support):
    # Simply use the formula introduced in the lecture
    return abs(antecedent_support - consequent_support) / (
        antecedent_support + consequent_support - support
    )


# Compute the imbalance ratio for "Water Bottle - 30 oz." -> "Mountain Bottle Cage"
imbalance_ratio(0.112802, 0.049356, 0.041139)

Now the question arises as to how these metrics have to be interpreted.

<div class="alert alert-block alert-info">

**Task:** Interpret the interestingness measures for the association rule `"Water Bottle - 30 oz." -> "Mountain Bottle Cage"`

</div>

Write down your solution here:

First both values must be interpreted separately from each other:

- **Kulczynski Measure:**  
When kulczynski measure is close to 0 or 1 we have an "interesting" association rule.
Since in this case the value is about 0.6, the kulczynski measure rather suggests that this association rule is uninteresting. 
- **Imbalance Ratio:**  
For the imbalance ratio, a value of 0 indicates a perfectly balanced association rule while 1 indicates a very unbalanced one.
In this case we are about 0.52, which is about the middle of the spectrum.
Thus we cannot speak of a particularly well balanced rule, but neither can we speak of a completely unbalanced one.

In summary we have not discovered the most interesting rule, but one that is not completely uninteresting (this would the case for kulczynski measure = 0.5 and imbalance ratio = 0.0). 