# 5. Clustering

This JupyterNotebook is part of an exercise series titled *Clustering*.
The series itself is based on lecture *8. Cluster Analysis*.

There are two parts:

- Part One: Implementing K-means and DBSCAN
- Part Two: Clustering in the AdventureWorks Database

Again we would like to remind you that we have multiple exercise groups.
Depending on how each group progresses, some parts of these exercises may not be discussed in its entirety.
If questions arise, ask them in your study group or in our StudOn forum.

## Part Two: Clustering in the AdventureWorks Database

After implementing two clustering methods in Part One, in this part you get to try out your knowledge of clustering methods within the AdventureWorks database as well. 

For this purpose, you will be presented with two fictitious scenarios, which you may work on either completely independently or guided:
- Clustering of products based on their profitability
- Clustering customers based on their interests

In [None]:
# Import required libraries
import os
import tempfile
import sqlite3
import urllib.request
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn.cluster import Birch
from sklearn.preprocessing import MinMaxScaler

In [None]:
# Create a temporary directory
dataset_folder = tempfile.mkdtemp()

# Build path to database
database_path = os.path.join(dataset_folder, "adventure-works.db")

# Get the database
urllib.request.urlretrieve(
    "https://github.com/FAU-CS6/KDD-Databases/raw/main/AdventureWorks/adventure-works.db",
    database_path,
)

# Open connection to the adventure-works.db
connection = sqlite3.connect(database_path)

### Clustering products based on their profitability

The first scenario is intended to first look at clustering in practice without major stumbling blocks. For this purpose, you are to put yourself in the following scenario:

*You are once again a Data Scientist at the fictitious company Adventure Works GmbH. After your successful analyses on the topic of Frequent Patterns, your bosses now assign you the task of dividing the products into groups of different profitability.*

*In discussions with your business administration colleagues, you learn that the decisive metrics here are probably the number of products actually sold and the profit per product (the sales price minus the production costs).*

*The colleagues from IT tell you that you will probably find the required data in the tables `Product` and `SalesOrderDetail`.*

#### Option 1: Solve the Assignment Independently

You may, of course, once again set about the task without help. Be aware that the two values suggested by your colleagues may not be available in the required form at the beginning. Also, think about which clustering method is best suited for the tasks.

<div class="alert alert-block alert-info">

**Task:** Group the products within the OLTP database of the fictitious Adventure Works GmbH according to their profitability. Furthermore, visualize the result.

</div>

In [None]:
# Cluster the products based on their profitability (Code placeholder 01/10)

In [None]:
# Cluster the products based on their profitability (Code placeholder 02/10)

In [None]:
# Cluster the products based on their profitability (Code placeholder 03/10)

In [None]:
# Cluster the products based on their profitability (Code placeholder 04/10)

In [None]:
# Cluster the products based on their profitability (Code placeholder 05/10)

In [None]:
# Cluster the products based on their profitability (Code placeholder 06/10)

In [None]:
# Cluster the products based on their profitability (Code placeholder 07/10)

In [None]:
# Cluster the products based on their profitability (Code placeholder 08/10)

In [None]:
# Cluster the products based on their profitability (Code placeholder 09/10)

In [None]:
# Cluster the products based on their profitability (Code placeholder 10/10)

In [None]:
# Sample solution => See Option 2

#### Option 2: Solve the Assignment by Solving Small Tasks 

A first step in any KDD Task should always be to get a look at the available data. In this case we already know which tables may be relevant for us: `Product` and `SalesOrderDetail`.

<div class="alert alert-block alert-info">

**Task:** Load the relations `Product` and `SalesOrderDetail` into two individual DataFrames and display the first ten rows of each DataFrame.

</div>

In [None]:
# Load Product into a DataFrame and display the first ten rows

In [None]:
# Load SalesOrderDetail into a DataFrame and display the first ten rows

In [None]:
# Load Product into a DataFrame and display the first ten rows
product_df = pd.read_sql_query("SELECT * FROM Product", connection)
product_df.head(10)

In [None]:
# Load SalesOrderDetail into a DataFrame and display the first ten rows
sales_order_detail_df = pd.read_sql_query("SELECT * FROM SalesOrderDetail", connection)
sales_order_detail_df.head(10)

It can be seen that the attribute 'ProductID' is probably the link between the two tables. It is less clear how to calculate the two metrics suggested by the business colleagues.

<div class="alert alert-block alert-info">

**Task:** Consider which attributes are needed for the two proposed metrics and how to combine them. Do not worry about the implementation in SQL/Python at this moment.
If you do not yet know everything about the data sets that you need to accomplish this task, try to learn more about the data sets.

</div>

How to compute the metrics required for the clustering:

- **Number of copies sold per product:**  
Write down your solution here
- **Average profit per copy sold (per product):**  
Write down your solution here

How to compute the metrics required for the clustering:

- **Number of copies sold per product:**  
The table `SalesOrderDetail` contains information on how many products have been sold within that single order (`OrderQty`). If we sum up the `OrderQty` per product we get the number of copies sold per product.
- **Average profit per copy sold (per product):**  
First of all, it must be understood that the profit per sale can be calculated simply by subtracting the manufacturing cost (`StandardCost` in `Product`) from the actual selling price (`UnitPrice` in `SalesOrderDetail`). In this case, however, should not be summed up for a product, but the average should be determined.

<div class="alert alert-block alert-info">

**Task:** Calculate both values (`ProfitPerUnit` and `OverallOrderQty`) and generate a corresponding DataFrame `product_overview_df` containing the `ProductID` as well as the `ProfitPerUnit` and the `OverallOrderQty`. You may use SQL and/or Python to perform the computation.

</div>

In [None]:
# Load `ProductID`,`ProfitPerUnit` and `OverallOrderQty` into a DataFrame

In [None]:
# Load `ProductID`,`ProfitPerUnit` and `OverallOrderQty` into a DataFrame
product_overview_df = pd.read_sql_query(
    "SELECT p.ProductID, AVG(sod.UnitPrice - p.StandardCost) AS ProfitPerUnit, SUM(sod.OrderQty) AS OverallOrderQty FROM Product p, SalesOrderDetail sod WHERE p.ProductID = sod.ProductID GROUP BY p.ProductID",
    connection,
)
product_overview_df.head(10)

Since the value range of `OverallOrderQty` goes from 4 to 8311 and the value range of `ProfitPerUnit` only goes from about -55 to about 1155, this dataset would not currently be a good fit for most clustering techniques. The `OverallOrderQty` would have a much higher influence in this case, which is why it makes sense to normalize the `product_overview_df` first. 

<div class="alert alert-block alert-info">

**Task:** Normalize the `product_overview_df`. You may write your own function or use a library for this. Already imported is the `MinMaxScaler` from scikit-learn.

</div>

In [None]:
# Normalize the product_overview_df

In [None]:
# Normalize the product_overview_df
min_max_scaler = MinMaxScaler()
product_overview_df[
    ["ProfitPerUnit", "OverallOrderQty"]
] = min_max_scaler.fit_transform(
    product_overview_df[["ProfitPerUnit", "OverallOrderQty"]]
)
product_overview_df

The now generated DataFrame can be used well for clustering. However, the question is which clustering method should be used. We will focus on K-Mean, DBSCAN and Birch. All of these methods have been presented in the lecture and are implemented in scikit-learn. 

<div class="alert alert-block alert-info">

**Task:** Run scikit-learn's k-means for the DataFrame at hand. You have to determine a good number of clusters yourself. Visualize the results as known from Part One.

</div>

In [None]:
# Perform scikit-learn's k-means clustering on the dataset

In [None]:
# Perform scikit-learn's k-means clustering on the dataset
kmeans = KMeans(n_clusters=6).fit(
    product_overview_df[["ProfitPerUnit", "OverallOrderQty"]]
)

# Save the labels to a copy of the big_dataset to generate the equivalent of our clustered_big_dataset
clustered_product_overview_df = product_overview_df.copy()
clustered_product_overview_df["cluster"] = kmeans.labels_

# Print the result
plt.figure(figsize=(8, 8))
sns.scatterplot(
    x=clustered_product_overview_df["ProfitPerUnit"],
    y=clustered_product_overview_df["OverallOrderQty"],
    hue=clustered_product_overview_df["cluster"],
    palette="deep",
)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

<div class="alert alert-block alert-info">

**Task:** Run scikit-learn's DBSCAN for the DataFrame at hand. You have to determine a good number of clusters yourself. Visualize the results as known from Part One.

</div>

In [None]:
# Perform scikit-learn's DBSCAN clustering on the dataset

In [None]:
# Perform scikit-learn's DBSCAN clustering on the dataset
dbscan = DBSCAN(eps=0.2, min_samples=5).fit(
    product_overview_df[["ProfitPerUnit", "OverallOrderQty"]]
)

# Save the labels to a copy of the big_dataset to generate the equivalent of our clustered_big_dataset
clustered_product_overview_df = product_overview_df.copy()
clustered_product_overview_df["cluster"] = dbscan.labels_

# Print the result
plt.figure(figsize=(8, 8))
sns.scatterplot(
    x=clustered_product_overview_df["ProfitPerUnit"],
    y=clustered_product_overview_df["OverallOrderQty"],
    hue=clustered_product_overview_df["cluster"],
    palette="deep",
)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

<div class="alert alert-block alert-info">

**Task:** Run scikit-learn's Birch for the DataFrame at hand. You have to determine a good number of clusters yourself. Visualize the results as known from Part One.

</div>

In [None]:
# Perform scikit-learn's Birch clustering on the dataset

In [None]:
# Perform scikit-learn's Birch clustering on the dataset
birch = Birch(threshold=0.1, n_clusters=6).fit(
    product_overview_df[["ProfitPerUnit", "OverallOrderQty"]]
)

# Save the labels to a copy of the big_dataset to generate the equivalent of our clustered_big_dataset
clustered_product_overview_df = product_overview_df.copy()
clustered_product_overview_df["cluster"] = birch.labels_

# Print the result
plt.figure(figsize=(8, 8))
sns.scatterplot(
    x=clustered_product_overview_df["ProfitPerUnit"],
    y=clustered_product_overview_df["OverallOrderQty"],
    hue=clustered_product_overview_df["cluster"],
    palette="deep",
)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

Of course, your fictitious bosses don't want to be presented with three different results from you. A clustering procedure should be chosen. For this purpose, it is useful to compare the results briefly.

<div class="alert alert-block alert-info">

**Task:** Compare the results you have achieved with all three methods and consider which one you think is best.

</div>

Write down your solution here:

Of course, this comparison also depends on the results you were able to achieve through optimization of parameters. However some things should be unversally valid:

- **K-means and Birch:**  
Both k-means and Birch produce quite similar results in this case. With our parameters we can interpret the found clusters the following way: 
  - Less interesting from a business perspective:
    - The cluster that does not bring much profit per unit and that was not sold frequently.
    - The cluster that does not bring much profit per unit, but was sold at least a little more often.
    - The cluster that brings a little more profit per unit, but which was not sold frequently.
  - Interesting from a business perspective:
    - The cluster that does not bring much profit per unit, but which was sold at least extremely often.
    - The cluster that is average both in terms of profit per unit and frequency of sales.
    - The cluster that brings extremely much profit, but which was hardly sold.

- **DBSCAN:**  
With our parameters we can interpret the found clusters the following way:
  - Less interesting from a business perspective:
    - Merged into one cluster
  - Interesting from a business perspective:
    - The cluster that does not bring much profit per unit, but which was sold at least extremely often.
    - The cluster that is average both in terms of profit per unit and frequency of sales.
    - The cluster that brings extremely much profit, but which was hardly sold.
    
Since DBSCAN merges the less interesting products into a cluster, it can be said that DBSCAN is probably the better choice here. This way, the focus in a presentation can be placed on the economically more interesting product clusters. 

However, it is of course not a big problem in this case if k-means or Birch are used, since one can at least argue that here a better distinction is made between completely uninteresting and at least somewhat interesting products.

### Clustering customers based on their interests

<div class="alert alert-block alert-warning">

**TODO**: Use `Customer`, `SalesOrderHeader` and `SalesOrderDetail` to cluster Customers in groups interessted in different products.
    
There are two possiblities to take a look at:
- Simple one hot encoding:
One column per product id. If the customer ever ordered the product he is interested in the product.
- One hot encoding with counts:
One column per product id. If the customer ordered the product multiple times he is more interested in the product.   
    
</div>