# 5. Clustering

This JupyterNotebook is part of an exercise series titled *Clustering*.\
The series itself is based on lecture *8. Cluster Analysis*.

This exercise series is divided into two parts. There will be one exercise session per part (= one part per week):

- **5.1.** [A Close Look at K-Means and DBSCAN](./5.1-A-Close-Look-at-K-Means-and-DBSCAN.ipynb) (*last weeks notebook*)
- **5.2.** Clustering in Python (*this notebook*)
    - **5.2.1.** [Clustering Products Based on their Profitability](#5.2.1.-Clustering-Products-Based-on-their-Profitability)
    - **5.2.2.** [Clustering Customers Based on their Interests](#5.2.2.-Clustering-Customers-Based-on-their-Interests)
        - **5.2.2.1.** [Clustering](#5.2.2.1.-Clustering)
        - **5.2.2.2.** [Validation](#5.2.2.2.-Validation)

<div class="alert alert-block alert-warning">

**Important:**
    
Work on the respective part yourself **BEFORE** each exercise session. The exercise session is **NOT** intended to take a first look at the exercise sheet, but to solve problems students had while preparing the exercise sheet beforehand.
    
</div>

## 5.2. Clustering in Python

After implementing K-means and DBSCAN in [**5.1.**](./5.1-A-Close-Look-at-K-Means-and-DBSCAN.ipynb), in this part you get to try out your knowledge of clustering methods with the AdventureWorks database. 

For this purpose, you will be presented with two fictitious scenarios, which you may work on either completely independently or guided:

   - **5.2.1.** [Clustering Products Based on their Profitability](#5.2.1.-Clustering-Products-Based-on-their-Profitability)
   - **5.2.2.** [Clustering Customers Based on their Interests](#5.2.2.-Clustering-Customers-Based-on-their-Interests)

In [None]:
# Import required libraries
import os
import tempfile
import sqlite3
import urllib.request
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn.cluster import Birch
from sklearn.preprocessing import MinMaxScaler

In [None]:
# Create a temporary directory
dataset_folder = tempfile.mkdtemp()

# Build path to database
database_path = os.path.join(dataset_folder, "adventure-works.db")

# Get the database
urllib.request.urlretrieve(
    "https://github.com/FAU-CS6/KDD-Databases/raw/main/AdventureWorks/adventure-works.db",
    database_path,
)

# Open connection to the adventure-works.db
connection = sqlite3.connect(database_path)

### 5.2.1. Clustering Products Based on their Profitability

The first scenario is intended to first look at clustering in practice without major stumbling blocks. For this purpose, you are to put yourself in the following scenario:

*You are once again a Data Scientist at the fictitious company Adventure Works GmbH. After your successful analyses on the topic of Frequent Patterns, your bosses now assign you the task of dividing the products into groups of different profitability.*

*In discussions with your business administration colleagues, you learn that the decisive metrics here are probably the number of products actually sold and the profit per product (the sales price minus the production costs).*

*The colleagues from IT tell you that you will probably find the required data in the tables `Product` and `SalesOrderDetail`.*

You are now given two options:

You can either do the assignment on your own or try a step-by-step guided version.

- **Option 1:** [Cluster the Products on your Own](#Option-1:-Cluster-the-Products-on-your-Own)
- **Option 2:** [Cluster the Products using Step-by-Step Tasks](#Option-2:-Cluster-the-Products-using-Step-by-Step-Tasks)

We recommend that you first try it on your own and only switch to the guided version if you encounter problems.

#### Option 1: Cluster the Products on your Own

You may, of course, once again set about the task without help. Be aware that the two values suggested by your colleagues may not be available in the required form at the beginning. Also, think about which clustering method is best suited for the tasks.

<div class="alert alert-block alert-info">

**Task 1.1:** 
    
Group the products within the OLTP database of the fictitious Adventure Works GmbH according to their profitability. Furthermore, visualize the result.

</div>

In [None]:
# Cluster the products based on their profitability (Code placeholder 01/10)

In [None]:
# Cluster the products based on their profitability (Code placeholder 02/10)

In [None]:
# Cluster the products based on their profitability (Code placeholder 03/10)

In [None]:
# Cluster the products based on their profitability (Code placeholder 04/10)

In [None]:
# Cluster the products based on their profitability (Code placeholder 05/10)

In [None]:
# Cluster the products based on their profitability (Code placeholder 06/10)

In [None]:
# Cluster the products based on their profitability (Code placeholder 07/10)

In [None]:
# Cluster the products based on their profitability (Code placeholder 08/10)

In [None]:
# Cluster the products based on their profitability (Code placeholder 09/10)

In [None]:
# Cluster the products based on their profitability (Code placeholder 10/10)

In [None]:
# Sample solution => See Option 2

#### Option 2: Cluster the Products using Step-by-Step Tasks 

A first step in any KDD Task should always be to get a look at the available data. In this case we already know which tables may be relevant for us: `Product` and `SalesOrderDetail`.

<div class="alert alert-block alert-info">

**Task 1.2.1:**
    
Load the relations `Product` and `SalesOrderDetail` into two individual DataFrames and display the first ten rows of each DataFrame.

</div>

In [None]:
# Load Product into a DataFrame and display the first ten rows

In [None]:
# Load SalesOrderDetail into a DataFrame and display the first ten rows

In [None]:
# Load Product into a DataFrame and display the first ten rows
product_df = pd.read_sql_query("SELECT * FROM Product", connection)
product_df.head(10)

In [None]:
# Load SalesOrderDetail into a DataFrame and display the first ten rows
sales_order_detail_df = pd.read_sql_query("SELECT * FROM SalesOrderDetail", connection)
sales_order_detail_df.head(10)

It can be seen that the attribute 'ProductID' is probably the link between the two tables. It is less clear how to calculate the two metrics suggested by the business colleagues.

<div class="alert alert-block alert-info">

**Task 1.2.2:** 
    
Consider which attributes are needed for the two proposed metrics and how to combine them. Do not worry about the implementation in SQL/Python at this moment.
If you do not yet know everything about the data sets that you need to accomplish this task, try to learn more about the data sets.

</div>

How to compute the metrics required for the clustering:

- **Number of copies sold per product:**  
Write down your solution here
- **Average profit per copy sold (per product):**  
Write down your solution here

How to compute the metrics required for the clustering:

- **Number of copies sold per product:**  
The table `SalesOrderDetail` contains information on how many products have been sold within that single order (`OrderQty`). If we sum up the `OrderQty` per product we get the number of copies sold per product.
- **Average profit per copy sold (per product):**  
First of all, it must be understood that the profit per sale can be calculated simply by subtracting the manufacturing cost (`StandardCost` in `Product`) from the actual selling price (`UnitPrice` in `SalesOrderDetail`). In this case, however, should not be summed up for a product, but the average should be determined.

<div class="alert alert-block alert-info">

**Task 1.2.3:** 
    
Calculate both values (`ProfitPerUnit` and `OverallOrderQty`) and generate a corresponding DataFrame `product_overview_df` containing the `ProductID` as well as the `ProfitPerUnit` and the `OverallOrderQty`. You may use SQL and/or Python to perform the computation.

</div>

In [None]:
# Load `ProductID`,`ProfitPerUnit` and `OverallOrderQty` into a DataFrame

In [None]:
# Load `ProductID`,`ProfitPerUnit` and `OverallOrderQty` into a DataFrame
product_overview_df = pd.read_sql_query(
    "SELECT p.ProductID, AVG(sod.UnitPrice - p.StandardCost) AS ProfitPerUnit, SUM(sod.OrderQty) AS OverallOrderQty FROM Product p, SalesOrderDetail sod WHERE p.ProductID = sod.ProductID GROUP BY p.ProductID",
    connection,
)
product_overview_df.head(10)

Since the value range of `OverallOrderQty` goes from 4 to 8311 and the value range of `ProfitPerUnit` only goes from about -55 to about 1155, this dataset would not currently be a good fit for most clustering techniques. The `OverallOrderQty` would have a much higher influence in this case, which is why it makes sense to normalize the `product_overview_df` first. 

<div class="alert alert-block alert-info">

**Task 1.2.4:**
    
Normalize the `product_overview_df`. You may write your own function or use a library for this. Already imported is the `MinMaxScaler` from scikit-learn.

</div>

In [None]:
# Normalize the product_overview_df

In [None]:
# Normalize the product_overview_df
min_max_scaler = MinMaxScaler()
product_overview_df[
    ["ProfitPerUnit", "OverallOrderQty"]
] = min_max_scaler.fit_transform(
    product_overview_df[["ProfitPerUnit", "OverallOrderQty"]]
)
product_overview_df.head(10)

The now generated DataFrame can be used well for clustering. However, the question is which clustering method should be used. We will focus on K-Means, DBSCAN and BIRCH. All of these methods have been presented in the lecture and are implemented in scikit-learn. 

<div class="alert alert-block alert-info">

**Task 1.2.5:**
    
Run scikit-learn's K-means for the DataFrame at hand. You have to determine a good number of clusters yourself. Visualize the results as known from Part One.

</div>

In [None]:
# Perform scikit-learn's K-means clustering on the dataset

In [None]:
# Perform scikit-learn's K-means clustering on the dataset
kmeans = KMeans(n_clusters=6, n_init="auto").fit(
    product_overview_df[["ProfitPerUnit", "OverallOrderQty"]]
)

# Save the labels to a copy of the big_dataset to generate the equivalent of our clustered_big_dataset
clustered_product_overview_df = product_overview_df.copy()
clustered_product_overview_df["cluster"] = kmeans.labels_

# Print the result
plt.figure(figsize=(8, 8))
sns.scatterplot(
    x=clustered_product_overview_df["ProfitPerUnit"],
    y=clustered_product_overview_df["OverallOrderQty"],
    hue=clustered_product_overview_df["cluster"],
    palette="deep",
)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

<div class="alert alert-block alert-info">

**Task 1.2.6:**
    
Run scikit-learn's DBSCAN for the DataFrame at hand. You have to determine a good number of clusters yourself. Visualize the results as known from Part One.

</div>

In [None]:
# Perform scikit-learn's DBSCAN clustering on the dataset

In [None]:
# Perform scikit-learn's DBSCAN clustering on the dataset
dbscan = DBSCAN(eps=0.2, min_samples=5).fit(
    product_overview_df[["ProfitPerUnit", "OverallOrderQty"]]
)

# Save the labels to a copy of the big_dataset to generate the equivalent of our clustered_big_dataset
clustered_product_overview_df = product_overview_df.copy()
clustered_product_overview_df["cluster"] = dbscan.labels_

# Print the result
plt.figure(figsize=(8, 8))
sns.scatterplot(
    x=clustered_product_overview_df["ProfitPerUnit"],
    y=clustered_product_overview_df["OverallOrderQty"],
    hue=clustered_product_overview_df["cluster"],
    palette="deep",
)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

<div class="alert alert-block alert-info">

**Task 1.2.7:**
    
Run scikit-learn's BIRCH for the DataFrame at hand. You have to determine a good number of clusters yourself. Visualize the results as known from Part One.

</div>

In [None]:
# Perform scikit-learn's BIRCH clustering on the dataset

In [None]:
# Perform scikit-learn's BIRCH clustering on the dataset
birch = Birch(threshold=0.1, n_clusters=6).fit(
    product_overview_df[["ProfitPerUnit", "OverallOrderQty"]]
)

# Save the labels to a copy of the big_dataset to generate the equivalent of our clustered_big_dataset
clustered_product_overview_df = product_overview_df.copy()
clustered_product_overview_df["cluster"] = birch.labels_

# Print the result
plt.figure(figsize=(8, 8))
sns.scatterplot(
    x=clustered_product_overview_df["ProfitPerUnit"],
    y=clustered_product_overview_df["OverallOrderQty"],
    hue=clustered_product_overview_df["cluster"],
    palette="deep",
)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

Of course, your fictitious bosses don't want to be presented with three different results from you. A clustering procedure should be chosen. For this purpose, it is useful to compare the results briefly.

<div class="alert alert-block alert-info">

**Task 1.2.8:**
    
Compare the results you have achieved with all three methods and consider which one you think is best.

</div>

Write down your solution here:

Of course, this comparison also depends on the results you were able to achieve through optimization of parameters. However some things should be unversally valid:

- **K-means and BIRCH:**  
Both K-means and BIRCH produce quite similar results in this case. With our parameters we can interpret the found clusters the following way: 
  - Less interesting from a business perspective:
    - The cluster that does not bring much profit per unit and that was not sold frequently.
    - The cluster that does not bring much profit per unit, but was sold at least a little more often.
    - The cluster that brings a little more profit per unit, but which was not sold frequently.
  - Interesting from a business perspective:
    - The cluster that does not bring much profit per unit, but which was sold at least extremely often.
    - The cluster that is average both in terms of profit per unit and frequency of sales.
    - The cluster that brings extremely much profit, but which was hardly sold.

- **DBSCAN:**  
With our parameters we can interpret the found clusters the following way:
  - Less interesting from a business perspective:
    - Merged into one cluster
  - Interesting from a business perspective:
    - The cluster that does not bring much profit per unit, but which was sold at least extremely often.
    - The cluster that is average both in terms of profit per unit and frequency of sales.
    - The cluster that brings extremely much profit, but which was hardly sold.
    
Since DBSCAN merges the less interesting products into a cluster, it can be said that DBSCAN is probably the better choice here. This way, the focus in a presentation can be placed on the economically more interesting product clusters. 

However, it is of course not a big problem in this case if K-means or BIRCH are used, since one can at least argue that here a better distinction is made between completely uninteresting and at least somewhat interesting products.

### 5.2.2. Clustering Customers Based on their Interests

The first scenario was quite straightforward. Before we briefly look at why this is not the case with the second scenario, we must first introduce it:

*As a Data Scientist who has already solved two different tasks for his bosses in the fictitious Adventure Works GmbH, you are immediately assigned another task. Your bosses want to make the sales team more efficient by assigning customers with similar product interests to the same employee.*

*In order to be able to carry out this reassignment, you are tasked with dividing the customers into 16 clusters (there are 16 sales persons in the company). This classification is to be based on the products that the customers have ordered in the past.*

*Via the IT department you learn that the customers can probably be found in the table ´Customer´. You will need to join the table `SalesOrderHeader` and then the table `SalesOrderDetail` to get information on the ordered ProductIDs per customers.*

But before you can dive into the task on your own or with help, we first need to take a look at the relevant dataset:

In [None]:
customer_purchases_df = pd.read_sql_query(
    "SELECT c.CustomerID, sod.ProductID, sod.OrderQty FROM Customer c JOIN SalesOrderHeader soh ON c.CustomerID = soh.CustomerID JOIN SalesOrderDetail sod ON sod.SalesOrderID = soh.SalesOrderID",
    connection,
    index_col="CustomerID",
)
customer_purchases_df

If the attribute `ProductID` were fed directly to the clustering in this form, simply the numerical distance between two product IDs would be used to calculate a distance between customers. 

A customer who bought only the product with ID `707` would thus be more dissimilar to a customer who bought only the product with ID `879` than to a customer who bought only the product with ID `712`. Usually, however, the distribution of product IDs is not based on similarity, but simply on the order in which the products were added to the catalog.

So for this task, we first need to put the data into a different format. Basically, we need to compare each customer's interest in each product individually. That means we need one dimension per product.

But here, too, we have to choose between two different possibilities:

- **Concept 1:** Determine for each product if the customer has purchased the product (binary scale):

|            | Product 1 | Product 2 | Product 3 | Product 4 |
|------------|-----------|-----------|-----------|-----------|
| Customer 1 | 1         | 0         | 0         | 0         |
| Customer 2 | 1         | 1         | 0         | 0         |
| Customer 3 | 0         | 1         | 1         | 1         |


- **Concept 2:** Determine sum of copies purchased for each product purchased (continous scale): 

|            | Product 1 | Product 2 | Product 3 | Product 4 |
|------------|-----------|-----------|-----------|-----------|
| Customer 1 | 236       | 0         | 0         | 0         |
| Customer 2 | 1         | 199       | 0         | 0         |
| Customer 3 | 0         | 199       | 5         | 1         |

In the first variant, `Customer 1` and `Customer 2` would be most similar. In the second variant, the interests of `Customer 2` and `Customer 3` would be most similar.

Since both options have pros and cons in your case, you decide to discuss the two options with your superiors:

*When you present the two options to your supervisors, they decide that the number of copies purchased should be relevant in determining similar interests. One of the supervisors justifies this decision as follows:*

*In the example, each of the customers obviously has more interest in certain products. A sales person who looks after `Customer 1` and `Customer 2` would be responsible for large orders for `Product 1` and `Product 2` and would therefore have to be well versed in both products. While a common sales person for `Customer 2` and `Customer 3` would only need to be expert in `Product 2`.*

#### 5.2.2.1. Clustering

You are now given two options:

You can either do the assignment on your own or try a step-by-step guided version.

- **Option 1:** [Cluster the Customers on your Own](#Option-1:-Cluster-the-Customers-on-your-Own)
- **Option 2:** [Cluster the Customers using Step-by-Step Tasks](#Option-2:-Cluster-the-Customers-using-Step-by-Step-Tasks)

We recommend that you first try it on your own and only switch to the guided version if you encounter problems.

##### Option 1: Cluster the Customers on your Own

Now that you are prepared for the one of the stumbling blocks in this task, you are again free to tackle it on your own. Since more than two dimensions are relevant this time, no visualization is necessary. 

<div class="alert alert-block alert-info">

**Task 2.1:**
    
Group the customers into 16 clusters by using K-means. You do not need to visualize the result.

</div>

In [None]:
# Cluster the customers based on their interests (Code placeholder 01/10)

In [None]:
# Cluster the customers based on their interests (Code placeholder 02/10)

In [None]:
# Cluster the customers based on their interests (Code placeholder 03/10)

In [None]:
# Cluster the customers based on their interests (Code placeholder 04/10)

In [None]:
# Cluster the customers based on their interests (Code placeholder 05/10)

In [None]:
# Cluster the customers based on their interests (Code placeholder 06/10)

In [None]:
# Cluster the customers based on their interests (Code placeholder 07/10)

In [None]:
# Cluster the customers based on their interests (Code placeholder 08/10)

In [None]:
# Cluster the customers based on their interests (Code placeholder 09/10)

In [None]:
# Cluster the customers based on their interests (Code placeholder 10/10)

In [None]:
# Sample solution => See Option 2

##### Option 2: Cluster the Customers using Step-by-Step Tasks

Having already considered the data set during the explanation of the issues with this task, we do not propose a mandatory "Getting to Know Your Data" task at this point. However, we do recommend that you continue to familiarize yourself with the dataset at hand if you still see ambiguities:

In [None]:
# Placeholder for optional "Getting to Know Your Data" code

The first step on the way to clustering, is to load a DataFrame from the database, in which the sum of all purchased copies per customer and product is listed.

<div class="alert alert-block alert-info">

**Task 2.2.1:**
    
Load a DataFrame that contains the total copies (`TotalOrderQty`) purchased per customer and product.

</div>

In [None]:
# Load a DataFrame that contains the total copies purchased per customer and product

In [None]:
# Load a DataFrame that contains the total copies purchased per customer and product
customer_interests_df = pd.read_sql_query(
    "SELECT c.CustomerID, sod.ProductID, SUM(sod.OrderQty) AS TotalOrderQty FROM Customer c JOIN SalesOrderHeader soh ON c.CustomerID = soh.CustomerID JOIN SalesOrderDetail sod ON sod.SalesOrderID = soh.SalesOrderID GROUP BY c.CustomerID, sod.ProductID",
    connection,
)
customer_interests_df.tail(10)

We can now pivot this DataFrame so that the `CustomerID` forms the index of the rows, the `ProductID` the columns and the `TotalOrderQty`. 

<div class="alert alert-block alert-info">

**Task 2.2.2:**
    
Use `pivot` to bring the DataFrame into the correct format. Replace created `NaN` values with `0`s. 

</div>

In [None]:
# Pivot the DataFrame

In [None]:
# Pivot the DataFrame
customer_interests_pivot_df = customer_interests_df.pivot(
    index="CustomerID", columns="ProductID", values="TotalOrderQty"
).fillna(0)
customer_interests_pivot_df.tail(10)

This DataFrame is now prepared to be used in clustering. 

<div class="alert alert-block alert-info">

**Task 2.2.3:**
    
Use the K-means implementation of scikit-learn to cluster the DataFrame into 16 clusters.

</div>

In [None]:
# Perform scikit-learn's K-means clustering on the dataset

In [None]:
# Perform scikit-learn's K-means clustering on the dataset
kmeans = KMeans(n_clusters=16, n_init="auto").fit(customer_interests_pivot_df)

# Save the labels to a copy of the big_dataset to generate the equivalent of our clustered_big_dataset
clustered_customer_interests_pivot_df = customer_interests_pivot_df.copy()
clustered_customer_interests_pivot_df["cluster"] = kmeans.labels_

# Print the resulting DataFrame
clustered_customer_interests_pivot_df.tail(10)

#### 5.2.2.2. Validation

Now that we have 16 clusters, one might assume that the task has been solved. However, there is another stumbling block in this task.

<div class="alert alert-block alert-info">

**Task 3:**
    
Output how many customers are in each of the clusters.

</div>

In [None]:
# Get the count of customers per cluster

In [None]:
# Get the count of customers per cluster
clustered_customer_interests_pivot_df[["cluster"]].value_counts()

A very unbalanced distribution can be seen, indicating that these clusters should definitely not be used in this way to assign customers to the 16 sales persons. If you try the different clustering methods implemented scikit-learn, you will notice that none of the methods changes this imbalance decisively. 

This is simply because clustering is not designed to achieve (roughly) equal cluster sizes. There are ideas here on how to get around this (e.g., calculate significantly more clusters and then merge each of these with neighbors until about the required size is reached), but these new groups would then not necessarily only contain customers with similar interests. 

It would make more sense here to approach the fictitious bosses again and tell them that similarity of interests is probably not the best criterion for dividing customers among sales persons. 

Since it is of course a shame to end an exercise sheet with a perceived failure: 
There are some interesting things that can be concluded from the identified clusters. We now know that the majority of the customer base seems to share similar interests. This can be pitched to the manangement to further specialize the focus of the company. Even failures in data science sometimes contain new insights, you just have to be open enough to discover them.