# 5. Clustering

This JupyterNotebook is part of an exercise series titled *Clustering*. <br/>
The series itself is based on lecture *8. Cluster Analysis*.

This exercise series is divided into two parts. There will be one exercise session per part (= one part per week):

- **5.1.** A Close Look at K-Means and DBSCAN (*this notebook*)
    - **5.1.1.** [K-Means](#5.1.1.-K-Means)
        - **5.1.1.1.** [Application by Hand](#5.1.1.1.-Application-by-Hand)
        - **5.1.1.2.** [Implementation](#5.1.1.2.-Implementation)
- **5.2.** [Clustering in Python](./5.2-Clustering-in-Python.ipynb) (*next weeks notebook*)

<div class="alert alert-block alert-warning">

**Important:**
    
Work on the respective part yourself <u>BEFORE</u> each exercise session. The exercise session is <u>NOT</u> intended to take a first look at the exercise sheet, but to solve problems students had while preparing the exercise sheet beforehand.
    
</div>

## 5.1. A Close Look at K-Means and DBSCAN

In this part we will take a closer look at two clustering methods known from the lecture: K-means and DBSCAN. 

In the following, you will first apply both methods step by step by hand to a data set and then write your own implementation for both methods.

In [None]:
# Import the required libraries
import math
import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.datasets import make_blobs
from matplotlib import pyplot as plt

The data set to cluster is the same for both methods:

In [None]:
# Create the dataset
dataset = pd.DataFrame(
    [
        [1, 1],
        [1, 2],
        [1, 4],
        [2, 1],
        [2, 3],
        [3, 2],
        [3, 4],
        [4, 1],
        [4, 3],
        [4, 4],
    ],
    columns=["x", "y"],
)

# Output the dataset in a scatterplot diagram
plt.figure(figsize=(4, 4))
sns.scatterplot(x=dataset["x"], y=dataset["y"])

### 5.1.1. K-Means

The first clustering method we are taking a close look at is K-means. It is part of the partitioning methods.

#### 5.1.1.1. Application by Hand

In order to familiarise yourself with K-means, you should first apply K-means by hand.

Given is the small data set:

| x | y |
|:-:|:-:|
| 1 | 1 |
| 1 | 2 |
| 1 | 4 |
| 2 | 1 |
| 2 | 3 |
| 3 | 2 |
| 3 | 4 |
| 4 | 1 |
| 4 | 3 |
| 4 | 4 |

You are now to apply K-Means to this data set by hand. We will use the Euclidean distance and a k of three in both cases:

- **Option 1:** [Apply K-Means on your own](#Option-1:-Apply-K-Means-on-your-own)
- **Option 2:** [Apply K-Means step by step](#Option-2:-Apply-K-Means-step-by-step)

It is recommended that you first try it on your own and only resort to the guided step-by-step variant if you have problems.

##### Option 1: Apply K-Means on your own

<div class="alert alert-block alert-info">

**Task 1.1:** 
    
Use K-Means to cluster the data points into three clusters. Write down all intermediate steps.

</div>

Write down your solution here:

Sample solution => See Option 2

##### Option 2: Apply K-Means step by step

The first step of K-Means is to arbitrarily distribute all data points into k partitions. This can be done in many different ways (e.g. randomly or by dividing the points into partitions of equal size).

In this case we distribute the points into (approximately) equal-sized partitions:

| x | y | Partition |
|:-:|:-:|:---------:|
| 1 | 1 |     1     |
| 1 | 2 |     1     |
| 1 | 4 |     1     |
| 2 | 1 |     1     |
| 2 | 3 |     2     |
| 3 | 2 |     2     |
| 3 | 4 |     2     |
| 4 | 1 |     3     |
| 4 | 3 |     3     |
| 4 | 4 |     3     |

The next step is to calculate the centroids of the partitions.

<div class="alert alert-block alert-info">

**Task 1.2.1:** 
    
Determine the coordinates of the centroids of the partitions.

</div>

Write down your solution here:

**Centroid of Partition 1:**

- $x = \frac{1 + 1 + 1 + 2}{4} = 1,25$
- $y = \frac{1 + 2 + 4 + 1}{4} = 2$

**Centroid of Partition 2:**

- $x = \frac{2 + 2 + 3}{3} = \frac{7}{3} \approx 2,333$
- $y = \frac{3 + 2 + 4}{3} = 3$ 

**Centroid of Partition 3:**

- $x = \frac{4 + 4 + 4}{3} = 4$
- $y = \frac{1 + 3 + 4}{3} = \frac{8}{3} \approx 2,667$ 

Next, the nearest centroid is calculated for each of the original data points. 

<div class="alert alert-block alert-info">

**Task 1.2.2:** 
    
For each data point, determine which centroid has the smallest Euclidean distance to that point.
    
</div>

Write down your solution here:

**Data Point at (1,1):**
- $Distance_{(1,1)\leftrightarrow(1.25,2.0)} = \sqrt{(1-1.25)^2+(1-2.0)^2} \approx 1.03$
- $Distance_{(1,1)\leftrightarrow(2.33,3.0)} = \sqrt{(1-2.33)^2+(1-3.0)^2} \approx 2.4$
- $Distance_{(1,1)\leftrightarrow(4.0,2.67)} = \sqrt{(1-4.0)^2+(1-2.67)^2} \approx 3.43$
 
Nearest centroid: $(1.25,2.0)$ (Partition 1)
 
**Data Point at (1,2):**
- $Distance_{(1,2)\leftrightarrow(1.25,2.0)} = \sqrt{(1-1.25)^2+(2-2.0)^2} \approx 0.25$
- $Distance_{(1,2)\leftrightarrow(2.33,3.0)} = \sqrt{(1-2.33)^2+(2-3.0)^2} \approx 1.67$
- $Distance_{(1,2)\leftrightarrow(4.0,2.67)} = \sqrt{(1-4.0)^2+(2-2.67)^2} \approx 3.07$
 
Nearest centroid: $(1.25,2.0)$ (Partition 1)
 
**Data Point at (1,4):**
- $Distance_{(1,4)\leftrightarrow(1.25,2.0)} = \sqrt{(1-1.25)^2+(4-2.0)^2} \approx 2.02$
- $Distance_{(1,4)\leftrightarrow(2.33,3.0)} = \sqrt{(1-2.33)^2+(4-3.0)^2} \approx 1.67$
- $Distance_{(1,4)\leftrightarrow(4.0,2.67)} = \sqrt{(1-4.0)^2+(4-2.67)^2} \approx 3.28$
 
Nearest centroid: $(2.33,3.0)$ (Partition 2)
 
**Data Point at (2,1):**
- $Distance_{(2,1)\leftrightarrow(1.25,2.0)} = \sqrt{(2-1.25)^2+(1-2.0)^2} \approx 1.25$
- $Distance_{(2,1)\leftrightarrow(2.33,3.0)} = \sqrt{(2-2.33)^2+(1-3.0)^2} \approx 2.03$
- $Distance_{(2,1)\leftrightarrow(4.0,2.67)} = \sqrt{(2-4.0)^2+(1-2.67)^2} \approx 2.6$
 
Nearest centroid: $(1.25,2.0)$ (Partition 1)
 
**Data Point at (2,3):**
- $Distance_{(2,3)\leftrightarrow(1.25,2.0)} = \sqrt{(2-1.25)^2+(3-2.0)^2} \approx 1.25$
- $Distance_{(2,3)\leftrightarrow(2.33,3.0)} = \sqrt{(2-2.33)^2+(3-3.0)^2} \approx 0.33$
- $Distance_{(2,3)\leftrightarrow(4.0,2.67)} = \sqrt{(2-4.0)^2+(3-2.67)^2} \approx 2.03$
 
Nearest centroid: $(2.33,3.0)$ (Partition 2)
 
**Data Point at (3,2):**
- $Distance_{(3,2)\leftrightarrow(1.25,2.0)} = \sqrt{(3-1.25)^2+(2-2.0)^2} \approx 1.75$
- $Distance_{(3,2)\leftrightarrow(2.33,3.0)} = \sqrt{(3-2.33)^2+(2-3.0)^2} \approx 1.2$
- $Distance_{(3,2)\leftrightarrow(4.0,2.67)} = \sqrt{(3-4.0)^2+(2-2.67)^2} \approx 1.2$
 
Nearest centroid: $(2.33,3.0)$ (Partition 2)
 
**Data Point at (3,4):**
- $Distance_{(3,4)\leftrightarrow(1.25,2.0)} = \sqrt{(3-1.25)^2+(4-2.0)^2} \approx 2.66$
- $Distance_{(3,4)\leftrightarrow(2.33,3.0)} = \sqrt{(3-2.33)^2+(4-3.0)^2} \approx 1.2$
- $Distance_{(3,4)\leftrightarrow(4.0,2.67)} = \sqrt{(3-4.0)^2+(4-2.67)^2} \approx 1.67$
 
Nearest centroid: $(2.33,3.0)$ (Partition 2)
 
**Data Point at (4,1):**
- $Distance_{(4,1)\leftrightarrow(1.25,2.0)} = \sqrt{(4-1.25)^2+(1-2.0)^2} \approx 2.93$
- $Distance_{(4,1)\leftrightarrow(2.33,3.0)} = \sqrt{(4-2.33)^2+(1-3.0)^2} \approx 2.6$
- $Distance_{(4,1)\leftrightarrow(4.0,2.67)} = \sqrt{(4-4.0)^2+(1-2.67)^2} \approx 1.67$
 
Nearest centroid: $(4.0,2.67)$ (Partition 3)
 
**Data Point at (4,3):**
- $Distance_{(4,3)\leftrightarrow(1.25,2.0)} = \sqrt{(4-1.25)^2+(3-2.0)^2} \approx 2.93$
- $Distance_{(4,3)\leftrightarrow(2.33,3.0)} = \sqrt{(4-2.33)^2+(3-3.0)^2} \approx 1.67$
- $Distance_{(4,3)\leftrightarrow(4.0,2.67)} = \sqrt{(4-4.0)^2+(3-2.67)^2} \approx 0.33$
 
Nearest centroid: $(4.0,2.67)$ (Partition 3)
 
**Data Point at (4,4):**
- $Distance_{(4,4)\leftrightarrow(1.25,2.0)} = \sqrt{(4-1.25)^2+(4-2.0)^2} \approx 3.4$
- $Distance_{(4,4)\leftrightarrow(2.33,3.0)} = \sqrt{(4-2.33)^2+(4-3.0)^2} \approx 1.94$
- $Distance_{(4,4)\leftrightarrow(4.0,2.67)} = \sqrt{(4-4.0)^2+(4-2.67)^2} \approx 1.33$
 
Nearest centroid: $(4.0,2.67)$ (Partition 3)

In the next step, the data points are assigned to the partition to which the respective centroid belongs.

<div class="alert alert-block alert-info">

**Task 1.2.3:** 
    
Assign the points to the respective new partition.
    
</div>

| x | y | Old Partition | New Partition |
|:-:|:-:|:-------------:|:-------------:|
| 1 | 1 |       1       |       ?       |
| 1 | 2 |       1       |       ?       |
| 1 | 4 |       1       |       ?       |
| 2 | 1 |       1       |       ?       |
| 2 | 3 |       2       |       ?       |
| 3 | 2 |       2       |       ?       |
| 3 | 4 |       2       |       ?       |
| 4 | 1 |       3       |       ?       |
| 4 | 3 |       3       |       ?       |
| 4 | 4 |       3       |       ?       |

| x | y | Old Partition | New Partition |
|:-:|:-:|:-------------:|:-------------:|
| 1 | 1 |       1       |       1       |
| 1 | 2 |       1       |       1       |
| 1 | 4 |       1       |       2       |
| 2 | 1 |       1       |       1       |
| 2 | 3 |       2       |       2       |
| 3 | 2 |       2       |       2       |
| 3 | 4 |       2       |       2       |
| 4 | 1 |       3       |       3       |
| 4 | 3 |       3       |       3       |
| 4 | 4 |       3       |       3       |

Since a point has been assigned into a new partition, a new run begins in which centroids are determined, the distances of the points to the new centroids are measured and a reallocation of points takes place. This takes place until there are no more partition changes.

<div class="alert alert-block alert-info">

**Task 1.2.4:**
    
Continue K-means until no points are reassigned.

</div>

Write down your solution here:

**4. Step: Compute the new centroids**

**Centroid of Partition 1:**

- $x = \frac{1 + 1 + 2}{3} = 4/3 \approx 1,333$
- $y = \frac{1 + 2 + 1}{3} = 4/3 \approx 1,333$

**Centroid of Partition 2:**

- $x = \frac{1 + 2 + 2 + 3}{4} = 2$
- $y = \frac{4 + 3 + 2 + 4}{4} = 3,25$ 

**Centroid of Partition 3:**

- $x = \frac{4 + 4 + 4}{3} = 4$
- $y = \frac{1 + 3 + 4}{3} = \frac{8}{3} \approx 2,667$ 



**5. Step: Compute the new distances**

**Data Point at (1,1):**
- $Distance_{(1,1)\leftrightarrow(1.33,1.33)} = \sqrt{(1-1.33)^2+(1-1.33)^2} \approx 0.47$
- $Distance_{(1,1)\leftrightarrow(2.0,3.25)} = \sqrt{(1-2.0)^2+(1-3.25)^2} \approx 2.46$
- $Distance_{(1,1)\leftrightarrow(4.0,2.67)} = \sqrt{(1-4.0)^2+(1-2.67)^2} \approx 3.43$
 
Nearest centroid: $(1.33,1.33)$ (Partition 1)
 
**Data Point at (1,2):**
- $Distance_{(1,2)\leftrightarrow(1.33,1.33)} = \sqrt{(1-1.33)^2+(2-1.33)^2} \approx 0.75$
- $Distance_{(1,2)\leftrightarrow(2.0,3.25)} = \sqrt{(1-2.0)^2+(2-3.25)^2} \approx 1.6$
- $Distance_{(1,2)\leftrightarrow(4.0,2.67)} = \sqrt{(1-4.0)^2+(2-2.67)^2} \approx 3.07$
 
Nearest centroid: $(1.33,1.33)$ (Partition 1)
 
**Data Point at (1,4):**
- $Distance_{(1,4)\leftrightarrow(1.33,1.33)} = \sqrt{(1-1.33)^2+(4-1.33)^2} \approx 2.69$
- $Distance_{(1,4)\leftrightarrow(2.0,3.25)} = \sqrt{(1-2.0)^2+(4-3.25)^2} \approx 1.25$
- $Distance_{(1,4)\leftrightarrow(4.0,2.67)} = \sqrt{(1-4.0)^2+(4-2.67)^2} \approx 3.28$
 
Nearest centroid: $(2.0,3.25)$ (Partition 2)
 
**Data Point at (2,1):**
- $Distance_{(2,1)\leftrightarrow(1.33,1.33)} = \sqrt{(2-1.33)^2+(1-1.33)^2} \approx 0.75$
- $Distance_{(2,1)\leftrightarrow(2.0,3.25)} = \sqrt{(2-2.0)^2+(1-3.25)^2} \approx 2.25$
- $Distance_{(2,1)\leftrightarrow(4.0,2.67)} = \sqrt{(2-4.0)^2+(1-2.67)^2} \approx 2.6$
 
Nearest centroid: $(1.33,1.33)$ (Partition 1)
 
**Data Point at (2,3):**
- $Distance_{(2,3)\leftrightarrow(1.33,1.33)} = \sqrt{(2-1.33)^2+(3-1.33)^2} \approx 1.8$
- $Distance_{(2,3)\leftrightarrow(2.0,3.25)} = \sqrt{(2-2.0)^2+(3-3.25)^2} \approx 0.25$
- $Distance_{(2,3)\leftrightarrow(4.0,2.67)} = \sqrt{(2-4.0)^2+(3-2.67)^2} \approx 2.03$
 
Nearest centroid: $(2.0,3.25)$ (Partition 2)
 
**Data Point at (3,2):**
- $Distance_{(3,2)\leftrightarrow(1.33,1.33)} = \sqrt{(3-1.33)^2+(2-1.33)^2} \approx 1.8$
- $Distance_{(3,2)\leftrightarrow(2.0,3.25)} = \sqrt{(3-2.0)^2+(2-3.25)^2} \approx 1.6$
- $Distance_{(3,2)\leftrightarrow(4.0,2.67)} = \sqrt{(3-4.0)^2+(2-2.67)^2} \approx 1.2$
 
Nearest centroid: $(4.0,2.67)$ (Partition 3)
 
**Data Point at (3,4):**
- $Distance_{(3,4)\leftrightarrow(1.33,1.33)} = \sqrt{(3-1.33)^2+(4-1.33)^2} \approx 3.14$
- $Distance_{(3,4)\leftrightarrow(2.0,3.25)} = \sqrt{(3-2.0)^2+(4-3.25)^2} \approx 1.25$
- $Distance_{(3,4)\leftrightarrow(4.0,2.67)} = \sqrt{(3-4.0)^2+(4-2.67)^2} \approx 1.67$
 
Nearest centroid: $(2.0,3.25)$ (Partition 2)
 
**Data Point at (4,1):**
- $Distance_{(4,1)\leftrightarrow(1.33,1.33)} = \sqrt{(4-1.33)^2+(1-1.33)^2} \approx 2.69$
- $Distance_{(4,1)\leftrightarrow(2.0,3.25)} = \sqrt{(4-2.0)^2+(1-3.25)^2} \approx 3.01$
- $Distance_{(4,1)\leftrightarrow(4.0,2.67)} = \sqrt{(4-4.0)^2+(1-2.67)^2} \approx 1.67$
 
Nearest centroid: $(4.0,2.67)$ (Partition 3)
 
**Data Point at (4,3):**
- $Distance_{(4,3)\leftrightarrow(1.33,1.33)} = \sqrt{(4-1.33)^2+(3-1.33)^2} \approx 3.14$
- $Distance_{(4,3)\leftrightarrow(2.0,3.25)} = \sqrt{(4-2.0)^2+(3-3.25)^2} \approx 2.02$
- $Distance_{(4,3)\leftrightarrow(4.0,2.67)} = \sqrt{(4-4.0)^2+(3-2.67)^2} \approx 0.33$
 
Nearest centroid: $(4.0,2.67)$ (Partition 3)
 
**Data Point at (4,4):**
- $Distance_{(4,4)\leftrightarrow(1.33,1.33)} = \sqrt{(4-1.33)^2+(4-1.33)^2} \approx 3.77$
- $Distance_{(4,4)\leftrightarrow(2.0,3.25)} = \sqrt{(4-2.0)^2+(4-3.25)^2} \approx 2.14$
- $Distance_{(4,4)\leftrightarrow(4.0,2.67)} = \sqrt{(4-4.0)^2+(4-2.67)^2} \approx 1.33$
 
Nearest centroid: $(4.0,2.67)$ (Partition 3)


**6. Step: Reassign the Data Points**

| x | y | Old Partition | New Partition |
|:-:|:-:|:-------------:|:-------------:|
| 1 | 1 |       1       |       1       |
| 1 | 2 |       1       |       1       |
| 1 | 4 |       2       |       2       |
| 2 | 1 |       1       |       1       |
| 2 | 3 |       2       |       2       |
| 3 | 2 |       2       |       3       |
| 3 | 4 |       2       |       2       |
| 4 | 1 |       3       |       3       |
| 4 | 3 |       3       |       3       |
| 4 | 4 |       3       |       3       |


**7. Step: Compute the new centroids**

**Centroid of Partition 1:**

- $x = \frac{1 + 1 + 2}{3} = 4/3 \approx 1,333$
- $y = \frac{1 + 2 + 1}{3} = 4/3 \approx 1,333$

**Centroid of Partition 2:**

- $x = \frac{1 + 2 + 2 + 3}{4} = 2$
- $y = \frac{4 + 3 + 2 + 4}{4} = 3,25$ 

**Centroid of Partition 3:**

- $x = \frac{4 + 4 + 4}{3} = 4$
- $y = \frac{1 + 3 + 4}{3} = \frac{8}{3} \approx 2,667$ 



In [None]:
centroids = pd.DataFrame(
    [[4 / 3, 4 / 3], [2, 3.25], [4, 8 / 3]],
    columns=["x", "y"],
)

euclidean_distance(dataset.iloc[0], centroids.iloc[0])

for index, row in dataset.iterrows():
    best_dist = 5000
    best_cent = 0
    print("**Data Point at (" + str(row["x"]) + "," + str(row["y"]) + "):**")
    for index_cent, row_cent in centroids.iterrows():
        distance = euclidean_distance(row, row_cent)
        if distance < best_dist:
            best_cent = index_cent
            best_dist = distance
        print(
            "- $Distance_{("
            + str(row["x"])
            + ","
            + str(row["y"])
            + ")\leftrightarrow("
            + str(round(row_cent["x"], 2))
            + ","
            + str(round(row_cent["y"], 2))
            + ")} = \sqrt{("
            + str(row["x"])
            + "-"
            + str(round(row_cent["x"], 2))
            + ")^2+("
            + str(row["y"])
            + "-"
            + str(round(row_cent["y"], 2))
            + ")^2} \\approx "
            + str(round(distance, 2))
            + "$"
        )
    print(" ")
    print(
        "Nearest centroid: $("
        + str(round(centroids.iloc[best_cent]["x"], 2))
        + ","
        + str(round(centroids.iloc[best_cent]["y"], 2))
        + ")$ (Partition "
        + str(best_cent + 1)
        + ")"
    )
    print(" ")

#### 5.1.1.2. Implementation

As announced, there are two options for you regarding the implementation of K-means. You may implement the method without extra help or you can choose option two: A guided step-by-step implementation of the K-means algorithm. 

##### Option 1: Implement K-means on Your Own

Some of you may prefer to implement K-means on your own. In this case refer to the lecture for a comprehensive explanation of the method. 

<div class="alert alert-block alert-info">

**Task:** Use your knowledge of K-means to implement a method `k_means` that can be used to cluster the `small_dataset` and into `k` clusters using the euclidean distance to measure the distance between two points.
If you are in need of more code cells than provided, feel free to add more.

</div>

In [None]:
# Implement a k_means function (Code placeholder 01/10)

In [None]:
# Implement a k_means function (Code placeholder 02/10)

In [None]:
# Implement a k_means function (Code placeholder 03/10)

In [None]:
# Implement a k_means function (Code placeholder 04/10)

In [None]:
# Implement a k_means function (Code placeholder 05/10)

In [None]:
# Implement a k_means function (Code placeholder 06/10)

In [None]:
# Implement a k_means function (Code placeholder 07/10)

In [None]:
# Implement a k_means function (Code placeholder 08/10)

In [None]:
# Implement a k_means function (Code placeholder 09/10)

In [None]:
# Implement a k_means function (Code placeholder 10/10)

In [None]:
# Sample k_means sceleton
# NOTE: You are allowed to use this sceleton but don't have to
def k_means(dataset, k):
    # Copy the original dataset
    dataset_copy = dataset.copy()

    # Create a new empty column to save the cluster/partition affiliation (-1 is representing no cluster/partition)
    dataset_copy["cluster"] = -1

    # ...
    return dataset_copy

In [None]:
# Cluster the small_dataset (We use k=3)
clustered_small_dataset = k_means(small_dataset, 3)

# Print a scatterplot
plt.figure(figsize=(4, 4))
sns.scatterplot(
    x=clustered_small_dataset["x"],
    y=clustered_small_dataset["y"],
    hue=clustered_small_dataset["cluster"],
    palette="deep",
    legend=None,
)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

In [None]:
# Sample solution => See Option 2

##### Option 2: Implement K-means by Solving Small Tasks

When someone tries to implement K-means step-by-step, the initial step is always to make an initial partition of the existing data into `k` non-empty partitions. This division can be random or according to an arbitrary scheme. However it is important that the result are exactly `k` partitions, that none of these partitions is empty and that each sample is represented in exactly one of the partitions. 

<div class="alert alert-block alert-info">

**Task:** Write a function `partition_dataset` that splits a `dataset` into `k` initial partitions. It doesn`t matter what kind of partitioning you decide on, as long as it complies with the rules mentioned. 

</div>

In [None]:
# Implement a funtion to arbitrarily partition the dataset into k parts
def partition_dataset(dataset, k):
    # Copy the original dataset
    dataset_copy = dataset.copy()

    # Create a new empty column to save the cluster/partition affiliation (-1 is representing no cluster/partition)
    dataset_copy["cluster"] = -1

    # ...

    # Return the dataset
    return dataset_copy


# Partition the small_dataset
partitioned_small_dataset = partition_dataset(small_dataset, 3)

# Print a scatterplot
plt.figure(figsize=(4, 4))
sns.scatterplot(
    x=partitioned_small_dataset["x"],
    y=partitioned_small_dataset["y"],
    hue=partitioned_small_dataset["cluster"],
    palette="deep",
)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

In [None]:
# Implement a funtion to arbitrarily partition the dataset into k parts
def partition_dataset(dataset, k):
    # Copy the original dataset
    dataset_copy = dataset.copy()

    # Create a new empty column to save the cluster/partition affiliation (-1 is representing no cluster/partition)
    dataset_copy["cluster"] = -1

    # Compute quotient and the remainder if spliting the dataset into k parts
    quotient, remainder = divmod(dataset_copy.shape[0], k)

    # And then to assign the samples to the cluster/partition
    for i in range(0, k):
        # Assign the cluster value
        dataset_copy.loc[
            i * quotient
            + min(i, remainder) : (i + 1) * quotient
            + min(i + 1, remainder),
            "cluster",
        ] = i

    # Return the dataset
    return dataset_copy


# Partition the small_dataset
partitioned_small_dataset = partition_dataset(small_dataset, 3)

# Print a scatterplot
plt.figure(figsize=(4, 4))
sns.scatterplot(
    x=partitioned_small_dataset["x"],
    y=partitioned_small_dataset["y"],
    hue=partitioned_small_dataset["cluster"],
    palette="deep",
)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

The first repetitive step in K-means is to calculate for the so-called centroids (mean points) for each partition/cluster. 

<div class="alert alert-block alert-info">

**Task:** Implement the function `compute_centroids` that computes the centroid for each of the `k` partitions. The return value should be a pandas DataFrame with the cluster identifier as an index and two columns `x` and `y` indicating the coordinates of the corresponding centroid.

</div>

In [None]:
# Implement a function to compute the centroids for a partitioned dataset
def compute_centroids(partitioned_dataset, k):
    # Init a DataFrame to hold the centroids
    centroids = pd.DataFrame(
        [[np.nan, np.nan] for i in range(0, k)], columns=["x", "y"]
    )

    # ...

    # Return the centroids
    return centroids


# Compute the centroids of the intitial partitioning
centroids = compute_centroids(partitioned_small_dataset, 3)

# Print the centroids into the scatterplot (black)
plt.figure(figsize=(4, 4))
sns.scatterplot(
    x=partitioned_small_dataset["x"],
    y=partitioned_small_dataset["y"],
    hue=partitioned_small_dataset["cluster"],
    palette="deep",
)
sns.scatterplot(x=centroids["x"], y=centroids["y"], c=["black"])
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

In [None]:
# Implement a function to compute the centroids for a partitioned dataset
def compute_centroids(partitioned_dataset, k):
    # Init a DataFrame to hold the centroids
    centroids = pd.DataFrame(
        [[np.nan, np.nan] for i in range(0, k)], columns=["x", "y"]
    )

    # Compute the centroid of each partition
    for i in range(0, k):
        # Compute the mean of the x values within that single partition
        x_mean = partitioned_dataset[partitioned_dataset["cluster"] == i]["x"].mean()

        # Compute the mean of the y values within that single partition
        y_mean = partitioned_dataset[partitioned_dataset["cluster"] == i]["y"].mean()

        # Add the centroid of this single partition
        centroids.loc[i, ["x", "y"]] = [x_mean, y_mean]

    # Return the centroids
    return centroids


# Compute the centroids of the intitial partitioning
centroids = compute_centroids(partitioned_small_dataset, 3)

# Print the centroids into the scatterplot (black)
plt.figure(figsize=(4, 4))
sns.scatterplot(
    x=partitioned_small_dataset["x"],
    y=partitioned_small_dataset["y"],
    hue=partitioned_small_dataset["cluster"],
    palette="deep",
)
sns.scatterplot(x=centroids["x"], y=centroids["y"], c=["black"])
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

To reassign points to their nearest centroid, distance measure must be defined. Here, for example, the Euclidean distance comes in handy, which we have already implemented ourselves in an earlier exercise.

In [None]:
# "Pythonic" implementation of the euclidean distance
def euclidean_distance(a, b):
    return (abs(a - b) ** 2).sum() ** 0.5


# Compute the euclidean distance for two random points a and b
a = pd.Series([1, 1])
b = pd.Series([2, 2])
euclidean_distance(a, b)

Reassignment is also the next step within K-means. Samples are always reassigned to the cluster/partition whose centroid is closest to themselves.

<div class="alert alert-block alert-info">

**Task:** Complete the function `reassign_samples` that reassigns samples to the cluster/partition whose centroid is closest to themselves. Return the dataset and an indictator to communicate whether at least tuple was reassigned within the function or not.

</div>

In [None]:
# Implement a function to reassign each sample to its nearest centroid
def reassign_samples(partitioned_dataset, centroids, k):
    # Indicator to show whether there was at least one tuple reassigned
    reassign_indicator = False

    # Copy the original partitioned_dataset
    dataset_copy = partitioned_dataset.copy()

    # ...

    return reassign_indicator, dataset_copy


# Reassign the samples of our partitioned_small_dataset to their nearest centroid
reassign_indicator, reassigned_small_dataset = reassign_samples(
    partitioned_small_dataset, centroids, 3
)

# Output the indicator
print("Was there at least one sample reassigned? - " + str(reassign_indicator))

# Print a scatterplot showing the new class assignments
plt.figure(figsize=(4, 4))
sns.scatterplot(
    x=reassigned_small_dataset["x"],
    y=reassigned_small_dataset["y"],
    hue=reassigned_small_dataset["cluster"],
    palette="deep",
)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

In [None]:
# Implement a function to reassign each sample to its nearest centroid
def reassign_samples(partitioned_dataset, centroids, k):
    # Indicator to show whether there was at least one tuple reassigned
    reassign_indicator = False

    # Copy the original partitioned_dataset
    dataset_copy = partitioned_dataset.copy()

    # Check for each sample whether it has to be reassigned
    for i in range(0, dataset_copy.shape[0]):
        # Get the value of the the dataset for easier access
        sample = dataset_copy.loc[i, ["x", "y"]]

        # Set the current cluster id and centroid values
        current_cluster = dataset_copy.loc[i, "cluster"]
        current_centroid = centroids.loc[current_cluster]
        current_distance = euclidean_distance(sample, current_centroid)

        # Iterate through the centroids and check whether the distance is lower than the current distance
        # NOTE: We do not skip the current centroid, as this would complicate the code and isn't a big performance problem
        for j in range(0, k):
            # Compute the distance
            distance = euclidean_distance(sample, centroids.loc[j])

            # If the distance is lower than the current_distance we have to reassign
            if distance < current_distance:
                # Set the cluster
                dataset_copy.loc[i, "cluster"] = j
                current_cluster = j

                # Set the current_centroid
                current_centroid = centroids.loc[j]

                # Set the current_distance
                current_distance = distance

                # Set the reassign_indicator
                reassign_indicator = True

    return reassign_indicator, dataset_copy


# Reassign the samples of our partitioned_small_dataset to their nearest centroid
reassign_indicator, reassigned_small_dataset = reassign_samples(
    partitioned_small_dataset, centroids, 3
)

# Output the indicator
print("Was there at least one sample reassigned? - " + str(reassign_indicator))

# Print a scatterplot showing the new class assignments
plt.figure(figsize=(4, 4))
sns.scatterplot(
    x=reassigned_small_dataset["x"],
    y=reassigned_small_dataset["y"],
    hue=reassigned_small_dataset["cluster"],
    palette="deep",
)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

In the iterative K-means algorithm it would now be checked whether samples were reassigned or not. If yes, we have to go back to calculating the centroids for this new assignment. If not, then the corresponding clusters have been found. 

This decision can of course be passed to a wrapper function `k_means` which summarizes the whole algorithm.

<div class="alert alert-block alert-info">

**Task:** Merge the previously implemented function `k_means` to achieve a complete implementation of the algorithm.
    
</div>

In [None]:
# Implement the wrapper function k_means
def k_means(dataset, k):
    # Copy the original dataset
    dataset_copy = dataset.copy()

    # Create a new empty column to save the cluster/partition affiliation (-1 is representing no cluster/partition)
    dataset_copy["cluster"] = -1

    # ...

    # Return the clustered dataset
    return dataset_copy


# Cluster the small_dataset
clustered_small_dataset = k_means(small_dataset, 3)

# Output the corresponding scatterplot
plt.figure(figsize=(4, 4))
sns.scatterplot(
    x=clustered_small_dataset["x"],
    y=clustered_small_dataset["y"],
    hue=clustered_small_dataset["cluster"],
    palette="deep",
)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

In [None]:
# Implement the wrapper function k_means
def k_means(dataset, k):
    # Partition the dataset
    dataset = partition_dataset(dataset, k)

    # Set the reassign_indicator to True (as the intial partitioning was as reassingment in itself)
    reassign_indicator = True

    # As long as there are reassingment the following two steps are repeated
    while reassign_indicator:
        # Compute the centroids
        centroids = compute_centroids(dataset, k)

        # Reassign each sample to the cluster of the nearest centroid
        reassign_indicator, dataset = reassign_samples(dataset, centroids, k)

    # Return the clustered dataset
    return dataset


# Cluster the small_dataset
clustered_small_dataset = k_means(small_dataset, 3)

# Output the corresponding scatterplot
plt.figure(figsize=(4, 4))
sns.scatterplot(
    x=clustered_small_dataset["x"],
    y=clustered_small_dataset["y"],
    hue=clustered_small_dataset["cluster"],
    palette="deep",
)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

#### Libary: scikit-learn

Even with the clustering algorithms from this task sheet, it is of course not normally necessary to create your own implementations for the procedures. In the case of K-means, for example, there is a good implementation in scikit-learn.

In [None]:
from sklearn.cluster import KMeans

<div class="alert alert-block alert-info">

**Task:** Use scikit-learn's implementation of K-means to find three clusters in the `small_dataset`. Print the result in a diagram.
    
</div>

In [None]:
# Perform sklearns K-means clustering on the small_dataset
# ...

In [None]:
# Perform sklearns K-means clustering on the small_dataset
kmeans = KMeans(n_clusters=3, n_init="auto").fit(small_dataset[["x", "y"]])

# Save the labels to a copy of the small_dataset to generate the equivalent of our clustered_big_dataset
clustered_small_dataset_2 = small_dataset.copy()
clustered_small_dataset_2["cluster"] = kmeans.labels_

# Print the result
plt.figure(figsize=(4, 4))
sns.scatterplot(
    x=clustered_small_dataset_2["x"],
    y=clustered_small_dataset_2["y"],
    hue=clustered_small_dataset_2["cluster"],
    palette="deep",
)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

### DBSCAN

In addition to the partitioning methods, density-based methods were also presented in the lecture. As an example of these methods, you will asked to implement DBSCAN in the following.

#### Implementation

Also during this implementation you have two options: On the one hand, you may implement DBSCAN completely on your own, on the other hand, you may use the task series divided into smaller tasks. 

##### Option 1: Implement DBSCAN on Your Own

If you decided to implement DBSCAN on your own refer to the lecture for a comprehensive explanation of the method. 

<div class="alert alert-block alert-info">

**Task:** Implement a method `dbscan` that can be used to cluster the two datasets `small_dataset` and `big_dataset` into multiple clusters. You shall use the euclidean distance to measure the distance between two points during the clustering.
If you are in need of more code cells than provided, feel free to add more.

</div>

In [None]:
# Implement a dbscan function (Code placeholder 01/10)

In [None]:
# Implement a dbscan function (Code placeholder 02/10)

In [None]:
# Implement a dbscan function (Code placeholder 03/10)

In [None]:
# Implement a dbscan function (Code placeholder 04/10)

In [None]:
# Implement a dbscan function (Code placeholder 05/10)

In [None]:
# Implement a dbscan function (Code placeholder 06/10)

In [None]:
# Implement a dbscan function (Code placeholder 07/10)

In [None]:
# Implement a dbscan function (Code placeholder 08/10)

In [None]:
# Implement a dbscan function (Code placeholder 09/10)

In [None]:
# Implement a dbscan function (Code placeholder 10/10)

In [None]:
# Sample dbscan sceleton
# NOTE: You are allowed to use this sceleton but don't have to
def dbscan(dataset, eps, min_pts):
    # Copy the original dataset
    dataset_copy = dataset.copy()

    # Create a new empty column to save the cluster/partition affiliation
    # Special codings for ...
    # ... points that are not set yet: -1
    # ... points that are noise: -2
    dataset_copy["cluster"] = -1

    # Create a new empty column to save the visited status
    dataset_copy["visited"] = False

    # ...

    # Return the clustered dataset
    return dataset_copy

In [None]:
# Cluster the small_dataset
# (the parameters eps=1 and min_pts=2 should result in five different clusters and
# one "noisy" point for this dataset)
clustered_small_dataset = dbscan(small_dataset, 1, 2)

# Output the corresponding scatterplot
plt.figure(figsize=(4, 4))
sns.scatterplot(
    x=clustered_small_dataset["x"],
    y=clustered_small_dataset["y"],
    hue=clustered_small_dataset["cluster"],
    palette="deep",
)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

In [None]:
# Cluster the big_dataset
# (the parameters eps=1 and min_pts=5 should result in five different clusters and
# multiple "noisy" points for this dataset)
clustered_big_dataset = dbscan(big_dataset, 1, 5)

# Output the clustered dataset including information on the true classes
plt.figure(figsize=(4, 4))
sns.scatterplot(
    x=clustered_big_dataset["x"],
    y=clustered_big_dataset["y"],
    hue=clustered_big_dataset["cluster"],
    style=clustered_big_dataset["true_labels"],
    palette="deep",
)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

In [None]:
# Sample solution => See Option 2

##### Option 2: Implement DBSCAN by Solving Small Tasks

For DBSCAN, you need not only the cluster membership as meta information, but also the status "visited". Before we start with the step-by-step implementation of DBSCAN, it is useful to write a small function for preparing the data set:

In [None]:
# Add columns to the dataset to save the status of the dataset
def prepare_dataset(dataset):
    # Copy the original dataset
    dataset_copy = dataset.copy()

    # Create a new empty column to save the cluster/partition affiliation
    # Special codings for ...
    # ... points that are not set yet: -1
    # ... points that are noise: -2
    dataset_copy["cluster"] = -1

    # Create a new empty column to save the visited status
    dataset_copy["visited"] = False

    # Return the dataset_copy
    return dataset_copy


# Prepare the small dataset
prepared_small_dataset = prepare_dataset(small_dataset)
prepared_small_dataset

Besides this preparatory helper function, there are two things that it would make sense to outsource to separate functions before the actual DBSCAN implementation. 

First, a function is needed in DBSCAN to randomly select a single unvisited point from a prepared data set. 

<div class="alert alert-block alert-info">

**Task:** Write a function `pick_random_unvisited_point` that randomly selects an unvisited point out of the `dataset` and returns it.

</div>

In [None]:
# Pick a random point that is unvisited
def pick_random_unvisited_point(dataset):
    # ...
    return None


# Pick a random point
random_point = pick_random_unvisited_point(prepared_small_dataset)
random_point

In [None]:
# Pick a random point that is unvisited
def pick_random_unvisited_point(dataset):
    # Select all points that are unvisited
    unvisited_points = dataset[dataset["visited"] == False]

    # If there are no unvisited points return None
    if len(unvisited_points) < 1:
        return None
    else:
        # Select one random point and return it
        return unvisited_points.sample().iloc[0]


# Pick a random point
random_point = pick_random_unvisited_point(prepared_small_dataset)
random_point

A second helper function that helps implementing DBSCAN is a function that returns all point within eps distance of a selected point.

<div class="alert alert-block alert-info">

**Task:** Write a function `get_all_points_within_eps_distance` that returns all points within distance of `eps`to the passed `point`. Use the euclidean distance function introduced during the K-means part of this exercise to determine the distance between two points. 

</div>

In [None]:
# Get all points within a distance of eps next to a specific point
def get_all_points_within_eps_distance(point, dataset, eps):
    # ...
    return None


# Get all points within distance of 1 regarding to the point (6,5)
points_within_eps_distance = get_all_points_within_eps_distance(
    pd.Series(data=[6, 5, -1, False], index=["x", "y", "cluster", "visited"]),
    prepared_small_dataset,
    1,
)
points_within_eps_distance

In [None]:
# Get all points within a distance of eps next to a specific point
def get_all_points_within_eps_distance(point, dataset, eps):
    # Select all unvisited points within eps distance
    return dataset[
        dataset.apply(
            lambda a: euclidean_distance([a["x"], a["y"]], point[["x", "y"]].values)
            <= eps,
            axis=1,
        )
    ]


# Get all points within distance of 1 regarding to the point (6,5)
points_within_eps_distance = get_all_points_within_eps_distance(
    pd.Series(data=[6, 5, -1, False], index=["x", "y", "cluster", "visited"]),
    prepared_small_dataset,
    1,
)
points_within_eps_distance

The pseudocode from the lecture on DBSCAN is quite general. Thus, the substep `If p′ is core point, add all objects in its ϵ-neighborhood to N` is ultimately something that can be implemented both by merging multiple sets of points, and by recursion. 
Since the recursive variant of DBSCAN is easier to implement, we focus on this variant in this step-by-step implementation. 

Finally, it makes sense to outsource the entire step `For each p′ in N that does not yet belong to a cluster` to a seperate recursive function.

<div class="alert alert-block alert-info">

**Task:** Complete the function sceleton of the function `expand_cluster` below. Remember that you can use the previously defined helper functions. 

</div>

In [None]:
# This function is used to expand a specific cluster by one point
# If the point is a core point (at least min_pts in eps distance) by itself
# expand_cluster is called for each neighbor.
def expand_cluster(dataset, eps, min_pts, point, cluster_id):
    # ...
    return None

In [None]:
# This function is used to expand a specific cluster by one point
# If the point is a core point (at least min_pts in eps distance) by itself
# expand_cluster is called for each neighbor.
def expand_cluster(dataset, eps, min_pts, point, cluster_id):
    # Add the point to the cluster
    dataset.loc[point.name, "cluster"] = cluster_id

    # If point was not visited, we have to visit it now
    if dataset.loc[point.name, "visited"] == False:
        # Mark the point as visited
        dataset.loc[point.name, "visited"] = True

        # Get all points within eps distance
        points_within_eps_distance = get_all_points_within_eps_distance(
            point, dataset, eps
        )

        # Check if count of points is higher than min_pts => is a core point
        # => We have to go deeper into the recursion
        if len(points_within_eps_distance.index) >= min_pts:
            # Iterate through the points in eps distance
            for index, row in points_within_eps_distance.iterrows():
                # Check whether the neighbor is already member of a cluster
                # (Note that a point marked as noise is not part of a cluster, too)
                if dataset.loc[index, "cluster"] >= 0:
                    # Skip that point
                    continue
                else:
                    # Expand the cluster with that point
                    expand_cluster(dataset, eps, min_pts, row, cluster_id)

Of course it is useful to have a test case to test your implementation against. However this test case is somewhat more difficult to understand, as the function depends on input of an undefined function. Therefore lets describe the test scenario first:

*Lets say that the point with id `7` (Coordinates are `(3, 3)`) is selected as random unvisited point out of the prepared_small_dataset by the main dbscan function. As in this example eps is `1` in this case and min_pts is `2` the selected random unvisited point is a core point, as there is one other point (Id `2` and coordinates `(4,3)`) within eps distance. Therefore a new cluster with id `0` is created, the point with id `7` is added and `expand_cluster` gets called for all neighboring points that are not part of a cluster yet. In our example call we take a look call `expand_cluster` for the point with id `2`*

If your function works fine, it should add the point with id `2` into the cluster and should check whether it is a core point itself. As there is one still unvisited point to descend to (Id `8` with coordinates `(4,4)`) recursion is started. In the end there should be three visited points withing cluster `0`(Ids `2`, `7` and `8`).

In [None]:
# Prepare the dataset
prepared_small_dataset = prepare_dataset(small_dataset)

# Mark the point with id 7 as visited and add it to the cluster with id 0
prepared_small_dataset.loc[7, "visited"] = True
prepared_small_dataset.loc[7, "cluster"] = 0

# Select the point with id 2
point_with_id_2 = prepared_small_dataset.iloc[2]

# Call expand_cluster
expand_cluster(prepared_small_dataset, 1, 2, point_with_id_2, 0)

# Take a look at the dataset (should now contain three points within cluster 0)
prepared_small_dataset

With the help of the recursive function `expand_cluster` it is now not difficult to implement the function `dbscan`, which in principle takes over the remaining steps of the pseudocode and uses `expand_cluster` whenever neighboring items have to be added to the cluster.

<div class="alert alert-block alert-info">

**Task:** Complete the `dbscan`. Again it is recommended to use the previously defined functions.

</div>

In [None]:
# Implement dbscan
def dbscan(dataset, eps, min_pts):
    # ...

    # Return the clustered dataset
    return dataset


# Cluster the small_dataset
clustered_small_dataset = dbscan(small_dataset, 1.5, 2)

# Output the corresponding scatterplot
plt.figure(figsize=(4, 4))
sns.scatterplot(
    x=clustered_small_dataset["x"],
    y=clustered_small_dataset["y"],
    hue=clustered_small_dataset["cluster"],
    palette="deep",
)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

In [None]:
# Implement dbscan
def dbscan(dataset, eps, min_pts):
    # Prepare the dataset
    dataset = prepare_dataset(dataset)

    # While there are unvisited points pick a random one
    while len(dataset[dataset["visited"] == False]) > 0:
        # Select a random unvisited point
        random_point = pick_random_unvisited_point(dataset)

        # Mark the random point as visited
        dataset.loc[random_point.name, "visited"] = True

        # Get all points within eps distance
        points_within_eps_distance = get_all_points_within_eps_distance(
            random_point, dataset, eps
        )

        # Check if count of points is higher than min_pts => is a core point
        if len(points_within_eps_distance.index) < min_pts:
            # Not a core point => mark as noise
            dataset.loc[random_point.name, "cluster"] = -2
        else:
            # Get the last used cluster id
            last_cluster_id = dataset["cluster"].max()

            # Increment the id to get an new id for the new cluster
            new_cluster_id = last_cluster_id + 1

            # Add the random point to the cluster
            dataset.loc[random_point.name, "cluster"] = new_cluster_id

            # Iterate through the points in eps distance
            for index, row in points_within_eps_distance.iterrows():
                # Check whether the neighbor is already member of a cluster
                # (Note that a point marked as noise is not part of a cluster, too)
                if dataset.loc[index, "cluster"] >= 0:
                    # Skip that point
                    continue
                else:
                    # Expand the cluster with that point
                    expand_cluster(dataset, eps, min_pts, row, new_cluster_id)

    # Return the clustered dataset
    return dataset


# Cluster the small_dataset
clustered_small_dataset = dbscan(small_dataset, 1.5, 2)

# Output the corresponding scatterplot
plt.figure(figsize=(4, 4))
sns.scatterplot(
    x=clustered_small_dataset["x"],
    y=clustered_small_dataset["y"],
    hue=clustered_small_dataset["cluster"],
    palette="deep",
)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

#### Libary: scikit-learn

Just as for K-means, scikit-learn also offers an extensive implementation for DBSCAN.

In [None]:
from sklearn.cluster import DBSCAN

<div class="alert alert-block alert-info">

**Task:** Use scikit-learn's implementation of DBSCAN to find clusters in the `big_dataset`. Use the same parameters we used in the above in the own implementation. Print the result in a diagram.
    
</div>

In [None]:
# Perform sklearns DBSCAN clustering on the big_dataset
# ...

In [None]:
# Perform sklearns DBSCAN clustering on the big_dataset
dbscan = DBSCAN(eps=1, min_samples=5).fit(big_dataset[["x", "y"]])

# Save the labels to a copy of the big_dataset to generate the equivalent of our clustered_big_dataset
clustered_big_dataset_3 = big_dataset.copy()
clustered_big_dataset_3["cluster"] = dbscan.labels_

# Print the result
plt.figure(figsize=(4, 4))
sns.scatterplot(
    x=clustered_big_dataset_3["x"],
    y=clustered_big_dataset_3["y"],
    hue=clustered_big_dataset_3["cluster"],
    style=clustered_big_dataset_3["true_labels"],
    palette="deep",
)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

In this case, the results of our function and that of scikit-learn are identical. This shows that DBSCAN is more deterministic than K-means. 