# 5. Clustering

This JupyterNotebook is part of an exercise series titled *Clustering*.
The series itself is based on lecture *8. Cluster Analysis*.

There are two parts:

- Part One: Implementing k-means and DBScan
- Part Two: Clustering in the AdventureWorks Database

Again we would like to remind you that we have multiple exercise groups.
Depending on how each group progresses, some parts of these exercises may not be discussed in its entirety.
If questions arise, ask them in your study group or in our StudOn forum.

## Part One: Implementing k-means and DBScan

In the first part of this exercise sheet we will take a closer look at two clustering methods known from the lecture: k-means and DBScan. You will be asked to implement both methods from scratch. In both cases you get the option to implement the algortithms totally on your self or with the help of a step-by-step task series. 

In [None]:
# Import the required libraries
import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.datasets import make_blobs
from matplotlib import pyplot as plt

The datasets to cluster are the same for both methods. On the one hand, there is the `small_dataset`, which is used as an example for k-means in the lecture.

In [None]:
# Create the small_dataset
small_dataset = pd.DataFrame(
    [
        [6, 5],
        [8, 5],
        [4, 3],
        [5, 6],
        [6, 2],
        [6, 3],
        [2, 2],
        [3, 3],
        [4, 4],
        [5, 5],
        [6, 6],
        [7, 7],
        [8, 7],
        [8, 4],
    ],
    columns=["x", "y"],
)

# Output the small_dataset in a scatterplot diagram
plt.figure(figsize=(8, 8))
sns.scatterplot(x=small_dataset["x"], y=small_dataset["y"])

On the other hand, there is the `big_dataset`, a larger dataset that was generated using `sklearn` and for which a ground trouth (column `true_labels`) exists to validate the clustering.

In [None]:
# Generate the big_dataset
big_data, big_data_true_labels = make_blobs(
    n_samples=1000, centers=10, random_state=2306
)
big_dataset = pd.DataFrame(big_data, columns=["x", "y"])
big_dataset["true_labels"] = big_data_true_labels

# Print a scatterplot diagram of the big_dataset while showing the actual class affiliations with colors
plt.figure(figsize=(8, 8))
sns.scatterplot(
    x=big_dataset["x"],
    y=big_dataset["y"],
    hue=big_dataset["true_labels"],
    palette="deep",
    legend=None,
)

If you decide to take a standalone approach to implementation, we recommend that you first use the `small_dataset`. A clustering run on the `big_dataset` takes its time and is thus not so well suited for debugging.

### K-means

The first approach you are asked to implement is k-means. It is part of the so called partitioning methods and is the first approach we took a look at in the lecture. 

#### Implementation

As announced, there are two options for you regarding the implementation of k-means. You may implement the method without extra help or you can choose option two: A guided step-by-step implementation of the k-means algorithm. 

##### Option 1: Implement K-means on Your Own

Some of you may prefer to implement k-means on your own. In this case refer to the lecture for a comprehensive explanation of the method. 

<div class="alert alert-block alert-info">

**Task:** Use your knowledge of k-means to implement a method `k_means` that can be used to cluster the two datasets `small_dataset` and `big_dataset` into `k` clusters using the euclidean distance to measure the distance between two points.
If you are in need of more code cells than provided, feel free to add more.

</div>

In [None]:
# Implement a k_means function (Code placeholder 01/10)

In [None]:
# Implement a k_means function (Code placeholder 02/10)

In [None]:
# Implement a k_means function (Code placeholder 03/10)

In [None]:
# Implement a k_means function (Code placeholder 04/10)

In [None]:
# Implement a k_means function (Code placeholder 05/10)

In [None]:
# Implement a k_means function (Code placeholder 06/10)

In [None]:
# Implement a k_means function (Code placeholder 07/10)

In [None]:
# Implement a k_means function (Code placeholder 08/10)

In [None]:
# Implement a k_means function (Code placeholder 09/10)

In [None]:
# Implement a k_means function (Code placeholder 10/10)

In [None]:
# Sample k_means sceleton
# NOTE: You are allowed to use this sceleton but don't have to
def k_means(dataset, k):
    # Copy the original dataset
    dataset_copy = dataset.copy()

    # Create a new empty column to save the cluster/partition affiliation (-1 is representing no cluster/partition)
    dataset_copy["cluster"] = -1

    # ...
    return dataset_copy

In [None]:
# Cluster the small_dataset (just as in the lecture we use k=2)
clustered_small_dataset = k_means(small_dataset, 2)

# Print a scatterplot
plt.figure(figsize=(8, 8))
sns.scatterplot(
    x=clustered_small_dataset["x"],
    y=clustered_small_dataset["y"],
    hue=clustered_small_dataset["cluster"],
    palette="deep",
    legend=None,
)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

In [None]:
# Cluster the big_dataset (as we know that there are 10 classes, we use k=10)
clustered_big_dataset = k_means(big_dataset, 10)

# Print a scatterplot
plt.figure(figsize=(8, 8))
sns.scatterplot(
    x=clustered_big_dataset["x"],
    y=clustered_big_dataset["y"],
    hue=clustered_big_dataset["cluster"],
    palette="deep",
    legend=None,
)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

In [None]:
# Sample solution => See Option 2

##### Option 2: Implement K-means by Solving Small Tasks

When someone tries to implement k-means step-by-step, the initial step is always to make an initial partition of the existing data into `k` non-empty partitions. This division can be random or according to an arbitrary scheme. However it is important that the result are exactly `k` partitions, that none of these partitions is empty and that each sample is represented in exactly one of the partitions. 

<div class="alert alert-block alert-info">

**Task:** Write a function `partition_dataset` that splits a `dataset` into `k` initial partitions. It doesn`t matter what kind of partitioning you decide on, as long as it complies with the rules mentioned. 

</div>

In [None]:
# Implement a funtion to arbitrarily partition the dataset into k parts
def partition_dataset(dataset, k):
    # Copy the original dataset
    dataset_copy = dataset.copy()

    # Create a new empty column to save the cluster/partition affiliation (-1 is representing no cluster/partition)
    dataset_copy["cluster"] = -1

    # ...

    # Return the dataset
    return dataset_copy


# Partition the small_dataset
partitioned_small_dataset = partition_dataset(small_dataset, 2)

# Print a scatterplot
plt.figure(figsize=(8, 8))
sns.scatterplot(
    x=partitioned_small_dataset["x"],
    y=partitioned_small_dataset["y"],
    hue=partitioned_small_dataset["cluster"],
    palette="deep",
)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

In [None]:
# Implement a funtion to arbitrarily partition the dataset into k parts
def partition_dataset(dataset, k):
    # Copy the original dataset
    dataset_copy = dataset.copy()

    # Create a new empty column to save the cluster/partition affiliation (-1 is representing no cluster/partition)
    dataset_copy["cluster"] = -1

    # The method that was used in the lecture example is to sort the samples regarding to their y values
    dataset_copy = dataset_copy.sort_values(by=["y"]).reset_index(drop=True)

    # Then to define an equal size for each cluster/partition
    cluster_size = round(dataset_copy.shape[0] / k)

    # And then to assign the samples to the cluster/partition
    for i in range(0, dataset_copy.shape[0], cluster_size):
        # Start of the slice
        start = i

        # End of the slice
        end = min(i + cluster_size, dataset_copy.shape[0])

        # Cluster id
        cluster_id = i / cluster_size

        # Assign the cluster value
        dataset_copy.loc[start : end - 1, "cluster"] = cluster_id

    # Return the dataset
    return dataset_copy


# Partition the small_dataset
partitioned_small_dataset = partition_dataset(small_dataset, 2)

# Print a scatterplot
plt.figure(figsize=(8, 8))
sns.scatterplot(
    x=partitioned_small_dataset["x"],
    y=partitioned_small_dataset["y"],
    hue=partitioned_small_dataset["cluster"],
    palette="deep",
)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

The first repetitive step in k-means is to calculate for the so-called centroids (mean points) for each partition/cluster. 

<div class="alert alert-block alert-info">

**Task:** Implement the function `compute_centroids` that computes the centroid for each of the `k` partitions. The return value should be a pandas DataFrame with the cluster identifier as an index and two columns `x` and `y` indicating the coordinates of the corresponding centroid.

</div>

In [None]:
# Implement a function to compute the centroids for a partitioned dataset
def compute_centroids(partitioned_dataset, k):
    # Init a DataFrame to hold the centroids
    centroids = pd.DataFrame(
        [[np.nan, np.nan] for i in range(0, k)], columns=["x", "y"]
    )

    # ...

    # Return the centroids
    return centroids


# Compute the centroids of the intitial partitioning
centroids = compute_centroids(partitioned_small_dataset, 2)

# Print the centroids into the scatterplot (black)
plt.figure(figsize=(8, 8))
sns.scatterplot(
    x=partitioned_small_dataset["x"],
    y=partitioned_small_dataset["y"],
    hue=partitioned_small_dataset["cluster"],
    palette="deep",
)
sns.scatterplot(x=centroids["x"], y=centroids["y"], c=["black"])
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

In [None]:
# Implement a function to compute the centroids for a partitioned dataset
def compute_centroids(partitioned_dataset, k):
    # Init a DataFrame to hold the centroids
    centroids = pd.DataFrame(
        [[np.nan, np.nan] for i in range(0, k)], columns=["x", "y"]
    )

    # Compute the centroid of each partition
    for i in range(0, k):
        # Compute the mean of the x values within that single partition
        x_mean = partitioned_dataset[partitioned_dataset["cluster"] == i]["x"].mean()

        # Compute the mean of the y values within that single partition
        y_mean = partitioned_dataset[partitioned_dataset["cluster"] == i]["y"].mean()

        # Add the centroid of this single partition
        centroids.loc[i, ["x", "y"]] = [x_mean, y_mean]

    # Return the centroids
    return centroids


# Compute the centroids of the intitial partitioning
centroids = compute_centroids(partitioned_small_dataset, 2)

# Print the centroids into the scatterplot (black)
plt.figure(figsize=(8, 8))
sns.scatterplot(
    x=partitioned_small_dataset["x"],
    y=partitioned_small_dataset["y"],
    hue=partitioned_small_dataset["cluster"],
    palette="deep",
)
sns.scatterplot(x=centroids["x"], y=centroids["y"], c=["black"])
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

To reassign points to their nearest centroid, distance measure must be defined. Here, for example, the Euclidean distance comes in handy, which we have already implemented ourselves in an earlier exercise.

In [None]:
# "Pythonic" implementation of the euclidean distance
def euclidean_distance(a, b):
    return (abs(a - b) ** 2).sum() ** 0.5


# Compute the euclidean distance for two random points a and b
a = pd.Series([1, 9])
b = pd.Series([9, 5])
euclidean_distance(a, b)

Reassignment is also the next step within k-means. Samples are always reassigned to the cluster/partition whose centroid is closest to themselves.

<div class="alert alert-block alert-info">

**Task:** Complete the function `reassign_samples` that reassigns samples to the cluster/partition whose centroid is closest to themselves. Return the dataset and an indictator to communicate whether at least tuple was reassigned within the function or not.

</div>

In [None]:
# Implement a function to reassign each sample to its nearest centroid
def reassign_samples(partitioned_dataset, centroids, k):
    # Indicator to show whether there was at least one tuple reassigned
    reassign_indicator = False

    # Copy the original partitioned_dataset
    dataset_copy = partitioned_dataset.copy()

    # ...

    return reassign_indicator, dataset_copy


# Reassign the samples of our partitioned_small_dataset to their nearest centroid
reassign_indicator, reassigned_small_dataset = reassign_samples(
    partitioned_small_dataset, centroids, 2
)

# Output the indicator
print("Was there at least one sample reassigned? - " + str(reassign_indicator))

# Print a scatterplot showing the new class assignments
plt.figure(figsize=(8, 8))
sns.scatterplot(
    x=reassigned_small_dataset["x"],
    y=reassigned_small_dataset["y"],
    hue=reassigned_small_dataset["cluster"],
    palette="deep",
)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

In [None]:
# Implement a function to reassign each sample to its nearest centroid
def reassign_samples(partitioned_dataset, centroids, k):
    # Indicator to show whether there was at least one tuple reassigned
    reassign_indicator = False

    # Copy the original partitioned_dataset
    dataset_copy = partitioned_dataset.copy()

    # Check for each sample whether it has to be reassigned
    for i in range(0, dataset_copy.shape[0]):
        # Get the value of the the dataset for easier access
        sample = dataset_copy.loc[i, ["x", "y"]]

        # Set the current cluster id and centroid values
        current_cluster = dataset_copy.loc[i, "cluster"]
        current_centroid = centroids.loc[current_cluster]
        current_distance = euclidean_distance(sample, current_centroid)

        # Iterate through the centroids and check whether the distance is lower than the current distance
        # NOTE: We do not skip the current centroid, as this would complicate the code and isn't a big performance problem
        for j in range(0, k):
            # Compute the distance
            distance = euclidean_distance(sample, centroids.loc[j])

            # If the distance is lower than the current_distance we have to reassign
            if distance < current_distance:

                # Set the cluster
                dataset_copy.loc[i, "cluster"] = j
                current_cluster = j

                # Set the current_centroid
                current_centroid = centroids.loc[j]

                # Set the current_distance
                current_distance = distance

                # Set the reassign_indicator
                reassign_indicator = True

    return reassign_indicator, dataset_copy


# Reassign the samples of our partitioned_small_dataset to their nearest centroid
reassign_indicator, reassigned_small_dataset = reassign_samples(
    partitioned_small_dataset, centroids, 2
)

# Output the indicator
print("Was there at least one sample reassigned? - " + str(reassign_indicator))

# Print a scatterplot showing the new class assignments
plt.figure(figsize=(8, 8))
sns.scatterplot(
    x=reassigned_small_dataset["x"],
    y=reassigned_small_dataset["y"],
    hue=reassigned_small_dataset["cluster"],
    palette="deep",
)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

In the iterative k-means algorithm it would now be checked whether samples were reassigned or not. If yes, we have to go back to calculating the centroids for this new assignment. If not, then the corresponding clusters have been found. 

This decision can of course be passed to a wrapper function `k_means` which summarizes the whole algorithm.

<div class="alert alert-block alert-info">

**Task:** Merge the previously implemented function `k_means` to achieve a complete implementation of the algorithm.
    
</div>

In [None]:
# Implement the wrapper function k_means
def k_means(dataset, k):
    # Copy the original dataset
    dataset_copy = dataset.copy()

    # Create a new empty column to save the cluster/partition affiliation (-1 is representing no cluster/partition)
    dataset_copy["cluster"] = -1

    # ...

    # Return the clustered dataset
    return dataset_copy


# Cluster the small_dataset
clustered_small_dataset = k_means(small_dataset, 2)

# Output the corresponding scatterplot
plt.figure(figsize=(8, 8))
sns.scatterplot(
    x=clustered_small_dataset["x"],
    y=clustered_small_dataset["y"],
    hue=clustered_small_dataset["cluster"],
    palette="deep",
)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

In [None]:
# Implement the wrapper function k_means
def k_means(dataset, k):
    # Partition the dataset
    dataset = partition_dataset(dataset, k)

    # Set the reassign_indicator to True (as the intial partitioning was as reassingment in itself)
    reassign_indicator = True

    # As long as there are reassingment the following two steps are repeated
    while reassign_indicator:
        # Compute the centroids
        centroids = compute_centroids(dataset, k)

        # Reassign each sample to the cluster of the nearest centroid
        reassign_indicator, dataset = reassign_samples(dataset, centroids, k)

    # Return the clustered dataset
    return dataset


# Cluster the small_dataset
clustered_small_dataset = k_means(small_dataset, 2)

# Output the corresponding scatterplot
plt.figure(figsize=(8, 8))
sns.scatterplot(
    x=clustered_small_dataset["x"],
    y=clustered_small_dataset["y"],
    hue=clustered_small_dataset["cluster"],
    palette="deep",
)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

#### Validation

If we take a look at the `big_dataset` with its `true_labels`, we can identify some areas that might lead to problems with k-means due to overlapping of the classes. 

In [None]:
# Print a scatterplot diagram of the big_dataset while showing the actual class affiliations with colors
plt.figure(figsize=(8, 8))
sns.scatterplot(
    x=big_dataset["x"],
    y=big_dataset["y"],
    hue=big_dataset["true_labels"],
    palette="deep",
)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

If you now run your k-means implementation on this `big_dataset`, you will probably also see some areas where the classes were not well recognized. 

In [None]:
# Cluster the big_dataset
clustered_big_dataset = k_means(big_dataset, 10)

# Output the clustered dataset including information on the true classes
plt.figure(figsize=(8, 8))
sns.scatterplot(
    x=clustered_big_dataset["x"],
    y=clustered_big_dataset["y"],
    hue=clustered_big_dataset["cluster"],
    style=clustered_big_dataset["true_labels"],
    palette="deep",
)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

It is important to mention that these problems may well be caused by k-means itself and not necessarily by faulty programming on your part. 

<div class="alert alert-block alert-info">

**Task:** Compare the actual class partitioning and the partitioning determined by your k-means implementation. Describe the most prominent misidentifications.
    
</div>

Write down your solution here:

Especially due to the fact that the initial assignment can be done arbitrarily, the results of the student implementation and this sample solution are probably very different. For this reason, we focus in this description explicitly only on the differences within the sample solution:

- Our k-means implementation created a mega cluster of original two classes around the point -10, 5. These are not correctly recognized by our k-means implementation.
- Another mega cluster also exists around point 7, 7.5: again, two classes were detected as only one cluster.
- To compensate for the two incorrectly recognized classes, our k-means implementation created four clusters around the point 0, 3. Originally, this was a region of two classes. 

Especially with strongly mixed classes k-means has problems. However, this is not unusual for a clustering algorithm.

#### Libary: scikit-learn

Even with the clustering algorithms from this task sheet, it is of course not normally necessary to create your own implementations for the procedures. In the case of k-means, for example, there is a good implementation in scikit-learn.

In [None]:
from sklearn.cluster import KMeans

<div class="alert alert-block alert-info">

**Task:** Use scikit-learn's implementation of k-means to find ten clusters in the `big_dataset`. Print the result in a diagram.
    
</div>

In [None]:
# Perform sklearns k-means clustering on the big_dataset
# ...

In [None]:
# Perform sklearns k-means clustering on the big_dataset
kmeans = KMeans(n_clusters=10).fit(big_dataset[["x", "y"]])

# Save the labels to a copy of the big_dataset to generate the equivalent of our clustered_big_dataset
clustered_big_dataset_2 = big_dataset.copy()
clustered_big_dataset_2["cluster"] = kmeans.labels_

# Print the result
plt.figure(figsize=(8, 8))
sns.scatterplot(
    x=clustered_big_dataset_2["x"],
    y=clustered_big_dataset_2["y"],
    hue=clustered_big_dataset_2["cluster"],
    style=clustered_big_dataset_2["true_labels"],
    palette="deep",
)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

This shows how useful it can be to optimize the selection of the start value. By default, Scikit-learn uses a method called k-means++, which leads to a significantly better recognition of the true classes contained in the data set.

### DBSCAN

In addition to the partitioning methods, density-based methods were also presented in the lecture. As an example of these methods, you will asked to implement DBSCAN in the following.

#### Implementation

Also during this implementation you have two options: On the one hand, you may implement DBSCAN completely on your own, on the other hand, you may use the task series divided into smaller tasks. 

##### Option 1: Implement DBSCAN on Your Own

If you decided to implement DBSCAN on your own refer to the lecture for a comprehensive explanation of the method. 

<div class="alert alert-block alert-info">

**Task:** Implement a method `dbscan` that can be used to cluster the two datasets `small_dataset` and `big_dataset` into multiple clusters. You shall use the euclidean distance to measure the distance between two points during the clustering.
If you are in need of more code cells than provided, feel free to add more.

</div>

In [None]:
# Implement a dbscan function (Code placeholder 01/10)

In [None]:
# Implement a dbscan function (Code placeholder 02/10)

In [None]:
# Implement a dbscan function (Code placeholder 03/10)

In [None]:
# Implement a dbscan function (Code placeholder 04/10)

In [None]:
# Implement a dbscan function (Code placeholder 05/10)

In [None]:
# Implement a dbscan function (Code placeholder 06/10)

In [None]:
# Implement a dbscan function (Code placeholder 07/10)

In [None]:
# Implement a dbscan function (Code placeholder 08/10)

In [None]:
# Implement a dbscan function (Code placeholder 09/10)

In [None]:
# Implement a dbscan function (Code placeholder 10/10)

In [None]:
# Sample dbscan sceleton
# NOTE: You are allowed to use this sceleton but don't have to
def dbscan(dataset, eps, min_pts):
    # Copy the original dataset
    dataset_copy = dataset.copy()

    # Create a new empty column to save the cluster/partition affiliation
    # Special codings for ...
    # ... points that are not set yet: -1
    # ... points that are noise: -2
    dataset_copy["cluster"] = -1

    # Create a new empty column to save the visited status
    dataset_copy["visited"] = False

    # ...

    # Return the clustered dataset
    return dataset_copy

In [None]:
# Cluster the small_dataset
# (the parameters eps=1 and min_pts=2 should result in five different clusters and
# one "noisy" point for this dataset)
clustered_small_dataset = dbscan(small_dataset, 1, 2)

# Output the corresponding scatterplot
plt.figure(figsize=(8, 8))
sns.scatterplot(
    x=clustered_small_dataset["x"],
    y=clustered_small_dataset["y"],
    hue=clustered_small_dataset["cluster"],
    palette="deep",
)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

In [None]:
# Cluster the big_dataset
# (the parameters eps=1 and min_pts=5 should result in five different clusters and
# multiple "noisy" points for this dataset)
clustered_big_dataset = dbscan(big_dataset, 1, 5)

# Output the clustered dataset including information on the true classes
plt.figure(figsize=(8, 8))
sns.scatterplot(
    x=clustered_big_dataset["x"],
    y=clustered_big_dataset["y"],
    hue=clustered_big_dataset["cluster"],
    style=clustered_big_dataset["true_labels"],
    palette="deep",
)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

In [None]:
# Sample solution => See Option 2

##### Option 2: Implement DBSCAN by Solving Small Tasks

For DBSCAN, you need not only the cluster membership as meta information, but also the status "visited". Before we start with the step-by-step implementation of DBSCAN, it is useful to write a small function for preparing the data set:

In [None]:
# Add columns to the dataset to save the status of the dataset
def prepare_dataset(dataset):
    # Copy the original dataset
    dataset_copy = dataset.copy()

    # Create a new empty column to save the cluster/partition affiliation
    # Special codings for ...
    # ... points that are not set yet: -1
    # ... points that are noise: -2
    dataset_copy["cluster"] = -1

    # Create a new empty column to save the visited status
    dataset_copy["visited"] = False

    # Return the dataset_copy
    return dataset_copy


# Prepare the small dataset
prepared_small_dataset = prepare_dataset(small_dataset)
prepared_small_dataset

Besides this preparatory helper function, there are two things that it would make sense to outsource to separate functions before the actual DBSCAN implementation. 

First, a function is needed in DBSCAN to randomly select a single unvisited point from a prepared data set. 

<div class="alert alert-block alert-info">

**Task:** Write a function `pick_random_unvisited_point` that randomly selects an unvisited point out of the `dataset` and returns it.

</div>

In [None]:
# Pick a random point that is unvisited
def pick_random_unvisited_point(dataset):
    # ...
    return None


# Pick a random point
random_point = pick_random_unvisited_point(prepared_small_dataset)
random_point

In [None]:
# Pick a random point that is unvisited
def pick_random_unvisited_point(dataset):
    # Select all points that are unvisited
    unvisited_points = dataset[dataset["visited"] == False]

    # If there are no unvisited points return None
    if len(unvisited_points) < 1:
        return None
    else:
        # Select one random point and return it
        return unvisited_points.sample().iloc[0]


# Pick a random point
random_point = pick_random_unvisited_point(prepared_small_dataset)
random_point

A second helper function that helps implementing DBSCAN is a function that returns all point within eps distance of a selected point.

<div class="alert alert-block alert-info">

**Task:** Write a function `get_all_points_within_eps_distance` that returns all points within distance of `eps`to the passed `point`. Use the euclidean distance function introduced during the k-means part of this exercise to determine the distance between two points. 

</div>

In [None]:
# Get all points within a distance of eps next to a specific point
def get_all_points_within_eps_distance(point, dataset, eps):
    # ...
    return None


# Get all points within distance of 1 regarding to the point (6,5)
points_within_eps_distance = get_all_points_within_eps_distance(
    pd.Series(data=[6, 5, -1, False], index=["x", "y", "cluster", "visited"]),
    prepared_small_dataset,
    1,
)
points_within_eps_distance

In [None]:
# Get all points within a distance of eps next to a specific point
def get_all_points_within_eps_distance(point, dataset, eps):
    # Select all unvisited points within eps distance
    return dataset[
        dataset.apply(
            lambda a: euclidean_distance([a["x"], a["y"]], point[["x", "y"]].values)
            <= eps,
            axis=1,
        )
    ]


# Get all points within distance of 1 regarding to the point (6,5)
points_within_eps_distance = get_all_points_within_eps_distance(
    pd.Series(data=[6, 5, -1, False], index=["x", "y", "cluster", "visited"]),
    prepared_small_dataset,
    1,
)
points_within_eps_distance

The pseudocode from the lecture on DBSCAN is quite general. Thus, the substep `If p′ is core point, add all objects in its ϵ-neighborhood to N` is ultimately something that can be implemented both by merging multiple sets of points, and by recursion. 
Since the recursive variant of DBSCAN is easier to implement, we focus on this variant in this step-by-step implementation. 

Finally, it makes sense to outsource the entire step `For each p′ in N that does not yet belong to a cluster` to a seperate recursive function.

<div class="alert alert-block alert-info">

**Task:** Complete the function sceleton of the function `expand_cluster` below. Remember that you can use the previously defined helper functions. 

</div>

In [None]:
# This function is used to expand a specific cluster by one point
# If the point is a core point (at least min_pts in eps distance) by itself
# expand_cluster is called for each neighbor.
def expand_cluster(dataset, eps, min_pts, point, cluster_id):
    # ...
    return None

In [None]:
# This function is used to expand a specific cluster by one point
# If the point is a core point (at least min_pts in eps distance) by itself
# expand_cluster is called for each neighbor.
def expand_cluster(dataset, eps, min_pts, point, cluster_id):
    # Add the point to the cluster
    dataset.loc[point.name, "cluster"] = cluster_id

    # If point was not visited, we have to visit it now
    if dataset.loc[point.name, "visited"] == False:
        # Mark the point as visited
        dataset.loc[point.name, "visited"] = True

        # Get all points within eps distance
        points_within_eps_distance = get_all_points_within_eps_distance(
            point, dataset, eps
        )

        # Check if count of points is higher than min_pts => is a core point
        # => We have to go deeper into the recursion
        if len(points_within_eps_distance.index) >= min_pts:
            # Iterate through the points in eps distance
            for index, row in points_within_eps_distance.iterrows():
                # Check whether the neighbor is already member of a cluster
                # (Note that a point marked as noise is not part of a cluster, too)
                if dataset.loc[index, "cluster"] >= 0:
                    # Skip that point
                    continue
                else:
                    # Expand the cluster with that point
                    expand_cluster(dataset, eps, min_pts, row, cluster_id)

Of course it is useful to have a test case to test your implementation against. However this test case is somewhat more difficult to understand, as the function depends on input of an undefined function. Therefore lets desribe the test scenario first:

*Lets say that the point with id `7` (Coordinates are `(3, 3)`) is selected as random unvisited point out of the prepared_small_dataset by the main dbscan function. As in this example eps is `1` in this case and min_pts is `2` the selected random unvisited point is a core point, as there is one other point (Id `2` and coordinates `(4,3)`) within eps distance. Therefore a new cluster with id `0` is created, the point with id `7` is added and `expand_cluster` gets called for all neighboring points that are not part of a cluster yet. In our example call we take a look call `expand_cluster` for the point with id `2`*

If your function works fine, it should add the point with id `2` into the cluster and should check whether it is a core point itself. As there is one still univisited point to descend to (Id `8` with coordingates `(4,4)`) recursion is started. In the end there should be three visited points withing cluster `0`(Ids `2`, `7` and `8`).

In [None]:
# Prepare the dataset
prepared_small_dataset = prepare_dataset(small_dataset)

# Mark the point with id 7 as visited and add it to the cluster with id 0
prepared_small_dataset.loc[7, "visited"] = True
prepared_small_dataset.loc[7, "cluster"] = 0

# Select the point with id 2
point_with_id_2 = prepared_small_dataset.iloc[2]

# Call expand_cluster
expand_cluster(prepared_small_dataset, 1, 2, point_with_id_2, 0)

# Take a look at the dataset (should now contain three points within cluster 0)
prepared_small_dataset

With the help of the recursive function `expand_cluster` it is now not difficult to implement the function `dbscan`, which in principle takes over the remaining steps of the pseudocode and uses `expand_cluster` whenever neighboring items have to be added to the cluster.

<div class="alert alert-block alert-info">

**Task:** Complete the `dbscan`. Again it es recommend to use the prviously defined functions.

</div>

In [None]:
# Implement dbscan
def dbscan(dataset, eps, min_pts):
    # ...

    # Return the clustered dataset
    return dataset


# Cluster the small_dataset
clustered_small_dataset = dbscan(small_dataset, 1, 2)

# Output the corresponding scatterplot
plt.figure(figsize=(8, 8))
sns.scatterplot(
    x=clustered_small_dataset["x"],
    y=clustered_small_dataset["y"],
    hue=clustered_small_dataset["cluster"],
    palette="deep",
)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

In [None]:
# Implement dbscan
def dbscan(dataset, eps, min_pts):
    # Prepare the dataset
    dataset = prepare_dataset(dataset)

    # While there are unvisited points pick a random one
    while len(dataset[dataset["visited"] == False]) > 0:
        # Select a random unvisited point
        random_point = pick_random_unvisited_point(dataset)

        # Mark the random point as visited
        dataset.loc[random_point.name, "visited"] = True

        # Get all points within eps distance
        points_within_eps_distance = get_all_points_within_eps_distance(
            random_point, dataset, eps
        )

        # Check if count of points is higher than min_pts => is a core point
        if len(points_within_eps_distance.index) < min_pts:
            # Not a core point => mark as noise
            dataset.loc[random_point.name, "cluster"] = -2
        else:
            # Get the last used cluster id
            last_cluster_id = dataset["cluster"].max()

            # Increment the id to get an new id for the new cluster
            new_cluster_id = last_cluster_id + 1

            # Add the random point to the cluster
            dataset.loc[random_point.name, "visited"] = True

            # Iterate through the points in eps distance
            for index, row in points_within_eps_distance.iterrows():
                # Check whether the neighbor is already member of a cluster
                # (Note that a point marked as noise is not part of a cluster, too)
                if dataset.loc[index, "cluster"] >= 0:
                    # Skip that point
                    continue
                else:
                    # Expand the cluster with that point
                    expand_cluster(dataset, eps, min_pts, row, new_cluster_id)

    # Return the clustered dataset
    return dataset


# Cluster the small_dataset
clustered_small_dataset = dbscan(small_dataset, 1, 2)

# Output the corresponding scatterplot
plt.figure(figsize=(8, 8))
sns.scatterplot(
    x=clustered_small_dataset["x"],
    y=clustered_small_dataset["y"],
    hue=clustered_small_dataset["cluster"],
    palette="deep",
)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

#### Validation

Again we can take a look at the `big_dataset` with its `true_labels`. We will again see strengths and disadvantages of a clustering methods. 

In [None]:
# Cluster the big_dataset
clustered_big_dataset = dbscan(big_dataset, 1, 5)

# Output the clustered dataset including information on the true classes
plt.figure(figsize=(8, 8))
sns.scatterplot(
    x=clustered_big_dataset["x"],
    y=clustered_big_dataset["y"],
    hue=clustered_big_dataset["cluster"],
    style=clustered_big_dataset["true_labels"],
    palette="deep",
)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

<div class="alert alert-block alert-info">

**Task:** Describe the most prominent problems you can identificate for the DBSCAN algorithm in this example.
    
</div>

Write down your solution here:

In this case, there are some things that stand out in particular. However it has to be noted, that some of these might be less significant with other `eps` and `min_pts` configurations while others might be more significant:
    
- Our implementation of DBSCAN leads to less classes than there actually are. (However it might also be an advantage that someone does not have to know the actual count of classes in advance.)
- There are some "noisy" points that are not part of any cluster.
- DBSCAN can not differate overlapping classes at all. 

#### Libary: scikit-learn

Just as for k-means, scikit-learn also offers an extensive implementation for DBSCAN.

In [None]:
from sklearn.cluster import DBSCAN

<div class="alert alert-block alert-info">

**Task:** Use scikit-learn's implementation of DBSCAN to find clusters in the `big_dataset`. Use the same parameters we used in the above in the own implementation. Print the result in a diagram.
    
</div>

In [None]:
# Perform sklearns DBSCAN clustering on the big_dataset
# ...

In [None]:
# Perform sklearns DBSCAN clustering on the big_dataset
dbscan = DBSCAN(eps=1, min_samples=5).fit(big_dataset[["x", "y"]])

# Save the labels to a copy of the big_dataset to generate the equivalent of our clustered_big_dataset
clustered_big_dataset_3 = big_dataset.copy()
clustered_big_dataset_3["cluster"] = dbscan.labels_

# Print the result
plt.figure(figsize=(8, 8))
sns.scatterplot(
    x=clustered_big_dataset_3["x"],
    y=clustered_big_dataset_3["y"],
    hue=clustered_big_dataset_3["cluster"],
    style=clustered_big_dataset_3["true_labels"],
    palette="deep",
)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

In this case, the results of our function and that of scikit-learn are identical. This shows that DBSCAN is more deterministic than k-means. 

## Part Two: Clustering in the AdventureWorks Database

<div class="alert alert-block alert-warning">

TODO

</div>