# DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

DBSCAN is a density-based clustering algorithm that groups together points that are close to each other based on a distance metric and a density threshold, while marking points that are in low-density regions as noise.

## History

DBSCAN was proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu in 1996. The algorithm has since become a popular choice for clustering tasks due to its ability to find clusters of arbitrary shapes and its robustness to noise.

## Mathematical Equations

DBSCAN does not rely on a specific mathematical equation like some other machine learning algorithms. Instead, it is based on the concept of density, which is defined as the number of points within a given radius `eps` of a point.

## Learning Algorithm

The learning algorithm for DBSCAN consists of the following steps:

1. For each point in the dataset, determine the points within the `eps` radius.
2. If a point has at least `min_samples` points within its `eps` radius, mark it as a core point. Otherwise, mark it as a border point or noise.
3. Assign each point to a cluster by following the procedure:
   - If a point is a core point, create a new cluster and recursively add all directly and indirectly reachable core points within the `eps` radius.
   - If a point is a border point, assign it to the nearest core point's cluster.
   - If a point is noise, do not assign it to any cluster.

The `eps` radius and `min_samples` are hyperparameters of the DBSCAN algorithm.

## Pros and Cons

**Pros:**
- Can find clusters of arbitrary shapes.
- Robust to noise.
- Does not require the number of clusters as an input parameter.
- Handles datasets with varying densities.
- Requires only two hyperparameters.

**Cons:**
- Not efficient with high-dimensional data.
- Sensitive to the choice of `eps` and `min_samples` hyperparameters.
- Cannot handle clusters with different densities well.

## Suitable Tasks and Datasets

DBSCAN can be applied to a variety of clustering tasks, including:

- Anomaly detection
- Image segmentation
- Spatial data analysis
- Pattern recognition

It works well with datasets that have clusters of arbitrary shapes and varying densities. DBSCAN is also suitable for datasets with noise.

## Difference between K-means and DBSCAN

The main differences between k-means and DBSCAN are:

- K-means requires the number of clusters as an input parameter, while DBSCAN does not.
- K-means is sensitive to the initial placement of centroids and may converge to local optima, while DBSCAN is more robust due to its density-based approach.
- K-means tends to work well with spherical clusters and may struggle with clusters of arbitrary shapes, while DBSCAN can find clusters of any shape.
- K-means is less robust to noise compared to DBSCAN, which can identify and separate noise points from clusters.
- DBSCAN can handle datasets with varying densities, while k-means assumes similar densities across clusters.

## References

1. Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd (Vol. 96, No. 34, pp. 226-231).


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from collections import deque

# Function to calculate Euclidean distance between two points
def euclidean_distance(a, b):
    return np.sqrt(np.sum((a - b) ** 2))

# DBSCAN class
class DBSCAN:
    def __init__(self, eps=0.5, min_samples=5):
        self.eps = eps
        self.min_samples = min_samples

    def fit_predict(self, X):
        n_samples = X.shape[0]
        labels = np.full(n_samples, -1, dtype=int)  # -1 represents noise points
        cluster_id = 0

        for i in range(n_samples):
            if labels[i] != -1:  # Point already assigned to a cluster
                continue

            neighbors = self._find_neighbors(X, i)
            if len(neighbors) < self.min_samples:  # Noise point
                continue

            # Assign point and its neighbors to a new cluster
            self._expand_cluster(X, labels, i, neighbors, cluster_id)
            cluster_id += 1

        return labels

    def _find_neighbors(self, X, i):
        neighbors = []
        for j, x_j in enumerate(X):
            if euclidean_distance(X[i], x_j) <= self.eps:
                neighbors.append(j)
        return neighbors

    def _expand_cluster(self, X, labels, i, neighbors, cluster_id):
        labels[i] = cluster_id
        queue = deque(neighbors)

        while queue:
            j = queue.popleft()
            if labels[j] == -1:  # Noise point
                labels[j] = cluster_id
            elif labels[j] != -1:  # Point already assigned to a cluster
                continue

            new_neighbors = self._find_neighbors(X, j)
            if len(new_neighbors) >= self.min_samples:
                queue.extend(new_neighbors)

# Generate the "two moons" dataset
X, y = make_moons(n_samples=300, noise=0.05, random_state=42)

# Apply the DBSCAN algorithm
dbscan = DBSCAN(eps=0.2, min_samples=5)
clusters = dbscan.fit_predict(X)

# Visualize the results
plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap='viridis')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('DBSCAN Clustering')
plt.show()


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score

# Generate the "two moons" dataset
X, y = make_moons(n_samples=300, noise=0.05, random_state=42)

# Apply the DBSCAN algorithm
dbscan = DBSCAN(eps=0.2, min_samples=5)
clusters = dbscan.fit_predict(X)

# Evaluate the model using silhouette score
score = silhouette_score(X, clusters)
print(f"Silhouette Score: {score:.2f}")

# Visualize the results
plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap='viridis')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('DBSCAN Clustering')
plt.show()
