## Ordering Points To Identify Cluster Structure

This technique does not segregate the data into clusters. It produces a reachablity distance plot

OPTICS draws inspiration from DBSCAN
* DBSCAN assumes constant density of cluster
* OPTICS allows varying cluster density 

DBSCAN works based on two important parameters
* Radius of neighbourhood (R)<br>
The radius,`"R"`, defines an area that if included enough number of points within, we call it a dense area
* Minimum number of neighbours (M) <br>
The `"M"` define the minimum number of points we want in a neighbourhood to define a cluster

OPTICS adds two more parameters 
* **Core Distance**
* **Reachability Distance**

**Core Distance**
* The minimum value of radius required to classify a given point as a core-point
* If a given point is not a core-point then its core-distance is undefined
* Core distance of a data-point `p` is the smallest value `epsilon` such that the `epsilon neighbourhood` of point `p` still has atleast `min_samples` number of points 
* In simple terms : **It's the minimum value of radius required to classify a point as core-point**

Lets understand core-distance using this example:
* In order to classify point `p` as a core-point we need atleast `5# points` in it's neighbourhood. The epsilon (radius of neighbourhood around the point of interest) is set to to `6mm`.
* To classify point `p` as a core-point the minimum value of epsilon required is only `3mm`.
* Hence the core-distance is `3mm` 

<img src='./notes/OPTICS - core-distance.PNG'>

**Reachability Distance**

* Reachability Distance between two core-points `p` & `q` is the maximum of two values
    * `core-distance(p)`
        * Minimum value `epsilon` such that the epsilon neighbourhood of `p` still contains the `min_sample` other points that makes `p` a qualified core-point
    * `distance_between(p, q)`
        * We can use any distance metric : [ `euclidean`, `cosine`, `Manhattan`, ...] to compute distance betweem points.

Let's understand the reachability-distance using an example:
* In order to classify point `p` as a core-point we need atleast `5# points` in it's neighbourhood. The epsilon (radius of neighbourhood around the point of interest) is set to to `6mm`.
* To classify point `p` as a core-point the minimum value of epsilon required is only `3mm`.
* Hence the core-distance is `3mm` 
* The Euclidean distance between point `p` and `q` is calculated : `2mm`
* The Reachability distance between `(p, q)` is `maximum( core-distance(p), distance(p, q))`
    * In this examples we've `#3` points `p,q,r`
    * Hence `Reachability-distance( p, q )` = `7mm` which is the `max(3mm, 7mm)`
    * Hence `Reachability-distance( p, r )` = `3mm` which is the `max(3mm, 2mm)`

<img src='./notes/OPTICS - reachability-distance.PNG'>



The OPTICS algorithm shares many similarities with the DBSCAN algorithm, and can be considered a generalization of DBSCAN that relaxes the `eps` requirement from a single value to a value range.

* The key difference between DBSCAN and OPTICS is that the OPTICS algorithm builds a reachability graph, which assigns each sample both a `reachability_` distance, and a spot within the `cluster ordering_` attribute;
* These two attributes [ `reachability_` distance &  `cluster ordering_` ] are assigned when the model is fitted, and are used to determine cluster membership.
* If OPTICS is run with the default value of `inf` set for `max_eps` parameter, then DBSCAN style cluster extraction can be performed repeatedly in linear time for any given `eps` value using the `cluster_optics_dbscan method`.


In [2]:
from sklearn.cluster import OPTICS, cluster_optics_dbscan
import matplotlib.gridspec as gridspec
import matplotlib.pyplot as plt
import numpy as np

#### Generate sample data

In [63]:
np.random.seed(0)
n_points_per_cluster = 250

C1 = [-5, -2] + 0.8 * np.random.randn(n_points_per_cluster, 2)
C2 = [4, -1] + 0.1 * np.random.randn(n_points_per_cluster, 2)
C3 = [1, -2] + 0.2 * np.random.randn(n_points_per_cluster, 2)
C4 = [-2, 3] + 0.3 * np.random.randn(n_points_per_cluster, 2)
C5 = [3, -2] + 1.6 * np.random.randn(n_points_per_cluster, 2)
C6 = [5, 6] + 2 * np.random.randn(n_points_per_cluster, 2)
X = np.vstack((C1, C2, C3, C4, C5, C6))

print('Shape of X :',X.shape)
plt.scatter(X[:, 0], X[:, 1]);

Shape of X : (1500, 2)


<img src='./plots/data-for-clustering.png'>

#### OPTICS
The default cluster extraction with OPTICS looks at the steep slopes within the graph to find clusters, and the user can define what counts as a steep slope using the parameter `xi`. 

* cluster_method    
    * The extraction method used to extract clusters using the calculated reachability and ordering. 
    * Possible values are `“xi”` and `“dbscan”`.
    * `default=’xi’`
* xi
    * Determines the minimum steepness on the reachability plot that constitutes a cluster boundary. 
    * float between 0 and 1, 
    * default=0.05
    * For example, an upwards point in the reachability plot is defined by the ratio from one point to its successor being at most `1-xi`. 
    * Used only when `cluster_method='xi'`.
* min_cluster_size
    * Minimum number of samples in an OPTICS cluster 
    * Expressed as an absolute number or a fraction of the number of samples (rounded to be at least 2).
    * int > 1 or float between 0 and 1, 
    * default=None
    * If `None`, the value of `min_samples` is used instead. 
    * Used only when `cluster_method='xi'`.
* min_samples
    * The number of samples in a neighborhood for a point to be considered as a core point.
    * Also, up and down steep regions can’t have more than `min_samples` consecutive non-steep points.
    * Expressed as an absolute number or a fraction of the number of samples (rounded to be at least 2). 
    * int > 1 or float between 0 and 1
    * default=5

In [17]:
optics = OPTICS(cluster_method='xi', min_cluster_size=0.05, xi=0.05, max_eps=np.inf, min_samples=50)

* If OPTICS is run with the default value of `inf` set for `max_eps` parameter, then DBSCAN style cluster extraction can be performed repeatedly in linear time for any given `eps` value using the `cluster_optics_dbscan method`.

In [18]:
optics.fit(X)

#### `sklearn.cluster.cluster_optics_dbscan`

Perform DBSCAN extraction for an arbitrary epsilon.

Extracting the clusters runs in linear time. 
* Returns : The estimated labels. `labels_` array of shape `(n_samples,)`
* Note that this results in `labels_` which are close to a DBSCAN with similar settings and `eps`, only if `eps` is close to `max_eps`.



In [64]:
labels_050 = cluster_optics_dbscan(
    reachability=optics.reachability_,
    core_distances=optics.core_distances_,
    ordering=optics.ordering_,
    eps=0.5)


labels_200 = cluster_optics_dbscan(
    reachability=optics.reachability_,
    core_distances=optics.core_distances_,
    ordering=optics.ordering_,
    eps=2)

In [65]:
space = np.arange(len(X))
reachability = optics.reachability_[optics.ordering_]
labels = optics.labels_[optics.ordering_]

In [67]:
plt.figure(figsize=(10,7))
G = gridspec.GridSpec(nrows=2, ncols=3)
ax1 = plt.subplot(G[0, :])
ax2 = plt.subplot(G[1, 0])
ax3 = plt.subplot(G[1, 1])
ax4 = plt.subplot(G[1, 2])



# Reachability plot
colors = ["g", "r", "b", "y", "c"]
for cls, color in zip(range(5), colors):
    x = space[cls==labels]
    y = reachability[cls==labels]
    ax1.plot(x, y, linestyle='', marker='o',  c=color, alpha=0.3)

# reachability of noise
x_noise = space[labels==-1]
y_noise = reachability[labels==-1]
ax1.plot(x_noise, y_noise, c='k', linestyle='', marker='.',  alpha=0.3)

# minimum steepness on reachability-graph that constitute a cluster-boundart 
ax1.plot(space, np.full_like(space, fill_value=2.0, dtype=np.float32), c='k', linestyle='--', alpha=0.5)
ax1.plot(space, np.full_like(space, fill_value=0.5, dtype=np.float32), c='k', linestyle='--', alpha=0.5)

ax1.set(title='Reachability plot', ylabel='reachability (epsilon distance)', xlabel='cluster ordering')


# CLUSTERING -- OPTICS
for cls, color in zip(range(5), colors):
    x = X[optics.labels_==cls]
    ax2.plot(x[:, 0], x[:, 1], c=color, linestyle='', marker='.', alpha=0.3)
# plot noise
x_noise = X[optics.labels_ == -1]
ax2.plot(x_noise[:, 0], x_noise[:, 1], c='k', linestyle='', marker='+', alpha=0.2)
ax2.set(title='Automatic Clustering OPTICS')



# CLUSTERING -- DBSCAN eps = 0.5
colors = ["g", "r", "b",  "c"]
for cls, color in zip(range(4), colors):
    x = X[labels_050==cls]
    ax3.plot(x[:, 0], x[:, 1], c=color, linestyle='', marker='.', alpha=0.3)
# plot noise
x_noise = X[labels_050 == -1]
ax3.plot(x_noise[:, 0], x_noise[:, 1], c='k', linestyle='', marker='+', alpha=0.2)
ax3.set(title='Clustering at 0.5 epsilon cut DBSCAN')

# CLUSTERING -- DBSCAN eps = 2.0
colors = ["g", "m", "y",  "c"]
for cls, color in zip(range(4), colors):
    x = X[labels_200==cls]
    ax4.plot(x[:, 0], x[:, 1], c=color, linestyle='', marker='.', alpha=0.3)
# plot noise
x_noise = X[labels_200 == -1]
ax4.plot(x_noise[:, 0], x_noise[:, 1], c='k', linestyle='', marker='+', alpha=0.2)
ax4.set(title='Clustering at 2.0 epsilon cut DBSCAN')

plt.tight_layout()

<img src='./plots/OPTICS-reachability-and-clustering.png'>