## Types of Clustering
There are many types of clustering algorithms of which here are the top 4 well-known ones:

* Connectivity-based Clustering
* Centroid-based Clustering
* Distribution-based Clustering
* Density-based Clustering

### Clustering Principles:
All clustering algorithms try to group data points based on similarities between the data. What does this actually mean?


It is often spoken of, in terms of **`inter-cluster heterogeneity`** and **`intra-cluster homogeneity`**. 

* `Inter-cluster heterogeneity`<br> This means that the clusters are as different from one another as possible. The characteristics of one cluster are very different from another cluster. This makes the clusters very stable and reliable.
* `Intra-cluster homogeneity`<br> This talks about how similar are the characteristics of all the data within the cluster. The more similar, the more cohesive is the cluster and hence more stable. 

* **Hence the objective of clustering is to maximise the inter-cluster distance (Inter-cluster heterogeneity) and minimise the intra-cluster distance (intra-cluster homogeneity )**



## Density-Based Spatial Clustering of Applications with Noise

Most of the traditional clustering algorithms like Centroid based Kmeans and Connectivity based Heirarchical can be used to group data in an unsupervised way. However when applied to tasks with arbitary shape clusters or clusters within clusters traditional clustering methods might not be able to acheive good results.

For Example : Kmeans can cause problems in the domain of anomaly detection. Because Kmeans assign the anomaly to the same cluster as normal data. The anomaly pulls the cluster centroid towards them making it harder to classify the anomaly from data.
* Kmeans algorithm has no notion of outliers
    * Kmeans asssign all points to a cluster even if they dont belong to any
* Density based clustering locates regions of high density, and seperates the outliers.
    * Density in this context is the number of points within a specified radius
*   DBSCAN can find dense cluster and seperate the noise

**Idea** : *If a particular point belongs to a cluster, it should be near to lots of other points in that cluster*

DBSCAN works based on two important parameters
* Radius of neighbourhood (R)<br>
The radius,`"R"`, defines an area that, if included enough number of points within, we call it a dense area
* Minimum number of neighbours (M) <br>
The `"M"` define the minimum number of points we want in a neighbourhood to define a cluster

**Each point in our dataset can be either** 
* **Core point**<br>
A data-point is a core-point if it has in its neighbourhood `"M"` data-points
* **Border point**<br>
A data-point is a border-point if it has less than `"M"` data-points in its neighbourhood<br> and is reachable from any of the core-points
* **Outlier point**<br>
A data-point is a border-point if it has less than `"M"` data-points in its neighbourhood<br> and is **not** reachable from any of the core-points

**A cluster is formed by connecting all core-points that are in the neighbourhood along with the border-points which are reachable from those core-points**

Advantages of DBSCAN
* Robust to outliers
* Does not require specification of the number of cluster

In [13]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN

#### Data generation

In [73]:
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(
    n_samples=750, centers=centers, cluster_std=0.4, random_state=0
)

X = StandardScaler().fit_transform(X)

plt.scatter(X[:, 0], X[:, 1]);

<img src='./plots/data-3-blobs.png'>

#### DBSCAN
* `eps` <br>
 The maximum distance between two samples for one to be considered as in the neighborhood of the other.
* `min_samples` <br>
    The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.

In [15]:
dbscan = DBSCAN(eps=0.3, min_samples=10, metric='minkowski', p=2)
dbscan.fit(X)

noise = dbscan.labels_==-1
clusters_found = len(np.unique(dbscan.labels_[~noise]))
print('Number of clusters found :',clusters_found,'Number of noise found :',len(dbscan.labels_[noise]))

Number of clusters found : 3 Number of noise found : 18


### Visualize clusters
* The color indicates cluster membership, with large circles indicating core samples found by the algorithm. 
* Smaller circles are non-core samples that are still part of a cluster. 
* The outliers are indicated by black points

In [72]:
colors = np.array(['salmon','lightblue','seagreen'])
core_samples = X[dbscan.core_sample_indices_]

plt.scatter(X[~noise, 0], X[~noise, 1], color=colors[dbscan.labels_[~noise]], edgecolors='k')
plt.scatter(
    core_samples[:, 0], core_samples[:, 1], 
    color=colors[dbscan.labels_[dbscan.core_sample_indices_]],
    edgecolors='k', s=90)
plt.scatter(X[noise, 0], X[noise, 1], c='k')
plt.title(f'Number of clusters found :{clusters_found} & Number of noise found {len(dbscan.labels_[noise])}')

<img src='./plots/dbscan-result.png'>