Clustering and Preprocessing data.

Let's train an unsupervised model which will cluster a make_moons dataset.

Let's start with DBSCAN.

In [None]:
import sklearn
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm
from sklearn.datasets import make_moons
from sklearn.cluster import DBSCAN
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cluster import SpectralClustering
from sklearn.mixture import BayesianGaussianMixture

In [None]:
x, y = make_moons(n_samples=1000, noise=0.05, random_state=42)
dbscan=DBSCAN(eps=0.05, min_samples=5)
dbscan.fit(x)
dbscan.labels_[:10]

The DBSCAN algorithm counts how many instances are within a certain distance from it.

if the instance has at least as many instances as min_samples a distance of eps from it, it is considered a dense region.

In this case we want a minimum of 5 samples a distance of 0.05 to be considered a dense region.

All instances in the dense region are in the same cluster.

In [None]:
len(dbscan.core_sample_indices_)

This model considers 808 of the 1000 instances to be in a cluster.

The remaining 192 instances are considered anomolies.

That isn't too great.

In [None]:
dbscan.core_sample_indices_[:10]

In [None]:
np.unique(dbscan.labels_)

Due to the hyperparameters, it identifies 7 different clusters(the -1 are the anomolies)

We want something more like 2 clusters.

In [None]:
def plot_dbscan(dbscan, X, size, show_xlabels=True, show_ylabels=True):
    core_mask = np.zeros_like(dbscan.labels_, dtype=bool)
    core_mask[dbscan.core_sample_indices_] = True
    anomalies_mask = dbscan.labels_ == -1
    non_core_mask = ~(core_mask | anomalies_mask)

    cores = dbscan.components_
    anomalies = X[anomalies_mask]
    non_cores = X[non_core_mask]
    
    plt.scatter(cores[:, 0], cores[:, 1],
                c=dbscan.labels_[core_mask], marker='o', s=size, cmap="Paired")
    plt.scatter(cores[:, 0], cores[:, 1], marker='*', s=20, c=dbscan.labels_[core_mask])
    plt.scatter(anomalies[:, 0], anomalies[:, 1],
                c="r", marker="x", s=100)
    plt.scatter(non_cores[:, 0], non_cores[:, 1], c=dbscan.labels_[non_core_mask], marker=".")
    if show_xlabels:
        plt.xlabel("$x_1$", fontsize=14)
    else:
        plt.tick_params(labelbottom=False)
    if show_ylabels:
        plt.ylabel("$x_2$", fontsize=14, rotation=0)
    else:
        plt.tick_params(labelleft=False)
    plt.title("eps={:.2f}, min_samples={}".format(dbscan.eps, dbscan.min_samples), fontsize=14)
    
plt.show()

In [None]:
plt.figure(figsize=(9, 3.2))

plt.subplot(121)
plot_dbscan(dbscan, x, size=100)

plt.show()

This is what our clusters look like.

Let's improve it!

In [None]:
dbscan2 = DBSCAN(eps=0.2)
dbscan2.fit(x)

plt.figure(figsize=(9, 3.2))
plt.subplot(122)
plot_dbscan(dbscan2, x, size=600, show_ylabels=False)
plt.show()

So we increase our eps from 0.05 to 0.20 and now we have a very accurate model.

In [None]:
knn = KNeighborsClassifier(n_neighbors=50)
knn.fit(dbscan.components_, dbscan.labels_[dbscan.core_sample_indices_])
new_x = np.array([[-0.5, 0], [0, 0.5], [1, -0.1], [2, 1]])
knn.predict(new_x)

Now what if we had to introduce new data to our algorithm?

Unfortunately, DBSCAN can't predict which cluster a new instance would belong to.

Let's try KNeighbors Classifier with our new instances.

In [None]:
knn.predict_proba(new_x)

By default, KNN does not find anomolies.

We can introduce a max distance which would coinsider the instances that are too far away as anomolies.

In [None]:
y_dist, y_pred_idx = knn.kneighbors(new_x, n_neighbors=1)
y_pred = dbscan.labels_[dbscan.core_sample_indices_][y_pred_idx]
y_pred[y_dist > 0.2] = -1
y_pred.ravel()

Now the new instances can be clustered with KNN and it can also detect anomolies.

Spectral Clustering is another algorithm that can capture complex structures by taking a similarity matrix between instances and reduces it's dimensionality and uses another clustering algorithm.

It does not scale well on a large number of datasets and it doesn't work well with clusters of different sizes.

Let's try it out.

In [None]:
sc = SpectralClustering(n_clusters=2, gamma=1, random_state=42)
sc.fit(x)

We know we want 2 clusters and let's try gamma=1.

In [None]:
np.percentile(sc.affinity_matrix_, 95)

In [None]:
def plot_spectral_clustering(sc, X, size, alpha, show_xlabels=True, show_ylabels=True):
    plt.scatter(X[:, 0], X[:, 1], marker='o', s=size, c='gray', cmap="Paired", alpha=alpha)
    plt.scatter(X[:, 0], X[:, 1], marker='o', s=30, c='w')
    plt.scatter(X[:, 0], X[:, 1], marker='.', s=10, c=sc.labels_, cmap="Paired")
    
    if show_xlabels:
        plt.xlabel("$x_1$", fontsize=14)
    else:
        plt.tick_params(labelbottom=False)
    if show_ylabels:
        plt.ylabel("$x_2$", fontsize=14, rotation=0)
    else:
        plt.tick_params(labelleft=False)
    plt.title("RBF gamma={}".format(sc.gamma), fontsize=14)

In [None]:
plt.figure(figsize=(6, 3))

plt.subplot(121)
plot_spectral_clustering(sc, x, size=500, alpha=0.1)

plt.show()

That did not turn out as intended. 

Maybe if we try a larger gamma of 100.

In [None]:
sc2 = SpectralClustering(n_clusters=2, gamma=100, random_state=42)
sc2.fit(x)

plt.figure(figsize=(6, 3))
plt.subplot(122)
plot_spectral_clustering(sc2, x, size=4000, alpha=0.01, show_ylabels=False)

plt.show()

That looks pretty good.

Another algorithm we can try is a Gaussian mixture Model. 

All the instances generated by the Gaussian mixture form a cluster that looks like an ellipsoid.

There are many types of GMM. Let's try a popular one called the Bayesian Gaussian Mixture which can give weights equal or close to zero for unnecessary clusters and eliminate them aautomatically.

In [None]:
x_moons, y_moons = make_moons(n_samples=1000, noise=0.05, random_state=42)
bgm = BayesianGaussianMixture(n_components=5, n_init=10, random_state=42)
bgm.fit(x_moons)

In [None]:
def plot_data(x):
    plt.plot(x[:, 0], x[:, 1], 'k.', markersize=2)

def plot_gaussian_mixture(clusterer, x, resolution=1000, show_ylabels=True):
    mins = x.min(axis=0) - 0.1
    maxs = x.max(axis=0) + 0.1
    xx, yy = np.meshgrid(np.linspace(mins[0], maxs[0], resolution),
                         np.linspace(mins[1], maxs[1], resolution))
    Z = -clusterer.score_samples(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    plt.contourf(xx, yy, Z,
                 norm=LogNorm(vmin=1.0, vmax=30.0),
                 levels=np.logspace(0, 2, 12))
    plt.contour(xx, yy, Z,
                norm=LogNorm(vmin=1.0, vmax=30.0),
                levels=np.logspace(0, 2, 12),
                linewidths=1, colors='k')

    Z = clusterer.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.contour(xx, yy, Z,
                linewidths=2, colors='r', linestyles='dashed')
    
    plt.plot(x[:, 0], x[:, 1], 'k.', markersize=2)

    plt.xlabel("$x_1$", fontsize=14)
    if show_ylabels:
        plt.ylabel("$x_2$", fontsize=14, rotation=0)
    else:
        plt.tick_params(labelleft=False)
    
plt.figure(figsize=(9, 3.2))

plt.subplot(121)
plot_data(x_moons)

plt.subplot(122)
plot_gaussian_mixture(bgm, x_moons, show_ylabels=False)

plt.show()

So we tried the Bayesian GMM and as we can see it does not work too well.

We set the n_components to 5 hoping it would automatically detect the 2 clusters but unfortunately it keeps looking for ellipsoids

It does however work well with many other datasets.

The density plot, however, looks pretty good.