# Spectral Clustering
*Curtis Miller*

Here I demonstrate clustering using spectral clustering.

Both spectral and hierarchical clustering depend on computing how "similar" datapoints in a dataset are (using some measure of similarity). Both methods then try to group "similar" datapoints into common clusters.

Spectral clustering uses the similarity matrix formed from the datapoints' similarity measures and the eigenvalues of said matrix to form clusters. If a random walker were jumping from datapoint to datapoint based on how "similar" those datapoints were, the clusters represent collections of datapoints the random walker spends considerable time in.

## Clustering the Iris Dataset

I will demonstrate using spectral clustering for the iris dataset. I first load in that dataset.

In [None]:
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
iris_obj = load_iris()
iris_data = iris_obj.data
species = iris_obj.target
iris_data[:5,:]

In [None]:
plt.scatter(iris_data[:, 0], iris_data[:, 1], c=species, cmap=plt.cm.brg)
plt.xlabel("Sepal Length")
plt.ylabel("Sepal Width")
plt.show()

Next I import the `SpectralClustering` object to perform spectral clustering, and then apply the method.

In [None]:
from sklearn.cluster import SpectralClustering

In [None]:
irisclust = SpectralClustering(n_clusters=3,   # Three clusters
                               affinity="rbf")    # "Closeness" is defined using Gaussian kernel
irisclust = irisclust.fit(iris_data)

# Visualizing the clustering
plt.scatter(iris_data[:, 0], iris_data[:, 1], c=irisclust.labels_, cmap=plt.cm.brg)
plt.xlabel("Sepal Length")
plt.ylabel("Sepal Width")
plt.show()

Choosing different affinity schemes yields different results. Like with hierarchical clustering, spectral clustering produces nice results but does not allow for "prediction". That is, it doesn't take new , never-before-seen datapoints and assign them to a cluster.

## Clustering Headlines

Let's cluster the headlines dataset, as we did before with hierarchical clustering. While hierarchical clustering requires a distance matrix, spectral clustering requires a similarity matrix.

The code below was explained in a previous video.

In [None]:
import pandas as pd
from nltk import ngrams
import numpy as np

In [None]:
headlines = pd.read_csv("HNHeadlines.txt", header=None, index_col=0).iloc[:, 0]
headline_sets = [set(''.join(u) for u in ngrams(h.lower(), 3)) for h in headlines]
sims = np.zeros((len(headlines), len(headlines)))    # Will contain the affinity matrix
for i in range(len(headlines)):
    for j in range(i, len(headlines)):
        h1, h2 = headline_sets[i], headline_sets[j]
        js = len(h1.intersection(h2))/len(h1.union(h2))    # Compute the Jaccard similarity for the two documents
        sims[i,j] = sims[j,i] = js    # Store the Jaccard similarity in the appropriate entries of the matrix

headlines

I also plan to assess the quality of the resulting clustering using a silhouette plot.

In [None]:
import matplotlib.cm as cm
import matplotlib.pyplot as plt
from sklearn.metrics import silhouette_score, silhouette_samples
%matplotlib inline

def silhouette_plot(data, labels, metric="euclidean", xticks = True):
    """Creates a silhouette plot given a dataset and the labels corresponding to cluster assignment, and reports the
       average silhouette score"""
    silhouette_avg = silhouette_score(data, labels,
                                      metric=metric)    # The average silhouette score over the entire sample
    sample_silhouette_values = silhouette_samples(data, labels,
                                                  metric=metric)    # The silhouette score of each individual data point
    
    # This loop creates the silhouettes in the silhouette plot
    y_lower = 10    # For space between silhouettes
    for k in np.unique(labels):
        cluster_values = sample_silhouette_values[labels == k]
        cluster_values.sort()
        nk = len(cluster_values)
        y_upper = y_lower + nk
        color = cm.spectral(float(k) / len(np.unique(labels)))
        plt.fill_betweenx(np.arange(y_lower, y_upper),
                          0, cluster_values,
                          facecolor=color, edgecolor=color)
        plt.text(-0.05, y_lower + 0.5 * nk, str(k))
        y_lower = y_upper + 10
    
    plt.axvline(x=silhouette_avg, color="red", linestyle="--")
    if xticks:
        plt.xticks([-0.1, 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0])
    plt.yticks([])
    plt.xlabel("Silhouette Score")
    plt.ylabel("Cluster")
    plt.show()
    
    print("The average silhouette score is", silhouette_avg)

Now we can perform the clustering.

In [None]:
headlineclust = SpectralClustering(n_clusters=4, affinity="precomputed")
hclusters = headlineclust.fit_predict(sims)
hclusters

How well did the algorithm do?

In [None]:
silhouette_plot(1 - sims, hclusters, metric="precomputed", xticks=False)

In [None]:
headlines[hclusters==3]

In [None]:
headlines[hclusters==2]

The clustering is not necessarily bad; at least one cluster seems reasonable. That said, it's not great either, according to the silhouette plot.