# Spectral Clustering

## Objectives

- Explore the effectiveness of Spectral Clustering in identifying complex cluster structures within datasets.
- Compare the performance of Spectral Clustering with traditional k-Means clustering in handling non-linearly separable data.
- Demonstrate the application of Spectral Clustering on various datasets with intricate geometrical shapes.
- Apply Spectral Clustering to text data to uncover thematic clusters from a collection of documents using natural language processing techniques.

## Background

Spectral Clustering is used to identify clusters in data with non-convex or intertwined structures, where traditional clustering methods like k-Means may fail to handle effectively. 

## Datasets Used

The notebook uses synthetic datasets: "Half Moons," "Concentric Circles," and "Spiral-Shaped Data" generated through specific functions like make_moons and make_circles, which are ideal for testing clustering algorithms due to their complex shapes.

## Introduction

Spectral Clustering is an advanced machine learning algorithm that partitions data into clusters based on the eigenvalues of a similarity matrix, often used when the structure of individual clusters is non-convex or intertwined, making it particularly effective for complex cluster structures that are not easily separable with linear boundaries.

In [1]:
import numpy as np
import pandas as pd

from sklearn.cluster import KMeans

import plotly.express as px
import ClusterVisualizer as cv

## Half Moons Example

In [2]:
from sklearn.datasets import make_moons

Xm, _ = make_moons(300, noise=.05, random_state=10)

In [3]:
# Saving the data to a pandas DataFrame
df_Xm = pd.DataFrame(Xm, columns=['x', 'y'])

In [4]:
cv_m = cv.ClusterVisualizer(df_Xm)

cv_m.plot_data(title='Half Moons Data')

In [5]:
# Applying k-Means with k=2
ym_k2 = KMeans(n_clusters=2, random_state=10, n_init='auto').fit_predict(Xm)

cv_m.plot_clusters(ym_k2, title='Half Moons: k-Means Clustering (k=2)')

The results are not good. Let's see what happens when we increase the number of clusters.

In [6]:
ym_k3 = KMeans(n_clusters=3, random_state=10, n_init='auto').fit_predict(Xm)

cv_m.plot_clusters(ym_k3, title='Half Moons: k-Means Clustering (k=3)')

In [7]:
ym_k6 = KMeans(n_clusters=6, random_state=10, n_init='auto').fit_predict(Xm)

cv_m.plot_clusters(ym_k6, title='Half Moons: k-Means Clustering (k=6)')

In all cases we have one cluster that shares points from both half moons. That is not a good result.

When we studied Support Vector Machines, we used a kernel transformation to project the data into a higher dimension where a linear separation is possible.

We could use the same trick to allow k-means to discover non-linear boundaries.

Scikit-Learn implements a version of this kernelized k-means: the SpectralClustering estimator.

It uses the graph of nearest neighbors to compute a higher-dimensional representation of the data and then assigns labels using a k-means algorithm.

In [8]:
from sklearn.cluster import SpectralClustering

In [9]:
ym_s2 = SpectralClustering(n_clusters=2, affinity='nearest_neighbors', 
                          n_neighbors=20, assign_labels='kmeans').fit_predict(Xm)

cv_m.plot_clusters(ym_s2, title='Half Moons: Spectral Clustering (k=2)')

With this kernel transform approach, the kernelized k-means can find the more complicated non-linear boundaries between clusters.

In [10]:
ym_s3 = SpectralClustering(n_clusters=3, affinity='nearest_neighbors', 
                              n_neighbors=20, assign_labels='kmeans').fit_predict(Xm)

cv_m.plot_clusters(ym_s3, title='Half Moons: Spectral Clustering (k=3)')

In [11]:
ym_s6 = SpectralClustering(n_clusters=6, affinity='nearest_neighbors', 
                              n_neighbors=20, assign_labels='kmeans').fit_predict(Xm)

cv_m.plot_clusters(ym_s6, title='Half Moons: Spectral Clustering (k=6)')

Even if you choose an incorrect number of clusters, the Spectral Clustering algorithm provides a better result than the k-Means method.

Spectral Clustering never groups in the same cluster points from different half moons.

## Concentric Circles Example

In [12]:
from sklearn.datasets import make_circles

Xc, _ = make_circles(n_samples=400, random_state=123, noise=0.1, factor=0.2)

In [13]:
df_c = pd.DataFrame(Xc, columns=['x', 'y'])

cv_c = cv.ClusterVisualizer(df_c)

cv_c.plot_data(title='Concentric Circles Data')

In [14]:
# Applying k-Means with k=2
yc_k2 = KMeans(n_clusters=2, random_state=10, n_init='auto').fit_predict(Xc)

cv_c.plot_clusters(yc_k2, title='Concentric Circles: k-Means Clustering (k=2)')

The results are not good. Let's see what happens when we apply Spectral clustering.

In [15]:
yc_c2 = SpectralClustering(n_clusters=2, affinity='nearest_neighbors',
                           n_neighbors=20, assign_labels='kmeans').fit_predict(Xc)


Graph is not fully connected, spectral embedding may not work as expected.



The warning "Graph is not fully connected, spectral embedding may not work as expected" typically occurs in Spectral Clustering when the affinity graph is not fully connected. 

It means there are subgraphs in your data that are not connected to each other, making it difficult for the algorithm to perform spectral embedding effectively. 

It often happens when the `n_neighbors` parameter is set too low for the given dataset, or the data itself is very sparse or has distinct clusters that are far apart.

Let's increase the `n_neighbors` parametger.

In [16]:
yc_c3 = SpectralClustering(n_clusters=2, affinity='nearest_neighbors',
                           n_neighbors=40, assign_labels='kmeans').fit_predict(Xc)

cv_c.plot_clusters(yc_c3, title='Concentric Circles: Spectral Clustering (k=2)')

## Spiral-Shaped Data Example

In [17]:
def create_spiral_data(n_points, noise=0.8):
    '''
    # Function to generate spiral data
    '''
    np.random.seed(12345)
    n = np.sqrt(np.random.rand(n_points, 1)) * 360 * (2 * np.pi) / 360
    d1x = -np.cos(n) * n + np.random.rand(n_points, 1) * noise
    d1y = np.sin(n) * n + np.random.rand(n_points, 1) * noise
    return np.vstack((np.hstack((d1x, d1y)), np.hstack((-d1x, -d1y))))


In [18]:
# Generating the data
Xs = create_spiral_data(300)

In [19]:
df_s = pd.DataFrame(Xs, columns=['x', 'y'])

cv_s = cv.ClusterVisualizer(df_s)

cv_s.plot_data(title='Spiral-Shaped Data')

In [20]:
# Applying k-Means with k=2
ys_k2 = KMeans(n_clusters=2, random_state=0, n_init='auto').fit_predict(Xs)

cv_s.plot_clusters(ys_k2, title='Spiral-Shaped Data: k-Means Clustering (k=2)')   

k-Means clustering algorithm assigns points to the nearest cluster center. It is effective when dealing with spherical or blob-like clusters because it minimizes the variance within each cluster. 

When it comes to more complex structures, like spirals, k-Means cannot capture the shape and continuity of the data. 

As seen in the plot, it has divided the spirals into two halves based on the nearest mean, resulting in a mix of both spirals in each cluster.

In [21]:
ys_s2 = SpectralClustering(n_clusters=2, affinity='nearest_neighbors',
                           n_neighbors=10, assign_labels='kmeans').fit_predict(Xs) 

cv_s.plot_clusters(ys_s2, title='Spiral-Shaped Data: Spectral Clustering (k=2)')

Spectral Clustering constructs a similarity graph and uses the eigenvalues and eigenvectors of this graph to project the data into a lower-dimensional space that respects the data's inherent connectivity. 

It essentially "unrolls" the spirals in this new space, making it easy to separate them with a linear classifier.

In [22]:
# Applying k-Means with k=3
ys_k3 = KMeans(n_clusters=3, random_state=0, n_init='auto').fit_predict(Xs)

cv_s.plot_clusters(ys_k3, title='Spiral-Shaped Data: k-Means Clustering (k=3)')

The algorithm tries to minimize variance within each cluster, leading to a poor fit for the naturally spiral-shaped data.

It cuts through the spirals, incorrectly assigning points based on proximity to the nearest centroid.

In [23]:
# Spectral Clustering with k=3
ys_s3 = SpectralClustering(n_clusters=3, affinity='nearest_neighbors',
                           n_neighbors=10, assign_labels='kmeans').fit_predict(Xs)

cv_s.plot_clusters(ys_s3, title='Spiral-Shaped Data: Spectral Clustering (k=3)')

Spectral Clustering effectively uses the data's connectivity, correctly separating the spirals based on their inherent structure, not just distance, leading to a more meaningful clustering that follows the shape of the data.

## Document Analysis Example

Let's analyze a simple example related to document analysis.

In [24]:
documents = [
    "The sky is blue and beautiful",
    "Love this blue and beautiful sky!",
    "The quick brown fox jumps over the lazy dog.",
    "A king's crown is made of gold",
    "The crown of a king is called a diadem",
    "The bright sun in the sky is shining",
    "The sun is bright and beautiful today",
    "The sun in the sky is very bright",
    "We can see the shining sun, the bright sun",
    "The fox jumped over the dog",
    "The fox is quicker than the lazy dog",
    "The dog is lazy but the brown fox is quick!",
    "Gold crowns are worn by kings",
    "The king's diadem is made of gold",
    "The water is clear and the sun is bright",
    "Drink water to stay hydrated",
    "Stay hydrated and drink water",
    "Python is a programming language",
    "Python language is used for machine learning",
    "Machine learning can be done with Python",
    "Artificial Intelligence is a branch of machine learning",
    "Learning Python is fun",
    "Python is easy to learn and powerful",
    "Machine learning is a fascinating field",
    "You need clear water to stay hydrated",
    "Stay healthy by drinking clear water"
]

The provided variable `documents` contains a list of text documents. These documents appear to cover a variety of topics and may include sentences or phrases related to different subjects. Some of the topics mentioned in these documents include:
- Nature and Weather: References to the sky, blue color, sunshine, and the beauty of the sky.
- Animals: Mentions of foxes and dogs, as well as their characteristics.
- Royalty: References to kings, crowns, and diadems.
- Hydration: Discussion about the importance of clear water for staying hydrated and healthy.
- Programming and Technology: References to Python as a programming language and its use in machine learning and artificial intelligence.

In [25]:
# Convert a collection of raw documents to a matrix of TF-IDF features
from sklearn.feature_extraction.text import TfidfVectorizer

In [26]:
# Convert a collection of raw documents to a matrix of TF-IDF features
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

The `TfidfVectorizer`, which stands for "Term Frequency-Inverse Document Frequency Vectorizer," is a popular feature extraction technique used in natural language processing (NLP) and text analysis. It is used to convert a collection of text documents into a numerical format suitable for machine learning algorithms. 

Let's create a function for applying spectral clustering and plotting the results.

In [27]:
def spectral_clustering_docs(X, n_clusters=2, n_neighbors=10):
    '''
    Function to perform spectral clustering on the data X.
    '''
    clustering = SpectralClustering(n_clusters=n_clusters, affinity='nearest_neighbors', 
                                    n_neighbors=n_neighbors, n_jobs=-1)
    labels = clustering.fit_predict(X)
    # Print out the clusters
    for i in range(n_clusters):
        print(f"Cluster {i}:")
        cluster_documents = np.where(labels == i)[0]
        for doc_index in cluster_documents:
            print(f" - {documents[doc_index]}")
    return labels        

In [28]:
# Asking for 3 clusters
labels3 = spectral_clustering_docs(X, n_clusters=3)

Cluster 0:
 - The sky is blue and beautiful
 - Love this blue and beautiful sky!
 - The quick brown fox jumps over the lazy dog.
 - A king's crown is made of gold
 - The crown of a king is called a diadem
 - The bright sun in the sky is shining
 - The sun is bright and beautiful today
 - The sun in the sky is very bright
 - We can see the shining sun, the bright sun
 - The fox jumped over the dog
 - The fox is quicker than the lazy dog
 - The dog is lazy but the brown fox is quick!
 - Gold crowns are worn by kings
 - The king's diadem is made of gold
Cluster 1:
 - Python is a programming language
 - Python language is used for machine learning
 - Machine learning can be done with Python
 - Artificial Intelligence is a branch of machine learning
 - Learning Python is fun
 - Python is easy to learn and powerful
 - Machine learning is a fascinating field
Cluster 2:
 - The water is clear and the sun is bright
 - Drink water to stay hydrated
 - Stay hydrated and drink water
 - You need clea

Here is a description of each cluster:

- **Cluster 0**:
It is focused on Programming, Python, and Machine Learning.
- **Cluster 1**:
It is related to Nature, Beauty, and Royalty.
- **Cluster 2**:
It is centered around hydration and Health.

Notice that it appears that the sentence "The water is clear and the sun is bright" is more related to Nature in Cluster 1 than to Cluster 2.

In [29]:
# Asking for 4 clusters
labels4 = spectral_clustering_docs(X, n_clusters=4)

Cluster 0:
 - The water is clear and the sun is bright
 - Drink water to stay hydrated
 - Stay hydrated and drink water
 - You need clear water to stay hydrated
 - Stay healthy by drinking clear water
Cluster 1:
 - Python is a programming language
 - Python language is used for machine learning
 - Machine learning can be done with Python
 - Artificial Intelligence is a branch of machine learning
 - Learning Python is fun
 - Python is easy to learn and powerful
 - Machine learning is a fascinating field
Cluster 2:
 - The sky is blue and beautiful
 - Love this blue and beautiful sky!
 - The quick brown fox jumps over the lazy dog.
 - The sun is bright and beautiful today
 - The sun in the sky is very bright
 - We can see the shining sun, the bright sun
 - The fox jumped over the dog
 - The fox is quicker than the lazy dog
 - The dog is lazy but the brown fox is quick!
Cluster 3:
 - A king's crown is made of gold
 - The crown of a king is called a diadem
 - The bright sun in the sky is sh

Here is a description of each cluster:

- **Cluster 0**:
It is focused on natural elements and animals. 
- **Cluster 1**:
It is related to programming and technology. 
- **Cluster 2**:
It is centered around the theme of hydration and water. 
- **Cluster 3**:
It is associated with royalty. 

Notice that it appears that the sentence "The bright sun in the sky is shining" was mistakenly placed in Cluster 3. This sentence is more related to Cluster 0, which focuses on natural elements and scenes.

In [30]:
# Asking for 5 clusters
labels5 = spectral_clustering_docs(X, n_clusters=5)

Cluster 0:
 - Drink water to stay hydrated
 - Stay hydrated and drink water
 - You need clear water to stay hydrated
 - Stay healthy by drinking clear water
Cluster 1:
 - Python is a programming language
 - Python language is used for machine learning
 - Machine learning can be done with Python
 - Artificial Intelligence is a branch of machine learning
 - Learning Python is fun
 - Python is easy to learn and powerful
 - Machine learning is a fascinating field
Cluster 2:
 - A king's crown is made of gold
 - The crown of a king is called a diadem
 - The king's diadem is made of gold
Cluster 3:
 - The sky is blue and beautiful
 - The quick brown fox jumps over the lazy dog.
 - The fox jumped over the dog
 - The fox is quicker than the lazy dog
 - The dog is lazy but the brown fox is quick!
Cluster 4:
 - Love this blue and beautiful sky!
 - The bright sun in the sky is shining
 - The sun is bright and beautiful today
 - The sun in the sky is very bright
 - We can see the shining sun, the b

Here is a description of each cluster:

- **Cluster 0**:
It is focused on Hydratation and health. 
- **Cluster 1**:
It is related to programming and technology. 
- **Cluster 2**:
It is centered around Nature and animals. 
- **Cluster 3**:
It is associated with royalty. 
- **Cluster 4**:
It is related to Nature, beauty, and royalty. 

We could continue asking for more clusters, but it increases the complexity of interpretation and may require more effort to distinguish between the clusters effectively.

Here, we have an example where domain experts can evaluate the clustering results to determine if the documents within each cluster share a common theme or topic that makes sense.

## Conclusions

Key Takeaways:
- Spectral Clustering successfully identifies and separates complex structures where k-Means clustering struggles, particularly evident in the "Half Moons" and "Spiral-Shaped Data."
- Even when increasing the number of clusters, k-Means clustering cannot correctly separate data points in non-linearly separable datasets, unlike Spectral Clustering.
- Spectral Clustering remains robust and effective across various datasets and configurations, consistently outperforming k-Means in scenarios involving intricate data structures.
- Spectral Clustering effectively grouped text documents into coherent themes based on content similarity, showcasing its utility in natural language processing tasks.

## References

- https://scikit-learn.org/stable/modules/clustering.html#spectral-clustering
- Muller, A.C. & Guido, S. (2017) Introduction to Machine Learning with Python. A guide for Data scientists. USA: O'Reilly, chapter 3.
- VanderPlas, J. (2017) Python Data Science Handbook: Essential Tools for Working with Data. USA: O'Reilly Media, Inc. chapter 5.