# Affinity Propagation

## Objectives

- Explore using Affinity Propagation to cluster various datasets, including synthetic, text, and real-world data.
- Assess the impact of different parameters on the clustering performance of Affinity Propagation.
- Evaluate clustering results using external validation measures.

## Background

Affinity Propagation (AP) is a clustering algorithm that does not require specifying the number of clusters in advance. It identifies "exemplars," representative points for clusters, through message passing between data points.

## Datasets Used

- Synthetic datasets generated with make_blobs and make_moons demonstrate the clustering capability of AP in structured and complex geometrical distributions.
- Text documents illustrate how AP handles high-dimensional, sparse data using cosine similarity measures.
- The Wine dataset from the UCI repository, which consists of chemical analyses of Italian wines, evaluates AP's performance in real-world, multivariate data.

## A simple example

Affinity Propagation (AP) is a clustering algorithm that identifies exemplars among data points and forms clusters by exchanging real-valued messages between points until a high-quality set of exemplars and corresponding clusters emerges.

In [1]:
import pandas as pd
pd.set_option('display.max_columns', 8)

import ClusterVisualizer as cv

import plotly.express as px

In [2]:
# Generating data with nicely defining clusters
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=50, centers=3, cluster_std=1.5, random_state=10) 

In [3]:
# Savig the data to a DataFrame
dataX = pd.DataFrame(X, columns=['x', 'y'])
print(dataX.shape)
dataX.head()

(50, 2)


Unnamed: 0,x,y
0,-0.122728,-6.634907
1,-1.597559,-6.70108
2,4.366142,3.929363
3,-2.292342,-4.573166
4,-1.414224,-4.79944


In [4]:
cvX = cv.ClusterVisualizer(dataX)

cvX.plot_data()

In [5]:
# Initialize the model
from sklearn.cluster import AffinityPropagation

ap_clusters = AffinityPropagation(random_state=0).fit(X)

In [6]:
print('Number of clusters: %i' % len(ap_clusters.cluster_centers_indices_))
print('Cluster centers: \n', ap_clusters.cluster_centers_.round(2))

Number of clusters: 3
Cluster centers: 
 [[ 5.82 -9.42]
 [ 3.52  4.75]
 [-0.75 -5.37]]


In [7]:
# Plotting the clusters
cvX.plot_clusters(ap_clusters.labels_, ap_clusters.cluster_centers_, 
                  title='Affinity Propagation Clustering')

As you can see, the AP algorithm detects the three clusters.

## Half Moons Example

In [8]:
from sklearn.datasets import make_moons

Xm, _ = make_moons(100, noise=.08, random_state=10)

In [9]:
# Savig the data to a DataFrame
dataXm = pd.DataFrame(Xm, columns=['x', 'y'])
print(dataXm.shape)
dataXm.head()

(100, 2)


Unnamed: 0,x,y
0,0.201479,0.883652
1,0.5107,0.882023
2,-0.889352,0.401421
3,-0.573666,0.582637
4,0.683472,-0.327357


In [10]:
cv_m = cv.ClusterVisualizer(dataXm)

cv_m.plot_data('Half Moons Data')

In [11]:
ap_clusters_m = AffinityPropagation().fit(Xm)

print('Number of clusters: %i' % len(ap_clusters_m.cluster_centers_indices_))

print('Cluster centers: \n', ap_clusters_m.cluster_centers_.round(2))

Number of clusters: 8
Cluster centers: 
 [[-0.89  0.4 ]
 [ 0.63 -0.41]
 [ 1.96  0.22]
 [-0.35  0.92]
 [ 0.14  0.09]
 [ 0.33  1.02]
 [ 1.36 -0.41]
 [ 0.96  0.4 ]]


In [12]:
cv_m.plot_clusters(ap_clusters_m.labels_, ap_clusters_m.cluster_centers_, 
                   title = 'Half Moons - Affinity Propagation Clustering')

The `preference` parameter is a crucial determinant of how many exemplars (or cluster centers) are chosen. 

A lower preference value (more negative) typically results in fewer clusters, indicating lower suitability for points to be considered exemplars.

In [13]:
# Modifying the preference parameter
ap_clusters_m3 = AffinityPropagation(preference = -50).fit(Xm)

print('Number of clusters: %i' % len(ap_clusters_m3.cluster_centers_indices_))

print('Cluster centers: \n', ap_clusters_m3.cluster_centers_.round(2))

Number of clusters: 2
Cluster centers: 
 [[-0.37  0.69]
 [ 1.02  0.08]]


In [14]:
cv_m.plot_clusters(ap_clusters_m3.labels_, ap_clusters_m3.cluster_centers_, 
                   title = 'Half Moons - Affinity Propagation (preference = -50)')

The expected outcome is that each moon shape forms its own cluster, which is not the case here. Instead, points from both half moons are mixed within the two clusters, indicating that AP did not effectively separate the two distinct shapes.

## Document Analysis Example

In [15]:
# Sample text data
documents = [
    "Nature is beautiful.",        
    "I love nature.",    
    "Machine learning is a facet of AI.",
    "Nature is my best friend.",
    "General AI is coming soon.",
    "AI is creating a new era.",
    "Cluster Analysis is a branch of AI."
]

In [16]:
# Convert the text data into TF-IDF features
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english')
Xd = vectorizer.fit_transform(documents)

For document clustering using AP, cosine similarity is highly recommended as it effectively captures semantic similarities between high-dimensional, sparse text data by focusing on the orientation of TF-IDF vectors rather than their magnitude. 

Cosine similarity inherently normalizes the document vectors, which is crucial in text analysis, especially when document lengths vary significantly. 

In [17]:
# Compute the Cosine similarity matrix
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim_matrix = cosine_similarity(Xd)

In [18]:
# Perform Affinity Propagation with precomputed similarities
ap = AffinityPropagation(affinity='precomputed', random_state=0)
ap.fit(cosine_sim_matrix)

In [19]:
# Output the cluster assignments
print("Cluster labels:", ap.labels_)

Cluster labels: [0 0 1 0 1 1 1]


In [20]:
# Optionally, print the documents and their cluster assignments
for document, label in zip(documents, ap.labels_):
    print(f"Cluster: {label} - {document}")

Cluster: 0 - Nature is beautiful.
Cluster: 0 - I love nature.
Cluster: 1 - Machine learning is a facet of AI.
Cluster: 0 - Nature is my best friend.
Cluster: 1 - General AI is coming soon.
Cluster: 1 - AI is creating a new era.
Cluster: 1 - Cluster Analysis is a branch of AI.


The AP algorithm grouped the documents into two clusters:
- Cluster 0: documents related to Nature
- Cluster 1: documents related to AI

## Wine Dataset

The Wine dataset is a collection of chemical analyses of wines grown in the same region in Italy but derived from three different cultivars, featuring measurements such as alcohol content, color intensity, and other chemical properties used to classify the wines.

In [21]:
from sklearn.datasets import load_wine

wine = load_wine()

In [22]:
df = pd.DataFrame(wine.data, columns=wine.feature_names)
df['class'] = wine.target_names[wine.target]
print(df.shape)
df.head()

(178, 14)


Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,...,hue,od280/od315_of_diluted_wines,proline,class
0,14.23,1.71,2.43,15.6,...,1.04,3.92,1065.0,class_0
1,13.2,1.78,2.14,11.2,...,1.05,3.4,1050.0,class_0
2,13.16,2.36,2.67,18.6,...,1.03,3.17,1185.0,class_0
3,14.37,1.95,2.5,16.8,...,0.86,3.45,1480.0,class_0
4,13.24,2.59,2.87,21.0,...,1.04,2.93,735.0,class_0


In [23]:
# Transforming the DataFrame into a long format for visualization purposes
df_melted = pd.melt(df, id_vars=['class'])

# Create a box plot with the original data
fig_o = px.box(df_melted, x='variable', y='value', 
             width=700, height=400, title='Box Plot of Original Wine Features')
# Change the legend position to 'top'
fig_o.update_layout(legend=dict(x=0.4, y=1.2, orientation='h'))
# Show the plot
fig_o.show()

In [24]:
# Standardize the data
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

In [25]:
dfS = pd.DataFrame(scaler.fit_transform(df.iloc[:, :-1]), columns=df.columns[:-1])                                       
dfS['class'] = df['class']
dfS.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,...,hue,od280/od315_of_diluted_wines,proline,class
0,1.518613,-0.56225,0.232053,-1.169593,...,0.362177,1.84792,1.013009,class_0
1,0.24629,-0.499413,-0.827996,-2.490847,...,0.406051,1.113449,0.965242,class_0
2,0.196879,0.021231,1.109334,-0.268738,...,0.318304,0.788587,1.395148,class_0
3,1.69155,-0.346811,0.487926,-0.809251,...,-0.427544,1.184071,2.334574,class_0
4,0.2957,0.227694,1.840403,0.451946,...,0.362177,0.449601,-0.037874,class_0


In [26]:
# Transforming the DataFrame into a long format for visualization purposes
dfS_melted = pd.melt(dfS, id_vars=['class'])

# Create a box plot with the standardized data
fig_s = px.box(dfS_melted, x='variable', y='value', 
               width=700, height=400, 
               title='Box Plot of Standardized Wine Features')
# Change the legend position to 'top'
fig_s.update_layout(legend=dict(x=0.4, y=1.2, orientation='h'))
# Show the plot
fig_s.show()

In [27]:
# TSNE (for visualization only) 
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2)
tsne.fit(dfS.iloc[:, :-1])

df_tsne = pd.DataFrame(tsne.fit_transform(dfS.iloc[:, :-1]), columns=['TSNE_1', 'TSNE_2'])
df_tsne['Original Class'] = dfS['class']
df_tsne.head()

Unnamed: 0,TSNE_1,TSNE_2,Original Class
0,6.892741,11.439153,class_0
1,-0.496582,11.350184,class_0
2,3.206583,11.402533,class_0
3,5.181119,14.11286,class_0
4,2.216416,7.116625,class_0


In [28]:
cv_w = cv.ClusterVisualizer(df_tsne)

cv_w.plot_clusters(df_tsne['Original Class'], 
                   title='Original Class in the t-SNE Space')

In [29]:
# Using Affinity Propagation
ap = AffinityPropagation(random_state=0,  preference=-200)
ap.fit(dfS.iloc[:, :-1])

print('Number of clusters: %i' % len(ap.cluster_centers_indices_))

Number of clusters: 3


In [30]:
# Plot the clusters in the t-SNE space
cv_w.plot_clusters(ap.labels_, title='Affinity Propagation Results in the t-SNE Space')

### External Validation Measures

External validation measures for clustering assess the quality of clustering results by comparing them to known ground truth or predefined labels in the data.

In [31]:
dfS['AP Cluster'] = ap.labels_
dfS.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,...,od280/od315_of_diluted_wines,proline,class,AP Cluster
0,1.518613,-0.56225,0.232053,-1.169593,...,1.84792,1.013009,class_0,0
1,0.24629,-0.499413,-0.827996,-2.490847,...,1.113449,0.965242,class_0,1
2,0.196879,0.021231,1.109334,-0.268738,...,0.788587,1.395148,class_0,0
3,1.69155,-0.346811,0.487926,-0.809251,...,1.184071,2.334574,class_0,0
4,0.2957,0.227694,1.840403,0.451946,...,0.449601,-0.037874,class_0,0


In [32]:
cluster = {'class_0': 0, 'class_1': 1, 'class_2': 2}
dfS['class'] = dfS['class'].map(cluster)
dfS.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,...,od280/od315_of_diluted_wines,proline,class,AP Cluster
0,1.518613,-0.56225,0.232053,-1.169593,...,1.84792,1.013009,0,0
1,0.24629,-0.499413,-0.827996,-2.490847,...,1.113449,0.965242,0,1
2,0.196879,0.021231,1.109334,-0.268738,...,0.788587,1.395148,0,0
3,1.69155,-0.346811,0.487926,-0.809251,...,1.184071,2.334574,0,0
4,0.2957,0.227694,1.840403,0.451946,...,0.449601,-0.037874,0,0


In [33]:
# Calculate the Adjusted Rand Index 
from sklearn.metrics import adjusted_rand_score

print("Adjusted Rand Index = %.4f" % adjusted_rand_score(dfS['class'], dfS['AP Cluster']))

Adjusted Rand Index = 0.6218


An ARS close to 1 indicates perfect clustering, while 0 or negative values indicate random or independent label assignments. An ARS of 0.6218 suggests a good, but not perfect, clustering performance.

In [34]:
# Calculate the Adjusted Mutual Information Score 
from sklearn.metrics import adjusted_mutual_info_score

print("Adjusted Mutual Information Score = %.4f" 
      % adjusted_mutual_info_score(dfS['class'], dfS['AP Cluster']))

Adjusted Mutual Information Score = 0.6075


The AMI measures mutual information between the clustering assignments and the true labels. An AMI of 0.6075 indicates a substantial amount of shared information, reflecting a good clustering quality.

In [35]:
from sklearn.metrics import normalized_mutual_info_score

print("Normalized Mutual Information (NMI) = %.4f" 
      % normalized_mutual_info_score(dfS['class'], dfS['AP Cluster']))

Normalized Mutual Information (NMI) = 0.6116


NMI is a normalization of the mutual information score, comparing the clustering results to the true labels. It ranges from 0 (no mutual information) to 1 (perfect correlation). An NMI of 0.6116 shows a good level of correlation between your clustering assignment and the true labels.

In [36]:
from sklearn.metrics import fowlkes_mallows_score

print("Fowlkes-Mallows Index (FMI) = %.4f" 
      % fowlkes_mallows_score(dfS['class'], dfS['AP Cluster']))

Fowlkes-Mallows Index (FMI) = 0.7487



 FMI measures the geometric mean of the precision and recall of the clustering. An FMI value closer to 1 indicates a high precision and recall. An FMI of 0.7487 is quite high, suggesting a good clustering quality with relatively high precision and recall.

These metrics indicate that the Affinity Propagation clustering on the UCI Wine dataset has performed quite well. The clustering assignments show good agreement with the true labels, indicating that the algorithm effectively discovered the inherent groupings in the data. The consistent results across different metrics (which measure various aspects of clustering quality) reinforce the conclusion that the outcome is reliable and of good quality.

In [37]:
crosstab = pd.crosstab(dfS['class'], dfS['AP Cluster']).reset_index()

long_crosstab = crosstab.melt(id_vars='class', var_name='AP Cluster', value_name='Count')
fig_c = px.bar(long_crosstab, x='class', y='Count', color='AP Cluster', barmode='group')               
fig_c.update_layout(title='Wine Class Count by AP Cluster',
                    width=600, height=400, legend_title='AP Cluster',
                    xaxis_title='Wine Class', yaxis_title='Count')
fig_c.show()

- AP Cluster 0 (blue) has elements of the three wine classes, mainly from class 0.
- AP Cluster 1 (red) has elements of wine classes 0 and 1, but mainly from class 1.
- AP Cluster 2 (green) has elements of wine classes 1 and 2, but mainly from class 2.

## Conclusions

Key Takeaways:
- Affinity Propagation successfully identified clusters in simple synthetic data and complex geometrical shapes, although it required parameter adjustments for optimal performance.
- AP effectively grouped documents into semantically similar categories in text clustering, showcasing its applicability in natural language processing.
- AP demonstrated good clustering quality for the Wine dataset, as evidenced by positive external validation scores, indicating a solid match between the true labels and the clusters formed.

## References

- [Affinity Propagation - sklearn](https://scikit-learn.org/stable/modules/clustering.html#affinity-propagation)
- [Wine Dataset - UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/dataset/109/wine) 