# Affinity Propagation and Agglomerative Clustering

## Overview

This notebook demonstrates the implementation of **Affinity Propagation (AP)** and **Agglomerative Clustering** on the tweets data. It covers the following steps:

## Steps

1. **Import Libraries**
   - Necessary libraries are imported, including those for clustering, dimensionality reduction, and evaluation.

2. **Load and Prepare Data**
   - The TF-IDF matrix is loaded from a saved file (`tfidf_matrix.pkl`).
   - Dimensionality reduction is applied using TruncatedSVD to reduce the matrix to 100 components.

3. **Chunk Processing**
   - The data is processed in chunks for Affinity Propagation to handle large datasets efficiently.
   - The `chunk_data` function splits the data into manageable sizes.

4. **Affinity Propagation Clustering**
   - Affinity Propagation is applied to each data chunk.
   - Labels from each chunk are collected and combined.

5. **Agglomerative Clustering**
   - Agglomerative Clustering is performed on the full data matrix to group the data into 3 clusters.

6. **Save Results**
   - The clustering results are saved to CSV files, including individual files for each cluster label.
   - Models and labels are saved to pickle files for future use.

7. **Evaluation**
   - Evaluation metrics, including Silhouette Score, Calinski-Harabasz Score, and Davies-Bouldin Index, are computed on a subset of the data to assess clustering quality.

## Files Generated
- `tweets_with_agg_labels.csv`: Data with Agglomerative Clustering labels.
- `tweets_label_0.csv`, `tweets_label_1.csv`, `tweets_label_2.csv`: Filtered data by cluster labels.
- `svd_model.pkl`: Saved TruncatedSVD model.
- `agglomerative_clustering_labels.pkl`: Saved Agglomerative Clustering labels.
- `affinity_propagation_labels.pkl`: Saved Affinity Propagation labels.
- `cluster_labels.pkl`: Saved labels from Affinity Propagation.
- `affinity_propagation_model.pkl`: Saved Affinity Propagation model.

## Metrics
- **Silhouette Score:** Measures the quality of clustering.
- **Calinski-Harabasz Score:** Evaluates cluster separation.
- **Davies-Bouldin Index:** Assesses the average similarity ratio of each cluster with its most similar cluster.

In [9]:
# AP_clustering_implementation.ipynb

# Importing Necessary Libraries
import numpy as np
import pandas as pd
import pickle
from sklearn.cluster import AffinityPropagation, AgglomerativeClustering
from sklearn.metrics import silhouette_score, calinski_harabasz_score, accuracy_score, classification_report, davies_bouldin_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

In [10]:
# Loading the TF-IDF matrix
with open('tfidf_matrix.pkl', 'rb') as f:
    tfidf_matrix = pickle.load(f)

In [11]:
# Reducing dimensionality
svd = TruncatedSVD(n_components=100)
reduced_tfidf_matrix = svd.fit_transform(tfidf_matrix)


In [12]:
# Function to split data into chunks
def chunk_data(data, chunk_size):
    for i in range(0, data.shape[0], chunk_size):
        yield data[i:i + chunk_size]

# Initialize variables
chunk_size = 1000  # Adjust chunk size as needed
all_labels = []  # List to store labels for each chunk
all_data_chunks = []  # List to store data chunks for Agglomerative Clustering


# Processing data in chunks
for chunk in chunk_data(reduced_tfidf_matrix, chunk_size):
    affinity_propagation = AffinityPropagation(random_state=4, max_iter=500, damping=0.9)
    chunk_labels = affinity_propagation.fit_predict(chunk)
    all_labels.append(chunk_labels)
    all_data_chunks.append(chunk)  # Collect data for the final clustering


In [13]:
# Concatenate all data chunks
full_data_matrix = np.vstack(all_data_chunks)
print(f"Full data shape: {full_data_matrix.shape}")

# Verify length of labels
labels = np.concatenate(all_labels)
print(f"Length of labels: {len(labels)}")

# Perform Agglomerative Clustering on the full data
agglomerative_clustering = AgglomerativeClustering(n_clusters=3)
agg_labels = agglomerative_clustering.fit_predict(full_data_matrix)

# Verify length of Agglomerative labels
print(f"Length of Agglomerative labels: {len(agg_labels)}")

#  save the full dataset with labels
df = pd.read_csv('cleaned_tweets.csv')  # Load your tweet data
df['agg_labels'] = agg_labels  # Add Agglomerative labels to the DataFrame
df.to_csv('tweets_with_agg_labels.csv', index=False)


Full data shape: (27981, 100)
Length of labels: 27981
Length of Agglomerative labels: 27981


In [14]:
# Filter and save tweets with label 0
df_label_0 = df[df['agg_labels'] == 0]
df_label_0.to_csv('tweets_label_0.csv', index=False)

# Filter and save tweets with label 1
df_label_1 = df[df['agg_labels'] == 1]
df_label_1.to_csv('tweets_label_1.csv', index=False)

# Filter and save tweets with label 2
df_label_3 = df[df['agg_labels'] == 2]
df_label_3.to_csv('tweets_label_2.csv', index=False)

In [15]:
# Save the cluster labels and the modelS 
# Save the SVD model after fitting during training
with open('svd_model.pkl', 'wb') as f:
    pickle.dump(svd, f)

# Save the Agglomerative Clustering labels
with open('agglomerative_clustering_labels.pkl', 'wb') as f:
    pickle.dump(agg_labels, f)
    
cluster_labels = affinity_propagation.labels_
with open('affinity_propagation_labels.pkl', 'wb') as f:
    pickle.dump(cluster_labels, f)
    
with open('cluster_labels.pkl', 'wb') as f:
    pickle.dump(labels, f)
with open('affinity_propagation_model.pkl', 'wb') as f:
    pickle.dump(affinity_propagation, f)

In [16]:
# Evaluation on a subset of the data
subset_size = 500  # Adjust subset size as needed
tfidf_matrix_subset = full_data_matrix[:subset_size, :]
agg_labels_subset = agg_labels[:subset_size]


# Calculate evaluation metrics
silhouette_avg = silhouette_score(tfidf_matrix_subset, agg_labels_subset)
calinski_harabasz = calinski_harabasz_score(tfidf_matrix_subset, agg_labels_subset)

print(f'Silhouette Score: {silhouette_avg:.3f}')
print(f'Calinski-Harabasz Score: {calinski_harabasz:.3f}')

# Davies-Bouldin Index
davies_bouldin = davies_bouldin_score(tfidf_matrix_subset, agg_labels_subset)
print(f'Davies-Bouldin Index: {davies_bouldin:.3f}')


Silhouette Score: 0.169
Calinski-Harabasz Score: 13.605
Davies-Bouldin Index: 1.953


In [17]:
# Display the number of clusters
n_clusters = len(set(agg_labels))
print(f'Number of clusters: {n_clusters}')


Number of clusters: 3
