Adapted from: https://www.sbert.net/examples/applications/clustering/README.html
This is a more complex example on performing clustering on large scale dataset.

This examples find in a large set of sentences local communities, i.e., groups of sentences that are highly
similar. You can freely configure the threshold what is considered as similar. A high threshold will
only find extremely similar sentences, a lower threshold will find more sentence that are less similar.

A second parameter is 'min_community_size': Only communities with at least a certain number of sentences will be returned.

The method for finding the communities is extremely fast, for clustering 50k sentences it requires only 5 seconds (plus embedding comuptation).

In this example, we download a large set of questions from Quora and then find similar questions in this set.

In [3]:
%pip install sentence_transformers

[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


In [70]:
from sentence_transformers import SentenceTransformer, util
import os
import csv
import time
import json

In [71]:
# Model for computing sentence embeddings. We use one trained for similar questions detection
model = SentenceTransformer('all-MiniLM-L6-v2')

In [72]:
# Get all unique sentences from the file
corpus_sentences = set()
with open('../nlp/ranada_all.json', 'r') as openfile:
    json_object = json.load(openfile)
    for line in json_object:
        text = line['text']
        if text.strip() != '':
            corpus_sentences.add(text) 

corpus_sentences = list(corpus_sentences)

In [73]:
# Encode the corpus
print("Encode the corpus. This might take a while")
corpus_embeddings = model.encode(corpus_sentences, batch_size=64, show_progress_bar=True, convert_to_tensor=True)

Encode the corpus. This might take a while


Batches:   0%|          | 0/646 [00:00<?, ?it/s]

In [74]:
print("Start clustering")
start_time = time.time()

#Two parameters to tune:
#min_cluster_size: Only consider cluster that have at least 40 elements
#threshold: Consider sentence pairs with a cosine-similarity larger than threshold as similar
clusters = util.community_detection(corpus_embeddings, min_community_size=40, threshold=0.78)

print("Clustering done after {:.2f} sec".format(time.time() - start_time))

#Print for all clusters the top 3 and bottom 3 elements
for i, cluster in enumerate(clusters):
    print("\nCluster {}, #{} Elements ".format(i+1, len(cluster)))
    for sentence_id in cluster[0:3]:
        print("\t", corpus_sentences[sentence_id])
    print("\t", "...")
    for sentence_id in cluster[-3:]:
        print("\t", corpus_sentences[sentence_id])

Start clustering
Clustering done after 7.08 sec

Cluster 1, #541 Elements 
	 😠😠😠😠😠😠😠😠😠😠😠😠👎👎👎👎👎👎👎👎👎👎👎👎👎
	 🤣😂😜😝😛😜😝😛🤣😂😅
	 🤢🤮
	 ...
	 😂😂😂😂😂😂😂😂
	 🤐🤐🤐
	 😶‍🌫️

Cluster 2, #459 Elements 
	 swerte ni pia hindi ako nakasagupa nya kng hindi my kalalagyan sya🤣😅😂
	 si pia Ranada ay may kakapalan ng mukha yan mga bwesit na Rappler yan dapat sa kanika sunugin ng buhay mga yan ang lakas ng loob manira mawawala din sila sa madaling panahon.
	 Pia ranada talaga wala ng ginawa maayos.hayst rappler wala talaga kai kwenta.
	 ...
	 true nmn d kapaniwa niwala ang rappler wala ng gnwang balita panay kontra gobyerno lalo n yan pia .
	 dapat sa kanya bigwasan isang beses ang putang pia na yan..
	 Mauti nga saiyo Pia alalahanin mo pangulo na yongnasa harap yong way na pagtatanong mo. Iha kumo maykarapatan na kayo ganyan ang asal nyo magreport ka pa may kasinunngalinga
n tama ang panngulo pahiya ka ano

Cluster 3, #418 Elements 
	 Dapat lang sa kanya yang ganyang trato, bastos eh, kung makipag usap sa pangulo aka

In [75]:
print(len(clusters))
print(len(corpus_embeddings))
corpus_embeddings.shape

86
41287


torch.Size([41287, 384])

In [76]:
# get the labels
cluster_to_label = {}

# load json
# go through each
# cluster_to_label[cluster] = label
with open('../nlp/ranada_cluster_labels.json', 'r') as openfile:
    json_object = json.load(openfile)
    for entry in json_object:
        cluster = str(entry['cluster'])
        label = str(round(entry['percentage'] * 100,1)) + "%      "+ entry['label']
        cluster_to_label[cluster] = label

In [77]:
# create list of assigned cluster for each sentence
corpus_cluster = ['0']*len(corpus_sentences)

for i, cluster in enumerate(clusters):
    for sentence_id in cluster:
        corpus_cluster[sentence_id] = str(i+1)

In [78]:
import pandas as pd

In [79]:
# trying to visualize
import numpy as np
from sklearn.manifold import TSNE

X = corpus_embeddings

# reduce dimensionality
X_embedded = TSNE(n_components=2).fit_transform(X)

In [80]:
X_embedded

array([[-51.62401 ,  57.033005],
       [-16.489244,  21.888742],
       [ 11.578374,  37.396584],
       ...,
       [ -8.158647, -72.6615  ],
       [ 84.16173 , -34.671124],
       [-23.970442,  40.86671 ]], dtype=float32)

In [82]:
# remove ones with no cluster
new_corpus_sentences = []
new_corpus_cluster = []
new_X_embedded = []

for i, cluster in enumerate(corpus_cluster):
    if cluster != '0':
        new_corpus_sentences.append(corpus_sentences[i])
        # modify to get corpus_cluster[i] and turn into label name
        # corpus_cluster[i] is a string of the label
        new_corpus_cluster.append(cluster_to_label[corpus_cluster[i]])
        
        new_X_embedded.append(X_embedded[i])

KeyError: '81'

In [None]:
# create a dataframe of the text and their label (cluster)
df_embeddings = pd.DataFrame(new_X_embedded)
df_embeddings = df_embeddings.rename(columns={0:'x',1:'y'})
df_embeddings = df_embeddings.assign(Topic=new_corpus_cluster)


In [66]:
df_embeddings = df_embeddings.assign(text=new_corpus_sentences)

In [54]:
print(len(new_corpus_sentences))

6420


In [23]:
%pip install plotly

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0mCollecting plotly
  Downloading plotly-5.15.0-py2.py3-none-any.whl (15.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.5/15.5 MB[0m [31m28.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting tenacity>=6.2.0
  Downloading tenacity-8.2.2-py3-none-any.whl (24 kB)
Installing collected packages: tenacity, plotly
[33m  DEPRECATION: Configuring installation scheme with distutils config files is depre

In [84]:
# display embedding
import plotly.express as px

fig = px.scatter(
    df_embeddings, x='x', y='y',
    color='Topic', labels={'color': 'Topic'},
    hover_data=['text'], title = 'Topic cluster of Facebook posts/comments related to Pia Ranada-Robles')
fig.show()
fig.write_html("../html/topic_clusters.html")