# Overview

In the notebook [Agglomerative with Sentence Transformers](https://www.kaggle.com/code/aisuko/agglomerative-with-sentence-transformers). We split the sentences into many of the different clusters. However, [Agglomerative Clustering](https://www.kaggle.com/code/aisuko/agglomerative-clustering) for larger datasets is quite slow, so it is  only applicable for maybe a few thousand sentences. In this notebook, we present a clustering algorithm that is tuned for large datasets(50k sentences in less than 5 seconds). In the large list os sentences it searches for local communities: A local community is set of highly similar sentences. 

We can configure the threshold of cosine-similarity for which we consider two sentences as similar. Also, we can specify the minimal size for a local community. This allows us to get either large coarse-grained clusters or small fine-grained clusters. 
* A high threshold will only find extremely similar sentences, a lower threshold will find more sentence that are less similar.
* A second parameter is `min_community_size`: only communities with at least a certain number of sentences will be returned

In [1]:
!pip install sentence-transformers==2.3.1
!pip install datasets==2.15.0

Collecting sentence-transformers==2.3.1
  Downloading sentence_transformers-2.3.1-py3-none-any.whl.metadata (11 kB)
Downloading sentence_transformers-2.3.1-py3-none-any.whl (132 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.8/132.8 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentence-transformers
Successfully installed sentence-transformers-2.3.1
Collecting datasets==2.15.0
  Downloading datasets-2.15.0-py3-none-any.whl.metadata (20 kB)
Collecting pyarrow-hotfix (from datasets==2.15.0)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl.metadata (3.6 kB)
Collecting fsspec<=2023.10.0,>=2023.1.0 (from fsspec[http]<=2023.10.0,>=2023.1.0->datasets==2.15.0)
  Downloading fsspec-2023.10.0-py3-none-any.whl.metadata (6.8 kB)
Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m27.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspe

# Loading dataset

We will use a pre-processed embedding dataset `aisuko/quora_questions_raw` for fitting the lower CPU. If you have powerful CPU resource, try complete pre-processed embeddings of all the questions dataset `aisuko/quora_questions`.

In [2]:
from datasets import load_dataset

dataset = load_dataset("aisuko/quora_questions_raw")
dataset

Downloading readme:   0%|          | 0.00/191 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/58.2M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

  return pd.read_csv(xopen(filepath_or_buffer, "rb", download_config=download_config), **kwargs)


DatasetDict({
    train: Dataset({
        features: ['id', 'qid1', 'qid2', 'question1', 'question2', 'is_duplicate'],
        num_rows: 404290
    })
})

In [3]:
ds=dataset['train'].remove_columns(['id','qid1','qid2','is_duplicate'])
ds

Dataset({
    features: ['question1', 'question2'],
    num_rows: 404290
})

We limit our corpus to only the first 50k questions.

In [4]:
max_corpus_size=50000

# Get all unique sentences from the file
corpus_sentences = set()

num=len(ds['question1'])
num2=len(ds['question2'])

while len(corpus_sentences)<max_corpus_size:
    if num>0:
        for i in ds['question1']:
            corpus_sentences.add(i)
            num-=1
    elif num2>0:
        for i in ds['question2']:
            corpus_sentences.add(i)
            num2-=1
    break

corpus_sentences=list(corpus_sentences)
len(corpus_sentences)

290457

# Loading the model

In [5]:
from sentence_transformers import SentenceTransformer

model=SentenceTransformer('all-MiniLM-L6-v2').to('cuda')
model.max_seq_length=256
model

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
  (2): Normalize()
)

In [6]:
# corpus_sentences=list(corpus_sentences)
corpus_embeddings=model.encode(corpus_sentences, batch_size=64, show_progress_bar=True, convert_to_tensor=True, normalize_embeddings=True)
len(corpus_embeddings)

Batches:   0%|          | 0/4539 [00:00<?, ?it/s]

290457

# Start clustering

Here, we use two parameters to tune:


**min_cluster_size**

Only consider cluster that have at least 25 elements


**threshold**

Consider sentence pairs with a cosine-similarity larger thatn threshold as similar

In [7]:
import time
from sentence_transformers.util import community_detection

print('Start clustering')
start_time=time.time()
 
clusters=community_detection(corpus_embeddings, min_community_size=25, threshold=0.75,show_progress_bar=True)
print('Clustering done after {:.2f} sec'.format(time.time()-start_time))

Start clustering


Finding clusters:   0%|          | 0/284 [00:00<?, ?it/s]

Clustering done after 13.28 sec


Here we will print for first 10 clusters the top 3 and bottom 3 elements.

In [8]:
for i, cluster in enumerate(clusters[:10]):
    print('\nCluster {}, #{} Elements'.format(i+1, len(cluster)))
    for sentence_id in cluster[0:3]:
        print('\t', corpus_sentences[sentence_id])
    print('\t','...')
    for sentence_id in cluster[-3:]:
        print('\t', corpus_sentences[sentence_id])


Cluster 1, #459 Elements
	 What are some things new employees should know going into their first day at Atricure?
	 What are some things new employees should know going into their first day at Loews?
	 What are some things new employees should know going into their first day at Receptos?
	 ...
	 What are some things new employees should know going into their first day at Electronics for Imaging?
	 What are some things new employees should know going into their first day at Select Income REIT?
	 What are some things new employees should know going into their first day at Immunomedics?

Cluster 2, #326 Elements
	 What is a good inpatient drug and alcohol rehab center in Greene County AR?
	 What is a good inpatient drug and alcohol rehab center in Franklin County AR?
	 What is a good inpatient drug and alcohol rehab center in Logan County AR?
	 ...
	 Is there an inpatient Drug and Alcohol Rehab Center in San Francisco County California?
	 Is there an inpatient Drug and Alcohol Rehab Cent