### Fast Clustering
Fast clustering algorithm is tuned for large datasets (50k sentences in less than 5 seconds). In a large list of sentences it searches for local communities: A local community is a set of highly similar sentences.

You can configure the threshold of cosine-similarity for which we consider two sentences as similar. Also, you can specify the minimal size for a local community. This allows you to get either large coarse-grained clusters or small fine-grained clusters.

In [None]:
from sentence_transformers import SentenceTransformer, util
import pandas as pd

In [None]:
model = SentenceTransformer('all-MiniLM-L6-v2')

In [5]:
# here we use the Quora Duplicate Questions dataset
# http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv
df = pd.read_csv('quora_duplicate_questions.tsv', sep='\t')
df.shape

(404290, 6)

In [6]:
df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [7]:
# here we only use the first 5k questions for fast computation
sentences = df['question1'].tolist()[:5000]
len(sentences)

5000

In [None]:
corpus_embeddings = model.encode(sentences, batch_size=64, show_progress_bar=True)

In [9]:
# Fast Clustering
clusters = util.community_detection(corpus_embeddings, min_community_size=5, threshold=0.5)

In [10]:
# number of clusters
len(clusters)

217

In [11]:
for i, cluster in enumerate(clusters):
  print("\nCluster {}, total {} Questions".format(i+1, len(cluster)))
  for id in cluster[0:3]:
    print("\t", sentences[id])
  print("\t", "...")


Cluster 1, total 48 Questions
	 How will demonetization of ‎₹1000 and ‎₹500 notes will help curb the rampant black currency in India?
	 How does banning 500 & 1000 rupee notes solve black money problem?
	 How exactly does banning Rs 500 and Rs 1000 notes curb the problem of black money?
	 ...

Cluster 2, total 42 Questions
	 How do you start making money?
	 How you make money?
	 How do we start a business?
	 ...

Cluster 3, total 41 Questions
	 Which is the best institute to learn digital marketing (job oriented) in India?
	 Which is the best digital marketing institution in banglore?
	 What is the best certification course to learn digital marketing?
	 ...

Cluster 4, total 38 Questions
	 How can learn English?
	 How I can speak English fluently?
	 What can I do to practice my English?
	 ...

Cluster 5, total 32 Questions
	 Which were some of the bad experiences you had in life?
	 Which moment in your life changed you completely?
	 What do you think you were in a past life and why?
	