<a href="https://colab.research.google.com/github/PrajwalRaut8/SBERT---Sentence-Transformers/blob/main/SBERT_Sentence_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/86.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━[0m [32m71.7/86.0 kB[0m [31m1.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentencepiece (from sentence-transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
  Created wheel for sentence-transformers: filename=sentence_transf

In [None]:
from sentence_transformers import SentenceTransformer, util

In [None]:
model = SentenceTransformer('all-MiniLM-L6-v2')

In [None]:
sentences = ['The cat sits outside', 'the new movie is cool', 'the new movie is fantastic', 'the dog bark on strangers']

In [None]:
embeddings = model.encode(sentences = sentences, convert_to_tensor=True)

In [None]:
for sent,embed in zip(sentences, embeddings):
  print("Sentence:", sent)
  print("Len(Embeddings):", len(embed))
  print()

Sentence: The cat sits outside
Len(Embeddings): 384

Sentence: the new movie is cool
Len(Embeddings): 384

Sentence: the new movie is fantastic
Len(Embeddings): 384

Sentence: the dog bark on strangers
Len(Embeddings): 384



In [None]:
cosine_scores = util.cos_sim(embeddings, embeddings)

In [None]:
cosine_scores

tensor([[ 1.0000, -0.0066, -0.0226,  0.2008],
        [-0.0066,  1.0000,  0.7770,  0.1504],
        [-0.0226,  0.7770,  1.0000,  0.1815],
        [ 0.2008,  0.1504,  0.1815,  1.0000]])

In [None]:
sentences

['The cat sits outside',
 'the new movie is cool',
 'the new movie is fantasstic',
 'the dog bark on strangers']

In [None]:
paraphrases = util.paraphrase_mining(model, sentences)

In [None]:
for sim in paraphrases[0:10]:
  score, i, j = sim
  print(sentences[i], "<>", sentences[j], "-->", score)

the new movie is cool <> the new movie is fantastic --> 0.7770369052886963
The cat sits outside <> the dog bark on strangers --> 0.20075498521327972
the new movie is fantastic <> the dog bark on strangers --> 0.18148058652877808
the new movie is cool <> the dog bark on strangers --> 0.1504034698009491
The cat sits outside <> the new movie is cool --> -0.006579317152500153
The cat sits outside <> the new movie is fantastic --> -0.02260509878396988


## **Semantic Search**

In [None]:
from sentence_transformers import SentenceTransformer, util
import torch

In [None]:
model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')

Downloading (…)f1ccc/.gitattributes:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)891b0f1ccc/README.md:   0%|          | 0.00/11.5k [00:00<?, ?B/s]

Downloading (…)1b0f1ccc/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)ccc/data_config.json:   0%|          | 0.00/25.5k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)f1ccc/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/383 [00:00<?, ?B/s]

Downloading (…)1ccc/train_script.py:   0%|          | 0.00/13.8k [00:00<?, ?B/s]

Downloading (…)891b0f1ccc/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)b0f1ccc/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [None]:
import requests

In [None]:
response = requests.get('https://raw.githubusercontent.com/laxmimerit/machine-learning-dataset/master/text-dataset-for-machine-learning/sbert-corpus.txt')
corpus = response.text.split('\r\n')

response = requests.get('https://raw.githubusercontent.com/laxmimerit/machine-learning-dataset/master/text-dataset-for-machine-learning/sbert-queries.txt')
queries = response.text.split('\r\n')

In [None]:
print(corpus)

['A man is eating food.', 'A man is eating a piece of bread.', 'A man is eating pasta.', 'The girl is carrying a baby.', 'The baby is carried by the woman', 'A man is riding a horse.', 'A man is riding a white horse on an enclosed ground.', 'A monkey is playing drums.', 'Someone in a gorilla costume is playing a set of drums.', 'A cheetah is running behind its prey.', 'A cheetah chases prey on across a field.']


In [None]:
print(queries)

['A man is eating pasta.', 'Someone in a gorilla costume is playing a set of drums.', 'A cheetah chases prey on across a field.']


In [None]:
corpus_embeddings = model.encode(corpus, convert_to_tensor=True)
queries_embeddings = model.encode(queries, convert_to_tensor=True)

In [None]:
#lets normalize vectors for fast calculation
corpus_embeddings = util.normalize_embeddings(corpus_embeddings)
queries_embedding = util.normalize_embeddings(queries_embeddings)

In [None]:
len(corpus_embeddings[0])

384

In [None]:
hits = util.semantic_search(queries_embeddings, corpus_embeddings, score_function = util.dot_score, top_k = 3)

In [None]:
hits

[[{'corpus_id': 2, 'score': 0.9999999403953552},
  {'corpus_id': 0, 'score': 0.8384666442871094},
  {'corpus_id': 1, 'score': 0.7468275427818298}],
 [{'corpus_id': 8, 'score': 1.0},
  {'corpus_id': 7, 'score': 0.7612733840942383},
  {'corpus_id': 3, 'score': 0.3815287947654724}],
 [{'corpus_id': 10, 'score': 1.0},
  {'corpus_id': 9, 'score': 0.8703994750976562},
  {'corpus_id': 6, 'score': 0.37411707639694214}]]

In [None]:
for query, hit in zip(queries, hits):
  for q_hit in hit:
    id = q_hit['corpus_id']
    score = q_hit['score']

    print(query, "<>", corpus[id], "-->", score)

  print()

A man is eating pasta. <> A man is eating pasta. --> 0.9999999403953552
A man is eating pasta. <> A man is eating food. --> 0.8384666442871094
A man is eating pasta. <> A man is eating a piece of bread. --> 0.7468275427818298

Someone in a gorilla costume is playing a set of drums. <> Someone in a gorilla costume is playing a set of drums. --> 1.0
Someone in a gorilla costume is playing a set of drums. <> A monkey is playing drums. --> 0.7612733840942383
Someone in a gorilla costume is playing a set of drums. <> The girl is carrying a baby. --> 0.3815287947654724

A cheetah chases prey on across a field. <> A cheetah chases prey on across a field. --> 1.0
A cheetah chases prey on across a field. <> A cheetah is running behind its prey. --> 0.8703994750976562
A cheetah chases prey on across a field. <> A man is riding a white horse on an enclosed ground. --> 0.37411707639694214



##K-Means Clustering

In [None]:
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans

In [None]:
model = SentenceTransformer('all-MiniLM-L6-v2')

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [None]:
import requests
response = requests.get('https://raw.githubusercontent.com/laxmimerit/machine-learning-dataset/master/text-dataset-for-machine-learning/sbert-corpus.txt')
corpus = response.text.split('\r\n')

In [None]:
len(corpus), print(corpus)

['A man is eating food.', 'A man is eating a piece of bread.', 'A man is eating pasta.', 'The girl is carrying a baby.', 'The baby is carried by the woman', 'A man is riding a horse.', 'A man is riding a white horse on an enclosed ground.', 'A monkey is playing drums.', 'Someone in a gorilla costume is playing a set of drums.', 'A cheetah is running behind its prey.', 'A cheetah chases prey on across a field.']


(11, None)

In [None]:
corpus_embeddings = model.encode(corpus)

In [None]:
num_clusters = 5
clustering_model = KMeans(n_clusters= num_clusters)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_



In [None]:
cluster_assignment

array([1, 1, 1, 0, 0, 4, 4, 3, 3, 2, 2], dtype=int32)

In [None]:
clustered_sentences = [[] for i in range(num_clusters)]
clustered_sentences

[[], [], [], [], []]

In [None]:
for sentence_id, cluster_id in enumerate(cluster_assignment):
  clustered_sentences[cluster_id].append(corpus[sentence_id])


In [None]:
for i, cluster in enumerate(clustered_sentences):
  print("Cluster ", i+1)
  print(cluster)
  print()

Cluster  1
['The girl is carrying a baby.', 'The baby is carried by the woman']

Cluster  2
['A man is eating food.', 'A man is eating a piece of bread.', 'A man is eating pasta.']

Cluster  3
['A cheetah is running behind its prey.', 'A cheetah chases prey on across a field.']

Cluster  4
['A monkey is playing drums.', 'Someone in a gorilla costume is playing a set of drums.']

Cluster  5
['A man is riding a horse.', 'A man is riding a white horse on an enclosed ground.']



##Agglomerative Clustering

In [None]:
from sentence_transformers import SentenceTransformer
from sklearn.cluster import AgglomerativeClustering
import numpy as np

In [None]:
model = SentenceTransformer('all-MiniLM-L6-v2')

In [None]:
import requests
response = requests.get('https://raw.githubusercontent.com/laxmimerit/machine-learning-dataset/master/text-dataset-for-machine-learning/sbert-corpus.txt')
corpus = response.text.split('\r\n')

In [None]:
print(corpus)

['A man is eating food.', 'A man is eating a piece of bread.', 'A man is eating pasta.', 'The girl is carrying a baby.', 'The baby is carried by the woman', 'A man is riding a horse.', 'A man is riding a white horse on an enclosed ground.', 'A monkey is playing drums.', 'Someone in a gorilla costume is playing a set of drums.', 'A cheetah is running behind its prey.', 'A cheetah chases prey on across a field.']


In [None]:
corpus_embeddings = model.encode(corpus)

In [None]:
corpus_embeddings = corpus_embeddings/np.linalg.norm(corpus_embeddings, axis=1, keepdims=True)

In [None]:
clustering_model = AgglomerativeClustering(n_clusters=None, distance_threshold=1.5)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_

In [None]:
cluster_assignment

array([0, 0, 0, 4, 4, 1, 1, 2, 2, 3, 3])

In [None]:
clustered_sentences = [[] for i in range(num_clusters)]
for sentence_id, cluster_id in enumerate(cluster_assignment):
  clustered_sentences[cluster_id].append(corpus[sentence_id])

for i, cluster in enumerate(clustered_sentences):
  print("Cluster ", i+1)
  print(cluster)
  print()

Cluster  1
['A man is eating food.', 'A man is eating a piece of bread.', 'A man is eating pasta.']

Cluster  2
['A man is riding a horse.', 'A man is riding a white horse on an enclosed ground.']

Cluster  3
['A monkey is playing drums.', 'Someone in a gorilla costume is playing a set of drums.']

Cluster  4
['A cheetah is running behind its prey.', 'A cheetah chases prey on across a field.']

Cluster  5
['The girl is carrying a baby.', 'The baby is carried by the woman']



##Fast Clustering

In [None]:
from sentence_transformers import SentenceTransformer, util

In [None]:
import pandas as pd
import time

In [None]:
model = SentenceTransformer('all-MiniLM-L6-v2')

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [None]:
import requests

url = 'http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv'
response = requests.get(url)

if response.status_code == 200:
    with open('quora_duplicate_questions.tsv', 'wb') as file:
        file.write(response.content)
else:
    print('Failed to download the file. Status code:', response.status_code)


In [None]:
import pandas as pd

df = pd.read_csv('quora_duplicate_questions.tsv', sep='\t')
df.shape


(404290, 6)

In [None]:
df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [None]:
sentences = df['question1'].tolist()[:1000]
len(sentences)

1000

In [None]:
corpus_embeddings = model.encode(sentences, batch_size=64, show_progress_bar = True)

Batches:   0%|          | 0/16 [00:00<?, ?it/s]

In [None]:
import torch
from sentence_transformers import SentenceTransformer, util

# Convert 'corpus_embeddings' to a PyTorch tensor
corpus_embeddings = torch.tensor(corpus_embeddings)

# Perform community detection
clusters = util.community_detection(corpus_embeddings, min_community_size=5, threshold=0.5)


In [None]:
clusters

[[92, 103, 304, 607, 688, 723, 777, 870, 919, 978],
 [105, 199, 295, 321, 439, 675, 689, 877, 907],
 [28, 78, 273, 284, 564, 647, 784, 945],
 [79, 299, 549, 590, 725, 726, 733],
 [100, 140, 287, 598, 618, 669],
 [93, 263, 401, 544, 930, 957],
 [72, 198, 364, 644, 686, 969],
 [384, 722, 734, 752, 895, 973],
 [49, 302, 566, 591, 967],
 [3, 63, 115, 218, 910],
 [233, 333, 419, 422, 425],
 [317, 502, 532, 608, 852],
 [219, 540, 703, 742, 858],
 [175, 612, 796, 926, 996]]

In [None]:
for i, cluster in enumerate(clusters):
  print('\nCluster {}, #{}Questions'.format(i+1, len(cluster)))
  for id in cluster[0:3]:
    print("\t", sentences[id])
  print("\t.")


Cluster 1, #10Questions
	 What are some of the best romantic movies in English?
	 Which is the best fiction novel of 2016?
	 Which are the best Hollywood thriller movies?
	.

Cluster 2, #9Questions
	 Will the recent demonetisation results in higher GDP? If so how much?
	 What are the effects of demonitization of 500 and 1000 rupees notes on real estate sector?
	 What will be the effect of banning 500 and 1000 notes on stock markets in India?
	.

Cluster 3, #8Questions
	 What is best way to make money online?
	 How can I make money through the Internet?
	 What is the best way to get traffic on your website?
	.

Cluster 4, #7Questions
	 What is purpose of life?
	 What the meaning of this all life?
	 What is the best lesson in life?
	.

Cluster 5, #6Questions
	 Will there really be any war between India and Pakistan over the Uri attack? What will be its effects?
	 What is our stance against Pakistan?
	 If there will be a war between India and Pakistan who will win?
	.

Cluster 6, #6Quest

##Quora Questions Auto-complete Suggester

In [None]:
 !pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113

Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cu113


In [None]:
import requests

url = 'http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv'
response = requests.get(url)

if response.status_code == 200:
    with open('quora_duplicate_questions.tsv', 'wb') as file:
        file.write(response.content)
else:
    print('Failed to download the file. Status code:', response.status_code)

In [None]:
import pandas as pd

df = pd.read_csv('quora_duplicate_questions.tsv', sep='\t')
df.shape

(404290, 6)

In [None]:
import torch
torch.__version__

'2.1.0+cu118'

In [None]:
import os
import time
import pandas as pd

In [None]:
df.shape

(404290, 6)

In [None]:
df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [None]:
corpus_sentences = list(set(df['question1'].tolist() + df['question2'].tolist()))
len(corpus_sentences)

537361

In [None]:
model = SentenceTransformer('quora-distilbert-multilingual')

.gitattributes:   0%|          | 0.00/345 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/572 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/539M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/447 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

In [None]:
corpus_embeddings = model.encode(corpus_sentences, show_progress_bar = True, convert_to_tensor = True)

Batches:   0%|          | 0/16793 [00:00<?, ?it/s]

In [None]:
model._target_device

device(type='cuda')

In [None]:
corpus_embeddings = corpus_embeddings.to(model._target_device)

In [None]:
while True:
  query = input("Please enter a question: ")

  if query == 'n':
    break

  question_embedding = model.encode(query, convert_to_tensor = True)
  hits = util.semantic_search(question_embedding, corpus_embeddings, top_k=5)
  hits = hits[0]
  print(hits)

  for hit in hits:
    print(hit['score'], ": ", corpus_sentences[hit['corpus_id']])

  print('\n\n\n\n')

Please enter a question: what is python
[{'corpus_id': 262583, 'score': 0.9529969096183777}, {'corpus_id': 311197, 'score': 0.9495009183883667}, {'corpus_id': 448567, 'score': 0.9477468729019165}, {'corpus_id': 396237, 'score': 0.9458462595939636}, {'corpus_id': 233730, 'score': 0.9456238150596619}]
0.9529969096183777 :  What do people want to know about python?
0.9495009183883667 :  What is the best python CMS?
0.9477468729019165 :  What is python language?
0.9458462595939636 :  What are the most dangerous kinds of pythons?
0.9456238150596619 :  What's the scope of python programming?





Please enter a question: how to learn python
[{'corpus_id': 207458, 'score': 0.9958191514015198}, {'corpus_id': 73519, 'score': 0.9927080273628235}, {'corpus_id': 23041, 'score': 0.9891510009765625}, {'corpus_id': 9543, 'score': 0.9827979803085327}, {'corpus_id': 354805, 'score': 0.9823888540267944}]
0.9958191514015198 :  How do I learn python?
0.9927080273628235 :  What are some of the best ways to

##Similar Research Paper Recommendation using SBERT

In [None]:
!pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentencepiece (from sentence-transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m21.2 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125923 sha256=62cc14f13cf4cc0f199ae10e4efb8c3be543b2f75d4163d9f32550237d23ab9b
  Stored in directory: /root/.cache/pip/wheels/62/f2/10/1e606fd5f02395388f74e7462910fe851042f97238cbbd902f
Successfully built sentence-tr

In [None]:
from sentence_transformers import SentenceTransformer, util
import os
import json
import requests

In [None]:
response = requests.get('https://sbert.net/datasets/emnlp2016-2018.json')
papers = json.loads(response.text)

In [None]:
len(papers)

974

In [None]:
model = SentenceTransformer('allenai-specter')

.gitattributes:   0%|          | 0.00/690 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/2.71k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/622 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/462k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/331 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/222k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

In [None]:
paper_texts = [paper['title'] + '[SEP]' + paper['abstract'] for paper in papers]

In [None]:
corpus_embeddings = model.encode(paper_texts, convert_to_tensor=True, show_progress_bar = True)

Batches:   0%|          | 0/31 [00:00<?, ?it/s]

In [None]:
def search(title, abstract):
  query_embedding = model.encode(title + '[SEP]' + abstract, convert_to_tensor = True)

  search_hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=3)[0]

  print("Most Similar Papers\n")
  for hit in search_hits:
    related_paper = papers[hit['corpus_id']]
    print(related_paper['title'])
    print(related_paper['abstract'])
    print('\n\n')

In [None]:
title = 'a novel method to find out similar documents'
abstract = 'a novel method to find out similar documents'
search(title, abstract)

Most Similar Papers

Quantifying the Effects of Text Duplication on Semantic Models
One of the important factors that make a search engine fast and accurate is a concise and duplicate free index. In order to remove duplicate and near-duplicate documents from the index, a search engine needs a swift and reliable duplicate and near-duplicate text document detection system. Traditional approaches to this problem, such as brute force comparisons or simple hash-based algorithms are not suitable as they are not scalable and are not capable of detecting near-duplicate documents effectively. In this paper, a new signature-based approach to text similarity detection is introduced which is fast, scalable, reliable and needs less storage space. The proposed method is examined on popular text document data-sets such as CiteseerX, Enron, Gold Set of Near-duplicate News Articles and etc. The results are promising and comparable with the best cutting-edge algorithms, considering the accuracy and perf