[Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084)

A siamese bi-encoder architecture allows the network to learn embeddings that can be compared using cosine similarity

### Siamese BERT-networks for semantic searching

In [5]:
!pip install -U sentence-transformers



In [6]:
import numpy as np
import requests

from bs4 import BeautifulSoup
from urllib.request import urlopen
from datasets import load_dataset

from sentence_transformers import SentenceTransformer,util,InputExample,losses,evaluation
from transformers import pipeline

from random import sample,seed,shuffle
from torch.utils.data import DataLoader

In [7]:
PERSON = 'Agney Praseed'

google_html_res = BeautifulSoup(requests.get(f'https://www.google.com/search?q={PERSON}').text).get_text()[:1024]

MODEL_NAME = 'deepset/roberta-base-squad2'

nlp = pipeline(
    'question-answering',
    model=MODEL_NAME,
    tokenizer=MODEL_NAME,
    max_length=10
    )

nlp(f'Who is {PERSON}?',google_html_res)

{'score': 0.2403692603111267,
 'start': 273,
 'end': 301,
 'answer': 'Engineer at Publicis Sapient'}

In [8]:
text = urlopen('https://www.gutenberg.org/cache/epub/277/pg277.txt').read().decode()

docs = list(filter(lambda x: len(x) >100, text.split('\r\n\r\n')))

docs = np.array(docs)

print('There are ',len(docs),' documents/para')

docs[2]

There are  95  documents/para


"\r\nOn Monday morning July 16, 1945, the world was changed forever when\r\nthe first atomic bomb was tested in an isolated area of the New Mexico\r\ndesert. Conducted in the final month of World War II by the top-secret\r\nManhattan Engineer District, this test was code named Trinity. The\r\nTrinity test took place on the Alamogordo Bombing and Gunnery Range,\r\nabout 230 miles south of the Manhattan Project's headquarters at Los\r\nAlamos, New Mexico. Today this 3,200 square mile range, partly located\r\nin the desolate Jornada del Muerto Valley, is named the White Sands\r\nMissile Range and is actively used for non-nuclear weapons testing."

In [9]:
# msmarco is a dataset on Bing search results
bi_encoder = SentenceTransformer('msmarco-distilbert-base-v4')
bi_encoder.get_max_seq_length = 256

bi_encoder

.gitattributes:   0%|          | 0.00/690 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.71k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/545 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/265M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/319 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: DistilBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

In [10]:
doc_embdeddings = bi_encoder.encode(docs,convert_to_tensor=True,show_progress_bar=True)

doc_embdeddings.shape

Batches:   0%|          | 0/3 [00:00<?, ?it/s]

torch.Size([95, 768])

In [11]:
QUESTION = 'Why was the name TRINITY given to first detonation of a nuclear weapon?'

question_emb = bi_encoder.encode(QUESTION,convert_to_tensor=True)

# Number of documents to retrieve with the bi—encoder
hits = util.semantic_search(question_emb,doc_embdeddings,top_k=3)[0]

hits

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[{'corpus_id': 7, 'score': 0.5677276849746704},
 {'corpus_id': 21, 'score': 0.5582700967788696},
 {'corpus_id': 8, 'score': 0.5567224621772766}]

In [12]:
print("Questio : ",QUESTION)

for i, hit in enumerate(hits):
    print(f'Document {i+1} Cosine_Similarity_Score {hit["score"]:.3f}:\n\n{docs[hit["corpus_id"]]}')
    print("\n")              

Questio :  Why was the name TRINITY given to first detonation of a nuclear weapon?
Document 1 Cosine_Similarity_Score 0.568:

The origin of the code name Trinity for the test site is also
interesting, but the true source is unknown. One popular account
attributes the name to J. Robert Oppenheimer, the scientific head of the
Manhattan Project. According to this version, the well read Oppenheimer
based the name Trinity on the fourteenth Holy Sonnet by John Donne, a
16th century English poet and sermon writer. The sonnet started, "Batter
my heart, three-personed God."[2] Another version of the name's origin
comes from University of New Mexico historian Ferenc M. Szasz. In his
1984 book, The Day the Sun Rose Twice, Szasz quotes Robert W. Henderson
head of the Engineering Group in the Explosives Division of the
Manhattan Project. Henderson told Szasz that the name Trinity came from
Major W. A. (Lex) Stevens. According to Henderson, he and Stevens were
at the test site discussing the best wa

In [13]:
nlp(QUESTION,str(docs[hits[0]['corpus_id']]))

{'score': 0.01621008850634098,
 'start': 302,
 'end': 328,
 'answer': 'the fourteenth Holy Sonnet'}

In [14]:
training_qa = load_dataset('adversarial_qa','adversarialQA',split='train')

good_training_data = []
bad_training_data = []

last_example = None

# When a question does not align with context, give it a label of 0
for example in training_qa:
    if last_example and example['context'] !=last_example['context']:
        bad_training_data.append((example['question'],last_example['context'],0.0))
    good_training_data.append((example['question'],example['context'],1.0))        
    last_example = example
    
len(good_training_data),len(bad_training_data)

Downloading builder script:   0%|          | 0.00/2.90k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.04k [00:00<?, ?B/s]

Downloading and preparing dataset adversarial_qa/adversarialQA (download: 8.60 MiB, generated: 31.98 MiB, post-processed: Unknown size, total: 40.58 MiB) to /root/.cache/huggingface/datasets/adversarial_qa/adversarialQA/1.0.0/92356be07b087c5c6a543138757828b8d61ca34de8a87807d40bbc0e6c68f04b...


Downloading data:   0%|          | 0.00/9.02M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/30000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3000 [00:00<?, ? examples/s]

Dataset adversarial_qa downloaded and prepared to /root/.cache/huggingface/datasets/adversarial_qa/adversarialQA/1.0.0/92356be07b087c5c6a543138757828b8d61ca34de8a87807d40bbc0e6c68f04b. Subsequent calls will reuse this data.


(30000, 2647)

In [15]:
good_training_data[0]

('What sare the benifts of the blood brain barrir?',
 'Another approach to brain function is to examine the consequences of damage to specific brain areas. Even though it is protected by the skull and meninges, surrounded by cerebrospinal fluid, and isolated from the bloodstream by the blood–brain barrier, the delicate nature of the brain makes it vulnerable to numerous diseases and several types of damage. In humans, the effects of strokes and other types of brain damage have been a key source of information about brain function. Because there is no ability to experimentally control the nature of the damage, however, this information is often difficult to interpret. In animal studies, most commonly involving rats, it is possible to use electrodes or locally injected chemicals to produce precise patterns of damage and then examine the consequences for behavior.',
 1.0)

In [16]:
bad_training_data[0]

('What do you think with?',
 'Another approach to brain function is to examine the consequences of damage to specific brain areas. Even though it is protected by the skull and meninges, surrounded by cerebrospinal fluid, and isolated from the bloodstream by the blood–brain barrier, the delicate nature of the brain makes it vulnerable to numerous diseases and several types of damage. In humans, the effects of strokes and other types of brain damage have been a key source of information about brain function. Because there is no ability to experimentally control the nature of the damage, however, this information is often difficult to interpret. In animal studies, most commonly involving rats, it is possible to use electrodes or locally injected chemicals to produce precise patterns of damage and then examine the consequences for behavior.',
 0.0)

In [17]:
seed(42)

sampled_training_data = sample(good_training_data,500) + sample(bad_training_data,500)

shuffle(sampled_training_data)

training_index = int(.8*len(sampled_training_data))

In [18]:
train_examples = [InputExample(texts=t[:2],label=t[2]) for t in sampled_training_data[:training_index]]

In [19]:
train_examples[0].__dict__

{'guid': '',
 'texts': ('What changed after the eigth century?',
  'There is disagreement about the origin of the term, but general consensus that "cardinalis" from the word cardo (meaning \'pivot\' or \'hinge\') was first used in late antiquity to designate a bishop or priest who was incorporated into a church for which he had not originally been ordained. In Rome the first persons to be called cardinals were the deacons of the seven regions of the city at the beginning of the 6th century, when the word began to mean “principal,” “eminent,” or "superior." The name was also given to the senior priest in each of the "title" churches (the parish churches) of Rome and to the bishops of the seven sees surrounding the city. By the 8th century the Roman cardinals constituted a privileged class among the Roman clergy. They took part in the administration of the church of Rome and in the papal liturgy. By decree of a synod of 769, only a cardinal was eligible to become pope. In 1059, during th

In [20]:
train_data_loader = DataLoader(train_examples, shuffle=True, batch_size=32)

train_loss = losses.CosineSimilarityLoss(bi_encoder)

In [21]:
sentences1,sentences2,scores = zip(*sampled_training_data[training_index:])

evaluator = evaluation.EmbeddingSimilarityEvaluator(sentences1,sentences2,scores)

In [22]:
bi_encoder.evaluate(evaluator)

0.5043181128603533

In [24]:
bi_encoder.fit(
                train_objectives=[(train_data_loader,train_loss)],              
                output_path="./results",
                epochs=2,
                evaluator=evaluator
              )

Epoch:   0%|          | 0/2 [00:00<?, ?it/s]

Iteration:   0%|          | 0/25 [00:00<?, ?it/s]

Iteration:   0%|          | 0/25 [00:00<?, ?it/s]

In [25]:
bi_encoder.evaluate(evaluator)

0.5050109764878448

In [27]:
finetuned_bi_encoder = SentenceTransformer("./results")

QUESTION = 'Who made the call to drop the nuclear bomb on Japan, List all names?'

doc_embdeddings = finetuned_bi_encoder.encode(docs,convert_to_tensor=True,show_progress_bar=True)

question_emb = finetuned_bi_encoder.encode(QUESTION,convert_to_tensor=True)

hits = util.semantic_search(question_emb,doc_embdeddings,top_k=3)[0]

print("Question : ",QUESTION)

for i, hit in enumerate(hits):
    print(f'Document {i+1} Cosine_Similarity_Score {hit["score"]:.3f}:\n\n{docs[hit["corpus_id"]]}')
    print("\n")              
    
nlp(QUESTION,str(docs[hits[0]['corpus_id']]))    

Batches:   0%|          | 0/3 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question :  Who made the call to drop the nuclear bomb on Japan, List all names?
Document 1 Cosine_Similarity_Score 0.368:

Feis, Herbert. Japan Subdued: The Atomic Bomb and the End of the War in
the Pacific. Princeton: Princeton University Press, 1961.


Document 2 Cosine_Similarity_Score 0.356:

The true story of the Trinity test first became known to the public on
August 6, 1945. This is when the world's second nuclear bomb, nicknamed
Little Boy, exploded 1,850 feet over Hiroshima, Japan, destroying a
large portion of the city and killing an estimated 70,000 to 130,000
of its inhabitants. Three days later on August 9, a third atomic bomb
devastated the city of Nagasaki and killed approximately 45,000 more
Japanese. The Nagasaki weapon was a plutonium bomb, similar to the
Trinity device, and it was nicknamed Fat Man. On Tuesday August 14, at 7
p.m. Eastern War Time, President Truman made a brief formal announcement
that Japan had finally surrendered and World War II was over after
al

{'score': 0.14259175956249237,
 'start': 0,
 'end': 13,
 'answer': 'Feis, Herbert'}