#### Introduction to Sentence Transformers
By Ankush Chander

## Problem statement

Sentence pair regression: Given two sentences, generate a numeric value based on the use case.  

- **Common use cases**:
1. Semantic search: Given a query, find out most similar documents from the corpus.
2. Paraphrase mining:  Finding paraphrases (texts with identical / similar meaning) in a large corpus of sentences./5555
3. Translated Sentence mining: Bitext mining describes the process of finding parallel (translated) sentence pairs in monolingual corpora.

## BERT model 

BERT(Bidirectional Encoder Representations from Transformers)

- Encoder only model
- Trains on both left as well as right context across all the layers
- Two tasks:
  - Masked Language modelling(MLM)
  - Next word prediction(NSP)
- Architecture:
   - BERT-BASE (Layers=12, Hidden layer dimensions=768, Attention heads=12, Total Parameters=110M)
   - BERT-LARGE (Layers=24, Hidden layer dimensions=1024, Attention heads=16, Total Parameters=340M).
- Pretrained on Wikipedia and BooksCorpus
- Task specific finetuning using different corpus 

![Bert architecture](img/sbert_talk/bert_architecture.png)

Picture credits: [Bert paper](https://aclanthology.org/N19-1423.pdf)

## Sentence embeddings(timeline)


| Year | Model/Technique                     | Description                                                                                                                                                                                                             |
| ---- | ----------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 2013 | Word2Vec<br>(Mikolov et al.)        | it uses skip-gram and CBOW models to create dense word vectors.                                                                                                                                                         |
| 2014 | GloVe<br>(Pennington et al)         | uses global word-word co-occurrence statistics to create word embeddings.                                                                                                                                               |
| 2015 | Doc2Vec<br>(Le and Mikolov)                             | An extension of Word2Vec for creating paragraph or document embeddings.                                                                                                                               |
| 2016 | Skip-Thought Vectors<br>(Kiros et al.) | extends the skip-gram model to sentences, capturing semantic meaning over larger contexts.                                                                                                                              |
| 2017 | InferSent<br>(Conneau et al., 2017) | uses labeled data of the Stanford Natural Language Inference dataset (Bowman et al., 2015) and the Multi- Genre NLI dataset (Williams et al., 2018) to train a siamese BiLSTM network with max-pooling over the output. |
| 2018 | BERT<br>(Devlin et al.)             | It uses a transformer architecture to create deeply contextualized word embeddings.                                                                                                                                     |
| 2020 | Sentence-BERT<br>(Reimers et al.)      | It fine-tunes BERT to produce semantically meaningful sentence embeddings.                                                                                                                                              |

## Bi-encoder vs Cross encoder

|                       | Bi-Encoder                                                                              | Cross-Encoder                                                                                  |
| --------------------- | --------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------- |
| **How does it work?** | Calculates embeddings of query as well as document and then measures distance.          | Passes both query as well as document to model as input and ask to generate score between 0-1. |
| **Pros**              | - Similarity calculation is fast as embeddings calculated once for all query/docs. O(n) | Generates better score as model pays attention to both query and document at the same time     |
|                       | - ==Faster computation== for large datasets                                             | Suitable for higher accuracy over small documents                                              |
| **Cons**              | - Does not produce similarity scores for pairs                                          | - Does not produce sentence embeddings                                                         |
|                       | - Cannot score pre-defined pairs simultaneously                                         | - Cannot pass individual sentences                                                             |
|                       | Does not necessarily works well for asymetric search                                    | Works well for asymetric search due to attention mechanism                                     |
| **When to use**       | - Generate candidate documents                                                          | - Rerank small set of candidate documents.                                                     |

## Bi-encoder vs Cross encoder

![BiEncoder](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/Bi_vs_Cross-Encoder.png)

Credit: [Docs: Sentence Bert](https://www.sbert.net) 

## Applications

### Semantic Search
Given a query find out documents from corpus which are most semantically(meaningwise not just term wise) similar to the query. 

| Criteria                 | Symmetric Search                                                                                 | Asymmetric Search                                                                                                                                               |
| ------------------------ | ------------------------------------------------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Query and Corpus Length  | Query and entries in the corpus are of about the same length and have the same amount of content | Short query (like a question or some keywords) and longer paragraph answering the query                                                                         |
| Example                  | *Query:* "How to learn Python online?"<br>*Expected Document:* "How to learn Python on the web?" | *Query:* "What is Python"<br>*Expected Document:* "Python is an interpreted, high-level and general-purpose programming language. Python’s design philosophy …" |
| Related Training Example | Quora Duplicate Questions                                                                        | MS MARCO                                                                                                                                                        |
| Suitable Models          | Pre-Trained Sentence Embedding Models                                                            | Pre-Trained MS MARCO Models                                                                                                                                     |

Based on use-case model should be chosen wisely.

#### Approximate Nearest Neighbor
Searching a large corpus with millions of embeddings can be time-consuming if exact nearest neighbor search is used (like it is used by util.semantic_search).

Approximate Nearest Neighbor (ANN) can be helpful when dealing with large datasets.
- Data is partitioned into smaller fractions of similar embeddings.
This index can be searched efficiently, allowing retrieval of the embeddings with the highest similarity (the nearest neighbors) within milliseconds, even with millions of vectors.
- ANN methods typically have one or more parameters to tune, determining the recall-speed trade-off:
- Three popular libraries for ANN: Annoy, FAISS, and hnswlib.

#### Retrieve and rerank
Semantic search is very efficient in terms of computation, however can lead to noisy results.
For complex search tasks, for example question answering retrieval, the search can significantly be improved by using Retrieve & Re-Rank.



![Retrieve and Rerank](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/InformationRetrieval.png)

Credit: [Docs: Sentence Bert](https://www.sbert.net) 

# Usage

In [1]:
from tqdm.autonotebook import tqdm, trange
from sentence_transformers import SentenceTransformer

# 1. Load a pretrained Sentence Transformer model
model = SentenceTransformer("all-MiniLM-L6-v2")

# The sentences to encode
sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]

# 2. Calculate embeddings by calling model.encode()
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# 3. Calculate the embedding similarities
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.6660, 0.1046],
#         [0.6660, 1.0000, 0.1411],
#         [0.1046, 0.1411, 1.0000]])



  from tqdm.autonotebook import tqdm, trange


(3, 384)
tensor([[1.0000, 0.6660, 0.1046],
        [0.6660, 1.0000, 0.1411],
        [0.1046, 0.1411, 1.0000]])


  attn_output = torch.nn.functional.scaled_dot_product_attention(


# Training


## Key datasets
1. [SNLI(Stanford Natural Language Inference)](https://paperswithcode.com/dataset/snli) consists of 570k sentence-pairs manually labeled as entailment, contradiction, and neutral.
2. [The MS MARCO (Microsoft MAchine Reading Comprehension)](https://paperswithcode.com/dataset/ms-marco) is a collection of datasets focused on deep learning in search. The first dataset was a question answering dataset featuring 100,000 real Bing questions and a human generated answer.
3. [Multi-Genre Natural Language Inference (MultiNLI)](https://huggingface.co/datasets/nyu-mll/multi_nli) corpus is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information.

Browse [https://huggingface.co/datasets?other=sentence-transformers](https://huggingface.co/datasets?other=sentence-transformers) to find training datasets that might be useful for your tasks.



## Loss functions

| Inputs                                  | Labels                         | Appropriate Loss Functions                                                                                                                                                                                                                             |
| --------------------------------------- | ------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `(sentence_A, sentence_B) pairs`        | `class`                        | [`SoftmaxLoss`](https://www.sbert.net/docs/package_reference/sentence_transformer/losses.html#softmaxloss)                                                                                                                                             |
| `(anchor, positive/negative) pairs`     | `1 if positive, 0 if negative` | [`ContrastiveLoss`](https://www.sbert.net/docs/package_reference/sentence_transformer/losses.html#contrastiveloss)  <br>[`OnlineContrastiveLoss`](https://www.sbert.net/docs/package_reference/sentence_transformer/losses.html#onlinecontrastiveloss) |
| `(sentence_A, sentence_B) pairs`        | `float similarity score`       | [`CoSENTLoss`](https://www.sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss)  <br>[`CosineSimilarityLoss`](https://www.sbert.net/docs/package_reference/sentence_transformer/losses.html#cosinesimilarityloss)             |
| `(anchor, positive, negative) triplets` | `none`                         | [`TripletLoss`](https://www.sbert.net/docs/package_reference/sentence_transformer/losses.html#tripletloss)                                                                                                                                             |


Full list of loss functions can be explored here: [Loss overview](https://www.sbert.net/docs/sentence_transformer/loss_overview.html)

## How to train
1. Appropriate dataset to one of the acceptable formats.
2. Choose loss function consistent with dataset format
3. Choose evaluation method consistent with the task
4. Train model

In [1]:
from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses
from datasets import Dataset

model = SentenceTransformer("microsoft/mpnet-base")

embeddings = model.encode(["It's nice weather outside today.", "It's quite rainy, sadly."])
before_similarity = model.similarity(embeddings[0], embeddings[1])
print(before_similarity)

train_dataset = Dataset.from_dict({
    "anchor": ["It's nice weather outside today.", "He drove to work."],
    "positive": ["It's so sunny.", "He took the car to the office."],
    "negative": ["It's quite rainy, sadly.", "She walked to the store."],
})
loss = losses.TripletLoss(model=model)

trainer = SentenceTransformerTrainer(
    model=model,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()

# save model
model.save("my_model")

  from tqdm.autonotebook import tqdm, trange
No sentence-transformers model found with name microsoft/mpnet-base. Creating a new one with mean pooling.
Some weights of MPNetModel were not initialized from the model checkpoint at microsoft/mpnet-base and are newly initialized: ['mpnet.pooler.dense.bias', 'mpnet.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tensor([[0.9480]])


Step,Training Loss


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

In [2]:
# use saved model
model = SentenceTransformer("my_model")
embeddings = model.encode(["It's nice weather outside today.", "It's quite rainy, sadly."])
after_similarity = model.similarity(embeddings[0], embeddings[1])
print(after_similarity)


tensor([[0.8304]])


## Appropriating datasets (examples)

1. Abstract analyzer: Given a research paper abstract, classify sentences into categories: research_context, problem_statement, approach, results.
  - single sentences |	class
  - (anchor, positive) pairs | None
  - (anchor, positive, negative) triplets
2. RAG on your research papers:
 - (para1, para2, class) 1 if paragraphs from same section of the paper, else 0
 - (para1, para2, class) 1 if paragraphs from same paper of the paper, else 0
 - (chunk 1, chunk2, class) 1 if chunk2 is followed by the chunk 1, else 0

# References
1. [Paper: Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://doi.org/10.18653/v1/d19-1410)
2. [Docs: Sentence Transformers](https://www.sbert.net/index.html)
3. [Paper: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://doi.org/10.48550/arxiv.1810.04805)
4. [Huggingface - Sentence Transformers](https://huggingface.co/sentence-transformers)