### 1. How Sentence Transformers models work

In a Sentence Transformer model, you map a variable-length text (or image pixels) to a fixed-size embedding representing that input's meaning.

This is how the Sentence Transformers models work:

1. Layer 1 – The input text is passed through a pre-trained Transformer model that can be obtained directly from the Hugging Face Hub. This tutorial will use the "distilroberta-base" model. The Transformer outputs are contextualized word embeddings for all input tokens; imagine an embedding for each token of the text.

2. Layer 2 - The embeddings go through a pooling layer to get a single fixed-length embedding for all the text. For example, mean pooling averages the embeddings generated by the model.

In [79]:
from sentence_transformers import SentenceTransformer, models

## Step 1: use an existing language model
word_embedding_model = models.Transformer('distilroberta-base')

## Step 2: use a pool function over the token embeddings
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())

## Join steps 1 and 2 using the modules argument
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])


Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### 2. The question I have here is: Tutorial says that from first layer we get last hidden state of transformer. So let's check it comparing with output of the original transformer.

Let's download original RobertaModel and its tokenizator

In [35]:
from transformers import RobertaModel
distilroberta_model = RobertaModel.from_pretrained('distilroberta-base')
from transformers import RobertaTokenizer
distilroberta_tokenizer = RobertaTokenizer.from_pretrained('distilroberta-base')

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


We will use next sentence. First we will tokenize it

In [None]:
text1='I am a student'

In [23]:
text_emb=distilroberta_tokenizer(text1,return_tensors='pt')

In [32]:
text_emb

{'input_ids': tensor([[   0,  100,  524,   10, 1294,    2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]]), 'token_embeddings': tensor([[[ 0.0008,  0.0691, -0.0220,  ..., -0.0082, -0.0437, -0.0104],
         [ 0.0038,  0.1006, -0.0765,  ...,  0.1998, -0.0285, -0.1054],
         [ 0.1853,  0.0873,  0.0098,  ...,  0.3601,  0.0253, -0.1086],
         [ 0.2156, -0.1720, -0.1061,  ...,  0.2778,  0.0912,  0.0706],
         [ 0.0728, -0.0749, -0.0756,  ...,  0.1787,  0.1348,  0.0565],
         [-0.0155,  0.0790, -0.0671,  ..., -0.0799, -0.0534, -0.0420]]],
       grad_fn=<NativeLayerNormBackward0>)}

Then we will look at the output of the original model.

In [25]:
distilroberta_model.eval()
out = distilroberta_model(input_ids=text_emb['input_ids'],
                 attention_mask=text_emb['attention_mask'],
                 output_attentions=True,
                 output_hidden_states=True,
                 return_dict=True)
out.keys()

odict_keys(['last_hidden_state', 'pooler_output', 'hidden_states', 'attentions'])

In [26]:
out['last_hidden_state']

tensor([[[ 0.0008,  0.0691, -0.0220,  ..., -0.0082, -0.0437, -0.0104],
         [ 0.0038,  0.1006, -0.0765,  ...,  0.1998, -0.0285, -0.1054],
         [ 0.1853,  0.0873,  0.0098,  ...,  0.3601,  0.0253, -0.1086],
         [ 0.2156, -0.1720, -0.1061,  ...,  0.2778,  0.0912,  0.0706],
         [ 0.0728, -0.0749, -0.0756,  ...,  0.1787,  0.1348,  0.0565],
         [-0.0155,  0.0790, -0.0671,  ..., -0.0799, -0.0534, -0.0420]]],
       grad_fn=<NativeLayerNormBackward0>)

In [28]:
out['pooler_output'][0][:5]

tensor([ 0.1637,  0.0231, -0.1608, -0.2013,  0.2004], grad_fn=<SliceBackward0>)

Next we will look at the output of the transformer module from sentence-transformer

In [30]:

word_embedding_model.eval()
out = word_embedding_model(text_emb)

In [31]:
out

{'input_ids': tensor([[   0,  100,  524,   10, 1294,    2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]]), 'token_embeddings': tensor([[[ 0.0008,  0.0691, -0.0220,  ..., -0.0082, -0.0437, -0.0104],
         [ 0.0038,  0.1006, -0.0765,  ...,  0.1998, -0.0285, -0.1054],
         [ 0.1853,  0.0873,  0.0098,  ...,  0.3601,  0.0253, -0.1086],
         [ 0.2156, -0.1720, -0.1061,  ...,  0.2778,  0.0912,  0.0706],
         [ 0.0728, -0.0749, -0.0756,  ...,  0.1787,  0.1348,  0.0565],
         [-0.0155,  0.0790, -0.0671,  ..., -0.0799, -0.0534, -0.0420]]],
       grad_fn=<NativeLayerNormBackward0>)}

We see that it also outputs the tokenized text. Note that its tokenized output is the same as output of the original tokenizator. Also note that the output is really the last_hidden_state of the original trasformer.

### 3. The next question is:

Why not use a Transformer model, like BERT or Roberta, out of the box to create embeddings for entire sentences and texts? There are at least two reasons.

1. Pre-trained Transformers require heavy computation to perform semantic search tasks. For example, finding the most similar pair in a collection of 10,000 sentences requires about 50 million inference computations (~65 hours) with BERT. In contrast, a BERT Sentence Transformers model reduces the time to about 5 seconds.

2. Once trained, Transformers create poor sentence representations out of the box. A BERT model with its token embeddings averaged to create a sentence embedding performs worse than the GloVe embeddings developed in 2014.

### 4. Dataset format

To train a Sentence Transformers model, you need to inform it somehow that two sentences have a certain degree of similarity. Therefore, each example in the data requires a label or structure that allows the model to understand whether two sentences are similar or different.

Unfortunately, there is no single way to prepare your data to train a Sentence Transformers model. Furthermore, the structure of your data will influence which loss function you can use. 

Most dataset configurations will take one of four forms (below you will see examples of each case):

1. Case 1: A pair of sentences and label. The label can be integer o float.

2. Case 2: The example is a pair of positive (similar) sentences without a label. For example, pairs of paraphrases, pairs of full texts and their summaries, pairs of duplicate questions, pairs of (query, response), or pairs of (source_language, target_language). Natural Language Inference datasets can also be formatted this way by pairing entailing sentences. Having your data in this format can be great since you can use the MultipleNegativesRankingLoss, one of the most used loss functions for Sentence Transformers models.

3. Case 3: The example is a sentence with an integer label. This data format is easily converted by loss functions into three sentences (triplets) where the first is an "anchor", the second a "positive" of the same class as the anchor, and the third a "negative" of a different class. Each sentence has an integer label indicating the class to which it belongs.

4. Case 4: The example is a triplet (anchor, positive, negative) without classes or labels for the sentences.

The next step is converting the dataset into a format the Sentence Transformers model can understand. The model cannot accept raw lists of strings. Each example must be converted to a sentence_transformers.InputExample class and then to a torch.utils.data.DataLoader class to batch and shuffle the examples.

Let's download dataset for Case4

In [63]:
from datasets import load_dataset

dataset_id = "embedding-data/QQP_triplets"
dataset = load_dataset(dataset_id)


Bellow we can see the Case4 structure

In [42]:
dataset['train']['set'][:1]

[{'query': 'Why in India do we not have one on one political debate as in USA?',
  'pos': ['Why cant we have a public debate between politicians in India like the one in US?'],
  'neg': ['Can people on Quora stop India Pakistan debate? We are sick and tired seeing this everyday in bulk?',
   'Why do politicians, instead of having a decent debate on issues going in and around the world, end up fighting always?',
   'Can educated politicians make a difference in India?',
   'What are some unusual aspects about politics and government in India?',
   'What is debate?',
   'Why does civic public communication and discourse seem so hollow in modern India?',
   'What is a Parliamentary debate?',
   "Why do we always have two candidates at the U.S. presidential debate. yet the ballot has about 7 candidates? Isn't that a misrepresentation of democracy?",
   'Why is civic public communication and discourse so hollow in modern India?',
   "Aren't the Presidential debates teaching our whole countr

Next we will convert our dataset to InputExample

In [64]:
from sentence_transformers import InputExample

train_examples = []
train_data = dataset['train']['set']
# For agility we only 1/2 of our available data
n_examples = dataset['train'].num_rows // 2000




In [65]:
n_examples

50

In [66]:
for i in range(n_examples):
  example = train_data[i]
  train_examples.append(InputExample(texts=[example['query'], example['pos'][0], example['neg'][0]]))

We get the list of InputExamples

In [55]:
train_examples[:10]

[<sentence_transformers.readers.InputExample.InputExample at 0x2305616add0>,
 <sentence_transformers.readers.InputExample.InputExample at 0x23001be4e90>,
 <sentence_transformers.readers.InputExample.InputExample at 0x23077572c90>,
 <sentence_transformers.readers.InputExample.InputExample at 0x2300f87ebd0>,
 <sentence_transformers.readers.InputExample.InputExample at 0x2300f87ec90>,
 <sentence_transformers.readers.InputExample.InputExample at 0x2305e1d41d0>,
 <sentence_transformers.readers.InputExample.InputExample at 0x2305e1d4210>,
 <sentence_transformers.readers.InputExample.InputExample at 0x2305e1d4190>,
 <sentence_transformers.readers.InputExample.InputExample at 0x2305e1d4150>,
 <sentence_transformers.readers.InputExample.InputExample at 0x2305e1d4110>]

Next we convert the training examples to a Dataloader.

In [67]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=1)


### 5. The next step is to choose a suitable loss function that can be used with the data format.

Remember the four different formats your data could be in? Each will have a different loss function associated with it.

1. Case 1: Pair of sentences and a label indicating how similar they are. The loss function optimizes such that (1) the sentences with the closest labels are near in the vector space, and (2) the sentences with the farthest labels are as far as possible. The loss function depends on the format of the label. If its an integer use ContrastiveLoss or SoftmaxLoss; if its a float you can use CosineSimilarityLoss.

2. Case 2: If you only have two similar sentences (two positives) with no labels, then you can use the MultipleNegativesRankingLoss function. The MegaBatchMarginLoss can also be used, and it would convert your examples to triplets (anchor_i, positive_i, positive_j) where positive_j serves as the negative.

3. Case 3: When your samples are triplets of the form [anchor, positive, negative] and you have an integer label for each, a loss function optimizes the model so that the anchor and positive sentences are closer together in vector space than the anchor and negative sentences. You can use BatchHardTripletLoss, which requires the data to be labeled with integers (e.g., labels 1, 2, 3) assuming that samples with the same label are similar. Therefore, anchors and positives must have the same label, while negatives must have a different one. Alternatively, you can use BatchAllTripletLoss, BatchHardSoftMarginTripletLoss, or BatchSemiHardTripletLoss. The differences between them is beyond the scope of this tutorial, but can be reviewed in the Sentence Transformers documentation.

4. Case 4: If you don't have a label for each sentence in the triplets, you should use TripletLoss. This loss minimizes the distance between the anchor and the positive sentences while maximizing the distance between the anchor and the negative sentences.

![title](losses_for_dif_structures.bmp)

So we must use TripletLoss

In [68]:
from sentence_transformers import losses

train_loss = losses.TripletLoss(model=model)


### 6. What are the limits of Sentence Transformers?

Sentence Transformers models work much better than the simple Transformers models for semantic search. However, where do the Sentence Transformers models not work well? If your task is classification, then using sentence embeddings is the wrong approach. In that case, the 🤗 Transformers library would be a better choice.

### 7. Training

In [69]:
num_epochs = 2

warmup_steps = int(len(train_dataloader) * num_epochs * 0.1) #10% of train data

In [98]:
from sentence_transformers.evaluation import TripletEvaluator
evaluator = TripletEvaluator([train_data[i]['query'] for i in range(30)],[train_data[i]['pos'][0] for i in range(30)],[train_data[i]['neg'] for i in range(30)],batch_size=20)

In [103]:
model.evaluate(evaluator)

0.9666666666666667

In [102]:
model.fit(train_objectives=[(train_dataloader, train_loss)],
          epochs=num_epochs,
          evaluator=evaluator,
          evaluation_steps=5,
          warmup_steps=warmup_steps) 

Epoch:   0%|          | 0/2 [00:00<?, ?it/s]

Iteration:   0%|          | 0/50 [00:00<?, ?it/s]

Iteration:   0%|          | 0/50 [00:00<?, ?it/s]