In [1]:
# !pip install accelerate==0.34.2

In [2]:
# !pip install transformers==4.45.2 

In [3]:
# !pip install sentence-transformers==3.1.1

# Chapter 10. Creating Text Embedding Models

## Contrastive Learning

There are many ways we can apply contrastive learning to create text embedding models but the most well-known technique and framework is sentence-transformers.

### SBERT

Its approach fixes a major problem with the original BERT implementation for creating sentence embeddings, namely its computational overhead. 

Before sentence-transformers, sentence embeddings often used an architectural structure called **cross-encoders with BERT**.

A cross-encoder allows two sentences to be passed to the Transformer network simultaneously to predict the extent to which the two sentences are similar. 

- by adding a classification head to the original architecture that can output a similarity score.
- PROBLEM: the number of computations rises quickly when you want to find the highest pair in a collection of 10,000 sentences.
- Moreover, a cross-encoder generally does not generate embeddings, it outputs a similarity score between the input sentences.


A solution to this overhead is to generate embeddings from a BERT model by averaging its output layer or using the [CLS] token. This, however, has shown to be worse than simply averaging word vectors, like GloVe

# <img src="imgs/crossencoder.png" alt="Patching" width="500" height="200">

Instead, the authors of sentence-transformers approached the problem differently and searched for a method that is fast and creates embeddings that can be compared semantically. 

Unlike a cross-encoder, in **sentence-transformers** the **classification head is dropped**, and **mean pooling** is used on the final output layer to generate an embedding. 

- This pooling layer averages the word embeddings and gives back a fixed dimensional output vector. This ensures a fixed-size embedding.


See the architecture of the original sentence-transformers model, which leverages a Siamese network, also called a **bi-encoder**.

# <img src="imgs/biencoder.png" alt="Patching" width="350" height="250">

The optimization process of these pairs of sentences is done through loss functions, which can have a major impact on the model’s performance. 

During training, the embeddings for each sentence are concatenated together with the difference between the embeddings. Then, this resulting embedding is optimized through a softmax classifier.

**The resulting architecture is also referred to as a bi-encoder or SBERT for sentence-BERT**

*Note: Although a bi-encoder is quite fast and creates accurate sentence representations, cross-encoders generally achieve better performance than a bi-encoder but do not generate embeddings.*

To perform contrastive learning, we need two things. First, we need data that constitutes similar/dissimilar pairs. Second, we will need to define how the model defines and optimizes similarity.

## Creating an Embedding Model

### Generating Contrastive Examples

When pretraining your embedding model, you will often see data being used from natural language inference (NLI) datasets:

- NLI refers to the task of investigating whether, for a given premise, it entails the hypothesis (entailment), contradicts it (contradiction), or neither (neutral).

The data that we are going to be using throughout creating and fine-tuning embedding models is derived from the **General Language Understanding Evaluation benchmark (GLUE)**. This GLUE benchmark consists of nine language understanding tasks to evaluate and analyze model performance.

One of these tasks is the **Multi-Genre Natural Language Inference (MNLI)** corpus:
- 392,702 sentence pairs
- annotated with entailment (contradiction, neutral, entailment)

In [132]:
from datasets import Dataset, load_dataset

from sentence_transformers import SentenceTransformer, losses, models
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from sentence_transformers.training_args import SentenceTransformerTrainingArguments
from sentence_transformers.trainer import SentenceTransformerTrainer
from sentence_transformers import InputExample
from sentence_transformers.datasets import NoDuplicatesDataLoader
from sentence_transformers.cross_encoder import CrossEncoder
from sentence_transformers.datasets import DenoisingAutoEncoderDataset
from mteb import MTEB
import nltk
import gc
import torch
import random
from tqdm import tqdm
import pandas as pd
import numpy as np

In [5]:
# Load MNLI dataset from GLUE
# 0 = entailment, 1 = neutral, 2 = contradiction
train_dataset = load_dataset("glue", "mnli", split="train").select(range(50_000))
train_dataset = train_dataset.remove_columns("idx")

In [6]:
train_dataset[2]

{'premise': 'One of our number will carry out your instructions minutely.',
 'hypothesis': 'A member of my team will execute your orders with immense precision.',
 'label': 0}

In [7]:
train_dataset[4]

{'premise': "yeah i tell you what though if you go price some of those tennis shoes i can see why now you know they're getting up in the hundred dollar range",
 'hypothesis': 'The tennis shoes have a range of prices.',
 'label': 1}

In [8]:
train_dataset[8]

{'premise': 'Gays and lesbians.', 'hypothesis': 'Heterosexuals.', 'label': 2}

### Train Model

We typically choose an existing sentence-transformers model and fine-tune that model

but in this example, **we are going to train an embedding from scratch**.

We need to define 2 things:

- a pretrained Transformer model that serves as embedding individual words
-  define a loss function over which we will optimize the model (softmax loss)

In [9]:
# Model
embedding_model = SentenceTransformer('bert-base-uncased')



In [10]:
# Define the loss function. In softmax loss, we will also need to explicitly set the number of labels.
train_loss = losses.SoftmaxLoss(
    model=embedding_model,
    sentence_embedding_dimension=embedding_model.get_sentence_embedding_dimension(),
    num_labels=3
)

In [11]:
embedding_model.get_sentence_embedding_dimension()

768

In [12]:
len(set(train_dataset["label"]))

3

Before we train our model, we define an evaluator to evaluate the model’s performance during training, which also determines the best model to save.

We can perform evaluation of the performance of our model using the Semantic Textual Similarity Benchmark (STSB). It is a collection of human-labeled sentence pairs, with similarity scores between 1 and 5.

In [13]:
# Create an embedding similarity evaluator for stsb
val_sts = load_dataset('glue', 'stsb', split='validation')

In [14]:
val_sts

Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx'],
    num_rows: 1500
})

In [15]:
val_sts["label"][:10]

[5.0,
 4.75,
 5.0,
 2.4000000953674316,
 2.75,
 2.615000009536743,
 5.0,
 2.3329999446868896,
 3.75,
 5.0]

In [16]:
val_sts["sentence1"][:10]

['A man with a hard hat is dancing.',
 'A young child is riding a horse.',
 'A man is feeding a mouse to a snake.',
 'A woman is playing the guitar.',
 'A woman is playing the flute.',
 'A woman is cutting an onion.',
 'A man is erasing a chalk board.',
 'A woman is carrying a boy.',
 'Three men are playing guitars.',
 'A woman peels a potato.']

In [17]:
val_sts["sentence2"][:10]

['A man wearing a hard hat is dancing.',
 'A child is riding a horse.',
 'The man is feeding a mouse to the snake.',
 'A man is playing guitar.',
 'A man is playing a flute.',
 'A man is cutting onions.',
 'The man is erasing the chalk board.',
 'A woman is carrying her baby.',
 'Three men are on stage playing guitars.',
 'A woman is peeling a potato.']

In [18]:
evaluator = EmbeddingSimilarityEvaluator(
    sentences1=val_sts["sentence1"],
    sentences2=val_sts["sentence2"],
    scores=[score/5 for score in val_sts["label"]],
    main_similarity="cosine",
)

- Paramters to highlight:
    - warmup_steps: the number of steps during which the learning rate will be linearly increased from zero to the initial learning rate defined for the training process. Note that we did not specify a custom learning rate for this training process.
    - fp16: By enabling this parameter we allow for mixed precision training, where computations are performed using 16-bit floating-point numbers (FP16) instead of the default 32-bit (FP32). This reduces memory usage and potentially increases the training speed.

In [19]:
# Define the training arguments
args = SentenceTransformerTrainingArguments(
    output_dir="training_base_embedding_model",
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    warmup_steps=100,
    fp16=True,
    eval_steps=100,
    logging_steps=100,
)

In [20]:
# Train embedding model
trainer = SentenceTransformerTrainer(
    model=embedding_model,
    args=args,
    train_dataset=train_dataset,
    loss=train_loss,
    evaluator=evaluator
)

In [21]:
trainer.train()

  0%|          | 0/1563 [00:00<?, ?it/s]

dataset = dataset.select_columns(['hypothesis', 'entailment', 'contradiction'])


{'loss': 1.0764, 'grad_norm': 2.6743345260620117, 'learning_rate': 5e-05, 'epoch': 0.06}
{'loss': 0.9465, 'grad_norm': 2.88771653175354, 'learning_rate': 4.6582365003417636e-05, 'epoch': 0.13}
{'loss': 0.8812, 'grad_norm': 3.460381507873535, 'learning_rate': 4.316473000683528e-05, 'epoch': 0.19}
{'loss': 0.8422, 'grad_norm': 3.8805480003356934, 'learning_rate': 3.9747095010252904e-05, 'epoch': 0.26}
{'loss': 0.8262, 'grad_norm': 4.558804512023926, 'learning_rate': 3.632946001367054e-05, 'epoch': 0.32}


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'loss': 0.8235, 'grad_norm': 3.3562326431274414, 'learning_rate': 3.291182501708818e-05, 'epoch': 0.38}
{'loss': 0.8069, 'grad_norm': 4.393437385559082, 'learning_rate': 2.9494190020505813e-05, 'epoch': 0.45}
{'loss': 0.7899, 'grad_norm': 4.835585594177246, 'learning_rate': 2.611073137388927e-05, 'epoch': 0.51}
{'loss': 0.778, 'grad_norm': 4.837282657623291, 'learning_rate': 2.2693096377306907e-05, 'epoch': 0.58}
{'loss': 0.7632, 'grad_norm': 4.742218494415283, 'learning_rate': 1.9275461380724537e-05, 'epoch': 0.64}
{'loss': 0.7486, 'grad_norm': 3.6050546169281006, 'learning_rate': 1.5857826384142175e-05, 'epoch': 0.7}
{'loss': 0.7292, 'grad_norm': 4.809312343597412, 'learning_rate': 1.2440191387559808e-05, 'epoch': 0.77}
{'loss': 0.7467, 'grad_norm': 4.783884525299072, 'learning_rate': 9.022556390977444e-06, 'epoch': 0.83}
{'loss': 0.7127, 'grad_norm': 3.5038461685180664, 'learning_rate': 5.604921394395079e-06, 'epoch': 0.9}
{'loss': 0.7478, 'grad_norm': 5.3534698486328125, 'learning

TrainOutput(global_step=1563, training_loss=0.8112286099698096, metrics={'train_runtime': 75.6031, 'train_samples_per_second': 661.349, 'train_steps_per_second': 20.674, 'total_flos': 0.0, 'train_loss': 0.8112286099698096, 'epoch': 1.0})

In [22]:
# Evaluate our trained model
evaluator(embedding_model)

{'pearson_cosine': 0.5328487247553961,
 'spearman_cosine': 0.610085819603248,
 'pearson_manhattan': 0.595494529766493,
 'spearman_manhattan': 0.6172999437988739,
 'pearson_euclidean': 0.5832804782034675,
 'spearman_euclidean': 0.6126055900254425,
 'pearson_dot': 0.5086478770718181,
 'spearman_dot': 0.5466362366701062,
 'pearson_max': 0.595494529766493,
 'spearman_max': 0.6172999437988739}

The one we are interested in most is 'pearson_cosine', which is the cosine similarity between centered vectors. It is a value between 0 and 1 where a higher value indicates higher degrees of similarity. We get a value of 0.59, which we consider a baseline throughout this chapter.

Larger batch sizes tend to work better with multiple negative rankings (MNR) loss as a larger batch makes the task more difficult. 

### In-Depth Evaluation

A good embedding model is more than just a good score on the STSB benchmark! As we observed earlier, the GLUE benchmark has a number of tasks for which we can evaluate our embedding model. 

 To unify this evaluation procedure, the Massive Text Embedding Benchmark (MTEB) was developed

 **The MTEB spans 8 embedding tasks that cover 58 datasets and 112 languages.**

In [24]:
# Choose evaluation task
evaluation = MTEB(tasks=["Banking77Classification"])

# Calculate results
results = evaluation.run(embedding_model)



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

In [25]:
results

[MTEBResults(task_name=Banking77Classification, scores=...)]

In [27]:
mteb_results = results[0]

In [29]:
mteb_results.dict()

/tmp/ipykernel_1240834/3408119087.py:1: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  mteb_results.dict()


{'dataset_revision': '0fd18e25b25c072e09e0d92ab615fda904d66300',
 'task_name': 'Banking77Classification',
 'mteb_version': '1.12.39',
 'scores': {'test': [{'accuracy': 0.5904220779220779,
    'f1': 0.5889991560371988,
    'f1_weighted': 0.5889991560371988,
    'scores_per_experiment': [{'accuracy': 0.587987012987013,
      'f1': 0.5866559054728446,
      'f1_weighted': 0.5866559054728445},
     {'accuracy': 0.5743506493506494,
      'f1': 0.5729107811455537,
      'f1_weighted': 0.5729107811455536},
     {'accuracy': 0.5928571428571429,
      'f1': 0.5902115454600196,
      'f1_weighted': 0.5902115454600197},
     {'accuracy': 0.599025974025974,
      'f1': 0.6004218246023643,
      'f1_weighted': 0.6004218246023644},
     {'accuracy': 0.6009740259740259,
      'f1': 0.6003663902905382,
      'f1_weighted': 0.6003663902905382},
     {'accuracy': 0.6,
      'f1': 0.5985031216347628,
      'f1_weighted': 0.5985031216347628},
     {'accuracy': 0.6006493506493507,
      'f1': 0.59754962132

In [None]:
# Empty and delete trainer/model
trainer.accelerator.clear()
del trainer, embedding_model

gc.collect()
torch.cuda.empty_cache()

# Loss Functions

- Softmax loss is generally not advised as there are more performant losses. Instead:

    - **Cosine similarity**: Cosine similarity loss is straightforward—it calculates the cosine similarity between the two embeddings of the two texts and compares that to the labeled similarity score. Cosine similarity loss intuitively works best using data where you have pairs of sentences and labels that indicate their similarity between 0 and 1.
    - Multiple negatives ranking (MNR) loss

See losses: https://www.sbert.net/docs/package_reference/sentence_transformer/losses.html



### Cosine Similarity

To use this loss with our NLI dataset, we need to convert the entailment (0), neutral (1), and contradiction (2) labels to values between 0 and 1. 

The entailment represents a high similarity between the sentences, so we give it a similarity score of 1. In contrast, since both neutral and contradiction represent dissimilarity, we give these labels a similarity score of 0:

In [31]:
# Load MNLI dataset from GLUE
# 0 = entailment, 1 = neutral, 2 = contradiction
train_dataset = load_dataset("glue", "mnli", split="train").select(range(50_000))
train_dataset = train_dataset.remove_columns("idx")


In [32]:
# (neutral/contradiction)=0 and (entailment)=1
mapping = {2: 0, 1: 0, 0:1}
train_dataset = Dataset.from_dict({
    "sentence1": train_dataset["premise"],
    "sentence2": train_dataset["hypothesis"],
    "label": [float(mapping[label]) for label in train_dataset["label"]]
})

In [33]:
# Create an embedding similarity evaluator for stsb
val_sts = load_dataset('glue', 'stsb', split='validation')
evaluator = EmbeddingSimilarityEvaluator(
    sentences1=val_sts["sentence1"],
    sentences2=val_sts["sentence2"],
    scores=[score/5 for score in val_sts["label"]],
    main_similarity="cosine"
)

In [36]:
# Define model
embedding_model = SentenceTransformer('bert-base-uncased')

# Loss function
train_loss = losses.CosineSimilarityLoss(model=embedding_model)



In [37]:
# Define the training arguments
args = SentenceTransformerTrainingArguments(
    output_dir="cosineloss_embedding_model",
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    warmup_steps=100,
    fp16=True,
    eval_steps=100,
    logging_steps=100,
)

In [38]:
# Train model
trainer = SentenceTransformerTrainer(
    model=embedding_model,
    args=args,
    train_dataset=train_dataset,
    loss=train_loss,
    evaluator=evaluator
)

In [39]:
trainer.train()

  0%|          | 0/1563 [00:00<?, ?it/s]

{'loss': 0.2297, 'grad_norm': 1.8058668375015259, 'learning_rate': 5e-05, 'epoch': 0.06}
{'loss': 0.1679, 'grad_norm': 2.009979009628296, 'learning_rate': 4.6582365003417636e-05, 'epoch': 0.13}
{'loss': 0.1725, 'grad_norm': 1.6234655380249023, 'learning_rate': 4.316473000683528e-05, 'epoch': 0.19}
{'loss': 0.1585, 'grad_norm': 1.1327171325683594, 'learning_rate': 3.9747095010252904e-05, 'epoch': 0.26}
{'loss': 0.1538, 'grad_norm': 1.5563822984695435, 'learning_rate': 3.632946001367054e-05, 'epoch': 0.32}


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'loss': 0.1574, 'grad_norm': 1.4668450355529785, 'learning_rate': 3.291182501708818e-05, 'epoch': 0.38}
{'loss': 0.1533, 'grad_norm': 1.2550017833709717, 'learning_rate': 2.9494190020505813e-05, 'epoch': 0.45}
{'loss': 0.1575, 'grad_norm': 1.3735573291778564, 'learning_rate': 2.6076555023923443e-05, 'epoch': 0.51}
{'loss': 0.149, 'grad_norm': 1.4254496097564697, 'learning_rate': 2.2658920027341084e-05, 'epoch': 0.58}
{'loss': 0.1468, 'grad_norm': 1.0495551824569702, 'learning_rate': 1.9241285030758715e-05, 'epoch': 0.64}
{'loss': 0.148, 'grad_norm': 1.1007907390594482, 'learning_rate': 1.5823650034176352e-05, 'epoch': 0.7}
{'loss': 0.1464, 'grad_norm': 1.0369843244552612, 'learning_rate': 1.2406015037593984e-05, 'epoch': 0.77}
{'loss': 0.1449, 'grad_norm': 1.4250003099441528, 'learning_rate': 8.988380041011621e-06, 'epoch': 0.83}
{'loss': 0.14, 'grad_norm': 1.3402336835861206, 'learning_rate': 5.570745044429255e-06, 'epoch': 0.9}
{'loss': 0.1401, 'grad_norm': 1.1403541564941406, 'lear

TrainOutput(global_step=1563, training_loss=0.15715857339225667, metrics={'train_runtime': 76.861, 'train_samples_per_second': 650.525, 'train_steps_per_second': 20.335, 'total_flos': 0.0, 'train_loss': 0.15715857339225667, 'epoch': 1.0})

In [40]:
# Evaluate our trained model
evaluator(embedding_model)

{'pearson_cosine': 0.7338387386384624,
 'spearman_cosine': 0.7356435201309607,
 'pearson_manhattan': 0.7431044662138206,
 'spearman_manhattan': 0.7429445631964317,
 'pearson_euclidean': 0.7429558955289768,
 'spearman_euclidean': 0.7428646728011881,
 'pearson_dot': 0.6999397043633594,
 'spearman_dot': 0.6999497108007732,
 'pearson_max': 0.7431044662138206,
 'spearman_max': 0.7429445631964317}

In [41]:
import gc
import torch

gc.collect()
torch.cuda.empty_cache()

### Multiple Negatives Ranking Loss

Multiple negatives ranking (MNR) loss, often referred to as InfoNCE or NTXentLoss is a loss that uses either positive pairs of sentences or triplets that **contain a pair of positive sentences and an additional unrelated sentence.** 

- **Negative pairs** are constructed by mixing a positive pair with another positive pair. These negatives are called **in-batch negatives** and can also be used to generate the triplets.

After having generated these positive and negative pairs, we calculate their embeddings and apply cosine similarity. 

- These similarity scores are then used to answer the question, are these pairs negative or positive?  In other words, it is treated as a classification task and we can use **cross-entropy loss to optimize the model**.

To make these triplets we start with an **anchor sentence** (i.e., labeled as the “premise”), which is used to compare other sentences.

- we only select sentence pairs that are positive (i.e., labeled as “entailment”)
- to add negative sentences, we randomly sample sentences as the “hypothesis.”

In [None]:
# # Load MNLI dataset from GLUE
mnli = load_dataset("glue", "mnli", split="train").select(range(50_000))
mnli = mnli.remove_columns("idx")
mnli = mnli.filter(lambda x: True if x['label'] == 0 else False)

In [44]:
# Prepare data and add a soft negative
train_dataset = {"anchor": [], "positive": [], "negative": []}
soft_negatives = mnli["hypothesis"]
random.shuffle(soft_negatives)
for row, soft_negative in tqdm(zip(mnli, soft_negatives)):
    train_dataset["anchor"].append(row["premise"])
    train_dataset["positive"].append(row["hypothesis"])
    train_dataset["negative"].append(soft_negative)
train_dataset = Dataset.from_dict(train_dataset)
len(train_dataset)

16875it [00:00, 45839.20it/s]


16875

In [46]:
train_dataset[2]

{'anchor': 'How do you know? All this is their information again.',
 'positive': 'This information belongs to them.',
 'negative': "They're pretty nice yes."}

In [47]:
train_dataset[3]

{'anchor': "my walkman broke so i'm upset now i just have to turn the stereo up real loud",
 'positive': "I'm upset that my walkman broke and now I have to turn the stereo up really loud.",
 'negative': 'I asked what was wrong with the frosting.'}

In [48]:
# Create an embedding similarity evaluator for stsb
val_sts = load_dataset('glue', 'stsb', split='validation')
evaluator = EmbeddingSimilarityEvaluator(
    sentences1=val_sts["sentence1"],
    sentences2=val_sts["sentence2"],
    scores=[score/5 for score in val_sts["label"]],
    main_similarity="cosine"
)

In [49]:
val_sts

Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx'],
    num_rows: 1500
})

In [50]:
# Define model
embedding_model = SentenceTransformer('bert-base-uncased')

# Loss function
train_loss = losses.MultipleNegativesRankingLoss(model=embedding_model)




In [51]:
# Define the training arguments
args = SentenceTransformerTrainingArguments(
    output_dir="training_mnrloss_embedding_model",
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    warmup_steps=100,
    fp16=True,
    eval_steps=100,
    logging_steps=100,
)


In [52]:
# Train model
trainer = SentenceTransformerTrainer(
    model=embedding_model,
    args=args,
    train_dataset=train_dataset,
    loss=train_loss,
    evaluator=evaluator
)

In [54]:
trainer.train()

  0%|          | 0/528 [00:00<?, ?it/s]

{'loss': 0.3277, 'grad_norm': 5.583022594451904, 'learning_rate': 4.85e-05, 'epoch': 0.19}
{'loss': 0.1104, 'grad_norm': 4.510645866394043, 'learning_rate': 3.866822429906542e-05, 'epoch': 0.38}
{'loss': 0.0783, 'grad_norm': 3.3987390995025635, 'learning_rate': 2.698598130841122e-05, 'epoch': 0.57}
{'loss': 0.0677, 'grad_norm': 0.8355557918548584, 'learning_rate': 1.530373831775701e-05, 'epoch': 0.76}
{'loss': 0.0688, 'grad_norm': 1.185404658317566, 'learning_rate': 3.6214953271028036e-06, 'epoch': 0.95}


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 32.9937, 'train_samples_per_second': 511.461, 'train_steps_per_second': 16.003, 'train_loss': 0.1269701152588382, 'epoch': 1.0}


TrainOutput(global_step=528, training_loss=0.1269701152588382, metrics={'train_runtime': 32.9937, 'train_samples_per_second': 511.461, 'train_steps_per_second': 16.003, 'total_flos': 0.0, 'train_loss': 0.1269701152588382, 'epoch': 1.0})

In [55]:
# Evaluate our trained model
evaluator(embedding_model)

{'pearson_cosine': 0.8017746118354057,
 'spearman_cosine': 0.806607378712418,
 'pearson_manhattan': 0.8185531478394932,
 'spearman_manhattan': 0.8146311493742627,
 'pearson_euclidean': 0.8186071709577609,
 'spearman_euclidean': 0.8147699590785977,
 'pearson_dot': 0.7352056555456891,
 'spearman_dot': 0.7235103433617001,
 'pearson_max': 0.8186071709577609,
 'spearman_max': 0.8147699590785977}

We would like to have negatives that are very related to the question but not the right answer. These negatives are called **hard negatives**.


Gathering negatives can roughly be divided into the following three processes:

- **Easy negatives**: through randomly sampling documents as we did before.
- **Semi-hard negatives**: using a pretrained embedding model, we can apply cosine similarity on all sentence embeddings to find those that are highly related. Generally, this does not lead to hard negatives since this method merely finds similar sentences, not question/answer pairs.
- **Hard negatives**: these often need to be either manually labeled (for instance, by generating semi-hard negatives) or you can use a generative model to either judge or generate sentence pairs.

# Fine-Tuning an Embedding Model


In the previous section, we went through the basics of training an embedding model from scratch and saw how we could leverage loss functions to further optimize its performance. 

This approach, although quite powerful, requires creating an embedding model from scratch. This process can be quite costly and time-consuming.

Instead, the sentence-transformers framework allows nearly all embedding models to be used as a base for fine-tuning. 

## Supervised Fine-Tuning

The most straightforward way to fine-tune an embedding model is to repeat the process of training our model as we did before but replace the 'bert-base-uncased' with a pretrained sentence-transformers model.

There are many to choose from but generally, all-MiniLM-L6-v2 performs well across many use cases and due to its small size is quite fast.

We use the same data as we used to train our model in the MNR loss example but instead use a pretrained embedding model to fine-tune. 

In [56]:
# Load MNLI dataset from GLUE
# 0 = entailment, 1 = neutral, 2 = contradiction
train_dataset = load_dataset("glue", "mnli", split="train").select(range(50_000))
train_dataset = train_dataset.remove_columns("idx")

In [57]:
# Create an embedding similarity evaluator for stsb
val_sts = load_dataset('glue', 'stsb', split='validation')
evaluator = EmbeddingSimilarityEvaluator(
    sentences1=val_sts["sentence1"],
    sentences2=val_sts["sentence2"],
    scores=[score/5 for score in val_sts["label"]],
    main_similarity="cosine"
)

In [58]:
# Define model
embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [59]:
# Loss function
train_loss = losses.MultipleNegativesRankingLoss(model=embedding_model)

In [60]:
# Define the training arguments
args = SentenceTransformerTrainingArguments(
    output_dir="finetuned_embedding_model",
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    warmup_steps=100,
    fp16=True,
    eval_steps=100,
    logging_steps=100,
)

In [61]:
# Train model
trainer = SentenceTransformerTrainer(
    model=embedding_model,
    args=args,
    train_dataset=train_dataset,
    loss=train_loss,
    evaluator=evaluator
)

In [62]:
trainer.train()

  0%|          | 0/1563 [00:00<?, ?it/s]

dataset = dataset.select_columns(['hypothesis', 'entailment', 'contradiction'])


{'loss': 0.1658, 'grad_norm': 3.795701265335083, 'learning_rate': 5e-05, 'epoch': 0.06}
{'loss': 0.1092, 'grad_norm': 3.0204427242279053, 'learning_rate': 4.6582365003417636e-05, 'epoch': 0.13}
{'loss': 0.1208, 'grad_norm': 2.0707175731658936, 'learning_rate': 4.316473000683528e-05, 'epoch': 0.19}
{'loss': 0.1128, 'grad_norm': 3.7757997512817383, 'learning_rate': 3.9747095010252904e-05, 'epoch': 0.26}
{'loss': 0.1098, 'grad_norm': 4.821083068847656, 'learning_rate': 3.632946001367054e-05, 'epoch': 0.32}


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'loss': 0.1005, 'grad_norm': 2.3958559036254883, 'learning_rate': 3.291182501708818e-05, 'epoch': 0.38}
{'loss': 0.1183, 'grad_norm': 4.6376051902771, 'learning_rate': 2.9494190020505813e-05, 'epoch': 0.45}
{'loss': 0.1015, 'grad_norm': 1.9947108030319214, 'learning_rate': 2.6076555023923443e-05, 'epoch': 0.51}
{'loss': 0.1099, 'grad_norm': 1.2352083921432495, 'learning_rate': 2.2658920027341084e-05, 'epoch': 0.58}
{'loss': 0.1011, 'grad_norm': 5.134367942810059, 'learning_rate': 1.9241285030758715e-05, 'epoch': 0.64}
{'loss': 0.0932, 'grad_norm': 2.1828413009643555, 'learning_rate': 1.5823650034176352e-05, 'epoch': 0.7}
{'loss': 0.106, 'grad_norm': 1.9396013021469116, 'learning_rate': 1.2406015037593984e-05, 'epoch': 0.77}
{'loss': 0.1058, 'grad_norm': 2.8037314414978027, 'learning_rate': 8.988380041011621e-06, 'epoch': 0.83}
{'loss': 0.105, 'grad_norm': 1.2151201963424683, 'learning_rate': 5.570745044429255e-06, 'epoch': 0.9}
{'loss': 0.1074, 'grad_norm': 3.354924201965332, 'learnin

TrainOutput(global_step=1563, training_loss=0.11024498771721174, metrics={'train_runtime': 26.4611, 'train_samples_per_second': 1889.564, 'train_steps_per_second': 59.068, 'total_flos': 0.0, 'train_loss': 0.11024498771721174, 'epoch': 1.0})

In [63]:
# Evaluate our trained model
evaluator(embedding_model)

{'pearson_cosine': 0.8476046186902555,
 'spearman_cosine': 0.8474595848409044,
 'pearson_manhattan': 0.8501359286358263,
 'spearman_manhattan': 0.8468474331729003,
 'pearson_euclidean': 0.8508508761235378,
 'spearman_euclidean': 0.8474595848409044,
 'pearson_dot': 0.8476046173484233,
 'spearman_dot': 0.8474595848409044,
 'pearson_max': 0.8508508761235378,
 'spearman_max': 0.8474595848409044}

### TIP

Instead of using a pretrained BERT model like 'bert-base-uncased' or a possible out-of-domain model like 'all-mpnet-base-v2', you can also perform **masked language modeling** on the pretrained BERT model to first adapt it to your domain.

 Then, you can use this fine-tuned BERT model as the base for training your embedding model.

 This is a form of **domain adaptation** (*In the next chapter, we will apply masked language modeling on a pretrained model.*)

## Augmented SBERT

A disadvantage of training or fine-tuning these embedding models is that they often require substantial training data. 

Extracting such a high number of sentence pairs for your use case is generally not possible as in many cases, there are only a couple of thousand labeled data points available.

Fortunately, there is a way to augment your data such that an embedding model can be fine-tuned when there is only a little labeled data available. This procedure is referred to as **Augmented SBERT**

In this procedure, we aim to augment the small amount of labeled data such that they can be used for regular training. It makes use of the slow and more accurate cross-encoder architecture (BERT) to augment and label a larger set of input pairs. These newly labeled pairs are then used for fine-tuning a bi-encoder (SBERT). The steps are:

- Fine-tune a cross-encoder (BERT) using a small, annotated dataset (gold dataset).
- Create new sentence pairs.
- Label new sentence pairs with the fine-tuned cross-encoder (silver dataset).
- Train a bi-encoder (SBERT) on the extended dataset (gold + silver dataset).

# <img src="imgs/augmentedsbert.png" alt="Patching" width="500" height="200">

Augmented SBERT works through training a cross-encoder on a small gold dataset, then using that to label an unlabeled dataset to generate a larger silver dataset. Finally, both the gold and silver datasets are used to train the bi-encoder.

### Start fine-tuning Augmented SBERT

 Instead of our original 50,000 documents, we take a subset of 10,000 documents to simulate a setting where we have limited annotated data. 

In [64]:
# Prepare a small set of 10000 documents for the cross-encoder
dataset = load_dataset("glue", "mnli", split="train").select(range(10_000))
mapping = {2: 0, 1: 0, 0:1}


In [None]:
# Data Loader
gold_examples = [
    InputExample(texts=[row["premise"], row["hypothesis"]], label=mapping[row["label"]])
    for row in tqdm(dataset)
]
gold_dataloader = NoDuplicatesDataLoader(gold_examples, batch_size=32)

100%|██████████| 10000/10000 [00:00<00:00, 78329.69it/s]


In [69]:
# Pandas DataFrame for easier data handling
gold = pd.DataFrame(
    {
    'sentence1': dataset['premise'],
    'sentence2': dataset['hypothesis'],
    'label': [mapping[label] for label in dataset['label']]
    }
)

In [70]:
gold

Unnamed: 0,sentence1,sentence2,label
0,Conceptually cream skimming has two basic dime...,Product and geography are what make cream skim...,0
1,you know during the season and i guess at at y...,You lose the things to the following level if ...,1
2,One of our number will carry out your instruct...,A member of my team will execute your orders w...,1
3,How do you know? All this is their information...,This information belongs to them.,1
4,yeah i tell you what though if you go price so...,The tennis shoes have a range of prices.,0
...,...,...,...
9995,"Because, despite its monopoly power, Microsoft...",Microsoft owns 60 percent of all computer-rela...,0
9996,"'Right,' I mumbled.","'Wrong', I said.",0
9997,Thanks dad.,Thanks Obama.,0
9998,which is good,I don't think that's great,0


In [73]:
# Train a cross-encoder on the gold dataset
cross_encoder = CrossEncoder('bert-base-uncased', num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [78]:
# cross_encoder.model

In [87]:
batch = next(iter(gold_dataloader))

In [88]:
batch[0].texts

['Random CWC inspections, Bork says, violate the Fourth Amendment, which requires a warrant to search private facilities.',
 'They do not have a warrant to conduct these random inspections.']

In [89]:
batch[0].label

0

In [90]:
gold_dataloader = NoDuplicatesDataLoader(gold_examples, batch_size=32)
cross_encoder.fit(
    train_dataloader=gold_dataloader,
    epochs=1,
    show_progress_bar=True,
    warmup_steps=100,
    use_amp=False
)

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/312 [00:00<?, ?it/s]

After training our cross-encoder, we use the remaining 40,000 sentence pairs (from our original dataset of 50,000 sentence pairs) as our silver dataset (step 2):

**Step 2:** Create new sentence pairs

In [91]:
# Prepare the silver dataset by predicting labels with the cross-encoder
silver = load_dataset(
    "glue", "mnli", split="train"
).select(range(10_000, 50_000))
pairs = list(zip(silver["premise"], silver["hypothesis"]))

In [93]:
pairs[0]

('Hindus and Buddhists still bathe where he bathed.',
 'Hindus and Buddhists bathe in the same location.')

**Step 3:** Label new sentence pairs with the fine-tuned cross-encoder (silver dataset)

In [95]:
# Label the sentence pairs using our fine-tuned cross-encoder
output = cross_encoder.predict(pairs, apply_softmax=True, show_progress_bar=True)

Batches:   0%|          | 0/1250 [00:00<?, ?it/s]

In [97]:
output.shape

(40000, 2)

In [98]:
output

array([[0.32352844, 0.67647153],
       [0.9028155 , 0.09718444],
       [0.9230069 , 0.07699306],
       ...,
       [0.8280179 , 0.17198205],
       [0.23569162, 0.76430833],
       [0.15352033, 0.8464796 ]], dtype=float32)

In [101]:
silver = pd.DataFrame(
    {
        "sentence1": silver["premise"], 
        "sentence2": silver["hypothesis"],
        "label": np.argmax(output, axis=1)
    }
)

In [102]:
silver

Unnamed: 0,sentence1,sentence2,label
0,Hindus and Buddhists still bathe where he bathed.,Hindus and Buddhists bathe in the same location.,1
1,"Probably no one will even notice you at all.""",Everyone will know who you are.,0
2,well what what do you mean if they can prove i...,You don't need to say anymore about the matter...,0
3,I feel dizzy.,The dizziness I feel is from drinking.,0
4,"Well, he did, sir.","Sir, well, he did complete it before he left l...",0
...,...,...,...
39995,"It was a cop, a woman of intermediate age.",The police were after us.,0
39996,"In Trinidad, Motel Las Cuevas has a disco with...",There is a disco in Motel Las Cuevas.,1
39997,Tommy's heart sank at the sight of them.,The sight of them cheered Tommy up.,0
39998,"Wodehouse, Paul Gigot and Mary McGrory.",Paul Gigot and others.,1


**Step 4:** Train a bi-encoder (SBERT) on the extended dataset (gold + silver dataset)

In [103]:
# Combine gold + silver
data = pd.concat([gold, silver], ignore_index=True, axis=0)
data = data.drop_duplicates(subset=['sentence1', 'sentence2'], keep="first")
train_dataset = Dataset.from_pandas(data, preserve_index=False)

In [104]:
len(train_dataset)

49998

In [105]:
# Create an embedding similarity evaluator for stsb
val_sts = load_dataset('glue', 'stsb', split='validation')
evaluator = EmbeddingSimilarityEvaluator(
    sentences1=val_sts["sentence1"],
    sentences2=val_sts["sentence2"],
    scores=[score/5 for score in val_sts["label"]],
    main_similarity="cosine"
)

In [106]:
# Define model
embedding_model = SentenceTransformer('bert-base-uncased')

# Loss function
train_loss = losses.CosineSimilarityLoss(model=embedding_model)



In [107]:
# Define the training arguments
args = SentenceTransformerTrainingArguments(
    output_dir="augmented_embedding_model",
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    warmup_steps=100,
    fp16=True,
    eval_steps=100,
    logging_steps=100,
)

In [108]:
# Train model
trainer = SentenceTransformerTrainer(
    model=embedding_model,
    args=args,
    train_dataset=train_dataset,
    loss=train_loss,
    evaluator=evaluator
)

In [109]:
trainer.train()

  0%|          | 0/1563 [00:00<?, ?it/s]

{'loss': 0.2244, 'grad_norm': 1.6568797826766968, 'learning_rate': 5e-05, 'epoch': 0.06}
{'loss': 0.1649, 'grad_norm': 1.353955864906311, 'learning_rate': 4.6582365003417636e-05, 'epoch': 0.13}
{'loss': 0.1495, 'grad_norm': 1.5114811658859253, 'learning_rate': 4.316473000683528e-05, 'epoch': 0.19}
{'loss': 0.1454, 'grad_norm': 1.3773846626281738, 'learning_rate': 3.9747095010252904e-05, 'epoch': 0.26}
{'loss': 0.1451, 'grad_norm': 1.428134799003601, 'learning_rate': 3.632946001367054e-05, 'epoch': 0.32}


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'loss': 0.1413, 'grad_norm': 0.9189131855964661, 'learning_rate': 3.291182501708818e-05, 'epoch': 0.38}
{'loss': 0.1379, 'grad_norm': 1.5339771509170532, 'learning_rate': 2.9494190020505813e-05, 'epoch': 0.45}
{'loss': 0.1344, 'grad_norm': 1.3115142583847046, 'learning_rate': 2.6076555023923443e-05, 'epoch': 0.51}
{'loss': 0.1396, 'grad_norm': 1.3162517547607422, 'learning_rate': 2.2658920027341084e-05, 'epoch': 0.58}
{'loss': 0.1379, 'grad_norm': 2.0594136714935303, 'learning_rate': 1.9241285030758715e-05, 'epoch': 0.64}
{'loss': 0.134, 'grad_norm': 1.292578935623169, 'learning_rate': 1.5823650034176352e-05, 'epoch': 0.7}
{'loss': 0.138, 'grad_norm': 1.0106781721115112, 'learning_rate': 1.2406015037593984e-05, 'epoch': 0.77}
{'loss': 0.135, 'grad_norm': 1.2033296823501587, 'learning_rate': 8.988380041011621e-06, 'epoch': 0.83}
{'loss': 0.1341, 'grad_norm': 1.5598094463348389, 'learning_rate': 5.570745044429255e-06, 'epoch': 0.9}
{'loss': 0.1334, 'grad_norm': 0.9005824327468872, 'lear

TrainOutput(global_step=1563, training_loss=0.1456825832335215, metrics={'train_runtime': 77.1952, 'train_samples_per_second': 647.682, 'train_steps_per_second': 20.247, 'total_flos': 0.0, 'train_loss': 0.1456825832335215, 'epoch': 1.0})

In [110]:
# Evaluate our trained model
evaluator(embedding_model)

{'pearson_cosine': 0.697684514280203,
 'spearman_cosine': 0.7006963038426953,
 'pearson_manhattan': 0.7203120291872793,
 'spearman_manhattan': 0.7170015019278266,
 'pearson_euclidean': 0.7200757173710048,
 'spearman_euclidean': 0.7167987301830531,
 'pearson_dot': 0.6446288153738338,
 'spearman_dot': 0.6430106252431572,
 'pearson_max': 0.7203120291872793,
 'spearman_max': 0.7170015019278266}

In [111]:
import gc
import torch

gc.collect()
torch.cuda.empty_cache()

# Unsupervised Learning

To create an embedding model, we typically need labeled data. However, not all real-world datasets come with a nice set of labels that we can use. 

 We instead look for techniques to train the model without any predetermined labels—unsupervised learning.

 Many approaches exist, like:
 
- Simple Contrastive Learning of Sentence Embeddings (SimCSE)
- Contrastive Tension (CT)
- Transformer-based Sequential Denoising Auto-Encoder (TSDAE)
- Generative Pseudo-Labeling (GPL)


## Transformer-Based Sequential Denoising Auto-Encoder (TSDAE)

The underlying idea of TSDAE is that we **add noise** to the input sentence by removing a certain percentage of words from it. 

This “damaged” sentence is put through an encoder, with a pooling layer on top of it, to map it to a sentence embedding

From this sentence embedding, a decoder tries to reconstruct the original sentence from the “damaged” sentence but without the artificial noise.

The main concept here is that the more accurate the sentence embedding is, the more accurate the reconstructed sentence will be.

**This method is very similar to masked language modeling, where we try to reconstruct and learn certain masked words. Here, instead of reconstructing masked words, we try to reconstruct the entire sentence.**


<img src="imgs/tsdae.png" alt="Hugging Face" height=400 width=500>

*Caption*:  TSDAE randomly removes words from an input sentence that is passed through an encoder to generate a sentence embedding. From this sentence embedding, the original sentence is reconstructed.

After training, we can use the encoder to generate embeddings from text since the decoder is only used for judging whether the embeddings can accurately reconstruct the original sentence

Since we only need a bunch of sentences without any labels, training this model is straightforward. We start by downloading an external tokenizer, which is used for the denoising procedure:

In [112]:
# Download additional tokenizer
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/david/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

Then, we create flat sentences from our data and remove any labels that we have to mimic an unsupervised setting:

In [113]:
# Create a flat list of sentences
mnli = load_dataset("glue", "mnli", split="train").select(range(25_000))
flat_sentences = mnli["premise"] + mnli["hypothesis"]

In [116]:
# Add noise to our input data
damaged_data = DenoisingAutoEncoderDataset(list(set(flat_sentences)))

In [128]:
example = damaged_data[0]

In [129]:
example.texts

['waste associated costs',
 'Hazardous waste disposal has some associated costs.']

In [126]:
# Create dataset
train_dataset = {"damaged_sentence": [], "original_sentence": []}
for data in tqdm(damaged_data):
    train_dataset["damaged_sentence"].append(data.texts[0])
    train_dataset["original_sentence"].append(data.texts[1])
train_dataset = Dataset.from_dict(train_dataset)

100%|██████████| 48353/48353 [00:03<00:00, 13440.70it/s]


In [127]:
train_dataset[0]

{'damaged_sentence': 'waste disposal has costs',
 'original_sentence': 'Hazardous waste disposal has some associated costs.'}

In [130]:
# Create an embedding similarity evaluator for stsb
val_sts = load_dataset('glue', 'stsb', split='validation')
evaluator = EmbeddingSimilarityEvaluator(
    sentences1=val_sts["sentence1"],
    sentences2=val_sts["sentence2"],
    scores=[score/5 for score in val_sts["label"]],
    main_similarity="cosine"
)

Next, we run the training as before but with the [CLS] token as the pooling strategy instead of the mean pooling of the token embeddings. In the TSDAE paper, this was shown to be more effective since mean pooling loses the position information, which is not the case when using the [CLS] token:

In [133]:
# Create your embedding model
word_embedding_model = models.Transformer('bert-base-uncased')

In [134]:
word_embedding_model

Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 

In [139]:
# list(word_embedding_model.modules())

In [140]:
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), 'cls')

In [141]:
embedding_model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

Using our sentence pairs, we will need a loss function that attempts to reconstruct the original sentence using the noise sentence, namely DenoisingAutoEncoderLoss.

Moreover, we tie the parameters of both models. Instead of having separate weights for the encoder’s embedding layer and the decoder’s output layer, they share the same weights. This means that any updates to the weights in one layer will be reflected in the other layer as well:

In [142]:
# Use the denoising auto-encoder loss
train_loss = losses.DenoisingAutoEncoderLoss(
    embedding_model, tie_encoder_decoder=True
)
train_loss.decoder = train_loss.decoder.to("cuda")

Some weights of BertLMHeadModel were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['bert.encoder.layer.0.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.0.crossattention.output.LayerNorm.weight', 'bert.encoder.layer.0.crossattention.output.dense.bias', 'bert.encoder.layer.0.crossattention.output.dense.weight', 'bert.encoder.layer.0.crossattention.self.key.bias', 'bert.encoder.layer.0.crossattention.self.key.weight', 'bert.encoder.layer.0.crossattention.self.query.bias', 'bert.encoder.layer.0.crossattention.self.query.weight', 'bert.encoder.layer.0.crossattention.self.value.bias', 'bert.encoder.layer.0.crossattention.self.value.weight', 'bert.encoder.layer.1.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.1.crossattention.output.LayerNorm.weight', 'bert.encoder.layer.1.crossattention.output.dense.bias', 'bert.encoder.layer.1.crossattention.output.dense.weight', 'bert.encoder.layer.1.crossattention.self.key.bias', 'bert.e

Finally, training our model works the same as we have seen several times before but we lower the batch size as memory increases with this loss function:

In [144]:
# Define the training arguments
args = SentenceTransformerTrainingArguments(
    output_dir="training_tsdae_embedding_model",
    num_train_epochs=1,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=100,
    fp16=True,
    eval_steps=100,
    logging_steps=100,
)


In [145]:
# Train model
trainer = SentenceTransformerTrainer(
    model=embedding_model,
    args=args,
    train_dataset=train_dataset,
    loss=train_loss,
    evaluator=evaluator
)

In [146]:
trainer.train()

  0%|          | 0/3023 [00:00<?, ?it/s]

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


{'loss': 6.9358, 'grad_norm': 9.466378211975098, 'learning_rate': 4.9e-05, 'epoch': 0.03}
{'loss': 4.8196, 'grad_norm': 6.414368152618408, 'learning_rate': 4.8323640095791995e-05, 'epoch': 0.07}
{'loss': 4.605, 'grad_norm': 7.210152626037598, 'learning_rate': 4.66130687649675e-05, 'epoch': 0.1}
{'loss': 4.4569, 'grad_norm': 7.5083160400390625, 'learning_rate': 4.4902497434143e-05, 'epoch': 0.13}
{'loss': 4.3871, 'grad_norm': 6.3871750831604, 'learning_rate': 4.319192610331851e-05, 'epoch': 0.17}


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'loss': 4.2901, 'grad_norm': 6.639077663421631, 'learning_rate': 4.1481354772494014e-05, 'epoch': 0.2}
{'loss': 4.2378, 'grad_norm': 6.110928058624268, 'learning_rate': 3.977078344166952e-05, 'epoch': 0.23}
{'loss': 4.1407, 'grad_norm': 7.846587657928467, 'learning_rate': 3.806021211084502e-05, 'epoch': 0.26}
{'loss': 4.1004, 'grad_norm': 7.981261253356934, 'learning_rate': 3.634964078002053e-05, 'epoch': 0.3}
{'loss': 4.0651, 'grad_norm': 5.656072616577148, 'learning_rate': 3.463906944919603e-05, 'epoch': 0.33}
{'loss': 4.0565, 'grad_norm': 6.975688457489014, 'learning_rate': 3.2928498118371536e-05, 'epoch': 0.36}
{'loss': 3.9328, 'grad_norm': 6.082306385040283, 'learning_rate': 3.121792678754704e-05, 'epoch': 0.4}
{'loss': 3.9532, 'grad_norm': 6.799237251281738, 'learning_rate': 2.9507355456722545e-05, 'epoch': 0.43}
{'loss': 3.8724, 'grad_norm': 7.174602031707764, 'learning_rate': 2.7796784125898052e-05, 'epoch': 0.46}
{'loss': 3.8855, 'grad_norm': 5.872269153594971, 'learning_rate

TrainOutput(global_step=3023, training_loss=4.034782010910332, metrics={'train_runtime': 172.8166, 'train_samples_per_second': 279.794, 'train_steps_per_second': 17.493, 'total_flos': 0.0, 'train_loss': 4.034782010910332, 'epoch': 1.0})

In [147]:
# Evaluate our trained model
evaluator(embedding_model)

{'pearson_cosine': 0.7331246985650315,
 'spearman_cosine': 0.7400307035317102,
 'pearson_manhattan': 0.7306657513545326,
 'spearman_manhattan': 0.7335406002926033,
 'pearson_euclidean': 0.7310164658364212,
 'spearman_euclidean': 0.7335530400406114,
 'pearson_dot': 0.6425985295271535,
 'spearman_dot': 0.6431215653931609,
 'pearson_max': 0.7331246985650315,
 'spearman_max': 0.7400307035317102}

### Domain Adaptation

When you have very little or no labeled data available, you typically use unsupervised learning to create your text embedding model. However, unsupervised techniques are generally outperformed by supervised techniques and **have difficulty learning domain-specific concepts.**

**This is where domain adaptation comes in. Its goal is to update existing embedding models to a specific textual domain**

The target domain, or out-domain, generally contains words and subjects that were not found in the source domain or in-domain.

**In domain adaptation, the aim is to create and generalize an embedding model from one domain to another.**


### Adaptative Pre-Training

One method for domain adaptation is called adaptive pretraining.

- You start by pretraining your domain-specific corpus using an unsupervised technique, such as the previously discussed TSDAE or  masked language modeling
- you fine-tune that model using a training dataset that can be either outside or in your target domain (*although data from the target domain is preferred, out-domain data also works since we started with unsupervised training on the target domain.*)

<img src="imgs/adaptivepretrain.png" alt="Hugging Face" height=400 width=500>