# 构建文本嵌入模型

嵌入：文本数据转为数值表示的过程

嵌入模型：对输入进行嵌入的 LLM 模型

目的：尽可能准确的将文本数据表示为嵌入向量（表示包含语义、情感等基于不同目的）

## 对比学习

训练和微调文本嵌入的一种主要技术

基本理念：向模型输入相似和不相似的文档对作为示例。 对比解释是通过“为什么是 P 而不是 Q”来理解“为什么是 P”

## SBERT

bi-encoder 或 sentences-BERT 双编码器架构

是 sentence-transformers 使用的训练的一种孪生架构

训练过程：
1. 文本分别输入两个完全相同共享权重的 BERT 模型
2. 输出层做平均池化生成嵌入向量
3. 文本的嵌入和之间的差向量拼接
4. 使用softmax 分类器对嵌入向量进行优化

## 构建嵌入模型

### 生成对比样本

NLI（自然语言推理）数据集

GLUE 基准评估和分析模型性能

### 训练

1. 选择 BERT 基座模型
2. 定义损失函数

#### 损失函数

目前不建议使用softmax

其他：

* 余弦相似度损失函数：计算两段文本的两个嵌入的余弦相似度，与标注相似度分数比较
* 多负例排序损失函数：InfoNCE 或 NTXentLoss，使用正例句子对或包含一对正例句子和一个不相关句子（负例）的三元组；最小化相关文本对距离，最大化不相关文本对距离
  * 问题：难负例（和问题相关但不正确的负例）获取困难
  * 搜集负例步骤：
    1. 获取简单负例，随机采样
    2. 获取半难负例：使用预训练的嵌入模型，对句子应用余弦相似度，找到高度相关的句子
    3. 获取难负例：手动标注或生成模型判断或生成

### 评估

大规模文本嵌入基准（Massive Text Embedding Benchmark, MTEB）

In [1]:
from datasets import load_dataset

train_dataset = load_dataset('glue', 'mnli', split='train').select(range(50_000))
train_dataset = train_dataset.remove_columns('idx')

train_dataset[0]

{'premise': 'Conceptually cream skimming has two basic dimensions - product and geography.',
 'hypothesis': 'Product and geography are what make cream skimming work. ',
 'label': 1}

In [2]:
from sentence_transformers import SentenceTransformer, losses
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator

# base model
embedding_model = SentenceTransformer('bert-base-uncased')
# critetion / loss function
train_loss = losses.SoftmaxLoss(
    model=embedding_model,
    sentence_embedding_dimension=embedding_model.get_sentence_embedding_dimension(),
    num_labels=3,
)
# evaluation
val_sts = load_dataset('glue', 'stsb', split='validation')
evaluator = EmbeddingSimilarityEvaluator(
    sentences1=val_sts['sentence1'],
    sentences2=val_sts['sentence2'],
    scores=[score/5 for score in val_sts['label']],
    main_similarity='cosine'
)


No sentence-transformers model found with name bert-base-uncased. Creating a new one with mean pooling.


In [None]:
from sentence_transformers.training_args import SentenceTransformerTrainingArguments
from sentence_transformers.trainer import SentenceTransformerTrainer
import torch

# 禁用mps， fp16 加速不能使用 mps
# torch.cuda.is_available = lambda: False
torch.backends.mps.is_available = lambda: False

args = SentenceTransformerTrainingArguments(
    output_dir='models/base_embedding_model',
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    warmup_steps=100,
    fp16=True,
    eval_steps=100,
    logging_steps=100,
)

trainer = SentenceTransformerTrainer(
    model=embedding_model,
    args=args,
    train_dataset=train_dataset,
    evaluator=evaluator,
    loss=train_loss,
)

trainer.train()

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Column 'hypothesis' is at index 1, whereas a column with this name is usually expected at index 0. Note that the column order can be important for some losses, e.g. MultipleNegativesRankingLoss will always consider the first column as the anchor and the second as the positive, regardless of the dataset column names. Consider renaming the columns to match the expected order, e.g.:
dataset = dataset.select_columns(['hypothesis', 'entailment', 'contradiction'])


Step,Training Loss
100,1.0697
200,0.9431
300,0.8856
400,0.846
500,0.8256
600,0.8317
700,0.8087
800,0.7898
900,0.7808
1000,0.7734


TrainOutput(global_step=1563, training_loss=0.8150056020159486, metrics={'train_runtime': 175.745, 'train_samples_per_second': 284.503, 'train_steps_per_second': 8.894, 'total_flos': 0.0, 'train_loss': 0.8150056020159486, 'epoch': 1.0})

In [4]:
evaluator(embedding_model)

{'pearson_cosine': 0.5208034264503338, 'spearman_cosine': 0.5903850622353031}

In [None]:
# MTEB
# from mteb import MTEB

# evaluation = MTEB(tasks=['Banking77Classification'])
# results = evaluation.run(embedding_model)

In [5]:
from datasets import Dataset, load_dataset

train_dataset = load_dataset('glue', 'mnli', split='train').select(range(50_000))
train_dataset = train_dataset.remove_columns('idx')

mapping = {2: 0, 1: 0, 0: 1}
train_dataset = Dataset.from_dict({
    'sentence1': train_dataset['premise'],
    'sentence2': train_dataset['hypothesis'],
    'label': [float(mapping[label]) for label in train_dataset['label']]
})

train_dataset[0]


{'sentence1': 'Conceptually cream skimming has two basic dimensions - product and geography.',
 'sentence2': 'Product and geography are what make cream skimming work. ',
 'label': 0.0}

In [None]:
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator

val_sts = load_dataset('glue', 'mnli', split='validation_matched')
evaluator = EmbeddingSimilarityEvaluator(
    sentences1=train_dataset['sentence1'],
    sentences2=train_dataset['sentence2'],
    scores=train_dataset['label'],
    main_similarity='cosine'
)

Using the latest cached version of the dataset since glue couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'mnli' at C:\Users\Dita\.cache\huggingface\datasets\glue\mnli\0.0.0\bcdcba79d07bc864c1c254ccfcedcce55bcc9a8c (last modified on Thu Jun 12 10:24:35 2025).


In [None]:

embedding_model = SentenceTransformer('bert-base-uncased')
train_loss = losses.CosineSimilarityLoss(model=embedding_model)

args = SentenceTransformerTrainingArguments(
    output_dir='models/cosineloss_embedding_model',
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    warmup_steps=100,
    fp16=True,
    eval_steps=100,
    logging_steps=100,
)

trainer = SentenceTransformerTrainer(   
    model=embedding_model,  
    args=args,
    train_dataset=train_dataset,
    evaluator=evaluator,
    loss=train_loss,
)
trainer.train()
evaluator(embedding_model)

No sentence-transformers model found with name bert-base-uncased. Creating a new one with mean pooling.


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

AttributeError: 'NoneType' object has no attribute 'float'

In [14]:
# 多负例排序损失函数
import random
from tqdm import tqdm


mnli = load_dataset('glue', 'mnli', split='train').select(range(50_000))
mnli = mnli.remove_columns('idx')
mnli.filter(lambda x: True if x['label'] == 0 else False)

train_dataset = {'anchor': [], 'positive': [], 'negative': []}
soft_negatives = mnli['hypothesis']
random.shuffle(soft_negatives)

for row, soft_negative in tqdm(zip(mnli, soft_negatives)):
    train_dataset['anchor'].append(row['premise'])
    train_dataset['positive'].append(row['hypothesis'])
    train_dataset['negative'].append(soft_negative)

train_dataset = Dataset.from_dict(train_dataset)
train_dataset[0]



Filter:   0%|          | 0/50000 [00:00<?, ? examples/s]

50000it [00:01, 43380.97it/s]


{'anchor': 'Conceptually cream skimming has two basic dimensions - product and geography.',
 'positive': 'Product and geography are what make cream skimming work. ',
 'negative': 'There were staff members who resented the new outcome-oriented approach of the office.'}

In [None]:
val_sts = load_dataset('glue', 'stsb', split='validation')
evaluator = EmbeddingSimilarityEvaluator(
    sentences1=val_sts['sentence1'],
    sentences2=val_sts['sentence2'],
    scores=[score/5 for score in val_sts['label']],
    main_similarity='cosine'
)

embedding_model = SentenceTransformer('bert-base-uncased')

# 多负例排序损失函数
train_loss = losses.MultipleNegativesRankingLoss(model=embedding_model)

args = SentenceTransformerTrainingArguments(
    output_dir='models/multinegatives_embedding_model',
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    warmup_steps=100,
    fp16=True,
    eval_steps=100,
    logging_steps=100,
)

trainer = SentenceTransformerTrainer(
    model=embedding_model,
    args=args,
    train_dataset=train_dataset,
    evaluator=evaluator,
    loss=train_loss,
)

trainer.train()
evaluator(embedding_model)


No sentence-transformers model found with name bert-base-uncased. Creating a new one with mean pooling.


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss
100,0.6601
200,0.2426
300,0.2491
400,0.2209
500,0.2073
600,0.203
700,0.2122
800,0.1856
900,0.1754
1000,0.1845


{'pearson_cosine': 0.7492317880271468, 'spearman_cosine': 0.7590657106720886}

## 微调

### 监督学习

重复之前的模型训练过程

In [2]:
from datasets import load_dataset
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from sentence_transformers import losses, SentenceTransformer
from sentence_transformers.trainer import SentenceTransformerTrainer
from sentence_transformers.training_args import SentenceTransformerTrainingArguments


train_dataset = load_dataset('glue', 'mnli', split='train').select(range(50_000))
train_dataset = train_dataset.remove_columns('idx')

val_sts = load_dataset('glue', 'stsb', split='validation')
evaluator = EmbeddingSimilarityEvaluator(
    sentences1=val_sts['sentence1'],
    sentences2=val_sts['sentence2'],
    scores=[score/5 for score in val_sts['label']],
    main_similarity='cosine',
)

embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
train_loss = losses.MultipleNegativesRankingLoss(embedding_model)
args = SentenceTransformerTrainingArguments(
    output_dir='models/finetuned_ninilm-l6-v2',
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    warmup_steps=100,
    fp16=True,
    eval_steps=100,
    logging_steps=100,
)
trainer = SentenceTransformerTrainer(
    model=embedding_model,
    args=args,
    train_dataset=train_dataset,
    loss=train_loss,
    evaluator=evaluator,
)

trainer.train()
evaluator(embedding_model)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/369 [00:00<?, ?B/s]

loading configuration file config.json from cache at C:\Users\Dita\.cache\huggingface\hub\models--sentence-transformers--all-MiniLM-L6-v2\snapshots\c9745ed1d9f207416be6d2e6f8de32d1f16199bf\config.json
Model config BertConfig {
  "architectures": [
    "BertModel"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 384,
  "initializer_range": 0.02,
  "intermediate_size": 1536,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.52.4",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install hu

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

loading weights file model.safetensors from cache at C:\Users\Dita\.cache\huggingface\hub\models--sentence-transformers--all-MiniLM-L6-v2\snapshots\c9745ed1d9f207416be6d2e6f8de32d1f16199bf\model.safetensors
All model checkpoint weights were used when initializing BertModel.

All the weights of BertModel were initialized from the model checkpoint at sentence-transformers/all-MiniLM-L6-v2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use BertModel for predictions without further training.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

loading file vocab.txt from cache at C:\Users\Dita\.cache\huggingface\hub\models--sentence-transformers--all-MiniLM-L6-v2\snapshots\c9745ed1d9f207416be6d2e6f8de32d1f16199bf\vocab.txt
loading file tokenizer.json from cache at C:\Users\Dita\.cache\huggingface\hub\models--sentence-transformers--all-MiniLM-L6-v2\snapshots\c9745ed1d9f207416be6d2e6f8de32d1f16199bf\tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at C:\Users\Dita\.cache\huggingface\hub\models--sentence-transformers--all-MiniLM-L6-v2\snapshots\c9745ed1d9f207416be6d2e6f8de32d1f16199bf\special_tokens_map.json
loading file tokenizer_config.json from cache at C:\Users\Dita\.cache\huggingface\hub\models--sentence-transformers--all-MiniLM-L6-v2\snapshots\c9745ed1d9f207416be6d2e6f8de32d1f16199bf\tokenizer_config.json
loading file chat_template.jinja from cache at None


config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
Using auto half precision backend


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

***** Running training *****
  Num examples = 50,000
  Num Epochs = 1
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 1,563
  Number of trainable parameters = 22,713,216
Column 'hypothesis' is at index 1, whereas a column with this name is usually expected at index 0. Note that the column order can be important for some losses, e.g. MultipleNegativesRankingLoss will always consider the first column as the anchor and the second as the positive, regardless of the dataset column names. Consider renaming the columns to match the expected order, e.g.:
dataset = dataset.select_columns(['hypothesis', 'entailment', 'contradiction'])


Step,Training Loss
100,0.1564
200,0.1113
300,0.118
400,0.1144
500,0.1121
600,0.0978
700,0.1163
800,0.1012
900,0.1072
1000,0.108


Configuration saved in models/finetuned_ninilm-l6-v2\checkpoint-500/config.json
Model weights saved in models/finetuned_ninilm-l6-v2\checkpoint-500/model.safetensors
tokenizer config file saved in models/finetuned_ninilm-l6-v2\checkpoint-500/tokenizer_config.json
Special tokens file saved in models/finetuned_ninilm-l6-v2\checkpoint-500/special_tokens_map.json
tokenizer config file saved in models/finetuned_ninilm-l6-v2\checkpoint-500\tokenizer_config.json
Special tokens file saved in models/finetuned_ninilm-l6-v2\checkpoint-500\special_tokens_map.json
Configuration saved in models/finetuned_ninilm-l6-v2\checkpoint-1000/config.json
Model weights saved in models/finetuned_ninilm-l6-v2\checkpoint-1000/model.safetensors
tokenizer config file saved in models/finetuned_ninilm-l6-v2\checkpoint-1000/tokenizer_config.json
Special tokens file saved in models/finetuned_ninilm-l6-v2\checkpoint-1000/special_tokens_map.json
tokenizer config file saved in models/finetuned_ninilm-l6-v2\checkpoint-1000

{'pearson_cosine': 0.8479709577792636, 'spearman_cosine': 0.8480548361413733}

#### 增强型SBERT

少量标注数据下的微调

增强少量的标注数据，使其可以用于常规训练

步骤：
1. 使用小型标注数据集（黄金数据集）微调交叉编码器（BERT）
2. 创建新的句子对
3. 使用微调后的BERT标注新的句子对（白银数据集）
4. 在扩展数据集（黄金+白银）上训练SBERT

备注：
* 黄金数据集：规模较小但完全标注的数据集，包含真实标注
* 白银数据集：完全标注，但不一定真实，通过BERT预测生成

In [1]:
import pandas as pd
from tqdm import tqdm
from datasets import load_dataset, Dataset
from sentence_transformers import InputExample
from sentence_transformers.datasets import NoDuplicatesDataLoader


dataset = load_dataset('glue', 'mnli', split='train').select(range(10_000))
mapping = {2: 0, 1: 0, 0: 1}

gold_examples = [
    InputExample(
        texts=[row['premise'], row['hypothesis']], label=mapping[row['label']]
    ) for row in tqdm(dataset)
]
gold_dataloader = NoDuplicatesDataLoader(gold_examples, batch_size=32)
gold = pd.DataFrame({
    'sentences1': dataset['premise'],
    'sentences2': dataset['hypothesis'],
    'label': [mapping[label] for label in dataset['label']]
})

100%|██████████| 10000/10000 [00:00<00:00, 33854.79it/s]


In [2]:
from sentence_transformers.cross_encoder import CrossEncoder


cross_encoder = CrossEncoder('bert-base-uncased', num_labels=2)
cross_encoder.fit(
    train_dataloader=gold_dataloader,
    epochs=1,
    show_progress_bar=True,
    warmup_steps=100,
    use_amp=False,
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


README.md: 0.00B [00:00, ?B/s]

Step,Training Loss


In [3]:
import numpy as np


silver = load_dataset('glue', 'mnli', split='train').select(range(10_000, 50_000))
pairs = list(zip(silver['premise'], silver['hypothesis']))

output = cross_encoder.predict(pairs, apply_softmax=True, show_progress_bar=True)
silver = pd.DataFrame({
    'sentences1': silver['premise'],
    'sentences2': silver['hypothesis'],
    'label': np.argmax(output, axis=1)
})

Batches:   0%|          | 0/1250 [00:00<?, ?it/s]

In [4]:
import pandas as pd
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from sentence_transformers import losses, SentenceTransformer
from sentence_transformers.trainer import SentenceTransformerTrainer
from sentence_transformers.training_args import SentenceTransformerTrainingArguments


data = pd.concat([gold, silver], ignore_index=True, axis=0)
data.drop_duplicates(subset=['sentences1', 'sentences2'], keep='first')
train_dataset = Dataset.from_pandas(data, preserve_index=False)

val_sts = load_dataset('glue', 'stsb', split='validation')
evaluator = EmbeddingSimilarityEvaluator(
    sentences1=val_sts['sentence1'],
    sentences2=val_sts['sentence2'],
    scores=[score/5 for score in val_sts['label']],
    main_similarity='cosine',
)

embedding_model = SentenceTransformer('bert-base-uncased')
train_loss = losses.CosineSimilarityLoss(model=embedding_model)

args = SentenceTransformerTrainingArguments(
    output_dir='models/augmented_bert-base-uncased',
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    warmup_steps=100,
    fp16=True,
    eval_steps=100,
)
trainer = SentenceTransformerTrainer(
    model=embedding_model,
    args=args,
    train_dataset=train_dataset,
    loss=train_loss,
    evaluator=evaluator,
)

trainer.train()
evaluator(embedding_model)

No sentence-transformers model found with name bert-base-uncased. Creating a new one with mean pooling.


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss
500,0.1596
1000,0.135
1500,0.127


{'pearson_cosine': 0.7097646012060139, 'spearman_cosine': 0.7206616699341748}

### 无监督学习

无须预定义标注数据就能训练

技术：

* SimCSE (Simple Contrastive Learning of Sentence Embeddings, 句子嵌入的简单对比学习)
* CT (Contrastive Tension, 对比张力)
* TSDAE (Transformer-based Sequential Denoising Auto-Encoder, 基于Transformer的序列去噪自编码器)
* GPL (Generative Pseudo-Labeling, 生成式伪标签)


#### TSDAE 

假设完全没有标注数据，也不要求认为创建标签

基本思想：删除句子中的一定比例的词来添加噪声。将句子输入编码器，池化层，生成嵌入。嵌入输入解码器尝试重建句子，但不包含认为添加的噪声。

核心概念：嵌入越准确，重建句子越准确

类似掩码语言建模，区别，这里尝试重建整个句子

编码器是训练的嵌入模型，解码器仅用于验证是否能通过嵌入还原

关键类：

* sentence_transformers.datasets.DenoisingAutoEncoderDataset
* sentence_transformers.losses.DenoisingAutoEncoderLoss

领域适配：将现有的嵌入模型更新到一个包含不同源领域主题的特点文本领域，步骤：
1. 使用无监督学习技术，对特定领域的语料库进行预训练
2. 使用域内或域外的训练数据集对模型微调

例如：
1. TSDAE 在领域内训练模型
2. 常规监督训练或SBERT微调

In [6]:
%pip install nltk

Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   ------ --------------------------------- 0.3/1.5 MB ? eta -:--:--
   ------------- -------------------------- 0.5/1.5 MB 1.3 MB/s eta 0:00:01
   --------------------------- ------------ 1.0/1.5 MB 2.0 MB/s eta 0:00:01
   ---------------------------------------- 1.5/1.5 MB 2.7 MB/s eta 0:00:00
Installing collected packages: nltk
Successfully installed nltk-3.9.1
Note: you may need to restart the kernel to use updated packages.


In [3]:
import nltk


nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Dita\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


True

In [4]:
from tqdm import tqdm
from datasets import load_dataset, Dataset
from sentence_transformers.datasets import DenoisingAutoEncoderDataset


mnli = load_dataset('glue', 'mnli', split='train').select(range(25_000))
flat_sentences = mnli['premise'] + mnli['hypothesis']

damaged_data = DenoisingAutoEncoderDataset(list(set(flat_sentences)))
train_dataset = {'damaged_sentence': [], 'original_sentence': []}
for data in tqdm(damaged_data):
    train_dataset['damaged_sentence'].append(data.texts[0])
    train_dataset['original_sentence'].append(data.texts[1])

train_dataset = Dataset.from_dict(train_dataset)
train_dataset[0]

100%|██████████| 48353/48353 [00:04<00:00, 9843.46it/s] 


{'damaged_sentence': "The's villas.",
 'original_sentence': "The island's villas are still occupied and host large vineyards."}

In [5]:
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from sentence_transformers import SentenceTransformer, models
from sentence_transformers.losses import DenoisingAutoEncoderLoss
from sentence_transformers.trainer import SentenceTransformerTrainer
from sentence_transformers.training_args import SentenceTransformerTrainingArguments


val_sts = load_dataset('glue', 'stsb', split='validation')
evaluator = EmbeddingSimilarityEvaluator(
    sentences1=val_sts['sentence1'],
    sentences2=val_sts['sentence2'],
    scores=[score/5 for score in val_sts['label']],
    main_similarity='cosine',
)

word_embedding_model = models.Transformer('bert-base-uncased')
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), 'cls')
embedding_model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

train_loss = DenoisingAutoEncoderLoss(embedding_model, tie_encoder_decoder=True)
train_loss.decoder = train_loss.decoder.to('cuda')

args = SentenceTransformerTrainingArguments(
    output_dir='models/tsdae_bert-base-uncased',
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    warmup_steps=100,
    fp16=True,
    eval_steps=100,
    logging_steps=100,
)
trainer = SentenceTransformerTrainer(
    model=embedding_model,
    args=args,
    train_dataset=train_dataset,
    loss=train_loss,
    evaluator=evaluator,
)
trainer.train()
evaluator(embedding_model)


Some weights of BertLMHeadModel were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['bert.encoder.layer.0.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.0.crossattention.output.LayerNorm.weight', 'bert.encoder.layer.0.crossattention.output.dense.bias', 'bert.encoder.layer.0.crossattention.output.dense.weight', 'bert.encoder.layer.0.crossattention.self.key.bias', 'bert.encoder.layer.0.crossattention.self.key.weight', 'bert.encoder.layer.0.crossattention.self.query.bias', 'bert.encoder.layer.0.crossattention.self.query.weight', 'bert.encoder.layer.0.crossattention.self.value.bias', 'bert.encoder.layer.0.crossattention.self.value.weight', 'bert.encoder.layer.1.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.1.crossattention.output.LayerNorm.weight', 'bert.encoder.layer.1.crossattention.output.dense.bias', 'bert.encoder.layer.1.crossattention.output.dense.weight', 'bert.encoder.layer.1.crossattention.self.key.bias', 'bert.e

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


Step,Training Loss
100,6.7124
200,4.7303
300,4.3985
400,4.2414
500,4.1551
600,4.0589
700,3.9933
800,3.9102
900,3.8577
1000,3.8045


{'pearson_cosine': 0.7456462312597696, 'spearman_cosine': 0.7505943187048942}