# 构建文本嵌入模型

嵌入：文本数据转为数值表示的过程

嵌入模型：对输入进行嵌入的 LLM 模型

目的：尽可能准确的将文本数据表示为嵌入向量（表示包含语义、情感等基于不同目的）

## 对比学习

训练和微调文本嵌入的一种主要技术

基本理念：向模型输入相似和不相似的文档对作为示例。 对比解释是通过“为什么是 P 而不是 Q”来理解“为什么是 P”

## SBERT

bi-encoder 或 sentences-BERT 双编码器架构

是 sentence-transformers 使用的训练的一种孪生架构

训练过程：
1. 文本分别输入两个完全相同共享权重的 BERT 模型
2. 输出层做平均池化生成嵌入向量
3. 文本的嵌入和之间的差向量拼接
4. 使用softmax 分类器对嵌入向量进行优化

## 构建嵌入模型

### 生成对比样本

NLI（自然语言推理）数据集

GLUE 基准评估和分析模型性能

### 训练

1. 选择 BERT 基座模型
2. 定义损失函数

#### 损失函数

目前不建议使用softmax

其他：

* 余弦相似度损失函数：计算两段文本的两个嵌入的余弦相似度，与标注相似度分数比较
* 多负例排序损失函数：InfoNCE 或 NTXentLoss，使用正例句子对或包含一对正例句子和一个不相关句子（负例）的三元组；最小化相关文本对距离，最大化不相关文本对距离
  * 问题：难负例（和问题相关但不正确的负例）获取困难
  * 搜集负例步骤：
    1. 获取简单负例，随机采样
    2. 获取半难负例：使用预训练的嵌入模型，对句子应用余弦相似度，找到高度相关的句子
    3. 获取难负例：手动标注或生成模型判断或生成

### 评估

大规模文本嵌入基准（Massive Text Embedding Benchmark, MTEB）

In [1]:
from datasets import load_dataset

train_dataset = load_dataset('glue', 'mnli', split='train').select(range(50_000))
train_dataset = train_dataset.remove_columns('idx')

train_dataset[0]

{'premise': 'Conceptually cream skimming has two basic dimensions - product and geography.',
 'hypothesis': 'Product and geography are what make cream skimming work. ',
 'label': 1}

In [2]:
from sentence_transformers import SentenceTransformer, losses
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator

# base model
embedding_model = SentenceTransformer('bert-base-uncased')
# critetion / loss function
train_loss = losses.SoftmaxLoss(
    model=embedding_model,
    sentence_embedding_dimension=embedding_model.get_sentence_embedding_dimension(),
    num_labels=3,
)
# evaluation
val_sts = load_dataset('glue', 'stsb', split='validation')
evaluator = EmbeddingSimilarityEvaluator(
    sentences1=val_sts['sentence1'],
    sentences2=val_sts['sentence2'],
    scores=[score/5 for score in val_sts['label']],
    main_similarity='cosine'
)


No sentence-transformers model found with name bert-base-uncased. Creating a new one with mean pooling.


In [None]:
from sentence_transformers.training_args import SentenceTransformerTrainingArguments
from sentence_transformers.trainer import SentenceTransformerTrainer
import torch

# 禁用mps， fp16 加速不能使用 mps
# torch.cuda.is_available = lambda: False
torch.backends.mps.is_available = lambda: False

args = SentenceTransformerTrainingArguments(
    output_dir='models/base_embedding_model',
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    warmup_steps=100,
    fp16=True,
    eval_steps=100,
    logging_steps=100,
)

trainer = SentenceTransformerTrainer(
    model=embedding_model,
    args=args,
    train_dataset=train_dataset,
    evaluator=evaluator,
    loss=train_loss,
)

trainer.train()

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Column 'hypothesis' is at index 1, whereas a column with this name is usually expected at index 0. Note that the column order can be important for some losses, e.g. MultipleNegativesRankingLoss will always consider the first column as the anchor and the second as the positive, regardless of the dataset column names. Consider renaming the columns to match the expected order, e.g.:
dataset = dataset.select_columns(['hypothesis', 'entailment', 'contradiction'])


Step,Training Loss
100,1.0697
200,0.9431
300,0.8856
400,0.846
500,0.8256
600,0.8317
700,0.8087
800,0.7898
900,0.7808
1000,0.7734


TrainOutput(global_step=1563, training_loss=0.8150056020159486, metrics={'train_runtime': 175.745, 'train_samples_per_second': 284.503, 'train_steps_per_second': 8.894, 'total_flos': 0.0, 'train_loss': 0.8150056020159486, 'epoch': 1.0})

In [4]:
evaluator(embedding_model)

{'pearson_cosine': 0.5208034264503338, 'spearman_cosine': 0.5903850622353031}

In [None]:
# MTEB
# from mteb import MTEB

# evaluation = MTEB(tasks=['Banking77Classification'])
# results = evaluation.run(embedding_model)

In [5]:
from datasets import Dataset, load_dataset

train_dataset = load_dataset('glue', 'mnli', split='train').select(range(50_000))
train_dataset = train_dataset.remove_columns('idx')

mapping = {2: 0, 1: 0, 0: 1}
train_dataset = Dataset.from_dict({
    'sentence1': train_dataset['premise'],
    'sentence2': train_dataset['hypothesis'],
    'label': [float(mapping[label]) for label in train_dataset['label']]
})

train_dataset[0]


{'sentence1': 'Conceptually cream skimming has two basic dimensions - product and geography.',
 'sentence2': 'Product and geography are what make cream skimming work. ',
 'label': 0.0}

In [None]:
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator

val_sts = load_dataset('glue', 'mnli', split='validation_matched')
evaluator = EmbeddingSimilarityEvaluator(
    sentences1=train_dataset['sentence1'],
    sentences2=train_dataset['sentence2'],
    scores=train_dataset['label'],
    main_similarity='cosine'
)

Using the latest cached version of the dataset since glue couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'mnli' at C:\Users\Dita\.cache\huggingface\datasets\glue\mnli\0.0.0\bcdcba79d07bc864c1c254ccfcedcce55bcc9a8c (last modified on Thu Jun 12 10:24:35 2025).


In [None]:

embedding_model = SentenceTransformer('bert-base-uncased')
train_loss = losses.CosineSimilarityLoss(model=embedding_model)

args = SentenceTransformerTrainingArguments(
    output_dir='models/cosineloss_embedding_model',
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    warmup_steps=100,
    fp16=True,
    eval_steps=100,
    logging_steps=100,
)

trainer = SentenceTransformerTrainer(   
    model=embedding_model,  
    args=args,
    train_dataset=train_dataset,
    evaluator=evaluator,
    loss=train_loss,
)
trainer.train()
evaluator(embedding_model)

No sentence-transformers model found with name bert-base-uncased. Creating a new one with mean pooling.


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

AttributeError: 'NoneType' object has no attribute 'float'

In [14]:
# 多负例排序损失函数
import random
from tqdm import tqdm


mnli = load_dataset('glue', 'mnli', split='train').select(range(50_000))
mnli = mnli.remove_columns('idx')
mnli.filter(lambda x: True if x['label'] == 0 else False)

train_dataset = {'anchor': [], 'positive': [], 'negative': []}
soft_negatives = mnli['hypothesis']
random.shuffle(soft_negatives)

for row, soft_negative in tqdm(zip(mnli, soft_negatives)):
    train_dataset['anchor'].append(row['premise'])
    train_dataset['positive'].append(row['hypothesis'])
    train_dataset['negative'].append(soft_negative)

train_dataset = Dataset.from_dict(train_dataset)
train_dataset[0]



Filter:   0%|          | 0/50000 [00:00<?, ? examples/s]

50000it [00:01, 43380.97it/s]


{'anchor': 'Conceptually cream skimming has two basic dimensions - product and geography.',
 'positive': 'Product and geography are what make cream skimming work. ',
 'negative': 'There were staff members who resented the new outcome-oriented approach of the office.'}

In [None]:
val_sts = load_dataset('glue', 'stsb', split='validation')
evaluator = EmbeddingSimilarityEvaluator(
    sentences1=val_sts['sentence1'],
    sentences2=val_sts['sentence2'],
    scores=[score/5 for score in val_sts['label']],
    main_similarity='cosine'
)

embedding_model = SentenceTransformer('bert-base-uncased')

# 多负例排序损失函数
train_loss = losses.MultipleNegativesRankingLoss(model=embedding_model)

args = SentenceTransformerTrainingArguments(
    output_dir='models/multinegatives_embedding_model',
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    warmup_steps=100,
    fp16=True,
    eval_steps=100,
    logging_steps=100,
)

trainer = SentenceTransformerTrainer(
    model=embedding_model,
    args=args,
    train_dataset=train_dataset,
    evaluator=evaluator,
    loss=train_loss,
)

trainer.train()
evaluator(embedding_model)


No sentence-transformers model found with name bert-base-uncased. Creating a new one with mean pooling.


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss
100,0.6601
200,0.2426
300,0.2491
400,0.2209
500,0.2073
600,0.203
700,0.2122
800,0.1856
900,0.1754
1000,0.1845


{'pearson_cosine': 0.7492317880271468, 'spearman_cosine': 0.7590657106720886}

## 微调