# 微调掩码语言模型：Fine-tuning a masked language model for **domain adaptation**

在 大型语言模型（LLM） 的语境中，领域适应（Domain Adaptation） 指的是将通用预训练的语言模型（如 GPT、LLaMA、BERT 等）调整或优化，使其在特定领域（如医疗、法律、金融、芯片设计等）中表现更优的过程。领域自适应的目标是让模型更好地理解、生成或处理特定领域的术语、知识结构和任务需求，从而提高其在该领域的准确性、相关性和实用性。

In [2]:
!pip install datasets evaluate transformers[sentencepiece]
!pip install --upgrade --quiet datasets



In [3]:
from pprint import pprint

## 1 选择用于掩码语言模型的预训练模型

### 1.1 加载`distilbert`预训练模型以及分词器

In [4]:
from transformers import AutoModelForMaskedLM

ckpt = "distilbert-base-uncased"
model = AutoModelForMaskedLM.from_pretrained(ckpt)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [5]:
num_params_distilbert = model.num_parameters()
print(f"Number of parameters in DistilBERT: {round(num_params_distilbert / (1024 * 1024))}MB")

Number of parameters in DistilBERT: 64MB


In [6]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(ckpt)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

### 1.2 尝试填空

In [7]:
text = "This is a great [MASK]."
inputs = tokenizer(text, return_tensors="pt")
print("inputs:")
pprint(inputs)
print("")

print("inputs to token:")
print(inputs.tokens())
print("")

outputs = model(**inputs)
logits = outputs.logits
print(f"logits' shape: {logits.shape}")
assert logits.shape[-1] == model.config.vocab_size

inputs:
{'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]]),
 'input_ids': tensor([[ 101, 2023, 2003, 1037, 2307,  103, 1012,  102]])}

inputs to token:
['[CLS]', 'this', 'is', 'a', 'great', '[MASK]', '.', '[SEP]']

logits' shape: torch.Size([1, 8, 30522])


In [19]:
print(tokenizer.decode(inputs["input_ids"][0]))

import numpy as np
logit = logits[0]
output_text = ""
for t in range(logit.shape[0]):
    w = np.argmax(logit[t].detach().numpy())
    # print(tokenizer.decode([w]))
    output_text += " " + tokenizer.decode([w])
print(output_text)

[CLS] this is a great [MASK]. [SEP]
 . this is a great deal . .


In [9]:
# 标记"[MASK]"的token id：
mask_token_id = tokenizer.mask_token_id
print(f"mask_token_id: {mask_token_id}")

mask_token_id: 103


In [10]:
import torch

# 找到mask标记在输入中的索引：
input_ids = inputs["input_ids"][0]  # 我们的输入只有一个句子
mask_token_index = torch.where(input_ids == mask_token_id)[0].item()
print(f"mask_token_index: {mask_token_index}")

mask_token_index: 5


In [11]:
mask_token_logits = logits[0, mask_token_index, :]
print(f"mask_token_logits' shape: {mask_token_logits.shape}")

mask_token_logits' shape: torch.Size([30522])


In [12]:
top_5_tokens = torch.topk(mask_token_logits, 5, dim=-1).indices.tolist()
# print(top_5_tokens.indices.tolist())
print(top_5_tokens)

[3066, 3112, 6172, 2801, 8658]


In [13]:
for token in top_5_tokens:
    print(text.replace(tokenizer.mask_token, tokenizer.decode([token])))

This is a great deal.
This is a great success.
This is a great adventure.
This is a great idea.
This is a great feat.


## 2 IMDb数据集

为了展示领域适应，我们将使用著名的电影评论数据集（简称 IMDb），这是一个常用于基准情感分析模型的电影评论语料库。通过在这个语料库上微调 DistilBERT，我们期望语言模型会将其词汇从预训练时基于的维基百科事实性数据调整为电影评论更具主观性的元素。

### 2.1 加载数据集

我们可以通过🤗 Datasets 的 load_dataset() 函数从 Hugging Face Hub 获取这些数据：

In [14]:
from datasets import load_dataset

imdb_dataset = load_dataset("imdb")
imdb_dataset

README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

### 2.2 查看数据集

In [16]:
def show_samples(samples):
    for sample in samples:
      print(f"\nReview: {sample['text']}")
      print(f"Label: {sample['label']}")

In [17]:
samples = imdb_dataset["train"].shuffle(123).select(range(3))
show_samples(samples)


Review: I won't claim to be a fan of Ralph Bakshi, because i am not. I have only watched 5 of his animated films so far: Coonskin, Wizards, Fritz the Cat and Lord of the Rings and finally "Fire and Ice". What i CAN claim, is that i found "Fire and Ice" to be the most enjoyable of the lot. It is a straightforward fantasy tale of swords and sorcery along the lines of Conan the Barbarian, but the beautiful artwork, realistic animation and lively film score effectively lends a very classic charm to this movie.<br /><br />Deserving first mention, is the animation itself. I do not care what people say about rotoscoping but in my opinion Ralph Bakshi used that technique very effectively here. I was amazed at how realistic the movements of the characters were. The style of directing and the photo-realistic character designs made "Fire and Ice" feel more like a big budget fantasy blockbuster than a cartoon. Sadly the level of art detail tends to get a little inconsistent, especially near the e

In [18]:
samples = imdb_dataset["unsupervised"].shuffle(123).select(range(3))
show_samples(samples)


Review: > Escapes is the textbook example of bad film-making. Whenever you've seen a > movie that you feel was horrible, see this one and realize what true garbage > is. One can only guess that Vincent Price was blackmailed into being > involved in this mess. Two bright spots about this film were that it has no > sequel, and that it has a "Mystery Science Theater" quality about it. To me > the most frightening thing about this movie was that I paid .99 to rent this > dog. >
Label: -1

Review: I spent what seemed like hours (really only 60 minutes) waiting for something to happen in this movie. Although I liked seeing the excellent photos of interesting places that I'd like to visit someday, it seemed this was more a travelogue than a drama. I also thought it was nice to witness a healthy, positive relationship between the mother and daughter. But John Malkowitz seemed a bit affected in both his manner and his role. And what was the reason for the three famous/rich women sitting at the

## 3 预处理数据集

### 3.1 使用单词作为语言模型的标签

In [22]:
def tokenize_function(samples, tokenizer):
    result = tokenizer(samples["text"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result

In [23]:
tokenized_dataset = imdb_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=["text", "label"],
    fn_kwargs={"tokenizer": tokenizer},
)
tokenized_dataset

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (720 > 512). Running this sequence through the model will result in indexing errors


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 50000
    })
})

### 3.2 分块

In [24]:
tokenizer.model_max_length

512

In [25]:
chunk_size = 128

In [37]:
tokenized_samples = tokenized_dataset["train"][:3]

total_length = 0
for idx, sample in enumerate(tokenized_samples["input_ids"]):
    print(f"Review {idx} length: {len(sample)}")
    total_length += len(sample)
print(f"Total reviews length: {total_length}")

Review 0 length: 363
Review 1 length: 304
Review 2 length: 133
Total reviews length: 800


将嵌套列表压扁：

In [27]:
lst = [
    [1, 4, 5],
    [6, 3, 2]
]
print(sum(lst, []))

[1, 4, 5, 6, 3, 2]


In [31]:
print(tokenized_samples.keys())

dict_keys(['input_ids', 'attention_mask', 'word_ids'])


In [35]:
concated_samples = {
    k: sum(tokenized_samples[k], []) for k in tokenized_samples.keys()
}
total_length = len(concated_samples["input_ids"])
print(f"Concated review length: {total_length}")

Concated review length: 800


In [41]:
chunks = {
    k: [t[i:i+chunk_size] for i in range(0, total_length, chunk_size)]
    for k, t in concated_samples.items()
}
print(len(chunks["input_ids"]))

7


In [43]:
for chunk in chunks["input_ids"]:
    print(f"Chunk length: {len(chunk)}")

Chunk length: 128
Chunk length: 128
Chunk length: 128
Chunk length: 128
Chunk length: 128
Chunk length: 128
Chunk length: 32


In [51]:
def group_texts(samples):
    # Concatenate all texts
    concated_samples = {k: sum(samples[k], []) for k in samples.keys()}
    # Compute length of concatenated texts
    total_length = len(concated_samples["input_ids"])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concated_samples.items()
    }
    # Create a new labels column, the label is the token itself
    result["labels"] = result["input_ids"].copy()
    return result

In [52]:
lm_datasets = tokenized_dataset.map(group_texts, batched=True)
lm_datasets

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 61291
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 59904
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 122957
    })
})

# 4 使用Trainer API微调 DistilBERT 模型

### 4.1 随机掩码单个标记

In [54]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm_probability=0.15,
)

In [61]:
samples = [lm_datasets["train"][i] for i in range(2)]
for sample in samples:
    sample.pop("word_ids")  # word_ids can't not be transformed to tensor

for chunk in data_collator(samples)["input_ids"]:
    print(f"{tokenizer.decode(chunk)}")

[CLS] i rented [MASK] am curious - yellow from my video store because of all the controversy that surrounded [MASK] when it was first [MASK] in [MASK]. i [MASK] [MASK] [MASK] at first it was seized by u. s. customs if it ever tried to enter [MASK] country, therefore being a [MASK] of films considered " controversial " i really had [MASK] see this for myself [MASK] < br / > < br [MASK] > the plot is centered around a young swedish drama student named lena who wants [MASK] learn everything she can [MASK] life. in [MASK] she wants to focus her attentions to [MASK] some sort of documentary on what the average sw [MASK] thought about certain political issues such
as the vietnam war and race issues in the united states. in between asking politicians and ordinary denizens of stockholm about their opinions [MASK] politics, she has sex with her drama teacher, classmates, and married men. < br / > < br / > what kills me [MASK] [MASK] am curious - yellow is that improving years ago, this was cons

### 4.2 随机掩码整个单词

In [62]:
import collections
import numpy as np

from transformers import default_data_collator

wwm_probability = 0.2  # 设置整词掩码的概率为 20%


def whole_word_masking_data_collator(features):
    for feature in features:
        word_ids = feature.pop("word_ids")

        # Create a map between words and corresponding token indices
        mapping = collections.defaultdict(list)
        current_word_index = -1
        current_word = None
        for idx, word_id in enumerate(word_ids):
            if word_id is not None:
                if word_id != current_word:
                    current_word = word_id
                    current_word_index += 1
                mapping[current_word_index].append(idx)

        # Randomly mask words
        mask = np.random.binomial(1, wwm_probability, (len(mapping),))
        input_ids = feature["input_ids"]
        labels = feature["labels"]
        new_labels = [-100] * len(labels)  # -100说明不参与损失函数的计算
        for word_id in np.where(mask)[0]:
            word_id = word_id.item()
            for idx in mapping[word_id]:
                new_labels[idx] = labels[idx]  # 但是，损失函数需要计算[MASK]的损失
                input_ids[idx] = tokenizer.mask_token_id
        feature["labels"] = new_labels

    return default_data_collator(features)

In [64]:
samples = [lm_datasets["train"][i] for i in range(2)]
batch = whole_word_masking_data_collator(samples)

for chunk in batch["input_ids"]:
    print(f"{tokenizer.decode(chunk)}")

[CLS] i [MASK] i am curious - yellow from my video store [MASK] of [MASK] the [MASK] that [MASK] it when [MASK] was first released in 1967. i [MASK] heard that at first it was seized by [MASK]. s. customs [MASK] it ever tried to enter this [MASK], therefore being a fan of films [MASK] " [MASK] " i really [MASK] to [MASK] this for [MASK]. < br / [MASK] < [MASK] / > [MASK] plot is centered around [MASK] young swedish [MASK] student named lena who wants to learn [MASK] she can [MASK] life. in [MASK] she wants [MASK] focus her attentions to making some sort of documentary on what the average swede thought about certain political issues such
as the vietnam war and race issues in the [MASK] states. in between asking politicians and ordinary denizens of stockholm [MASK] their [MASK] on politics, she [MASK] sex with her drama teacher, classmates, [MASK] married men [MASK] < br / > < [MASK] / > what kills me about i [MASK] curious - yellow is [MASK] 40 years ago, this was [MASK] pornographic. r

### 4.3 下采样数据集

In [65]:
train_size = 10_000
test_size = int(0.1 * train_size)

downsampled_dataset = lm_datasets["train"].train_test_split(
    train_size=train_size,
    test_size=test_size,
    seed=123,
)
downsampled_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 1000
    })
})

### 4.4 定义Trainer

In [66]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [69]:
from transformers import TrainingArguments

batch_size = 64
# Show the training loss with every epoch
logging_steps = len(downsampled_dataset["train"]) // batch_size
model_name = ckpt.split("/")[-1]

training_args = TrainingArguments(
    output_dir=f"{model_name}-finetuned-imdb",
    overwrite_output_dir=True,
    eval_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    push_to_hub=True,
    fp16=True,
    logging_steps=logging_steps,
    remove_unused_columns=False  # 使用整词掩码
)

In [71]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"],
    data_collator=whole_word_masking_data_collator,
    processing_class=tokenizer
)

### 4.5 使用困惑度评测语言模型

在正式训练之前看一下困惑度是多少：

In [73]:
import math

eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mf499d5[0m ([33m0xf499d5[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Perplexity: 63.95


正式训练：

In [74]:
trainer.train()

Epoch,Training Loss,Validation Loss,Model Preparation Time
1,3.5486,3.330831,0.0021
2,3.4041,3.295099,0.0021
3,3.3688,3.293973,0.0021


TrainOutput(global_step=471, training_loss=3.4407115620412645, metrics={'train_runtime': 42.7453, 'train_samples_per_second': 701.832, 'train_steps_per_second': 11.019, 'total_flos': 994208670720000.0, 'train_loss': 3.4407115620412645, 'epoch': 3.0})

训练后再看一下困惑度：

In [75]:
import math

eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 25.99


Well done！可以看到困惑度已经降低了！那么，现在我们可以保存模型：

In [76]:
trainer.push_to_hub()

Uploading...:   0%|          | 0.00/268M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/f499d5/distilbert-base-uncased-finetuned-imdb/commit/77ca7750461db8c89df1e913147f6eb4e5a72808', commit_message='End of training', commit_description='', oid='77ca7750461db8c89df1e913147f6eb4e5a72808', pr_url=None, repo_url=RepoUrl('https://huggingface.co/f499d5/distilbert-base-uncased-finetuned-imdb', endpoint='https://huggingface.co', repo_type='model', repo_id='f499d5/distilbert-base-uncased-finetuned-imdb'), pr_revision=None, pr_num=None)