# 微调语言模型

## 一 准备数据集


加载数据集

In [2]:
from datasets import load_dataset
datasets = load_dataset('/home/futureai/datasets/Salesforce--wikitext/wikitext-2-raw-v1')

  from .autonotebook import tqdm as notebook_tqdm


查看部分数据案例

In [3]:
# 查看训练集中第11条数据
datasets["train"][10]

{'text': ' The game \'s battle system , the BliTZ system , is carried over directly from Valkyira Chronicles . During missions , players select each unit using a top @-@ down perspective of the battlefield map : once a character is selected , the player moves the character around the battlefield in third @-@ person . A character can only act once per @-@ turn , but characters can be granted multiple turns at the expense of other characters \' turns . Each character has a field and distance of movement limited by their Action Gauge . Up to nine characters can be assigned to a single mission . During gameplay , characters will call out if something happens to them , such as their health points ( HP ) getting low or being knocked out by enemy attacks . Each character has specific " Potentials " , skills unique to each character . They are divided into " Personal Potential " , which are innate skills that remain unaltered unless otherwise dictated by the story and can either help or impede

In [4]:
# 查看validation集中第11条数据
datasets["validation"][10]

{'text': ' The closest relative of H. gammarus is the American lobster , Homarus americanus . The two species are very similar , and can be crossed artificially , although hybrids are unlikely to occur in the wild since their ranges do not overlap . The two species can be distinguished by a number of characteristics : \n'}

In [5]:
# 查看test集中第4条数据
datasets["test"][3]

{'text': ' Robert Boulter is an English film , television and theatre actor . He had a guest @-@ starring role on the television series The Bill in 2000 . This was followed by a starring role in the play Herons written by Simon Stephens , which was performed in 2001 at the Royal Court Theatre . He had a guest role in the television series Judge John Deed in 2002 . In 2004 Boulter landed a role as " Craig " in the episode " Teddy \'s Story " of the television series The Long Firm ; he starred alongside actors Mark Strong and Derek Jacobi . He was cast in the 2005 theatre productions of the Philip Ridley play Mercury Fur , which was performed at the Drum Theatre in Plymouth and the Menier Chocolate Factory in London . He was directed by John Tiffany and starred alongside Ben Whishaw , Shane Zaza , Harry Kent , Fraser Ayres , Sophie Stanton and Dominic Hall . \n'}

## 二 准备模型

此处选择 Qwen/Qwen2.5-0.5B，可使用 modelscope 下载

```modelscope download --model Qwen/Qwen2.5-0.5B --local_dir /home/futureai/models/Qwen/Qwen2.5-0.5B```

In [6]:
model_checkpoint = "/home/futureai/models/Qwen/Qwen2.5-0.5B"

## 三 数据集处理

加载分词器

In [7]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


训练数据 token 化

In [8]:
def tokenizer_function(example):
    return tokenizer(example["text"])

In [9]:
tokenized_datasets = datasets.map(tokenizer_function, batched=True, num_proc=4, remove_columns=["text"])

查看 token 化之后的数据案例

In [10]:
tokenized_datasets["train"][1]

{'input_ids': [284, 85162, 88, 4204, 65316, 14429, 284, 715],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

In [11]:
tokenized_datasets["test"][1]

{'input_ids': [284, 8397, 425, 10965, 465, 284, 715],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

将训练数据连接在一起，并将结果分割成特定 block_size 的小块

In [12]:
block_size = 20

In [13]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

In [14]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

Map (num_proc=4): 100%|██████████| 36718/36718 [00:16<00:00, 2234.58 examples/s]
Map (num_proc=4): 100%|██████████| 3760/3760 [00:01<00:00, 2011.31 examples/s]
Map (num_proc=4): 100%|██████████| 4358/4358 [00:02<00:00, 1742.72 examples/s]


查看经过拼接、切割后的数据

In [22]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

'ed Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria'

## 四 训练代码

加载模型

In [None]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(model_checkpoint)

训练参数

In [None]:
from transformers import Trainer, TrainingArguments

In [None]:
model_name = model_checkpoint.split("/")[-1]
training_args = TrainingArguments(
    output_dir=f"{model_name}-finetuned-wikitext2",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01, # 权重衰减（L2正则化）系数，用于防止过拟合。
    push_to_hub=False
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"]
)

In [None]:
trainer.train()