<a href="https://colab.research.google.com/github/LolitaSian/Getting-Started-with-Google-BERT/blob/main/Chapter03/3.05.%20Finetuning%20BERT%20for%20downstream%20tasks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetuning BERT for downstream tasks
So far we learned how to use the pre-trained BERT model. Now, let us learn how to finetune the pre-trained BERT for downstream tasks. Note that finetuning implies that we are not training BERT from scratch, instead, we are using the already trained BERT and updating its weights according to our task. 

In this section, we will learn how to finetune the pre-trained BERT for the following downstream tasks: 

- Text classification 
- Natural language inference 
- Named entity recognition 
- Question-answering

Let us explore text classification in the next section

首先，安装必要的库。

In [43]:
!pip install Transformers
!pip install Datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


然后，导入必要的模块。

In [44]:
from transformers import BertForSequenceClassification, BertTokenizerFast, Trainer, TrainingArguments
from datasets import load_dataset
import torch
import numpy as np

下载并加载数据集。

打印看一下数据集内容。

In [119]:
ethos = load_dataset("ethos", "binary", split='train')
print(ethos)



Dataset({
    features: ['text', 'label'],
    num_rows: 998
})


然后，检查数据类型。

In [110]:
type(ethos)

datasets.arrow_dataset.Dataset

In [47]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


现在，创建训练集和测试集。

In [112]:
dataset = ethos.train_test_split(test_size=0.2)
train_set = dataset['train']
test_set = dataset['test']
print(train_set)

Dataset({
    features: ['text', 'label'],
    num_rows: 798
})


接下来，下载并加载预训练的BERT模型。在这个例子中，我们使用预训练的bert-base-uncased模型。由于要进行序列分类，因此我们使用BertForSequence-Classification类。

In [70]:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

有了词元分析器，我们可以轻松地预处理数据集。我们定义了一个名为preprocess的函数来处理数据集，如下所示

In [50]:
def preprocess(data):
    return tokenizer(data['text'], padding = True, truncation = True)

使用preprocess函数对训练集和测试集进行预处理。

In [113]:
train_set = train_set.map(preprocess, batched = True, batch_size = len(train_set))
test_set = test_set.map(preprocess, batched = True, batch_size = len(test_set))

Map:   0%|          | 0/798 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

接下来，使用set_format函数，选择数据集中需要的列及其对应的格式，如下所示。

In [114]:
train_set.set_format('torch',columns = ['input_ids', 'attention_mask', 'label'])
test_set.set_format('torch',columns = ['input_ids', 'attention_mask', 'label'])

首先，定义批量大小和迭代次数。
然后，确定预热步骤和权重衰减。

In [115]:
batch_size = 64
epochs = 5
warmup_steps = 500
weight_decay = 0.01

接着，设置训练参数。

In [116]:
training_args = TrainingArguments(
    output_dir = './results',
    num_train_epochs = epochs,
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size = batch_size,
    warmup_steps = warmup_steps,
    weight_decay = weight_decay,
    evaluation_strategy = 'steps',
    logging_dir = './logs',
)

最后，定义训练函数。

In [117]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_set,
    eval_dataset = test_set
)

开始训练

In [118]:
trainer.train()



Step,Training Loss,Validation Loss


TrainOutput(global_step=200, training_loss=0.6117241287231445, metrics={'train_runtime': 149.2698, 'train_samples_per_second': 10.692, 'train_steps_per_second': 1.34, 'total_flos': 419925244354560.0, 'train_loss': 0.6117241287231445, 'epoch': 2.0})

训练完之后评估模型

In [120]:
trainer.evaluate()

{'eval_loss': 0.5309603810310364,
 'eval_runtime': 3.755,
 'eval_samples_per_second': 53.262,
 'eval_steps_per_second': 6.658,
 'epoch': 2.0}

以这种方式，我们就可以针对文本分类任务对预训练的BERT模型进行微调。