# 基于 huggingface 的 transformers 库的多标签文本分类的实验

## BERT

BERT 全称为 Bidirectional Encoder Representations from Transformer，是谷歌在 2018 年 10 月发布的语言表示模型。BERT 通过维基百科和书籍语料组成的庞大语料进行了预训练，使用时只要根据下游任务进行输出层的修改和模型微调训练，就可以得到很好的效果

BERT 的整体结构如下图所示，其是以 Transformer 为基础构建的，使用 WordPiece 的方法进行数据预处理，最后通过 MLM 任务和下个句子预测任务进行预训练的语言表示模型。

![avatar](./img/bert.png)

当使用 BERT 完成文本分类时，通常有 2 种方案：

1. 从预训练好的 BERT 模型中提取特征向量，即 Feature Extraction 方法。
2. 将下游任务模型添加到 BERT 模型之后，然后使用下游任务的训练集对进行训练，即 Fine-Tuning 方法。

通常 Fine-Tuning 方法更常被人们使用，因为通过结合下游任务的数据集进行微调从而调整预训练模型参数，使模型能够更好捕捉到下游任务的数据特征。下面使用 Fine-Tuning 方法应用 BERT 预训练模型进行文本分类任务。

## Transformer

Transformer 是一个完全基于注意力机制（Attention mechanism）的模块，对比 RNN（Recurrent Neural Network），当输入的句子是长句子时，RNN 可能会遗忘之前句子中出现的字词，而 Transformer 的注意力机制使得句子中重要的字词的权重增大，从而保证不会被遗忘。并且 Transformer 另一个巨大的优势在于，它可以使用并行的方法运行计算，从而加快了速度。

Transformer 的具体结构如下图：

![avatar](./img/transformer.png)

## Huggingface Transformer

huggingface的transformers可能是目前最流行的深度学习库了，而这家机构又提供了datasets这个库，帮助快速获取和处理数据。这一套全家桶使得整个使用BERT类模型机器学习流程变得前所未有的简单。

## 实验说明

> 参考<br>https://zhuanlan.zhihu.com/p/344767513<br>
https://blog.csdn.net/weixin_38471579/article/details/108239629

本实验目标是AGNews新闻分类任务

使用huggingface的transformers和datasets实现BERT训练(trainer)和预测(pipeline)

ag_news dataset label
| label | meaning  |
| ----- | -------- |
| 0     | World    |
| 1     | Sports   |
| 2     | Business |
| 3     | Sci/Tech |

## 代码

In [1]:
import os
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from transformers import Trainer, TrainingArguments
from transformers import pipeline
from datasets import load_dataset
from datasets import load_from_disk

### 使用datasets读取数据集

training set：顾名思义，是用来训练模型的。因此它占了所有数据的绝大部分。<br>
development set(validation set)：用来对训练集训练出来的模型进行测试，对训练出的模型的超参数进行调整，不断地优化模型，。<br>
test set：在训练结束后对训练出的模型进行一次最终的评估所用的数据集。

> 原文是使用load_dataset，并且自己确定各个数据集的大小。这里因为网络原因无法使用load_dataset，解决方法参考 https://blog.csdn.net/weixin_43201090/article/details/123308940

In [2]:
train_dataset = load_from_disk("./dataset/train")
dev_dataset = load_from_disk("./dataset/validation/")
test_dataset = load_from_disk("./dataset/test/")
print(train_dataset)
print(dev_dataset)
print(test_dataset)

Dataset({
    features: ['text', 'label'],
    num_rows: 16000
})
Dataset({
    features: ['text', 'label'],
    num_rows: 2000
})
Dataset({
    features: ['text', 'label'],
    num_rows: 2000
})


原始数据集包含text和label两个字段

In [3]:
train_dataset.features # 查看数据集

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(num_classes=6, names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], id=None)}

由于bert模型期望得到的标签的字段为`labels`而原始数据集中的名字是`label`，所以做一下调整。

下面的代码把label字段复制到`labels`。

In [4]:
train_dataset = train_dataset.map(lambda examples: {'labels': examples['label']}, batched=True)
dev_dataset = dev_dataset.map(lambda examples: {'labels': examples['label']}, batched=True)
test_dataset = test_dataset.map(lambda examples: {'labels': examples['label']}, batched=True)

train_dataset[0] # 查看效果



  0%|          | 0/16 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

{'text': 'i didnt feel humiliated', 'label': 0, 'labels': 0}

### 加载模型，tokenizer，并预处理数据

In [5]:
model_id = 'prajjwal1/bert-tiny'
# note that we need to specify the number of classes for this task
# we can directly use the metadata (num_classes) stored in the dataset
model = AutoModelForSequenceClassification.from_pretrained(model_id, 
            num_labels=train_dataset.features["label"].num_classes)
tokenizer = AutoTokenizer.from_pretrained(model_id)

Some weights of the model checkpoint at prajjwal1/bert-tiny were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initia

用bert的方法对数据集做分词预处理，把所有序列补充或截断到256个token

In [6]:
MAX_LENGTH = 256
train_dataset = train_dataset.map(lambda e: tokenizer(e['text'], truncation=True, padding='max_length', max_length=MAX_LENGTH), batched=True)
dev_dataset = dev_dataset.map(lambda e: tokenizer(e['text'], truncation=True, padding='max_length', max_length=MAX_LENGTH), batched=True)
test_dataset = test_dataset.map(lambda e: tokenizer(e['text'], truncation=True, padding='max_length', max_length=MAX_LENGTH), batched=True)

  0%|          | 0/16 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

为了放进pytorch模型训练，还要再声明格式和使用的字段

In [7]:
train_dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'labels'])
dev_dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'labels'])
test_dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'labels'])

In [8]:
train_dataset.features # 查看数据集，目前可以直接放进bert训练了

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(num_classes=6, names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], id=None),
 'labels': Value(dtype='int64', id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}

In [9]:
train_dataset[0]

{'labels': tensor(0),
 'input_ids': tensor([  101,  1045,  2134,  2102,  2514, 26608,   102,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,  

### 指定模型训练时，显示的验证指标

In [10]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='macro')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

### 指定训练参数，使用trainer直接训练

In [11]:
training_args = TrainingArguments(
    output_dir='./results',          
    learning_rate=3e-4,
    num_train_epochs=3,              # epoch为10，出现过拟合
    per_device_train_batch_size=64, 
    per_device_eval_batch_size=64,   
    do_train=True,
    do_eval=True,
    no_cuda=False,
    load_best_model_at_end=True,
    # eval_steps=100, 存疑
    evaluation_strategy="epoch",
    save_strategy="epoch"
)

trainer = Trainer(
    model=model,                         # the instantiated Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=dev_dataset,            # evaluation dataset
    compute_metrics=compute_metrics
)

train_out = trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 16000
  Num Epochs = 3
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 750


  0%|          | 0/750 [00:00<?, ?it/s]

The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 2000
  Batch size = 64


  0%|          | 0/32 [00:00<?, ?it/s]

Saving model checkpoint to ./results\checkpoint-250
Configuration saved in ./results\checkpoint-250\config.json
Model weights saved in ./results\checkpoint-250\pytorch_model.bin


{'eval_loss': 0.39984986186027527, 'eval_accuracy': 0.8775, 'eval_f1': 0.8548110250334194, 'eval_precision': 0.861507897846472, 'eval_recall': 0.8499940196540186, 'eval_runtime': 0.9295, 'eval_samples_per_second': 2151.613, 'eval_steps_per_second': 34.426, 'epoch': 1.0}


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 2000
  Batch size = 64


{'loss': 0.6281, 'learning_rate': 9.999999999999999e-05, 'epoch': 2.0}


  0%|          | 0/32 [00:00<?, ?it/s]

Saving model checkpoint to ./results\checkpoint-500
Configuration saved in ./results\checkpoint-500\config.json
Model weights saved in ./results\checkpoint-500\pytorch_model.bin


{'eval_loss': 0.29661861062049866, 'eval_accuracy': 0.9035, 'eval_f1': 0.8780639170951697, 'eval_precision': 0.8743198461066242, 'eval_recall': 0.8831430136325844, 'eval_runtime': 0.894, 'eval_samples_per_second': 2237.117, 'eval_steps_per_second': 35.794, 'epoch': 2.0}


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 2000
  Batch size = 64


  0%|          | 0/32 [00:00<?, ?it/s]

Saving model checkpoint to ./results\checkpoint-750
Configuration saved in ./results\checkpoint-750\config.json
Model weights saved in ./results\checkpoint-750\pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from ./results\checkpoint-750 (score: 0.2821809649467468).


{'eval_loss': 0.2821809649467468, 'eval_accuracy': 0.903, 'eval_f1': 0.876381013773018, 'eval_precision': 0.870887500095454, 'eval_recall': 0.8826170614228879, 'eval_runtime': 0.9, 'eval_samples_per_second': 2222.224, 'eval_steps_per_second': 35.556, 'epoch': 3.0}
{'train_runtime': 47.4411, 'train_samples_per_second': 1011.781, 'train_steps_per_second': 15.809, 'train_loss': 0.5023726908365885, 'epoch': 3.0}


| epoch | eval_accuracy | eval_f1     | eval_loss   | eval_precision | eval_recall |
| ----- | ------------- | ----------- | ----------- | -------------- | ----------- |
| 1     | 0.885         | 0.853234205 | 0.894597054 | 0.875926099    | 0.83732716  |
| 2     | 0.884         | 0.856785579 | 0.910005748 | 0.871581735    | 0.84460723  |
| 3     | 0.884         | 0.855251653 | 0.912829518 | 0.862350882    | 0.84961513  |

### 使用pipeline直接对文本进行预测

`pipeline`可以直接加载训练好的模型和tokenizer，然后直接对文本进行分类预测，无需再自行预处理

首先把模型放回cpu来进行预测

In [12]:
model = model.cpu()

用sentiment-analysis来指定我们做的是文本分类任务（情感分析是一类代表性的文本分类任务），并指定我们之前训好的模型。

In [13]:
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

从模型没有见过的test集里挑一个例子来进行预测

In [14]:
test_examples = load_from_disk("./dataset/test/")
test_examples[622]

{'text': 'i don t feel brave though', 'label': 1}

该文本的类别为0，看看模型能不能做出正确预测？

In [15]:
result = classifier(test_examples[622]['text'])
result

[{'label': 'LABEL_1', 'score': 0.980215311050415}]