# Task06 Transformers解决文本分类任务、超参搜索

## 1 文本分类任务简介
- 使用Transformers代码库中的模型来解决文本分类任务，任务来源于[GLUE Benchmark](https://gluebenchmark.com/)
- GLUE榜单的9个级别的分类任务：
  1. CoLA (Corpus of Linguistic Acceptability)：鉴别一个句子是否语法正确.
  2. MNLI (Multi-Genre Natural Language Inference)：给定一个假设，判断另一个句子与该假设的关系：entails、contradicts、unrelated。
  3. MRPC (Microsoft Research Paraphrase Corpus)：判断两个句子是否互为paraphrases
  4. QNLI (Question-answering Natural Language Inference)：判断第2句是否包含第1句问题的答案
  5. QQP (Quora Question Pairs2)：判断两个问句是否语义相同
  6. RTE (Recognizing Textual Entailment)：判断一个句子是否与假设成entail关系
  7. SST-2 (Stanford Sentiment Treebank)：判断一个句子的情感正负向
  8. STS-B (Semantic Textual Similarity Benchmark)：判断两个句子的相似性（分数为1-5分）
  9. WNLI (Winograd Natural Language Inference)：判断带有匿名代词的句子中，是否存在能够替换该代词的子句

In [1]:
GLUE_TASKS = ["cola", "mnli", "mnli-mm", "mrpc",
              "qnli", "qqp", "rte", "sst2", "stsb", "wnli"]

In [2]:
# 任务为CoLA任务
task = "cola"
# BERT模型
model_checkpoint = "distilbert-base-uncased"
# 根据GPU调整batch_size大小，避免显存溢出
batch_size = 16

## 2 加载数据

### 2.1 加载数据和对应的评测方式

In [3]:
from datasets import load_dataset, load_metric

In [4]:
actual_task = "mnli" if task == "mnli-mm" else task
# 加载GLUE数据集
dataset = load_dataset("glue", actual_task)
# 加载GLUE的评测方式
metric = load_metric('glue', actual_task)

Reusing dataset glue (C:\Users\hurui\.cache\huggingface\datasets\glue\cola\1.0.0\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


### 2.2 查看数据

In [5]:
# 对于训练集、验证集和测试集，只需要使用对应的key（train，validation，test）即可得到相应的数据
dataset

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 8551
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1043
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1063
    })
})

In [6]:
# 查看训练集第一条数据
dataset["train"][0]

{'sentence': "Our friends won't buy this analysis, let alone the next one we propose.",
 'label': 1,
 'idx': 0}

In [7]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML


def show_random_elements(dataset, num_examples=10):
    """从数据集中随机选择几条数据"""
    assert num_examples <= len(
        dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [8]:
show_random_elements(dataset["train"])

Unnamed: 0,sentence,label,idx
0,No one can forgive you that comment.,acceptable,2078
1,Bill and Kathy married.,acceptable,2318
2,$5 will buy a ticket.,acceptable,2410
3,Which books did Robin talk to Chris and read?,unacceptable,7039
4,Jill offered the ball towards Bob.,unacceptable,2053
5,Has not Henri studied for his exam?,unacceptable,7466
6,Fanny stopped talking when in came Aunt Norris.,unacceptable,6778
7,Who do you think that would be nominated for the position?,unacceptable,4784
8,Mickey teamed with the women up.,unacceptable,440
9,Baseballs toss easily.,unacceptable,2783


### 2.3 查看评测方法

In [9]:
metric

Metric(name: "glue", features: {'predictions': Value(dtype='int64', id=None), 'references': Value(dtype='int64', id=None)}, usage: """
Compute GLUE evaluation metric associated to each GLUE dataset.
Args:
    predictions: list of predictions to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.
Returns: depending on the GLUE subset, one or several of:
    "accuracy": Accuracy
    "f1": F1 score
    "pearson": Pearson Correlation
    "spearmanr": Spearman Correlation
    "matthews_correlation": Matthew Correlation
Examples:

    >>> glue_metric = datasets.load_metric('glue', 'sst2')  # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(res

In [10]:
# 调用metric的compute方法，计算评测值
import numpy as np

fake_preds = np.random.randint(0, 2, size=(64,))
fake_labels = np.random.randint(0, 2, size=(64,))
metric.compute(predictions=fake_preds, references=fake_labels)

{'matthews_correlation': -0.00392156862745098}

### 2.4 文本分类任务与评测方法

| 任务 | 评测方法 |
| :---: | :---: |
| CoLA | [Matthews Correlation Coefficient](https://en.wikipedia.org/wiki/Matthews_correlation_coefficient) | 
| MNLI | Accuracy |
| MRPC | Accuracy and [F1 score](https://en.wikipedia.org/wiki/F1_score) |
| QNLI | Accuracy |
| QQP | Accuracy and F1 score |
| RTE | Accuracy |
| SST-2 | Accuracy |
| STS-B | [Pearson Correlation Coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) and [Spearman's_Rank_Correlation_Coefficient](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient) |
| WNLI | Accuracy |

## 3 数据预处理

### 3.1 数据预处理流程
- 使用工具：Tokenizer
- 流程：
  1. 对输入数据进行tokenize，得到tokens
  2. 将tokens转化为预训练模型中需要对应的token ID
  3. 将token ID转化为模型需要的输入格式

### 3.2 构建模型对应的tokenizer

In [11]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

In [12]:
tokenizer("Hello, this one sentence!", "And this sentence goes with it.")

{'input_ids': [101, 7592, 1010, 2023, 2028, 6251, 999, 102, 1998, 2023, 6251, 3632, 2007, 2009, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

### 3.3 对数据集datasets所有样本进行预处理

In [13]:
# 定义如下dict，用于对数据格式进行检查
task_to_keys = {
    "cola": ("sentence", None),
    "mnli": ("premise", "hypothesis"),
    "mnli-mm": ("premise", "hypothesis"),
    "mrpc": ("sentence1", "sentence2"),
    "qnli": ("question", "sentence"),
    "qqp": ("question1", "question2"),
    "rte": ("sentence1", "sentence2"),
    "sst2": ("sentence", None),
    "stsb": ("sentence1", "sentence2"),
    "wnli": ("sentence1", "sentence2"),
}

In [14]:
# 对训练数据集的第1条数据进行数据格式检查
sentence1_key, sentence2_key = task_to_keys[task]
if sentence2_key is None:
    print(f"Sentence: {dataset['train'][0][sentence1_key]}")
else:
    print(f"Sentence 1: {dataset['train'][0][sentence1_key]}")
    print(f"Sentence 2: {dataset['train'][0][sentence2_key]}")

Sentence: Our friends won't buy this analysis, let alone the next one we propose.


In [15]:
# 构造数据预处理函数
def preprocess_function(examples):
    if sentence2_key is None:
        return tokenizer(examples[sentence1_key], truncation=True)
    return tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True)

In [16]:
# 对所有数据进行预处理
encoded_dataset = dataset.map(preprocess_function, batched=True)

Loading cached processed dataset at C:\Users\hurui\.cache\huggingface\datasets\glue\cola\1.0.0\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad\cache-fd5eee62c2b8c26e.arrow
Loading cached processed dataset at C:\Users\hurui\.cache\huggingface\datasets\glue\cola\1.0.0\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad\cache-0ce499346cf9c20b.arrow


HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




## 4 微调预训练模型

### 4.1 加载分类模型

In [17]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

num_labels = 3 if task.startswith("mnli") else 1 if task == "stsb" else 2
model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint, num_labels=num_labels)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'classifier

### 4.2 设定训练参数

In [18]:
metric_name = "pearson" if task == "stsb" else "matthews_correlation" if task == "cola" else "accuracy"

args = TrainingArguments(
    "test-glue",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    log_level='error',
    logging_strategy="no",
    report_to="none"
)

In [19]:
# 根据任务名称获取评测方法
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    if task != "stsb":
        predictions = np.argmax(predictions, axis=1)
    else:
        predictions = predictions[:, 0]
    return metric.compute(predictions=predictions, references=labels)

In [20]:
# 构造训练器Trainer
validation_key = "validation_mismatched" if task == "mnli-mm" else "validation_matched" if task == "mnli" else "validation"
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

### 4.3 训练模型

In [21]:
trainer.train()

Epoch,Training Loss,Validation Loss,Matthews Correlation
1,No log,0.493609,0.419078
2,No log,0.536483,0.466428
3,No log,0.628672,0.47826
4,No log,0.739923,0.512745
5,No log,0.862426,0.519563


TrainOutput(global_step=2675, training_loss=0.2717150308484229, metrics={'train_runtime': 100.5668, 'train_samples_per_second': 425.14, 'train_steps_per_second': 26.599, 'total_flos': 229537542078168.0, 'train_loss': 0.2717150308484229, 'epoch': 5.0})

### 4.4 模型评估

In [22]:
trainer.evaluate()

{'eval_loss': 0.8624260425567627,
 'eval_matthews_correlation': 0.519563286537562,
 'eval_runtime': 0.6501,
 'eval_samples_per_second': 1604.31,
 'eval_steps_per_second': 101.519,
 'epoch': 5.0}

## 5 超参数搜索

### 5.1 设置初始化模型

In [23]:
def model_init():
    return AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint, num_labels=num_labels)

In [24]:
trainer = Trainer(
    model_init=model_init,
    args=args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

### 5.2 超参数搜索

In [25]:
# 使用1/10数据进行搜索
best_run = trainer.hyperparameter_search(n_trials=10, direction="maximize")

[32m[I 2021-08-24 19:12:22,473][0m A new study created in memory with name: no-name-0184509a-44cb-49e4-a5f0-830e75becd06[0m


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,No log,0.467975,0.459079
2,No log,0.498678,0.521112


[32m[I 2021-08-24 19:12:45,880][0m Trial 0 finished with value: 0.5211120728046958 and parameters: {'learning_rate': 7.172454795940363e-05, 'num_train_epochs': 2, 'seed': 1, 'per_device_train_batch_size': 64}. Best is trial 0 with value: 0.5211120728046958.[0m


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,No log,0.4835,0.429662
2,No log,0.478505,0.498233
3,No log,0.518115,0.514713


[32m[I 2021-08-24 19:13:19,993][0m Trial 1 finished with value: 0.514713087977408 and parameters: {'learning_rate': 2.7656377054145692e-05, 'num_train_epochs': 3, 'seed': 29, 'per_device_train_batch_size': 64}. Best is trial 0 with value: 0.5211120728046958.[0m


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,No log,0.560201,0.379474


[32m[I 2021-08-24 19:14:32,984][0m Trial 2 finished with value: 0.3794738324574525 and parameters: {'learning_rate': 4.279050781378118e-06, 'num_train_epochs': 1, 'seed': 19, 'per_device_train_batch_size': 4}. Best is trial 0 with value: 0.5211120728046958.[0m


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,No log,0.481317,0.453374
2,No log,0.678357,0.463343
3,No log,0.920804,0.515403
4,No log,0.978926,0.554827
5,No log,1.189169,0.550403


[32m[I 2021-08-24 19:17:35,982][0m Trial 3 finished with value: 0.5504031254980248 and parameters: {'learning_rate': 4.301257551502102e-05, 'num_train_epochs': 5, 'seed': 20, 'per_device_train_batch_size': 8}. Best is trial 3 with value: 0.5504031254980248.[0m


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,No log,0.612682,0.0
2,No log,0.601103,0.0
3,No log,0.587219,0.0
4,No log,0.582568,0.0


  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
[32m[I 2021-08-24 19:18:20,631][0m Trial 4 finished with value: 0.0 and parameters: {'learning_rate': 1.514308427468381e-06, 'num_train_epochs': 4, 'seed': 8, 'per_device_train_batch_size': 64}. Best is trial 3 with value: 0.5504031254980248.[0m


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,No log,0.554493,0.0


  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
[32m[I 2021-08-24 19:18:58,287][0m Trial 5 pruned. [0m


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,No log,0.566486,0.285888


[32m[I 2021-08-24 19:20:09,803][0m Trial 6 pruned. [0m


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,No log,0.621905,0.0


  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
[32m[I 2021-08-24 19:20:47,404][0m Trial 7 pruned. [0m


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,No log,0.613638,0.0


  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
[32m[I 2021-08-24 19:20:59,269][0m Trial 8 pruned. [0m


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,No log,0.52209,0.403548


[32m[I 2021-08-24 19:21:36,476][0m Trial 9 pruned. [0m


In [26]:
# 得到效果最好的模型参数
best_run

BestRun(run_id='3', objective=0.5504031254980248, hyperparameters={'learning_rate': 4.301257551502102e-05, 'num_train_epochs': 5, 'seed': 20, 'per_device_train_batch_size': 8})

### 5.3 设置效果最好的参数并训练模型

In [27]:
for n, v in best_run.hyperparameters.items():
    setattr(trainer.args, n, v)

trainer.train()

Epoch,Training Loss,Validation Loss,Matthews Correlation
1,No log,0.481317,0.453374
2,No log,0.678357,0.463343
3,No log,0.920804,0.515403
4,No log,0.978926,0.554827
5,No log,1.189169,0.550403


TrainOutput(global_step=5345, training_loss=0.26719996967083726, metrics={'train_runtime': 178.4912, 'train_samples_per_second': 239.536, 'train_steps_per_second': 29.945, 'total_flos': 413547436355364.0, 'train_loss': 0.26719996967083726, 'epoch': 5.0})

In [28]:
trainer.evaluate()

{'eval_loss': 0.9789257049560547,
 'eval_matthews_correlation': 0.5548273578107759,
 'eval_runtime': 0.6556,
 'eval_samples_per_second': 1590.796,
 'eval_steps_per_second': 100.664,
 'epoch': 5.0}

## 6 总结

&emsp;&emsp;本次任务，主要介绍了用BERT模型解决文本分类任务的方法及步骤，步骤主要分为加载数据、数据预处理、微调预训练模型和超参数搜索。在加载数据阶段中，必须使用与分类任务相应的评测方法；在数据预处理阶段中，对tokenizer分词器的建模，并完成数据集中所有样本的预处理；在微调预训练模型阶段，通过对模型参数进行设置，并构建Trainner训练器，进行模型训练和评估；最后在超参数搜索阶段，使用hyperparameter_search方法，搜索效果最好的超参数，并进行模型训练和评估。  
&emsp;&emsp;其中在数据集下载时，需要使用外网方式建立代理。如果使用conda安装ray[tune]包时，请下载对应ray-tune依赖包。