在快速发展的自然语言处理 (NLP) 领域，我们经常会比较不同的语言模型，看看哪一种最适合特定任务。这篇博文主要是关于比较三种模型：RoBERTa、Mistral-7b 和 Llama-2-7b。我们用它们来解决一个常见问题——对有关灾难的推文进行分类。值得注意的是，Mistral 和 Llama 2 是具有 70 亿个参数的大型模型。相比之下，RoBERTa-large（355M 参数）是一个相对较小的模型，用作比较研究的基线。

在这篇博客中，我们使用 PEFT（参数高效微调）技术：LoRA（大型语言模型的低秩适应）来微调序列分类任务上的预训练模型。 LoRa 旨在显着减少可训练参数的数量，同时保持强大的下游任务性能。

RoBERTa 的最大序列长度限制为 512：

In [1]:
MAX_LEN = 512 
roberta_checkpoint = "roberta-large"

我们将从 Hugging Face 加载数据集：

In [2]:
from datasets import load_dataset
dataset = load_dataset("wangrongsheng/twitter_disaster")

Downloading readme:   0%|          | 0.00/84.0 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/988k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/427k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7613 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3263 [00:00<?, ? examples/s]

现在，让我们将数据集分为训练数据集和验证数据集。然后添加测试集：

In [3]:
from datasets import Dataset
# Split the dataset into training and validation datasets
data = dataset['train'].train_test_split(train_size=0.8, seed=42)
# Rename the default "test" split to "validation"
data['val'] = data.pop("test")
# Convert the test dataframe to HuggingFace dataset and add it into the first dataset
data['test'] = dataset['test']

我们来检查一下数据分布：

In [4]:
import pandas as pd

data['train'].to_pandas().info()
data['test'].to_pandas().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6090 entries, 0 to 6089
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        6090 non-null   int64 
 1   keyword   6037 non-null   object
 2   location  4064 non-null   object
 3   text      6090 non-null   object
 4   target    6090 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 238.0+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3263 entries, 0 to 3262
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        3263 non-null   int64 
 1   keyword   3237 non-null   object
 2   location  2158 non-null   object
 3   text      3263 non-null   object
 4   target    3263 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 127.6+ KB


由于类别不平衡，我们将计算正负权重，并将其用于稍后的损失计算：

In [5]:
pos_weights = len(data['train'].to_pandas()) / (2 * data['train'].to_pandas().target.value_counts()[1])
neg_weights = len(data['train'].to_pandas()) / (2 * data['train'].to_pandas().target.value_counts()[0])

In [6]:
pos_weights, neg_weights

(1.1622137404580153, 0.877521613832853)

然后，我们计算列文本的最大长度：

In [7]:
# Number of Characters
max_char = data['train'].to_pandas()['text'].str.len().max()
# Number of Words
max_words = data['train'].to_pandas()['text'].str.split().str.len().max()

In [8]:
max_char, max_words

(152, 31)

让我们看一下训练数据的一行示例：

In [9]:
data['train'][0]

{'id': 5285,
 'keyword': 'fear',
 'location': 'Thibodaux, LA',
 'text': 'my worst fear. https://t.co/iH8UDz8mq3',
 'target': 0}

该数据包括关键字、位置和推文文本。为了简单起见，我们选择text特征作为唯一的输入LLM。

定义 RoBERTa 数据加载器：

In [10]:
from transformers import AutoTokenizer
roberta_tokenizer = AutoTokenizer.from_pretrained(roberta_checkpoint, add_prefix_space=True)

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

定义用于转换数据帧的一行的预处理函数：

In [11]:
def roberta_preprocessing_function(examples):
    return roberta_tokenizer(examples['text'], truncation=True, max_length=MAX_LEN)

通过将预处理函数应用于训练数据集的第一个示例，我们得到了标记化输入（ input_ids ）和注意掩码：

In [12]:
roberta_preprocessing_function(data['train'][0])

{'input_ids': [0, 127, 2373, 2490, 4, 1205, 640, 90, 4, 876, 73, 118, 725, 398, 13083, 329, 398, 119, 1343, 246, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

现在，让我们将预处理函数应用于整个数据集：

In [13]:
col_to_delete = ['id', 'keyword','location', 'text']
# Apply the preprocessing function and remove the undesired columns
roberta_tokenized_datasets = data.map(roberta_preprocessing_function, batched=True, remove_columns=col_to_delete)
# Rename the target to label as for HugginFace standards
roberta_tokenized_datasets = roberta_tokenized_datasets.rename_column("target", "label")
# Set to torch format
roberta_tokenized_datasets.set_format("torch")

Map:   0%|          | 0/6090 [00:00<?, ? examples/s]

Map:   0%|          | 0/1523 [00:00<?, ? examples/s]

Map:   0%|          | 0/3263 [00:00<?, ? examples/s]

我们可以查看我们的标记化训练数据集：

In [14]:
roberta_tokenized_datasets['train'][0]

{'label': tensor(0),
 'input_ids': tensor([    0,   127,  2373,  2490,     4,  1205,   640,    90,     4,   876,
            73,   118,   725,   398, 13083,   329,   398,   119,  1343,   246,
             2]),
 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])}

为了生成训练批次，我们还需要将给定批次的行填充到批次中找到的最大长度。为此，我们将使用DataCollatorWithPadding类：

In [15]:
# Data collator for padding a batch of examples to the maximum length seen in the batch
from transformers import DataCollatorWithPadding

roberta_data_collator = DataCollatorWithPadding(tokenizer=roberta_tokenizer)

2024-08-15 08:38:41.113197: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-15 08:38:41.113329: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-15 08:38:41.227380: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


我们使用 Hugging Face AutoModelForSequenceClassification类加载带有序列分类头的预训练 RoBERTa 模型：

In [16]:
from transformers import AutoModelForSequenceClassification

roberta_model = AutoModelForSequenceClassification.from_pretrained(roberta_checkpoint, num_labels=2)

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-large and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [19]:
!pip install peft -q

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


我们导入 LoRa 配置并为 RoBERTa 分类器设置一些参数：

In [20]:
from peft import get_peft_model, LoraConfig, TaskType

roberta_peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS, r=2, lora_alpha=16, lora_dropout=0.1, bias="none",
)

roberta_model = get_peft_model(roberta_model, roberta_peft_config)
roberta_model.print_trainable_parameters()

trainable params: 1,248,258 || all params: 356,610,052 || trainable%: 0.3500


In [23]:
!pip install evaluate -q

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


我们定义用于比较三个模型的性能指标：F1 分数、召回率、精确度和准确度：

In [24]:
import evaluate
import numpy as np

def compute_metrics(eval_pred):
    # All metrics are already predefined in the HF `evaluate` package
    precision_metric = evaluate.load("precision")
    recall_metric = evaluate.load("recall")
    f1_metric= evaluate.load("f1")
    accuracy_metric = evaluate.load("accuracy")

    logits, labels = eval_pred # eval_pred is the tuple of predictions and labels returned by the model
    predictions = np.argmax(logits, axis=-1)
    precision = precision_metric.compute(predictions=predictions, references=labels)["precision"]
    recall = recall_metric.compute(predictions=predictions, references=labels)["recall"]
    f1 = f1_metric.compute(predictions=predictions, references=labels)["f1"]
    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)["accuracy"]
    # The trainer is expecting a dictionary where the keys are the metrics names and the values are the scores. 
    return {"precision": precision, "recall": recall, "f1-score": f1, 'accuracy': accuracy}

正如本文开头提到的，我们的正类和负类之间的分布不平衡。我们需要用加权交叉熵损失来训练我们的模型来解决这个问题。 Trainer类不支持提供自定义损失，因为它期望直接从模型的输出中获取损失。因此，我们需要定义自定义的WeightedCELossTrainer来重写compute_loss方法，以根据模型的预测和输入标签计算加权交叉熵损失：

In [25]:
from transformers import Trainer

class WeightedCELossTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels")
        # Get model's predictions
        outputs = model(**inputs)
        logits = outputs.get("logits")
        # Compute custom loss
        loss_fct = torch.nn.CrossEntropyLoss(weight=torch.tensor([neg_weights, pos_weights], device=model.device, dtype=logits.dtype))
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss

将模型转移到 GPU 设备上进行训练：

In [27]:
roberta_model = roberta_model.cuda()
#roberta_model.device()

设置训练参数：

In [28]:
from transformers import TrainingArguments

lr = 1e-4
batch_size = 8
num_epochs = 5

training_args = TrainingArguments(
    output_dir="roberta-large-lora-token-classification",
    learning_rate=lr,
    lr_scheduler_type= "constant",
    warmup_ratio= 0.1,
    max_grad_norm= 0.3,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.001,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="wandb",
    fp16=False,
    gradient_checkpointing=True,
)



最后，我们通过提供模型、训练参数和标记化数据集来定义 RoBERTa 训练器：

In [29]:
roberta_trainer = WeightedCELossTrainer(
    model=roberta_model,
    args=training_args,
    train_dataset=roberta_tokenized_datasets['train'],
    eval_dataset=roberta_tokenized_datasets["val"],
    data_collator=roberta_data_collator,
    compute_metrics=compute_metrics
)

开始训练：

In [32]:
import torch

roberta_trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1-score,Accuracy
1,0.6561,0.557806,0.664411,0.754224,0.706475,0.732108
2,0.5728,0.595131,0.538108,0.900154,0.673563,0.627052
3,0.5577,0.498145,0.69146,0.771121,0.729121,0.755089
4,0.5531,0.479321,0.820976,0.697389,0.754153,0.805647
5,0.54,0.475383,0.848077,0.677419,0.753202,0.810243


Downloading builder script:   0%|          | 0.00/7.55k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]



TrainOutput(global_step=3810, training_loss=0.5647373109351932, metrics={'train_runtime': 363.5465, 'train_samples_per_second': 83.758, 'train_steps_per_second': 10.48, 'total_flos': 2795794693370352.0, 'train_loss': 0.5647373109351932, 'epoch': 5.0})