# 微调预训练模型进行文本分类

## 参数设置或变量设置

In [14]:
model_checkpoint = "schen/longformer-chinese-base-4096"
batch_size = 2 # 每一批次的数量
num_labels = 2 # 多少分类，这里是二分类问题，积极和消极
output_dir = "/home/chenli/pre_model/20221108" # 模型保存路径
learning_rate = 1e-5 # 学习率
num_train_epochs = 10 # 训练轮次，差不多设置为5左右。轮数不要设置太大。轮数设置的太大，Loss是下降了，但是微调的时候效果不是很好，有可能训练过头了

## 参数设置或变量设置
跑自己的数据集

In [4]:
model_checkpoint = "schen/longformer-chinese-base-4096"
batch_size = 2 # 每一批次的数量
num_labels = 2 # 多少分类，这里是二分类问题，积极和消极
output_dir = "/home/chenli/pre_model/20221109" # 模型保存路径
learning_rate = 1e-5 # 学习率
num_train_epochs = 5 # 训练轮次，差不多设置为5左右。轮数不要设置太大。轮数设置的太大，Loss是下降了，但是微调的时候效果不是很好，有可能训练过头了

## 参数设置或变量设置
20221114 <br/>
用自己的数据集进行微调模型

In [1]:
model_checkpoint = "schen/longformer-chinese-base-4096"
batch_size = 2 # 每一批次的数量
num_labels = 2 # 多少分类，这里是二分类问题，积极和消极
output_dir = "/home/chenli/pre_model/20221114" # 模型保存路径
learning_rate = 1e-5 # 学习率
weight_decay=0.01 # 学习率衰减，设置0.01即可。如果weight_decay设置太小，几乎就不起作用了。
num_train_epochs = 10 # 训练轮次，差不多设置为5左右。轮数不要设置太大。轮数设置的太大，Loss是下降了，但是微调的时候效果不是很好，有可能训练过头了

## 加载数据

In [2]:
from datasets import load_dataset
from datasets import load_from_disk
# 加载一个评估标准，默认的评估标准
from datasets import load_metric

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# 加载数据集
dataset = load_from_disk('./data/ChnSentiCorp')
metric = load_metric("glue","mrpc")

  metric = load_metric("glue","mrpc")
Using the latest cached version of the module from /home/chenli/.cache/huggingface/modules/datasets_modules/metrics/glue/91f3cfc5498873918ecf119dbf806fb10815786c84f41b85a5d3c47c1519b343 (last modified on Mon Nov  7 19:47:24 2022) since it couldn't be found locally at glue, or remotely on the Hugging Face Hub.


In [4]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 9600
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1200
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1200
    })
})

In [5]:
metric

Metric(name: "glue", features: {'predictions': Value(dtype='int64', id=None), 'references': Value(dtype='int64', id=None)}, usage: """
Compute GLUE evaluation metric associated to each GLUE dataset.
Args:
    predictions: list of predictions to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.
Returns: depending on the GLUE subset, one or several of:
    "accuracy": Accuracy
    "f1": F1 score
    "pearson": Pearson Correlation
    "spearmanr": Spearman Correlation
    "matthews_correlation": Matthew Correlation
Examples:

    >>> glue_metric = datasets.load_metric('glue', 'sst2')  # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(res

直接调用metric的compute方法，传入labels和predictions即可得到metric的值

In [6]:
import numpy as np

fake_preds = np.random.randint(0, 2, size=(64,))
fake_labels = np.random.randint(0, 2, size=(64,))
metric.compute(predictions=fake_preds, references=fake_labels)

{'accuracy': 0.53125, 'f1': 0.5}

## 加载数据
加载自己的数据集

In [3]:
from datasets import load_dataset
from datasets import load_from_disk
# 加载一个评估标准，默认的评估标准
from datasets import load_metric

train_dataset = load_dataset('csv',data_files='../data/MyDataset/data2/train_dataset.csv',split='train')
valid_dataset = load_dataset('csv',data_files='../data/MyDataset/data2/valid_dataset.csv',split='train')
test_dataset = load_dataset('csv',data_files='../data/MyDataset/data2/test_dataset.csv',split='train')

Using custom data configuration default-5602383f9cde0ea3
Reusing dataset csv (/home/chenli/.cache/huggingface/datasets/csv/default-5602383f9cde0ea3/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a)
Using custom data configuration default-062c84d526dcea84
Reusing dataset csv (/home/chenli/.cache/huggingface/datasets/csv/default-062c84d526dcea84/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a)
Using custom data configuration default-0f8395db45727ded
Reusing dataset csv (/home/chenli/.cache/huggingface/datasets/csv/default-0f8395db45727ded/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a)


## 数据预处理
在将数据喂入模型之前，我们需要对数据进行预处理。预处理的工具叫Tokenizer。Tokenizer首先对输入进行tokenize，然后将tokens转化为预模型中需要对应的token ID，再转化为模型需要的输入格式。

为了达到数据预处理的目的，我们使用AutoTokenizer.from_pretrained方法实例化我们的tokenizer，这样可以确保：

我们得到一个与预训练模型一一对应的tokenizer。
使用指定的模型checkpoint对应的tokenizer的时候，我们也下载了模型需要的词表库vocabulary，准确来说是tokens vocabulary。
这个被下载的tokens vocabulary会被缓存起来，从而再次使用的时候不会重新下载。

In [7]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

注意：use_fast=True要求tokenizer必须是transformers.PreTrainedTokenizerFast类型，因为我们在预处理的时候需要用到fast tokenizer的一些特殊特性（比如多线程快速tokenizer）。如果对应的模型没有fast tokenizer，去掉这个选项即可。

几乎所有模型对应的tokenizer都有对应的fast tokenizer。

tokenizer既可以对单个文本进行预处理，也可以对一对文本进行预处理，tokenizer预处理后得到的数据满足预训练模型输入格式

In [8]:
# 分词
def preprocess_function(data):
    return tokenizer(data['text'],truncation=True)

接下来对数据集datasets里面的所有样本进行预处理，处理的方式是使用map函数，将预处理函数prepare_train_features应用到（map)所有样本上。

In [9]:
encoded_dataset = dataset.map(function=preprocess_function,
                     batched=True,
                     remove_columns=['text'])

100%|██████████| 10/10 [00:00<00:00, 10.14ba/s]
100%|██████████| 2/2 [00:00<00:00, 10.18ba/s]
100%|██████████| 2/2 [00:00<00:00, 16.51ba/s]


In [10]:
encoded_dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 9600
    })
    validation: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1200
    })
    test: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1200
    })
})

## 数据预处理
预处理自己的数据

In [4]:
from transformers import AutoTokenizer
    
# tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
# 加载分词器
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/839 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/131k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

In [5]:
# 分词
def preprocess_function(data):
    return tokenizer(data['text'],padding='max_length',max_length=1500,truncation=True)

In [6]:
encoded_train_dataset = train_dataset.map(function=preprocess_function,
                     batched=True,
                     remove_columns=['text'])
encoded_train_dataset = encoded_train_dataset.rename_column("label", "labels")
encoded_train_dataset

  0%|          | 0/3 [00:00<?, ?ba/s]

Dataset({
    features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 2755
})

In [7]:
encoded_valid_dataset = valid_dataset.map(function=preprocess_function,
                     batched=True,
                     remove_columns=['text'])
encoded_valid_dataset = encoded_valid_dataset.rename_column("label", "labels")
encoded_valid_dataset

  0%|          | 0/1 [00:00<?, ?ba/s]

Dataset({
    features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 344
})

In [8]:
encoded_test_dataset = test_dataset.map(function=preprocess_function,
                     batched=True,
                     remove_columns=['text'])
encoded_test_dataset = encoded_test_dataset.rename_column("label", "labels")
encoded_test_dataset

  0%|          | 0/1 [00:00<?, ?ba/s]

Dataset({
    features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 345
})

## 微调预训练模型

既然数据已经准备好了，现在我们需要下载并加载我们的预训练模型，然后微调预训练模型。既然我们是做seq2seq任务，那么我们需要一个能解决这个任务的模型类。我们使用AutoModelForSequenceClassification 这个类。和tokenizer相似，from_pretrained方法同样可以帮助我们下载并加载模型，同时也会对模型进行缓存，就不会重复下载模型啦。

In [12]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

Some weights of the model checkpoint at schen/longformer-chinese-base-4096 were not used when initializing BertForSequenceClassification: ['bert.encoder.layer.4.attention.self.value_global.weight', 'bert.encoder.layer.8.attention.self.value_global.weight', 'cls.predictions.bias', 'bert.encoder.layer.3.attention.self.key_global.bias', 'bert.encoder.layer.4.attention.self.value_global.bias', 'bert.encoder.layer.3.attention.self.query_global.weight', 'bert.encoder.layer.0.attention.self.key_global.weight', 'bert.encoder.layer.1.attention.self.query_global.weight', 'bert.encoder.layer.1.attention.self.value_global.weight', 'bert.encoder.layer.2.attention.self.query_global.bias', 'bert.encoder.layer.3.attention.self.key_global.weight', 'bert.encoder.layer.10.attention.self.query_global.weight', 'bert.encoder.layer.5.attention.self.value_global.bias', 'cls.seq_relationship.bias', 'bert.encoder.layer.5.attention.self.key_global.weight', 'bert.encoder.layer.2.attention.self.query_global.weight

由于我们微调的任务是文本分类任务，而我们加载的是预训练的语言模型，所以会提示我们加载模型的时候扔掉了一些不匹配的神经网络参数（比如：预训练语言模型的神经网络head被扔掉了，同时随机初始化了文本分类的神经网络head）。

为了能够得到一个Trainer训练工具，我们还需要3个要素，其中最重要的是训练的设定/参数 TrainingArguments。这个训练设定包含了能够定义训练过程的所有属性。

In [15]:
metric_name = "accuracy"

args = TrainingArguments(
    output_dir = output_dir,
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate = learning_rate,
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size = batch_size,
    num_train_epochs = num_train_epochs,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
)

上面evaluation_strategy = "epoch"参数告诉训练代码：我们每个epcoh会做一次验证评估。

上面batch_size在这个notebook之前定义好了。

最后，由于不同的任务需要不同的评测指标，我们定一个函数来根据任务名字得到评价方法

In [17]:
def compute_metrics(eval_preds):
    metric = load_metric('glue','mrpc')
    logits,labels = eval_preds # 预测值和真实值
    predictions = np.argmax(logits,axis=-1)
    return metric.compute(predictions=predictions,references=labels)

全部传给 Trainer

In [18]:
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

## 微调预训练模型
根据自己的数据进行微调

In [9]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

Downloading:   0%|          | 0.00/423M [00:00<?, ?B/s]

Some weights of the model checkpoint at schen/longformer-chinese-base-4096 were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'bert.encoder.layer.0.attention.self.query_global.weight', 'bert.encoder.layer.0.attention.self.query_global.bias', 'bert.encoder.layer.0.attention.self.key_global.weight', 'bert.encoder.layer.0.attention.self.key_global.bias', 'bert.encoder.layer.0.attention.self.value_global.weight', 'bert.encoder.layer.0.attention.self.value_global.bias', 'bert.encoder.layer.1.attention.self.query_global.weight', 'bert.encoder.layer.1.attention.self.query_global.bias', 'bert.encoder.layer.1.attention.self.key_global.weight', 'bert.encoder.layer.1.

In [11]:
metric_name = "accuracy"

args = TrainingArguments(
    output_dir = output_dir,
    evaluation_strategy = "epoch",
    # save_strategy = "epoch",
    learning_rate = learning_rate,
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size = batch_size,
    num_train_epochs = num_train_epochs,
    weight_decay=weight_decay,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
)

In [12]:
import numpy as np
def compute_metrics(eval_preds):
    metric = load_metric('glue','mrpc')
    logits,labels = eval_preds # 预测值和真实值
    predictions = np.argmax(logits,axis=-1)
    return metric.compute(predictions=predictions,references=labels)

In [13]:
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_train_dataset,
    eval_dataset=encoded_valid_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

## 开始训练

### 20221108 训练
各个参数和变量如下：<br/>
model_checkpoint = "schen/longformer-chinese-base-4096" <br/>
batch_size = 2 # 每一批次的数量 <br/>
num_labels = 2 # 多少分类，这里是二分类问题，积极和消极 <br/>
output_dir = "/home/chenli/pre_model/20221108" # 模型保存路径 <br/>
learning_rate = 1e-5 # 学习率 <br/>
num_train_epochs = 10 # 训练轮次 <br/>

In [19]:
trainer.train()

***** Running training *****
  Num examples = 9600
  Num Epochs = 10
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 2
  Gradient Accumulation steps = 1
  Total optimization steps = 48000
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.4482,0.436606,0.923333,0.921635
2,0.2392,0.361255,0.938333,0.936968
3,0.2046,0.413974,0.944167,0.943268
4,0.0695,0.473601,0.938333,0.93686
5,0.0622,0.506522,0.944167,0.942784
6,0.0332,0.488925,0.9425,0.941772
7,0.0356,0.540889,0.940833,0.938846
8,0.0276,0.544506,0.940833,0.938315
9,0.0186,0.536422,0.9475,0.945923
10,0.0146,0.563689,0.946667,0.945299


***** Running Evaluation *****
  Num examples = 1200
  Batch size = 2
Using the latest cached version of the module from /home/chenli/.cache/huggingface/modules/datasets_modules/metrics/glue/91f3cfc5498873918ecf119dbf806fb10815786c84f41b85a5d3c47c1519b343 (last modified on Mon Nov  7 19:47:24 2022) since it couldn't be found locally at glue, or remotely on the Hugging Face Hub.
Saving model checkpoint to /home/chenli/pre_model/20221108/checkpoint-4800
Configuration saved in /home/chenli/pre_model/20221108/checkpoint-4800/config.json
Model weights saved in /home/chenli/pre_model/20221108/checkpoint-4800/pytorch_model.bin
tokenizer config file saved in /home/chenli/pre_model/20221108/checkpoint-4800/tokenizer_config.json
Special tokens file saved in /home/chenli/pre_model/20221108/checkpoint-4800/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 1200
  Batch size = 2
Using the latest cached version of the module from /home/chenli/.cache/huggingface/modules/datasets_

TrainOutput(global_step=48000, training_loss=0.1149462267626077, metrics={'train_runtime': 125940.975, 'train_samples_per_second': 0.762, 'train_steps_per_second': 0.381, 'total_flos': 7428788537010720.0, 'train_loss': 0.1149462267626077, 'epoch': 10.0})

训练完成后进行评估，使用的是验证集

In [20]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 1200
  Batch size = 2


Using the latest cached version of the module from /home/chenli/.cache/huggingface/modules/datasets_modules/metrics/glue/91f3cfc5498873918ecf119dbf806fb10815786c84f41b85a5d3c47c1519b343 (last modified on Mon Nov  7 19:47:24 2022) since it couldn't be found locally at glue, or remotely on the Hugging Face Hub.


{'eval_loss': 0.5364220142364502,
 'eval_accuracy': 0.9475,
 'eval_f1': 0.945922746781116,
 'eval_runtime': 476.841,
 'eval_samples_per_second': 2.517,
 'eval_steps_per_second': 1.258,
 'epoch': 10.0}

使用测试集进行评估

In [21]:
trainer.evaluate(eval_dataset=encoded_dataset["test"])

***** Running Evaluation *****
  Num examples = 1200
  Batch size = 2


{'eval_loss': 0.4677197337150574,
 'eval_accuracy': 0.9541666666666667,
 'eval_f1': 0.9541284403669725,
 'eval_runtime': 363.9768,
 'eval_samples_per_second': 3.297,
 'eval_steps_per_second': 1.648,
 'epoch': 10.0}

In [22]:
trainer.predict(test_dataset=encoded_dataset["test"])

***** Running Prediction *****
  Num examples = 1200
  Batch size = 2


PredictionOutput(predictions=array([[ 5.9555726, -5.709093 ],
       [ 6.093847 , -5.8897943],
       [ 6.08048  , -5.892603 ],
       ...,
       [-4.9640007,  4.1808424],
       [-6.003969 ,  5.558333 ],
       [ 6.074811 , -5.919014 ]], dtype=float32), label_ids=array([1, 0, 0, ..., 1, 1, 0]), metrics={'test_loss': 0.4677197337150574, 'test_accuracy': 0.9541666666666667, 'test_f1': 0.9541284403669725, 'test_runtime': 872.9183, 'test_samples_per_second': 1.375, 'test_steps_per_second': 0.687})

In [23]:
trainer.args

TrainingArguments(
_n_gpu=0,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=epoch,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=True,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_n

## 2022.11.10 训练
根据自己的训练集进行训练 <br/>
model_checkpoint = "schen/longformer-chinese-base-4096" <br/>
batch_size = 2 # 每一批次的数量 <br/>
num_labels = 2 # 多少分类，这里是二分类问题，积极和消极 <br/>
output_dir = "/home/chenli/pre_model/20221109" # 模型保存路径 <br/>
learning_rate = 1e-5 # 学习率 <br/>
num_train_epochs = 5 # 训练轮次，差不多设置为5左右。轮数不要设置太大。轮数设置的太大，Loss是下降了，但是微调的时候效果不是很好，有可能训练过头了

训练前先评估一下

In [14]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 236
  Batch size = 2
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


  metric = load_metric('glue','mrpc')


{'eval_loss': 0.7960287928581238,
 'eval_accuracy': 0.2457627118644068,
 'eval_f1': 0.0,
 'eval_runtime': 2459.7237,
 'eval_samples_per_second': 0.096,
 'eval_steps_per_second': 0.048}

In [15]:
trainer.args

TrainingArguments(
_n_gpu=0,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=epoch,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=True,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_n

开始训练

In [16]:
trainer.train()

***** Running training *****
  Num examples = 1889
  Num Epochs = 5
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 2
  Gradient Accumulation steps = 1
  Total optimization steps = 4725


Epoch,Training Loss,Validation Loss


RuntimeError: [enforce fail at alloc_cpu.cpp:66] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 1610612736 bytes. Error code 12 (Cannot allocate memory)

## 20221114 训练
model_checkpoint = "schen/longformer-chinese-base-4096" <br/>
batch_size = 2 # 每一批次的数量 <br/>
num_labels = 2 # 多少分类，这里是二分类问题，积极和消极 <br/>
output_dir = "/home/chenli/pre_model/20221114" # 模型保存路径 <br/>
learning_rate = 1e-5 # 学习率 <br/>
weight_decay=0.01 # 学习率衰减，设置0.01即可。如果weight_decay设置太小，几乎就不起作用了。 <br/>
num_train_epochs = 10 # 训练轮次，差不多设置为5左右。轮数不要设置太大。轮数设置的太大，Loss是下降了，但是微调的时候效果不是很好，有可能训练过头了

训练前评估

In [14]:
trainer.evaluate()



{'eval_loss': 0.6863282918930054,
 'eval_accuracy': 0.5523255813953488,
 'eval_f1': 0.7116104868913857,
 'eval_runtime': 42.1272,
 'eval_samples_per_second': 8.166}

In [15]:
trainer.evaluate(eval_dataset=encoded_test_dataset)



{'eval_loss': 0.6917498111724854,
 'eval_accuracy': 0.553623188405797,
 'eval_f1': 0.7126865671641791,
 'eval_runtime': 30.3158,
 'eval_samples_per_second': 11.38}

开始训练

In [16]:
trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy,F1,Runtime,Samples Per Second
1,0.1368,0.224179,0.959302,0.962963,62.0941,5.54
2,0.1108,0.176989,0.97093,0.973958,129.8985,2.648
3,0.0859,0.203063,0.968023,0.971429,30.4146,11.31
4,0.0752,0.191595,0.97093,0.973822,30.1685,11.403
5,0.041,0.313224,0.962209,0.965699,30.4654,11.291
6,0.0164,0.263096,0.968023,0.970976,31.7895,10.821
7,0.0298,0.40847,0.915698,0.91922,130.5327,2.635
8,0.0243,0.233864,0.97093,0.973822,130.1279,2.644
9,0.0183,0.235423,0.968023,0.971279,129.9329,2.648
10,0.0108,0.241471,0.97093,0.973822,30.1266,11.418


Using the latest cached version of the module from /home/chenli/.cache/huggingface/modules/datasets_modules/metrics/glue/91f3cfc5498873918ecf119dbf806fb10815786c84f41b85a5d3c47c1519b343 (last modified on Fri Nov 11 21:04:53 2022) since it couldn't be found locally at glue, or remotely on the Hugging Face Hub.
Using the latest cached version of the module from /home/chenli/.cache/huggingface/modules/datasets_modules/metrics/glue/91f3cfc5498873918ecf119dbf806fb10815786c84f41b85a5d3c47c1519b343 (last modified on Fri Nov 11 21:04:53 2022) since it couldn't be found locally at glue, or remotely on the Hugging Face Hub.
Using the latest cached version of the module from /home/chenli/.cache/huggingface/modules/datasets_modules/metrics/glue/91f3cfc5498873918ecf119dbf806fb10815786c84f41b85a5d3c47c1519b343 (last modified on Fri Nov 11 21:04:53 2022) since it couldn't be found locally at glue, or remotely on the Hugging Face Hub.
Using the latest cached version of the module from /home/chenli/.ca

TrainOutput(global_step=6890, training_loss=0.05106816651686182, metrics={'train_runtime': 6912.9123, 'train_samples_per_second': 0.997, 'total_flos': 26040130019100000, 'epoch': 10.0})

训练后评估

In [17]:
trainer.evaluate()



Using the latest cached version of the module from /home/chenli/.cache/huggingface/modules/datasets_modules/metrics/glue/91f3cfc5498873918ecf119dbf806fb10815786c84f41b85a5d3c47c1519b343 (last modified on Fri Nov 11 21:04:53 2022) since it couldn't be found locally at glue, or remotely on the Hugging Face Hub.


{'eval_loss': 0.17698924243450165,
 'eval_accuracy': 0.9709302325581395,
 'eval_f1': 0.9739583333333334,
 'eval_runtime': 129.1969,
 'eval_samples_per_second': 2.663,
 'epoch': 10.0}

In [18]:
trainer.evaluate(eval_dataset=encoded_test_dataset)

Using the latest cached version of the module from /home/chenli/.cache/huggingface/modules/datasets_modules/metrics/glue/91f3cfc5498873918ecf119dbf806fb10815786c84f41b85a5d3c47c1519b343 (last modified on Fri Nov 11 21:04:53 2022) since it couldn't be found locally at glue, or remotely on the Hugging Face Hub.


{'eval_loss': 0.2300749123096466,
 'eval_accuracy': 0.9623188405797102,
 'eval_f1': 0.9669211195928753,
 'eval_runtime': 129.4035,
 'eval_samples_per_second': 2.666,
 'epoch': 10.0}

## 超参数搜索

Trainer同样支持超参搜索，使用optuna or Ray Tune代码库。

反注释下面两行安装依赖：

In [24]:
! pip install optuna
! pip install ray[tune]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting optuna
  Using cached optuna-3.0.3-py3-none-any.whl (348 kB)
Collecting alembic>=1.5.0
  Using cached alembic-1.8.1-py3-none-any.whl (209 kB)
Collecting cliff
  Using cached cliff-4.0.0-py3-none-any.whl (80 kB)
Collecting cmaes>=0.8.2
  Using cached cmaes-0.9.0-py3-none-any.whl (23 kB)
Collecting colorlog
  Using cached colorlog-6.7.0-py2.py3-none-any.whl (11 kB)
Collecting Mako
  Using cached Mako-1.2.3-py3-none-any.whl (78 kB)
Collecting autopage>=0.4.0
  Using cached autopage-0.5.1-py3-none-any.whl (29 kB)
Collecting cmd2>=1.0.0
  Using cached cmd2-2.4.2-py3-none-any.whl (147 kB)
Collecting PrettyTable>=0.7.2
  Using cached prettytable-3.5.0-py3-none-any.whl (26 kB)
Collecting stevedore>=2.0.1


超参搜索时，Trainer将会返回多个训练好的模型，所以需要传入一个定义好的模型从而让Trainer可以不断重新初始化该传入的模型

In [25]:
def model_init():
    return AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

和之前调用 Trainer类似:

In [26]:
trainer = Trainer(
    model_init=model_init,
    args=args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

loading configuration file config.json from cache at /home/chenli/.cache/huggingface/hub/models--schen--longformer-chinese-base-4096/snapshots/f0e53c8afe22f6b7cca5d5278fda13e26951a3b6/config.json
Model config BertConfig {
  "_name_or_path": "schen/longformer-chinese-base-4096",
  "architectures": [
    "BertForPreTraining"
  ],
  "attention_probs_dropout_prob": 0.1,
  "attention_window": [
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512
  ],
  "classifier_dropout": null,
  "directionality": "bidi",
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 4096,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3,
  "pooler_size_

调用方法hyperparameter_search。注意，这个过程可能很久，我们可以先用部分数据集进行超参搜索，再进行全量训练。 比如使用1/10的数据进行搜索

In [None]:
best_run = trainer.hyperparameter_search(n_trials=10, direction="maximize")

hyperparameter_search会返回效果最好的模型相关的参数：

In [None]:
best_run

将Trainner设置为搜索到的最好参数，进行训练：

In [None]:
for n, v in best_run.hyperparameters.items():
    setattr(trainer.args, n, v)

trainer.train()

最后别忘了，查看如何上传模型 ，上传模型到](https://huggingface.co/transformers/model_sharing.html) 到🤗 Model Hub。随后您就可以像这个notebook一开始一样，直接用模型名字就能使用您自己上传的模型啦。