## 可执行的，完整的自定义问答模型（法律案件专精训练）
### 1）将司法阅读理解CJRC数据集转换成transformers的Dataloader能识别的格式

In [1]:
import torch
import transformers

print(torch.__version__)
transformers.__version__

  from .autonotebook import tqdm as notebook_tqdm


2.0.1


'4.30.2'

In [2]:
# 函数转换CJRC数据为SQaAD格式
# 详细数据结构参见其他文件

import json
import codecs
import os
from tqdm import tqdm
# from transformers.data.datasets import QAExample

def convert_file(input_file, output_file):
    with codecs.open(input_file, "r", encoding="utf-8") as f:
        input_data = json.load(f)
    output_data = []
    for article in tqdm(input_data["data"]):
        single_data = {}
        title = article["domain"] + '-' + article["caseid"].replace(".txt","")
        single_data['title'] = title
        single_data['paragraphs'] = []
        for paragraph in article["paragraphs"]:        
            context = paragraph["context"]
            qas_list = []
            for qa in paragraph["qas"]:
                question = qa["question"]
                id = qa["id"]
                is_impossible = qa['is_impossible']
                answers = qa["answers"]
    #             start_position = qa["answers"][0]["answer_start"] if qa["answers"] else -1
                qas_list.append(dict(
                    id=id,
                    question=question,
                    answers=answers,
                    is_impossible=is_impossible
                ))
            single_data['paragraphs'].append({"context":context,"qas":qas_list})
        output_data.append(single_data)
        
    output_json = {'version': '1.0'}
    with codecs.open(output_file, "w", encoding="utf-8") as writer:
#         for example in output_data:
#             writer.write(json.dumps(example, ensure_ascii=False) + "\n")
        output_json['data'] = output_data
        writer.write(json.dumps(output_json, ensure_ascii=False))

def convert_files(input_dir, output_dir):
    os.makedirs(output_dir, exist_ok=True)
    for file_name in ["big_train_data", "dev_ground_truth", "test_ground_truth"]:
        input_file = os.path.join(input_dir, f"{file_name}.json")
        output_file = os.path.join(output_dir, f"{file_name}.json")
        convert_file(input_file, output_file)
    print("All Done")
        
convert_files("./CJRC","./CJRC/transfered")

100%|███████████████████████████████████████████████████████████████████████████| 8000/8000 [00:00<00:00, 83554.79it/s]
100%|██████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 124823.05it/s]
100%|██████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 167117.06it/s]

All Done





### 2）构建dataset

In [3]:
from transformers import BertTokenizer,BertConfig
from transformers.data.processors.squad import SquadV2Processor
from torch.utils.data import DataLoader
from transformers.data.data_collator import default_data_collator

In [4]:
processor = SquadV2Processor()

file_path = "./CJRC/transfered"
train_file = "big_train_data.json"
train_examples = processor.get_train_examples(data_dir=file_path, filename=train_file)

dev_file = "dev_ground_truth.json"
dev_examples = processor.get_dev_examples(data_dir=file_path, filename=dev_file)

# test_file = "test_ground_truth.json"
# test_examples = processor.get_test_examples(data_dir=file_path, filename=test_file)

100%|█████████████████████████████████████████████████████████████████████████████| 8000/8000 [00:14<00:00, 552.34it/s]
100%|█████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:02<00:00, 434.83it/s]


In [20]:
# max_position_embeddings默认是512，因为我调整了traindata的max_seq_length=128，这里要改一下
config = BertConfig.from_pretrained('chinese-bert-wwm')
print(config.hidden_size)
# config.max_position_embeddings = 128

tokenizer = BertTokenizer.from_pretrained("chinese-bert-wwm", config=config)

768


squad_convert_examples_to_features是将SQuAD数据集转换成BERT模型可以处理的特征的函数。该函数主要包括以下参数：<br>

examples: 输入的SQuAD数据集示例，包含问题、段落和答案。<br>
tokenizer: BERT模型使用的分词器，用于将文本分割成标记（token）。<br>
max_seq_length: 模型最大序列长度。如果超出此长度，则会被截断，否则将填充到相同长度。<br>
doc_stride: 在转换示例时跨越文档的步幅。如果设置为0，则每个示例只涵盖一个段落，否则可能会跨越多个段落。<br>
max_query_length: 最大问题长度。如果问题超出此长度，则会被截断。<br>
is_training: 是否为训练模式。如果是，则输出包括每个示例的特征、真实标签、答案文本等，否则只输出特征。<br>

通过调整这些参数，您可以改变模型的性能、内存使用量和训练时间。具体而言：<br>

max_seq_length：增加此值会增加内存消耗，因为模型需要为每个特征构建输入张量。但是，如果此值太低，则会丢失重要的上下文信息，从而导致性能下降。<br>

doc_stride：如果此值过大，则可能会出现潜在的信息丢失，导致性能下降。但是，如果此值过小，则会增加生成的特征数和内存消耗。因此，建议在实验中进行搜索以找到最佳值。<br>

max_query_length：增加此值可能会增加训练时间和内存使用量，因为每个问题都需要用于生成特征。但是，此值应根据数据集中的典型查询长度进行选择，否则可能会丢失重要的信息。<br>

In [7]:
from transformers.data.processors.squad import SquadV2Processor, squad_convert_examples_to_features
from transformers.data.datasets.squad import SquadDataset

# 训练集
train_features = squad_convert_examples_to_features(train_examples, 
                                                       tokenizer, 
                                                       max_seq_length=512, 
                                                       doc_stride=128, 
                                                       max_query_length=64, 
                                                       is_training=True)

# 调整最大长度以配合硬件水平，但会影响性能
# train_features = squad_convert_examples_to_features(train_examples, 
#                                                        tokenizer, 
#                                                        max_seq_length=128, 
#                                                        doc_stride=32, 
#                                                        max_query_length=16, 
#                                                        is_training=True)

len(train_features)

convert squad examples to features: 100%|████████████████████████████████████████| 39333/39333 [07:30<00:00, 87.27it/s]
add example index and unique id: 100%|███████████████████████████████████████| 39333/39333 [00:00<00:00, 743661.27it/s]


86463

In [67]:
# 硬件原因，先搞500个试试，确保模型能跑起来
train_features = train_features[:700]

# 将features中的每个样本的input_ids、attention_mask、token_type_ids、start_positions和end_positions提取出来，并组成一个TensorDataset
all_input_ids = torch.tensor([f.input_ids for f in train_features], dtype=torch.long)
all_attention_masks = torch.tensor([f.attention_mask for f in train_features], dtype=torch.long)
all_token_type_ids = torch.tensor([f.token_type_ids for f in train_features], dtype=torch.long)
all_start_positions = torch.tensor([f.start_position for f in train_features], dtype=torch.long)
all_end_positions = torch.tensor([f.end_position for f in train_features], dtype=torch.long)

train_dataset = torch.utils.data.TensorDataset(all_input_ids, all_attention_masks, all_token_type_ids, all_start_positions, all_end_positions)

# 使用DataLoader将训练集加载到内存中
batch_size = 32
# batch_size = 16
train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
len(train_dataloader)

16

In [64]:
# model

import torch
import torch.nn as nn
from transformers import BertTokenizer, BertForQuestionAnswering, BertModel
from transformers import InputExample
from transformers.data.processors.squad import SquadV2Processor, squad_convert_examples_to_features
from transformers.data.datasets.squad import SquadDataset


# 模型接受的输入应该是一个包含input_ids、attention_mask、token_type_ids、start_positions和end_positions参数的字典，
# 其中input_ids、attention_mask和token_type_ids参数是标准的BERT输入，start_positions和end_positions参数是答案的开始和结束位置。

# 如果提供了start_positions和end_positions参数，模型将返回损失；否则，将返回答案的logits。
# 在训练时，使用BertTokenizer将原始文本转换为模型所需的输入格式，并优化模型的参数。

class BertQA(nn.Module):
    def __init__(self, pretrained_path):
        super(BertQA, self).__init__()
        self.bert = BertModel.from_pretrained(pretrained_path, config=config)
        self.dropout = torch.nn.Dropout(self.bert.config.hidden_dropout_prob)
        self.qa_outputs = torch.nn.Linear(self.bert.config.hidden_size, 2)

    def forward(self, input_ids=None, attention_mask=None, token_type_ids=None,
                start_positions=None, end_positions=None):
        outputs = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
        )


        '''
        sequence_output 是 BERT 模型最后一个 Transformer block 的每个 token 对应的隐藏状态，
        它的张量形状为 [batch_size, sequence_length, hidden_size]。
        pooled_output 是对最后一个 Transformer block 的所有 token 进行池化后得到的向量，
        它的张量形状为 [batch_size, hidden_size]。
        
        在 QA 模型中，我们需要对输入的问题和文本段落进行建模，然后预测答案的起始位置和结束位置。
        具体来说，我们可以将问题和文本段落同时输入 BERT 模型，
        然后使用 BERT 输出的 sequence_output 来计算每个位置上的答案起始位置和结束位置的预测值（即 logits），最终得到最可能的答案区间。
        
        而 pooled_output 通常被用于额外的任务，如分类任务或序列标注任务等，例如可以使用 pooled_output 来预测其他相关问题的标签或实体的类别。
        '''
        sequence_output = outputs[0]  # (batch_size, sequence_length, hidden_size)
        pooled_output = outputs[1]  # (batch_size, hidden_size)
        print("sequence_output's shape:",sequence_output.shape) # [16, 512, 768]
        print("pooled_output's shape:",pooled_output.shape) # [16, 768]

        # Flatten the sequence tensor
#         flattened_sequence_output = sequence_output.view(-1, self.bert.config.hidden_size)

#         pooled_output = self.dropout(pooled_output)

        # [src_len, batch_size,2]
        logits = self.qa_outputs(sequence_output)

        start_logits, end_logits = logits.split(1, dim=-1)
        print("start_logits's shape:",start_logits.shape) #[16,512,1]
        print("end_logits's shape:",end_logits.shape)
        # [src_len, batch_size,1], [src_len, batch_size,1]
        start_logits = start_logits.squeeze(-1).transpose(0,1)
        end_logits = end_logits.squeeze(-1).transpose(0,1)
        if start_positions is not None and end_positions is not None:          
            
#             ignored_index = start_logits.size(1) # 取输入序列的长度
#             start_logits.clamp_(0,ignored_index)
#             end_logits.clamp_(0,ignored_index)
#             loss_fct = torch.nn.CrossEntropyLoss(ignored_index=ignored_index)
    
            loss_fct = torch.nn.CrossEntropyLoss()
            print("start_positions's shape:", start_positions.shape)
            start_loss = loss_fct(start_logits, start_positions)
            end_loss = loss_fct(end_logits, end_positions)
            total_loss = (start_loss + end_loss) / 2

            return total_loss, start_logits, end_logits
        else:
            return start_logits, end_logits

In [69]:
# 实例化模型和参数
import gc
gc.collect()

model = BertQA('chinese-bert-wwm')

param_optimizer = list(model.named_parameters())
# 这里是设置不应该添加正则化项的参数，一般是BN层的可训练参数及卷积层和全连接层的 bias
no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"]
optimizer_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.001},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0},
]
optimizer = torch.optim.AdamW(optimizer_parameters, lr=3e-5)

# 将模型转移到GPU（如果可用）
# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device = 'cpu'
print(device)
model.to(device)

Some weights of the model checkpoint at chinese-bert-wwm were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


cpu


BertQA(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(21128, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=Tru

In [11]:
d = next(iter(train_dataloader))
print(d)
# 使用index标记各个参数，顺序如下
# input_ids, attention_masks, token_type_ids, start_positions, end_positions
len(d)

[tensor([[ 101, 1155,  815,  ...,    0,    0,    0],
        [ 101, 1155,  815,  ..., 4500, 3326,  102],
        [ 101, 4506, 3378,  ...,  865, 2157,  102],
        ...,
        [ 101, 1403, 1333,  ..., 4135, 2154,  102],
        [ 101, 1352, 3175,  ..., 5385, 1104,  102],
        [ 101, 3289,  166,  ..., 1357, 4638,  102]]), tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        ...,
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1]]), tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 1, 1, 1],
        [0, 0, 0,  ..., 1, 1, 1],
        ...,
        [0, 0, 0,  ..., 1, 1, 1],
        [0, 0, 0,  ..., 1, 1, 1],
        [0, 0, 0,  ..., 1, 1, 1]]), tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])]


5

如果显存不够用，可以尝试以下方法对训练参数进行调整：<br>

减小批次大小 (batch_size)：默认情况下，每个批次训练数据集中有16个样本。您可以将此值减小为8、4或更小以减少内存使用量。但是，批次大小太小可能会导致梯度稳定性下降，并增加训练时间。<br>

减少序列长度 (max_seq_length)：默认情况下，BERT模型可以处理长度为512的序列。您可以将此值减小为256、128或更小以减少内存使用量。但是，序列长度太短可能会丢失重要的上下文信息，导致性能下降。<br>

减少训练步数 (num_train_steps)：默认情况下，训练步数为1000。您可以将此值减小为500、200或更小以减少内存使用量。但是，训练步数太少可能会导致模型欠拟合，并且在执行推理时，模型无法很好地泛化。<br>

减少并行数：如果您使用多个GPU并行训练模型，则可以尝试减少GPU数量以减少内存使用量。<br>

当您对训练参数进行调整时，请注意，在执行过程中，一些训练设置可能会导致模型的性能下降。因此，您应该评估模型的测试/验证数据集上的性能，以确保所做的更改不会对性能产生负面影响。

In [70]:
# 训练批次
num_epochs = 3

# 训练模型
for epoch in range(num_epochs):
    # 开始一个epoch的训练过程
    running_loss = 0.0
    for batch_idx, batch_data in enumerate(train_dataloader):
#         input_ids = batch_data['input_ids'].to(device)
#         attention_mask = batch_data['attention_mask'].to(device)
#         start_positions = batch_data['start_positions'].to(device)
#         end_positions = batch_data['end_positions'].to(device)
        input_ids = batch_data[0].to(device)
        attention_mask = batch_data[1].to(device)
        token_type_ids = batch_data[2].to(device) 
        start_positions = batch_data[3].to(device)
        end_positions = batch_data[4].to(device)

        optimizer.zero_grad()

        loss, _, _ = model(input_ids=input_ids,
                           attention_mask=attention_mask,
                           token_type_ids = token_type_ids,
                           start_positions=start_positions,
                           end_positions=end_positions)
        
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

        # 每隔10个batch打印一次训练结果
        if batch_idx % 10 == 9:
            avg_loss = running_loss / 10
            print(f'Epoch {epoch+1}, Batch {batch_idx+1}/{len(train_dataloader)}, Avg. Loss: {avg_loss:.4f}')
            running_loss = 0.0

# 保存训练好的模型
torch.save({'state_dict': model.state_dict()}, 'self_QA/bert_model/cjrc_model_dict.pth.bar')
# tokenizer.save_pretrained('self_QA/bert_tokenizer')

sequence_output's shape: torch.Size([32, 512, 768])
pooled_output's shape: torch.Size([32, 768])
start_logits's shape: torch.Size([32, 512, 1])
end_logits's shape: torch.Size([32, 512, 1])
start_positions's shape: torch.Size([32])


ValueError: Expected input batch_size (512) to match target batch_size (32).