### 三、 使用 LoRA 进行微调（TriviaQA数据集）¶

设备：M40-24G

基座模型：MiniCPM

使用数据：TriviaQA

由于显存限制，训练时的`per_device_train_batch_size`取1
`gradient_accumulation_steps`取16，这是因为16在cosmosQA数据集上表现最好
这等效于`Total train batch size`16

In [1]:
!bash lora_finetune_trivia_web.sh

20240726235858
[2024-07-26 23:59:00,750] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-26 23:59:03,241] [INFO] [runner.py:568:main] cmd = /usr/local/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=19888 --enable_each_rank_log=None finetune.py --model_name_or_path /hy-tmp/MiniCPM/finetune/output/LoRA/20240725165943/checkpoint-1000 --output_dir output/LoRA/20240726235858/ --train_data_path /hy-tmp/data_MiniCPM/triviaqa_web_train_withDescription.json --eval_data_path /hy-tmp/data_MiniCPM/triviaqa_web_valid_withDescription.json --learning_rate 5e-5 --per_device_train_batch_size 1 --per_device_eval_batch_size 4 --model_max_length 1024 --use_lora --gradient_accumulation_steps 4 --warmup_steps 100 --max_steps 1000 --weight_decay 0.01 --evaluation_strategy steps --eval_steps 100 --save_strategy steps --save_steps 500 --seed 42 --log_level info --logging_strategy steps

## 推理验证

In [1]:
import torch
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
path = "/hy-tmp/MiniCPM/finetune/output/LoRA/20240726235858/checkpoint-1000"
tokenizer = AutoTokenizer.from_pretrained(path)
model = AutoModelForCausalLM.from_pretrained(
    path, torch_dtype=torch.float16, device_map="cuda", trust_remote_code=True
)

In [10]:
import json
input_file = '/hy-tmp/data_MiniCPM/triviaqa_web_valid_withDescription.json'
message = {}
with open(input_file, 'r', encoding='utf-8') as infile:
    line = infile.readline()
    message = json.loads(line)

In [11]:
query = f"<用户>{message['messages'][0]['content']}"

In [14]:
query

"<用户>You are a reading comprehension expert and you need to answer the encyclopaedic questions below based on the text given. Note: Try to be as concise as possible. Do not need to answer in complete sentences.\nExample:--Text: Omitted here. --Question: Where in England was Dame Judi Dench born? --Answer: York\nHere is the question you have to answer:\nQuestion: Who was the man behind The Chipmunks?\nText: Description 1: A struggling songwriter named Dave Seville finds success when he comes across a trio of singing chipmunks: ... Title: Alvin and the Chipmunks (2007) ...\nDescription 2: The man who brought the Chipmunks to life, ... Five more Chipmunks singles charted in the early '60s, ... See Behind-the-Scenes Rehearsal Photos of Fox's 'The Passion'"

In [12]:
res, history = model.chat(tokenizer, query=query, max_length=1024, top_p=0.5)
res

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.


'Dave Seville'

In [13]:
# 看一下标准答案
message['messages'][1]['content']

'David Seville'

In [15]:
# 看一下没有微调过的模型是什么结果

MiniCPM_bf16_path = "/hy-tmp/miniCPM-bf16"
MiniCPM_bf16_tokenizer = AutoTokenizer.from_pretrained(MiniCPM_bf16_path)
MiniCPM_bf16_model = AutoModelForCausalLM.from_pretrained(
    MiniCPM_bf16_path, torch_dtype=torch.float16, device_map="cuda", trust_remote_code=True
)
         
MiniCPM_bf16_res, MiniCPM_bf16_history = MiniCPM_bf16_model.chat(MiniCPM_bf16_tokenizer, query=query, max_length=1024, top_p=0.5)
MiniCPM_bf16_res

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


'Question: Who brought the Chipmunks to life?\nText: Description 2: The man who brought the Chipmunks to life, ...\nAnswer: Dave Seville'

In [18]:
# 看一下模型对于cosmosqa的数据的回答有没有改进

query_cos = "<用户>You are a reading comprehension expert and below you need to answer multiple choice questions based on the text. Note: Only the number of the option needs to be answered and your answer is only one number.\n" \
         "Here are the questions you have to answer: \n" \
         "Text: HGH and steroid use is rampant in track , and , just like in most American professional sports , any incredible individual achievement is questioned immediately , which is a sad state of affairs . The fact that we , as a human population have to assume that someone is using steroids because of an outstanding feat is horrible . What about the people who actually train their hearts out to achieve the same success that the users are trying to achieve ? It is n't fair to the men and women that dedicate their lives to achieving glory the right way , hard work and determination .\n" \
         "Question: Why is HGH and steroid use rampant in track ?\n" \
         "Option 0: Because we have to assume that someone is using steroids because of an outstanding feat .\n" \
         "Option 1: None of the above choices .\n" \
         "Option 2: Because any incredible individual achievement is questioned immediately .\n" \
         "Option 3: Because it 's an American professional sport .\n" \
         
res_cos = model.chat(tokenizer, query=query_cos, max_length=1024, top_p=0.5)[0]
res_cos

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


'2'

In [6]:
import json
import csv
import re
import os

input_file_path = '/hy-tmp/cosmosqa-master/data/test.jsonl'
output_file_path = os.path.join(path, 'predictions_cosmosqa_2nd.lst')

# 逐行读取 JSONL 文件
total_lines = sum(1 for line in open(input_file_path, 'r'))
with open(input_file_path, 'r') as file, open(output_file_path, 'w') as lstfile:
    for line in tqdm(file, total=total_lines, desc="Processing"):
        record = json.loads(line)
        input = f"<用户>You are a reading comprehension expert and below you need to answer multiple choice questions based on the text. Note: Only the number of the option needs to be answered and your answer is only one number.\n" \
         f"Here are the questions you have to answer: \n" \
         f"Text: {record['context']}\n" \
         f"Question: {record['question']}\n" \
         f"Option 0: {record['answer0']}\n" \
         f"Option 1: {record['answer1']}\n" \
         f"Option 2: {record['answer2']}\n" \
         f"Option 3: {record['answer3']}\n"
        res, history = model.chat(tokenizer, query=input, max_length=1024, top_p=0.5)
        label = re.sub(r'\D', '', res)
        lstfile.write(f"{label}\n")

Processing:   0%|          | 0/6963 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Processing:   0%|          | 1/6963 [00:01<3:18:54,  1.71s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Processing:   0%|          | 2/6963 [00:03<2:59:56,  1.55s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Processing:   0%|          | 3/6963 [00:04<3:00:20,  1.55s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Processing:   0%|          | 4/6963 [00:06<2:49:22,  1.46s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Processing:   0%|          | 5/6963 [00:07<2:52:54,  1.49s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Processing:   0%|          | 6/6963 [00:09<2:51:30,  1.48s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Processing:   0%|          | 7/6963 [00:10<2:47:45,  1.45s/it]Setting `pad_token_id` to `eos_token_i