# 研究数据集应用于微调的先后顺序对模型效果的影响
此笔记本的微调顺序：`base -->  Tri --> TriMixCos`

### 三、 使用 LoRA 进行微调（使用TriviaQA数据集）¶

设备：M40-24G

<u>基座模型：`MiniCPM`</u>

<u>使用数据：`TriviaQA`</u>

由于显存限制，训练时的`per_device_train_batch_size`取1

`gradient_accumulation_steps`取4，这是因为在用cosmos微调的实验中，虽然取4的正确率比8和16稍低了一点（分别为0.7513、0.7593、0.7742），但运行时间短了很多（分别约4h、8h、16h）

这等效于`Total train batch size`4

In [1]:
"""
微调模型5
"""

!bash lora_finetune_triviaQA.sh

20240729121408
[2024-07-29 12:14:10,037] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-29 12:14:12,430] [INFO] [runner.py:568:main] cmd = /usr/local/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=19888 --enable_each_rank_log=None finetune.py --model_name_or_path /hy-tmp/miniCPM-bf16 --output_dir output/LoRA/20240729121408/ --train_data_path /hy-tmp/data_MiniCPM/triviaqa_web_train_withDescription.json --eval_data_path /hy-tmp/data_MiniCPM/triviaqa_web_valid_withDescription.json --learning_rate 5e-5 --per_device_train_batch_size 1 --per_device_eval_batch_size 4 --model_max_length 1024 --use_lora --gradient_accumulation_steps 4 --warmup_steps 100 --max_steps 1000 --weight_decay 0.01 --evaluation_strategy steps --eval_steps 1000 --save_strategy steps --save_steps 500 --seed 42 --log_level info --logging_strategy steps --logging_steps 10 --deepspeed configs/ds_con

## 对微调模型5进一步进行微调

<u>基座模型：`model tuned with Tri` </u>

<u>使用数据：`CosmosQA`</u>

In [2]:
!bash lora_finetune_trivia_cosmos.sh

20240729162803
[2024-07-29 16:28:05,109] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-29 16:28:07,571] [INFO] [runner.py:568:main] cmd = /usr/local/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=19888 --enable_each_rank_log=None finetune.py --model_name_or_path /hy-tmp/MiniCPM/finetune/output/LoRA/20240729121408/checkpoint-1000 --output_dir output/LoRA/20240729162803/ --train_data_path /hy-tmp/data_MiniCPM/cosmosqa_train.json --eval_data_path /hy-tmp/data_MiniCPM/cosmosqa_valid.json --learning_rate 5e-5 --per_device_train_batch_size 1 --per_device_eval_batch_size 4 --model_max_length 1024 --use_lora --gradient_accumulation_steps 4 --warmup_steps 100 --max_steps 1000 --weight_decay 0.01 --evaluation_strategy steps --eval_steps 1000 --save_strategy steps --save_steps 500 --seed 42 --log_level info --logging_strategy steps --logging_steps 10 --deepspeed configs

## 推理验证

In [5]:
"""
微调模型5在cosmos数据集上的验证
"""

import torch
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer
import json
import csv
import re
import os
from concurrent.futures import ThreadPoolExecutor, as_completed


for model in ['20240729121408']:
    path = f"/hy-tmp/MiniCPM/finetune/output/LoRA/{model}/checkpoint-1000"
    tokenizer = AutoTokenizer.from_pretrained(path)
    model = AutoModelForCausalLM.from_pretrained(
        path, torch_dtype=torch.float16, device_map="cuda", trust_remote_code=True
    )
    
    input_file_path = '/hy-tmp/cosmosqa-master/data/test.jsonl'
    csv_file_path = os.path.join(path, 'predictions_cosmosqa.csv')
    lst_file_path = os.path.join(path, 'predictions_cosmosqa.lst')
    
    # 逐行读取 JSONL 文件
    with open(input_file_path, 'r') as file, open(csv_file_path, 'w', newline='') as csvfile, open(lst_file_path, 'w') as lstfile:
        # 初始化列表来存储 id 和 label
        data = []
        writer = csv.writer(csvfile)
        writer.writerow(['id', 'label'])    
        for line in file:
            record = json.loads(line)
            # 提取 id 和 label 信息
            id = record['id']
            input = f"<用户>You are a reading comprehension expert and below you need to answer multiple choice questions based on the text. Note: Only the number of the option needs to be answered and your answer is only one number.\n" \
             f"Here are the questions you have to answer: \n" \
             f"Text: {record['context']}\n" \
             f"Question: {record['question']}\n" \
             f"Option 0: {record['answer0']}\n" \
             f"Option 1: {record['answer1']}\n" \
             f"Option 2: {record['answer2']}\n" \
             f"Option 3: {record['answer3']}\n"
            res = model.chat(tokenizer, query=input, max_length=1024, top_p=0.5)[0]
            label = re.sub(r'\D', '', res)  # 提取回答中的数字
            writer.writerow([id, label])
            lstfile.write(f"{label}\n")

  from .autonotebook import tqdm as notebook_tqdm
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_tok

## 模型5在Trivia数据集上的测试：

In [6]:
"""
微调模型5在Trivia数据集上的验证
"""

import torch
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer
import json
import csv
import re
import os
from concurrent.futures import ThreadPoolExecutor, as_completed


for model in ['20240729121408']:
    path = f"/hy-tmp/MiniCPM/finetune/output/LoRA/{model}/checkpoint-1000"
    tokenizer = AutoTokenizer.from_pretrained(path)
    model = AutoModelForCausalLM.from_pretrained(
        path, torch_dtype=torch.float16, device_map="cuda", trust_remote_code=True
    )
    
    input_file_path = '/hy-tmp/qa/verified-web-dev.json'
    output_file_path = os.path.join(path, 'predictions_tri.json')
    
    prompt = "You are a reading comprehension expert and you need to answer the encyclopaedic questions below based on the text given. Note: Try to be as concise as possible. Do not need to answer in complete sentences.\n" \
                "Example:--Text: Omitted here. --Question: Where in England was Dame Judi Dench born? --Your Answer: York\n" \
                "Here is the question you have to answer:\n"
    
    with open(input_file_path, 'r') as file, open(output_file_path, 'w') as json_file:
        data = json.load(file)['Data']
        result_dict = {}
        for item in data:
            id = item['QuestionId']
            Filename = item['EntityPages']
            for x in item['SearchResults']:
                Filename.append(x['Filename'])
            descriptions = "\n".join([f"Description {index+1}: {desc['Description']}" for index, desc in enumerate(item['SearchResults'])])                 
            query =  f"{prompt}"\
                     f"Question: {item['Question']}\n"\
                     f"Text: {descriptions}"
            res = model.chat(tokenizer, query=query, max_length=1024, top_p=0.5)[0]
            result_dict[f'{id}--{Filename[0]}'] = res
        json.dump(result_dict, json_file, indent=4)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for o

In [7]:
!python /hy-tmp/triviaqa_evaluation.py

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


em=0: New York Yankees ['boston braves', 'boston braves disambiguation', 'boston braves']
em=0: Coca-Cola ['wonderbra', 'wonderbra women', 'wonder bra', 'wonderbra']
em=0: Angela ['bowie disambiguation', 'bowie', 'bowie']
em=0: The Waves ['katrina and waves', 'katrina waves', 'katrina waves']
em=0: Buckle ['agnet', 'aglet', 'fluglebinder', 'flugelbinder', 'anglets', 'aglet']
em=0: Elysium ['alysian fields', 'elysian fields', 'elysiane fields', 'elysian fields disambiguation', 'elysian fields']
em=0: Preacher ['adrian', 'adrián', 'adrian cronauer']
Missed question qz_3569--Winter_Olympic_Games.txt will receive score 0.
Missed question qz_3569--110/110_186062.txt will receive score 0.
em=0: Lactase ['animal rennet', 'rennet', 'emporase', 'rennets', 'rennett', 'rennet']
em=0: Whiskey ['cuban rum', 'gunpowder rum', 'rum beverage', 'jamaica spirit', 'light rum', 'caña blanca', 'overproof rum', 'gold rum', 'rum', 'jamaica spirits', 'rude rum', 'coconut rum', 'dark rum', 'spiced rum', 'white 

## 用trivia_cosmos微调的模型在cosmos数据集上的测试

In [10]:
"""
用trivia_cosmos微调的模型在cosmos数据集上的验证
"""

import torch
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer
import json
import csv
import re
import os
from concurrent.futures import ThreadPoolExecutor, as_completed


for model in ['20240729162803']:
    path = f"/hy-tmp/MiniCPM/finetune/output/LoRA/{model}/checkpoint-1000"
    tokenizer = AutoTokenizer.from_pretrained(path)
    model = AutoModelForCausalLM.from_pretrained(
        path, torch_dtype=torch.float16, device_map="cuda", trust_remote_code=True
    )
    
    input_file_path = '/hy-tmp/cosmosqa-master/data/test.jsonl'
    csv_file_path = os.path.join(path, 'predictions_cosmosqa.csv')
    lst_file_path = os.path.join(path, 'predictions_cosmosqa.lst')
    
    # 逐行读取 JSONL 文件
    with open(input_file_path, 'r') as file, open(csv_file_path, 'w', newline='') as csvfile, open(lst_file_path, 'w') as lstfile:
        # 初始化列表来存储 id 和 label
        data = []
        writer = csv.writer(csvfile)
        writer.writerow(['id', 'label'])    
        for line in file:
            record = json.loads(line)
            # 提取 id 和 label 信息
            id = record['id']
            input = f"<用户>You are a reading comprehension expert and below you need to answer multiple choice questions based on the text. Note: Only the number of the option needs to be answered and your answer is only one number.\n" \
             f"Here are the questions you have to answer: \n" \
             f"Text: {record['context']}\n" \
             f"Question: {record['question']}\n" \
             f"Option 0: {record['answer0']}\n" \
             f"Option 1: {record['answer1']}\n" \
             f"Option 2: {record['answer2']}\n" \
             f"Option 3: {record['answer3']}\n"
            res = model.chat(tokenizer, query=input, max_length=1024, top_p=0.5)[0]
            label = re.sub(r'\D', '', res)  # 提取回答中的数字
            writer.writerow([id, label])
            lstfile.write(f"{label}\n")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for o

## 用trivia_cosmos微调的模型在Trivia数据集上的测试

In [8]:
"""
用trivia_cosmos微调的模型在Trivia数据集上的验证
"""

import torch
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer
import json
import csv
import re
import os
from concurrent.futures import ThreadPoolExecutor, as_completed


for model in ['20240729162803']:
    path = f"/hy-tmp/MiniCPM/finetune/output/LoRA/{model}/checkpoint-1000"
    tokenizer = AutoTokenizer.from_pretrained(path)
    model = AutoModelForCausalLM.from_pretrained(
        path, torch_dtype=torch.float16, device_map="cuda", trust_remote_code=True
    )
    
    input_file_path = '/hy-tmp/qa/verified-web-dev.json'
    output_file_path = os.path.join(path, 'predictions_tri.json')
    
    prompt = "You are a reading comprehension expert and you need to answer the encyclopaedic questions below based on the text given. Note: Try to be as concise as possible. Do not need to answer in complete sentences.\n" \
                "Example:--Text: Omitted here. --Question: Where in England was Dame Judi Dench born? --Your Answer: York\n" \
                "Here is the question you have to answer:\n"
    
    with open(input_file_path, 'r') as file, open(output_file_path, 'w') as json_file:
        data = json.load(file)['Data']
        result_dict = {}
        for item in data:
            id = item['QuestionId']
            Filename = item['EntityPages']
            for x in item['SearchResults']:
                Filename.append(x['Filename'])
            descriptions = "\n".join([f"Description {index+1}: {desc['Description']}" for index, desc in enumerate(item['SearchResults'])])                 
            query =  f"{prompt}"\
                     f"Question: {item['Question']}\n"\
                     f"Text: {descriptions}"
            res = model.chat(tokenizer, query=query, max_length=1024, top_p=0.5)[0]
            result_dict[f'{id}--{Filename[0]}'] = res
        json.dump(result_dict, json_file, indent=4)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for o

In [9]:
!python /hy-tmp/triviaqa_evaluation.py

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


em=0: Question: Rita Coolidge sang the title song for which Bond film?
Text: Description 1: ... Rita Coolidge Performing The title track to the JAMES BOND film OCTOPUSSY. Clip from THE VAL DOONICAN MUSIC SHOW 1983 Featuring Rita Coolidge ... HIGH ...
Answer: OCTOPUSSY ['list of bond girls in octopussy', 'bond 13', 'list of james bond allies in octopussy', 'magda james bond', 'penelope smallbone', 'kamal kahn', 'octopussy', 'list of james bond villains in octopussy', 'vijay james bond', 'jim fanning james bond', 'general orlov', 'kamal khan', 'octopussy character', 'octopussy film', 'octopussy']
em=0: Answer: "On the Street Where You Live" is a song from the musical "Camelot" with music by Frederick Loewe and lyrics by Alan Jay Lerner. ['my fair lady musical', 'my fair lady', 'my fair lady 2010 film', 'why can t english 3f', 'my fair lady upcoming film', 'my fair lady 2012 film', 'my fair lady 2014 film', 'my fair lady 2015 film', 'i m ordinary man', 'enry iggins', 'my fair lady']
em=0: