# 👾Qwen2大模型微调入门-命名实体识别任务   -8.15


## 1.安装环境

本案例测试于modelscope==1.14.0、transformers==4.41.2、datasets==2.18.0、peft==0.11.1、accelerate==0.30.1、swanlab==0.3.11

In [1]:
%pip install torch swanlab modelscope transformers datasets peft pandas accelerate

Looking in indexes: https://mirrors.cloud.aliyuncs.com/pypi/simple
Collecting swanlab
  Downloading https://mirrors.cloud.aliyuncs.com/pypi/packages/cf/d8/ffc5f26c488ce91b76787274470c6c7dd646221e64e4f95051f06377aa57/swanlab-0.3.16-py3-none-any.whl (218 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m218.1/218.1 kB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
Collecting cos-python-sdk-v5
  Downloading https://mirrors.cloud.aliyuncs.com/pypi/packages/fa/d7/97727ec88a45ff676539b3ba8c286d0b91c2d74c9aacb0d304103a763572/cos-python-sdk-v5-1.9.31.tar.gz (123 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m123.5/123.5 kB[0m [31m23.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting swanboard==0.1.3b4
  Downloading https://mirrors.cloud.aliyuncs.com/pypi/packages/72/a5/a81a0c116899f912adfba3c2238e675b431fcf27e4e1f10cf2d234ac5d87/swanboard-0.1.3b4-py3-none-any.whl (747 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━

如果是第一次使用SwanLab，则前往[SwanLab](https://swanlab.cn)注册账号后，在[用户设置](https://swanlab.cn/settings/overview)复制API Key，如果执行下面的代码：

In [2]:
!swanlab login

[1m[34mswanlab[0m[0m: Logging into swanlab cloud.
[1m[34mswanlab[0m[0m: You can find your API key at: [33mhttps://swanlab.cn/settings[0m
[1m[34mswanlab[0m[0m: Paste an API key from your profile and hit enter, or press 'CTRL-C' to quit: 
Aborted!


## 2. 数据集加载

1. 在[chinese_ner_sft - huggingface](https://huggingface.co/datasets/qgyd2021/chinese_ner_sft/tree/main/data)下载ccfbdci.jsonl到同级目录下。

<img src="../assets/ner_dataset.png" width=600>

2. 将ccfbdci.jsonl进行处理，转换成new_train.jsonl和new_test.jsonl

In [3]:
# 2.将train.jsonl和test.jsonl进行处理，转换成new_train.jsonl和new_test.jsonl

import json
import pandas as pd
import os

def dataset_jsonl_transfer(origin_path, new_path):
    """
    将原始数据集转换为大模型微调所需数据格式的新数据集
    """
    messages = []

    # 读取旧的JSONL文件
    with open(origin_path, "r") as file:
        for line in file:
            # 解析每一行的json数据
            data = json.loads(line)
            input_text = data["text"]
            entities = data["entities"]
            match_names = ["疾病分类", "等待期", "意外免等待期", "费用报销范围"]
            
            entity_sentence = ""
            for entity in entities:
                entity_json = dict(entity)
                entity_text = entity_json["text"]
                entity_names = entity_json["label"]
                
                if entity_names in match_names:

                    entity_sentence += f"""{{"entity_text": "{entity_text}", "entity_label": "{entity_names}"}}"""
                    # break
                
                
            
            if entity_sentence == "":
                entity_sentence = "没有找到任何实体"
            
            message = {
                "instruction": """你是一个保险领域命名实体识别的专家，你需要从给定的文本中提取以下四个可能存在的实体：疾病分类; 等待期; 意外免等待期; 费用报销范围。提取出结果后以 json 格式输出, 如 [{"text": "重症疾病", "type": "疾病分类"},{"text": "因意外伤害或于本合同生效之日起", "type": "意外免等待期"},{"text": "180日", "type": "等待期"},{"text": "合理且必需的", "type": "费用报销范围"}] 注意: 1. 输出的每一行都必须是正确的 json 字符串。2. 找不到任何实体时, 输出"没有找到任何实体"。3.尽可能多的找全疾病分类、等待期、意外免等待期、费用报销范围四个实体。4.当有多个实体存在时，使用,分隔每个实体。5.不要输出任何多余的信息。""",
                "input": f"文本:{input_text}",
                "output": entity_sentence,
            }
            
            messages.append(message)
            # break

    # 保存重构后的JSONL文件
    with open(new_path, "w", encoding="utf-8") as file:
        for message in messages:
            file.write(json.dumps(message, ensure_ascii=False) + "\n")

def test_dataset_jsonl_transfer(origin_path, new_path):
    """
    将测试集的原始数据集转换为大模型微调所需数据格式的新数据集
    """
    messages = []

    # 读取旧的JSONL文件
    with open(origin_path, "r") as file:
        for line in file:
            # 解析每一行的json数据
            data = json.loads(line)
            input_text = data["text"]
            id = data["id"]
            match_names = ["疾病分类", "等待期", "意外免等待期", "费用报销范围"]
            
            
            message = {
                "instruction": """你是一个保险领域命名实体识别的专家，你需要从给定的文本中提取以下四个可能存在的实体：疾病分类; 等待期; 意外免等待期; 费用报销范围。提取出结果后以 json 格式输出, 如 [{"text": "重症疾病", "type": "疾病分类"},{"text": "因意外伤害或于本合同生效之日起", "type": "意外免等待期"},{"text": "180日", "type": "等待期"},{"text": "合理且必需的", "type": "费用报销范围"}] 注意: 1. 输出的每一行都必须是正确的 json 字符串。2. 找不到任何实体时, 输出"没有找到任何实体"。3.尽可能多的找全疾病分类、等待期、意外免等待期、费用报销范围四个实体。4.当有多个实体存在时，使用,分隔每个实体。5.不要输出任何多余的信息。""",
                "id": id,
                "input": f"文本:{input_text}",
            }
            
            messages.append(message)
            # break

    # 保存重构后的JSONL文件
    with open(new_path, "w", encoding="utf-8") as file:
        for message in messages:
            file.write(json.dumps(message, ensure_ascii=False) + "\n")


# 加载、处理数据集和测试集
train_dataset_path = "train.jsonl"
train_jsonl_new_path = "train_trans.jsonl"

# 加载测试数据，同时要保留id信息   #TODO
test_dataset_path = "AFAC_track1_testA_230601.jsonl"
test_jsonl_new_path = "test_trans.jsonl"


if not os.path.exists(train_jsonl_new_path):
    dataset_jsonl_transfer(train_dataset_path, train_jsonl_new_path)

if not os.path.exists(test_jsonl_new_path):
    test_dataset_jsonl_transfer(test_dataset_path, test_jsonl_new_path)



total_df = pd.read_json(train_jsonl_new_path, lines=True)
train_df = total_df[int(len(total_df) * 0.1):]  # 取90%的数据做训练集

test_df = pd.read_json(test_jsonl_new_path, lines=True)
# test_df = total_df[:int(len(total_df) * 0.1)].sample(n=20)  # 随机取10%的数据中的20条做测试集

In [4]:
test_df.head()

Unnamed: 0,instruction,id,input
0,你是一个金融文本命名实体识别领域的专家，你需要从给定的句子中提取 疾病分类; 等待期; 意外...,yanbao23_dev_0,文本:，本报告中的信息或所表述的意见均不构成对任何人的投资建议。在任何情况下，本公司、本公司...
1,你是一个金融文本命名实体识别领域的专家，你需要从给定的句子中提取 疾病分类; 等待期; 意外...,yanbao23_dev_1,文本:• 产品层面，优选第二曲线培育成熟的公司。综合品类公司逐渐聚焦，如盐津更加聚焦核心品类...
2,你是一个金融文本命名实体识别领域的专家，你需要从给定的句子中提取 疾病分类; 等待期; 意外...,yanbao23_dev_2,文本:投资策略报告|传媒 图73：公众号和视频号邦定升级 数据来源：友望数据，广发证券发展研...
3,你是一个金融文本命名实体识别领域的专家，你需要从给定的句子中提取 疾病分类; 等待期; 意外...,yanbao23_dev_3,文本:我们认为目前传媒板块处于产业演进新周期的初始阶段，中长期看，新的通信技术及ARVR等硬...
4,你是一个金融文本命名实体识别领域的专家，你需要从给定的句子中提取 疾病分类; 等待期; 意外...,yanbao23_dev_4,文本:成长空间大；高端酒是确定性最强的价格带，长期量价齐升趋势确定。我们长期看好高景气优质赛...


## 3. 下载/加载模型和tokenizer

In [5]:
!pip install tf-keras

Looking in indexes: https://mirrors.cloud.aliyuncs.com/pypi/simple
Collecting tf-keras
  Downloading https://mirrors.cloud.aliyuncs.com/pypi/packages/21/8b/75f7572ec0273ed8da50bc19defe08aaaafcc15fda3407db53f49acec814/tf_keras-2.17.0-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m30.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting tensorflow<2.18,>=2.17
  Downloading https://mirrors.cloud.aliyuncs.com/pypi/packages/ad/fc/b1b67cbad080b8b7a13d0118e2cc60a28dbdbacabdc3f8dd0480210c3f25/tensorflow-2.17.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (601.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m601.3/601.3 MB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: tensorflow, tf-keras
  Attempting uninstall: tensorflow
    Found existing installation: tensorflow 2.16.1
    Uninstalling tensorflow-2.16.1:
      Successfully uninstalled te

## 使用glm4

In [None]:
from modelscope import snapshot_download, AutoTokenizer
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq
import torch

model_id = "ZhipuAI/glm-4-9b-chat"    
model_dir = "./ZhipuAI/glm-4-9b-chat/"

# 在modelscope上下载GLM4模型到本地目录下
model_dir = snapshot_download(model_id, cache_dir="./", revision="master")

# Transformers加载模型权重
tokenizer = AutoTokenizer.from_pretrained(model_dir, use_fast=False, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
model.enable_input_require_grads()  # 开启梯度检查点时，要执行该方法

## 使用QWEN2

In [6]:
from modelscope import snapshot_download, AutoTokenizer
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq
import torch

model_id = "qwen/Qwen2-1.5B-Instruct"    
model_dir = "./qwen/Qwen2-1___5B-Instruct"

# 在modelscope上下载Qwen模型到本地目录下
model_dir = snapshot_download(model_id, cache_dir="./", revision="master")

# Transformers加载模型权重
tokenizer = AutoTokenizer.from_pretrained(model_dir, use_fast=False, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", torch_dtype=torch.bfloat16)
model.enable_input_require_grads()  # 开启梯度检查点时，要执行该方法

2024-08-15 23:23:51.915016: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-08-15 23:23:51.925811: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-15 23:23:51.940469: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-15 23:23:51.944813: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-15 23:23:51.955486: I tensorflow/core/platform/cpu_feature_guar

## 4. 预处理训练数据

In [7]:
def process_func(example):
    """
    将数据集进行预处理, 处理成模型可以接受的格式
    """

    MAX_LENGTH = 384 
    input_ids, attention_mask, labels = [], [], []
    system_prompt = """你是一个保险领域命名实体识别的专家，你需要从给定的文本中提取以下四个可能存在的实体：疾病分类; 等待期; 意外免等待期; 费用报销范围。提取出结果后以 json 格式输出, 如 [{"text": "重症疾病", "type": "疾病分类"},{"text": "因意外伤害或于本合同生效之日起", "type": "意外免等待期"},{"text": "180日", "type": "等待期"},{"text": "合理且必需的", "type": "费用报销范围"}] 注意: 1. 输出的每一行都必须是正确的 json 字符串。2. 找不到任何实体时, 输出"没有找到任何实体"。3.尽可能多的找全疾病分类、等待期、意外免等待期、费用报销范围四个实体。4.当有多个实体存在时，使用,分隔每个实体。5.不要输出任何多余的信息。"""
    
    instruction = tokenizer(
        f"<|im_start|>system\n{system_prompt}<|im_end|>\n<|im_start|>user\n{example['input']}<|im_end|>\n<|im_start|>assistant\n",
        add_special_tokens=False,
    )
    response = tokenizer(f"{example['output']}", add_special_tokens=False)
    input_ids = instruction["input_ids"] + response["input_ids"] + [tokenizer.pad_token_id]
    attention_mask = (
        instruction["attention_mask"] + response["attention_mask"] + [1]
    )
    labels = [-100] * len(instruction["input_ids"]) + response["input_ids"] + [tokenizer.pad_token_id]
    if len(input_ids) > MAX_LENGTH:  # 做一个截断
        input_ids = input_ids[:MAX_LENGTH]
        attention_mask = attention_mask[:MAX_LENGTH]
        labels = labels[:MAX_LENGTH]
    return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}   

In [8]:
from datasets import Dataset

train_ds = Dataset.from_pandas(train_df)
train_dataset = train_ds.map(process_func, remove_columns=train_ds.column_names)

Map:   0%|          | 0/450 [00:00<?, ? examples/s]

## 5. 设置LORA

In [9]:
from peft import LoraConfig, TaskType, get_peft_model

config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    inference_mode=False,  # 训练模式
    r=8,  # Lora 秩
    lora_alpha=32,  # Lora alaph，具体作用参见 Lora 原理
    lora_dropout=0.1,  # Dropout 比例
)

model = get_peft_model(model, config)

## 6. 训练

In [10]:
args = TrainingArguments(
    output_dir="./output/Qwen2-NER",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,
    logging_steps=10,
    num_train_epochs=2,
    save_steps=100,
    learning_rate=1e-4,
    save_on_each_node=True,
    gradient_checkpointing=True,
    report_to="none",
)

In [11]:
from swanlab.integration.huggingface import SwanLabCallback
import swanlab

swanlab_callback = SwanLabCallback(
    project="Qwen2-NER-fintune",
    experiment_name="Qwen2-1.5B-Instruct-8.15-3",
    description="使用通义千问Qwen2-1.5B-Instruct模型在NER数据集上微调，实现关键实体识别任务。",
    config={
        "model": model_id,
        "model_dir": model_dir,
        "dataset": "AFAC_track1_testA_230601.jsonl",
    },
)

In [12]:

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),
    callbacks=[swanlab_callback],
)

trainer.train()


# ====== 训练结束后的预测 ===== #

def predict(messages, model, tokenizer):
    device = "cuda"
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    model_inputs = tokenizer([text], return_tensors="pt").to(device)
    generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512)
    generated_ids = [
        output_ids[len(input_ids) :]
        for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]

    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    print(response)

    return response
    

test_text_list = []
output_jsonl_data = []
for index, row in test_df.iterrows():
    instruction = row["instruction"]
    input_value = row["input"]
    id = row["id"]

    messages = [
        {"role": "system", "content": f"{instruction}"},
        {"role": "user", "content": f"{input_value}"},
    ]

    response = predict(messages, model, tokenizer)
    output_data = {"id": id, "entity": f"{response}"}
    messages.append(output_data)
    
    output_jsonl_data.append(output_data)
    
    result_text = f"{messages[0]}\n\n{messages[1]}\n\n{messages[2]}"   #message0：系统背景提示词，message1：用户输入，message2：系统回复
    
#     将messages[2]保存为jsonl格式
    
    test_text_list.append(swanlab.Text(result_text, caption=response))

# 将jsonl_data保存为JSONL格式
with open("output_baoxian.jsonl", "w", encoding="utf-8") as outfile:
    for entry in output_jsonl_data:
        json.dump(entry, outfile, ensure_ascii=False)
        outfile.write("\n")

swanlab.log({"Prediction": test_text_list})
swanlab.finish()

Detected kernel version 4.19.91, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


[2024-08-15 23:26:35,399] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)


df: /root/.triton/autotune: 没有那个文件或目录


[1m[34mswanlab[0m[0m: Tracking run with swanlab version 0.3.16                                  
[1m[34mswanlab[0m[0m: Run data will be saved locally in [35m[1m/mnt/workspace/LLM-Finetune/notebook/swanlog/run-20240815_232638-a3b1799d[0m[0m
[1m[34mswanlab[0m[0m: 👋 Hi [1m[39mVincent[0m[0m, welcome to swanlab!
[1m[34mswanlab[0m[0m: Syncing run [33mQwen2-1.5B-Instruct-8.15-3_Aug15_23-26-38[0m to the cloud
[1m[34mswanlab[0m[0m: 🌟 Run `[1mswanlab watch -l /mnt/workspace/LLM-Finetune/notebook/swanlog[0m` to view SwanLab Experiment Dashboard locally
[1m[34mswanlab[0m[0m: 🏠 View project at [34m[4mhttps://swanlab.cn/@Vincent/Qwen2-NER-fintune[0m[0m
[1m[34mswanlab[0m[0m: 🚀 View run at [34m[4mhttps://swanlab.cn/@Vincent/Qwen2-NER-fintune/runs/t6zdlerlnx7qm42xn99l4[0m[0m



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss
10,0.9583
20,0.2012
30,0.1866
40,0.1559
50,0.1288


The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


没有找到任何实体
没有找到任何实体
没有找到任何实体
{"entity_text": "媒体板块", "entity_label": "疾病分类"}
没有找到任何实体
没有找到任何实体
{"entity_text": "女装", "entity_label": "类别"}
没有找到任何实体
没有找到任何实体
没有找到任何实体
{"entity_text": "疫情", "entity_label": "疾病分类"}
没有找到任何实体
没有找到任何实体
{"entity_text": "盒马", "entity_label": "疾病分类"}
没有找到任何实体
没有找到任何实体
没有找到任何实体
{"entity_text": "骁龙XR平台", "entity_label": "费用报销范围"}
{"entity_text": "丁雄军", "entity_label": "人物"}
没有找到任何实体
没有找到任何实体
没有找到任何实体
没有找到任何实体
{"entity_text": "军机等航空装备", "entity_label": "军工电子"}
没有找到任何实体
没有找到任何实体
没有找到任何实体
{"entity_text": "跨境电商业务", "entity_label": "费用报销范围"}
没有找到任何实体
{"entity_text": "20年11月，公司升级打造\"喜粤TV\"品牌，继续强化内容产品运营，并提出将从播控平台走向主流媒体平台的战略", "entity_label": "核心出品方"}
没有找到任何实体
没有找到任何实体
没有找到任何实体
没有找到任何实体
没有找到任何实体
{"entity_text": "AI", "entity_label": "疾病分类"}
没有找到任何实体
没有找到任何实体
{"entity_text": "安井", "entity_label": "疾病分类"}
没有找到任何实体
{"entity_text": "航空胎", "entity_label": "疾病分类"}
没有找到任何实体
{"entity_text": "联合创始人", "entity_label": "创始人及核心高管团队"}
没有找到任何实体
没有找到任何实体
没有找到任何实体
没有找到任何实体
没有找到任何实体
没有找到任何实

In [None]:
test_text_list = []
for index, row in test_df.iterrows():
    instruction = row["instruction"]
    input_value = row["input"]
    id = row["id"]

    messages = [
        {"role": "system", "content": f"{instruction}"},
        {"role": "user", "content": f"{input_value}"},
    ]

    response = predict(messages, model, tokenizer)
    messages.append({"id": id, "entity": f"{response}"})
    result_text = f"{messages[0]}\n\n{messages[1]}\n\n{messages[2]}"   #message0：系统背景提示词，message1：用户输入，message2：系统回复
    test_text_list.append(swanlab.Text(result_text, caption=response))

swanlab.log({"Prediction": test_text_list}   
swanlab.finish()

In [24]:
print(test_text_list[0].data)

AttributeError: 'Text' object has no attribute 'data'