# 微调大模型:ChatGLM2-6B 进行二分类任务

## 导入数据

In [1]:
import pandas as pd

In [2]:
train_df = pd.read_csv('./csv_data/train.csv')
testB_df = pd.read_csv('./csv_data/testB.csv')

In [3]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6000 entries, 0 to 5999
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   uuid      6000 non-null   int64 
 1   title     6000 non-null   object
 2   author    6000 non-null   object
 3   abstract  6000 non-null   object
 4   Keywords  6000 non-null   object
 5   label     6000 non-null   int64 
dtypes: int64(2), object(4)
memory usage: 281.4+ KB


## 制作数据集

In [4]:
res = []

for i in range(len(train_df)):
    paper_item = train_df.loc[i]
    tmp = {
    "instruction": "Please judge whether it is a medical field paper according to the given paper title and abstract, output 1 or 0, the following is the paper title and abstract -->",
    "input": f"title:{paper_item[1]},abstract:{paper_item[3]}",
    "output": str(paper_item[5])
  }
    res.append(tmp)

In [5]:
import json

with open('./data/paper_label.json', mode='w', encoding='utf-8') as f:
    json.dump(res, f, ensure_ascii=False, indent=4)

## 微调chatglm2-6b

- 首先需要clone微调脚本：`git clone https://github.com/LLLM-Lab/xfg-paper.git`
- 进入目录安装环境：`cd ./xfg-paper`；`pip install -r requirements.txt `
- 将脚本中的`model_name_or_path`更换为你本地的chatglm2-6b模型路径，然后运行脚本：`sh xfg_train.sh`

微调过程大概需要两个小时（我使用了阿里云A10-24G运行了两个小时左右），微调过程需要16G的显存，推荐使用24G显存的显卡，比如3090，4090等。

当然，我已经把训练好的lora权重放在了仓库里，您可以直接运行下面的代码。

## 加载训练好的LoRA权重，进行预测

In [6]:
from peft import PeftModel
from transformers import AutoTokenizer, AutoModel, GenerationConfig, AutoModelForCausalLM

model_path = "../chatglm2-6b"
model = AutoModel.from_pretrained(model_path, trust_remote_code=True).half().cuda()
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# 加载 label lora权重
model = PeftModel.from_pretrained(model, './output/label_xfg').half()
model = model.eval()
response, history = model.chat(tokenizer, "你好", history=[])
response

Explicitly passing a `revision` is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.


Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]

Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.


'你好👋！我是人工智能助手 ChatGLM2-6B，很高兴见到你，欢迎问我任何问题。'

In [7]:
# 预测函数

def predict(text):
    response, history = model.chat(tokenizer, f"Please judge whether it is a medical field paper according to the given paper title and abstract, output 1 or 0, the following is the paper title, author and abstract -->{text}", history=[],
    temperature=0.01)
    return response

In [8]:
predict('title:Seizure Detection and Prediction by Parallel Memristive Convolutional Neural Networks,author:Li, Chenqi; Lammie, Corey; Dong, Xuening; Amirsoleimani, Amirali; Azghadi, Mostafa Rahimi; Genov, Roman,abstract:During the past two decades, epileptic seizure detection and prediction algorithms have evolved rapidly. However, despite significant performance improvements, their hardware implementation using conventional technologies, such as Complementary Metal-Oxide-Semiconductor (CMOS), in power and areaconstrained settings remains a challenging task; especially when many recording channels are used. In this paper, we propose a novel low-latency parallel Convolutional Neural Network (CNN) architecture that has between 2-2,800x fewer network parameters compared to State-Of-The-Art (SOTA) CNN architectures and achieves 5-fold cross validation accuracy of 99.84% for epileptic seizure detection, and 99.01% and 97.54% for epileptic seizure prediction, when evaluated using the University of Bonn Electroencephalogram (EEG), CHB-MIT and SWEC-ETHZ seizure datasets, respectively. We subsequently implement our network onto analog crossbar arrays comprising Resistive Random-Access Memory (RRAM) devices, and provide a comprehensive benchmark by simulating, laying out, and determining hardware requirements of theCNNcomponent of our system. We parallelize the execution of convolution layer kernels on separate analog crossbars to enable 2 orders of magnitude reduction in latency compared to SOTA hybrid Memristive-CMOS Deep Learning (DL) accelerators. Furthermore, we investigate the effects of non-idealities on our system and investigate Quantization Aware Training (QAT) to mitigate the performance degradation due to lowAnalog-to-Digital Converter (ADC)/Digital-to-Analog Converter (DAC) resolution. Finally, we propose a stuck weight offsetting methodology to mitigate performance degradation due to stuck RON/ROFF memristor weights, recovering up to 32% accuracy, without requiring retraining. The CNN component of our platform is estimated to consume approximately 2.791Wof power while occupying an area of 31.255 mm(2) in a 22 nm FDSOI CMOS process.')

'1'

In [9]:
# 预测测试集

from tqdm import tqdm

label = []

for i in tqdm(range(len(testB_df))):
    test_item = testB_df.loc[i]
    test_input = f"title:{test_item[1]},author:{test_item[2]},abstract:{test_item[3]}"
    label.append(int(predict(test_input)))


100%|██████████| 2000/2000 [05:50<00:00,  5.70it/s]


In [15]:
testB_df['label'] = label
testB_df['Keywords'] = ['tmp' for _ in range(2000)]

In [16]:
testB_df

Unnamed: 0,uuid,title,author,abstract,label,Keywords
0,0,Tobacco Consumption and High-Sensitivity Cardi...,"Julia Brox Skranes,Magnus Nakrem Lyngbakken,Kr...",Background Cardiac troponins represent a sensi...,1,tmp
1,1,Approaching towards sustainable supply chain u...,"Mohammad Reza Seddigh,Sajjad Shokouhyar,Fateme...",These two main objectives of this study are to...,1,tmp
2,2,Does globalization matter for ecological footp...,"Kirikkaleli, Dervis; Adebayo, Tomiwa Sunday; K...",The main aim of this paper is to explore the r...,0,tmp
3,3,Myths and Misconceptions About University Stud...,"Megan Paull,Kirsten Holmes,Maryam Omari,Debbie...",This paper examines myths and misconceptions a...,1,tmp
4,4,Antioxidant Status of Rat Liver Mitochondria u...,"S I Khizrieva,R A Khalilov,A M Dzhafarova,V R ...",For evaluation of the contribution of the anti...,1,tmp
...,...,...,...,...,...,...
1995,1995,The treatment of veterinary antibiotics in swi...,"Qian, Mengcheng; Yang, Linyan; Chen, Xingkui; ...",Elevated concentrations and potential toxiciti...,0,tmp
1996,1996,Socio-political efficacy explains increase in ...,"Taciano L Milfont,Danny Osborne,Chris G Sibley...",The ongoing COVID-19 pandemic claimed millions...,1,tmp
1997,1997,Investigation of early puberty prevalence and ...,"Esin Gizem Olgun,Sirmen Kizilcan Cetin,Zeynep ...",We aimed to determine the prevalence of early ...,1,tmp
1998,1998,From 3D printing to 3D bioprinting: the materi...,"Nihal Engin Vrana,Sharda Gupta,Kunal Mitra,Alb...",The application of 3D printing technologies fi...,1,tmp


In [17]:
submit = testB_df[['uuid', 'Keywords', 'label']]

In [18]:
submit

Unnamed: 0,uuid,Keywords,label
0,0,tmp,1
1,1,tmp,1
2,2,tmp,0
3,3,tmp,1
4,4,tmp,1
...,...,...,...
1995,1995,tmp,0
1996,1996,tmp,1
1997,1997,tmp,1
1998,1998,tmp,1


In [19]:
submit.to_csv('submit.csv', index=False)