<a href="https://colab.research.google.com/github/CLUEbenchmark/pCLUE/blob/main/Fine_tuning_PromptCLUE_model_with_pCLUE_using_PyTorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 使用自定义数据集训练<a href='https://github.com/clue-ai/PromptCLUE'>PromptCLUE</a>模型

在这个notebook中我们将使用transformers库结合GPU训练PromptCLUE模型，使用的是<a href='https://github.com/CLUEbenchmark/pCLUE'>pCLUE多任务提示学习数据集</a>。

它是一个PyTorch实现，从环境准备、数据下载和转化、模型训练、预测到模型效果评估的整个过程。


In [5]:
# 安装需要的包 install libraries 
!pip install sentencepiece
!pip install transformers
!pip install torch
!pip install rich[jupyter]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [6]:
# 引入相应的包 Importing libraries
import os,json
import numpy as np
import pandas as pd
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
import os,time
# Importing the T5 modules from huggingface/transformers
from transformers import T5Tokenizer, T5ForConditionalGeneration

# rich: for a better display on terminal
from rich.table import Column, Table
from rich import box
from rich.console import Console
print("end2...")

end2...


In [7]:
# 查看GPU的信息
!nvidia-smi

Tue Oct  4 02:25:05 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    23W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# 数据准备和转化

In [8]:
# 下载pCLUE的部分数据（如，pCLUE_train_1.json）到本地
!wget https://raw.githubusercontent.com/CLUEbenchmark/pCLUE/main/datasets/pCLUE_train_1.json
!wget https://raw.githubusercontent.com/CLUEbenchmark/pCLUE/main/datasets/pCLUE_train_2.json
!wget https://raw.githubusercontent.com/CLUEbenchmark/pCLUE/main/datasets/pCLUE_train_3.json
!wget https://raw.githubusercontent.com/CLUEbenchmark/pCLUE/main/datasets/pCLUE_train_4.json
!wget https://raw.githubusercontent.com/CLUEbenchmark/pCLUE/main/datasets/pCLUE_train_5.json
!wget https://raw.githubusercontent.com/CLUEbenchmark/pCLUE/main/datasets/pCLUE_train_6.json
!wget https://raw.githubusercontent.com/CLUEbenchmark/pCLUE/main/datasets/pCLUE_train_7.json
!wget https://raw.githubusercontent.com/CLUEbenchmark/pCLUE/main/datasets/pCLUE_train_8.json
!wget https://raw.githubusercontent.com/CLUEbenchmark/pCLUE/main/datasets/pCLUE_train_9.json

--2022-10-04 02:25:05--  https://raw.githubusercontent.com/CLUEbenchmark/pCLUE/main/datasets/pCLUE_train_1.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 100662150 (96M) [text/plain]
Saving to: ‘pCLUE_train_1.json.7’


2022-10-04 02:25:11 (463 MB/s) - ‘pCLUE_train_1.json.7’ saved [100662150/100662150]

--2022-10-04 02:25:11--  https://raw.githubusercontent.com/CLUEbenchmark/pCLUE/main/datasets/pCLUE_train_2.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 100254394 (96M) [text/plain]
Saving to: ‘pCLUE_train_2.json.3’


2

In [12]:
# 合并多个训练集，得到一个全量的训练集（如果需要全量数据训练；否则以下只使用部分数据进行训练）
!rm -rf pCLUE_train.json
!cat pCLUE_train_1.json pCLUE_train_2.json pCLUE_train_3.json pCLUE_train_4.json pCLUE_train_5.json pCLUE_train_6.json pCLUE_train_7.json pCLUE_train_8.json pCLUE_train_9.json >> pCLUE_train.json

In [13]:
# 查看数据量
!wc -l pCLUE_train.json

1200705 pCLUE_train.json


In [14]:
# 数据准备：将json文件转化为csv形式的文件。
def convert_json_to_csv(source_file, target_file):
    """将json文件转化为csv形式的文件。
       source_file:输入文件；
       target_file：转化后的文件
    """
    lines=open(source_file,'r').readlines()
    print("length of lines:",len(lines))
    input_list=[]
    output_list=[]
    answer_choices_list=[]
    type_list=[]
    for i, line in enumerate(lines):
        # {"input": "以下内容为真：“滁县地区专员张友道说:大都架到高处了”那么下面的陈述：“张友道对身边的官员说了话。”是真的,假的,或未知？\n答案：", "target": "未知", "answer_choices": ["真的", "假的", "未知"], "type": "nli"}
        # 1)获得字段值
        json_string=json.loads(line.strip())
        input_=json_string["input"].replace("\n", "_")
        output_=json_string["target"]
        answer_choices_=json_string.get("answer_choices",[])
        type_=json_string["type"]
        if i<10:print(i,"input:",input_,";output:",output_)
        # 2)添加到列表中
        input_list.append(input_)
        output_list.append(output_)
        answer_choices_list.append(answer_choices_)
        type_list.append(type_)

    # 3)写成pandas的dataframe，以csv进行保存
    df = pd.DataFrame({'input': input_list,
                       'target':output_list,
                       'answer_choices': answer_choices_list,
                       'type': type_list,
                       })
    df.to_csv(target_file,index=False)

# 请运行以下三行代码进行格式换行，如果你需要全量数据训练。
# 默认将只使用部分在线的示例数据进行训练。
source_file='pCLUE_train.json'
target_file='pCLUE_train.csv'
convert_json_to_csv(source_file, target_file)

length of lines: 1200705
0 input: 这是关于哪方面的新闻： 故事,文化,娱乐,体育,财经,房产,汽车,教育,科技,军事,旅游,国际,股票,农业,游戏?崔万军合同到期 广州龙狮主教练离职_答案： ;output: 体育
1 input: 这是一个完型填空任务。候选的词语有这些：针锋相对，牵肠挂肚，心急如焚，望眼欲穿，不翼而飞，黯然神伤，金石为开，归心似箭，艰苦卓绝，触景伤情。文章内容为：_既然没有了姚明，我们也没有了那么多可以__的东西。不妨放开心思，好好的欣赏一下姚明之外的东西，也许，乐趣就在其中。(嘟嘟)_ 请问：下划线处应该选择哪个词语？_答案： ;output: 牵肠挂肚
2 input: 哪个类别最好的描述了这篇新闻？汶川地震10周年丨航拍新北川 楼房拔地起 旧貌换新颜_选项：故事，文化，娱乐，体育，财经，房产，汽车，教育，科技，军事，旅游，国际，股票，农业，游戏_答案： ;output: 国际
3 input: “现在买不是很好的时机了”我们这样说有道理吗“现在能以历史最低价买到”？是的,不是,或也许？_答案： ;output: 不是
4 input: 假定下面是真的“他想起方才王琦瑶关于指纹的话,就找一块抹布将所有的家什抹了一遍”因此,“他做了亏心事”是必然的,可能的,或不可能？_答案： ;output: 可能的
5 input: 哪个类别最好的描述了这个APP应用程序？平台简介有信钱包&mdash;&mdash;您的随身银行，为您提供专业的借钱借款快贷款金融平台。有信钱包为您提供低息借贷产品、征信查询、信用卡办理和理财资讯。平台利用大数据技术，为用户匹配推荐最适合的低费率、放款快的贷款产品，是一款满足个人和中小企业各种现金借贷需求的贷款APP。产品特点1.操作简单仅需身份证，1分钟即可完成申请；2.额度灵活贷款金额100元100万不等，任你借；3.极速贷款30秒审批，3分钟到账；4.超低月息灵活分期，让你贷款无忧；5.征信查询一键查询网贷报告，快速了解您的信用情况；6.安全保障信息安全，息费透明，保护您的隐私安全_选项：银行，社区，电商，支付，经营，卡牌，借贷，驾校，理财，职考，新闻，旅游，交通，魔幻，医疗，影像，动作，工具，体育，小说，运动，相机，工具，快递，教育，股票，菜谱，行车，仙侠，亲子，购物，射击，漫画，小学，

In [15]:
# 做一些相关的配置(打印显示；GPU设置)
# define a rich console logger
console = Console(record=True)

# to display dataframe in ASCII format
def display_df(df):
    """display dataframe in ASCII format"""

    console = Console()
    table = Table(
        Column("source_text", justify="center"),
        Column("target_text", justify="center"),
        title="Sample Data",
        pad_edge=False,
        box=box.ASCII,
    )

    for i, row in enumerate(df.values.tolist()):
        table.add_row(row[0], row[1])

    # console.print(table) # TODO TODO TODO 

# training logger to log training progress
training_logger = Table(
    Column("Epoch", justify="center"),
    Column("Steps", justify="center"),
    Column("Loss", justify="center"),
    title="Training Status",
    pad_edge=False,
    box=box.ASCII,
)

# Setting up the device for GPU usage
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'
print("end...")

end...


# Dataset Class 自定义数据集类

In [16]:
class YourDataSetClass(Dataset):
    """
    创建一个自定义的数据集，用于训练，必须包括两个字段：输入(如source_text)、输出（如target_text）
    Creating a custom dataset for reading the dataset and
    loading it into the dataloader to pass it to the
    neural network for finetuning the model

    """

    def __init__(
        self, dataframe, tokenizer, source_len, target_len, source_text, target_text
    ):
        """
        Initializes a Dataset class

        Args:
            dataframe (pandas.DataFrame): Input dataframe
            tokenizer (transformers.tokenizer): Transformers tokenizer
            source_len (int): Max length of source text
            target_len (int): Max length of target text
            source_text (str): column name of source text
            target_text (str): column name of target text
        """
        self.tokenizer = tokenizer
        self.data = dataframe
        self.source_len = source_len
        self.summ_len = target_len
        self.target_text = self.data[target_text]
        self.source_text = self.data[source_text]

    def __len__(self):
        """returns the length of dataframe"""

        return len(self.target_text)

    def __getitem__(self, index):
        """return the input ids, attention masks and target ids"""

        source_text = str(self.source_text[index])
        target_text = str(self.target_text[index])

        # cleaning data so as to ensure data is in string type
        source_text = " ".join(source_text.split())
        target_text = " ".join(target_text.split())

        source = self.tokenizer.batch_encode_plus(
            [source_text],
            max_length=self.source_len,
            pad_to_max_length=True,
            truncation=True,
            padding="max_length",
            return_tensors="pt",
        )
        target = self.tokenizer.batch_encode_plus(
            [target_text],
            max_length=self.summ_len,
            pad_to_max_length=True,
            truncation=True,
            padding="max_length",
            return_tensors="pt",
        )

        source_ids = source["input_ids"].squeeze()
        source_mask = source["attention_mask"].squeeze()
        target_ids = target["input_ids"].squeeze()
        target_mask = target["attention_mask"].squeeze()

        return {
            "source_ids": source_ids.to(dtype=torch.long),
            "source_mask": source_mask.to(dtype=torch.long),
            "target_ids": target_ids.to(dtype=torch.long),
            "target_ids_y": target_ids.to(dtype=torch.long),
        }
print("end...")

end...


# 训练方法 Train

In [17]:
def train(epoch, tokenizer, model, device, loader, optimizer):

    """
    用于训练的方法
    Function to be called for training with the parameters passed from main function

    """

    model.train()
    time1=time.time()
    for _, data in enumerate(loader, 0):
        y = data["target_ids"].to(device, dtype=torch.long)
        y_ids = y[:, :-1].contiguous() # target, from start to end(except end of token, <EOS>). e.g. "你好吗？"
        lm_labels = y[:, 1:].clone().detach() # target, for second to end.e.g."好吗？<EOS>"
        lm_labels[y[:, 1:] == tokenizer.pad_token_id] = -100 # releted to pad_token and loss. for detail, check here: https://github.com/Shivanandroy/T5-Finetuning-PyTorch/issues/3
        ids = data["source_ids"].to(device, dtype=torch.long) # input. e.g. "how are you?"
        mask = data["source_mask"].to(device, dtype=torch.long)

        outputs = model(
            input_ids=ids,
            attention_mask=mask,
            decoder_input_ids=y_ids,
            labels=lm_labels,
        )
        loss = outputs[0]
        # 每100步打印日志
        if _ % 100 == 0 and _!=0:
            time2=time.time()
            print(_,"epoch:"+str(epoch)+"-loss:"+str(loss)+";each step's time spent:"+str(float(time2-time1)/float(_+0.0001)))
            # training_logger.add_row(str(epoch), str(_), str(loss))
            # console.print(training_logger)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
print("end...")

end...


# 用于验证的方法 Validate

In [18]:
def validate(epoch, tokenizer, model, device, loader,max_length):

  """
  用于验证的方法：输入用于验证的数据，返回模型预测的结果和正确的标签
  Function to evaluate model for predictions

  """
  model.eval()
  predictions = []
  actuals = []
  with torch.no_grad():
      for _, data in enumerate(loader, 0):
          y = data['target_ids'].to(device, dtype = torch.long)
          ids = data['source_ids'].to(device, dtype = torch.long)
          mask = data['source_mask'].to(device, dtype = torch.long)

          generated_ids = model.generate(
              input_ids = ids,
              attention_mask = mask, 
              max_length=max_length, 
              num_beams=2,
              repetition_penalty=2.5, 
              length_penalty=1.0, 
              early_stopping=True
              )
          preds = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in generated_ids]
          target = [tokenizer.decode(t, skip_special_tokens=True, clean_up_tokenization_spaces=True)for t in y]
          if _%1000==0:
              console.print(f'Completed {_}')

          predictions.extend(preds)
          actuals.extend(target)
  return predictions, actuals
print("end...")

end...


# 训练类 Trainer


In [21]:
# 训练类：整合数据集类、训练方法、验证方法，加载数据进行训练并验证训练过程的效果
def T5Trainer(
    dataframe, source_text, target_text, model_params, output_dir="./outputs/"
):
    """
    T5 trainer
    """
    # Set random seeds and deterministic pytorch for reproducibility
    torch.manual_seed(model_params["SEED"])  # pytorch random seed
    np.random.seed(model_params["SEED"])  # numpy random seed
    torch.backends.cudnn.deterministic = True

    # logging
    console.log(f"""[Model]: Loading {model_params["MODEL"]}...\n""")

    # tokenzier for encoding the text
    tokenizer = T5Tokenizer.from_pretrained(model_params["MODEL"])

    # Defining the model. We are using PromptCLUE model and added a Language model layer on top for generation of prediction.
    # Further this model is sent to device (GPU/TPU) for using the hardware.
    model = T5ForConditionalGeneration.from_pretrained(model_params["MODEL"])
    model = model.to(device)

    # logging
    console.log(f"[Data]: Reading data...\n")

    # Importing the raw dataset
    dataframe = dataframe[[source_text, target_text]]
    # display_df(dataframe.head(2))

    # Creation of Dataset and Dataloader
    # Defining the train size So 94% of the data will be used for training and the rest for validation.
    train_size = 0.94
    train_dataset = dataframe.sample(frac=train_size, random_state=model_params["SEED"])
    val_dataset = dataframe.drop(train_dataset.index).reset_index(drop=True)
    train_dataset = train_dataset.reset_index(drop=True)
    
    # 打印数据集相关日志：数据量、训练步数
    console.print(f"FULL Dataset: {dataframe.shape}")
    console.print(f"TRAIN Dataset: {train_dataset.shape}")
    console.print(f"TEST Dataset: {val_dataset.shape}\n")
    total_train_steps=int((train_dataset.shape[0] * model_params["TRAIN_EPOCHS"])/model_params["TRAIN_BATCH_SIZE"])
    console.print(f"Total Train Steps: {total_train_steps}\n")

    # Creating the Training and Validation dataset for further creation of Dataloader
    training_set = YourDataSetClass(
        train_dataset,
        tokenizer,
        model_params["MAX_SOURCE_TEXT_LENGTH"],
        model_params["MAX_TARGET_TEXT_LENGTH"],
        source_text,
        target_text,
    )
    val_set = YourDataSetClass(
        val_dataset,
        tokenizer,
        model_params["MAX_SOURCE_TEXT_LENGTH"],
        model_params["MAX_TARGET_TEXT_LENGTH"],
        source_text,
        target_text,
    )

    # Defining the parameters for creation of dataloaders
    train_params = {
        "batch_size": model_params["TRAIN_BATCH_SIZE"],
        "shuffle": True,
        "num_workers": 0,
    }

    val_params = {
        "batch_size": model_params["VALID_BATCH_SIZE"],
        "shuffle": False,
        "num_workers": 0,
    }

    # Creation of Dataloaders for testing and validation. This will be used down for training and validation stage for the model.
    training_loader = DataLoader(training_set, **train_params)
    val_loader = DataLoader(val_set, **val_params)

    # Defining the optimizer that will be used to tune the weights of the network in the training session.
    optimizer = torch.optim.Adam(
        params=model.parameters(), lr=model_params["LEARNING_RATE"]
    )

    # Training loop
    console.log(f"[Initiating Fine Tuning]...\n")

    for epoch in range(model_params["TRAIN_EPOCHS"]):
        # 1) train for one epoch
        train(epoch, tokenizer, model, device, training_loader, optimizer)
        
        # 2) save model for each epoch
        console.log(f"[Saving Model]...\n")
        path = os.path.join(output_dir, "model_files")
        model.save_pretrained(path)
        tokenizer.save_pretrained(path)

        # 3) evaluating test dataset
        console.log(f"[Initiating Validation]...\n")
        with torch.no_grad(): # add 2022.10.4
          #for epoch in range(model_params["VAL_EPOCHS"]):
          predictions, actuals = validate(epoch, tokenizer, model, device, val_loader,model_params["MAX_TARGET_TEXT_LENGTH"])
          final_df = pd.DataFrame({"Generated Text": predictions, "Actual Text": actuals})
          final_df.to_csv(os.path.join(output_dir, "predictions.csv"))

    console.save_text(os.path.join(output_dir, "logs.txt"))

    console.log(f"[Validation Completed.]\n")
    console.print(
        f"""[Model] Model saved @ {os.path.join(output_dir, "model_files")}\n"""
    )
    console.print(
        f"""[Validation] Generation on Validation data saved @ {os.path.join(output_dir,'predictions.csv')}\n"""
    )
    console.print(f"""[Logs] Logs saved @ {os.path.join(output_dir,'logs.txt')}\n""")
print("end...")

end...


In [22]:
# 定义模型的参数 let's define model parameters specific to T5
model_params = {
    "MODEL": "ClueAI/PromptCLUE",  # model_type
    "TRAIN_BATCH_SIZE": 8,  # training batch size, 8
    "VALID_BATCH_SIZE": 8,  # validation batch size,8 
    "TRAIN_EPOCHS": 1,  # number of training epochs
    "VAL_EPOCHS": 1,  # number of validation epochs
    "LEARNING_RATE": 1e-4,  # learning rate
    "MAX_SOURCE_TEXT_LENGTH": 512,  # max length of source text, 512
    "MAX_TARGET_TEXT_LENGTH": 64,  # max length of target text,64
    "SEED": 42,  # set seed for reproducibility
}
print("end...")

end...


In [23]:
# 训练模型
# 使用 pCLUE:1200000+多任务提示学习数据集 的部分数据
# dataframe必须有2列: 
#   - input: 文本输入
#   - target: 目标输出
df = pd.read_csv('/content/pCLUE_train.csv')  # 数据量：1200k数据。
df = df.sample(frac=0.01) # TODO  取消本行代码，如果你需要更多数据训练
print("df.head:",df.head(n=5))
print("df.shape:",df.shape)
# 显存占用说明：如果运行现在显存不足，请使用nvidia-smi查看显存；如果显卡多数被占用了，请重启colab程序
T5Trainer(
    dataframe=df,
    source_text="input",
    target_text="target",
    model_params=model_params,
    output_dir="outputs",
)
print("end..")

df.head:                                                      input target  \
1116350  “不登大雅，一无所有，残杯冷炙，家常茶饭，奇珍异宝，山珍海错，步步为营，名垂青史，绫罗绸缎，...   残杯冷炙   
593090   给定“有马是神户附近最有名的一个温泉”因此，它必定是真的“神户附近还有别的温泉。”？是的,不...     是的   
823073   以下内容为真：“这个贝宁顿就是早期他就是这个嘶吼,但是后来到了中年其实还是变的,听说是有点流...     假的   
79862    对话：男：毕业论文还没写完吗？你不是说你周末一定可以完成吗？女：唉，别提了，本来打算周末写完...    写论文   
1084238  “这时候放在床上枕头旁边的手机（候选词）响了，我感到奇怪，因为欠费已被停机两个月，现在它(代...     是的   

                                            answer_choices  \
1116350  ['不登大雅', '一无所有', '残杯冷炙', '家常茶饭', '奇珍异宝', '山珍海错...   
593090                                  ['是的', '不是', '也许']   
823073                                  ['真的', '假的', '未知']   
79862                          ['写论文', '逛街', '陪妹妹', '看电影']   
1084238                                       ['是的', '不是']   

                        type  
1116350                  mrc  
593090                   nli  
823073                   nli  
79862                    mrc  
1084238  anaphora_resolution  
df.shape: (12007,

100 epoch:0-loss:tensor(0.7344, device='cuda:0', grad_fn=<NllLossBackward0>);each step's time spent:0.3988712065250642
200 epoch:0-loss:tensor(0.8039, device='cuda:0', grad_fn=<NllLossBackward0>);each step's time spent:0.39218965471236256
300 epoch:0-loss:tensor(0.3353, device='cuda:0', grad_fn=<NllLossBackward0>);each step's time spent:0.3900473144867411
400 epoch:0-loss:tensor(4.0370, device='cuda:0', grad_fn=<NllLossBackward0>);each step's time spent:0.38897510460139295
500 epoch:0-loss:tensor(1.5137, device='cuda:0', grad_fn=<NllLossBackward0>);each step's time spent:0.38859843016147855
600 epoch:0-loss:tensor(0.4864, device='cuda:0', grad_fn=<NllLossBackward0>);each step's time spent:0.3879960057189783
700 epoch:0-loss:tensor(1.6351, device='cuda:0', grad_fn=<NllLossBackward0>);each step's time spent:0.38756365995543995
800 epoch:0-loss:tensor(0.5294, device='cuda:0', grad_fn=<NllLossBackward0>);each step's time spent:0.3872689508103219
900 epoch:0-loss:tensor(0.7503, device='cuda

end..


In [28]:
# 查看训练后显存占用情况。如果显存被占用，可以kill掉相关的进程
!nvidia-smi
# !fuser -v /dev/nvidia*

Tue Oct  4 02:38:58 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P0    50W / 300W |   1203MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [27]:
# !nvidia-smi -r 
# 使用以下命令清除训练中残存的GPU显存缓存
torch.cuda.empty_cache() 
torch.cuda.empty_cache() 
torch.cuda.empty_cache() 
torch.cuda.empty_cache() 
torch.cuda.empty_cache() 
torch.cuda.empty_cache() 
torch.cuda.empty_cache() 
torch.cuda.empty_cache() 
torch.cuda.empty_cache() 
torch.cuda.empty_cache() 
torch.cuda.empty_cache() 
torch.cuda.empty_cache() 
torch.cuda.empty_cache() 
torch.cuda.empty_cache()
torch.cuda.empty_cache() 
torch.cuda.empty_cache() 
torch.cuda.empty_cache() 
torch.cuda.empty_cache() 
torch.cuda.empty_cache() 
torch.cuda.empty_cache() 
torch.cuda.empty_cache() 
torch.cuda.empty_cache() 
torch.cuda.empty_cache() 
torch.cuda.empty_cache() 
torch.cuda.empty_cache() 
torch.cuda.empty_cache() 
torch.cuda.empty_cache() 
torch.cuda.empty_cache()  

In [None]:
# 定位调占用显存的进程（后面可以kill掉）
!fuser -v /dev/nvidia*

# 加载训练好的模型做预测

In [29]:
# 加载训练后的模型
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("ClueAI/PromptCLUE")
model_trained = AutoModelForSeq2SeqLM.from_pretrained("/content/outputs/model_files/") 
print("end...")

end...


In [30]:
# import torch
# from transformers import AutoTokenizer
# 修改colab笔记本设置为gpu，推理更快
device = torch.device('cpu') # cuda
model_trained.to(device)
def preprocess(text):
  return text.replace("\n", "_")
def postprocess(text):
  return text.replace("_", "\n")

def answer_fn(text, sample=False, top_p=0.6):
  '''sample：是否抽样。生成任务，可以设置为True;
     top_p：0-1之间，生成的内容越多样、
  '''
  text = preprocess(text)
  encoding = tokenizer(text=[text], truncation=True, padding=True, max_length=768, return_tensors="pt").to(device) 
  if not sample: # 不进行采样
    out = model_trained.generate(**encoding, return_dict_in_generate=True, output_scores=False, max_length=128, num_beams=4, length_penalty=0.6)
  else: # 采样（生成）
    out = model_trained.generate(**encoding, return_dict_in_generate=True, output_scores=False, max_length=128, do_sample=True, top_p=top_p)
  out_text = tokenizer.batch_decode(out["sequences"], skip_special_tokens=True)
  return postprocess(out_text[0])  
print("end...")

end...


In [31]:
text="这是关于哪方面的新闻： 故事,文化,娱乐,体育,财经,房产,汽车,教育,科技,军事,旅游,国际,股票,农业,游戏?如果日本沉没，中国会接收日本难民吗？"
result=answer_fn(text, sample=False, top_p=0.6)
print("result2:",result)

result2: 国际


In [40]:
#  每次预测耗时情况计算
time1=time.time()
num_times=100
for i in range(num_times):
  result=answer_fn(text, sample=False, top_p=0.6)
time2=time.time()
time_spent=float(time2-time1)/float(num_times)
print("time spent for single input:"+str(time_spent))


time spent for single input:0.27129202604293823


# 评估公开测试集的效果

In [32]:
!pip install pylcs
!pip install Rouge

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [33]:
# 安装包
import json,pylcs
from rouge import Rouge
import numpy as np
print("end...")

end...


In [34]:
# 加载公开测试集(test_public.json)
!wget https://raw.githubusercontent.com/CLUEbenchmark/pCLUE/main/datasets/pCLUE_test_public_1.json
!wget https://raw.githubusercontent.com/CLUEbenchmark/pCLUE/main/datasets/pCLUE_test_public_2.json

--2022-10-04 02:39:59--  https://raw.githubusercontent.com/CLUEbenchmark/pCLUE/main/datasets/pCLUE_test_public_1.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 61122625 (58M) [text/plain]
Saving to: ‘pCLUE_test_public_1.json.1’


2022-10-04 02:40:02 (419 MB/s) - ‘pCLUE_test_public_1.json.1’ saved [61122625/61122625]

--2022-10-04 02:40:03--  https://raw.githubusercontent.com/CLUEbenchmark/pCLUE/main/datasets/pCLUE_test_public_2.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 52082600 (50M) [text/plain]
Saving to: ‘pCLUE

In [35]:
#!tail -f pCLUE_test_public_2.json

{"input": "哪个类别最好的描述了这篇新闻？五月一定要去一次江南——吃喝玩乐全攻略\n选项：故事，文化，娱乐，体育，财经，房产，汽车，教育，科技，军事，旅游，国际，股票，农业，游戏\n答案：", "target": "旅游", "answer_choices": ["故事", "文化", "娱乐", "体育", "财经", "房产", "汽车", "教育", "科技", "军事", "旅游", "国际", "股票", "农业", "游戏"], "type": "classify"}
{"input": "给定下面的问题。\n大学生就业形势严峻的根本原因是什么？\n以及下面的答案：工作岗位不足。 写一段可能发生的对话。\n答案：", "target": "男：亲爱的网友们，大家好，欢迎收看我们的节目。又到了大学生毕业的季节，大学生们要走出校园进入社会，开始新的人生里程了。我们今天节目的主题就是大家关心的大学生就业问题。我们有幸请到了李教授。您好，李教授。女：您好。男：李教授，您能不能先给我们介绍一下最近几年大学生就业的基本情况？女：好的，最近几年我们国家各高校不断扩招，毕业生数量也逐年增加。这当然使年轻人有了更多的学习机会，对提高国民素质有非常积极的促进作用，但是同时也带来一些消极影响。因为社会上的工作岗位有限，需求已趋于饱和，大量的毕业生涌入社会，没有足够的岗位提供给他们，这就从根本上造成了今天大学生就业压力大、就业形势严峻的局面。男：面对这样的形势，有没有什么应对或者解决的措施？女：从政府方面来说，要做的工作是帮助学校和用人单位搭建一个互动的平台，制定相应的保障机制，应对大学生毕业后马上失业的状况；从学校方面来说，应该给学生提供相应的就业指导，包括求职准备、求职技巧、求职礼仪等，帮助学生搜集就业信息，对学生进行心理辅导等等；从学生自身来说，应该认清就业形势，正确给自己定位，提高自身的综合能力和素质，积极主动地寻找和把握工作机会。男：其实很多用人单位还是有招聘需要的，有很多岗位是需要人才的，一方面大学生找不到工作，而另一方面企业招不到人。女：没错，不同的岗位对人才的要求是不同的。比如专业，某些岗位就是需要特定专业的人，其他专业的毕业生不能胜任。还有经验，工作经验是非常重要的，比如一些管理岗位，没有工作经验的人是不可能得到这样的职位的。另外一个重要的原因，是毕

In [37]:
# 合并公开测试集
!rm -rf pCLUE_test_public.json
!cat pCLUE_test_public_1.json pCLUE_test_public_2.json >> pCLUE_test_public.json
!wc -l pCLUE_test_public.json

129556 pCLUE_test_public.json


In [50]:
# 在公开测试集上做预测，并写入到文件
def predict_on_test(source_file,target_file,select_top):
  lines=open(source_file,'r').readlines()
  if select_top!=-1: # select_top==-1 -->全量预测；其他值，则选取top值进行预测
    lines=lines[0:select_top] 
  print("length of lines:",len(lines))
  target_object=open(target_file,'w')
  for i,line in enumerate(lines):
    # print(i,line)
    json_string_right=json.loads(line)
    input_string=json_string_right["input"]
    target_answer=json_string_right["target"]
    type=json_string_right["type"]

    predict_answer=answer_fn(input_string)
    json_string_predict={"target":predict_answer.strip(),"type":type}
    json_string_predict=json.dumps(json_string_predict,ensure_ascii=False)
    target_object.write(json_string_predict+"\n")
    if i%100==0: 
      print(i,"input_string:",input_string,";predict:",predict_answer)

select_top=1000 # TODO 改变select_top的值，使得用一个大的数量，或全量数据
source_file='pCLUE_test_public.json'
target_file='pCLUE_test_public_predict.json'
predict_on_test(source_file,target_file,select_top)

length of lines: 1000
0 input_string: 哪个类别最好的描述了这篇新闻？扣篮王拉文：精彩暴扣表演！炸
选项：故事，文化，娱乐，体育，财经，房产，汽车，教育，科技，军事，旅游，国际，股票，农业，游戏
答案： ;predict: 体育
100 input_string: 哪个类别最好的描述了这篇新闻？新泽西寄宿高中推荐！好位置！好学校！一共就6所，抓紧时间！
选项：故事，文化，娱乐，体育，财经，房产，汽车，教育，科技，军事，旅游，国际，股票，农业，游戏
答案： ;predict: 教育
200 input_string: 下面两个句子语义是“相同”或“不同”？“点开花呗进去显示系统繁忙”，“点花呗 显示系统繁忙”。选项：相同，不同。答案： ;predict: 不同
300 input_string: 给定“抓好传染病、地方病、青少年近视防治”是否遵循“已经不会有人再得传染病、地方病和近视了”是的,不是,或也许？
答案： ;predict: 是的
400 input_string: “哎,她在完全没道理,后来.”问题：“我听到了她所说的东西”真的,假的,或未知？
答案： ;predict: 真的
500 input_string: 这是一个完型填空任务。候选的词语有这些：旷日持久，顺理成章，遥遥无期，作如是观，白日做梦，拳打脚踢，一模一样，势均力敌，指日可待，不了了之。文章内容为：
小凯说他4岁的时候，父母离婚了，他一直跟着母亲过。从小母亲对他要求十分严格，放学后必须回家，很少有机会和别的同学玩耍、沟通，去超市买东西、去操场打篮球、健身，母亲也要陪护在身边。从小学到初中，母亲一直这样呵护自己长大。因为长的白净，性格又腼腆，男孩离他远远的，女孩都愿意和他交朋友。到了技校后，班里的女生王某经常约他出去打篮球。后来王某提出要和他“交朋友”，还要带他到她家里玩儿，他没答应。一天晚上10点多了，其他班级三四个男生把他拖到操场上__，其中一名男生拽着他的衣领说：“王*是我们的姊妹儿，她能看好你，是你的荣幸，如果不答应，我们看见你就揍！”后来他们又打过他三四次，小凯没敢告诉老师，他偷偷告诉了舅舅，让舅舅找人去惩罚那几个“野蛮男生”。舅舅和母亲去找过学校，校方表示将调查落实，后来#idiom578603#。今年6月，小凯再次被打，贺女士把此事反映给了辖区

In [52]:
# 使用评估脚本进行评估
"""
脚本见：https://github.com/CLUEbenchmark/pCLUE/blob/main/evaluate_pclue.py
计算pCLUE任务总分，及子分数
"""
def f1_sim(text_a, text_b):
    """F1相似度
    说明：算出两个文本的最长公共子序列长度，然后乘2并处以两者
    长度之和。推荐用pylcs算，速度较快。
    """
    if not text_a and not text_b:
        return 0.
    else:
        lcs = pylcs.lcs(text_a, text_b)
        return 2. * lcs / (len(text_a) + len(text_b))

def rouge_l_zh(target, pred):
    """计算Rouge-l得分，Rouge-l指标常用于评估自动文本摘要及翻译任务
    target: 真实标签
    pred: 预测标签"""
    if not(isinstance(target, str) or isinstance(pred, str)):
        logger.info("target或pred为非字符串！请检查!")
        return
    else:
        rouge = Rouge()
        scores = rouge.get_scores(" ".join(list(pred)), " ".join(list(target)))
        score = scores[0]["rouge-l"]
        return score["f"]

def normalize(text):
    """简单的文本标准化
    """
    return ' '.join(text.lower().split())

def evaluate_pclue_fn(predict_file,target_file,select_top):
    """
    计算pclue的成绩
    :param predict_file: 预测文件
    :param target_file:  正确的文件
    :return: 一个dict，包括总分score，以及各个部分的分数（mrc, generate, classify, nli）
    """
    predict_lines=open(predict_file,'r').readlines()
    target_lines=open(target_file,'r').readlines()
    
    predict_lines=predict_lines[0:select_top]
    target_lines=target_lines[0:select_top]
    # 1.记录
    classify_list=[]
    mrc_list=[]
    generate_list=[]
    nli_list=[]
    for i, target_line in enumerate(target_lines):
        # e.g. target_line = {"target": "不同"}
        predict_line=predict_lines[i]
        target_answer=json.loads(target_line.replace("，",","))["target"] # 正确的标签
        if isinstance(target_answer, list):  # 将列表转换为字符串，如关键词生成
            target_answer = "，".join(target_answer)
        target_answer=normalize(target_answer)
        predict_answer=json.loads(predict_line)["target"] # 预测的标签
        predict_answer=normalize(predict_answer)
        if len(predict_answer)==0: 
          predict_answer="无答案"
        if i%100==0:
          print(i,"target_answer:",target_answer,";predict_answer:",predict_answer,"length of predict_answer:",len(predict_answer))

        type=json.loads(target_line.replace("，",","))["type"] # 替换可能存在问题的数据，如有，以便能加载为json
        if type=='classify' or type=='anaphora_resolution': # 分类
            label_temp=True if target_answer==predict_answer else False
            classify_list.append(label_temp)
        elif type=='mrc': # 阅读理解
            em=1 if target_answer==predict_answer else 0
            f1=f1_sim(predict_answer,target_answer)
            mrc_list.append((em, f1))
        elif type=='generate': # 生成
            rouge_l=rouge_l_zh(target_answer, predict_answer)
            generate_list.append(rouge_l)
        elif type=='nli': # 推理
            label_temp = True if target_answer == predict_answer else False
            nli_list.append(label_temp)
        else:
            print("error...predict_line:",predict_line,";target_line:",target_line)
            break # 中断运行
        # if predict_answer==target_answer: count_right=count_right+1
        if i<10: print(i, 'target_answer:',target_answer,";predict_answer:",predict_answer) # 显示部分内容

    # 2.计算最后的得分
    classify_score=np.average(classify_list)
    nli_score=np.average(nli_list)
    generate_score=np.average(generate_list)
    mrc_em_score=np.average([x[0] for x in mrc_list])
    mrc_f1_score=np.average([x[1] for x in mrc_list])
    mrc_score=np.average([mrc_em_score,mrc_f1_score])
    # 计算总分
    score=np.average([classify_score,nli_score,generate_score,mrc_score])
    # 保存分数
    result_dict={"score":score,"classify_score":classify_score,"nli_score":nli_score,"generate_score":generate_score,
                 "mrc_em_score":mrc_em_score,"mrc_f1_score":mrc_f1_score}
    return result_dict

# 预测的文件，以及正确的文件
target_file='pCLUE_test_public.json'
predict_file='pCLUE_test_public_predict.json'
result=evaluate_pclue_fn(predict_file,target_file,select_top)
print("result:",result)

0 target_answer: 电竞 ;predict_answer: 体育 length of predict_answer: 2
0 target_answer: 电竞 ;predict_answer: 体育
1 target_answer: 休闲益智 ;predict_answer: 休闲益智
2 target_answer: 孤立无援 ;predict_answer: 孤立无援
3 target_answer: 军事 ;predict_answer: 教育
4 target_answer: 约会社交 ;predict_answer: 视频
5 target_answer: 怕丢脸 ;predict_answer: 奖励他
6 target_answer: 是的 ;predict_answer: 是的
7 target_answer: 也许 ;predict_answer: 是的
8 target_answer: 工具 ;predict_answer: 办公
9 target_answer: 足三两是哪个品牌的招牌食品之一？ ;predict_answer: 麦当劳的餐牌上足-{}-三两及double足-{}-三两都会以小字体加上「烹调前」标签,以符合香港海关《商品说明条例》的规定。
100 target_answer: 教育 ;predict_answer: 教育 length of predict_answer: 2
200 target_answer: 相同 ;predict_answer: 不同 length of predict_answer: 2
300 target_answer: 不是 ;predict_answer: 是的 length of predict_answer: 2
400 target_answer: 未知 ;predict_answer: 真的 length of predict_answer: 2
500 target_answer: 拳打脚踢 ;predict_answer: 拳打脚踢 length of predict_answer: 4
600 target_answer: 电竞 ;predict_answer: 娱乐 length of predict_answer: 2
700 target_answer: 男：

# 生成在测试集上的预测结果并提交

In [49]:
# 下载合并测试集(test.json)
# 加载公开测试集(test_public.json)
!wget https://raw.githubusercontent.com/CLUEbenchmark/pCLUE/main/datasets/pCLUE_test_1.json
!wget https://raw.githubusercontent.com/CLUEbenchmark/pCLUE/main/datasets/pCLUE_test_2.json
!rm -rf pCLUE_test.json
!cat pCLUE_test_1.json pCLUE_test_2.json >> pCLUE_test.json
!wc -l pCLUE_test.json

--2022-10-04 03:36:36--  https://raw.githubusercontent.com/CLUEbenchmark/pCLUE/main/datasets/pCLUE_test_1.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 89403305 (85M) [text/plain]
Saving to: ‘pCLUE_test_1.json’


2022-10-04 03:36:40 (445 MB/s) - ‘pCLUE_test_1.json’ saved [89403305/89403305]

--2022-10-04 03:36:40--  https://raw.githubusercontent.com/CLUEbenchmark/pCLUE/main/datasets/pCLUE_test_2.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 97299757 (93M) [text/plain]
Saving to: ‘pCLUE_test_2.json’


2022-10-04 03:36

In [None]:
# 在测试集上做预测，并生成预测文件
source_file='pCLUE_test.json'
target_file='pCLUE_predict.json'
select_top=-1 # 全量预测
predict_on_test(source_file,target_file,select_top)

length of lines: 250461
0 input_string: 下面两个句子语义是“相同”或“不同”？“我的蚂蚁借呗和花呗用不了啦”，“花呗借呗现在还不能用” 选项：相同，不同。答案： ;predict: 不同
100 input_string: 阅读短文：
柳宗悦也是如此这般对比美术与工艺的，他说:如果只有美是美之通途，这样的希望就过于渺茫，因为美术是少数天才所胜任的工作。给予__以美之通途，只有工艺之道。即使没有文化的人，与神的邂逅机缘也是相同的。  
 从候选成语“凡夫俗子，举目无亲，同心同德，拉帮结派，数一数二，情不自禁，大惊失色，盛气凌人，微不足道，明哲保身”中选出最适合填在下划线处的成语。正确答案是： ;predict: 举目无亲
200 input_string: 我想知道下面两句话的意思是否相同。“美团为什么不能使用花呗付款“，”为什么在美团外卖上用不了花呗”是相同的吗？选项：相同，不同。答案： ;predict: 不同
300 input_string: 
段落：生日那天，保罗的哥哥送给他一辆新车。当保罗离开办公室时，一个男孩儿看着那辆新车，很羡慕地问：“先生，这是您的车？”保罗点点头：“这是我哥哥送给我的生日礼物。”男孩儿吃惊地说：“你是说这是你哥哥送的礼物？„„我也好希望能„„”保罗以为他是希望能有个送他车子的哥哥，但那男孩儿却说，“我希望自己能成为送车给弟弟的哥哥。”保罗对他说：“你要不要坐我的车去兜风？”男孩儿高兴地坐上车，车子开了一会儿以后，那男孩儿小心地说：“先生，你能不能把车开到我家门前？”保罗心想那男孩儿一定是想要告诉他认识的人，他坐了一辆新车子回家。没想到保罗这次又猜错了。男孩儿下了车，过了一会儿保罗听到他回来的声音，但是动作有些缓慢。原来他扶着脚有毛病的弟弟出来了，他扶着弟弟在台阶上坐下，指着那辆新车。只听那男孩儿告诉弟弟：“你看，这就是保罗的哥哥送给他的新车。将来我也会送给你一辆这样的车，到那时你就可以不用每天都呆在家里了。”那个生日，保罗才真正体会到“给予比接受更幸福”的道理。 
问：这个故事主要告诉我们什么道理 选项：应该关心别人，“给”更让人幸福，保罗是一个好人，应该送给亲人礼物。答案： ;predict: “给”更让人幸福
400 input_string: 天蝎座ν (ν Sco / ν Scorpi

本notebook基于以下项目，并结合PromptCLUE模型和pCLUE数据集修改得到：https://github.com/Shivanandroy/T5-Finetuning-PyTorch