UIE文本信息抽取微調問題 #2895

JoewithAmma · 2022-07-27T08:22:21Z

目前有兩個問題
1.目前要進行UIE微調任務，使用Doccano進行標註，從Doccano輸出成jsonl檔案，丟進doccano.py會出現錯誤信息UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa5 in position 37: invalid start byte，需再以人工轉編碼至utf-8

2.標註完以doccano.py進行訓練測試資料及切分，確認裡面不是空的。但丟進fintune.py進行訓練總是跑了1 epoch就不跑了，沒有跑出錯誤信息，使用已跑出model_10來預測新的檔案效果不太好

不知道是否有人遇過一樣情形。

linjieccc · 2022-07-27T11:32:18Z

@JoewithAmma 您好，麻烦提供下微调的执行配置以及环境相关信息

JoewithAmma · 2022-07-27T22:40:41Z

您好

我的軟體環境：paddlenlp 2.3.4，paddlepaddle-gpu 2.3.1.post101，win10系统，Python3.8.13，cuda10.1，cudnn7.6.5。

硬體環境：
作業系統:Windows10 家用版64位元
系統型號:ASUS TUF Gaming A15 FA506IU_FA506IU
BIOS:FA506IU.319
處理器: AMD Ryzen 7 4800H with Radeon Graphics (16CPUs),~2.9GHz
記憶體:16384MB RAM
分頁檔案: 使用了14996MB，還有8949MB 可用
DirectX版本: DirectX 12

Display Devices

       Card name: AMD Radeon(TM) Graphics
    Manufacturer: Advanced Micro Devices, Inc.
       Chip type: AMD Radeon Graphics Processor (0x1636)
        DAC type: Internal DAC(400MHz)
     Device Type: Full Device (POST)

          Card name: NVIDIA GeForce GTX 1660 Ti
    Manufacturer: NVIDIA
       Chip type: NVIDIA GeForce GTX 1660 Ti
        DAC type: Integrated RAMDAC
     Device Type: Full Device

linjieccc · 2022-07-28T02:26:21Z

请问训练集有多少条数据？可以试试batch_size=1看有没有问题

JoewithAmma · 2022-07-28T02:53:19Z

總共有47筆標註資料我是用0.8 0.1 0.1去切分訓練資料集。
batch_size設為1的結果仍然一樣。

linjieccc · 2022-07-28T03:00:52Z

可以试试指定--device cpu看能否正常训练，不确定是不是GPU版本PaddlePaddle的安装问题

JoewithAmma · 2022-07-28T03:09:20Z

指定--device cpu仍然無法正常訓練global step停在10，epoch只有1。

請問每一次微調訓練完需要將更改什麼文件或刪除那些東西嗎?

我是直接把整個資料夾下載下來在本地端跑

linjieccc · 2022-07-28T03:35:55Z

第一次执行需要下载UIE的预训练模型，后面就不需要再更改了

如果方便的话可以提供几条数据和执行命令，我们尝试复现下

JoewithAmma · 2022-07-28T03:50:24Z

不好意思，由於使用的資料比較敏感無法傳送給您
以下是使用的執行命令
python doccano.py --doccano_file 路徑.jsonl --splits 0.8 0.1 0.1

python finetune_new.py --train_path ./data/train.txt --dev_path ./data/dev.txt --save_dir ./checkpoint --model uie-tiny --learning_rate 1e-5 --batch_size 2 --max_seq_len 512 --num_epochs 10 --seed 1000 --logging_steps 10 --valid_steps 10

請問有可能是在標注資料時出了問題嗎?我的資料來源是法院的判例文件，主要標註是發文機關、債務人、身分證字號、命令內容等

linjieccc · 2022-07-28T05:03:07Z

可以排查下标注结果（抽取片段内容）的长度是否超过了max_seq_len

JoewithAmma · 2022-07-28T05:05:46Z

for epoch in range(1, args.num_epochs + 1):
"原先迴圈內所有內容"
epoch=epoch+1

我嘗試在epoch迴圈內或外加入此行，第一次可以完整運行，但第二次之後仍然只會跑1 epoch，不太清楚問題是出在哪裡。

JoewithAmma · 2022-07-28T05:17:32Z

不好意思，不太清楚您的意思，請問標註結果(抽取片段內容)的長度是指在txt檔案裡面一行的文字內容資料嗎。

linjieccc · 2022-07-28T06:01:47Z

您试下执行文档里的默认例子，看下会不会也出现卡住的问题：https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/uie#%E8%AE%AD%E7%BB%83%E5%AE%9A%E5%88%B6

JoewithAmma · 2022-07-28T06:24:22Z

執行文檔內默認的例子仍然會出現卡住的問題，執行結果如下

import pandas._libs.testing as _testing
[2022-07-28 14:21:16,263] [ INFO] - Downloading resource files...
[2022-07-28 14:21:16,266] [ INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'uie-base'.
W0728 14:21:16.310019 19452 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.7, Runtime API Version: 10.1
W0728 14:21:16.317026 19452 gpu_resources.cc:91] device: 0, cuDNN Version: 7.6.
[2022-07-28 14:21:27,019] [ INFO] - global step 10, epoch: 1, loss: 0.00431, speed: 2.06 step/s
[2022-07-28 14:21:30,884] [ INFO] - global step 20, epoch: 1, loss: 0.00346, speed: 2.59 step/s
[2022-07-28 14:21:34,713] [ INFO] - global step 30, epoch: 1, loss: 0.00279, speed: 2.61 step/s
[2022-07-28 14:21:38,525] [ INFO] - global step 40, epoch: 1, loss: 0.00252, speed: 2.62 step/s

JoewithAmma · 2022-07-29T00:48:35Z

您好，請問會建議整個環境重裝嗎，還是還有其他測試方式呢?

我最近測試的結果是一天中的第一次訓練會成功執行，但我由於硬體容量不夠，因此我中途就停止了，請問這樣會有影響嗎，第二次後就無法成功訓練，只會跑1epoch。

linjieccc · 2022-07-29T02:17:43Z

1.环境推荐试试用新的conda环境或者使用docker镜像的方式，详细安装方式可以参考官网安装说明：https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/conda/linux-conda.html

2.硬盘容量不够您可以注释掉这部分代码改为只保留最优模型

PaddleNLP/model_zoo/uie/finetune.py

Lines 127 to 135 in f1f3eb7

    
           save_dir = os.path.join(args.save_dir, "model_%d" % global_step) 
        
           if not os.path.exists(save_dir): 
        
               os.makedirs(save_dir) 
        
           model_to_save = model._layers if isinstance( 
        
               model, paddle.DataParallel) else model 
        
           model_to_save.save_pretrained(save_dir) 
        
           logger.disable() 
        
           tokenizer.save_pretrained(save_dir) 
        
           logger.enable()

JoewithAmma · 2022-07-29T09:24:08Z

我使用新的conda環境還是沒有辦法，仍然只會執行1 epoch，想請問有可能甚麼文件需要刪除或程式碼有那裡可能會讓他提早結束嗎?因為我測試過train的段落，只保留計算部分，程式仍然只會執行1 epoch。

JoewithAmma · 2022-08-01T06:49:00Z

後來使用python3.9版本就可以正常運行，謝謝您的協助。

github-actions · 2022-12-08T02:57:22Z

This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动，被标记为stale。

github-actions · 2022-12-22T16:16:30Z

This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天，即将关闭。

LemonNoel assigned LemonNoel and linjieccc and unassigned LemonNoel Jul 27, 2022

github-actions bot added the stale label Dec 8, 2022

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UIE文本信息抽取微調問題 #2895

UIE文本信息抽取微調問題 #2895

JoewithAmma commented Jul 27, 2022

linjieccc commented Jul 27, 2022

JoewithAmma commented Jul 27, 2022 •

edited

linjieccc commented Jul 28, 2022

JoewithAmma commented Jul 28, 2022

linjieccc commented Jul 28, 2022

JoewithAmma commented Jul 28, 2022 •

edited

linjieccc commented Jul 28, 2022

JoewithAmma commented Jul 28, 2022

linjieccc commented Jul 28, 2022

JoewithAmma commented Jul 28, 2022 •

edited

JoewithAmma commented Jul 28, 2022

linjieccc commented Jul 28, 2022

JoewithAmma commented Jul 28, 2022

JoewithAmma commented Jul 29, 2022 •

edited

linjieccc commented Jul 29, 2022 •

edited

JoewithAmma commented Jul 29, 2022

JoewithAmma commented Aug 1, 2022

github-actions bot commented Dec 8, 2022

github-actions bot commented Dec 22, 2022

UIE文本信息抽取微調問題 #2895

UIE文本信息抽取微調問題 #2895

Comments

JoewithAmma commented Jul 27, 2022

linjieccc commented Jul 27, 2022

JoewithAmma commented Jul 27, 2022 • edited

Display Devices

linjieccc commented Jul 28, 2022

JoewithAmma commented Jul 28, 2022

linjieccc commented Jul 28, 2022

JoewithAmma commented Jul 28, 2022 • edited

linjieccc commented Jul 28, 2022

JoewithAmma commented Jul 28, 2022

linjieccc commented Jul 28, 2022

JoewithAmma commented Jul 28, 2022 • edited

JoewithAmma commented Jul 28, 2022

linjieccc commented Jul 28, 2022

JoewithAmma commented Jul 28, 2022

JoewithAmma commented Jul 29, 2022 • edited

linjieccc commented Jul 29, 2022 • edited

JoewithAmma commented Jul 29, 2022

JoewithAmma commented Aug 1, 2022

github-actions bot commented Dec 8, 2022

github-actions bot commented Dec 22, 2022

JoewithAmma commented Jul 27, 2022 •

edited

JoewithAmma commented Jul 28, 2022 •

edited

JoewithAmma commented Jul 28, 2022 •

edited

JoewithAmma commented Jul 29, 2022 •

edited

linjieccc commented Jul 29, 2022 •

edited