Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UIE中NER可以使用自己doccano标注的数据进行训练么? #4320

Closed
ucas010 opened this issue Jan 3, 2023 · 10 comments
Closed

UIE中NER可以使用自己doccano标注的数据进行训练么? #4320

ucas010 opened this issue Jan 3, 2023 · 10 comments
Assignees
Labels
question Further information is requested stale

Comments

@ucas010
Copy link

ucas010 commented Jan 3, 2023

Feature request

不做微调,直接用自己的数据进行训练,
NER中实体与bert-base有很大差异,标签就一个关键词,其他都是other,
这种我想自己训练个模型,请问下有脚本或参考文档么?
感谢!!

Motivation

因为有的很大实体识别并不仅仅限于O,per,loc,misc,time,money等等
还有可能不是这些,直接就是特定的词,不管他啥词性。

Your contribution

thx

@github-actions github-actions bot added the triage label Jan 3, 2023
@ucas010
Copy link
Author

ucas010 commented Jan 3, 2023

@tonyanhq @ZeyuChen @kztao @QingshuChen
大佬请指点下啊,感谢🙏

@JunnYu JunnYu added question Further information is requested and removed triage labels Jan 3, 2023
@linjieccc
Copy link
Contributor

@ucas010 Hi,

用doccano标注的NER数据可以参考UIE训练定制的流程进行训练

@ucas010
Copy link
Author

ucas010 commented Jan 4, 2023

hi大佬,我的任务比较简单,我是用的waybill_ie,三个类别(P-B,P-I,O),但是出了个bug,
Exception in thread Thread-1:
Traceback (most recent call last):
File "/data/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/data/python3.9/threading.py", line 917, in run
self._target(*self._args, **self._kwargs)
File "/data/python3.9/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 217, in _thread_loop
batch = self._dataset_fetcher.fetch(indices,
File "/data/python3.9/site-packages/paddle/fluid/dataloader/fetcher.py", line 125, in fetch
data.append(self.dataset[idx])
File "/data/python3.9/site-packages/paddlenlp/datasets/dataset.py", line 260, in getitem
return self._transform(self.new_data[idx]) if self._transform_pipline else self.new_data[idx]
File "/data/lib/python3.9/site-packages/paddlenlp/datasets/dataset.py", line 252, in _transform
data = fn(data)
File "/dataPaddleNLP/examples/information_extraction/waybill_ie/run_ernie_crf.py", line 43, in convert_to_features
tokenized_input["labels"] = [label_vocab[x] for x in labels]
File "/data/PaddleNLP/examples/information_extraction/waybill_ie/run_ernie_crf.py", line 43, in
tokenized_input["labels"] = [label_vocab[x] for x in labels]
KeyError: 'OOOOOOOOOOOOOOOP-BP-IP-IP-IP-IP-IP-IOP-BP-IP-IP-IOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOP-BP-IP-IP-IP-IP-IP-IOOOOOOP-BP-IP-IP-IOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOP-BP-IP-IP-IP-IP-IP-IOOOOOOOOOOOOOOOOP-BP-IP-IP-IP-IP-IP-IOP-BP-IP-IP-IOP-BP-IP-IP-IOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO'

@ucas010
Copy link
Author

ucas010 commented Jan 4, 2023

运行代码
CUDA_VISIBLE_DEVICES=1 python run_ernie_crf.py --data_dir mydata
而使用bigru_crf没有问题,但指标都是0.。。。。请问下咋办啊?
[TRAIN] Epoch:9 - Step:2769 - Loss: 0.000000
[TRAIN] Epoch:9 - Step:2770 - Loss: 0.000000
[TRAIN] Epoch:9 - Step:2771 - Loss: 0.000000
[TRAIN] Epoch:9 - Step:2772 - Loss: 0.000000
[TRAIN] Epoch:9 - Step:2773 - Loss: 0.000000
[TRAIN] Epoch:9 - Step:2774 - Loss: 0.000000
[TRAIN] Epoch:9 - Step:2775 - Loss: 0.000000
[TRAIN] Epoch:9 - Step:2776 - Loss: 0.000000
[TRAIN] Epoch:9 - Step:2777 - Loss: 0.000000
[TRAIN] Epoch:9 - Step:2778 - Loss: 0.000000
[TRAIN] Epoch:9 - Step:2779 - Loss: 0.000000
[TRAIN] Epoch:9 - Step:2780 - Loss: 0.000000
[EVAL] Precision: 0.000000 - Recall: 0.000000 - F1: 0.000000
在标记中发现,的确O比较多,所以预测的全是O,

@ucas010
Copy link
Author

ucas010 commented Jan 4, 2023

@linjieccc
Copy link
Contributor

hi大佬,我的任务比较简单,我是用的waybill_ie,三个类别(P-B,P-I,O),但是出了个bug, Exception in thread Thread-1: Traceback (most recent call last): File "/data/python3.9/threading.py", line 980, in _bootstrap_inner self.run() File "/data/python3.9/threading.py", line 917, in run self._target(*self._args, **self._kwargs) File "/data/python3.9/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 217, in _thread_loop batch = self._dataset_fetcher.fetch(indices, File "/data/python3.9/site-packages/paddle/fluid/dataloader/fetcher.py", line 125, in fetch data.append(self.dataset[idx]) File "/data/python3.9/site-packages/paddlenlp/datasets/dataset.py", line 260, in getitem return self._transform(self.new_data[idx]) if self._transform_pipline else self.new_data[idx] File "/data/lib/python3.9/site-packages/paddlenlp/datasets/dataset.py", line 252, in _transform data = fn(data) File "/dataPaddleNLP/examples/information_extraction/waybill_ie/run_ernie_crf.py", line 43, in convert_to_features tokenized_input["labels"] = [label_vocab[x] for x in labels] File "/data/PaddleNLP/examples/information_extraction/waybill_ie/run_ernie_crf.py", line 43, in tokenized_input["labels"] = [label_vocab[x] for x in labels] KeyError: 'OOOOOOOOOOOOOOOP-BP-IP-IP-IP-IP-IP-IOP-BP-IP-IP-IOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOP-BP-IP-IP-IP-IP-IP-IOOOOOOP-BP-IP-IP-IOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOP-BP-IP-IP-IP-IP-IP-IOOOOOOOOOOOOOOOOP-BP-IP-IP-IP-IP-IP-IOP-BP-IP-IP-IOP-BP-IP-IP-IOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO'

@ucas010 方便提供一下复现代码么,我们尝试复现下这个问题

@ucas010
Copy link
Author

ucas010 commented Jan 5, 2023

python run_ernie_crf.py --data_dir /data/PaddleNLP/examples/information_extraction/waybill_ie/data2/ --batch_size 256 --epochs 10 --save_dir /data/PaddleNLP/examples/information_extraction/waybill_ie/models/ernie_crf_ckpt/

@linjieccc
Copy link
Contributor

@ucas010 方便的话可以贴一下微调用到的数据集/data/PaddleNLP/examples/information_extraction/waybill_ie/data2/

@github-actions
Copy link

github-actions bot commented Mar 8, 2023

This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动,被标记为stale。

@github-actions github-actions bot added the stale label Mar 8, 2023
@github-actions
Copy link

This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天,即将关闭。

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested stale
Projects
None yet
Development

No branches or pull requests

3 participants