Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ernie-1.0本地数据训练后,预测效果变差,每次结果预测都不一样问题 #1532

Closed
Zengpr opened this issue Dec 29, 2021 · 15 comments
Assignees
Labels
pre-training Issues about pre-training stale

Comments

@Zengpr
Copy link

Zengpr commented Dec 29, 2021

完型填空问题:
print(predict_mask('有关 [MASK] [MASK] 和单位的同事', topk=2))
print(predict_mask('原则上 [MASK] 构成暂时适用的障碍。', topk=2))

原ernie-1.0结果:
image
结果稳定

本地数据训练后结果:
image
image
每次都不一样,请问是什么什么问题啊,而且预测结果完全不对

@wawltor
Copy link
Collaborator

wawltor commented Dec 29, 2021

可以看一下,微调之后的参数是否已经加载进去了

@Zengpr
Copy link
Author

Zengpr commented Dec 30, 2021

可以看一下,微调之后的参数是否已经加载进去了

加载进去了吧,没加载都跑不了了,这个确定有跑过可以用吗

@wawltor
Copy link
Collaborator

wawltor commented Dec 30, 2021

可以看一下,微调之后的参数是否已经加载进去了

加载进去了吧,没加载都跑不了了,这个确定有跑过可以用吗

能提供一下具体是哪个示例吗?

@Zengpr
Copy link
Author

Zengpr commented Dec 30, 2021

https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/language_model/ernie-1.0
就这个跑完的模型用static_vars.pdparams的文件做完型填空效果很差,结果也不稳定,原ernie-1.0为save_weights.pdparams文件就挺好

@wawltor
Copy link
Collaborator

wawltor commented Dec 30, 2021

这个示例是提供预训练能力的,需要通过自己构建数据来完成预训练,想问一下具体用的是什么数据了?原ernie-1.0是已经通过大量中文语料预训练之后的结果,只需要在特定的任务进行finetune即可。

@Zengpr
Copy link
Author

Zengpr commented Dec 30, 2021

我就用里面给的样例数据根据data_tools文件夹构建数据,ernie-1.0文件夹训练了2000step,然后用样例数据的源数据来做预测,也是完全预测不出来的

@wawltor
Copy link
Collaborator

wawltor commented Dec 31, 2021

我就用里面给的样例数据根据data_tools文件夹构建数据,ernie-1.0文件夹训练了2000step,然后用样例数据的源数据来做预测,也是完全预测不出来的

抱歉,data_tools里面是一份样例数据,不是ERNIE的预训练数据;如果想用ERNIE的话,直接使用原生的ERNIE就行

@Zengpr
Copy link
Author

Zengpr commented Dec 31, 2021

我就用里面给的样例数据根据data_tools文件夹构建数据,ernie-1.0文件夹训练了2000step,然后用样例数据的源数据来做预测,也是完全预测不出来的

抱歉,data_tools里面是一份样例数据,不是ERNIE的预训练数据;如果想用ERNIE的话,直接使用原生的ERNIE就行

这只是个例子啊,如果连测试样例都用不了,那我的数据根本就用不了,所以你们就是没试过效果了?好无语

@wawltor
Copy link
Collaborator

wawltor commented Dec 31, 2021

我就用里面给的样例数据根据data_tools文件夹构建数据,ernie-1.0文件夹训练了2000step,然后用样例数据的源数据来做预测,也是完全预测不出来的

抱歉,data_tools里面是一份样例数据,不是ERNIE的预训练数据;如果想用ERNIE的话,直接使用原生的ERNIE就行

这只是个例子啊,如果连测试样例都用不了,那我的数据根本就用不了,所以你们就是没试过效果了?好无语

嗯嗯 实在比较抱歉,我们在文档中没有写明白这块。因为ERNIE的原生的预训练数据集是不开源的,所以没有办法全部开源数据。但是我们在ERNIE预训练的数据集上做过测试,目前开源出来预训练框架效果是可以达到收敛的。后续我们将修改文档,同时提供目前可以开源的较大数据集。

@ZHUI
Copy link
Collaborator

ZHUI commented Dec 31, 2021

元旦后我们将提供 clue corpus small数据集的训练示例,更新后将第一时间通知你

@Zengpr Zengpr closed this as completed Dec 31, 2021
@Zengpr Zengpr reopened this Dec 31, 2021
@ZeyuChen ZeyuChen added the pre-training Issues about pre-training label Jan 3, 2022
@ZHUI
Copy link
Collaborator

ZHUI commented Jan 4, 2022

#1555 @Zengpr 已提供clue corpus small 数据集制作教程,欢迎使用。

训练过程、效果评估等教程,稍晚提供。

@ZHUI
Copy link
Collaborator

ZHUI commented Jan 4, 2022

训练详细日志,100w steps预训练权重已经提供。https://github.com/PaddlePaddle/PaddleNLP/pull/1555/files#diff-318be3751cecb9049866fb59aa5aec28a4f07ca0261ae39bee205e7476b8420bR86-R108

使用示例

import paddle
from paddlenlp.transformers import ErnieForMaskedLM, ErnieTokenizer
tokenizer = ErnieTokenizer.from_pretrained('zhui/cluecorpussmall_ernie-1.0')
model = ErnieForMaskedLM.from_pretrained('zhui/cluecorpussmall_ernie-1.0')
tokens = ['[CLS]', '我', '的', '[MASK]','很', '可', '爱','。', '[SEP]']
masked_ids = paddle.to_tensor([tokenizer.convert_tokens_to_ids(tokens)])
segment_ids = paddle.to_tensor([[0] * len(tokens)])
outputs = model(masked_ids, token_type_ids=segment_ids)
prediction_scores = outputs
prediction_index = paddle.argmax(prediction_scores[0, 3]).item()
predicted_token = tokenizer.convert_ids_to_tokens([prediction_index])[0]
print(tokens)
#['[CLS]', '我', '的', '[MASK]', '很', '可', '爱', '。', '[SEP]']
print(predicted_token)
#猫

@Zengpr
Copy link
Author

Zengpr commented Jan 6, 2022

训练详细日志,100w steps预训练权重已经提供。https://github.com/PaddlePaddle/PaddleNLP/pull/1555/files#diff-318be3751cecb9049866fb59aa5aec28a4f07ca0261ae39bee205e7476b8420bR86-R108

使用示例

import paddle
from paddlenlp.transformers import ErnieForMaskedLM, ErnieTokenizer
tokenizer = ErnieTokenizer.from_pretrained('zhui/cluecorpussmall_ernie-1.0')
model = ErnieForMaskedLM.from_pretrained('zhui/cluecorpussmall_ernie-1.0')
tokens = ['[CLS]', '我', '的', '[MASK]','很', '可', '爱','。', '[SEP]']
masked_ids = paddle.to_tensor([tokenizer.convert_tokens_to_ids(tokens)])
segment_ids = paddle.to_tensor([[0] * len(tokens)])
outputs = model(masked_ids, token_type_ids=segment_ids)
prediction_scores = outputs
prediction_index = paddle.argmax(prediction_scores[0, 3]).item()
predicted_token = tokenizer.convert_ids_to_tokens([prediction_index])[0]
print(tokens)
#['[CLS]', '我', '的', '[MASK]', '很', '可', '爱', '。', '[SEP]']
print(predicted_token)
#猫

ok,感谢感谢,晚点去试一下!

@github-actions
Copy link

This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动,被标记为stale。

@github-actions github-actions bot added the stale label Jan 25, 2023
@github-actions
Copy link

github-actions bot commented Feb 8, 2023

This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天,即将关闭。

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pre-training Issues about pre-training stale
Projects
None yet
Development

No branches or pull requests

4 participants