Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add model Prohetnet #1698

Merged
merged 8 commits into from
Mar 7, 2022
Merged

Add model Prohetnet #1698

merged 8 commits into from
Mar 7, 2022

Conversation

d294270681
Copy link
Contributor

Description

Add new model Prophetnet
The model weight:
链接:https://pan.baidu.com/s/1FOnd01rNvDJoONYegacq1Q
提取码:o28q
The tokenizer vocab file:
链接:https://pan.baidu.com/s/1pUxLy6eGTZFqzf85OlIzUg
提取码:ltp6

@d294270681 d294270681 changed the title add Prohetnet model Add model Prohetnet Feb 22, 2022
Copy link
Contributor

@smallv0221 smallv0221 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

数据集读取的问题再看看,cnn_daliymail和gigaword数据集都可以通过load_dataset传入名称加载,不同点是前者是paddlenlp数据集,后者是HuggingFace数据集。但是访问和处理方式应该没什么差别

--epochs=6 \
--lr=0.0001 \
--warmup_init_lr=1e-07 \
--warmup_updates=1000 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

用warmup_steps比较好

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

test_data_src = 'data/' + args.dataset + '_data/uncased_tok_data/test.src'
test_data_tgt = 'data/' + args.dataset + '_data/uncased_tok_data/test.tgt'

test_dataset = load_dataset(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里可以直接使用paddlenlp内置的cnn_daliymail数据集么

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

源码使用的是GLGE baseline的cnn_dailymail,和paddlenlp的cnn_daliymail有点区别,GLGE的文本会多个[S_SEP]标签,不知道会不会产生影响。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

那GLGE baseline的这两个数据集和hugging face的这两个数据集一样么

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cnndm和gigaword都存在一些差别

from .. import PretrainedTokenizer, BasicTokenizer, WordpieceTokenizer


class Trie:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个trie在基类里有,应该不用重新定义吧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

dev_data_src = 'data/' + args.dataset + '_data/uncased_tok_data/dev.src'
dev_data_tgt = 'data/' + args.dataset + '_data/uncased_tok_data/dev.tgt'

train_dataset = load_dataset(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里应该可以直接读内置的cnn_daliymail数据集,gigaword数据集在huggingface上也有,paddlenlp的load_dataset也可以读取HF的数据集

@smallv0221
Copy link
Contributor

如果都能通过传入数据集名称直接加载,应该可以省略一些数据处理代码

Copy link
Contributor

@smallv0221 smallv0221 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

example下的那个__init__.py去掉吧

@d294270681
Copy link
Contributor Author

example下的那个__init__.py去掉吧

已修改

Copy link
Contributor

@smallv0221 smallv0221 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants