Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DATASET] Add cnn_dailymail dataset #1061

Merged
merged 5 commits into from
Sep 28, 2021
Merged

Conversation

gongel
Copy link
Member

@gongel gongel commented Sep 22, 2021

PR types

New features

PR changes

Others

Description

  • Add cnn_dailymail dataset
  • Use like this:
from paddlenlp.datasets import load_dataset
train_set = load_dataset("cnn_dailymail",  splits=["train"]) # version defaults to "3.0.0"
train_set, dev_set, test_set = load_dataset("cnn_dailymail",  splits=["train", "dev", "test"], version="3.0.0")

Copy link
Member

@ZeyuChen ZeyuChen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible just use the latest 3.0.0 version instead of involve tfds concept?

# Make article into a single string
article = " ".join(article_lines)

if tfds_version >= "2.0.0":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't envolve tfds_version concept to paddlenlp dataset, is it possible just use the latest version?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think paddlenlp should be consistent with the original data. For ease of use, version defaults to "3.0.0".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里还是用version吧

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里还是用version吧

Done, thx

Copy link
Contributor

@smallv0221 smallv0221 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我测试一下没读到数据,你测试过了么

@gongel
Copy link
Member Author

gongel commented Sep 26, 2021

我测试一下没读到数据,你测试过了么

我这边测试没问题,第一次加载解压比较慢。

Copy link
Contributor

@smallv0221 smallv0221 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gongel gongel merged commit 5f0e8c2 into PaddlePaddle:develop Sep 28, 2021
@gongel gongel deleted the dataset_cnn_dm branch September 30, 2021 07:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants