Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refine fleet dataset class api #27133

Merged
merged 15 commits into from
Sep 16, 2020
Merged

Conversation

yaoxuefeng6
Copy link
Contributor

@yaoxuefeng6 yaoxuefeng6 commented Sep 7, 2020

PR types

Others

PR changes

APIs

Describe

1, hide some setting method and other methods in dataset class. These methods will be set at init() once by passing specific key value pair.
2, move some method from base class to child class.
3, modify related ut.
4, update example codes.

dataset api python demo

slots = ["slot1", "slot2", "slot3", "slot4"]
slots_vars = []
for slot in slots:
var = fluid.layers.data(name=slot, shape=[1], dtype="int64", lod_level=1)
slots_vars.append(var)

# create dataset instance directly with distributed.InMemoryDataset
dataset = paddle.distributed.InMemoryDataset()
# call init() to initialize single node related settings once.
dataset.init(
    batch_size=32,
    thread_num=3,
    pipe_command="cat",
    use_var=slots_vars)
# call init_distributed_settings() to initialize distributed related settings.
dataset._init_distributed_settings(
            fea_eval=True,
            candidate_size=10000)
# call update_settings to update specific settings.
dataset.update_settings(batch_size=2)
dataset.set_filelist(
    ["test_run_with_dump_a.txt", "test_run_with_dump_b.txt"])
dataset.load_into_memory()
dataset.local_shuffle()

place = paddle.CUDAPlace(0) if paddle.fluid.core.is_compiled_with_cuda()  else paddle.CPUPlace()
exe = paddle.static.Executor(place)
startup_program = paddle.static.Program()
main_program = paddle.static.Program()
exe.run(startup_program)

exe.train_from_dataset(main_program, dataset)

update according to comments
1, only expose InMemoryDataset and QueueDataset in paddle.distributed, which can be created directly without using factory
2, add init_distributed_settings() method to set advanced distributed related settings.
3, add update_settings() method to update some settings on a existed dataset instance.

@paddle-bot-old
Copy link

paddle-bot-old bot commented Sep 7, 2020

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@@ -991,6 +908,107 @@ def __init__(self):
self.boxps = core.BoxPS(self.dataset)
self.proto_desc.name = "PaddleBoxDataFeed"

def init(self, **kwargs):
"""
should be called only once in user's python scripts to initialize seetings of dataset instance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: seetings

Copy link
Member

@guru4elephant guru4elephant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

"get_world_size",
"prepare_context",
"ParallelEnv",
"init_parallel_env", "get_rank", "get_world_size", "prepare_context",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to keep original layout, and add a comma at the last line.

@yaoxuefeng6 yaoxuefeng6 requested review from XiaoguangHu01 and jzhang533 and removed request for fuyinno4, ForFishes and hutuxian September 15, 2020 11:27
Copy link
Contributor

@jzhang533 jzhang533 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • deprecate apis in paddle.fluid.dataset as discussed in move dataset from paddfle.fluid to paddle.fleet #25887
  • have some examples to demonstrate working with dygraph mode.
  • have some examples to demonstrate working with apis in paddle.static
  • add docs to show typical scenarios of using these apis
  • what's the format of files in set_filelist ?

Copy link
Contributor

@jzhang533 jzhang533 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm
will have followup prs.

Copy link
Contributor

@luotao1 luotao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yaoxuefeng6 yaoxuefeng6 merged commit c67c391 into PaddlePaddle:develop Sep 16, 2020
seiriosPlus added a commit to seiriosPlus/Paddle that referenced this pull request Sep 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants