【paddle.fleet】refine launch and distributed repr string for print #27093

guru4elephant · 2020-09-06T14:03:56Z

PR types

Function optimization

PR changes

Others

Describe

In Paddle 2.0, paddle.distributed.fleet.DistributedStrategy can be serialized into protobuf, but it is still not in pretty format for printing. This PR mainly optimizes the log format of DistributedStrategy. Log samples are listed below

fleetrun log sample:

    +=======================================================================================+
    |                        Distributed Envs                      Value                    |
    +---------------------------------------------------------------------------------------+
    |                 PADDLE_CURRENT_ENDPOINT                 127.0.0.1:25538               |
    |                     PADDLE_TRAINERS_NUM                        4                      |
    |                     FLAGS_selected_gpus                        4                      |                              |                PADDLE_TRAINER_ENDPOINTS  ... 0.1:49874,127.0.0.1:15274,127.0.0.1:58230|
    |                       PADDLE_TRAINER_ID                        0                      |
    +=======================================================================================+

DistributedStrategy log sample:

    +==============================================================================+                        
    |                                                                              |
    |                         DistributedStrategy Overview                         |
    |                                                                              |
    +==============================================================================+
    |                     amp = True, please check amp_configs                     |
    +------------------------------------------------------------------------------+
    |                     init_loss_scaling                 32768.0                |
    |                    incr_every_n_steps                   1000                 |
    |               decr_every_n_nan_or_inf                    2                   |
    |                            incr_ratio                   2.0                  |
    |                            decr_ratio              0.800000011921            |
    |              use_dynamic_loss_scaling                   True                 |
    +==============================================================================+
    |               recompute = True, please check recompute_configs               |
    +------------------------------------------------------------------------------+
    |                           checkpoints              pool2d_0.tmp_0            |
    |                                               res2a.add.output.5.tmp_1       |
    |                                               res2b.add.output.5.tmp_1       |
    |                                               res2c.add.output.5.tmp_1       |
    |                                               res3a.add.output.5.tmp_1       |
    |                                               res3b.add.output.5.tmp_1       |
    |                                               res3c.add.output.5.tmp_1       |
    |                                               res3d.add.output.5.tmp_1       |
    |                                               res4a.add.output.5.tmp_1       |
    |                                               res4b.add.output.5.tmp_1       |
    |                                               res4c.add.output.5.tmp_1       |
    |                                               res4d.add.output.5.tmp_1       |
    |                                               res4e.add.output.5.tmp_1       |
    |                                               res4f.add.output.5.tmp_1       |
    |                                               res5a.add.output.5.tmp_1       |
    |                                               res5b.add.output.5.tmp_1       |
    |                                               res5c.add.output.5.tmp_1       |
    |                                                    pool2d_1.tmp_0       |
    |                                                      fc_0.tmp_1              |                      
    +==============================================================================+
    |                  a_sync = True, please check a_sync_configs                  |
    +------------------------------------------------------------------------------+
    |                               k_steps                    -1                  |
    |                     max_merge_var_num                    1                   |
    |                       send_queue_size                    16                  |
    |               independent_recv_thread                  False                 |
    |         min_send_grad_num_before_recv                    1                   |
    |                      thread_pool_size                    1                   |
    |                       send_wait_times                    1                   |
    |               runtime_split_send_recv                  False                 |
    +==============================================================================+
    |                    Environment Flags, Communication Flags                    |
    +------------------------------------------------------------------------------+
    |                                  mode                    1                   |
    |                               elastic                  False                 |
    |                                  auto                  False                 |
    |                   sync_nccl_allreduce                   True                 |
    |                         nccl_comm_num                    1                   |
    |            use_hierarchical_allreduce                  False                 |
    |   hierarchical_allreduce_inter_nranks                    1                   |
    |                       sync_batch_norm                  False                 |
    |                   fuse_all_reduce_ops                   True                 |
    |                  fuse_grad_size_in_MB                    32                  |
    |              fuse_grad_size_in_TFLOPS                   50.0                 |
    |               cudnn_exhaustive_search                   True                 |
    |             conv_workspace_size_limit                   4000                 |
    |    cudnn_batchnorm_spatial_persistent                   True                 |
    +==============================================================================+
    |                                Build Strategy                                |
    +------------------------------------------------------------------------------+
    |           enable_sequential_execution                  False                 |
    |              fuse_elewise_add_act_ops                  False                 |
    |                       fuse_bn_act_ops                  False                 |
    |              fuse_relu_depthwise_conv                  False                 |
    |                    fuse_broadcast_ops                  False                 |
    |                fuse_all_optimizer_ops                  False                 |
    |                        enable_inplace                  False                 |
    |     enable_backward_optimizer_op_deps                   True                 |
    |                 cache_runtime_context                  False                 |
    +==============================================================================+
    |                              Execution Strategy                              |
    +------------------------------------------------------------------------------+
    |                           num_threads                    1                   |
    |          num_iteration_per_drop_scope                    10                  |
    |                 num_iteration_per_run                    1                   |
    |                    use_thread_barrier                  False                 |
    +==============================================================================+

test=develop

paddle-bot-old · 2020-09-06T14:04:08Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

test=develop

refine launch and distributed repr string for print

31216b1

test=develop

guru4elephant requested review from danleifeng, seiriosPlus and mapingshuo September 6, 2020 14:05

danleifeng previously approved these changes Sep 6, 2020

View reviewed changes

add fleet distributed strategy repr unit test

c21a1c0

test=develop

guru4elephant dismissed danleifeng’s stale review via c21a1c0 September 7, 2020 00:58

guru4elephant changed the title ~~refine launch and distributed repr string for print~~ 【paddle.fleet】refine launch and distributed repr string for print Sep 7, 2020

guru4elephant and others added 3 commits September 7, 2020 09:31

import protobuf

7add04a

test=develop

refine cmake file

7d05156

test=develop

Merge branch 'develop' into refine_log_format

3a2b50c

PaddlePaddle locked and limited conversation to collaborators Sep 8, 2020

PaddlePaddle unlocked this conversation Sep 8, 2020

guru4elephant requested a review from kolinwei September 8, 2020 14:13

kolinwei approved these changes Sep 8, 2020

View reviewed changes

guru4elephant merged commit f7d08b7 into PaddlePaddle:develop Sep 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【paddle.fleet】refine launch and distributed repr string for print #27093

【paddle.fleet】refine launch and distributed repr string for print #27093

guru4elephant commented Sep 6, 2020 •

edited

Loading

paddle-bot-old bot commented Sep 6, 2020

【paddle.fleet】refine launch and distributed repr string for print #27093

【paddle.fleet】refine launch and distributed repr string for print #27093

Conversation

guru4elephant commented Sep 6, 2020 • edited Loading

PR types

PR changes

Describe

paddle-bot-old bot commented Sep 6, 2020

guru4elephant commented Sep 6, 2020 •

edited

Loading