【paddle.fleet】Update fleetrun & ps-heter #27472

MrChengmo · 2020-09-22T10:59:33Z

PR types

New features,Bug fixes

PR changes

Others

Describe

优化fleetrun参数服务器任务启动代码，在launch_utils.py中增加了class ParameterServerLauncher封装ps启动的实现
对fleetrun启动命令进行了整理，现在使用fleetrun --help会分类说明各个参数
支持fleetrun提交ps-heter、ps-gpu任务

# 1台机器通过多进程模拟， 2个服务节点搭配2个训练节点， 两个训练节点共用一张GPU卡
# 2个server 2个worker
export CUDA_VISIBLE_DEVICES=0
fleetrun --server_num=2 --worker_num=2  train.py

# 1台机器通过多进程模拟，2个服务节点搭配2个训练节点以及2个异构训练节点，每个异构训练节点占用一张GPU卡
# 2个server 2个worker 2个heter_worker
export CUDA_VISIBLE_DEVICES=0,1
fleetrun --server_num=2 --worker_num=2 --heter_worker_num=2 train.py

# 2个server 4个worker 1个异构训练节点
# 每台机器均指定了可用设备 GPU:0
export CUDA_VISIBLE_DEVICES=0
fleetrun --servers="xx.xx.xx.xx:6170,yy.yy.yy.yy:6171" --workers="xx.xx.xx.xx:6172,xx.xx.xx.xx:6173,xx.xx.xx.xx:6174,xx.xx.xx.xx:6175,yy.yy.yy.yy:6176,yy.yy.yy.yy:6177,yy.yy.yy.yy:6178,yy.yy.yy.yy:6179" --heter_workers="xx.xx.xx.xx:6180,yy.yy.yy.yy:6181"  train.py

优化PaddleCloud RoleMaker 中参数服务器环境变量检查的用户易用性

去掉了_ps_env中的try&catch，os.env[]全部改为os.getenv()

paddle-bot-old · 2020-09-22T10:59:38Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

paddle-bot-old · 2020-09-22T10:59:40Z

✅ This PR's description meets the template requirements!
Please wait for other CI results.

123malin

LGTM

seiriosPlus · 2020-10-10T10:26:24Z

python/paddle/distributed/fleet/base/role_maker.py

+            self._worker_endpoints = []
+
+        trainers_num = os.getenv("PADDLE_TRAINERS_NUM", None)
+        assert trainers_num != None


assert 换成raise ValuseError

seiriosPlus · 2020-10-10T10:27:44Z

python/paddle/distributed/fleet/base/role_maker.py

+        trainers_num = int(trainers_num)
+
+        training_role = os.getenv("TRAINING_ROLE", None)
+        assert training_role != None


下面也有判断了，这里是否可以删除

danleifeng

LGTM

123malin

LGTM

seiriosPlus · 2020-10-12T06:56:27Z

python/paddle/fluid/tests/unittests/dist_fleet_heter_ctr.py

        fleet.stop_worker()
+        shutil.rmtree(model_path)


只有0号节点才会创建 model_path。这里删除直接删除，会不会报错？

这里的确有问题，但是单测没有报错，我看一下

单测已修

seiriosPlus · 2020-10-12T08:12:56Z

python/paddle/fluid/incubate/fleet/parameter_server/ir/public.py

@@ -133,6 +133,8 @@ def __init__(self, main_program, startup_program, strategy, role_maker):

        self.origin_main_program = main_program
        self.origin_startup_program = startup_program


origin main 和 origin ps main 的区别？

origin_main是fluid.default_main_program()

origin_ps_main是经过了ps切图，但还没有经过异构切图的program，单独保存出来是为了后面异构的save模型

seiriosPlus

LGTM

* refine fleetrun.ps_launch * update fleet run for multi device support * ps_graph support ps-gpu * fix heter save * add heter save unittest * fix unittest & simple code * update fleetrun * fix fleetrun * fix launch barrier * fix role maker * add paddlecloud rolemaker unittest * rename heter_worker_device_guard

refine fleetrun.ps_launch

34bd004

MrChengmo added 3 commits September 22, 2020 19:02

for merge

50ada3a

fix

be70c94

update fleet run for multi device support

4efcb9d

MrChengmo changed the title ~~refine fleetrun.ps_launch~~ 【paddle.fleet】Update fleetrun & ps-heter Sep 23, 2020

MrChengmo added 18 commits September 23, 2020 13:57

ps_graph support ps-gpu

1c57d55

fix

dbbcc43

fix

d3cda7f

revert performance code

291e159

fix

5189880

for merge

03a2be6

for merge

d8c4ddb

fix heter save

e5660af

add heter save unittest

369b955

fix unittest

6ddd726

add heter unittest

b8e873e

fix unittest

aafa019

fix unittest & simple code

d7bb812

fix unittest

da6aaee

update fleetrun

578dd7c

fix fleetrun

f58a290

fix coverage

c987c1f

fix launch barrier

ed20fea

123malin previously approved these changes Oct 9, 2020

View reviewed changes

simple code

3236ec2

MrChengmo dismissed 123malin’s stale review via 3236ec2 October 9, 2020 04:04

guru4elephant requested review from danleifeng and seiriosPlus October 9, 2020 15:58

guru4elephant requested a review from Thunderbrook October 10, 2020 08:59

seiriosPlus reviewed Oct 10, 2020

View reviewed changes

fix role maker

f327822

danleifeng previously approved these changes Oct 12, 2020

View reviewed changes

add paddlecloud rolemaker unittest

7fc7c08

MrChengmo dismissed danleifeng’s stale review via 7fc7c08 October 12, 2020 02:44

123malin previously approved these changes Oct 12, 2020

View reviewed changes

seiriosPlus reviewed Oct 12, 2020

View reviewed changes

rename heter_worker_device_guard

2cb76b9

MrChengmo dismissed 123malin’s stale review via 2cb76b9 October 12, 2020 08:18

fix unittest

ebab5dd

seiriosPlus approved these changes Oct 13, 2020

View reviewed changes

seiriosPlus merged commit c5f2802 into PaddlePaddle:develop Oct 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【paddle.fleet】Update fleetrun & ps-heter #27472

【paddle.fleet】Update fleetrun & ps-heter #27472

MrChengmo commented Sep 22, 2020 •

edited

Loading

paddle-bot-old bot commented Sep 22, 2020

paddle-bot-old bot commented Sep 22, 2020 •

edited

Loading

123malin left a comment

seiriosPlus Oct 10, 2020

MrChengmo Oct 11, 2020

seiriosPlus Oct 10, 2020

seiriosPlus Oct 10, 2020

MrChengmo Oct 11, 2020

danleifeng left a comment

123malin left a comment

seiriosPlus Oct 12, 2020

MrChengmo Oct 12, 2020

MrChengmo Oct 12, 2020

seiriosPlus Oct 12, 2020

MrChengmo Oct 12, 2020

seiriosPlus left a comment

		@@ -133,6 +133,8 @@ def __init__(self, main_program, startup_program, strategy, role_maker):

		self.origin_main_program = main_program
		self.origin_startup_program = startup_program

【paddle.fleet】Update fleetrun & ps-heter #27472

【paddle.fleet】Update fleetrun & ps-heter #27472

Conversation

MrChengmo commented Sep 22, 2020 • edited Loading

PR types

PR changes

Describe

paddle-bot-old bot commented Sep 22, 2020

paddle-bot-old bot commented Sep 22, 2020 • edited Loading

123malin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danleifeng left a comment

Choose a reason for hiding this comment

123malin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seiriosPlus left a comment

Choose a reason for hiding this comment

MrChengmo commented Sep 22, 2020 •

edited

Loading

paddle-bot-old bot commented Sep 22, 2020 •

edited

Loading