Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【paddle.fleet】Update fleetrun & ps-heter #27472

Merged
merged 27 commits into from
Oct 13, 2020

Conversation

MrChengmo
Copy link
Contributor

@MrChengmo MrChengmo commented Sep 22, 2020

PR types

New features,Bug fixes

PR changes

Others

Describe

  • 优化fleetrun参数服务器任务启动代码,在launch_utils.py中增加了class ParameterServerLauncher封装ps启动的实现

  • 对fleetrun启动命令进行了整理,现在使用fleetrun --help会分类说明各个参数

  • 支持fleetrun提交ps-heter、ps-gpu任务

# 1台机器通过多进程模拟, 2个服务节点搭配2个训练节点, 两个训练节点共用一张GPU卡
# 2个server 2个worker
export CUDA_VISIBLE_DEVICES=0
fleetrun --server_num=2 --worker_num=2  train.py
# 1台机器通过多进程模拟,2个服务节点搭配2个训练节点以及2个异构训练节点,每个异构训练节点占用一张GPU卡
# 2个server 2个worker 2个heter_worker
export CUDA_VISIBLE_DEVICES=0,1
fleetrun --server_num=2 --worker_num=2 --heter_worker_num=2 train.py
# 2个server 4个worker 1个异构训练节点
# 每台机器均指定了可用设备 GPU:0
export CUDA_VISIBLE_DEVICES=0
fleetrun --servers="xx.xx.xx.xx:6170,yy.yy.yy.yy:6171" --workers="xx.xx.xx.xx:6172,xx.xx.xx.xx:6173,xx.xx.xx.xx:6174,xx.xx.xx.xx:6175,yy.yy.yy.yy:6176,yy.yy.yy.yy:6177,yy.yy.yy.yy:6178,yy.yy.yy.yy:6179" --heter_workers="xx.xx.xx.xx:6180,yy.yy.yy.yy:6181"  train.py
  • 优化PaddleCloud RoleMaker 中 参数服务器环境变量检查的用户易用性

    去掉了_ps_env中的try&catch,os.env[]全部改为os.getenv()

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot-old
Copy link

paddle-bot-old bot commented Sep 22, 2020

✅ This PR's description meets the template requirements!
Please wait for other CI results.

@MrChengmo MrChengmo changed the title refine fleetrun.ps_launch 【paddle.fleet】Update fleetrun & ps-heter Sep 23, 2020
123malin
123malin previously approved these changes Oct 9, 2020
Copy link
Contributor

@123malin 123malin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

self._worker_endpoints = []

trainers_num = os.getenv("PADDLE_TRAINERS_NUM", None)
assert trainers_num != None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert 换成raise ValuseError

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修复

trainers_num = int(trainers_num)

training_role = os.getenv("TRAINING_ROLE", None)
assert training_role != None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

下面也有判断了 ,这里是否可以删除

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修复

danleifeng
danleifeng previously approved these changes Oct 12, 2020
Copy link
Contributor

@danleifeng danleifeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

123malin
123malin previously approved these changes Oct 12, 2020
Copy link
Contributor

@123malin 123malin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

fleet.stop_worker()
shutil.rmtree(model_path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

只有0号节点才会创建 model_path。 这里删除直接删除,会不会报错?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的确有问题,但是单测没有报错,我看一下

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

单测已修

@@ -133,6 +133,8 @@ def __init__(self, main_program, startup_program, strategy, role_maker):

self.origin_main_program = main_program
self.origin_startup_program = startup_program
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

origin main 和 origin ps main 的区别?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

origin_main是fluid.default_main_program()

origin_ps_main是经过了ps切图,但还没有经过异构切图的program,单独保存出来是为了后面异构的save模型

Copy link
Collaborator

@seiriosPlus seiriosPlus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@seiriosPlus seiriosPlus merged commit c5f2802 into PaddlePaddle:develop Oct 13, 2020
chen-zhiyu pushed a commit to chen-zhiyu/Paddle that referenced this pull request Oct 15, 2020
* refine fleetrun.ps_launch

* update fleet run for multi device support

* ps_graph support ps-gpu

* fix heter save

* add heter save unittest

* fix unittest & simple code

* update fleetrun

* fix fleetrun

* fix launch barrier

* fix role maker

* add paddlecloud rolemaker unittest

* rename heter_worker_device_guard
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants