-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
【paddle.fleet】Update fleetrun & ps-heter #27472
Conversation
Thanks for your contribution! |
✅ This PR's description meets the template requirements! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
self._worker_endpoints = [] | ||
|
||
trainers_num = os.getenv("PADDLE_TRAINERS_NUM", None) | ||
assert trainers_num != None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert 换成raise ValuseError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修复
trainers_num = int(trainers_num) | ||
|
||
training_role = os.getenv("TRAINING_ROLE", None) | ||
assert training_role != None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
下面也有判断了 ,这里是否可以删除
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修复
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
fleet.stop_worker() | ||
shutil.rmtree(model_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
只有0号节点才会创建 model_path。 这里删除直接删除,会不会报错?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的确有问题,但是单测没有报错,我看一下
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
单测已修
@@ -133,6 +133,8 @@ def __init__(self, main_program, startup_program, strategy, role_maker): | |||
|
|||
self.origin_main_program = main_program | |||
self.origin_startup_program = startup_program |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
origin main 和 origin ps main 的区别?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
origin_main是fluid.default_main_program()
origin_ps_main是经过了ps切图,但还没有经过异构切图的program,单独保存出来是为了后面异构的save模型
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* refine fleetrun.ps_launch * update fleet run for multi device support * ps_graph support ps-gpu * fix heter save * add heter save unittest * fix unittest & simple code * update fleetrun * fix fleetrun * fix launch barrier * fix role maker * add paddlecloud rolemaker unittest * rename heter_worker_device_guard
PR types
New features,Bug fixes
PR changes
Others
Describe
优化fleetrun参数服务器任务启动代码,在
launch_utils.py
中增加了class ParameterServerLauncher
封装ps启动的实现对fleetrun启动命令进行了整理,现在使用fleetrun --help会分类说明各个参数
支持fleetrun提交ps-heter、ps-gpu任务
优化PaddleCloud RoleMaker 中 参数服务器环境变量检查的用户易用性
去掉了
_ps_env
中的try&catch,os.env[]全部改为os.getenv()