Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AutoTuner]Improve ETCD fault tolerance #58314

Merged
merged 3 commits into from
Oct 27, 2023

Conversation

Caozhou1995
Copy link
Contributor

@Caozhou1995 Caozhou1995 commented Oct 23, 2023

PR types

Others

PR changes

Others

Description

PCard-76733
Improve ETCD fault tolerance.
In the AutoTuner scenario, multiple tasks will be launched. In multi machine scenarios, there may be instability in ETCD communication, such as poor connectivity. This PR add retry operation to avoid program crashes due to instability.
At the same time, this PR has fixed some bugs and upgrade autotuner as follows:

  1. fix memory grep error due to rocm memory upgrade.
  2. fix metric none error due to the first error
  3. remove mp degree > 8 prune rule.
  4. change the memory monitoring interval to 5 second.
  5. add run best control filed.

@paddle-bot
Copy link

paddle-bot bot commented Oct 23, 2023

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@gongweibao
Copy link
Contributor

gongweibao commented Oct 24, 2023

The explanation needs to be more detailed?

  1. What's the problem?
  2. How to resolve them?

@Caozhou1995
Copy link
Contributor Author

Caozhou1995 commented Oct 24, 2023

The explanation needs to be more detailed?

  1. What's the problem?
  2. How to resolve them?
  1. In the AutoTuner scenario, multiple tasks will be launched. In multi machine scenarios, there may be instability in ETCD communication, such as poor connectivity. 2. Add retry operation to avoid program crush. @gongweibao

@paddle-bot paddle-bot bot added the contributor External developers label Oct 25, 2023
Copy link
Contributor

@zhiqiu zhiqiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhiqiu zhiqiu merged commit 8f1854c into PaddlePaddle:develop Oct 27, 2023
28 checks passed
@paddle-bot paddle-bot bot removed the contributor External developers label Nov 3, 2023
zeroRains pushed a commit to zeroRains/Paddle that referenced this pull request Nov 8, 2023
* add fault tolerant for etcd apis

* fix metric bug

* fix some bugs
danleifeng pushed a commit to danleifeng/Paddle that referenced this pull request Nov 14, 2023
* add fault tolerant for etcd apis

* fix metric bug

* fix some bugs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants