Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try distributed run for 3 times to prevent failure #5305

Merged
merged 7 commits into from Jun 24, 2021

Conversation

jackalcooper
Copy link
Collaborator

分布式 CI 加入 retry,运行三次,避免因为网络故障出错

python3 ci/test/distributed_run.py --bash_script=ci/test/2node_op_test.sh --custom_img_tag=${{ env.image_name }} --oneflow_wheel_path=${{ env.wheelhouse_dir }}
- name: Op test (distributed, 3rd try)
if: matrix.test_suite == 'cuda' && steps.distributed_try_2.outcome=='failure'
continue-on-error: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个设置成 false 是不是就可以省掉下面的判断

@oneflow-ci-bot oneflow-ci-bot self-requested a review June 24, 2021 11:09
@jackalcooper jackalcooper changed the title Retry distributed run for 3 times to prevent failure Try distributed run for 3 times to prevent failure Jun 24, 2021
@oneflow-ci-bot oneflow-ci-bot removed their request for review June 24, 2021 13:00
@oneflow-ci-bot oneflow-ci-bot merged commit 1117b7f into master Jun 24, 2021
@oneflow-ci-bot oneflow-ci-bot deleted the run_distributed_test_3_times branch June 24, 2021 13:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants