Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terminate job if fail too much #206

Closed
wants to merge 1 commit into from

Conversation

typhoonzero
Copy link
Collaborator

@typhoonzero typhoonzero commented Jul 7, 2017

Fix #149

@@ -25,6 +25,10 @@ check_trainer_ret() {
echo "Program Abort" > /dev/termination-log
fi
echo "termination log wroted..."
FAILED_COUNT=$(python /root/k8s_tools.py fetch_pserver_ips)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The arguments should be fetch_job_fail_count

@pineking pineking mentioned this pull request Jul 14, 2017
if j.metadata.name == PADDLE_JOB_NAME:
failed_count = j.status.failed
break
if not failed_count:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

failed_count没有定义?

Copy link
Collaborator

@gongweibao gongweibao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

只有一个comment

@Yancey1989 Yancey1989 mentioned this pull request Aug 16, 2017
@typhoonzero typhoonzero deleted the fix_max_retry branch November 2, 2017 10:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants