New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Restart individual job from build matrix #2680

Closed
mattclay opened this Issue Jun 7, 2016 · 14 comments

Comments

Projects
None yet
6 participants
@mattclay

mattclay commented Jun 7, 2016

Sometimes just one or two builds in the build matrix fail due to transient errors such as a network timeout. In these cases it would be nice to have the option to restart individual builds from the build matrix, rather than the entire set of jobs, particularly when the matrix is large.

@avinci

This comment has been minimized.

Member

avinci commented Jun 7, 2016

@mattclay do you want to restart it as a new job or do you need it to run as the same job?

The former is super easy, the later is a bit of a chaos especially since all internal eventing needs to be reset since the whole set of jobs have reached a terminal state.

@mattclay

This comment has been minimized.

mattclay commented Jun 7, 2016

If restarting as a new job would allow that job, upon completion, to take the place of the previous job for purposes of reporting status for the entire set of jobs (such as updating a PR), then yes.

For example, if 1 of 30 jobs fails, is restarted and then passes, the status for the set of jobs would need to reflect everything passing.

@jimi-c

This comment has been minimized.

jimi-c commented Jun 7, 2016

Yes, restarting in-place would be best. For example, https://app.shippable.com/runs/575705135cde2b0c00589cf3 (job 434.15) failed in a one-off way, so restarting it should delete and recreate 434.15.

@mattclay

This comment has been minimized.

mattclay commented Jun 19, 2016

This appears to be the same request as issue #794.

@nitzmahone

This comment has been minimized.

nitzmahone commented Sep 6, 2016

This would solve a lot of pain- some of our tests that rely on external infrastructure can have spurious failures, and when we're running dozens of those builds at once, the likelihood of at least one spurious failure is quite high... Being able to restart just the failures would be a much more stable path to "everything's green" than rolling the dice with all of them repeatedly.

@mattclay

This comment has been minimized.

mattclay commented Apr 6, 2018

Has any more consideration been given to implementing this?

@manishas

This comment has been minimized.

Contributor

manishas commented Apr 7, 2018

@mattclay we're discussing internally and will have an update early next week.

@manishas

This comment has been minimized.

Contributor

manishas commented Apr 24, 2018

@mattclay we discussed this and wanted to make sure the following scenario would work for you:

  • You have N jobs in your build matrix. For build number 545, these are numbered 545.1, 545.2,...., 545.n

  • For build 545, you have some failures, let's say 545.2 and 545.7 fail, while everything else passes.

  • You can click on rebuild for those individual jobs. 545.2 and 545.7 will rerun and update the status for build 545. If this was a pull request build, the pull request status will be updated as well.

  • You will lose the previous job information (i.e. console, status, etc) for 545.2 and 545.7 since it will be overwritten by the rebuilt information.

@mattclay

This comment has been minimized.

mattclay commented Apr 27, 2018

@manishas What you describe sounds good, except for the last item. Losing previous job information would make it very difficult to troubleshoot past failures for reducing or eliminating them in the future.

If the failed job information could preserved, even if located elsewhere in the UI, that would probably be adequate. Just so long as there was some way to get at it after the job had been restarted (possibly multiple times if it failed on subsequent re-runs).

The person(s) reviewing failed jobs may not be the same person(s) who are restarting them, so capturing or reviewing the information before jobs are restarted isn't practical.

@mattclay

This comment has been minimized.

mattclay commented Apr 27, 2018

@manishas I just thought of another issue with replacing individual job results. We have automated processes that respond to CI failures by downloading results using the API. If those results disappear or change when jobs are restarted, that is going to cause problems for those processes.

Would it be possible to create a new run that contained only the jobs that failed from the previous run, then splice in the results from the previous jobs which succeeded?

For example, the initial run could be:

  • 500
    • 500.1 pass
    • 500.2 fail
    • 500.3 fail

Re-running the entire run (failed jobs only) would result in a new run:

  • 501
    • 500.1 pass -- This would simply be a reference to the original 500.1 job in the job list.
    • 501.2 pass
    • 501.3 pass

The final status for 501 would work like any normal run, except that that some of the jobs contributing to that status are actually from the previous 500 run.

@manishas

This comment has been minimized.

Contributor

manishas commented May 2, 2018

@mattclay let me discuss with the team and get back to you in a couple of days.

@mattclay

This comment has been minimized.

mattclay commented May 25, 2018

Thank you for adding the "Restart failed items" option. This is going to save a lot of time and reduce CI load substantially.

@mattclay mattclay closed this May 25, 2018

@manishas manishas reopened this May 25, 2018

@manishas

This comment has been minimized.

Contributor

manishas commented May 25, 2018

Reopening this to fix the problem with the wrong SHA being selected while rerunning.

@manishas

This comment has been minimized.

Contributor

manishas commented Aug 1, 2018

This was fixed in late May. Rerunning failed jobs now gives you a choice of using latest vs original SHA.

@manishas manishas closed this Aug 1, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment