Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make "timed out" and "log limit exceeded" builds aborted #1369

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Ma27
Copy link
Member

@Ma27 Ma27 commented Mar 13, 2024

In 7369408 I gave builds that failed because of a timeout or exceeded log limit a stop sign and I stand by that reasoning: with that it's possible to distinguish between actual build failures and rather transient things such as timeouts.

Back then I considered it a feature that these are shown in a different tab, but I don't think that's a good idea anymore. When using a jobset to e.g. track the regressions from a mass rebuild (like a compiler or gcc update), "Newly failed builds" should exclusively display regressions (and flaky builds of course, not much I can do about that).

Also, when a bunch of builds fail in such a jobset because of e.g. a broken connection to a builder that results in a timeout, I want to be able to restart them all w/o rebuilding actual regressions.

To make it clear that we not only have "Aborted" builds in the tab, I renamed the label to "Aborted / Timed out".

cc @NixOS/infra who will probably be the most affected by that for opinions.

In 7369408 I gave builds that failed
because of a timeout or exceeded log limit a stop sign and I stand by
that reasoning: with that it's possible to distinguish between actual
build failures and rather transient things such as timeouts.

Back then I considered it a feature that these are shown in a different
tab, but I don't think that's a good idea anymore. When using a jobset to
e.g. track the regressions from a mass rebuild (like a compiler or gcc
update), "Newly failed builds" should exclusively display regressions (and
flaky builds of course, not much I can do about that).

Also, when a bunch of builds fail in such a jobset because of e.g. a
broken connection to a builder that results in a timeout, I want to be
able to restart them all w/o rebuilding actual regressions.

To make it clear that we not only have "Aborted" builds in the tab, I
renamed the label to "Aborted / Timed out".
@mweinelt
Copy link
Member

Deployed on hydra.nixos.org.

@vcunat
Copy link
Member

vcunat commented Mar 19, 2024

I find it a little weird that "restart all aborted jobs" now won't restart this whole tab but only a subset of it (won't restart the timed out jobs).

@Ma27
Copy link
Member Author

Ma27 commented Mar 19, 2024

OK I only tested it until the point of where a certain job was displayed in a jobset eval. You're right, that should be part of that PR as well.

We should probably factor the logic on when something is aborted out to get rid of this duplication (in the restart, the jobset eval template and the jobset diff).

@vcunat
Copy link
Member

vcunat commented Mar 22, 2024

Also for hydra.nixos.org there's a practical disadvantage that timing out jobs can't be filtered out when doing diffs, i.e. instead of in "still failing" tab they'll be in "aborted" tab. I mean the timeouts that aren't transient but just a bad build, though yeah – we'd better fix those somehow.

@Ma27
Copy link
Member Author

Ma27 commented Mar 23, 2024

@vcunat part of the motivation was that I don't want to see timed out builds in "newly failing" (and subsequently I don't want to restart actual regressions when restarting timed out builds) and aborted builds seem more sensible here.

Perhaps we want to add another tab? Doesn't seem too complicated I guess, the categorization happens entirely outside of the DB AFAIK.

@vcunat
Copy link
Member

vcunat commented Apr 16, 2024

Note: "log limit exceeded" is usually not transient, in my experience. Restarting such builds will typically produce too much logs again, so I'm not sure about the move in that case. I consider it similar to "output size exceeded".

Though yes, it's somewhat impure – at least in the sense that these limits are configurable without changing the derivation hash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants