Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

18.09 Zero Hydra Failures #45960

Closed
samueldr opened this issue Sep 2, 2018 · 88 comments
Closed

18.09 Zero Hydra Failures #45960

samueldr opened this issue Sep 2, 2018 · 88 comments

Comments

@samueldr
Copy link
Member

@samueldr samueldr commented Sep 2, 2018

image

Let's make Jellyfish the best release so far!

We have: the main jobset starting at 425 failures, x86_64-darwin at 829, and aarch64-linux at ~1120. The numbers may seem large, but one weird trick appropriate fix may fix many at once.

How can I help???

  • Choose some package from those that fail on Hydra. Hurry before the good ones have been taken!
  • Find a fix. That may mean simply restricting meta.platforms, in case the package inherently doesn't support what it has in there ATM.
  • Typically the package was broken on master already. You can verify that on Hydra - example URL: https://hydra.nixos.org/job/nixpkgs/trunk/bash.x86_64-linux In that case base the fix on master and request backporting in the description of the pull request.
  • Ping this issue from the PR, e.g. /cc ZHF #45960. Do this also if you have some WIP. Alternatively you may just post a note in this issue. If the breakage is specific to darwin #45961 or aarch64 #45962 mention the respective issue instead.

The remaining packages will be marked as broken before the release (on the failing platforms), i.e. at the end of September. /cc @NixOS/nixpkgs-committers, but everyone can help out!

@samueldr samueldr added this to the 18.09 milestone Sep 2, 2018
@worldofpeace worldofpeace mentioned this issue Sep 3, 2018
4 of 9 tasks complete
@symphorien symphorien mentioned this issue Sep 3, 2018
5 of 9 tasks complete
@danieldk danieldk mentioned this issue Sep 3, 2018
4 of 9 tasks complete
@veprbl veprbl mentioned this issue Sep 3, 2018
0 of 9 tasks complete
danieldk added a commit to danieldk/nixpkgs that referenced this issue Sep 3, 2018
Keras expects keras_preprocessing 1.0.2 and 1.0.4. 1.0.3 and 1.0.5
are respectively in nixpkgs.

ZHF NixOS#45960
@symphorien symphorien mentioned this issue Sep 3, 2018
4 of 9 tasks complete
@volth
Copy link
Contributor

@volth volth commented Sep 3, 2018

Reverting ad47c38 will fix nixpkgs.perl*Packages.MouseXGetOpt

@xeji
Copy link
Contributor

@xeji xeji commented Sep 3, 2018

Reverting ad47c38 will fix nixpkgs.perl*Packages.MouseXGetOpt

reverted in 9889c0f and 4c00a04

xeji added a commit that referenced this issue Sep 3, 2018
Keras expects keras_preprocessing 1.0.2 and 1.0.4. 1.0.3 and 1.0.5
are respectively in nixpkgs.

ZHF #45960

(cherry picked from commit e33be2a)
xeji added a commit that referenced this issue Sep 3, 2018
Keras expects keras_preprocessing 1.0.2 and 1.0.4. 1.0.3 and 1.0.5
are respectively in nixpkgs.

ZHF #45960
@dywedir dywedir mentioned this issue Sep 3, 2018
0 of 9 tasks complete
@markuskowa markuskowa mentioned this issue Sep 3, 2018
4 of 9 tasks complete
@andir andir mentioned this issue Sep 3, 2018
3 of 9 tasks complete
@samueldr
Copy link
Member Author

@samueldr samueldr commented Sep 4, 2018

Failures report as of right now.

Let's see how useful this is as a format. This was queried from the last finished eval, there were evals running while this was made.

@volth
Copy link
Contributor

@volth volth commented Sep 4, 2018

Failures report as of right now.

I assumed that perl52[68]Packages.TestMagpie should be ignored as broken because it depends on broken perl52[68]Packages.UNIVERSALref (#45983).

Or should each dependent have its own meta.broken = versionAtLeast perl.version "5.26" ?

@xeji
Copy link
Contributor

@xeji xeji commented Sep 4, 2018

This was queried from the last finished eval, there were evals running while this was made.

@volth the table shows an earlier eval where UNIVERSALref wasn't marked as broken yet, see the build logs, it is fine in the latest eval: https://hydra.nixos.org/eval/1477017#tabs-removed

Or should each dependent have its own meta.broken = versionAtLeast perl.version "5.26" ?

No, only the package that is broken itself.

@danieldk danieldk mentioned this issue Sep 4, 2018
3 of 9 tasks complete
@vcunat
Copy link
Member

@vcunat vcunat commented Sep 4, 2018

I'd add that broken and other checks are transitive during evaluation (implemented as exceptions).

@timokau
Copy link
Member

@timokau timokau commented Sep 4, 2018

The sage failures are due to a mistake I made when adding pkg-config aliases to openblas and the recent numpy update. I fixed openblas in #46016 in staging. I don't know if that means it will also be merged into 18.09. I haven't gotten to backporting the numpy upgrade from sage upstream yet.

I really think it is a shame that hydra doesn't ping maintainers on failures anymore. Seems like an essential feature to miss.

@xeji
Copy link
Contributor

@xeji xeji commented Sep 4, 2018

@vcunat @samueldr there are a number of changes currently in staging/staging-next that should go to 18.09 once they reach master - openblas, texlive 2018, a systemd bugfix, etc.
What's the workflow for these? Guess we'll need a staging-18.09 branch + Hydra job.

@vcunat
Copy link
Member

@vcunat vcunat commented Sep 4, 2018

Since the fork point the staging branch won't get to 18.09 anymore. Cherry-picking should be done if desired. For this one I did it in 6f8e07a.

@Ma27 Ma27 mentioned this issue Sep 24, 2018
5 of 9 tasks complete
Mic92 added a commit that referenced this issue Sep 24, 2018
Recent boost versions name their `python3` shared objects
`boost_python3x` rather than `boost_python3`.

See https://hydra.nixos.org/build/80712295
Addresses #45960

(cherry picked from commit 50f23da)
xeji added a commit that referenced this issue Sep 24, 2018
The dependency `distro` was missing.
See https://hydra.nixos.org/build/81330387

Addresses #45960

(cherry picked from commit baa7e52)
xeji added a commit that referenced this issue Sep 24, 2018
@xeji
Copy link
Contributor

@xeji xeji commented Sep 24, 2018

sage is still broken on 18.09 (but not on master). ping @timokau

Ma27 added a commit to Ma27/nixpkgs that referenced this issue Sep 24, 2018
@Ma27 Ma27 mentioned this issue Sep 24, 2018
4 of 9 tasks complete
xeji added a commit that referenced this issue Sep 24, 2018
@timokau
Copy link
Member

@timokau timokau commented Sep 27, 2018

@xeji thanks for the heads up. I cannot reproduce a failure with the most recent release-18.09, so that failure is probably outdated.

@vcunat
Copy link
Member

@vcunat vcunat commented Sep 29, 2018

@timokau: the newest evaluation failed three times (on different Hydra machines): https://hydra.nixos.org/build/81780926 There's always some "N doctests failed" at the end.

@qknight
Copy link
Member

@qknight qknight commented Sep 29, 2018

you lack a implementation (library, egg) where fib is defined, see the error

NameError: name 'fib' is not defined

**********************************************************************
File "/nix/store/9nix1fnsj56iqqcjavkar8gbjnic8djm-sage-src-8.3/src/sage/repl/ipython_extension.py", line 403, in sage.repl.ipython_extension.SageMagics.fortran
Failed example:
    fib
Exception raised:
    Traceback (most recent call last):
      File "/nix/store/0bg15dw9whi80i71ldkxkyh1v5ck74lv-python-2.7.15-env/lib/python2.7/site-packages/sage/doctest/forker.py", line 573, in _run
        self.compile_and_execute(example, compiler, test.globs)
      File "/nix/store/0bg15dw9whi80i71ldkxkyh1v5ck74lv-python-2.7.15-env/lib/python2.7/site-packages/sage/doctest/forker.py", line 983, in compile_and_execute
        exec(compiled, globs)
      File "<doctest sage.repl.ipython_extension.SageMagics.fortran[3]>", line 1, in <module>
        fib
    NameError: name 'fib' is not defined
**********************************************************************
File "/nix/store/9nix1fnsj56iqqcjavkar8gbjnic8djm-sage-src-8.3/src/sage/repl/ipython_extension.py", line 407, in sage.repl.ipython_extension.SageMagics.fortran
Failed example:
    fib(a, 10)
Exception raised:
    Traceback (most recent call last):
      File "/nix/store/0bg15dw9whi80i71ldkxkyh1v5ck74lv-python-2.7.15-env/lib/python2.7/site-packages/sage/doctest/forker.py", line 573, in _run
        self.compile_and_execute(example, compiler, test.globs)
      File "/nix/store/0bg15dw9whi80i71ldkxkyh1v5ck74lv-python-2.7.15-env/lib/python2.7/site-packages/sage/doctest/forker.py", line 983, in compile_and_execute
        exec(compiled, globs)
      File "<doctest sage.repl.ipython_extension.SageMagics.fortran[6]>", line 1, in <module>
        fib(a, Integer(10))
    NameError: name 'fib' is not defined
**********************************************************************
File "/nix/store/9nix1fnsj56iqqcjavkar8gbjnic8djm-sage-src-8.3/src/sage/repl/ipython_extension.py", line 408, in sage.repl.ipython_extension.SageMagics.fortran
Failed example:
    a
Expected:
    array([  0.,   1.,   1.,   2.,   3.,   5.,   8.,  13.,  21.,  34.])
Got:
    array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
**********************************************************************
@timokau
Copy link
Member

@timokau timokau commented Sep 29, 2018

@vcunat I still cannot reproduce that locally. The main error seems to be

File "/nix/store/9nix1fnsj56iqqcjavkar8gbjnic8djm-sage-src-8.3/src/sage/misc/inline_fortran.py", line 65, in sage.misc.inline_fortran.InlineFortran.eval
Failed example:
    fortran(code, globals())
Exception raised:
    Traceback (most recent call last):
      ...
      File "/nix/store/0bg15dw9whi80i71ldkxkyh1v5ck74lv-python-2.7.15-env/lib/python2.7/site-packages/sage/misc/inline_fortran.py", line 125, in eval
        raise RuntimeError("failed to compile Fortran code:\n" + log_string)
    RuntimeError: failed to compile Fortran code:
      ...
      File "/nix/store/0bg15dw9whi80i71ldkxkyh1v5ck74lv-python-2.7.15-env/lib/python2.7/threading.py", line 736, in start
        _start_new_thread(self.__bootstrap, ())
    thread.error: can't start new thread

With the rest being avalanche errors. Could that be some sort of hydra threading issue? Can anybody reproduce the error?

@knedlsepp knedlsepp mentioned this issue Sep 30, 2018
2 of 9 tasks complete
@samueldr
Copy link
Member Author

@samueldr samueldr commented Oct 1, 2018

Hmm, last times I wanted to generate a report there weren't good evals to check. Now there is:

It looks like a chunk of the new failures on aarch64 would be fixed with #47564.

@vcunat any blockers for a release you think? I don't have anything in mind, since staging was merged and the last things I knew of were all on staging.

While not strictly ZHF related, the blockers don't seem to be severe enough to warrant blocking the release? Most are either older than this release cycle, or things that don't really block, but still would be great to finalize.

@matthewbauer
Copy link
Member

@matthewbauer matthewbauer commented Oct 1, 2018

It sounds alright to me. Someone had mentioned merging #42846 before the release but since it's a nonessential module I would consider it nonblocking (although might be worthwhile backporting).

Make sure you look at the 18.09 milestone PRs though too: https://github.com/NixOS/nixpkgs/pulls?q=is%3Aopen+is%3Apr+milestone%3A18.09

@peterhoeg
Copy link
Member

@peterhoeg peterhoeg commented Oct 1, 2018

I consider #47577 a blocker - apologies for not raising it earlier.

@timokau
Copy link
Member

@timokau timokau commented Oct 1, 2018

It looks like the sage build succeeded, curious.

As for the release I suggest removing all the blocker labels and bumping the milestones, then give the people involved at least one day to complain before going ahead with 18.09.

@samueldr
Copy link
Member Author

@samueldr samueldr commented Oct 1, 2018

Hey, what do you know: I found a blocker, and I should have known better and known it is one beforehand:

See #47602, while this doesn't block on a technical side, this is a definite blocker on the human side, with a side dish of bad user experience as their first bite into NixOS.

@timokau
Copy link
Member

@timokau timokau commented Nov 7, 2018

In case anybody was on the edge of their seat because of the transient sage failure: Turns out it was caused by some numpy issue with high cpu count. Since I'm still seeing the issue in master, its probably still there in release-18.09 too. I've opened a PR (#49888) to cherry-pick the fix to numpy in release-18.09. For master it should be good enough to wait for the next numpy upgrade.

@cyounkins
Copy link
Contributor

@cyounkins cyounkins commented Dec 17, 2018

Looks like we had a huge bump at https://hydra.nixos.org/eval/1495549

Many are propagated from https://hydra.nixos.org/build/85961701 which failed with "Log limit exceeded" and "building of '/nix/store/5awxqywjwjldazlzls4jslgm1l828hb3-nbd-3.18' killed after writing more than 67108864 bytes of log output" despite not writing 6MB to the log in the web UI. Possible hydra issue? @grahamc

@vcunat
Copy link
Member

@vcunat vcunat commented Dec 30, 2018

Apparently someone has restarted these builds and they succeeded. (Yes, I'm a bit late.)

@matthewbauer matthewbauer removed this from the 18.09 milestone Jan 22, 2019
@grahamc grahamc closed this Feb 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
You can’t perform that action at this time.