Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stdenv/generic/setup.sh: enable parallel installs for parallel builds #217568

Merged
merged 13 commits into from Mar 15, 2023

Conversation

trofi
Copy link
Contributor

@trofi trofi commented Feb 21, 2023

The primary motivating example is openssl:

Before the change full package build took 1m54s minutes. After the change full package build takes 59s.

About a 2x speedup.

The difference is visible because openssl builds hundreds of manpages spawning a perl process per manual in install phase. Such a workload is very easy to parallelize.

Another example would be autotools+libtool based build system where install step requires relinking. The more binaries there are to relink the more gain it will be to do it in parallel.

The change enables parallel installs by default only for buiilds that already have parallel builds enabled. There is a high chance those build systems already handle parallelism well but some packages will fail.

Description of changes
Things done
  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandbox = true set in nix.conf? (See Nix manual)
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 23.05 Release Notes (or backporting 22.11 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

@github-actions github-actions bot added the 6.topic: stdenv Standard environment label Feb 21, 2023
@trofi
Copy link
Contributor Author

trofi commented Feb 21, 2023

If there is a consensus it's a useful change I think it would be worthwile to have a hydra run before merge to staging.

While building system I encountered at least the following failures:

  • net-snmp: relinking fails
  • xfsprogs: missing dependencies in install rules
  • ocaml: missing dependencies in install rules

More failures from hydra run:

  • subversion: relinking fails
  • eresi: missing dependencies
  • s9fes: missing dependencies
  • vpnc: missing depends

@Atemu
Copy link
Member

Atemu commented Feb 22, 2023

If a package build is doing anything other than installing files in the installPhase, that smells to me that it's building some previously unbuilt target that should have been built in the buildPhase but wasn't. Your example might be building the docs as a dependency of the install.

The correct fix in that case would be to explicitly build all targets that are to be installed rather than accelerating builds in the installPhase.

Now of course there might still be a benefit to installing in parallel (I would expect there to be one) but that should be evaluated independently of the coincidental fix to the potential issue above.

@trofi
Copy link
Contributor Author

trofi commented Feb 22, 2023

If a package build is doing anything other than installing files in the installPhase, that smells to me that it's building some previously unbuilt target that should have been built in the buildPhase but wasn't. Your example might be building the docs as a dependency of the install.

The correct fix in that case would be to explicitly build all targets that are to be installed rather than accelerating builds in the installPhase.

I will not stop you from fixing it. But I do not think such fixes belong to nixpkgs and should be done upstream. I would say it's a good enough workaround to just make these cases faster. No need to penalize nixpkgs users for upstream faults. Calling make with and without -j is inconsistent. I would say openssl is a perfect example where it helps.

Now of course there might still be a benefit to installing in parallel (I would expect there to be one) but that should be evaluated independently of the coincidental fix to the potential issue above.

ghc is another example of heavyweight make install. gcc is yet another one. Both install many files and relink executables against target install directory.

@Atemu
Copy link
Member

Atemu commented Feb 22, 2023

I will not stop you from fixing it. But I do not think such fixes belong to nixpkgs and should be done upstream.

It really wouldn't be much of a fix; you'd merely declare the default + additional targets in buildFlags. Yes that is absolutely an upstream bug and they should fix it internally but that applies to many things we currently patch ourselves.

I would say it's a good enough workaround to just make these cases faster. No need to penalize nixpkgs users for upstream faults. Calling make with and without -j is inconsistent. I would say openssl is a perfect example where it helps.

I generally agree but I fear doing so would mask the actual issue that should be worked around or at least reported.

Therefore, I'd advocate for this feature being handled the same as enableParallelBuilding; disabled by default unless config.enableParallelInstallingByDefault but enabled by default for select build systems that are known to handle it well and/or benefit from it.

Another unknown is whether this affects r13y (certain parallel builds have done so in the past) which is also why I'd prefer this being off by default.

ghc is another example of heavyweight make install. gcc is yet another one. Both install many files and relink executables against target install directory.

That's great. My point is that this PR's merit should be measured by cases like that rather than (potentially, I haven't checked) broken build scripts.

@trofi
Copy link
Contributor Author

trofi commented Feb 22, 2023

I will not stop you from fixing it. But I do not think such fixes belong to nixpkgs and should be done upstream.

It really wouldn't be much of a fix; you'd merely declare the default + additional targets in buildFlags. Yes that is absolutely an upstream bug and they should fix it internally but that applies to many things we currently patch ourselves.

If we enable parallel installs by default you would not need to do even that. Which sounds like a nice decrease in maintenance burden.

I would say it's a good enough workaround to just make these cases faster. No need to penalize nixpkgs users for upstream faults. Calling make with and without -j is inconsistent. I would say openssl is a perfect example where it helps.

I generally agree but I fear doing so would mask the actual issue that should be worked around or at least reported.

I don't think we have any signal today that we will lose by enabling install parallelism.

Therefore, I'd advocate for this feature being handled the same as enableParallelBuilding; disabled by default unless config.enableParallelInstallingByDefault but enabled by default for select build systems that are known to handle it well and/or benefit from it.

Why is already used config.enableParallelInstallingByDefault not enough? If the package can handle build parallelism there should be a high chance of handling install parallelism at least in most cases.

Another unknown is whether this affects r13y (certain parallel builds have done so in the past) which is also why I'd prefer this being off by default.

Why is it relevant here? If parallel builds break reproducibility then we fix the reproducibility issue one way or another. Why mask it? Note that it is not as drastic as as breaking package build. I would expect packages to fail the build be a more common issue that introduce output non-determinism.

ghc is another example of heavyweight make install. gcc is yet another one. Both install many files and relink executables against target install directory.

That's great. My point is that this PR's merit should be measured by cases like that rather than (potentially, I haven't checked) broken build scripts.

I don't see a big difference in those. Both can do more or do less in make install as long as it produces needed outcome.

@trofi trofi marked this pull request as ready for review February 22, 2023 23:06
@trofi trofi requested a review from ttuegel as a code owner February 22, 2023 23:06
@ghost
Copy link

ghost commented Feb 23, 2023

I generally agree but I fear doing so would mask the actual issue

The actual issue is already masked.

There is no assert or build failure if you are compiling things during the installPhase. Serializing the installPhase isn't an assertion or a check -- it's an annoyance, which we only notice in the really ridiculous cases like openssl.

To unmask the issue would require a PR that causes cc-wrapper (and compilers for other languages) to fail during the installPhase, and add a allowCompilationDuringInstallPhase flag for the cases where upstream deliberately does things like running perl scripts during make install. Then we would have a real check for the issue.

Another unknown is whether this affects r13y (certain parallel builds have done so in the past) which is also why I'd prefer this being off by default.

This is a totally valid concern. I think the proposal to do a few Hydra runs with this PR before merging it is a very good idea.

@Atemu
Copy link
Member

Atemu commented Feb 23, 2023

If we enable parallel installs by default you would not need to do even that. Which sounds like a nice decrease in maintenance burden.

That'd mean building things in the install phase and that just seems wrong to me.

I don't think we have any signal today that we will lose by enabling install parallelism.

The signal is time and parallelism.

When refactoring ffmpeg, I mistakenly stopped building the general all target and only noticed because it was taking forever building ffmpeg on only one core despite enableParallelBuilding.

I wasn't building anything in the buildPhase there. I wouldn't have noticed that if the installPhase built the package in parallel.

Why is already used config.enableParallelInstallingByDefault not enough? If the package can handle build parallelism there should be a high chance of handling install parallelism at least in most cases.

Ah, didn't notice it was guarded behind enableParallelBuilding. That looks sane to me.

I don't see a big difference in those.

Not a big difference in what? Time?

I'm mostly concerned about weighing this PR's merit in the time improvements in regular builds rather than broken ones as the broken ones should be fixed rather than worked around.

To unmask the issue would require a PR that causes cc-wrapper (and compilers for other languages) to fail during the installPhase, and add a allowCompilationDuringInstallPhase flag for the cases where upstream deliberately does things like running perl scripts during make install. Then we would have a real check for the issue.

That sounds great.

@trofi
Copy link
Contributor Author

trofi commented Feb 23, 2023

I don't think we have any signal today that we will lose by enabling install parallelism.

The signal is time and parallelism.

Sounds a bit abstract. What query should I write against local logs or hydra to catch openssl and other packages like that?

If I were to build interesting the problematic cases in nixpkgs I would consider adding something queryable.

An example of a more actionable signal:

  • CPU and wall-time seconds in build phase
  • CPU and wall-time seconds in install time

That at least allows you to construct a query to find suspicious cases, like: "a lot (>10s) of time in install", "install time is a lot longer than build time" and similar.

I don't think making an install phase parallel by default masks something like that infeasible. And I don't think we have an equivalent of that today as a signal.

When refactoring ffmpeg, I mistakenly stopped building the general all target and only noticed because it was taking forever building ffmpeg on only one core despite enableParallelBuilding.

I wasn't building anything in the buildPhase there. I wouldn't have noticed that if the installPhase built the package in parallel.

How did you detect that? Did you look at the timing reported in logs? Or happened to notice it interactively?

I agree this specific use case will look differently. I disagree it overweights the benefit of faster installs for everyone.

I would say that it's a feature to get faster install in this case: if the end result is the same then it's a cosmetic issue. If the result is different - you can find about it faster as the full build is still fast.

I don't see a big difference in those.

Not a big difference in what? Time?

Not a big difference in the type of the problem: both do something substantial in install phase that is worth running in parallel.

I'm mostly concerned about weighing this PR's merit in the time improvements in regular builds rather than broken ones as the broken ones should be fixed rather than worked around.

I disagree that openssl build is broken: it's not ideal but it's not incorrect.

@7c6f434c
Copy link
Member

Another unknown is whether this affects r13y (certain parallel builds have done so in the past) which is also why I'd prefer this being off by default.

Yes, if we don't always parallel-build due to race conditions making output less deterministic than it could be, and the main case where parallel install helps is building during the install phase, same considerations probably apply.

Therefore, I'd advocate for this feature being handled the same as enableParallelBuilding; disabled by default unless config.enableParallelInstallingByDefault but enabled by default for select build systems that are known to handle it well and/or benefit from it.

Maybe it should just default to enableParallelBuilding then? And override in the very rare case where installation behaves worse than main building.

Notably, openssl already has this flag set.

@trofi trofi changed the title stdenv/generic/setup.sh: enable parallel installs by default stdenv/generic/setup.sh: enable parallel installs for parallel builds Feb 24, 2023
@trofi
Copy link
Contributor Author

trofi commented Feb 26, 2023

Rebased against today's merge base of master and staging in preparation for hydra jobset. No changes otherwise.

@vcunat
Copy link
Member

vcunat commented Feb 27, 2023

Jobset: https://hydra.nixos.org/jobset/nixpkgs/pr-217568-stdenv-parallel-install

Without the change parallel install fails as:

    $ install flags: -j16 ...
    ...
    collect2: error: ld returned 1 exit status
    libtool:   error: error: relink 'libsvn_ra_serf-1.la' with the above command before installing it
    make: *** [build-outputs.mk:1316: install-serf-lib] Error 1
    make: *** Waiting for unfinished jobs....
    /nix/store/1qasgqvab0xh2jcy00x9b1zh39dw7m8f-bin
Without the change parallel install fails as:

    $ install flags: -j16 ...
    ...
    install: target '...-ocaml-4.14.0/lib/ocaml/threads': No such file or directory
    make[1]: *** [Makefile:140: installopt] Error 1
Without the change parallel installs fail as:

    install flags: -j2
    ...
    ln: failed to create symbolic link '...-eresi-0.83-a3-phoenix//bin/elfsh': No such file or directory
    make: *** [Makefile:108: install64] Error 1
Without the change parallel installs fail as:

    install flags: -j16
    install -d -m 0755 ...-s9fes-20181205/share/s9fes
    sed -e "s|^#! /usr/local|#! ...-s9fes-20181205|" <prog/s9help.scm >...-s9fes-20181205/bin/s9help
    ...-bash-5.2-p15/bin/bash: line 1: ...-s9fes-20181205/bin/s9help: No such file or directory
    make: *** [Makefile:157: install-util] Error 1
    make: *** Waiting for unfinished jobs....
@ofborg ofborg bot requested a review from siraben March 4, 2023 16:42
Without the change parallel installs fail as:

    install flags: -j1
    ...
    install -m644 src/doc/*.md ...-vpnc-unstable-2021-11-04/share/doc/vpnc
    install: target '...-vpnc-unstable-2021-11-04/share/doc/vpnc': No such file or directory
Without the change parallel installs fail as:

    ...-coreutils-9.1/bin/install: cannot stat 'asy-keywords.el': No such file or directory
    make: *** [Makefile:272: install-asy] Error 1
Without the change parallel installs fail as:

    cp: cannot stat '...-gretl-2022c/share/gretl/data/plotbars': Not a directory
    make[1]: *** [Makefile:73: install_datafiles] Error 1
Without the change parallel installs fail as:

    lrelease error: Parse error at src/translations/qsynth_ru.ts:1503:33: Premature end of document.
    make: *** [Makefile:107: src/translations/qsynth_ru.qm] Error 1
Without the change parallel installs fail as:

    ...-binutils-2.40/bin/ld: cannot find ./.libs/libircd.so: No such file or directory
    collect2: error: ld returned 1 exit status
    make[4]: *** [Makefile:634: solanum] Error 1
@trofi
Copy link
Contributor Author

trofi commented Mar 15, 2023

Hydra run is complete \o/: https://hydra.nixos.org/eval/1791763?compare=trunk&full=1#tabs-now-fail

It uncovered only 12 packages that fail to install (all have a workaround now). Does not sound too bad.

@mweinelt mweinelt merged commit 19680e9 into NixOS:staging Mar 15, 2023
5 checks passed
@trofi trofi deleted the stdenv-parallel-install branch March 15, 2023 18:40
@K900 K900 mentioned this pull request Mar 15, 2023
4 tasks
@trofi
Copy link
Contributor Author

trofi commented Mar 16, 2023

A bit of fallout in staging post merge:

@trofi
Copy link
Contributor Author

trofi commented Mar 20, 2023

From staging-next failures:

@vcunat vcunat mentioned this pull request Mar 24, 2023
6 tasks
@vcunat
Copy link
Member

vcunat commented Mar 26, 2023

I fixed more failures that seemed caused by this. I suppose the best reference generally is

git grep 'enableParallelInstalling = false;'

or

git log -S enableParallelInstalling

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants