Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize LLVM image generation #47797

Merged
merged 34 commits into from
Mar 6, 2023
Merged

Parallelize LLVM image generation #47797

merged 34 commits into from
Mar 6, 2023

Conversation

pchintalapudi
Copy link
Member

@pchintalapudi pchintalapudi commented Dec 4, 2022

By chopping up the LLVM module into N smaller modules, and doing some of the multiversioning work upfront, we can achieve large speedups from throwing multiple threads at sysimage building.

Should probably wait for #47184 to avoid large merge conflicts.

@vchuravy is there a good test for these changes on a package image workload?

TODO:
- [ ] figure out why target uses of relocation slots are not being picked up in the pre-optimization scanner

  • lower-bound the thread count by the actual amount of work available
  • remove debug timing prints

@IanButterworth
Copy link
Sponsor Member

IanButterworth commented Dec 5, 2022

Sounds great!

If this means that threading would used for the pkgimage generation during package precompilation, then we might need a slightly cleverer load balancing approach during Pkg.precompile given that it uses multi-processing already.

I can imagine a Pkg.precompile approach where for the early stages of the dep tree where there are many deps that can be done in parallel, multi-process is prioritized and this is still single threaded, then when the tree gets down to low numbers of deps, it bumps this multithreading N up.

That could nicely get away from the long tail effect in Pkg.precompile

@staticfloat
Copy link
Sponsor Member

This reduces the Outputting sysimage file... portion from 142s to 20s for me, using 32 cores.

@vchuravy
Copy link
Sponsor Member

vchuravy commented Dec 5, 2022

Yeah we were particularly interested in reducing the long-tail here were single very large packages take excessive time.

@brenhinkeller brenhinkeller added compiler:llvm For issues that relate to LLVM compiler:latency Compiler latency compiler:codegen Generation of LLVM IR and native code labels Dec 5, 2022
@petvana
Copy link
Member

petvana commented Dec 6, 2022

Just out of curiosity: Would it be possible to reduce peak memory usage for building large sysimages by using more chunks than threads with some basic scheduler? (It may also help to distribute the load for CPUs with fewer threads.)

@pchintalapudi
Copy link
Member Author

Just out of curiosity: Would it be possible to reduce peak memory usage for building large sysimages by using more chunks than threads with some basic scheduler? (It may also help to distribute the load for CPUs with fewer threads.)

It's possible to chop up modules in this way, but judging by the build failures we actually use the most memory when we dump the sysimage bytes at the end, rather than during optimization. If that stops being an issue, we may want to revisit this.

@IanButterworth
Copy link
Sponsor Member

I just tried this rebased to have pkgimages and via the old code load precompilation I don't see >100% cpu.
Does it need to be switched on specifically for pkgimage generation?

@pchintalapudi
Copy link
Member Author

Is there a print along the lines of Adding outputs with n threads? That should say if it's attempting to use multiple threads.

@IanButterworth
Copy link
Sponsor Member

All I see is this (and single thread utilization)

julia> using GLMakie
[ Info: Precompiling GLMakie [e9467ef8-e4e7-5192-8a1a-b1aee30e663a]

julia> 

create_expr_cache forwards both stdout and stderr from the child process, so it seems to not be happening.

@pchintalapudi
Copy link
Member Author

Does it at least appear in the system image building after rebase? If not, then something might have gone wrong there. In any case, I'm working on adapting to the pkgimg changes seperately so we'll see how that goes.

@IanButterworth
Copy link
Sponsor Member

It does

@pchintalapudi pchintalapudi force-pushed the pc/psys3 branch 2 times, most recently from 06f01ed to 594fa00 Compare January 7, 2023 01:38
@pchintalapudi
Copy link
Member Author

@IanButterworth could you try with this set of commits? I personally don't see any of the debug timing dumping I'm doing during the package load process, but if I set JULIA_IMAGE_THREADS to 4 and run using OrdinaryDiffEq I do see 400% CPU usage, which suggests that it will speed up package image compilation.

@IanButterworth
Copy link
Sponsor Member

IanButterworth commented Jan 7, 2023

It does appear to be working. On some packages the LLVM phase during precompilation is quite short with this speedup, so can be easy to miss in a cpu monitor.

I also decided to try unrestricting this during Pkg.precompile (by reverting JuliaLang/Pkg.jl#3273) as I am now thinking that given the LLVM stage is a short fraction of the precompilation phase of a package with this PR, the likelihood of many threads clashing from packages precompiling in parallel and overwhelming the system seems low.

master

shell> rm -rf ~/.julia/compiled/v1.10/

(@v1.10) pkg> activate /home/ian/Documents/GitHub/Makie.jl/GLMakie
  Activating project at `~/Documents/GitHub/Makie.jl/GLMakie`

Precompiling environment...
  197 dependencies successfully precompiled in 191 seconds

This PR + JuliaLang/Pkg.jl#3273 reverted

shell> rm -rf ~/.julia/compiled/v1.10/

(@v1.10) pkg> activate /home/ian/Documents/GitHub/Makie.jl/GLMakie
  Activating project at `~/Documents/GitHub/Makie.jl/GLMakie`

(GLMakie) pkg> precompile
Precompiling environment...
  197 dependencies successfully precompiled in 129 seconds

So that's a minute faster, or now takes 2/3 of the time master takes, and it behaved well, the animation remained smooth.
That's on a 16 core linux machine

For completeness, during julia build outputting the sysimage on master takes 120s, this PR takes 21s.

Seeing how windows behaves during the same test would be interesting, I think, so I pushed JuliaLang/julia@9647a58 (#47797) so that others can test. Note that that commit needs to be dropped before merge.

@pchintalapudi
Copy link
Member Author

That looks like a decent time savings, do we see similar time savings with PackageCompiler custom system images?

@giordano
Copy link
Contributor

giordano commented Jan 7, 2023

Using an environment with only Plots.jl. v1.8.3:

(tmp) pkg> instantiate
Precompiling project...
  135 dependencies successfully precompiled in 38 seconds

master (de73c26):

(tmp) pkg> precompile
Precompiling environment...
  134 dependencies successfully precompiled in 76 seconds

This PR (9647a58):

(tmp) pkg> precompile
Precompiling environment...
  134 dependencies successfully precompiled in 63 seconds

it’s stil far from v1.8 times, but slightly faster than master.

@pchintalapudi
Copy link
Member Author

@IanButterworth After thinking a little more, I think we'll probably run into bad situations if we precompile a project with multiple heavy top-level dependencies, rather than just one. Right now we only use half the system threads for parallel precompilation, so if there are 3 or more heavy packages at the toplevel we'll probably see those thread clashes become more problematic.

@giordano
Copy link
Contributor

GNU Make 4.4 has a nice new feature:

* New feature: Improved support for -l / --load-average
  On systems that provide /proc/loadavg (Linux), GNU Make will use it to
  determine the number of runnable jobs and use this as the current load,
  avoiding the need for heuristics.
  Implementation provided by Sven C. Dack <sdack@gmx.com>

Is there any hope to implement something similar (aiming to maintain a given load average, rather than using a fixed number of threads)? I guess the challenge would also be to do it for all platforms.

@pchintalapudi
Copy link
Member Author

We pick the number of threads at the start time and partition the work based on the number of threads, rather than how make has a fixed job list with dependencies. We could potentially use the system load to select a more aggressive fraction of the core count for these jobs, but I don't know if system load will respond quickly enough for it to be a useful heuristic for these jobs.

@staticfloat
Copy link
Sponsor Member

I think the most likely useful approach is going to be similar to -j in make; because we know each thread will approximately saturate a single core, we know that the ideal number of threads across all processes is roughly the number of cores in the system. To solve this, we’d need a way to coordinate thread creation across processes, e.g. we create a Julia subprocess and refuse to let it create threads of the parent already has N threads alive. This could actually be useful for normal Julia workloads as well, but in any case, I think it’s fine to overload the system for a minute or two for now. This is a big improvement and overloading the system in the short term should not be considered a blocker, IMO.

@Krastanov
Copy link

A question from the peanut gallery: are you saying the system will be overloaded in terms of CPU usage, or in terms of memory? I am asking because when parallel precompilation was enabled some time ago there were a lot of novices who were confused why is Julia crashing with out-of-memory errors on their systems. I have students who have to start julia with a single thread, otherwise precompilation of DiffEq and something else at the same time would definitely cause the OOM killer to kill programs on their device.

Either way, looking forward to this being merged. Just trying to make sure I know what to tell students with machines with no more than 8gb of RAM.

@pchintalapudi
Copy link
Member Author

@staticfloat I'm now thinking that memory limiting should be out of scope here. It looks like Pkg.jl will be controlling the number of threads for precompile and therefore Pkg should be responsible for any memory limiting. The other case where this is useful is when building system images, in which case I don't think it's unreasonable to ask people to set the environment variable if that's an issue. What do you think about this?

@pchintalapudi
Copy link
Member Author

@nanosoldier runtests(configuration=(precompile=false, julia_flags=["--pkgimages=yes"], env=["JULIA_IMAGE_THREADS=8"]))

@nanosoldier
Copy link
Collaborator

Your package evaluation job has completed - possible new issues were detected.
A full report can be found here.

@IanButterworth
Copy link
Sponsor Member

Are the pkgeval runs equivalent here apart from the commit? Or is one run under rr etc?

@pchintalapudi
Copy link
Member Author

The last failures were caused by unfortunate timing. A linked change between master and JET.jl went through in the last 3 days that broke packages dependent on JET. Rebasing eliminated that error, but I want to make sure nothing else has changed since then.

@pchintalapudi
Copy link
Member Author

@nanosoldier runtests(["EQDSKReader", "NSGAII", "NeidArchive", "ReliabilityDiagrams", "Survey", "QuantumOptics", "Syslogs", "Scalpels", "MakieThemes", "PopGenCore", "NonlinearSolveMINPACK", "DiffEqCallbacks", "UnfoldMakie", "JosephsonCircuits", "REoptLite", "PersistentCollections", "QuantEcon", "RPRMakie", "Bokeh", "REopt", "LogicToolkit", "WaterWaves1D"], configuration = (precompile = false, julia_flags = ["--pkgimages=yes"], env = ["JULIA_IMAGE_THREADS=8"]))

@pchintalapudi
Copy link
Member Author

If this comes back without obvious related failures I think we can call this clean from a testing perspective.

@nanosoldier
Copy link
Collaborator

Your package evaluation job has completed - possible new issues were detected.
A full report can be found here.

@giordano
Copy link
Contributor

giordano commented Mar 5, 2023

Tests of Scalpels.jl basically fail because two different environments are used in the two runs of the tests, only the one for this PR having an old buggy version of Polinomials.jl.

@maleadt Using different environments makes apple-to-apple comparisons hard and can add needless noise like in this case, is there a way to avoid that?

@maleadt
Copy link
Member

maleadt commented Mar 5, 2023

Tests of Scalpels.jl basically fail because two different environments are used in the two runs of the tests, only the one for this PR having an old buggy version of Polinomials.jl.

Both tests use Polynomials 2.0.25 during package installation, and v1.2.1 during testing; I don't see a difference between both runs? FWIW, PkgEval.jl checks-out and fixes the registry during set-up, so there aren't multiple environments at play.

@giordano
Copy link
Contributor

giordano commented Mar 5, 2023

Uhm, ok, got confused comparing the environments of the two runs, didn't realise installation and test environments were different. Scalpels.jl sets needlessly restrictive compat bounds for the test environment, I opened a PR to remove that: RvSpectML/Scalpels.jl#4. Still unclear why the warning which causes the test failure is triggered only in one case, but that had been fixed in Polynomials.jl by JuliaMath/Polynomials.jl#442.

Copy link
Sponsor Member

@vchuravy vchuravy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good from my side.

@vchuravy vchuravy requested a review from maleadt March 6, 2023 00:37
@staticfloat
Copy link
Sponsor Member

LGTM!

@staticfloat staticfloat merged commit 0f5f62c into master Mar 6, 2023
@staticfloat staticfloat deleted the pc/psys3 branch March 6, 2023 19:39
@timholy
Copy link
Sponsor Member

timholy commented Mar 6, 2023

Woohoo! Thanks @pchintalapudi, looking forward to seeing this in action.

@staticfloat
Copy link
Sponsor Member

staticfloat commented Mar 6, 2023

From 454 seconds to 67 seconds, in one fell swoop, you've reduced overall build CI times by 30%. (The assert build is even more impressive, from 700 seconds to 101 seconds ).

The buildbots salute you sir.

buildbot salute

pchintalapudi added a commit that referenced this pull request Mar 6, 2023
@KristofferC
Copy link
Sponsor Member

@pchintalapudi, I noticed my M1 pro only seems to be using 4 cores for generating the sysimage but the M1 pro has 8 (or 6) high performance cores and manually bumping it using the env variable improves performance (42s -> 33s for sysimage). Of course I can just set that env variable but maybe the defaults could be tweaked a bit?

@pchintalapudi
Copy link
Member Author

I'm hesitant to generally increase the number of cores allocated, as users with low power systems will probably want to be doing something else with the rest of their compute while waiting for image compilation. Also, keeping the core count at ~half means we are very much less likely to see oversubscribing in Pkg.jl runs, since even if two packages overlap we will not have oversubscribed the system. That being said, for the system image or PackageCompiler in particular I think we can just set JULIA_IMAGE_THREADS to Sys.CPU_THREADS, since we don't expect underpowered users to be running those commands anyways.

for (auto &F : M.functions()) {
if (!F.isDeclaration()) {
if (!Preserve.contains(&F)) {
F.deleteBody();
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this results in legal IR, since there may still be attributes attached to the module for the declaration (such as DISubprogram and llvm.dbg.cu) which should not be added there. We used to have a function (prepare_call) for getting a proper declaration for a definition, but I think we removed that at some point. I thought maybe OrcJIT would have it, but it doesn't look like it includes support for handling debug information correctly either: https://llvm.org/doxygen/CompileOnDemandLayer_8cpp_source.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just drop all metadata on the function? Since we're still attaching metadata to the real definition the linker might save us right?

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need the metadata, though note this is module metadata, not on the function

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Module metadata seems more complicated since we're not allowed to drop module flags. However, we could go back to using CloneModule either before or after serialization, at a time cost. Presumably CloneModule should handle the debug info correctly.

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We emitted them into separate modules to avoid this issue. Can we keep them separate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It requires a bit more care to keep the modules separate since GPUCompiler also uses this codepath and probably assumes a single module, which I'm more hesitant to mess with.

For DISubprogram, the LangRef says it's legal to have it on a declaration for callsite debug info, so I think it's valid to leave it there despite deleting the function body. I suspect the same reasoning to apply to global variable declarations, though the LangRef does not mention that, but stripping them also shouldn't affect correctness. llvm.dbg.cu doesn't have much documentation, but I don't see anything obvious that LLVM doesn't already do (https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Utils/CloneFunction.cpp#L280 copies all the compile units, irrespective of if they're related to the function that was just cloned) so I'm not really seeing anything obviously wrong here.

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It requires a bit more care to keep the modules separate since GPUCompiler also uses this codepath and probably assumes a single module, which I'm more hesitant to mess with.

We could merge it in GPUCompiler.jl, generally I would be in favor for exposing the work-list to GPUC, since we might want to implement caching differently for Enzyme vs CUDA

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:codegen Generation of LLVM IR and native code compiler:latency Compiler latency compiler:llvm For issues that relate to LLVM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet