Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PGO+LTO Makefile #45641

Merged
merged 5 commits into from
Feb 9, 2024
Merged

Conversation

haampie
Copy link
Contributor

@haampie haampie commented Jun 10, 2022

Adds a convenient way to enable PGO+LTO on Julia and LLVM together:

  1. cd contrib/pgo-lto
  2. make -j$(nproc) stage1
  3. make clean-profiles
  4. ./stage1.build/julia -O3 -e 'using Pkg; Pkg.add("LoopVectorization"); Pkg.test("LoopVectorization")'
  5. make -j$(nproc) stage2
* Output looks roughly like as follows
$ make -C contrib/pgo-lto top 
make: Entering directory '/dev/shm/julia/contrib/pgo-lto'
llvm-profdata show --topn=50 /dev/shm/julia/contrib/pgo-lto/profiles/merged.prof | c++filt
Instrumentation level: IR  entry_first = 0
Total functions: 85943
Maximum function count: 7867557260
Maximum internal block count: 3468437590
Top 50 functions with the largest internal block counts: 
  llvm::BitVector::operator|=(llvm::BitVector const&), max count = 7867557260
  LateLowerGCFrame::ComputeLiveness(State&), max count = 3468437590
  llvm::hashing::detail::hash_combine_recursive_helper::hash_combine_recursive_helper(), max count = 1742259834
  llvm::SUnit::addPred(llvm::SDep const&, bool), max count = 511396575
  llvm::LiveRange::overlaps(llvm::LiveRange const&, llvm::CoalescerPair const&, llvm::SlotIndexes const&) const, max count = 508061762
  llvm::StringMapImpl::LookupBucketFor(llvm::StringRef), max count = 505682177
  std::map<llvm::BasicBlock*, BBState, std::less<llvm::BasicBlock*>, std::allocator<std::pair<llvm::BasicBlock* const, BBState> > >::operator[](llvm::BasicBlock* const&), max count = 395628888
  llvm::LiveRange::advanceTo(llvm::LiveRange::Segment const*, llvm::SlotIndex) const, max count = 384642728
  llvm::LiveRange::isLiveAtIndexes(llvm::ArrayRef<llvm::SlotIndex>) const, max count = 380291040
  llvm::PassRegistry::enumerateWith(llvm::PassRegistrationListener*), max count = 352313953
  ijl_method_instance_add_backedge, max count = 349608221
  llvm::SUnit::ComputeHeight(), max count = 336604330
  llvm::LiveRange::advanceTo(llvm::LiveRange::Segment*, llvm::SlotIndex), max count = 331030109
  llvm::SmallPtrSetImplBase::insert_imp(void const*), max count = 272966545
  llvm::LiveIntervals::checkRegMaskInterference(llvm::LiveInterval&, llvm::BitVector&), max count = 257449540
  LateLowerGCFrame::ComputeLiveSets(State&), max count = 252096274
  /dev/shm/julia/src/jltypes.c:has_free_typevars, max count = 230879464
  ijl_get_pgcstack, max count = 216953592
  LateLowerGCFrame::RefineLiveSet(llvm::BitVector&, State&, std::vector<int, std::allocator<int> > const&), max count = 188013152
  /dev/shm/julia/src/flisp/flisp.c:apply_cl, max count = 174863813
  /dev/shm/julia/src/flisp/builtins.c:fl_memq, max count = 168621603

This results quite often in spectacular speedups for time to first X as
it reduces the time spent in LLVM optimization passes by 25 or even 30%.

Example 1:

using LoopVectorization
function f!(a, b)
    @turbo for i in eachindex(a)
        a[i] *= b[i]
    end
    return a
end
f!(rand(1), rand(1))
$ time ./julia -O3 lv.jl

Without PGO+LTO: 14.801s
With PGO+LTO: 11.978s (-19%)

Example 2:

$ time ./julia -e 'using Pkg; Pkg.test("Unitful");'

Without PGO+LTO: 1m47.688s
With PGO+LTO: 1m35.704s (-11%)

Example 3 (taken from issue #45395, which is almost only LLVM):

$ JULIA_LLVM_ARGS=-time-passes ./julia script-45395.jl

Without PGO+LTO:

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 101.0130 seconds (98.6253 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  53.6961 ( 54.7%)   0.1050 (  3.8%)  53.8012 ( 53.3%)  53.8045 ( 54.6%)  Unroll loops
  25.5423 ( 26.0%)   0.0072 (  0.3%)  25.5495 ( 25.3%)  25.5444 ( 25.9%)  Global Value Numbering
   7.1995 (  7.3%)   0.0526 (  1.9%)   7.2521 (  7.2%)   7.2517 (  7.4%)  Induction Variable Simplification
   6.0541 (  5.1%)   0.0098 (  0.3%)   5.0639 (  5.0%)   5.0561 (  5.1%)  Combine redundant instructions #2

With PGO+LTO:

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 72.6507 seconds (70.1337 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  36.0894 ( 51.7%)   0.0825 (  2.9%)  36.1719 ( 49.8%)  36.1738 ( 51.6%)  Unroll loops
  16.5713 ( 23.7%)   0.0129 (  0.5%)  16.5843 ( 22.8%)  16.5794 ( 23.6%)  Global Value Numbering
   5.9047 (  8.5%)   0.0395 (  1.4%)   5.9442 (  8.2%)   5.9438 (  8.5%)  Induction Variable Simplification
   4.7566 (  6.8%)   0.0078 (  0.3%)   4.7645 (  6.6%)   4.7575 (  6.8%)  Combine redundant instructions #2

Or -28% time spent in LLVM.

perf reports show this is mostly fewer instructions and reduction in icache misses.


Finally there's a significant reduction in binary sizes. For libLLVM.so:

79M	usr/lib/libLLVM-13jl.so (before)
67M	usr/lib/libLLVM-13jl.so (after)

And it can be reduced by another 2MB with --icf=safe when using LLD as
a linker anyways.

  • Two out-of-source builds would be better than a single in-source build, so that it's easier to find good profile data

@giordano giordano added the domain:building Build system, or building Julia or its dependencies label Jun 10, 2022
@vchuravy
Copy link
Sponsor Member

Can we use make -C deps install-clang / Clang_jll as the stage0? We should formalize that for ASAN as well.

@haampie
Copy link
Contributor Author

haampie commented Jun 10, 2022

I tried Clang_jll as well, and things compile fine, but no profile data is generated. Maybe some relocation issue, I didn't look into it further.

contrib/pgo-lto/Make.user.pgo-lto Outdated Show resolved Hide resolved
contrib/pgo-lto/Make.user.pgo-lto Outdated Show resolved Hide resolved
@haampie
Copy link
Contributor Author

haampie commented Jun 10, 2022

It's not just compile time improvements, also runtime:

$ cat alloc.jl
@time for i in 1:1000000000
    string(i)
end
$ julia ./alloc.jl
Before: 43.057379 seconds (2.00 G allocations: 89.332 GiB, 6.71% gc time)
After:  34.928795 seconds (2.00 G allocations: 89.332 GiB, 6.95% gc time)

@oscardssmith
Copy link
Member

oscardssmith commented Jun 10, 2022

How does this affect runtime?

@vchuravy
Copy link
Sponsor Member

I tried Clang_jll as well, and things compile fine, but no profile data is generated. Maybe some relocation issue, I didn't look into it further.

Might be that we misconfigured it :)

@haampie
Copy link
Contributor Author

haampie commented Jun 11, 2022

How does this affect runtime?

From perf record:

symbol before after
julia_dec_41286* 8.66% 22.66%
jl_gc_pool_alloc_noinline 9.00% 11.49%
ijl_alloc_string 7.07% 9.72%
ijl_array_to_string 17.45% 5.61%
ijl_string_to_array 19.73% 3.62%
julia_ndigits0zpb_50016* 1.92% 3.23%
julia_YY.stringYY.443_30211* 1.47% 1.74%
ijl_get_pgcstack 0.39% 0.89%
gc_sweep_pool 0.37% 0.45%
add_page 0.30% 0.33%

The symbols with * are in sys.o, the others are libjulia-internal.so.

So it seems ijl_array_to_string and ijl_string_to_array are optimized better

@giordano
Copy link
Contributor

So it seems ijl_array_to_string and ijl_string_to_array are optimized better

OK, so this is the C side of the runtime which gets optimised, but the code generated by Julia should still be the same, right?

Copy link
Sponsor Member

@staticfloat staticfloat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks really nice! Excellent work! I think there are a few more pieces that would be really helpful for this:

  • Tracking down why Clang_jll doesn't work. I'll be happy to help investigate whether we're building it wrong or what. Being able to use that would improve the ergonomics of this significantly, IMO.
  • A smoke-test script that can be run to build julia, generate an example trace, then rebuild Julia with that profile data. We can, for example, run that on CI to ensure that we don't break this in the future.

Comment on lines 37 to 39
stage1: export CFLAGS=-fprofile-generate=$(PROFILE_DIR) -Xclang -mllvm -Xclang -vp-counters-per-site=$(COUNTERS_PER_SITE)
stage1: export CXXFLAGS=-fprofile-generate=$(PROFILE_DIR) -Xclang -mllvm -Xclang -vp-counters-per-site=$(COUNTERS_PER_SITE)
stage1: export LDFLAGS=-fuse-ld=lld -flto=thin -fprofile-generate=$(PROFILE_DIR)
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually didn't know you could attach environment variables to targets like this in Make! This is very cool!

Quick demonstration to anyone else watching, who wants to understand better how this interacts with rules and dependencies:

$ cat Makefile 
all: foo bar foobar

# This rule will have `$FOO` defined within it
foo:
        @echo "[foo]    FOO: $${FOO}"

# This rule will not
bar:
        @echo "[bar]    FOO: $${FOO}"

# Even though this rule depends on `foo`, it won't have `$FOO` defined.
foobar: foo bar
        @echo "[foobar] FOO: $${FOO}"

# Attach an environment variable to `foo`
foo: export FOO=foo
$ make
[foo]    FOO: foo
[bar]    FOO: 
[foobar] FOO: 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is neat! :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't so neat in the end, cause the variables are also set on prerequisites.

contrib/pgo-lto/Makefile Outdated Show resolved Hide resolved
contrib/pgo-lto/Makefile Outdated Show resolved Hide resolved
@haampie
Copy link
Contributor Author

haampie commented Jun 15, 2022

Regarding lld, I tried both LLVM 13 and 14, both have the issue. When you do clang -fprofile-generate, clang adds the static compiler-rt library to the linker invocation, together with a flag -u__llvm_profile_runtime. This flag forces an object file defining this symbol to be linked (I guess they want finer granularity than --whole-archive), and this object file contains a global whose constructor registers an atexit hook. The issue is that the constructor of this class is never called...

When comparing system clang/lld vs Julia's clang/lld, it seems the system version generates just a .init_array section, whereas Julia's version adds both an .init_array and a .ctors section to the ELF file. So likely that's an issue, but I can't explain why.

Edit okay, so the issue is potentially that the static lib has a .ctors section instead of a .init_array?

$ ar x /home/harmen/Documents/projects/julia/usr/lib/clang/14.0.3/lib/linux/libclang_rt.profile-x86_64.a InstrProfilingRuntime.cpp.o
$ objdump -x InstrProfilingRuntime.cpp.o
...
  4 .ctors        00000008  0000000000000000  0000000000000000  00000048  2**3
                  CONTENTS, ALLOC, LOAD, RELOC, DATA
...

Possibly related: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100896 ?

@haampie
Copy link
Contributor Author

haampie commented Jun 15, 2022

That's probably it... ld translates ctors into init_array and merges the sections, whereas lld keeps the sections the same. From an old thread in the GCC bug tracker: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46770:

// ctors.c
#include <stdio.h>
static void ctor() { puts("ctor"); }
static void dtor() { puts("dtor"); }
static void (*const ctors []) ()
  __attribute__ ((used, section (".ctors"), aligned (sizeof (void *))))
  = { ctor };
static void (*const dtors []) ()
  __attribute__ ((used, section (".dtors"), aligned (sizeof (void *))))
  = { dtor };

// init_array.c
#include <stdio.h>
static void init() { puts ("init_array"); }
static void fini () { puts ("fini_array"); }
static void (*const init_array []) ()
  __attribute__ ((used, section (".init_array"), aligned (sizeof (void *))))
  = { init };
static void (*const fini_array []) ()
  __attribute__ ((used, section (".fini_array"), aligned (sizeof (void *))))
  = { fini };

// main.c
#include <stdio.h>
int main() { puts("hello world");}
$ clang -fuse-ld=ld main.c ctors.c init_array.c -o with_ld 
$ ./with_ld
ctor
init_array
hello world
fini_array
dtor
$ clang -fuse-ld=lld main.c ctors.c init_array.c -o with_lld 
$ ./with_lld
init_array
hello world
fini_array
$ readelf -S with_ld | grep -E '(ctors|init_array)'
  [18] .init_array       INIT_ARRAY       0000000000403df0  00002df0
$ readelf -S with_lld | grep -E '(ctors|init_array)'
  [12] .ctors            PROGBITS         00000000002004c8  000004c8
  [21] .init_array       INIT_ARRAY       0000000000202900  00000900

So the solution is to configure GCC < 11 targeting Linux with --enable-initfini-array? (https://reviews.llvm.org/D45508 did not land). As I understand it, when using Clang_jll locally, it will still look for a system GCC from which it takes among other things crtbegin.o/crtend.o. And those files should have .ctors/.dtors sections with required sentinel values that make those constructors/destructors work, meaning support for ctors/dtors is enabled/disabled at compile time of GCC, and there's no way to fix that, except for using linkers that take care of ctor->init_array or renaming those sections with objcopy (which sounds painful).

Edit: https://github.com/JuliaBinaryWrappers/GCCBootstrap_jll.jl/releases uses .init/fini_array

@Krastanov
Copy link

Will this be something used in the official or nightly binaries, or will it be available only to people that build julia on their own? Asking as I do not know exactly how the contrib folder is used. Please excuse the tangential question.

@haampie
Copy link
Contributor Author

haampie commented Jun 27, 2022

If you have a recent clang and lld on your system, it might be best to try commit c949fc0 and go through the 6 steps at the top of this PR.

In the newer commits the idea is to use a patched Yggdrasil version of clang so you no longer need clang installed, but currently make gets stuck in an infinite loop, and I haven't had time to check why yet.

$(MAKE) -C $(STAGE0_BUILD)/deps install-clang install-llvm install-llvm-tools
# Turn [cd]tors into init/fini_array sections in libclang_rt, since lld
# doesn't do that, and otherwise the profile constructor is not executed
find $< -name 'libclang_rt.profile-*.a' -exec objcopy --rename-section .ctors=.init_array --rename-section .dtors=.fini_array {} +
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these not have opposite ordering?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could be, need to check. It's likely there's only one global so might not be an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ nm --defined-only ./clang/14.0.5/lib/linux/libclang_rt.profile-x86_64.a | grep GLOBAL
0000000000000000 t _GLOBAL__sub_I_InstrProfilingRuntime.cpp

@staticfloat
Copy link
Sponsor Member

Will this be something used in the official or nightly binaries, or will it be available only to people that build julia on their own?

It's not certain yet; we'll need to do pretty extensive testing to ensure that it wouldn't e.g. speed up some workloads, but slow down others. Most likely what this will be used for is for application-specific Julia builds, e.g. you have a workload and you want Julia to run 10% faster on that workload, so you can profile it on exactly that workload.

@haampie haampie force-pushed the ttfx-improvements branch 3 times, most recently from 229819b to 6f56ffd Compare June 28, 2022 10:42
Adds a convenient way to enable PGO+LTO on Julia and LLVM together:

1. `cd contrib/pgo-lto`
2. `make -j$(nproc) stage1`
3. `make clean-profiles`
4. `./stage1.build/julia -O3 -e 'using Pkg; Pkg.add("LoopVectorization"); Pkg.test("LoopVectorization")'`
5. `make -j$(nproc) stage2`

This results quite often in spectacular speedups for time to first X as
it reduces the time spent in LLVM optimization passes by 25 or even 30%.

Example 1:

```julia
using LoopVectorization
function f!(a, b)
    @turbo for i in eachindex(a)
        a[i] *= b[i]
    end
    return a
end
f!(rand(1), rand(1))
```

```console
$ time ./julia -O3 lv.jl
```

Without PGO+LTO: 14.801s
With PGO+LTO: 11.978s (-19%)

Example 2:

```console
$ time ./julia -e 'using Pkg; Pkg.test("Unitful");'
```

Without PGO+LTO: 1m47.688s
With PGO+LTO: 1m35.704s (-11%)

Example 3 (taken from issue JuliaLang#45395, which is almost only LLVM):

```console
$ JULIA_LLVM_ARGS=-time-passes ./julia script-45395.jl
```

Without PGO+LTO:

```
===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 101.0130 seconds (98.6253 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  53.6961 ( 54.7%)   0.1050 (  3.8%)  53.8012 ( 53.3%)  53.8045 ( 54.6%)  Unroll loops
  25.5423 ( 26.0%)   0.0072 (  0.3%)  25.5495 ( 25.3%)  25.5444 ( 25.9%)  Global Value Numbering
   7.1995 (  7.3%)   0.0526 (  1.9%)   7.2521 (  7.2%)   7.2517 (  7.4%)  Induction Variable Simplification
   5.0541 (  5.1%)   0.0098 (  0.3%)   5.0639 (  5.0%)   5.0561 (  5.1%)
   Combine redundant instructions JuliaLang#2
```

Wit PGO+LTO:

```
===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 72.6507 seconds (70.1337 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  36.0894 ( 51.7%)   0.0825 (  2.9%)  36.1719 ( 49.8%)  36.1738 ( 51.6%)  Unroll loops
  16.5713 ( 23.7%)   0.0129 (  0.5%)  16.5843 ( 22.8%)  16.5794 ( 23.6%)  Global Value Numbering
   5.9047 (  8.5%)   0.0395 (  1.4%)   5.9442 (  8.2%)   5.9438 (  8.5%)  Induction Variable Simplification
   4.7566 (  6.8%)   0.0078 (  0.3%)   4.7645 (  6.6%)   4.7575 (  6.8%)  Combine redundant instructions JuliaLang#2
```

Or -28% time spent in LLVM.

---

Finally there's a significant reduction in binary sizes. For libLLVM.so:

```
79M	usr/lib/libLLVM-13jl.so (before)
67M	usr/lib/libLLVM-13jl.so (after)
```

And it can be reduced by another 2MB with `--icf=safe` when using LLD as
a linker anways.

Turn into makefile

Newline

Use two out of source builds

Ignore profiles + build dirs

Add --icf=safe

stage0 setup prebuilt clang with [cd]tors->init/fini patch
@haampie
Copy link
Contributor Author

haampie commented Jun 28, 2022

This should now build Julia with BB's LLVM

@maleadt
Copy link
Member

maleadt commented Jun 28, 2022

I'm getting an error building stage1 here (on a fresh clone):

    JULIA contrib/pgo-lto/stage1.build/usr/lib/julia/corecompiler.ji
ERROR: `ccall` requires the compiler

@haampie
Copy link
Contributor Author

haampie commented Jun 28, 2022

I've seen that before when a different libstdc++.so is used during linking & runtime. Does ./contrib/pgo-lto/stage0.build/usr/tools/clang -v pick a sensible GCC?

@maleadt
Copy link
Member

maleadt commented Jun 28, 2022

Does ./contrib/pgo-lto/stage0.build/usr/tools/clang -v pick a sensible GCC?

I think so?

$ ./stage0.build/usr/tools/clang -v
clang version 14.0.5 (/depot/downloads/clones/llvm-project.git-5a9787eb535c2edc5dea030cc221c1d60f38c9f42344f410e425ea2139e233aa 3c1151c0f6c5b265ec2b3a176fe12be4b23252bf)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/tim/Julia/src/julia/contrib/pgo-lto/./stage0.build/usr/tools
Found candidate GCC installation: /usr/lib/gcc/x86_64-pc-linux-gnu/10.3.0
Found candidate GCC installation: /usr/lib/gcc/x86_64-pc-linux-gnu/11.3.0
Found candidate GCC installation: /usr/lib/gcc/x86_64-pc-linux-gnu/12.1.0
Found candidate GCC installation: /usr/lib64/gcc/x86_64-pc-linux-gnu/10.3.0
Found candidate GCC installation: /usr/lib64/gcc/x86_64-pc-linux-gnu/11.3.0
Found candidate GCC installation: /usr/lib64/gcc/x86_64-pc-linux-gnu/12.1.0
Selected GCC installation: /usr/lib64/gcc/x86_64-pc-linux-gnu/12.1.0
Candidate multilib: .;@m64
Candidate multilib: 32;@m32
Selected multilib: .;@m64
Found CUDA installation: /opt/cuda, version

Tracing execution in GDB, it looks like we're dispatching to the codegen stubs instead of the actual compiler, jl_generate_fptr_for_unspecialized_fallback instead of jl_generate_fptr_for_unspecialized_impl.

Ah, this is once more caused by the libstdc++ we helpfully put there (I'm on Arch, so have a recent libc):

$ ldd usr/lib/libjulia-codegen.so.1.9
usr/lib/libjulia-codegen.so.1.9: /home/tim/Julia/src/julia/contrib/pgo-lto/stage1.build/usr/lib/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /home/tim/Julia/src/julia/contrib/pgo-lto/stage1.build/usr/lib/libLLVM-14jl.so)

Removing that makes the build continue.

@haampie
Copy link
Contributor Author

haampie commented Jun 28, 2022

Yeah, Julia's libstdc++ detection looks wrong with clang. It uses the default Fortran compiler, assumes it's GCC, and uses the libstdc++ in its install prefix. But clang will search for the most recent GCC on the system and use the libstdc++ shipped with that.

@staticfloat maybe it's more reliable to use something along the lines of $(CXX) test.cc && ldd a.out | grep ...?

@maleadt
Copy link
Member

maleadt commented Jun 28, 2022

I can reproduce the speed-up, but some at least seems to come from the LLVM source build that's involved (either the fact that it's a source build, or the -flto=thin, or ...). Doing a Pkg.test("Unitful"):

Julia with LLVM from BB
./bin/julia -e   86.49s user 6.18s system 108% cpu 1:25.78 total
./bin/julia -e   85.12s user 5.64s system 107% cpu 1:24.23 total

PGO+LTO stage2 without `-fprofile-use`
./bin/julia -e   83.23s user 6.09s system 108% cpu 1:22.41 total
./bin/julia -e   82.89s user 6.08s system 108% cpu 1:22.10 total

Actual PGO+LTO
./bin/julia -e   78.25s user 5.90s system 108% cpu 1:17.28 total
./bin/julia -e   78.55s user 6.09s system 108% cpu 1:17.77 total

@haampie
Copy link
Contributor Author

haampie commented Jun 29, 2022

Probably stage0 should be optional s.t. package managers can use their own clang.

@maleadt
Copy link
Member

maleadt commented Jul 4, 2022

Maybe you can print the top x of those and see if there is some similarity among those packages.

In absolute difference:

 Row │ name                               diff    
     │ String                             Float64 
─────┼────────────────────────────────────────────
   1 │ KernelEstimator                    932.901
   2 │ NeighbourLists                     593.324
   3 │ CrystalInfoFramework               571.33
   4 │ AdmittanceModels                   417.928
   5 │ GXBeam                             365.921
   6 │ NeutralLandscapes                  310.035
   7 │ YAAD                               221.342
   8 │ DifferentiableTrajectoryOptimiza…  164.797
   9 │ Quiqbox                            142.185
  10 │ DrelTools                          127.423

Looking at KernelEstimator:

 Row │ name             1.9.0-buildkite  1.9.0-lto  1.9.0-reference  1.9.0-lto+pgo 
     │ String           Float64          Float64    Float64          Float64       
─────┼─────────────────────────────────────────────────────────────────────────────
   1 │ KernelEstimator           1727.3    778.552          794.394        402.494

I can pretty much reproduce these timings in isolation. This was executed on a 2x 32-Core AMD EPYC 7513 (deepsea4 for JC people).

@haampie
Copy link
Contributor Author

haampie commented Jul 4, 2022

Again the relative numbers, lto+pgo compared to reference, with the top 10 dropped on both sides, y-axis = % change in runtime and x-axis packages sorted by change.
Bildschirmfoto 2022-07-04 um 11 53 25

List of packages with more than 5% increase in runtime:

 Row │ name            ref         pgo         change   
     │ String          Float64     Float64     Float64  
─────┼──────────────────────────────────────────────────
   1 │ GridUtilities    144.54      151.81      5.0297
   2 │ MLJOpenML        118.427     125.063     5.60387
   3 │ GLNS              16.4351     17.5027    6.49577
   4 │ JuliaFormatter    74.4771     82.8296   11.2148
   5 │ JuliaZH            5.25297     6.37024  21.2693
   6 │ Quiqbox         1239.79     1537.26     23.9933
   7 │ LKH                5.42084     8.16988  50.7125

87% of the pkgs: at least 5% reduction in test runtime
32% of the pkgs: at least 10% reduction in test runtime

@maleadt
Copy link
Member

maleadt commented Jul 5, 2022

A comment by @gbaraldi is that we should check whether the PGO-attained performance benefit is portable across systems. The easiest way for that is if we could trick the buildbots into generating PGO-optimized binaries, and run PkgEval on that. For testing purposes, maybe we could commit a merged profile trace (e.g. from running Base.runtests) and modify the main Makefile to use it?

@haampie
Copy link
Contributor Author

haampie commented Jul 5, 2022

Here's a merged.prof file generated using Base.runtests on x86, someone with an M1 could try just the stage 2 build? I don't have access to aarch64 right now.

Without changing the Makefile:

cd contrib/pgo-lto
make stage0 -j$(nproc)
touch stage1
mkdir profiles
curl -Lfs https://github.com/JuliaLang/julia/files/9048880/data.tar.gz | tar -zxf- -C profiles/
touch profiles/merged.prof
make stage2 -j$(nproc)

@haampie
Copy link
Contributor Author

haampie commented Jul 5, 2022

Interestingly, those additional measurements show that LTO itself doesn't yield much speed-up compared to a plain source build

Turns out this is because -flto was not part of the cflags/cxxflags, I must have dropped those flags during a force push :(. However, adding them back runs into counter overflow issues really quickly, which happens for sure with Base.runtests() as training data. I've seen this before, and it is probable that this is a bug in LLVM, since these numbers are impossible:

Maximum function count: 17870283321406155538
Maximum internal block count: 17582052945254417008
Top 50 functions with the largest internal block counts: 
  llvm::IRBuilderBase::CreateAlloca(llvm::Type*, llvm::Value*, llvm::Twine const&), max count = 17870283321406155538

Workaround: use -fprofile-instr-generate=$(PROFILE_DIR)/prof-%p.profraw, which seems to work fine, but the downside is it generates ~ 30GB of profile data instead of ~ 15MB.

Another issue is that Julia when built with LTO and PGO drops the jl_crc32c symbol... Edit: I think there's some actual issue with the crc32c software version fallback... but it was triggered by passing CC= on the command line, which results in a missing CC += -march=... flag :(. Will fix julia's CC behavior in a separate PR then, cause julia overrides CC if passed as an environment variable, which is a pain.

@gbaraldi
Copy link
Member

gbaraldi commented Feb 6, 2023

bump?

@lseman
Copy link

lseman commented Dec 20, 2023

Any update on this?

@oscardssmith
Copy link
Member

IMO we should merge this. Tagging triage to confirm.

@oscardssmith oscardssmith added performance Must go faster status:triage This should be discussed on a triage call labels Dec 20, 2023
@LilithHafner
Copy link
Member

Triage thinks this is a good idea, provided there is a roadmap to eventually turn it on by default. Feel free to merge when ready.

@LilithHafner LilithHafner removed the status:triage This should be discussed on a triage call label Dec 21, 2023
@gitboy16
Copy link
Contributor

gitboy16 commented Feb 9, 2024

I was just wondering if this PR would be added before the 1.11 feature freeze? Thank you.

@oscardssmith oscardssmith merged commit 36b7d3b into JuliaLang:master Feb 9, 2024
8 checks passed
@oscardssmith
Copy link
Member

yes.

@Krastanov
Copy link

Is it on the roadmap for this to become the default way julia is built?

@oscardssmith
Copy link
Member

if someone makes the PR to do that :)


AFTER_STAGE1_MESSAGE:='Run `make clean-profiles` to start with a clean slate. $\
Then run Julia to collect realistic profile data, for example: `$(STAGE1_BUILD)/julia -O3 -e $\
'\''using Pkg; Pkg.add("LoopVectorization"); Pkg.test("LoopVectorization")'\''`. This $\

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a reminder to whoever potentially pursues this in the future, LoopVectorization is somewhat sunsetting and probably a different default should be picked for profiling. If someone among the core devs has good suggestions, I would be happy to set up some of the future PRs related to fixing this and making PGO/LTO default.

@vchuravy
Copy link
Sponsor Member

vchuravy commented Feb 9, 2024

We should at least have CI pipeline that checks that this doesn't break.

@Krastanov
Copy link

I seem to be able to compile julia with the default makefile, but trying to run

cd contrib/pgo-lto
make -j$(nproc) stage1

leads to some hash mismatch for a downloaded file. Not sure where to look for that:

  ERROR: sha512 checksum failure on llvm-julia-16.0.6-2.tar.gz, should be:
      6f2513adea1b939229c9be171e7ce41e488b3cfaa2e615912c4bc1ddaf0ab2e7
      5df213a5d5db80105d6473a8017b0656016bbbb085ef00a38073519668885884
  But `sha512sum /home/stefan/Documents/LocalScratchSpace/julia-pgo/deps/srccache/llvm-julia-16.0.6-2.tar.gz | awk '{ print $1; }'` results in:
      5f2f88b4673b13780fa819c78cb27fc5dab77c2976768ae4f7863b904c911e39
      fc18ee85d212e512a7c60081c74efd1fa2e7142b78002982533b7326ff808f24

@lseman
Copy link

lseman commented Feb 12, 2024

For the version 10.0, we were able to derive a PKGBUILD (Archlinux based package description) that make full use of PGO following the step of the first post on this tread.

https://github.com/CachyOS/CachyOS-PKGBUILDS/blob/master/julia/PKGBUILD

@haampie haampie deleted the ttfx-improvements branch February 12, 2024 21:43
@haampie
Copy link
Contributor Author

haampie commented Feb 12, 2024

Nice, I didn't realize this got merged. It did not bitrot?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:building Build system, or building Julia or its dependencies performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet