Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Float16 test failures when building with GCC 12 and USE_SYSTEM_CSL=0 USE_BINARYBUILDER_CSL=0 #44829

Closed
nalimilan opened this issue Apr 1, 2022 · 17 comments · Fixed by #45249
Closed
Labels
domain:building Build system, or building Julia or its dependencies
Milestone

Comments

@nalimilan
Copy link
Member

When building Julia with make USE_SYSTEM_CSL=0 USE_BINARYBUILDER_CSL=0 and GCC 12 fails when running Float16 tests. This prevents shipping the Julia RPM package in Fedora 36 (https://bugzilla.redhat.com/show_bug.cgi?id=2044284).

Cc: @vchuravy who had started investigating this

The output is:

julia> Float16(2.) > Float16(1.)
false

julia> Base.runtests("float16")
Running parallel tests with:
  nworkers() = 1
  nthreads() = 1
  Sys.CPU_THREADS = 1
  Sys.total_memory() = 7.439 GiB
  Sys.free_memory() = 6.513 GiB

Test  (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB)
float16    (1) |        started at 2022-04-01T16:47:18.204
float16    (1) |         failed at 2022-04-01T16:47:30.885
Test Failed at /home/fedora/nalimilan/julia/usr/share/julia/test/float16.jl:9
  Expression: f > g
   Evaluated: Float16(0.0) > Float16(0.0)
Test Failed at /home/fedora/nalimilan/julia/usr/share/julia/test/float16.jl:10
  Expression: g < f
   Evaluated: Float16(0.0) < Float16(0.0)
Test Failed at /home/fedora/nalimilan/julia/usr/share/julia/test/float16.jl:12
  Expression: all([g g] .< [f f])
Test Failed at /home/fedora/nalimilan/julia/usr/share/julia/test/float16.jl:14
  Expression: all([f f] .> [g g])
Test Failed at /home/fedora/nalimilan/julia/usr/share/julia/test/float16.jl:16
  Expression: isless(g, f)
Test Failed at /home/fedora/nalimilan/julia/usr/share/julia/test/float16.jl:22
  Expression: !(isequal(Float16(-0.0), Float16(0.0)))
   Evaluated: !(isequal(Float16(0.0), Float16(0.0)))
Test Failed at /home/fedora/nalimilan/julia/usr/share/julia/test/float16.jl:23
  Expression: !(isequal(Float16(0.0), Float16(-0.0)))
   Evaluated: !(isequal(Float16(0.0), Float16(0.0)))
Test Failed at /home/fedora/nalimilan/julia/usr/share/julia/test/float16.jl:42
  Expression: convert(Bool, Float16(1.0)) == true
   Evaluated: false == true
Test Failed at /home/fedora/nalimilan/julia/usr/share/julia/test/float16.jl:45
  Expression: convert(Int128, Float16(-2.0)) == Int128(-2)
   Evaluated: 0 == -2
Test Failed at /home/fedora/nalimilan/julia/usr/share/julia/test/float16.jl:46
  Expression: convert(UInt128, Float16(2.0)) == UInt128(2)
   Evaluated: 0x00000000000000000000000000000000 == 0x00000000000000000000000000000002
Test Failed at /home/fedora/nalimilan/julia/usr/share/julia/test/float16.jl:49
  Expression: convert(Int128, Float16(1.0)) === Int128(1.0)
   Evaluated: 0 === 1
Test Failed at /home/fedora/nalimilan/julia/usr/share/julia/test/float16.jl:50
  Expression: convert(Int128, Float16(-1.0)) === Int128(-1.0)
   Evaluated: 0 === -1
Test Failed at /home/fedora/nalimilan/julia/usr/share/julia/test/float16.jl:54
  Expression: convert(UInt128, Float16(1.0)) === UInt128(1.0)
   Evaluated: 0x00000000000000000000000000000000 === 0x00000000000000000000000000000001
Test Failed at /home/fedora/nalimilan/julia/usr/share/julia/test/float16.jl:58
  Expression: convert(Int128, Float16(-1.0)) == Int128(-1)
   Evaluated: 0 == -1
[...]
@nalimilan nalimilan added the domain:building Build system, or building Julia or its dependencies label Apr 1, 2022
@vchuravy vchuravy added this to the 1.8 milestone Apr 1, 2022
@vchuravy vchuravy added the kind:regression Regression in behavior compared to a previous version label Apr 1, 2022
@vchuravy
Copy link
Sponsor Member

vchuravy commented Apr 1, 2022

@nalimilan Could you try to bisect this? I never got anywhere in my investigation

@nalimilan
Copy link
Member Author

nalimilan commented Apr 1, 2022

Well AFAIK it's not a regression in Julia, it's just due to Fedora moving to GCC12. I'll confirm this in a moment [EDIT: I confirm that it also fails on 1.7.2, which builds fine on Fedora 35 with GCC 11].

@giordano
Copy link
Contributor

giordano commented Apr 1, 2022

Why is everybody using gcc 12 before the release? Also Debian...

@vchuravy vchuravy removed the kind:regression Regression in behavior compared to a previous version label Apr 2, 2022
@mtasaka
Copy link

mtasaka commented Apr 5, 2022

Well, compiling with gcc12 itself seems not needed here.

What I tried now is:

  • First of all, set up Fedora 35 buildroot by $ mock --verbose -r fedora-35-x86_64 --uniqueest julia --init
  • Install Fedora 35 julia into that buildroot: $ mock --verbose -r fedora-35-x86_64 --uniqueest julia --install julia
  • In that buildroot, try some julia basic command like julia> Float16(2.) > Float16(1.) or convert(Int128, Float16(-2.0)) == Int128(-2) - WORKS
  • Now in that buildroot, first just update glibc to Fedora 36 ones (i.e. download Fedora 36 glibc binary rpms on that buildroot, and update them in the buildroot), then try the previous julia basic command - still WORKS
  • Next, this time upgrade libgcc to Fedora 36 ones (from libgcc-11.2.1-9.fc35 to libgcc-12.0.1-0.12.fc36). To be clear, julia is using one in Fedora 35 binary rpm, julia itself is recompiled against gcc11 here, just updated libgcc only to 12. Now julia> Float16(2.) > Float16(1.) or convert(Int128, Float16(-2.0)) == Int128(-2) BOTH RETURNS FALSE.

So the problem does not seems to be gcc12-compiled julia binary, but it seems some functions in libgcc_s.so.1 is returning different result.

@nalimilan
Copy link
Member Author

Thanks for checking @mtasaka. I don't understand the details, but @vchuravy said (on Slack) that the problem was likely due to a change in the GCC ABI, which would be consistent with your observations that using libgcc_s.so 12 is enough to reproduce the bug.

@giordano
Copy link
Contributor

giordano commented Apr 6, 2022

the problem was likely due to a change in the GCC ABI

If true, that'd be rather worrying 😕 Do you happen to know whether GCC intends to make this kind of ABI breaking changes? Perhaps it's accidental and it should be reported/fixed

@vchuravy
Copy link
Sponsor Member

vchuravy commented Apr 6, 2022

Right now it's an hypothesis. The issue might be that we shadow a GCC function definition and I am wondering if that leads to a conflict...

@mtasaka
Copy link

mtasaka commented Apr 6, 2022

Long comment...

Now I compared gcc11 libgcc_s.so symbols and gcc12 libgcc_s.so symbol, and compared to gcc11 one, gcc12 libgcc_s.so adds some new symbols. So I tried to try

$ gdb --args julia -e 'println(Float16(2.) > Float32(1.))'

which returns True with gcc11 while False with gcc12, with adding breakpoints on all symbols newly added in gcc12 libgcc_s.so.
Then results are below.

gdb-gcc11.log.txt
gdb-gcc12.log.txt

Note that with gdb-gcc12.log.txt, __extendhfsf2 and __truncdfhf2 shows (2 locations), and execution of the above julia command hits __truncdfhf2 - with gcc11 it is inside julia, but with gcc12 ones inside libgcc12 is used.

and while julia prototype is uint16_t __truncdfhf2(double), gcc12 ones is _Float16 __truncdfhf2(double) and disassembling gcc12 __truncdfhf2 looks like some values are stored in xmm0 register then returns, so apparently int16 value is not returned.

So my guess it when julia sees ''Float16(2.)", julia (or llvm?) tries to interpret as "conversion from double value of 2 to some int16 representation" and generates the code to call __truncdfhf2, but with gcc12 gcc12 __truncdfhf2 is now called, and it is no longer working.

@mtasaka
Copy link

mtasaka commented Apr 6, 2022

A quick workaround is not to call libgcc internal __extendhfsf2 or __truncdfhf2 but force to call julia internal ones?

@mtasaka
Copy link

mtasaka commented Apr 7, 2022

So (although this may be simply the workaround and the "desired" solution may be different) actually forcely avoiding __extendhfsf2 __truncdfhf2 symbol collision by

grep -rl __truncdfhf2 . | xargs sed -i -e 's|__truncdfhf2|__truncdfhf2_internal|'
grep -rl __extendhfsf2 . | xargs sed -i -e 's|__extendhfsf2|__extendhfsf2_internal|'

inside both julia-1.7.2.tar.gz and llvm-julia-12.0.1-4.tar.gz makes test suite pass:

https://koji.fedoraproject.org/koji/taskinfo?taskID=85284148
https://koji.fedoraproject.org/koji/taskinfo?taskID=85284173

@vtjnash
Copy link
Sponsor Member

vtjnash commented Apr 12, 2022

clang expects this value to be in %ax when you call this function (_Float16 is not specified in the platform ABI, but it would be a major ABI-breaking change for GCC to put it anywhere else)
https://github.com/llvm/llvm-project/blob/788f94f731dc9ba514925ed0c5936c5e79f6abd8/llvm/test/CodeGen/X86/half.ll#L171-L172

; CHECK-NEXT:    callq __truncdfhf2@PLT
; CHECK-NEXT:    movw %ax, (%rbx)

@phoebewang
Copy link

Thanks @vtjnash for letting me know this.

In fact, I'm working on applying the same ABI to features pre avx512fp16. It's true it's ABI breaking, which is a big concern to me too.

It is expected we will use the same xmm registers rather than GPRs when passing/returning f16, which is compatible to GCC12. It is not a problem to Clang, since Clang doesn't allow passing/returning half type directly before avx512fp16. The concerns are from other front end. I didn't have such knowledge, and am glad to know it now.

Could you help me to evaluate the impact to Julia if we change the ABI? Is there anyway to workaround it? Anything I can do with? Thanks!

@vtjnash
Copy link
Sponsor Member

vtjnash commented Apr 13, 2022

The main concern for me was the conditional breakage, since we compile code simultaneously for many different targets. We can usually deal with flag-day type events by runtime detection, but we need it to be detectable in some way then.

There seems to be a related ABI problem in llvm / compiler-rt / clang here: https://reviews.llvm.org/D92241, since it is going to create different code based on the target cpu, but llvm CodeGen will have no idea what compiler flags were used to compile the support library, and this may thus result in mis-compilation of code that calls into the runtime support library.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Apr 13, 2022

Looking around further at llvm issues, it looks like we are essentially running into the same ABI problems as were being triggered by attempts to land https://reviews.llvm.org/D114099 to enable this in clang also

@phoebewang
Copy link

I think we won't have conditional breakage if D114099 and related patches merged. But I have no idea what to do next, given it will expose the problem you described.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Apr 13, 2022

Yes, seems like https://reviews.llvm.org/D107082 is a fix for this issue. We can have 2 copies of our __truncdfhf2 helper (for each ABI), and select at compile time based on the LLVM version we know is in use.

vtjnash added a commit that referenced this issue Apr 13, 2022
A partial fix to #44829, but perhaps we will still have ABI problems with the sysimg (which may not link directly enough on linux in many cases)
vtjnash added a commit to vtjnash/julia that referenced this issue May 6, 2022
vtjnash added a commit that referenced this issue May 9, 2022
Fixes #44829, until llvm fixes the support for these intrinsics itself
vtjnash added a commit that referenced this issue May 9, 2022
Fixes #44829, until llvm fixes the support for these intrinsics itself
vtjnash added a commit that referenced this issue May 9, 2022
Fixes #44829, until llvm fixes the support for these intrinsics itself
vtjnash added a commit that referenced this issue May 9, 2022
Fixes #44829, until llvm fixes the support for these intrinsics itself
vtjnash added a commit that referenced this issue May 10, 2022
Fixes #44829, until llvm fixes the support for these intrinsics itself
KristofferC pushed a commit that referenced this issue May 18, 2022
Fixes #44829, until llvm fixes the support for these intrinsics itself

Also need to handle vectors, since the vectorizer may have introduced them.

Also change our runtime emulation versions to f32 for consistency.
KristofferC pushed a commit that referenced this issue May 18, 2022
Fixes #44829, until llvm fixes the support for these intrinsics itself

Also need to handle vectors, since the vectorizer may have introduced them.

Also change our runtime emulation versions to f32 for consistency.

(cherry picked from commit f2c627e)
@vchuravy vchuravy reopened this Jun 26, 2022
@vchuravy
Copy link
Sponsor Member

Should be fixed by #45649

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:building Build system, or building Julia or its dependencies
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants