`Float16` test failures when building with GCC 12 and `USE_SYSTEM_CSL=0 USE_BINARYBUILDER_CSL=0` #44829

nalimilan · 2022-04-01T16:53:20Z

When building Julia with make USE_SYSTEM_CSL=0 USE_BINARYBUILDER_CSL=0 and GCC 12 fails when running Float16 tests. This prevents shipping the Julia RPM package in Fedora 36 (https://bugzilla.redhat.com/show_bug.cgi?id=2044284).

Cc: @vchuravy who had started investigating this

The output is:

julia> Float16(2.) > Float16(1.)
false

julia> Base.runtests("float16")
Running parallel tests with:
  nworkers() = 1
  nthreads() = 1
  Sys.CPU_THREADS = 1
  Sys.total_memory() = 7.439 GiB
  Sys.free_memory() = 6.513 GiB

Test  (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB)
float16    (1) |        started at 2022-04-01T16:47:18.204
float16    (1) |         failed at 2022-04-01T16:47:30.885
Test Failed at /home/fedora/nalimilan/julia/usr/share/julia/test/float16.jl:9
  Expression: f > g
   Evaluated: Float16(0.0) > Float16(0.0)
Test Failed at /home/fedora/nalimilan/julia/usr/share/julia/test/float16.jl:10
  Expression: g < f
   Evaluated: Float16(0.0) < Float16(0.0)
Test Failed at /home/fedora/nalimilan/julia/usr/share/julia/test/float16.jl:12
  Expression: all([g g] .< [f f])
Test Failed at /home/fedora/nalimilan/julia/usr/share/julia/test/float16.jl:14
  Expression: all([f f] .> [g g])
Test Failed at /home/fedora/nalimilan/julia/usr/share/julia/test/float16.jl:16
  Expression: isless(g, f)
Test Failed at /home/fedora/nalimilan/julia/usr/share/julia/test/float16.jl:22
  Expression: !(isequal(Float16(-0.0), Float16(0.0)))
   Evaluated: !(isequal(Float16(0.0), Float16(0.0)))
Test Failed at /home/fedora/nalimilan/julia/usr/share/julia/test/float16.jl:23
  Expression: !(isequal(Float16(0.0), Float16(-0.0)))
   Evaluated: !(isequal(Float16(0.0), Float16(0.0)))
Test Failed at /home/fedora/nalimilan/julia/usr/share/julia/test/float16.jl:42
  Expression: convert(Bool, Float16(1.0)) == true
   Evaluated: false == true
Test Failed at /home/fedora/nalimilan/julia/usr/share/julia/test/float16.jl:45
  Expression: convert(Int128, Float16(-2.0)) == Int128(-2)
   Evaluated: 0 == -2
Test Failed at /home/fedora/nalimilan/julia/usr/share/julia/test/float16.jl:46
  Expression: convert(UInt128, Float16(2.0)) == UInt128(2)
   Evaluated: 0x00000000000000000000000000000000 == 0x00000000000000000000000000000002
Test Failed at /home/fedora/nalimilan/julia/usr/share/julia/test/float16.jl:49
  Expression: convert(Int128, Float16(1.0)) === Int128(1.0)
   Evaluated: 0 === 1
Test Failed at /home/fedora/nalimilan/julia/usr/share/julia/test/float16.jl:50
  Expression: convert(Int128, Float16(-1.0)) === Int128(-1.0)
   Evaluated: 0 === -1
Test Failed at /home/fedora/nalimilan/julia/usr/share/julia/test/float16.jl:54
  Expression: convert(UInt128, Float16(1.0)) === UInt128(1.0)
   Evaluated: 0x00000000000000000000000000000000 === 0x00000000000000000000000000000001
Test Failed at /home/fedora/nalimilan/julia/usr/share/julia/test/float16.jl:58
  Expression: convert(Int128, Float16(-1.0)) == Int128(-1)
   Evaluated: 0 == -1
[...]

The text was updated successfully, but these errors were encountered:

vchuravy · 2022-04-01T16:56:18Z

@nalimilan Could you try to bisect this? I never got anywhere in my investigation

nalimilan · 2022-04-01T16:57:48Z

Well AFAIK it's not a regression in Julia, it's just due to Fedora moving to GCC12. I'll confirm this in a moment [EDIT: I confirm that it also fails on 1.7.2, which builds fine on Fedora 35 with GCC 11].

giordano · 2022-04-01T17:34:32Z

Why is everybody using gcc 12 before the release? Also Debian...

mtasaka · 2022-04-05T15:01:14Z

Well, compiling with gcc12 itself seems not needed here.

What I tried now is:

First of all, set up Fedora 35 buildroot by $ mock --verbose -r fedora-35-x86_64 --uniqueest julia --init
Install Fedora 35 julia into that buildroot: $ mock --verbose -r fedora-35-x86_64 --uniqueest julia --install julia
In that buildroot, try some julia basic command like julia> Float16(2.) > Float16(1.) or convert(Int128, Float16(-2.0)) == Int128(-2) - WORKS
Now in that buildroot, first just update glibc to Fedora 36 ones (i.e. download Fedora 36 glibc binary rpms on that buildroot, and update them in the buildroot), then try the previous julia basic command - still WORKS
Next, this time upgrade libgcc to Fedora 36 ones (from libgcc-11.2.1-9.fc35 to libgcc-12.0.1-0.12.fc36). To be clear, julia is using one in Fedora 35 binary rpm, julia itself is recompiled against gcc11 here, just updated libgcc only to 12. Now julia> Float16(2.) > Float16(1.) or convert(Int128, Float16(-2.0)) == Int128(-2) BOTH RETURNS FALSE.

So the problem does not seems to be gcc12-compiled julia binary, but it seems some functions in libgcc_s.so.1 is returning different result.

nalimilan · 2022-04-06T07:25:21Z

Thanks for checking @mtasaka. I don't understand the details, but @vchuravy said (on Slack) that the problem was likely due to a change in the GCC ABI, which would be consistent with your observations that using libgcc_s.so 12 is enough to reproduce the bug.

giordano · 2022-04-06T08:54:05Z

the problem was likely due to a change in the GCC ABI

If true, that'd be rather worrying 😕 Do you happen to know whether GCC intends to make this kind of ABI breaking changes? Perhaps it's accidental and it should be reported/fixed

vchuravy · 2022-04-06T09:10:28Z

Right now it's an hypothesis. The issue might be that we shadow a GCC function definition and I am wondering if that leads to a conflict...

mtasaka · 2022-04-06T14:18:28Z

Long comment...

Now I compared gcc11 libgcc_s.so symbols and gcc12 libgcc_s.so symbol, and compared to gcc11 one, gcc12 libgcc_s.so adds some new symbols. So I tried to try

$ gdb --args julia -e 'println(Float16(2.) > Float32(1.))'

which returns True with gcc11 while False with gcc12, with adding breakpoints on all symbols newly added in gcc12 libgcc_s.so.
Then results are below.

gdb-gcc11.log.txt
gdb-gcc12.log.txt

Note that with gdb-gcc12.log.txt, __extendhfsf2 and __truncdfhf2 shows (2 locations), and execution of the above julia command hits __truncdfhf2 - with gcc11 it is inside julia, but with gcc12 ones inside libgcc12 is used.

and while julia prototype is uint16_t __truncdfhf2(double), gcc12 ones is _Float16 __truncdfhf2(double) and disassembling gcc12 __truncdfhf2 looks like some values are stored in xmm0 register then returns, so apparently int16 value is not returned.

So my guess it when julia sees ''Float16(2.)", julia (or llvm?) tries to interpret as "conversion from double value of 2 to some int16 representation" and generates the code to call __truncdfhf2, but with gcc12 gcc12 __truncdfhf2 is now called, and it is no longer working.

mtasaka · 2022-04-06T14:20:00Z

A quick workaround is not to call libgcc internal __extendhfsf2 or __truncdfhf2 but force to call julia internal ones?

mtasaka · 2022-04-07T08:17:30Z

So (although this may be simply the workaround and the "desired" solution may be different) actually forcely avoiding __extendhfsf2 __truncdfhf2 symbol collision by

grep -rl __truncdfhf2 . | xargs sed -i -e 's|__truncdfhf2|__truncdfhf2_internal|'
grep -rl __extendhfsf2 . | xargs sed -i -e 's|__extendhfsf2|__extendhfsf2_internal|'

inside both julia-1.7.2.tar.gz and llvm-julia-12.0.1-4.tar.gz makes test suite pass:

https://koji.fedoraproject.org/koji/taskinfo?taskID=85284148
https://koji.fedoraproject.org/koji/taskinfo?taskID=85284173

vtjnash · 2022-04-12T21:35:39Z

clang expects this value to be in %ax when you call this function (_Float16 is not specified in the platform ABI, but it would be a major ABI-breaking change for GCC to put it anywhere else)
https://github.com/llvm/llvm-project/blob/788f94f731dc9ba514925ed0c5936c5e79f6abd8/llvm/test/CodeGen/X86/half.ll#L171-L172

; CHECK-NEXT:    callq __truncdfhf2@PLT
; CHECK-NEXT:    movw %ax, (%rbx)

phoebewang · 2022-04-13T07:35:55Z

Thanks @vtjnash for letting me know this.

In fact, I'm working on applying the same ABI to features pre avx512fp16. It's true it's ABI breaking, which is a big concern to me too.

It is expected we will use the same xmm registers rather than GPRs when passing/returning f16, which is compatible to GCC12. It is not a problem to Clang, since Clang doesn't allow passing/returning half type directly before avx512fp16. The concerns are from other front end. I didn't have such knowledge, and am glad to know it now.

Could you help me to evaluate the impact to Julia if we change the ABI? Is there anyway to workaround it? Anything I can do with? Thanks!

vtjnash · 2022-04-13T14:55:55Z

The main concern for me was the conditional breakage, since we compile code simultaneously for many different targets. We can usually deal with flag-day type events by runtime detection, but we need it to be detectable in some way then.

There seems to be a related ABI problem in llvm / compiler-rt / clang here: https://reviews.llvm.org/D92241, since it is going to create different code based on the target cpu, but llvm CodeGen will have no idea what compiler flags were used to compile the support library, and this may thus result in mis-compilation of code that calls into the runtime support library.

vtjnash · 2022-04-13T14:58:48Z

Looking around further at llvm issues, it looks like we are essentially running into the same ABI problems as were being triggered by attempts to land https://reviews.llvm.org/D114099 to enable this in clang also

phoebewang · 2022-04-13T15:11:13Z

I think we won't have conditional breakage if D114099 and related patches merged. But I have no idea what to do next, given it will expose the problem you described.

vtjnash · 2022-04-13T21:42:32Z

Yes, seems like https://reviews.llvm.org/D107082 is a fix for this issue. We can have 2 copies of our __truncdfhf2 helper (for each ABI), and select at compile time based on the LLVM version we know is in use.

A partial fix to #44829, but perhaps we will still have ABI problems with the sysimg (which may not link directly enough on linux in many cases)

A partial fix to JuliaLang#44829

Fixes #44829, until llvm fixes the support for these intrinsics itself

Fixes #44829, until llvm fixes the support for these intrinsics itself Also need to handle vectors, since the vectorizer may have introduced them. Also change our runtime emulation versions to f32 for consistency.

Fixes #44829, until llvm fixes the support for these intrinsics itself Also need to handle vectors, since the vectorizer may have introduced them. Also change our runtime emulation versions to f32 for consistency. (cherry picked from commit f2c627e)

vchuravy · 2022-06-26T16:05:15Z

Should be fixed by #45649

nalimilan added the domain:building Build system, or building Julia or its dependencies label Apr 1, 2022

vchuravy added this to the 1.8 milestone Apr 1, 2022

vchuravy added the kind:regression Regression in behavior compared to a previous version label Apr 1, 2022

vchuravy removed the kind:regression Regression in behavior compared to a previous version label Apr 2, 2022

vtjnash added a commit that referenced this issue Apr 13, 2022

JIT: explicitly handle __extendhfsf2 and friends

f37fcfd

A partial fix to #44829, but perhaps we will still have ABI problems with the sysimg (which may not link directly enough on linux in many cases)

vtjnash mentioned this issue Apr 13, 2022

JIT: explicitly handle __extendhfsf2 and friends #44975

Closed

vtjnash added a commit to vtjnash/julia that referenced this issue May 6, 2022

codegen: explicitly handle __extendhfsf2 and friends

10064bf

A partial fix to JuliaLang#44829

vtjnash added a commit that referenced this issue May 9, 2022

codegen: explicitly handle Float16 intrinsics

e558d30

Fixes #44829, until llvm fixes the support for these intrinsics itself

vtjnash mentioned this issue May 9, 2022

codegen: explicitly handle Float16 intrinsics #45249

Merged

vtjnash added a commit that referenced this issue May 9, 2022

codegen: explicitly handle Float16 intrinsics

255e8b4

Fixes #44829, until llvm fixes the support for these intrinsics itself

vtjnash added a commit that referenced this issue May 9, 2022

codegen: explicitly handle Float16 intrinsics

f8b2c46

Fixes #44829, until llvm fixes the support for these intrinsics itself

vtjnash added a commit that referenced this issue May 9, 2022

codegen: explicitly handle Float16 intrinsics

73777a7

Fixes #44829, until llvm fixes the support for these intrinsics itself

vtjnash added a commit that referenced this issue May 10, 2022

codegen: explicitly handle Float16 intrinsics

d2a7deb

Fixes #44829, until llvm fixes the support for these intrinsics itself

KristofferC closed this as completed in #45249 May 18, 2022

vtjnash mentioned this issue May 24, 2022

Build: failing Float16 intrinsics on 1.6.6 or 1.7.2 on ubuntu 22.04 #45433

Closed

zsoerenm mentioned this issue May 31, 2022

Conversion from Float64 to Float16 is incorrect #45528

Closed

giordano mentioned this issue Jun 4, 2022

[CompilerSupportLibraries_jll] Upgrade to libraries from GCC 12 #45582

Merged

vchuravy reopened this Jun 26, 2022

JeffBezanson closed this as completed Jul 25, 2022

fxcoudert mentioned this issue Aug 17, 2022

julia 1.8.0 Homebrew/homebrew-core#100404

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Float16` test failures when building with GCC 12 and `USE_SYSTEM_CSL=0 USE_BINARYBUILDER_CSL=0` #44829

`Float16` test failures when building with GCC 12 and `USE_SYSTEM_CSL=0 USE_BINARYBUILDER_CSL=0` #44829

nalimilan commented Apr 1, 2022

vchuravy commented Apr 1, 2022

nalimilan commented Apr 1, 2022 •

edited

giordano commented Apr 1, 2022

mtasaka commented Apr 5, 2022

nalimilan commented Apr 6, 2022

giordano commented Apr 6, 2022

vchuravy commented Apr 6, 2022

mtasaka commented Apr 6, 2022

mtasaka commented Apr 6, 2022

mtasaka commented Apr 7, 2022

vtjnash commented Apr 12, 2022

phoebewang commented Apr 13, 2022

vtjnash commented Apr 13, 2022

vtjnash commented Apr 13, 2022

phoebewang commented Apr 13, 2022

vtjnash commented Apr 13, 2022

vchuravy commented Jun 26, 2022

Float16 test failures when building with GCC 12 and USE_SYSTEM_CSL=0 USE_BINARYBUILDER_CSL=0 #44829

Float16 test failures when building with GCC 12 and USE_SYSTEM_CSL=0 USE_BINARYBUILDER_CSL=0 #44829

Comments

nalimilan commented Apr 1, 2022

vchuravy commented Apr 1, 2022

nalimilan commented Apr 1, 2022 • edited

giordano commented Apr 1, 2022

mtasaka commented Apr 5, 2022

nalimilan commented Apr 6, 2022

giordano commented Apr 6, 2022

vchuravy commented Apr 6, 2022

mtasaka commented Apr 6, 2022

mtasaka commented Apr 6, 2022

mtasaka commented Apr 7, 2022

vtjnash commented Apr 12, 2022

phoebewang commented Apr 13, 2022

vtjnash commented Apr 13, 2022

vtjnash commented Apr 13, 2022

phoebewang commented Apr 13, 2022

vtjnash commented Apr 13, 2022

vchuravy commented Jun 26, 2022

`Float16` test failures when building with GCC 12 and `USE_SYSTEM_CSL=0 USE_BINARYBUILDER_CSL=0` #44829

`Float16` test failures when building with GCC 12 and `USE_SYSTEM_CSL=0 USE_BINARYBUILDER_CSL=0` #44829

nalimilan commented Apr 1, 2022 •

edited