Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OpenBLAS] Build the BFloat16 kernels in OpenBLAS #7202

Merged
merged 3 commits into from
Jan 25, 2024

Conversation

imciner2
Copy link
Member

Continuation of #7168. This now includes a patch that fixes detection and building on x86_64 architectures, which is based on the upstream PR OpenMathLib/OpenBLAS#4192.

The aarch64 build failures need a bit more investigation, since it looks like there might be naming differences between the apple toolchains and gnu toolchains used for linux that are causing the kernels to not compile properly.

@ViralBShah
Copy link
Contributor

ViralBShah commented Aug 14, 2023

BTW, I was just using GCC 11.1 as an experiment to see if it would get me over the hump of getting things to build. That is probably what is preventing the building of the libgfortran3 and libgfortran4 variants.

@imciner2
Copy link
Member Author

imciner2 commented Aug 14, 2023

For the bfloat16 functions, GCC 10+ is needed, but GCC 11 added the Sapphire Rapids support, so compiling with GCC 11 will give both. The downside is binarybuilder won't be able to use GCC 8+ with anything other than gfortran5, so the x86_64 targets with libgfortran3 or libgfortran4 won't be able to have the bfloat16 components, probably (they need older GCC versions).

The powerpc build failure is because of GCC 11 (the test was assuming exactly GCC 10, not anything greater), but it is a simple patch (sent upstream in OpenMathLib/OpenBLAS#4193).

The remaining issues are actually now clang-based issues. One with AppleClang and one with the flang compiler for the memsan version.

@imciner2
Copy link
Member Author

(actually cooperlake is new in GCC 10, so we need at least GCC 10 to build the bfloat16 interface parts).

@ViralBShah
Copy link
Contributor

Would it make sense to keep building the old libgfortran{3,4} builds with the older compilers without bfloat16 support, but add it to the libgfortran5 version?

Or perhaps we should just build a separate library (say OpenBLASWithExt) with the extensions and fewer platforms. We also have OpenBLAS with high core counts, which is not widely used at all, and could also include the extensions and other experimental features.

@imciner2
Copy link
Member Author

Rebased to pickup the new 0.3.24, which allows the three current bfloat-specific patches to be dropped. Lets see if any of the other platforms start working now (I expect the gfortran3 and 4 to still be broken though).

@imciner2
Copy link
Member Author

Building with gfortran3 and gfortran4 will require some build system tweaks, so I have asked upstream about it here: OpenMathLib/OpenBLAS#4422. So hopefully they can work out the best way to get the generic kernels for the newer targets on these older compilers.

@imciner2
Copy link
Member Author

sigh the aarch62 Apple builds are still failing, but at least the others are passing.

@giordano I believe you had some SVE-related failures previously, do these look familiar? I think these are living in the bfloat16 kernels, so maybe the upstream fix for your SVE problem was incomplete last time?

@giordano
Copy link
Member

My problem was only with the use of -march=native (OpenMathLib/OpenBLAS#4433), which was fixed by OpenMathLib/OpenBLAS#4434.

The error you're getting now is

[17:25:10] In file included from ../kernel/arm64/sbgemm_kernel_8x4_neoversen2.c:29:
[17:25:10] /opt/x86_64-linux-musl/lib/clang/13.0.1/include/arm_sve.h:15:2: error: "SVE support not enabled"
[17:25:10] #error "SVE support not enabled"
[17:25:10]  ^

Relevant lines of /opt/x86_64-linux-musl/lib/clang/13.0.1/include/arm_sve.h are

#if !defined(__ARM_FEATURE_SVE)
#error "SVE support not enabled"
#else

so the problem is that __ARM_FEATURE_SVE is undefined, which looking at the compiler invocations like

[17:25:10] cc -O2 -DMAX_STACK_ALLOC=2048 -Wall -DF_INTERFACE_GFORT -fPIC -DDYNAMIC_ARCH -DSMP_SERVER -DNO_WARMUP -DMAX_CPU_NUMBER=512 -DMAX_PARALLEL_NUMBER=1 -DBUILD_BFLOAT16 -DBUILD_SINGLE=1 -DBUILD_DOUBLE=1 -DBUILD_COMPLEX=1 -DBUILD_COMPLEX16=1 -DVERSION=\"0.3.26\" -march=armv8.2-a -mtune=cortex-a72 -UASMNAME -UASMFNAME -UNAME -UCNAME -UCHAR_NAME -UCHAR_CNAME -DASMNAME=_dtrmm_kernel_LN_NEOVERSEN2 -DASMFNAME=_dtrmm_kernel_LN_NEOVERSEN2_ -DNAME=dtrmm_kernel_LN_NEOVERSEN2_ -DCNAME=dtrmm_kernel_LN_NEOVERSEN2 -DCHAR_NAME=\"dtrmm_kernel_LN_NEOVERSEN2_\" -DCHAR_CNAME=\"dtrmm_kernel_LN_NEOVERSEN2\" -DNO_AFFINITY -DTS=_NEOVERSEN2 -I.. -DBUILD_KERNEL -DTABLE_NAME=gotoblas_NEOVERSEN2 -DDOUBLE  -UCOMPLEX -c -DTRMMKERNEL -DDOUBLE -UCOMPLEX -DLEFT -UTRANSA ../kernel/arm64/dtrmm_kernel_8x4.S -o dtrmm_kernel_LN_NEOVERSEN2.o

looks plausible, as -march=armv8.2-a alone doesn't enable SVE automatically

$ julia --compile=min -e 'using BinaryBuilderBase; BinaryBuilderBase.runshell(Platform("aarch64", "macos"); preferred_llvm_version=v"13", lock_microarchitecture=false)'
sandbox:${WORKSPACE} # cc -march=armv8.2-a -dM -E - < /dev/null | grep __ARM_FEATURE_SVE
sandbox:${WORKSPACE} # cc -march=armv8.2-a+sve -dM -E - < /dev/null | grep __ARM_FEATURE_SVE
#define __ARM_FEATURE_SVE 1

According to https://gcc.gnu.org/onlinedocs/gcc/AArch64-Options.html SVE is required only with armv9-a

@giordano
Copy link
Member

I don't know if these are the lines being hit, but https://github.com/OpenMathLib/OpenBLAS/blob/d6a5174e9c50b9f68db96d7d7818f92cdfb4e7ec/Makefile.arm64#L138-L142 looks suspicious: when targeting macOS -march=armv8.2-a is used, otherwise -march=armv8.5-a+sve+sve2+bf16, which is capable of generating code for SVE and bf16. I guess there's some logic to review around there.

@imciner2
Copy link
Member Author

I've sent an issue report upstream about the SVE kernels and the NeoverseN2 target: OpenMathLib/OpenBLAS#4449. Looking at the issue report that prompted the fallback to the armv8.2-a option, OpenMathLib/OpenBLAS#3739, I guess it would also be fixed by bumping the GCC version to be 11.4 or newer, but there probably is definitely an issue to fix upstream about including the kernels when they shouldn't be.

As for the new Windows failure... that is an assembler problem that I haven't dived into, but it looks like it isn't happy with some part of the bfloat16 kernel assembly.

@imciner2
Copy link
Member Author

And I've sent the Windows assembler errors upstream now OpenMathLib/OpenBLAS#4450.

@imciner2
Copy link
Member Author

🥳 all platforms are building now 🎉.

Right now, I have it only turned on for the OpenBLAS build script, not the OpenBLAS32 build script. Does it make sense to enable it in the OpenBLAS32 build script as well?

@giordano
Copy link
Member

🥳 all platforms are building now 🎉.

Thanks for the heroic effort of tracking down all the issues!

Right now, I have it only turned on for the OpenBLAS build script, not the OpenBLAS32 build script. Does it make sense to enable it in the OpenBLAS32 build script as well?

Not sure, maybe yes for consistency?

@giordano
Copy link
Member

While we are here, I verified patch 10-neoverse-generic-kernels.patch can now be safely dropped. With a local build of OpenBLAS v0.3.13 (make DYNAMIC_ARCH=1 LIBPREFIX=libopenblas64_ INTERFACE64=1 SYMBOLSUFFIX=64_ -j36) on Neoverse V2:

$ OPENBLAS_CORETYPE=NEOVERSEN1 OPENBLAS_VERBOSE=2 julia -q
Core: neoversen1
julia> using LinearAlgebra

julia> BLAS.lbt_forward("./libopenblas64_.so"; clear=true)
Core: neoversen1
4856

julia> a = zeros(Float64, 100);

julia> a[1] = -Inf;

julia> BLAS.nrm2(a)
NaN

Instead, with a local build of commit OpenMathLib/OpenBLAS@d6a5174 (latest on branch develop as of now) with the same configuration I get

$ OPENBLAS_CORETYPE=NEOVERSEN1 OPENBLAS_VERBOSE=2 julia -q
Core: neoversen1
julia> using LinearAlgebra

julia> BLAS.lbt_forward("./libopenblas64_.so"; clear=true)
Core: neoversen1
4860

julia> a = zeros(Float64, 100);

julia> a[1] = -Inf;

julia> BLAS.nrm2(a)
Inf

So it appears the bug reported at OpenMathLib/OpenBLAS#2998 is indeed fixed upstream, and we don't need to carry the patch around.

@imciner2
Copy link
Member Author

I was actually going to do a follow-on PR to backport the patch mentioned in #7661 to our builds so that we can maintain better history of these changes, so we can drop that patch at that time.

@giordano
Copy link
Member

giordano commented Jan 24, 2024

Sounds good.

In the meantime, I verified the neoversen1 kernel for nrm2 is way faster than the generic one:

$ OPENBLAS_CORETYPE=NEOVERSEN1 OPENBLAS_VERBOSE=2 julia -q
Core: neoversen1
julia> using BenchmarkTools, LinearAlgebra

julia> a = zeros(Float64, 100);

julia> a[1] = -Inf;

julia> @btime norm($a);
  40.557 ns (0 allocations: 0 bytes)

julia> BLAS.lbt_forward("./libopenblas64_.so"; clear=true)
Core: neoversen1
4860

julia> @btime norm($a);
  6.912 ns (0 allocations: 0 bytes)

Apple Silicon (which now uses neoversen1 kernels by default) users who compute lots of norms are going to love this improvement.

Edit: new benchmarks, this time on an M1 (instead of neoversev2):

% OPENBLAS_VERBOSE=2 julia -q
Core: neoversen1
julia> using BenchmarkTools, LinearAlgebra

julia> a = zeros(Float64, 100); a[1] = -Inf;

julia> @btime norm($(a))
  43.267 ns (0 allocations: 0 bytes)
Inf

julia> BLAS.lbt_forward(realpath("./libopenblas64_.dylib"); clear=true)
Core: neoversen1
4874

julia> @btime norm($(a))
  8.008 ns (0 allocations: 0 bytes)
Inf

./libopenblas64_.dylib is a build with proper neoversen1 kernel for nrm2 from this PR, this confirms the ~5x speedup observed on neoversev2 for this core type.

@DrTimothyAldenDavis
Copy link

bfloat16 on the CPU? oooo nice. I'd love to add it as a user-defined type to GraphBLAS. I just haven't figured out how to get those operations to work on the CPU.

GraphBLAS doesn't yet call any of the BLAS, no dgemms or whatever. It could do so for say GrB_mxm when all the matrices are dense and when the semiring is (plus,times) for double, single, double complex or single complex. I could do that for bfloat16 too, if it's available.

@ViralBShah
Copy link
Contributor

@imciner2 Let's also incorporate the patch for GEMV thresholds that will speed up UMFPACK into this PR.

@imciner2
Copy link
Member Author

@imciner2 Let's also incorporate the patch for GEMV thresholds that will speed up UMFPACK into this PR.

I guess we could, but my plan was to merge this one and then do the patch backporting/removal for both 0.3.23 and 0.3.26 in another PR to get better git history. Since we squash merge PRs, that would be a lot of patch changes going in/out at once, so for future reference I think it is better to keep them separate.

@ViralBShah
Copy link
Contributor

ViralBShah commented Jan 24, 2024

Right - but the problem is that every merge publishes hundreds of megs of binaries for download, which would be nice to avoid. If we make a clean PR here, we can choose not to squash the merge so that we can retain the history.

* Add upstream patch to fix compilation with non-AVX512bf processors.
  Taken from upstream, can be removed in 0.3.27.
* Add patch to compile with SVE on aarch64 NeoverseN2, used by Apple
  systems. No upstream fix yet.
@giordano
Copy link
Member

If we make a clean PR here, we can choose not to squash the merge so that we can retain the history

The way CI works here requires squash-merging, as we do the diff with the previous commit to determine what to rebuild. But I'm sympathetic to the idea of avoiding intermediate builds, that won't likely have much use, if possible

* Remove fallback to generic aarch kernels on neoverse. This was fixed
  before 0.3.20, so the optimized kernels should be fine to use now.
* Backport threshold increase for gemv multithreading to improve
  performance. This is included in 0.3.27.
@imciner2
Copy link
Member Author

Ok, I have backported the patch and also removed the neoverse fallback. All that's left is the threading number update.

@ViralBShah
Copy link
Contributor

ViralBShah commented Jan 25, 2024

sanitize build is still failing. Should we turn it off?

@giordano
Copy link
Member

sanitize build is still failing. Should we turn it off?

I simply used preferred_llvm_version in the wrong recipe, it was OpenBLAS32@0.3.23, not OpenBLAS@0.3.23 🤦

@ViralBShah ViralBShah merged commit 65d6af9 into JuliaPackaging:master Jan 25, 2024
47 checks passed
@ViralBShah
Copy link
Contributor

ViralBShah commented Jan 25, 2024

@KristofferC @oscardssmith Note that we should use the new OpenBLAS binaries from here for 1.10.1.

giordano added a commit to JuliaLang/julia that referenced this pull request Jan 26, 2024
This also

* drops a patch (`deps/patches/neoverse-generic-kernels.patch`) not
needed anymore for an [old
bug](OpenMathLib/OpenBLAS#2998) fixed upstream
in OpenBLAS. This results in ~5x speedup in the computation of
`BLAS.nrm2` (and hence `LinearAlgebra.norm` for vectors longer than
`LinearAlgebra.NRM2_CUTOFF` (== 32) elements) when the neoversen1
kernels are used, e.g. by default on all Apple Silicon CPUs
* adds a regression test for the above bug
* updates other patches when building openblas from source

Corresponding PR in Yggdrasil:
JuliaPackaging/Yggdrasil#7202.
grasph pushed a commit to Moelf/Yggdrasil that referenced this pull request Jul 1, 2024
* [OpenBLAS] Build the BFloat16 kernels

* Add upstream patch to fix compilation with non-AVX512bf processors.
  Taken from upstream, can be removed in 0.3.27.
* Add patch to compile with SVE on aarch64 NeoverseN2, used by Apple
  systems. No upstream fix yet.

* [OpenBLAS] Update patches to improve performance

* Remove fallback to generic aarch kernels on neoverse. This was fixed
  before 0.3.20, so the optimized kernels should be fine to use now.
* Backport threshold increase for gemv multithreading to improve
  performance. This is included in 0.3.27.

* [OpenBLAS@0.3.23] Compile with llvm 13 for msan build

---------

Co-authored-by: Viral B. Shah <ViralBShah@users.noreply.github.com>
Co-authored-by: Mosè Giordano <mose@gnu.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants