__muloti4 generated after "buffer: Fix grow_buffers() for block size > PAGE_SIZE" #1958

nathanchance · 2023-11-13T14:54:25Z

After commit fa4992db4fa5 ("buffer: fix grow_buffers() for block size > PAGE_SIZE") in -next (hash may not be stable), certain builds fail due to the generation of __muloti4:

$ make -skj"$(nproc)" ARCH=arm64 LLVM=1 defconfig all
...
ld.lld: error: undefined symbol: __muloti4
>>> referenced by buffer.c:0 (fs/buffer.c:0)
>>>               fs/buffer.o:(bdev_getblk) in archive vmlinux.a

This appears to be due to the types passed to __builtin_mul_overflow, as shown by this simple reproducer: https://godbolt.org/z/csfGc6z6c

__builtin_mul_overflow is used frequently over the tree but it apparently has never been used with unsigned long long and unsigned int? We could potentially add a cast on the use of size in __builtin_mul_overflow in this one case but that will not scale as more places use this builtin.

The text was updated successfully, but these errors were encountered:

nathanchance · 2023-11-15T14:45:13Z

Looking at the optimized IR, I think I can see why this is __muloti4, which is the libcall for int128 multiplication (i65 gets widened to i128 during legalization?):

for.cond.preheader.i:                             ; preds = %bdev_logical_block_size.exit.i
  %7 = zext i64 %block to i65
  %8 = zext i32 %size to i65
  %9 = call { i65, i1 } @llvm.smul.with.overflow.i65(i65 %7, i65 %8)

HANDLE_LIBCALL(MULO_I128, "__muloti4")

https://godbolt.org/z/hMe13jx5x

but why in the world is it trying to zero extend these values to i65 in the first place? This appears to come from clang in the front end, as I can see

  %0 = load i64, ptr %block.addr, align 8, !dbg !15500
  %1 = load i32, ptr %size.addr, align 4, !dbg !15500
  %2 = zext i64 %0 to i65, !dbg !15500
  %3 = zext i32 %1 to i65, !dbg !15500
  %4 = call { i65, i1 } @llvm.smul.with.overflow.i65(i65 %2, i65 %3), !dbg !15500

in the IR with -Xclang -disable-llvm-optzns, which should be unoptimized IIUC.

nathanchance · 2023-11-16T19:06:15Z

I wonder if this is llvm/llvm-project#38013?

heiher · 2023-11-23T02:26:48Z

This can also be reproduced on LoongArch.

LLVM HEAD: llvm/llvm-project@3311112

This library function only exists in compiler-rt not libgcc. So this would fail to link unless we were linking with compiler-rt. Fixes ClangBuiltLinux/linux#1958

nathanchance · 2023-11-23T15:19:32Z

Thanks for the change but that is only going to fix the issue for LoongArch, as this is seen with all architectures because the problem appears to occur during clang's LLVM IR code generation phase. I think the fix is most likely making it so that clang does not generate IR that causes this multiplication to be widened to i128 from i65 in the first place (otherwise code generation will just continue to suffer as well) but that does not seem like too easy of a fix based on the LLVM IR issue that I mentioned above.

nathanchance · 2023-11-28T23:56:31Z

I sent https://lore.kernel.org/20231128-avoid-muloti4-grow_buffers-v1-1-bc3d0f0ec483@kernel.org/ to workaround this in the kernel.

When building with clang after commit 6976079 ("buffer: fix grow_buffers() for block size > PAGE_SIZE"), there is an error at link time due to the generation of a 128-bit multiplication libcall: ld.lld: error: undefined symbol: __muloti4 >>> referenced by buffer.c:0 (fs/buffer.c:0) >>> fs/buffer.o:(bdev_getblk) in archive vmlinux.a Due to the width mismatch between the factors and the sign mismatch between the factors and the result, clang generates IR that performs this overflow check with 65-bit signed multiplication and LLVM does not improve on it during optimization, so the 65-bit multiplication is extended to 128-bit during legalization, resulting in the libcall on most targets. To avoid the initial situation that causes clang to generate the problematic IR, cast size (which is an 'unsigned int') to the same type/width as block (which is currently a 'u64'/'unsigned long long'). GCC appears to already do this internally because there is no binary difference with the cast for arm, arm64, riscv, or x86_64. Link: ClangBuiltLinux#1958 Link: llvm/llvm-project#38013 Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Closes: https://lore.kernel.org/CA+G9fYuA_PTd7R2NsBvtNb7qjwp4avHpCmWi4=OmY4jndDcQYA@mail.gmail.com/ Signed-off-by: Nathan Chancellor <nathan@kernel.org>

When building with clang after commit 6976079 ("buffer: fix grow_buffers() for block size > PAGE_SIZE"), there is an error at link time due to the generation of a 128-bit multiplication libcall: ld.lld: error: undefined symbol: __muloti4 >>> referenced by buffer.c:0 (fs/buffer.c:0) >>> fs/buffer.o:(bdev_getblk) in archive vmlinux.a Due to the width mismatch between the factors and the sign mismatch between the factors and the result, clang generates IR that performs this overflow check with 65-bit signed multiplication and LLVM does not improve on it during optimization, so the 65-bit multiplication is extended to 128-bit during legalization, resulting in the libcall on most targets. To avoid the initial situation that causes clang to generate the problematic IR, cast size (which is an 'unsigned int') to the same type/width as block (which is currently a 'u64'/'unsigned long long'). GCC appears to already do this internally because there is no binary difference with the cast for arm, arm64, riscv, or x86_64. Link: ClangBuiltLinux#1958 Link: llvm/llvm-project#38013 Link: https://lkml.kernel.org/r/20231128-avoid-muloti4-grow_buffers-v1-1-bc3d0f0ec483@kernel.org Signed-off-by: Nathan Chancellor <nathan@kernel.org> Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Closes: https://lore.kernel.org/CA+G9fYuA_PTd7R2NsBvtNb7qjwp4avHpCmWi4=OmY4jndDcQYA@mail.gmail.com/ Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

When building with clang after commit 6976079 ("buffer: fix grow_buffers() for block size > PAGE_SIZE"), there is an error at link time due to the generation of a 128-bit multiplication libcall: ld.lld: error: undefined symbol: __muloti4 >>> referenced by buffer.c:0 (fs/buffer.c:0) >>> fs/buffer.o:(bdev_getblk) in archive vmlinux.a Due to the width mismatch between the factors and the sign mismatch between the factors and the result, clang generates IR that performs this overflow check with 65-bit signed multiplication and LLVM does not improve on it during optimization, so the 65-bit multiplication is extended to 128-bit during legalization, resulting in the libcall on most targets. To avoid the initial situation that causes clang to generate the problematic IR, cast size (which is an 'unsigned int') to the same type/width as block (which is currently a 'u64'/'unsigned long long'). GCC appears to already do this internally because there is no binary difference with the cast for arm, arm64, riscv, or x86_64. Link: ClangBuiltLinux/linux#1958 Link: llvm/llvm-project#38013 Link: https://lkml.kernel.org/r/20231128-avoid-muloti4-grow_buffers-v1-1-bc3d0f0ec483@kernel.org Signed-off-by: Nathan Chancellor <nathan@kernel.org> Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Closes: https://lore.kernel.org/CA+G9fYuA_PTd7R2NsBvtNb7qjwp4avHpCmWi4=OmY4jndDcQYA@mail.gmail.com/ Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

nathanchance · 2023-12-01T18:08:59Z

Unfortunately, the only reason that this workaround works at all is due to a change in LLVM 12, so the build is still broken for LLVM 11:

llvm/llvm-project@3203143
https://github.com/ClangBuiltLinux/continuous-integration2/actions/runs/7059960842/job/19221361610
https://storage.tuxsuite.com/public/clangbuiltlinux/continuous-integration2/builds/2YwP88xX1Dl8qaMR8QAhzPrXN6F/build.log

I wonder if this is worth bumping the minimum version of LLVM for the kernel over...

When building with clang after commit 6976079 ("buffer: fix grow_buffers() for block size > PAGE_SIZE"), there is an error at link time due to the generation of a 128-bit multiplication libcall: ld.lld: error: undefined symbol: __muloti4 >>> referenced by buffer.c:0 (fs/buffer.c:0) >>> fs/buffer.o:(bdev_getblk) in archive vmlinux.a Due to the width mismatch between the factors and the sign mismatch between the factors and the result, clang generates IR that performs this overflow check with 65-bit signed multiplication and LLVM does not improve on it during optimization, so the 65-bit multiplication is extended to 128-bit during legalization, resulting in the libcall on most targets. To avoid the initial situation that causes clang to generate the problematic IR, cast size (which is an 'unsigned int') to the same type/width as block (which is currently a 'u64'/'unsigned long long'). GCC appears to already do this internally because there is no binary difference with the cast for arm, arm64, riscv, or x86_64. Link: ClangBuiltLinux#1958 Link: llvm/llvm-project#38013 Link: https://lkml.kernel.org/r/20231128-avoid-muloti4-grow_buffers-v1-1-bc3d0f0ec483@kernel.org Signed-off-by: Nathan Chancellor <nathan@kernel.org> Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Closes: https://lore.kernel.org/CA+G9fYuA_PTd7R2NsBvtNb7qjwp4avHpCmWi4=OmY4jndDcQYA@mail.gmail.com/ Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

nickdesaulniers · 2023-12-01T23:25:24Z

That change in clang-12 seems to apply when:

neither input is signed, but the output is
the inputs and outputs are the same width

Are there any further changes that can be made to the kernel sources to use the same signedness for both inputs, and the output?

nickdesaulniers · 2023-12-01T23:33:11Z

For example, grow_buffers declares pos as a loff_t aka long long.

grow_buffers passes pos to grow_dev_folio which takes a pgoff_t aka unsigned long.

There is type confusion going on in grow_buffers; we confuse BOTH signedness and long-longy-ness.

Does this help clang-11?

diff --git a/fs/buffer.c b/fs/buffer.c
index 9f41d2b38902..1e06d2118981 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1085,7 +1085,7 @@ static bool grow_dev_folio(struct block_device *bdev, sector_t block,
 static bool grow_buffers(struct block_device *bdev, sector_t block,
                unsigned size, gfp_t gfp)
 {
-       loff_t pos;
+       pgoff_t pos;
 
        /*
         * Check for a block which lies outside our maximum possible

nathanchance · 2023-12-01T23:53:03Z

Yes, that diff helps clang-11 and I think that would also allow us to drop the cast that I added to __builtin_mul_overflow.

However, the change that introduced this moved away from pgoff_t, which seemed deliberate? I don't know enough about those types to know what the intention was there. Perhaps this was to avoid issues with long being different widths on different architectures?