Skip to content

Conversation

@grooverdan
Copy link
Member

  • The Jira issue number for this PR is: MDEV-36316/MDEV-36317/MDEV-36318

Description

Various fixing to allow a Debug MSAN with Clang-20 to pass tests.

See individual commit messages for details.

Release Notes

  • Internal changes only.

How can this PR be tested?

podman run --rm -ti -v "$PWD":/source:z --mount=type=tmpfs,tmpfs-size=10G,dst=/build --shm-size=10g --workdir /build --entrypoint /bin/bash --user buildbot --cap-add=SYS_PTRACE --privileged quay.io/mariadb-foundation/bb-worker:dev_debian12-msan-clang-20

cmake    -DWITH_EMBEDDED_SERVER=OFF \
                -DWITH_INNODB_{BZIP2,LZ4,LZMA,LZO,SNAPPY}=OFF \
                -DPLUGIN_{MROONGA,ROCKSDB,OQGRAPH,SPIDER}=NO \
                -DWITH_ZLIB=bundled \
                -DHAVE_LIBAIO_H=0 \
                -DCMAKE_DISABLE_FIND_PACKAGE_{URING,LIBAIO}=1 \
                -DWITH_NUMA=NO \
                -DWITH_SYSTEMD=no \
                -DWITH_MSAN=ON \
                -DHAVE_CXX_NEW=1 \
                -DCMAKE_{EXE,MODULE}_LINKER_FLAGS="-L${MSAN_LIBDIR} -Wl,-rpath=${MSAN_LIBDIR}" \
                -DCMAKE_CXX_FLAGS=-fsanitize=memory \
                -DWITH_DBUG_TRACE=OFF
                -DCMAKE_BUILD_TYPE=Debug \
                /source
cmake --build .
mysql-test/mtr --parallel=auto --force --big-test

Basing the PR against the correct MariaDB version

  • This is a new feature or a refactoring, and the PR is based against the main branch.
  • This is a bug fix, and the PR is based against the earliest maintained branch in which the bug can be reproduced.

PR quality check

  • I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
  • For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.

In CMAKE_BUILD_TYPE=Debug the MSAN of clang-20.1 results in
MemorySanitizer: use-of-uninitialized-value on mach_read_from_2
called by rec_set_bit_field_2 (and likewise for the _1 equivalent).

The non-debug builds are assumed to optimize this down such that
this becomes just a setting of values.
@grooverdan grooverdan added the MariaDB Foundation Pull requests created by MariaDB Foundation label Mar 20, 2025
@grooverdan grooverdan requested a review from dr-m March 20, 2025 06:11
@grooverdan grooverdan changed the title MDEV-36316/MDEV-36317/MDEV-36318 Debug msan fixes 10.6 MDEV-36316/MDEV-36327/MDEV-36328 Debug msan fixes 10.6 Mar 20, 2025
Without this increase the mtr test case pre/post conditions will
fail as the stack usage has increased under MSAN with clang-20.1.

A partial success with 432K was achieved, however the 448K was needed
for test cases that changed default collation.

The resulting behaviour observed on smaller stack size was SEGV when
a function allocated memory from the stack, and the called another
function (potentially coincidenly memset - assuming common in early
functions post allocation).
@grooverdan grooverdan force-pushed the debug-msan-fixes-10.6 branch from 12611e9 to b445f66 Compare March 20, 2025 07:52
The function dict_process_sys_columns_rec left nth_v_col uninitialized
unless it was a virtual column. This was ok as the function
i_s_sys_columns_fill_table also didn't read this value unless it was a
virtual column.

As MSAN in clang-20 didn't follow this though, the pass by value
was changed to a pass by ptr so that MSAN could detect this correctly.
…n_range

ror_scan_selectivity passed an uninitialized page structure so
we shouldn't be using its values. btr_estimate_n_rows_in_range
doesn't use the page numbers in the tuples so these can be omitted.

While ror_scan_selectivity never uses the result, however the mrr calling
of records_in_range does use the result.
@grooverdan grooverdan force-pushed the debug-msan-fixes-10.6 branch from b445f66 to 5e9b106 Compare March 21, 2025 05:42
Comment on lines +159 to 164
#ifndef DBUG_OFF
MEM_MAKE_DEFINED(rec - offs, 1);
#endif
mach_write_to_1(rec - offs,
(mach_read_from_1(rec - offs) & ~mask)
| (val << shift));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks incorrect to me. Why would we claim that all bits at rec[-offs] are initialized when we are only overwriting some of the bits here? What would fail if this change and the similar change to rec_set_bit_field_2() were omitted?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trace was like in MDEV-36316.

I agree it seem overly incorrect claiming all bits are initialised. It only occurred in Debug mode so I'm assuming a less optimised code makes this look different. I'll look at forcing a higher optimisation on these blocks as an alternative.

Comment on lines -5456 to 5457
more column information */
ulint nth_v_col, /*!< in: virtual column, its
ulint* nth_v_col, /*!< in: virtual column, its
sequence number (nth virtual col) */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand why we would need any of the changes to this file and which problem these changes would solve. We’re no longer passing a read-only parameter by value but via a pointer that is effectively read-only. Can you test again without including any of these changes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error received was in MDEV-36327.

It was working around a msan patten of possibly not tracing undefined memory once it got copied to registers and then used as argument to the next function.

An alternative is:

$ git diff storage/innobase/dict/dict0load.cc
diff --git a/storage/innobase/dict/dict0load.cc b/storage/innobase/dict/dict0load.cc
index 87ff163a233..0b2cdc9e01b 100644
--- a/storage/innobase/dict/dict0load.cc
+++ b/storage/innobase/dict/dict0load.cc
@@ -1184,6 +1184,8 @@ static const char *dict_load_column_low(dict_table_t *table,
        /* Report the virtual column number */
        if ((prtype & DATA_VIRTUAL) && nth_v_col != NULL) {
                *nth_v_col = dict_get_v_col_pos(pos);
+       } else {
+               *nth_v_col = 0;
        }
 
        return(NULL);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right, starting with clang-16 or thereabouts, -fsanitize=memory would complain when uninitialized data is being passed to function calls, and there is no way to disable that via MSAN_OPTIONS. In older clang versions, the data was passed just fine, apparently along with the shadow bytes that would indicate which bits are uninitialized.

Comment on lines -14479 to +14480
btr_pos_t tuple1(range_start, mode1, pages->first_page);
btr_pos_t tuple2(range_end, mode2, pages->last_page);
btr_pos_t tuple1(range_start, mode1, 0);
btr_pos_t tuple2(range_end, mode2, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would seem to be the actual fix. ~0ULL might be a safer value, but I think that 0 should be OK as well, because the smallest possible index page number is 3.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I could add MEM_UNDEFINED(&tuple1.page_id, sizeof tuple1.page_id) (plus tuple2) to protect against future btr_estimate_n_rows_in_range that access it.

Copy link
Contributor

@dr-m dr-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I installed a fresh build of clang-20 from https://apt.llvm.org, specifically:

Debian clang version 20.1.2 (++20250402124457+58df0ef89dd6-1~exp1~20250402004517.100)

I built MSAN-instrumented libraries using the build-msan19.sh script attached to MDEV-20377. For libidn2 I used the release tarball, because apt source libidn2 would omit the configure script. I tested e7442e5 (10.6) and 25737db (10.5), both CMAKE_BUILD_TYPE=Debug and CMAKE_BUILD_TYPE=RelWithDebInfo, and no issues were reported. Here is my script for invoking cmake, which I invoked with CLANG=20:

#!/bin/bash
set -ex
test ! -e .git
test ! -e .bashrc
: ${CLANG=15}
# sudo apt install libunwind15-$CLANG
exec cmake -DCMAKE_{C_COMPILER=clang,CXX_COMPILER=clang++}-$CLANG \
-DCMAKE_C_FLAGS='-O2 -march=native -mtune=native -Wno-unused-command-line-argument -fdebug-macro -fno-limit-debug-info' \
-DCMAKE_CXX_FLAGS='-stdlib=libc++ -O2 -march=native -mtune=native -Wno-unused-command-line-argument -fdebug-macro -fno-limit-debug-info' \
-DCMAKE_{EXE,MODULE,SHARED}_LINKER_FLAGS:STRING=-fuse-ld=lld-$CLANG \
-DWITH_DBUG_TRACE=OFF \
-DWITH_EMBEDDED_SERVER=OFF -DWITH_UNIT_TESTS=OFF -DCMAKE_BUILD_TYPE=Debug \
-DWITH_INNODB_{BZIP2,LZMA,LZO,SNAPPY}=OFF \
-DPLUGIN_{ARCHIVE,TOKUDB,MROONGA,OQGRAPH,ROCKSDB,CONNECT,SPIDER}=NO -DWITH_SAFEMALLOC=OFF \
-DWITH_{ZLIB,SSL,PCRE}=bundled \
-DHAVE_LIBAIO_H=0 \
-DCMAKE_DISABLE_FIND_PACKAGE_URING=1 -DCMAKE_DISABLE_FIND_PACKAGE_LIBAIO=1 \
-DWITH_MSAN=ON \
-G Ninja "$@"

Note: there is no package installed similar to what the commented line says. I have libunwind-20 and libunwind-20-dev. With clang-20 I’d get lots of trouble if I pointed LD_LIBRARY_PATH to an MSAN-instrumented libunwind.so. So, I omitted that instrumented library at test runtime.

Can you please check again?

Also, it is worth noting is that I used MSAN_OPTIONS=poison_in_dtor=0 in order to work around MDEV-30942. Can you double-check if that would make the problems disappear in your environment? If yes, then we’d seem to need a more specific fix.

@grooverdan
Copy link
Member Author

Won't fix - avoid with -DC{,XX}_FLAGS=-O2

@grooverdan grooverdan closed this Apr 4, 2025
@grooverdan grooverdan deleted the debug-msan-fixes-10.6 branch April 4, 2025 07:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

MariaDB Foundation Pull requests created by MariaDB Foundation

Development

Successfully merging this pull request may close these issues.

2 participants