-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
MDEV-36316/MDEV-36327/MDEV-36328 Debug msan fixes 10.6 #3899
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In CMAKE_BUILD_TYPE=Debug the MSAN of clang-20.1 results in MemorySanitizer: use-of-uninitialized-value on mach_read_from_2 called by rec_set_bit_field_2 (and likewise for the _1 equivalent). The non-debug builds are assumed to optimize this down such that this becomes just a setting of values.
Without this increase the mtr test case pre/post conditions will fail as the stack usage has increased under MSAN with clang-20.1. A partial success with 432K was achieved, however the 448K was needed for test cases that changed default collation. The resulting behaviour observed on smaller stack size was SEGV when a function allocated memory from the stack, and the called another function (potentially coincidenly memset - assuming common in early functions post allocation).
12611e9 to
b445f66
Compare
The function dict_process_sys_columns_rec left nth_v_col uninitialized unless it was a virtual column. This was ok as the function i_s_sys_columns_fill_table also didn't read this value unless it was a virtual column. As MSAN in clang-20 didn't follow this though, the pass by value was changed to a pass by ptr so that MSAN could detect this correctly.
…n_range ror_scan_selectivity passed an uninitialized page structure so we shouldn't be using its values. btr_estimate_n_rows_in_range doesn't use the page numbers in the tuples so these can be omitted. While ror_scan_selectivity never uses the result, however the mrr calling of records_in_range does use the result.
b445f66 to
5e9b106
Compare
| #ifndef DBUG_OFF | ||
| MEM_MAKE_DEFINED(rec - offs, 1); | ||
| #endif | ||
| mach_write_to_1(rec - offs, | ||
| (mach_read_from_1(rec - offs) & ~mask) | ||
| | (val << shift)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks incorrect to me. Why would we claim that all bits at rec[-offs] are initialized when we are only overwriting some of the bits here? What would fail if this change and the similar change to rec_set_bit_field_2() were omitted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The trace was like in MDEV-36316.
I agree it seem overly incorrect claiming all bits are initialised. It only occurred in Debug mode so I'm assuming a less optimised code makes this look different. I'll look at forcing a higher optimisation on these blocks as an alternative.
| more column information */ | ||
| ulint nth_v_col, /*!< in: virtual column, its | ||
| ulint* nth_v_col, /*!< in: virtual column, its | ||
| sequence number (nth virtual col) */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand why we would need any of the changes to this file and which problem these changes would solve. We’re no longer passing a read-only parameter by value but via a pointer that is effectively read-only. Can you test again without including any of these changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error received was in MDEV-36327.
It was working around a msan patten of possibly not tracing undefined memory once it got copied to registers and then used as argument to the next function.
An alternative is:
$ git diff storage/innobase/dict/dict0load.cc
diff --git a/storage/innobase/dict/dict0load.cc b/storage/innobase/dict/dict0load.cc
index 87ff163a233..0b2cdc9e01b 100644
--- a/storage/innobase/dict/dict0load.cc
+++ b/storage/innobase/dict/dict0load.cc
@@ -1184,6 +1184,8 @@ static const char *dict_load_column_low(dict_table_t *table,
/* Report the virtual column number */
if ((prtype & DATA_VIRTUAL) && nth_v_col != NULL) {
*nth_v_col = dict_get_v_col_pos(pos);
+ } else {
+ *nth_v_col = 0;
}
return(NULL);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, right, starting with clang-16 or thereabouts, -fsanitize=memory would complain when uninitialized data is being passed to function calls, and there is no way to disable that via MSAN_OPTIONS. In older clang versions, the data was passed just fine, apparently along with the shadow bytes that would indicate which bits are uninitialized.
| btr_pos_t tuple1(range_start, mode1, pages->first_page); | ||
| btr_pos_t tuple2(range_end, mode2, pages->last_page); | ||
| btr_pos_t tuple1(range_start, mode1, 0); | ||
| btr_pos_t tuple2(range_end, mode2, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would seem to be the actual fix. ~0ULL might be a safer value, but I think that 0 should be OK as well, because the smallest possible index page number is 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I could add MEM_UNDEFINED(&tuple1.page_id, sizeof tuple1.page_id) (plus tuple2) to protect against future btr_estimate_n_rows_in_range that access it.
dr-m
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I installed a fresh build of clang-20 from https://apt.llvm.org, specifically:
Debian clang version 20.1.2 (++20250402124457+58df0ef89dd6-1~exp1~20250402004517.100)
I built MSAN-instrumented libraries using the build-msan19.sh script attached to MDEV-20377. For libidn2 I used the release tarball, because apt source libidn2 would omit the configure script. I tested e7442e5 (10.6) and 25737db (10.5), both CMAKE_BUILD_TYPE=Debug and CMAKE_BUILD_TYPE=RelWithDebInfo, and no issues were reported. Here is my script for invoking cmake, which I invoked with CLANG=20:
#!/bin/bash
set -ex
test ! -e .git
test ! -e .bashrc
: ${CLANG=15}
# sudo apt install libunwind15-$CLANG
exec cmake -DCMAKE_{C_COMPILER=clang,CXX_COMPILER=clang++}-$CLANG \
-DCMAKE_C_FLAGS='-O2 -march=native -mtune=native -Wno-unused-command-line-argument -fdebug-macro -fno-limit-debug-info' \
-DCMAKE_CXX_FLAGS='-stdlib=libc++ -O2 -march=native -mtune=native -Wno-unused-command-line-argument -fdebug-macro -fno-limit-debug-info' \
-DCMAKE_{EXE,MODULE,SHARED}_LINKER_FLAGS:STRING=-fuse-ld=lld-$CLANG \
-DWITH_DBUG_TRACE=OFF \
-DWITH_EMBEDDED_SERVER=OFF -DWITH_UNIT_TESTS=OFF -DCMAKE_BUILD_TYPE=Debug \
-DWITH_INNODB_{BZIP2,LZMA,LZO,SNAPPY}=OFF \
-DPLUGIN_{ARCHIVE,TOKUDB,MROONGA,OQGRAPH,ROCKSDB,CONNECT,SPIDER}=NO -DWITH_SAFEMALLOC=OFF \
-DWITH_{ZLIB,SSL,PCRE}=bundled \
-DHAVE_LIBAIO_H=0 \
-DCMAKE_DISABLE_FIND_PACKAGE_URING=1 -DCMAKE_DISABLE_FIND_PACKAGE_LIBAIO=1 \
-DWITH_MSAN=ON \
-G Ninja "$@"Note: there is no package installed similar to what the commented line says. I have libunwind-20 and libunwind-20-dev. With clang-20 I’d get lots of trouble if I pointed LD_LIBRARY_PATH to an MSAN-instrumented libunwind.so. So, I omitted that instrumented library at test runtime.
Can you please check again?
Also, it is worth noting is that I used MSAN_OPTIONS=poison_in_dtor=0 in order to work around MDEV-30942. Can you double-check if that would make the problems disappear in your environment? If yes, then we’d seem to need a more specific fix.
|
Won't fix - avoid with |
Description
Various fixing to allow a Debug MSAN with Clang-20 to pass tests.
See individual commit messages for details.
Release Notes
How can this PR be tested?
Basing the PR against the correct MariaDB version
mainbranch.PR quality check