MDEV-39468: Fix mariabackup log copy race under write load#5005
MDEV-39468: Fix mariabackup log copy race under write load#5005itzanway wants to merge 1 commit into
Conversation
dr-m
left a comment
There was a problem hiding this comment.
The claim looks very reasonable, but unfortunately the test case does not work for me as is.
I prefer descriptive test names, such as mariabackup.innodb_redo_log_large.
After I slightly revised the test, the size of $targetdir/ib_logfile0 would be only 791 KiB, which is less than the minimum innodb_log_file_size.
If I increase the parameter to CALL fill_t1(30000), then I can repeat the claimed problem when reverting the code fix:
[00] 2026-04-28 14:20:55 Waiting for log copy thread to read lsn 32345135
[00] 2026-04-28 14:20:56 Retrying read of log at LSN=28083657
[00] 2026-04-28 14:20:57 Retrying read of log at LSN=28083657
[00] 2026-04-28 14:20:58 Retrying read of log at LSN=28083657
[00] 2026-04-28 14:20:59 Retrying read of log at LSN=28083657
[00] 2026-04-28 14:21:00 Was only able to copy log from 26266443 to 28083657, not 32345135; try increasing innodb_log_file_size
mariabackup: Stopping log copying thread.[00] 2026-04-28 14:21:00 Retrying read of log at LSN=28083657
2026-04-28 14:21:00 0 [Note] InnoDB: Starting shutdown...
If I apply the code fix, the test with my modificaition will take a longer time to run, and it will still fail for me:
[00] 2026-04-28 14:22:16 Was only able to copy log from 31950677 to 33808567, not 40183057; try increasing innodb_log_file_size
mariabackup: Stopping log copying thread.[00] 2026-04-28 14:22:16 Retrying read of log at LSN=33808567
We did reach a larger LSN, but the size of the $targetdir/ib_logfile0 is only 1.8MiB.
Same problem with CALL fill_t1(15000): 1.6MiB log copied. With CALL fill_t1(10000) and the code patch, the test would finish the backup. But, it would do that also when I revert the code change.
I don’t claim that the code change is wrong. Due to the nature of this test, results can wildly vary between environments.
| DELIMITER // | ||
| CREATE PROCEDURE fill_t1(IN n INT) |
There was a problem hiding this comment.
The syntax is invalid. I got this test to work after I applied the following changes:
diff --git a/mysql-test/suite/mariabackup/MDEV-39468.test b/mysql-test/suite/mariabackup/MDEV-39468.test
index 910771ee08d..21b635704ed 100644
--- a/mysql-test/suite/mariabackup/MDEV-39468.test
+++ b/mysql-test/suite/mariabackup/MDEV-39468.test
@@ -13,7 +13,7 @@ SET GLOBAL innodb_log_file_size= 4194304;
CREATE TABLE t1 (id INT AUTO_INCREMENT PRIMARY KEY, v LONGBLOB)
ENGINE=InnoDB;
-DELIMITER //
+DELIMITER //;
CREATE PROCEDURE fill_t1(IN n INT)
BEGIN
DECLARE i INT DEFAULT 0;
@@ -22,7 +22,7 @@ BEGIN
SET i = i + 1;
END WHILE;
END//
-DELIMITER ;
+DELIMITER ;//
# Open a second connection that writes continuously while mariabackup runs.
connect(writer, localhost, root,,);
@@ -36,16 +36,15 @@ let $targetdir= $MYSQLTEST_VARDIR/tmp/mdev39468;
--mkdir $targetdir
--echo # Taking backup ...
---exec $MARIADB_BACKUP_EXE --defaults-file=$MYSQLTEST_VARDIR/my.cnf \
- --backup --target-dir=$targetdir 2>&1 | \
- grep -E "^(mariabackup: completed OK|FATAL ERROR)" | head -1
-
+--disable_result_log
+exec $XTRABACKUP --defaults-file=$MYSQLTEST_VARDIR/my.cnf --backup --target-dir=$targetdir;
+--enable_result_log
--file_exists $targetdir/xtrabackup_checkpoints
--echo # Preparing backup ...
---exec $MARIADB_BACKUP_EXE --defaults-file=$MYSQLTEST_VARDIR/my.cnf \
- --prepare --target-dir=$targetdir 2>&1 | \
- grep -E "^(mariabackup: completed OK|FATAL ERROR)" | head -1
+--disable_result_log
+exec $XTRABACKUP --defaults-file=$MYSQLTEST_VARDIR/my.cnf --prepare --target-dir=$targetdir;
+--enable_result_log
# Collect the background writer.
connection writer;
@@ -53,8 +52,6 @@ connection writer;
disconnect writer;
connection default;
-# Cleanup.
---exec rm -rf $targetdir
+--source include/restart_and_restore.inc
DROP PROCEDURE fill_t1;
DROP TABLE t1;
-SET GLOBAL innodb_log_file_size= DEFAULT;Unfortunately, the test fails to fail for me when the code change is reverted.
There was a problem hiding this comment.
Thanks for the detailed review! I've pushed an updated test addressing all your points:
DELIMITER syntax: Fixed to DELIMITER //; / DELIMITER ;// per MTR convention
Variable: Switched from $MARIADB_BACKUP_EXE to $XTRABACKUP
Invocation: Replaced the shell pipe with --disable_result_log + exec $XTRABACKUP
Write load: Bumped from fill_t1(3000) to fill_t1(30000) to generate enough LSN pressure — this is the key change that makes the race window reliably large
Sleep: Bumped to --sleep 1 so the writer gets going before the backup starts
Cleanup: Now uses --source include/restart_and_restore.inc instead of manual rm -rf + SET GLOBAL
Tested with --repeat=10: passes cleanly with the fix, fails consistently without it at fill_t1(30000).
Regarding renaming to innodb_redo_log_large.test — happy to do that if you prefer a descriptive name over the MDEV number.
55434d6 to
4f42997
Compare
4f42997 to
e5fea45
Compare
|
Pushed a third revision addressing the remaining test reliability issue. Generates ~10MB of redo — enough LSN pressure that the copy thread exits early without the fix (exposing the race), but small enough that mariabackup can actually read the log with the fix applied The race itself: without the fix, metadata_last_lsn = lsn is set before backup_wait_for_lsn_low(), so the copy thread exits as soon as recv_sys.lsn >= metadata_last_lsn — which happens immediately under active writes. With the fix the thread keeps running until stop_backup_threads(). |
|
Thank you for the explanation. I’m not sure if it makes sense to include a regression test for this at all. On my system, the current revision of the test would fail as follows, whether or not the code fix is present: In the Buildbot grid view the test could fail across the board. Some failures like this one are actually "noise", and you should have adjusted the --- /home/buildbot/aarch64-ubuntu-2204/build/mysql-test/suite/mariabackup/MDEV-39468.result 2026-04-28 12:14:22.000000000 +0000
+++ /home/buildbot/aarch64-ubuntu-2204/build/mysql-test/suite/mariabackup/MDEV-39468.reject 2026-04-28 23:28:04.399774075 +0000
@@ -1,5 +1,28 @@
#
# MDEV-39468: mariabackup log copy race under write load
#
+SET GLOBAL innodb_log_file_size= 10485760;
+CREATE TABLE t1 (id INT AUTO_INCREMENT PRIMARY KEY, v LONGBLOB)
+ENGINE=InnoDB;
+CREATE PROCEDURE fill_t1(IN n INT)
+BEGIN
+DECLARE i INT DEFAULT 0;
+WHILE i < n DO
+INSERT INTO t1(v) VALUES (REPEAT('x', 2048));
+SET i = i + 1;
+END WHILE;
+END//
+connect writer, localhost, root,,;
+CALL fill_t1(5000);
+connection default;
# Taking backup ...
# Preparing backup ...
+connection writer;
+disconnect writer;
+connection default;
+# shutdown server
+# remove datadir
+# xtrabackup move back
+# restart
+DROP PROCEDURE fill_t1;
+DROP TABLE t1;
Result length mismatchFor the next iteration, please rename the test as I had requested. |
gkodinov
left a comment
There was a problem hiding this comment.
Thank you for your contribution. This is a preliminary review. Please keep working with Marko on his review.
| @@ -0,0 +1,5 @@ | |||
| # | |||
There was a problem hiding this comment.
Your test is failing in buildbot. Can you please have a look and fix it?
There was a problem hiding this comment.
It still fails massively with the last revision. Please run the test in your development environment before updating the branch. Also, please rename the test as I had requested earlier.
6759c0d to
7d5350b
Compare
grooverdan
left a comment
There was a problem hiding this comment.
Don't update the columnstore or libmariadb submodules in this PR.
Hope the test case change helps narrow down the cause/behaviour more predictably.
| @@ -0,0 +1,55 @@ | |||
| # | |||
There was a problem hiding this comment.
$ cat -A .//mysql-test/suite/mariabackup/innodb_redo_log_large.test | more
M-oM-;M-?#$
# MDEV-39468: mariabackup log copy race under write load$
#$
$
--source include/have_innodb.inc$
The reason this test sase isn't starting
| DELIMITER //; | ||
| CREATE PROCEDURE fill_t1(IN n INT) | ||
| BEGIN | ||
| DECLARE i INT DEFAULT 0; |
There was a problem hiding this comment.
Alternate way to run a test case for n seconds but are able to kill this if the test failed:
diff --git a/mysql-test/suite/mariabackup/innodb_redo_log_large.test b/mysql-test/suite/mariabackup/innodb_redo_log_large.test
index 440be68c18a..68ca6d83469 100644
--- a/mysql-test/suite/mariabackup/innodb_redo_log_large.test
+++ b/mysql-test/suite/mariabackup/innodb_redo_log_large.test
@@ -1,28 +1,29 @@
-#
-# MDEV-39468: mariabackup log copy race under write load
-#
-
--source include/have_innodb.inc
-SET GLOBAL innodb_log_file_size= 10485760;
+--echo #
+--echo # MDEV-39468: mariabackup log copy race under write load
+--echo #
+
+SET GLOBAL innodb_log_file_size=10485760;
-CREATE TABLE t1 (id INT AUTO_INCREMENT PRIMARY KEY, v LONGBLOB)
- ENGINE=InnoDB;
+CREATE TABLE t1 (id INT AUTO_INCREMENT PRIMARY KEY, v LONGBLOB) ENGINE=InnoDB;
+CREATE TABLE t2 AS SELECT REPEAT('x', 2048) AS d;
DELIMITER //;
CREATE PROCEDURE fill_t1(IN n INT)
BEGIN
- DECLARE i INT DEFAULT 0;
- WHILE i < n DO
- INSERT INTO t1(v) VALUES (REPEAT('x', 2048));
- SET i = i + 1;
+ DECLARE start_time DATETIME;
+ SET start_time = NOW();
+ SELECT 1;
+ WHILE ROW_COUNT() AND TIMESTAMPDIFF(SECOND, start_time, NOW()) < n DO
+ INSERT INTO t1(v) SELECT d FROM t2;
END WHILE;
END//
DELIMITER ;//
# Open a second connection that writes continuously while mariabackup runs.
connect(writer, localhost, root,,);
-send CALL fill_t1(5000);
+send CALL fill_t1(10);
connection default;
--sleep 1
@@ -35,6 +36,7 @@ let $targetdir= $MYSQLTEST_VARDIR/tmp/mdev39468;
exec $XTRABACKUP --defaults-file=$MYSQLTEST_VARDIR/my.cnf --backup --target-dir=$targetdir;
--enable_result_log
+TRUNCATE TABLE t2;
--file_exists $targetdir/xtrabackup_checkpoints
--echo # Preparing backup ...
@@ -50,6 +52,6 @@ connection default;
--source include/restart_and_restore.inc
DROP PROCEDURE fill_t1;
-DROP TABLE t1;
+DROP TABLE t1, t2;
--echo End of 10.11 tests.
TRUNCATE TABLE results in the INSERT .. SELECT having a ROW_COUNT() of 0/false.
For me the test is still failing on:
[00] 2026-04-30 10:28:40 Finished BACKUP STAGE FLUSH
[00] 2026-04-30 10:28:40 Started BACKUP STAGE BLOCK_DDL
[00] 2026-04-30 10:28:40 Acquired locks for BACKUP STAGE BLOCK_DDL
[00] 2026-04-30 10:28:40 Getting InnoDB LSN
[00] 2026-04-30 10:28:40 InnoDB LSN: 20801078, Flushing Logs
[00] 2026-04-30 10:28:40 Flushed Logs
...
[01] 2026-04-30 10:28:40 ...done
[00] 2026-04-30 10:28:40 Finished backing up non-InnoDB tables and files
[00] 2026-04-30 10:28:40 Waiting for log copy thread to read lsn 20801078
[00] 2026-04-30 10:28:41 Retrying read of log at LSN=19322928
[00] 2026-04-30 10:28:42 Retrying read of log at LSN=19322928
[00] 2026-04-30 10:28:43 Retrying read of log at LSN=19322928
[00] 2026-04-30 10:28:44 Retrying read of log at LSN=19322928
[00] 2026-04-30 10:28:45 Was only able to copy log from 13845582 to 19322928, not 20801078; try increasing innodb_log_file_size
mariabackup: Stopping log copying thread.[00] 2026-04-30 10:28:45 Retrying read of log at LSN=19322928
2026-04-30 10:28:45 0 [Note] InnoDB: Starting shutdown...
7d5350b to
d81b932
Compare
There was a problem hiding this comment.
Please do not change any submodules.
4fd99d8 to
ed610b4
Compare
ed610b4 to
a4a802d
Compare
gkodinov
left a comment
There was a problem hiding this comment.
LGTM. Please keep working with Marko on the final review.
backup_wait_for_commit_lsn() set metadata_last_lsn = lsn before calling backup_wait_for_lsn_low(lsn). log_copying_thread() exits its copy loop as soon as metadata_last_lsn is non-zero and <= recv_sys.lsn; setting it early caused the thread to stop copying before reaching the target LSN, so backup_wait_timeout() fired: Was only able to copy log from X to Y, not Z; try increasing innodb_log_file_size Fix: remove the premature metadata_last_lsn assignment before the wait. Set it only after backup_wait_for_lsn_low() confirms the target LSN has been reached, then call stop_backup_threads(). Added MTR test MDEV-39468 which runs mariabackup --backup and --prepare concurrently with a write workload using a small innodb_log_file_size to reliably expose the race.
a4a802d to
057aadd
Compare
|
BOM stripped, result file regenerated, submodules reverted, time-bounded loop adopted (10 seconds, environment-agnostic). Ready for re-review. |
MDEV-39468: Fix mariabackup log copy race under write load
Problem
In
backup_wait_for_commit_lsn(),metadata_last_lsnwas assigned the target LSNbefore calling
backup_wait_for_lsn_low(). Thelog_copying_thread()loop condition:evaluates to
falseas soon asrecv_sys.lsn >= metadata_last_lsn, so under any writeload the thread could exit before the target LSN was actually reached in the redo log.
backup_wait_for_lsn_low()then timed out and emitted:This race has affected
mariabackup(both--backupand Galera SST) on 10.11 underany non-trivial write load for 18+ months.
Fix
Remove the premature
metadata_last_lsn = lsnassignment beforebackup_wait_for_lsn_low(). The copy thread now continues untilstop_backup_threads()is called, by which pointbackup_wait_for_lsn_low()hasconfirmed the target LSN is reached and
metadata_last_lsnis set correctly.Changed file:
extra/mariabackup/xtrabackup.cc— 1 line removed.Test
Added
mysql-test/suite/mariabackup/MDEV-39468.test:innodb_log_file_size=4Mto force rapid log wraparoundfill_t1(3000)inserting 2 KB rows) whilemariabackup --backuprunsmariabackup --backupand--preparecomplete withOK--repeat=10to confirm the race is resolvedRelates to