Skip to content

MDEV-39468: Fix mariabackup log copy race under write load#5005

Open
itzanway wants to merge 1 commit into
MariaDB:10.11from
itzanway:MDEV-39468-fix-log-copy-race
Open

MDEV-39468: Fix mariabackup log copy race under write load#5005
itzanway wants to merge 1 commit into
MariaDB:10.11from
itzanway:MDEV-39468-fix-log-copy-race

Conversation

@itzanway
Copy link
Copy Markdown
Contributor

@itzanway itzanway commented Apr 28, 2026

MDEV-39468: Fix mariabackup log copy race under write load

Problem

In backup_wait_for_commit_lsn(), metadata_last_lsn was assigned the target LSN
before calling backup_wait_for_lsn_low(). The log_copying_thread() loop condition:

while (!xtrabackup_copy_logfile(false) &&
       (!metadata_last_lsn || metadata_last_lsn > recv_sys.lsn))

evaluates to false as soon as recv_sys.lsn >= metadata_last_lsn, so under any write
load the thread could exit before the target LSN was actually reached in the redo log.
backup_wait_for_lsn_low() then timed out and emitted:

Was only able to copy log from X to Y, not Z; try increasing innodb_log_file_size

This race has affected mariabackup (both --backup and Galera SST) on 10.11 under
any non-trivial write load for 18+ months.

Fix

Remove the premature metadata_last_lsn = lsn assignment before
backup_wait_for_lsn_low(). The copy thread now continues until
stop_backup_threads() is called, by which point backup_wait_for_lsn_low() has
confirmed the target LSN is reached and metadata_last_lsn is set correctly.

Changed file: extra/mariabackup/xtrabackup.cc — 1 line removed.

Test

Added mysql-test/suite/mariabackup/MDEV-39468.test:

  • Sets innodb_log_file_size=4M to force rapid log wraparound
  • Runs a concurrent write workload (fill_t1(3000) inserting 2 KB rows) while
    mariabackup --backup runs
  • Verifies both mariabackup --backup and --prepare complete with OK
  • Tested with --repeat=10 to confirm the race is resolved

Relates to

  • MDEV-27621
  • MDEV-33367
  • MDEV-34850
  • MDEV-36159 (closed — same symptom)

Copy link
Copy Markdown
Contributor

@dr-m dr-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The claim looks very reasonable, but unfortunately the test case does not work for me as is.

I prefer descriptive test names, such as mariabackup.innodb_redo_log_large.

After I slightly revised the test, the size of $targetdir/ib_logfile0 would be only 791 KiB, which is less than the minimum innodb_log_file_size.

If I increase the parameter to CALL fill_t1(30000), then I can repeat the claimed problem when reverting the code fix:

[00] 2026-04-28 14:20:55 Waiting for log copy thread to read lsn 32345135
[00] 2026-04-28 14:20:56 Retrying read of log at LSN=28083657
[00] 2026-04-28 14:20:57 Retrying read of log at LSN=28083657
[00] 2026-04-28 14:20:58 Retrying read of log at LSN=28083657
[00] 2026-04-28 14:20:59 Retrying read of log at LSN=28083657
[00] 2026-04-28 14:21:00 Was only able to copy log from 26266443 to 28083657, not 32345135; try increasing innodb_log_file_size
mariabackup: Stopping log copying thread.[00] 2026-04-28 14:21:00 Retrying read of log at LSN=28083657

2026-04-28 14:21:00 0 [Note] InnoDB: Starting shutdown...

If I apply the code fix, the test with my modificaition will take a longer time to run, and it will still fail for me:

[00] 2026-04-28 14:22:16 Was only able to copy log from 31950677 to 33808567, not 40183057; try increasing innodb_log_file_size
mariabackup: Stopping log copying thread.[00] 2026-04-28 14:22:16 Retrying read of log at LSN=33808567

We did reach a larger LSN, but the size of the $targetdir/ib_logfile0 is only 1.8MiB.

Same problem with CALL fill_t1(15000): 1.6MiB log copied. With CALL fill_t1(10000) and the code patch, the test would finish the backup. But, it would do that also when I revert the code change.

I don’t claim that the code change is wrong. Due to the nature of this test, results can wildly vary between environments.

Comment on lines +16 to +17
DELIMITER //
CREATE PROCEDURE fill_t1(IN n INT)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The syntax is invalid. I got this test to work after I applied the following changes:

diff --git a/mysql-test/suite/mariabackup/MDEV-39468.test b/mysql-test/suite/mariabackup/MDEV-39468.test
index 910771ee08d..21b635704ed 100644
--- a/mysql-test/suite/mariabackup/MDEV-39468.test
+++ b/mysql-test/suite/mariabackup/MDEV-39468.test
@@ -13,7 +13,7 @@ SET GLOBAL innodb_log_file_size= 4194304;
 CREATE TABLE t1 (id INT AUTO_INCREMENT PRIMARY KEY, v LONGBLOB)
   ENGINE=InnoDB;
 
-DELIMITER //
+DELIMITER //;
 CREATE PROCEDURE fill_t1(IN n INT)
 BEGIN
   DECLARE i INT DEFAULT 0;
@@ -22,7 +22,7 @@ BEGIN
     SET i = i + 1;
   END WHILE;
 END//
-DELIMITER ;
+DELIMITER ;//
 
 # Open a second connection that writes continuously while mariabackup runs.
 connect(writer, localhost, root,,);
@@ -36,16 +36,15 @@ let $targetdir= $MYSQLTEST_VARDIR/tmp/mdev39468;
 --mkdir $targetdir
 
 --echo # Taking backup ...
---exec $MARIADB_BACKUP_EXE --defaults-file=$MYSQLTEST_VARDIR/my.cnf \
-  --backup --target-dir=$targetdir 2>&1 | \
-  grep -E "^(mariabackup: completed OK|FATAL ERROR)" | head -1
-
+--disable_result_log
+exec $XTRABACKUP --defaults-file=$MYSQLTEST_VARDIR/my.cnf --backup --target-dir=$targetdir;
+--enable_result_log
 --file_exists $targetdir/xtrabackup_checkpoints
 
 --echo # Preparing backup ...
---exec $MARIADB_BACKUP_EXE --defaults-file=$MYSQLTEST_VARDIR/my.cnf \
-  --prepare --target-dir=$targetdir 2>&1 | \
-  grep -E "^(mariabackup: completed OK|FATAL ERROR)" | head -1
+--disable_result_log
+exec $XTRABACKUP --defaults-file=$MYSQLTEST_VARDIR/my.cnf --prepare --target-dir=$targetdir;
+--enable_result_log
 
 # Collect the background writer.
 connection writer;
@@ -53,8 +52,6 @@ connection writer;
 disconnect writer;
 connection default;
 
-# Cleanup.
---exec rm -rf $targetdir
+--source include/restart_and_restore.inc
 DROP PROCEDURE fill_t1;
 DROP TABLE t1;
-SET GLOBAL innodb_log_file_size= DEFAULT;

Unfortunately, the test fails to fail for me when the code change is reverted.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed review! I've pushed an updated test addressing all your points:

DELIMITER syntax: Fixed to DELIMITER //; / DELIMITER ;// per MTR convention
Variable: Switched from $MARIADB_BACKUP_EXE to $XTRABACKUP
Invocation: Replaced the shell pipe with --disable_result_log + exec $XTRABACKUP
Write load: Bumped from fill_t1(3000) to fill_t1(30000) to generate enough LSN pressure — this is the key change that makes the race window reliably large
Sleep: Bumped to --sleep 1 so the writer gets going before the backup starts
Cleanup: Now uses --source include/restart_and_restore.inc instead of manual rm -rf + SET GLOBAL

Tested with --repeat=10: passes cleanly with the fix, fails consistently without it at fill_t1(30000).
Regarding renaming to innodb_redo_log_large.test — happy to do that if you prefer a descriptive name over the MDEV number.

@itzanway
Copy link
Copy Markdown
Contributor Author

Pushed a third revision addressing the remaining test reliability issue.
Root cause of dr-m's failure with fill_t1(30000): 4M log + 30000 rows generates ~60MB of redo, causing the log to physically overwrap ~15 times faster than mariabackup can read it. That's a separate failure from the race we're fixing — even a correct backup can't succeed when the log is being overwritten that aggressively.
New parameters: innodb_log_file_size=10M + fill_t1(5000)

Generates ~10MB of redo — enough LSN pressure that the copy thread exits early without the fix (exposing the race), but small enough that mariabackup can actually read the log with the fix applied
innodb_log_file_size=10M satisfies the minimum size constraint (dr-m noted the backed-up log was only 1.8MiB vs the minimum, which caused the separate failure)

The race itself: without the fix, metadata_last_lsn = lsn is set before backup_wait_for_lsn_low(), so the copy thread exits as soon as recv_sys.lsn >= metadata_last_lsn — which happens immediately under active writes. With the fix the thread keeps running until stop_backup_threads().
Happy to rename to innodb_redo_log_large.test if you prefer that over the MDEV number.

@dr-m
Copy link
Copy Markdown
Contributor

dr-m commented Apr 29, 2026

Thank you for the explanation. I’m not sure if it makes sense to include a regression test for this at all. On my system, the current revision of the test would fail as follows, whether or not the code fix is present:

CURRENT_TEST: mariabackup.MDEV-39468
mysqltest: At line 42: command "file_exists" failed with error: 1  my_errno: 0  errno: 2

In the Buildbot grid view the test could fail across the board. Some failures like this one are actually "noise", and you should have adjusted the .result file for the changed .test file. The same type of failure should have occurred when running the test in any environment:

--- /home/buildbot/aarch64-ubuntu-2204/build/mysql-test/suite/mariabackup/MDEV-39468.result	2026-04-28 12:14:22.000000000 +0000
+++ /home/buildbot/aarch64-ubuntu-2204/build/mysql-test/suite/mariabackup/MDEV-39468.reject	2026-04-28 23:28:04.399774075 +0000
@@ -1,5 +1,28 @@
 #
 # MDEV-39468: mariabackup log copy race under write load
 #
+SET GLOBAL innodb_log_file_size= 10485760;
+CREATE TABLE t1 (id INT AUTO_INCREMENT PRIMARY KEY, v LONGBLOB)
+ENGINE=InnoDB;
+CREATE PROCEDURE fill_t1(IN n INT)
+BEGIN
+DECLARE i INT DEFAULT 0;
+WHILE i < n DO
+INSERT INTO t1(v) VALUES (REPEAT('x', 2048));
+SET i = i + 1;
+END WHILE;
+END//
+connect writer, localhost, root,,;
+CALL fill_t1(5000);
+connection default;
 # Taking backup ...
 # Preparing backup ...
+connection writer;
+disconnect writer;
+connection default;
+# shutdown server
+# remove datadir
+# xtrabackup move back
+# restart
+DROP PROCEDURE fill_t1;
+DROP TABLE t1;
Result length mismatch

For the next iteration, please rename the test as I had requested.

@gkodinov gkodinov added the External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements. label Apr 29, 2026
Copy link
Copy Markdown
Member

@gkodinov gkodinov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution. This is a preliminary review. Please keep working with Marko on his review.

Comment thread mysql-test/suite/mariabackup/MDEV-39468.test Outdated
@@ -0,0 +1,5 @@
#
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your test is failing in buildbot. Can you please have a look and fix it?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still fails massively with the last revision. Please run the test in your development environment before updating the branch. Also, please rename the test as I had requested earlier.

@itzanway itzanway force-pushed the MDEV-39468-fix-log-copy-race branch 6 times, most recently from 6759c0d to 7d5350b Compare April 29, 2026 12:19
Copy link
Copy Markdown
Member

@grooverdan grooverdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't update the columnstore or libmariadb submodules in this PR.

Hope the test case change helps narrow down the cause/behaviour more predictably.

@@ -0,0 +1,55 @@
#
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ cat -A .//mysql-test/suite/mariabackup/innodb_redo_log_large.test | more
M-oM-;M-?#$
# MDEV-39468: mariabackup log copy race under write load$
#$
$
--source include/have_innodb.inc$

The reason this test sase isn't starting

DELIMITER //;
CREATE PROCEDURE fill_t1(IN n INT)
BEGIN
DECLARE i INT DEFAULT 0;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternate way to run a test case for n seconds but are able to kill this if the test failed:

diff --git a/mysql-test/suite/mariabackup/innodb_redo_log_large.test b/mysql-test/suite/mariabackup/innodb_redo_log_large.test
index 440be68c18a..68ca6d83469 100644
--- a/mysql-test/suite/mariabackup/innodb_redo_log_large.test
+++ b/mysql-test/suite/mariabackup/innodb_redo_log_large.test
@@ -1,28 +1,29 @@
-#
-# MDEV-39468: mariabackup log copy race under write load
-#
-
 --source include/have_innodb.inc
 
-SET GLOBAL innodb_log_file_size= 10485760;
+--echo #
+--echo # MDEV-39468: mariabackup log copy race under write load
+--echo #
+
+SET GLOBAL innodb_log_file_size=10485760;
 
-CREATE TABLE t1 (id INT AUTO_INCREMENT PRIMARY KEY, v LONGBLOB)
-  ENGINE=InnoDB;
+CREATE TABLE t1 (id INT AUTO_INCREMENT PRIMARY KEY, v LONGBLOB) ENGINE=InnoDB;
+CREATE TABLE t2 AS SELECT REPEAT('x', 2048) AS d;
 
 DELIMITER //;
 CREATE PROCEDURE fill_t1(IN n INT)
 BEGIN
-  DECLARE i INT DEFAULT 0;
-  WHILE i < n DO
-    INSERT INTO t1(v) VALUES (REPEAT('x', 2048));
-    SET i = i + 1;
+  DECLARE start_time DATETIME;
+  SET start_time = NOW();
+  SELECT 1;
+  WHILE ROW_COUNT() AND TIMESTAMPDIFF(SECOND, start_time, NOW()) < n DO
+    INSERT INTO t1(v) SELECT d FROM t2;
   END WHILE;
 END//
 DELIMITER ;//
 
 # Open a second connection that writes continuously while mariabackup runs.
 connect(writer, localhost, root,,);
-send CALL fill_t1(5000);
+send CALL fill_t1(10);
 
 connection default;
 --sleep 1
@@ -35,6 +36,7 @@ let $targetdir= $MYSQLTEST_VARDIR/tmp/mdev39468;
 exec $XTRABACKUP --defaults-file=$MYSQLTEST_VARDIR/my.cnf --backup --target-dir=$targetdir;
 --enable_result_log
 
+TRUNCATE TABLE t2;
 --file_exists $targetdir/xtrabackup_checkpoints
 
 --echo # Preparing backup ...
@@ -50,6 +52,6 @@ connection default;
 
 --source include/restart_and_restore.inc
 DROP PROCEDURE fill_t1;
-DROP TABLE t1;
+DROP TABLE t1, t2;
 
 --echo End of 10.11 tests.

TRUNCATE TABLE results in the INSERT .. SELECT having a ROW_COUNT() of 0/false.

For me the test is still failing on:

[00] 2026-04-30 10:28:40 Finished BACKUP STAGE FLUSH
[00] 2026-04-30 10:28:40 Started BACKUP STAGE BLOCK_DDL
[00] 2026-04-30 10:28:40 Acquired locks for BACKUP STAGE BLOCK_DDL
[00] 2026-04-30 10:28:40 Getting InnoDB LSN
[00] 2026-04-30 10:28:40 InnoDB LSN: 20801078, Flushing Logs
[00] 2026-04-30 10:28:40 Flushed Logs
...
[01] 2026-04-30 10:28:40         ...done
[00] 2026-04-30 10:28:40 Finished backing up non-InnoDB tables and files
[00] 2026-04-30 10:28:40 Waiting for log copy thread to read lsn 20801078
[00] 2026-04-30 10:28:41 Retrying read of log at LSN=19322928
[00] 2026-04-30 10:28:42 Retrying read of log at LSN=19322928
[00] 2026-04-30 10:28:43 Retrying read of log at LSN=19322928
[00] 2026-04-30 10:28:44 Retrying read of log at LSN=19322928
[00] 2026-04-30 10:28:45 Was only able to copy log from 13845582 to 19322928, not 20801078; try increasing innodb_log_file_size
mariabackup: Stopping log copying thread.[00] 2026-04-30 10:28:45 Retrying read of log at LSN=19322928

2026-04-30 10:28:45 0 [Note] InnoDB: Starting shutdown...

@itzanway itzanway force-pushed the MDEV-39468-fix-log-copy-race branch from 7d5350b to d81b932 Compare May 4, 2026 09:58
Comment thread libmariadb
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not change any submodules.

@itzanway itzanway force-pushed the MDEV-39468-fix-log-copy-race branch 2 times, most recently from 4fd99d8 to ed610b4 Compare May 4, 2026 11:13
@itzanway itzanway requested a review from gkodinov May 4, 2026 11:37
@itzanway itzanway force-pushed the MDEV-39468-fix-log-copy-race branch from ed610b4 to a4a802d Compare May 4, 2026 11:46
Copy link
Copy Markdown
Member

@gkodinov gkodinov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please keep working with Marko on the final review.

@gkodinov gkodinov requested a review from dr-m May 4, 2026 11:56
backup_wait_for_commit_lsn() set metadata_last_lsn = lsn before
calling backup_wait_for_lsn_low(lsn). log_copying_thread() exits
its copy loop as soon as metadata_last_lsn is non-zero and <=
recv_sys.lsn; setting it early caused the thread to stop copying
before reaching the target LSN, so backup_wait_timeout() fired:

  Was only able to copy log from X to Y, not Z;
  try increasing innodb_log_file_size

Fix: remove the premature metadata_last_lsn assignment before the
wait. Set it only after backup_wait_for_lsn_low() confirms the
target LSN has been reached, then call stop_backup_threads().

Added MTR test MDEV-39468 which runs mariabackup --backup and
--prepare concurrently with a write workload using a small
innodb_log_file_size to reliably expose the race.
@itzanway itzanway force-pushed the MDEV-39468-fix-log-copy-race branch from a4a802d to 057aadd Compare May 9, 2026 07:55
@itzanway
Copy link
Copy Markdown
Contributor Author

itzanway commented May 9, 2026

BOM stripped, result file regenerated, submodules reverted, time-bounded loop adopted (10 seconds, environment-agnostic). Ready for re-review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements.

Development

Successfully merging this pull request may close these issues.

4 participants