Skip to content

Commit a394fc0

Browse files
committed
MDEV-29981: Replica stops with "Found invalid event in binary log"
Replication can stop in error if a Heartbeat log event is sent to a replica during rotation. There are two bugs at play: 1. Prior to MDEV-30128 (added in 11.0), there is a bug when checking legacy events. When the replica rotates its relay logs, it initializes its Format_description_log_event with binlog version 3 (this is hard-coded). So immediately after rotation (and until a new Format_descriptor with binlog_format 4 is sent from the master), the IO thread is expecting binlog_format 3 (i.e. it will call queue_old_event() for incoming events). This invalidates any events that are sent with an event type higher than 14. In theory, we wouldn't expect any events to be sent in-between a rotate and the next format descriptor log event, but if a long enough period of time passes between then, the primary will generate and send a Heartbeat event (of type 27). In such case, the slave will see the heartbeat event of type 27, see it is higher than 14, and result in an error mentioning 'Found invalid event in binary log', with the expected log coordinates of the new log (which is optimistically populated from the Rotate log event, not the new event). 2. In all versions of MariaDB (11.0+), there is a bug when checking the state of a Heartbeat log event, in that it doesn't consider a rotated binary log. The check is meant to ensure that the heartbeat provided by the master (i.e. the state of the master) is greater than or equal to the state of the slave. In other words, it checks that the slave isn't ahead of the master. However, if the filename provided by the master heartbeat event is different than the filename saved for the slave's state, the check always fails. This is broken, because when the master rotates its logs, the new binary log file will have a different filename (i.e. an incremented index counter suffix). For example, if the master rotates its binary logs from master-bin.000002 to master-bin.000003, master-bin.000003 is ahead of master-bin.000002, but the slave will see a difference between the filenames and fail the check. To fix the first problem, this patch disallows passing a heartbeat event into queue_old_event (which is the source of the error, as it tries to parse a heartbeat log event). This function (queue_old_event) was removed with MDEV-30128, so bypassing it for heartbeat events is not consequential (and it is already also done for Format_description_events, which are not supported in old binlog file versions). Note that backporting all of MDEV-30128 was also considered, but this is less risky for GA. To fix the second problem, we simply ignore heartbeat events on the slave if the filenames don't match. This is because during rotation, it can appear that the slave is ahead of the master, which breaks the validity of the check (i.e. the check is to ensure the master is ahead of the slave). Additionally note that this patch restores a heartbeat check that was incorrectly removed in 780db8e Reviewed-by: Andrei Elkin <andrei.elkin@mariadb.com> Signed-off-by: Brandon Nesterenko <brandon.nesterenko@mariadb.com>
1 parent ba9e8eb commit a394fc0

File tree

4 files changed

+142
-8
lines changed

4 files changed

+142
-8
lines changed
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
# Binlog format independent
2+
include/master-slave.inc
3+
[connection master]
4+
#
5+
# Initialize test data
6+
connection slave;
7+
include/stop_slave.inc
8+
change master to master_heartbeat_period=2;
9+
include/start_slave.inc
10+
# Rotating master logs and pausing after Rotate_event is written..
11+
connection master;
12+
set @old_dbug= @@global.debug_dbug;
13+
set @@global.debug_dbug= "+d,stop_after_rotate_written";
14+
flush binary logs;
15+
# Waiting for Rotate event to write/sync to disk
16+
connection server_1;
17+
set debug_sync="now wait_for rotate_written";
18+
# Sleep 3s so the master sends a heartbeat event to the slave
19+
connection server_1;
20+
set debug_sync="now signal finish_rotate";
21+
connection master;
22+
set @@global.debug_dbug= @old_dbug;
23+
#
24+
# Ensure replication continues without error..
25+
connection master;
26+
create table t1 (a int);
27+
insert into t1 values (1);
28+
connection slave;
29+
include/diff_tables.inc [master:test.t1,slave:test.t1]
30+
#
31+
# Cleanup
32+
connection master;
33+
drop table t1;
34+
include/save_master_gtid.inc
35+
connection slave;
36+
include/sync_with_master_gtid.inc
37+
include/rpl_end.inc
38+
# End of rpl_heartbeat_at_rotate.test
Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
#
2+
# Test ensures that a heartbeat event can be replicated from master to slave
3+
# during a slow rotation. MDEV-29981 reported two bugs where the slave could
4+
# stop in error during a slow rotation because of incorrect checks.
5+
#
6+
# To simulate a slow rotation, this test pauses (via debug_sync) after a rotate
7+
# event has been written and synced to disk, but before the new binary log has
8+
# been created. After the rotate event has been received and processed by the
9+
# slave, the test then waits for the next heartbeat event to be sent to the
10+
# slave. Once the heartbeat has been sent, the rotation is completed, and then
11+
# the test ensures replication is ok by creating transactions and ensuring they
12+
# are able to be replicated by the slave. Note that the heartbeat period is set
13+
# to 2 seconds.
14+
#
15+
#
16+
# References:
17+
# MDEV-29881: Replica stops with "Found invalid event in binary log"
18+
#
19+
--source include/have_debug.inc
20+
--source include/have_debug_sync.inc
21+
--source include/have_innodb.inc
22+
23+
--echo # Binlog format independent
24+
--source include/have_binlog_format_row.inc
25+
--source include/master-slave.inc
26+
27+
28+
--echo #
29+
--echo # Initialize test data
30+
--connection slave
31+
--source include/stop_slave.inc
32+
change master to master_heartbeat_period=2;
33+
--source include/start_slave.inc
34+
35+
--echo # Rotating master logs and pausing after Rotate_event is written..
36+
--connection master
37+
set @old_dbug= @@global.debug_dbug;
38+
set @@global.debug_dbug= "+d,stop_after_rotate_written";
39+
--send flush binary logs
40+
41+
--echo # Waiting for Rotate event to write/sync to disk
42+
--connection server_1
43+
set debug_sync="now wait_for rotate_written";
44+
45+
--echo # Sleep 3s so the master sends a heartbeat event to the slave
46+
--sleep 3
47+
48+
--connection server_1
49+
set debug_sync="now signal finish_rotate";
50+
51+
--connection master
52+
--reap
53+
set @@global.debug_dbug= @old_dbug;
54+
55+
--echo #
56+
--echo # Ensure replication continues without error..
57+
--connection master
58+
create table t1 (a int);
59+
insert into t1 values (1);
60+
--sync_slave_with_master
61+
--let $diff_tables=master:test.t1,slave:test.t1
62+
--source include/diff_tables.inc
63+
64+
65+
--echo #
66+
--echo # Cleanup
67+
--connection master
68+
drop table t1;
69+
--source include/save_master_gtid.inc
70+
71+
--connection slave
72+
--source include/sync_with_master_gtid.inc
73+
74+
--source include/rpl_end.inc
75+
--echo # End of rpl_heartbeat_at_rotate.test

sql/log.cc

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5653,6 +5653,12 @@ int MYSQL_BIN_LOG::new_file_impl()
56535653
}
56545654
update_binlog_end_pos();
56555655

5656+
DBUG_EXECUTE_IF("stop_after_rotate_written", {
5657+
DBUG_ASSERT(!debug_sync_set_action(
5658+
current_thd,
5659+
STRING_WITH_LEN("now SIGNAL rotate_written WAIT_FOR finish_rotate")));
5660+
});
5661+
56565662
old_name=name;
56575663
name=0; // Don't free name
56585664
close_flag= LOG_CLOSE_TO_BE_OPENED | LOG_CLOSE_INDEX;

sql/slave.cc

Lines changed: 23 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -6508,8 +6508,9 @@ static int queue_event(Master_info* mi, const uchar *buf, ulong event_len)
65086508
}
65096509
DBUG_ASSERT(((uchar) buf[FLAGS_OFFSET] & LOG_EVENT_ACCEPT_OWN_F) == 0);
65106510

6511-
if (mi->rli.relay_log.description_event_for_queue->binlog_version<4 &&
6512-
buf[EVENT_TYPE_OFFSET] != FORMAT_DESCRIPTION_EVENT /* a way to escape */)
6511+
if (mi->rli.relay_log.description_event_for_queue->binlog_version < 4 &&
6512+
buf[EVENT_TYPE_OFFSET] != FORMAT_DESCRIPTION_EVENT /* a way to escape */
6513+
&& buf[EVENT_TYPE_OFFSET] != HEARTBEAT_LOG_EVENT)
65136514
DBUG_RETURN(queue_old_event(mi,buf,event_len));
65146515

65156516
#ifdef ENABLED_DEBUG_SYNC
@@ -6806,17 +6807,31 @@ static int queue_event(Master_info* mi, const uchar *buf, ulong event_len)
68066807
68076808
Heartbeat is sent only after an event corresponding to the corrdinates
68086809
the heartbeat carries.
6809-
Slave can not have a higher coordinate except in the only
6810-
special case when mi->master_log_name, master_log_pos have never
6811-
been updated by Rotate event i.e when slave does not have any history
6812-
with the master (and thereafter mi->master_log_pos is NULL).
6810+
6811+
Slave can not have a higher coordinate except when rotating logs. That
6812+
is, either
6813+
1. when mi->master_log_name, master_log_pos have never been updated by
6814+
Rotate event i.e when slave does not have any history with the
6815+
master (and thereafter mi->master_log_pos is NULL)
6816+
2. if a heartbeat is sent during a slow rotation, the master can send
6817+
its Rotate event (thereby increasing the mi->master_log_name); yet
6818+
the sent heartbeat may still be for the old log file.
6819+
6820+
Therefore, state comparison is only valid when the log file names match,
6821+
otherwise the heartbeat is ignored.
68136822
68146823
Slave can have lower coordinates, if some event from master was omitted.
68156824
68166825
TODO: handling `when' for SHOW SLAVE STATUS' snds behind
6826+
6827+
TODO: Extend heartbeat events to use GTIDs instead of binlog
6828+
coordinates. This would alleviate the strange exceptions during log
6829+
rotation.
68176830
*/
6818-
if (memcmp(mi->master_log_name, hb.get_log_ident(), hb.get_ident_len()) ||
6819-
mi->master_log_pos > hb.log_pos) {
6831+
if (mi->master_log_pos &&
6832+
!memcmp(mi->master_log_name, hb.get_log_ident(), hb.get_ident_len()) &&
6833+
mi->master_log_pos > hb.log_pos)
6834+
{
68206835
/* missed events of heartbeat from the past */
68216836
error= ER_SLAVE_HEARTBEAT_FAILURE;
68226837
error_msg.append(STRING_WITH_LEN("heartbeat is not compatible with local info;"));

0 commit comments

Comments
 (0)