MDEV-28213 Skip ignored domain IDs during GTID validation#4677
MDEV-28213 Skip ignored domain IDs during GTID validation#4677bodyhedia44 wants to merge 1 commit intoMariaDB:mainfrom
Conversation
988af46 to
155d0fe
Compare
gkodinov
left a comment
There was a problem hiding this comment.
Thank you for your contribution. This is a preliminary review.
gkodinov
left a comment
There was a problem hiding this comment.
Please do not do multiple commits. Please stick to a single commit and amend it.
8f36cde to
2f4d520
Compare
|
done |
| Send the slave's IGNORE_DOMAIN_IDS and DO_DOMAIN_IDS to the master, | ||
| so it can skip GTID state validation for domains the slave doesn't | ||
| care about. See MDEV-28213. |
There was a problem hiding this comment.
Why on the master side?
Let’s see – if the ignored domains are not provided in the @@gtid_slave_pos, then the master will think the slave wants to replicate those domains from the beginning, even though the domain will end up ignored, regardless of where the master starts.
So this problem is really overlapping with (but not necessarily entirely part of) MDEV-9345 filtering on master, #4086.
There was a problem hiding this comment.
this PR does not implement master-side filtering. The actual filtering of binlog events (i.e., skipping events for ignored domains) still happens entirely on the slave side, as it always has.
What this PR does is send the slave's IGNORE_DOMAIN_IDS / DO_DOMAIN_IDS lists to the master as user variables during the connection handshake. The master uses them only in check_slave_start_position() to skip GTID state validation for domains the slave has declared it doesn't care about. Without this, if the slave ignores domain X but has no GTID position for it, the master rejects the connection with ER_GTID_POSITION_NOT_FOUND — even though the slave would have happily discarded those events anyway.
This approach was actually suggested by "Kristian Nielsen" in the context of MDEV-9345 / MDEV-20438:
"it might be reasonable if slave would send its list of IGNORE_DOMAIN_ID to the master on connect, and then to ignore any missing domain if it is in the ignore list."
gkodinov
left a comment
There was a problem hiding this comment.
LGTM.
Thanks for working on this.
Some optional improvements below.
Please stand by for the final review.
| DYNAMIC_ARRAY *ids; | ||
| const char *var_name; | ||
| } domain_id_vars[]= { | ||
| { ignore_ids, "slave_connect_state_domain_ids_ignore" }, |
There was a problem hiding this comment.
one little thing: prefix these with "mariadb_" please. We should avoid overriding any user variables.
There was a problem hiding this comment.
There's already the convention where the slave sets @slave_connect_state . I think this can be an exception to the "mariadb_" prefix to adhere to the existing convention (though in other cases I agree, we should add the prefix).
| about these variables simply ignore them (backwards compatible). | ||
| */ | ||
| { | ||
| DYNAMIC_ARRAY *ignore_ids= |
There was a problem hiding this comment.
optional: I'd fold these into the next initialization. I believe there was something about the compiler trying to optimize initializations by reordering them. Apparently not happening here, but better safe than sorry.
| &mi->domain_id_filter.m_domain_ids[Domain_id_filter::DO_DOMAIN_IDS]; | ||
|
|
||
| struct { | ||
| DYNAMIC_ARRAY *ids; |
There was a problem hiding this comment.
optional: I'd move the variable name first: easier on the eyes.
| true if success. | ||
| */ | ||
| static bool | ||
| get_slave_ignore_domain_ids(THD *thd, String *out_str) |
There was a problem hiding this comment.
I'd combine these two into get_user_var_string(). To complement the already existing get_user_var_int().
bf08125 to
da1a69e
Compare
|
Done improvements waiting the final review |
bnestere
left a comment
There was a problem hiding this comment.
Thank you for the contribution, @bodyhedia44 !
Generally I approve, though this can't go into 10.6. I think main makes the most sense for the fix, as it changes behavior and complements another ticket going into main for the 13.0 release: MDEV-9345. This would be put into our 13.0 preview release, with a code cutoff deadline of March 10. Then, we have a six week period of QA testing to ensure this works as expected (and without bugs). During this six week period, if any bugs are found, they would need to be fixed before the RC release (currently planned for May 7). I don't know if you can change the target branch once a PR is created, you might have to close this and create a new PR with the correct target (though @gkodinov might be able to correct me if I am wrong).
Once rebased on main, I'll do another round of review.
| */ | ||
| static bool | ||
| is_domain_id_ignored(const DYNAMIC_ARRAY *ignore_ids, | ||
| const DYNAMIC_ARRAY *do_ids, ulong domain_id) |
There was a problem hiding this comment.
domain_id should be a uint32_t:
const DYNAMIC_ARRAY *do_ids, uint32 domain_id)
| } | ||
|
|
||
|
|
||
| static int ulong_cmp(const void *a, const void *b) |
There was a problem hiding this comment.
Because domain_id is uint32_t, this should also be a uint32_cmp.
There was a problem hiding this comment.
Aktually, I recall those Domain ID List fields have ulong elements (why). Of course, they don’t have to be on the master side.
| DYNAMIC_ARRAY *ids; | ||
| const char *var_name; | ||
| } domain_id_vars[]= { | ||
| { ignore_ids, "slave_connect_state_domain_ids_ignore" }, |
There was a problem hiding this comment.
There's already the convention where the slave sets @slave_connect_state . I think this can be an exception to the "mariadb_" prefix to adhere to the existing convention (though in other cases I agree, we should add the prefix).
The author can change the base of their PR. |
da1a69e to
9a2177b
Compare
|
Done |
bnestere
left a comment
There was a problem hiding this comment.
Hi @bodyhedia44 !
Thanks for rebasing and updating the patch! While accounting for newer features after your rebase, I'd like to think on some of the details. I've left a few points for you to consider in the mean time.
| my_init_dynamic_array(PSI_INSTRUMENT_ME, &slave_ignore_domain_ids, | ||
| sizeof(uint32), 4, 4, MYF(0)); | ||
| my_init_dynamic_array(PSI_INSTRUMENT_ME, &slave_do_domain_ids, | ||
| sizeof(uint32), 4, 4, MYF(0)); |
There was a problem hiding this comment.
IGNORE/DO options are not common, so initializing the dynamic array always will more often than not be inefficient. They can probably be lazy initialized when the option is actually found.
| --connection server_1 | ||
| SET @@session.gtid_domain_id= 1; | ||
| INSERT INTO t1 VALUES (4); | ||
| --save_master_pos |
There was a problem hiding this comment.
Something that should also be tested here is what happens when new transactions arrive from gtid_domain_id=2. They should be ignored, but we want to make sure it doesn't cause replication to break. Previously, the replication connection would never get this far b/c it would error at connection time.
| See MDEV-28213. | ||
| */ | ||
| DYNAMIC_ARRAY slave_ignore_domain_ids; | ||
| DYNAMIC_ARRAY slave_do_domain_ids; |
There was a problem hiding this comment.
Would it make sense to re-use Domain_id_filter here instead of creating a new set of DYNAMIC_ARRAYs and re-writing the Domain_id_filter::do_filter()?
IIRC, the current MDEV-9345 draft patch for master-side replication filtering only considers db and table-level options. Having the Domain_id_filter` around would provide the support to add domain-level filtering on the master-side later.
| which is a generic rpl_binlog_state_base method used in other contexts. | ||
| Since domain filtering with purged binlogs is a rare configuration and | ||
| this bypass only runs once per slave connection (not per event), skipping | ||
| the index in this case is the right tradeoff. |
There was a problem hiding this comment.
I'm not convinced it is the right trade-off. It would be good to keep behavior consistent.
If changing your code to use the Domain_id_filter, perhaps these functions could be extended with a Domain_id_filter parameter with a default of NULL? The "other contexts" (e.g. innodb binlog use case) don't care about filtering and can just keep the NULL default. That seems conceptually clean to me.
IMO it makes sense conceptually for is_before_pos() to support filtering by domain-id. Not related to your patch, but generally I think the is_before_pos() likely shouldn't belong to rpl_binlog_state_base, as it sets a dependence on slave_connection_state (and other server internals, in-turn) which isn't required for a binary log state. Though fixing that likely isn't something to be addressed by your PR. Let me think about this, and ping @knielsen to see if he has any thoughts.
9a2177b to
6d5af66
Compare
gkodinov
left a comment
There was a problem hiding this comment.
Please fix the embedded build failures.
6d5af66 to
2d6949e
Compare
|
Done |
gkodinov
left a comment
There was a problem hiding this comment.
LGTM. Please keep working on the final review with Brandon.
bnestere
left a comment
There was a problem hiding this comment.
Thanks for the updates, @bodyhedia44 !
This is quite close. I've just left a few last notes.
| actually sends IGNORE_DOMAIN_IDS or DO_DOMAIN_IDS. Older slaves won't | ||
| send these, so the filter stays NULL and all domains are validated. | ||
| */ | ||
| { |
There was a problem hiding this comment.
This is a bit over-complicated (also on the sending-side). The slave can only use one of these options at a time, both ignore_domain_ids and do_domain_ids can't be set at the same time. So it is a bit inefficient to loop through both options and do the same logic twice.
It'd be better to initially figure out which is configured, and do the underlying logic for that option once.
| size_t var_name_len; | ||
| int list_type; /* Domain_id_filter::DO_DOMAIN_IDS or IGNORE_DOMAIN_IDS */ | ||
| } domain_vars[]= { | ||
| { STRING_WITH_LEN("slave_connect_state_domain_ids_ignore"), |
There was a problem hiding this comment.
It'd be good to make these LEX_CSTRING constants that both the slave and master side would use everywhere (error messages too)
2d6949e to
8487e06
Compare
|
Done |
bnestere
left a comment
There was a problem hiding this comment.
Thanks for such a quick turn-around, @bodyhedia44 ! I've left another round of review points. I probably should have caught them in my review yesterday, but better late than never, I suppose.
| @@ -1578,7 +1714,8 @@ gtid_check_binlog_file(slave_connection_state *state, | |||
|
|
|||
| if (likely(reader && !reader->open_index_file(buf))) | |||
| { | |||
| int lookup= reader->search_gtid_pos(state, out_start_seek, found_count); | |||
| int lookup= reader->search_gtid_pos(state, out_start_seek, found_count, | |||
| filter); | |||
There was a problem hiding this comment.
Bad whitespace (though if you move the filter into slave_connection_state, this won't be relevant)
| @@ -2709,7 +2906,8 @@ static int init_binlog_sender(binlog_send_info *info, | |||
| search_file_name, | |||
| info->until_gtid_state, | |||
| &info->until_binlog_state, | |||
| &found_in_index, &start_seek))) | |||
| &found_in_index, &start_seek, | |||
| info->slave_domain_filter))) | |||
There was a problem hiding this comment.
I think we also need to check ignored domains when using the new in-engine binlog (see just above's gtid_find_engine_pos()). Then we should also add a test to make sure the behavior works for in-engine binlog. You should be able to just include your existing test, e.g. binlog_in_engine.rpl_mysqlbinlog_slave_consistency_basic.test sets a pattern to do this using a common include file.
|
|
||
| if (var_name) | ||
| { | ||
| info->slave_domain_filter= new Domain_id_filter(); |
There was a problem hiding this comment.
We try to manage/track our own memory within the server, rather than just having new do it automatically (there are many exceptions that were missed, but we try 😅).
We have a concept called a MEM_ROOT which is a scoped memory pool, usually tied to some object. For objects who's scope is tied to that of the thread, like in this case, we can use the thread's MEM_ROOT like the following (also with an addition to account for out-of-resource):
| info->slave_domain_filter= new Domain_id_filter(); | |
| info->slave_domain_filter= new (thd->mem_root) Domain_id_filter(); | |
| if (!info->slave_domain_filter) | |
| { | |
| //... error... |
| @@ -480,6 +481,7 @@ class Gtid_index_reader : public Gtid_index_base | |||
| Index_node_base *n; | |||
| int (Gtid_index_reader::* search_cmp_function)(uint32, rpl_binlog_state_base *); | |||
| slave_connection_state *in_search_gtid_pos; | |||
| Domain_id_filter *in_search_domain_filter; | |||
There was a problem hiding this comment.
It is awkward to have this be a part of the Gtid_index_reader class. Where I like the slave_domain_filter in the binlog_send_info, it may lead to a more concise implementation of your patch if the filter was instead defined in the slave_connection_state, which is directly passed everywhere you'd need it (I think). Sorry I didn't think of that earlier.
| if (expect_number) | ||
| { | ||
| char *endptr; | ||
| ulong domain_id= strtoul(p, &endptr, 10); |
There was a problem hiding this comment.
ulong should always be avoided, as it is not consistent between architectures. A few caveats for us to consider
domain_idcan only ever be 32-bit, so any architecture inconsistencies wouldn't really be relevant for us here anyway..- We have our own internal functions for numeric parsing. There's an example already for parsing
domain_idwhen one manually specifies a GTID position on a slave forgtid_slave_pos, and it uses the functiongtid_parser_helper(). You should be able to do the same thing it does for parsing the domain_id, and note it also has additional validation checks that it would be good to include here as well.
There was a problem hiding this comment.
Domain IDs are ulongs in slave-side replication filters.
But they don’t have to be on the master side.
8487e06 to
3453bf5
Compare
|
Done |
When a slave is configured with IGNORE_DOMAIN_IDS or DO_DOMAIN_IDS, the master's binlog dump thread should skip GTID state validation for those filtered domains. This avoids false ER_GTID_POSITION_NOT_FOUND errors when the slave does not have (or need) the current GTID state for domains it is filtering. The slave now sends its IGNORE/DO domain ID lists to the master via user variables @slave_connect_state_domain_ids_ignore and @slave_connect_state_domain_ids_do, which the master reads in mysql_binlog_send() and passes to check_slave_start_position(). Changes: - sql/sql_repl.cc: load_ignore_domain_ids() returns bool, fix parser to avoid redundant whitespace skip and scope local variables tightly. Add ulong_cmp() comparator. Replace O(n) linear scans in is_domain_id_ignored() with bsearch() after sorting the arrays. - sql/slave.cc: Add build_domain_ids_query() helper to construct SET queries for domain ID user variables. Refactor duplicate code into a loop using a struct array. - mysql-test/suite/rpl/t/rpl_gtid_ignored_domain_ids_validation.test: New test validating end-to-end GTID replication with domain filtering. Reviewed-by: Georgi Kodinov <joro@mariadb.org> Reviewed-by: Brandon Nesterenko <brandon.nesterenko@mariadb.com>
3453bf5 to
8c3d53c
Compare
When a slave connects to a master using MASTER_USE_GTID=Slave_Pos and the
master has purged old binlogs, the master validates the slave's GTID state
against the oldest available binlog's Gtid_list event. If the Gtid_list
references domains that the slave is configured to ignore (via
CHANGE MASTER IGNORE_DOMAIN_IDS or DO_DOMAIN_IDS), validation incorrectly
fails with error 1236:
"Could not find GTID state requested by slave in any binlog files.
Probably the slave state is too old and required binlog files have
been purged."
This is a false rejection -- the slave does not need events from those domains.
Fix: the slave now sends its IGNORE_DOMAIN_IDS and DO_DOMAIN_IDS to the master
as user variables (@slave_connect_state_domain_ids_ignore and
@slave_connect_state_domain_ids_do) before COM_BINLOG_DUMP. The master reads
these and skips validation for ignored domains in three code paths:
searching for the right binlog file
does not care about
This is backwards compatible: older masters store the unknown user variables
harmlessly, and older slaves simply do not send them.
Includes MTR test rpl.rpl_gtid_ignored_domain_ids_validation covering both
IGNORE_DOMAIN_IDS and DO_DOMAIN_IDS scenarios with purged binlogs.