Skip to content

MDEV-28213 Skip ignored domain IDs during GTID validation#4677

Open
bodyhedia44 wants to merge 1 commit intoMariaDB:mainfrom
bodyhedia44:MDEV-28213-fix-ignored-domain-validation
Open

MDEV-28213 Skip ignored domain IDs during GTID validation#4677
bodyhedia44 wants to merge 1 commit intoMariaDB:mainfrom
bodyhedia44:MDEV-28213-fix-ignored-domain-validation

Conversation

@bodyhedia44
Copy link
Copy Markdown
Contributor

When a slave connects to a master using MASTER_USE_GTID=Slave_Pos and the
master has purged old binlogs, the master validates the slave's GTID state
against the oldest available binlog's Gtid_list event. If the Gtid_list
references domains that the slave is configured to ignore (via
CHANGE MASTER IGNORE_DOMAIN_IDS or DO_DOMAIN_IDS), validation incorrectly
fails with error 1236:

"Could not find GTID state requested by slave in any binlog files.
Probably the slave state is too old and required binlog files have
been purged."

This is a false rejection -- the slave does not need events from those domains.

Fix: the slave now sends its IGNORE_DOMAIN_IDS and DO_DOMAIN_IDS to the master
as user variables (@slave_connect_state_domain_ids_ignore and
@slave_connect_state_domain_ids_do) before COM_BINLOG_DUMP. The master reads
these and skips validation for ignored domains in three code paths:

  • contains_all_slave_gtid(): skip domains not needed by the slave when
    searching for the right binlog file
  • check_slave_start_position(): skip validation for domains the slave
    does not care about
  • gtid_find_binlog_file(): pass the ignore/do lists through to the above

This is backwards compatible: older masters store the unknown user variables
harmlessly, and older slaves simply do not send them.

Includes MTR test rpl.rpl_gtid_ignored_domain_ids_validation covering both
IGNORE_DOMAIN_IDS and DO_DOMAIN_IDS scenarios with purged binlogs.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Feb 21, 2026

CLA assistant check
All committers have signed the CLA.

@bodyhedia44 bodyhedia44 force-pushed the MDEV-28213-fix-ignored-domain-validation branch from 988af46 to 155d0fe Compare February 21, 2026 14:28
@gkodinov gkodinov added the External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements. label Feb 23, 2026
Copy link
Copy Markdown
Member

@gkodinov gkodinov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution. This is a preliminary review.

Comment thread sql/sql_repl.cc Outdated
@bodyhedia44 bodyhedia44 requested a review from gkodinov February 23, 2026 17:10
Copy link
Copy Markdown
Member

@gkodinov gkodinov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not do multiple commits. Please stick to a single commit and amend it.

Comment thread sql/slave.cc Outdated
Comment thread sql/sql_repl.cc Outdated
Comment thread sql/sql_repl.cc Outdated
Comment thread sql/sql_repl.cc Outdated
Comment thread sql/sql_repl.cc Outdated
Comment thread sql/sql_repl.cc Outdated
Comment thread sql/sql_repl.cc Outdated
@bodyhedia44 bodyhedia44 force-pushed the MDEV-28213-fix-ignored-domain-validation branch from 8f36cde to 2f4d520 Compare February 24, 2026 21:41
@bodyhedia44 bodyhedia44 requested a review from gkodinov February 24, 2026 21:42
@bodyhedia44
Copy link
Copy Markdown
Contributor Author

done

Comment thread sql/slave.cc Outdated
Comment on lines +2597 to +2599
Send the slave's IGNORE_DOMAIN_IDS and DO_DOMAIN_IDS to the master,
so it can skip GTID state validation for domains the slave doesn't
care about. See MDEV-28213.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why on the master side?

Let’s see – if the ignored domains are not provided in the @@gtid_slave_pos, then the master will think the slave wants to replicate those domains from the beginning, even though the domain will end up ignored, regardless of where the master starts.

So this problem is really overlapping with (but not necessarily entirely part of) MDEV-9345 filtering on master, #4086.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this PR does not implement master-side filtering. The actual filtering of binlog events (i.e., skipping events for ignored domains) still happens entirely on the slave side, as it always has.

What this PR does is send the slave's IGNORE_DOMAIN_IDS / DO_DOMAIN_IDS lists to the master as user variables during the connection handshake. The master uses them only in check_slave_start_position() to skip GTID state validation for domains the slave has declared it doesn't care about. Without this, if the slave ignores domain X but has no GTID position for it, the master rejects the connection with ER_GTID_POSITION_NOT_FOUND — even though the slave would have happily discarded those events anyway.

This approach was actually suggested by "Kristian Nielsen" in the context of MDEV-9345 / MDEV-20438:

"it might be reasonable if slave would send its list of IGNORE_DOMAIN_ID to the master on connect, and then to ignore any missing domain if it is in the ignore list."

Copy link
Copy Markdown
Member

@gkodinov gkodinov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Thanks for working on this.

Some optional improvements below.

Please stand by for the final review.

Comment thread sql/slave.cc Outdated
DYNAMIC_ARRAY *ids;
const char *var_name;
} domain_id_vars[]= {
{ ignore_ids, "slave_connect_state_domain_ids_ignore" },
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one little thing: prefix these with "mariadb_" please. We should avoid overriding any user variables.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's already the convention where the slave sets @slave_connect_state . I think this can be an exception to the "mariadb_" prefix to adhere to the existing convention (though in other cases I agree, we should add the prefix).

Comment thread sql/slave.cc Outdated
about these variables simply ignore them (backwards compatible).
*/
{
DYNAMIC_ARRAY *ignore_ids=
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional: I'd fold these into the next initialization. I believe there was something about the compiler trying to optimize initializations by reordering them. Apparently not happening here, but better safe than sorry.

Comment thread sql/slave.cc Outdated
&mi->domain_id_filter.m_domain_ids[Domain_id_filter::DO_DOMAIN_IDS];

struct {
DYNAMIC_ARRAY *ids;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional: I'd move the variable name first: easier on the eyes.

Comment thread sql/sql_repl.cc Outdated
true if success.
*/
static bool
get_slave_ignore_domain_ids(THD *thd, String *out_str)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd combine these two into get_user_var_string(). To complement the already existing get_user_var_int().

@ParadoxV5 ParadoxV5 added the Replication Patches involved in replication label Feb 28, 2026
@bodyhedia44 bodyhedia44 force-pushed the MDEV-28213-fix-ignored-domain-validation branch 2 times, most recently from bf08125 to da1a69e Compare March 2, 2026 20:55
@bodyhedia44
Copy link
Copy Markdown
Contributor Author

Done improvements waiting the final review

Copy link
Copy Markdown
Contributor

@bnestere bnestere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution, @bodyhedia44 !

Generally I approve, though this can't go into 10.6. I think main makes the most sense for the fix, as it changes behavior and complements another ticket going into main for the 13.0 release: MDEV-9345. This would be put into our 13.0 preview release, with a code cutoff deadline of March 10. Then, we have a six week period of QA testing to ensure this works as expected (and without bugs). During this six week period, if any bugs are found, they would need to be fixed before the RC release (currently planned for May 7). I don't know if you can change the target branch once a PR is created, you might have to close this and create a new PR with the correct target (though @gkodinov might be able to correct me if I am wrong).

Once rebased on main, I'll do another round of review.

Comment thread sql/sql_repl.cc Outdated
*/
static bool
is_domain_id_ignored(const DYNAMIC_ARRAY *ignore_ids,
const DYNAMIC_ARRAY *do_ids, ulong domain_id)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

domain_id should be a uint32_t:

                     const DYNAMIC_ARRAY *do_ids, uint32 domain_id)

Comment thread sql/sql_repl.cc Outdated
}


static int ulong_cmp(const void *a, const void *b)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because domain_id is uint32_t, this should also be a uint32_cmp.

Copy link
Copy Markdown
Contributor

@ParadoxV5 ParadoxV5 Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aktually, I recall those Domain ID List fields have ulong elements (why). Of course, they don’t have to be on the master side.

Comment thread sql/slave.cc Outdated
DYNAMIC_ARRAY *ids;
const char *var_name;
} domain_id_vars[]= {
{ ignore_ids, "slave_connect_state_domain_ids_ignore" },
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's already the convention where the slave sets @slave_connect_state . I think this can be an exception to the "mariadb_" prefix to adhere to the existing convention (though in other cases I agree, we should add the prefix).

@ParadoxV5
Copy link
Copy Markdown
Contributor

I don't know if you can change the target branch once a PR is created, you might have to close this and create a new PR with the correct target (though @gkodinov might be able to correct me if I am wrong).

The author can change the base of their PR.
(For this one, they even allowed maintainer editing.)

@bodyhedia44 bodyhedia44 force-pushed the MDEV-28213-fix-ignored-domain-validation branch from da1a69e to 9a2177b Compare March 4, 2026 14:38
@bodyhedia44 bodyhedia44 changed the base branch from 10.6 to main March 4, 2026 14:39
@bodyhedia44 bodyhedia44 requested a review from bnestere March 4, 2026 14:39
@bodyhedia44
Copy link
Copy Markdown
Contributor Author

Done

Copy link
Copy Markdown
Contributor

@bnestere bnestere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bodyhedia44 !

Thanks for rebasing and updating the patch! While accounting for newer features after your rebase, I'd like to think on some of the details. I've left a few points for you to consider in the mean time.

Comment thread sql/sql_repl.cc Outdated
my_init_dynamic_array(PSI_INSTRUMENT_ME, &slave_ignore_domain_ids,
sizeof(uint32), 4, 4, MYF(0));
my_init_dynamic_array(PSI_INSTRUMENT_ME, &slave_do_domain_ids,
sizeof(uint32), 4, 4, MYF(0));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IGNORE/DO options are not common, so initializing the dynamic array always will more often than not be inefficient. They can probably be lazy initialized when the option is actually found.

--connection server_1
SET @@session.gtid_domain_id= 1;
INSERT INTO t1 VALUES (4);
--save_master_pos
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something that should also be tested here is what happens when new transactions arrive from gtid_domain_id=2. They should be ignored, but we want to make sure it doesn't cause replication to break. Previously, the replication connection would never get this far b/c it would error at connection time.

Comment thread sql/sql_repl.cc Outdated
See MDEV-28213.
*/
DYNAMIC_ARRAY slave_ignore_domain_ids;
DYNAMIC_ARRAY slave_do_domain_ids;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to re-use Domain_id_filter here instead of creating a new set of DYNAMIC_ARRAYs and re-writing the Domain_id_filter::do_filter()?

IIRC, the current MDEV-9345 draft patch for master-side replication filtering only considers db and table-level options. Having the Domain_id_filter` around would provide the support to add domain-level filtering on the master-side later.

Comment thread sql/sql_repl.cc Outdated
which is a generic rpl_binlog_state_base method used in other contexts.
Since domain filtering with purged binlogs is a rare configuration and
this bypass only runs once per slave connection (not per event), skipping
the index in this case is the right tradeoff.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced it is the right trade-off. It would be good to keep behavior consistent.

If changing your code to use the Domain_id_filter, perhaps these functions could be extended with a Domain_id_filter parameter with a default of NULL? The "other contexts" (e.g. innodb binlog use case) don't care about filtering and can just keep the NULL default. That seems conceptually clean to me.

IMO it makes sense conceptually for is_before_pos() to support filtering by domain-id. Not related to your patch, but generally I think the is_before_pos() likely shouldn't belong to rpl_binlog_state_base, as it sets a dependence on slave_connection_state (and other server internals, in-turn) which isn't required for a binary log state. Though fixing that likely isn't something to be addressed by your PR. Let me think about this, and ping @knielsen to see if he has any thoughts.

@bodyhedia44 bodyhedia44 force-pushed the MDEV-28213-fix-ignored-domain-validation branch from 9a2177b to 6d5af66 Compare March 11, 2026 22:23
@bodyhedia44 bodyhedia44 requested a review from bnestere March 11, 2026 22:24
Copy link
Copy Markdown
Member

@gkodinov gkodinov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the embedded build failures.

@bodyhedia44 bodyhedia44 force-pushed the MDEV-28213-fix-ignored-domain-validation branch from 6d5af66 to 2d6949e Compare April 8, 2026 17:44
@bodyhedia44 bodyhedia44 requested a review from gkodinov April 8, 2026 17:45
@bodyhedia44
Copy link
Copy Markdown
Contributor Author

Done

Copy link
Copy Markdown
Member

@gkodinov gkodinov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please keep working on the final review with Brandon.

Copy link
Copy Markdown
Contributor

@bnestere bnestere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates, @bodyhedia44 !

This is quite close. I've just left a few last notes.

Comment thread sql/sql_repl.cc
actually sends IGNORE_DOMAIN_IDS or DO_DOMAIN_IDS. Older slaves won't
send these, so the filter stays NULL and all domains are validated.
*/
{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit over-complicated (also on the sending-side). The slave can only use one of these options at a time, both ignore_domain_ids and do_domain_ids can't be set at the same time. So it is a bit inefficient to loop through both options and do the same logic twice.

It'd be better to initially figure out which is configured, and do the underlying logic for that option once.

Comment thread sql/sql_repl.cc Outdated
size_t var_name_len;
int list_type; /* Domain_id_filter::DO_DOMAIN_IDS or IGNORE_DOMAIN_IDS */
} domain_vars[]= {
{ STRING_WITH_LEN("slave_connect_state_domain_ids_ignore"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be good to make these LEX_CSTRING constants that both the slave and master side would use everywhere (error messages too)

@bodyhedia44 bodyhedia44 force-pushed the MDEV-28213-fix-ignored-domain-validation branch from 2d6949e to 8487e06 Compare April 10, 2026 15:23
@bodyhedia44 bodyhedia44 requested a review from bnestere April 10, 2026 15:24
@bodyhedia44
Copy link
Copy Markdown
Contributor Author

Done

Copy link
Copy Markdown
Contributor

@bnestere bnestere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for such a quick turn-around, @bodyhedia44 ! I've left another round of review points. I probably should have caught them in my review yesterday, but better late than never, I suppose.

Comment thread sql/sql_repl.cc Outdated
@@ -1578,7 +1714,8 @@ gtid_check_binlog_file(slave_connection_state *state,

if (likely(reader && !reader->open_index_file(buf)))
{
int lookup= reader->search_gtid_pos(state, out_start_seek, found_count);
int lookup= reader->search_gtid_pos(state, out_start_seek, found_count,
filter);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bad whitespace (though if you move the filter into slave_connection_state, this won't be relevant)

Comment thread sql/sql_repl.cc Outdated
@@ -2709,7 +2906,8 @@ static int init_binlog_sender(binlog_send_info *info,
search_file_name,
info->until_gtid_state,
&info->until_binlog_state,
&found_in_index, &start_seek)))
&found_in_index, &start_seek,
info->slave_domain_filter)))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need to check ignored domains when using the new in-engine binlog (see just above's gtid_find_engine_pos()). Then we should also add a test to make sure the behavior works for in-engine binlog. You should be able to just include your existing test, e.g. binlog_in_engine.rpl_mysqlbinlog_slave_consistency_basic.test sets a pattern to do this using a common include file.

Comment thread sql/sql_repl.cc Outdated

if (var_name)
{
info->slave_domain_filter= new Domain_id_filter();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We try to manage/track our own memory within the server, rather than just having new do it automatically (there are many exceptions that were missed, but we try 😅).

We have a concept called a MEM_ROOT which is a scoped memory pool, usually tied to some object. For objects who's scope is tied to that of the thread, like in this case, we can use the thread's MEM_ROOT like the following (also with an addition to account for out-of-resource):

Suggested change
info->slave_domain_filter= new Domain_id_filter();
info->slave_domain_filter= new (thd->mem_root) Domain_id_filter();
if (!info->slave_domain_filter)
{
//... error...

Comment thread sql/gtid_index.h Outdated
@@ -480,6 +481,7 @@ class Gtid_index_reader : public Gtid_index_base
Index_node_base *n;
int (Gtid_index_reader::* search_cmp_function)(uint32, rpl_binlog_state_base *);
slave_connection_state *in_search_gtid_pos;
Domain_id_filter *in_search_domain_filter;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is awkward to have this be a part of the Gtid_index_reader class. Where I like the slave_domain_filter in the binlog_send_info, it may lead to a more concise implementation of your patch if the filter was instead defined in the slave_connection_state, which is directly passed everywhere you'd need it (I think). Sorry I didn't think of that earlier.

Comment thread sql/sql_repl.cc Outdated
if (expect_number)
{
char *endptr;
ulong domain_id= strtoul(p, &endptr, 10);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ulong should always be avoided, as it is not consistent between architectures. A few caveats for us to consider

  1. domain_id can only ever be 32-bit, so any architecture inconsistencies wouldn't really be relevant for us here anyway..
  2. We have our own internal functions for numeric parsing. There's an example already for parsing domain_id when one manually specifies a GTID position on a slave for gtid_slave_pos, and it uses the function gtid_parser_helper(). You should be able to do the same thing it does for parsing the domain_id, and note it also has additional validation checks that it would be good to include here as well.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Domain IDs are ulongs in slave-side replication filters.
But they don’t have to be on the master side.

@bodyhedia44 bodyhedia44 force-pushed the MDEV-28213-fix-ignored-domain-validation branch from 8487e06 to 3453bf5 Compare April 12, 2026 19:52
@bodyhedia44 bodyhedia44 requested a review from bnestere April 12, 2026 19:53
@bodyhedia44
Copy link
Copy Markdown
Contributor Author

Done

When a slave is configured with IGNORE_DOMAIN_IDS or DO_DOMAIN_IDS,
the master's binlog dump thread should skip GTID state validation for
those filtered domains. This avoids false ER_GTID_POSITION_NOT_FOUND
errors when the slave does not have (or need) the current GTID state
for domains it is filtering.

The slave now sends its IGNORE/DO domain ID lists to the master via
user variables @slave_connect_state_domain_ids_ignore and
@slave_connect_state_domain_ids_do, which the master reads in
mysql_binlog_send() and passes to check_slave_start_position().

Changes:
- sql/sql_repl.cc: load_ignore_domain_ids() returns bool, fix parser
  to avoid redundant whitespace skip and scope local variables tightly.
  Add ulong_cmp() comparator. Replace O(n) linear scans in
  is_domain_id_ignored() with bsearch() after sorting the arrays.
- sql/slave.cc: Add build_domain_ids_query() helper to construct SET
  queries for domain ID user variables. Refactor duplicate code into a
  loop using a struct array.
- mysql-test/suite/rpl/t/rpl_gtid_ignored_domain_ids_validation.test:
  New test validating end-to-end GTID replication with domain filtering.

Reviewed-by: Georgi Kodinov <joro@mariadb.org>
Reviewed-by: Brandon Nesterenko <brandon.nesterenko@mariadb.com>
@bodyhedia44 bodyhedia44 force-pushed the MDEV-28213-fix-ignored-domain-validation branch from 3453bf5 to 8c3d53c Compare April 12, 2026 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements. Replication Patches involved in replication

Development

Successfully merging this pull request may close these issues.

5 participants