Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation - switching to row based replication and data loss #991

Open
marnixgb opened this issue Jun 18, 2021 · 7 comments · Fixed by openark/gh-ost#23 · May be fixed by #999
Open

Documentation - switching to row based replication and data loss #991

marnixgb opened this issue Jun 18, 2021 · 7 comments · Fixed by openark/gh-ost#23 · May be fixed by #999

Comments

@marnixgb
Copy link

If you are running gh-ost on master (ie --allow-master-master, --allow-on-master) with anything other than row based binary logs the switch to row based needs to be handled carefully (more carefully than currently documented)

We found that many of our connections to mariadb are very long lasting (we recycle the connections a lot) - after a great deal of experimentation after some significant data loss it was discovered that the connections that were made prior to the switch to row based binary logging retained the mixed setting until closed. Transactions that took place on these connections are not shuttled into the building _gho table

These binary logs entries are silently ignored by gh-ost

@shlomi-noach
Copy link
Contributor

shlomi-noach commented Jun 20, 2021

While the help message indicates that this should run on replica, gh-ost does not actually prevent you from switching on primary. Solution: the feature must only work on a replica and fail (exit with error) if on primary.

@shlomi-noach
Copy link
Contributor

Addressed downstream by openark#23

@marnixgb
Copy link
Author

Hi

Just wanted to clarify - this is not related to --switch-to-rbr running on a master

We normally run MIXED mode in master-master and have always switched to row based prior to running gh-ost using :

set global binlog_format='ROW'

The point I was trying to make is that mysql / mariadb DO NOT immediately switch open connections based in the command above - any long running connections that are still open and doing inserts etc after the above command has been issued still run in the binary logging mode they were initiated with (in our case MIXED). This leads to gh-ost ignoring any transactions that run on those connections subsequently leading to data loss

This is not a criticism of gh-ost (it's not really even a criticism of MariaDB) - it's just worth mentioning in the documentation I think (as a warning) as it's a source of data loss that was not easy to track down/not obvious

Hope that helps clarify

@shlomi-noach
Copy link
Contributor

Got it. Thank you for clarifying! You are correct and this is a behavior of mysql&mariadb, and I'm open to suggestions on how to document this best.

@marnixgb
Copy link
Author

Ok, maybe something like the following :

In [https://github.com/github/gh-ost/blob/master/doc/requirements-and-limitations.md]

Due to MySQL / MariaDB global setting changes only applying to new connections you should ensure that when switching to ROW based binary logging that all connections are re-established prior to running gh-ost operations - failure to do so may result in gh-ost being unable to interpret transactions/data loss : reference [https://mariadb.com/kb/en/set/#global-session] - this is particularly a problem with -allow-on-master (using slave binary logs does not suffer with this as slave replication threads are very short lived)

@timvaillancourt
Copy link
Collaborator

@shlomi-noach / @marnixgb with the right tweaks I think the performance_schema could provided an indicator that this is the case

I think it will need some tweaking but this shows me all MySQL sessions with binlog_row_image=MINIMAL on 5.7:

SELECT threads.processlist_user, vars.variable_value FROM performance_schema.threads
    LEFT JOIN performance_schema.variables_by_thread vars ON threads.thread_id=vars.thread_id
    WHERE threads.processlist_command IN ('Query','Sleep')
    AND vars.variable_name='binlog_row_image' AND vars.variable_value='MINIMAL';

Perhaps a warning or error could be generated with this data 🤔?

@marnixgb
Copy link
Author

marnixgb commented Jun 22, 2021

I do like where you are going with that - but we don't operate performance_schema unfortunately (maybe we should) - a lot of people don't

In the rawest form you could gather the full processlist "Time" column - if any connections time implies connections prior to the running of gh-ost there is a risk that the connection could update in the wrong format (when gh-ost has initiated the switch to RBR of course)

But that's all still a bit fuzzy - probably enough to throw a warning though

@timvaillancourt timvaillancourt removed this from the v1.1.5 milestone Jun 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants