Skip to content

Conversation

gianlucaborello
Copy link

Hello,

I am facing some issues dealing with this image in production, in particular around the semantics of MYSQL_INNODB_NUM_MEMBERS. Specifically, the container fails to start until exactly MYSQL_INNODB_NUM_MEMBERS members are part of the cluster and online. In my case, such number is set to 3, and the setup is a standard single primary InnoDB Cluster.

This creates a problem because, if one mysql replica goes down, the cluster is still healthy since it still has a quorum (though it won't be tolerant to further failures, but that might not be an imminent problem), but even if it's healthy, starting new Router containers will fail, because they will be stuck waiting for the missing replica to come up. This severely limits the availability of Router, especially when Router itself gets deployed alongside the client application (as recommended in the official documentation), which might be subject to a very frequent deployment cycle, even when a mysql replica is down.

Looking a bit deeper at the implementation of the container, the variable MYSQL_INNODB_NUM_MEMBERS is used in the docker entrypoint as such:

select count(MEMBER_STATE) = $MYSQL_INNODB_NUM_MEMBERS from replication_group_members where MEMBER_STATE = 'ONLINE';

So, we don't proceed with the initialization until MYSQL_INNODB_NUM_MEMBERS are up and online.

My first attempt to solve this was to relax this query to the following:

select count(MEMBER_STATE) = $MYSQL_INNODB_NUM_MEMBERS from replication_group_members

This doesn't work, because replication_group_members entries are removed if a member of the cluster goes down, so if I temporarily lose one replica, the table shows up just two entries.

In other words, if I bring down the mysql3 container, I go from:

mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | 0120d6f4-bb57-11e8-a2e8-0242ac110003 | mysql1      |        3306 | ONLINE       | PRIMARY     | 8.0.12         |
| group_replication_applier | 026bb327-bb57-11e8-b03c-0242ac110004 | mysql2      |        3306 | ONLINE       | SECONDARY   | 8.0.12         |
| group_replication_applier | 02914fbf-bb57-11e8-b455-0242ac110005 | mysql3      |        3306 | ONLINE       | SECONDARY   | 8.0.12         |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
3 rows in set (0.00 sec)

To:

mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | 0120d6f4-bb57-11e8-a2e8-0242ac110003 | mysql1      |        3306 | ONLINE       | PRIMARY     | 8.0.12         |
| group_replication_applier | 026bb327-bb57-11e8-b03c-0242ac110004 | mysql2      |        3306 | ONLINE       | SECONDARY   | 8.0.12         |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
2 rows in set (0.00 sec)

So that's not a viable solution. Another attempt to relax the entrypoint constraint was to wait until at least (and not exactly) N members are online, so I could just pass MYSQL_INNODB_NUM_MEMBERS=1 and the underlying query would be:

select count(MEMBER_STATE) >= $MYSQL_INNODB_NUM_MEMBERS from replication_group_members where MEMBER_STATE = 'ONLINE';

Upon first inspection, this worked, because Router will dynamically discover the other cluster members at runtime via the bootstrap server, and properly route the traffic through the right master, so it doesn't matter if we end up with a configuration file containing a single bootstrap server:

$ docker exec -it mysql-router cat /tmp/mysqlrouter/mysqlrouter.conf
# File automatically generated during MySQL Router bootstrap
[metadata_cache:test]
bootstrap_server_addresses=mysql://mysql1:3306
...

However, it seems that all the cluster state is always discovered only via the bootstrap nodes. This means that if the original node discovered during the bootstrap goes down, the entire thing goes down, Router stops serving all the requests, even if we have a valid cluster formed by mysql2 and mysql3:

2018-09-13 18:31:17 metadata_cache WARNING [7f7ec54c2700] Replicaset 'default' not available
2018-09-13 18:31:17 routing WARNING [7f7ec54c2700] No available servers found for 'default' primary routing

So, also this option is not viable (or better, it would be viable just if we introduced an explicitly dependency between Router and the mysql instances, making sure that Router is started after the cluster is fully formed for the first time, which seems annoying).

The third option that I found was to actually not rely on the replication_group_members table, but to actually rely on the exact same table that Router internally uses during the bootstrap process to discover the members of the cluster (https://github.com/mysql/mysql-router/blob/8.0/src/router/src/config_generator.cc#L1328). This table has the advantage that all the members are always listed, even if one is down:

mysql> select * from mysql_innodb_cluster_metadata.instances;
+-------------+---------+---------------+--------------------------------------+---------------+------+--------+--------------------------------------------------------------------------------------+------------+---------------+-------------+
| instance_id | host_id | replicaset_id | mysql_server_uuid                    | instance_name | role | weight | addresses                                                                            | attributes | version_token | description |
+-------------+---------+---------------+--------------------------------------+---------------+------+--------+--------------------------------------------------------------------------------------+------------+---------------+-------------+
|           1 |       1 |             1 | 42dd80ee-bb64-11e8-b479-0242ac110003 | mysql1:3306   | HA   |   NULL | {"mysqlX": "mysql1:33060", "grLocal": "mysql1:33061", "mysqlClassic": "mysql1:3306"} | NULL       |          NULL | NULL        |
|           8 |       8 |             1 | 456b4ebd-bb64-11e8-9a17-0242ac110004 | mysql2:3306   | HA   |   NULL | {"mysqlX": "mysql2:33060", "grLocal": "mysql2:33061", "mysqlClassic": "mysql2:3306"} | NULL       |          NULL | NULL        |
|          15 |      15 |             1 | 46845bc1-bb64-11e8-841f-0242ac110005 | mysql3:3306   | HA   |   NULL | {"mysqlX": "mysql3:33060", "grLocal": "mysql3:33061", "mysqlClassic": "mysql3:3306"} | NULL       |          NULL | NULL        |
+-------------+---------+---------------+--------------------------------------+---------------+------+--------+--------------------------------------------------------------------------------------+------------+---------------+-------------+
3 rows in set (0.00 sec)

So, if we use this table with MYSQL_INNODB_NUM_MEMBERS=3, we can effectively make sure that Router, just like before, doesn't come up until all the bootstrap server addresses can be properly discovered by the bootstrap process, but we also become more tolerant to a replica going temporarily down, since Router can be safely started from scratch in that scenario. So, the query that I'm using, adapted from the C++ code in the link, is:

select count(*) = $MYSQL_INNODB_NUM_MEMBERS FROM mysql_innodb_cluster_metadata.instances WHERE replicaset_id = (SELECT replicaset_id FROM mysql_innodb_cluster_metadata.instances WHERE mysql_server_uuid = @@server_uuid);

What do you think? I'm not familiar too familiar with Router so I would be interested to know if I missed some corner case.

checking the number of members in the cluster.

This has the advantage that, once fully formed, it never shrinks even if a
member goes offline, so that starting the router container in a scenario
when one mysql replica is down doesn't fail. Such scenario can frequently
occur when the router is deployed alongside with the client application.

Signed-off-by: Gianluca Borello <g.borello@gmail.com>
@mysql-oca-bot
Copy link

Hi, thank you for submitting this pull request. In order to consider your code we need you to sign the Oracle Contribution Agreement (OCA). Please review the details and follow the instructions at http://www.oracle.com/technetwork/community/oca-486395.html
Please make sure to include your MySQL bug system user (email) in the returned form.
Thanks

@gianlucaborello
Copy link
Author

Hi,

I am already a contributor in other Oracle repositories on GitHub, and my name is listed in the page at http://www.oracle.com/technetwork/community/oca-486395.html, along with my github username:

Gianluca Borello - MySQL Operator - gianlucaborello

As your FAQ says (http://www.oracle.com/technetwork/oca-faq-405384.pdf):

Once you execute an OCA, it is valid for all Oracle sponsored projects. 

So I think I'm good?

@datakatalyst
Copy link

Hi,

I have the same problem and Ginaluca's solution is good for me.
Do you have merge prevision?

Thank you

@ltangvald ltangvald requested a review from neumayer September 20, 2018 13:54
@ltangvald
Copy link
Member

@gianlucaborello I'm not 100% familiar with the procedure, but generally you need to follow the instructions given on the page linked by the oca bot, and your merge proposal will be added to an internal queue for review

@lefred
Copy link

lefred commented Sep 20, 2018

@gianlucaborello : Hi,
let me check on our site, but if you are listed that should not be a problem. You just have to send "I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it."
I'm more familiar with the server code, but it should be the same here. Stay tuned and thank you already for your contribution ;)

@gianlucaborello
Copy link
Author

Thank you for your help. Yes, my name is already listed in the OCA list and have successfully contributed in other Oracle projects on GitHub using this username. As far as the other request, here is the explicit agreement of the OCA:

I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Let me know if I should also send it via email.

Thanks

@mysql-admin
Copy link

@gianlucaborello Sorry for the late reply. For handling MySQL contributions, in addition to the signed OCA you need to have a user in http://bugs.mysql.com, please create one by clicking the 'register' link at the top right side of the screen.

Thanks
==Omer

@gianlucaborello
Copy link
Author

Done. Username is g.borello@gmail.com.

Thanks

@mysql-oca-bot
Copy link

Hi, thank you for your contribution. Please confirm this code is submitted under the terms of the OCA (Oracle's Contribution Agreement) you have previously signed by cutting and pasting the following text as a comment:
"I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it."
Thanks

@gianlucaborello
Copy link
Author

I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

@mysql-admin
Copy link

Thanks
==Omer

@mysql-oca-bot
Copy link

Hi, thank you for your contribution. Your code has been assigned to an internal queue. Please follow
bug http://bugs.mysql.com/bug.php?id=92547 for updates.
Thanks

@neumayer
Copy link

@gianlucaborello What about we remove the loop altogether and you use "restart: on-failure". This way we remove the static dependency and your setup will still work because the container will be restarted until the initial bootstrap server is working. Right?

A somewhat larger problem is that a similar version of this dependency still exists for the initial bootstrap server -- if it goes away and router is restarted it will fail. But that's maybe part of a larger internal discussion.

@gianlucaborello
Copy link
Author

gianlucaborello commented Sep 29, 2018

@neumayer thanks for your reply.

What you are suggesting can be seen as an improvement of the current behavior, which I'd definitely welcome, but if I'm not mistaken it is still subject to a race condition, unless the user explicitly makes sure that all the Router instances are started after the cluster has been fully formed at least once, which in my experience places just additional burden on the user. I, for one, maintain a few production applications and usually let all components start at the same time on Kubernetes and naturally figure out the proper dependency among themselves, relying for example on the MYSQL_INNODB_NUM_MEMBERS loop for Router, so I would hit this race condition.

In particular, if you remove the loop, this is the race condition I see:

  1. Start a MySQL cluster of 3 members (mysql1, mysql2, mysql3), by adding one cluster member at a time.
  2. Before step 1 above is fully completed, start a client application with Router deployed alongside. Since the loop was removed, Router will not wait for all the 3 members to join and will start, using as bootstrap servers the active members at the time. Let's suppose it's just mysql1, so we have:
$ docker exec -it mysql-router cat /tmp/mysqlrouter/mysqlrouter.conf | grep boot
bootstrap_server_addresses=mysql://mysql1:3306
  1. All the other replicas come up and join the cluster. Router discovers them at runtime, through the bootstrap server mysql1, and will consider them in the routing decision.
  2. Temporarily bring down the mysql1, the bootstrap server. The cluster keeps functioning because we still have quorum with mysql2 and mysql3.
  3. Router, however, having lost connection with the bootstrap server, will stop working, with the message:
2018-09-29 00:43:59 metadata_cache ERROR [7ff445615700] Failed connecting with any of the metadata servers
2018-09-29 00:44:02 metadata_cache WARNING [7ff445615700] Failed connecting with Metadata Server mysql1:3306: Can't connect to MySQL server on 'mysql1' (113) (2003)
2018-09-29 00:44:02 metadata_cache ERROR [7ff445615700] Failed to connect to metadata server
2018-09-29 00:44:02 metadata_cache ERROR [7ff445615700] Failed connecting with any of the metadata servers
2018-09-29 00:44:05 metadata_cache WARNING [7ff445615700] Failed connecting with Metadata Server mysql1:3306: Can't connect to MySQL server on 'mysql1' (113) (2003)
2018-09-29 00:44:05 metadata_cache ERROR [7ff445615700] Failed to connect to metadata server
2018-09-29 00:44:05 metadata_cache ERROR [7ff445615700] Failed connecting with any of the metadata servers

Notice that Router doesn't exit, it keeps staying in this loop waiting for mysql1 to reappear, all while in the meantime we have an available valid cluster with the mysql2 and mysql3 members. IMHO, unless I missed something, this is not ideal, and a byproduct of removing the loop.

I'm not particularly attached to the solution I'm proposing, but on paper it seems to work relatively well, in the sense that I can't think of a scenario where Router would actually become unavailable assuming the underlying cluster is still operating even with some replicas down (modulo the initial bootstrap server problem that you mention, but that's inevitable at the moment), what's the disadvantage that you see?

The only scenario it doesn't cover is a user changing the number of cluster members at runtime, but that to me seems a different use case that should perhaps be handled natively inside Router the same way is done in other distributed systems (e.g. in Cassandra, once a member of the cluster is discovered via a bootstrap node, it will keep being considered even if the original bootstrap node goes down, thus eliminating the need to know, at bootstrap time, all the cluster members, like Router instead seems to want).

Hope this all makes sense.

@neumayer
Copy link

neumayer commented Oct 1, 2018

First of all, in case I haven't said it. Thanks for the good writeup and your input, we really appreciate it.

Now a bit of explanation on how I see things:

The entrypoint script was always meant as a workaround until we find out more about the functionality the router image should provide becomes a bit more clear. Currently it is intended to let users use the bootstrap mode of cluster. And the "wait for a number of servers to be online" was mainly a workaround for the case when one starts multiple servers locally and router at the same time. The servers will simply not have finished the initialisation process in time so router fails (or rather bootstrap mode). I always had a strange feeling about it and now clearly it doesn't work with scaling of the cluster (well, both intended and unintended scaling).

I really think all issues are due to current limitations in bootstrap mode (or due to how we use bootstrap mode in the image). So in the medium run I see two "proper" solutions:

A lot of my motivation comes from having the docker image as thin a layer as possible, only handling configuration handling for the application really. This is to not have duplicate logic (like the waiting for the servers is in a way).

  • bootstrap mode handles scaling (disconnects/reconnects/up/down) better
  • use non-bootstrap mode, i.e. manually write a config file and mount it into the router image (running in non-bootstrap mode)

I hear there is internal efforts to make bootstrap mode more graceful/dynamic. There is also a somewhat larger discussion on how router is best used in different docker scenarios (plain/docker-compose/container schedulers).

I also see that your immediate problem is not solved by any of this :-)

For the time being I can propose a compromise, leave the loop the way you describe it (after all, it doesn't change semantics, shouldn't impact existing deployments), but make it optional. I.e. if the variable is not set there is no loop, just restart.

@gianlucaborello
Copy link
Author

Thanks for the context @neumayer.

I'll definitely be interested in hearing the new developments of the bootstrap mode, I'd love to see a more resilient workflow.

I am fine with your compromise, if you remove the variable it could be worth adding a note saying that in that mode users should try to launch Router after the cluster has been fully formed the first time, otherwise people might end up spending hours of troubleshooting like me, just to eventually figure out the internal details about Router never using the other cluster members discovered at runtime if the ones discovered during the bootstrap go down, leading to loss of availability.

Thanks

@neumayer
Copy link

neumayer commented Oct 2, 2018

Great.

I mentioned in the readme. I merged both changes, they will be part of the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants