Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(migrations) ensure inserts are performed after schema agreement #7667

Merged
merged 1 commit into from Oct 7, 2021

Conversation

mikefero
Copy link
Contributor

@mikefero mikefero commented Aug 6, 2021

Summary

Using a multi-node Apache Cassandra 4.0.0 cluster fails bootstrap

Error: 
/usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:48: [Cassandra error] cluster_mutex callback threw an error: /usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:41: [Cassandra error] failed to run migration '013_220_to_230' up: [Write timeout] Operation timed out - received only 1 responses.
stack traceback:
        [C]: in function 'assert'
        /usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:41: in function </usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:40>
        [C]: in function 'xpcall'
        /usr/local/share/lua/5.1/kong/db/init.lua:402: in function </usr/local/share/lua/5.1/kong/db/init.lua:352>
        [C]: in function 'pcall'
        /usr/local/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
        /usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:40: in function 'bootstrap'
        /usr/local/share/lua/5.1/kong/cmd/migrations.lua:162: in function 'cmd_exec'
        /usr/local/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/local/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:48: in function <init_worker_by_lua:46>
        [C]: in function 'xpcall'
        init_worker_by_lua:55: in function <init_worker_by_lua:53>
stack traceback:
        [C]: in function 'error'
        /usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:48: in function 'bootstrap'
        /usr/local/share/lua/5.1/kong/cmd/migrations.lua:162: in function 'cmd_exec'
        /usr/local/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/local/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:48: in function <init_worker_by_lua:46>
        [C]: in function 'xpcall'
        init_worker_by_lua:55: in function <init_worker_by_lua:53>

When bootstrapping a multi-node Apache Cassandra cluster, the CREATE TABLE response from the coordinator node does guarantee that the schema altering statement will result in schema agreement across the cluster. This update moves the insert statement to occur after the CREATE TABLE statement to ensure schema agreement occurs before executing the insert statement.

Full changelog

  • fix(migrations) ensure inserts are performed after schema agreement

Reproduction steps

#!/usr/bin/env bash

# Start with a fresh environment
ccm remove
gojira nuke -f
set -e

# Create and start a C* cluster
CASS_VERSION=${1:-4.0.0}
CASS_NODES=${2:-3}
CASS_NAME="cass-${CASS_VERSION//\./-}-${CASS_NODES}-nodes"
ccm create -v "$CASS_VERSION" -n "$CASS_NODES" -s "$CASS_NAME"

# Export some Kong Gateway settings for ease of use with Gojira environments
export KONG_DATABASE=cassandra
export KONG_TEST_DATABASE=cassandra
if [[ $2 -eq 1 ]]; then
  export KONG_CASSANDRA_REPL_FACTOR=1
  export KONG_CASSANDRA_WRITE_CONSISTENCY=ONE
  export KONG_CASSANDRA_READ_CONSISTENCY=ONE
else
  export KONG_CASSANDRA_REPL_FACTOR=3
  export KONG_CASSANDRA_WRITE_CONSISTENCY=LOCAL_QUORUM
  export KONG_CASSANDRA_READ_CONSISTENCY=LOCAL_QUORUM
fi
export KONG_DB_UPDATE_PROPAGATION=2

# Bring up the Kong Gateway instance
gojira up \
       --network-mode host \
       --alone \
       -e KONG_CASSANDRA_REPL_FACTOR \
       -e KONG_CASSANDRA_WRITE_CONSISTENCY \
       -e KONG_CASSANDRA_READ_CONSISTENCY \
       -e KONG_DB_UPDATE_PROPAGATION

# Prepare the C* datastore
gojira run \
       --network-mode host \
       --alone \
       -e KONG_CASSANDRA_REPL_FACTOR \
       -e KONG_CASSANDRA_WRITE_CONSISTENCY \
       -e KONG_CASSANDRA_READ_CONSISTENCY \
       -e KONG_DB_UPDATE_PROPAGATION \
       kong migrations bootstrap --vv

# Start the Kong Gateway
gojira run \
       --network-mode host \
       --alone \
       -e KONG_CASSANDRA_REPL_FACTOR \
       -e KONG_CASSANDRA_WRITE_CONSISTENCY \
       -e KONG_CASSANDRA_READ_CONSISTENCY \
       -e KONG_DB_UPDATE_PROPAGATION \
       kong start

@mikefero mikefero requested a review from a team August 6, 2021 21:28
Copy link
Member

@thibaultcha thibaultcha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having just skimmed the code a little bit as well as the up_f implementation (f66c3ec), and unless I am missing something, I believe that there are fundamental issues with this approach: mixing DDL and DML queries in the up migration step should be avoided because other replicas chosen for the DML (INSERT) may still be digesting schema changes from the DDL (ALTER) or may not even have received them yet. The new migrations framework (#3802) was adamant about this and that is why the up step was meant to only receive a string of DDL queries: to make it fool-proof since this was a common mistake by migrations authors with the previous framework. Now, up_f seems to have cancelled that?

@mikefero
Copy link
Contributor Author

mikefero commented Aug 9, 2021

I believe that there are fundamental issues with this approach: mixing DDL and DML queries in the up migration step should be avoided because other replicas chosen for the DML (INSERT) may still be digesting schema changes from the DDL (ALTER) or may not even have received them yet.

Agreed, DDL and DML queries should not be mixed; DML queries should only be executed after schema agreement/consensus is met. During initial testing it appeared consensus was being reached when using the verbose output, but reading through the migration code it both up and then up_f are being executed one after the other without consensus being met. Moved the INSERT into a teardown function for C* in order to ensure migrations are properly executed.

@jeremyjpj0916
Copy link
Contributor

I wonder if this occurs with C* 3.x series as well, I always have had weird r/w timeouts during migrations as well at times.

@mikefero
Copy link
Contributor Author

I wonder if this occurs with C* 3.x series as well, I always have had weird r/w timeouts during migrations as well at times.

It does; I tested on a 3.11.10 cluster as well. It makes sense that it would occur on all flavors of C* since DDL and DML statements were mixed without waiting on consensus.

@bungle bungle force-pushed the fix/migrations-multi-node-cass-cluster branch 2 times, most recently from 0d395fd to a1c31b1 Compare October 7, 2021 12:50
@bungle bungle force-pushed the fix/migrations-multi-node-cass-cluster branch 7 times, most recently from b0549d3 to 9f4683b Compare October 7, 2021 14:22
When bootstrapping a multi-node Apache Cassandra cluster, the CREATE
TABLE response from the coordinator node does guarantee that the schema
altering statement will result in schema agreement across the cluster.
This update moves the insert statement to occur after the CREATE TABLE
statement to ensure schema agreement occurs before executing the insert
statement.
@bungle bungle force-pushed the fix/migrations-multi-node-cass-cluster branch from 9f4683b to e3c0c92 Compare October 7, 2021 14:26
@bungle bungle merged commit 67e5697 into master Oct 7, 2021
@bungle bungle deleted the fix/migrations-multi-node-cass-cluster branch October 7, 2021 14:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants