Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use DAO layer instead of raw 'SELECT *' during migration #3532

Closed
wants to merge 1 commit into from

Conversation

pamiel
Copy link
Contributor

@pamiel pamiel commented Jun 12, 2018

Summary

This PR directly addresses issue #3392 (and indirectly addresses some posts of issue #3005)

Step 2017-01-24-132600_upstream_timeouts_2 of the migration process starts with a selection of all rows of the apis table (https://github.com/Kong/kong/blob/master/kong/dao/migrations/cassandra.lua#L417). However:

  • This comes just after step 2017-01-24-132600_upstream_timeouts that just altered the same apis table
  • And the selection is done by directly using a raw SELECT * FROM apis query, so not using the DAO layer... so not taking benefit on the "retry on read failure" policy nor of the "wait schema consensus" provided by this DAO layer.

Working on the issue with a Cassandra DBA, the analysis is that:

  • As working on a multi-nodes Cassandra cluster, the table alteration might not have been propagated properly on all nodes when the raw SELECT * is run, generating a read error (most probably due to an inconsistency on the schema of the queried node)... and unfortunately, no retry is implemented on this read error as this is a raw query that is done.
  • Solution is just to use local rows, err = dao.apis:find_all() instead of the local rows, err = dao.db:query([[ SELECT * FROM apis; ]]): the schema consensus is obtained and even if a read error occurs, a retry is performed by the DAO layer.

Full changelog

  • use local rows, err = dao.apis:find_all() instead of the local rows, err = dao.db:query([[ SELECT * FROM apis; ]]) in kong/doa/migrations/cassandra.lua (line 417)

Sorry, it not really clear on my fork: it mentions some "old" commits I already pushed a couple of month ago... but it says that the only updated file is kong/doa/migrations/cassandra.lua (which is correct).
Can you cross check that from your side the updates included in this PR are limited to this file only? Thanks a lot!

Issues resolved

Fix #3392

@thibaultcha
Copy link
Member

Hi,

Thank you for the patch. It seems necessary, and could be shipped in our upcoming 0.14 (for which the merge window closes this Friday).

However, it is actually a deliberate decision not to use the DAO in the migrations.

a retry is performed by the DAO layer.

The DAO has no such retry mechanism. The driver does.

The DAO should not be used in migrations because of schema changes evolving over time, which results in other errors. Unfortunately this guideline is not always respected. Another reason not to is because the migrations will elect one coordinator to run all of the migrations, in order to avoid that very issue you are referring to. Since the DAO will load balance between nodes, it increases the chances of hitting a node on which the schema has not being propagated yet. In this particular case, the apis column family has a wide partition key, and rows will thus be scattered across all nodes.

Ensuring the schema reaches consensus is done, but only after all of the migrations finished running, see: https://github.com/Kong/kong/blob/master/kong/dao/factory.lua#L458-L464

In this case, we should probably wait for the schema consensus before making the subsequent query (but this query should still be executed without using the DAO).

Sorry, it not really clear on my fork: it mentions some "old" commits I already pushed a couple of month ago...

We will need you to rebase this branch on top of our current master branch. Your PR should only consist of a single commit. See the part of CONTRIBUTING.md about this topic.

@thibaultcha thibaultcha added the pr/changes requested Changes were requested to this PR by a maintainer. Please address them and ping back once done. label Jun 12, 2018
@pamiel
Copy link
Contributor Author

pamiel commented Jun 13, 2018

Hi @thibaultcha,

I understand the risk you mention by using the DOA layer... but the various attempts we made with our 4 nodes cluster (migration launched with a non-existing kong keyspace => db is empty) highlights that:

  • When using "SELECT *" query, then the issue happens roughly in 50% of the migration attempts
  • While when using the find_all method... then we never had any failure there!

So, even if we can understand you analysis, the test results we have are at the opposite of the analysis ! Oops :(
We tried to identify the difference between running dao.db:query and dao.apis:find_all()and we did not find any valuable difference (all executed by a same Coordinator, with almost the same series of actions)=> its is really difficult to understand the results we have!

The DAO should not be used in migrations [...] Unfortunately this guideline is not always respected.

For sure, we tried the find_all usage… because many of the migration steps are already using this method, including the api rows update that is performed just after the select we are talking about!... So:

  • Do you mean that in order to respect the guideline, we should transform the whole function defined for this up step of the 2017-01-24-132600_upstream_timeouts_2 migration into a table of CQL commands as it is done in 2017-01-24-132600_upstream_timeouts ? But if so, how to loop on the rows to perform the updates using CQL commands only?
  • Or do you mean that the way the function is coded today is fine, but before the call to dao.db:query([[SELECT * FROM apis;]]), we should insert a piece of code to wait schema consensus first (with code similar to what is written here https://github.com/Kong/kong/blob/master/kong/dao/factory.lua#L458-L464 ?), as you also mentioned:

In this case, we should probably wait for the schema consensus before making the subsequent query (but this query should still be executed without using the DAO).

  • Or do you even mean that we should avoid any usage of the dao object in the migration function? However, the execution of the SELECT * FROM apis clause is also using the DAO layer as it is done through the call to the dao.db:query function. So we can keep a function for the implementation of the 2017-01-24-132600_upstream_timeouts_2 migration step, be we need to perform the select query without even using dao.db. If so, could you advice how to do this?

Last remark:

Since the DAO will load balance between nodes, it increases the chances of hitting a node on which the schema has not being propagated yet

A DBA told me that your explanation definitely makes sense for potential consistency issues related to data updates, buy maybe it is not for schema updates because schema propagation looks to be done using maximum consistency level (and we are using the LOCAL_QUORUM consistency). Any idea on that?

Thanks for your help!

@thibaultcha
Copy link
Member

@pamiel Yes, I understand that the approach you used here fix the issue, but it isn't the ideal solution, because the DAO should not be used in migrations (for several aforementioned reasons). It fixes the issue because find_all() performs such a consensus wait, as I suggested you implement in the migration (without using the DAO0.

Or do you mean that the way the function is coded today is fine, but before the call to dao.db:query([[SELECT * FROM apis;]]), we should insert a piece of code to wait schema consensus first

Yes, this is what I mean.

the execution of the SELECT * FROM apis clause is also using the DAO layer as it is done through the call to the dao.db:query function

No, this goes directly to the driver and skips the DAO.

A DBA told me that your explanation definitely makes sense for potential consistency issues related to data updates, buy maybe it is not for schema updates because schema propagation looks to be done using maximum consistency level (and we are using the LOCAL_QUORUM consistency). Any idea on that?

Well, as per the Cassandra error observed in #3392, and because of the nature of the fix (waiting for schema consensus), this is obviously the culprit here.

@pamiel
Copy link
Contributor Author

pamiel commented Jun 14, 2018

Ok, understood.
Schema consensus check added before the SELECT *.

Does this look ok for you ?

Thanks for your help.

@thibaultcha
Copy link
Member

@pamiel Yes, this looks good. Could you please rebase your branch on top of master? We would need your PR to be made of one single commit implementing the change, as stated previously. We do not have enough time to cleanup the git history for you, sorry.

Thanks!

@pamiel
Copy link
Contributor Author

pamiel commented Jun 15, 2018

After a couple of hours fighting against git, I think I properly rebased with master and squashed all the commits to a single commit.
Tell me is something wrong still remains !

@thibaultcha
Copy link
Member

@pamiel Yes, that is great, nicely done.

thibaultcha pushed a commit that referenced this pull request Jun 15, 2018
@thibaultcha
Copy link
Member

Manually merged to master. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr/changes requested Changes were requested to this PR by a maintainer. Please address them and ping back once done.
Projects
None yet
2 participants