Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unbuffered datastore queries for download #3703

Merged
merged 7 commits into from
Nov 8, 2021
Merged

Unbuffered datastore queries for download #3703

merged 7 commits into from
Nov 8, 2021

Conversation

dafeder
Copy link
Member

@dafeder dafeder commented Oct 28, 2021

Another approach to #3646. Previously we'd been trying to optimize our CSV streaming by breaking up the query into multiple chunks, but adding any sorting to queries on columns other than the primary key does not scale with this. Sorting large tables will always be slow, and will need to be repeated on every iteration. This may only add a few seconds, but as tables get larger that time increases and the number of times it is repeated increases.

A better solution may be to use unbuffered queries, to run the entire query just once and then stream the results directly from the database server. By default, Drupal and most other PHP/MySQL projects will always use buffered queries, meaning that the entire result set is passed to PHP immediately.

Here, we create a second database connection object based on default but adding the PDO::MYSQL_ATTR_USE_BUFFERED_QUERY attribute. This will be used by all datastore module database operations, which should be fine but we should look out for any unexpected side effects.

Known issues, misc

This adds to the pieces of DKAN that assume we are using MySQL as the underlying database. It would be good to have a fallback in case not, or to make it more obvious how one might add support for PostgreSQL or other PDO drivers. Most database backends have some equivalent to this cursor-based fetching, but it's not standardized in PDO.

QA Steps

Coming... in general, just download some big CSVs and make sure they work!

@dafeder dafeder changed the title Unbuffered datastore queries Unbuffered datastore queries for downlaod Oct 28, 2021
@dafeder dafeder changed the title Unbuffered datastore queries for downlaod Unbuffered datastore queries for download Oct 28, 2021
@dafeder dafeder marked this pull request as ready for review October 28, 2021 17:11
@dafeder
Copy link
Member Author

dafeder commented Oct 28, 2021

Testing notes:

  1. Tests have been copied over from the previous streaming CSV PR (Change to keyset pagination for streaming large CSVs #3700) and could probably be cleaned up to be more appropriate to what we're testing here.
  2. We need quite a bit more test coverage for the database/query code

@dafeder
Copy link
Member Author

dafeder commented Oct 28, 2021

Note: This does seem to be working well, much faster than previous pagination-based approaches in all cases, but needs some more testing against real world data and queries.

@janette janette added this to In progress in DKAN 2 Development Nov 8, 2021
@janette janette merged commit 0eed854 into 2.x Nov 8, 2021
DKAN 2 Development automation moved this from In progress to Done Nov 8, 2021
@clayliddell clayliddell deleted the unbuffered branch November 22, 2021 20:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

Optimize large OFFSET queries on large datastore tables
3 participants