Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download/streaming speed optimization: initial expirements #3692

Closed
wants to merge 25 commits into from

Conversation

dafeder
Copy link
Member

@dafeder dafeder commented Oct 11, 2021

The download endpoints on the datastore API are currently extremely slow. I believe this is due to the combination of two issues:

  1. The limit of 500 rows per loop on our streaming response creates a high overhead.
  2. The use of OFFSET will progressively slow down the responses the higher the offset value is (see this explanation).

Regarding 1, We have already changed the datastore API to allow higher limits on row counts - see #3689. For 2, this PR is introducing a new pagination method for the streaming loop that uses conditions rather than offsets to find a starting place.

I wrote a very simple first pass at this, see this change to the loop. It captures the last record_number from one page of results and passes it as a WHERE record_number > $lastRowId in the next page. This will not work on queries that do not explicitly request the record_number column (rowIds=true) or that have any sorting set other than record_number ASC. The results are very impressive. I ran some benchmarks with a 460mb dataset -- trying both a 500-row limit and a 20,000-row limit with both the old (offset) and new methods. This was on a local docker-based development environment so network speed was eliminated as a variable.

Limit Offset Duration
500 ✔️ 3:31:05
500 2:04
20,000 ✔️ 6:52
20,000 1:18

Clearly, the row count per loop iteration is the most significant factor. We see a progressive -- possibly even exponential -- loss in speed with each iteration. While the download speeds were similar for the first 30mb or so of transfer on all tests, on the first one -- which mimics the current DKAN behavior -- the speed was outrageously slow past the 100mb mark. On the test that still used the old offset behavior but increased row limit to 20,000, we saw the same progressive slowdown, but there were clearly few enough iterations that the download was still able to complete in a reasonable amount of time.

On both tests with the new condition-based pagination -- the download speed was consistent throughout the whole down

Issues

  • Current rowIds logic will not work with this -- we'll need to remove the record_number later in the request.
  • We will not be able to allow sorts, limits or offsets in streaming downloads. Unclear if we should throw an error or just strip them out
  • We could allow sorts as well as hiding record_number but would need a more serious rewrite of the streaming code to analyize the entire query object and add new conditions for each column used for sorting, not just record_number.

@dafeder dafeder changed the title Initial expirements Download/streaming speed optimization: initial expirements Oct 11, 2021
@grugnog
Copy link
Member

grugnog commented Oct 11, 2021

@dafeder see also #3646 which has an optimization that will improve queries that sort on another field - it still won't be as fast as sorting on record_number of course.

@dafeder
Copy link
Member Author

dafeder commented Oct 11, 2021

@grugnog yeah perhaps I am forgetting something from the conversation that led to that idea, but I'm now thinking that at least for datastore-specific queries, we should be able to put together all the appropriate pagination conditions based on the last result of the previous page. This would be, I think, faster, and would all be doable by modifying the datastore query object without breaking its schema validation, as opposed to adding something lower level to graft on a self-join like this.

@dafeder
Copy link
Member Author

dafeder commented Oct 11, 2021

(I think it's something we would do only for streaming responses; I'm imagining a second controller class for streaming endpoints that would handle the incoming query differently than the standard JSON API response which could probably handle a normal offset for a single query).

@dafeder
Copy link
Member Author

dafeder commented Oct 20, 2021

Closing in favor of #3700

@dafeder dafeder closed this Oct 20, 2021
@dafeder dafeder deleted the improve-csv-download branch October 20, 2021 13:11
@dafeder dafeder linked an issue Oct 27, 2021 that may be closed by this pull request
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize large OFFSET queries on large datastore tables
2 participants