Download/streaming speed optimization: initial expirements #3692

dafeder · 2021-10-11T01:36:20Z

The download endpoints on the datastore API are currently extremely slow. I believe this is due to the combination of two issues:

The limit of 500 rows per loop on our streaming response creates a high overhead.
The use of OFFSET will progressively slow down the responses the higher the offset value is (see this explanation).

Regarding 1, We have already changed the datastore API to allow higher limits on row counts - see #3689. For 2, this PR is introducing a new pagination method for the streaming loop that uses conditions rather than offsets to find a starting place.

I wrote a very simple first pass at this, see this change to the loop. It captures the last record_number from one page of results and passes it as a WHERE record_number > $lastRowId in the next page. This will not work on queries that do not explicitly request the record_number column (rowIds=true) or that have any sorting set other than record_number ASC. The results are very impressive. I ran some benchmarks with a 460mb dataset -- trying both a 500-row limit and a 20,000-row limit with both the old (offset) and new methods. This was on a local docker-based development environment so network speed was eliminated as a variable.

Limit	Offset	Duration
500	✔️	3:31:05
500	❌	2:04
20,000	✔️	6:52
20,000	❌	1:18

Clearly, the row count per loop iteration is the most significant factor. We see a progressive -- possibly even exponential -- loss in speed with each iteration. While the download speeds were similar for the first 30mb or so of transfer on all tests, on the first one -- which mimics the current DKAN behavior -- the speed was outrageously slow past the 100mb mark. On the test that still used the old offset behavior but increased row limit to 20,000, we saw the same progressive slowdown, but there were clearly few enough iterations that the download was still able to complete in a reasonable amount of time.

On both tests with the new condition-based pagination -- the download speed was consistent throughout the whole down

Issues

Current rowIds logic will not work with this -- we'll need to remove the record_number later in the request.
We will not be able to allow sorts, limits or offsets in streaming downloads. Unclear if we should throw an error or just strip them out
We could allow sorts as well as hiding record_number but would need a more serious rewrite of the streaming code to analyize the entire query object and add new conditions for each column used for sorting, not just record_number.

grugnog · 2021-10-11T16:14:18Z

@dafeder see also #3646 which has an optimization that will improve queries that sort on another field - it still won't be as fast as sorting on record_number of course.

dafeder · 2021-10-11T16:21:35Z

@grugnog yeah perhaps I am forgetting something from the conversation that led to that idea, but I'm now thinking that at least for datastore-specific queries, we should be able to put together all the appropriate pagination conditions based on the last result of the previous page. This would be, I think, faster, and would all be doable by modifying the datastore query object without breaking its schema validation, as opposed to adding something lower level to graft on a self-join like this.

dafeder · 2021-10-11T16:23:37Z

(I think it's something we would do only for streaming responses; I'm imagining a second controller class for streaming endpoints that would handle the incoming query differently than the standard JSON API response which could probably handle a normal offset for a single query).

Co-authored-by: Clayton Liddell <clayton.liddell@civicactions.com>

dafeder · 2021-10-20T13:11:50Z

Closing in favor of #3700

dafeder changed the title ~~Initial expirements~~ Download/streaming speed optimization: initial expirements Oct 11, 2021

dafeder added 13 commits October 18, 2021 15:25

Initial expirements

6844d41

Fix query logic error

3ee06e8

Controller refactor

ce772c0

Separate tests

13a7a1f

Get rid of while

67ab195

Use abstract controller and clean up

6b67fc5

Testing fixes

90ca866

Move RequestParamNormalizer into controller

f223283

Move tests

0008a84

Fixes after move

a64c2f8

More cleanup, make codeclimate happy

a6b0938

Test coverage fixes

d875921

Remove extra vars

4fd395a

dafeder force-pushed the improve-csv-download branch from 8fe28fc to 5a5e7ad Compare October 19, 2021 14:01

dafeder and others added 9 commits October 19, 2021 10:03

Back to offset method for pagination

15447f8

Consolidate routes

f659d6e

A few more improvements for the controller

a1405e6

No keys in iterations

2592b2a

CodeClimate fix

974dd1b

One little cleanup

c7bf99c

Make schema endpoint reflect defaults

df3fa70

Bug in getHeaderRow

611bc96

Apply suggestions from code review

4322361

Co-authored-by: Clayton Liddell <clayton.liddell@civicactions.com>

dafeder force-pushed the improve-csv-download branch from 5a5e7ad to 4c96085 Compare October 19, 2021 21:24

dafeder added 2 commits October 19, 2021 17:49

Rearrange the AbstractQueryController a bit

beffdd7

Switch to sqlite for streaming tests

8a267bc

Iterator working for simple property array!

5314829

dafeder force-pushed the improve-csv-download branch from 4c96085 to 5314829 Compare October 19, 2021 23:31

dafeder mentioned this pull request Oct 20, 2021

Change to keyset pagination for streaming large CSVs #3700

Closed

dafeder closed this Oct 20, 2021

dafeder deleted the improve-csv-download branch October 20, 2021 13:11

dafeder linked an issue Oct 27, 2021 that may be closed by this pull request

Optimize large OFFSET queries on large datastore tables #3646

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Download/streaming speed optimization: initial expirements #3692

Download/streaming speed optimization: initial expirements #3692

dafeder commented Oct 11, 2021 •

edited

Loading

grugnog commented Oct 11, 2021

dafeder commented Oct 11, 2021

dafeder commented Oct 11, 2021

dafeder commented Oct 20, 2021

Download/streaming speed optimization: initial expirements #3692

Download/streaming speed optimization: initial expirements #3692

Conversation

dafeder commented Oct 11, 2021 • edited Loading

Issues

grugnog commented Oct 11, 2021

dafeder commented Oct 11, 2021

dafeder commented Oct 11, 2021

dafeder commented Oct 20, 2021

dafeder commented Oct 11, 2021 •

edited

Loading