-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Download/streaming speed optimization: initial expirements #3692
Conversation
@grugnog yeah perhaps I am forgetting something from the conversation that led to that idea, but I'm now thinking that at least for datastore-specific queries, we should be able to put together all the appropriate pagination conditions based on the last result of the previous page. This would be, I think, faster, and would all be doable by modifying the datastore query object without breaking its schema validation, as opposed to adding something lower level to graft on a self-join like this. |
(I think it's something we would do only for streaming responses; I'm imagining a second controller class for streaming endpoints that would handle the incoming query differently than the standard JSON API response which could probably handle a normal offset for a single query). |
8fe28fc
to
5a5e7ad
Compare
Co-authored-by: Clayton Liddell <clayton.liddell@civicactions.com>
5a5e7ad
to
4c96085
Compare
4c96085
to
5314829
Compare
Closing in favor of #3700 |
The download endpoints on the datastore API are currently extremely slow. I believe this is due to the combination of two issues:
OFFSET
will progressively slow down the responses the higher the offset value is (see this explanation).Regarding 1, We have already changed the datastore API to allow higher limits on row counts - see #3689. For 2, this PR is introducing a new pagination method for the streaming loop that uses conditions rather than offsets to find a starting place.
I wrote a very simple first pass at this, see this change to the loop. It captures the last record_number from one page of results and passes it as a
WHERE record_number > $lastRowId
in the next page. This will not work on queries that do not explicitly request therecord_number
column (rowIds=true
) or that have any sorting set other thanrecord_number ASC
. The results are very impressive. I ran some benchmarks with a 460mb dataset -- trying both a 500-row limit and a 20,000-row limit with both the old (offset) and new methods. This was on a local docker-based development environment so network speed was eliminated as a variable.Clearly, the row count per loop iteration is the most significant factor. We see a progressive -- possibly even exponential -- loss in speed with each iteration. While the download speeds were similar for the first 30mb or so of transfer on all tests, on the first one -- which mimics the current DKAN behavior -- the speed was outrageously slow past the 100mb mark. On the test that still used the old offset behavior but increased row limit to 20,000, we saw the same progressive slowdown, but there were clearly few enough iterations that the download was still able to complete in a reasonable amount of time.
On both tests with the new condition-based pagination -- the download speed was consistent throughout the whole down
Issues
record_number
but would need a more serious rewrite of the streaming code to analyize the entire query object and add new conditions for each column used for sorting, not justrecord_number
.