Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resultsets fetch implementation #22

Closed
gogosofta opened this issue Sep 15, 2016 · 14 comments
Closed

resultsets fetch implementation #22

gogosofta opened this issue Sep 15, 2016 · 14 comments

Comments

@gogosofta
Copy link

gogosofta commented Sep 15, 2016

Is the resultset FETCH implemented based on this concept https://www.elastic.co/guide/en/elasticsearch/client/java-api/current/java-search-scrolling.html or is it restricted by the greatest of index.max_result_window and fetch.size ?
Currently I can't find a way to get the entire result set from a SELECT query if the result contains more than fetch.size (or index.max_result_window) rows because offsets are currently not implemented in sql4es.
Any hints/workarounds or plans for the future? :)

@corneversloot
Copy link
Member

The fetch is implemented as a scroll and it will put fetch.size number of results within a single ResultSet. More results (if they exist) can be fetches using the Statement's getMoreResults() function which is part of the JDBC specification. SQLWorkbench for example will execute this automatically and show a tab for each ResultSet (Result 1, Result 2, ..., Result N).

@gogosofta
Copy link
Author

Got it, but it can't be achieved by means of SQL, which makes the feature inaccessible for a non-programmable client (simple SQL data source).

@corneversloot
Copy link
Member

What exactly do you mean by 'means of SQL'? The getMoreResults is part of the JDBC specification and clients can use it (SQLWorkbench does).

Maybe better to approach it from another side, what are you trying to achieve using which client? Understanding this will enable me to help more efficiently i think :)

As a side node, implementing OFFSET or LIMIT X, Y is not planned atm. The SQL parser used (from the Presto project) does not even support parsing these I think.

@corneversloot
Copy link
Member

corneversloot commented Sep 17, 2016

I have added an option (results.split) which can be used to control how results from Elasticsearch are put in ResultSet objects. The current default is that all results, no matter the fetch size, are put in a single ResultSet. Setting results.split to true will split the ResultSet into multiple ResultSet objects each containing at most fetch.size number of records.

I understand that it is not really an answer to your issue as it does not implement a way to use offset (which Elasticsearch does not support either). But I hope it helps :)

EDIT: Elasticsearch actually does support it so might be interesting to see if I can add it

@gogosofta
Copy link
Author

Sounds great! That's what I actually wanted.
The idea is to be able to get all records from a select into a single ResultSet - what you've said now is the default case.

@dmarkhas
Copy link

dmarkhas commented Sep 18, 2016

In ESQueryState, I think the limit is not calculated correctly in lines 138-148.
If no fetch size and limit has been provided, the default value for limit is used, which is -1 - I guess this is what causes Drill to return 10 results, since 10 is the default number of results ES returns unless specified otherwise (and -1 is not a valid number).
When using WorkbenchJ, I see that it is calling the method setMaxRows with 0 (which according to JDBC spec should mean no limit), which then causes determineLimit to return 0...

@corneversloot
Copy link
Member

Whow, good call on this issue. I have implemented a fix to avoid setting the number of results to -1. If there is no limit or maxResults specified the driver will fetch Integer.MAX_VALUE rows.

I have explored the option to use OFFSET using query.setFrom(N) and it turns out that its use is limited because Elasticsearch throws an exception when offset + limit > 10000. Given this limitation I have chosen not to implement OFFSET at this time. Offsetting a result will have to be done by the client by iterating over the results. Obviously this is not efficient but ES does not really provide the support for this scenario at this point.

@dmarkhas
Copy link

Looks like there is another issue - when not specifying results.split/fetch.size in the URL, and with no limit provided in the query, I am getting IndexOutOfBoundException, in Squirrel/WorkbenchJ/Drill.
There's no stack trace though, so I'm not sure where it is coming from yet, but I'll try to dig around some more.

image

@corneversloot
Copy link
Member

Workbench's logs (help -> view logfile) could provide some information on the Exception.

One way this can happen if when your documents have more than 250 fields (including nested objects) which does not fit in the default array initialised to hold results. If this is the case you can increase the number of fields by setting the 'default.row.length' parameter

@dmarkhas
Copy link

I wonder if this is maybe coming back from ES itself? When I set a limit of 10,000 or 20,000, it doesn't happen and I get all results correctly. It only happens when the limit is really big (MAX_INT)...

I've also found this thread which seems related:
https://discuss.elastic.co/t/getting-java-lang-arrayindexoutofboundsexception-while-searching/35756

--Dan

From: Corne Versloot [mailto:notifications@github.com]
Sent: Thursday, September 22, 2016 09:13
To: Anchormen/sql4es sql4es@noreply.github.com
Cc: Markhasin, Dan dan.markhasin@intel.com; Comment comment@noreply.github.com
Subject: Re: [Anchormen/sql4es] resultsets fetch implementation (#22)

Workbench's logs (help -> view logfile) could provide some information on the Exception.

One way this can happen if when your documents have more than 250 fields (including nested objects) which does not fit in the default array initialised to hold results. If this is the case you can increase the number of fields by setting the 'default.row.length' parameter


You are receiving this because you commented.

Reply to this email directly, view it on GitHubhttps://github.com//issues/22#issuecomment-248820913, or mute the threadhttps://github.com/notifications/unsubscribe-auth/APcOgeBMHourYUBwSmzFdKvZeu3O9Vjpks5qshxdgaJpZM4J90sK.

Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

@corneversloot
Copy link
Member

Seems very likely. I actually assumed ES would fetch the MAX_INT results using the scroll in 10.000 increments but I do not think it is the case. Setting the size to 10.000 and executing the scroll in the driver until the LIMIT number of results or the end of the result has been reached should fix this.

@corneversloot
Copy link
Member

I have rebuild this part and results are always fetched in batches using a scroll. Search results are added to a single ResultSet or to multiple if configured to do so (setting results.split or using setMaxRows(...) on the Statement). Splitting results is generally speaking the better approach since it avoids loading all results before handing it to the client. Tested up to 3.6M rows with and without specifying a limit. I must say it is not particularly fast, at least not on my laptop. In the end Elasticsearch is not really build to just serve large number of records.

Feel free to use the latest 0.8.2.4 release to test it

@dmarkhas
Copy link

Yep, seems to be OK now :)

--Dan

From: Corne Versloot [mailto:notifications@github.com]
Sent: Sunday, September 25, 2016 15:19
To: Anchormen/sql4es sql4es@noreply.github.com
Cc: Markhasin, Dan dan.markhasin@intel.com; Comment comment@noreply.github.com
Subject: Re: [Anchormen/sql4es] resultsets fetch implementation (#22)

I have rebuild this part and results are always fetched in batches using a scroll. Search results are added to a single ResultSet or to multiple if configured to do so (setting results.split or using setMaxRows(...) on the Statement). Splitting results is generally speaking the better approach since it avoids loading all results before handing it to the client. Tested up to 3.6M rows with and without specifying a limit. I must say it is not particularly fast, at least not on my laptop. In the end Elasticsearch is not really build to just serve large number of records.

Feel free to use the latest 0.8.2.4 release to test it


You are receiving this because you commented.

Reply to this email directly, view it on GitHubhttps://github.com//issues/22#issuecomment-249418781, or mute the threadhttps://github.com/notifications/unsubscribe-auth/APcOgbp8erfhOSojVOsEtEKbrPfQJ040ks5qtmbNgaJpZM4J90sK.

Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

@aronica
Copy link

aronica commented Mar 15, 2018

ES support offset since v0.9, I have added mysql style limit with offset support both in my fork of presto-parser and es4sql, maybe this would be helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants