Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to using Point in Time search operations instead of Scrolls #1925

Open
jbaiera opened this issue Mar 2, 2022 · 6 comments
Open

Switch to using Point in Time search operations instead of Scrolls #1925

jbaiera opened this issue Mar 2, 2022 · 6 comments

Comments

@jbaiera
Copy link
Member

jbaiera commented Mar 2, 2022

The scroll API is no longer recommended for scanning all data in an index. The new recommended read pattern is to make use of the search after api in combination with a point in time.

Scrolled searches are less optimal than the search after api due to a combination of resource consumption concerns and their single-use read pattern. If a scroll operation succeeds on a server, but is ignored by a client due to a timeout, the state of the cursor on the server is now updated past the missed documents and cannot be rolled back to replay them. Instead a new scroll must be created.

Search after with PIT gives us some benefits over Scroll as well. Aside from the stable nature of reads from the API, it is compatible with composite aggregations, which is the basis for streaming aggregate results.

@matan129
Copy link

Any ideas when this will get pushed through?
Thanks

@masseyke
Copy link
Member

We don't currently have any plans to do it. After some investigation we found that PITs are missing the ability to only bring specific indices and shards, which is needed for es-hadoop to be performant. That is not currently on the roadmap for PITs, but for testing we put in a placeholder (a hack) for that feature, and found that PITs offered no advantages over scrolls -- there was no performance improvement and no new functionality was enabled.

@masseyke
Copy link
Member

We had initially thought that one big advantage of PITs would be that we could create them in the partitioner at the start of the job and have a consistent view of the data for the whole job, but realized that if we did that we'd we no way to close the PIT in spark because there's no way to be notified when an RDD is complete (there's not really such a notion -- the equivalent to RecordReader.close() in mapreduce). So we wound up having to push the PIT creation out to the tasks like we do for scrolls, taking away that advantage. (I'm working from memory of events more than 7 months ago so I might have some of the details wrong).

@matan129
Copy link

matan129 commented Nov 23, 2022

Got it. Thanks for the detailed response!

I was asking this because I was interested in making multiple queries from Spark, and it could be very nice if I could have assured PIT behaviour for these.

i.e. Scrolling is fine for a single query, but I don't see a way to coordinate that for multiple queries without the PIT API.

@masseyke
Copy link
Member

Yeah I would love to have that, too. Assuming we got the needed functionality into PIT (which is outside of this project), we'd need some way to tell es-spark to use a certain PIT -- maybe just by setting it in the configuration? So it wouldn't be as automatic as what we have now, but if you wanted to create and manage the PIT yourself it would be an option?

@matan129
Copy link

Yeah, totally an option.

To be honest... what I actually prefer is to be able to just read ES snapshots directly from S3 (I know their are not totally consistent, but that's fine).

Theoretically it should be totally possible - just map each shard (Lucene index) in S3 to a Spark partition 🤩

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants