Switch to using Point in Time search operations instead of Scrolls #1925

jbaiera · 2022-03-02T19:47:48Z

The scroll API is no longer recommended for scanning all data in an index. The new recommended read pattern is to make use of the search after api in combination with a point in time.

Scrolled searches are less optimal than the search after api due to a combination of resource consumption concerns and their single-use read pattern. If a scroll operation succeeds on a server, but is ignored by a client due to a timeout, the state of the cursor on the server is now updated past the missed documents and cannot be rolled back to replay them. Instead a new scroll must be created.

Search after with PIT gives us some benefits over Scroll as well. Aside from the stable nature of reads from the API, it is compatible with composite aggregations, which is the basis for streaming aggregate results.

matan129 · 2022-11-23T22:08:00Z

Any ideas when this will get pushed through?
Thanks

masseyke · 2022-11-23T22:25:31Z

We don't currently have any plans to do it. After some investigation we found that PITs are missing the ability to only bring specific indices and shards, which is needed for es-hadoop to be performant. That is not currently on the roadmap for PITs, but for testing we put in a placeholder (a hack) for that feature, and found that PITs offered no advantages over scrolls -- there was no performance improvement and no new functionality was enabled.

masseyke · 2022-11-23T22:42:49Z

We had initially thought that one big advantage of PITs would be that we could create them in the partitioner at the start of the job and have a consistent view of the data for the whole job, but realized that if we did that we'd we no way to close the PIT in spark because there's no way to be notified when an RDD is complete (there's not really such a notion -- the equivalent to RecordReader.close() in mapreduce). So we wound up having to push the PIT creation out to the tasks like we do for scrolls, taking away that advantage. (I'm working from memory of events more than 7 months ago so I might have some of the details wrong).

matan129 · 2022-11-23T22:55:02Z

Got it. Thanks for the detailed response!

I was asking this because I was interested in making multiple queries from Spark, and it could be very nice if I could have assured PIT behaviour for these.

i.e. Scrolling is fine for a single query, but I don't see a way to coordinate that for multiple queries without the PIT API.

masseyke · 2022-11-28T14:04:01Z

Yeah I would love to have that, too. Assuming we got the needed functionality into PIT (which is outside of this project), we'd need some way to tell es-spark to use a certain PIT -- maybe just by setting it in the configuration? So it wouldn't be as automatic as what we have now, but if you wanted to create and manage the PIT yourself it would be an option?

matan129 · 2022-11-28T14:54:00Z

Yeah, totally an option.

To be honest... what I actually prefer is to be able to just read ES snapshots directly from S3 (I know their are not totally consistent, but that's fine).

Theoretically it should be totally possible - just map each shard (Lucene index) in S3 to a Spark partition 🤩

jbaiera added :Rest enhancement :Core labels Mar 2, 2022

This was referenced Apr 5, 2022

Support specifying index and shard in point in time request elastic/elasticsearch#85703

Open

Adding support for filtering pit requests by index and shard elastic/elasticsearch#85705

Closed

Add point in time support to es-hadoop #1946

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to using Point in Time search operations instead of Scrolls #1925

Switch to using Point in Time search operations instead of Scrolls #1925

jbaiera commented Mar 2, 2022

matan129 commented Nov 23, 2022

masseyke commented Nov 23, 2022

masseyke commented Nov 23, 2022

matan129 commented Nov 23, 2022 •

edited

masseyke commented Nov 28, 2022

matan129 commented Nov 28, 2022

Switch to using Point in Time search operations instead of Scrolls #1925

Switch to using Point in Time search operations instead of Scrolls #1925

Comments

jbaiera commented Mar 2, 2022

matan129 commented Nov 23, 2022

masseyke commented Nov 23, 2022

masseyke commented Nov 23, 2022

matan129 commented Nov 23, 2022 • edited

masseyke commented Nov 28, 2022

matan129 commented Nov 28, 2022

matan129 commented Nov 23, 2022 •

edited