Memory efficient source filtering #25168

amir20001 · 2017-06-09T22:53:39Z

Example:

Using Twitter as an example, each user is a document, and each tweet is a document nested under the user. For active users, each document can end up with thousands of tweets and thus a single document can be a few megabytes in size.

{
  "userId": "1",
  "tweets": [
    {
      "id": 1,
      "message": "tweet 1",
      
    },
    {
      "id": 2,
      "message": "tweet 2"
    },
   ...
  ]
}

Use Case:
We want to find users that have used a specific hashtag in their tweets and view only those tweets. We use source filtering and nested inner hit queries to get back just the users and matching tweets.

Problem:
Even though we are using source filtering, ElasticSearch will load the entire document into memory before doing source filtering. Since each record is so large, that means with any real throughput, we see constant garbage collection happening in the logs.

Feature Request:
Can you load filtered source in a more memory efficient manner - where you do not have to load the entire source into memory first?

jpountz · 2017-06-12T13:31:14Z

If the issue is about not loading the _source in memory, then this is a high hanging fruit. However, you mentioned that the issue is mostly about garbage collection in your case, which I think we could improve by avoiding going through a map of maps intermediate representation, which I suspect is the source of all that garbage.

jpountz · 2017-06-12T13:34:42Z

Also relates to #9034.

osman · 2017-06-13T13:50:54Z

👍

jaemoore · 2017-06-13T13:56:11Z

👍

amir20001 · 2017-06-13T22:11:57Z

yeah, it seems related to #9034. in this case, since the large items are under a single nested field it would also require each nested item to be stored separately.

tlrx · 2017-06-16T13:55:26Z

We discussed about this offline in our "Fix-it Friday" meeting and we agreed that we could still reduce the garbage collection issue by filtering the source of documents in a streaming fashion instead of the current in-memory map implementation. We could use the same feature as what is used in response filtering for filter_path.

I'll give it a try in the next few weeks and update this issue.

tlrx · 2017-07-04T08:40:31Z

So I finally looked at this. I created tlrx@3362a50 that uses a streamed based implementation of the source filtering that replaces the current in-memory maps implementation. The results are similar to what I saw more than one year ago when I looked at this optimization, but at that time we didn't have Rally so tests were hard to reproduce.

tl;dr
Benchmarks show that both implementations have almost the same performance, because most of the time is spent loading and parsing the source and these steps are always executed however the filtering is done. Differences appear only for edge cases like the one described in this issue (ie, a document with a lot of fields where most of them are filtered out). The streamed based implementation has less memory pressure since it creates a lot less objects, so I think it is a good long term solution. Sadly, the filtering methods does not behave exactly the same making the change not trivial and other features like inner hits, highlighting or scripted fields requires the source to be parsed as a map anyway. So I think we should investigate #9034 instead of optimizing for edge cases like this issue.

A new XContentHelper.filter(BytesReference, XContentType, String[], String[]) has been added in https://github.com/tlrx/elasticsearch/tree/use-streamed-based-source-filtering. It uses Jackson streaming filtering under the hood. Implementation is quite straightforward. This method is used in the FetchSourceSubPhase to filter the source.

The implementation was tested with Rally using a JFR telemetry and memory profiling enabled on our default benchmarks. Note that the JFR options has been changed to use a custom profile and I created a new challenge with only searches with source filtering operations.

It has been tested with multiple benchmarks but pmc gave the most eloquent results because it contains a large body field that can be filtered out:

Rally results for the map based filtering indicate a median throughput of 187.097 ops/s and a 99th percentile latency of 470.179 ms compared to 195.384ops/s and 196.684 ms for the streaming based results.

Memory overview

Looking at the JFR records is interesting and show less memory usage using streaming based filtering:

Map based filtering

Streaming based filtering

Garbage collections

And less GCs using streaming based filtering, which is expected.

Map based filtering

Streaming based filtering

Allocations

The allocations statistics also show much less allocations for the streaming based filtering (4949 allocations for 3,5 Gb in TLAB, 7567 for 501Mb outside TLAB) compared to map based filtering (9676 allocations for 6,5 Gb in TLAB, 32428 for 2,96Gb outside TLAB)

Map based filtering

Streaming based filtering

JFR records:

Other considerations

While investigating the change I noticed that our filtering methods do not behave the same so I created #25491 so that all methods share a same set of tests. But there are still some differences: map based filtering prints out empty objects (#4715) while the streaming based implementation excludes empty objects. Also, map based filtering handles dot in field names as sub objects (#20736) and streaming based does not work exactly like this and requires some non trivial changes.

Also, some features require the source to be parsed as a map in order to work (like highlighting or scripted fields). If combined with source filtering, we don't want to parse the source as raw bytes for source filtering and another time as a map for the highlighting. Changing the way it works is not easy and I think we could instead investigate other solution like #9034 instead of optimizing the source filtering for edge cases like the one described in this issue.

I'd be happy to hear any thoughts or comments on this! I might have miss something...

jpountz · 2017-07-04T08:54:01Z

It is a pity that we managed to come up with different semantics about filtering values in a document. I'd be keen to switching to stream-based filtering even if that implies minor bw breaks.

s1monw · 2017-07-04T09:02:59Z

I'd be keen to switching to stream-based filtering even if that implies minor bw breaks.

++ is there a chance we can stick with object based parsing based on the index version created or some setting and remove it in 7.0?

osman · 2018-03-01T19:41:53Z

Any updates on whether this may be included in 7.0?

elasticmachine · 2018-03-19T01:30:27Z

Pinging @elastic/es-search-aggs

jpountz · 2018-03-20T08:08:20Z

@osman No updates for now.

elasticmachine · 2020-07-23T18:09:28Z

Pinging @elastic/es-core-infra (:Core/Infra/Scripting)

jtibshirani · 2020-10-14T22:48:23Z

We use source filtering and nested inner hit queries to get back just the users and matching tweets.

I noticed that when using source filtering in inner_hits, we were reloading and reparsing the _source for each nested document. So we recently merged #60494 to only load and parse the _source once per root document. This doesn't address the memory consumption of source filtering itself, but could help here (if I'm understanding the use case right).

I found myself needing support for something like `filter_path` on `XContentParser`. It was simple enough to plug it in so I did. Then I realized that it might offer more memory efficient source filtering (elastic#25168) so I put together a quick benchmark comparing the source filtering that we do in `_search`. Filtering using the parser is about 33% faster than how we filter now when you select a single field from a 300 byte document: ``` Benchmark (excludes) (includes) (source) Mode Cnt Score Error Units FetchSourcePhaseBenchmark.filterObjects message short avgt 5 2360.342 ± 4.715 ns/op FetchSourcePhaseBenchmark.filterXContentOnBuilder message short avgt 5 2010.278 ± 15.042 ns/op FetchSourcePhaseBenchmark.filterXContentOnParser message short avgt 5 1588.446 ± 18.593 ns/op ``` The top line is the way we filter now. The middle line is adding a filter to `XContentBuilder` - something we can do right now without any of my plumbing work. The bottom line is filtering on the parser, requiring all the new plumbing. This isn't particularly impresive. 33% *sounds* great! But 700 nanoseconds per document isn't going to cut into anyone's search times. If you fetch a thousand docuents that's .7 milliseconds of savings. But we mostly advise folks to use source filtering on fetch when the source is large and you only want a small part of it. So I tried when the source is about 4.3kb and you want a single field: ``` Benchmark (excludes) (includes) (source) Mode Cnt Score Error Units FetchSourcePhaseBenchmark.filterObjects message one_4k_field avgt 5 5957.128 ± 117.402 ns/op FetchSourcePhaseBenchmark.filterXContentOnBuilder message one_4k_field avgt 5 4999.073 ± 96.003 ns/op FetchSourcePhaseBenchmark.filterXContentonParser message one_4k_field avgt 5 3261.478 ± 48.879 ns/op ``` That's 45% faster. Put another way, 2.7 microseconds a document. Not bad! But have a look at how things come out when you want a single field from a 4 *megabyte* document: ``` Benchmark (excludes) (includes) (source) Mode Cnt Score Error Units FetchSourcePhaseBenchmark.filterObjects message one_4m_field avgt 5 8266343.036 ± 176197.077 ns/op FetchSourcePhaseBenchmark.filterXContentOnBuilder message one_4m_field avgt 5 6227560.013 ± 68306.318 ns/op FetchSourcePhaseBenchmark.filterXContentonParser message one_4m_field avgt 5 1617153.472 ± 80164.547 ns/op ``` These documents are very large. I've encountered documents like them in real life, but they've always been the outlier for me. But a 6.5 millisecond per document savings ain't anything to sneeze at. Take a look at what you get when I turn on gc metrics: ``` FetchSourcePhaseBenchmark.filterObjects message one_4m_field avgt 5 7036097.561 ± 84721.312 ns/op FetchSourcePhaseBenchmark.filterObjects:·gc.alloc.rate message one_4m_field avgt 5 2166.613 ± 25.975 MB/sec FetchSourcePhaseBenchmark.filterXContentOnBuilder message one_4m_field avgt 5 6104595.992 ± 55445.508 ns/op FetchSourcePhaseBenchmark.filterXContentOnBuilder:·gc.alloc.rate message one_4m_field avgt 5 2496.978 ± 22.650 MB/sec FetchSourcePhaseBenchmark.filterXContentonParser message one_4m_field avgt 5 1614980.846 ± 31716.956 ns/op FetchSourcePhaseBenchmark.filterXContentonParser:·gc.alloc.rate message one_4m_field avgt 5 1.755 ± 0.035 MB/sec ```

I found myself needing support for something like `filter_path` on `XContentParser`. It was simple enough to plug it in so I did. Then I realized that it might offer more memory efficient source filtering (#25168) so I put together a quick benchmark comparing the source filtering that we do in `_search`. Filtering using the parser is about 33% faster than how we filter now when you select a single field from a 300 byte document: ``` Benchmark (excludes) (includes) (source) Mode Cnt Score Error Units FetchSourcePhaseBenchmark.filterObjects message short avgt 5 2360.342 ± 4.715 ns/op FetchSourcePhaseBenchmark.filterXContentOnBuilder message short avgt 5 2010.278 ± 15.042 ns/op FetchSourcePhaseBenchmark.filterXContentOnParser message short avgt 5 1588.446 ± 18.593 ns/op ``` The top line is the way we filter now. The middle line is adding a filter to `XContentBuilder` - something we can do right now without any of my plumbing work. The bottom line is filtering on the parser, requiring all the new plumbing. This isn't particularly impresive. 33% *sounds* great! But 700 nanoseconds per document isn't going to cut into anyone's search times. If you fetch a thousand docuents that's .7 milliseconds of savings. But we mostly advise folks to use source filtering on fetch when the source is large and you only want a small part of it. So I tried when the source is about 4.3kb and you want a single field: ``` Benchmark (excludes) (includes) (source) Mode Cnt Score Error Units FetchSourcePhaseBenchmark.filterObjects message one_4k_field avgt 5 5957.128 ± 117.402 ns/op FetchSourcePhaseBenchmark.filterXContentOnBuilder message one_4k_field avgt 5 4999.073 ± 96.003 ns/op FetchSourcePhaseBenchmark.filterXContentonParser message one_4k_field avgt 5 3261.478 ± 48.879 ns/op ``` That's 45% faster. Put another way, 2.7 microseconds a document. Not bad! But have a look at how things come out when you want a single field from a 4 *megabyte* document: ``` Benchmark (excludes) (includes) (source) Mode Cnt Score Error Units FetchSourcePhaseBenchmark.filterObjects message one_4m_field avgt 5 8266343.036 ± 176197.077 ns/op FetchSourcePhaseBenchmark.filterXContentOnBuilder message one_4m_field avgt 5 6227560.013 ± 68306.318 ns/op FetchSourcePhaseBenchmark.filterXContentonParser message one_4m_field avgt 5 1617153.472 ± 80164.547 ns/op ``` These documents are very large. I've encountered documents like them in real life, but they've always been the outlier for me. But a 6.5 millisecond per document savings ain't anything to sneeze at. Take a look at what you get when I turn on gc metrics: ``` FetchSourcePhaseBenchmark.filterObjects message one_4m_field avgt 5 7036097.561 ± 84721.312 ns/op FetchSourcePhaseBenchmark.filterObjects:·gc.alloc.rate message one_4m_field avgt 5 2166.613 ± 25.975 MB/sec FetchSourcePhaseBenchmark.filterXContentOnBuilder message one_4m_field avgt 5 6104595.992 ± 55445.508 ns/op FetchSourcePhaseBenchmark.filterXContentOnBuilder:·gc.alloc.rate message one_4m_field avgt 5 2496.978 ± 22.650 MB/sec FetchSourcePhaseBenchmark.filterXContentonParser message one_4m_field avgt 5 1614980.846 ± 31716.956 ns/op FetchSourcePhaseBenchmark.filterXContentonParser:·gc.alloc.rate message one_4m_field avgt 5 1.755 ± 0.035 MB/sec ```

I found myself needing support for something like `filter_path` on `XContentParser`. It was simple enough to plug it in so I did. Then I realized that it might offer more memory efficient source filtering (elastic#25168) so I put together a quick benchmark comparing the source filtering that we do in `_search`. Filtering using the parser is about 33% faster than how we filter now when you select a single field from a 300 byte document: ``` Benchmark (excludes) (includes) (source) Mode Cnt Score Error Units FetchSourcePhaseBenchmark.filterObjects message short avgt 5 2360.342 ± 4.715 ns/op FetchSourcePhaseBenchmark.filterXContentOnBuilder message short avgt 5 2010.278 ± 15.042 ns/op FetchSourcePhaseBenchmark.filterXContentOnParser message short avgt 5 1588.446 ± 18.593 ns/op ``` The top line is the way we filter now. The middle line is adding a filter to `XContentBuilder` - something we can do right now without any of my plumbing work. The bottom line is filtering on the parser, requiring all the new plumbing. This isn't particularly impresive. 33% *sounds* great! But 700 nanoseconds per document isn't going to cut into anyone's search times. If you fetch a thousand docuents that's .7 milliseconds of savings. But we mostly advise folks to use source filtering on fetch when the source is large and you only want a small part of it. So I tried when the source is about 4.3kb and you want a single field: ``` Benchmark (excludes) (includes) (source) Mode Cnt Score Error Units FetchSourcePhaseBenchmark.filterObjects message one_4k_field avgt 5 5957.128 ± 117.402 ns/op FetchSourcePhaseBenchmark.filterXContentOnBuilder message one_4k_field avgt 5 4999.073 ± 96.003 ns/op FetchSourcePhaseBenchmark.filterXContentonParser message one_4k_field avgt 5 3261.478 ± 48.879 ns/op ``` That's 45% faster. Put another way, 2.7 microseconds a document. Not bad! But have a look at how things come out when you want a single field from a 4 *megabyte* document: ``` Benchmark (excludes) (includes) (source) Mode Cnt Score Error Units FetchSourcePhaseBenchmark.filterObjects message one_4m_field avgt 5 8266343.036 ± 176197.077 ns/op FetchSourcePhaseBenchmark.filterXContentOnBuilder message one_4m_field avgt 5 6227560.013 ± 68306.318 ns/op FetchSourcePhaseBenchmark.filterXContentonParser message one_4m_field avgt 5 1617153.472 ± 80164.547 ns/op ``` These documents are very large. I've encountered documents like them in real life, but they've always been the outlier for me. But a 6.5 millisecond per document savings ain't anything to sneeze at. Take a look at what you get when I turn on gc metrics: ``` FetchSourcePhaseBenchmark.filterObjects message one_4m_field avgt 5 7036097.561 ± 84721.312 ns/op FetchSourcePhaseBenchmark.filterObjects:·gc.alloc.rate message one_4m_field avgt 5 2166.613 ± 25.975 MB/sec FetchSourcePhaseBenchmark.filterXContentOnBuilder message one_4m_field avgt 5 6104595.992 ± 55445.508 ns/op FetchSourcePhaseBenchmark.filterXContentOnBuilder:·gc.alloc.rate message one_4m_field avgt 5 2496.978 ± 22.650 MB/sec FetchSourcePhaseBenchmark.filterXContentonParser message one_4m_field avgt 5 1614980.846 ± 31716.956 ns/op FetchSourcePhaseBenchmark.filterXContentonParser:·gc.alloc.rate message one_4m_field avgt 5 1.755 ± 0.035 MB/sec ```

* Memory efficient xcontent filtering (backport of #77154) I found myself needing support for something like `filter_path` on `XContentParser`. It was simple enough to plug it in so I did. Then I realized that it might offer more memory efficient source filtering (#25168) so I put together a quick benchmark comparing the source filtering that we do in `_search`. Filtering using the parser is about 33% faster than how we filter now when you select a single field from a 300 byte document: ``` Benchmark (excludes) (includes) (source) Mode Cnt Score Error Units FetchSourcePhaseBenchmark.filterObjects message short avgt 5 2360.342 ± 4.715 ns/op FetchSourcePhaseBenchmark.filterXContentOnBuilder message short avgt 5 2010.278 ± 15.042 ns/op FetchSourcePhaseBenchmark.filterXContentOnParser message short avgt 5 1588.446 ± 18.593 ns/op ``` The top line is the way we filter now. The middle line is adding a filter to `XContentBuilder` - something we can do right now without any of my plumbing work. The bottom line is filtering on the parser, requiring all the new plumbing. This isn't particularly impresive. 33% *sounds* great! But 700 nanoseconds per document isn't going to cut into anyone's search times. If you fetch a thousand docuents that's .7 milliseconds of savings. But we mostly advise folks to use source filtering on fetch when the source is large and you only want a small part of it. So I tried when the source is about 4.3kb and you want a single field: ``` Benchmark (excludes) (includes) (source) Mode Cnt Score Error Units FetchSourcePhaseBenchmark.filterObjects message one_4k_field avgt 5 5957.128 ± 117.402 ns/op FetchSourcePhaseBenchmark.filterXContentOnBuilder message one_4k_field avgt 5 4999.073 ± 96.003 ns/op FetchSourcePhaseBenchmark.filterXContentonParser message one_4k_field avgt 5 3261.478 ± 48.879 ns/op ``` That's 45% faster. Put another way, 2.7 microseconds a document. Not bad! But have a look at how things come out when you want a single field from a 4 *megabyte* document: ``` Benchmark (excludes) (includes) (source) Mode Cnt Score Error Units FetchSourcePhaseBenchmark.filterObjects message one_4m_field avgt 5 8266343.036 ± 176197.077 ns/op FetchSourcePhaseBenchmark.filterXContentOnBuilder message one_4m_field avgt 5 6227560.013 ± 68306.318 ns/op FetchSourcePhaseBenchmark.filterXContentonParser message one_4m_field avgt 5 1617153.472 ± 80164.547 ns/op ``` These documents are very large. I've encountered documents like them in real life, but they've always been the outlier for me. But a 6.5 millisecond per document savings ain't anything to sneeze at. Take a look at what you get when I turn on gc metrics: ``` FetchSourcePhaseBenchmark.filterObjects message one_4m_field avgt 5 7036097.561 ± 84721.312 ns/op FetchSourcePhaseBenchmark.filterObjects:·gc.alloc.rate message one_4m_field avgt 5 2166.613 ± 25.975 MB/sec FetchSourcePhaseBenchmark.filterXContentOnBuilder message one_4m_field avgt 5 6104595.992 ± 55445.508 ns/op FetchSourcePhaseBenchmark.filterXContentOnBuilder:·gc.alloc.rate message one_4m_field avgt 5 2496.978 ± 22.650 MB/sec FetchSourcePhaseBenchmark.filterXContentonParser message one_4m_field avgt 5 1614980.846 ± 31716.956 ns/op FetchSourcePhaseBenchmark.filterXContentonParser:·gc.alloc.rate message one_4m_field avgt 5 1.755 ± 0.035 MB/sec ``` * Fixup benchmark for 7.x

romseygeek · 2022-12-02T15:04:26Z

This has been implemented in #77154

clintongormley added :Search/Search Search-related issues that do not fall into other categories discuss >enhancement labels Jun 12, 2017

tlrx removed the discuss label Jun 16, 2017

tlrx self-assigned this Jun 16, 2017

talevy added :Search/Search Search-related issues that do not fall into other categories and removed :Search/Search Search-related issues that do not fall into other categories labels Mar 19, 2018

rjernst added the Team:Search Meta label for search team label May 4, 2020

jtibshirani mentioned this issue Jul 22, 2020

Partially parse source documents to speed up source access #52591

Closed

stu-elastic added the :Core/Infra/Scripting Scripting abstractions, Painless, and Mustache label Jul 23, 2020

elasticmachine added the Team:Core/Infra Meta label for core/infra team label Jul 23, 2020

jtibshirani mentioned this issue Sep 14, 2020

Follow-up improvements to field retrieval. #60985

Closed

8 tasks

jtibshirani mentioned this issue Oct 22, 2020

Better storage of _source #9034

Closed

rjernst added the needs:triage Requires assignment of a team area label label Dec 3, 2020

stu-elastic removed the needs:triage Requires assignment of a team area label label Dec 9, 2020

nik9000 mentioned this issue Jun 10, 2021

Investigage speeding up source excludes at index time #74022

Open

nik9000 mentioned this issue Sep 1, 2021

Memory efficient xcontent filtering #77154

Merged

nik9000 mentioned this issue Sep 13, 2021

Memory efficient xcontent filtering (backport of #77154) #77653

Merged

romseygeek closed this as completed Dec 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory efficient source filtering #25168

Memory efficient source filtering #25168

amir20001 commented Jun 9, 2017

jpountz commented Jun 12, 2017

jpountz commented Jun 12, 2017

osman commented Jun 13, 2017

jaemoore commented Jun 13, 2017

amir20001 commented Jun 13, 2017

tlrx commented Jun 16, 2017