Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return index set wildcard instead of individual indices in Searches#termineAffectedIndices() #4062

Merged
merged 5 commits into from Aug 11, 2017

Conversation

@joschi
Copy link
Contributor

@joschi joschi commented Aug 7, 2017

This PR changes the behavior of the Searches class to use a Terms Query filtering for the _index field instead of listing all indices in the URI path if a defined number of indices (see Searches.MAX_INDICES_PER_QUERY) has been reached.

While this leads to more load required on the side of Elasticsearch (because all index shards have to be visited instead of just the ones of the specified indices), it circumvents the problem with the URI being larger than 4 KB.

Using partitioned parallel queries when the maximum URI size has been reached would unfortunately only be feasible for a subset of queries currently supported by the Searches class. 馃槥

Fixes #4054

@joschi joschi added this to the 2.3.1 milestone Aug 7, 2017
@joschi joschi added this to the 2.4.0 milestone Aug 7, 2017
@joschi joschi removed this from the 2.3.1 milestone Aug 7, 2017
@bernd
Copy link
Member

@bernd bernd commented Aug 7, 2017

@joschi Thanks for the PR!

One thing I don't understand yet: Why do we need the terms query with the index list? We are already using a timerange filter in the query, how does the terms query improve things?

Loading

@joschi
Copy link
Contributor Author

@joschi joschi commented Aug 7, 2017

Why do we need the terms query with the index list? We are already using a timerange filter in the query, how does the terms query improve things?

Since users can have multiple index sets which contain distinct data sets, specifying the time range alone isn't sufficient.

Just think of a search query covering firewall and network appliance logs in a number of index sets which shouldn't include application logs from the same time frame.

Loading

@bernd
Copy link
Member

@bernd bernd commented Aug 7, 2017

Just think of a search query covering firewall and network appliance logs in a number of index sets which shouldn't include application logs from the same time frame.

Sorry, I forgot to mention that we also filter on streams. Index sets are bound to streams so this will be handled as far as I can see. Currently, I can search in either one or all streams. If I search in one stream, we have a stream filter in addition to the timerange. If I search in all streams, it's okay to look into all index sets.

Not sure if there is another scenario where selecting the indices is important. If not, we can greatly simply the code (as far as I can see) and save some cpu cycles processing the query.

Loading

@hc4
Copy link
Contributor

@hc4 hc4 commented Aug 7, 2017

@bernd agree.
Just use pattern as index name: graylog2_* (or firewall_*,network_* in your example)
E.g. that is what doing Grafana in it's requests to ES. And ES handles such queries quiet fast.
No sense in making Terms query, just more work for ES.

Loading

鈥ermineAffectedIndices()

If the number of indices determined in `Searches#determineAffectedIndices()` is larger than
`Searches.MAX_INDICES_PER_QUERY`, return the covering write index aliases of the respective
index sets instead of the individual index names.

Fixes #4054
@joschi
Copy link
Contributor Author

@joschi joschi commented Aug 7, 2017

@bernd I've implemented a variant returning the write index aliases instead of using a terms filter.

Loading

.map(IndexRange::indexName)
.map(indexSetRegistry::getForIndex)
.flatMap(o -> o.map(java.util.stream.Stream::of).orElseGet(java.util.stream.Stream::empty))
.map(IndexSet::getWriteIndexAlias)
Copy link
Member

@bernd bernd Aug 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you wanted to use IndexSet#getIndexWildcard() here? Using graylog_deflector doesn't make sense here. AFAIK

Loading

Copy link
Contributor Author

@joschi joschi Aug 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. That was bollocks.

Loading

@bernd bernd self-assigned this Aug 8, 2017
@@ -607,7 +607,7 @@ public void determineAffectedIndicesOnlyReturnsAliasesIfTooManyIndicesAreFound()

final TimeRange absoluteRange = AbsoluteRange.create(now, now.plusDays(numberOfIndices + 1));

assertThat(searches.determineAffectedIndices(absoluteRange, null)).containsExactly(indexSet.getWriteIndexAlias());
assertThat(searches.determineAffectedIndices(absoluteRange, null)).containsExactly(indexSet.getIndexWildcard());
Copy link
Contributor

@hc4 hc4 Aug 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good practice is to use constant values for expectations, insead of building it using actual code.
This is great example, why ;)

Loading

@bernd bernd changed the title Use Terms query if more than Searches.MAX_INDICES_PER_QUERY are used in a query Return index set wildcard instead of individual indices in Searches#termineAffectedIndices() Aug 9, 2017
Copy link
Member

@bernd bernd left a comment

I just tested the changes and noticed that the Searches#search(SearchesConfig) is using determineAffectedIndicesWithRanges() and then collects the index list manually. That means it's still using the full index list instead of the wildcards if the limit is crossed.

final Set<IndexRange> indexRanges = determineAffectedIndicesWithRanges(config.range(), config.filter());
final Set<String> indices = indexRanges.stream().map(IndexRange::indexName).collect(Collectors.toSet());

I guess the search method should use the regular determineAffectedIndices() method here, not sure if there was a reason to not use it, though.

Loading

@bernd
Copy link
Member

@bernd bernd commented Aug 9, 2017

Also please check if there are other search methods that are not using the correct index list. Thanks! 馃槂

Loading

@joschi
Copy link
Contributor Author

@joschi joschi commented Aug 9, 2017

I guess the search method should use the regular determineAffectedIndices() method here, not sure if there was a reason to not use it, though.

The idea was not to duplicate the work done in determineAffectedIndicesWithRanges() and this went well until determineAffectedIndices(...) started to include more complex logic than just extracting the index name from the set of index ranges.

Loading

Copy link
Member

@bernd bernd left a comment

I noticed another place where we are not using the alias. indicesContainingField() returns a list of indices.

final Set<String> affectedIndices = indicesContainingField(determineAffectedIndices(range, filter), field);

Loading

@joschi
Copy link
Contributor Author

@joschi joschi commented Aug 11, 2017

I noticed another place where we are not using the alias. indicesContainingField() returns a list of indices.

@bernd The Searches#indicesContainingField() method in Searches#fieldStats() already receives a reduced list of indices from Searches#determineAffectedIndices().

Loading

@bernd
Copy link
Member

@bernd bernd commented Aug 11, 2017

@joschi Yes, but Indices#getAllMessageFieldsForIndices() returns a full list of indices again.

Loading

bernd
bernd approved these changes Aug 11, 2017
Copy link
Member

@bernd bernd left a comment

Thanks for working through it! LGTM 馃憤

Loading

@bernd bernd merged commit 3dc4ca1 into master Aug 11, 2017
4 checks passed
Loading
@bernd bernd deleted the issue-4054 branch Aug 11, 2017
joschi added a commit that referenced this issue Aug 11, 2017
鈥 is too long (#4062)

* Return write index alias instead of individual indices in Searches#determineAffectedIndices()

If the number of indices determined in `Searches#determineAffectedIndices()` is larger than
`Searches.MAX_INDICES_PER_QUERY`, return the covering write index aliases of the respective
index sets instead of the individual index names.

Fixes #4054

* Use index wildcard instead of write index alias (duh!)

* Use correct indices in Searches#search(SearchesConfig)

* Reduce number of indices in Searches#fieldStats() if necessary

(cherry picked from commit 3dc4ca1)
bernd added a commit that referenced this issue Aug 11, 2017
鈥 is too long (#4062) (#4078)

* Return write index alias instead of individual indices in Searches#determineAffectedIndices()

If the number of indices determined in `Searches#determineAffectedIndices()` is larger than
`Searches.MAX_INDICES_PER_QUERY`, return the covering write index aliases of the respective
index sets instead of the individual index names.

Fixes #4054

* Use index wildcard instead of write index alias (duh!)

* Use correct indices in Searches#search(SearchesConfig)

* Reduce number of indices in Searches#fieldStats() if necessary

(cherry picked from commit 3dc4ca1)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants