New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Waiting for index range calculation before switching deflector alias. #2278

Merged
merged 18 commits into from Jun 9, 2016

Conversation

Projects
None yet
4 participants
@dennisoelkers
Member

dennisoelkers commented May 24, 2016

Before this change, whenever the deflector was cycled, a system job was submitted to calculate the index range of the current write index/deflector (index 1). Immediately after submitting the job (not necessarily after the index range calculation job has finished already), the deflector alias was cycled to a newly created index (index 2). When the index range calculation job takes longer than cycling the deflector alias, there is a time range, where searches do not take the index which was recently cycled (index 1) into account.

This happens because IndexHelper#determineAffectedIndices returns an index if:

  • it has a valid index range and the supplied time range is an intersection of it
  • it is the current deflector

Both conditions do not hold true for index 1 at this time. As searches are also used in alerts, there is also the potential of false positives/negatives, as described in #2264.

There are two ways to fix this:

  1. Wait for the index range calculation to complete before cycling the deflector alias
  2. Return all indices without valid time ranges in IndexHelper#determineAffectedIndices as well.

Solution 2 was not chosen, because it seems to be a too intrusive approach, as it has an effect on all searches being done, especially when indices are pretty large.

Instead, solution 1 was implemented, by enhancing the result type of the SystemJobManager#submit to include a ScheduledFuture as well, which is then used to wait for the completion of the index range calculation before switching the deflector alias. As we do not want to wait an infinite amount of time for the unlikely event of noncompletion of this system job, a configuration property was introduced defaulting to a duration of one hour, which may or may not be a good idea.

Solution 1 has the downside though, that there is a timeframe which is at maximum the time between the start of the index range calculation and the time of the actual index alias switch, which would not be included in the index range calculated and there might be the need for a second index range calculation after the index was switched to read only.

UPDATE: After tinkering a bit with the scenarios and possible implementations, a combination of both solutions was chosen. Now, when the deflector is cycled, a static index range with "empty" (== 0) begin/end parameters is saved. Index range calculation for the old deflector is postponed until after it was made read only, so we are sure to be able to calculate the correct time ranges. When searching, all indices with "empty" begin/end parameters are included in the search, because we cannot determine reliably which time range they cover.

@bernd

This comment has been minimized.

Member

bernd commented May 24, 2016

Two comments after quickly looking at this. (no time to do a full review currently, sorry)

  1. This is a breaking change to the SystemJobs API. So we have to discuss if we want to to this and if we do it, we have to adjust all affected plugins and document the upgrade path.
  2. As far as I can see, there is no second index range calculation to ensure the "index 1" range is correct. Without this, we will also run into alert problems just like it is now. 😄
@joschi

This comment has been minimized.

Contributor

joschi commented May 24, 2016

@dennisoelkers Would it solve to the problem to insert a fake index range for the current write-active index with end being now + 1 hour just before index rotation and keep the index range calculation as it is?

I share the fear of @bernd that without the second calculation (after the deflector alias has been pointed to the new index) the index range will be off by a few messages (and a few more messages in loaded ES clusters which take their time to actually return the min and max timestamps of an index).

@dennisoelkers

This comment has been minimized.

Member

dennisoelkers commented May 24, 2016

@bernd:
1.) Correct. That's why it's scheduled for 2.1.0., to keep the semver breakage minimal :)

@bernd / @joschi:
2.) I added the "ready-for-review" label a bit early. Nonetheless: We actually have the same issue at the moment too, because the index range is calculated 30 seconds before it is switched read-only, because we expect more messages to come in. I am still thinking about how to solve this issue, but it is also a separate one from the one causing #2264.

@hc4

This comment has been minimized.

Contributor

hc4 commented May 24, 2016

Why not just set range for current index to infinity and schedule job for index range recalculation right after index became readonly?
It will not break anything and will not change current behaviour (assuming, that current deflector may contain any timerange).
Also ES 2.3 a lot better (than 1.7) filters data by timestamp. So it shouldn't be a big performance drop.

@dennisoelkers dennisoelkers force-pushed the issue-2264 branch from 1812ab0 to 32846cf May 25, 2016

Changing handling of deflector index ranges.
Deflector index ranges are now statically created with the current time
as begin and 0L as end. Searching for the indices matching a time range
was changed accordingly.
@hc4

This comment has been minimized.

Contributor

hc4 commented May 25, 2016

I think commit d523b74 breaks logic.
Just created index can receive messages from the past.
For example in case of long journal queue, or messages from some queueing engines (like AMQP or Beats)

@dennisoelkers

This comment has been minimized.

Member

dennisoelkers commented May 26, 2016

Thanks for the input, @hc4, I came to the same conclusion. Please note though, that the PR Is still WIP. When it is finished, it will be labeled accordingly and any comments are welcome.

dennisoelkers added some commits May 26, 2016

Changing begin for deflector index range, returning it per default.
As the timestamp of a message is generated on the source and not during
ingestion time, we cannot determine anything for the deflector index
range. Therefore we need to set begin time to 0 and always return the
deflector index range for search.
@dennisoelkers

This comment has been minimized.

Member

dennisoelkers commented May 26, 2016

For the documentation: @bernd, I changed the method signatures back to their original form and added new ones for the SystemJobManager class, so it is backwards compatible.

@joschi joschi self-assigned this Jun 1, 2016

} catch (SystemJobConcurrencyException e) {
final SystemJobManager.ScheduleResult scheduleResult = systemJobManager.submitWithDelayForResult(makeReadOnlyJob, 30, TimeUnit.SECONDS);
if (scheduleResult != null) {
scheduleResult.getFuture().get(deflectorIndexReadOnlyTimeout.toMilliseconds(), TimeUnit.MILLISECONDS);

This comment has been minimized.

@joschi

joschi Jun 1, 2016

Contributor

In the worst case this would block the thread for the duration of deflectorIndexReadOnlyTimeout (1 hour by default). What would happen if there is the need to do another cycle during that time?

This comment has been minimized.

@dennisoelkers

dennisoelkers Jun 1, 2016

Member

From what I have seen in the code, this would mean that another cycle would be done, which would be waiting at the same point.

This comment has been minimized.

@bernd

bernd Jun 1, 2016

Member

I wouldn't feel good adding this "waiting on system job" logic in here.

Would it be enough to trigger an index range calculation job in the "make-read-only" job after it's done?

  • Do not call addSingleIndexRanges(oldTarget) in line 160
  • Modify the SetIndexReadOnlyJob to trigger a CreateNewSingleIndexRangeJob for the index once the read-only status is set

With this we wouldn't need the timeout logic and all the changes to the system jobs to return a future. We can also be sure that the index rage is correct because the index cannot be written anymore. Until the index is read-only, it will be included in the searches because it has the "deflector" index range.

This comment has been minimized.

@dennisoelkers

dennisoelkers Jun 2, 2016

Member

Coupling the job which sets indices read only to the index range calculation job sounds wrong to me, because they are not logically coupled and executing the first one should not need you to know (and be surprised about) that the second one is triggered implicitly too. The coupling should be done from the outside (as it is done here) via composition without any inherent coupling.

The logic you are describing is exactly what is happening in the PR, why it should be better to give up a separation of concerns to shift the waiting logic one level lower, while gaining nothing, is beyond me, tbh.

This comment has been minimized.

@hc4

hc4 Jun 2, 2016

Contributor

I think the problem is at line 182.
Why should we wait for job finished?
Is it possible to schedule one job after another (sorry, don't knwo deeply Java)

This comment has been minimized.

@hc4

hc4 Jun 2, 2016

Contributor

The only problem I see - you will wait here at least for 30 seconds (submit delay) and will block scheduler (single thread) for that time.
however, I'am not sure if it is real problem :)

This comment has been minimized.

@joschi

joschi Jun 2, 2016

Contributor

That's why I've asked if the thread is blocked and what happened if a new deflector cycle was necessary while waiting for the last one. 😉

In the end, this would in the worst case consume all (executor) threads in SystemJobManager and block all subsequent deflector cycles (until some threads are freed again).

This comment has been minimized.

@dennisoelkers

dennisoelkers Jun 6, 2016

Member

I agree, but from what I've seen this is already the case before this change. I am not sure if you are implying that there is a regression in this PR or that we should also change the current behavior (which I would perceive as scope creep).

This comment has been minimized.

@hc4

hc4 Jun 6, 2016

Contributor

scheduleResult.getFuture().get(deflectorIndexReadOnlyTimeout.toMilliseconds(), TimeUnit.MILLISECONDS) was introduced by this commit.
In matster there is no any waiting.

This comment has been minimized.

@bernd

bernd Jun 6, 2016

Member

Coupling the job which sets indices read only to the index range calculation job sounds wrong to me, because they are not logically coupled and executing the first one should not need you to know (and be surprised about) that the second one is triggered implicitly too. The coupling should be done from the outside (as it is done here) via composition without any inherent coupling.

I agree that the coupling introduced by this wouldn't be nice. We could introduce a separate job class that takes care of both. (i.e. SetIndexReadOnlyAndCalculateRange) That would make it clear what the job is supposed to do.
I also agree that it's nicer to be able to compose smaller jobs to avoid having lots of specialized jobs, but in this case I would vote for having one to avoid the introduction of timeout logic in this class and to avoid the async/sync issue you described later.

Another option would be to extend the system jobs subsystem to be able to run another job (or more) after one has finished. This would make it possible to compose a chain of system jobs without having to use timeouts or block the calling thread.

The logic you are describing is exactly what is happening in the PR, why it should be better to give up a separation of concerns to shift the waiting logic one level lower, while gaining nothing, is beyond me, tbh.

I would happily give up a separation of concern when I can avoid using a timeout. 😉 The problem I see with using timeouts is that it's really hard to find a good default for them and that the user probably needs to tune them to make them work with their setup which makes using the product harder.

import java.util.concurrent.ThreadFactory;
import java.util.concurrent.TimeUnit;
import static com.codahale.metrics.MetricRegistry.name;
public class SystemJobManager {
public class ScheduleResult {

This comment has been minimized.

@joschi

joschi Jun 1, 2016

Contributor

This should be a static nested class or no nested class at all (as it's also used by other classes). It would also be suitable to be converted to an Auto Value class.

This comment has been minimized.

@dennisoelkers

dennisoelkers Jun 1, 2016

Member

Why is it necessary to make it static? It does not access any members of the enclosing class.

This comment has been minimized.

@joschi

joschi Jun 1, 2016

Contributor

You're right.

@@ -31,4 +31,5 @@
void save(IndexRange indexRange);
IndexRange calculateRange(String index);
IndexRange createForDeflector(String index);

This comment has been minimized.

@joschi

joschi Jun 1, 2016

Contributor

I think createEmptyRange would be a better name as it's not inherently bound to the deflector alias (it's not even being used in the method implementations).

Maybe adding it as a default method (https://docs.oracle.com/javase/tutorial/java/IandI/defaultmethods.html) would make sense here as its implementation doesn't rely on any internals of the specific implementation of IndexRangeService.

This comment has been minimized.

@dennisoelkers

dennisoelkers Jun 1, 2016

Member

That's a good idea! 👍

This comment has been minimized.

@dennisoelkers

dennisoelkers Jun 1, 2016

Member

Turning out it's not as easy as it seems, as it currently relies on the choice of the value object in the service implementation. Using MongoIndexRange in a default implementation in the interface would smell broken to me. I will rename the method to a more generic name though.

@@ -194,7 +194,7 @@ public void cycle() {
}
private void addDeflectorIndexRange(String newTarget) {
final IndexRange deflectorRange = indexRangeService.createForDeflector(newTarget);
final IndexRange deflectorRange = indexRangeService.createEmptyRange(newTarget);

This comment has been minimized.

@hc4

hc4 Jun 2, 2016

Contributor

As I understand EmptyRange later treated as "Any time".
It looks a bit confusing to me :)

@joschi

This comment has been minimized.

Contributor

joschi commented Jun 9, 2016

LGTM. 👍

@joschi joschi merged commit 9388d6c into master Jun 9, 2016

4 checks passed

ci-server-integration Jenkins build graylog2-server-integration-pr 973 has succeeded
Details
ci-web-linter Jenkins build graylog-pr-linter-check 459 has succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@joschi joschi deleted the issue-2264 branch Jun 9, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment