Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Waiting for index range calculation before switching deflector alias. #2278

Merged
merged 18 commits into from Jun 9, 2016

Conversation

dennisoelkers
Copy link
Member

@dennisoelkers dennisoelkers commented May 24, 2016

Before this change, whenever the deflector was cycled, a system job was submitted to calculate the index range of the current write index/deflector (index 1). Immediately after submitting the job (not necessarily after the index range calculation job has finished already), the deflector alias was cycled to a newly created index (index 2). When the index range calculation job takes longer than cycling the deflector alias, there is a time range, where searches do not take the index which was recently cycled (index 1) into account.

This happens because IndexHelper#determineAffectedIndices returns an index if:

  • it has a valid index range and the supplied time range is an intersection of it
  • it is the current deflector

Both conditions do not hold true for index 1 at this time. As searches are also used in alerts, there is also the potential of false positives/negatives, as described in #2264.

There are two ways to fix this:

  1. Wait for the index range calculation to complete before cycling the deflector alias
  2. Return all indices without valid time ranges in IndexHelper#determineAffectedIndices as well.

Solution 2 was not chosen, because it seems to be a too intrusive approach, as it has an effect on all searches being done, especially when indices are pretty large.

Instead, solution 1 was implemented, by enhancing the result type of the SystemJobManager#submit to include a ScheduledFuture as well, which is then used to wait for the completion of the index range calculation before switching the deflector alias. As we do not want to wait an infinite amount of time for the unlikely event of noncompletion of this system job, a configuration property was introduced defaulting to a duration of one hour, which may or may not be a good idea.

Solution 1 has the downside though, that there is a timeframe which is at maximum the time between the start of the index range calculation and the time of the actual index alias switch, which would not be included in the index range calculated and there might be the need for a second index range calculation after the index was switched to read only.

UPDATE: After tinkering a bit with the scenarios and possible implementations, a combination of both solutions was chosen. Now, when the deflector is cycled, a static index range with "empty" (== 0) begin/end parameters is saved. Index range calculation for the old deflector is postponed until after it was made read only, so we are sure to be able to calculate the correct time ranges. When searching, all indices with "empty" begin/end parameters are included in the search, because we cannot determine reliably which time range they cover.

@bernd
Copy link
Member

bernd commented May 24, 2016

Two comments after quickly looking at this. (no time to do a full review currently, sorry)

  1. This is a breaking change to the SystemJobs API. So we have to discuss if we want to to this and if we do it, we have to adjust all affected plugins and document the upgrade path.
  2. As far as I can see, there is no second index range calculation to ensure the "index 1" range is correct. Without this, we will also run into alert problems just like it is now. 😄

@joschi
Copy link
Contributor

joschi commented May 24, 2016

@dennisoelkers Would it solve to the problem to insert a fake index range for the current write-active index with end being now + 1 hour just before index rotation and keep the index range calculation as it is?

I share the fear of @bernd that without the second calculation (after the deflector alias has been pointed to the new index) the index range will be off by a few messages (and a few more messages in loaded ES clusters which take their time to actually return the min and max timestamps of an index).

@dennisoelkers
Copy link
Member Author

@bernd:
1.) Correct. That's why it's scheduled for 2.1.0., to keep the semver breakage minimal :)

@bernd / @joschi:
2.) I added the "ready-for-review" label a bit early. Nonetheless: We actually have the same issue at the moment too, because the index range is calculated 30 seconds before it is switched read-only, because we expect more messages to come in. I am still thinking about how to solve this issue, but it is also a separate one from the one causing #2264.

@hc4
Copy link
Contributor

hc4 commented May 24, 2016

Why not just set range for current index to infinity and schedule job for index range recalculation right after index became readonly?
It will not break anything and will not change current behaviour (assuming, that current deflector may contain any timerange).
Also ES 2.3 a lot better (than 1.7) filters data by timestamp. So it shouldn't be a big performance drop.

Deflector index ranges are now statically created with the current time
as begin and 0L as end. Searching for the indices matching a time range
was changed accordingly.
@hc4
Copy link
Contributor

hc4 commented May 25, 2016

I think commit d523b74 breaks logic.
Just created index can receive messages from the past.
For example in case of long journal queue, or messages from some queueing engines (like AMQP or Beats)

@dennisoelkers
Copy link
Member Author

Thanks for the input, @hc4, I came to the same conclusion. Please note though, that the PR Is still WIP. When it is finished, it will be labeled accordingly and any comments are welcome.

As the timestamp of a message is generated on the source and not during
ingestion time, we cannot determine anything for the deflector index
range. Therefore we need to set begin time to 0 and always return the
deflector index range for search.
@dennisoelkers
Copy link
Member Author

For the documentation: @bernd, I changed the method signatures back to their original form and added new ones for the SystemJobManager class, so it is backwards compatible.

@joschi joschi self-assigned this Jun 1, 2016
} catch (SystemJobConcurrencyException e) {
final SystemJobManager.ScheduleResult scheduleResult = systemJobManager.submitWithDelayForResult(makeReadOnlyJob, 30, TimeUnit.SECONDS);
if (scheduleResult != null) {
scheduleResult.getFuture().get(deflectorIndexReadOnlyTimeout.toMilliseconds(), TimeUnit.MILLISECONDS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the worst case this would block the thread for the duration of deflectorIndexReadOnlyTimeout (1 hour by default). What would happen if there is the need to do another cycle during that time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I have seen in the code, this would mean that another cycle would be done, which would be waiting at the same point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't feel good adding this "waiting on system job" logic in here.

Would it be enough to trigger an index range calculation job in the "make-read-only" job after it's done?

  • Do not call addSingleIndexRanges(oldTarget) in line 160
  • Modify the SetIndexReadOnlyJob to trigger a CreateNewSingleIndexRangeJob for the index once the read-only status is set

With this we wouldn't need the timeout logic and all the changes to the system jobs to return a future. We can also be sure that the index rage is correct because the index cannot be written anymore. Until the index is read-only, it will be included in the searches because it has the "deflector" index range.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coupling the job which sets indices read only to the index range calculation job sounds wrong to me, because they are not logically coupled and executing the first one should not need you to know (and be surprised about) that the second one is triggered implicitly too. The coupling should be done from the outside (as it is done here) via composition without any inherent coupling.

The logic you are describing is exactly what is happening in the PR, why it should be better to give up a separation of concerns to shift the waiting logic one level lower, while gaining nothing, is beyond me, tbh.

Copy link
Contributor

@hc4 hc4 Jun 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem is at line 182.
Why should we wait for job finished?
Is it possible to schedule one job after another (sorry, don't knwo deeply Java)

Copy link
Contributor

@hc4 hc4 Jun 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only problem I see - you will wait here at least for 30 seconds (submit delay) and will block scheduler (single thread) for that time.
however, I'am not sure if it is real problem :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why I've asked if the thread is blocked and what happened if a new deflector cycle was necessary while waiting for the last one. 😉

In the end, this would in the worst case consume all (executor) threads in SystemJobManager and block all subsequent deflector cycles (until some threads are freed again).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but from what I've seen this is already the case before this change. I am not sure if you are implying that there is a regression in this PR or that we should also change the current behavior (which I would perceive as scope creep).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scheduleResult.getFuture().get(deflectorIndexReadOnlyTimeout.toMilliseconds(), TimeUnit.MILLISECONDS) was introduced by this commit.
In matster there is no any waiting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coupling the job which sets indices read only to the index range calculation job sounds wrong to me, because they are not logically coupled and executing the first one should not need you to know (and be surprised about) that the second one is triggered implicitly too. The coupling should be done from the outside (as it is done here) via composition without any inherent coupling.

I agree that the coupling introduced by this wouldn't be nice. We could introduce a separate job class that takes care of both. (i.e. SetIndexReadOnlyAndCalculateRange) That would make it clear what the job is supposed to do.
I also agree that it's nicer to be able to compose smaller jobs to avoid having lots of specialized jobs, but in this case I would vote for having one to avoid the introduction of timeout logic in this class and to avoid the async/sync issue you described later.

Another option would be to extend the system jobs subsystem to be able to run another job (or more) after one has finished. This would make it possible to compose a chain of system jobs without having to use timeouts or block the calling thread.

The logic you are describing is exactly what is happening in the PR, why it should be better to give up a separation of concerns to shift the waiting logic one level lower, while gaining nothing, is beyond me, tbh.

I would happily give up a separation of concern when I can avoid using a timeout. 😉 The problem I see with using timeouts is that it's really hard to find a good default for them and that the user probably needs to tune them to make them work with their setup which makes using the product harder.

@@ -194,7 +194,7 @@ public void cycle() {
}

private void addDeflectorIndexRange(String newTarget) {
final IndexRange deflectorRange = indexRangeService.createForDeflector(newTarget);
final IndexRange deflectorRange = indexRangeService.createEmptyRange(newTarget);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand EmptyRange later treated as "Any time".
It looks a bit confusing to me :)

@joschi
Copy link
Contributor

joschi commented Jun 9, 2016

LGTM. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants