New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Waiting for index range calculation before switching deflector alias. #2278
Conversation
Two comments after quickly looking at this. (no time to do a full review currently, sorry)
|
@dennisoelkers Would it solve to the problem to insert a fake index range for the current write-active index with I share the fear of @bernd that without the second calculation (after the deflector alias has been pointed to the new index) the index range will be off by a few messages (and a few more messages in loaded ES clusters which take their time to actually return the min and max timestamps of an index). |
@bernd: @bernd / @joschi: |
Why not just set range for current index to infinity and schedule job for index range recalculation right after index became readonly? |
Deflector index ranges are now statically created with the current time as begin and 0L as end. Searching for the indices matching a time range was changed accordingly.
I think commit d523b74 breaks logic. |
Thanks for the input, @hc4, I came to the same conclusion. Please note though, that the PR Is still WIP. When it is finished, it will be labeled accordingly and any comments are welcome. |
As the timestamp of a message is generated on the source and not during ingestion time, we cannot determine anything for the deflector index range. Therefore we need to set begin time to 0 and always return the deflector index range for search.
For the documentation: @bernd, I changed the method signatures back to their original form and added new ones for the |
} catch (SystemJobConcurrencyException e) { | ||
final SystemJobManager.ScheduleResult scheduleResult = systemJobManager.submitWithDelayForResult(makeReadOnlyJob, 30, TimeUnit.SECONDS); | ||
if (scheduleResult != null) { | ||
scheduleResult.getFuture().get(deflectorIndexReadOnlyTimeout.toMilliseconds(), TimeUnit.MILLISECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the worst case this would block the thread for the duration of deflectorIndexReadOnlyTimeout
(1 hour by default). What would happen if there is the need to do another cycle during that time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I have seen in the code, this would mean that another cycle would be done, which would be waiting at the same point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't feel good adding this "waiting on system job" logic in here.
Would it be enough to trigger an index range calculation job in the "make-read-only" job after it's done?
- Do not call
addSingleIndexRanges(oldTarget)
in line 160 - Modify the
SetIndexReadOnlyJob
to trigger aCreateNewSingleIndexRangeJob
for the index once the read-only status is set
With this we wouldn't need the timeout logic and all the changes to the system jobs to return a future. We can also be sure that the index rage is correct because the index cannot be written anymore. Until the index is read-only, it will be included in the searches because it has the "deflector" index range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coupling the job which sets indices read only to the index range calculation job sounds wrong to me, because they are not logically coupled and executing the first one should not need you to know (and be surprised about) that the second one is triggered implicitly too. The coupling should be done from the outside (as it is done here) via composition without any inherent coupling.
The logic you are describing is exactly what is happening in the PR, why it should be better to give up a separation of concerns to shift the waiting logic one level lower, while gaining nothing, is beyond me, tbh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem is at line 182.
Why should we wait for job finished?
Is it possible to schedule one job after another (sorry, don't knwo deeply Java)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only problem I see - you will wait here at least for 30 seconds (submit delay) and will block scheduler (single thread) for that time.
however, I'am not sure if it is real problem :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why I've asked if the thread is blocked and what happened if a new deflector cycle was necessary while waiting for the last one. 😉
In the end, this would in the worst case consume all (executor) threads in SystemJobManager
and block all subsequent deflector cycles (until some threads are freed again).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but from what I've seen this is already the case before this change. I am not sure if you are implying that there is a regression in this PR or that we should also change the current behavior (which I would perceive as scope creep).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scheduleResult.getFuture().get(deflectorIndexReadOnlyTimeout.toMilliseconds(), TimeUnit.MILLISECONDS)
was introduced by this commit.
In matster there is no any waiting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coupling the job which sets indices read only to the index range calculation job sounds wrong to me, because they are not logically coupled and executing the first one should not need you to know (and be surprised about) that the second one is triggered implicitly too. The coupling should be done from the outside (as it is done here) via composition without any inherent coupling.
I agree that the coupling introduced by this wouldn't be nice. We could introduce a separate job class that takes care of both. (i.e. SetIndexReadOnlyAndCalculateRange
) That would make it clear what the job is supposed to do.
I also agree that it's nicer to be able to compose smaller jobs to avoid having lots of specialized jobs, but in this case I would vote for having one to avoid the introduction of timeout logic in this class and to avoid the async/sync issue you described later.
Another option would be to extend the system jobs subsystem to be able to run another job (or more) after one has finished. This would make it possible to compose a chain of system jobs without having to use timeouts or block the calling thread.
The logic you are describing is exactly what is happening in the PR, why it should be better to give up a separation of concerns to shift the waiting logic one level lower, while gaining nothing, is beyond me, tbh.
I would happily give up a separation of concern when I can avoid using a timeout. 😉 The problem I see with using timeouts is that it's really hard to find a good default for them and that the user probably needs to tune them to make them work with their setup which makes using the product harder.
@@ -194,7 +194,7 @@ public void cycle() { | |||
} | |||
|
|||
private void addDeflectorIndexRange(String newTarget) { | |||
final IndexRange deflectorRange = indexRangeService.createForDeflector(newTarget); | |||
final IndexRange deflectorRange = indexRangeService.createEmptyRange(newTarget); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand EmptyRange later treated as "Any time".
It looks a bit confusing to me :)
LGTM. 👍 |
Before this change, whenever the deflector was cycled, a system job was submitted to calculate the index range of the current write index/deflector (index 1). Immediately after submitting the job (not necessarily after the index range calculation job has finished already), the deflector alias was cycled to a newly created index (index 2). When the index range calculation job takes longer than cycling the deflector alias, there is a time range, where searches do not take the index which was recently cycled (index 1) into account.
This happens because
IndexHelper#determineAffectedIndices
returns an index if:Both conditions do not hold true for index 1 at this time. As searches are also used in alerts, there is also the potential of false positives/negatives, as described in #2264.
There are two ways to fix this:
IndexHelper#determineAffectedIndices
as well.Solution 2 was not chosen, because it seems to be a too intrusive approach, as it has an effect on all searches being done, especially when indices are pretty large.
Instead, solution 1 was implemented, by enhancing the result type of the
SystemJobManager#submit
to include aScheduledFuture
as well, which is then used to wait for the completion of the index range calculation before switching the deflector alias. As we do not want to wait an infinite amount of time for the unlikely event of noncompletion of this system job, a configuration property was introduced defaulting to a duration of one hour, which may or may not be a good idea.Solution 1 has the downside though, that there is a timeframe which is at maximum the time between the start of the index range calculation and the time of the actual index alias switch, which would not be included in the index range calculated and there might be the need for a second index range calculation after the index was switched to read only.
UPDATE: After tinkering a bit with the scenarios and possible implementations, a combination of both solutions was chosen. Now, when the deflector is cycled, a static index range with "empty" (== 0) begin/end parameters is saved. Index range calculation for the old deflector is postponed until after it was made read only, so we are sure to be able to calculate the correct time ranges. When searching, all indices with "empty" begin/end parameters are included in the search, because we cannot determine reliably which time range they cover.