New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
If index is stale we may miss timeouts #2133
Comments
@andreasohlund @synhershko I think we need a hotfix for this? |
+1 this seems like a hotfix |
@johnsimons I think you can just bring in my changes from the RavenDB repo? |
@synhershko can you fix this in the core and ship it? |
There is no streaming API in RavenDB 2.0, so we cannot apply the fix from upstream here. I see 2 routes we could go here:
There may be some optimizations we could make to the current code to minimize the risk of this happening, but in order for this to be gone completely we need to use one of the above solutions or come up with another. Thoughts? |
But how does number 2 work regarding the range query we are currently doing? On 3 June 2014 17:46, Itamar Syn-Hershko notifications@github.com wrote:
|
Looking at this again, I think what I'll apply a similar logic to what we have with streams, only without streams. Meaning, we will request big page size, and not page further. If there are more timeouts to handle, the method will be called again. We'll find a way to make sure it starts to query from where it stopped. It then leaves the implementation with the same 2 holes, or gotchas, as the current upstream impl:
|
A failing test against master demonstrating this: b99e444 /cc @SimonCropp |
Nice work Itamar, does this test pass on on the new raven repo?(with the streaming) Sent from my iPhone
|
Nope, working on that now |
@johnsimons so do we release now? |
@andreasohlund there is one failing test which is faulty, I would want to see it pass before we do. Plus we need to have the cleanup params configurable, as you requested. |
Got it, do U have any bandwith to look at it this week? On Tue, Jun 17, 2014 at 11:00 AM, Itamar Syn-Hershko <
|
yeah I'll have it done by Friday |
thanks! On Tue, Jun 17, 2014 at 11:21 AM, Itamar Syn-Hershko <
|
@synhershko any update on this? |
Sent an email to @johnsimons and @andreasohlund re status never heard back |
Re my work on timeouts: Been giving this some more love and wrote some more tests. It's hard, since the timeout coordinator thing doesn't take into account Eventually Consistent storages, and the cleanup procedure I just introduced here is VERY hard to test especially with the current API (how do you know all cleanup operations have run? how do you verify this will behave the same on production?) There is one new failing test: Should_return_the_next_time_of_retrieval - basically because I must have broken the contract of what nextTimeToRunQuery to return, or simply because of the eventual-consistent nature of the persister. I copied this test from the NHibernate repo I think, so it probably should pass. I'm pretty much out of time this week, may have more next week. Unless you can think of an air-tight way out of this, may I suggest making the coordinator thing eventually-consistent storages aware? easiest way is to just invalidate startSlice when a new message comes in, but yeah it would introduce many duplicates. Not feeling comfortable in releasing this yet. |
(I am the one who posted the question on the mailing list, and I've been following this issue closely since then) This has become too large of a problem for us in production code: things stopped running several times a week. As a result we have gone away from using the NServiceBus Scheduler in favour of the Quartz scheduler. After switching we have not seen any more problems. We have only seen this issue with |
Simplify querying for timeouts, and introduce cleaning-up abilities to guarantee we don't skip any timeouts Further avoid duplicates by better cleanup logic Adding failing tests demonstrating skipped timeouts
Backwards port of the fix to #2133
…o guarantee we don't skip any timeouts, fixes #2133 Re-enabling tests that should have NEVER been disabled Minimizing the wait time for nextTimeToQuery timeouts Optimization Added back the loop that keeps querying for more results. Otherwise we could skip results that are suppose to trigger at the same time because the startSlice uses greater then logic. Double assignment not needed
…o guarantee we don't skip any timeouts, fixes #2133 Re-enabling tests that should have NEVER been disabled Minimizing the wait time for nextTimeToQuery timeouts Optimization Added back the loop that keeps querying for more results. Otherwise we could skip results that are suppose to trigger at the same time because the startSlice uses greater then logic. Double assignment not needed
…o guarantee we don't skip any timeouts, fixes #2133 Re-enabling tests that should have NEVER been disabled Minimizing the wait time for nextTimeToQuery timeouts Optimization Added back the loop that keeps querying for more results. Otherwise we could skip results that are suppose to trigger at the same time because the startSlice uses greater then logic. Double assignment not needed
…o guarantee we don't skip any timeouts, fixes #2133 Re-enabling tests that should have NEVER been disabled Minimizing the wait time for nextTimeToQuery timeouts Optimization Added back the loop that keeps querying for more results. Otherwise we could skip results that are suppose to trigger at the same time because the startSlice uses greater then logic. Double assignment not needed
…o guarantee we don't skip any timeouts, fixes #2133 Re-enabling tests that should have NEVER been disabled Minimizing the wait time for nextTimeToQuery timeouts Optimization Added back the loop that keeps querying for more results. Otherwise we could skip results that are suppose to trigger at the same time because the startSlice uses greater then logic. Double assignment not needed
…o guarantee we don't skip any timeouts, fixes #2133 Re-enabling tests that should have NEVER been disabled Minimizing the wait time for nextTimeToQuery timeouts Optimization Added back the loop that keeps querying for more results. Otherwise we could skip results that are suppose to trigger at the same time because the startSlice uses greater then logic. Double assignment not needed
…o guarantee we don't skip any timeouts, fixes #2133 Re-enabling tests that should have NEVER been disabled Minimizing the wait time for nextTimeToQuery timeouts Optimization Added back the loop that keeps querying for more results. Otherwise we could skip results that are suppose to trigger at the same time because the startSlice uses greater then logic. Double assignment not needed
I have been looking at this issue reported in the mailing list.
So to summarise the issue, the user is using the Scheduler API and creates to schedule tasks, one that runs every 5secs and another one that runs every 1min.
Every so often the 5sec schedule task stops from reoccurring.
Here are the logs
And here is one of the timeouts stored in Raven:
The only explanation I have is if the index is stale, we could end-up skipping over timeouts.
@synhershko has actually identified this issue in the https://github.com/Particular/NServiceBus.RavenDB repo and has fixed it.
I cannot replicate this myself, but the index staleness is definitely a possibility and that would explain the entries I can see in the log file.
The text was updated successfully, but these errors were encountered: