If index is stale we may miss timeouts #2133

johnsimons · 2014-05-28T06:02:59Z

I have been looking at this issue reported in the mailing list.

So to summarise the issue, the user is using the Scheduler API and creates to schedule tasks, one that runs every 5secs and another one that runs every 1min.
Every so often the 5sec schedule task stops from reoccurring.

Here are the logs

And here is one of the timeouts stored in Raven:

{
  "Destination": {
    "Queue": "Flyt.TiosCtcAgent",
    "Machine": "GMBSANNTID"
  },
  "SagaId": "00000000-0000-0000-0000-000000000000",
  "State": "PD94bWwgdmVyc2lvbj0iMS4wIiA/Pg0KPE1lc3NhZ2VzIHhtbG5zOnhzaT0iaHR0cDovL3d3dy53My5vcmcvMjAwMS9YTUxTY2hlbWEtaW5zdGFuY2UiIHhtbG5zOnhzZD0iaHR0cDovL3d3dy53My5vcmcvMjAwMS9YTUxTY2hlbWEiIHhtbG5zPSJodHRwOi8vdGVtcHVyaS5uZXQvTlNlcnZpY2VCdXMuU2NoZWR1bGluZy5NZXNzYWdlcyI+CjxTY2hlZHVsZWRUYXNrPgo8VGFza0lkPmU2YjZlN2NmLTlmYjMtNDhiZS1iZGUwLTc0NmJhZmE3MmRlNzwvVGFza0lkPgo8TmFtZT5Cb290c3RyYXA8L05hbWU+CjxFdmVyeT5QVDVTPC9FdmVyeT4KPC9TY2hlZHVsZWRUYXNrPgo8L01lc3NhZ2VzPg0K",
  "Time": "2014-05-21T08:18:00.2114720Z",
  "CorrelationId": "2f849e22-d760-4ce6-846b-a33200a9b784\\0",
  "OwningTimeoutManager": "Flyt.TiosCtcAgent",
  "Headers": {
    "NServiceBus.MessageId": "2f849e22-d760-4ce6-846b-a33200a9b784",
    "NServiceBus.CorrelationId": "2f849e22-d760-4ce6-846b-a33200a9b784",
    "NServiceBus.OriginatingEndpoint": "Flyt.TiosCtcAgent",
    "$.diagnostics.originating.hostid": "522caf7d3f53a790fdcc328b6cb0d627",
    "NServiceBus.MessageIntent": "Send",
    "NServiceBus.Version": "4.4.2",
    "NServiceBus.TimeSent": "2014-05-21 08:17:55:211472 Z",
    "NServiceBus.OriginatingMachine": "GMBSANNTID",
    "NServiceBus.ContentType": "text/xml",
    "NServiceBus.EnclosedMessageTypes": "NServiceBus.Scheduling.Messages.ScheduledTask, NServiceBus.Core, Version=4.4.0.0, Culture=neutral, PublicKeyToken=9fc386479f8a226c",
    "CorrId": "2f849e22-d760-4ce6-846b-a33200a9b784\\0",
    "WinIdName": "GMBDOMENE1\\sanntidintegrasjon",
    "NServiceBus.RelatedTo": "5da0ead5-1701-4e2e-9b43-a33200a9b16e",
    "NServiceBus.ConversationId": "a00caaba-1eab-48a7-854f-a33100a5c243",
    "NServiceBus.IsDeferredMessage": "True",
    "NServiceBus.Temporary.DelayDeliveryWith": "00:00:05",
    "NServiceBus.Timeout.Expire": "2014-05-21 08:18:00:211472 Z",
    "NServiceBus.Timeout.RouteExpiredTimeoutTo": "Flyt.TiosCtcAgent@GMBSANNTID",
    "NServiceBus.Timeout.ReplyToAddress": "Flyt.TiosCtcAgent@GMBSANNTID"
  }
}

The only explanation I have is if the index is stale, we could end-up skipping over timeouts.
@synhershko has actually identified this issue in the https://github.com/Particular/NServiceBus.RavenDB repo and has fixed it.

I cannot replicate this myself, but the index staleness is definitely a possibility and that would explain the entries I can see in the log file.

johnsimons · 2014-05-28T06:04:21Z

@andreasohlund @synhershko I think we need a hotfix for this?

andreasohlund · 2014-05-28T06:10:32Z

+1 this seems like a hotfix

synhershko · 2014-05-28T16:38:23Z

@johnsimons I think you can just bring in my changes from the RavenDB repo?

johnsimons · 2014-06-02T03:10:46Z

@synhershko can you fix this in the core and ship it?

synhershko · 2014-06-03T07:46:16Z

There is no streaming API in RavenDB 2.0, so we cannot apply the fix from upstream here.

I see 2 routes we could go here:

Waiting for non-stale results. I don't really like this.
Making sure we only delete a timeouts we've see before, this will allow us to fire events even if late. Better late than never.

There may be some optimizations we could make to the current code to minimize the risk of this happening, but in order for this to be gone completely we need to use one of the above solutions or come up with another.

Thoughts?

johnsimons · 2014-06-03T07:50:18Z

But how does number 2 work regarding the range query we are currently doing?

On 3 June 2014 17:46, Itamar Syn-Hershko notifications@github.com wrote:

There is no streaming API in RavenDB 2.0, so we cannot apply the fix from
upstream here.

I see 2 routes we could go here:

Waiting for non-stale results. I don't really like this.
2.

Making sure we only delete a timeouts we've see before, this will
allow us to fire events even if late. Better late than never.

There may be some optimizations we could make to the current code to
minimize the risk of this happening, but in order for this to be gone
completely we need to use one of the above solutions or come up with
another.

Thoughts?

—
Reply to this email directly or view it on GitHub
#2133 (comment)
.

synhershko · 2014-06-03T10:16:21Z

Looking at this again, I think what I'll apply a similar logic to what we have with streams, only without streams.

Meaning, we will request big page size, and not page further. If there are more timeouts to handle, the method will be called again. We'll find a way to make sure it starts to query from where it stopped.

It then leaves the implementation with the same 2 holes, or gotchas, as the current upstream impl:

If timeouts are added for a chunk we have already processed, we may miss them. IIRC Me and John concluded that can't really happen.
When there are many timeouts, the query used to pull them from store is going to be very costly. This is also an issue upstream, no idea for a solution at this point and we are waiting to see if it's a real issue.

Simplify querying for timeouts, and introduce cleaning-up abilities to guarantee we don't skip any timeouts

synhershko · 2014-06-08T18:09:14Z

A failing test against master demonstrating this: b99e444 /cc @SimonCropp

andreasohlund · 2014-06-08T18:41:41Z

Nice work Itamar, does this test pass on on the new raven repo?(with the streaming)

Sent from my iPhone

On 08 Jun 2014, at 20:09, Itamar Syn-Hershko notifications@github.com wrote:

A failing test against master demonstrating this: https://github.com/Particular/NServiceBus/tree/test-timeouts /cc @SimonCropp

—
Reply to this email directly or view it on GitHub.

synhershko · 2014-06-08T18:50:06Z

Nope, working on that now

andreasohlund · 2014-06-16T08:07:20Z

@johnsimons so do we release now?

synhershko · 2014-06-17T09:00:50Z

@andreasohlund there is one failing test which is faulty, I would want to see it pass before we do. Plus we need to have the cleanup params configurable, as you requested.

andreasohlund · 2014-06-17T09:20:59Z

Got it, do U have any bandwith to look at it this week?

On Tue, Jun 17, 2014 at 11:00 AM, Itamar Syn-Hershko <
notifications@github.com> wrote:

@andreasohlund https://github.com/andreasohlund there is one failing
test which is faulty, I would want to see it pass before we do. Plus we
need to have the cleanup params configurable, as you requested.

—
Reply to this email directly or view it on GitHub
#2133 (comment)
.

synhershko · 2014-06-17T09:21:44Z

yeah I'll have it done by Friday

andreasohlund · 2014-06-17T09:22:30Z

thanks!

On Tue, Jun 17, 2014 at 11:21 AM, Itamar Syn-Hershko <
notifications@github.com> wrote:

yeah I'll have it done by Friday

—
Reply to this email directly or view it on GitHub
#2133 (comment)
.

SimonCropp · 2014-06-28T15:33:24Z

@synhershko any update on this?

synhershko · 2014-06-28T17:45:37Z

Sent an email to @johnsimons and @andreasohlund re status never heard back

synhershko · 2014-06-28T17:47:24Z

Re my work on timeouts:

Been giving this some more love and wrote some more tests. It's hard, since the timeout coordinator thing doesn't take into account Eventually Consistent storages, and the cleanup procedure I just introduced here is VERY hard to test especially with the current API (how do you know all cleanup operations have run? how do you verify this will behave the same on production?)

There is one new failing test: Should_return_the_next_time_of_retrieval - basically because I must have broken the contract of what nextTimeToRunQuery to return, or simply because of the eventual-consistent nature of the persister. I copied this test from the NHibernate repo I think, so it probably should pass.

I'm pretty much out of time this week, may have more next week. Unless you can think of an air-tight way out of this, may I suggest making the coordinator thing eventually-consistent storages aware? easiest way is to just invalidate startSlice when a new message comes in, but yeah it would introduce many duplicates.

Not feeling comfortable in releasing this yet.

mariusGundersen · 2014-07-04T07:09:33Z

(I am the one who posted the question on the mailing list, and I've been following this issue closely since then)

This has become too large of a problem for us in production code: things stopped running several times a week. As a result we have gone away from using the NServiceBus Scheduler in favour of the Quartz scheduler. After switching we have not seen any more problems.

We have only seen this issue with Schedule.Every(), not with timeouts in sagas. I'm not familiar with the code, but I assume there is shared code here. This is worrying, as it could mean that sagas won't get the timeout or deferred messages that we expect them to get. If this is true, then we have a much bigger problem! If it is not true (that they are implemented the same way) then why is the scheduler implemented in a different way from the saga timeouts?

Simplify querying for timeouts, and introduce cleaning-up abilities to guarantee we don't skip any timeouts Further avoid duplicates by better cleanup logic Adding failing tests demonstrating skipped timeouts

Simplify querying for timeouts, and introduce cleaning-up abilities to guarantee we don't skip any timeouts

Fixes #2133: Stop losing timeouts

Backwards port of the fix to #2133

…o guarantee we don't skip any timeouts, fixes #2133 Re-enabling tests that should have NEVER been disabled Minimizing the wait time for nextTimeToQuery timeouts Optimization Added back the loop that keeps querying for more results. Otherwise we could skip results that are suppose to trigger at the same time because the startSlice uses greater then logic. Double assignment not needed

johnsimons added the Bug label May 28, 2014

johnsimons added this to the 4.6.2 milestone Jun 2, 2014

johnsimons assigned synhershko Jun 2, 2014

synhershko added a commit that referenced this issue Jun 3, 2014

Fixes #2133

7d1f6f9

Simplify querying for timeouts, and introduce cleaning-up abilities to guarantee we don't skip any timeouts

SimonCropp mentioned this issue Jun 28, 2014

If index is stale we may miss timeouts Particular/NServiceBus.RavenDB#27

Closed

SimonCropp closed this as completed Jun 28, 2014

SimonCropp reopened this Jun 28, 2014

synhershko added a commit that referenced this issue Jul 15, 2014

Fixes #2133

40beb96

Simplify querying for timeouts, and introduce cleaning-up abilities to guarantee we don't skip any timeouts

johnsimons added the Bug label Jul 15, 2014

johnsimons closed this as completed Jul 15, 2014

synhershko added a commit that referenced this issue Jul 15, 2014

Merge pull request #2234 from Particular/hotfix-4.6.2

a7679b7

Fixes #2133: Stop losing timeouts

synhershko added a commit that referenced this issue Jul 15, 2014

Don't lose timeouts

4d48da0

Backwards port of the fix to #2133

synhershko mentioned this issue Jul 15, 2014

Don't lose timeouts #2235

Merged

johnsimons mentioned this issue Jul 17, 2014

If index is stale we may miss timeouts #2238

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If index is stale we may miss timeouts #2133

If index is stale we may miss timeouts #2133

johnsimons commented May 28, 2014

johnsimons commented May 28, 2014

andreasohlund commented May 28, 2014

synhershko commented May 28, 2014

johnsimons commented Jun 2, 2014

synhershko commented Jun 3, 2014

johnsimons commented Jun 3, 2014

synhershko commented Jun 3, 2014

synhershko commented Jun 8, 2014

andreasohlund commented Jun 8, 2014

synhershko commented Jun 8, 2014

andreasohlund commented Jun 16, 2014

synhershko commented Jun 17, 2014

andreasohlund commented Jun 17, 2014

synhershko commented Jun 17, 2014

andreasohlund commented Jun 17, 2014

SimonCropp commented Jun 28, 2014

synhershko commented Jun 28, 2014

synhershko commented Jun 28, 2014

mariusGundersen commented Jul 4, 2014

If index is stale we may miss timeouts #2133

If index is stale we may miss timeouts #2133

Comments

johnsimons commented May 28, 2014

johnsimons commented May 28, 2014

andreasohlund commented May 28, 2014

synhershko commented May 28, 2014

johnsimons commented Jun 2, 2014

synhershko commented Jun 3, 2014

johnsimons commented Jun 3, 2014

synhershko commented Jun 3, 2014

synhershko commented Jun 8, 2014

andreasohlund commented Jun 8, 2014

synhershko commented Jun 8, 2014

andreasohlund commented Jun 16, 2014

synhershko commented Jun 17, 2014

andreasohlund commented Jun 17, 2014

synhershko commented Jun 17, 2014

andreasohlund commented Jun 17, 2014

SimonCropp commented Jun 28, 2014

synhershko commented Jun 28, 2014

synhershko commented Jun 28, 2014

mariusGundersen commented Jul 4, 2014