Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If index is stale we may miss timeouts #2133

Closed
johnsimons opened this issue May 28, 2014 · 19 comments
Closed

If index is stale we may miss timeouts #2133

johnsimons opened this issue May 28, 2014 · 19 comments
Assignees
Labels
Milestone

Comments

@johnsimons
Copy link
Member

I have been looking at this issue reported in the mailing list.

So to summarise the issue, the user is using the Scheduler API and creates to schedule tasks, one that runs every 5secs and another one that runs every 1min.
Every so often the 5sec schedule task stops from reoccurring.

Here are the logs

And here is one of the timeouts stored in Raven:

{
  "Destination": {
    "Queue": "Flyt.TiosCtcAgent",
    "Machine": "GMBSANNTID"
  },
  "SagaId": "00000000-0000-0000-0000-000000000000",
  "State": "PD94bWwgdmVyc2lvbj0iMS4wIiA/Pg0KPE1lc3NhZ2VzIHhtbG5zOnhzaT0iaHR0cDovL3d3dy53My5vcmcvMjAwMS9YTUxTY2hlbWEtaW5zdGFuY2UiIHhtbG5zOnhzZD0iaHR0cDovL3d3dy53My5vcmcvMjAwMS9YTUxTY2hlbWEiIHhtbG5zPSJodHRwOi8vdGVtcHVyaS5uZXQvTlNlcnZpY2VCdXMuU2NoZWR1bGluZy5NZXNzYWdlcyI+CjxTY2hlZHVsZWRUYXNrPgo8VGFza0lkPmU2YjZlN2NmLTlmYjMtNDhiZS1iZGUwLTc0NmJhZmE3MmRlNzwvVGFza0lkPgo8TmFtZT5Cb290c3RyYXA8L05hbWU+CjxFdmVyeT5QVDVTPC9FdmVyeT4KPC9TY2hlZHVsZWRUYXNrPgo8L01lc3NhZ2VzPg0K",
  "Time": "2014-05-21T08:18:00.2114720Z",
  "CorrelationId": "2f849e22-d760-4ce6-846b-a33200a9b784\\0",
  "OwningTimeoutManager": "Flyt.TiosCtcAgent",
  "Headers": {
    "NServiceBus.MessageId": "2f849e22-d760-4ce6-846b-a33200a9b784",
    "NServiceBus.CorrelationId": "2f849e22-d760-4ce6-846b-a33200a9b784",
    "NServiceBus.OriginatingEndpoint": "Flyt.TiosCtcAgent",
    "$.diagnostics.originating.hostid": "522caf7d3f53a790fdcc328b6cb0d627",
    "NServiceBus.MessageIntent": "Send",
    "NServiceBus.Version": "4.4.2",
    "NServiceBus.TimeSent": "2014-05-21 08:17:55:211472 Z",
    "NServiceBus.OriginatingMachine": "GMBSANNTID",
    "NServiceBus.ContentType": "text/xml",
    "NServiceBus.EnclosedMessageTypes": "NServiceBus.Scheduling.Messages.ScheduledTask, NServiceBus.Core, Version=4.4.0.0, Culture=neutral, PublicKeyToken=9fc386479f8a226c",
    "CorrId": "2f849e22-d760-4ce6-846b-a33200a9b784\\0",
    "WinIdName": "GMBDOMENE1\\sanntidintegrasjon",
    "NServiceBus.RelatedTo": "5da0ead5-1701-4e2e-9b43-a33200a9b16e",
    "NServiceBus.ConversationId": "a00caaba-1eab-48a7-854f-a33100a5c243",
    "NServiceBus.IsDeferredMessage": "True",
    "NServiceBus.Temporary.DelayDeliveryWith": "00:00:05",
    "NServiceBus.Timeout.Expire": "2014-05-21 08:18:00:211472 Z",
    "NServiceBus.Timeout.RouteExpiredTimeoutTo": "Flyt.TiosCtcAgent@GMBSANNTID",
    "NServiceBus.Timeout.ReplyToAddress": "Flyt.TiosCtcAgent@GMBSANNTID"
  }
}

The only explanation I have is if the index is stale, we could end-up skipping over timeouts.
@synhershko has actually identified this issue in the https://github.com/Particular/NServiceBus.RavenDB repo and has fixed it.

I cannot replicate this myself, but the index staleness is definitely a possibility and that would explain the entries I can see in the log file.

@johnsimons johnsimons added the Bug label May 28, 2014
@johnsimons
Copy link
Member Author

@andreasohlund @synhershko I think we need a hotfix for this?

@andreasohlund
Copy link
Member

+1 this seems like a hotfix

@synhershko
Copy link
Contributor

@johnsimons I think you can just bring in my changes from the RavenDB repo?

@johnsimons johnsimons added this to the 4.6.2 milestone Jun 2, 2014
@johnsimons
Copy link
Member Author

@synhershko can you fix this in the core and ship it?

@synhershko
Copy link
Contributor

There is no streaming API in RavenDB 2.0, so we cannot apply the fix from upstream here.

I see 2 routes we could go here:

  1. Waiting for non-stale results. I don't really like this.
  2. Making sure we only delete a timeouts we've see before, this will allow us to fire events even if late. Better late than never.

There may be some optimizations we could make to the current code to minimize the risk of this happening, but in order for this to be gone completely we need to use one of the above solutions or come up with another.

Thoughts?

@johnsimons
Copy link
Member Author

But how does number 2 work regarding the range query we are currently doing?

On 3 June 2014 17:46, Itamar Syn-Hershko notifications@github.com wrote:

There is no streaming API in RavenDB 2.0, so we cannot apply the fix from
upstream here.

I see 2 routes we could go here:

Waiting for non-stale results. I don't really like this.
2.

Making sure we only delete a timeouts we've see before, this will
allow us to fire events even if late. Better late than never.

There may be some optimizations we could make to the current code to
minimize the risk of this happening, but in order for this to be gone
completely we need to use one of the above solutions or come up with
another.

Thoughts?


Reply to this email directly or view it on GitHub
#2133 (comment)
.

@synhershko
Copy link
Contributor

Looking at this again, I think what I'll apply a similar logic to what we have with streams, only without streams.

Meaning, we will request big page size, and not page further. If there are more timeouts to handle, the method will be called again. We'll find a way to make sure it starts to query from where it stopped.

It then leaves the implementation with the same 2 holes, or gotchas, as the current upstream impl:

  1. If timeouts are added for a chunk we have already processed, we may miss them. IIRC Me and John concluded that can't really happen.
  2. When there are many timeouts, the query used to pull them from store is going to be very costly. This is also an issue upstream, no idea for a solution at this point and we are waiting to see if it's a real issue.

synhershko added a commit that referenced this issue Jun 3, 2014
Simplify querying for timeouts, and introduce cleaning-up abilities to guarantee we don't skip any timeouts
@synhershko
Copy link
Contributor

A failing test against master demonstrating this: b99e444 /cc @SimonCropp

@andreasohlund
Copy link
Member

Nice work Itamar, does this test pass on on the new raven repo?(with the streaming)

Sent from my iPhone

On 08 Jun 2014, at 20:09, Itamar Syn-Hershko notifications@github.com wrote:

A failing test against master demonstrating this: https://github.com/Particular/NServiceBus/tree/test-timeouts /cc @SimonCropp


Reply to this email directly or view it on GitHub.

@synhershko
Copy link
Contributor

Nope, working on that now

@andreasohlund
Copy link
Member

@johnsimons so do we release now?

@synhershko
Copy link
Contributor

@andreasohlund there is one failing test which is faulty, I would want to see it pass before we do. Plus we need to have the cleanup params configurable, as you requested.

@andreasohlund
Copy link
Member

Got it, do U have any bandwith to look at it this week?

On Tue, Jun 17, 2014 at 11:00 AM, Itamar Syn-Hershko <
notifications@github.com> wrote:

@andreasohlund https://github.com/andreasohlund there is one failing
test which is faulty, I would want to see it pass before we do. Plus we
need to have the cleanup params configurable, as you requested.


Reply to this email directly or view it on GitHub
#2133 (comment)
.

@synhershko
Copy link
Contributor

yeah I'll have it done by Friday

@andreasohlund
Copy link
Member

thanks!

On Tue, Jun 17, 2014 at 11:21 AM, Itamar Syn-Hershko <
notifications@github.com> wrote:

yeah I'll have it done by Friday


Reply to this email directly or view it on GitHub
#2133 (comment)
.

@SimonCropp
Copy link
Contributor

@synhershko any update on this?

@synhershko
Copy link
Contributor

Sent an email to @johnsimons and @andreasohlund re status never heard back

@synhershko
Copy link
Contributor

Re my work on timeouts:

Been giving this some more love and wrote some more tests. It's hard, since the timeout coordinator thing doesn't take into account Eventually Consistent storages, and the cleanup procedure I just introduced here is VERY hard to test especially with the current API (how do you know all cleanup operations have run? how do you verify this will behave the same on production?)

There is one new failing test: Should_return_the_next_time_of_retrieval - basically because I must have broken the contract of what nextTimeToRunQuery to return, or simply because of the eventual-consistent nature of the persister. I copied this test from the NHibernate repo I think, so it probably should pass.

I'm pretty much out of time this week, may have more next week. Unless you can think of an air-tight way out of this, may I suggest making the coordinator thing eventually-consistent storages aware? easiest way is to just invalidate startSlice when a new message comes in, but yeah it would introduce many duplicates.

Not feeling comfortable in releasing this yet.

@mariusGundersen
Copy link

(I am the one who posted the question on the mailing list, and I've been following this issue closely since then)

This has become too large of a problem for us in production code: things stopped running several times a week. As a result we have gone away from using the NServiceBus Scheduler in favour of the Quartz scheduler. After switching we have not seen any more problems.

We have only seen this issue with Schedule.Every(), not with timeouts in sagas. I'm not familiar with the code, but I assume there is shared code here. This is worrying, as it could mean that sagas won't get the timeout or deferred messages that we expect them to get. If this is true, then we have a much bigger problem! If it is not true (that they are implemented the same way) then why is the scheduler implemented in a different way from the saga timeouts?

synhershko added a commit that referenced this issue Jul 10, 2014
Simplify querying for timeouts, and introduce cleaning-up abilities to guarantee we don't skip any timeouts

Further avoid duplicates by better cleanup logic

Adding failing tests demonstrating skipped timeouts
synhershko added a commit that referenced this issue Jul 15, 2014
Simplify querying for timeouts, and introduce cleaning-up abilities to guarantee we don't skip any timeouts
@johnsimons johnsimons added the Bug label Jul 15, 2014
synhershko added a commit that referenced this issue Jul 15, 2014
synhershko added a commit that referenced this issue Jul 15, 2014
Backwards port of the fix to #2133
synhershko added a commit that referenced this issue Jul 21, 2014
…o guarantee we don't skip any timeouts, fixes #2133

Re-enabling tests that should have NEVER been disabled

Minimizing the wait time for nextTimeToQuery timeouts

Optimization

Added back the loop that keeps querying for more results.

Otherwise we could skip results that are suppose to trigger at the same time because the startSlice uses greater then logic.

Double assignment not needed
synhershko added a commit that referenced this issue Jul 21, 2014
…o guarantee we don't skip any timeouts, fixes #2133

Re-enabling tests that should have NEVER been disabled

Minimizing the wait time for nextTimeToQuery timeouts

Optimization

Added back the loop that keeps querying for more results.

Otherwise we could skip results that are suppose to trigger at the same time because the startSlice uses greater then logic.

Double assignment not needed
synhershko added a commit that referenced this issue Jul 21, 2014
…o guarantee we don't skip any timeouts, fixes #2133

Re-enabling tests that should have NEVER been disabled

Minimizing the wait time for nextTimeToQuery timeouts

Optimization

Added back the loop that keeps querying for more results.

Otherwise we could skip results that are suppose to trigger at the same time because the startSlice uses greater then logic.

Double assignment not needed
synhershko added a commit that referenced this issue Jul 21, 2014
…o guarantee we don't skip any timeouts, fixes #2133

Re-enabling tests that should have NEVER been disabled

Minimizing the wait time for nextTimeToQuery timeouts

Optimization

Added back the loop that keeps querying for more results.

Otherwise we could skip results that are suppose to trigger at the same time because the startSlice uses greater then logic.

Double assignment not needed
synhershko added a commit that referenced this issue Jul 21, 2014
…o guarantee we don't skip any timeouts, fixes #2133

Re-enabling tests that should have NEVER been disabled

Minimizing the wait time for nextTimeToQuery timeouts

Optimization

Added back the loop that keeps querying for more results.

Otherwise we could skip results that are suppose to trigger at the same time because the startSlice uses greater then logic.

Double assignment not needed
synhershko added a commit that referenced this issue Jul 21, 2014
…o guarantee we don't skip any timeouts, fixes #2133

Re-enabling tests that should have NEVER been disabled

Minimizing the wait time for nextTimeToQuery timeouts

Optimization

Added back the loop that keeps querying for more results.

Otherwise we could skip results that are suppose to trigger at the same time because the startSlice uses greater then logic.

Double assignment not needed
synhershko added a commit that referenced this issue Jul 21, 2014
…o guarantee we don't skip any timeouts, fixes #2133

Re-enabling tests that should have NEVER been disabled

Minimizing the wait time for nextTimeToQuery timeouts

Optimization

Added back the loop that keeps querying for more results.

Otherwise we could skip results that are suppose to trigger at the same time because the startSlice uses greater then logic.

Double assignment not needed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants