Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log exception when message is handled by retry behaviors #2074

Closed
andreasohlund opened this issue Apr 28, 2014 · 30 comments
Closed

Log exception when message is handled by retry behaviors #2074

andreasohlund opened this issue Apr 28, 2014 · 30 comments
Labels
Milestone

Comments

@andreasohlund
Copy link
Member

Who's affected

Users relying on the exception details of failing messages from the log file.

Symptoms

When message processing fails during First Level and Second Level Retries, the exception is logged only with level INFO which usually isn't monitored in production environments.

Changes

Exceptions during message processing will be logged with the following levels:

  • INFO when error is handled by First Level Retry
  • WARN when error is handled by Second Level Retry
  • ERROR when moving message to the error queue

The log messages are logged from the handling behavior which allows custom logging configuration for each of the mentioned log entries when providing an advanced logging framework.

@SimonCropp
Copy link
Contributor

isnt the issue that the line that actually logs the exception is a info and not a error or warn? https://github.com/Particular/NServiceBus/blob/develop/src/NServiceBus.Core/Unicast/Transport/TransportReceiver.cs#L318

@andreasohlund
Copy link
Member Author

I'm arguing that's correct since its only a WARN then we give up FLR and
goes to SLR

and then ERROR when we give up SLR and moves the message to the error q.

On Tue, Apr 29, 2014 at 8:22 AM, Simon Cropp notifications@github.comwrote:

isnt the issue that the line that actually logs the exception is a info
and not a error or warn?
https://github.com/Particular/NServiceBus/blob/develop/src/NServiceBus.Core/Unicast/Transport/TransportReceiver.cs#L318

Reply to this email directly or view it on GitHubhttps://github.com//issues/2074#issuecomment-41644994
.

@johannesg
Copy link
Contributor

Any status on this? We've started to use v4 more extensively and it increasingly becomes a problem for us that we can't see what the error is in our logs.

We filter out logs from NServicebus that are lower than WARN. We also have SLR disabled.

IMHO I don't really understand why a failed message is not considered an error (or warning)

@johannesg
Copy link
Contributor

Logging every failed message as warning is also useful if you want to measure the amount of failures you get. For example, if you have an endpoint that due to db contention fails 3-4 times for most messages then you want to know about it so you can take action. Having NSB only log whenever it goes to FLR or SLR is not helpful in this case.

@johnsimons
Copy link
Member

@andreasohlund can you please update the links you are pointing to in the issue description, the links are invalid! You can use 'y' to create links that link to the actual commit so that it doesn't change.
Also, I'm not clear on what you are proposing, my understanding is that we log a error when a message is sent to the error queue.

@johnsimons
Copy link
Member

@johannesg in your scenario, wouldn't it be better to monitor perf counters for sql instead ?
I'm pretty sure there is a perf counter for contention issues.

@johannesg
Copy link
Contributor

There are, and we monitor them too. I still think though that it is very useful to be able to see the problem directly at the source and not having to backtrack. Also, what if you call a third party webservice instead of a sql server?

May I ask, what is the reasoning behind not treating a failed message as a potential problem?

@johnsimons
Copy link
Member

May I ask, what is the reasoning behind not treating a failed message as a potential problem?

I guess the reasoning is that a message is not considered a failure till has gone through all the retries and moved to the error queue.

@mauroservienti
Copy link
Member

Any way to have in logs the reason of the failure? each time a retry attempt is performed?

@ThinkFr33ly
Copy link

I think the error details should be logged as warnings during retries, and an error once it's moved to the error queue. This is especially important when a simple ToString() isn't good enough as far as capturing error details. See: http://stackoverflow.com/questions/27696539/configure-nservicebus-to-log-exceptions-that-are-causing-message-retries

@mlidbom
Copy link

mlidbom commented Sep 7, 2015

I honestly don't know how to comment politely on this.

You have a horrible bug in your code that makes production logs almost completely useless. And you mark fixing it as a FEATURE? And do nothing about it for more than a year?

It should have been hotfixed the day after it was reported. All you need to do is change a log level to warn. That is all!

Unbelievable. Just unbelievably unprofessional.

@mlidbom
Copy link

mlidbom commented Sep 7, 2015

As for the log level for the original exception it seems obvious to me that it should be error or at least warn.

  • The message handling code has utterly failed. An exception has escaped the user code and had to be caught by infrastructure code that only possibly might be able to recover through retrying.
  • It is only possible to recover if retries are configured. If not the message fails completely.
  • Even if we might be able to recover at this time the failure is an indication that something is wrong. A service is down, we are getting close to the limit of load the system can handle, we are having problems with the sql server etc etc. We must know these things before we pass the threshold and disaster strikes.
  • Each retry applies additional load to the system. We must know that this is happening.
  • When a message goes to SLR we no longer have any ordering guarantees. We must know about problems and be able to react to them before this milestone is passed.(Please don't say we should "just" design everything to be guaranteed to be able to handle events in any order.)

Any one of the above reasons should make it obvious that this needs to be, at least, a warning level log entry that includes the full call stack. And I have no doubt that there are many more reasons that could be listed.

@andreasohlund andreasohlund changed the title Include the details of the last exception when moving messages to SLQ or Error Include the details of the last exception when moving messages to SLR or to the error queue Sep 7, 2015
@andreasohlund
Copy link
Member Author

@mlidbom this one definitely slipped through the cracks. @adamralph was just cleaning up some labels so he's innocent :) , we decided to merge Feature and Improvement. I'm changing this to a bug

This has been fixed on develop a few days ago, by @timbussmann.

You have a horrible bug in your code that makes production logs almost completely useless

Well all the details of the exception is available on the message in the error q so I'm not sure I'd call it a "horrible bug" since you can configure the logger to pick it up even though the level is too low. Again no info was lost since the details are keep on the error message. But yes quite inconvenient :(

Current approach is:

I think that with the above + the notifications api mentioned we have a good balance?

@andreasohlund andreasohlund added this to the 6.0.0 milestone Sep 7, 2015
@mlidbom
Copy link

mlidbom commented Sep 7, 2015

I just tried setting log level to INFO and restarted a few of our services. I got almost 7000 messages spewed out in the log. And NOTHING had gone wrong in any way shape of form. This is the log level at which you are hiding complete failures of the user code.

Could you please explain why you think that hiding the complete failure of user code in the INFO log level will help your users? It makes absolutely no sense to me and you did not respond at all to my list of reasons why this needs to be WARN level at least.

I stand by my opinion that this is a horrible bug. Inconvenient does not begin to cover it. The stack may well be lost forever when the message get's through on a later retry. Manually trying to correlate messages that may or may not be there anymore to errors in the log is not an acceptable way of finding the call stack that goes with an error.

Having to write our own code to hook into some custom API just to get errors logged as errors is likewise not acceptable. An API that does not even seem to exist in 4.6 which we use.

I'm not asking for much here. I just want exceptions escaping from my code to be logged at at least the WARN log level including the complete call stack whenever they occur. That this is something I have to argue about is unbelievable to me. That NServicebus has never supplied this for us without us having to fork it leaves me incredibly frustrated

EDIT Clarification: NServiceBus used to log the errors, but intentionally mutilated the call stacks causing the log entries to be useless most of the time. Now it does not log them at all with production log levels configured. So all in all we have never had what we needed from the logging from NServiceBus without having to fork it.

@andreasohlund
Copy link
Member Author

Regarding the INFO logging, agreed, we log way to much as info. We've started to fix that recently

278420b

with the goal to have a pretty much silent startup with logging set to INFO level

@andreasohlund
Copy link
Member Author

Could you please explain why you think that hiding the complete failure of user code in the INFO log level will help your users?

Since we now have corrected the logging you will always get at an ERROR log, and also at least one WARN if SLR is on. You will get INFO on the FLR with the motivation that if FLR doesn't solve the issue it will be logged with the higher levels mentioned above.

Yes it could be the FLR attempts prior to the last one could have a different exception and those would be logged with INFO.

Did I read you corrently in that this is your main concern with the what we have at the moment? (talking about the latest develop for now)

@mlidbom
Copy link

mlidbom commented Sep 7, 2015

"Since we now have corrected the logging you will always get at an ERROR log"
"Yes it could be the FLR attempts prior to the last one could have a different exception and those would be logged with INFO."

I'm not really sure how to interpret this. Seems self contradictory. If FLRs are logged as info, how do we still always get an ERROR log entry? Seems nothing would be logged above INFO unless FLR fails completely?

@mlidbom
Copy link

mlidbom commented Sep 7, 2015

OK so let me take a different tack here.
We use custom configured Log4Net logging with NSB 4.6 so we have full access to the configurability of Log4Net.

Could you suggest a configuration of the logger that will not log anything below the WARN level except for making sure that exceptions thrown from message handling code are logged including the full stack trace?

@johannesg
Copy link
Contributor

Andreas,

Consider this hypothetical scenario: there is a problem with a message
handler due do contention/slow webservice/whatever. This causes 50% of all
messages to fail once or twice but they usually succeed at the third
attempt. They will therefore never trigger SLR.

Q: How can we detect this problem with the help of NSB?

2015-09-07 14:30 GMT+02:00 Magnus Lidbom notifications@github.com:

OK so let me take a different tack here.
We use custom configured Log4Net logging with NSB 4.6 so we have full
access to the configurability of Log4Net.

Could you suggest a configuration of the logger that will not log anything
below the WARN level except for making sure that exception thrown from
message handling code is logged?


Reply to this email directly or view it on GitHub
#2074 (comment)
.

@mlidbom
Copy link

mlidbom commented Sep 7, 2015

@andreasohlund
Our issue is this: We need to be able to review our logs to find current and potential problems.
In order to do that we need to have a log level that defines: "There might be a problem. A developer should probably look at this log entry." Any message handler failing definitely fits this description.

If NServiceBus logs message handlers failing as INFO we lose this ability. In one stroke our logs stop being able to tell us if we should be worried. Potential and real problems are hidden in the torrent of log entries that various libraries output at the INFO level.

@timbussmann
Copy link
Contributor

@mlidbom have you tried the following configuration:

<logger name="NServiceBus.Unicast.Transport.TransportReceiver">
      <level value="INFO"></level>
</logger>

@johannesg but that's what First Level Retries are useful for. Do you really wanna see the exceptions in this case (spamming your log)? There are better options to detect network/server issues (e.g. performance counters).

@mlidbom
Copy link

mlidbom commented Sep 7, 2015

@timbussmann
Thanks for that config. I'll try it.

"@johannesg but that's what First Level Retries are useful for. Do you really wanna see the exceptions in this case (spamming your log)?"

Absolutely.
It seems we see these failures very differently. To me NServiceBus being able to retry automatically is great, but it is like the seat belt in your car. You absolutely should have one, but being saved by it should not be considered normal.

Message handlers failing always indicates that something is wrong. Something that we should at least review to see if we need to look more closely into it. Delegating failures to the INFO level seems to me to be sweeping real problems under the rug where they are less visible.

IMO that is a recipe for running into really nasty problems down the line that you will not be prepared for.

@timbussmann timbussmann changed the title Include the details of the last exception when moving messages to SLR or to the error queue Log exception when message is handled by retry behaviors Sep 7, 2015
@timbussmann
Copy link
Contributor

Great, let me know if that helps.

In most cases, first level retry failures are issues with external resources/systems and examining network exceptions may be helpful for a company detecting an unstable intranet, it's often not useful for the developer because it's not an issue with the endpoint (and there are far more efficient tools to monitor network/service status).

But it became obvious that we can't generalize this and need to provide a way for users to change it according to their needs. While I think INFO is fine for FLR most of the times (and I agree that there are too many INFO messages), the next release will use different loggers for each recoverability feature which allows custom logging configuration when using log4net or other logging frameworks (e.g. by default log WARN but include INFO from First Level Retries).

btw. I completely agree with you on the seat belts, but please consider there are different seat belts, specialized on their use case. While a race car demands a proper racing seat belt, that's a complete overkill for a common car.

@mlidbom
Copy link

mlidbom commented Sep 7, 2015

@timbussmann
That config works better than I would have expected. I was worried that the TransportReceiver would log more at INFO, but it seems to only log the failures.

Thanks again :)

@mlidbom
Copy link

mlidbom commented Sep 7, 2015

@timbussmann
"In most cases, first level retry failures are issues with external resources/systems and examining network exceptions may be helpful for a company detecting an unstable intranet, it's often not useful for the developer because it's not an issue with the endpoint"

We run our applications/services in production as well as develop them. In this context I do not care if a problem is caused by our code or by an overloaded SQL server or by a broken port on a switch. I always need to know about it.

Consider the very common situation where timeouts and deadlocks are becoming increasingly more common as load increases or as the database size increases, or as more databases are put on the same host without the knowledge of our team. Sooner or later we are very likely to run into major pain. This is certainly a resource external to the endpoint. But delegating messages about deadlocks and timeouts to the INFO level is likely to leave many organisations with little to no warning that they have an escalating problem in production. Once the problems escalate to the level that FLR fails and the entries in the log are noticed it is likely that the system is already on the verge of grinding to a virtual halt. It may well be to late to fix the problem without serious business impact at this point.

Most developers will assume that anything below the WARN level can safely be ignored. This assumption is perfectly understandable since it is almost invariably true in my experience.

I now know about this unusual aspect of NServiceBus and can adjust to it. How many of your customers do?
Hiding these messages with the logger configuration is no harder than showing them is.
Is defaulting to hiding them really the best thing for your customers?
I for one believe that letting the users filter out exceptions that they do not want to see is a better default than hiding exceptions away in a log level that is normally irrelevant to detecting problems.

Will the customer in the above scenario thank you for putting these timeouts and deadlocks in the INFO log level? I rather doubt it.

Edit: You may argue that other forms of monitoring are more optimal for detecting problems like this. Well I live in a world that is far from optimal and I need the tools I do have at my disposal to do their best to help me detect problems. I imagine having little to no ability to affect production monitoring is only too common in the trenches of application development. Also, those other tools would not tell me which parts of my code are causing the timeouts/deadlocks.

@timbussmann
Copy link
Contributor

@mlidbom Glad the config helps! :)

From my experience in running NServiceBus in production, we used several other monitoring mechanisms, especially for SQL server which also allowed us to examine these issue more detailed (e.g. query performance, number of queries, users, and so on). But as concluded before: Every project is different and has different needs. Sometimes it's very difficult to even gather these needs. But we currently end up discussing about reasonable default settings which is a tricky task for a framework. These are things only changed carefully. If it becomes apparent, that most users would like this kind of logging information, we will happily deliver. Therefore we really value feedback and more detailed insights like what you provide us with, thanks for that!

I'm going to close this issue since I think we got a working solution for future releases which allows more fine grained logging configuration for FLR & SLR. As Andreas mentioned, we also have a really cool notifications API from v5 on.

@mlidbom
Copy link

mlidbom commented Sep 7, 2015

So just so it's clear, is any fix coming to 4.* or 5.* or will the fix only be coming to the milestone 6.0.0 set for this bug?

The workaround with the log config seems like something we will have to use permanently, so I'm not overly concerned either way, just want to know :)

More importantly. If you change the loggers for these exceptions in the future so that the configuration trick referring to NServiceBus.Unicast.Transport.TransportReceiver stops working: Please be sure to include it in the migration instructions. I really do not want to end up with a logging configuration that excludes these exceptions again.

@timbussmann
Copy link
Contributor

Currently there are no changes planned for v4/v5. The custom logger configuration for the TransportReceiver should also work with v5.

I'll make sure to include the logging changes in the v6 upgrade guide.

@swettstein
Copy link

I can't find anything in the upgrade guide related to this. @timbussmann can you point me in the right direction?

@timbussmann
Copy link
Contributor

Hey @swettstein
You're right, there is currently no mentioning of the logging changes in the upgrade guide. We thought about this and decided to not include it in the upgrade guide, as we currently do not consider log statements as a part of the public API (which follows semver). As we can't reliably ensure that we're not changing something on logging even between patch versions, it doesn't add the necessary reliability to include it in the upgrade guide. Please note that I've raised an issue to make the logging part of the public API to counter this behavior here: #4159

As for now, the logger which logs first level attempts has been changed from NServiceBus.Unicast.Transport.TransportReceiver to NServiceBus.RecoverabilityExecutor. This logger logs all the retry specific log entries in v6.
/cc @mlidbom

I'm sorry this change doesn't make it into the upgrade guide. As for the reasons described above, we don't have a good way to decide which log changes should be on the upgrade guide and which are not useful. I hope we can come to a good decision in #4159 to address the root of this issue and make log statements more reliable for production monitoring.

@timbussmann timbussmann removed their assignment Sep 26, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants