Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endpoints with no heartbeat plugin should not be automatically added to monitoring groups #726

Closed
johnsimons opened this issue May 13, 2016 · 39 comments
Assignees
Milestone

Comments

@johnsimons
Copy link
Member

johnsimons commented May 13, 2016

Monitoring is only activated if heartbeats are ON or the user explicitly turns it ON.
This ensures that satellite queues are not monitored by default hence producing false positives.

To find out more about monitoring, see https://docs.particular.net/servicepulse/intro-endpoints-heartbeats

This issue was originally raised in Particular/ServicePulse#340

@johnsimons johnsimons added the Bug label May 13, 2016
@johnsimons
Copy link
Member Author

@pablocastilla issue moved here

@johnsimons
Copy link
Member Author

@Particular/servicecontrol-maintainers not sure how we are going to fix this one!

@mikeminutillo
Copy link
Member

So the issue is that we are treating timeouts as a separate endpoint?

From SCs perspective the rules are easy. If we see an endpoint we've never seen before we enlist it for monitoring. Once enlisted it's either alive or dead. Maybe we need a more nuanced state machine so we can hear about an endpoint and display it on screen without declaring it dead.

The simple workaround is to disable monitoring of that endpoint in Pulse. That could be painful if you have a lot of endpoints with Timeout Manager installed though.

@pablocastilla
Copy link

How about a special filter in service pulse? I just don't want to see it in
the web

El vie., 13 may. 2016 10:21, Mike Minutillo notifications@github.com
escribió:

So the issue is that we are treating timeouts as a separate endpoint?

From SCs perspective the rules are easy. If we see an endpoint we've never
seen before we enlist it for monitoring. Once enlisted it's either alive or
dead. Maybe we need a more nuanced state machine so we can hear about an
endpoint and display it on screen without declaring it dead.

The simple workaround is to disable monitoring of that endpoint in Pulse.
That could be painful if you have a lot of endpoints with Timeout Manager
installed though.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#726 (comment)

@johnsimons
Copy link
Member Author

@mikeminutillo my issue with our current solution is that ServicePulse should not be making an endpoint "red" if a user does not have heartbeats installed, heartbeats is an optional plugin, IMO the endpoint should only be red if an endpoint has heartbeats on and we stop receiving the heartbeat. For endpoints that do not have heartbeats installed, I would disable that capability.

cc @pablocastilla

@pablocastilla
Copy link

I agree, but in my case those endpoints have the heartbeat dll installed

El lun., 16 may. 2016 1:50, John Simons notifications@github.com escribió:

@mikeminutillo https://github.com/mikeminutillo my issue with our
current solution is that ServicePulse should not be making an endpoint
"red" if a user does not have heartbeats installed, heartbeats is an
optional plugin, IMO the endpoint should only be red if an endpoint has
heartbeats on and we stop receiving the heartbeat. For endpoints that do
not have heartbeats installed, I would disable that capability.

cc @pablocastilla https://github.com/pablocastilla


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#726 (comment)

@indualagarsamy
Copy link
Contributor

@johnsimons @mikeminutillo - Are timeout messages, control messages? If they are control messages, can we not filter from SC? i.e. if we can make a determination that the error messages are arriving from satellite queues, why should it get added to the Known endpoints list?

@johnsimons
Copy link
Member Author

Are timeout messages, control messages? If they are control messages, can we not filter from SC?

Not all messages are control messages, but control or not, something is failing and we need to report on it.

if we can make a determination that the error messages are arriving from satellite queues

I don't think we can determine if a message is from a satellite queue or not.

@johnsimons
Copy link
Member Author

@Particular/servicecontrol-maintainers I still think the way to address this issue is #726 (comment), thoughts ?

@SzymonPobiega
Copy link
Member

@johnsimons Would that imply a third category/tab on http://localhost:9090/#/endpoints page:

  • Inactive endpoints
  • Active endpoints
  • Not monitored endpoints

@johnsimons
Copy link
Member Author

@SzymonPobiega that sounds right.
We need to involve @Particular/servicepulse-maintainers to make sure they are ok with this change.

@WojcikMike
Copy link
Contributor

@johnsimons can you elaborate a bit on your implementation? From what I understand at the moment when the endpoint is discovered and don't send heartbeats we assume that it failed. THe reason for that is that every new endpoint is marked as monitored. What you are describing John would require us to by default add endpoints as unmonitored. Is this what you suggest?

@SzymonPobiega
Copy link
Member

SzymonPobiega commented Oct 27, 2016

@WojcikMike I believe that's the proposal. Add endpoints as "unmonitored". Move to "active" when HB is discovered. Move to "inactive" when there is no HB any more and the endpoint has previously been in "active" state

@WojcikMike
Copy link
Contributor

AFAIK there is no way to recognize that endpoint don't have HB. We mark endpoint as 'failed' when the HB don't show, which means that endpoint is not working. If you don't want to see that endpoint as failed you mark it as unmonitored.

If we unmonitor every endpoint that don't send HB then HB feature is useless. Unless I misunderstood something or some feature works in a different way that I think it works

@SzymonPobiega
Copy link
Member

Hmmm well, I would say that if an endpoint never ever sent us a HB then it is unmonitored. We only mark an endpoint as down/inactive if we know it previously has sent a HB and is not sending them any more.

@pablocastilla
Copy link

+1 to this :)

El jue., 27 oct. 2016 14:55, Szymon Pobiega notifications@github.com
escribió:

Hmmm well, I would say that if an endpoint never ever sent us a HB then it
is unmonitored. We only mark an endpoint as down/inactive if we know it
previously has sent a HB and is not sending them any more.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#726 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHxv4XNmToE1K5YpZu0-z3mRL1KOpGXPks5q4J9cgaJpZM4IduOe
.

@johnsimons
Copy link
Member Author

@SzymonPobiega @WojcikMike

Add endpoints as "unmonitored". Move to "active" when HB is discovered. Move to "inactive" when there is no HB any more and the endpoint has previously been in "active" state

That is pretty much it 👍
This works because the heartbeat includes the endpoint name which equals the queue name so for satellite queues eg myendpoint.timeouts it would not match and therefore it would be unmonitored.
Now this is all theoretical, we need to validate this for all transport permutations and see if it works.

Also there is a catch, if a user decides to uninstall heartbeats from a previously beating endpoint there would be no way to move that endpoint from "inactive" to "unmonitored", but to be honest I am not sure if we need to support it, thoughts ?

@pablocastilla
Copy link

Maybe in that case the user could disable it manually in the configuration
tab

El vie., 28 oct. 2016 1:30, John Simons notifications@github.com escribió:

@SzymonPobiega https://github.com/SzymonPobiega @WojcikMike
https://github.com/WojcikMike

Add endpoints as "unmonitored". Move to "active" when HB is discovered.
Move to "inactive" when there is no HB any more and the endpoint has
previously been in "active" state

That is pretty much it 👍
This works because the heartbeat includes the endpoint name
https://github.com/Particular/ServiceControl.Plugin.Nsb5.Heartbeat/blob/9aaf6f2775382e7dde121c81105eb0db6e21dcf2/src/ServiceControl.Plugin.Nsb5.Heartbeat/Heartbeats.cs#L109
which equals the queue name so for satellite queues eg myendpoint.timeouts
it would not match and therefore it would be unmonitored.
Now this is all theoretical, we need to validate this for all transport
permutations and see if it works.

Also there is a catch, if a user decides to uninstall heartbeats from a
previously beating endpoint there would be no way to move that endpoint
from "inactive" to "unmonitored", but to be honest I am not sure if we need
to support it, thoughts ?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#726 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHxv4c0s17fy5wjiDPK0x03zUJepl7ymks5q4TQOgaJpZM4IduOe
.

@johnsimons
Copy link
Member Author

@SzymonPobiega any other questions ? is it ok if i put your face against it ?

@SzymonPobiega SzymonPobiega self-assigned this Oct 31, 2016
@SzymonPobiega
Copy link
Member

@johnsimons I think enough for start. I self-assigned it.

@SzymonPobiega
Copy link
Member

Well, it turns out it works as expected. I did a repro where I created my own satellite "FaultySatellite" that throws exceptions. I also had a handler that throws exceptions as a base line. I run the app twice and generated four failed messages: two in the satellite and two in the handler.

I opened SP and I can see only one endpoint "active". The satellite does not show as "inactive" endpoint. On the failed message list the message that failed in the satellite shows proper endpoint name "SatelliteFailureGenerator" (as in "ProcessingEndpoint" header). The "FailedQ" header points correctly to the satellite queue.

Am I missing something @johnsimons & @pablocastilla ? I used V6 endpoint to generate the failed messages.

@johnsimons
Copy link
Member Author

@SzymonPobiega
I think we should try v5, it could be that "ProcessingEndpoint" header is not in v5 satellites ?
Also do you have heartbeats on ?

@SzymonPobiega
Copy link
Member

@johnsimons yes, I had HB on and working because I verified the endpoint is in the "active" tab. I'll try V5.

@SzymonPobiega
Copy link
Member

Check this and V5 does not add NServiceBus.ProcessingEndpoint header to failed messages (both coming from satellite and regular handler), only to audit messages. This is the reason the satellite is shown as a different endpoint.

How about we add this header in a patch release of NSB 5 instead of changing the way SC/SP works?

@johnsimons
Copy link
Member Author

How about we add this header in a patch release of NSB 5 instead of changing the way SC/SP works?

Even though that header quite possible makes sense to be in satellites, I don't think we can just patch the core to fix the issue.
As it currently stands users are already affected by this and there is no easy way for them to fix it, even if we were to patch the core and the users update to it, it would still be broken and showing as a filed endpoint in ServicePulse.
As I said before, the endpoint should only be red if an endpoint has heartbeats on and we stop receiving the heartbeat, that is IMO still the correct way to fix this issue.

@WojcikMike
Copy link
Contributor

As I said before, the endpoint should only be red if an endpoint has heartbeats on and we stop receiving the heartbeat, that is IMO still the correct way to fix this issue.

This approach has a drawback when you first try to start endpoint and you have the intention to use heartbeats but they are never recieved by SC. By explicit marking endpoint as monitored vs unmonitored we allow system to mark endpoint as red from the start. However to even mark endpoint as monitored vs unmonitored it needs to be discovered (failed message, audit or heartbeat).

We could change that by default discovered endpoints are not monitored but user can change it.

@SzymonPobiega
Copy link
Member

@WojcikMike I am with @johnsimons on this. When I really really want to make sure newly deployed endpoints are monitored, I would double check it in SC and not rely on the red label showing me lack of heartbeat.

@WojcikMike
Copy link
Contributor

@SzymonPobiega I hear you, however I am always reluctant with magic processes and later on understanding why my endpoint is not showing as red in SP. However I can live with this automation.

@SzymonPobiega
Copy link
Member

@johnsimons speaking about a quick fix, users who are affected can go to configuration and disable monitoring of the offending satellite even now.

@johnsimons johnsimons added Improvement and removed Bug labels Nov 8, 2016
@johnsimons
Copy link
Member Author

users who are affected can go to configuration and disable monitoring of the offending satellite even now.

@SzymonPobiega Agree, but really not optimal, I will mark this as an improvement.

So is there anything else that prevent us from proceeding ?

@SzymonPobiega
Copy link
Member

SzymonPobiega commented Nov 8, 2016

@johnsimons let me validate my plan of attack here before I jump into the code:

Assumptions

  1. Currently KnownEndpoint sets its Monitored property to true in the default constructor which means endpoints are monitored by default
  2. Users can opt-out from monitoring via the Configuration tab
  3. Endpoints with Monitored set to false do not show up in the Endpoints Overview

PoA

  1. Change the behavior of KnownEndpoint to set Monitored to false by default causing new endpoints that don't have HB plugin installed to not show up in Endpoints Overview
  2. Validate that toggling the switch in Configuration makes the endpoint show up in Endpoints Overview as "inactive"
  3. Validate that if a HB message is received for an endpoint that is not set to be monitored, it automatically switches to "monitored" mode (as active) and when the HB is gone, it shows up as "inactive" and the red label is present.

@WojcikMike @johnsimons does it sound good? There are no changes to SP here. I don't think we need an "unmonitored" tab in Endpoints Overview since the list of all the endpoints can be accessed via Configuration.

@johnsimons
Copy link
Member Author

@SzymonPobiega sounds like a plan

@WojcikMike
Copy link
Contributor

@SzymonPobiega sounds good.

Validate that if a HB message is received for an endpoint that is not set to be monitored, it automatically switches to "monitored" mode (as active) and when the HB is gone, it shows up as "inactive" and the red label is present.

This will cause that you will not be able to mark as inactive endpoint that sends heartbeats. However I am struggling in what circumstances that would be needed.

@SzymonPobiega
Copy link
Member

I made a good progress on that in #838 . Can you guys take a look?

@johnsimons
Copy link
Member Author

@SzymonPobiega how is this coming along ?

@SzymonPobiega
Copy link
Member

@johnsimons from coding perspective this is done. There was one question you asked the maintainer group but nobody answered yet...

@WojcikMike
Copy link
Contributor

WojcikMike commented Nov 28, 2016

As this piece of code change the behavior should we make a doco PR before we close this? @SzymonPobiega @johnsimons

@WojcikMike WojcikMike reopened this Nov 28, 2016
@SzymonPobiega
Copy link
Member

Here's a doco pull @WojcikMike Particular/docs.particular.net#2279. Please review. And good catch! Thanks!

@SzymonPobiega
Copy link
Member

Doco updated.

@johnsimons johnsimons changed the title Timeouts and timeoutsdispatcher are registered as endpoints, but they don't generate heartbeats so always stay "in red" Satellite queues are registered as normal endpoints, but they don't generate heartbeats so they are always "in red" Nov 30, 2016
@johnsimons johnsimons changed the title Satellite queues are registered as normal endpoints, but they don't generate heartbeats so they are always "in red" Satellite queues by default should have monitoring off Nov 30, 2016
@johnsimons johnsimons changed the title Satellite queues by default should have monitoring off Monitoring of endpoints is by default OFF Nov 30, 2016
@johnsimons johnsimons changed the title Monitoring of endpoints is by default OFF Endpoints with no heartbeat plugin should not be automatically added to monitoring groups Nov 30, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants