Endpoints with no heartbeat plugin should not be automatically added to monitoring groups #726

johnsimons · 2016-05-13T06:03:16Z

Monitoring is only activated if heartbeats are ON or the user explicitly turns it ON.
This ensures that satellite queues are not monitored by default hence producing false positives.

To find out more about monitoring, see https://docs.particular.net/servicepulse/intro-endpoints-heartbeats

This issue was originally raised in Particular/ServicePulse#340

johnsimons · 2016-05-13T06:03:35Z

@pablocastilla issue moved here

johnsimons · 2016-05-13T06:04:32Z

@Particular/servicecontrol-maintainers not sure how we are going to fix this one!

mikeminutillo · 2016-05-13T08:21:06Z

So the issue is that we are treating timeouts as a separate endpoint?

From SCs perspective the rules are easy. If we see an endpoint we've never seen before we enlist it for monitoring. Once enlisted it's either alive or dead. Maybe we need a more nuanced state machine so we can hear about an endpoint and display it on screen without declaring it dead.

The simple workaround is to disable monitoring of that endpoint in Pulse. That could be painful if you have a lot of endpoints with Timeout Manager installed though.

pablocastilla · 2016-05-13T14:35:50Z

How about a special filter in service pulse? I just don't want to see it in
the web

El vie., 13 may. 2016 10:21, Mike Minutillo notifications@github.com
escribió:

So the issue is that we are treating timeouts as a separate endpoint?

From SCs perspective the rules are easy. If we see an endpoint we've never
seen before we enlist it for monitoring. Once enlisted it's either alive or
dead. Maybe we need a more nuanced state machine so we can hear about an
endpoint and display it on screen without declaring it dead.

The simple workaround is to disable monitoring of that endpoint in Pulse.
That could be painful if you have a lot of endpoints with Timeout Manager
installed though.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#726 (comment)

johnsimons · 2016-05-15T23:50:29Z

@mikeminutillo my issue with our current solution is that ServicePulse should not be making an endpoint "red" if a user does not have heartbeats installed, heartbeats is an optional plugin, IMO the endpoint should only be red if an endpoint has heartbeats on and we stop receiving the heartbeat. For endpoints that do not have heartbeats installed, I would disable that capability.

cc @pablocastilla

pablocastilla · 2016-05-16T05:37:02Z

I agree, but in my case those endpoints have the heartbeat dll installed

El lun., 16 may. 2016 1:50, John Simons notifications@github.com escribió:

@mikeminutillo https://github.com/mikeminutillo my issue with our
current solution is that ServicePulse should not be making an endpoint
"red" if a user does not have heartbeats installed, heartbeats is an
optional plugin, IMO the endpoint should only be red if an endpoint has
heartbeats on and we stop receiving the heartbeat. For endpoints that do
not have heartbeats installed, I would disable that capability.

cc @pablocastilla https://github.com/pablocastilla

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#726 (comment)

indualagarsamy · 2016-05-16T15:45:58Z

@johnsimons @mikeminutillo - Are timeout messages, control messages? If they are control messages, can we not filter from SC? i.e. if we can make a determination that the error messages are arriving from satellite queues, why should it get added to the Known endpoints list?

johnsimons · 2016-08-18T01:08:39Z

Are timeout messages, control messages? If they are control messages, can we not filter from SC?

Not all messages are control messages, but control or not, something is failing and we need to report on it.

if we can make a determination that the error messages are arriving from satellite queues

I don't think we can determine if a message is from a satellite queue or not.

johnsimons · 2016-08-18T01:09:47Z

@Particular/servicecontrol-maintainers I still think the way to address this issue is #726 (comment), thoughts ?

SzymonPobiega · 2016-10-26T12:21:14Z

@johnsimons Would that imply a third category/tab on http://localhost:9090/#/endpoints page:

Inactive endpoints
Active endpoints
Not monitored endpoints

johnsimons · 2016-10-26T21:58:13Z

@SzymonPobiega that sounds right.
We need to involve @Particular/servicepulse-maintainers to make sure they are ok with this change.

WojcikMike · 2016-10-27T10:25:58Z

@johnsimons can you elaborate a bit on your implementation? From what I understand at the moment when the endpoint is discovered and don't send heartbeats we assume that it failed. THe reason for that is that every new endpoint is marked as monitored. What you are describing John would require us to by default add endpoints as unmonitored. Is this what you suggest?

SzymonPobiega · 2016-10-27T10:34:02Z

@WojcikMike I believe that's the proposal. Add endpoints as "unmonitored". Move to "active" when HB is discovered. Move to "inactive" when there is no HB any more and the endpoint has previously been in "active" state

WojcikMike · 2016-10-27T12:45:33Z

AFAIK there is no way to recognize that endpoint don't have HB. We mark endpoint as 'failed' when the HB don't show, which means that endpoint is not working. If you don't want to see that endpoint as failed you mark it as unmonitored.

If we unmonitor every endpoint that don't send HB then HB feature is useless. Unless I misunderstood something or some feature works in a different way that I think it works

SzymonPobiega · 2016-10-27T12:55:54Z

Hmmm well, I would say that if an endpoint never ever sent us a HB then it is unmonitored. We only mark an endpoint as down/inactive if we know it previously has sent a HB and is not sending them any more.

pablocastilla · 2016-10-27T15:44:11Z

+1 to this :)

El jue., 27 oct. 2016 14:55, Szymon Pobiega notifications@github.com
escribió:

Hmmm well, I would say that if an endpoint never ever sent us a HB then it
is unmonitored. We only mark an endpoint as down/inactive if we know it
previously has sent a HB and is not sending them any more.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#726 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHxv4XNmToE1K5YpZu0-z3mRL1KOpGXPks5q4J9cgaJpZM4IduOe
.

johnsimons · 2016-10-27T23:30:21Z

@SzymonPobiega @WojcikMike

Add endpoints as "unmonitored". Move to "active" when HB is discovered. Move to "inactive" when there is no HB any more and the endpoint has previously been in "active" state

That is pretty much it 👍
This works because the heartbeat includes the endpoint name which equals the queue name so for satellite queues eg myendpoint.timeouts it would not match and therefore it would be unmonitored.
Now this is all theoretical, we need to validate this for all transport permutations and see if it works.

Also there is a catch, if a user decides to uninstall heartbeats from a previously beating endpoint there would be no way to move that endpoint from "inactive" to "unmonitored", but to be honest I am not sure if we need to support it, thoughts ?

pablocastilla · 2016-10-28T05:12:38Z

Maybe in that case the user could disable it manually in the configuration
tab

El vie., 28 oct. 2016 1:30, John Simons notifications@github.com escribió:

@SzymonPobiega https://github.com/SzymonPobiega @WojcikMike
https://github.com/WojcikMike

Add endpoints as "unmonitored". Move to "active" when HB is discovered.
Move to "inactive" when there is no HB any more and the endpoint has
previously been in "active" state

That is pretty much it 👍
This works because the heartbeat includes the endpoint name
https://github.com/Particular/ServiceControl.Plugin.Nsb5.Heartbeat/blob/9aaf6f2775382e7dde121c81105eb0db6e21dcf2/src/ServiceControl.Plugin.Nsb5.Heartbeat/Heartbeats.cs#L109
which equals the queue name so for satellite queues eg myendpoint.timeouts
it would not match and therefore it would be unmonitored.
Now this is all theoretical, we need to validate this for all transport
permutations and see if it works.

Also there is a catch, if a user decides to uninstall heartbeats from a
previously beating endpoint there would be no way to move that endpoint
from "inactive" to "unmonitored", but to be honest I am not sure if we need
to support it, thoughts ?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#726 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHxv4c0s17fy5wjiDPK0x03zUJepl7ymks5q4TQOgaJpZM4IduOe
.

johnsimons · 2016-10-31T05:29:48Z

@SzymonPobiega any other questions ? is it ok if i put your face against it ?

SzymonPobiega · 2016-10-31T05:40:37Z

@johnsimons I think enough for start. I self-assigned it.

SzymonPobiega · 2016-11-03T08:48:08Z

Well, it turns out it works as expected. I did a repro where I created my own satellite "FaultySatellite" that throws exceptions. I also had a handler that throws exceptions as a base line. I run the app twice and generated four failed messages: two in the satellite and two in the handler.

I opened SP and I can see only one endpoint "active". The satellite does not show as "inactive" endpoint. On the failed message list the message that failed in the satellite shows proper endpoint name "SatelliteFailureGenerator" (as in "ProcessingEndpoint" header). The "FailedQ" header points correctly to the satellite queue.

Am I missing something @johnsimons & @pablocastilla ? I used V6 endpoint to generate the failed messages.

johnsimons · 2016-11-03T10:20:19Z

@SzymonPobiega
I think we should try v5, it could be that "ProcessingEndpoint" header is not in v5 satellites ?
Also do you have heartbeats on ?

SzymonPobiega · 2016-11-03T13:16:10Z

@johnsimons yes, I had HB on and working because I verified the endpoint is in the "active" tab. I'll try V5.

SzymonPobiega · 2016-11-04T09:51:28Z

Check this and V5 does not add NServiceBus.ProcessingEndpoint header to failed messages (both coming from satellite and regular handler), only to audit messages. This is the reason the satellite is shown as a different endpoint.

How about we add this header in a patch release of NSB 5 instead of changing the way SC/SP works?

johnsimons · 2016-11-06T23:16:22Z

How about we add this header in a patch release of NSB 5 instead of changing the way SC/SP works?

Even though that header quite possible makes sense to be in satellites, I don't think we can just patch the core to fix the issue.
As it currently stands users are already affected by this and there is no easy way for them to fix it, even if we were to patch the core and the users update to it, it would still be broken and showing as a filed endpoint in ServicePulse.
As I said before, the endpoint should only be red if an endpoint has heartbeats on and we stop receiving the heartbeat, that is IMO still the correct way to fix this issue.

WojcikMike · 2016-11-07T10:12:02Z

As I said before, the endpoint should only be red if an endpoint has heartbeats on and we stop receiving the heartbeat, that is IMO still the correct way to fix this issue.

This approach has a drawback when you first try to start endpoint and you have the intention to use heartbeats but they are never recieved by SC. By explicit marking endpoint as monitored vs unmonitored we allow system to mark endpoint as red from the start. However to even mark endpoint as monitored vs unmonitored it needs to be discovered (failed message, audit or heartbeat).

We could change that by default discovered endpoints are not monitored but user can change it.

SzymonPobiega · 2016-11-07T13:33:55Z

@WojcikMike I am with @johnsimons on this. When I really really want to make sure newly deployed endpoints are monitored, I would double check it in SC and not rely on the red label showing me lack of heartbeat.

WojcikMike · 2016-11-07T13:51:53Z

@SzymonPobiega I hear you, however I am always reluctant with magic processes and later on understanding why my endpoint is not showing as red in SP. However I can live with this automation.

SzymonPobiega · 2016-11-07T14:03:57Z

@johnsimons speaking about a quick fix, users who are affected can go to configuration and disable monitoring of the offending satellite even now.

johnsimons · 2016-11-08T00:34:45Z

users who are affected can go to configuration and disable monitoring of the offending satellite even now.

@SzymonPobiega Agree, but really not optimal, I will mark this as an improvement.

So is there anything else that prevent us from proceeding ?

SzymonPobiega · 2016-11-08T07:28:17Z

@johnsimons let me validate my plan of attack here before I jump into the code:

Assumptions

Currently KnownEndpoint sets its Monitored property to true in the default constructor which means endpoints are monitored by default
Users can opt-out from monitoring via the Configuration tab
Endpoints with Monitored set to false do not show up in the Endpoints Overview

PoA

Change the behavior of KnownEndpoint to set Monitored to false by default causing new endpoints that don't have HB plugin installed to not show up in Endpoints Overview
Validate that toggling the switch in Configuration makes the endpoint show up in Endpoints Overview as "inactive"
Validate that if a HB message is received for an endpoint that is not set to be monitored, it automatically switches to "monitored" mode (as active) and when the HB is gone, it shows up as "inactive" and the red label is present.

@WojcikMike @johnsimons does it sound good? There are no changes to SP here. I don't think we need an "unmonitored" tab in Endpoints Overview since the list of all the endpoints can be accessed via Configuration.

johnsimons · 2016-11-08T07:45:55Z

@SzymonPobiega sounds like a plan

WojcikMike · 2016-11-08T09:02:14Z

@SzymonPobiega sounds good.

Validate that if a HB message is received for an endpoint that is not set to be monitored, it automatically switches to "monitored" mode (as active) and when the HB is gone, it shows up as "inactive" and the red label is present.

This will cause that you will not be able to mark as inactive endpoint that sends heartbeats. However I am struggling in what circumstances that would be needed.

SzymonPobiega · 2016-11-08T14:37:27Z

I made a good progress on that in #838 . Can you guys take a look?

johnsimons · 2016-11-22T02:49:10Z

@SzymonPobiega how is this coming along ?

SzymonPobiega · 2016-11-22T07:23:04Z

@johnsimons from coding perspective this is done. There was one question you asked the maintainer group but nobody answered yet...

WojcikMike · 2016-11-28T12:16:43Z

As this piece of code change the behavior should we make a doco PR before we close this? @SzymonPobiega @johnsimons

SzymonPobiega · 2016-11-28T12:46:07Z

Here's a doco pull @WojcikMike Particular/docs.particular.net#2279. Please review. And good catch! Thanks!

SzymonPobiega · 2016-11-28T13:48:12Z

Doco updated.

johnsimons added the Bug label May 13, 2016

SzymonPobiega self-assigned this Oct 31, 2016

johnsimons added Improvement and removed Bug labels Nov 8, 2016

SzymonPobiega added the State: In Progress label Nov 8, 2016

SzymonPobiega mentioned this issue Nov 8, 2016

Endpoints with no heartbeat plugin should not be automatically added to monitoring groups #838

Merged

johnsimons closed this as completed in #838 Nov 27, 2016

johnsimons removed the State: In Progress label Nov 27, 2016

johnsimons added this to the 1.28.0 milestone Nov 27, 2016

WojcikMike reopened this Nov 28, 2016

SzymonPobiega closed this as completed Nov 28, 2016

johnsimons changed the title ~~Timeouts and timeoutsdispatcher are registered as endpoints, but they don't generate heartbeats so always stay "in red"~~ Satellite queues are registered as normal endpoints, but they don't generate heartbeats so they are always "in red" Nov 30, 2016

johnsimons changed the title ~~Satellite queues are registered as normal endpoints, but they don't generate heartbeats so they are always "in red"~~ Satellite queues by default should have monitoring off Nov 30, 2016

johnsimons changed the title ~~Satellite queues by default should have monitoring off~~ Monitoring of endpoints is by default OFF Nov 30, 2016

johnsimons changed the title ~~Monitoring of endpoints is by default OFF~~ Endpoints with no heartbeat plugin should not be automatically added to monitoring groups Nov 30, 2016

Endpoints with no heartbeat plugin should not be automatically added to monitoring groups #726

Endpoints with no heartbeat plugin should not be automatically added to monitoring groups #726

Comments

johnsimons commented May 13, 2016 • edited

johnsimons commented May 13, 2016

johnsimons commented May 13, 2016

mikeminutillo commented May 13, 2016

pablocastilla commented May 13, 2016

johnsimons commented May 15, 2016

pablocastilla commented May 16, 2016

indualagarsamy commented May 16, 2016

johnsimons commented Aug 18, 2016

johnsimons commented Aug 18, 2016

SzymonPobiega commented Oct 26, 2016

johnsimons commented Oct 26, 2016

WojcikMike commented Oct 27, 2016

SzymonPobiega commented Oct 27, 2016 • edited

WojcikMike commented Oct 27, 2016

SzymonPobiega commented Oct 27, 2016

pablocastilla commented Oct 27, 2016

johnsimons commented Oct 27, 2016

pablocastilla commented Oct 28, 2016

johnsimons commented Oct 31, 2016

SzymonPobiega commented Oct 31, 2016

SzymonPobiega commented Nov 3, 2016

johnsimons commented Nov 3, 2016

SzymonPobiega commented Nov 3, 2016

SzymonPobiega commented Nov 4, 2016

johnsimons commented Nov 6, 2016

WojcikMike commented Nov 7, 2016

SzymonPobiega commented Nov 7, 2016

WojcikMike commented Nov 7, 2016

SzymonPobiega commented Nov 7, 2016

johnsimons commented Nov 8, 2016

SzymonPobiega commented Nov 8, 2016 • edited

Assumptions

PoA

johnsimons commented Nov 8, 2016

WojcikMike commented Nov 8, 2016

SzymonPobiega commented Nov 8, 2016

johnsimons commented Nov 22, 2016

SzymonPobiega commented Nov 22, 2016

WojcikMike commented Nov 28, 2016 • edited

SzymonPobiega commented Nov 28, 2016

SzymonPobiega commented Nov 28, 2016

johnsimons commented May 13, 2016 •

edited

SzymonPobiega commented Oct 27, 2016 •

edited

SzymonPobiega commented Nov 8, 2016 •

edited

WojcikMike commented Nov 28, 2016 •

edited