New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Endpoints with no heartbeat plugin should not be automatically added to monitoring groups #726
Comments
@pablocastilla issue moved here |
@Particular/servicecontrol-maintainers not sure how we are going to fix this one! |
So the issue is that we are treating timeouts as a separate endpoint? From SCs perspective the rules are easy. If we see an endpoint we've never seen before we enlist it for monitoring. Once enlisted it's either alive or dead. Maybe we need a more nuanced state machine so we can hear about an endpoint and display it on screen without declaring it dead. The simple workaround is to disable monitoring of that endpoint in Pulse. That could be painful if you have a lot of endpoints with Timeout Manager installed though. |
How about a special filter in service pulse? I just don't want to see it in El vie., 13 may. 2016 10:21, Mike Minutillo notifications@github.com
|
@mikeminutillo my issue with our current solution is that ServicePulse should not be making an endpoint "red" if a user does not have heartbeats installed, heartbeats is an optional plugin, IMO the endpoint should only be red if an endpoint has heartbeats on and we stop receiving the heartbeat. For endpoints that do not have heartbeats installed, I would disable that capability. |
I agree, but in my case those endpoints have the heartbeat dll installed El lun., 16 may. 2016 1:50, John Simons notifications@github.com escribió:
|
@johnsimons @mikeminutillo - Are timeout messages, control messages? If they are control messages, can we not filter from SC? i.e. if we can make a determination that the error messages are arriving from satellite queues, why should it get added to the Known endpoints list? |
Not all messages are control messages, but control or not, something is failing and we need to report on it.
I don't think we can determine if a message is from a satellite queue or not. |
@Particular/servicecontrol-maintainers I still think the way to address this issue is #726 (comment), thoughts ? |
@johnsimons Would that imply a third category/tab on
|
@SzymonPobiega that sounds right. |
@johnsimons can you elaborate a bit on your implementation? From what I understand at the moment when the endpoint is discovered and don't send heartbeats we assume that it failed. THe reason for that is that every new endpoint is marked as monitored. What you are describing John would require us to by default add endpoints as unmonitored. Is this what you suggest? |
@WojcikMike I believe that's the proposal. Add endpoints as "unmonitored". Move to "active" when HB is discovered. Move to "inactive" when there is no HB any more and the endpoint has previously been in "active" state |
AFAIK there is no way to recognize that endpoint don't have HB. We mark endpoint as 'failed' when the HB don't show, which means that endpoint is not working. If you don't want to see that endpoint as failed you mark it as unmonitored. If we unmonitor every endpoint that don't send HB then HB feature is useless. Unless I misunderstood something or some feature works in a different way that I think it works |
Hmmm well, I would say that if an endpoint never ever sent us a HB then it is unmonitored. We only mark an endpoint as down/inactive if we know it previously has sent a HB and is not sending them any more. |
+1 to this :) El jue., 27 oct. 2016 14:55, Szymon Pobiega notifications@github.com
|
That is pretty much it 👍 Also there is a catch, if a user decides to uninstall heartbeats from a previously beating endpoint there would be no way to move that endpoint from "inactive" to "unmonitored", but to be honest I am not sure if we need to support it, thoughts ? |
Maybe in that case the user could disable it manually in the configuration El vie., 28 oct. 2016 1:30, John Simons notifications@github.com escribió:
|
@SzymonPobiega any other questions ? is it ok if i put your face against it ? |
@johnsimons I think enough for start. I self-assigned it. |
Well, it turns out it works as expected. I did a repro where I created my own satellite "FaultySatellite" that throws exceptions. I also had a handler that throws exceptions as a base line. I run the app twice and generated four failed messages: two in the satellite and two in the handler. I opened SP and I can see only one endpoint "active". The satellite does not show as "inactive" endpoint. On the failed message list the message that failed in the satellite shows proper endpoint name "SatelliteFailureGenerator" (as in "ProcessingEndpoint" header). The "FailedQ" header points correctly to the satellite queue. Am I missing something @johnsimons & @pablocastilla ? I used V6 endpoint to generate the failed messages. |
@SzymonPobiega |
@johnsimons yes, I had HB on and working because I verified the endpoint is in the "active" tab. I'll try V5. |
Check this and V5 does not add How about we add this header in a patch release of NSB 5 instead of changing the way SC/SP works? |
Even though that header quite possible makes sense to be in satellites, I don't think we can just patch the core to fix the issue. |
This approach has a drawback when you first try to start endpoint and you have the intention to use heartbeats but they are never recieved by SC. By explicit marking endpoint as monitored vs unmonitored we allow system to mark endpoint as red from the start. However to even mark endpoint as monitored vs unmonitored it needs to be discovered (failed message, audit or heartbeat). We could change that by default discovered endpoints are not monitored but user can change it. |
@WojcikMike I am with @johnsimons on this. When I really really want to make sure newly deployed endpoints are monitored, I would double check it in SC and not rely on the red label showing me lack of heartbeat. |
@SzymonPobiega I hear you, however I am always reluctant with magic processes and later on understanding why my endpoint is not showing as red in SP. However I can live with this automation. |
@johnsimons speaking about a quick fix, users who are affected can go to configuration and disable monitoring of the offending satellite even now. |
@SzymonPobiega Agree, but really not optimal, I will mark this as an improvement. So is there anything else that prevent us from proceeding ? |
@johnsimons let me validate my plan of attack here before I jump into the code: Assumptions
PoA
@WojcikMike @johnsimons does it sound good? There are no changes to SP here. I don't think we need an "unmonitored" tab in Endpoints Overview since the list of all the endpoints can be accessed via Configuration. |
@SzymonPobiega sounds like a plan |
@SzymonPobiega sounds good.
This will cause that you will not be able to mark as inactive endpoint that sends heartbeats. However I am struggling in what circumstances that would be needed. |
I made a good progress on that in #838 . Can you guys take a look? |
@SzymonPobiega how is this coming along ? |
@johnsimons from coding perspective this is done. There was one question you asked the maintainer group but nobody answered yet... |
As this piece of code change the behavior should we make a doco PR before we close this? @SzymonPobiega @johnsimons |
Here's a doco pull @WojcikMike Particular/docs.particular.net#2279. Please review. And good catch! Thanks! |
Doco updated. |
Monitoring is only activated if heartbeats are ON or the user explicitly turns it ON.
This ensures that satellite queues are not monitored by default hence producing false positives.
To find out more about monitoring, see https://docs.particular.net/servicepulse/intro-endpoints-heartbeats
This issue was originally raised in Particular/ServicePulse#340
The text was updated successfully, but these errors were encountered: