Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Runtime attributes contain values of the current state in last_... variables #6302
I will try to explain our use-case to make it easier to understand the rest of the issue.
For 2. we tried to implement the following condition in our notification script:
Currently this does not seem to be possible (at least not in the way we expected). The runtime attributes in https://www.icinga.com/docs/icinga2/latest/doc/09-object-types/#service have confusing names:
So during the Warning -> Ok-Change in 2. we have no way to access the timestamp of the last time that the service was ok.
So in my opinion the last_... variables should contain the timestamp of the last state change BEFORE the current state change. Or the
I hope my description is comprehensible. If not I will try to explain again ;)
(I guess the same applies to the host runtime attributes.)
Thank you for your super fast answer :)
As far as I understand a notification with a filter that excludes the warning state would not allow me to ONLY send a recovery when service changed like this: ok->critical->warning-ok, but not when it changed ok->warning->ok. Right?
In the second half of my text I also explain why I think that the current names of the variables are wrong or at least confusing. I would like some feedback what you think about it.
In that specific use case, the service.duration_sec is not the metric we should use, as I think it to be „correct“ in the sense of naming. It tells you how long the service has been in the state it is currently in.
So it isn’t really useful for this issue with the
After talking to @lippserd and @Crunsher and discussing different ideas and solutions, I think the most elegant solution would be to not change the current behaviour as it is the one that is technically correct.
Instead of altering the behaviour, I propose adding two values:
As the „state worseness“ is OK -> WARNING -> CRITICAL -> UNKNOWN, the value of
Once the state changes from OK to something else again,
See the attached a graph of state changes for details of state changes.
I named it
In pseudo code, it would need to look like this:
There are two open questions I can’t answer myself, so please give me feedback on that
I trust your expertise on this since you've talked and discussed in person during the OSMC hackathon.
In order to reliably determine SLA values (percentage of a timeframe, in which a monitored object was available) I am only interested in hard state changes, which are being populated by Icinga2 in the IDO table
Currently (because of
So in order to reduce the complexity of said availability function, it would be very helpful to know the value of the last (previous) hard state. Then I could just do a select on all hard state changes and could calculate the duration of OK time or NOK time based on the values of the previous hard states.
It's good to see this discussion going on. However, to me it seems like different use-cases are trying to tackle a slightly related issue from a completely different perspective.
@mj84: this alone will not help you. This is not enough. You'll loose events. You might have no event in the chosen period. You might be forced to mix in events from before or after the chosen period or even the current object state to get a meaningful result. Downtimes might come into play, they might be considered "legal" SLA violations. Downtimes could be nested. Someone asks you to calculate Availabiliity only for specific time periods (weekdays, 9 to 5). Another one wants to "fix" an SLA afterwards because of .
Believe me, for your use-case time dedicated to this issue will be wasted. I've been there before. With PostgreSQL pick a Cursor, jump from line to line, do your math. here is a working example for MySQL/MariaDB. It doesn't address every weird scenario, but it isn't that bad at all. It needs to be written in a different way for PostgreSQL, but you can eventually steal some ideas.
@Thomas-Gelf: Actually, my approach is kind of finalized and is being used productively in my environment for about 10 months with mostly monthly and weekly reports being generated.
I have implemented my PL/pgSQL function after inspecting the
I have thought about not only adding the
The idea behind that would be that you can look at a state change and always determine the state after the change by looking at the
I have not considered the implications of this yet and when I finished that, I’ll open an issue to discuss them. But as far as I understand your use case, this would then also solve your issue.
But for this issue, I’ll limit myself to the introduction of the
I think I need some help with how to make it work for hosts, too - maybe I’m missing something or I can’t see the wood for the trees any more
The change in my feature branch is working for services, but there’s something wrong when I try to apply this to hosts.
The current state (timestamps are not updated yet) can be compared to the master easily.
As you see, I added the reset to OK on line 211 - but this does a hard coded reset to
Will this also work for Hosts? I never really looked into host states as they always „just worked“ for me. My understanding is as follows:
The host derives its state from the check_command it uses. Therefore, this checkable is treated as a service that never appears to to user because the host wraps this „pseudo-service“.
Am I correct with this or did I get something fundamentally wrong?
I have tested some more and it looks like defining my test check as
Can somebody please have a look at/test my current implementation and tell me if I just failed to setup my test environment correctly or if there is really an issue?