New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runtime attributes contain values of the current state in last_... variables #6302
Comments
I've only read the requirements, which leads to the notification filter excluding the What's wrong with that attempt and why bother with your own state history logic later on? |
Thank you for your super fast answer :) As far as I understand a notification with a filter that excludes the warning state would not allow me to ONLY send a recovery when service changed like this: ok->critical->warning-ok, but not when it changed ok->warning->ok. Right? In the second half of my text I also explain why I think that the current names of the variables are wrong or at least confusing. I would like some feedback what you think about it. |
Any thoughts about my last message? :) |
Sorry, this is one of those issues where I need time, pen and paper. I'm waiting for others to share their input meanwhile. |
No need to apologize, I will wait. Thanks for the confirmation that this issue is not forgotten. :) |
Might be a candidate for the OSMC hackathon. |
In that specific use case, the service.duration_sec is not the metric we should use, as I think it to be „correct“ in the sense of naming. It tells you how long the service has been in the state it is currently in. So it isn’t really useful for this issue with the |
After talking to @lippserd and @Crunsher and discussing different ideas and solutions, I think the most elegant solution would be to not change the current behaviour as it is the one that is technically correct. Instead of altering the behaviour, I propose adding two values:
As the „state worseness“ is OK -> WARNING -> CRITICAL -> UNKNOWN, the value of Once the state changes from OK to something else again, See the attached a graph of state changes for details of state changes. I named it In pseudo code, it would need to look like this:
There are two open questions I can’t answer myself, so please give me feedback on that
@dnsmichi @Crunsher could you look at this and give me feedback, please? |
Nah, that's not really helpful information. Checks may randomly fail once in a blue moon and we don't care about that.
We certainly want this for icingadb ^_^ |
Thanks for the feedback!
I’ll start implementing this next week.
|
I trust your expertise on this since you've talked and discussed in person during the OSMC hackathon. |
Alright, as suggested by @dnsmichi in #5533, I'm gonna explain my reporting situation a bit further. In order to reliably determine SLA values (percentage of a timeframe, in which a monitored object was available) I am only interested in hard state changes, which are being populated by Icinga2 in the IDO table Currently (because of So in order to reduce the complexity of said availability function, it would be very helpful to know the value of the last (previous) hard state. Then I could just do a select on all hard state changes and could calculate the duration of OK time or NOK time based on the values of the previous hard states. |
It's good to see this discussion going on. However, to me it seems like different use-cases are trying to tackle a slightly related issue from a completely different perspective. @mj84: this alone will not help you. This is not enough. You'll loose events. You might have no event in the chosen period. You might be forced to mix in events from before or after the chosen period or even the current object state to get a meaningful result. Downtimes might come into play, they might be considered "legal" SLA violations. Downtimes could be nested. Someone asks you to calculate Availabiliity only for specific time periods (weekdays, 9 to 5). Another one wants to "fix" an SLA afterwards because of . Believe me, for your use-case time dedicated to this issue will be wasted. I've been there before. With PostgreSQL pick a Cursor, jump from line to line, do your math. here is a working example for MySQL/MariaDB. It doesn't address every weird scenario, but it isn't that bad at all. It needs to be written in a different way for PostgreSQL, but you can eventually steal some ideas. Cheers, |
@Thomas-Gelf: Actually, my approach is kind of finalized and is being used productively in my environment for about 10 months with mostly monthly and weekly reports being generated. I have implemented my PL/pgSQL function after inspecting the Cheers |
@mj84 I see your point, but I agree with @Thomas-Gelf here. While your current approach works for you, I consider it to be too specific to make it into a general purpose solution. I have thought about not only adding the The idea behind that would be that you can look at a state change and always determine the state after the change by looking at the I have not considered the implications of this yet and when I finished that, I’ll open an issue to discuss them. But as far as I understand your use case, this would then also solve your issue. But for this issue, I’ll limit myself to the introduction of the |
@mauricemeyer you got any update on the implementation state? I would take that task, if you don't have anything laying around yet. |
I have a cork in progress implementation with loads of debug output in my fork.
Work on it will continue on Monday as I’m currently on vacation.
|
I think I need some help with how to make it work for hosts, too - maybe I’m missing something or I can’t see the wood for the trees any more The change in my feature branch is working for services, but there’s something wrong when I try to apply this to hosts. The current state (timestamps are not updated yet) can be compared to the master easily. As you see, I added the reset to OK on line 211 - but this does a hard coded reset to Will this also work for Hosts? I never really looked into host states as they always „just worked“ for me. My understanding is as follows: The host derives its state from the check_command it uses. Therefore, this checkable is treated as a service that never appears to to user because the host wraps this „pseudo-service“. Am I correct with this or did I get something fundamentally wrong? |
Yes your assumption is correct, the host states are just wrapped service states. That means resetting the state to |
Okay, thanks for that info. I will debug this again next week to find out if I’m just misreading something or if it’s an actual error. |
I have tested some more and it looks like defining my test check as Can somebody please have a look at/test my current implementation and tell me if I just failed to setup my test environment correctly or if there is really an issue? |
Imho that's implemented with previous_hard_state etc. coming with 2.12 for IcingaDB. |
Hi,
I will try to explain our use-case to make it easier to understand the rest of the issue.
For different state changes we need different notifications in our setup:
For 2. we tried to implement the following condition in our notification script:
Currently this does not seem to be possible (at least not in the way we expected). The runtime attributes in https://www.icinga.com/docs/icinga2/latest/doc/09-object-types/#service have confusing names:
state: the new state
last_state: the previous state
last_state_ok, last_state_warning, last_state_critical: Based on the fact that last_state contains the previous state we assumed that the three last_... variables would contain the timestamp of the last state change BEFORE the current state change. But apparently the values already contain the current state change. (As a side effect the service.duration_sec is always very small (0.000767) in case of a recovery because it contains the number of seconds since the current state change. We would expect the time since the previous state.)
So during the Warning -> Ok-Change in 2. we have no way to access the timestamp of the last time that the service was ok.
So in my opinion the last_... variables should contain the timestamp of the last state change BEFORE the current state change. Or the
last_state
variable would have to contain the current state to be consistent with the other variable names.Another option would be to improve the variable names and the documentation. This would not solve our use-case, but it would avoid confusion based on the different meaning of last in the variable names.
I hope my description is comprehensible. If not I will try to explain again ;)
(I guess the same applies to the host runtime attributes.)
Your Environment
icinga2 --version
): 2.8.4The text was updated successfully, but these errors were encountered: