Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify check attempt data type to uint32 already used somewhere #656

Merged
merged 2 commits into from Apr 9, 2024

Conversation

Al2Klimov
Copy link
Member

A float isn't necessary as in Icinga 2 Checkable#max_check_attempts and check_attempt are ints. But uint8 isn't enough for e.g. 1 check/s to get HARD after 5m (300s > 255).

@Al2Klimov Al2Klimov self-assigned this Oct 10, 2023
@cla-bot cla-bot bot added the cla/signed label Oct 10, 2023
@Al2Klimov Al2Klimov linked an issue Oct 10, 2023 that may be closed by this pull request
@Al2Klimov
Copy link
Member Author

Before

2023-10-10 11:36:23 2023-10-10T09:36:23.002Z    FATAL   icingadb        strconv.ParseUint: parsing "256": value out of range
2023-10-10 11:36:23 can't parse check_attempt into the uint8 HostState#State.CheckAttempt: 256
2023-10-10 11:36:23 github.com/icinga/icingadb/pkg/structify.structifyMapByTree
2023-10-10 11:36:23     /icingadb-src/pkg/structify/structify.go:97
2023-10-10 11:36:23 github.com/icinga/icingadb/pkg/structify.structifyMapByTree
2023-10-10 11:36:23     /icingadb-src/pkg/structify/structify.go:102
2023-10-10 11:36:23 github.com/icinga/icingadb/pkg/structify.MakeMapStructifier.func1
2023-10-10 11:36:23     /icingadb-src/pkg/structify/structify.go:42
2023-10-10 11:36:23 github.com/icinga/icingadb/pkg/icingadb.(*RuntimeUpdates).Sync.structifyStream.func6
2023-10-10 11:36:23     /icingadb-src/pkg/icingadb/runtime_updates.go:324
2023-10-10 11:36:23 golang.org/x/sync/errgroup.(*Group).Go.func1
2023-10-10 11:36:23     /go/pkg/mod/golang.org/x/sync@v0.3.0/errgroup/errgroup.go:75
2023-10-10 11:36:23 runtime.goexit
2023-10-10 11:36:23     /usr/local/go/src/runtime/asm_amd64.s:1650
2023-10-10 11:36:23 can't structify map map[string]interface {}{"check_attempt":"256", "check_commandline":"'dummy'", "check_source":"master2", "check_timeout":"60000", "checksum":"b2ce7a04e37e4dd2bfad51f78d7a055244fef255", "environment_id":"36470b09ec644b7dc09863cdf0fbd4e68bc7f91b", "execution_time":"0", "hard_state":"1", "host_id":"acf2ca3316e8f6ca6ee42645d13a539445b62aaf", "id":"acf2ca3316e8f6ca6ee42645d13a539445b62aaf", "in_downtime":"0", "is_acknowledged":"0", "is_active":"1", "is_flapping":"0", "is_handled":"0", "is_problem":"1", "is_reachable":"1", "last_state_change":"0", "last_update":"1696930582403", "latency":"0", "next_check":"1696930582911", "next_update":"1696930582913", "output":"Check was successful.", "previous_hard_state":"99", "previous_soft_state":"99", "redis_key":"icinga:host:state", "runtime_type":"upsert", "scheduling_source":"master2", "severity":"2112", "soft_state":"1", "state_type":"0"} by tree []structify.structBranch{structify.structBranch{field:0, leaf:"", subTree:[]structify.structBranch{structify.structBranch{field:0, leaf:"", subTree:[]structify.structBranch{structify.structBranch{field:0, leaf:"", subTree:[]structify.structBranch{structify.structBranch{field:0, leaf:"", subTree:[]structify.structBranch{structify.structBranch{field:0, leaf:"id", subTree:[]structify.structBranch(nil)}}}}}, structify.structBranch{field:1, leaf:"", subTree:[]structify.structBranch{structify.structBranch{field:0, leaf:"checksum", subTree:[]structify.structBranch(nil)}}}}}, structify.structBranch{field:1, leaf:"", subTree:[]structify.structBranch{structify.structBranch{field:0, leaf:"environment_id", subTree:[]structify.structBranch(nil)}}}, structify.structBranch{field:2, leaf:"acknowledgement_comment_id", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:3, leaf:"last_comment_id", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:4, leaf:"check_attempt", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:5, leaf:"check_commandline", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:6, leaf:"check_source", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:7, leaf:"scheduling_source", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:8, leaf:"execution_time", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:9, leaf:"hard_state", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:10, leaf:"in_downtime", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:11, leaf:"is_acknowledged", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:12, leaf:"is_flapping", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:13, leaf:"is_handled", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:14, leaf:"is_problem", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:15, leaf:"is_reachable", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:16, leaf:"last_state_change", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:17, leaf:"last_update", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:18, leaf:"latency", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:19, leaf:"long_output", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:20, leaf:"next_check", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:21, leaf:"next_update", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:22, leaf:"output", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:23, leaf:"performance_data", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:24, leaf:"normalized_performance_data", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:25, leaf:"previous_soft_state", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:26, leaf:"previous_hard_state", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:27, leaf:"severity", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:28, leaf:"soft_state", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:29, leaf:"state_type", subTree:[]structify.structBranch(nil)}, structify.structBranch{field:30, leaf:"check_timeout", subTree:[]structify.structBranch(nil)}}}, structify.structBranch{field:1, leaf:"host_id", subTree:[]structify.structBranch(nil)}}
2023-10-10 11:36:23 github.com/icinga/icingadb/pkg/structify.MakeMapStructifier.func1
2023-10-10 11:36:23     /icingadb-src/pkg/structify/structify.go:42
2023-10-10 11:36:23 github.com/icinga/icingadb/pkg/icingadb.(*RuntimeUpdates).Sync.structifyStream.func6
2023-10-10 11:36:23     /icingadb-src/pkg/icingadb/runtime_updates.go:324
2023-10-10 11:36:23 golang.org/x/sync/errgroup.(*Group).Go.func1
2023-10-10 11:36:23     /go/pkg/mod/golang.org/x/sync@v0.3.0/errgroup/errgroup.go:75
2023-10-10 11:36:23 runtime.goexit
2023-10-10 11:36:23     /usr/local/go/src/runtime/asm_amd64.s:1650
2023-10-10 11:36:23 can't structify values map[string]interface {}{"check_attempt":"256", "check_commandline":"'dummy'", "check_source":"master2", "check_timeout":"60000", "checksum":"b2ce7a04e37e4dd2bfad51f78d7a055244fef255", "environment_id":"36470b09ec644b7dc09863cdf0fbd4e68bc7f91b", "execution_time":"0", "hard_state":"1", "host_id":"acf2ca3316e8f6ca6ee42645d13a539445b62aaf", "id":"acf2ca3316e8f6ca6ee42645d13a539445b62aaf", "in_downtime":"0", "is_acknowledged":"0", "is_active":"1", "is_flapping":"0", "is_handled":"0", "is_problem":"1", "is_reachable":"1", "last_state_change":"0", "last_update":"1696930582403", "latency":"0", "next_check":"1696930582911", "next_update":"1696930582913", "output":"Check was successful.", "previous_hard_state":"99", "previous_soft_state":"99", "redis_key":"icinga:host:state", "runtime_type":"upsert", "scheduling_source":"master2", "severity":"2112", "soft_state":"1", "state_type":"0"}
2023-10-10 11:36:23 github.com/icinga/icingadb/pkg/icingadb.(*RuntimeUpdates).Sync.structifyStream.func6
2023-10-10 11:36:23     /icingadb-src/pkg/icingadb/runtime_updates.go:326
2023-10-10 11:36:23 golang.org/x/sync/errgroup.(*Group).Go.func1
2023-10-10 11:36:23     /go/pkg/mod/golang.org/x/sync@v0.3.0/errgroup/errgroup.go:75
2023-10-10 11:36:23 runtime.goexit
2023-10-10 11:36:23     /usr/local/go/src/runtime/asm_amd64.s:1650

After

No crash so far.

@Al2Klimov Al2Klimov removed their assignment Oct 10, 2023
@Al2Klimov Al2Klimov marked this pull request as ready for review October 10, 2023 09:44
@Al2Klimov
Copy link
Member Author

ref/IP/48137

Copy link

@A41susan A41susan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks fine

Copy link
Contributor

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have a corresponding schema change as well. Currently, the check_attempt columns have type tinyint unsigned (MySQL) or tinyuint (PostgreSQL).

cmd/icingadb-migrate/convert.go Outdated Show resolved Hide resolved
@julianbrost julianbrost added this to the 1.2.0 milestone Jan 5, 2024
@julianbrost julianbrost added the consider backporting Candidate for inclusion in a bugfix release. label Jan 5, 2024
Copy link
Contributor

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. However, given that the schema upgrade changes column types in history tables (thus potentially resulting in time-consuming rewrites thereof), I'm not yet sure if we'd want to backport this and what this means for the schema version number change (which I want to consider before a final approval).

@julianbrost julianbrost requested a review from oxzi March 7, 2024 09:10
Copy link
Contributor

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed on Friday, please adapt this PR so that the change to the state_history is optional in the migration, so that only users who actually need it have to apply it. Otherwise, everyone, even all the users that just have regular checks with a few attempts would have to take the penalty of rewriting the that whole table for no real benefit.

Unfortunately, this will imply a possible discrepancy between fresh and upgraded installations, so there should be a corresponding hint in the full schema file.

Comment on lines +88 to +97
expectedMysqlSchemaVersion = 5
expectedPostgresSchemaVersion = 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done separately and even in a separate PR, probably as part of #707.

@@ -1343,4 +1343,4 @@ CREATE TABLE icingadb_schema (
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin ROW_FORMAT=DYNAMIC;

INSERT INTO icingadb_schema (version, timestamp)
VALUES (4, CURRENT_TIMESTAMP() * 1000);
VALUES (5, CURRENT_TIMESTAMP() * 1000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here!

Copy link
Member Author

@Al2Klimov Al2Klimov Mar 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, there are already upgrade files w/o even such changes. But the specific things our 1.1.2 upgrade files already do don't break the daemon and are even idempotent:

So they don't need a new schema version, but this PR does.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So they don't need a new schema version, but this PR does.

Of course they do! What makes them different in any way from this PR?

As I said before, these specific schema version changes should be committed either in a separate commit or even better in a separate PR. This schema change has nothing to do with the actual fix and as such, should better be part of the #707 PR.

schema/mysql/upgrades/1.1.2.sql Outdated Show resolved Hide resolved
@lippserd lippserd self-requested a review March 24, 2024 20:46
Copy link
Member

@lippserd lippserd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our upgrading docs should mention this bug and the schema fix and why it is not included in the upgrade scripts. It should also mention a workaround for this bug, which is to increase retry_interval so that time_to_hard_state = 255 * x * retry_interval. We should also document the statements to be executed and make clear that they take a very long time and should not be aborted under any circumstances.

Edit: I don't like the additional upgrade scripts. Since there's something to be written in the upgrading docs anyway, the statements should be there.

@julianbrost
Copy link
Contributor

Edit: I don't like the additional upgrade scripts. Since there's something to be written in the upgrading docs anyway, the statements should be there.

There and only there? I mean if these exist as a file like all the other ones, one can apply them the same way and in one go like this:

mysql [...] icingadb < /usr/share/icingadb/mysql/upgrades/1.1.2.sql
mysql [...] icingadb < /usr/share/icingadb/mysql/upgrades/1.1.2-history.sql

If it's only in the upgrading docs, you have to copy & paste, just another step where mistakes can be made.

@lippserd
Copy link
Member

Edit: I don't like the additional upgrade scripts. Since there's something to be written in the upgrading docs anyway, the statements should be there.

There and only there? I mean if these exist as a file like all the other ones, one can apply them the same way and in one go like this:

mysql [...] icingadb < /usr/share/icingadb/mysql/upgrades/1.1.2.sql
mysql [...] icingadb < /usr/share/icingadb/mysql/upgrades/1.1.2-history.sql

If it's only in the upgrading docs, you have to copy & paste, just another step where mistakes can be made.

We'd have to have full documentation in both places to make absolutely sure people read and understand it, although I'm pretty sure there'll still be people who just import the upgrade script without reading it, because why not. Of course, we could rename the file to caution-read-before-use, but that doesn't feel right either. Even if it's not a big problem, but if we ever introduce automatic schema migrations, we need to handle this one special case as well.

@Al2Klimov Al2Klimov force-pushed the max_check_attempts-range branch 2 times, most recently from 1790fa4 to 89f43ac Compare March 26, 2024 11:28
doc/04-Upgrading.md Outdated Show resolved Hide resolved
@julianbrost
Copy link
Contributor

Edit: I don't like the additional upgrade scripts. Since there's something to be written in the upgrading docs anyway, the statements should be there.

There and only there? I mean if these exist as a file like all the other ones, one can apply them the same way and in one go like this:

mysql [...] icingadb < /usr/share/icingadb/mysql/upgrades/1.1.2.sql
mysql [...] icingadb < /usr/share/icingadb/mysql/upgrades/1.1.2-history.sql

If it's only in the upgrading docs, you have to copy & paste, just another step where mistakes can be made.

We'd have to have full documentation in both places to make absolutely sure people read and understand it, although I'm pretty sure there'll still be people who just import the upgrade script without reading it, because why not.

I don't think it's possible to make everyone ignoring the upgrading instructions happy here. If it was, there shouldn't be an optional part. There will probably be users that ignore the upgrading docs, upgrade their setup, and then run into the bug a year later.

Of course, we could rename the file to caution-read-before-use, but that doesn't feel right either.

Sure, the name 1.1.2-history.sql is pretty non-descriptive in that regard. But shouldn't simply be including "optional" in the name be enough to make people think whether they need it? Maybe even placing it into schema/mysql/upgrades/optional/1.1.2-history.sql to make it even more distinct from the other upgrades?

Even if it's not a big problem, but if we ever introduce automatic schema migrations, we need to handle this one special case as well.

But that special handling would be necessary regardless where we put it now. But if the optional part is only mentioned in the upgrading docs and not where all the other schema upgrades are, I think there's a higher chance that we would simply forget that there was this optional upgrade some time in the past.

@lippserd
Copy link
Member

Maybe even placing it into schema/mysql/upgrades/optional/1.1.2-history.sql to make it even more distinct from the other upgrades?

That is a good idea.

doc/04-Upgrading.md Outdated Show resolved Hide resolved
doc/04-Upgrading.md Outdated Show resolved Hide resolved
schema/mysql/upgrades/1.1.2.sql Show resolved Hide resolved
doc/04-Upgrading.md Outdated Show resolved Hide resolved
@Al2Klimov Al2Klimov force-pushed the max_check_attempts-range branch 2 times, most recently from 4dff7fc to 34fb7ed Compare April 3, 2024 11:13
Comment on lines 18 to 21
### Upgrading the state_history Table (optional)

Icinga DB crashes if hosts/services reach check attempt 256. The `check_attempt` column of the `state_history` table
is too small to fit values greater 255. ([#655](https://github.com/Icinga/icingadb/issues/655))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just that if you read that section top to bottom, with the information you were given up to that sentence, you may ask yourself "well, if it still crashes, what does this update even do?".

-- #656 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your claim sounds better with if you read only the 1st 1/3 of that section. Yes. There's a crash. And the update doesn't fix it.* But we have good reasons and wrote them down in the upgrading docs.

* Actually it does, you "just" have to apply two files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your claim sounds better with if you read only the 1st 1/3 of that section.

Well yes, people tend to read text from top to bottom. And usually it's a feature if understanding a sentence doesn't depend on information only given later in the text. Otherwise, you end up with a text you have to read multiple times until you understand it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've prepared a suggestion for how this can be written in 1298c30. It's based on the text provided in this PR (hence the Co-authored-by) but restructured so that it first explains why the schema upgrade is split and then mentions the options including the "may crash" warning if the upgrade is skipped. Apart from that, I've tweaked the wording in a few places.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good in general, but for max_check_attempts=256 (which already works btw.) the highest possible check_attempt is 255: https://github.com/Icinga/icinga2/blob/9e31b8b5590c6d67b7dd538b2c884bd377a4e486/lib/icinga/checkable-check.cpp#L233-L237

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Feel free to change it, but I think it's fine to leave it as is with 255 obviously being a safe value for uint8 (with extra room for an off by one somewhere that hopefully doesn't exist 🙈)

doc/04-Upgrading.md Outdated Show resolved Hide resolved
@Al2Klimov Al2Klimov requested review from lippserd and removed request for lippserd and oxzi April 3, 2024 12:59
@Al2Klimov Al2Klimov marked this pull request as draft April 8, 2024 13:54
@Al2Klimov Al2Klimov marked this pull request as ready for review April 8, 2024 14:01
Al2Klimov and others added 2 commits April 8, 2024 16:01
A float isn't necessary as in Icinga 2 Checkable#max_check_attempts and
check_attempt are ints. But uint8 isn't enough for e.g. 1 check/s to get
HARD after 5m (300s > 255).
Co-authored-by: Alexander A. Klimov <alexander.klimov@icinga.com>
@julianbrost julianbrost removed their assignment Apr 8, 2024
@julianbrost julianbrost merged commit 0daca8b into main Apr 9, 2024
31 checks passed
@julianbrost julianbrost deleted the max_check_attempts-range branch April 9, 2024 12:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla/signed consider backporting Candidate for inclusion in a bugfix release. ref/IP
Projects
None yet
Development

Successfully merging this pull request may close these issues.

IcingaDB won't start when max_check_attempts is out of range
5 participants