Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

null value in column "downtime_end" of relation "sla_history_downtime" violates not-null constraint #9942

Closed
Mordecaine opened this issue Oct 19, 2023 · 11 comments · Fixed by #9896
Assignees
Labels
area/icingadb New backend bug Something isn't working
Milestone

Comments

@Mordecaine
Copy link

Describe the bug

If we start the IcingaDB Service on our config master server, the daemon crashes.
On our second master system, the daemon runs.

To Reproduce

I have no idea what we did to get this state.
But the daemon crashes on startup

systemctl start icingadb

Expected behavior

The daemon should run

Your Environment

Icinga HA Cluster with postgres cluster in backend

  • Icinga DB version: 1.1.1
  • Icinga 2 version: 2.14
  • Operating System and version: RedHat 8.8

Additional context

journalctl returns the following lines:

Oct 19 14:39:31 msd-ic-ma01 systemd[1]: Starting Icinga DB...
Oct 19 14:39:31 msd-ic-ma01 icingadb[1687676]: Starting Icinga DB
Oct 19 14:39:31 msd-ic-ma01 icingadb[1687676]: Connecting to database at 'localhost:5432'
Oct 19 14:39:31 msd-ic-ma01 systemd[1]: Started Icinga DB.
Oct 19 14:39:31 msd-ic-ma01 icingadb[1687676]: Connecting to Redis at 'localhost:6380'
Oct 19 14:39:31 msd-ic-ma01 icingadb[1687676]: Starting history sync
Oct 19 14:39:31 msd-ic-ma01 icingadb[1687676]: heartbeat: Received Icinga heartbeat
Oct 19 14:39:31 msd-ic-ma01 icingadb[1687676]: high-availability: Another instance is active
Oct 19 14:39:32 msd-ic-ma01 icingadb[1687676]: pq: null value in column "downtime_end" of relation "sla_history_downtime" violates not-null constraint
                                               can't perform "INSERT INTO \"sla_history_downtime\" (\"downtime_start\", \"downtime_id\", \"environment_id\", \"endpoint_id\", \"object_type\", \"host_id\", \"service_id\", \"downtime_end\") VALUES (:downtime_start,:downtime_id,:environment_id,:endpoint_id,:object_type,:host_id,:service_id,:downtime_end) ON CONFLICT ON CONSTRAINT pk_sla_history_downtime DO UPDATE SET \"downtime_end\" = EXCLUDED.\"downtime_end\""
                                               github.com/icinga/icingadb/internal.CantPerformQuery
                                                       github.com/icinga/icingadb/internal/internal.go:30
                                               github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1.1
                                                       github.com/icinga/icingadb/pkg/icingadb/db.go:391
                                               github.com/icinga/icingadb/pkg/retry.WithBackoff
                                                       github.com/icinga/icingadb/pkg/retry/retry.go:49
                                               github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1
                                                       github.com/icinga/icingadb/pkg/icingadb/db.go:386
                                               golang.org/x/sync/errgroup.(*Group).Go.func1
                                                       golang.org/x/sync@v0.3.0/errgroup/errgroup.go:75
                                               runtime.goexit
                                                       runtime/asm_amd64.s:1598
                                               can't retry
                                               github.com/icinga/icingadb/pkg/retry.WithBackoff
                                                       github.com/icinga/icingadb/pkg/retry/retry.go:68
                                               github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1
                                                       github.com/icinga/icingadb/pkg/icingadb/db.go:386
                                               golang.org/x/sync/errgroup.(*Group).Go.func1
                                                       golang.org/x/sync@v0.3.0/errgroup/errgroup.go:75
                                               runtime.goexit
                                                       runtime/asm_amd64.s:1598
Oct 19 14:39:32 msd-ic-ma01 systemd[1]: icingadb.service: Main process exited, code=exited, status=1/FAILURE
Oct 19 14:39:32 msd-ic-ma01 systemd[1]: icingadb.service: Failed with result 'exit-code'.
@Al2Klimov
Copy link
Member

@log1-c Please could you share the Redis data Icinga DB can't process:

  1. Run icingadb-redis-cli
  2. Query XRANGE icinga:history:stream:downtime - +

@log1-c
Copy link
Contributor

log1-c commented Nov 8, 2023

Running the command produces a lot of output like this

6470) 1) "1698153784788-1"
       2)  1) "downtime_id"
           2) "8763702cd5f895051473a13b1a0f6a83a86521c0"
           3) "environment_id"
           4) "d4af2e0c3a51075336c92e0de60bd5336ca9c9bf"
           5) "host_id"
           6) "192bb8b024dd83154cf545f4c646c737e67d7eae"
           7) "entry_time"
           8) "1698153784538"
           9) "author"
          10) "icingaadmin"
          11) "comment"
          12) "Diese Downtime gilt immer f\xc3\xbcr alle Develop Machinen des Virtual Desktop Teams"
          13) "is_flexible"
          14) "0"
          15) "flexible_duration"
          16) "0"
          17) "scheduled_start_time"
          18) "1698098400000"
          19) "scheduled_end_time"
          20) "1698184800000"
          21) "has_been_cancelled"
          22) "0"
          23) "trigger_time"
          24) "1698153784538"
          25) "event_id"
          26) "a79ec8a7a7a82aceb13c3c5880e84a5d66c13bf1"
          27) "event_type"
          28) "downtime_start"
          29) "object_type"
          30) "service"
          31) "service_id"
          32) "0a3cffce433d4305f0d5effa3dbf8005df53352c"
          33) "start_time"
          34) "1698098400000"
          35) "end_time"
          36) "1698184800000"
          37) "endpoint_id"
          38) "7fad74b9bcc7287690d7f4bdc47ce9c133830d81"
          39) "scheduled_by"
          40) "mvd-d9-ts1001!service-lanmanworkstation!virtual_desktop_develop_server"
 6471) 1) "1698153784788-2"
       2)  1) "downtime_id"
           2) "3d2da43046eca2490e42ae935d1afe05f80cba84"
           3) "environment_id"
           4) "d4af2e0c3a51075336c92e0de60bd5336ca9c9bf"
           5) "host_id"
           6) "192bb8b024dd83154cf545f4c646c737e67d7eae"
           7) "entry_time"
           8) "1698153784542"
           9) "author"
          10) "icingaadmin"
          11) "comment"
          12) "Diese Downtime gilt immer f\xc3\xbcr alle Develop Machinen des Virtual Desktop Teams"
          13) "is_flexible"
          14) "0"
          15) "flexible_duration"
          16) "0"
          17) "scheduled_start_time"
          18) "1698098400000"
          19) "scheduled_end_time"
          20) "1698184800000"
          21) "has_been_cancelled"
          22) "0"
          23) "trigger_time"
          24) "1698153784542"
          25) "event_id"
          26) "05f6c3ed0c3efaa52ea2ac66c2580ba14555f8bc"
          27) "event_type"
          28) "downtime_start"
          29) "object_type"
          30) "service"
          31) "service_id"
          32) "f84893b01646df8dc1ee7eec33ead3900e76d0e0"
          33) "start_time"
          34) "1698098400000"
          35) "end_time"
          36) "1698184800000"
          37) "endpoint_id"
          38) "7fad74b9bcc7287690d7f4bdc47ce9c133830d81"
          39) "scheduled_by"
          40) "mvd-d9-ts1001!service-frxsvc!virtual_desktop_develop_server"

...
15068) 1) "1698702302458-3"
       2)  1) "downtime_id"
           2) "c391e07992e08e36631d914dcf177fa9221e5f3a"
           3) "environment_id"
           4) "d4af2e0c3a51075336c92e0de60bd5336ca9c9bf"
           5) "host_id"
           6) "6426d0d5be7fde903c2c4b5ffc72b243e20c41e7"
           7) "entry_time"
           8) "1698702302348"
           9) "author"
          10) "maintenance-user"
          11) "comment"
          12) "automatic downtime for vm creation, 24h sentinelone downtime"
          13) "is_flexible"
          14) "0"
          15) "flexible_duration"
          16) "0"
          17) "scheduled_start_time"
          18) "1698702302000"
          19) "scheduled_end_time"
          20) "1698822302000"
          21) "has_been_cancelled"
          22) "0"
          23) "trigger_time"
          24) "1698702302348"
          25) "event_id"
          26) "d5746b65f3471c39d5b2928d9f1c5a7a24c9232d"
          27) "event_type"
          28) "downtime_start"
          29) "object_type"
          30) "service"
          31) "service_id"
          32) "253efc7b550fa1ed48f9b08cc35e1ddeaba5431e"
          33) "start_time"
          34) "1698702302000"
          35) "end_time"
          36) "1698822302000"
          37) "endpoint_id"
          38) "7fad74b9bcc7287690d7f4bdc47ce9c133830d81"
15069) 1) "1698702302458-4"
       2)  1) "downtime_id"
           2) "011a8cc65880cadac251a55f87c973dc4711c7e6"
           3) "environment_id"
           4) "d4af2e0c3a51075336c92e0de60bd5336ca9c9bf"
           5) "host_id"
           6) "a1d9bac76db364203e7d546b7c9511dfaa384425"
           7) "entry_time"
           8) "1698702302371"
           9) "author"
          10) "maintenance-user"
          11) "comment"
          12) "automatic downtime for vm creation"
          13) "is_flexible"
          14) "0"
          15) "flexible_duration"
          16) "0"
          17) "scheduled_start_time"
          18) "1698702302000"
          19) "scheduled_end_time"
          20) "1698735902000"
          21) "has_been_cancelled"
          22) "0"
          23) "trigger_time"
          24) "1698702302371"
          25) "event_id"
          26) "f3fabb956cb45826349b58eda4c5c642e399d637"
          27) "event_type"
          28) "downtime_start"
          29) "object_type"
          30) "host"
          31) "start_time"
          32) "1698702302000"
          33) "end_time"
          34) "1698735902000"
          35) "endpoint_id"
          36) "7fad74b9bcc7287690d7f4bdc47ce9c133830d81"
15070) 1) "1698702302458-5"
       2)  1) "downtime_id"
           2) "dc6299459d548ec1e577df385d02b6d7e6f65730"
           3) "environment_id"
           4) "d4af2e0c3a51075336c92e0de60bd5336ca9c9bf"
           5) "host_id"
           6) "a1d9bac76db364203e7d546b7c9511dfaa384425"
           7) "entry_time"
           8) "1698702302380"
           9) "author"
          10) "maintenance-user"
          11) "comment"
          12) "automatic downtime for vm creation"
          13) "is_flexible"
          14) "0"
          15) "flexible_duration"
          16) "0"
          17) "scheduled_start_time"
          18) "1698702302000"
          19) "scheduled_end_time"
          20) "1698735902000"
          21) "has_been_cancelled"
          22) "0"
          23) "trigger_time"
          24) "1698702302380"
          25) "event_id"
          26) "376ea1f29f0ca519b1200531d853a2892aaa8027"
          27) "event_type"
          28) "downtime_start"
          29) "object_type"
          30) "service"
          31) "service_id"
          32) "e4a461d6f88d4d783dca596e7015781f6634de7c"
          33) "start_time"
          34) "1698702302000"
          35) "end_time"
          36) "1698735902000"
          37) "endpoint_id"
          38) "7fad74b9bcc7287690d7f4bdc47ce9c133830d81"
          39) "parent_id"
          40) "011a8cc65880cadac251a55f87c973dc4711c7e6"
15071) 1) "1698702302458-6"
       2)  1) "downtime_id"
           2) "18da40453435eb3461437cef796599f5efcf9d1f"
           3) "environment_id"
           4) "d4af2e0c3a51075336c92e0de60bd5336ca9c9bf"
           5) "host_id"
           6) "a1d9bac76db364203e7d546b7c9511dfaa384425"
           7) "entry_time"
           8) "1698702302384"
           9) "author"
          10) "maintenance-user"
          11) "comment"
          12) "automatic downtime for vm creation"
          13) "is_flexible"
          14) "0"
          15) "flexible_duration"
          16) "0"
          17) "scheduled_start_time"
          18) "1698702302000"
          19) "scheduled_end_time"
          20) "1698735902000"
          21) "has_been_cancelled"
          22) "0"
          23) "trigger_time"
          24) "1698702302384"
          25) "event_id"
          26) "21ee7e1813b010e8ae11d72a664de7a6ccaede63"
          27) "event_type"
          28) "downtime_start"
          29) "object_type"
          30) "service"
          31) "service_id"
          32) "9595c971c11b49346ce91ab1bb7580684859bd24"
          33) "start_time"
          34) "1698702302000"
          35) "end_time"
          36) "1698735902000"
          37) "endpoint_id"
          38) "7fad74b9bcc7287690d7f4bdc47ce9c133830d81"
          39) "parent_id"
          40) "011a8cc65880cadac251a55f87c973dc4711c7e6"

but only on the node were the icingadb daemon is crashing. on the second master node (not the config master btw.) it just returns emtpy array

@Al2Klimov
Copy link
Member

Please could you share your file in /var/lib/icingadb-redis from the node with the long output via https://nextcloud.icinga.com/index.php/s/YnTKYNSLaYyx9d8 ?

@log1-c
Copy link
Contributor

log1-c commented Nov 8, 2023

Sure, I uploaded the dump.rdb file

@Al2Klimov
Copy link
Member

From the node with the long output of XRANGE. Obviously I need to debug "the node were the icingadb daemon is crashing".

127.0.0.1:6380> XRANGE icinga:history:stream:downtime + -
(empty array)
127.0.0.1:6380>

@log1-c
Copy link
Contributor

log1-c commented Nov 8, 2023

That is the file I uploaded

msd-ic-ma01# ls -lah
total 74M
drwxr-x---.  2 icingadb-redis icingadb-redis   22 Nov  8 14:35 .
drwxr-xr-x. 49 root           root           4.0K Oct 18 14:57 ..
-rw-r--r--.  1 icingadb-redis icingadb-redis  74M Nov  8 14:35 dump.rdb

node where the daemon runs:


msd-ic-ma02# ls -lah
total 28M
drwxr-x---.  2 icingadb-redis icingadb-redis   22 Nov  8 14:36 .
drwxr-xr-x. 46 root           root           4.0K Oct 23 06:42 ..
-rw-r--r--.  1 icingadb-redis icingadb-redis  28M Nov  8 14:36 dump.rdb

Did you maybe switch + and - in your command? Because when I do XRANGE icinga:history:stream:downtime + - I also get (emtpy array)
The other way round produces output

@Al2Klimov
Copy link
Member

Oops 😅

@Al2Klimov
Copy link
Member

Successfully reproduced it, but only on Postgres.🙈

@Al2Klimov
Copy link
Member

Monkey patched local Icinga DB a bit, so that it syncs history entries one by one.

diff --git a/pkg/icingadb/history/sync.go b/pkg/icingadb/history/sync.go
index dc8bc611..a23cbb13 100644
--- a/pkg/icingadb/history/sync.go
+++ b/pkg/icingadb/history/sync.go
@@ -20,6 +20,7 @@ import (
        "golang.org/x/sync/errgroup"
        "reflect"
        "sync"
+       "time"
 )

 // Sync specifies the source and destination of a history sync.
@@ -118,6 +119,7 @@ func (s Sync) readFromRedis(ctx context.Context, key string, output chan<- redis

                for _, stream := range streams {
                        for _, message := range stream.Messages {
+                               time.Sleep(time.Second / 3)
                                xra.Streams[1] = message.ID

                                select {

Exactly the only option I saw in the code and was afraid of:

127.0.0.1:6380> XRANGE icinga:history:stream:downtime - + COUNT 1
1) 1) "1697508001888-4"
   2)  1) "downtime_id"
       2) "1a542440ed19197da76b01788dd54cb10af7b4e6"
       3) "environment_id"
       4) "d4af2e0c3a51075336c92e0de60bd5336ca9c9bf"
       5) "host_id"
       6) "b4281fed7d790ac5b54b7481addee4413cdc7c49"
       7) "entry_time"
       8) "1697508001690"
       9) "author"
      10) "maintenance-user"
      11) "comment"
      12) "automatic downtime for customer creation"
      13) "is_flexible"
      14) "0"
      15) "flexible_duration"
      16) "0"
      17) "scheduled_start_time"
      18) "1697508001000"
      19) "scheduled_end_time"
      20) "1705475119000"
      21) "has_been_cancelled"
      22) "1"
      23) "trigger_time"
      24) "1697508001690"
      25) "event_id"
      26) "fbe4ec2afa1feca24b21c7fa510db57b75ba152f"
      27) "event_type"
      28) "downtime_start"
      29) "object_type"
      30) "service"
      31) "service_id"
      32) "9c42d91fc10ee3e6da1c0477aee3233a6d08fcc8"
      33) "start_time"
      34) "1697508001000"
      35) "end_time"
      36) "1705475119000"
      37) "endpoint_id"
      38) "7fad74b9bcc7287690d7f4bdc47ce9c133830d81"
      39) "parent_id"
      40) "8edec9935b10706f2389ff7f62b537c6d285410b"
127.0.0.1:6380>

has_been_cancelled=1, but cancel_time is missing.

@Al2Klimov
Copy link
Member

Can you discard that broken event?

XDEL icinga:history:stream:downtime 1697508001888-4

@log1-c
Copy link
Contributor

log1-c commented Dec 12, 2023

Done.

127.0.0.1:6380> XDEL icinga:history:stream:downtime 1697508001888-4
(integer) 1

Service keeps running after that.

● icingadb.service - Icinga DB
   Loaded: loaded (/usr/lib/systemd/system/icingadb.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2023-12-12 08:42:02 CET; 2min 12s ago
 Main PID: 791376 (icingadb)
    Tasks: 8 (limit: 49030)
   Memory: 240.7M
   CGroup: /system.slice/icingadb.service
           └─791376 /usr/sbin/icingadb --config /etc/icingadb/config.yml

Dec 12 08:42:22 msd-ic-ma01 icingadb[791376]: history-sync: Synced 80 notification history items
Dec 12 08:42:42 msd-ic-ma01 icingadb[791376]: history-sync: Synced 17424 downtime history items
Dec 12 08:42:42 msd-ic-ma01 icingadb[791376]: history-sync: Synced 67328 state history items
Dec 12 08:43:02 msd-ic-ma01 icingadb[791376]: history-sync: Synced 88187 state history items
Dec 12 08:43:02 msd-ic-ma01 icingadb[791376]: history-sync: Synced 37740 downtime history items
Dec 12 08:43:02 msd-ic-ma01 icingadb[791376]: high-availability: Another instance is active
Dec 12 08:43:22 msd-ic-ma01 icingadb[791376]: history-sync: Synced 8704 downtime history items
Dec 12 08:43:22 msd-ic-ma01 icingadb[791376]: history-sync: Synced 36014 state history items
Dec 12 08:43:42 msd-ic-ma01 icingadb[791376]: history-sync: Synced 1 state history items
Dec 12 08:44:02 msd-ic-ma01 icingadb[791376]: high-availability: Another instance is active

@yhabteab yhabteab added the bug Something isn't working label Dec 12, 2023
@yhabteab yhabteab reopened this Dec 12, 2023
@Al2Klimov Al2Klimov transferred this issue from Icinga/icingadb Dec 19, 2023
@Al2Klimov Al2Klimov added this to the 2.13.9 milestone Dec 19, 2023
@yhabteab yhabteab added the area/icingadb New backend label Dec 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/icingadb New backend bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants