Core update hangs indefinitely #6854

DavidePrincipi · 2024-02-21T10:30:27Z

Since I added a fast node to an existing cluster, the nightly apply-updates procedure blocks during update-core.

● redis.service - Core Redis DB
     Loaded: loaded (/etc/systemd/system/redis.service; enabled; preset: disabled)
     Active: active (running) since Tue 2024-02-20 02:45:50 CET; 13h ago

Feb 20 02:45:49.688798 ns8n5 agent@node[11056]: Running /var/lib/nethserver/node/update-core.d/95cleanup_images...
Feb 20 02:45:49.822616 ns8n5 agent@node[11056]: Failed to publish the action status on channel progress/node/15/task/9d6cec45-13e1-4799-a481-846ed0eb3469
Feb 20 02:45:49.822864 ns8n5 agent@node[11056]: task/node/15/9d6cec45-13e1-4799-a481-846ed0eb3469: update-core/95cleanup_images is starting
Feb 20 02:45:49.900384 ns8n5 agent@node[11056]: Failed to publish the action status on channel progress/node/15/task/9d6cec45-13e1-4799-a481-846ed0eb3469
Feb 20 02:45:49.900954 ns8n5 agent@node[11056]: Redis command failed: dial tcp 10.5.4.5:6379: connect: connection refused
Feb 20 02:45:49.900954 ns8n5 agent@node[11056]: task/node/15/9d6cec45-13e1-4799-a481-846ed0eb3469: action "update-core" status is "completed" (0) at step 95cleanup_images

From the log trace, there is no retry attempt to write the task output in Redis: after Redis is restarted, the node agent running on the fast node fails to publish its update-core exit status. As result the task outcome is never found by the controlling task running on the cluster leader and the whole action blocks.

Steps to reproduce

install NS8
define the check-bug-6854 action and run it with api-cli

To define such action

mkdir /var/lib/nethserver/cluster/action/check-bug-6854
vi /var/lib/nethserver/cluster/actions/check-bug-6854/10restart_redis
chmod +x /var/lib/nethserver/cluster/actions/check-bug-6854/10restart_redis

In 10restart_redis:

#!/bin/bash
systemctl stop redis
systemctl start redis --no-block

Expected results

The action terminates.

**Actual results.

The action is blocked until I manually create a fake task exit status with MPUT.

Fix proposal

During Redis restarts the default go-redis library retry settings may not suffice

Increase the retry period of our agent.

Components

core 2.5.1

The text was updated successfully, but these errors were encountered:

DavidePrincipi · 2024-02-22T18:07:39Z

Released in https://github.com/NethServer/ns8-core/releases/tag/2.5.2

DavidePrincipi added this to NethServer Feb 20, 2024

DavidePrincipi self-assigned this Feb 21, 2024

DavidePrincipi converted this from a draft issue Feb 21, 2024

DavidePrincipi added the bug label Feb 21, 2024

DavidePrincipi moved this from 🔖 Ready to 🏗 In progress in NethServer Feb 21, 2024

DavidePrincipi changed the title ~~Bug? Core update hangs indefinitely~~ Core update hangs indefinitely Feb 21, 2024

DavidePrincipi mentioned this issue Feb 21, 2024

Configure the agent retry backoff NethServer/ns8-core#581

Merged

DavidePrincipi moved this from 🏗 In progress to 👀 Testing in NethServer Feb 22, 2024

DavidePrincipi added the verified All test cases were verified successfully label Feb 22, 2024

DavidePrincipi closed this as completed Feb 22, 2024

github-project-automation bot moved this from 👀 Testing to ✅ Done in NethServer Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core update hangs indefinitely #6854

Core update hangs indefinitely #6854

DavidePrincipi commented Feb 21, 2024 •

edited

Loading

DavidePrincipi commented Feb 22, 2024

Core update hangs indefinitely #6854

Core update hangs indefinitely #6854

Comments

DavidePrincipi commented Feb 21, 2024 • edited Loading

DavidePrincipi commented Feb 22, 2024

DavidePrincipi commented Feb 21, 2024 •

edited

Loading