Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core update hangs indefinitely #6854

Closed
DavidePrincipi opened this issue Feb 21, 2024 · 1 comment
Closed

Core update hangs indefinitely #6854

DavidePrincipi opened this issue Feb 21, 2024 · 1 comment
Assignees
Labels
verified All test cases were verified successfully

Comments

@DavidePrincipi
Copy link
Member

DavidePrincipi commented Feb 21, 2024

Since I added a fast node to an existing cluster, the nightly apply-updates procedure blocks during update-core.

● redis.service - Core Redis DB
     Loaded: loaded (/etc/systemd/system/redis.service; enabled; preset: disabled)
     Active: active (running) since Tue 2024-02-20 02:45:50 CET; 13h ago
Feb 20 02:45:49.688798 ns8n5 agent@node[11056]: Running /var/lib/nethserver/node/update-core.d/95cleanup_images...
Feb 20 02:45:49.822616 ns8n5 agent@node[11056]: Failed to publish the action status on channel progress/node/15/task/9d6cec45-13e1-4799-a481-846ed0eb3469
Feb 20 02:45:49.822864 ns8n5 agent@node[11056]: task/node/15/9d6cec45-13e1-4799-a481-846ed0eb3469: update-core/95cleanup_images is starting
Feb 20 02:45:49.900384 ns8n5 agent@node[11056]: Failed to publish the action status on channel progress/node/15/task/9d6cec45-13e1-4799-a481-846ed0eb3469
Feb 20 02:45:49.900954 ns8n5 agent@node[11056]: Redis command failed: dial tcp 10.5.4.5:6379: connect: connection refused
Feb 20 02:45:49.900954 ns8n5 agent@node[11056]: task/node/15/9d6cec45-13e1-4799-a481-846ed0eb3469: action "update-core" status is "completed" (0) at step 95cleanup_images

From the log trace, there is no retry attempt to write the task output in Redis: after Redis is restarted, the node agent running on the fast node fails to publish its update-core exit status. As result the task outcome is never found by the controlling task running on the cluster leader and the whole action blocks.

Steps to reproduce

  • install NS8
  • define the check-bug-6854 action and run it with api-cli

To define such action

mkdir /var/lib/nethserver/cluster/action/check-bug-6854
vi /var/lib/nethserver/cluster/actions/check-bug-6854/10restart_redis
chmod +x /var/lib/nethserver/cluster/actions/check-bug-6854/10restart_redis

In 10restart_redis:

#!/bin/bash
systemctl stop redis
systemctl start redis --no-block

Expected results

The action terminates.

**Actual results.

The action is blocked until I manually create a fake task exit status with MPUT.

Fix proposal

During Redis restarts the default go-redis library retry settings may not suffice

Increase the retry period of our agent.

Components

  • core 2.5.1
@DavidePrincipi DavidePrincipi self-assigned this Feb 21, 2024
@DavidePrincipi DavidePrincipi converted this from a draft issue Feb 21, 2024
@DavidePrincipi DavidePrincipi moved this from 🔖 Ready to 🏗 In progress in NethServer Feb 21, 2024
@DavidePrincipi DavidePrincipi changed the title Bug? Core update hangs indefinitely Core update hangs indefinitely Feb 21, 2024
@DavidePrincipi DavidePrincipi moved this from 🏗 In progress to 👀 Testing in NethServer Feb 22, 2024
@DavidePrincipi DavidePrincipi added the verified All test cases were verified successfully label Feb 22, 2024
@DavidePrincipi
Copy link
Member Author

@github-project-automation github-project-automation bot moved this from 👀 Testing to ✅ Done in NethServer Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
verified All test cases were verified successfully
Projects
Archived in project
Development

No branches or pull requests

1 participant