Skip to content

Controller restore procedure gets stuck #1646

@nrauso

Description

@nrauso

The restore procedure from backup for the nethsecurity-controller app cannot be completed.
The process gets stuck indefinitely during the database import step restore-module/30restore_database

The restore action repeatedly expects the timescale container to exist and be running, but at this stage no container is available.

Relevant logs:

May 06 10:15:26 rl1 agent@nethsecurity-controller1[41205]: task/module/nethsecurity-controller1/8b6433e2-5e9f-4822-b196-f879e6ea93bb: restore-module/30restore_database is starting
May 06 10:15:26 rl1 agent@nethsecurity-controller1[41205]: Error: no container with name or ID "timescale" found: no such container
May 06 10:15:31 rl1 agent@nethsecurity-controller1[41205]: Error: no container with name or ID "timescale" found: no such container
May 06 10:15:36 rl1 agent@nethsecurity-controller1[41205]: Error: no container with name or ID "timescale" found: no such container
May 06 10:15:41 rl1 agent@nethsecurity-controller1[41205]: Error: no container with name or ID "timescale" found: no such container
May 06 10:15:46 rl1 agent@nethsecurity-controller1[41205]: Error: no container with name or ID "timescale" found: no such container

At this point, the timescale container is not started:

~]# runagent -m nethsecurity-controller1 podman ps -a
CONTAINER ID  IMAGE       COMMAND     CREATED     STATUS      PORTS       NAMES

The restore action remains indefinitely blocked:

~]# runagent -m nethsecurity-controller1 ps xf
    PID TTY      STAT   TIME COMMAND
...
  41205 ?        Ssl    0:00  \_ /usr/local/bin/agent --agentid=module/nethsecurity-controller1 --actionsdir=/usr/local/agent/actions --actionsdir=/home/ne
  42839 ?        S      0:00  |   \_ /bin/sh /home/nethsecurity-controller1/.config/actions/restore-module/30restore_database
  43278 ?        S      0:00  |       \_ sleep 5

A temporary manual workaround is possible, but it requires several manual steps:

  • Kill the blocked 30restore_database action
  • Temporarily modify the timescale systemd unit to avoid starting its dependencies and start it manually
  • Re-run the 30restore_database action manually
  • Stop timescale and restore the default 'timescale` systemd unit
  • Run configure-module to put the restored controller into production

This workaround allows the restore to complete, but it should not be required during a normal restore procedure.

Components

  • nethsecurity-controller:2.2.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    verifiedAll test cases were verified successfully

    Type

    Projects

    Status

    Done ✅

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions