Fix device running or reporting old snapshot id/name (backport #2575) #2580

github-actions · 2023-08-11T10:35:34Z

Backport of #2575

Description

This PR addresses several issues found while debugging the reported problem.

NOTE: None of the work here involved messing with the device files. The state of the device and its files were all a result of actions taken in the forge application. I feel this is important to point out because as it stands, it is possible to get the device to crash and into odd states.

The issues encountered...

problem 1 - device running snapshot of deleted instance

Device active/target snapshot get populated in checkin code
device reports it is on old snapshot & that matches
checkin code did not check that the activeSnapshots project matched the device current project

problem 2 - snapshot table is not cleaned - holds snapshots from deleted instances

When device checks in, its reported snapshot is found and the device objects active/target snapshot are populated meaning the device passes its checkin & does not instruct an update.

NOTE: No work was done to delete the old snapshots, only to mitigate the problem. A separate issue will be raised detailing thoughts and care points around this.

problem 3

Even if we force a DeviceUpdate, it just passes the exact same settings due to the snapshot table having old entries

problem 4

Once we do get the device to clean up, switching between dev/autonomous mode caused a crash due to missing files

[AGENT] 09/08/2023 09:47:27 [info] Cleaning instance directory
[AGENT] 09/08/2023 09:47:50 [info] Enabling developer mode
[AGENT] 09/08/2023 09:47:51 [info] Enabling remote editor access
[AGENT] 09/08/2023 09:47:51 [info] No running Node-RED instance, not starting editor
[AGENT] 09/08/2023 09:47:56 [info] Enabling remote editor access
[AGENT] 09/08/2023 09:47:56 [info] No running Node-RED instance, not starting editor
[Error: ENOENT: no such file or directory, open 'C:\\opt\\flowforge-device\\project\\flows.json'] {
 errno: -4058,
 code: 'ENOENT',
 syscall: 'open',
 path: 'C:\\\\opt\\\\flowforge-device\\\\project\\\\flows.json'
}

problem 5

when the agent determines a new snapshot is available, we dont instruct it to grab new settings.
Since some env vars are computed by the platform (e.g. FF_SNAPSHOT_ID/NAME) we must also instruct
the agent to refresh its settings.

The work in this PR

1: ensure device is not using orphaned snapshot

When checking the status of the device, grab the full row from ProjectSnapshots and verify the associated project is both present and matches the device. If not, the snapshot is considered to be an orphan or mismatched and a device update is requested.

2: ensure platform doesnt send ophaned snapshot > dev

When sending an update command to device, check the targetSnapshot has a project AND it matches the projectId that owns the device. If this fails, set the snapshot id in the payload update to null. This will signal the device to clear its current snapshot.

This can occur when an instance (with a member device and target snapshot) is deleted. The ProjectId field is cleared but the snapshot remains in the database.

TODO Unit Tests

Adds tests to test/unit/forge/db/controllers/Device_spec.js

Device controller
sendDeviceUpdateCommand
sends update command to device
does not send orphaned snapshot in update command to device

Manual Tests

delete instance - expect device to clear its config

On next poll, agent reports correct change 👍

[AGENT] 09/08/2023 15:17:39 [info] Stopping Node-RED
[AGENT] 09/08/2023 15:17:39 [info] Stopped Node-RED
[AGENT] 09/08/2023 15:17:39 [info] Cleaning instance directory

Device status on forge app shows Application and Instace Unassigned 👍
Last Known Status shows stopped 👍

delete active snapshot - expect device to clear its config

On next poll, agent reports correct change 👍

[AGENT] 09/08/2023 15:13:52 [info] Stopping Node-RED
[AGENT] 09/08/2023 15:13:52 [info] Stopped Node-RED
[AGENT] 09/08/2023 15:13:52 [info] Cleaning instance directory

Related Issue(s)

FlowFuse/device-agent#132

Checklist

I have read the contribution guidelines
Suitable unit/system level tests have been added and they pass
Documentation has been updated
Upgrade instructions
Configuration details
Concepts
Changes flowforge.yml?
Issue/PR raised on flowforge/helm to update ConfigMap Template
Issue/PR raised on flowforge/CloudProject to update values for Staging/Production

Labels

Backport needed? -> add the backport label
Includes a DB migration? -> add the area:migration label

Steve-Mcl and others added 5 commits August 11, 2023 10:35

ensure device is not using orphaned snapshot

15f444b

ensure platform doesnt send ophaned snapshot > dev

87025da

aff tests for sendDeviceUpdateCommand

49044a0

move new tests to own describe block

4c4161c

remove .only from test

804a0ae

knolleary approved these changes Aug 17, 2023

View reviewed changes

knolleary merged commit d6ca76d into maintenance Aug 17, 2023

knolleary deleted the backport-2575 branch August 17, 2023 10:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix device running or reporting old snapshot id/name (backport #2575) #2580

Fix device running or reporting old snapshot id/name (backport #2575) #2580

github-actions bot commented Aug 11, 2023

Fix device running or reporting old snapshot id/name (backport #2575) #2580

Fix device running or reporting old snapshot id/name (backport #2575) #2580

Conversation

github-actions bot commented Aug 11, 2023

Description

problem 1 - device running snapshot of deleted instance

problem 2 - snapshot table is not cleaned - holds snapshots from deleted instances

problem 3

problem 4

problem 5

The work in this PR

1: ensure device is not using orphaned snapshot

2: ensure platform doesnt send ophaned snapshot > dev

TODO Unit Tests

Manual Tests

Related Issue(s)

Checklist

Labels