Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix device running or reporting old snapshot id/name #2575

Merged
merged 5 commits into from
Aug 11, 2023
Merged

Conversation

Steve-Mcl
Copy link
Contributor

@Steve-Mcl Steve-Mcl commented Aug 9, 2023

Description

This PR addresses several issues found while debugging the reported problem.

NOTE: None of the work here involved messing with the device files. The state of the device and its files were all a result of actions taken in the forge application. I feel this is important to point out because as it stands, it is possible to get the device to crash and into odd states.

The issues encountered...

problem 1 - device running snapshot of deleted instance

Device active/target snapshot get populated in checkin code
device reports it is on old snapshot & that matches
checkin code did not check that the activeSnapshots project matched the device current project

problem 2 - snapshot table is not cleaned - holds snapshots from deleted instances

When device checks in, its reported snapshot is found and the device objects active/target snapshot are populated meaning the device passes its checkin & does not instruct an update.

NOTE: No work was done to delete the old snapshots, only to mitigate the problem. A separate issue will be raised detailing thoughts and care points around this.

problem 3

Even if we force a DeviceUpdate, it just passes the exact same settings due to the snapshot table having old entries

problem 4

Once we do get the device to clean up, switching between dev/autonomous mode caused a crash due to missing files

[AGENT] 09/08/2023 09:47:27 [info] Cleaning instance directory
[AGENT] 09/08/2023 09:47:50 [info] Enabling developer mode
[AGENT] 09/08/2023 09:47:51 [info] Enabling remote editor access
[AGENT] 09/08/2023 09:47:51 [info] No running Node-RED instance, not starting editor
[AGENT] 09/08/2023 09:47:56 [info] Enabling remote editor access
[AGENT] 09/08/2023 09:47:56 [info] No running Node-RED instance, not starting editor
[Error: ENOENT: no such file or directory, open 'C:\opt\flowforge-device\project\flows.json'] {
  errno: -4058,
  code: 'ENOENT',
  syscall: 'open',
  path: 'C:\\opt\\flowforge-device\\project\\flows.json'
}

problem 5

when the agent determines a new snapshot is available, we dont instruct it to grab new settings.
Since some env vars are computed by the platform (e.g. FF_SNAPSHOT_ID/NAME) we must also instruct
the agent to refresh its settings.

The work in this PR

1: ensure device is not using orphaned snapshot

When checking the status of the device, grab the full row from ProjectSnapshots and verify the associated project is both present and matches the device. If not, the snapshot is considered to be an orphan or mismatched and a device update is requested.

2: ensure platform doesnt send ophaned snapshot > dev

When sending an update command to device, check the targetSnapshot has a project AND it matches the projectId that owns the device. If this fails, set the snapshot id in the payload update to null. This will signal the device to clear its current snapshot.

This can occur when an instance (with a member device and target snapshot) is deleted. The ProjectId field is cleared but the snapshot remains in the database.

TODO Unit Tests

Adds tests to test/unit/forge/db/controllers/Device_spec.js

  • Device controller
    • sendDeviceUpdateCommand
      • sends update command to device
      • does not send orphaned snapshot in update command to device

Manual Tests

  1. delete instance - expect device to clear its config
  • On next poll, agent reports correct change 👍
[AGENT] 09/08/2023 15:17:39 [info] Stopping Node-RED
[AGENT] 09/08/2023 15:17:39 [info] Stopped Node-RED
[AGENT] 09/08/2023 15:17:39 [info] Cleaning instance directory
  • Device status on forge app shows Application and Instace Unassigned 👍
  • Last Known Status shows stopped 👍
  1. delete active snapshot - expect device to clear its config
  • On next poll, agent reports correct change 👍
[AGENT] 09/08/2023 15:13:52 [info] Stopping Node-RED
[AGENT] 09/08/2023 15:13:52 [info] Stopped Node-RED
[AGENT] 09/08/2023 15:13:52 [info] Cleaning instance directory

Related Issue(s)

FlowFuse/device-agent#132

Checklist

  • I have read the contribution guidelines
  • Suitable unit/system level tests have been added and they pass
  • Documentation has been updated
    • Upgrade instructions
    • Configuration details
    • Concepts
  • Changes flowforge.yml?
    • Issue/PR raised on flowforge/helm to update ConfigMap Template
    • Issue/PR raised on flowforge/CloudProject to update values for Staging/Production

Labels

  • Backport needed? -> add the backport label
  • Includes a DB migration? -> add the area:migration label

@codecov
Copy link

codecov bot commented Aug 9, 2023

Codecov Report

Merging #2575 (3fbb84f) into main (0c3a764) will increase coverage by 73.09%.
Report is 228 commits behind head on main.
The diff coverage is 56.25%.

@@            Coverage Diff             @@
##            main    #2575       +/-   ##
==========================================
+ Coverage   1.57%   74.67%   +73.09%     
==========================================
  Files        498      228      -270     
  Lines      17865     9231     -8634     
  Branches    4182     1921     -2261     
==========================================
+ Hits         282     6893     +6611     
+ Misses     17583     2338    -15245     
Flag Coverage Δ
backend 74.67% <56.25%> (?)
frontend ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
forge/comms/devices.js 34.64% <0.00%> (+34.64%) ⬆️
forge/db/controllers/Device.js 84.00% <100.00%> (+84.00%) ⬆️
forge/routes/api/device.js 78.64% <100.00%> (+78.64%) ⬆️

... and 444 files with indirect coverage changes

@Steve-Mcl Steve-Mcl linked an issue Aug 9, 2023 that may be closed by this pull request
@Steve-Mcl
Copy link
Contributor Author

@hardillb ready for review now tests are in place (passing locally)

I recommend testing against device running on release code and with device running under PR 157

@Steve-Mcl Steve-Mcl marked this pull request as ready for review August 10, 2023 13:17
@hardillb
Copy link
Contributor

Before I merge these do we want to add the backport label so we can include it in next weeks point release?

@hardillb
Copy link
Contributor

@Steve-Mcl ^

@Steve-Mcl
Copy link
Contributor Author

Before I merge these do we want to add the backport label so we can include it in next weeks point release?

Happy either way Ben.

Is this a good candidate for next weeks point release?

@hardillb
Copy link
Contributor

I think the same comments as the device agent apply, this was something only I had hit so far, so no great rush...

I'll add the flag, but not merge the backport PR

@hardillb hardillb merged commit 52fc3e4 into main Aug 11, 2023
5 checks passed
@hardillb hardillb deleted the 132-old-snapshots branch August 11, 2023 10:35
knolleary added a commit that referenced this pull request Aug 17, 2023
Fix device running or reporting old snapshot id/name (backport #2575)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Device running old snapshot
2 participants