Improve error handling around Device Agent tunnels #2488

knolleary · 2023-07-14T11:47:11Z

Part of #2483

Description

This modifies the Device Agent tunnel handling to improve its overall error handling and resiliance.

If the tunnel (websocket from the device agent) closes, it now allows the device agent to reconnect its tunnel. Previously we would discard the whole tunnel and require the user to toggle the enable/disable editor button to get it working again.

To provide better feedback to the user, when the Editor is 'enabled', the page polls every 5 seconds in case of any status change. If it finds the editor is enabled, but the device agent is not currently connected, it shows the following (and disables the Open Editor button).

In this case, the Device Agent will either reconnect of its own accord or the enable/disable button will have to be toggled to trigger the agent to reconnect.

It also now does a better job of handling websocket disconnects from editors. Previously it did nothing if an editor closed its window - it will now send a notification to the device agent so it can close its corresponding websocket.

The same is true in reverse - the Device Agent will (once the companion PR is merged) send a closed notification over the tunnel if Node-RED closes a websocket connection. This allows the platform to close the corresponding editor websocket - which in turn ensures the editor shows the 'lost connection to server' message.

I have tested these changes against the current released Device Agent (which doesn't do any of the tunnel reconnection logic). Everything works as well as it did before - which is to say, gold path is great, but stray off it and you have toggle the enable/disable button to get reconnected. The main point is the feedback improvements about being in these iffy states.

When the new Device Agent (PR coming shortly...), everything feels much more robust and reliable. You can disable editor mode, reenable it, and any existing editor browser will reconnect without requiring a reload, so un-deployed changes won't get lost.

There are a number of further improvements to be had which I'll write up separately but aren't needed right now to improve the overall reliability of the device editor proxy.

This PR is currently absent of unit tests. Because so much is based on the interactions of sequencing of the device agent, I have been doing a lot of manual testing up to this point. Raising the PR to get some eyes on it whilst evaluating what unit test coverage is possible.

Related Issue(s)

Checklist

I have read the contribution guidelines
Suitable unit/system level tests have been added and they pass
Documentation has been updated
- Upgrade instructions
- Configuration details
- Concepts
Changes flowforge.yml?
- Issue/PR raised on flowforge/helm to update ConfigMap Template
- Issue/PR raised on flowforge/CloudProject to update values for Staging/Production

Labels

Backport needed? -> add the backport label
Includes a DB migration? -> add the area:migration label

codecov · 2023-07-14T11:55:09Z

Codecov Report

Merging #2488 (d0b7ebc) into main (c2ac613) will decrease coverage by 32.38%.
The diff coverage is 34.95%.

@@             Coverage Diff             @@
##             main    #2488       +/-   ##
===========================================
- Coverage   72.32%   39.94%   -32.38%     
===========================================
  Files         224      489      +265     
  Lines        8817    17133     +8316     
  Branches     1811     3976     +2165     
===========================================
+ Hits         6377     6844      +467     
- Misses       2440    10289     +7849

Flag	Coverage Δ
backend	`74.37% <55.38%> (+2.04%)`	⬆️
frontend	`1.55% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
frontend/src/api/devices.js	`0.00% <0.00%> (ø)`
frontend/src/pages/device/Overview.vue	`0.00% <0.00%> (ø)`
frontend/src/pages/device/index.vue	`0.00% <0.00%> (ø)`
forge/ee/lib/deviceEditor/DeviceTunnelManager.js	`67.93% <47.05%> (+61.15%)`	⬆️
forge/ee/routes/deviceEditor/index.js	`71.53% <57.14%> (+57.53%)`	⬆️

... and 342 files with indirect coverage changes

Steve-Mcl · 2023-07-14T14:56:32Z

Steve-Mcl · 2023-07-14T15:12:40Z

Additional checks

Copying link & attempting to access remote editor without login to FF - OK - prompts for login
~~Logging in as a user from different team gives access if the Device ID is known (not using the one time token for initiating tunnel anymore?)~~
- confirmed fix in eed18a6
NR user menu "logout" doesnt do anything (might not be a new issue)
- console: POST http://192.168.86.130:3000/api/v1/devices/ELDpr3x9MP/editor/proxy/auth/revoke 415 (Unsupported Media Type)

knolleary · 2023-07-14T15:30:33Z

Test 1

Old window left open causes outage (resulting in oversized "not connected" status pill)

Well, if you widen your window a little its fine... ;).

From the screenshot - the editor hanging on load is a known behaviour of the existing device agent when it is trying to connect with old tokens. This is one of the many scenarios I've specifically fixed in the linked device agent PR

Test 2

disable/re-enable - deploy and inject still works but WS is lost (need to refresh browser)

Yup - this is the symptom of neither end properly telling the other end if a websock is dropped.

NR user menu "logout" doesnt do anything (

Preexisting - but part of a wider issue I'll be raising.

Steve-Mcl

Plenty of play time and scanning code. Much more resilient.

Will not deploy immediately in case you or I find a last a last minute gotcha (I am still playing locally) - but feel free to merge if you beat me to it.

Improve error handling around Device Agent tunnels

6b6173b

knolleary requested a review from Steve-Mcl July 14, 2023 11:47

Add unit tests for DeviceTunnelManager

eed18a6

Prevent wrapping of not-connected status pill

d0b7ebc

Steve-Mcl approved these changes Jul 14, 2023

View reviewed changes

Steve-Mcl linked an issue Jul 14, 2023 that may be closed by this pull request

Improve reliability of Device Editor tunnel #2483

Closed

Steve-Mcl merged commit 5beb535 into main Jul 14, 2023
4 of 5 checks passed

Steve-Mcl deleted the 2483-device-editor-tunnel-reconnect branch July 14, 2023 16:05

knolleary mentioned this pull request Jul 14, 2023

Add background polling to Device page to ensure correct editor access status is shown #2471

Open

knolleary linked an issue Jul 14, 2023 that may be closed by this pull request

Improve error feedback when trying to access expired Device editor tunnel #2473

Closed

This was referenced Jul 14, 2023

Improve error feedback when trying to access expired Device editor tunnel #2473

Closed

Improve error handling around Device Agent tunnels - backport #2508

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve error handling around Device Agent tunnels #2488

Improve error handling around Device Agent tunnels #2488

knolleary commented Jul 14, 2023 •

edited

codecov bot commented Jul 14, 2023 •

edited

Steve-Mcl commented Jul 14, 2023 •

edited

Steve-Mcl commented Jul 14, 2023 •

edited

knolleary commented Jul 14, 2023

Steve-Mcl left a comment

Improve error handling around Device Agent tunnels #2488

Improve error handling around Device Agent tunnels #2488

Conversation

knolleary commented Jul 14, 2023 • edited

Description

Related Issue(s)

Checklist

Labels

codecov bot commented Jul 14, 2023 • edited

Codecov Report

Steve-Mcl commented Jul 14, 2023 • edited

Test 1: FF PR2483 + Agent 1.9.4

Observations

Test 2: FF v1.9.0-git + Agent PR2483

Observations

Test 3: FF PR2483 + Agent PR2483

Steve-Mcl commented Jul 14, 2023 • edited

knolleary commented Jul 14, 2023

Steve-Mcl left a comment

Choose a reason for hiding this comment

knolleary commented Jul 14, 2023 •

edited

codecov bot commented Jul 14, 2023 •

edited

Steve-Mcl commented Jul 14, 2023 •

edited

Steve-Mcl commented Jul 14, 2023 •

edited