Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve error handling around Device Agent tunnels #2488

Merged
merged 3 commits into from Jul 14, 2023

Conversation

knolleary
Copy link
Member

@knolleary knolleary commented Jul 14, 2023

Part of #2483

Description

This modifies the Device Agent tunnel handling to improve its overall error handling and resiliance.

If the tunnel (websocket from the device agent) closes, it now allows the device agent to reconnect its tunnel. Previously we would discard the whole tunnel and require the user to toggle the enable/disable editor button to get it working again.

To provide better feedback to the user, when the Editor is 'enabled', the page polls every 5 seconds in case of any status change. If it finds the editor is enabled, but the device agent is not currently connected, it shows the following (and disables the Open Editor button).

image

In this case, the Device Agent will either reconnect of its own accord or the enable/disable button will have to be toggled to trigger the agent to reconnect.


It also now does a better job of handling websocket disconnects from editors. Previously it did nothing if an editor closed its window - it will now send a notification to the device agent so it can close its corresponding websocket.

The same is true in reverse - the Device Agent will (once the companion PR is merged) send a closed notification over the tunnel if Node-RED closes a websocket connection. This allows the platform to close the corresponding editor websocket - which in turn ensures the editor shows the 'lost connection to server' message.


I have tested these changes against the current released Device Agent (which doesn't do any of the tunnel reconnection logic). Everything works as well as it did before - which is to say, gold path is great, but stray off it and you have toggle the enable/disable button to get reconnected. The main point is the feedback improvements about being in these iffy states.

When the new Device Agent (PR coming shortly...), everything feels much more robust and reliable. You can disable editor mode, reenable it, and any existing editor browser will reconnect without requiring a reload, so un-deployed changes won't get lost.

There are a number of further improvements to be had which I'll write up separately but aren't needed right now to improve the overall reliability of the device editor proxy.


This PR is currently absent of unit tests. Because so much is based on the interactions of sequencing of the device agent, I have been doing a lot of manual testing up to this point. Raising the PR to get some eyes on it whilst evaluating what unit test coverage is possible.

Related Issue(s)

Checklist

  • I have read the contribution guidelines
  • Suitable unit/system level tests have been added and they pass
  • Documentation has been updated
    • Upgrade instructions
    • Configuration details
    • Concepts
  • Changes flowforge.yml?
    • Issue/PR raised on flowforge/helm to update ConfigMap Template
    • Issue/PR raised on flowforge/CloudProject to update values for Staging/Production

Labels

  • Backport needed? -> add the backport label
  • Includes a DB migration? -> add the area:migration label

@knolleary knolleary requested a review from Steve-Mcl July 14, 2023 11:47
@codecov
Copy link

codecov bot commented Jul 14, 2023

Codecov Report

Merging #2488 (d0b7ebc) into main (c2ac613) will decrease coverage by 32.38%.
The diff coverage is 34.95%.

@@             Coverage Diff             @@
##             main    #2488       +/-   ##
===========================================
- Coverage   72.32%   39.94%   -32.38%     
===========================================
  Files         224      489      +265     
  Lines        8817    17133     +8316     
  Branches     1811     3976     +2165     
===========================================
+ Hits         6377     6844      +467     
- Misses       2440    10289     +7849     
Flag Coverage Δ
backend 74.37% <55.38%> (+2.04%) ⬆️
frontend 1.55% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
frontend/src/api/devices.js 0.00% <0.00%> (ø)
frontend/src/pages/device/Overview.vue 0.00% <0.00%> (ø)
frontend/src/pages/device/index.vue 0.00% <0.00%> (ø)
forge/ee/lib/deviceEditor/DeviceTunnelManager.js 67.93% <47.05%> (+61.15%) ⬆️
forge/ee/routes/deviceEditor/index.js 71.53% <57.14%> (+57.53%) ⬆️

... and 342 files with indirect coverage changes

@Steve-Mcl
Copy link
Contributor

Steve-Mcl commented Jul 14, 2023

Local testing with various arrangements of FF and agent.

Short version, apart from the oversized status pill. running branch 2483 of FF and agent were stable and worked well.

Test 1: FF PR2483 + Agent 1.9.4

  • Existing Agent starts up and connects to platform
  • Developer mode engages correctly
  • Tunnel connects correctly
  • Tunnel closes correctly
  • Developer mode disabled correctly
  • Tunnel resilience
    • sleep recovery
    • multiple connections
    • old/stale connection

Observations

sleep proved fine on windows, (15m sleep)
Old window left open causes outage (resulting in oversized "not connected" status pill)

Test 2: FF v1.9.0-git + Agent PR2483

  • Upgraded Agent (original config) starts up and connects to platform
  • Developer mode engages correctly
  • Tunnel connects correctly
  • Tunnel closes correctly
  • Developer mode disabled correctly
  • Tunnel resilience
    • sleep recovery
    • multiple connections
    • old/stale connection

Observations

  • disable/re-enable - deploy and inject still works but WS is lost (need to refresh browser)
  • issues with multiple editors - 2nd editor deploy shows toast "Deploy failed: no response from server & console" and console gets :3000/api/v1/devices/ELDpr3x9MP/editor/proxy/flows:1 Failed to load resource: the server responded with a status of 404 (Not Found) NOTE: This may be a NR 3.0.2 issue or my own understanding. Either way, it is an observation of the mismatched versions of FF and Agent + an off-standard operation. Noted only as a prompt to follow up on NR.
      1. open 2 browsers & connect remote device
    • browser 1 - edit and deploy
    • browser 2 - ignore, edit & deploy

Test 3: FF PR2483 + Agent PR2483

  • Agent starts up and connects to upgraded platform
  • Developer mode engages correctly
  • Tunnel connects correctly
  • Tunnel closes correctly
  • Developer mode disabled correctly
  • Tunnel resilience
    • sleep recovery
    • multiple connections
    • old/stale connection

image

@Steve-Mcl
Copy link
Contributor

Steve-Mcl commented Jul 14, 2023

Additional checks

  • Copying link & attempting to access remote editor without login to FF - OK - prompts for login
  • Logging in as a user from different team gives access if the Device ID is known (not using the one time token for initiating tunnel anymore?)
  • NR user menu "logout" doesnt do anything (might not be a new issue)
    • console: POST http://192.168.86.130:3000/api/v1/devices/ELDpr3x9MP/editor/proxy/auth/revoke 415 (Unsupported Media Type)

@knolleary
Copy link
Member Author

Test 1

Old window left open causes outage (resulting in oversized "not connected" status pill)

Well, if you widen your window a little its fine... ;).

From the screenshot - the editor hanging on load is a known behaviour of the existing device agent when it is trying to connect with old tokens. This is one of the many scenarios I've specifically fixed in the linked device agent PR

Test 2

disable/re-enable - deploy and inject still works but WS is lost (need to refresh browser)

Yup - this is the symptom of neither end properly telling the other end if a websock is dropped.

NR user menu "logout" doesnt do anything (

Preexisting - but part of a wider issue I'll be raising.

Copy link
Contributor

@Steve-Mcl Steve-Mcl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plenty of play time and scanning code. Much more resilient.

Will not deploy immediately in case you or I find a last a last minute gotcha (I am still playing locally) - but feel free to merge if you beat me to it.

@Steve-Mcl Steve-Mcl linked an issue Jul 14, 2023 that may be closed by this pull request
@Steve-Mcl Steve-Mcl merged commit 5beb535 into main Jul 14, 2023
4 of 5 checks passed
@Steve-Mcl Steve-Mcl deleted the 2483-device-editor-tunnel-reconnect branch July 14, 2023 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants