New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve error handling around Device Agent tunnels #2488
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2488 +/- ##
===========================================
- Coverage 72.32% 39.94% -32.38%
===========================================
Files 224 489 +265
Lines 8817 17133 +8316
Branches 1811 3976 +2165
===========================================
+ Hits 6377 6844 +467
- Misses 2440 10289 +7849
Flags with carried forward coverage won't be shown. Click here to find out more.
|
Local testing with various arrangements of FF and agent. Short version, apart from the oversized status pill. running branch 2483 of FF and agent were stable and worked well. Test 1: FF PR2483 + Agent 1.9.4
Observationssleep proved fine on windows, (15m sleep) Test 2: FF v1.9.0-git + Agent PR2483
Observations
Test 3: FF PR2483 + Agent PR2483
|
Additional checks
|
Test 1
Well, if you widen your window a little its fine... ;). From the screenshot - the editor hanging on load is a known behaviour of the existing device agent when it is trying to connect with old tokens. This is one of the many scenarios I've specifically fixed in the linked device agent PR Test 2
Yup - this is the symptom of neither end properly telling the other end if a websock is dropped.
Preexisting - but part of a wider issue I'll be raising. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plenty of play time and scanning code. Much more resilient.
Will not deploy immediately in case you or I find a last a last minute gotcha (I am still playing locally) - but feel free to merge if you beat me to it.
Part of #2483
Description
This modifies the Device Agent tunnel handling to improve its overall error handling and resiliance.
If the tunnel (websocket from the device agent) closes, it now allows the device agent to reconnect its tunnel. Previously we would discard the whole tunnel and require the user to toggle the enable/disable editor button to get it working again.
To provide better feedback to the user, when the Editor is 'enabled', the page polls every 5 seconds in case of any status change. If it finds the editor is enabled, but the device agent is not currently connected, it shows the following (and disables the Open Editor button).
In this case, the Device Agent will either reconnect of its own accord or the enable/disable button will have to be toggled to trigger the agent to reconnect.
It also now does a better job of handling websocket disconnects from editors. Previously it did nothing if an editor closed its window - it will now send a notification to the device agent so it can close its corresponding websocket.
The same is true in reverse - the Device Agent will (once the companion PR is merged) send a
closed
notification over the tunnel if Node-RED closes a websocket connection. This allows the platform to close the corresponding editor websocket - which in turn ensures the editor shows the 'lost connection to server' message.I have tested these changes against the current released Device Agent (which doesn't do any of the tunnel reconnection logic). Everything works as well as it did before - which is to say, gold path is great, but stray off it and you have toggle the enable/disable button to get reconnected. The main point is the feedback improvements about being in these iffy states.
When the new Device Agent (PR coming shortly...), everything feels much more robust and reliable. You can disable editor mode, reenable it, and any existing editor browser will reconnect without requiring a reload, so un-deployed changes won't get lost.
There are a number of further improvements to be had which I'll write up separately but aren't needed right now to improve the overall reliability of the device editor proxy.
This PR is currently absent of unit tests. Because so much is based on the interactions of sequencing of the device agent, I have been doing a lot of manual testing up to this point. Raising the PR to get some eyes on it whilst evaluating what unit test coverage is possible.
Related Issue(s)
Checklist
flowforge.yml
?flowforge/helm
to update ConfigMap Templateflowforge/CloudProject
to update values for Staging/ProductionLabels
backport
labelarea:migration
label