Automatic failover for Node-RED Instances #1920

MarianRaphael · 2023-04-05T08:30:55Z

Description

Implement a robust automatic failover mechanism for Node-RED instances that focuses solely on high availability without considering scalability. This feature will monitor the active Node-RED instance and seamlessly switch to a hot-spare instance if the primary instance fails or becomes unresponsive, thus ensuring reliability without the added complexity of load balancing and scaling.

Related Epic

#1678

Assumption

Automatic failover without scaling is assumed to be easier to implement than a complete high availability solution with scaling, as it omits the complexities associated with load balancing, state management, and other challenges tied to scaling.

Motivation

As a customer of FlowForge,
I would like to have the option to utilize high availability instances.
This allows me to run business-critical processes within Node-RED and ensure that they are always available.

Key considerations

Heartbeat mechanism: Introduce a heartbeat system between the primary and hot-spare instances to ensure both instances can respond to requests. This could involve periodic 'ping' messages or other methods to track each instance's status.
Failover decision-making: The hot-spare instance should autonomously determine whether to become the active instance, based on the monitored status of the primary instance. This decision-making process must be efficient, reliable, and safe to prevent having two active instances simultaneously.
Safe failover: Design the failover process to avoid potential conflicts or issues that could arise when transitioning from the primary to the hot-spare instance. This may involve synchronization, locking, or other techniques to ensure a smooth handover.

MarianRaphael · 2023-04-14T06:41:59Z

If the assumption is incorrect, the issue can be closed.

knolleary · 2023-04-27T16:04:23Z

@hardillb and I have been discussing how we will approach this at length to start building an implementation plan.

Our working notes are in https://www.figma.com/file/upA7oHb9seloP74kTMyegN/FlowForge-High-Availability-Design-notes?node-id=0%3A1&t=bZVnNEZIeQBpxH12-1

There is a lot of technical work required it isn't something you can do half the job, but we are starting to identify the steps needed to work towards it.

The key criteria is whether we can build a failover system that operates faster than it takes k8s to restart a crashed pod.

Current Architecture

HA Architecture

Lots more detail at the figma link above... copying here for reference

Key points

Move to using StatefulSets rather than Deployments. This enables us to deploy two instances of node-red, but each can have a distinct configuration
Introduce HA Controller (naming tbd) component. This becomes what the platform communicates with rather than to the launcher directly. (How/where this gets deployed is TBD)
Each launcher instance connects back to the HA Controller with a websocket - allowing for bi-directional push notification.
The launcher starts Node-RED with a CLI flag to start in stopped state (this needs adding upstream) - this gets it into 'ready' state.
The HA Controller decides which instance should be 'active' and which should be 'inactive'. If any state change is needed, it will tell the active instance to stop the flows and become inactive, tell the inactive instance to start its flows to become active, update the ingress controller's service with the pod label of the new active instance.
The launcher's monitor their local Node-RED instance and send regular heartbeat pings back to the HA Controller

Restarting Node-RED

If the platform asks the HA Controller to restart Node-RED (could be updating settings or a staged deployment rollout), the HA Controller will notify the inactive instance to restart first - once it is ready, HA Controller will trigger a failover so the newly updated inactive instance becomes the active instance. It will then tell the newly inactive instance to restart. This will minimise the downtime of rolling out new flows.

There are a few different scenarios like this - some already documented in the figma doc.

Tasks

Two immediate tasks have been identified that can be got underway now:

Detect hung Node-RED flows and force restart nr-launcher#110 - Update launcher to monitor for hung flows - eg if a while(true){} loop is deployed in a function node, the instance's event loop will be stuck. The launcher can monitor responsiveness of the instance and trigger a restart if needed. Even without the HA work, this will improve the resilience and recovery time for this particular failure mode.
Move to using k8s StatefulSets when deploying instances driver-k8s#82 - Update k8s driver to use StatefulSets rather than Deployments. We had only just migrated to Deployments from bare pods, but some of that work done to manage that migration should help manage this update.

From there, we then have to build the HA Controller. There's no short-cutting that piece - a finer grained task breakdown will follow for that.

MarianRaphael · 2023-05-19T07:59:19Z

Activities paused, based on workshop discussion.
Alternative first approach: #2156

MarianRaphael added feature-request New feature or request that needs to be turned into Epic/Story details needs-triage Needs looking at to decide what to do labels Apr 5, 2023

MarianRaphael mentioned this issue Apr 5, 2023

High Availability & Scaling #1678

Closed

MarianRaphael added this to the 1.7 milestone Apr 11, 2023

This was referenced Apr 27, 2023

Detect hung Node-RED flows and force restart FlowFuse/nr-launcher#110

Closed

Move to using k8s StatefulSets when deploying instances FlowFuse/driver-k8s#82

Open

knolleary added size:XXL - 13 Sizing estimation point and removed needs-triage Needs looking at to decide what to do labels May 7, 2023

MarianRaphael modified the milestones: 1.7, 1.8 May 11, 2023

MarianRaphael mentioned this issue May 19, 2023

Automatic failover for Devices #2157

Open

MarianRaphael removed this from the 1.8 milestone Aug 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic failover for Node-RED Instances #1920

Automatic failover for Node-RED Instances #1920

MarianRaphael commented Apr 5, 2023 •

edited

Loading

MarianRaphael commented Apr 14, 2023

knolleary commented Apr 27, 2023 •

edited

Loading

MarianRaphael commented May 19, 2023

Automatic failover for Node-RED Instances #1920

Automatic failover for Node-RED Instances #1920

Comments

MarianRaphael commented Apr 5, 2023 • edited Loading

Description

Related Epic

Assumption

Motivation

Key considerations

MarianRaphael commented Apr 14, 2023

knolleary commented Apr 27, 2023 • edited Loading

Current Architecture

HA Architecture

Tasks

MarianRaphael commented May 19, 2023

MarianRaphael commented Apr 5, 2023 •

edited

Loading

knolleary commented Apr 27, 2023 •

edited

Loading