Instance Load Balancing #2156

MarianRaphael · 2023-05-19T07:54:13Z

Epic

Description

Allow horizontal scaling of n Node-RED instances in a FlowForge environment, distribute incoming (HTTP) requests between instances in a round-robin manner.

Constraints

Editor disabled
Flows have to be stateless or store all states in a separate storage layer, we have to inform the user about this fact and provide them with examples of how to achieve it.

User Story

As a FlowForge customer, I want to leverage instance load balancing.
This will allow me to run business-critical processes within Node-RED and ensure that they are always available and can handle increasing workloads.

Assumption

Flows for business critical processes will not be developed in the target system / instance.
It is assumed that we can effectively communicate the significance of stateless functions to the user and ensure that they understand the concept and use it correctly.

Have you provided an initial effort estimate for this issue?

I have provided an initial effort estimate

Backend Tasks

Give feedback

HA: Add HA EE feature flag
HA: Create multiple replicas of instance based on ha setting
HA: Add HA flag to create instance endpoint
HA: Add endpoint for modifying HA settings
HA: Add HA flag to instance view
HA: Support querying individual replica logs in logging API #2260

size:M - 3 task
Disable editor if ha configuration present flowforge/flowforge-nr-launcher#121

task
HA: Disable persistent context cache layer in HA enabled instances flowforge/flowforge#2261

blocked priority:high size:L - 5 task
Add support for Shared Subscriptions when instance in HA mode nr-project-nodes#28

task
Add shared subscriptions to broker ACL for project nodes flowforge/flowforge#2225
HA: Enable HA feature to be billable flowforge/flowforge#2262

area:billing priority:high size:XL - 8 task
Options

UX Tasks

Give feedback

HA UX: Add option in Instance Settings to enable HA
HA UX: Add HA indicator on Instance overview page
HA UX: Disable editor button if HA is enabled
Options

Documentation Tasks

Give feedback

HA: Add HA feature description to docs
HA: Guidance on creating HA-capable flows
Options

knolleary · 2023-05-23T10:36:02Z

We need to decide how HA (in whatever form) is exposed to the user as a choice, and how that relates to billing. Whilst there's a lot of business decisions wrapped in that which could evolve, it has a specific impact on how we choose to implement it from the start. I want to make sure our initial iteration is pointed in the right direction.

In our current model, we have three Instance Types - S, M and L. Each Instance Type has a Stripe Product/Price associated with it.

At its most simple, HA could be a simple on/off choice. When turned on, they get two replicas of the instance. That gives them double the capacity, but that might not be enough for their requirements... what if they need 3 replicas for their needs (save that thought for later).

In terms of how we model this in Stripe, we have two options. Let's consider a Team with 5 small instances, 2 of which are HA enabled.

HA is treated as an add-on. It has a separate Product/Price (with one corresponding to each InstanceType). On their invoice they will see:
```
5 x small
2 x small ha add-on
```
If (in the future) we allow users to pick how many replicas they get, then they would be purchasing additional 'ha add-on units'.
HA InstanceTypes. When HA is enabled on an instance, we use a different Product/Price for the whole instance:
```
3 x small
2 x small (ha-enabled)
```
This means HA is a fixed additional cost over the base instance price. But if the number of replicas is variable but the price is fixed, we need to ensure our margins are protected. The price either needs to allow a certain amount of capacity for additional replicas or we provide a way for a user to purchase additional HA capacity.

One future problem with the InstanceTypes approach (opt 2) is what happens if we have other features in the future that mean we end up managing MxN InstanceTypes to cover all combinations. Add-ons (opt 1) are a cleaner way to manage it in my view.

Open questions

Should a user be able to enable/disable HA for an existing instance?
MVP: Only support setting the option at create time.
Should a user be able to pick how many replicas they get?
MVP: They get two replicas
Auto-scaling of the replicas?
MVP: No - but would be a useful roadmap item

MarianRaphael · 2023-05-23T14:49:12Z

As discussed in the Product Meeting:

Option 1 - High Availability (HA) is treated as an add-on. It has a separate product/price (with one corresponding to each Instance Type)
Beta feature in 1.8 – no extra charge

knolleary · 2023-05-24T13:00:53Z

Getting some implementation notes out of head and onto virtual paper.

Feature Flag

As this feature will only be available under certain conditions, we will introduce an ha feature flag that is enabled if:

an EE license has been applied
we are using the k8s driver (as we're restricting this to the k8s driver in the initial implementation. Adding docker support will require lots of additional work to replicate bits the k8s stack gives us for free)

In the future, this feature will be restricted to particular team tiers. That will require team-level feature flags, something we don't have today but will need to be introduced as part of the work to reintroduce team tiers.

UX

Creating Instance with HA

When creating a new instance, a toggle will be added below the 'select instance type' to enable/disable HA mode. The UI should clearly show the cost associated with enabling the feature. For MVP this will be at no cost, but we need a means to have a cost associated with it that is dependent on the InstanceType selected.

Viewing Instance Details

The instance overview needs to indicate if HA is enabled or not on the instance

Enabling/Disabling HA on an existing Instance

Out of scope for MVP.

Accessing Logs

We will now have logs from n separate replicas. @hardillb is investigating how we can gather those logs from the individual replicas.

In the UI, we need a way to show logs from individual replicas.

Disabling the editor

If HA is enabled on an instance, we will disable the editor. The only way to update the flows will be via a pipeline deploy from another instance. The instance overview needs to help the user to understand that - and explain why the editor button is unavailable.

API Changes

A new flag can be added to the create-instance api end point to indicate if HA mode is to be enabled or not. The server will validate whether HA can be enabled for the given request.

The instance view will include the flag for the UI to know the state (but only if EE license applied).

DB Changes

At this stage, none. The HA flag can be stored in the existing ProjectSettings blob.

Billing configuration

This is potentially out of scope if for MVP there's no cost associated with enabling HA. However that will be a very short-term position, so it is worth having a sense of how billing will be applied.

From previous discussions, the decision is for HA to be considered as an add-on item on the invoice, rather than use an alternative instance product.

In the current model, we have a ProjectType which includes the stripe price/product for that type.

The most straight-forward implementation will be to add HA price/product options in the ProjectType properties. This won't easily scale if we want different prices for different TeamTiers. That is a future issue we already have to tackle, so this won't make it any worse to deal with than it already is.

Persistent Context

We added a caching layer to our persistent context implementation for two reasons:

performance
it allows the api to operate in synchronous mode - which is what most users expect. Removing the caching layer will require the API to be asynchronous. This will require an Function nodes accessing context to be modified to use the async api Node-RED provides.

If HA is enabled, we will need to disable the caching layer

ProjectNodes MQTT configuration

To have multiple instances connected to the broker, they will need to use a shared subscription for the project nodes - so that messages are distributed.

knolleary · 2023-05-24T13:25:30Z

ProjectSettings

To indicate the HA state of an Instance, we will store an object in ProjectSettings under the key ha. For this iteration it will contain a single key replicas which indicates how many replicas should be running:

ha: {
   replicas: 2
}

knolleary · 2023-05-24T13:34:52Z

For anyone just reading the notifications of comments... I've added task lists to the main description of this epic to track individual tasks.

knolleary · 2023-05-24T13:36:51Z

For the sake of local development/testing, I'm going to make the stub driver appear to offer HA capabilities. This removes the need for everyone to have a local k8s development environment before they can contribute to this work (although longer term, that remains something we need to enable).

hardillb · 2023-05-25T08:44:26Z

Branch for k8s driver changes:
https://github.com/flowforge/flowforge-driver-k8s/tree/2156-ha-replicas

PR:
FlowFuse/driver-k8s#85

knolleary · 2023-06-02T11:18:58Z

REgarding Persistent Context... we need to disable synchronous access if HA is enabled as we need to ensure access is synced with the backing store.

However, we have FlowFuse/nr-persistent-context#16 when the store is async-only. This has been fixed upstream in Node-RED 3.1.

Need to work out the consequences of this - do we have to (document) a restriction in persistent context with this iteration of scaling... or pre-req Node-RED 3.1 (which is still in beta).

knolleary · 2023-06-08T10:49:42Z

I have raised follow-up items for the HA tasks known at this time - all linked in the task lists above.

MarianRaphael added feature-request New feature or request that needs to be turned into Epic/Story details needs-triage Needs looking at to decide what to do size:XXL - 13 Sizing estimation point labels May 19, 2023

MarianRaphael added this to the 1.8 milestone May 19, 2023

MarianRaphael mentioned this issue May 19, 2023

Automatic failover for Node-RED Instances #1920

Open

3 tasks

MarianRaphael removed the needs-triage Needs looking at to decide what to do label May 22, 2023

MarianRaphael mentioned this issue May 23, 2023

Auto-scaling of replicas #2174

Open

knolleary self-assigned this May 24, 2023

hardillb mentioned this issue May 25, 2023

HA: multiple instance replicas FlowFuse/driver-k8s#85

Merged

12 tasks

hardillb mentioned this issue May 26, 2023

Add permission to list endpoints FlowFuse/helm#133

Merged

11 tasks

This was referenced Jun 2, 2023

Add support for Shared Subscriptions when instance in HA mode FlowFuse/nr-project-nodes#28

Closed

HA: multiple instance replica support #2180

Merged

Disable editor if ha configuration present FlowFuse/nr-launcher#121

Closed

MarianRaphael mentioned this issue Jun 4, 2023

Blog Post Image for Release Notes 1.8 FlowFuse/website#821

Closed

MarianRaphael closed this as completed Jun 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instance Load Balancing #2156

Instance Load Balancing #2156

MarianRaphael commented May 19, 2023 •

edited by knolleary

Backend Tasks

UX Tasks

Documentation Tasks

knolleary commented May 23, 2023 •

edited

MarianRaphael commented May 23, 2023

knolleary commented May 24, 2023

knolleary commented May 24, 2023 •

edited

knolleary commented May 24, 2023

knolleary commented May 24, 2023

hardillb commented May 25, 2023

knolleary commented Jun 2, 2023

knolleary commented Jun 8, 2023

Instance Load Balancing #2156

Instance Load Balancing #2156

Comments

MarianRaphael commented May 19, 2023 • edited by knolleary

Epic

Description

Constraints

User Story

Assumption

Have you provided an initial effort estimate for this issue?

Backend Tasks

UX Tasks

Documentation Tasks

knolleary commented May 23, 2023 • edited

Open questions

MarianRaphael commented May 23, 2023

knolleary commented May 24, 2023

Feature Flag

UX

API Changes

DB Changes

Billing configuration

Persistent Context

ProjectNodes MQTT configuration

knolleary commented May 24, 2023 • edited

ProjectSettings

knolleary commented May 24, 2023

knolleary commented May 24, 2023

hardillb commented May 25, 2023

knolleary commented Jun 2, 2023

knolleary commented Jun 8, 2023

MarianRaphael commented May 19, 2023 •

edited by knolleary

knolleary commented May 23, 2023 •

edited

knolleary commented May 24, 2023 •

edited