Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instance Load Balancing #2156

Closed
14 of 16 tasks
MarianRaphael opened this issue May 19, 2023 · 9 comments
Closed
14 of 16 tasks

Instance Load Balancing #2156

MarianRaphael opened this issue May 19, 2023 · 9 comments
Assignees
Labels
feature-request New feature or request that needs to be turned into Epic/Story details size:XXL - 13 Sizing estimation point
Milestone

Comments

@MarianRaphael
Copy link
Contributor

MarianRaphael commented May 19, 2023

Epic

#1678

Description

Allow horizontal scaling of n Node-RED instances in a FlowForge environment, distribute incoming (HTTP) requests between instances in a round-robin manner.

Constraints

  • Editor disabled
  • Flows have to be stateless or store all states in a separate storage layer, we have to inform the user about this fact and provide them with examples of how to achieve it.

User Story

As a FlowForge customer, I want to leverage instance load balancing.
This will allow me to run business-critical processes within Node-RED and ensure that they are always available and can handle increasing workloads.

Assumption

  • Flows for business critical processes will not be developed in the target system / instance.
  • It is assumed that we can effectively communicate the significance of stateless functions to the user and ensure that they understand the concept and use it correctly.

Have you provided an initial effort estimate for this issue?

I have provided an initial effort estimate


Backend Tasks

  1. size:M - 3 task
    hardillb
  2. task
    hardillb
  3. blocked priority:high size:L - 5 task
  4. task
  5. area:billing priority:high size:XL - 8 task

UX Tasks

Documentation Tasks

@MarianRaphael MarianRaphael added feature-request New feature or request that needs to be turned into Epic/Story details needs-triage Needs looking at to decide what to do size:XXL - 13 Sizing estimation point labels May 19, 2023
@MarianRaphael MarianRaphael added this to the 1.8 milestone May 19, 2023
@MarianRaphael MarianRaphael removed the needs-triage Needs looking at to decide what to do label May 22, 2023
@knolleary
Copy link
Member

knolleary commented May 23, 2023

We need to decide how HA (in whatever form) is exposed to the user as a choice, and how that relates to billing. Whilst there's a lot of business decisions wrapped in that which could evolve, it has a specific impact on how we choose to implement it from the start. I want to make sure our initial iteration is pointed in the right direction.

In our current model, we have three Instance Types - S, M and L. Each Instance Type has a Stripe Product/Price associated with it.

At its most simple, HA could be a simple on/off choice. When turned on, they get two replicas of the instance. That gives them double the capacity, but that might not be enough for their requirements... what if they need 3 replicas for their needs (save that thought for later).

In terms of how we model this in Stripe, we have two options. Let's consider a Team with 5 small instances, 2 of which are HA enabled.

  1. HA is treated as an add-on. It has a separate Product/Price (with one corresponding to each InstanceType). On their invoice they will see:

    5 x small
    2 x small ha add-on
    

    If (in the future) we allow users to pick how many replicas they get, then they would be purchasing additional 'ha add-on units'.

  2. HA InstanceTypes. When HA is enabled on an instance, we use a different Product/Price for the whole instance:

    3 x small
    2 x small (ha-enabled)
    

    This means HA is a fixed additional cost over the base instance price. But if the number of replicas is variable but the price is fixed, we need to ensure our margins are protected. The price either needs to allow a certain amount of capacity for additional replicas or we provide a way for a user to purchase additional HA capacity.

One future problem with the InstanceTypes approach (opt 2) is what happens if we have other features in the future that mean we end up managing MxN InstanceTypes to cover all combinations. Add-ons (opt 1) are a cleaner way to manage it in my view.

Open questions

  1. Should a user be able to enable/disable HA for an existing instance?
    MVP: Only support setting the option at create time.
  2. Should a user be able to pick how many replicas they get?
    MVP: They get two replicas
  3. Auto-scaling of the replicas?
    MVP: No - but would be a useful roadmap item

@MarianRaphael
Copy link
Contributor Author

As discussed in the Product Meeting:

  • Option 1 - High Availability (HA) is treated as an add-on. It has a separate product/price (with one corresponding to each Instance Type)
  • Beta feature in 1.8 – no extra charge

@knolleary
Copy link
Member

Getting some implementation notes out of head and onto virtual paper.

Feature Flag

As this feature will only be available under certain conditions, we will introduce an ha feature flag that is enabled if:

  • an EE license has been applied
  • we are using the k8s driver (as we're restricting this to the k8s driver in the initial implementation. Adding docker support will require lots of additional work to replicate bits the k8s stack gives us for free)

In the future, this feature will be restricted to particular team tiers. That will require team-level feature flags, something we don't have today but will need to be introduced as part of the work to reintroduce team tiers.

UX

Creating Instance with HA

When creating a new instance, a toggle will be added below the 'select instance type' to enable/disable HA mode. The UI should clearly show the cost associated with enabling the feature. For MVP this will be at no cost, but we need a means to have a cost associated with it that is dependent on the InstanceType selected.

Viewing Instance Details

The instance overview needs to indicate if HA is enabled or not on the instance

Enabling/Disabling HA on an existing Instance

Out of scope for MVP.

Accessing Logs

We will now have logs from n separate replicas. @hardillb is investigating how we can gather those logs from the individual replicas.

In the UI, we need a way to show logs from individual replicas.

Disabling the editor

If HA is enabled on an instance, we will disable the editor. The only way to update the flows will be via a pipeline deploy from another instance. The instance overview needs to help the user to understand that - and explain why the editor button is unavailable.

API Changes

A new flag can be added to the create-instance api end point to indicate if HA mode is to be enabled or not. The server will validate whether HA can be enabled for the given request.

The instance view will include the flag for the UI to know the state (but only if EE license applied).

DB Changes

At this stage, none. The HA flag can be stored in the existing ProjectSettings blob.

Billing configuration

This is potentially out of scope if for MVP there's no cost associated with enabling HA. However that will be a very short-term position, so it is worth having a sense of how billing will be applied.

From previous discussions, the decision is for HA to be considered as an add-on item on the invoice, rather than use an alternative instance product.

In the current model, we have a ProjectType which includes the stripe price/product for that type.

The most straight-forward implementation will be to add HA price/product options in the ProjectType properties. This won't easily scale if we want different prices for different TeamTiers. That is a future issue we already have to tackle, so this won't make it any worse to deal with than it already is.

Persistent Context

We added a caching layer to our persistent context implementation for two reasons:

  1. performance
  2. it allows the api to operate in synchronous mode - which is what most users expect. Removing the caching layer will require the API to be asynchronous. This will require an Function nodes accessing context to be modified to use the async api Node-RED provides.

If HA is enabled, we will need to disable the caching layer

ProjectNodes MQTT configuration

To have multiple instances connected to the broker, they will need to use a shared subscription for the project nodes - so that messages are distributed.

@knolleary knolleary self-assigned this May 24, 2023
@knolleary
Copy link
Member

knolleary commented May 24, 2023

ProjectSettings

To indicate the HA state of an Instance, we will store an object in ProjectSettings under the key ha. For this iteration it will contain a single key replicas which indicates how many replicas should be running:

ha: {
   replicas: 2
}

@knolleary
Copy link
Member

For anyone just reading the notifications of comments... I've added task lists to the main description of this epic to track individual tasks.

@knolleary
Copy link
Member

For the sake of local development/testing, I'm going to make the stub driver appear to offer HA capabilities. This removes the need for everyone to have a local k8s development environment before they can contribute to this work (although longer term, that remains something we need to enable).

@hardillb
Copy link
Contributor

@knolleary
Copy link
Member

REgarding Persistent Context... we need to disable synchronous access if HA is enabled as we need to ensure access is synced with the backing store.

However, we have FlowFuse/nr-persistent-context#16 when the store is async-only. This has been fixed upstream in Node-RED 3.1.

Need to work out the consequences of this - do we have to (document) a restriction in persistent context with this iteration of scaling... or pre-req Node-RED 3.1 (which is still in beta).

@knolleary
Copy link
Member

I have raised follow-up items for the HA tasks known at this time - all linked in the task lists above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request New feature or request that needs to be turned into Epic/Story details size:XXL - 13 Sizing estimation point
Projects
Archived in project
Archived in project
Development

No branches or pull requests

3 participants