Skip to content

Latest commit

 

History

History
362 lines (258 loc) · 18.9 KB

ha.rst

File metadata and controls

362 lines (258 loc) · 18.9 KB
orphan

High availability deployment

has been systematically built with High availability(HA) as a goal. The exact deployment steps to achieve HA depend on the specifics of the infrastructure in which is deployed. This guide covers a brief explanation on various services, how they interact and the external services necessary for to function. As the the description that follows outlines, components also scale horizontally thus increasing the system throughput while achieving higher availability.

In the section /install/overview a detailed picture and explanation on how single-box deployments work is provided. Repeating the picture here to keep context and use as a reference to layer on some HA deployment specific details.

Components

st2api

This process host the REST API endpoints that serve requests from WebUI, CLI and ChatOps. It maintains connections to MongoDB to store and retrieve information. It also connects to RabbitMQ to push various kind of messages onto the message bus. It is a Python WSGI app running under gunicorn managed process which by default listens on port 9101 and is front-ended by Nginx acting as a reverse proxy.

Multiple st2api processes can be behind a loadbalancer in an active-active configuration. Each of these processes can be deployed on separate compute instances.

st2auth

All authentication is managed by this process. This process needs connection to MongoDB and an authentication backend. See authentication backends <ref-auth-backends> for more information. It is a Python WSGI app running under gunicorn managed process which by default listens on port 9100 and is front-ended by Nginx acting as a reverse proxy.

Multiple st2auth processes can be behind a loadbalancer in an active-active configuration. Each of these processes can be deployed on separate compute instances. If using the PAM authentication backend special care has to be taken to guarantee that all boxes on which an instance of st2auth runs should have the same users. Generally, all st2auth process should see the same identities, via some provider if applicable, for the system to work predictably.

st2stream

This process exposes a server-sent event stream. Requires access to both MongoDB and RabbitMQ. It is also a gunicorn managed process, listens on port 9102 by default and is front-ended by Nginx acting as a reverse proxy. Clients like WebUI and ChatOps maintain a persistent connection with an st2stream process and receive update from the st2stream server.

Multiple st2stream process can be behind a loadbalances in an active-active configuration. Since clients maintain a persistent connection with a specific instance the client will loose event momentarily if an st2stream process goes down. It would be the responsibility of the client to reconnect to an alternate stream connection via the loadbalancer. Note that this is in contrast with st2api where each connection from a client is short-lived. Take the long-lived nature of connection made to this process when configuring appropriate timeouts for loadbalancer, wsgi app server like gunicorn etc.

st2sensorcontainer

st2sensorcontainer manages the sensors to be run on a node. It will start, stop and restart based on policy sensor running on a node. In this case node is same as a compute instance i.e. Virtual Machine and somewhere down the line could be a container.

It is possible to run a st2sensorcontainer in HA mode by running one on each compute instance thus leading to various sensor nodes. Each sensor node needs to be provided with proper partition information to share work with other sensor nodes so that the same sensor does not run on different nodes. See /reference/sensor_partitioning for information on how to partition sensors. At this point st2sensorcontainer's do not form a cluster and distribute work or take over new work if some nodes in the cluster disappear. It would however be possible to a sensor itself to be implemented with HA in mind so that the same sensor can be deployed on multiple nodes with the sensor managing active-active or active-passive. Providing some platform level HA support for sensors is likely to be an enhancement to StackStorm in future releases.

st2rulesengine

This is a dual purpose process - its main function is to evaluate rules when it sees TriggerInstances and decide if an ActionExecution is to be requested. It needs access to MongoDB to locate rules and RabbitMQ to listen for TriggerInstances and request ActionExecutions. The auxiliary purpose of this process is to run all the defined timers. See timers <ref-rule-timers> for specifics on setting up timers via rules.

Multiple st2rulesengine can run in active-active with only connections to MongoDB and RabbitMQ. All these will share the TriggerInstance load and naturally pick up more work if one or more of the process becomes unavailable. The timer function in each of the RulesEngine is not HA compatible. In the interim it is possible to disable the timer in all but one of the st2rulesengine to avoid duplicate timer events, expect this to be fixed in a future release.

st2actionrunner

All ActionExecutions are handled for execution by st2actionrunner. It manages the full life-cycle of an execution from scheduling to one of the terminal states.

Multiple st2actionrunner can run in active-active with only connections to MongoDB and RabbitMQ. Work gets naturally distributed across runners via RabbitMQ. Adding more st2actionrunner increases the ability of to execute actions.

In a proper distributed setup it is recommended to setup Zookeeper or Redis to provide a distributed co-ordination layer. See Policies </policies>. Using a default file based co-ordination backend will not work as it would in a single box deployment.

st2resultstracker

Tracks results of execution handed over to mistral. It requires access to MongoDB and RabbitMQ to perform its function.

Multiple st2resultstracker will co-operate with each other to perform work. At startup though there is a possibility of extra work however there will no negative consequences of this duplication. Specifically the jobs to track results also get stored in the DB in case there are no workers to take over the work, this pattern makes all result trackers pick up the same work set on startup. Once this work set is exhausted all subsequent tasks are round-robined. If needed st2resulttracker processes could be started in a staggered manner to avoid extra work.

st2notifier

This is a dual purpose process - its main function is to generate st2.core.actiontrigger and st2.core.notifytrigger based on the completion of ActionExecution. The auxiliary purpose is to act as a backup scheduler for actions that may not have been scheduled.

Multiple st2notifiers can run in active-active requiring connections to RabbitMQ and MongoDB. For the auxiliary purpose to function in an HA deployment when more than 1 st2notifiers are running either Zookeeper or Redis is required to provide co-ordination much like for policies. It is also possible to designate a single st2notifier as provider of auxiliary functions by disabling the scheduler in all but 1 st2notifiers.

st2garbagecollector

Optional service that cleans up old executions and other operations data based on setup configurations. By default this process does nothing and needs to be setup to perform any work.

By design it is a singleton process. Running multiple in active-active will not yield much benefit and also will not do any harm. Ideal configuration is active-passive but does not itself provide ability to run this in active-passive.

mistral-api

Mistral api is served by this aptly named process. It needs access to PostgreSQL and RabbitMQ.

Multiple mistral-api can run and just like st2api in active-active configuration by using a loadbalancer to distribute at its front end. In typical single box deployment mistral-api is local to the box and communicates via a direct HTTP connection however for HA setup we recommend putting mistral-api behind a loadbalancer and setting up to communicate via the loadbalancer.

mistral-server

mistral-server is the worker engine for mistral i.e. the process which actually manages executions. plugin to mistral i.e. st2mistral communicates back to api. This process needs access to PostgreSQL and RabbitMQ.

Multiple mistral-server can run and co-ordinate work in an active-active configuration. In an HA deployment all communication with the API must be via the configured loadbalancer.

Required dependencies

Some HA recommendations for the dependencies required by components. Depending on the exact infrastructure these may not be suitable and would only serve as a suggestion.

MongoDB

uses this to cache Actions, Rules and Sensor metadata which already live in the filesystem. All the content should ideally be source-control managed in preferably a git repository. also stores operation data like ActionExecution, TriggerInstance etc. MongoDB supports replica set high-availability which we recommend to provide a safe failover.

Loss of connectivity to a MongoDB cluster will cause downtime for . However, once a replica MongoDB is brought back it should be quite possible to bring back to operational state by simply loading the content. Easy access to old ActionExecutions will be lost but all the data of old ActionExecution will still be available in audit logs.

PostgreSQL

Used primarily by mistral-api and mistral-server. To deploy PostgreSQL in HA please see documentation provided by the PostgreSQL project.

The data stored in PostgreSQL is operational for mistral therefore starting from a brand new PostgreSQL in case of loss of a cluster will bring automation services back instantly. Certainly there will be downtime while a new DB cluster is provisioned.

RabbitMQ

RabbitMQ is the communication hub for to co-ordinate and distribute work. See RabbitMQ documentation to understand HA deployment strategies.

Our recommendation is to mirror all the Queues and Exchanges so that loss of 1 server still retains functionality.

Zookeeper/Redis

Various features rely on a proper co-ordination backend in a distributed deployment to work correctly.

This shows how to run a replicated zookeeper setup. See this to understand Redis deployments using sentinel.

Nginx and loadbalancer

An Nginx server is required to reverse proxy each instance of st2api, st2auth, st2stream and mistral-api. This server will terminate SSL connections, shield clients from internal port numbers of various services and only require ports 80 and 443 to be open on containers. Often it is best to deploy 1 set of all these services on a compute instance and share an Nginx server.

There is also a need for a loadbalancer to frontend all the REST services. This results in an HA deployment for REST services as well as single endpoint for clients. Most deployment infrastructures will already have a loadbalancer solution which they would prefer to use so we do not provide any recommendations.

Sharing Content

In an HA setup with st2apu, st2actionrunner and st2sensorcontainer each running on multiple boxes the question of managing distributed content is crucial. does not provide a built-in solution to distributing content on various boxes. Instead it relieas on management of content from outside and here are a few strategies.

Read-Write NFS mounts

If content folders i.e. /opt/stackstorm/packs and /opt/stackstorm/virtualenvs are placed on read-write NFS mounts then writing from any node will be visible to other nodes. Special care needs to be take in case of /opt/stackstorm/virtualenvs since that has symlinks to system libraries. If care is not taken to provision all host boxes in an identical manner it could leads to unpredicatble behavior. Although possible to implement in this manner it is certainly not ideal and perhaps managing the virtualenvs on every host box individually would be a more robust approach.

Content management

Managing pack installation using a content management tool of your choice. Assuming that the list of packs to be deployed will be static in deployments then deploying content to nodes via CM tools could be a sub-step of an overall deployment. This is perhaps the better of the two approaches to end up with a predicatble HA deployment of .

Reference HA setup

In this section we provide a highly opinionated and therefore prescriptive approach to deploying in HA. This deployment has 3 independent boxes which we categorize as controller box and blueprint box. Let call these boxes st2-multi-node-cntl, st2-multi-node-1 and st2-multi-node-2. For the sake of reference we will be using Ubuntu 14.04 as the base operating system.

Controller box

This box runs all the shared required dependencies and some components.

  • Nginx as loadbalancer
  • MongoDB
  • PostgreSQL
  • RabbitMQ
  • st2chatops
  • st2web

In practice MongoDB, PostgreSQL and RabbitMQ are often in standalone clusters managed outside of . The 2 shared components i.e. st2chatops and st2web are placed here for sake of convenience and could be placed anywhere with the right configuration.

Nginx acting as the loadbalancer can easily be switched out for Amazon ELB, HAProxy or any other of your choosing. In that case st2web which is being served off this Nginx will also need a new home.

st2chatops which use hubot is not easily deployed in HA. Using something like keepalived to maintain st2chatops in active-passive configuration would be an option.

Following are the steps to provision a controller box on Ubuntu 14.04.

Install required dependencies

  1. Install MongoDB, PostgreSQL and RabbitMQ.
sudo apt-get install -y mongodb-server rabbitmq-server postgresql
  1. Fix listen address in /etc/postgresql/9.3/main/postgresql.conf and have PostgreSQL listen an interface that has an IP address reachable from st2-multi-node-1 and st2-multi-node-2.
  2. Fix bind_ip in /etc/mongodb.conf to bind MongoDB to an interface that has an IP address reachable from st2-multi-node-1 and st2-multi-node-2.
  3. Restart MongoDB.

    service mongodb restart
  4. To /etc/postgresql/9.3/main/pg_hba.conf add an ACL rule. Here the subnet to allow access is 10.0.3.1/24
host       all  all  10.0.3.1/24  trust
  1. restart PostgreSQL
service postgresql restart
  1. Create Mistral DB in PostgreSQL.
cat << EHD | sudo -u postgres psql
CREATE ROLE mistral WITH CREATEDB LOGIN ENCRYPTED PASSWORD 'StackStorm';
CREATE DATABASE mistral OWNER mistral;
EHD
  1. Add stable repos.
curl -s https://packagecloud.io/install/repositories/StackStorm/staging-stable/script.deb.sh | sudo bash
  1. Setup st2web and SSL termination. Follow install webui and setup ssl<ref-install-webui-ssl-deb>.
  2. Configuration for Nginx as loadbalancer for controller box can be found here. With this configuration Nginx will loadbalance all requests between the two blueprint boxes st2-multi-node-1 and st2-multi-node-2. This includes requests to st2api, st2auth and mistral-api. Nginx also serves as the webserver for st2web.
  3. Install st2chatops following from setup chatops<ref-setup-chatops-deb>.

Blueprint box

This box is a repeatable image that is essentially the single-box reference deployment with a few changes. The aim is to Deploy as many of these boxes for desired HA objectives and also get some horizontal scaling. processes outlined above support the capbility of being turned on-off individually therefore each box can also be made to offer different services.

  1. Add stable repos.
curl -s https://packagecloud.io/install/repositories/StackStorm/staging-stable/script.deb.sh | sudo bash
  1. Install all components and mistral.
sudo apt-get install -y st2 st2mistral
  1. Install Nginx.
sudo apt-get install -y nginx
  1. Update Mistral connection to PostgreSQL in /etc/mistral/mistral.conf by changing atabase.connection property.
  2. Update Mistral connection to RabbitMQ in /etc/mistral/mistral.conf by changing default.transport_url property.
  3. Setup users as per here<ref-config-auth-deb>. Make sure that user configuration on all boxes running st2auth is identical. This ensures consistent authentication from the entire install since the request to authenticate a user can be forwarded by the loadbalancer to any of the st2auth processes.
  4. Use shared st2 config and replace /etc/st2/st2.conf. This config points to the controller node or configuration values of database, messaging and mistral.
  5. Configure authentication as per this documentation<ref-config-auth-deb>.
  6. Use Nginx config for the blueprint boxes from here. In this config Nginx will act as the SSL termination endpoint for all the REST endpoints exposed by st2api, st2auth and mistral-api.
  7. To use Timer triggers with mistral please do the following to only enable them on one server by doing the following in /etc/st2/st2.conf

    [timer]
    enable = False
  8. See /reference/sensor_partitioning to decide on how to partition sensors that suit your requirements.
  9. All content should be synced by choosing a suitable strategy as outlined above. This is cruicial to obtain predicatable outcomes.