Skip to content
This repository has been archived by the owner on Feb 27, 2020. It is now read-only.

Commit

Permalink
Merge pull request #26 from Metaswitch/chronos_scale_docs
Browse files Browse the repository at this point in the history
[Reviewer: Matt] Add chronos scaling changes
  • Loading branch information
eleanor-merry committed Apr 23, 2015
2 parents b10a5d3 + cfe961c commit 2bdcef4
Show file tree
Hide file tree
Showing 3 changed files with 18 additions and 71 deletions.
60 changes: 17 additions & 43 deletions docs/Clearwater_Elastic_Scaling.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,23 +4,15 @@ This page explains how to use this elastic scaling function when using a deploym

## Before scaling your deployment

Before scaling up or down, you should decide how many each of Bono, Sprout, Homestead and Homer nodes you need (i.e. your target size). This should be based on your call load profile and measurements of current systems, though based on experience we recommend scaling up a tier of a given type (sprout, bono, etc.) when the average CPU utilization within that tier reaches ~60%. The [Deployment Sizing Spreadsheet](http://www.projectclearwater.org/technical/clearwater-performance/) may also provide useful input.
Before scaling up or down, you should decide how many each of Bono, Sprout, Homestead, Homer and Ralf nodes you need (i.e. your target size). This should be based on your call load profile and measurements of current systems, though based on experience we recommend scaling up a tier of a given type (sprout, bono, etc.) when the average CPU utilization within that tier reaches ~60%. The [Deployment Sizing Spreadsheet](http://www.projectclearwater.org/technical/clearwater-performance/) may also provide useful input.

Having determined your new cluster size, you need to decide which of the following two scale-up methods you are going to use. The quick method is completely safe when scaling up your cluster, but, when scaling down, timer events programmed before the scale operation started may be lost. This may lead to:

* Incomplete billing records generated by Ralf
* Missing NOTIFY messages on registration expiry
* Authentication vectors not being invalidated if the client never sends an authenticated REGISTER message.

On the other hand, the slow method guarantees that timers are persisted over the scale operation. Despite the limitations, the quick method is simpler and is sufficient for testing deployments. Unless you are using your deployment for hosting a live service, we recommend the quick method.

## Performing the resize (Quick Method)
## Performing the resize

### If you did an Automated Install

To resize your automated deployment, run:

knife deployment resize -E <env> --sprout-count <n> --bono-count <n> --homer-count <n> --homestead-count <n>
knife deployment resize -E <env> --sprout-count <n> --bono-count <n> --homer-count <n> --homestead-count <n> --ralf-count <n>

Where the `<n>` values are how many nodes of each type you need. Once this command has completed, the resize operation has completed and any nodes that are no longer needed will have been terminated.

Expand All @@ -31,43 +23,25 @@ If you're scaling up your manual deployment, follow the following process.
1. Spin up new nodes, following the [standard install process](Manual Install).
2. On Sprout, Memento and Ralf nodes, update `/etc/clearwater/cluster_settings` to contain both a list of the old nodes (`servers=...`) and a (longer) list of the new nodes (`new_servers=...`) and then run `service <process> reload` to re-read this file.
3. On new Memento, Homestead and Homer nodes, follow the [instructions on the Cassandra website](http://www.datastax.com/documentation/cassandra/1.2/cassandra/operations/ops_add_node_to_cluster_t.html) to join the new nodes to the existing cluster.
4. On Sprout, Homestead and Ralf nodes, update `/etc/chronos/chronos.conf` to contain a list of all the nodes and then run `service chronos reload` to re-read this file.
4. On Sprout and Ralf nodes, update `/etc/chronos/chronos.conf` to contain a list of all the nodes (see [here](https://github.com/Metaswitch/chronos/blob/dev/doc/clustering.md) for details of how to do this) and then run `service chronos reload` to re-read this file.
5. On Sprout, Memento and Ralf nodes, run `service astaire reload` to start resynchronization.
6. Update DNS to contain the new nodes.
7. On Sprout, Memento and Ralf nodes, wait until Astaire has resynchronized, either by running `service astaire wait-sync` or by polling over [SNMP](Clearwater SNMP Statistics).
8. On all nodes, update /etc/clearwater/cluster_settings to just contain the new list of nodes (`servers=...`) and then run `service <process> reload` to re-read this file.
6. On Sprout and Ralf nodes, run `service chronos resync` to start resynchronization of Chronos timers.
7. Update DNS to contain the new nodes.
8. On Sprout, Memento and Ralf nodes, wait until Astaire has resynchronized, either by running `service astaire wait-sync` or by polling over [SNMP](Clearwater SNMP Statistics).
9. On Sprout and Ralf nodes, wait until Chronos has resynchronized, either by running `service chronos wait-sync` or by polling over [SNMP](Clearwater SNMP Statistics).
10. On all nodes, update /etc/clearwater/cluster_settings to just contain the new list of nodes (`servers=...`) and then run `service <process> reload` to re-read this file.

If you're scaling down your manual deployment, follow the following process.

1. Update DNS to contain the nodes that will remain after the scale-down.
2. On Sprout, Memento and Ralf nodes, update `/etc/clearwater/cluster_settings` to contain both a list of the old nodes (`servers=...`) and a (shorter) list of the new nodes (`new_servers=...`) and then run `service <process> reload` to re-read this file.
3. On leaving Memento, Homestead and Homer nodes, follow the [instructions on the Cassandra website](http://www.datastax.com/documentation/cassandra/1.2/cassandra/operations/ops_remove_node_t.html) to remove the leaving nodes from the cluster.
4. On Sprout, Homestead and Ralf nodes, update `/etc/chronos/chronos.conf` to contain a list of just the remaining nodes and then run `service chronos reload` to re-read this file.
4. On Sprout and Ralf nodes, update `/etc/chronos/chronos.conf` to mark the nodes that are being scaled down as leaving (see [here](https://github.com/Metaswitch/chronos/blob/dev/doc/clustering.md) for details of how to do this) and then run `service chronos reload` to re-read this file.
5. On Sprout, Memento and Ralf nodes, run `service astaire reload` to start resynchronization.
6. On Sprout, Memento and Ralf nodes, wait until Astaire has resynchronized, either by running `service astaire wait-sync` or by polling over [SNMP](Clearwater SNMP Statistics).
7. On all nodes, update /etc/clearwater/cluster_settings to just contain the new list of nodes (`servers=...`) and then run `service <process> reload` to re-read this file.
8. On the nodes that are about to be turned down, run `monit unmonitor <process> && service <process> quiesce` to start the main process quiescing.
9. Turn down each of these nodes once the process has terminated.

## Performing the resize (Slow Method)

### If you did an Automated Install

To resize your automated deployment:

* Run

knife deployment resize -E <env> --sprout-count <n> --bono-count <n> --homer-count <n> --homestead-count <n> --start

where the `<n>` values are as they were in the basic method.

* Wait for the session refresh time to pass.
* Run

knife deployment resize -E <env> --finish

Once this last command has completed, the resize operation is complete and any nodes that are no longer needed will have been terminated.

### If you did a Manual Install

To resize your manual deployment, follow the quick process described above but, after waiting for Astaire to resynchronize, also wait for the session refresh time to pass.
6. On the Sprout and Ralf nodes that are staying in the Chronos cluster, run `service chronos resync` to start resynchronization of Chronos timers.
7. On Sprout, Memento and Ralf nodes, wait until Astaire has resynchronized, either by running `service astaire wait-sync` or by polling over [SNMP](Clearwater SNMP Statistics).
8. On Sprout and Ralf nodes, wait until Chronos has resynchronized, either by running `service chronos wait-sync` or by polling over [SNMP](Clearwater SNMP Statistics).
9. On Sprout, Memento and Ralf nodes, update /etc/clearwater/cluster_settings to just contain the new list of nodes (`servers=...`) and then run `service <process> reload` to re-read this file.
10. On the Sprout and Ralf nodes that are staying in the cluster, update `/etc/chronos/chronos.conf` so that it only contains entries for the staying nodes in the cluster and then run `service chronos reload` to re-read this file.
11. On the nodes that are about to be turned down, run `monit unmonitor <process> && service <process> quiesce|stop` to start the main process quiescing.
12. Turn down each of these nodes once the process has terminated.
27 changes: 0 additions & 27 deletions docs/Clearwater_IP_Port_Usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,10 +49,6 @@ All-in-one nodes need the following ports opened to the world

UDP/32768-65535

* 0MQ statistics interface:

TCP/6665-6669

## Ellis

The Ellis node needs the following ports opened to the world:
Expand Down Expand Up @@ -81,10 +77,6 @@ The Bono nodes need the following ports opened to the world:

UDP/32768-65535

* 0MQ statistics interface:

TCP/6669

They also need the following ports open to all other Bono nodes and to all the Sprout nodes:

* Internal SIP signalling:
Expand Down Expand Up @@ -125,10 +117,6 @@ They also need the following ports opened to all homestead nodes:

They also need the following ports opened to the world:

* 0MQ statistics interface:

TCP/6666

* HTTP interface (if including a Memento AS):

TCP/443
Expand Down Expand Up @@ -156,11 +144,6 @@ They also need the following ports opened to all other Homestead nodes:

They also need the following ports opened to the world:

* 0MQ statistics interface:

TCP/6667
TCP/6668

## Homer

The Homer nodes need the following ports open to all the Sprout nodes and the Ellis node:
Expand All @@ -177,10 +160,6 @@ They also need the following ports opened to all other Homer nodes:

They also need the following ports opened to the world:

* 0MQ statistics interface:

TCP/6665

## Ralf

The Ralf nodes need the following ports open to all the Sprout and Bono nodes:
Expand All @@ -199,12 +178,6 @@ They also need to following ports open to all other Ralf nodes:

TCP/11211

They also need the following ports opened to the world:

* 0MQ statistics interface:

TCP/6664

## Standalone Application Servers

Standalone application servers need the following ports open to all Sprout nodes:
Expand Down
2 changes: 1 addition & 1 deletion docs/Manual_Install.md
Original file line number Diff line number Diff line change
Expand Up @@ -234,7 +234,7 @@ Once you've reached this point, your Clearwater deployment is ready to handle ca

## Larger-Scale Deployments

If you're intending to spin up a larger-scale deployment containing more than one node of each types, it's recommended that you use the [automated install process](Automated_Install), as this makes scaling up and down very straight-forward. If for some reason you can't, you'll need to configure DNS correctly and cluster the nodes in the sprout, homestead and homer tiers.
If you're intending to spin up a larger-scale deployment containing more than one node of each types, it's recommended that you use the [automated install process](Automated_Install), as this makes scaling up and down very straight-forward. If for some reason you can't, you'll need to configure DNS correctly and cluster the nodes in the Sprout, Homestead, Homer and Ralf tiers.

### Configuring DNS

Expand Down

0 comments on commit 2bdcef4

Please sign in to comment.