Merge pull request #26 from Metaswitch/chronos_scale_docs

[Reviewer: Matt] Add chronos scaling changes
Metaswitch · Apr 23, 2015 · 2bdcef4 · 2bdcef4
2 parents b10a5d3 + cfe961c
commit 2bdcef4
Show file tree

Hide file tree

Showing 3 changed files with 18 additions and 71 deletions.
diff --git a/docs/Clearwater_Elastic_Scaling.md b/docs/Clearwater_Elastic_Scaling.md
@@ -4,23 +4,15 @@ This page explains how to use this elastic scaling function when using a deploym
 
 ## Before scaling your deployment
 
-Before scaling up or down, you should decide how many each of Bono, Sprout, Homestead and Homer nodes you need (i.e. your target size). This should be based on your call load profile and measurements of current systems, though based on experience we recommend scaling up a tier of a given type (sprout, bono, etc.) when the average CPU utilization within that tier reaches ~60%. The [Deployment Sizing Spreadsheet](http://www.projectclearwater.org/technical/clearwater-performance/) may also provide useful input.
+Before scaling up or down, you should decide how many each of Bono, Sprout, Homestead, Homer and Ralf nodes you need (i.e. your target size). This should be based on your call load profile and measurements of current systems, though based on experience we recommend scaling up a tier of a given type (sprout, bono, etc.) when the average CPU utilization within that tier reaches ~60%. The [Deployment Sizing Spreadsheet](http://www.projectclearwater.org/technical/clearwater-performance/) may also provide useful input.
 
-Having determined your new cluster size, you need to decide which of the following two scale-up methods you are going to use.  The quick method is completely safe when scaling up your cluster, but, when scaling down, timer events programmed before the scale operation started may be lost.  This may lead to:
-
-* Incomplete billing records generated by Ralf
-* Missing NOTIFY messages on registration expiry
-* Authentication vectors not being invalidated if the client never sends an authenticated REGISTER message.
-
-On the other hand, the slow method guarantees that timers are persisted over the scale operation.  Despite the limitations, the quick method is simpler and is sufficient for testing deployments.  Unless you are using your deployment for hosting a live service, we recommend the quick method.
-
-## Performing the resize (Quick Method)
+## Performing the resize
 
 ### If you did an Automated Install
 
 To resize your automated deployment, run:
 
-    knife deployment resize -E <env> --sprout-count <n> --bono-count <n> --homer-count <n> --homestead-count <n>
+    knife deployment resize -E <env> --sprout-count <n> --bono-count <n> --homer-count <n> --homestead-count <n> --ralf-count <n>
 
 Where the `<n>` values are how many nodes of each type you need.  Once this command has completed, the resize operation has completed and any nodes that are no longer needed will have been terminated.
 
@@ -31,43 +23,25 @@ If you're scaling up your manual deployment, follow the following process.
 1.  Spin up new nodes, following the [standard install process](Manual Install).
 2.  On Sprout, Memento and Ralf nodes, update `/etc/clearwater/cluster_settings` to contain both a list of the old nodes (`servers=...`) and a (longer) list of the new nodes (`new_servers=...`) and then run `service <process> reload` to re-read this file.
 3.  On new Memento, Homestead and Homer nodes, follow the [instructions on the Cassandra website](http://www.datastax.com/documentation/cassandra/1.2/cassandra/operations/ops_add_node_to_cluster_t.html) to join the new nodes to the existing cluster.
-4.  On Sprout, Homestead and Ralf nodes, update `/etc/chronos/chronos.conf` to contain a list of all the nodes and then run `service chronos reload` to re-read this file.
+4.  On Sprout and Ralf nodes, update `/etc/chronos/chronos.conf` to contain a list of all the nodes (see [here](https://github.com/Metaswitch/chronos/blob/dev/doc/clustering.md) for details of how to do this) and then run `service chronos reload` to re-read this file.
 5.  On Sprout, Memento and Ralf nodes, run `service astaire reload` to start resynchronization.
-6.  Update DNS to contain the new nodes.
-7.  On Sprout, Memento and Ralf nodes, wait until Astaire has resynchronized, either by running `service astaire wait-sync` or by polling over [SNMP](Clearwater SNMP Statistics).
-8.  On all nodes, update /etc/clearwater/cluster_settings to just contain the new list of nodes (`servers=...`) and then run `service <process> reload` to re-read this file.
+6.  On Sprout and Ralf nodes, run `service chronos resync` to start resynchronization of Chronos timers.
+7.  Update DNS to contain the new nodes.
+8.  On Sprout, Memento and Ralf nodes, wait until Astaire has resynchronized, either by running `service astaire wait-sync` or by polling over [SNMP](Clearwater SNMP Statistics).
+9.  On Sprout and Ralf nodes, wait until Chronos has resynchronized, either by running `service chronos wait-sync` or by polling over [SNMP](Clearwater SNMP Statistics).
+10.  On all nodes, update /etc/clearwater/cluster_settings to just contain the new list of nodes (`servers=...`) and then run `service <process> reload` to re-read this file.
 
 If you're scaling down your manual deployment, follow the following process.
 
 1.  Update DNS to contain the nodes that will remain after the scale-down.
 2.  On Sprout, Memento and Ralf nodes, update `/etc/clearwater/cluster_settings` to contain both a list of the old nodes (`servers=...`) and a (shorter) list of the new nodes (`new_servers=...`) and then run `service <process> reload` to re-read this file.
 3.  On leaving Memento, Homestead and Homer nodes, follow the [instructions on the Cassandra website](http://www.datastax.com/documentation/cassandra/1.2/cassandra/operations/ops_remove_node_t.html) to remove the leaving nodes from the cluster.
-4.  On Sprout, Homestead and Ralf nodes, update `/etc/chronos/chronos.conf` to contain a list of just the remaining nodes and then run `service chronos reload` to re-read this file.
+4.  On Sprout and Ralf nodes, update `/etc/chronos/chronos.conf` to mark the nodes that are being scaled down as leaving (see [here](https://github.com/Metaswitch/chronos/blob/dev/doc/clustering.md) for details of how to do this) and then run `service chronos reload` to re-read this file.
 5.  On Sprout, Memento and Ralf nodes, run `service astaire reload` to start resynchronization.
-6.  On Sprout, Memento and Ralf nodes, wait until Astaire has resynchronized, either by running `service astaire wait-sync` or by polling over [SNMP](Clearwater SNMP Statistics).
-7.  On all nodes, update /etc/clearwater/cluster_settings to just contain the new list of nodes (`servers=...`) and then run `service <process> reload` to re-read this file.
-8.  On the nodes that are about to be turned down, run `monit unmonitor <process> && service <process> quiesce` to start the main process quiescing.
-9.  Turn down each of these nodes once the process has terminated.
-
-## Performing the resize (Slow Method)
-
-### If you did an Automated Install
-
-To resize your automated deployment:
-
-* Run 
-
-        knife deployment resize -E <env> --sprout-count <n> --bono-count <n> --homer-count <n> --homestead-count <n> --start
-
-    where the `<n>` values are as they were in the basic method.
-
-* Wait for the session refresh time to pass.
-* Run
-
-        knife deployment resize -E <env> --finish
-
-Once this last command has completed, the resize operation is complete and any nodes that are no longer needed will have been terminated.
-
-### If you did a Manual Install
-
-To resize your manual deployment, follow the quick process described above but, after waiting for Astaire to resynchronize, also wait for the session refresh time to pass.
+6.  On the Sprout and Ralf nodes that are staying in the Chronos cluster, run `service chronos resync` to start resynchronization of Chronos timers.
+7.  On Sprout, Memento and Ralf nodes, wait until Astaire has resynchronized, either by running `service astaire wait-sync` or by polling over [SNMP](Clearwater SNMP Statistics).
+8.  On Sprout and Ralf nodes, wait until Chronos has resynchronized, either by running `service chronos wait-sync` or by polling over [SNMP](Clearwater SNMP Statistics).
+9.  On Sprout, Memento and Ralf nodes, update /etc/clearwater/cluster_settings to just contain the new list of nodes (`servers=...`) and then run `service <process> reload` to re-read this file.
+10.  On the Sprout and Ralf nodes that are staying in the cluster, update `/etc/chronos/chronos.conf` so that it only contains entries for the staying nodes in the cluster and then run `service chronos reload` to re-read this file.
+11.  On the nodes that are about to be turned down, run `monit unmonitor <process> && service <process> quiesce|stop` to start the main process quiescing.
+12.  Turn down each of these nodes once the process has terminated.
diff --git a/docs/Clearwater_IP_Port_Usage.md b/docs/Clearwater_IP_Port_Usage.md
@@ -49,10 +49,6 @@ All-in-one nodes need the following ports opened to the world
 
         UDP/32768-65535
 
-* 0MQ statistics interface:
-
-        TCP/6665-6669
-
 ## Ellis
 
 The Ellis node needs the following ports opened to the world:
@@ -81,10 +77,6 @@ The Bono nodes need the following ports opened to the world:
 
         UDP/32768-65535
 
-* 0MQ statistics interface:
-
-        TCP/6669
-
 They also need the following ports open to all other Bono nodes and to all the Sprout nodes:
 
 * Internal SIP signalling:
@@ -125,10 +117,6 @@ They also need the following ports opened to all homestead nodes:
 
 They also need the following ports opened to the world:
 
-* 0MQ statistics interface:
-
-        TCP/6666
-
 * HTTP interface (if including a Memento AS):
 
         TCP/443
@@ -156,11 +144,6 @@ They also need the following ports opened to all other Homestead nodes:
 
 They also need the following ports opened to the world:
 
-* 0MQ statistics interface:
-
-        TCP/6667
-        TCP/6668
-
 ## Homer
 
 The Homer nodes need the following ports open to all the Sprout nodes and the Ellis node:
@@ -177,10 +160,6 @@ They also need the following ports opened to all other Homer nodes:
 
 They also need the following ports opened to the world:
 
-* 0MQ statistics interface:
-
-        TCP/6665
-
 ## Ralf
 
 The Ralf nodes need the following ports open to all the Sprout and Bono nodes:
@@ -199,12 +178,6 @@ They also need to following ports open to all other Ralf nodes:
 
         TCP/11211
 
-They also need the following ports opened to the world:
-
-* 0MQ statistics interface:
-
-        TCP/6664
-
 ## Standalone Application Servers
 
 Standalone application servers need the following ports open to all Sprout nodes:

diff --git a/docs/Manual_Install.md b/docs/Manual_Install.md
@@ -234,7 +234,7 @@ Once you've reached this point, your Clearwater deployment is ready to handle ca
 
 ## Larger-Scale Deployments
 
-If you're intending to spin up a larger-scale deployment containing more than one node of each types, it's recommended that you use the [automated install process](Automated_Install), as this makes scaling up and down very straight-forward.  If for some reason you can't, you'll need to configure DNS correctly and cluster the nodes in the sprout, homestead and homer tiers.
+If you're intending to spin up a larger-scale deployment containing more than one node of each types, it's recommended that you use the [automated install process](Automated_Install), as this makes scaling up and down very straight-forward.  If for some reason you can't, you'll need to configure DNS correctly and cluster the nodes in the Sprout, Homestead, Homer and Ralf tiers.
 
 ### Configuring DNS