Merge pull request #121 from Metaswitch/rtd_updates

[Reviewer: Rob] Updates to ReadTheDocs
Metaswitch · Oct 15, 2015 · cec6567 · cec6567
2 parents 16be125 + ade5543
commit cec6567
Show file tree

Hide file tree

Showing 3 changed files with 63 additions and 53 deletions.
diff --git a/docs/Clearwater_stress_testing.md b/docs/Clearwater_stress_testing.md
@@ -18,6 +18,8 @@ The clearwater-sip-stress package includes two important scripts.
 * `/usr/share/clearwater/infrastructure/scripts/sip-stress`, which generates a `/usr/share/clearwater/sip-stress/users.csv.1` file containing the list of all subscribers we should be targeting - these are calculated from properties in `/etc/clearwater/shared_config`.
 * `/etc/init.d/clearwater-sip-stress`, which runs `/usr/share/clearwater/bin/sip-stress`, which in turn runs SIPp specifying `/usr/share/clearwater/sip-stress/call_load2.xml` as its test script. This test script simulates a pair of subscribers registering every 5 minutes and then making a call every 30 minutes.
 
+The stress test logs to `/var/log/clearwater-sip-stress/sipp.<index>.out`.
+
 ## Running Stress
 
 ### Using Chef
@@ -39,18 +41,16 @@ This section describes step-by-step how to run stress using Chef automation.  It
             "repo_server" => "http://repo.cw-ngv.com/latest",
             "number_start" => "2010000000",
             "number_count" => 1000,
-            "pstn_number_count" => 0,
-            "enum_server" => "enum.ENVIRONMENT.DOMAIN"}
+            "pstn_number_count" => 0}
 
 4.  Upload your new environment to the chef server by typing `knife environment from file environments/ENVIRONMENT.rb`
 5. Create the deployment by typing `knife deployment resize -E ENVIRONMENT`.  If you want more nodes, supply parameters such "--bono-count 5" or "--sprout-count 3" to control this.
 6. Follow [this process](https://github.com/Metaswitch/crest/blob/dev/docs/Bulk-Provisioning%20Numbers.md) to bulk provision subscribers. Create 100,000 subscribers per SIPp node.
-7. Create an ENUM server.  You need a dedicated ENUM server because the DN range that stress tests use needs to be routed back to your own deployment.  To create an ENUM server, type `knife box create -E ENVIRONMENT enum --index 1`.
-8. Add a DNS entry for the ENUM server - `knife dns record create -E ENVIRONMENT enum -z DOMAIN -T A --local enum -p ENVIRONMENT`.
-9. Create your stress test node by typing `knife box create -E ENVIRONMENT sipp --index 1`.  If you have multiple bono nodes, you'll need to create multiple stress test nodes by repeating this command with "--index 2", "--index 3", etc. - each stress test node only sends traffic to the bono with the same index.
+7. Create your stress test node by typing `knife box create -E ENVIRONMENT sipp --index 1`.  If you have multiple bono nodes, you'll need to create multiple stress test nodes by repeating this command with "--index 2", "--index 3", etc. - each stress test node only sends traffic to the bono with the same index.
   * To create multiple nodes, try `for x in {1..20} ; do { knife box create -E ENVIRONMENT sipp --index $x && sleep 2 ; } ; done`.
-10. Create a Cacti server for monitoring the deployment, as described in [this document](Cacti.md).
-11. When you've finished, destroy your deployment with `knife deployment delete -E ENVIRONMENT`.
+  * To modify the number of calls/hour to simulate, edit/add `count=<number>` to `/etc/clearwater/shared_config`, then run `sudo /usr/share/clearwater/infrastructure/scripts/sip-stress` and `sudo service clearwater-sip-stress restart`.
+8. Create a Cacti server for monitoring the deployment, as described in [this document](Cacti.md).
+9. When you've finished, destroy your deployment with `knife deployment delete -E ENVIRONMENT`.
 
 ### Manual (i.e. non-Chef) stress runs
 
@@ -71,7 +71,7 @@ Set the following properties in /etc/clearwater/shared_config:
 * (optional) bono_servers - a list of bono servers in this deployment
 * (optional) stress_target - the target host (defaults to the $node_idx-th entry in $bono_servers or, if there are no $bono_servers, defaults to $home_realm)
 * (optional) base - the base directory number (defaults to 2010000000)
-* (optional) count - the number of calls to run on this node (defaults to 50000) - note that the SIPp script simulates 2 subscribers per "call".
+* (optional) count - the number of calls to run on this node (defaults to 30000) - note that the SIPp script simulates 2 subscribers per "call".
 
 Finally install the clearwater-sip-stress Debian package. Stress will start automatically after the package is installed.
 

diff --git a/docs/Geographic_redundancy.md b/docs/Geographic_redundancy.md
@@ -15,38 +15,43 @@ The architecture of a geographically-redundant system is as follows.
 
 ![Diagram](img/Geographic_redundancy_diagram.png)
 
-Sprout has one memcached cluster per geographic region.  Although memcached
-itself does not support the concept of local and remote peers, sprout builds
+Sprout has one memcached cluster per geographic region. Although memcached
+itself does not support the concept of local and remote peers, Sprout builds
 this on top - writing to both local and remote clusters and reading from the
 local but falling back to the remote. Communication between the nodes should be
 secure - for example, if it is going over the public internet rather than a
 private connection between datacenters, it should be encrypted and
-authenticated with IPsec. Each sprout uses homers and homesteads in the same
-region only.
+authenticated with IPsec. Each Sprout uses Homers and Homesteads in the same
+region only. Sprout also has one Chronos cluster per geographic region; these
+clusters do not communicate.
 
-Separate instances of bono in each geographic region front the sprouts
+Separate instances of Bono in each geographic region front the Sprouts
 in that region.  Clearwater uses a geo-routing DNS service such as
 Amazon's Route&nbsp;53 to achieve this. A geo-routing DNS service
 responds to DNS queries based on latency, so if you're nearer to
 geographic region B's instances, you'll be served by them.
 
-Homestead and homer are each a single cluster split over the
+Homestead and Homer are each a single cluster split over the
 geographic regions. Since they are backed by Cassandra (which is aware
 of local and remote peers), they can be smarter about spatial
 locality. As with Sprout nodes, communication between the nodes should be
 secure.
 
-ellis is not redundant, whether deployed in a single geographic region
+Ellis is not redundant, whether deployed in a single geographic region
 or more. It is deployed in one of the geographic regions and a failure
 of that region would deny all provisioning function.
 
+Ralf does not support geographic redundancy. Each geographic region has its
+own Ralf cluster; Sprout and Bono should only communicate with their local
+Ralfs.
+
 While it appears as a single node in our system, Route 53 DNS is actually a
 geographically-redundant service provided by Amazon. Route 53's DNS
 interface has had 100% uptime since it was first turned up in 2010.
 (Its configuration interface has not, but that is less important.)
 
-The architecture above is for 2 geographic regions but could equally be
-extended to more.
+The architecture above is for 2 geographic regions - we do not currently
+support more regions.
 
 Note that there are other servers involved in a deployment that are not
 described above. Specifically,
@@ -75,64 +80,64 @@ complicated.
 The subscriber interacts with Clearwater through 3 interfaces, and these
 each have a different user experience.
 
--   SIP to bono for calls
--   HTTP to homer for direct call service configuration (not currently
+-   SIP to Bono for calls
+-   HTTP to Homer for direct call service configuration (not currently
     exposed)
--   HTTP to ellis for web-UI-based provisioning
+-   HTTP to Ellis for web-UI-based provisioning
 
 For the purposes of the following descriptions, we label the two regions
 A and B, and the deployment in region A has failed.
 
-### SIP to bono
+### SIP to Bono
 
-If the subscriber was connected to a bono node in region A, their TCP
+If the subscriber was connected to a Bono node in region A, their TCP
 connection fails. They then attempt to re-register. If it has been more
 than the DNS TTL (proposed to be 30s) since they last connected, DNS
 will point to region B, they will re-register and their service will
 recover (both for incoming and outgoing calls). If it has been less than
 the DNS TTL since they last connected, they will probably wait 5 minutes
 before they try to re-register (using the correct DNS entry this time).
 
-If the subscriber was connected to a bono node in region B, their TCP
+If the subscriber was connected to a Bono node in region B, their TCP
 connection does not fail, they do not re-register and their service is
 unaffected.
 
 Overall, the subscriber's expected incoming or outgoing call service
 outage would be as follows.
 
-      50% chance of being on a bono node in region A *
+      50% chance of being on a Bono node in region A *
       30s/300s chance of having a stale DNS entry *
       300s re-registration time
     = 5% chance of a 300s outage
     = 15s average outage
 
 Realistically, if 50% of subscribers all re-registered almost
 simultaneously (due to their TCP connection dropping and their DNS being
-timed out), it's unlikely that bono would be able to keep up.
+timed out), it's unlikely that Bono would be able to keep up.
 
 Also, depending on the failure mode of the nodes in region A, it's
 possible that the TCP connection failure would be silent and the clients
 would not notice until they next re-REGISTERed. In this case, all
-clients connected to bonos in region A would take an average of 150s to
+clients connected to Bonos in region A would take an average of 150s to
 notice the failure. This equates to a 50% chance of a 150s outage, or an
 average outage of 75s.
 
-### HTTP to homer
+### HTTP to Homer
 
 (This function is not currently exposed.)
 
-If the subscriber was using a homer node in region A, their requests
+If the subscriber was using a Homer node in region A, their requests
 would fail until their DNS timed out. If the subscriber was using a
-homer node in region B, they would see no failures.
+Homer node in region B, they would see no failures.
 
 Given the proposed DNS TTL of 30s, 50% of subscribers (those in region
 A) would see an average of 15s of failures. On average, a subscriber
 would see 7.5s of failures.
 
-### HTTP to ellis
+### HTTP to Ellis
 
-ellis is not geographically redundant. If ellis was deployed in region
-A, all service would fail until region A was recovered. If ellis was
+Ellis is not geographically redundant. If Ellis was deployed in region
+A, all service would fail until region A was recovered. If Ellis was
 deployed in region B, there would be no outage.
 
 Setup