Skip to content
This repository has been archived by the owner on Feb 27, 2020. It is now read-only.

Commit

Permalink
Merge pull request #121 from Metaswitch/rtd_updates
Browse files Browse the repository at this point in the history
[Reviewer: Rob] Updates to ReadTheDocs
  • Loading branch information
eleanor-merry committed Oct 15, 2015
2 parents 16be125 + ade5543 commit cec6567
Show file tree
Hide file tree
Showing 3 changed files with 63 additions and 53 deletions.
16 changes: 8 additions & 8 deletions docs/Clearwater_stress_testing.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ The clearwater-sip-stress package includes two important scripts.
* `/usr/share/clearwater/infrastructure/scripts/sip-stress`, which generates a `/usr/share/clearwater/sip-stress/users.csv.1` file containing the list of all subscribers we should be targeting - these are calculated from properties in `/etc/clearwater/shared_config`.
* `/etc/init.d/clearwater-sip-stress`, which runs `/usr/share/clearwater/bin/sip-stress`, which in turn runs SIPp specifying `/usr/share/clearwater/sip-stress/call_load2.xml` as its test script. This test script simulates a pair of subscribers registering every 5 minutes and then making a call every 30 minutes.

The stress test logs to `/var/log/clearwater-sip-stress/sipp.<index>.out`.

## Running Stress

### Using Chef
Expand All @@ -39,18 +41,16 @@ This section describes step-by-step how to run stress using Chef automation. It
"repo_server" => "http://repo.cw-ngv.com/latest",
"number_start" => "2010000000",
"number_count" => 1000,
"pstn_number_count" => 0,
"enum_server" => "enum.ENVIRONMENT.DOMAIN"}
"pstn_number_count" => 0}

4. Upload your new environment to the chef server by typing `knife environment from file environments/ENVIRONMENT.rb`
5. Create the deployment by typing `knife deployment resize -E ENVIRONMENT`. If you want more nodes, supply parameters such "--bono-count 5" or "--sprout-count 3" to control this.
6. Follow [this process](https://github.com/Metaswitch/crest/blob/dev/docs/Bulk-Provisioning%20Numbers.md) to bulk provision subscribers. Create 100,000 subscribers per SIPp node.
7. Create an ENUM server. You need a dedicated ENUM server because the DN range that stress tests use needs to be routed back to your own deployment. To create an ENUM server, type `knife box create -E ENVIRONMENT enum --index 1`.
8. Add a DNS entry for the ENUM server - `knife dns record create -E ENVIRONMENT enum -z DOMAIN -T A --local enum -p ENVIRONMENT`.
9. Create your stress test node by typing `knife box create -E ENVIRONMENT sipp --index 1`. If you have multiple bono nodes, you'll need to create multiple stress test nodes by repeating this command with "--index 2", "--index 3", etc. - each stress test node only sends traffic to the bono with the same index.
7. Create your stress test node by typing `knife box create -E ENVIRONMENT sipp --index 1`. If you have multiple bono nodes, you'll need to create multiple stress test nodes by repeating this command with "--index 2", "--index 3", etc. - each stress test node only sends traffic to the bono with the same index.
* To create multiple nodes, try `for x in {1..20} ; do { knife box create -E ENVIRONMENT sipp --index $x && sleep 2 ; } ; done`.
10. Create a Cacti server for monitoring the deployment, as described in [this document](Cacti.md).
11. When you've finished, destroy your deployment with `knife deployment delete -E ENVIRONMENT`.
* To modify the number of calls/hour to simulate, edit/add `count=<number>` to `/etc/clearwater/shared_config`, then run `sudo /usr/share/clearwater/infrastructure/scripts/sip-stress` and `sudo service clearwater-sip-stress restart`.
8. Create a Cacti server for monitoring the deployment, as described in [this document](Cacti.md).
9. When you've finished, destroy your deployment with `knife deployment delete -E ENVIRONMENT`.

### Manual (i.e. non-Chef) stress runs

Expand All @@ -71,7 +71,7 @@ Set the following properties in /etc/clearwater/shared_config:
* (optional) bono_servers - a list of bono servers in this deployment
* (optional) stress_target - the target host (defaults to the $node_idx-th entry in $bono_servers or, if there are no $bono_servers, defaults to $home_realm)
* (optional) base - the base directory number (defaults to 2010000000)
* (optional) count - the number of calls to run on this node (defaults to 50000) - note that the SIPp script simulates 2 subscribers per "call".
* (optional) count - the number of calls to run on this node (defaults to 30000) - note that the SIPp script simulates 2 subscribers per "call".

Finally install the clearwater-sip-stress Debian package. Stress will start automatically after the package is installed.

Expand Down
53 changes: 29 additions & 24 deletions docs/Geographic_redundancy.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,38 +15,43 @@ The architecture of a geographically-redundant system is as follows.

![Diagram](img/Geographic_redundancy_diagram.png)

Sprout has one memcached cluster per geographic region. Although memcached
itself does not support the concept of local and remote peers, sprout builds
Sprout has one memcached cluster per geographic region. Although memcached
itself does not support the concept of local and remote peers, Sprout builds
this on top - writing to both local and remote clusters and reading from the
local but falling back to the remote. Communication between the nodes should be
secure - for example, if it is going over the public internet rather than a
private connection between datacenters, it should be encrypted and
authenticated with IPsec. Each sprout uses homers and homesteads in the same
region only.
authenticated with IPsec. Each Sprout uses Homers and Homesteads in the same
region only. Sprout also has one Chronos cluster per geographic region; these
clusters do not communicate.

Separate instances of bono in each geographic region front the sprouts
Separate instances of Bono in each geographic region front the Sprouts
in that region. Clearwater uses a geo-routing DNS service such as
Amazon's Route&nbsp;53 to achieve this. A geo-routing DNS service
responds to DNS queries based on latency, so if you're nearer to
geographic region B's instances, you'll be served by them.

Homestead and homer are each a single cluster split over the
Homestead and Homer are each a single cluster split over the
geographic regions. Since they are backed by Cassandra (which is aware
of local and remote peers), they can be smarter about spatial
locality. As with Sprout nodes, communication between the nodes should be
secure.

ellis is not redundant, whether deployed in a single geographic region
Ellis is not redundant, whether deployed in a single geographic region
or more. It is deployed in one of the geographic regions and a failure
of that region would deny all provisioning function.

Ralf does not support geographic redundancy. Each geographic region has its
own Ralf cluster; Sprout and Bono should only communicate with their local
Ralfs.

While it appears as a single node in our system, Route 53 DNS is actually a
geographically-redundant service provided by Amazon. Route 53's DNS
interface has had 100% uptime since it was first turned up in 2010.
(Its configuration interface has not, but that is less important.)

The architecture above is for 2 geographic regions but could equally be
extended to more.
The architecture above is for 2 geographic regions - we do not currently
support more regions.

Note that there are other servers involved in a deployment that are not
described above. Specifically,
Expand Down Expand Up @@ -75,64 +80,64 @@ complicated.
The subscriber interacts with Clearwater through 3 interfaces, and these
each have a different user experience.

- SIP to bono for calls
- HTTP to homer for direct call service configuration (not currently
- SIP to Bono for calls
- HTTP to Homer for direct call service configuration (not currently
exposed)
- HTTP to ellis for web-UI-based provisioning
- HTTP to Ellis for web-UI-based provisioning

For the purposes of the following descriptions, we label the two regions
A and B, and the deployment in region A has failed.

### SIP to bono
### SIP to Bono

If the subscriber was connected to a bono node in region A, their TCP
If the subscriber was connected to a Bono node in region A, their TCP
connection fails. They then attempt to re-register. If it has been more
than the DNS TTL (proposed to be 30s) since they last connected, DNS
will point to region B, they will re-register and their service will
recover (both for incoming and outgoing calls). If it has been less than
the DNS TTL since they last connected, they will probably wait 5 minutes
before they try to re-register (using the correct DNS entry this time).

If the subscriber was connected to a bono node in region B, their TCP
If the subscriber was connected to a Bono node in region B, their TCP
connection does not fail, they do not re-register and their service is
unaffected.

Overall, the subscriber's expected incoming or outgoing call service
outage would be as follows.

50% chance of being on a bono node in region A *
50% chance of being on a Bono node in region A *
30s/300s chance of having a stale DNS entry *
300s re-registration time
= 5% chance of a 300s outage
= 15s average outage

Realistically, if 50% of subscribers all re-registered almost
simultaneously (due to their TCP connection dropping and their DNS being
timed out), it's unlikely that bono would be able to keep up.
timed out), it's unlikely that Bono would be able to keep up.

Also, depending on the failure mode of the nodes in region A, it's
possible that the TCP connection failure would be silent and the clients
would not notice until they next re-REGISTERed. In this case, all
clients connected to bonos in region A would take an average of 150s to
clients connected to Bonos in region A would take an average of 150s to
notice the failure. This equates to a 50% chance of a 150s outage, or an
average outage of 75s.

### HTTP to homer
### HTTP to Homer

(This function is not currently exposed.)

If the subscriber was using a homer node in region A, their requests
If the subscriber was using a Homer node in region A, their requests
would fail until their DNS timed out. If the subscriber was using a
homer node in region B, they would see no failures.
Homer node in region B, they would see no failures.

Given the proposed DNS TTL of 30s, 50% of subscribers (those in region
A) would see an average of 15s of failures. On average, a subscriber
would see 7.5s of failures.

### HTTP to ellis
### HTTP to Ellis

ellis is not geographically redundant. If ellis was deployed in region
A, all service would fail until region A was recovered. If ellis was
Ellis is not geographically redundant. If Ellis was deployed in region
A, all service would fail until region A was recovered. If Ellis was
deployed in region B, there would be no outage.

Setup
Expand Down

0 comments on commit cec6567

Please sign in to comment.