Skip to content
This repository has been archived by the owner on Feb 27, 2020. It is now read-only.

Commit

Permalink
Merge pull request #275 from Metaswitch/cass_to_memcached
Browse files Browse the repository at this point in the history
Cassandra to Memcached
  • Loading branch information
sebrexmetaswitch committed Sep 7, 2017
2 parents a5b04cf + ebe5a78 commit 4974703
Show file tree
Hide file tree
Showing 10 changed files with 30 additions and 17 deletions.
6 changes: 3 additions & 3 deletions docs/Clearwater_Architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ Dime nodes run Clearwater's Homestead and Ralf components.

#### Homestead (HSS Cache)

Homestead provides a web services interface to Sprout for retrieving authentication credentials and user profile information. It can either master the data (in which case it exposes a web services provisioning interface) or can pull the data from an IMS compliant HSS over the Cx interface. The Homestead nodes themselves are stateless - the mastered / cached subscriber data is all stored on Vellum (via Cassandra's Thrift interface).
Homestead provides a web services interface to Sprout for retrieving authentication credentials and user profile information. It can either master the data (in which case it exposes a web services provisioning interface) or can pull the data from an IMS compliant HSS over the Cx interface. The Homestead nodes themselves are stateless - the mastered / cached subscriber data is all stored on Vellum (Cassandra for the mastered data, and Astaire/Memcached for the cached data).

In the IMS architecture, the HSS mirror function is considered to be part of the I-CSCF and S-CSCF components, so in Clearwater I-CSCF and S-CSCF function is implemented with a combination of Sprout and Dime clusters.

Expand All @@ -46,10 +46,10 @@ Ralf provides an HTTP API that both Bono and Sprout can use to report billable e
### Vellum (State store)

As described above, Vellum is used to maintain all long-lived state in the dedployment. It does this by running a number of cloud optimized, distributed storage clusters.
- [Cassandra](http://cassandra.apache.org/). Cassandra is used by Homestead to store authentication credentials and profile information, and is used by Homer to store MMTEL service settings. Vellum exposes Cassandra's Thrift API.
- [Cassandra](http://cassandra.apache.org/). Cassandra is used by Homestead to store authentication credentials and profile information when an HSS is not in use, and is used by Homer to store MMTEL service settings. Vellum exposes Cassandra's Thrift API.
- [etcd](https://github.com/coreos/etcd). etcd is used by Vellum itself to share clustering information between Vellum nodes and by other nodes in the deployment for shared configuration.
- [Chronos](https://github.com/Metaswitch/chronos). Chronos is a distributed, redundant, reliable timer service developed by Clearwater. It is used by Sprout and Ralf nodes to enable timers to be run (e.g. for SIP Registration expiry) without pinning operations to a specific node (one node can set the timer and another act on it when it pops). Chronos is accessed via an HTTP API.
- [Memcached](https://memcached.org/) / [Astaire](https://github.com/Metaswitch/astaire). Vellum also runs a Memcached cluster fronted by Astaire. Astaire is a service developed by Clearwater that enabled more rapid scale up and scale down of memcached clusters. This cluster is used by Sprout and Ralf for storing registration and session state.
- [Memcached](https://memcached.org/) / [Astaire](https://github.com/Metaswitch/astaire). Vellum also runs a Memcached cluster fronted by Astaire. Astaire is a service developed by Clearwater that enabled more rapid scale up and scale down of memcached clusters. This cluster is used by Sprout for storing registration state, Ralf for storing session state and Homestead for storing cached subscriber data.

### Homer (XDMS)

Expand Down
6 changes: 4 additions & 2 deletions docs/Clearwater_Configuration_Options_Reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ This section describes settings that are specific to a single node and are not a
* If this node is an etcd master, this should be left blank
* If this node is an etcd proxy, it should contain the IP addresses of all the nodes that are currently etcd masters in the cluster.
* `etcd_cluster_key` - this is the name of the etcd datastore clusters that this node should join. It defaults to the function of the node (e.g. a Vellum node defaults to using 'vellum' as its etcd datastore cluster name when it joins the Cassandra cluster). This must be set explicitly on nodes that colocate function.
* `remote_cassandra_seeds` - this is used to connect the Cassandra cluster in your second site to the Cassandra cluster in your first site; this is only necessary in a geographically redundant deployment. It should be set to an IP address of a Vellum node in your first site, and it should only be set on the first Vellum node in your second site.
* `remote_cassandra_seeds` - this is used to connect the Cassandra cluster in your second site to the Cassandra cluster in your first site; this is only necessary in a geographically redundant deployment which is using at least one of Homestead-Prov, Homer or Memento. It should be set to an IP address of a Vellum node in your first site, and it should only be set on the first Vellum node in your second site.
* `scscf_node_uri` - this can be optionally set, and only applies to nodes running an S-CSCF. If it is configured, it almost certainly needs configuring on each S-CSCF node in the deployment.

If set, this is used by the node to advertise the URI to which requests to this node should be routed. It should be formatted as a SIP URI.
Expand Down Expand Up @@ -80,6 +80,7 @@ This section describes options for the basic configuration of a Clearwater deplo
* `memento_hostname` - a hostname that resolves by DNS round-robin to all Mementos in the cluster (the default is `memento.<home_domain>`). This should match Memento's SSL certificate, if you are using one.
* `sprout_registration_store` - this is the location of Sprout's registration store. It has the format `<site_name>=<domain>[:<port>][,<site_name>=<domain>[:<port>]]`. In a non-GR deployment, only one domain is provided (and the site name is optional). For a GR deployment, each domain is identified by the site name, and one of the domains must relate to the local site.
* `ralf_session_store` - this is the location of ralf's session store. It has the format `<site_name>=<domain>[:<port>][,<site_name>=<domain>[:<port>]]`. In a non-GR deployment, only one domain is provided (and the site name is optional). For a GR deployment, each domain is identified by the site name, and one of the domains must relate to the local site.
* `homestead_impu_store` - this is the location of homestead's IMPU store. It has the format `<site_name>=<domain>[:<port>][,<site_name>=<domain>[:<port>]]`. In a non-GR deployment, only one domain is provided (and the site name is optional). For a GR deployment, each domain is identified by the site name, and one of the domains must relate to the local site.
* `memento_auth_store` - this is the location of Memento's authorization vector store. It just has the format `<domain>[:port]`. If not present, defaults to the loopback IP.
* `sprout_chronos_callback_uri` - the callback hostname used on Sprout's Chronos timers. If not present, defaults to the host specified in `sprout-hostname`. In a GR deployment, should be set to a deployment-wide Sprout hostname (that will be resolved by using static DNS records in `/etc/clearwater/dns.json`).
* `ralf_chronos_callback_uri` - the callback hostname used on ralf's Chronos timers. If not present, defaults to the host specified in `ralf-hostname`. In a GR deployment, should be set to a deployment-wide Dime hostname (that will be resolved by using static DNS records in `/etc/clearwater/dns.json`).
Expand Down Expand Up @@ -212,6 +213,7 @@ This section describes optional configuration options, particularly for ensuring
* `dummy_app_server` - this field allows the name of a dummy application server to be specified. If an iFC contains this dummy application server, then no application server will be invoked when this iFC is triggered.
* `http_acr_logging` when set to 'Y', Clearwater will log the bodies of HTTP requests made to Ralf. This provides additional diagnostics, but increases the volume of data sent to SAS.
* `dns_timeout` - The time in milliseconds that Clearwater will wait for a response from the DNS server (defaults to 200 milliseconds).
* `homestead_cache_threads` - The number of threads used by Homestead for accessing it's subscriber data cache. Defaults to 50x the number of CPU cores.

### Experimental options

Expand Down Expand Up @@ -239,7 +241,7 @@ This section describes settings that may vary between systems in the same deploy
* `ibcf_domain` - For Bono IBCF nodes, allows for a domain alias to be specified for the IBCF to allow for including IBCFs in routes as domains instead of IPs.
* `upstream_recycle_connections` - the average number of seconds before Bono will destroy and re-create a connection to Sprout. A higher value means slightly less work, but means that DNS changes will not take effect as quickly (as new Sprout nodes added to DNS will only start to receive messages when Bono creates a new connection and does a fresh DNS lookup).
* `authentication` - by default, Clearwater performs authentication challenges (SIP Digest or IMS AKA depending on HSS configuration). When this is set to 'Y', it simply accepts all REGISTERs - obviously this is very insecure and should not be used in production.
* `num_http_threads` (homestead) - determines the number of HTTP worker threads that will be used to process requests. Defaults to 50 times the number of CPU cores on the system.
* `num_http_threads` (homestead) - determines the number of HTTP worker threads that will be used to process requests. Defaults to 4 times the number of CPU cores on the system.

## DNS Config

Expand Down
3 changes: 2 additions & 1 deletion docs/Configuring_GR_deployments.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,8 @@ Adding a site to a non-GR deployment follows the same basic process as described
2. Now you need to update the shared configuration on your first site so that it will communicate with your second site
* Update the shared configuration on your first site to use the GR options - follow the GR parts of setting up shared config [here](http://clearwater.readthedocs.io/en/latest/Manual_Install.html#provide-shared-configuration).
* Update the Chronos configuration on your Vellum nodes on your first site to add the GR configuration file - instructions [here](http://clearwater.readthedocs.io/en/latest/Manual_Install.html#chronos-configuration).
* Update Cassandra's strategy by running `cw-update_cassandra_strategy` on any Vellum node in your entire deployment.
* If you are using any of Homestead-Prov, Homer or Memento:
* Update Cassandra's strategy by running `cw-update_cassandra_strategy` on any Vellum node in your entire deployment.
* At this point, your first and second sites are replicating data between themselves, but no external traffic is going to your second site.
3. Change DNS so that your external nodes (e.g. the HSS, the P-CSCF) will send traffic to your new site. Now you have a GR deployment.

Expand Down
2 changes: 1 addition & 1 deletion docs/External_HSS_Integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ This page describes

When Clearwater is deployed without an external HSS, all HSS data is mastered in Vellum's Cassandra database.

When Clearwater is deployed with an external HSS, HSS data is queried from the external HSS via its Cx/Diameter interface and is then cached in the Cassandra database.
When Clearwater is deployed with an external HSS, HSS data is queried from the external HSS via its Cx/Diameter interface and is then cached in Memcached on Vellum.

Clearwater uses the following Cx message types.

Expand Down
4 changes: 2 additions & 2 deletions docs/Geographic_redundancy.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,12 @@ Each site has its own, separate, etcd cluster. This means that Clearwater's [aut

Vellum has 3 databases, which support Geographic Redundancy differently:

* The Homestead, Homer and Memento databases are backed by Cassandra, which is aware of local and remote peers, so these are a single cluster split across the two geographic regions.
* The Homestead-Prov, Homer and Memento databases are backed by Cassandra, which is aware of local and remote peers, so these are a single cluster split across the two geographic regions.
* Chronos is aware of local peers and the remote cluster, and handles replicating timers across the two sites itself.
* There is one memcached cluster per geographic region. Although memcached itself does not support the concept of local and remote peers, Vellum runs Astaire as a memcached proxy which allows Sprout and Dime nodes to build geographic redundancy on top - writing to both local and remote clusters, and reading from the local but falling back to the remote.

Sprout nodes use the local Vellum cluster for Chronos and both local and remote Vellum clusters for memcached (via Astaire). If the Sprout node includes Memento, then it also uses the local Vellum cluster for Cassandra.
Dime nodes use the local Vellum cluster for Chronos and Cassandra, and both local and remote Vellum clusters for memcached (via Astaire).
Dime nodes use the local Vellum cluster for Chronos and both local and remote Vellum clusters for memcached (via Astaire). If Homestead-Prov is in use, then it also uses the local Vellum cluster for Cassandra.

Communications between nodes in different sites should be secure - for example, if it is going over the public internet rather than a private connection between datacenters, it should be encrypted and authenticated with (something like) IPsec.

Expand Down
2 changes: 2 additions & 0 deletions docs/Handling_Failed_Nodes.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,5 +29,7 @@ For each site that contains one or more failed Vellum nodes, log into a healthy

* `sudo cw-mark_node_failed "vellum" "memcached" <failed node IP>`
* `sudo cw-mark_node_failed "vellum" "chronos" <failed node IP>`

If you are using any of Homestead-Prov, Homer or Memento, also run:
* `sudo cw-mark_node_failed "vellum" "cassandra" <failed node IP>`

5 changes: 4 additions & 1 deletion docs/Handling_Multiple_Failed_Nodes.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ The shared configuration is at `/etc/clearwater/shared_config`. Verify that this

#### Vellum - Cassandra configuration

Check that the Cassandra cluster is healthy by running the following on a Vellum node:
If you are using any of Homestead-Prov, Homer or Memento, check that the Cassandra cluster is healthy by running the following on a Vellum node:

sudo /usr/share/clearwater/bin/run-in-signaling-namespace nodetool status

Expand Down Expand Up @@ -113,6 +113,9 @@ Run these commands on one Vellum node in the affected site:

/usr/share/clearwater/clearwater-cluster-manager/scripts/load_from_chronos_cluster vellum
/usr/share/clearwater/clearwater-cluster-manager/scripts/load_from_memcached_cluster vellum

If you are using any of Homestead-Prov, Homer or Memento, also run:

/usr/share/clearwater/clearwater-cluster-manager/scripts/load_from_cassandra_cluster vellum

Verify the cluster state is correct in etcd by running sudo `/usr/share/clearwater/clearwater-cluster-manager/scripts/check_cluster_state`
Expand Down
4 changes: 3 additions & 1 deletion docs/Handling_Site_Failure.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,12 @@ More information about Clearwater's geographic redundancy support is available [

### Recovery

To recover from this situation, all you need to do is remove the failed Vellum nodes from Cassandra cluster.
If you are using any of Homestead-Prov, Homer or Memento, to recover from this situation all you need to do is remove the failed Vellum nodes from Cassandra cluster.

* From any Vellum node in the remaining site, run `cw-remove_site_from_cassandra <site ID - the name of the failed site>`

If you are not using any of Homestead-Prov, Homer or Memento, you do not need to do anything to recover the single remaining site.

You should now have a working single-site cluster, which can continue to run as a single site, or be safely paired with a new remote site (details on how to set up a new remote site are [here](http://clearwater.readthedocs.io/en/latest/Configuring_GR_deployments.html#removing-a-site-from-a-gr-deployment)).

### Impact
Expand Down
13 changes: 8 additions & 5 deletions docs/Manual_Install.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,8 +77,9 @@ Note that the `etcd_cluster` variable should be set to a comma separated list th
If you are creating a [geographically redundant deployment](Geographic_redundancy.md), then:

* `etcd_cluster` should contain the IP addresses of nodes only in the local site
* You should set `local_site_name` in `/etc/clearwater/local_config`. The name you choose is arbitrary, but must be the same for every node in the site. This name will also be used in the `remote_site_names`, `sprout_registration_store` and `ralf_session_store` configuration options set in shared config (desscribed below).
* On the first Vellum node in the second site, you should set `remote_cassandra_seeds` to the IP address of a Vellum node in the first site.
* You should set `local_site_name` in `/etc/clearwater/local_config`. The name you choose is arbitrary, but must be the same for every node in the site. This name will also be used in the `remote_site_names`, `sprout_registration_store`, `homestead_impu_store` and `ralf_session_store` configuration options set in shared config (desscribed below).
* If your deployment uses Homestead-Prov, Homer or Memento:
* on the first Vellum node in the second site, you should set `remote_cassandra_seeds` to the IP address of a Vellum node in the first site.

## Install Node-Specific Software

Expand Down Expand Up @@ -149,6 +150,7 @@ Log onto any node in the deployment and create the file `/etc/clearwater/shared_
sprout_registration_store=vellum.<site_name>.<zone>
hs_hostname=hs.<site_name>.<zone>:8888
hs_provisioning_hostname=hs.<site_name>.<zone>:8889
homestead_impu_store=vellum.<zone>
ralf_hostname=ralf.<site_name>.<zone>:10888
ralf_session_store=vellum.<zone>
xdms_hostname=homer.<site_name>.<zone>:7888
Expand Down Expand Up @@ -187,11 +189,12 @@ If you want your Sprout nodes to include Gemini/Memento Application Servers add

See the [Chef instructions](Installing_a_Chef_workstation.md#add-deployment-specific-configuration) for more information on how to fill these in. The values marked `<secret>` **must** be set to secure values to protect your deployment from unauthorized access. To modify these settings after the deployment is created, follow [these instructions](Modifying_Clearwater_settings.md).

If you are creating a [geographically redundant deployment](Geographic_redundancy.md), some of the options require information about all sites to be specified. You need to set the `remote_site_names` configuration option to include the `local_site_name` of each site, replace the `sprout_registration_store` and `ralf_session_store` with the values as described in [Clearwater Configuration Options Reference](Clearwater_Configuration_Options_Reference.md), and set the `sprout_chronos_callback_uri` and `ralf_chronos_callback_uri` to deployment wide hostnames. For example, for sites named `siteA` and `siteB`:
If you are creating a [geographically redundant deployment](Geographic_redundancy.md), some of the options require information about all sites to be specified. You need to set the `remote_site_names` configuration option to include the `local_site_name` of each site, replace the `sprout_registration_store`, `homestead_impu_store` and `ralf_session_store` with the values as described in [Clearwater Configuration Options Reference](Clearwater_Configuration_Options_Reference.md), and set the `sprout_chronos_callback_uri` and `ralf_chronos_callback_uri` to deployment wide hostnames. For example, for sites named `siteA` and `siteB`:

remote_site_names=siteA,siteB
sprout_registration_store="siteA=sprout-siteA.<zone>,siteB=sprout-siteB.<zone>"
ralf_session_store="siteA=ralf-siteA.<zone>,siteB=ralf-siteB.<zone>"
sprout_registration_store="siteA=vellum-siteA.<zone>,siteB=vellum-siteB.<zone>"
homestead_impu_store="siteA=vellum-siteA.<zone>,siteB=vellum-siteB.<zone>"
ralf_session_store="siteA=vellum-siteA.<zone>,siteB=vellum-siteB.<zone>"
sprout_chronos_callback_uri=sprout.<zone>
ralf_chronos_callback_uri=ralf.<zone>

Expand Down
2 changes: 1 addition & 1 deletion docs/Troubleshooting_and_Recovery.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ To examine Ellis' database, run `mysql` (as root), then type `use ellis;` to set

Problems on Vellum may include:

* Failing to read or write to the Cassandra database:
* Failing to read or write to the Cassandra database (only relevant if you deployment is using Homestead-Prov, Homer or Memento):
* Check that Cassandra is running (`sudo monit status`). If not, check its `/var/log/cassandra/*.log` files.
* Check that Cassandra is configured correctly. First access the command-line CQL interface by running `cqlsh`. There are 3 databases:
* Type `use homestead_provisioning;` to set the provisioning database and then `describe tables;` - this should report `service_profiles`, `public`, `implicit_registration_sets` and `private`.
Expand Down

0 comments on commit 4974703

Please sign in to comment.