Merge pull request #275 from Metaswitch/cass_to_memcached

Cassandra to Memcached
Metaswitch · Sep 7, 2017 · 4974703 · 4974703
2 parents a5b04cf + ebe5a78
commit 4974703
Show file tree

Hide file tree

Showing 10 changed files with 30 additions and 17 deletions.
diff --git a/docs/Clearwater_Architecture.md b/docs/Clearwater_Architecture.md
@@ -35,7 +35,7 @@ Dime nodes run Clearwater's Homestead and Ralf components.
 
 #### Homestead (HSS Cache)
 
-Homestead provides a web services interface to Sprout for retrieving authentication credentials and user profile information.  It can either master the data (in which case it exposes a web services provisioning interface) or can pull the data from an IMS compliant HSS over the Cx interface.  The Homestead nodes themselves are stateless - the mastered / cached subscriber data is all stored on Vellum (via Cassandra's Thrift interface).
+Homestead provides a web services interface to Sprout for retrieving authentication credentials and user profile information.  It can either master the data (in which case it exposes a web services provisioning interface) or can pull the data from an IMS compliant HSS over the Cx interface.  The Homestead nodes themselves are stateless - the mastered / cached subscriber data is all stored on Vellum (Cassandra for the mastered data, and Astaire/Memcached for the cached data).
 
 In the IMS architecture, the HSS mirror function is considered to be part of the I-CSCF and S-CSCF components, so in Clearwater I-CSCF and S-CSCF function is implemented with a combination of Sprout and Dime clusters.
 
@@ -46,10 +46,10 @@ Ralf provides an HTTP API that both Bono and Sprout can use to report billable e
 ### Vellum (State store)
 
 As described above, Vellum is used to maintain all long-lived state in the dedployment.  It does this by running a number of cloud optimized, distributed storage clusters.
-- [Cassandra](http://cassandra.apache.org/).  Cassandra is used by Homestead to store authentication credentials and profile information, and is used by Homer to store MMTEL service settings.  Vellum exposes Cassandra's Thrift API.
+- [Cassandra](http://cassandra.apache.org/).  Cassandra is used by Homestead to store authentication credentials and profile information when an HSS is not in use, and is used by Homer to store MMTEL service settings.  Vellum exposes Cassandra's Thrift API.
 - [etcd](https://github.com/coreos/etcd).  etcd is used by Vellum itself to share clustering information between Vellum nodes and by other nodes in the deployment for shared configuration.
 - [Chronos](https://github.com/Metaswitch/chronos).  Chronos is a distributed, redundant, reliable timer service developed by Clearwater.  It is used by Sprout and Ralf nodes to enable timers to be run (e.g. for SIP Registration expiry)  without pinning operations to a specific node (one node can set the timer and another act on it when it pops).  Chronos is accessed via an HTTP API.
-- [Memcached](https://memcached.org/) / [Astaire](https://github.com/Metaswitch/astaire).  Vellum also runs a Memcached cluster fronted by Astaire.  Astaire is a service developed by Clearwater that enabled more rapid scale up and scale down of memcached clusters.  This cluster is used by Sprout and Ralf for storing registration and session state.
+- [Memcached](https://memcached.org/) / [Astaire](https://github.com/Metaswitch/astaire).  Vellum also runs a Memcached cluster fronted by Astaire.  Astaire is a service developed by Clearwater that enabled more rapid scale up and scale down of memcached clusters. This cluster is used by Sprout for storing registration state, Ralf for storing session state and Homestead for storing cached subscriber data.
 
 ### Homer (XDMS)
 

diff --git a/docs/Clearwater_Configuration_Options_Reference.md b/docs/Clearwater_Configuration_Options_Reference.md
@@ -39,7 +39,7 @@ This section describes settings that are specific to a single node and are not a
     * If this node is an etcd master, this should be left blank
     * If this node is an etcd proxy, it should contain the IP addresses of all the nodes that are currently etcd masters in the cluster.
 * `etcd_cluster_key` - this is the name of the etcd datastore clusters that this node should join. It defaults to the function of the node (e.g. a Vellum node defaults to using 'vellum' as its etcd datastore cluster name when it joins the Cassandra cluster). This must be set explicitly on nodes that colocate function.
-* `remote_cassandra_seeds` - this is used to connect the Cassandra cluster in your second site to the Cassandra cluster in your first site; this is only necessary in a geographically redundant deployment. It should be set to an IP address of a Vellum node in your first site, and it should only be set on the first Vellum node in your second site.
+* `remote_cassandra_seeds` - this is used to connect the Cassandra cluster in your second site to the Cassandra cluster in your first site; this is only necessary in a geographically redundant deployment which is using at least one of Homestead-Prov, Homer or Memento. It should be set to an IP address of a Vellum node in your first site, and it should only be set on the first Vellum node in your second site.
 * `scscf_node_uri` - this can be optionally set, and only applies to nodes running an S-CSCF. If it is configured, it almost certainly needs configuring on each S-CSCF node in the deployment.
 
     If set, this is used by the node to advertise the URI to which requests to this node should be routed. It should be formatted as a SIP URI.
@@ -80,6 +80,7 @@ This section describes options for the basic configuration of a Clearwater deplo
 * `memento_hostname` - a hostname that resolves by DNS round-robin to all Mementos in the cluster (the default is `memento.<home_domain>`).  This should match Memento's SSL certificate, if you are using one.
 * `sprout_registration_store` - this is the location of Sprout's registration store. It has the format `<site_name>=<domain>[:<port>][,<site_name>=<domain>[:<port>]]`. In a non-GR deployment, only one domain is provided (and the site name is optional). For a GR deployment, each domain is identified by the site name, and one of the domains must relate to the local site.
 * `ralf_session_store` - this is the location of ralf's session store. It has the format `<site_name>=<domain>[:<port>][,<site_name>=<domain>[:<port>]]`. In a non-GR deployment, only one domain is provided (and the site name is optional). For a GR deployment, each domain is identified by the site name, and one of the domains must relate to the local site.
+* `homestead_impu_store` - this is the location of homestead's IMPU store. It has the format `<site_name>=<domain>[:<port>][,<site_name>=<domain>[:<port>]]`. In a non-GR deployment, only one domain is provided (and the site name is optional). For a GR deployment, each domain is identified by the site name, and one of the domains must relate to the local site.
 * `memento_auth_store` - this is the location of Memento's authorization vector store. It just has the format `<domain>[:port]`. If not present, defaults to the loopback IP.
 * `sprout_chronos_callback_uri` - the callback hostname used on Sprout's Chronos timers. If not present, defaults to the host specified in `sprout-hostname`. In a GR deployment, should be set to a deployment-wide Sprout hostname (that will be resolved by using static DNS records in `/etc/clearwater/dns.json`).
 * `ralf_chronos_callback_uri` - the callback hostname used on ralf's Chronos timers. If not present, defaults to the host specified in `ralf-hostname`. In a GR deployment, should be set to a deployment-wide Dime hostname (that will be resolved by using static DNS records in `/etc/clearwater/dns.json`).
@@ -212,6 +213,7 @@ This section describes optional configuration options, particularly for ensuring
 * `dummy_app_server` - this field allows the name of a dummy application server to be specified. If an iFC contains this dummy application server, then no application server will be invoked when this iFC is triggered.
 * `http_acr_logging` when set to 'Y', Clearwater will log the bodies of HTTP requests made to Ralf.  This provides additional diagnostics, but increases the volume of data sent to SAS.
 * `dns_timeout` - The time in milliseconds that Clearwater will wait for a response from the DNS server (defaults to 200 milliseconds).
+* `homestead_cache_threads` - The number of threads used by Homestead for accessing it's subscriber data cache. Defaults to 50x the number of CPU cores.
 
 ### Experimental options
 
@@ -239,7 +241,7 @@ This section describes settings that may vary between systems in the same deploy
 * `ibcf_domain` - For Bono IBCF nodes, allows for a domain alias to be specified for the IBCF to allow for including IBCFs in routes as domains instead of IPs.
 * `upstream_recycle_connections` - the average number of seconds before Bono will destroy and re-create a connection to Sprout. A higher value means slightly less work, but means that DNS changes will not take effect as quickly (as new Sprout nodes added to DNS will only start to receive messages when Bono creates a new connection and does a fresh DNS lookup).
 * `authentication` - by default, Clearwater performs authentication challenges (SIP Digest or IMS AKA depending on HSS configuration). When this is set to 'Y', it simply accepts all REGISTERs - obviously this is very insecure and should not be used in production.
-* `num_http_threads` (homestead) - determines the number of HTTP worker threads that will be used to process requests. Defaults to 50 times the number of CPU cores on the system.
+* `num_http_threads` (homestead) - determines the number of HTTP worker threads that will be used to process requests. Defaults to 4 times the number of CPU cores on the system.
 
 ## DNS Config
 

diff --git a/docs/Configuring_GR_deployments.md b/docs/Configuring_GR_deployments.md
@@ -26,7 +26,8 @@ Adding a site to a non-GR deployment follows the same basic process as described
 2. Now you need to update the shared configuration on your first site so that it will communicate with your second site
     * Update the shared configuration on your first site to use the GR options - follow the GR parts of setting up shared config [here](http://clearwater.readthedocs.io/en/latest/Manual_Install.html#provide-shared-configuration).
     * Update the Chronos configuration on your Vellum nodes on your first site to add the GR configuration file - instructions [here](http://clearwater.readthedocs.io/en/latest/Manual_Install.html#chronos-configuration).
-    * Update Cassandra's strategy by running `cw-update_cassandra_strategy` on any Vellum node in your entire deployment.
+    * If you are using any of Homestead-Prov, Homer or Memento:
+        * Update Cassandra's strategy by running `cw-update_cassandra_strategy` on any Vellum node in your entire deployment.
     * At this point, your first and second sites are replicating data between themselves, but no external traffic is going to your second site.
 3. Change DNS so that your external nodes (e.g. the HSS, the P-CSCF) will send traffic to your new site. Now you have a GR deployment.
 

diff --git a/docs/External_HSS_Integration.md b/docs/External_HSS_Integration.md
@@ -15,7 +15,7 @@ This page describes
 
 When Clearwater is deployed without an external HSS, all HSS data is mastered in Vellum's Cassandra database.
 
-When Clearwater is deployed with an external HSS, HSS data is queried from the external HSS via its Cx/Diameter interface and is then cached in the Cassandra database.
+When Clearwater is deployed with an external HSS, HSS data is queried from the external HSS via its Cx/Diameter interface and is then cached in Memcached on Vellum.
 
 Clearwater uses the following Cx message types.
 

diff --git a/docs/Geographic_redundancy.md b/docs/Geographic_redundancy.md
@@ -14,12 +14,12 @@ Each site has its own, separate, etcd cluster. This means that Clearwater's [aut
 
 Vellum has 3 databases, which support Geographic Redundancy differently:
 
-* The Homestead, Homer and Memento databases are backed by Cassandra, which is aware of local and remote peers, so these are a single cluster split across the two geographic regions.
+* The Homestead-Prov, Homer and Memento databases are backed by Cassandra, which is aware of local and remote peers, so these are a single cluster split across the two geographic regions.
 * Chronos is aware of local peers and the remote cluster, and handles replicating timers across the two sites itself.
 * There is one memcached cluster per geographic region. Although memcached itself does not support the concept of local and remote peers, Vellum runs Astaire as a memcached proxy which allows Sprout and Dime nodes to build geographic redundancy on top - writing to both local and remote clusters, and reading from the local but falling back to the remote.
 
 Sprout nodes use the local Vellum cluster for Chronos and both local and remote Vellum clusters for memcached (via Astaire). If the Sprout node includes Memento, then it also uses the local Vellum cluster for Cassandra.
-Dime nodes use the local Vellum cluster for Chronos and Cassandra, and both local and remote Vellum clusters for memcached (via Astaire).
+Dime nodes use the local Vellum cluster for Chronos and both local and remote Vellum clusters for memcached (via Astaire). If Homestead-Prov is in use, then it also uses the local Vellum cluster for Cassandra.
 
 Communications between nodes in different sites should be secure - for example, if it is going over the public internet rather than a private connection between datacenters, it should be encrypted and authenticated with (something like) IPsec.
 

diff --git a/docs/Handling_Failed_Nodes.md b/docs/Handling_Failed_Nodes.md
@@ -29,5 +29,7 @@ For each site that contains one or more failed Vellum nodes, log into a healthy
 
 * `sudo cw-mark_node_failed "vellum" "memcached" <failed node IP>`
 * `sudo cw-mark_node_failed "vellum" "chronos" <failed node IP>`
+
+If you are using any of Homestead-Prov, Homer or Memento, also run:
 * `sudo cw-mark_node_failed "vellum" "cassandra" <failed node IP>`
 
diff --git a/docs/Handling_Multiple_Failed_Nodes.md b/docs/Handling_Multiple_Failed_Nodes.md
@@ -58,7 +58,7 @@ The shared configuration is at `/etc/clearwater/shared_config`. Verify that this
 
 #### Vellum - Cassandra configuration
 
-Check that the Cassandra cluster is healthy by running the following on a Vellum node:
+If you are using any of Homestead-Prov, Homer or Memento, check that the Cassandra cluster is healthy by running the following on a Vellum node:
 
 	sudo /usr/share/clearwater/bin/run-in-signaling-namespace nodetool status
 
@@ -113,6 +113,9 @@ Run these commands on one Vellum node in the affected site:
 
 	/usr/share/clearwater/clearwater-cluster-manager/scripts/load_from_chronos_cluster vellum
 	/usr/share/clearwater/clearwater-cluster-manager/scripts/load_from_memcached_cluster vellum
+
+If you are using any of Homestead-Prov, Homer or Memento, also run:
+
 	/usr/share/clearwater/clearwater-cluster-manager/scripts/load_from_cassandra_cluster vellum
 
 Verify the cluster state is correct in etcd by running sudo `/usr/share/clearwater/clearwater-cluster-manager/scripts/check_cluster_state`

diff --git a/docs/Handling_Site_Failure.md b/docs/Handling_Site_Failure.md
@@ -9,10 +9,12 @@ More information about Clearwater's geographic redundancy support is available [
 
 ### Recovery
 
-To recover from this situation, all you need to do is remove the failed Vellum nodes from Cassandra cluster.
+If you are using any of Homestead-Prov, Homer or Memento, to recover from this situation all you need to do is remove the failed Vellum nodes from Cassandra cluster.
 
     * From any Vellum node in the remaining site, run `cw-remove_site_from_cassandra <site ID - the name of the failed site>`
 
+If you are not using any of Homestead-Prov, Homer or Memento, you do not need to do anything to recover the single remaining site.
+
 You should now have a working single-site cluster, which can continue to run as a single site, or be safely paired with a new remote site (details on how to set up a new remote site are [here](http://clearwater.readthedocs.io/en/latest/Configuring_GR_deployments.html#removing-a-site-from-a-gr-deployment)).
 
 ### Impact

diff --git a/docs/Manual_Install.md b/docs/Manual_Install.md
@@ -77,8 +77,9 @@ Note that the `etcd_cluster` variable should be set to a comma separated list th
 If you are creating a [geographically redundant deployment](Geographic_redundancy.md), then:
 
 * `etcd_cluster` should contain the IP addresses of nodes only in the local site
-*  You should set `local_site_name` in `/etc/clearwater/local_config`. The name you choose is arbitrary, but must be the same for every node in the site. This name will also be used in the `remote_site_names`, `sprout_registration_store` and `ralf_session_store` configuration options set in shared config (desscribed below).
-*  On the first Vellum node in the second site, you should set `remote_cassandra_seeds` to the IP address of a Vellum node in the first site.
+*  You should set `local_site_name` in `/etc/clearwater/local_config`. The name you choose is arbitrary, but must be the same for every node in the site. This name will also be used in the `remote_site_names`, `sprout_registration_store`, `homestead_impu_store` and `ralf_session_store` configuration options set in shared config (desscribed below).
+*  If your deployment uses Homestead-Prov, Homer or Memento:
+    * on the first Vellum node in the second site, you should set `remote_cassandra_seeds` to the IP address of a Vellum node in the first site.
 
 ## Install Node-Specific Software
 
@@ -149,6 +150,7 @@ Log onto any node in the deployment and create the file `/etc/clearwater/shared_
     sprout_registration_store=vellum.<site_name>.<zone>
     hs_hostname=hs.<site_name>.<zone>:8888
     hs_provisioning_hostname=hs.<site_name>.<zone>:8889
+    homestead_impu_store=vellum.<zone>
     ralf_hostname=ralf.<site_name>.<zone>:10888
     ralf_session_store=vellum.<zone>
     xdms_hostname=homer.<site_name>.<zone>:7888
@@ -187,11 +189,12 @@ If you want your Sprout nodes to include Gemini/Memento Application Servers add
 
 See the [Chef instructions](Installing_a_Chef_workstation.md#add-deployment-specific-configuration) for more information on how to fill these in. The values marked `<secret>` **must** be set to secure values to protect your deployment from unauthorized access. To modify these settings after the deployment is created, follow [these instructions](Modifying_Clearwater_settings.md).
 
-If you are creating a [geographically redundant deployment](Geographic_redundancy.md), some of the options require information about all sites to be specified. You need to set the `remote_site_names` configuration option to include the `local_site_name` of each site, replace the `sprout_registration_store` and `ralf_session_store` with the values as described in [Clearwater Configuration Options Reference](Clearwater_Configuration_Options_Reference.md), and set the `sprout_chronos_callback_uri` and `ralf_chronos_callback_uri` to deployment wide hostnames. For example, for sites named `siteA` and `siteB`:
+If you are creating a [geographically redundant deployment](Geographic_redundancy.md), some of the options require information about all sites to be specified. You need to set the `remote_site_names` configuration option to include the `local_site_name` of each site, replace the `sprout_registration_store`, `homestead_impu_store` and `ralf_session_store` with the values as described in [Clearwater Configuration Options Reference](Clearwater_Configuration_Options_Reference.md), and set the `sprout_chronos_callback_uri` and `ralf_chronos_callback_uri` to deployment wide hostnames. For example, for sites named `siteA` and `siteB`:
 
     remote_site_names=siteA,siteB
-    sprout_registration_store="siteA=sprout-siteA.<zone>,siteB=sprout-siteB.<zone>"
-    ralf_session_store="siteA=ralf-siteA.<zone>,siteB=ralf-siteB.<zone>"
+    sprout_registration_store="siteA=vellum-siteA.<zone>,siteB=vellum-siteB.<zone>"
+    homestead_impu_store="siteA=vellum-siteA.<zone>,siteB=vellum-siteB.<zone>"
+    ralf_session_store="siteA=vellum-siteA.<zone>,siteB=vellum-siteB.<zone>"
     sprout_chronos_callback_uri=sprout.<zone>
     ralf_chronos_callback_uri=ralf.<zone>
 

diff --git a/docs/Troubleshooting_and_Recovery.md b/docs/Troubleshooting_and_Recovery.md
@@ -28,7 +28,7 @@ To examine Ellis' database, run `mysql` (as root), then type `use ellis;` to set
 
 Problems on Vellum may include:
 
-* Failing to read or write to the Cassandra database:
+* Failing to read or write to the Cassandra database (only relevant if you deployment is using Homestead-Prov, Homer or Memento):
     * Check that Cassandra is running (`sudo monit status`). If not, check its `/var/log/cassandra/*.log` files.
     * Check that Cassandra is configured correctly. First access the command-line CQL interface by running `cqlsh`. There are 3 databases:
         * Type `use homestead_provisioning;` to set the provisioning database and then `describe tables;` - this should report `service_profiles`, `public`, `implicit_registration_sets` and `private`.