Skip to content
abemusic edited this page Nov 15, 2012 · 39 revisions

Prerequisites

  • Python 2.5+
  • boto v2.0
  • simplejson
  • prettytable (version 0.5; 0.6.1 throws an exception)
  • setuptools
  • dateutil (version 1.5; 2.0 throws an exception)
  • cElementTree
  • elementtree
  • PyYAML
  • Yapsy (version 1.8; 1.9 requires Python 3)
  • Fabric

As of 0.8.15 we've added a requirements file that (hopefully :)) will stay up-to-date. So, for all you pip users out there you should be able to do:

git clone git://github.com/digitalreasoning/PyStratus.git
cd PyStratus
pip install -r requirements.txt
pip install .

And, if you're set on using easy_install, open up the requirements.txt file and install the packages listed.

Plugins

Stratus provides a plugin-based architecture (via http://yapsy.sourceforge.net/) to easily support additional services in the cloud. There are actually two plugins required for a particular service. One for the service's functionality (launching cluster, stopping cassandra, etc) and one for the command-line interface to control the service. Three services and their plugins are provided by Stratus (Hadoop, Cassandra, and a hybrid Hadoop/Cassandra.) and are located in the plugins directory. Stratus looks for plugins in two places: a plugins directory in the current working directory and a plugins directory in ~/.stratus. If you install stratus with the setup.py script it's recommended to symlink the plugins directory from your github clone to ~/.stratus/plugins so it can find the plugins -- or else you have to execute the stratus script from within the project directory (which is equally fine.)

Setting Environment Variables to Specify AWS Credentials

You must specify your AWS credentials when using stratus. The simplest way to do this is to set the environment variables:

  • AWS_ACCESS_KEY_ID= Your AWS Access Key ID
  • AWS_SECRET_ACCESS_KEY= Your AWS Secret Access Key

Configuration

To configure stratus, create a directory called .stratus in your home directory (note the leading period "."). In that directory, create a file called clusters.cfg that contains a section for each cluster you want to control. Start each section with a unique name for the section enclosed in square brackets. Each key/value pair must be on its own line. Keys are separated from values by an equals sign. For example:

[my-cassandra-cluster]
service_type=cassandra
cloud_provider=ec2

In addition to clusters.cfg, any files ending in .cfg in ~/.stratus/clusters.cfg.d and any of its subdirectories will be parsed for configuration as well.

Each cluster requires the following key/value pairs:

  • service_type: One of [cassandra, hadoop, hadoop_cassandra_hybrid]
  • cloud_provider: Only ec2 is supported
  • image_id: The Amazon EC2 image ID for your cluster nodes
  • instance_type: The type of EC2 instance to run (small, medium, large, etc...see EC2 documentation for a valid list of these)
  • key_name: Key name to use
  • availability_zone: The zone to place your instance in (see EC2 documentation)
  • region: The region to place your instance in (see EC2 documentation)
  • private_key: Path to your private key for password-less SSH commands
  • user_data_file: Path to a bootstrap script that will be executed on each node after the instance is started (see http://aws.amazon.com/articles/1085)

Optional commands:

  • ssh_options: Options to supply to ssh and scp
  • security_groups: Any user-defined security groups to authorize your cluster to use (separated by newlines)
  • env: List of user-defined key/value pairs to be set in your node's environment (separated by newlines)

Property Interpolation:

  • Config files may use standard python variable interpolation. In the following example, value2 evaluates to jake.smith.
[section]
value1=jake
value2=%(value1)s.smith
  • Several variables are provided by default that may be used in interpolation.
  • config_dir: ~/.stratus
  • config_ddir: ~/.stratus/clusters.cfg.d/
  • this_dir: the directory containing the config file that first defined the current section. Unless a section is defined in more than one file, this is the directory containing the current config file. In the case that you are defining the same section in more than one file, it is recommended to simply not use this variable, since it cannot be guaranteed which defines the section first.

NOTES

  • It's best practice to define your cluster with a unique and identifiable name so that other users will know who owns this cluster.
  • security_groups allow you to define custom security groups for your cluster. This is useful if you have multiple clusters that need to communicate via their internal/private network.
  • See Cloudera CDH for other AMIs to use with Stratus.
  • Be sure that your clusters.cfg file uses the proper line feed characters.

Configuring Cassandra 0.6.x Clusters

The following example shows how to specify an i386 Fedora OS as the AMI in a clusters.cfg file for a Cassandra cluster:

[my-cassandra-cluster]
service_type=cassandra
cloud_provider=ec2
image_id=ami-6159bf08
instance_type=m1.small
key_name=your_key_name
availability_zone=us-east-1c
region=us-east-1
private_key=/path/to/key/file
ssh_options=-i %(private_key)s -o StrictHostKeyChecking=no
security_groups=security-group-1
    security-group-2
    security-group-3
user_data_file=file:///path/to/cassandra-ec2-init-remote.sh
cassandra_config_file=file:///path/to/storage-conf.xml
env=AWS_ACCESS_KEY_ID=YOUR_ACCESS_KEY_GOES_HERE
    AWS_SECRET_ACCESS_KEY=YOUR_SECRET_KEY_GOES_HERE

NOTES

  • cassandra-config_file is the location to your storage-conf.xml file. This file will be copied to each node in your cluster and Cassandra will use it for its configuration. See the Cassandra 0.6.x Config File section for details.

Configuring Cassandra 0.7.x Clusters (last tested with Beta3)

[my-cassandra-cluster]
service_type=cassandra
cloud_provider=ec2
image_id=ami-6159bf08
instance_type=m1.small
key_name=your_key_name
availability_zone=us-east-1d
region=us-east-1
private_key=/path/to/key/file
ssh_options=-i %(private_key)s -o StrictHostKeyChecking=no
security_groups=security-group-1
user_data_file=file:///path/to/cassandra-ec2-init-remote.sh
cassandra_config_file=file:///path/to/cassandra.yaml
keyspace_definitions_file=file:///path/to/keyspace_definitions
env=AWS_ACCESS_KEY_ID=YOUR_ACCESS_KEY_GOES_HERE
    AWS_SECRET_ACCESS_KEY=YOUR_SECRET_KEY_GOES_HERE
    CASSANDRA_URL=http://apache.mirrors.pair.com//cassandra/0.7.0/apache-cassandra-0.7.0-beta3-bin.tar.gz

NOTES

  • cassandra-config_file is the location to your cassandra.yamll file. This file will be copied to each node in your cluster and Cassandra will use it for its configuration. See the Cassandra 0.7.x Config File section for details.
  • keyspace_definitions_file points to a text file containing a batch of Thrift APIs that will be used to set up your keyspaces initially. Cassandra 0.7 allows for dynamic keyspaces and you are now required to use the API to manage them. (see Keyspace Definitions File section for an example)
  • CASSANDRA_URL in the env section will override which version of Cassandra to be pulled and installed on each node of your cluster. See the cassandra-ec2-init-remote.sh file in cassandra/data for how this variable is used to configure Cassandra.

Configuring Hadoop Clusters

The following example shows how to specify an i386 Fedora OS (ami-6159bf08) as the AMI in a clusters.cfg file for a Hadoop cluster:

[my-hadoop-cluster]
service_type=hadoop
cloud_provider=ec2
image_id=ami-6159bf08
instance_type=m1.small
key_name=your_key_name
availability_zone=us-east-1c
region=us-east-1
private_key=/path/to/key/file
ssh_options=-i %(private_key)s -o StrictHostKeyChecking=no
security_groups=security-group-1
    security-group-2
    security-group-3
user_data_file=file:///path/to/cassandra-ec2-init-remote.sh
env=AWS_ACCESS_KEY_ID=YOUR_ACCESS_KEY_GOES_HERE
    AWS_SECRET_ACCESS_KEY=YOUR_SECRET_KEY_GOES_HERE

NOTES

  • cassandra_config_file is not used for Hadoop clusters and is not present here.

Configuring Hadoop/Cassandra Hybrid Clusters

Hybrid Hadoop/Cassandra clusters operate exactly like Hadoop clusters where there will be one node that acts as a namenode, secondary namenode and job tracker, and one or more nodes act as data nodes and task trackers. The only difference is that Cassandra will be installed and started on the Hadoop nodes designated as data nodes. The same commands to operate a Cassandra cluster will also be available, but will only manipulate data nodes with Cassandra services on them.

[my-hadoop-cassandra-cluster]
service_type=hadoop_cassandra_hybrid
cloud_provider=ec2
image_id=ami-6159bf08
instance_type=m1.small
key_name=your_key_name
availability_zone=us-east-1c
region=us-east-1
private_key=/path/to/key/file
ssh_options=-i %(private_key)s -o StrictHostKeyChecking=no
security_groups=security-group-1
    security-group-2
    security-group-3
user_data_file=file:///path/to/hadoop-cassandra-hybrid-ec2-init-remote.sh
cassandra_config_file=file:///path/to/storage-conf.xml
env=AWS_ACCESS_KEY_ID=YOUR_ACCESS_KEY_GOES_HERE
    AWS_SECRET_ACCESS_KEY=YOUR_SECRET_KEY_GOES_HERE

NOTES

  • cassandra_config_file is the same as in a pure Cassandra 0.6.x or 0.7x cluster.
  • For Cassandra 0.7.x remember to supply your keyspace_definitions_file

Cassandra 0.6.x Config File

The cassandra_config_file parameter in your clusters.cfg file points to a local copy of a storage-conf.xml file for Cassandra v0.6.x that will be pushed out to each node in your cluster. You are responsible for configuring settings in this file, but keep in mind that stratus will automatically copy this file and modify various parameters before it pushes it out. The modifications for storage-conf.xml files are:

  • Seeds element will contain valid Seed elements containing the private IP address of the seed nodes. Stratus arbitrarily chooses the first two nodes to be seeds.
  • InitialToken will contain a generated token for proper key distribution
  • CommitLogDirectory will be /mnt/cassandra-logs
  • DataFileDirectories will contain one DataFileDirectory element with the value /mnt/cassandra-data
  • ListenAddress and ThriftAddress will be null

Cassandra 0.7.x Config File

The cassandra_config_file parameter in your clusters.cfg file points to a local copy of a cassandra.yaml file for Cassandra v0.7.x that will be pushed out to each node in your cluster. You are responsible for configuring settings in this file, but keep in mind that stratus will automatically copy this file and modify various parameters before it pushes it out. The modifications for cassandra.yaml files area:

  • seeds will contain a list of private IP addresses of the seed nodes. Stratus arbitrarily chooses the first two nodes to be seeds.
  • initial_token will contain a generated token for proper key distribution
  • commitlog_directory will be /mnt/cassandra-logs
  • data_file_directories will contain a single list with the value /mnt/cassandra-data
  • listen_address and rpc_address will be null

Keyspace Definition File

For Cassandra 0.7.x only

The following is a sample file pulled from http://wiki.apache.org/cassandra/LiveSchemaUpdates that shows how you can use Thrift API commands in a batch style to build up your keyspaces. The following will create a keyspace Keyspace1 with two column families: Standard1 and Standard2. This file if passed in through the keyspace_definition_file parameter of your clusters.cfg file will be executed on ONE node via the cassandra-cli utility after the Cassandra service has started.

/* Create a new keyspace */
create keyspace Keyspace1 with replication_factor = 3 and placement_strategy = 'org.apache.cassandra.locator.RackUnawareStrategy'

/* Switch to the new keyspace */
use Keyspace1

/* Create new column families */
create column family Standard1 with column_type = 'Standard' and comparator = 'BytesType'
create column family Standard2 with column_type = 'Standard' and comparator = 'UTF8Type' and rows_cached = 10000

Installing

Check out the package, browse to that project's root directory, and run the following:

% sudo python setup.py install

This will build and install the python egg file to your python's site-packages location. It will also install an executable script in your path so you can run stratus from anywhere. Keep in mind it's not required to actually build and install the stratus egg and executable script. It's perfectly fine to checkout the project (which includes the executable script) and then run from that directory (e.g, "./stratus list".) This has the benefit of being up to pull the latest changes and run the script directly rather than remembering to build and install after each update to the code. If you choose to install the egg, remember you must symlink or copy the plugins directory to ~/.stratus/plugins before stratus knows what services are available to it and how to execute them.

Running a Basic Cloud Script

After specifying an AMI, you can run stratus. It will display usage instructions when you invoke it without arguments.

You can test that the script can connect to your cloud provider by typing:

% stratus list --all

this will list the cluster name, service type, and cloud provider for ALL clusters that have been defined or are currently running in EC2

Launching a Cluster

After you install stratus and setup your EC2 account information, starting a Cassandra cluster with 10 nodes is easy by using one command:

% stratus exec CLUSTER_NAME launch-cluster 10 # (where CLUSTER_NAME is a defined cluster in your ~/.stratus/clusters.cfg file)

Expanding a Cassandra Cluster

After a Cassandra cluster is launched, you may add 10 nodes to it with one command:

% stratus exec CLUSTER_NAME expand-cluster 10 # (where CLUSTER_NAME is a defined cluster in your ~/.stratus/clusters.cfg file)

It is recommended that you double the size of a cluster. There are a few things to note about cluster expansion. First, this functionality is still experimental, and you may encounter performance issues. Only one new Cassandra node can be started every two minutes. So if you specify 10 nodes, the script will wait 2 minutes in between the launch of each node. If the number of nodes is not clustered, you will have an unbalanced cluster. You may fix this with the "rebalance" command, but that can have significant performance implications. Also, stratus does not currently repair nodes after bootstrapping, so that needs to be done manually after the bootstrapping process has finished.

Using Persistent Clusters

  1. Create a new section in your clusters.cfg file. (This is completely optional. Most users will want EBS so you can use an existing cluster config if you would like.)
  2. Create storage for the new cluster by creating a temporary EBS volume, formatting it, and saving it as a snapshot in S3. This way, you only have to do the formatting once and can use the snapshot to clone cluster volumes later. NOTE: You only have to do this step once unless you remove the snapshot later. All snapshots of a given size are identical, so you can just reuse one if one already exists in the size you want.
  3. Create a JSON spec file that defines how storage volumes will be created and assigned for your cluster. This spec file should reference the snapshot ID you created in the previous step. Remember that if you already have a formatted snapshot you may use that ID instead. IMPORTANT CASSANDRA INFO: All Cassandra cluster nodes expect to have two separate storage devices defined. One storage volume will be used to store Cassandra log files (/dev/sdj) and the second will be used to store Cassandra data (/dev/sdk). The automatic configuration of the nodes will try to mount these volumes to /mnt/cassandra-logs and /mnt/cassandra-data respectively and MUST exist for persistent storage. A sample JSON spec file can be found in the stratus/cassandra/data directory of the project and is referenced below in the "Sample JSON spec file" section.
  4. Use the create-storage command to create the storage volumes defined in your spec file for the number nodes your cluster will have. The following example creates storage for a 3-node Cassandra cluster -- assuming your spec defines the required two volumes per node this command will create 6 volumes (2 for each node)
  5. Launch your cluster with the appropriate number of nodes (should be the same number from the previous step).
  6. When all nodes have finished the configuration of your nodes will begin. This consists of assigning the devices for your storage volumes to the appropriate nodes, mounting those volumes to the proper mount points, and launching the Cassandra services. You can test your persistent storage by:
    • writing data to the Cassandra services
    • terminating your clusters like normal: % stratus exec CLUSTER_NAME terminate-cluster
    • re-launching the cluster: % stratus exec CLUSTER_NAME launch-cluster N
    • retrieve data previously written to Cassandra
    • SSH into your cluster: % stratus exec CLUSTER_NAME login

Example:

The following example shows how to create a 100GB snapshot, create storage for a 3-node cluster, and then launch the cluster.

% stratus exec CLUSTER_NAME create-formatted-snapshot 100
% stratus exec CLUSTER_NAME create-storage 3 ~/.stratus/my-cassandra-ebs-cluster-storage-spec.json
% stratus exec CLUSTER_NAME launch-cluster 3

JSON Spec File Keys

  • nn = Hadoop name node
  • snn = Hadoop secondary name node
  • dn = Hadoop data node
  • tt = Hadoop task tracker
  • jt = Hadoop job tracker
  • cn = Cassandra node
  • hcn = Hadoop/Cassandra node
  • Prefix Hadoop-specific keys with "hybrid_" for Hadoop/Cassandra hybrid keys (e.g, hybrid_nn)

Sample Cassandra JSON spec file

{
    "cn": [
        {
          "device": "/dev/sdj",
          "mount_point": "/mnt/cassandra-logs",
          "size_gb": "100",
          "snapshot_id": "snap-xxxxxx"
        },
        {
          "device": "/dev/sdk",
          "mount_point": "/mnt/cassandra-data",
          "size_gb": "100",
          "snapshot_id": "snap-xxxxxx"
        }
    ]
}
  • For the automatic configuration to work correctly there needs to be two volumes defined and must reference the devices /dev/sdj and /dev/sdk. The sdj device must have the mount point /mnt/cassandra-logs and the sdk device must have the /mnt/cassandra-data mount point.

Sample Hadoop JSON spec file

{
    "nn": [
        {
          "device": "/dev/sdh",
          "mount_point": "/mnt/hadoop-ebs",
          "size_gb": "100",
          "snapshot_id": "snap-xxxxxx"
        }
    ],
    "dn": [
        {
          "device": "/dev/sdi",
          "mount_point": "/mnt/hadoop-ebs",
          "size_gb": "100",
          "snapshot_id": "snap-xxxxxx"
        }
    ]
}

Sample Hadoop/Cassandra Hybrid JSON spec file

{
    "hybrid_nn": [
        {
          "device": "/dev/sdh",
          "mount_point": "/mnt/hadoop-ebs",
          "size_gb": "100",
          "snapshot_id": "snap-xxxxxx"
        }
    ],
    "hybrid_dn": [
        {
          "device": "/dev/sdi",
          "mount_point": "/mnt/hadoop-ebs",
          "size_gb": "100",
          "snapshot_id": "snap-xxxxxx"
        }
    ],
    "cn": [
        {
          "device": "/dev/sdj",
          "mount_point": "/mnt/cassandra-logs",
          "size_gb": "100",
          "snapshot_id": "snap-xxxxxx"
        },
        {
          "device": "/dev/sdk",
          "mount_point": "/mnt/cassandra-data",
          "size_gb": "100",
          "snapshot_id": "snap-xxxxxx"
        }
    ]
}