Automated Data Persistence

Chris Donati edited this page May 26, 2015 · 9 revisions

Introduction

The most important feature in AppScale is the ability to persist your data across cloud and cluster deployments. AppScale 1.10.0 brings that feature to you, for VirtualBox, EC2, Eucalyptus, and Google Compute Engine deployments. This post details how we support data persistence in general, as well as specifics of persisting your data across each supported cluster or cloud.

Persistence

Saving your data is typically a hard problem. But why is that? The answer is simple - saving the state of an entire system could be as complicated as saving the state of every machine in your system! Thankfully, App Engine and AppScale both make this normally difficult problem a lot easier for you. For starters, the App Engine programming model forces a more-or-less stateless web server onto you. You save all persistent state into the Datastore, and anything in memcache can be reconstituted from that if needed. AppScale's implementation of these services is also mostly stateless - all state that we need to persist resides in three places:

  1. Cassandra - this NoSQL datastore is used within AppScale to implement support for the Datastore and Blobstore APIs, so all user and application data is stored here.
  2. ZooKeeper - since Cassandra only supports row-level transactions (not sufficient for the type of transactions that App Engine supports), we use this service to provide locking. With this, AppScale can implement App Engine transaction semantics.
  3. The source / war files for the App Engine apps that are hosted in this AppScale deployment. Note that we could simply not store these, and require the user to upload their apps every time they start up AppScale.

We begin by instructing Cassandra to store all its data in /opt/appscale/cassandra, and ZooKeeper data is stored in /opt/appscale/zookeeper. Similarly, we store App Engine apps that users upload in /opt/appscale/apps. That makes the problem of how to persist data in AppScale deployments a problem of "how do we persist the /opt directory" - a much simpler problem! Let's dive into how we do it anywhere you can run AppScale.

VirtualBox

Nothing special here - whenever you power off VirtualBox machines, your data is automatically saved. Note that this works fine if you use appscale down, then appscale up - a separate post is planned that discusses how AppScale 1.10.0 automatically handles the case where your VM was rebooted or halted while AppScale was running.

Amazon EC2 and Eucalyptus

Both Amazon EC2 and Eucalyptus provide data persistence across instances in the form of the Elastic Block Store (EBS). Basically, you ask Amazon or Eucalyptus for one EBS volume (disk) per machine in your AppScale deployment, before you start AppScale. Then AppScale will automatically format it (if necessary) and mount it to /opt/appscale. Assuming you're on a one node deployment, you begin by creating one new EBS volume:

$ ec2-create-volume --size 10 -z us-east-1b
# returns a volume id, something like vol-ABCDEFG
# if in Eucalyptus, use euca-create-volume instead and euca-describe-availability-zones to get your AZ names

This creates one 10 GB disk, in the US East 1B availability zone. Next, your AppScalefile needs to tell us both what disks you're using and what availability zone the instances are running in (which needs to be the same as where your EBS volume is):

zone: us-east-1b

disks:
 node-1: vol-ABCDEFG

Then you can run 'appscale up' and 'appscale down' to your heart's content, knowing that your data is automatically saved!

Google Compute Engine

This is actually the first cloud we implemented with persistent disk support because of the very attractive per-minute instance pricing. This looks similar to our AWS support, but uses Google Compute Engine's Persistent Disks instead of AWS's Elastic Block Store. Like before, begin by creating a persistent disk:

$ gcutil adddisk --size_gb=10 --zone=us-central-1a mydisk1
# here, your disk is called mydisk1, instead of vol-ABCDEFG like in AWS

And tell your AppScalefile what your disk is called and where it can be found:

zone: us-central-1a

disks:
 node-1: mydisk1

Just like in EC2 and Eucalyptus, that's it! We'll persist your data across AppScale deployments from here on out.

Conclusion and Future Work

This covers how AppScale automatically backs up data to cloud storage systems and uses it in future deployments. One area that we'd love to look into in the future is periodically backing up your data stored in /opt/appscale to a cloud storage service like Amazon S3 or Google Cloud Storage and restoring from that (instead of EBS or PD). Alternatively, we'd also like to consider the performance impacts of storing your Google App Engine applications in the Datastore / Blobstore itself, so that it automatically gets replicated across machines (and so that the AppController does not have to worry about storing and locating your apps). It also should be possible to reduce the number of persistent disks you need to save your data from one per machine to only on machines that run the Cassandra or ZooKeeper services. We would love to have an extra set of eyes looking over this, so feel free to join us in #appscale on freenode.net and let us know what you think!