# Data management in the cloud

- **Estimated time**: 15 minutes
- **Requirements**:
  - An active node on Chameleon with a Floating IP, provisioned with a Key Pair on this Jupyter instance (so you can SSH)

In this tutorial we'll go over the various ways you can deal with everything concering storage for your experiment: software, configuration, input data sets and outputs or experiment results. You'll learn what Chameleon can offer for your various experimental needs, but also will learn strategies for how to think about storage on the cloud.

## Background: Data strategies on the cloud

Unlike a server you may have plugged in to a wall that your lab group uses, or a laptop, cloud environments should ideally be considered ephemeral. While this sounds like a constraint at first, it actually unlocks a lot of interesting possibilities. Ephemeral environments require reproducibility, because you will probably need to re-create your experiment setup several times. Therefore, investing some time in learning about how to properly store all the data and configuration for your experiment can be helpful. There are a few common storage patterns on the cloud:

- **Bootable disk images**: the bread and butter of the cloud. You need something to boot, after all! Bootable images work best when they contain the minimum amount of configuration to start your experiment. You can and should customize your boot image to contain dependencies required for your experiment. It can be helpful to work off of a simple base image and install additional software, and then save your changes as a snapshot, rather than attempt to build a disk image from scratch.
  - AWS equivalent: [AMIs](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html)
  - Chameleon equivalent: [Images (Glance)](https://chameleoncloud.readthedocs.io/en/latest/technical/images.html)
- **Ephemeral storage**: any data written to your instance's primary block device is usually not persisted, unless you opt to snapshot your disk image or otherwise base your disk image _on top of_ a mountable block device (AWS EBS volumes can work like this, for example.) This is good and bad; you can get in to a bad state and easily revert to a clean image, but can also lose important data if it is not persisted somewhere.
- **Mountable block devices**: mountable volumes are nice for persistent storage not necessarily related to the operations of your image. For example, your bootable image might contain the MySQL binaries, but you may store the database itself on a mountable block device, so that when your instance is destroyed, you can first detach the block device and later re-attach it to a new instance, thus preserving all of your data in practice.
  - AWS equivalent: [EBS](https://aws.amazon.com/ebs/), [EFS](https://aws.amazon.com/efs/)
  - Chameleon equivalent: _Not yet supported_
- **Object storage**: unlike a block device, an object storage system stores files as binary blobs; it has no concept of a file as consisting of multiple chunks of data. This has some interesting implications. First of all, object storage systems tend to scale very well on the cloud, because you can replicate and store binary blobs with great ease. Object storage also supports arbitrary metadata attached to an object, usually including at least a checksum for integrity verification, but also security levels, ACLs, and further descriptions of the data. Object storage is a great solution for data that you need to read or bootstrap into your experimental environment, and also provides an easy mechanism for sharing data sets with others. However, it cannot be effectively used like a "real" filesystem, as you cannot incrementally or partially edit/update a file; you must make a new copy of the object and write the contents over top of the old copy. This can have performance implications in write-heavy workloads.
  - AWS equilvalent: [S3](https://aws.amazon.com/s3/)
  - Chameleon equivalent: [Object Store (Swift)](https://chameleoncloud.readthedocs.io/en/latest/technical/swift.html)
  
We'll now go over what options are available to you in Chameleon, so you are better-equipped to address your storage needs.

## Tutorial

1. [Creating your own disk image](#Creating-your-own-disk-image)
1. [Using the mounted Object Store](#Using-the-mounted-Object-Store)
1. [Using the Object Store directly](#Using-the-Object-Store-directly)

### Variables you'll see/use in this Notebook

  - `FLOATING_IP`: a Floating IP attached to your running instance
  - `OS_REGION_NAME`: the Chameleon site name, e.g. CHI@TACC or CHI@UC
  - `OS_PROJECT_NAME`: the Chameleon project to authenticate under. You might need to set this if no project was automatically chosen for you (which can happen if you are on multiple projects!)
  
### Set variables

In [34]:
FLOATING_IP="192.5.87.33" # Required
OS_REGION_NAME="${OS_REGION_NAME:-CHI@UC}"
if [[ -z "${OS_PROJECT_NAME:+x}" ]]; then
  OS_PROJECT_NAME="CH-XXXXXX" # For example.
fi

# Create shortcut for SSH
do_ssh() {
  ssh cc@$FLOATING_IP "$@"
}

wait_ssh "$FLOATING_IP"
# Kludgy way to ensure your server has latest snapshot utility (force past update prompt)
do_ssh timeout 10 "yes | sudo cc-snapshot" >/dev/null 2>/dev/null || true

Waiting up to 300 seconds for SSH on 192.5.87.33...
SSH is running!


## Creating your own disk image

When you are logged in to one of your bare metal nodes, and you're using a disk image derived from one of Chameleon's official images ([CentOS7](https://www.chameleoncloud.org/appliances/1/) or [Ubuntu16.04](https://www.chameleoncloud.org/appliances/19/), for example), you will have access to a tool called `cc-snapshot`. This tool allows you to create a new bootable disk image from your current filesystem contents. It was created to get around a limitation with the [OpenStack Ironic](https://docs.openstack.org/ironic) bare metal provisioning system, which does not support snapshotting similar to how you would for a VM.

You can interactively use the tool by just running `cc-snapshot` (requires sudo). It will prompt you for your Chameleon username and password; this is necessary to save the resulting disk image back up to Chameleon. It may also ask you if you'd like to update the script. This is recommended, as bugs are fixed from time to time.

We will create a snapshot of your instance now. We'll craft a special invocation of `cc-snapshot` so we can run it over SSH. You can also try connecting to your node via SSH in a Jupyter Terminal and running it interactively if you like.

In [27]:
snapshot_name="$USER-tutorial-snapshot-$(date +%b%d)"

# We do a little trick to pass our authentication token to our remote
# instance here by setting OS_TOKEN and OS_AUTH_TYPE.
# We also invoke with some flags, -f and -y, which help us skip any
# prompts (necessary because Jupyter Bash cannot wait for input)
do_ssh sudo env OS_TOKEN=$OS_TOKEN OS_AUTH_TYPE=token \
  cc-snapshot -fy "$snapshot_name"

Will snapshot the instance using the following name: 'jason_a-tutorial-snapshot-May30'
BDB2053 Freeing read locks for locker 0x54: 23942/140308114094208
BDB2053 Freeing read locks for locker 0x56: 23942/140308114094208
BDB2053 Freeing read locks for locker 0x57: 23942/140308114094208
BDB2053 Freeing read locks for locker 0x58: 23942/140308114094208
Package 1:libguestfs-xfs-1.38.2-12.el7_6.2.x86_64 already installed and latest version
Nothing to do
No packages marked for update
tar: Removing leading `/' from member names
tar: Removing leading `/' from hard link targets
[   0.0] Examining the guest ...
[   2.0] Setting a random seed
[   2.0] Running: grub2-install /dev/sda && grub2-mkconfig -o /boot/grub2/grub.cfg
[   9.5] Finishing off
[   0.0] Examining the guest ...
[   2.2] Performing "abrt-data" ...
[   2.2] Performing "backup-files" ...
[   2.9] Performing "bash-history" ...
[   2.9] Performing "blkid-tab" ...
[   2.9] Performing "crash-data" ...
[   2.9] Performing "cron-spool" ..

Your image should now be visible in the [Images panel](https://chi.uc.chameleoncloud.org/project/images) in the Chameleon GUI. You may have to filter for your username to find it if there are a lot of images available to your project. You can now use this image when launching a new bare metal instance!

Creating a disk snapshot is a great way to save the state of a node so you can resume your work later, after your lease ends for example. However, it may not be a good idea to store a bunch of data in a disk image. Disk images are best when they are lean and easy to launch/start. Large data sets, or experimental data that you'd like to store in an easily sharable way, are better suited for the Object Store.

## Using the mounted Object Store

The Chameleon [Object Store](https://chameleoncloud.readthedocs.io/en/latest/technical/swift.html) contains several petabytes of storage for your use. Chameleon exposes it to you in a number of ways, but the most immediately obvious may be in a special mounted directory you may have already noticed inside your instance. The `my_mounting_point` folder in your default home directory is a [cloudfuse](https://github.com/redbo/cloudfuse) mount of the Chameleon Object Store, where your various data objects are presented as files in a navigable tree. You can interact with it much like you would a normal file system:

In [36]:
container_name=$USER-tutorial-container

# Writing a folder writes a new container (namespace) to the
# Object Store owned by your current Project.
do_ssh mkdir -p my_mounting_point/$container_name

# Writing files creates new objects
do_ssh bash -c "echo Test >my_mounting_point/$container_name/TestFile.txt"
do_ssh bash -c "echo Test >my_mounting_point/$container_name/TestFile2.txt"

echo "Objects found in '$container_name':"
openstack object list "$container_name"

do_ssh rm my_mounting_point/$container_name/TestFile2.txt

echo "Objects found in '$container_name' after deletion:"
openstack object list "$container_name"

Objects found in 'jason_a-tutorial-container':
+---------------+
| Name          |
+---------------+
| TestFile.txt  |
| TestFile2.txt |
+---------------+
Objects found in 'jason_a-tutorial-container' after deletion:
+--------------+
| Name         |
+--------------+
| TestFile.txt |
+--------------+


While it looks like a normal filesystem, under the hood you are making requests to the Object Store. Because the underlying representation is an object, rather than a file, you could suffer severe performance penalties depending on your usage. You should consult our documentation to learn more about this. In general, it is a nice way to get acquainted with the Object Store, and could be fine for your use-cases, especially if your experiment is read-intensive.

## Using the Object Store directly

The Object Store is also exposed via its OpenStack API, which means you can, once again, use the `openstack` CLI tool to access it. The Object Store functions are under the `openstack object` top-level command.

In [50]:
echo "Uploading this Notebook!"

openstack object create $container_name DataManagement.ipynb >/dev/null \
  && openstack object show $container_name DataManagement.ipynb

Uploading this Notebook!
+----------------+----------------------------------+
| Field          | Value                            |
+----------------+----------------------------------+
| account        | v1                               |
| container      | jason_a-tutorial-container       |
| content-length | 18560                            |
| content-type   | binary/octet-stream              |
| etag           | 8332695bfd8af037c6f10ad9f6844b11 |
| last-modified  | Thu, 30 May 2019 02:06:54 GMT    |
| object         | DataManagement.ipynb             |
+----------------+----------------------------------+


If you are dealing with particularly large files, you should look into using the `swift` CLI tool instead, which is a more specialized tool for just the underlying OpenStack service that powers the Object Store, named [Swift](https://docs.openstack.org/swift). The `swift` tool in particular supports multi-threading and also transparently breaking large files (> 5GB) into chunks for easier (and parallel) transmission. It can also be used to customize ACLs on your objects.

In [61]:
# Not actually giving a nice example because the raw Swift
# client doesn't support token-authentication (requires you
# to type in your username/password every time.)
swift upload -h

Usage: swift upload [--changed] [--skip-identical] [--segment-size <size>]
                    [--segment-container <container>] [--leave-segments]
                    [--object-threads <thread>] [--segment-threads <threads>]
                    [--meta <name:value>] [--header <header>] [--use-slo]
                    [--ignore-checksum] [--object-name <object-name>]
                    <container> <file_or_directory> [<file_or_directory>] [...]

Uploads specified files and directories to the given container.

Positional arguments:
  <container>           Name of container to upload to.
  <file_or_directory>   Name of file or directory to upload. Specify multiple
                        times for multiple uploads. If "-" is specified, reads
                        content from standard input (--object-name is required
                        in this case).

Optional arguments:
  -c, --changed         Only upload files that have changed since the last
                        upload.
  -

## Recap

You should now have a bootable image snapshot and be more familiar with the `cc-snapshot` tool. You should also understand what the `my_mounting_point` directory is, how it works, and what it is good for. Finally, you should have gained some familiarity with some more commands for the `openstack` CLI around managing data in the Chameleon Object Store.