Skip to content
This repository has been archived by the owner on Apr 11, 2022. It is now read-only.

Commit

Permalink
GitBook: [master] 4 pages modified
Browse files Browse the repository at this point in the history
  • Loading branch information
jaredscheib authored and gitbook-bot committed Jan 2, 2020
1 parent b064846 commit b3c1e06
Show file tree
Hide file tree
Showing 4 changed files with 121 additions and 137 deletions.
3 changes: 2 additions & 1 deletion SUMMARY.md
Expand Up @@ -116,7 +116,7 @@
* [Managing Data in Gradient](data/managing-data-in-gradient.md)
* [Managing Persistent Storage with VMs](data/managing-persistent-storage-with-vms.md)
* [Public Datasets Repository](data/public-datasets-repository.md)
* [Experiment Datasets](experiments/datasets.md)
* [Private Datasets](data/private-datasets-repository.md)

## Instances

Expand Down Expand Up @@ -150,3 +150,4 @@

* [Product release notes](https://support.paperspace.com/hc/en-us/articles/217560197)
* [CLI/SDK Release notes](https://github.com/Paperspace/gradient-cli/releases)

112 changes: 112 additions & 0 deletions data/private-datasets-repository.md
@@ -0,0 +1,112 @@
# Private Datasets

## About

When executing an experiment in Gradient you may optionally supply one or more datasets that will be downloaded into your experiment's environment prior to execution. These datasets can be downloaded from an S3 object or folder \(including the full bucket\). Gradient allows teams to run reproducible machine learning experiments by taking advantage of S3 ETags and Version IDs, which combine to allow you to be sure that datasets exactly match between training sets, and to be sure which version of a dataset you are using.

### S3 Datasets

Datasets are downloaded and mounted readonly on `/data/global/DATASET` within your experiment jobs using the supplied AWS credentials. The credentials are optional for public buckets. The name of the dataset is the `basename` of the last item in the s3 path, e.g. `s3://my-bucket/mnist.zip` would have the name `mnist` and `s3://my-bucket` would have the name `my-bucket`. The name maybe overridden with the optional `name` parameter.

```text
datasets: [
{
"url": "s3://my-bucket/mnist-modified.zip",
"awsSecretAccessKey": "<KEY>",
"awsAccessKeyId": "<ID>",
"name": "mnist",
},
]
```

#### ETag

When downloading a dataset you may supply an optional `etag` parameter, which will tell the dataset downloader to verify that the object stored at the path matches the supplied etag. If it does not match the etag, the experiment will end with an error. This feature is only supported on S3 objects and not buckets.

```text
datasets: [
{
"url": "s3://my-bucket/my-dataset.zip",
"awsSecretAccessKey": "<KEY>",
"awsAccessKeyId": "<ID>",
"etag": "d0e2243df4d1e89ead52d51083b2eb523593b38e",
},
]
```

#### VersionId

When downloading a dataset you may supply an optional `versionId` parameter, which will tell the dataset downloader to fetch your S3 object at the specified version. This feature is only supported on versioned S3 buckets and is not supported on downloads of folders.

```text
datasets: [
{
"url": "s3://my-bucket/my-dataset.zip",
"awsSecretAccessKey": "<KEY>",
"awsAccessKeyId": "<ID>",
"versionId": "1111111",
},
]
```

#### Supplying a Different Volume Size

When downloading a dataset they are by default downloaded to an ephemeral volume that lasts for the duration of the experiment job. These volumes are 5 GB \(`"5Gi"`\) by default; if you need a larger volume you may supply a size parameter with your dataset.

For example, this snippet will start an experiment with a dataset that downloads to a 10 GB volume:

```text
datasets: [
{
"url": "s3://my-bucket/my-dataset.zip",
"awsSecretAccessKey": "<KEY>",
"awsAccessKeyId": "<ID>",
"volumeOptions": {
"kind": "dynamic",
"size": "10Gi",
},
},
]
```

Size units may be specified with the SI prefix for base-10 units \(K, M, G, T\). Or for base-2 quantities you may add an `i` specifier at the end \(Ki, Mi, Gi, Ti\).

#### Downloading to Shared Storage

Datasets are normally downloaded to transient storage per experiment job. This means that for a 3 worker, 2 parameter server experiment the dataset will be downloaded 5 times. Since this can be very inefficient, in order to decrease job start up time you may choose to download your artifact to your team shared storage space. This will download the dataset to a unique path within your shared storage and mount it into your experiment jobs at the same path as if you had downloaded it to a dynamic volume. This can be useful for quickly switching your volume options without changing your experiment code. It is _strongly_ recommended to supply an etag for shared storage downloads to ensure that you have a consistent dataset between experiment executions.

For example,

```text
datasets: [
{
"url": "s3://my-bucket/my-dataset.zip",
"awsSecretAccessKey": "<KEY>",
"awsAccessKeyId": "<ID>",
"etag": "d0e2243df4d1e89ead52d51083b2eb523593b38e",
"volumeOptions": {
"kind": "shared",
},
},
]
```

**Cleaning Up Shared Storage**

Because shared storage datasets are stored in your team storage they are not automatically deleted. Datasets are downloaded to `/<TEAMHANDLE>/data/<DATASET_NAME-ETAG>`, where `DATASET_NAME` is derived from the bucket or the user supplied parameter. If the dataset was downloaded without an `etag`, the `-ETAG` portion of the download path is omitted.

#### Archive Expansion

If the object supplied is in a recognized archive format, such as zip, the archive will automatically be expanded in the root of the mount path. For example, `s3://my-bucket/dataset.zip` would be downloaded and expanded so that the contents of `dataset.zip` are accessible inside the container at `/data/global/dataset`. Archive formats are detected by their extension. These are the supported archive extensions:

* .zip
* .tar
* .tar.bz2
* .tar.gz
* .tar.gz2
* .tar.xz
* .tbz
* .tbz2
* .tgz
* .txz

14 changes: 7 additions & 7 deletions experiments/about.md
Expand Up @@ -8,19 +8,19 @@ Experiments are intended to be used for intensive computational tasks like neura

Experiments can be run from the **Experiment Builder** web interface, our **CLI,** the **GradientCI** bot, or our new **SDK**. Here is a quick overview and instructions for each option:

The web interface is great for getting familiar with Experiments and running sample projects.
The web interface is great for getting familiar with Experiments and running sample projects.

{% page-ref page="run-experiments-ui.md" %}

The CLI \(command-line interface\) is the most popular tool for launching Experiments. It's powerful, flexible, and easy-to-use.
The CLI \(command-line interface\) is the most popular tool for launching Experiments. It's powerful, flexible, and easy-to-use.

{% page-ref page="run-experiments-cli.md" %}

The SDK let's you programmatically interact with the Gradient platform. The SDK can be incorporated into any python project and enables more advanced ML pipelines.
The SDK let's you programmatically interact with the Gradient platform. The SDK can be incorporated into any python project and enables more advanced ML pipelines.

{% page-ref page="../projects/gradientci.md" %}

GradientCI enables you to submit Experiments directly from a GitHub commit \(or branch\). You can launch Experiments without ever leaving your code.
GradientCI enables you to submit Experiments directly from a GitHub commit \(or branch\). You can launch Experiments without ever leaving your code.

{% page-ref page="../gradient-python-sdk/gradient-python-sdk/" %}

Expand Down Expand Up @@ -62,11 +62,11 @@ An experiment goes through a number of "states" between being submitted to Gradi
| `EXPERIMENT_STATE_CANCELLED` | Cancelled |
| `EXPERIMENT_STATE_ERROR` | Error |

![](../.gitbook/assets/image%20%2834%29.png)

![](../.gitbook/assets/image%20%2835%29.png)

## Private Datasets

You may mount private datasets hosted in S3 buckets into experiment environment.

{% page-ref page="datasets.md" %}
{% page-ref page="../data/private-datasets-repository.md" %}

129 changes: 0 additions & 129 deletions experiments/datasets.md

This file was deleted.

0 comments on commit b3c1e06

Please sign in to comment.