diff --git a/SUMMARY.md b/SUMMARY.md index 303b2bb5..ad3cbc30 100644 --- a/SUMMARY.md +++ b/SUMMARY.md @@ -116,7 +116,7 @@ * [Managing Data in Gradient](data/managing-data-in-gradient.md) * [Managing Persistent Storage with VMs](data/managing-persistent-storage-with-vms.md) * [Public Datasets Repository](data/public-datasets-repository.md) -* [Experiment Datasets](experiments/datasets.md) +* [Private Datasets](data/private-datasets-repository.md) ## Instances @@ -150,3 +150,4 @@ * [Product release notes](https://support.paperspace.com/hc/en-us/articles/217560197) * [CLI/SDK Release notes](https://github.com/Paperspace/gradient-cli/releases) + diff --git a/data/private-datasets-repository.md b/data/private-datasets-repository.md new file mode 100644 index 00000000..504bb4cb --- /dev/null +++ b/data/private-datasets-repository.md @@ -0,0 +1,112 @@ +# Private Datasets + +## About + +When executing an experiment in Gradient you may optionally supply one or more datasets that will be downloaded into your experiment's environment prior to execution. These datasets can be downloaded from an S3 object or folder \(including the full bucket\). Gradient allows teams to run reproducible machine learning experiments by taking advantage of S3 ETags and Version IDs, which combine to allow you to be sure that datasets exactly match between training sets, and to be sure which version of a dataset you are using. + +### S3 Datasets + +Datasets are downloaded and mounted readonly on `/data/global/DATASET` within your experiment jobs using the supplied AWS credentials. The credentials are optional for public buckets. The name of the dataset is the `basename` of the last item in the s3 path, e.g. `s3://my-bucket/mnist.zip` would have the name `mnist` and `s3://my-bucket` would have the name `my-bucket`. The name maybe overridden with the optional `name` parameter. + +```text +datasets: [ + { + "url": "s3://my-bucket/mnist-modified.zip", + "awsSecretAccessKey": "", + "awsAccessKeyId": "", + "name": "mnist", + }, +] +``` + +#### ETag + +When downloading a dataset you may supply an optional `etag` parameter, which will tell the dataset downloader to verify that the object stored at the path matches the supplied etag. If it does not match the etag, the experiment will end with an error. This feature is only supported on S3 objects and not buckets. + +```text +datasets: [ + { + "url": "s3://my-bucket/my-dataset.zip", + "awsSecretAccessKey": "", + "awsAccessKeyId": "", + "etag": "d0e2243df4d1e89ead52d51083b2eb523593b38e", + }, +] +``` + +#### VersionId + +When downloading a dataset you may supply an optional `versionId` parameter, which will tell the dataset downloader to fetch your S3 object at the specified version. This feature is only supported on versioned S3 buckets and is not supported on downloads of folders. + +```text +datasets: [ + { + "url": "s3://my-bucket/my-dataset.zip", + "awsSecretAccessKey": "", + "awsAccessKeyId": "", + "versionId": "1111111", + }, +] +``` + +#### Supplying a Different Volume Size + +When downloading a dataset they are by default downloaded to an ephemeral volume that lasts for the duration of the experiment job. These volumes are 5 GB \(`"5Gi"`\) by default; if you need a larger volume you may supply a size parameter with your dataset. + +For example, this snippet will start an experiment with a dataset that downloads to a 10 GB volume: + +```text +datasets: [ + { + "url": "s3://my-bucket/my-dataset.zip", + "awsSecretAccessKey": "", + "awsAccessKeyId": "", + "volumeOptions": { + "kind": "dynamic", + "size": "10Gi", + }, + }, +] +``` + +Size units may be specified with the SI prefix for base-10 units \(K, M, G, T\). Or for base-2 quantities you may add an `i` specifier at the end \(Ki, Mi, Gi, Ti\). + +#### Downloading to Shared Storage + +Datasets are normally downloaded to transient storage per experiment job. This means that for a 3 worker, 2 parameter server experiment the dataset will be downloaded 5 times. Since this can be very inefficient, in order to decrease job start up time you may choose to download your artifact to your team shared storage space. This will download the dataset to a unique path within your shared storage and mount it into your experiment jobs at the same path as if you had downloaded it to a dynamic volume. This can be useful for quickly switching your volume options without changing your experiment code. It is _strongly_ recommended to supply an etag for shared storage downloads to ensure that you have a consistent dataset between experiment executions. + +For example, + +```text +datasets: [ + { + "url": "s3://my-bucket/my-dataset.zip", + "awsSecretAccessKey": "", + "awsAccessKeyId": "", + "etag": "d0e2243df4d1e89ead52d51083b2eb523593b38e", + "volumeOptions": { + "kind": "shared", + }, + }, +] +``` + +**Cleaning Up Shared Storage** + +Because shared storage datasets are stored in your team storage they are not automatically deleted. Datasets are downloaded to `//data/`, where `DATASET_NAME` is derived from the bucket or the user supplied parameter. If the dataset was downloaded without an `etag`, the `-ETAG` portion of the download path is omitted. + +#### Archive Expansion + +If the object supplied is in a recognized archive format, such as zip, the archive will automatically be expanded in the root of the mount path. For example, `s3://my-bucket/dataset.zip` would be downloaded and expanded so that the contents of `dataset.zip` are accessible inside the container at `/data/global/dataset`. Archive formats are detected by their extension. These are the supported archive extensions: + +* .zip +* .tar +* .tar.bz2 +* .tar.gz +* .tar.gz2 +* .tar.xz +* .tbz +* .tbz2 +* .tgz +* .txz + diff --git a/experiments/about.md b/experiments/about.md index ae11efa2..c8f41145 100644 --- a/experiments/about.md +++ b/experiments/about.md @@ -8,19 +8,19 @@ Experiments are intended to be used for intensive computational tasks like neura Experiments can be run from the **Experiment Builder** web interface, our **CLI,** the **GradientCI** bot, or our new **SDK**. Here is a quick overview and instructions for each option: -The web interface is great for getting familiar with Experiments and running sample projects. +The web interface is great for getting familiar with Experiments and running sample projects. {% page-ref page="run-experiments-ui.md" %} -The CLI \(command-line interface\) is the most popular tool for launching Experiments. It's powerful, flexible, and easy-to-use. +The CLI \(command-line interface\) is the most popular tool for launching Experiments. It's powerful, flexible, and easy-to-use. {% page-ref page="run-experiments-cli.md" %} -The SDK let's you programmatically interact with the Gradient platform. The SDK can be incorporated into any python project and enables more advanced ML pipelines. +The SDK let's you programmatically interact with the Gradient platform. The SDK can be incorporated into any python project and enables more advanced ML pipelines. {% page-ref page="../projects/gradientci.md" %} -GradientCI enables you to submit Experiments directly from a GitHub commit \(or branch\). You can launch Experiments without ever leaving your code. +GradientCI enables you to submit Experiments directly from a GitHub commit \(or branch\). You can launch Experiments without ever leaving your code. {% page-ref page="../gradient-python-sdk/gradient-python-sdk/" %} @@ -62,11 +62,11 @@ An experiment goes through a number of "states" between being submitted to Gradi | `EXPERIMENT_STATE_CANCELLED` | Cancelled | | `EXPERIMENT_STATE_ERROR` | Error | -![](../.gitbook/assets/image%20%2834%29.png) - +![](../.gitbook/assets/image%20%2835%29.png) ## Private Datasets You may mount private datasets hosted in S3 buckets into experiment environment. -{% page-ref page="datasets.md" %} +{% page-ref page="../data/private-datasets-repository.md" %} + diff --git a/experiments/datasets.md b/experiments/datasets.md deleted file mode 100644 index 5eb69c30..00000000 --- a/experiments/datasets.md +++ /dev/null @@ -1,129 +0,0 @@ -# Datasets - -## About - -When executing an experiment in Gradient you may optionally supply one or more datasets that will be downloaded into your experiment's environment prior to execution. -These datasets can be downloaded from an S3 object or folder (including the full bucket). -Gradient allows teams to run reproducible machine learning experiments by taking advantage of S3 ETags and Version IDs, which combine to allow you to be sure that datasets exactly match between training sets, and to be sure which version of a dataset you are using. - -### S3 Datasets - -Datasets are downloaded and mounted readonly on `/data/global/DATASET` within your experiment jobs using the supplied AWS credentials. -The credentials are optional for public buckets. -The name of the dataset is the `basename` of the last item in the s3 path, e.g. `s3://my-bucket/mnist.zip` would have the name `mnist` and `s3://my-bucket` would have the name `my-bucket`. -The name maybe overridden with the optional `name` parameter. - -``` -datasets: [ - { - "url": "s3://my-bucket/mnist-modified.zip", - "awsSecretAccessKey": "", - "awsAccessKeyId": "", - "name": "mnist", - }, -] -``` - -#### ETag - -When downloading a dataset you may supply an optional `etag` parameter, which will tell the dataset downloader to verify that the object stored at the path matches the supplied etag. -If it does not match the etag, the experiment will end with an error. -This feature is only supported on S3 objects and not buckets. - -``` -datasets: [ - { - "url": "s3://my-bucket/my-dataset.zip", - "awsSecretAccessKey": "", - "awsAccessKeyId": "", - "etag": "d0e2243df4d1e89ead52d51083b2eb523593b38e", - }, -] -``` - -#### VersionId - -When downloading a dataset you may supply an optional `versionId` parameter, which will tell the dataset downloader to fetch your S3 object at the specified version. -This feature is only supported on versioned S3 buckets and is not supported on downloads of folders. - -``` -datasets: [ - { - "url": "s3://my-bucket/my-dataset.zip", - "awsSecretAccessKey": "", - "awsAccessKeyId": "", - "versionId": "1111111", - }, -] -``` - -#### Supplying a Different Volume Size - -When downloading a dataset they are by default downloaded to an ephemeral volume that lasts for the duration of the experiment job. -These volumes are 5 GB (`"5Gi"`) by default; if you need a larger volume you may supply a size parameter with your dataset. - -For example, this snippet will start an experiment with a dataset that downloads to a 10 GB volume: - -``` -datasets: [ - { - "url": "s3://my-bucket/my-dataset.zip", - "awsSecretAccessKey": "", - "awsAccessKeyId": "", - "volumeOptions": { - "kind": "dynamic", - "size": "10Gi", - }, - }, -] -``` - -Size units may be specified with the SI prefix for base-10 units (K, M, G, T). -Or for base-2 quantities you may add an `i` specifier at the end (Ki, Mi, Gi, Ti). - -#### Downloading to Shared Storage - -Datasets are normally downloaded to transient storage per experiment job. -This means that for a 3 worker, 2 parameter server experiment the dataset will be downloaded 5 times. -Since this can be very inefficient, in order to decrease job start up time you may choose to download your artifact to your team shared storage space. -This will download the dataset to a unique path within your shared storage and mount it into your experiment jobs at the same path as if you had downloaded it to a dynamic volume. -This can be useful for quickly switching your volume options without changing your experiment code. -It is *strongly* recommended to supply an etag for shared storage downloads to ensure that you have a consistent dataset between experiment executions. - -For example, - -``` -datasets: [ - { - "url": "s3://my-bucket/my-dataset.zip", - "awsSecretAccessKey": "", - "awsAccessKeyId": "", - "etag": "d0e2243df4d1e89ead52d51083b2eb523593b38e", - "volumeOptions": { - "kind": "shared", - }, - }, -] -``` - -##### Cleaning Up Shared Storage - -Because shared storage datasets are stored in your team storage they are not automatically deleted. -Datasets are downloaded to `//data/`, where `DATASET_NAME` is derived from the bucket or the user supplied parameter. -If the dataset was downloaded without an `etag`, the `-ETAG` portion of the download path is omitted. - -#### Archive Expansion - -If the object supplied is in a recognized archive format, such as zip, the archive will automatically be expanded in the root of the mount path. -For example, `s3://my-bucket/dataset.zip` would be downloaded and expanded so that the contents of `dataset.zip` are accessible inside the container at `/data/global/dataset`. -Archive formats are detected by their extension. These are the supported archive extensions: -* .zip -* .tar -* .tar.bz2 -* .tar.gz -* .tar.gz2 -* .tar.xz -* .tbz -* .tbz2 -* .tgz -* .txz