## Basic Bundle and File Operations with the DCP Python API

Here are some examples of basic operations with the HCA DSS: getting bundle and file metadata and contents. Here we'll illustrate this using the HCA DCPs python API.

First, install the python library so we can make some requests.

In [1]:
import sys
!{sys.executable} -m pip install hca



Now, we're going to get the "manifest" of a bundle. This is metadata about a bundle and its contents. We'll make a request with the CLI using the bundle's UUID and version.

In [3]:
bundle_uuid = "ead66505-a78b-44ee-81f6-418be859ab65"
bundle_version = "2018-12-06T043139.806469Z"

We first need to create a `DSSClient` object, and then we can make the request with `DSSClient.get_bundle`:

In [4]:
import hca.dss
dss_client = hca.dss.DSSClient()
dss_client.get_bundle(uuid=bundle_uuid, version=bundle_version, replica="aws")

{'bundle': {'creator_uid': 8008,
  'files': [{'content-type': 'application/json; dcp-type="metadata/biomaterial"',
    'crc32c': 'ec0d3d14',
    'indexed': True,
    'name': 'cell_suspension_0.json',
    's3_etag': 'fee4d8354468476052a1c0da67fb1f03',
    'sha1': '6f77abbdd0892a99e6fae5c920daaaf3018b24fd',
    'sha256': '5d41934e1af911f510be04506d41b59ebd91d1678f42df759819b223a898b1da',
    'size': 1356,
    'uuid': '3e4e6f8e-0c67-470c-8059-950e0bd1ebff',
    'version': '2018-12-04T190346.297000Z'},
   {'content-type': 'application/json; dcp-type="metadata/biomaterial"',
    'crc32c': '73fc7859',
    'indexed': True,
    'name': 'specimen_from_organism_0.json',
    's3_etag': '5ba0e2a0506f3fab2d69c5d19227c5c3',
    'sha1': '53275ddd170d3d6dd31239674bff86ede607ab2f',
    'sha256': '3b76d6f98b4912fd9fc233054cd47c321bb2411e5a56c656efc5c7e9327f2a32',
    'size': 1643,
    'uuid': '3a04d7b0-105b-457b-9a5e-a6b2ab4d3225',
    'version': '2018-12-04T190116.949000Z'},
   {'content-type': 'applic

And there's the contents of that bundle along with some metadata. We can get metadata for a single file using `DSSClient.head_file`.

In [6]:
file_uuid = "9adf9f89-f546-4889-86ff-b430e3123c8b"
file_version = "2018-12-04T191256.554000Z"

In [10]:
response = dss_client.head_file(uuid=file_uuid, version=file_version, replica="aws")
response.headers

{'Date': 'Tue, 11 Dec 2018 23:48:50 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Content-Length': '0', 'Connection': 'keep-alive', 'x-amzn-RequestId': '4f3cec17-fd9f-11e8-97f5-975f4601d6bf', 'X-DSS-SHA1': '9bff1801fb12b3302b7f7d7dd57a3a3cb6070914', 'Access-Control-Allow-Origin': '*', 'X-DSS-S3-ETAG': '003989f2df2aa8f39723f3925e5fb851', 'X-DSS-SHA256': '7a967973684ffa8ac3fe5004177e2fc7d8f3a2c822bbede536dcd58d9af7bdb5', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload', 'Access-Control-Allow-Headers': 'Authorization,Content-Type,X-Amz-Date,X-Amz-Security-Token,X-Api-Key', 'X-DSS-CONTENT-TYPE': 'application/json; dcp-type="metadata/process"', 'X-DSS-CRC32C': '3c8c3cc5', 'X-DSS-CREATOR-UID': '8008', 'x-amz-apigw-id': 'RxDzZFD3oAMFaFg=', 'X-DSS-VERSION': '2018-12-04T191256.554000Z', 'X-Amzn-Trace-Id': 'Root=1-5c104ce2-6e03b1a094df5820d80881c0;Sampled=0', 'X-DSS-SIZE': '408'}

This gives a response with metadata about the file in the header. If we want to get the file contents, we can use `DSSClient.get_file`:

In [12]:
dss_client.get_file(uuid=file_uuid, version=file_version, replica="aws")

{'process_core': {'process_id': 'E18_20160930_Neurons_Sample_71_S068_L007_006'},
 'schema_type': 'process',
 'describedBy': 'https://schema.humancellatlas.org/type/process/6.0.2/process',
 'provenance': {'document_id': '9adf9f89-f546-4889-86ff-b430e3123c8b',
  'submission_date': '2018-12-04T18:59:53.218Z',
  'update_date': '2018-12-04T19:12:56.554Z'}}

And there are the contents of the file. Note that since the remote file was JSON, the python library has converted it to a python dictionary rather than raw bytes.

### Downloading an Entire Bundle

The `DSSClient` also has a method `download` that downloads and entire bundle at once:

In [13]:
dss_client.download(bundle_uuid=bundle_uuid, version=bundle_version, replica="aws")

In [14]:
!ls "$bundle_uuid"

cell_suspension_0.json
dissociation_protocol_0.json
donor_organism_0.json
E18_20160930_Neurons_Sample_71_S068_L007_I1_006.fastq.gz
E18_20160930_Neurons_Sample_71_S068_L007_R1_006.fastq.gz
E18_20160930_Neurons_Sample_71_S068_L007_R2_006.fastq.gz
library_preparation_protocol_0.json
links.json
process_0.json
process_1.json
process_2.json
project_0.json
sequence_file_0.json
sequence_file_1.json
sequence_file_2.json
sequencing_protocol_0.json
specimen_from_organism_0.json


### Streaming Large Files

Some files are too large to read into memory at once. The `DSSClient.get_file.stream` method allows iterating over portions of the file. For example, `E18_20160930_Neurons_Sample_71_S068_L007_R2_006.fastq.gz` in the bundle above is 386,122,988 bytes. We can read it one megabyte at a time:

In [20]:
with dss_client.get_file.stream(uuid="330b4a8c-57d6-4d7a-b15d-9a49ed1c0472", replica="aws") as remote_fh, \
        open("E18_20160930_Neurons_Sample_71_S068_L007_R2_006.fastq.gz", "wb") as local_fh:
    while True:
        chunk = remote_fh.raw.read(1<<20)
        if not chunk:
            break
        local_fh.write(chunk)

In [21]:
!ls -l E18_20160930_Neurons_Sample_71_S068_L007_R2_006.fastq.gz

-rw-r--r-- 1 jovyan users 386122988 Dec 12 03:41 E18_20160930_Neurons_Sample_71_S068_L007_R2_006.fastq.gz
