**N.B:** This notebook doesn't point to the "real" HCA Data Coordination Platform. It points to an instance of the Data Storage System used for testing. So, you won't find this data in the future, and its organization and metadata structure won't necessarily match what's available in the DCP. Nevertheless, this can be useful for exploring the API and data model conceptually.

## Interacting with the HCA Data Storage System

The current interface of the HCS DSS is defined here: https://dss.data.humancellatlas.org/v1/swagger.json, and there are two bindings: a python interface and a command line interface.

Here we'll use the python interface to look through some objects in the DSS and how we can access them. First, we'll import the library and initialize the client.

In [1]:
import hca.dss
dss_client = hca.dss.DSSClient()

### Bundles
Now let's take a look at a "bundle" that's stored in the DSS. A bundle is the basic unit of storage in the DSS, and it's simply a versioned collection of files with a unique ID.

In [2]:
bundle_uuid = 'ec7e6476-abeb-4d78-82d8-8a28b43c46a6' # Elsewhere, we'll cover how to query for a bundle uuid
bundle = dss_client.get_bundle(uuid=bundle_uuid, replica="aws")
bundle

{'bundle': {'creator_uid': 8008,
  'files': [{'content-type': 'application/json; dcp-type="metadata/project"',
    'crc32c': '4ce66be3',
    'indexed': True,
    'name': 'project.json',
    's3_etag': '2be0a2479191779505af379eba6d5d10',
    'sha1': '36a9ab5c9cf6d772e52310740ce55ed09eb2dd19',
    'sha256': '76106f0ac8250169e75f1034192cec634ac395c99a912a9be9b8496782f6e6f4',
    'size': 2714,
    'uuid': '370452fa-3091-48cb-8d69-10c26e0db433',
    'version': '2018-03-26T154205.186524Z'},
   {'content-type': 'application/json; dcp-type="metadata/biomaterial"',
    'crc32c': 'b1c5b387',
    'indexed': True,
    'name': 'biomaterial.json',
    's3_etag': '8a3359dcb4c38b24f648d9168a8fea64',
    'sha1': 'd43bbe3baf4e3fabb7f5d4bc0c062429027dd28d',
    'sha256': '7ca86a30655aedc6c00a34b1768d529459740a14c989acfd4552fb660493a6bd',
    'size': 3700,
    'uuid': '1d7fd92c-ae01-4920-91a9-3edf950d4d70',
    'version': '2018-03-26T154205.786705Z'},
   {'content-type': 'application/gzip; dcp-type=data',

So this bundle is a collection of eight files:

In [3]:
[f['name'] for f in bundle['bundle']['files']]

['project.json',
 'biomaterial.json',
 'SRR3934437_1.fastq.gz',
 'SRR3934437_2.fastq.gz',
 'file.json',
 'process.json',
 'protocol.json',
 'links.json']

A couple fastqs and some json files. Let's look at one of the json files

In [4]:
biomaterial_json_uuid = [f['uuid'] for f in bundle['bundle']['files'] if f['name'] == 'biomaterial.json'][0]
print("The uuid of biomaterial.json is", biomaterial_json_uuid)
dss_client.get_file(uuid=biomaterial_json_uuid, replica="aws")

The uuid of biomaterial.json is 1d7fd92c-ae01-4920-91a9-3edf950d4d70


{'biomaterials': [{'content': {'biomaterial_core': {'has_input_biomaterial': 'BT_S2_T',
     'biomaterial_name': 'Single cell from Tumor,1001000174.A2',
     'supplementary_files': ['ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2243nnn/GSM2243527/suppl/GSM2243527_1001000174.A2.csv.gz'],
     'ncbi_taxon_id': [9606],
     'biomaterial_id': 'GSM2243527',
     'biosd_biomaterial': 'SAMN05421306'},
    'genus_species': [{'text': 'Homo sapiens', 'ontology': 'NCBITaxon:9606'}],
    'describedBy': 'https://schema.humancellatlas.org/type/biomaterial/5.1.0/cell_suspension',
    'total_estimated_cells': 1,
    'schema_type': 'biomaterial'},
   'hca_ingest': {'accession': '',
    'submissionDate': '2018-03-22T10:14:35.739Z',
    'updateDate': '2018-03-22T10:19:26.105Z',
    'document_id': '64ad6b33-66d0-46f0-a34d-283ca8ae5978'}},
  {'content': {'organ': {'text': 'brain', 'ontology': 'UBERON:0000955'},
    'genus_species': [{'text': 'Homo sapiens', 'ontology': 'NCBITaxon:9606'}],
    'schema_type': '

Metadata. Great.

This describes the data that's in the two fastq files. So these fastqs are from a tumor in the temporal lobe of a human brain. This json follows the HCA Metadata Schema, which is defined here: https://github.com/HumanCellAtlas/metadata-schema . The other json files contain other metadata.

Now, if we want to download one of the fastqs, we probably don't want to dump it to an output cell in the notebook. So we'll write it to a file.

In [5]:
with dss_client.get_file.stream(uuid="b968bf89-f6a8-4fd2-9bc1-8dd25d4c85c9", replica="aws") as remote_fh, \
          open("/home/jovyan/SRR3934437_2.fastq.gz", "wb") as local_fh:
    while True:
        chunk = remote_fh.raw.read(1<<20)
        if not chunk:
            break
        local_fh.write(chunk)

In [6]:
ls -l /home/jovyan/*.fastq.gz

-rw-r--r-- 1 jovyan users 49310595 Apr 25 15:25 /home/jovyan/SRR3934437_2.fastq.gz


There it is. And the size matches too, so hey that's pretty good.

So, we've looked at a bundle and its contents, the metadata stored in the bundle, and we've downloaded some of its data.

### Replicas

In most of the `DSSClient` invocations above, there was a `replica="aws"` parameter. The DSS replicates data across multiple clouds so users are more likely to be able to compute next to the HCA's data. Currently, `aws` and `gcp` are permitted values for `replica`, but the DSS is designed to allow expanding that list.

In [7]:
bundle = dss_client.get_bundle(uuid=bundle_uuid, replica="gcp")
bundle

{'bundle': {'creator_uid': 8008,
  'files': [{'content-type': 'application/json; dcp-type="metadata/project"',
    'crc32c': '4ce66be3',
    'indexed': True,
    'name': 'project.json',
    's3_etag': '2be0a2479191779505af379eba6d5d10',
    'sha1': '36a9ab5c9cf6d772e52310740ce55ed09eb2dd19',
    'sha256': '76106f0ac8250169e75f1034192cec634ac395c99a912a9be9b8496782f6e6f4',
    'size': 2714,
    'uuid': '370452fa-3091-48cb-8d69-10c26e0db433',
    'version': '2018-03-26T154205.186524Z'},
   {'content-type': 'application/json; dcp-type="metadata/biomaterial"',
    'crc32c': 'b1c5b387',
    'indexed': True,
    'name': 'biomaterial.json',
    's3_etag': '8a3359dcb4c38b24f648d9168a8fea64',
    'sha1': 'd43bbe3baf4e3fabb7f5d4bc0c062429027dd28d',
    'sha256': '7ca86a30655aedc6c00a34b1768d529459740a14c989acfd4552fb660493a6bd',
    'size': 3700,
    'uuid': '1d7fd92c-ae01-4920-91a9-3edf950d4d70',
    'version': '2018-03-26T154205.786705Z'},
   {'content-type': 'application/gzip; dcp-type=data',

### Versions

There's an additional identifying field in the descriptions of bundles and files: `version`. Objects in the DSS are versioned, and superseding versions of a bundle or file can be published later. By omitting a `version` parameter above, we were defaulting to the latest version of each object, but we can specify a version as well:

In [8]:
bundle = dss_client.get_bundle(uuid=bundle_uuid, replica="gcp", version="2018-03-26T154214.008924Z")
bundle

{'bundle': {'creator_uid': 8008,
  'files': [{'content-type': 'application/json; dcp-type="metadata/project"',
    'crc32c': '4ce66be3',
    'indexed': True,
    'name': 'project.json',
    's3_etag': '2be0a2479191779505af379eba6d5d10',
    'sha1': '36a9ab5c9cf6d772e52310740ce55ed09eb2dd19',
    'sha256': '76106f0ac8250169e75f1034192cec634ac395c99a912a9be9b8496782f6e6f4',
    'size': 2714,
    'uuid': '370452fa-3091-48cb-8d69-10c26e0db433',
    'version': '2018-03-26T154205.186524Z'},
   {'content-type': 'application/json; dcp-type="metadata/biomaterial"',
    'crc32c': 'b1c5b387',
    'indexed': True,
    'name': 'biomaterial.json',
    's3_etag': '8a3359dcb4c38b24f648d9168a8fea64',
    'sha1': 'd43bbe3baf4e3fabb7f5d4bc0c062429027dd28d',
    'sha256': '7ca86a30655aedc6c00a34b1768d529459740a14c989acfd4552fb660493a6bd',
    'size': 3700,
    'uuid': '1d7fd92c-ae01-4920-91a9-3edf950d4d70',
    'version': '2018-03-26T154205.786705Z'},
   {'content-type': 'application/gzip; dcp-type=data',