**N.B**: This notebook doesn't point to the "real" HCA Data Coordination Platform. It points to an instance of the Data Storage System used for testing. So, you won't find this data in the future, and its organization and metadata structure won't necessarily match what's available in the DCP. 

This is especially true for anything regarding expression matrices. The formats, metadata, and APIs for expression matrices are under active development, and feedback and ideas are welcome. Nevertheless, the examples below can be useful for exploring the API and data model conceptually.

## Indexed Metadata

The [Investigate a Bundle](Investigate a Bundle.ipynb) notebook walked through the contents of a bundle (and you probably should step through it before looking at this notebook). There we started with a bundle uuid, but what if we wanted to find that bundle uuid based on some facts we knew about the bundle?

Recall the bundle we were looking at:

In [1]:
import hca.dss
dss_client = hca.dss.DSSClient()
bundle_uuid = 'ec7e6476-abeb-4d78-82d8-8a28b43c46a6'
bundle = dss_client.get_bundle(uuid=bundle_uuid, replica="aws")
bundle

{'bundle': {'creator_uid': 8008,
  'files': [{'content-type': 'application/json; dcp-type="metadata/project"',
    'crc32c': '4ce66be3',
    'indexed': True,
    'name': 'project.json',
    's3_etag': '2be0a2479191779505af379eba6d5d10',
    'sha1': '36a9ab5c9cf6d772e52310740ce55ed09eb2dd19',
    'sha256': '76106f0ac8250169e75f1034192cec634ac395c99a912a9be9b8496782f6e6f4',
    'size': 2714,
    'uuid': '370452fa-3091-48cb-8d69-10c26e0db433',
    'version': '2018-03-26T154205.186524Z'},
   {'content-type': 'application/json; dcp-type="metadata/biomaterial"',
    'crc32c': 'b1c5b387',
    'indexed': True,
    'name': 'biomaterial.json',
    's3_etag': '8a3359dcb4c38b24f648d9168a8fea64',
    'sha1': 'd43bbe3baf4e3fabb7f5d4bc0c062429027dd28d',
    'sha256': '7ca86a30655aedc6c00a34b1768d529459740a14c989acfd4552fb660493a6bd',
    'size': 3700,
    'uuid': '1d7fd92c-ae01-4920-91a9-3edf950d4d70',
    'version': '2018-03-26T154205.786705Z'},
   {'content-type': 'application/gzip; dcp-type=data',

And recall that bundle contained some json files, one of which was `biomaterial.json`. If you look at the contents of the bundle, you can see that that file has an `"indexed": True` field, which means that the contents of that file can be searched. So let's look at the contents of `biomaterial.json` again:

In [3]:
biomaterial_json_uuid = [f['uuid'] for f in bundle['bundle']['files'] if f['name'] == 'biomaterial.json'][0]
dss_client.get_file(uuid=biomaterial_json_uuid, replica="aws")

{'biomaterials': [{'content': {'biomaterial_core': {'has_input_biomaterial': 'BT_S2_T',
     'biomaterial_name': 'Single cell from Tumor,1001000174.A2',
     'supplementary_files': ['ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2243nnn/GSM2243527/suppl/GSM2243527_1001000174.A2.csv.gz'],
     'ncbi_taxon_id': [9606],
     'biomaterial_id': 'GSM2243527',
     'biosd_biomaterial': 'SAMN05421306'},
    'genus_species': [{'text': 'Homo sapiens', 'ontology': 'NCBITaxon:9606'}],
    'describedBy': 'https://schema.humancellatlas.org/type/biomaterial/5.1.0/cell_suspension',
    'total_estimated_cells': 1,
    'schema_type': 'biomaterial'},
   'hca_ingest': {'accession': '',
    'submissionDate': '2018-03-22T10:14:35.739Z',
    'updateDate': '2018-03-22T10:19:26.105Z',
    'document_id': '64ad6b33-66d0-46f0-a34d-283ca8ae5978'}},
  {'content': {'organ': {'text': 'brain', 'ontology': 'UBERON:0000955'},
    'genus_species': [{'text': 'Homo sapiens', 'ontology': 'NCBITaxon:9606'}],
    'schema_type': '

There are a number of things in there that we could search for, but let's say we want to find bundles associated with that `biomaterial_id`, `BT_S2_T`

In [4]:
es_query = {
    "query": {
        "term": {
            "files.biomaterial_json.biomaterials.content.biomaterial_core.biomaterial_id": "BT_S2_T"
        }
    }
}
result = dss_client.post_search(replica="aws", es_query=es_query)
print(result["total_hits"])
result

5613


{'es_query': {'query': {'term': {'files.biomaterial_json.biomaterials.content.biomaterial_core.biomaterial_id': 'BT_S2_T'}}},
 'results': [{'bundle_fqid': '01cd1cb1-dabf-4d96-9534-9a5c3182ff37.2018-03-26T154145.331261Z',
   'bundle_url': 'https://dss.integration.data.humancellatlas.org/v1/bundles/01cd1cb1-dabf-4d96-9534-9a5c3182ff37?version=2018-03-26T154145.331261Z&replica=aws',
   'search_score': 0.74350744},
  {'bundle_fqid': 'b9167b17-9577-4478-bb99-f1a81486a76f.2018-03-26T154145.350947Z',
   'bundle_url': 'https://dss.integration.data.humancellatlas.org/v1/bundles/b9167b17-9577-4478-bb99-f1a81486a76f?version=2018-03-26T154145.350947Z&replica=aws',
   'search_score': 0.74350744},
  {'bundle_fqid': '77adc081-699e-4f07-a124-c48c99ccd4d8.2018-03-26T154145.090092Z',
   'bundle_url': 'https://dss.integration.data.humancellatlas.org/v1/bundles/77adc081-699e-4f07-a124-c48c99ccd4d8?version=2018-03-26T154145.090092Z&replica=aws',
   'search_score': 0.74350744},
  {'bundle_fqid': '6897626d-a

That worked a little too well. We found 5613 bundles that contain that `biomaterial_id`. It would be great if the uuid we're looking for was in there:

In [5]:
bundle_uuid in (r["bundle_fqid"][:36] for r in dss_client.post_search.iterate(replica="aws", es_query=es_query))

True

Promising! Also, note a couple changes in that function call. First, we used `post_search.iterate` to return a generator over all the query results. If we just used `post_search`, we would only get the first page of 100 query results.

The query also returned a "bundle_fqid", a fully-qualified bundle id. That included both the bundle uuid and the version, so we use `[:36]` to strip off the version.

Now, we can make our query more specific by adding some other contraints:

In [6]:
es_query = {
    "query": {
        "bool": {
            "must": [
                {
                    "term": {
                        "files.biomaterial_json.biomaterials.content.biomaterial_core.biomaterial_id": "BT_S2_T"
                    }
                },
                {
                    "term": {
                        "files.file_json.files.content.file_core.file_name": "SRR3934437_2.fastq.gz"
                    }
                }
            ]
        }
    }
}
result = dss_client.post_search(replica="aws", es_query=es_query)
print(result["total_hits"])
result

8


{'es_query': {'query': {'bool': {'must': [{'term': {'files.biomaterial_json.biomaterials.content.biomaterial_core.biomaterial_id': 'BT_S2_T'}},
     {'term': {'files.file_json.files.content.file_core.file_name': 'SRR3934437_2.fastq.gz'}}]}}},
 'results': [{'bundle_fqid': '1ec1f360-4132-426f-bb02-1337be74340d.2018-03-26T205712.928298Z',
   'bundle_url': 'https://dss.integration.data.humancellatlas.org/v1/bundles/1ec1f360-4132-426f-bb02-1337be74340d?version=2018-03-26T205712.928298Z&replica=aws',
   'search_score': 7.523884},
  {'bundle_fqid': '7a533bfc-c99a-4f50-8d28-0257c59b53d1.2018-04-09T144329.633297Z',
   'bundle_url': 'https://dss.integration.data.humancellatlas.org/v1/bundles/7a533bfc-c99a-4f50-8d28-0257c59b53d1?version=2018-04-09T144329.633297Z&replica=aws',
   'search_score': 7.523884},
  {'bundle_fqid': 'f3959034-e93b-4b28-8f40-82d8bcccf10b.2018-03-26T212152.807357Z',
   'bundle_url': 'https://dss.integration.data.humancellatlas.org/v1/bundles/f3959034-e93b-4b28-8f40-82d8bcccf

Well okay, eight bundles. This is a test environment, so there are some bundles uploaded multiple times. But, our desired bundle id, `ec7e6476-abeb-4d78-82d8-8a28b43c46a6` is there.