# Download Any BAM File 🌕

Any BAM file will do. I just want to see what aligned data looks like in the HCA.

First I'll set up the DSS client.

In [1]:
import hca, hca.dss, json
client = hca.dss.DSSClient()

****
#### Now I want to find a bundle that has a BAM file in it.

The `client` has a method `get_file` that sounds very promising, but I need to get a UUID. I don't know what RFC4122 is, but hopefully I won't have to.

In [2]:
help(client.get_file)

Help on method get_file in module hca.util:

get_file(client, uuid: str = None, replica: str = None, version: Union[str, NoneType] = None, token: Union[str, NoneType] = None, directurl: Union[str, NoneType] = None) method of builtins.type instance
    Retrieve a file given a UUID and optionally a version.
    
    
    .. admonition:: Streaming
    
     Use ``DSSClient.get_file.stream(**kwargs)`` to get a ``requests.Response`` object whose body has not been
     read yet. This allows streaming large file bodies:
    
     .. code-block:: python
    
        fid = "7a8fbda7-d470-467a-904e-5c73413fab3e"
        with DSSClient().get_file.stream(uuid=fid, replica="aws") as fh:
            while True:
                chunk = fh.raw.read(1024)
                ...
                if not chunk:
                    break
    
     The keyword arguments for ``DSSClient.get_file.stream()`` are identical to the arguments for
     ``DSSClient.get_file()`` listed here.
    :param uuid:  A RFC4122-c

****
#### Searching for bundles

The `post_search` method accepts a `query`, and there's even an example query in the data-store repo's [readme](https://github.com/HumanCellAtlas/data-store/blob/master/README.md). The function signature doesn't quite match the example, but it should be easy to fix. 

In [3]:
client.post_search(replica="aws", es_query={
    "query": {
        "bool": {
            "must": [{
                "match": {
                    "files.sample_json.donor.species": "Homo sapiens"
                }
            }, {
                "match": {
                    "files.assay_json.single_cell.method": "Fluidigm C1"
                }
            }, {
                "match": {
                    "files.sample_json.ncbi_biosample": "SAMN04303778"
                }
            }]
        }
    }
})

{'es_query': {'query': {'bool': {'must': [{'match': {'files.sample_json.donor.species': 'Homo sapiens'}},
     {'match': {'files.assay_json.single_cell.method': 'Fluidigm C1'}},
     {'match': {'files.sample_json.ncbi_biosample': 'SAMN04303778'}}]}}},
 'results': [],
 'total_hits': 0}

****
#### Well that didn't work!

But I can see how the results are structured. What if I just give it an empty query?

In [4]:
search_response = client.post_search(replica="aws", es_query={})
search_response["total_hits"]

592713

****
Okay great, many results. What does each result look like?

In [5]:
search_response["results"][0]

{'bundle_fqid': 'ffffab51-b074-464b-8dfc-771b9cff7662.2019-04-05T130104.213000Z',
 'bundle_url': 'https://dss.data.humancellatlas.org/v1/bundles/ffffab51-b074-464b-8dfc-771b9cff7662?version=2019-04-05T130104.213000Z&replica=aws',
 'search_score': None}

____
Now I have an ID that I can work with, though it's an "fqid" not a "uuid", and it's for a bundle, not a file. I can try providing the  `get_bundle`... 

In [6]:
try:
    client.get_bundle(uuid=search_response["results"][1]["bundle_fqid"], replica="aws")
    print("Completed successfully!")
except Exception as e:
    # If this operation fails, let's print the error (without raising the exception)
    print("Oh no! There was an error.")
    print(e)

Oh no! There was an error.
not_found: Cannot find bundle! (HTTP 404). Details:
Traceback (most recent call last):
  File "/var/task/chalicelib/dss/error.py", line 72, in wrapper
    return func(*args, **kwargs)
  File "/var/task/chalicelib/dss/api/bundles/__init__.py", line 54, in get
    raise DSSException(404, "not_found", "Cannot find bundle!")
dss.error.DSSException



****
Hmmm, a `DSSException`; it appears that it couldn't find a bundle with that UUID. I suppose I can strip off the timestamp from the ID. (Note that the UUID is the part of the FQID before the first `.`. The timestamp is everything after.)

In [7]:
client.get_bundle(uuid=search_response["results"][1]["bundle_fqid"].split('.', maxsplit=1)[0], replica="aws")

{'bundle': {'creator_uid': 0,
  'files': [{'content-type': 'application/json; dcp-type="metadata/biomaterial"',
    'crc32c': '8722c942',
    'indexed': True,
    'name': 'cell_suspension_0.json',
    's3_etag': 'a3fc5da1c0cac1a2e27fe6b0ea6d0ecd',
    'sha1': '742e010fa41313862475ef534900289126758ae0',
    'sha256': 'e1be7945afe98b4c50da869c41070ab63bc52a0ce8fe880da10e45cd231f0d98',
    'size': 833,
    'uuid': 'd858e8c4-3528-4594-9b94-0f510fad630d',
    'version': '2019-01-30T170526.388000Z'},
   {'content-type': 'application/json; dcp-type="metadata/biomaterial"',
    'crc32c': '59253d96',
    'indexed': True,
    'name': 'specimen_from_organism_0.json',
    's3_etag': '107a71915fc6b795e78d0a03c78a850f',
    'sha1': '509c7b4dc829de2fe4a5e4840063fe2343cf612f',
    'sha256': '5ead3f45433a0d74611aab300efee4cc518d5dd59f89b7b27de82f33e42197a4',
    'size': 860,
    'uuid': 'a4cbb8ad-a6c3-468a-8c98-3d79aae05a02',
    'version': '2019-01-30T152520.346000Z'},
   {'content-type': 'application

****
This is very promising! No BAM files in this bundle, but I see the structure of a "bundle" and how I can work with it. Now, I just need to iterate over bundles until I find a BAM 

In [8]:
for result in search_response["results"]:
    bundle_uuid = result["bundle_fqid"].split('.', maxsplit=1)[0]
    try:
        bundle_dict = client.get_bundle(uuid=bundle_uuid, replica="aws")
    except:
        # Sometimes, bundles that are deleted linger in the index.
        # It's no problem - we can just ignore it and move on.
        continue
    found_file = False
    for file_dict in bundle_dict["bundle"]["files"]:
        if file_dict["name"].endswith(".bam"):
            print("Name: {}, UUID: {}".format(file_dict["name"], file_dict["uuid"]))
            found_file = True
            break
    if found_file:
        break

Name: ceae7e4d-6871-4d47-b2af-f3c9a5b3f5db_qc.bam, UUID: 1907c7b9-55e5-47d5-a54c-85423e942523


No results, what if I try it for fastqs?

In [9]:
for result in search_response["results"]:
    bundle_uuid = result["bundle_fqid"].split('.', maxsplit=1)[0]
    try:
        bundle_dict = client.get_bundle(uuid=bundle_uuid, replica="aws")
    except:
        continue
    found_file = False
    for file_dict in bundle_dict["bundle"]["files"]:
        if file_dict["name"].endswith(".fastq.gz"):
            print("Name: {}, UUID: {}".format(file_dict["name"], file_dict["uuid"]))
            found_file = True
            break
    if found_file:
        break

Name: SRR6520468_1.fastq.gz, UUID: b86e4309-2995-4c05-9d27-23b5ca214512


****
Perhaps aren't any BAMs?

No, it turns out the search results are paginated, so the `results` list only contains 100 of the ~2000 results. There's another RFC for the reading list there, but there's also the `post_search.iterate` method that should work.

In [10]:
help(client.post_search)

Help on method post_search in module hca.util:

post_search(client, es_query: Mapping = None, output_format: Union[str, NoneType] = 'summary', replica: str = None, per_page: Union[str, NoneType] = 100, search_after: Union[str, NoneType] = None) method of builtins.type instance
    Find bundles by searching their metadata with an Elasticsearch query
    
    
    
    .. admonition:: Pagination
    
     This method supports pagination. Use ``DSSClient.post_search.iterate(**kwargs)`` to create a generator that
     yields all results, making multiple requests over the wire if necessary:
    
     .. code-block:: python
    
       for result in DSSClient.post_search.iterate(**kwargs):
           ...
    
     The keyword arguments for ``DSSClient.post_search.iterate()`` are identical to the arguments for
     ``DSSClient.post_search()`` listed here.
    :param es_query:  Elasticsearch query 
    :type es_query: typing.Mapping
    :param output_format:  Specifies the output format. The d

In [11]:
results = client.post_search.iterate(replica="aws", es_query={})
for result in results:
    bundle_uuid = result["bundle_fqid"].split('.', maxsplit=1)[0]
    try:
        bundle_dict = client.get_bundle(uuid=bundle_uuid, replica="aws")
    except:
        continue
    found_file = False
    for file_dict in bundle_dict["bundle"]["files"]:
        if file_dict["name"].endswith(".bam"):
            print("Name: {}, UUID: {}".format(file_dict["name"], file_dict["uuid"]))
            found_file = True
            break
    if found_file:
        break

Name: ceae7e4d-6871-4d47-b2af-f3c9a5b3f5db_qc.bam, UUID: 1907c7b9-55e5-47d5-a54c-85423e942523


In [12]:
bam_file = client.get_file(uuid=file_dict['uuid'], replica="aws")
with open("Aligned.sortedByCoord.out.bam", "wb") as output_bam:
    output_bam.write(bam_file)

In [13]:
import pysam
bam = pysam.AlignmentFile("Aligned.sortedByCoord.out.bam", "rb")
print(bam.header)

@HD	VN:1.0	SO:coordinate
@SQ	SN:chr1	LN:248956422
@SQ	SN:chr2	LN:242193529
@SQ	SN:chr3	LN:198295559
@SQ	SN:chr4	LN:190214555
@SQ	SN:chr5	LN:181538259
@SQ	SN:chr6	LN:170805979
@SQ	SN:chr7	LN:159345973
@SQ	SN:chr8	LN:145138636
@SQ	SN:chr9	LN:138394717
@SQ	SN:chr10	LN:133797422
@SQ	SN:chr11	LN:135086622
@SQ	SN:chr12	LN:133275309
@SQ	SN:chr13	LN:114364328
@SQ	SN:chr14	LN:107043718
@SQ	SN:chr15	LN:101991189
@SQ	SN:chr16	LN:90338345
@SQ	SN:chr17	LN:83257441
@SQ	SN:chr18	LN:80373285
@SQ	SN:chr19	LN:58617616
@SQ	SN:chr20	LN:64444167
@SQ	SN:chr21	LN:46709983
@SQ	SN:chr22	LN:50818468
@SQ	SN:chrX	LN:156040895
@SQ	SN:chrY	LN:57227415
@SQ	SN:chrM	LN:16569
@SQ	SN:GL000008.2	LN:209709
@SQ	SN:GL000009.2	LN:201709
@SQ	SN:GL000194.1	LN:191469
@SQ	SN:GL000195.1	LN:182896
@SQ	SN:GL000205.2	LN:185591
@SQ	SN:GL000208.1	LN:92689
@SQ	SN:GL000213.1	LN:164239
@SQ	SN:GL000214.1	LN:137718
@SQ	SN:GL000216.2	LN:176608
@SQ	SN:GL000218.1	LN:161147
@SQ	SN:GL000219.1	LN:179198
@SQ	SN:GL000220.1	LN:161802
@SQ	SN:GL00022