# Download Any BAM File

Any BAM file will do. I just want to see what aligned data looks like in the HCA.

First I'll set up the DSS client.

In [32]:
import hca, hca.dss, json
client = hca.dss.DSSClient()

****
#### Now I want to find a bundle that has a BAM file in it.

The `client` has a method `get_file` that sounds very promising, but I need to get a UUID. I don't know what RFC4122 is, but hopefully I won't have to.

In [33]:
help(client.get_file)

Help on method get_file in module hca.util:

get_file(client, uuid:str=None, replica:str=None, version:Union[str, NoneType]=None, token:Union[str, NoneType]=None, directurl:Union[str, NoneType]=None, content_disposition:Union[str, NoneType]=None) method of builtins.type instance
    Retrieve a file given a UUID and optionally a version.
    
    
    .. admonition:: Streaming
    
     Use ``DSSClient.get_file.stream(**kwargs)`` to get a ``requests.Response`` object whose body has not been
     read yet. This allows streaming large file bodies:
    
     .. code-block:: python
    
        fid = "7a8fbda7-d470-467a-904e-5c73413fab3e"
        with DSSClient().get_file.stream(uuid=fid, replica="aws") as fh:
            while True:
                chunk = fh.raw.read(1024)
                ...
                if not chunk:
                    break
    
     The keyword arguments for ``DSSClient.get_file.stream()`` are identical to the arguments for
     ``DSSClient.get_file()`` listed her

****
#### Searching for bundles

The `post_search` method accepts a `query`, and there's even an example query in the data-store repo's [readme](https://github.com/HumanCellAtlas/data-store/blob/master/README.md). The function signature doesn't quite match the example, but it should be easy to fix. 

In [34]:
client.post_search(replica="aws", es_query={
    "query": {
        "bool": {
            "must": [{
                "match": {
                    "files.sample_json.donor.species": "Homo sapiens"
                }
            }, {
                "match": {
                    "files.assay_json.single_cell.method": "Fluidigm C1"
                }
            }, {
                "match": {
                    "files.sample_json.ncbi_biosample": "SAMN04303778"
                }
            }]
        }
    }
})

{'es_query': {'query': {'bool': {'must': [{'match': {'files.sample_json.donor.species': 'Homo sapiens'}},
     {'match': {'files.assay_json.single_cell.method': 'Fluidigm C1'}},
     {'match': {'files.sample_json.ncbi_biosample': 'SAMN04303778'}}]}}},
 'results': [],
 'total_hits': 0}

****
#### Well that didn't work!

But I can see how the results are structured. What if I just give it an empty query?

In [35]:
search_response = client.post_search(replica="aws", es_query={})
search_response["total_hits"]

2307

****
Okay great, many results. What does each result look like?

In [36]:
search_response["results"][0]

{'bundle_fqid': 'fff14b79-c37e-43a4-9abc-c0777314f380.2019-05-14T105022.468000Z',
 'bundle_url': 'https://dss.integration.data.humancellatlas.org/v1/bundles/fff14b79-c37e-43a4-9abc-c0777314f380?version=2019-05-14T105022.468000Z&replica=aws',
 'search_score': None}

____
Now I have an ID that I can work with, `bundle_fqid`! It's an FQID and not a UUID, and for a bundle, not a file. What happens if I provide the FQID to `get_bundle` as the UUID?

In [37]:
try:
    client.get_bundle(uuid=search_response["results"][0]["bundle_fqid"], replica="aws")
    print("Completed successfully!")
except Exception as e:
    # If this operation fails, let's print the error (without raising the exception)
    print("Oh no! There was an error.")
    print(e)

Oh no! There was an error.
not_found: Cannot find bundle! (HTTP 404). Details:
Traceback (most recent call last):
  File "/var/task/chalicelib/dss/error.py", line 72, in wrapper
    return func(*args, **kwargs)
  File "/var/task/chalicelib/dss/api/bundles/__init__.py", line 55, in get
    raise DSSException(404, "not_found", "Cannot find bundle!")
dss.error.DSSException



****
Hmmm, a `DSSException` - it appears that it couldn't find a bundle with that UUID. This makes sense because FQIDs aren't UUIDs: the UUID is the part of the FQID before the first `.`. The timestamp is everything after. So, if I extract the UUID from the FQID, everything should work:

In [38]:
bundle_uuid, bundle_version = search_response['results'][0]['bundle_fqid'].split('.', maxsplit=1)
client.get_bundle(uuid=bundle_uuid, replica="aws")

{'bundle': {'creator_uid': 0,
  'files': [{'content-type': 'application/json; dcp-type="metadata/biomaterial"',
    'crc32c': 'fdff6dea',
    'indexed': True,
    'name': 'cell_suspension_0.json',
    's3_etag': '608e9aba7ccfa6f9ab1a41fe55d3034c',
    'sha1': 'dd0be5655f99000e257e295011cf0a56824c32b2',
    'sha256': '1842c7285092fa5e7f56ad48fa208edd29120b4798875ae8a6fa1142f817d415',
    'size': 997,
    'uuid': '5851c22f-3ad8-40db-bdf2-f5dc0e26dbd3',
    'version': '2019-05-14T093128.827000Z'},
   {'content-type': 'application/json; dcp-type="metadata/biomaterial"',
    'crc32c': 'a0c4269e',
    'indexed': True,
    'name': 'specimen_from_organism_0.json',
    's3_etag': '73163822cf12c0426e9730ef0c0cddd9',
    'sha1': '7bcb10987d77d74a7080bb8c46b8193f00505985',
    'sha256': 'bb712ce013a58eccdf0bdf55c15d3ea13388bcf3cec5c24220e381eb65846283',
    'size': 1169,
    'uuid': '1a2e8c90-0d8b-465c-a4a6-bed616dbfad2',
    'version': '2019-05-14T092648.799000Z'},
   {'content-type': 'applicatio

****
Nice! Now I can see the general structure of a bundle.

Using this information, what if I want to find a bundle with a BAM? I can write a function `find_bam()` that iterates over bundles and returns bundle information for each undle with a BAM file:

In [39]:
def find_bam():
    for result in search_response["results"]:
        bundle_uuid, _ = results['bundle_fqid'].split('.', maxsplit=1)
        bundle_dict = client.get_bundle(uuid=bundle_uuid, replica="aws")
        for file_dict in bundle_dict["bundle"]["files"]:
            if file_dict["name"].endswith(".bam"):
                yield file_dict

Looks good! `find_bam()` will loop over each result in the `post_search` query we made earlier and GET each bundle until it finds a bundle containing a file that ends in `.bam`. One thing to remember is that sometimes, `post_search` can return a bundle that's been deleted but is lingering in the index, and that trying to GET that bundle could result in an error. It's no problem, as long as we can catch and ignore those cases...

In [50]:
def find_bam():
    for result in search_response["results"]:
        bundle_uuid = result["bundle_fqid"].split('.', maxsplit=1)[0]
        try:
            bundle_dict = client.get_bundle(uuid=bundle_uuid, replica="aws")
        except:
            continue
        for file_dict in bundle_dict["bundle"]["files"]:
            if file_dict["name"].endswith(".bam"):
                yield file_dict

One last thing: each `post_search` request is paginated and only returns some 100 results per request. Luckily, there's a method I can use that will automatically and transparently paginate through all results, `post_search.iterate`.

In [51]:
help(client.post_search)

Help on method post_search in module hca.util:

post_search(client, es_query:Mapping=None, output_format:Union[str, NoneType]='summary', replica:str=None, per_page:Union[str, NoneType]=100, search_after:Union[str, NoneType]=None) method of builtins.type instance
    Find bundles by searching their metadata with an Elasticsearch query
    
    
    
    .. admonition:: Pagination
    
     This method supports pagination. Use ``DSSClient.post_search.iterate(**kwargs)`` to create a generator that
     yields all results, making multiple requests over the wire if necessary:
    
     .. code-block:: python
    
       for result in DSSClient.post_search.iterate(**kwargs):
           ...
    
     The keyword arguments for ``DSSClient.post_search.iterate()`` are identical to the arguments for
     ``DSSClient.post_search()`` listed here.
    :param es_query:  Elasticsearch query 
    :type es_query: typing.Mapping
    :param output_format:  Specifies the output format. The default format, 

I can update my function `find_bam` to use it:

In [67]:
def find_bam():
    results = client.post_search.iterate(replica="aws", es_query={})
    for result in results:
        bundle_uuid = result["bundle_fqid"].split('.', maxsplit=1)[0]
        try:
            bundle_dict = client.get_bundle(uuid=bundle_uuid, replica="aws")
        except:
            continue
        for file_dict in bundle_dict["bundle"]["files"]:
            if file_dict["name"].endswith(".bam"):
                yield file_dict

The `find_bam()` function is a generator (since it uses the `yield` keyword), so to get the next bundle containing a BAM file, call the `next()` function and pass it the generator:

In [68]:
bam_generator = find_bam()
bam_file = next(bam_generator)
print(json.dumps(bam_file, indent=4))

{
    "content-type": "application/gzip; dcp-type=data",
    "crc32c": "13d0a885",
    "indexed": false,
    "name": "merged.bam",
    "s3_etag": "2a460ca1a794924868eb9ab10188148d-263",
    "sha1": "e80d114b07a165f638f80904806587a8c2a67b0a",
    "sha256": "610003f3064db6e64aa14b759827e9e4b8c7b8a8fd24b591f85a6db6aae34a9f",
    "size": 17629632969,
    "uuid": "0fb3b147-11c1-4494-bbfb-55f0140ac656",
    "version": "2019-07-28T070418.165550Z"
}


Check the size to make sure this BAM file is less than 1 GB (since we will download it for this tutorial), and if it is too large, get the next BAM file returned by `find_bam()`:

In [69]:
B2GB = 1/1024/1024/1024
size_in_gb = bam_file['size']*B2GB
print("BAM file %s is %0.2f GB"%(bam_file['name'], bam_file['size']*B2GB))

while size_in_gb > 1.0:
    print(" ------ This BAM file is too large! continuing to look for another BAM file ------")
    bam_file = next(bam_generator)
    size_in_gb = bam_file['size']*B2GB
    print("BAM file %s is %0.2f GB"%(bam_file['name'], bam_file['size']*B2GB))

BAM file merged.bam is 16.42 GB
 ------ This BAM file is too large! continuing to look for another BAM file ------
BAM file ecd3c30b-610c-4311-91c6-54c04f63f632_qc.bam is 0.04 GB


Looks good! What if I want to look for another file type, like fastqs? I can generalize that code above...

In [70]:
def find_ext(extension):
    results = client.post_search.iterate(replica="aws", es_query={})
    for result in results:
        bundle_uuid = result["bundle_fqid"].split('.', maxsplit=1)[0]
        try:
            bundle_dict = client.get_bundle(uuid=bundle_uuid, replica="aws")
        except:
            continue
        for file_dict in bundle_dict["bundle"]["files"]:
            if file_dict["name"].endswith(extension):
                return file_dict

In [71]:
fastq_file = find_ext('.fastq.gz')
print(f"Name: {fastq_file['name']}, UUID: {fastq_file['uuid']}")

Name: HP1506401_H16.fastq.gz, UUID: 90dc8de8-9837-492f-bef8-673b262dd417


If I want to download a file and know its UUID, I can use the `get_file` method...

In [72]:
bam_file = client.get_file(uuid=bam_file['uuid'], replica="aws")

Waiting 10s before redirect per Retry-After header


In [73]:
print(type(bam_file))

<class 'bytes'>


In [74]:
with open("Aligned.sortedByCoord.out.bam", "wb") as output_bam:
    # Write the file in chunks, since it is very large
    blocksize = 1000000000
    for i in range(0, len(bam_file), blocksize):
        output_bam.write(bam_file[i:i+blocksize])

In [75]:
import pysam
bam = pysam.AlignmentFile("Aligned.sortedByCoord.out.bam", "rb")
print(bam.header)

@HD	VN:1.0	SO:coordinate
@SQ	SN:chr1	LN:248956422
@SQ	SN:chr2	LN:242193529
@SQ	SN:chr3	LN:198295559
@SQ	SN:chr4	LN:190214555
@SQ	SN:chr5	LN:181538259
@SQ	SN:chr6	LN:170805979
@SQ	SN:chr7	LN:159345973
@SQ	SN:chr8	LN:145138636
@SQ	SN:chr9	LN:138394717
@SQ	SN:chr10	LN:133797422
@SQ	SN:chr11	LN:135086622
@SQ	SN:chr12	LN:133275309
@SQ	SN:chr13	LN:114364328
@SQ	SN:chr14	LN:107043718
@SQ	SN:chr15	LN:101991189
@SQ	SN:chr16	LN:90338345
@SQ	SN:chr17	LN:83257441
@SQ	SN:chr18	LN:80373285
@SQ	SN:chr19	LN:58617616
@SQ	SN:chr20	LN:64444167
@SQ	SN:chr21	LN:46709983
@SQ	SN:chr22	LN:50818468
@SQ	SN:chrX	LN:156040895
@SQ	SN:chrY	LN:57227415
@SQ	SN:chrM	LN:16569
@SQ	SN:GL000008.2	LN:209709
@SQ	SN:GL000009.2	LN:201709
@SQ	SN:GL000194.1	LN:191469
@SQ	SN:GL000195.1	LN:182896
@SQ	SN:GL000205.2	LN:185591
@SQ	SN:GL000208.1	LN:92689
@SQ	SN:GL000213.1	LN:164239
@SQ	SN:GL000214.1	LN:137718
@SQ	SN:GL000216.2	LN:176608
@SQ	SN:GL000218.1	LN:161147
@SQ	SN:GL000219.1	LN:179198
@SQ	SN:GL000220.1	LN:161802
@SQ	SN:GL00022