# Download Any BAM File

In this notebook we show how to search all bundles in the Human Cell Atlas Data Store (DSS) to find any bundle with a BAM file, and downloda it. This will allow us to see an example of aligned data in the HCA.

We start by creating a DSS client:

In [17]:
import hca, hca.dss, json
client = hca.dss.DSSClient()

## Finding Bundles with BAM Files

The DSS client we created has a method called `get_file()` that is the function we will ultimately use to download the file, but if we print its help string, we see that we need a UUID to download a file:

In [18]:
help(client.get_file)

Help on method get_file in module hca.util:

get_file(client, uuid:str=None, replica:str=None, version:Union[str, NoneType]=None, token:Union[str, NoneType]=None, directurl:Union[str, NoneType]=None, content_disposition:Union[str, NoneType]=None) method of builtins.type instance
    Retrieve a file given a UUID and optionally a version.
    
    
    .. admonition:: Streaming
    
     Use ``DSSClient.get_file.stream(**kwargs)`` to get a ``requests.Response`` object whose body has not been
     read yet. This allows streaming large file bodies:
    
     .. code-block:: python
    
        fid = "7a8fbda7-d470-467a-904e-5c73413fab3e"
        with DSSClient().get_file.stream(uuid=fid, replica="aws") as fh:
            while True:
                chunk = fh.raw.read(1024)
                ...
                if not chunk:
                    break
    
     The keyword arguments for ``DSSClient.get_file.stream()`` are identical to the arguments for
     ``DSSClient.get_file()`` listed her

## Writing the ElasticSearch Query

To find BAM files, we can run an ElasticSearch wildcard query, and search for files listed in each bundle's manifest for patterns matching `*.bam`. See the ["ElasticSearch Queries"](../elasticsearch-queries/elasticsearch-queries.ipynb) notebook for details about ElasticSearch wildcard and regular expression queries.

We will use the `post_search()` method and pass the ElasticSearch query as an argument. We want to use a wildcard search to find all bundles whose file manifest contains a `*.bam` file. We can run an empty search to determine the metadata structure, including the file manifest, to see how to assemble the ElasticSearch query:

In [26]:
#print(help(client.post_search))

In [27]:
bundles = client.post_search(es_query={}, replica='aws', output_format='raw')

first_bundle = bundles['results'][0]

The file manifest, after some exploration, can be found under `metadata.manifest.files`:

In [28]:
print(json.dumps(
    first_bundle['metadata']['manifest']['files'],
    indent=4
))

[
    {
        "content-type": "application/json; dcp-type=\"metadata/biomaterial\"",
        "crc32c": "fdff6dea",
        "indexed": true,
        "name": "cell_suspension_0.json",
        "s3-etag": "608e9aba7ccfa6f9ab1a41fe55d3034c",
        "sha1": "dd0be5655f99000e257e295011cf0a56824c32b2",
        "sha256": "1842c7285092fa5e7f56ad48fa208edd29120b4798875ae8a6fa1142f817d415",
        "size": 997,
        "uuid": "5851c22f-3ad8-40db-bdf2-f5dc0e26dbd3",
        "version": "2019-05-14T093128.827000Z"
    },
    {
        "content-type": "application/json; dcp-type=\"metadata/biomaterial\"",
        "crc32c": "a0c4269e",
        "indexed": true,
        "name": "specimen_from_organism_0.json",
        "s3-etag": "73163822cf12c0426e9730ef0c0cddd9",
        "sha1": "7bcb10987d77d74a7080bb8c46b8193f00505985",
        "sha256": "bb712ce013a58eccdf0bdf55c15d3ea13388bcf3cec5c24220e381eb65846283",
        "size": 1169,
        "uuid": "1a2e8c90-0d8b-465c-a4a6-bed616dbfad2",
        "version

When we write an ElasticSearch query, we can use a wildcard search to match bundles where the field

```
manifest.files.name
```

matches the wildcard expression `*.bam`.

In [31]:
bam_query = {
    "query": {
        "wildcard": {
            "manifest.files.name" : {
                "value": "*.bam"
            }
        }
    }
}

## Running the ElasticSearch Query

Now we can write a function that will call `post_search.iterate()` to iterate over the ElasticSearch results to extract the file UUID information for each BAM file, so that we can use the DSS API to download BAM files.

We also implement this function as a generator using the `yield` keyword so that it evaluates lazily.

In [73]:
def find_bam():
    for bundle in client.post_search.iterate(es_query = bam_query, replica = 'aws', output_format = 'raw'):
        # Use the files list Use that to get the bundle information
        for file_dict in bundle["metadata"]["manifest"]["files"]:
            if file_dict["name"].endswith(".bam"):
                yield file_dict

This is a generator, so we can test it by running `next(find_bam())`, which should return a JSON containing information about a BAM file from the first result of our ElasticSearch query.

In [74]:
bam_generator = find_bam()
bam_file = next(bam_generator)
print(json.dumps(bam_file, indent=4))

{
    "content-type": "application/gzip; dcp-type=data",
    "crc32c": "13d0a885",
    "indexed": false,
    "name": "merged.bam",
    "s3-etag": "2a460ca1a794924868eb9ab10188148d-263",
    "sha1": "e80d114b07a165f638f80904806587a8c2a67b0a",
    "sha256": "610003f3064db6e64aa14b759827e9e4b8c7b8a8fd24b591f85a6db6aae34a9f",
    "size": 17629632969,
    "uuid": "0fb3b147-11c1-4494-bbfb-55f0140ac656",
    "version": "2019-07-28T070418.165550Z"
}


Check the size to make sure this BAM file is less than 1 GB (since we will download it for this tutorial), and if it is too large, get the next BAM file returned by `find_bam()`:

In [48]:
B2GB = 1/1024/1024/1024
size_in_gb = bam_file['size']*B2GB
print("BAM file %s is %0.2f GB"%(bam_file['name'], bam_file['size']*B2GB))

while size_in_gb > 1.0:
    print(" ------ This BAM file is too large! continuing to look for another BAM file ------")
    bam_file = next(bam_generator)
    size_in_gb = bam_file['size']*B2GB
    print("BAM file %s is %0.2f GB"%(bam_file['name'], bam_file['size']*B2GB))

BAM file merged.bam is 16.42 GB
 ------ This BAM file is too large! continuing to look for another BAM file ------
BAM file ecd3c30b-610c-4311-91c6-54c04f63f632_qc.bam is 0.04 GB


## Downloading the BAM File

We can now use the `bam_file` information to download the file. We call the `client.get_file()` method and pass the UUID of the file:

In [65]:
bam_file = client.get_file(uuid=bam_file['uuid'], replica="aws")

Waiting 10s before redirect per Retry-After header


KeyboardInterrupt: 

In [66]:
#print(type(bam_file)) # <class 'bytes'>

This is potentially a large file, so we write the file in chunks:

In [67]:
with open("Aligned.sortedByCoord.out.bam", "wb") as output_bam:
    # Write the file in chunks
    blocksize = 1000000000
    for i in range(0, len(bam_file), blocksize):
        output_bam.write(bam_file[i:i+blocksize])

TypeError: unhashable type: 'slice'

Now test the file to ensure we can read it using the `pysam` module:

In [68]:
import pysam
bam = pysam.AlignmentFile("Aligned.sortedByCoord.out.bam", "rb")
print(bam.header)

ValueError: file has no sequences defined (mode='rb') - is it SAM/BAM format? Consider opening with check_sq=False

## Generalizing: Find and Download Any File of Any Extension in the Data Store

Note that we can do a bit more work, and get a general function to iterate over data with specific file extensions in the data store, and pass that along to the `client.get_file()` method to download it:

In [69]:
def es_filetype_query(extension):
    """
    Given a file extension (excluding .), return an ElasticSearch
    wildcard query that will return bundles containing files with
    that extension.
    
    Example: es_filetype_query('fastq.gz')
    """
    query = {
        "query": {
            "wildcard": {
                "manifest.files.name": {
                    "value": "*.%s"%(extension)
                }
            }
        }
    }
    return query

Next, we can rewrite `find_bam()` to take a file extension as input, assemble the corresponding ElasticSearch query with `es_filetype_query()`,  and return JSON with UUIDs and other information about each of the matching files.

In [70]:
def find_all_filetype(extension):
    for bundle in client.post_search.iterate(es_query = es_filetype_query(extension),
                                             replica = 'aws',
                                             output_format = 'raw'):
        # Use the file nmanifest to find the JSON with file info
        for file_dict in bundle["metadata"]["manifest"]["files"]:
            if file_dict["name"].endswith(extension):
                yield file_dict

In [71]:
fastq_file = next(find_all_filetype('fastq.gz'))
print(f"File Name: {fastq_file['name']}")
print(f"UUID: {fastq_file['uuid']}")

File Name: HP1506401_H16.fastq.gz
UUID: 90dc8de8-9837-492f-bef8-673b262dd417


This can be combined with the `client.get_file()` method to download the file:

In [79]:
for fastq_file in find_all_filetype('fastq.gz'):
    print("\nUncomment code to download the following file:")
    print(f"- File Name: {fastq_file['name']}")
    print(f"- UUID: {fastq_file['uuid']}")
    print(f"- Size: {fastq_file['size']/1024/1024/1024}GB")
    #client.get_file(uuid=fastq_file['uuid'], replica="aws")
    break


Uncomment code to download the following file:
- File Name: HP1506401_H16.fastq.gz
- UUID: 90dc8de8-9837-492f-bef8-673b262dd417
- Size: 0.02440743986517191GB
