# Download Any BAM File

In this notebook we show how to search all bundles in the Human Cell Atlas Data Store (DSS) to find any bundle with a BAM file, and download it. This will allow us to see an example of aligned data in the HCA.

We start by creating a DSS client:

In [1]:
import hca, hca.dss, json
client = hca.dss.DSSClient()

## Finding Bundles with BAM Files

The DSS client we created has a method called `get_file()` that is the function we will ultimately use to download the file, but if we print its help string, we see that we need a UUID to download a file:

In [2]:
help(client.get_file)

Help on method get_file in module hca.util:

get_file(client, uuid:str=None, replica:str=None, version:Union[str, NoneType]=None, token:Union[str, NoneType]=None, directurl:Union[str, NoneType]=None, content_disposition:Union[str, NoneType]=None) method of builtins.type instance
    Retrieve a file given a UUID and optionally a version.
    
    
    .. admonition:: Streaming
    
     Use ``DSSClient.get_file.stream(**kwargs)`` to get a ``requests.Response`` object whose body has not been
     read yet. This allows streaming large file bodies:
    
     .. code-block:: python
    
        fid = "7a8fbda7-d470-467a-904e-5c73413fab3e"
        with DSSClient().get_file.stream(uuid=fid, replica="aws") as fh:
            while True:
                chunk = fh.raw.read(1024)
                ...
                if not chunk:
                    break
    
     The keyword arguments for ``DSSClient.get_file.stream()`` are identical to the arguments for
     ``DSSClient.get_file()`` listed her

## Writing the ElasticSearch Query

To find BAM files, we can run an ElasticSearch wildcard query, and search for files listed in each bundle's manifest for patterns matching `*.bam`. See the ["ElasticSearch Queries"](../elasticsearch-queries/elasticsearch-queries.ipynb) notebook for details about ElasticSearch wildcard and regular expression queries.

We will use the `post_search()` method and pass the ElasticSearch query as an argument. We want to use a wildcard search to find all bundles whose file manifest contains a `*.bam` file. We can run an empty search to determine the metadata structure, including the file manifest, to see how to assemble the ElasticSearch query:

In [3]:
#print(help(client.post_search))

In [4]:
bundles = client.post_search(es_query={}, replica='aws', output_format='raw')

first_bundle = bundles['results'][0]

The file manifest, after some exploration, can be found under `metadata.manifest.files`:

In [5]:
print(json.dumps(
    first_bundle['metadata']['manifest']['files'],
    indent=4
))

[
    {
        "content-type": "application/json; dcp-type=\"metadata/biomaterial\"",
        "crc32c": "e51ced73",
        "indexed": true,
        "name": "cell_suspension_0.json",
        "s3-etag": "4b126057ce7abdc231255a8bf7784f8a",
        "sha1": "5dc657584a0fb00b1a918e0ecfc4701edf569ca1",
        "sha256": "775d6a9a562a6e818a3de5741c48dfc17d304b942f448da760b3996a139a5876",
        "size": 841,
        "uuid": "ba96ea2d-c7e2-4c47-9561-418a849f93d0",
        "version": "2019-07-09T232055.867000Z"
    },
    {
        "content-type": "application/json; dcp-type=\"metadata/biomaterial\"",
        "crc32c": "6c24cc69",
        "indexed": true,
        "name": "specimen_from_organism_0.json",
        "s3-etag": "de2f1daec3d270806b2d5590eabb3dfc",
        "sha1": "077305bae96361f9cd453b2066ffb10d4fb6977f",
        "sha256": "40b34fd3f409255888f3065b7d1f9735538713eac567e647d74043dc44eb4777",
        "size": 861,
        "uuid": "2436de6c-82fa-4434-8cec-f73cde7b01cb",
        "version"

When we write an ElasticSearch query, we can use a wildcard search to match bundles where the field

```
manifest.files.name
```

matches the wildcard expression `*.bam`.

In [6]:
bam_query = {
    "query": {
        "wildcard": {
            "manifest.files.name" : {
                "value": "*.bam"
            }
        }
    }
}

## Running the ElasticSearch Query

Now we can write a function that will call `post_search.iterate()` to iterate over the ElasticSearch results to extract the file UUID information for each BAM file, so that we can use the DSS API to download BAM files.

We also implement this function as a generator using the `yield` keyword so that it evaluates lazily.

In [7]:
def find_bam():
    for bundle in client.post_search.iterate(es_query = bam_query, replica = 'aws', output_format = 'raw'):
        # Use the files list Use that to get the bundle information
        for file_dict in bundle["metadata"]["manifest"]["files"]:
            if file_dict["name"].endswith(".bam"):
                yield file_dict

This is a generator, so we can test it by running `next(find_bam())`, which should return a JSON containing information about a BAM file from the first result of our ElasticSearch query.

In [8]:
bam_generator = find_bam()
bam_file = next(bam_generator)
print(json.dumps(bam_file, indent=4))

{
    "content-type": "application/gzip; dcp-type=data",
    "crc32c": "74405869",
    "indexed": false,
    "name": "ceae7e4d-6871-4d47-b2af-f3c9a5b3f5db_qc.bam",
    "s3-etag": "d86563efae1f97a11215ab8ac08ff57d-3",
    "sha1": "d497da691495b27de9a304b4b0e1853203fe3a6a",
    "sha256": "14b700a3d4e3643cf781fb4641bdaa6dbc252d31d9e5a465acc23f30d61237d7",
    "size": 176911395,
    "uuid": "1907c7b9-55e5-47d5-a54c-85423e942523",
    "version": "2019-05-18T173113.216367Z"
}


Check the size to make sure this BAM file is less than 1 GB (since we will download it for this tutorial), and if it is too large, get the next BAM file returned by `find_bam()`:

In [9]:
B2GB = 1/1024/1024/1024
size_in_gb = bam_file['size']*B2GB
print("BAM file %s is %0.2f GB"%(bam_file['name'], bam_file['size']*B2GB))

while size_in_gb > 1.0:
    print(" ------ This BAM file is too large! continuing to look for another BAM file ------")
    bam_file = next(bam_generator)
    size_in_gb = bam_file['size']*B2GB
    print("BAM file %s is %0.2f GB"%(bam_file['name'], bam_file['size']*B2GB))

BAM file ceae7e4d-6871-4d47-b2af-f3c9a5b3f5db_qc.bam is 0.16 GB


## Downloading the BAM File

We can now use the `bam_file` information to download the file. We call the `client.get_file()` method and pass the UUID of the file:

In [10]:
bam_file = client.get_file(uuid=bam_file['uuid'], replica="aws")

In [11]:
#print(type(bam_file)) # <class 'bytes'>

This is potentially a large file, so we write the file in chunks:

In [12]:
with open("Aligned.sortedByCoord.out.bam", "wb") as output_bam:
    # Write the file in chunks
    blocksize = 1000000000
    for i in range(0, len(bam_file), blocksize):
        output_bam.write(bam_file[i:i+blocksize])

Now test the file to ensure we can read it using the `pysam` module:

In [13]:
import pysam
bam = pysam.AlignmentFile("Aligned.sortedByCoord.out.bam", "rb")
print(bam.header)

@HD	VN:1.0	SO:coordinate
@SQ	SN:chr1	LN:248956422
@SQ	SN:chr2	LN:242193529
@SQ	SN:chr3	LN:198295559
@SQ	SN:chr4	LN:190214555
@SQ	SN:chr5	LN:181538259
@SQ	SN:chr6	LN:170805979
@SQ	SN:chr7	LN:159345973
@SQ	SN:chr8	LN:145138636
@SQ	SN:chr9	LN:138394717
@SQ	SN:chr10	LN:133797422
@SQ	SN:chr11	LN:135086622
@SQ	SN:chr12	LN:133275309
@SQ	SN:chr13	LN:114364328
@SQ	SN:chr14	LN:107043718
@SQ	SN:chr15	LN:101991189
@SQ	SN:chr16	LN:90338345
@SQ	SN:chr17	LN:83257441
@SQ	SN:chr18	LN:80373285
@SQ	SN:chr19	LN:58617616
@SQ	SN:chr20	LN:64444167
@SQ	SN:chr21	LN:46709983
@SQ	SN:chr22	LN:50818468
@SQ	SN:chrX	LN:156040895
@SQ	SN:chrY	LN:57227415
@SQ	SN:chrM	LN:16569
@SQ	SN:GL000008.2	LN:209709
@SQ	SN:GL000009.2	LN:201709
@SQ	SN:GL000194.1	LN:191469
@SQ	SN:GL000195.1	LN:182896
@SQ	SN:GL000205.2	LN:185591
@SQ	SN:GL000208.1	LN:92689
@SQ	SN:GL000213.1	LN:164239
@SQ	SN:GL000214.1	LN:137718
@SQ	SN:GL000216.2	LN:176608
@SQ	SN:GL000218.1	LN:161147
@SQ	SN:GL000219.1	LN:179198
@SQ	SN:GL000220.1	LN:161802
@SQ	SN:GL00022

## Generalizing: Find and Download Any File of Any Extension in the Data Store

Note that we can do a bit more work, and get a general function to iterate over data with specific file extensions in the data store, and pass that along to the `client.get_file()` method to download it:

In [14]:
def es_filetype_query(extension):
    """
    Given a file extension (excluding .), return an ElasticSearch
    wildcard query that will return bundles containing files with
    that extension.
    
    Example: es_filetype_query('fastq.gz')
    """
    query = {
        "query": {
            "wildcard": {
                "manifest.files.name": {
                    "value": "*.%s"%(extension)
                }
            }
        }
    }
    return query

Next, we can rewrite `find_bam()` to take a file extension as input, assemble the corresponding ElasticSearch query with `es_filetype_query()`,  and return JSON with UUIDs and other information about each of the matching files.

In [15]:
def find_all_filetype(extension):
    for bundle in client.post_search.iterate(es_query = es_filetype_query(extension),
                                             replica = 'aws',
                                             output_format = 'raw'):
        # Use the file nmanifest to find the JSON with file info
        for file_dict in bundle["metadata"]["manifest"]["files"]:
            if file_dict["name"].endswith(extension):
                yield file_dict

In [16]:
fastq_file = next(find_all_filetype('fastq.gz'))
print(f"File Name: {fastq_file['name']}")
print(f"UUID: {fastq_file['uuid']}")

File Name: SRR6520067_1.fastq.gz
UUID: de204c6b-97be-44dd-bdb8-89af19a717b9


This can be processed with a function to either print information about the file or download it:

In [17]:
def print_file(fastq):
    print(f" - {fastq['name']} : Size {fastq['size']/1024/1024/1024:.2f} GB")

In [18]:
def download_file(fastq):
    print(f"Now downloading fastq file {fastq['name']} ({fastq['size']/1024/1024/1024} GB)...")
    fastq = client.get_file(uuid=fastq['uuid'], version=fastq['version'], replica='aws')
    open(fsatq['filename'], 'wb').write(fastq)

Now we can iterate over a few fastq files:

In [19]:
import itertools
for fastq_file in itertools.islice(find_all_filetype('fastq.gz'), 10):
    print_file(fastq_file)
    # download_file(fastq_file)

 - SRR6520067_1.fastq.gz : Size 0.04 GB
 - SRR6520067_2.fastq.gz : Size 0.04 GB
 - SRR6579532_1.fastq.gz : Size 0.04 GB
 - SRR6579532_2.fastq.gz : Size 0.04 GB
 - SRR6611699_1.fastq.gz : Size 0.04 GB
 - SRR6611699_2.fastq.gz : Size 0.04 GB
 - ERR2459896_1.fastq.gz : Size 0.04 GB
 - ERR2459896_2.fastq.gz : Size 0.04 GB
 - E18_20161004_Neurons_Sample_14_S083_L005_I1_002.fastq.gz : Size 0.04 GB
 - E18_20161004_Neurons_Sample_14_S083_L005_R1_002.fastq.gz : Size 0.09 GB


To iterate over every fastq file, replace `itertools.islice(find_all_filetype('fastq.gz', 10))` with `find_all_filetype('fastq.gz')`.