# Download Any BAM File 🌕

Any BAM file will do. I just want to see what aligned data looks like in the HCA.

First I'll set up the DSS client.

In [1]:
import hca, hca.dss, json
client = hca.dss.DSSClient()

****
#### Now I want to find a bundle that has a BAM file in it.

The `client` has a method `get_file` that sounds very promising, but I need to get a UUID. I don't know what RFC4122 is, but hopefully I won't have to.

In [2]:
help(client.get_file)

Help on method get_file in module hca.util:

get_file(client, uuid: str = None, replica: str = None, version: Union[str, NoneType] = None, token: Union[str, NoneType] = None, directurl: Union[str, NoneType] = None) method of builtins.type instance
    Retrieve a file given a UUID and optionally a version.
    
    
    .. admonition:: Streaming
    
     Use ``DSSClient.get_file.stream(**kwargs)`` to get a ``requests.Response`` object whose body has not been
     read yet. This allows streaming large file bodies:
    
     .. code-block:: python
    
        fid = "7a8fbda7-d470-467a-904e-5c73413fab3e"
        with DSSClient().get_file.stream(uuid=fid, replica="aws") as fh:
            while True:
                chunk = fh.raw.read(1024)
                ...
                if not chunk:
                    break
    
     The keyword arguments for ``DSSClient.get_file.stream()`` are identical to the arguments for
     ``DSSClient.get_file()`` listed here.
    :param uuid:  A RFC4122-c

****
#### Searching for bundles

The `post_search` method accepts a `query`, and there's even an example query in the data-store repo's [readme](https://github.com/HumanCellAtlas/data-store/blob/master/README.md). The function signature doesn't quite match the example, but it should be easy to fix. 

In [3]:
client.post_search(replica="aws", es_query={
    "query": {
        "bool": {
            "must": [{
                "match": {
                    "files.sample_json.donor.species": "Homo sapiens"
                }
            }, {
                "match": {
                    "files.assay_json.single_cell.method": "Fluidigm C1"
                }
            }, {
                "match": {
                    "files.sample_json.ncbi_biosample": "SAMN04303778"
                }
            }]
        }
    }
})

{'es_query': {'query': {'bool': {'must': [{'match': {'files.sample_json.donor.species': 'Homo sapiens'}},
     {'match': {'files.assay_json.single_cell.method': 'Fluidigm C1'}},
     {'match': {'files.sample_json.ncbi_biosample': 'SAMN04303778'}}]}}},
 'results': [],
 'total_hits': 0}

****
#### Well that didn't work!

But I can see how the results are structured. What if I just give it an empty query?

In [4]:
search_response = client.post_search(replica="aws", es_query={})
search_response["total_hits"]

695059

****
Okay great, many results. What does each result look like?

In [5]:
search_response["results"][0]

{'bundle_fqid': 'ffffba2d-30da-4593-9008-8b3528ee94f1.2019-08-01T200147.309074Z',
 'bundle_url': 'https://dss.data.humancellatlas.org/v1/bundles/ffffba2d-30da-4593-9008-8b3528ee94f1?version=2019-08-01T200147.309074Z&replica=aws',
 'search_score': None}

____
Now I have an ID that I can work with, `bundle_fqid`! It's an FQID and not a UUID, and for a bundle, not a file. What happens if I provide the FQID to `get_bundle` as the UUID?

In [6]:
try:
    client.get_bundle(uuid=search_response["results"][0]["bundle_fqid"], replica="aws")
    print("Completed successfully!")
except Exception as e:
    # If this operation fails, let's print the error (without raising the exception)
    print("Oh no! There was an error.")
    print(e)

Oh no! There was an error.
not_found: Cannot find bundle! (HTTP 404). Details:
Traceback (most recent call last):
  File "/var/task/chalicelib/dss/error.py", line 72, in wrapper
    return func(*args, **kwargs)
  File "/var/task/chalicelib/dss/api/bundles/__init__.py", line 54, in get
    raise DSSException(404, "not_found", "Cannot find bundle!")
dss.error.DSSException



****
Hmmm, a `DSSException` - it appears that it couldn't find a bundle with that UUID. This makes sense because FQIDs aren't UUIDs: the UUID is the part of the FQID before the first `.`. The timestamp is everything after. So, if I extract the UUID from the FQID, everything should work:

In [7]:
bundle_uuid, bundle_version = search_response['results'][0]['bundle_fqid'].split('.', maxsplit=1)
client.get_bundle(uuid=bundle_uuid, replica="aws")

{'bundle': {'creator_uid': 8008,
  'files': [{'content-type': 'application/json; dcp-type="metadata/biomaterial"',
    'crc32c': 'e51ced73',
    'indexed': True,
    'name': 'cell_suspension_0.json',
    's3_etag': '4b126057ce7abdc231255a8bf7784f8a',
    'sha1': '5dc657584a0fb00b1a918e0ecfc4701edf569ca1',
    'sha256': '775d6a9a562a6e818a3de5741c48dfc17d304b942f448da760b3996a139a5876',
    'size': 841,
    'uuid': 'ba96ea2d-c7e2-4c47-9561-418a849f93d0',
    'version': '2019-07-09T232055.867000Z'},
   {'content-type': 'application/json; dcp-type="metadata/biomaterial"',
    'crc32c': '6c24cc69',
    'indexed': True,
    'name': 'specimen_from_organism_0.json',
    's3_etag': 'de2f1daec3d270806b2d5590eabb3dfc',
    'sha1': '077305bae96361f9cd453b2066ffb10d4fb6977f',
    'sha256': '40b34fd3f409255888f3065b7d1f9735538713eac567e647d74043dc44eb4777',
    'size': 861,
    'uuid': '2436de6c-82fa-4434-8cec-f73cde7b01cb',
    'version': '2019-07-09T223746.665000Z'},
   {'content-type': 'applicat

****
Nice! Now I can see the general structure of a bundle.

Using this information, what if I want to find a bundle with a BAM? I can write a function `find_bam()` that iterates over bundles until it finds a bundle with a BAM...

In [8]:
def find_bam():
    for result in search_response["results"]:
        bundle_uuid, _ = results['bundle_fqid'].split('.', maxsplit=1)
        bundle_dict = client.get_bundle(uuid=bundle_uuid, replica="aws")
        for file_dict in bundle_dict["bundle"]["files"]:
            if file_dict["name"].endswith(".bam"):
                return file_dict

Looks good! `find_bam()` will loop over each result in the `post_search` query we made earlier and GET each bundle until it finds a bundle containing a file that ends in `.bam`. One thing to remember is that sometimes, `post_search` can return a bundle that's been deleted but is lingering in the index, and that trying to GET that bundle could result in an error. It's no problem, as long as we can catch and ignore those cases...

In [9]:
def find_bam():
    for result in search_response["results"]:
        bundle_uuid = result["bundle_fqid"].split('.', maxsplit=1)[0]
        try:
            bundle_dict = client.get_bundle(uuid=bundle_uuid, replica="aws")
        except:
            continue
        for file_dict in bundle_dict["bundle"]["files"]:
            if file_dict["name"].endswith(".bam"):
                return file_dict

One last thing: each `post_search` request is paginated and only returns some 100 results per request. Luckily, there's a method I can use that will automatically and transparently paginate through all results, `post_search.iterate`.

In [10]:
help(client.post_search)

Help on method post_search in module hca.util:

post_search(client, es_query: Mapping = None, output_format: Union[str, NoneType] = 'summary', replica: str = None, per_page: Union[str, NoneType] = 100, search_after: Union[str, NoneType] = None) method of builtins.type instance
    Find bundles by searching their metadata with an Elasticsearch query
    
    
    
    .. admonition:: Pagination
    
     This method supports pagination. Use ``DSSClient.post_search.iterate(**kwargs)`` to create a generator that
     yields all results, making multiple requests over the wire if necessary:
    
     .. code-block:: python
    
       for result in DSSClient.post_search.iterate(**kwargs):
           ...
    
     The keyword arguments for ``DSSClient.post_search.iterate()`` are identical to the arguments for
     ``DSSClient.post_search()`` listed here.
    :param es_query:  Elasticsearch query 
    :type es_query: typing.Mapping
    :param output_format:  Specifies the output format. The d

I can update my function `find_bam` to use it:

In [11]:
def find_bam():
    results = client.post_search.iterate(replica="aws", es_query={})
    for result in results:
        bundle_uuid = result["bundle_fqid"].split('.', maxsplit=1)[0]
        try:
            bundle_dict = client.get_bundle(uuid=bundle_uuid, replica="aws")
        except:
            continue
        for file_dict in bundle_dict["bundle"]["files"]:
            if file_dict["name"].endswith(".bam"):
                return file_dict

In [12]:
bam_file = find_bam()
bam_file

{'content-type': 'application/gzip; dcp-type=data',
 'crc32c': '74405869',
 'indexed': False,
 'name': 'ceae7e4d-6871-4d47-b2af-f3c9a5b3f5db_qc.bam',
 's3_etag': 'd86563efae1f97a11215ab8ac08ff57d-3',
 'sha1': 'd497da691495b27de9a304b4b0e1853203fe3a6a',
 'sha256': '14b700a3d4e3643cf781fb4641bdaa6dbc252d31d9e5a465acc23f30d61237d7',
 'size': 176911395,
 'uuid': '1907c7b9-55e5-47d5-a54c-85423e942523',
 'version': '2019-05-18T173113.216367Z'}

Looks good! What if I want to look for another file type, like fastqs? I can generalize that code above...

In [13]:
def find_ext(extension):
    results = client.post_search.iterate(replica="aws", es_query={})
    for result in results:
        bundle_uuid = result["bundle_fqid"].split('.', maxsplit=1)[0]
        try:
            bundle_dict = client.get_bundle(uuid=bundle_uuid, replica="aws")
        except:
            continue
        for file_dict in bundle_dict["bundle"]["files"]:
            if file_dict["name"].endswith(extension):
                return file_dict

In [14]:
fastq_file = find_ext('.fastq.gz')
print(f"Name: {fastq_file['name']}, UUID: {fastq_file['uuid']}")

Name: SRR6520067_1.fastq.gz, UUID: de204c6b-97be-44dd-bdb8-89af19a717b9


If I want to download a file and know its UUID, I can use the `get_file` method...

In [15]:
bam_file = client.get_file(uuid=bam_file['uuid'], replica="aws")
with open("Aligned.sortedByCoord.out.bam", "wb") as output_bam:
    output_bam.write(bam_file)

In [16]:
import pysam
bam = pysam.AlignmentFile("Aligned.sortedByCoord.out.bam", "rb")
print(bam.header)

@HD	VN:1.0	SO:coordinate
@SQ	SN:chr1	LN:248956422
@SQ	SN:chr2	LN:242193529
@SQ	SN:chr3	LN:198295559
@SQ	SN:chr4	LN:190214555
@SQ	SN:chr5	LN:181538259
@SQ	SN:chr6	LN:170805979
@SQ	SN:chr7	LN:159345973
@SQ	SN:chr8	LN:145138636
@SQ	SN:chr9	LN:138394717
@SQ	SN:chr10	LN:133797422
@SQ	SN:chr11	LN:135086622
@SQ	SN:chr12	LN:133275309
@SQ	SN:chr13	LN:114364328
@SQ	SN:chr14	LN:107043718
@SQ	SN:chr15	LN:101991189
@SQ	SN:chr16	LN:90338345
@SQ	SN:chr17	LN:83257441
@SQ	SN:chr18	LN:80373285
@SQ	SN:chr19	LN:58617616
@SQ	SN:chr20	LN:64444167
@SQ	SN:chr21	LN:46709983
@SQ	SN:chr22	LN:50818468
@SQ	SN:chrX	LN:156040895
@SQ	SN:chrY	LN:57227415
@SQ	SN:chrM	LN:16569
@SQ	SN:GL000008.2	LN:209709
@SQ	SN:GL000009.2	LN:201709
@SQ	SN:GL000194.1	LN:191469
@SQ	SN:GL000195.1	LN:182896
@SQ	SN:GL000205.2	LN:185591
@SQ	SN:GL000208.1	LN:92689
@SQ	SN:GL000213.1	LN:164239
@SQ	SN:GL000214.1	LN:137718
@SQ	SN:GL000216.2	LN:176608
@SQ	SN:GL000218.1	LN:161147
@SQ	SN:GL000219.1	LN:179198
@SQ	SN:GL000220.1	LN:161802
@SQ	SN:GL00022