# Download Any BAM File 🌕

Any BAM file will do. I just want to see what aligned data looks like in the HCA.

First I'll set up the DSS client.

In [1]:
import hca, hca.dss, json
client = hca.dss.DSSClient()

****
#### Now I want to find a bundle that has a BAM file in it.

The `client` has a method `get_file` that sounds very promising, but I need to get a UUID. I don't know what RFC4122 is, but hopefully I won't have to.

In [6]:
help(client.get_file)

Help on method get_file in module hca.util:

get_file(client, uuid:str=None, replica:str=None, version:Union[str, NoneType]=None) method of builtins.type instance
    Retrieve a file given a UUID and optionally a version.
    
    
    .. admonition:: Streaming
    
     Use ``DSSClient.get_file.stream(**kwargs)`` to get a ``requests.Response`` object whose body has not been
     read yet. This allows streaming large file bodies:
    
     .. code-block:: python
    
        fid = "7a8fbda7-d470-467a-904e-5c73413fab3e"
        with DSSClient().get_file.stream(uuid=fid, replica="aws") as fh:
            while True:
                chunk = fh.raw.read(1024)
                ...
                if not chunk:
                    break
    
     The keyword arguments for ``DSSClient.get_file.stream()`` are identical to the arguments for
     ``DSSClient.get_file()`` listed here.
    :param uuid:  A RFC4122-compliant ID for the file. 
    :type uuid: <class 'str'>
    :param replica:  Replica

****
#### Searching for bundles

The `post_search` method accepts a `query`, and there's even an example query in the data-store repo's [readme](https://github.com/HumanCellAtlas/data-store/blob/master/README.md). The function signature doesn't quite match the example, but it should be easy to fix. 

In [10]:
client.post_search(replica="aws", es_query={
    "query": {
        "bool": {
            "must": [{
                "match": {
                    "files.sample_json.donor.species": "Homo sapiens"
                }
            }, {
                "match": {
                    "files.assay_json.single_cell.method": "Fluidigm C1"
                }
            }, {
                "match": {
                    "files.sample_json.ncbi_biosample": "SAMN04303778"
                }
            }]
        }
    }
})

{'es_query': {'query': {'bool': {'must': [{'match': {'files.sample_json.donor.species': 'Homo sapiens'}},
     {'match': {'files.assay_json.single_cell.method': 'Fluidigm C1'}},
     {'match': {'files.sample_json.ncbi_biosample': 'SAMN04303778'}}]}}},
 'results': [],
 'total_hits': 0}

****
#### Well that didn't work!

But I can see how the results are structured. What if I just give it an empty query?

In [13]:
search_response = client.post_search(replica="aws", es_query={})
search_response["total_hits"]

2241

****
Okay great, many results. What does each result look like?

In [14]:
search_response["results"][0]

{'bundle_fqid': 'fb57815c-dbd9-4d3c-853f-8651e72b099f.2017-09-30T023030.386657Z',
 'bundle_url': 'https://dss.staging.data.humancellatlas.org/v1/bundles/fb57815c-dbd9-4d3c-853f-8651e72b099f?version=2017-09-30T023030.386657Z&replica=aws',
 'search_score': 1.0}

____
Now I have an ID that I can work with, though it's an "fqid" not a "uuid", and it's for a bundle, not a file. I can try it with `get_bundle`:

In [19]:
client.get_bundle(uuid=search_response["results"][0]["bundle_fqid"], replica="aws")

SwaggerAPIException: not_found: Cannot find file! (HTTP 404). Details:
Traceback (most recent call last):
  File "/var/task/chalicelib/dss/error.py", line 26, in wrapper
    return func(*args, **kwargs)
  File "/var/task/chalicelib/dss/api/bundles/__init__.py", line 34, in get
    return get_bundle(uuid, Replica[replica], version, directurls)
  File "/var/task/chalicelib/dss/storage/bundles.py", line 132, in get_bundle
    directurls
  File "/var/task/chalicelib/dss/storage/bundles.py", line 46, in get_bundle_from_bucket
    raise DSSException(404, "not_found", "Cannot find file!")
dss.error.DSSException


****
Hmmm, a `SwaggerAPIExcecption`; it appears that it couldn't find a bundle with that UUID. I suppose I can strip off the timestamp from the ID.

In [24]:
client.get_bundle(uuid=search_response["results"][0]["bundle_fqid"][:36], replica="aws")

{'bundle': {'creator_uid': 1,
  'files': [{'content-type': 'gzip',
    'crc32c': 'e978e85d',
    'indexed': True,
    'name': 'SRR2967608_1.fastq.gz',
    's3_etag': '89f6f8bec37ec1fc4560f3f99d47721d',
    'sha1': '58f03f7c6c0887baa54da85db5c820cfbe25d367',
    'sha256': '9b4c0dde8683f924975d0867903dc7a967f46bee5c0a025c451b9ba73e43f120',
    'size': 22220717,
    'uuid': '44b872e5-20ac-4cbd-b6c6-404a04e21308',
    'version': '2017-09-30T023023.368927Z'},
   {'content-type': 'gzip',
    'crc32c': '942cd9d6',
    'indexed': True,
    'name': 'SRR2967608_2.fastq.gz',
    's3_etag': 'fb9bbafee8a92ced414b3658b1bb9517',
    'sha1': 'bb5c8a68c155bad257cb7b93faef71a116cecba2',
    'sha256': 'c0d11199740a66150b8bb70a0474d8de9819e77f3f77b55dd04790e3fe6fb53c',
    'size': 23675941,
    'uuid': 'a9fef945-e003-4fdf-84b3-eddfa6f0ef9c',
    'version': '2017-09-30T023024.232296Z'},
   {'content-type': 'application/json',
    'crc32c': 'd31000c8',
    'indexed': True,
    'name': 'assay.json',
    's3_

****
This is very promising! No BAM files in this bundle, but I see the structure of a "bundle" and how I can work with it. Now, I just need to iterate over bundles until I find a BAM 

In [28]:
for result in search_response["results"]:
    bundle_uuid = result["bundle_fqid"][:36]
    bundle_dict = client.get_bundle(uuid=bundle_uuid, replica="aws")
    found_file = False
    for file_dict in bundle_dict["bundle"]["files"]:
        if file_dict["name"].endswith(".bam"):
            print("Name: {}, UUID: {}".format(file_dict["name"], file_dict["uuid"]))
            found_file = True
            break
    if found_file:
        break

No results, what if I try it for fastqs?

In [33]:
for result in search_response["results"]:
    bundle_uuid = result["bundle_fqid"][:36]
    bundle_dict = client.get_bundle(uuid=bundle_uuid, replica="aws")
    found_file = False
    for file_dict in bundle_dict["bundle"]["files"]:
        if file_dict["name"].endswith(".fastq.gz"):
            print("Name: {}, UUID: {}".format(file_dict["name"], file_dict["uuid"]))
            found_file = True
            break
    if found_file:
        break

Name: SRR2967608_1.fastq.gz, UUID: 44b872e5-20ac-4cbd-b6c6-404a04e21308


****
Perhaps aren't any BAMs?

No, it turns out the search results are paginated, so the `results` list only contains 100 of the ~2000 results. There's another RFC for the reading list there, but there's also the `post_search.iterate` method that should work.

In [34]:
help(client.post_search)

Help on method post_search in module hca.util:

post_search(client, es_query:Mapping=None, output_format:Union[str, NoneType]='summary', replica:str=None, per_page:Union[str, NoneType]=100) method of builtins.type instance
    Find bundles by searching their metadata with an Elasticsearch query
    
    
    
    .. admonition:: Pagination
    
     This method supports pagination. Use ``DSSClient.post_search.iterate(**kwargs)`` to create a generator that
     yields all results, making multiple requests over the wire if necessary:
    
     .. code-block:: python
    
       for result in DSSClient.post_search.iterate(**kwargs):
           ...
    
     The keyword arguments for ``DSSClient.post_search.iterate()`` are identical to the arguments for
     ``DSSClient.post_search()`` listed here.
    :param es_query:  Elasticsearch query 
    :type es_query: typing.Mapping
    :param output_format:  Specifies the output format. The default format, ``summary``, is a list of UUIDs for bund

In [35]:
results = client.post_search.iterate(replica="aws", es_query={})
for result in results:
    bundle_uuid = result["bundle_fqid"][:36]
    bundle_dict = client.get_bundle(uuid=bundle_uuid, replica="aws")
    found_file = False
    for file_dict in bundle_dict["bundle"]["files"]:
        if file_dict["name"].endswith(".bam"):
            print("Name: {}, UUID: {}".format(file_dict["name"], file_dict["uuid"]))
            found_file = True
            break
    if found_file:
        break

Name: Aligned.sortedByCoord.out.bam, UUID: f9fa3804-ce46-456a-9023-50c8d5bf822f


In [41]:
bam_file = client.get_file(uuid="f9fa3804-ce46-456a-9023-50c8d5bf822f", replica="aws")
with open("Aligned.sortedByCoord.out.bam", "wb") as output_bam:
    output_bam.write(bam_file)

In [46]:
import pysam
bam = pysam.AlignmentFile("Aligned.sortedByCoord.out.bam", "rb")
print(bam.header)

@HD	VN:1.4	SO:coordinate
@SQ	SN:chr21	LN:48129895
@PG	ID:STAR	PN:STAR	VN:STAR_2.5.3a	CL:STAR   --genomeDir ./star   --genomeLoad NoSharedMemory   --readFilesIn /cromwell_root/org-humancellatlas-dss-staging/blobs/fe6d4fdfea2ff1df97500dcfe7085ac3abfb760026bff75a34c20fb97a4b2b29.17f8b4be0cc6e8281a402bb365b1283b458906a3.c7bbee4c46bbf29432862e05830c8f39.4ef74578   /cromwell_root/org-humancellatlas-dss-staging/blobs/c305bee37b3c3735585e11306272b6ab085f04cd22ea8703957b4503488cfeba.f166b6952e30a41e1409e7fb0cb0fb1ad93f3f21.a3a9f23d07cfc5e40a4c3a8adf3903ae.69987b3e      --readFilesCommand zcat      --limitBAMsortRAM 30000000000   --outSAMtype BAM   SortedByCoordinate      --outSAMstrandField intronMotif   --outSAMunmapped Within      --sjdbGTFfile /cromwell_root/broad-dsde-mint-staging-teststorage/demo/gencodev19_chr21.gtf   --quantMode TranscriptomeSAM      --twopassMode Basic
@CO	user command line: STAR --readFilesIn /cromwell_root/org-humancellatlas-dss-staging/blobs/fe6d4fdfea2ff1df97500dcfe