# Downloading All Fastq Files in a Project

In this vignette, we'll demonstrate how to identify bundles associated with a given project, and how to download all `.fastq` files in those bundles.

This vignette assumes some familiarity with the HCA API and command line tools. If you need to brush up, try paging through [the other vignettes](https://github.com/HumanCellAtlas/data-consumer-vignettes/) - the [Download Any BAM File](https://github.com/HumanCellAtlas/data-consumer-vignettes/tree/master/Download%20Any%20BAM%20File) vignette will be particularly relevant. Make sure that you've installed the HCA CLI and that you're logged into it before starting!

## Method 1: Using the Data Explorer

The fastest way to get all `.fastq`s associated with a project is to find the project in the HCA Data Explorer, then download a file manifest for the `.fastq`s associated with the project, then use `hca dss download-manifest` to download the files you selected. See [this guide](https://data.humancellatlas.org/guides/quick-start-guide) for details.

## Method 2: Using the HCA Data Store API

We can also retrieve all `.fastq`s associated with a project with the Data Store (DSS) API. First, you'll want to find a project that you like on the [HCA Data Explorer](https://data.humancellatlas.org/explore/). (You'll probably want to filter for `.fastq` files to make sure that whatever you're downloading has the kind of data that you're looking for.)

Once you find a project you like, open the project in the Data Explorer and then copy its URL. The UUID is at the end of the URL:

In [1]:
url = 'https://data.humancellatlas.org/explore/projects/cc95ff89-2e68-4a08-a234-480eca21ce79'
_, project_uuid = url.rsplit('/', maxsplit=1)
project_uuid

'cc95ff89-2e68-4a08-a234-480eca21ce79'

(This tutorial works a lot better if you choose a project that you know has `.fastq`s in it - see Method 1).

Given a project UUID, we can make a `POST /search` request to the Data Store to search for all bundles that belong to the given project. `POST /search` requests accept ElasticSearch queries. In a bundle, the `files.project_json.provenance.document_id` field contains the project UUID. That means we want a query that looks like this:

    {
      "query": {
        "bool": {
          "must": {
            "match": {
              "files.project_json.provenance.document_id": project_uuid
            }
          }
        }
      }
    }

We can use the HCA API to search for bundles matching this query.

In [2]:
import hca.dss
client = hca.dss.DSSClient()
query = {'query':
            {'bool':
                {'must':
                    {'match': {'files.project_json.provenance.document_id': project_uuid}}
                }
            }
        }
results = client.post_search(replica='aws', es_query=query)
bundle_uuid, bundle_version = results['results'][0]['bundle_fqid'].split('.', maxsplit=1)
print("There are %d bundles in this project. Wow!" % results['total_hits'])
results['results'][0]

There are 254 bundles in this project. Wow!


{'bundle_fqid': 'ff5a367c-4861-4321-9c48-2595c001ec1f.2019-07-03T090632.606000Z',
 'bundle_url': 'https://dss.integration.data.humancellatlas.org/v1/bundles/ff5a367c-4861-4321-9c48-2595c001ec1f?version=2019-07-03T090632.606000Z&replica=aws',
 'search_score': None}

We can actually improve this code - `post_search` is a paginated endpoint, and might only return a subset of all results with each request. We *could* control how many results a `post_search` request returns with the `per_page` argument, but if we use the `post_search.iterate` method, the API Python bindings will handle pagination for us transparently:

In [3]:
results = client.post_search.iterate(replica='aws', es_query=query)
# Let's print just the first ten results for brevity. We can 'slice'
# the first ten results out of `results` using itertools.islice
import itertools
for result in itertools.islice(results, 10):
    bundle_uuid, bundle_version = result['bundle_fqid'].split('.', maxsplit=1)
    print(bundle_uuid, bundle_version)

ff5a367c-4861-4321-9c48-2595c001ec1f 2019-07-03T090632.606000Z
fe23bae9-4b71-4142-98c6-710edb2e42f9 2019-07-03T090632.605000Z
fdeb24d5-29f2-4c85-b6e5-5d8e8b40c294 2019-07-31T053949.705104Z
fce44c9a-dea9-4df5-865b-d99a8ef2c562 2019-07-31T043522.547425Z
fb7ea2c9-4c76-4249-a176-a4158df3b87a 2019-07-31T045957.400965Z
fb2bb6f8-529b-4f92-ab0f-579703956b53 2019-07-31T042306.099595Z
fab59898-3db9-4647-b3af-9cda96836937 2019-07-31T050607.292142Z
f9018b11-51e6-4520-8ba8-a9dac3c739c4 2019-07-03T090632.606000Z
f8d2c03e-c6fb-4e42-90f3-df7b8f446624 2019-07-31T045305.547659Z
f8b2b231-bf29-4e7a-9757-81656beb2da0 2019-07-31T050631.109072Z


Nice! That's a lot of bundles...

Now that we've identified all the bundles belonging to the project we're interested in, we can make a `GET /bundle/{uuid}` request to list all the files in a given bundle, and from there, identify all `.fastq` files in each bundle.

In [4]:
client.get_bundle(uuid=bundle_uuid, replica='aws')

{'bundle': {'creator_uid': 0,
  'files': [{'content-type': 'application/json; dcp-type="metadata/biomaterial"',
    'crc32c': '6f6a4adc',
    'indexed': True,
    'name': 'cell_suspension_0.json',
    's3_etag': 'e615def6fd75103bb602f684dddbbca1',
    'sha1': '58b2440dc4e96228f53abeb78ad1a99f64d244b7',
    'sha256': '547b0f0b9313c7362893cb098ca47c0536a16fc438115d66030cccd241d1a248',
    'size': 873,
    'uuid': 'bc708932-6606-4e88-90a3-071876e4315b',
    'version': '2019-07-03T083142.069000Z'},
   {'content-type': 'application/json; dcp-type="metadata/biomaterial"',
    'crc32c': 'd8904117',
    'indexed': True,
    'name': 'specimen_from_organism_0.json',
    's3_etag': '12e967fa198200a877fd4c2657a4b537',
    'sha1': '9c2a1e9378174570a9a4333586785844fa82525e',
    'sha256': '4e692aed34d438adbd4f18cecca77fa0d6078a62d36167ac1110d35cc0b7833b',
    'size': 1017,
    'uuid': 'ed281945-6106-4a90-afc1-80e964f77693',
    'version': '2019-07-03T083138.212000Z'},
   {'content-type': 'applicatio

Let's write a function `find_fastqs_in_bundle` that, given a bundle UUID, will return the UUIDs of `.fastq` files in that bundle.

In [5]:
def find_fastqs_in_bundle(bundle_uuid, version):
    bundle = client.get_bundle(uuid=bundle_uuid, version=version, replica='aws')
    fastq_files = []
    for file_ in bundle['bundle']['files']:
        if file_['name'].endswith('.fastq.gz'):
            fastq_files.append(file_)
    return fastq_files

Experienced Pythonistas will recognize that we can make a small improvement to `find_fastqs_in_bundle` that is more concise and a little faster:

In [6]:
def find_fastqs_in_bundle(bundle_uuid,version ):
    bundle = client.get_bundle(uuid=bundle_uuid, version=version, replica='aws')
    return filter(lambda x: x['name'].endswith('.fastq.gz'), bundle['bundle']['files'])

Now, let's try using `find_fastqs_in_bundle` with all of the bundles we found at the beginning of this tutorial:

In [7]:
results = client.post_search.iterate(replica='aws', es_query=query)
for result in itertools.islice(results, 10):
    bundle_uuid, bundle_version = result['bundle_fqid'].split('.', maxsplit=1)
    for fastq in find_fastqs_in_bundle(bundle_uuid, bundle_version):
        print(f"Bundle {result['bundle_fqid']} has {fastq['name']}")

Bundle ff5a367c-4861-4321-9c48-2595c001ec1f.2019-07-03T090632.606000Z has MantonBM7_HiSeq_6_S21_L005_I1_001.fastq.gz
Bundle ff5a367c-4861-4321-9c48-2595c001ec1f.2019-07-03T090632.606000Z has MantonBM7_HiSeq_6_S21_L005_R1_001.fastq.gz
Bundle ff5a367c-4861-4321-9c48-2595c001ec1f.2019-07-03T090632.606000Z has MantonBM7_HiSeq_6_S21_L005_R2_001.fastq.gz
Bundle ff5a367c-4861-4321-9c48-2595c001ec1f.2019-07-03T090632.606000Z has MantonBM7_HiSeq_6_S21_L006_I1_001.fastq.gz
Bundle ff5a367c-4861-4321-9c48-2595c001ec1f.2019-07-03T090632.606000Z has MantonBM7_HiSeq_6_S21_L006_R1_001.fastq.gz
Bundle ff5a367c-4861-4321-9c48-2595c001ec1f.2019-07-03T090632.606000Z has MantonBM7_HiSeq_6_S21_L006_R2_001.fastq.gz
Bundle fe23bae9-4b71-4142-98c6-710edb2e42f9.2019-07-03T090632.605000Z has MantonBM1_HiSeq_4_S4_L007_I1_001.fastq.gz
Bundle fe23bae9-4b71-4142-98c6-710edb2e42f9.2019-07-03T090632.605000Z has MantonBM1_HiSeq_4_S4_L007_R1_001.fastq.gz
Bundle fe23bae9-4b71-4142-98c6-710edb2e42f9.2019-07-03T090632.6050