# Downloading all `.fastq`s in a project

In this vignette, we'll demonstrate how to identify bundles associated with a given project, and how to download all `.fastq` files in those bundles.

This vignette assumes some familiarity with the HCA API and command line tools. If you need to brush up, try paging through [the other vignettes](https://github.com/HumanCellAtlas/data-consumer-vignettes/) - the [Download Any BAM File](https://github.com/HumanCellAtlas/data-consumer-vignettes/tree/master/Download%20Any%20BAM%20File) vignette will be particularly relevant. Make sure that you've installed the HCA CLI and that you're logged into it before starting!

## Method 1: Using the Data Explorer

The fastest way to get all `.fastq`s associated with a project is to find the project in the HCA Data Explorer, then download a file manifest for the `.fastq`s associated with the project, then use `hca dss download-manifest` to download the files you selected. See [this guide](https://data.humancellatlas.org/guides/quick-start-guide) for details.

## Method 2: Using the HCA DSS REST API, the long way

We can also retrieve all `.fastq`s associated with a project with the DSS API. First, you'll want to find a project that you like on the [HCA Data Explorer](https://data.humancellatlas.org/explore/). (You'll probably want to filter for `.fastq` files to make sure that whatever you're downloading has the kind of data that you're looking for.)

Once you find a project you like, open the project in the Data Explorer and then copy its URL. The UUID is at the end of the URL:

In [1]:
url = 'https://data.humancellatlas.org/explore/projects/74b6d569-3b11-42ef-b6b1-a0454522b4a0'
_, project_uuid = url.rsplit('/', maxsplit=1)
project_uuid

'74b6d569-3b11-42ef-b6b1-a0454522b4a0'

(This tutorial works a lot better if you choose a project that you know has `.fastq`s in it - see Method 1).

Given a project UUID, we can make a `POST /search` request to the Data Store to search for all bundles that belong to the given project. `POST /search` requests accept ElasticSearch queries. In a bundle, the `files.project_json.provenance.document_id` field contains the project UUID. That means we want a query that looks like this:

    {
      "query": {
        "bool": {
          "must": {
            "match": {
              "files.project_json.provenance.document_id": project_uuid
            }
          }
        }
      }
    }

We can use the HCA API to search for bundles matching this query.

In [2]:
import hca.dss
client = hca.dss.DSSClient()
query = {'query':
            {'bool':
                {'must':
                    {'match': {'files.project_json.provenance.document_id': project_uuid}}
                }
            }
        }
results = client.post_search(replica='aws', es_query=query)
bundle_uuid, bundle_version = results['results'][0]['bundle_fqid'].split('.', maxsplit=1)
print("There are %d bundles in this project. Wow!" % results['total_hits'])
results['results'][0]

There are 5459 bundles in this project. Wow!


{'bundle_fqid': 'fffcea5e-2e6c-4ca1-9aa9-c23b90b2e8b8.2019-05-16T211813.059000Z',
 'bundle_url': 'https://dss.data.humancellatlas.org/v1/bundles/fffcea5e-2e6c-4ca1-9aa9-c23b90b2e8b8?version=2019-05-16T211813.059000Z&replica=aws',
 'search_score': None}

We can actually improve this code - `post_search` is a paginated endpoint, and might only return a subset of all results with each request. We *could* control how many results a `post_search` request returns with the `per_page` argument, but if we use the `post_search.iterate` method, the API Python bindings will handle pagination for us transparently:

In [3]:
results = client.post_search.iterate(replica='aws', es_query=query)
# Let's print just the first ten results for brevity. We can 'slice'
# the first ten results out of `results` using itertools.islice
import itertools
for result in itertools.islice(results, 10):
    bundle_uuid, bundle_version = result['bundle_fqid'].split('.', maxsplit=1)
    print(bundle_uuid, bundle_version)

fffcea5e-2e6c-4ca1-9aa9-c23b90b2e8b8 2019-05-16T211813.059000Z
ffc1714d-caef-41fe-9b68-a6db5efd4c69 2019-05-16T211813.060000Z
ffa728d1-3767-46a0-ad94-adefd36630c9 2019-05-16T211813.105000Z
ff982627-9abf-4d66-bc45-f9298ffb9311 2019-05-16T211813.111000Z
ff8020e9-5bce-491b-b5f3-821ab25fb167 2019-05-16T211813.082000Z
ff781a9d-e4f1-4a4f-acee-8bf28d8cd684 2019-05-16T211813.068000Z
ff74ec2f-21b7-4949-8c55-42315afa8295 2019-05-16T211813.074000Z
ff5e0cfc-bb2a-4f20-bade-897f1626fa8d 2019-05-16T211813.074000Z
ff5b1de9-fb65-4451-95dc-7da7d11bd493 2019-05-16T211813.105000Z
ff5a4036-08cf-4abc-98a7-572eaaa452cc 2019-05-16T211813.057000Z


Nice! That's a lot of bundles...

Now that we've identified all the bundles belonging to the project we're interested in, we can make a `GET /bundle/{uuid}` request to list all the files in a given bundle, and from there, identify all `.fastq` files in each bundle.

In [4]:
client.get_bundle(uuid=bundle_uuid, replica='aws')

{'bundle': {'creator_uid': 8008,
  'files': [{'content-type': 'application/json; dcp-type="metadata/biomaterial"',
    'crc32c': 'f04827fa',
    'indexed': True,
    'name': 'cell_suspension_0.json',
    's3_etag': '458bf0395db347c39bfaf267b0154568',
    'sha1': 'd5ba2ba27848c3b41d42c6e275aa4e3ad7cc6515',
    'sha256': '0e50c74847558a3123546b6e9d7eaf6766d504f42e6a7b389dba0369876af33d',
    'size': 1366,
    'uuid': 'd980eed4-9ac6-4343-81fc-647b0d88239f',
    'version': '2019-05-16T162211.753000Z'},
   {'content-type': 'application/json; dcp-type="metadata/biomaterial"',
    'crc32c': 'd3fc2fae',
    'indexed': True,
    'name': 'specimen_from_organism_0.json',
    's3_etag': '5c2033719c75743ed71058d7becf31ca',
    'sha1': '53b104f207ac57ab1f4e07cf6d5634ca856852e7',
    'sha256': 'cf9325255cc7bc9763844bbd1f50c7616f907985524dffd972d849f430d8bb07',
    'size': 1687,
    'uuid': '324191c3-db02-4671-9c51-1fdca8e6e88f',
    'version': '2019-05-16T161926.507000Z'},
   {'content-type': 'applic

Let's write a function `find_fastqs_in_bundle` that, given a bundle UUID, will return the UUIDs of `.fastq` files in that bundle.

In [5]:
def find_fastqs_in_bundle(bundle_uuid, version):
    bundle = client.get_bundle(uuid=bundle_uuid, version=version, replica='aws')
    fastq_files = []
    for file_ in bundle['bundle']['files']:
        if file_['name'].endswith('.fastq.gz'):
            fastq_files.append(file_)
    return fastq_files

Experienced Pythonistas will recognize that we can make a small improvement to `find_fastqs_in_bundle` that is more concise and a little faster:

In [6]:
def find_fastqs_in_bundle(bundle_uuid,version ):
    bundle = client.get_bundle(uuid=bundle_uuid, version=version, replica='aws')
    return filter(lambda x: x['name'].endswith('.fastq.gz'), bundle['bundle']['files'])

Now, let's try using `find_fastqs_in_bundle` with all of the bundles we found at the beginning of this tutorial:

In [7]:
results = client.post_search.iterate(replica='aws', es_query=query)
for result in itertools.islice(results, 10):
    bundle_uuid, bundle_version = result['bundle_fqid'].split('.', maxsplit=1)
    for fastq in find_fastqs_in_bundle(bundle_uuid, bundle_version):
        print(f"Bundle {result['bundle_fqid']} has {fastq['name']}")

Bundle fffcea5e-2e6c-4ca1-9aa9-c23b90b2e8b8.2019-05-16T211813.059000Z has E18_20161004_Neurons_Sample_14_S083_L005_I1_002.fastq.gz
Bundle fffcea5e-2e6c-4ca1-9aa9-c23b90b2e8b8.2019-05-16T211813.059000Z has E18_20161004_Neurons_Sample_14_S083_L005_R1_002.fastq.gz
Bundle fffcea5e-2e6c-4ca1-9aa9-c23b90b2e8b8.2019-05-16T211813.059000Z has E18_20161004_Neurons_Sample_14_S083_L005_R2_002.fastq.gz
Bundle ffc1714d-caef-41fe-9b68-a6db5efd4c69.2019-05-16T211813.060000Z has E18_20161004_Neurons_Sample_57_S126_L008_I1_002.fastq.gz
Bundle ffc1714d-caef-41fe-9b68-a6db5efd4c69.2019-05-16T211813.060000Z has E18_20161004_Neurons_Sample_57_S126_L008_R1_002.fastq.gz
Bundle ffc1714d-caef-41fe-9b68-a6db5efd4c69.2019-05-16T211813.060000Z has E18_20161004_Neurons_Sample_57_S126_L008_R2_002.fastq.gz
Bundle ffa728d1-3767-46a0-ad94-adefd36630c9.2019-05-16T211813.105000Z has E18_20160930_Neurons_Sample_61_S058_L002_I1_010.fastq.gz
Bundle ffa728d1-3767-46a0-ad94-adefd36630c9.2019-05-16T211813.105000Z has E18_20160