# Download Data for All Liver Cells Sequenced with 10x

In this notebook, we cover how to search the Human Cell Atlas Data Store (DSS) for bundles containing liver cells that were sequenced with 10x, and download all of the data that is found.

The two steps of this process are:

1. Write an ElasticSearch query to return bundles matching our two conditions (liver cells, and 10x).

2. Iterate over the results and download the relevant files using the DSS API.

As usual, we start with a DSS API client.

In [1]:
import hca.dss, json
client = hca.dss.DSSClient()

## Writing the ElasticSearch Query

It is recommended that the reader at least skims the ["Writing ElasticSearch Queries"](../elasticsearch-queries/elasticsearch-queries.html) notebook, which covers how ElasticSearch queries are written.

To find bundles with T cells and that were sequenced with 10x, we will use a boolean conditional query, with two conditions. We should first run an empty query and look at the metadata returned by one item to figure out what fields should contain "T cells" and which fields should contain "10x".

In [2]:
response = client.post_search(es_query={}, replica='aws', output_format='raw')
first_bundle = response['results'][1]

Each result contains metadata, as extracted from several JSON files:

In [3]:
print("Metadata files:")
print("\n".join(first_bundle['metadata']['files'].keys()))

Metadata files:
cell_suspension_json
collection_protocol_json
donor_organism_json
enrichment_protocol_json
library_preparation_protocol_json
links_json
process_json
project_json
sequence_file_json
sequencing_protocol_json
specimen_from_organism_json


### Boolean Condition 1: 10x Data

If we are looking for 10x data, the `sequencing_protocol_json` file might be the first place you would look:

In [4]:
import json
print(json.dumps(
    first_bundle['metadata']['files']['sequencing_protocol_json'],
    indent=4
))

[
    {
        "describedBy": "https://schema.humancellatlas.org/type/protocol/sequencing/10.0.2/sequencing_protocol",
        "instrument_manufacturer_model": {
            "ontology": "EFO:0008637",
            "ontology_label": "Illumina NovaSeq 6000",
            "text": "Illumina NovaSeq 6000"
        },
        "method": {
            "ontology": "EFO:0008441",
            "ontology_label": "full length single cell RNA sequencing",
            "text": "full length single cell RNA sequencing"
        },
        "paired_end": true,
        "protocol_core": {
            "protocol_description": "Libraries were sequenced on an Illumina NovaSeq 6000",
            "protocol_id": "sequencing_protocol_1",
            "protocol_name": "SmartSeq2"
        },
        "provenance": {
            "document_id": "571cc0c7-4dc2-443b-93f4-0ce4af08cf6d",
            "submission_date": "2019-07-09T21:04:28.843Z",
            "update_date": "2019-07-09T21:04:34.377Z"
        },
        "schema_typ

This doesn't contain the right metadata, so we can look next in `library_preparation_protocol_json`:

In [5]:
import json
print(json.dumps(
    first_bundle['metadata']['files']['library_preparation_protocol_json'],
    indent=4
))

[
    {
        "describedBy": "https://schema.humancellatlas.org/type/protocol/sequencing/6.1.1/library_preparation_protocol",
        "end_bias": "full length",
        "input_nucleic_acid_molecule": {
            "ontology": "OBI:0000869",
            "ontology_label": "polyA RNA",
            "text": "polyA RNA"
        },
        "library_construction_method": {
            "ontology": "EFO:0008931",
            "ontology_label": "Smart-seq2",
            "text": "Smart-seq2"
        },
        "nucleic_acid_source": "single cell",
        "protocol_core": {
            "protocol_description": "smart-seq2",
            "protocol_id": "library_preparation_1",
            "protocol_name": "Smart-seq2"
        },
        "provenance": {
            "document_id": "47d78ecd-e946-4591-9ed4-acbbdbdf82a1",
            "submission_date": "2019-07-09T21:04:28.837Z",
            "update_date": "2019-07-09T21:04:34.114Z"
        },
        "schema_type": "protocol",
        "strand": "unstra

From this we can determine the first boolean condition:

```
files.library_preparation_protocol.library_construction_method.text
```

should contain the text "10x". A `wildcard` query would be good here.

### Boolean Condition 2: Matching liver cells

To find the organ type for the cells in the bundle, we can use the `specimen_from_organism_json` metadata file, which contains information about the organism the specimen came from (including the organ).

Checking the metadata reveals the path needed to obtain the organ type:

In [6]:
import json
print(json.dumps(
    first_bundle['metadata']['files']['specimen_from_organism_json'],
    indent=4
))

[
    {
        "biomaterial_core": {
            "biomaterial_description": "Lung EPCAM",
            "biomaterial_id": "3_38_F_Lung_EPCAM",
            "biomaterial_name": "3_38_F_Lung_EPCAM",
            "ncbi_taxon_id": [
                10090
            ]
        },
        "describedBy": "https://schema.humancellatlas.org/type/biomaterial/10.2.1/specimen_from_organism",
        "genus_species": [
            {
                "ontology": "NCBITaxon:10090",
                "ontology_label": "Mus musculus",
                "text": "Mus musculus"
            }
        ],
        "organ": {
            "ontology": "UBERON:0002048",
            "ontology_label": "lung",
            "text": "lung epcam"
        },
        "provenance": {
            "document_id": "577a91d8-e579-41b6-9353-7e4e774c161a",
            "submission_date": "2019-07-09T19:41:33.874Z",
            "update_date": "2019-07-09T22:28:11.151Z"
        },
        "schema_type": "biomaterial"
    }
]


The path needed for our boolean condition is

```
files.specimen_from_organism_json.organ.text
```

and it should match "liver".

### Combining for the Query

We now assemble a boolean conditional query using our two conditions: the first, a wildcard query to find 3' single cell data, and the second, a match query to find organs matching "liver".

In [7]:
method = "*10x*"
organ = "liver"

query = {
    "query": {
        "bool": {
            "must": [
                {
                    "wildcard": {
                        "files.library_preparation_protocol_json.library_construction_method.text": {
                            "value": method
                        }
                    }
                },
                {
                    "match": {
                        "files.specimen_from_organism_json.organ.text": organ
                    }
                }
            ]
        }
    }
}

Now we run the query:

In [8]:
search_results = client.post_search(
    es_query=query, replica='aws', output_format='raw')

print("post_search() found %d results"%(search_results['total_hits']))
print("post_search() returned %d results"%(len(search_results['results'])))

post_search() found 15 results
post_search() returned 10 results


We can also print information about the project this bundle is part of - it so happens that all of these bundles belong to the same project:

In [9]:
print(json.dumps(
    search_results['results'][1]['metadata']['files']['project_json'][0]['project_core'],
    indent=4
))

{
    "project_description": "The liver is the largest solid organ in the body and is critical for metabolic and immune functions. However, little is known about the cells that make up the human liver and its immune microenvironment. Here we report a map of the cellular landscape of the human liver using single-cell RNA sequencing. We provide the transcriptional profiles of 8444 parenchymal and non-parenchymal cells obtained from the fractionation of fresh hepatic tissue from five human livers. Using gene expression patterns, flow cytometry, and immunohistochemical examinations, we identify 20 discrete cell populations of hepatocytes, endothelial cells, cholangiocytes, hepatic stellate cells, B cells, conventional and non-conventional T cells, NK-like cells, and distinct intrahepatic monocyte/macrophage populations. Together, our study presents a comprehensive view of the human liver at single-cell resolution that outlines the characteristics of resident cells in the liver, and in part

This concludes the ElasticSearch query portion of this tutorial, the next step is to use the results of this query to download the results.

## Downloading All Query Results

Next, we iterate over the results of the query to download the data. To make that easier, we can rewrrite the `post_search()` API call to `post_search.iterate()` to iterate over all results instead of returning them one page at a time:

```python
bundles = client.post_search(es_query=query, replica='aws', output_format='raw')
```

now becomes

```python
bundles_generator = client.post_search.iterate(es_query=query, replica='aws', output_format='raw')
for i, bundle in enumerate(bundles_generator):
    # ...
```

As a refresher, here is our ElasticSearch query:

In [10]:
print(json.dumps(query, indent=4))

{
    "query": {
        "bool": {
            "must": [
                {
                    "wildcard": {
                        "files.library_preparation_protocol_json.library_construction_method.text": {
                            "value": "*10x*"
                        }
                    }
                },
                {
                    "match": {
                        "files.specimen_from_organism_json.organ.text": "liver"
                    }
                }
            ]
        }
    }
}


We have our query, and we have our loop structure for iterating over all results of the query. The last thing we need is a function to process each bundle returned by the query.

### Functions to Process Results

Below we give several functions that process the results of the query in different ways. Swap out the call to `print_bundle()` with a call to your function of choice in the loop below.

In [11]:
def print_bundle(bundle):
    """This function extracts the bundle FQID and prints it"""
    bundle_fqid = bundle['bundle_fqid']
    bundle_uuid, bundle_version = bundle_fqid.split(".", 1)
    print(f"Now processing bundle UUID {bundle_uuid} Version {bundle_version}")

In [12]:
def download_bundle(bundle):
    """This function downloads the exact version of the bundle"""
    bundle_fqid = bundle['bundle_fqid']
    bundle_uuid, bundle_version = bundle_fqid.split(".", 1)
    client.download(
        bundle_uuid=bundle_uuid,
        version=bundle_version,
        replica="aws",
        download_dir="data-10x-liver"
    )

In [13]:
def download_latest_bundle(bundle):
    """This function downloads the latest version of the bundle"""
    budle_fqid = bundle['bundle_fqid']
    bundle_uuid, _ = bundle_fqid.split(".", 1)
    
    # Get information about the latest version of this bundle
    bundle_info = dss.get_bundle(
        uuid=bundle_uuid,
        replica="aws"
    )
    
    # Download the latest version of the bundle
    client.download(
        bundle_uuid=bundle_uuid,
        version=bundle_info['version'],
        replica="aws",
        download_dir="data-10x-liver-latestversion"
    )

In [14]:
def download_fastq_only(bundle):
    """This function downloads only the .fastq.gz files contained in a bundle"""
    bundle_fqid = bundle['bundle_fqid']
    for f in bundle['metadata']['manifest']['files']:
        f_name = f['name']
        if f_name.endswith(".fastq.gz"):
            f_uuid = f['uuid']
            f_version = f['version']
            print(f"Downloading file {f_name} : UUID {f_uuid}")
            results = client.get_file(
                uuid = f_uuid,
                version = f_version,
                replica = 'aws'
            )
            open(f_name, 'w').write(results.decode("utf-8"))

In [15]:
results_generator = client.post_search.iterate(es_query=query, replica='aws', output_format='raw')

for bundle in results_generator:
    print_bundle(bundle)

Now processing bundle UUID fd7a46db-1e90-4bfd-8e70-a77baa01faa5 Version 2019-09-23T173114.106310Z
Now processing bundle UUID fd7a46db-1e90-4bfd-8e70-a77baa01faa5 Version 2019-09-20T175818.017079Z
Now processing bundle UUID fbda9910-5076-47a6-83d6-cfff39d17606 Version 2019-09-26T051746.268160Z
Now processing bundle UUID fb2ae8b7-06b0-4881-ad9f-1f37255b91b6 Version 2019-09-23T173114.107225Z
Now processing bundle UUID fb2ae8b7-06b0-4881-ad9f-1f37255b91b6 Version 2019-09-20T175818.015707Z
Now processing bundle UUID c65efd23-bbc4-459a-ac60-d3cde705193d Version 2019-09-23T173114.107641Z
Now processing bundle UUID c65efd23-bbc4-459a-ac60-d3cde705193d Version 2019-09-20T175818.014333Z
Now processing bundle UUID c59a8de8-d4f3-424b-b716-06b7152b980a Version 2019-09-23T173114.106782Z
Now processing bundle UUID c59a8de8-d4f3-424b-b716-06b7152b980a Version 2019-09-20T175818.012949Z
Now processing bundle UUID be9f2d04-77ee-4f59-a0f7-f0b58034cf8c Version 2019-09-23T173114.105576Z
Now processing bundl