# Find the Number of Cells of a Specific Cell Type

In this notebook, we demonstrate how to search for cells matching a particular organ type using the API for the Human Cell Atlas's Data Store (DSS).

Start by creaeting an API client for the DSS:

In [1]:
import hca.dss
client = hca.dss.DSSClient()

Next we specify the type of cells we are searching for:

In [2]:
organ_type = 'pancreas'

The DSS indexes all content in the data store using ElasticSearch, so we should assemble an ElasticSearch query to search for cells matching our particular cell type. The problem is, we don't know the metadata structure ahead of time, so it would be useful to look at a few items in the data store to figure out where organ type would be stored.

## Writing the ElasticSearch Query

We know from the ["ElasticSearch Queries"](../elasticsearch-queries/elasticsearch-queries.ipynb) that we can use a match search if we have only one condition to match, but we need a boolean conditional search if we have multiple. We need a boolean conditional search to match two criteria:

* First condition is that the organ type matches our organ of interest
* Second conditition is that the bundle contains a gene expression matrix, from which we can extract a count of the total number of cells in the bundle

The structure of the query will be as follows:

```json
query = {
    "query" : {
        "bool" : {
            "must" : [
                {
                    "match" : {
                        "some.field": "value"
                    }
                }, {
                    "match" : {
                        "some.field": "value"
                    }
                }
            ]
        }
    }
}
```

We can start with an empty ElasticSearch query and examine the first item to inspect the metadata structure:

In [3]:
bundles = client.post_search(es_query={}, replica='aws', output_format='raw')
first_bundle = bundles['results'][0]

In [4]:
import json
print(json.dumps(
    first_bundle,
    indent=4))

{
    "bundle_fqid": "ffffba2d-30da-4593-9008-8b3528ee94f1.2019-08-01T200147.309074Z",
    "bundle_url": "https://dss.data.humancellatlas.org/v1/bundles/ffffba2d-30da-4593-9008-8b3528ee94f1?version=2019-08-01T200147.309074Z&replica=aws",
    "metadata": {
        "files": {
            "cell_suspension_json": [
                {
                    "biomaterial_core": {
                        "biomaterial_description": "Bladder",
                        "biomaterial_id": "G5.B000610.3_56_F.1.1_B000610_3_56_F_Bladder_",
                        "ncbi_taxon_id": [
                            10090
                        ]
                    },
                    "describedBy": "https://schema.humancellatlas.org/type/biomaterial/13.1.1/cell_suspension",
                    "estimated_cell_count": 1,
                    "genus_species": [
                        {
                            "ontology": "NCBITaxon:10090",
                            "ontology_label": "Mus musculus",
   

There is a lot of information here, but it is logically organized as follows: each bundle contains a number of JSON files that contain metadata related to the bundle, which ranges from technical information about the sequencing equipment used, to DOI numbers for related publications, to date and timestamps of when the data was last updated. This is all organized in the following JSON files:

In [5]:
print("Metadata JSON Files:")
print("--------------------")
print("\n".join(first_bundle['metadata']['files'].keys()))

Metadata JSON Files:
--------------------
cell_suspension_json
collection_protocol_json
donor_organism_json
enrichment_protocol_json
library_preparation_protocol_json
links_json
process_json
project_json
sequence_file_json
sequencing_protocol_json
specimen_from_organism_json


Each JSON file contains different metadata information. In addition to metadata, we can also access the bundle manifest of files in each search result using the `metadata.manifest.files` key:

In [6]:
print("Bundle Manifest:")
print("----------------")
for file_json in first_bundle['metadata']['manifest']['files']:
    print(file_json['name'])

Bundle Manifest:
----------------
cell_suspension_0.json
specimen_from_organism_0.json
donor_organism_0.json
sequence_file_0.json
sequence_file_1.json
project_0.json
library_preparation_protocol_0.json
sequencing_protocol_0.json
collection_protocol_0.json
enrichment_protocol_0.json
process_0.json
process_1.json
process_2.json
links.json
SRR6520067_1.fastq.gz
SRR6520067_2.fastq.gz


To assemble our query conditionals, we need to:

* Find the metadata JSON file containing organ type
* Find bundles whose manifests list files of type `*.result` (a gene expression matrix file)

## Query Condition: Cell Organ Type

by searching through the list of JSON files above, we can find that `specimen_from_organism_json` contains information about the organs from which the cell data was taken.

In [7]:
print(json.dumps(
    first_bundle['metadata']['files']['specimen_from_organism_json'],
    indent = 4
))

[
    {
        "biomaterial_core": {
            "biomaterial_description": "Bladder",
            "biomaterial_id": "3_56_F_Bladder",
            "biomaterial_name": "3_56_F_Bladder",
            "ncbi_taxon_id": [
                10090
            ]
        },
        "describedBy": "https://schema.humancellatlas.org/type/biomaterial/10.2.1/specimen_from_organism",
        "genus_species": [
            {
                "ontology": "NCBITaxon:10090",
                "ontology_label": "Mus musculus",
                "text": "Mus musculus"
            }
        ],
        "organ": {
            "ontology": "UBERON:0018707",
            "ontology_label": "bladder organ",
            "text": "bladder"
        },
        "provenance": {
            "document_id": "2436de6c-82fa-4434-8cec-f73cde7b01cb",
            "submission_date": "2019-07-09T19:41:34.179Z",
            "update_date": "2019-07-09T22:37:46.665Z"
        },
        "schema_type": "biomaterial"
    }
]


Now we know where to find the organ type: we search on the field

```
files.specimen_from_organism_json.organ.text
```

and it must match whatever organ we specified above. To execute a query matching this one condition:

In [8]:
organ_query = {
    "query": {
        "match": {
            "files.specimen_from_organism_json.organ.text": organ_type
        }
    }
}

In [9]:
print(client.post_search(es_query=organ_query, replica='aws')['total_hits'])

12543


## Query Condition: Bundle Contains Gene Expression Matrix

We search for expression matrices by searching for bundles containing `*.result` files, which can be found in the bundle manifest. Borrowing a query from the "ElasticSearch Queries" tutorial:

In [10]:
filetype_query = {
    "query" : {
        "wildcard": {
            "manifest.files.name": {
                "value": "*.results"
            }
        }
    }
}

In [11]:
print(client.post_search(es_query=filetype_query, replica='aws')['total_hits'])

14178


## Combining the Queries

To combine the queries, we use a boolean conditional query and specify our two conditions in the `must` key:

In [12]:
query = {
    "query": {
        "bool": {
            "must": [
                {
                    "wildcard": {
                        "manifest.files.name": {
                            "value": "*.results"
                        }
                    }
                },
                {
                    "match": {
                        "files.specimen_from_organism_json.organ.text": organ_type
                    }
                }
            ]
        }
    }
}

Now we run the query, and see how many results we have found:

In [13]:
bundles = client.post_search(es_query=query, replica='aws', output_format='raw')
print(bundles['total_hits'])

2544


In [14]:
first_bundle = bundles['results'][0]

In [15]:
print(first_bundle['metadata']['files']['specimen_from_organism_json'][0]['organ']['text'])

pancreas


## Counting Cells

Now we wish to iterate over the results returned by the search, and count the number of cells in each result. We can see an `estimated_cell_count` field in the metadata, so a first pass might use that field:

In [16]:
def extract_cell_counts(bundle):
    return bundle['metadata']['files']['cell_suspension_json'][0]['estimated_cell_count']

In [18]:
count = 0

# Iterate through all search results containing the chosen cell type
import itertools
for bundle in itertools.islice(client.post_search.iterate(es_query=query, replica='aws', output_format='raw'), 1):
    count += extract_cell_counts(bundle)

print('{} cell count: {}'.format(organ_type, count))

pancreas cell count: 1


Unfortuantely, **the `esimated_cell_count` field is incorrect!**

To get an accurate count of the number of cells, we need to download the `*.results` file in each result, as that contains the gene expression matrix. This matrix will have one row per cell, so we can count the number of rows in this cell to determine the total number of cells in that search result.

We need to redefine the `extract_cell_counts()` method above:

In [19]:
def extract_cell_counts(bundle):
    # First, go through the manifest and find the *.results file
    # get the metadata manifest files
    for f in bundle['metadata']['manifest']['files']:
        if f['name'].endswith(".result"):
            # Get the file UUID
            # Download the file with client.get_file()
            # Write the binary content to a temp file
            # Read the number of lines
            # return number of lines
            pass
    return 0

Now we try it again:

In [21]:
count = 0

# Iterate through all search results containing the chosen cell type
import itertools
for bundle in itertools.islice(client.post_search.iterate(es_query=query, replica='aws', output_format='raw'), 1):
    count += extract_cell_counts(bundle)

print('{} cell count: {}'.format(organ_type, count))

pancreas cell count: 0
