# Sum of All Brain Cells

I need to find how many cells of a certain organ type are available in the HCA database. How can I go about doing this?

First, let's get set up.

In [22]:
import hca.dss
client = hca.dss.DSSClient()

client.host = 'https://dss.staging.data.humancellatlas.org/v1'

Now, what type of cells are we looking for? Maybe brain cells?

In [23]:
organ_type = 'brain'

To find what cells appear in the database, we'll need to do some _searching_: specifically, using the __post_search()__ method. `post_search()` has four parameters: `es_query`, `output_format`, `replica`, and `per_page`, two of which are optional: `output_format` and `per_page`.

Let's not worry about `per_page`; we won't be needing it for this. For `replica`, we'll be using AWS, and for `output_format`, we'll want to use the mode `raw`. Using `raw` will return get the verbatim JSON metadata for bundles that match our query, which we'll need in order to view the organ type.

Assuming we don't know what the fields in our ElasticSearch query will look like, we can start by searching for those first. Our query should find all bundles that are of the organ type "brain" and also include a cell count. In the end, it should look something like this:

```json
query = {
    "query" : {
        "bool" : {
            "must" : [
                {
                    "match" : {
                        "some.path.to.the.organ.type" : organ_type
                    }
                },
                {
                    "exists" : { "field" : "some.path.to.the.cell.count" }
                }
            ]
        }
    }
}
```

To find these fields, we can just look at some arbitrary bundle in its `raw` form. Let's do a search over everything to find one. Notice that I'm using the `json` Python module so that our results print nicely.

In [24]:
import json

results = client.post_search(es_query={}, replica='aws', output_format='raw')['results'][0]
print(json.dumps(results, indent=4, sort_keys=True))

{
    "bundle_fqid": "02be9959-4616-4f24-ad30-c5f0611694e6.2018-11-13T165615.841022Z",
    "bundle_url": "https://dss.staging.data.humancellatlas.org/v1/bundles/02be9959-4616-4f24-ad30-c5f0611694e6?version=2018-11-13T165615.841022Z&replica=aws",
    "metadata": {
        "files": {
            "analysis_file_json": [
                {
                    "describedBy": "http://schema.staging.data.humancellatlas.org/type/file/5.3.4/analysis_file",
                    "file_core": {
                        "file_format": "tsv",
                        "file_name": "barcodes.tsv"
                    },
                    "provenance": {
                        "document_id": "56307381-7bf5-472f-9d90-d1ee73675932",
                        "submission_date": "2018-11-13T16:49:57.454Z",
                        "update_date": "2018-11-13T16:52:51.181Z"
                    },
                    "schema_type": "file"
                },
                {
                    "describedBy": "htt

There we go! It looks kind of messy, but now that we know where the `total_estimated_cells` and `organ` fields lie, we can finish our ElasticSearch query.

In [25]:
query = {
    "query" : {
        "bool" : {
            "must" : [
                {
                    "match" : {
                        "files.specimen_from_organism_json.organ.text" : organ_type
                    }
                },
                {
                    "exists" : { "field" : "files.cell_suspension_json.total_estimated_cells" }
                }
            ]
        }
    }
}

Now, let's see if this returns any results. We don't need `raw` `output_format` here since all we want from our search is the total number of hits.

In [26]:
client.post_search(es_query=query, replica='aws')['total_hits']

34179

Great, we got something. It looks like we have plenty of bundles containing some number of brain cells. Now we just need to add them up.

In [27]:
# Iterate through all search results containing the chosen cell type

count = 0
for bundle in client.post_search.iterate(es_query=query, replica='aws', output_format='raw'):
    count += bundle['metadata']['files']['cell_suspension_json'][0]['total_estimated_cells']
print('{} cell count: {}'.format(organ_type, count))

brain cell count: 141770004


Alright, it looks like we're done here. Our program returned ~140 million cells total out of the bundles we searched.