# Find the Number of Cells of a Specific Cell Type

I need to find how many cells of a certain organ type are available in the HCA database. How can I go about doing this?

First, let's get set up.

In [1]:
import json

In [2]:
import hca.dss
client = hca.dss.DSSClient()

Now, what type of cells are we looking for? Maybe cells from a pancreas?

In [3]:
organ_type = 'pancreas'

To find what cells appear in the database, we'll need to do some _searching_: specifically, using the __post_search()__ method. `post_search()` has four parameters: `es_query`, `output_format`, `replica`, and `per_page`, two of which are optional: `output_format` and `per_page`.

Let's not worry about `per_page`; we won't be needing it for this. For `replica`, we'll be using AWS, and for `output_format`, we'll want to use the mode `raw`. Using `raw` will return get the verbatim JSON metadata for bundles that match our query, which we'll need in order to view details on the organ type.

Assuming we don't know how to find the fields for our ElasticSearch query, we'll start by looking for them first. Our query should find all bundles that have both an _organ type_ of hematopoietic system and _at least one cell_ in the sample. In the end, it should look something like this:

```json
query = {
    "query" : {
        "bool" : {
            "must" : [
                {
                    "match" : {
                        "some.path.to.the.organ.type" : organ_type
                    }
                }, {
                    "range" : {
                        "some.path.to.the.total.number.of.cells" : {
                            "gt" : 0
                        }
                    }
                }
            ]
        }
    }
}
```

To find these fields, we can examine a bundle -- _any_ bundle containing what we need -- in its `raw` form. Let's do a search over everything to find one.

In [4]:
print(json.dumps(
    client.post_search(es_query={}, replica='aws', output_format='raw')['results'][0],
    indent=4))

{
    "bundle_fqid": "fff14b79-c37e-43a4-9abc-c0777314f380.2019-05-14T105022.468000Z",
    "bundle_url": "https://dss.integration.data.humancellatlas.org/v1/bundles/fff14b79-c37e-43a4-9abc-c0777314f380?version=2019-05-14T105022.468000Z&replica=aws",
    "metadata": {
        "files": {
            "cell_suspension_json": [
                {
                    "biomaterial_core": {
                        "biomaterial_description": "HP1506401_H16 cell",
                        "biomaterial_id": "HP1506401_H16",
                        "biomaterial_name": "HP1506401_H16 cell",
                        "biosamples_accession": "SAMEA4438410",
                        "insdc_sample_accession": "ERS1349859",
                        "ncbi_taxon_id": [
                            9606
                        ]
                    },
                    "describedBy": "https://schema.humancellatlas.org/type/biomaterial/13.1.0/cell_suspension",
                    "estimated_cell_count": 1,
     

It's not very pretty, but now that we know where fields go, we can finish our ElasticSearch query.

In [5]:
query = {
    "query" : {
        "bool" : {
            "must" : [
                {
                    "match" : {"files.specimen_from_organism_json.organ.text" : organ_type}
                }, {
                    "range" : {
                        "files.cell_suspension_json.estimated_cell_count" : {
                            "gt" : 0
                        }
                    }
                }
            ]
        }
    }
}

Now, let's see if this returns any results. Instead of using a `raw` `output_format` here, let's just use the default, `summary`, for simplicity (since all we need from our search is the total number of hits).

In [6]:
print(client.post_search(es_query=query, replica='aws')['total_hits'])

495


We can have all of the metadata included with the search results by including the keyword `output_format="raw"` with our `post_search()` call:

In [7]:
item = client.post_search(es_query=query, replica='aws', output_format='raw')['results'][0]
print(json.dumps(
    item['metadata']['files']['specimen_from_organism_json'][0]['organ']['text'],
    indent=4))

"pancreas"


Great, we have multiple bundles containing at least one cell type matching our organ of interest. Now we just need to add them up.

In [12]:
count = 0

# Iterate through all search results containing the chosen cell type
for bundle in client.post_search.iterate(es_query=query, replica='aws', output_format='raw'):
    count += bundle['metadata']['files']['cell_suspension_json'][0]['estimated_cell_count']

print('{} cell count: {}'.format(organ_type, count))

pancreas cell count: 495


And now we are finished counting the number of cells matching our particular cell type.