# Find the Number of Cells of a Specific Cell Type

I need to find how many cells of a certain organ type are available in the HCA database. How can I go about doing this?

First, let's get set up.

In [None]:
raise Exception("Raising an exception to see if tests fail too")

In [9]:
import hca.dss
client = hca.dss.DSSClient()

Now, what type of cells are we looking for? Maybe cells from a _hematopoietic system_?

In [10]:
organ_type = 'hematopoietic system'

To find what cells appear in the database, we'll need to do some _searching_: specifically, using the __post_search()__ method. `post_search()` has four parameters: `es_query`, `output_format`, `replica`, and `per_page`, two of which are optional: `output_format` and `per_page`.

Let's not worry about `per_page`; we won't be needing it for this. For `replica`, we'll be using AWS, and for `output_format`, we'll want to use the mode `raw`. Using `raw` will return get the verbatim JSON metadata for bundles that match our query, which we'll need in order to view details on the organ type.

Assuming we don't know how to find the fields for our ElasticSearch query, we'll start by looking for them first. Our query should find all bundles that have both an _organ type_ of hematopoietic system and _at least one cell_ in the sample. In the end, it should look something like this:

```json
query = {
    "query" : {
        "bool" : {
            "must" : [{
                "match" : {
                    "some.path.to.the.organ.type" : organ_type
                }
            }, {
                "range" : {
                    "some.path.to.the.total.number.of.cells" : {
                        "gt" : 0
                    }
                }
            }]
        }
    }
}
```

To find these fields, we can examine a bundle -- _any_ bundle containing what we need -- in its `raw` form. Let's do a search over everything to find one.

In [11]:
client.post_search(es_query={}, replica='aws', output_format='raw')['results'][0]

{'bundle_fqid': '0fd98881-8879-4928-939d-e3583ae0d006.2018-05-01T070337.608744Z',
 'bundle_url': 'https://dss.data.humancellatlas.org/v1/bundles/0fd98881-8879-4928-939d-e3583ae0d006?version=2018-05-01T070337.608744Z&replica=aws',
 'metadata': {'files': {},
  'manifest': {'creator_uid': 0,
   'files': [{'content-type': 'application/json',
     'crc32c': '003ad03e',
     'indexed': True,
     'name': 'cell.json',
     's3-etag': 'fb5aac9f8f0a9da710998cac814a33dc',
     'sha1': '6e1de4809aa92d3e142830fbcf353c2e03556335',
     'sha256': 'cb718e73527242a757431cbbc94b40878e201608064b77ca6b9ebf900c3c9ee6',
     'size': 42,
     'uuid': 'ff17b618-e94e-4c0a-a41a-f339ce917358',
     'version': '2018-05-01T070332.393227Z'},
    {'content-type': 'gzip',
     'crc32c': '942cd9d6',
     'indexed': False,
     'name': 'SRR2967608_2.fastq.gz',
     's3-etag': 'fb9bbafee8a92ced414b3658b1bb9517',
     'sha1': 'bb5c8a68c155bad257cb7b93faef71a116cecba2',
     'sha256': 'c0d11199740a66150b8bb70a0474d8de981

Well, that's not very helpful. The `files` section is empty. Let's try another one.

In [12]:
client.post_search(es_query={}, replica='aws', output_format='raw')['results'][20]

{'bundle_fqid': '1e352213-af04-4678-b659-f245e924a888.2018-03-29T141756.401707Z',
 'bundle_url': 'https://dss.data.humancellatlas.org/v1/bundles/1e352213-af04-4678-b659-f245e924a888?version=2018-03-29T141756.401707Z&replica=aws',
 'metadata': {'files': {'biomaterial_json': {'biomaterials': [{'content': {'biomaterial_core': {'biomaterial_id': '22011_1#68',
        'has_input_biomaterial': '1134_T',
        'ncbi_taxon_id': [10090]},
       'describedBy': 'https://schema.humancellatlas.org/type/biomaterial/5.1.0/cell_suspension',
       'genus_species': [{'ontology': 'NCBITaxon:10090',
         'text': 'Mus musculus'}],
       'schema_type': 'biomaterial',
       'target_cell_type': [{'ontology': 'CL:0000625', 'text': 'CD8+ T cell'}],
       'total_estimated_cells': 1},
      'hca_ingest': {'accession': '',
       'document_id': 'd7c77c44-7ded-4e26-9046-f851a2e5ee37',
       'submissionDate': '2018-03-28T14:00:02.182Z',
       'updateDate': '2018-03-28T14:48:30.990Z'}},
     {'content': 

There we go! It's not very pretty, but now that we know where the `total_estimated_cells` and `organ` fields lie, we can finish our ElasticSearch query.

In [13]:
query = {
    "query" : {
        "bool" : {
            "must" : [{
                "match" : {
                    "files.biomaterial_json.biomaterials.content.organ.text" : organ_type
                }
            }, {
                "range" : {
                    "files.biomaterial_json.biomaterials.content.total_estimated_cells" : {
                        "gt" : 0
                    }
                }
            }]
        }
    }
}

Now, let's see if this returns any results. Instead of using a `raw` `output_format` here, let's just use the default, `summary`, for simplicity (since all we need from our search is the total number of hits).

In [14]:
client.post_search(es_query=query, replica='aws')['total_hits']

4

Great, we got something. It looks like we have 4 bundles containing at least one hematopoietic system cell. Now we just need to add them up.

In [15]:
count = 0

# Iterate through all search results containing the chosen cell type
for bundle in client.post_search.iterate(es_query=query, replica='aws', output_format='raw'):
    count += bundle['metadata']['files']['biomaterial_json']['biomaterials'][0]['content']['total_estimated_cells']

print('{} cell count: {}'.format(organ_type, count))

hematopoietic system cell count: 17397


Alright, it looks like we're done here. Our program returned 17397 cells total out of the four bundles we searched.