# Writing ElasticSearch Queries

This notebook gives an overview of how to write an ElasticSearch (ES) query to programmatically find (and download) data from the Human Cell Atlas Data Store (DSS).

All of the consumer vignette notebooks use a type of ES query called a "request body search", which means the search is provided as a JSON document.

## Running a Basic Query

We start with the most basic query: an empty query. First we need a DSS client:

In [1]:
import hca.dss
client = hca.dss.DSSClient()

In [2]:
my_query = {}

To run this ES query, we use the [`post_search()`](#) function of the HCA Python API:

In [3]:
bundles = client.post_search(es_query=my_query, replica='aws')

print("post_search found %d results"%(bundles['total_hits']))
print("post_search returned %d results"%(len(bundles['results'])))

post_search found 704548 results
post_search returned 100 results


`bundles` is a dictionary, and the ElasticSearch query results are stored under the results key:

In [4]:
print(bundles.keys())

dict_keys(['es_query', 'results', 'total_hits'])


### Bundles Found vs Bundles Returned

When you run an ES query, there can potentially be thousands of results that match. The ES search engine paginates results, meaning it returns them in chunks. That's why the `bundles['total_count']` is larger than the  `bundles['results']` - the results are paginated, but the total count is the total number of items matching the search.

Total number of bundles found is always returned with the query under the `total_hits` key:

In [5]:
print(bundles['total_hits'])

704548


If we need to iterate over all search items without paginating them, we can get an iterator over all results found by ES by using the `.iterate()` method of `post_search` (as per the [HCA CLI documentation](https://hca.readthedocs.io/en/latest/api.html#hca.dss.DSSClient.post_search)):

```
for result in client.post_search.iterate(...):
    ...
```

### Including Metadata in Results

To get the metadata, we should include the keyword argument `output_format='raw'` with our search:

In [6]:
bundles = client.post_search(
    es_query=my_query, replica='aws', output_format='raw')

We can inspect the results to determine the metadata schema, and use that to assemble more complicated queries:

In [7]:
first_bundle = bundles['results'][0]

import json
print(json.dumps(
    first_bundle,
    indent=4))

{
    "bundle_fqid": "ffffba2d-30da-4593-9008-8b3528ee94f1.2019-08-01T200147.309074Z",
    "bundle_url": "https://dss.data.humancellatlas.org/v1/bundles/ffffba2d-30da-4593-9008-8b3528ee94f1?version=2019-08-01T200147.309074Z&replica=aws",
    "metadata": {
        "files": {
            "cell_suspension_json": [
                {
                    "biomaterial_core": {
                        "biomaterial_description": "Bladder",
                        "biomaterial_id": "G5.B000610.3_56_F.1.1_B000610_3_56_F_Bladder_",
                        "ncbi_taxon_id": [
                            10090
                        ]
                    },
                    "describedBy": "https://schema.humancellatlas.org/type/biomaterial/13.1.1/cell_suspension",
                    "estimated_cell_count": 1,
                    "genus_species": [
                        {
                            "ontology": "NCBITaxon:10090",
                            "ontology_label": "Mus musculus",
   

## Boolean Conditional Queries

Most of the queries we will run will consist of boolean conditional tests, so we will show how to assemble such queries here.

All ES queries must include a `query` element in the search body. When we specify a set of boolean conditional checks, we must put them in a `bool` directive and wrap each condition in a `must` or `must_not` directive:

```json
my_query = {
    "query": {
        "bool": {
            "must": [
                ...snip...
            ]
        }
    }
}
```

where the `must` directive contains the list of boolean conditions that must be met by each search result.

### Inspecting Metadata to Help Assemble Queries

The next step is to figure out what fields to use to assemble the boolean checks. we can use the metadata from the first result of the empty query (above) to pick out an interesting metadata field. Some of the JSON files in the bundle hold metadata, so we can use `first_bundle['metadata']['files']` to look at the metadata contained in various JSON files:

In [8]:
first_bundle_metadata = first_bundle['metadata']['files']
print("\n".join(first_bundle_metadata.keys()))

cell_suspension_json
collection_protocol_json
donor_organism_json
enrichment_protocol_json
library_preparation_protocol_json
links_json
process_json
project_json
sequence_file_json
sequencing_protocol_json
specimen_from_organism_json


If we look in the `specimen_from_organism_json` metadata file, we can see it contains the type of organism this sample is from.

In [9]:
print(first_bundle_metadata['specimen_from_organism_json'][0]['organ']['text'])

bladder


Each metadata JSON contains different types of information - for example, `project_json` contains information about publications related to the data set, so we can search using DOI numbers too:

In [10]:
print(first_bundle_metadata['project_json'][0]['publications'][0]['doi'])

10.1101/237446


### Creating the Boolean Conditional Check

To actually use this field in a boolean conditional in an ES query, we can refer to it by separating each key we used to access the field with a period. Start with the `metadata` key, then the `files` key, and so on:

```
metadata.files.specimen_from_organism_json.organ.text
```

We stipulate that this field must match a value that we provide, like "pancreas", like so:

```json
"match" : {
    "files.specimen_from_organism_json.organ.text" : "pancreas"
}
```

and if we have multiple conditions, we provide each boolean conditional check wrapped in a dictionary, with all of them provided together in a list:

```json
"must" : [
    {
        "match" : {
            "files.specimen_from_organism_json.organ.text" : "pancreas"
        }
    },
    {
        "match" : {
            "files.project_json.publications.doi" : "10.1016/j.cmet.2016.08.020"
        }
    }
]
```

In [11]:
organ_type = 'liver'
organ_query = {
    "query" : {
        "bool" : {
            "must" : [{
                "match" : {
                    "files.specimen_from_organism_json.organ.text" : organ_type
                }
            }]
        }
    }
}

Now we can run the search with our new query:

In [12]:
organ_bundles = client.post_search(
    es_query=organ_query, replica='aws', output_format='raw')

print("post_search found %d results"%(organ_bundles['total_hits']))
print("post_search returned %d results"%(len(organ_bundles['results'])))

post_search found 1914 results
post_search returned 10 results


As covered in the "Bundles Found vs Bundles Returned" section, the results of `post_search()` are paginated. We can get the total number of matches using the `total_hits` key. We can also use the `post_search.iterate()` method to iterate over all results:

```python
for org_bundle in client.post_search.iterate(
                            es_query = organ_query,
                            replica = 'aws',
                            output_format = 'raw'):
    
    ...
```

## Match Queries for Single-Condition Checks

The organ query we just ran has only a single condition to match, so we can specify a `match` query instead of a `bool` query to make the query just a little simpler:

In [13]:
def print_query_hits(query):
    """Utility method to run a search and print how many results were found/returned"""
    bundles_ = client.post_search(
        es_query = query,
        replica = 'aws',
        output_format = 'raw'
    )
    print("post_search() found %d results"%(bundles_['total_hits']))
    print("post_search() returned %d results"%(len(bundles_['results'])))
    
def get_query_hits(query):
    """Utility method to run a search and pass along what is returned"""
    return client.post_search(
        es_query = query,
        replica = 'aws',
        output_format = 'raw'
    )

In [14]:
simpler_organ_query = {
    "query" : {
        "match" : {
            "files.specimen_from_organism_json.organ.text" : organ_type
        }
    }
}

In [15]:
print_query_hits(simpler_organ_query)

post_search() found 1914 results
post_search() returned 10 results


### Match Queries vs Boolean Conditional Queries

The boolean conditional method, while more complicated, is also more flexible. 

Unlike a boolean conditional search, we _cannot_ specify multiple match conditions, since the `query` key cannot store a list:

```python
############ THIS IS AN INVALID QUERY ############
{
    "query" : [
        {
            "match" : {
                "foo.bar" : "baz"
            }
        },
        {
            "match" : {
                "fizz.buz" : "wuz"
            }
        }
    ]
}
```

Likewise, the `match` dictionaries cannot take multiple key-value pairs:

```python
############ THIS IS AN INVALID QUERY ############
{
    "query" : {
        "match" : {
            "foo.bar" : "baz",
            "fizz.buz" : "wuz"
        }
    }
}
```

That means that queries will become long and deeply nested very quickly.

## Wildcard and Regular Expression Queries

When running queries where the exact content of fields is unknown or partially known (for example, if searching for data related to the small or large intestines, or searching for filenames with a given extension). For these situations, we can use `wildcard` or `regexp` queries.

In [16]:
wc_query = {
    "query": {
        "wildcard": {
            "manifest.files.name": {
                "value": "*.fastq.gz"
            }
        }
    }
}

In [17]:
re_query = {
    "query": {
        "regexp": {
            "files.biomaterial_json.biomaterials.content.target_cell_type.text": {
                "value": ".*T\\ cell" # Gives us any type of T cell
            }
        }
    }
}

### Wildcard or RegEx?

Wildcard conditions are appropriate when all you need is to specify text and wildcards (e.g., "search for files that end in .fastq.gz").

A regular expression can be more surgical and exact, and is better when you have a more specific request (e.g., "search for bundles containing .fastq.gz files whose names consist only of digits").

## Filename Queries

Being able to search the DSS for particular file types is important. For example, it can help to find gene expression matrices, Fastq files, and other useful data. We cover the use of the manifest metadata to add query conditions based on filenames.

### Exact Filename Matches

To search for exact filenames, we can use a `match` query that looks at each bundle's manifest (the part of the metadata that stores a list of files contained in the bundle) for filenames matching a mattern:

In [18]:
exact_filename_query = {
    "query": {
        "match": {
            "manifest.files.name": "SRR3562314_1.fastq.gz"
        }
    }
}

In [19]:
print_query_hits(exact_filename_query)

post_search() found 2 results
post_search() returned 2 results


While an exact filename search has its uses, it is rare that we know the exact name of the file we are looking for. Instead, we want a search type that allows for wildcard patterns.

### Wildcard Filename Matches

Instead of using a `match` condition, we can use a `wildcard` condition, and find bundles with files matching a wildcard expression, like `*.fastq`.

In [20]:
fastq_query = {
    "query": {
        "wildcard": {
            "manifest.files.name": {
                "value" : "*.fastq.gz"
            }
        }
    }
}

In [21]:
print_query_hits(fastq_query)

post_search() found 150588 results
post_search() returned 10 results


We can then print the names of all `*.fastq.gz` files in the bundles that are returned, by accessing the manifest via the metadata:

In [22]:
def print_fastq_files_in_bundle(bundle_result):
    files_list_ = bundle_result['metadata']['manifest']['files']
    print("\nBundle %s:"%(bundle_result['bundle_fqid']))
    for f in files_list_:
        if f['name'].endswith("fastq.gz"):
            print(" -", f['name'])

In [23]:
fastq_bundles = get_query_hits(fastq_query)

for bundle in fastq_bundles['results']:
    print_fastq_files_in_bundle(bundle)


Bundle ffffba2d-30da-4593-9008-8b3528ee94f1.2019-08-01T200147.309074Z:
 - SRR6520067_1.fastq.gz
 - SRR6520067_2.fastq.gz

Bundle ffffaf55-f19c-40e3-aa81-a6c69d357265.2019-08-01T200147.836832Z:
 - SRR6579532_1.fastq.gz
 - SRR6579532_2.fastq.gz

Bundle fffee4fc-7bee-4989-8656-e61619da0e70.2019-08-01T200146.910993Z:
 - SRR6611699_1.fastq.gz
 - SRR6611699_2.fastq.gz

Bundle fffe55c1-18ed-401b-aa9a-6f64d0b93fec.2019-05-17T233932.932000Z:
 - ERR2459896_1.fastq.gz
 - ERR2459896_2.fastq.gz

Bundle fffcea5e-2e6c-4ca1-9aa9-c23b90b2e8b8.2019-05-16T211813.059000Z:
 - E18_20161004_Neurons_Sample_14_S083_L005_I1_002.fastq.gz
 - E18_20161004_Neurons_Sample_14_S083_L005_R1_002.fastq.gz
 - E18_20161004_Neurons_Sample_14_S083_L005_R2_002.fastq.gz

Bundle fffcc997-3121-42af-80ca-33d1cb06f509.2019-08-01T200147.451270Z:
 - SRR6530626_1.fastq.gz
 - SRR6530626_2.fastq.gz

Bundle fffc09cc-4b69-40ea-9fbe-1086e9ac25eb.2019-08-01T200147.153304Z:
 - SRR6604368_1.fastq.gz
 - SRR6604368_2.fastq.gz

Bundle fffbe17d

### Using Multiple Filename Conditionals

We show how to combine multiple filename conditionals together in a single query. We can search for bundles containing `*.fastq.gz` files (sequencing data) and containing `*.results` files (gene expression matrices).

In [24]:
multifile_query = {
    "query": {
        "bool": {
            "must": [
                {
                    "wildcard": {
                        "manifest.files.name": {
                            "value": "*.fastq.gz"
                        }
                    }
                },
                {
                    "wildcard": {
                        "manifest.files.name": {
                            "value": "*.results"
                        }
                    }
                }
            ]
        }
    }
}

In [25]:
print_query_hits(multifile_query)

post_search() found 14178 results
post_search() returned 10 results


## Searching Across All Metadata

It is possible to use the reserved ES keyword `_all` to search every metadata field. This can be used in both match queries and boolean conditional queries.

If we were interested in any and all Human Cell Atlas data that mentions a particular term in any of its metadata, we can use the `_all` key in a match query to search across all metadata:

In [26]:
term_query = {
    "query": {
        "match": {
            "_all": "mouse"
        }
    }
}

In [27]:
print_query_hits(term_query)

post_search() found 119503 results
post_search() returned 10 results


Likewise, we can perform a boolean conditional serach to search for data relating to specific organ types and mentioning a particular term in any metadata field:

In [28]:
term_organ_query = {
    "query" : {
        "bool" : {
            "must" : [
                {
                    "match" : {
                        "files.specimen_from_organism_json.organ.text" : organ_type
                    }
                },
                {
                    "match" : {
                        "_all" : "mouse"
                    }
                }
            ]
        }
    }
}

In [29]:
print_query_hits(term_organ_query)

post_search() found 1899 results
post_search() returned 10 results
