# Download all bundles for T cells sequenced with 10x

Suppose I want to get all bundles that contain T cells _and_ were sequenced using 10x. How should I go about doing this?

For those short on time, here are the steps in a nutshell:

> 1) Write an Elasticsearch query with a `bool` and a `must` subquery, and then add the conditions you want to specify within them (in this case, `match` or `reg_exp`). See the query below for an example.

> 2) Execute the `post_search` method using the Elasticsearch query you wrote, and from the search results get a bundle.

> 3) If your search returned no results, it may be helpful to try **a)** checking that the paths to your fields are correct, and/or **b)** disassembling your search and executing it one piece at a time.

And now for the in-depth answer. First, we'll need a query to search with; it might be a little more complicated than the ones used in previous vignettes, but the process overall is simple.

In [2]:
query = {
    "query": {
        "bool": {
            "must": [
                {
                    "match": {
                        "files.dissociation_protocol_json.dissociation_method.text": "10x_v2"
                    }
                },
                {
                    "regexp": {
                        "files.cell_suspension_json.selected_cell_type.text": {
                            "value": ".*T\\ cell" # Gives us any type of T cell
                        }
                    }
                }
            ]
        }
    }
}

This query should give us all bundles with a dissociation method matching *10x_v2* and a target cell type matching *any type of T cell*. Keep in mind that while the use of the wildcard characters **`.*`** is convenient for finding a value in an unknown format, it _can_ make searches slow. However, for this example, let's not worry about performance.

If you're wondering how to find the paths to these fields, [this previous vignette](https://github.com/HumanCellAtlas/data-consumer-vignettes/tree/master/tasks/Find%20Cell%20Type%20Count#find-cell-type-count) might be helpful.

Now, let's give the query a try.

In [3]:
import hca.dss, json
client = hca.dss.DSSClient()

client.host = 'https://dss.staging.data.humancellatlas.org/v1'

# Print the first bundle we get from this query

search_results = client.post_search(es_query=query, replica='aws', output_format='raw')
print(json.dumps(search_results['results'][0], indent=4, sort_keys=True))

IndexError: list index out of range

...Well, that didn't exactly work out like we were hoping. What went wrong?

If the list index is out of range, it must mean that the search returned no results. Let's see...

In [4]:
print(search_results['total_hits'])

0


Aha, we've found the problem. Let's try simplifying the search a little, this time only looking for 10x seq.

In [5]:
query = {
    "query": {
        "match": {
            "files.dissociation_protocol_json.dissociation_method.text": "10x_v2"
        }
    }
}

Now that we've abandoned half the query, we should get some results.

In [6]:
search_results = client.post_search(es_query=query, replica='aws', output_format='raw')
print(search_results['total_hits'])

14070


That's a lot of bundles with 10x sequencing. Why aren't we getting any that also contain data on T cells?

Let's search using the other half of the query and find out.

In [7]:
query = {
    "query": {
        "regexp": {
            "files.cell_suspension_json.selected_cell_type.text": {
                "value": ".*T\\ cell"
            }
        }
    }
}

Okay, let's see how many bundles with T cells there are.

In [8]:
search_results = client.post_search(es_query=query, replica='aws', output_format='raw')
print(search_results['total_hits'])

1183


It looks like there are also plenty of bundles with T cells, so why aren't we getting any that also have 10x sequencing? Maybe we can examine part of a bundle to get a better idea of what's going on.

In [9]:
print(json.dumps(search_results['results'][0]['metadata']['files']['dissociation_protocol_json'][0], indent=4, sort_keys=True))

{
    "describedBy": "http://schema.staging.data.humancellatlas.org/type/protocol/biomaterial_collection/5.0.3/dissociation_protocol",
    "dissociation_method": {
        "ontology": "EFO:0009129",
        "text": "mechanical dissociation"
    },
    "protocol_core": {
        "document": "TissueDissociationProtocol.pdf",
        "protocol_id": "tissue_dissociation_protocol",
        "protocol_name": "Extracting cells from lymph nodes"
    },
    "provenance": {
        "document_id": "40056e47-131d-4c6e-a884-a927bfccf8ce",
        "submission_date": "2018-09-13T18:02:23.415Z",
        "update_date": "2018-09-13T18:02:29.781Z"
    },
    "schema_type": "protocol"
}


Looking at the first bundle, it would seem that all the cells recorded here were sequenced via mechanical dissociation. What about the other bundles?

In [10]:
print( '10x_2' in search_results['results'] )

False


Well, there's our answer! It would seem that there aren't any data on 10x seq in any of these bundles, meaning there aren't any bundles containing data on both T cells and 10x sequencing.

Still, I'm not quite satisfied yet; I want to actually see some results from a compound query. I know that a common practice of breaking down hematopoietic system tissue (which often contains T cells) is by means of _mechanical dissociation_. Maybe we can find some bundles with data on T cells and this method instead.

In [11]:
query = {
    "query": {
        "bool": {
            "must": [
                {
                    "match": {
                        "files.dissociation_protocol_json.dissociation_method.text": "mechanical dissociation"
                    }
                },
                {
                    "regexp": {
                        "files.cell_suspension_json.selected_cell_type.text": {
                            "value": ".*T\\ cell"
                        }
                    }
                }
            ]
        }
    }
}

And now to run a search on it...

In [12]:
search_results = client.post_search(es_query=query, replica='aws', output_format='raw')
print(search_results['total_hits'])

1183


There we go! Our query with two parameters worked. It looks like all the bundles with T cell data have a _mechanical_ dissociation method. 