# How to search
---

Before we do any work, we need to import several functions from cdapython:
- `Q` and `query` which power the search
- `columns` which lets us view entity field names
- `unique_terms` which lets view entity field contents

We're also asking cdapython to report it's version so we can be sure we're using the one we mean to.

In [2]:
from cdapython import Q, columns, unique_terms, query
import cdapython
print(cdapython.__version__)
Q.set_host_url("http://35.192.60.10:8080/")

2022.4.13



The CDA provides a custom python tool for searching CDA data. [`Q`](usage/#q) (short for Query) offers several ways to search and filter data, and several input modes:

- [Q.counts()](../../../Documentation/usage/#qcounts) returns summary information (counts) for the specified search
- [Q.run()](../../../Documentation/usage/#qrun) returns data for the specified search
- [Q.sql()](../../../Documentation/usage/#qsql) allows you to use SQL syntax instead of Q syntax
- [query()](../../../Documentation/usage/#query) allows you to use a more natural language syntax of Q



## Retreiving summary information

Let's demonstrate search with a relatively simple question: We're interested how well represented hispanic and/or latino populations are in CDA resources and would like to know what data is available from these subjects. To run this simple search, we would first construct a query in `Q` and save it to a variable `myquery`:

In [8]:
myquery = Q('Subject.ethnicity = "hispanic or latino"')

If you aren't sure how we knew what terms to put in our search, please refer back to the [What search terms are available?](../SearchTerms) notebook. 

Since we are looking for summary information, we want to use this query in `Q.counts`. We do this by running `.counts()` on the query we just saved, and saving the result to a new variable `mycounts`:

In [23]:
mycounts = myquery.counts()

Since all we're doing is saving information in variables, we don't get any visible output. To see what our results are, we need to look into the variable. The simplest way is to call `mycounts` directly:

In [15]:
mycounts


            QueryID: 9f33298e-7f4c-4599-b098-6a7826d83e4a
            
            Offset: 0
            Count: 3
            Total Row Count: 3
            More pages: False
            

This output tells us our QueryID, which we don't really need, but the computer does to track our questions. Then it tells us four parameters that describe our results:

- `Offset:` This is how many rows of information we've told the query to skip in the data, here we didn't tell it to skip anything, so the offset is zero
- `Count:` This is how many rows the current page of our results table has. To keep searches fast, we default to pages with 100 rows.
- `Total Row Count:` This is how many rows are in the full results table
- `More pages:` This is alwasys a True or False. False means that our current page has all the availble results. True means that we will see only the first 100 results in this table, and will need to page through for more.

Now that we've seen the metadata about our results, let's look at the actual table. The easiest way to do this is by using the python function `.to_dataframe()` on our `mycounts` variable:

In [27]:
mycounts.to_dataframe()

Unnamed: 0,system,subject_count,subject_files_count,researchsubject_count,researchsubject_files_count,specimen_count,specimen_files_count
0,IDC,371,86776,0,0,0,0
1,PDC,57,9401,444,9401,1033,9232
2,GDC,3118,32866,19158,32866,20598,30995


What do these numbers mean?

- **system:** Which data source contributed this data? The CDA currently has data from IDC, PDC and GDC
- **subject_count:** How many unique individuals meet our query. Note that *within* a data source the number is of *unique* individuals, but the same individuals can have data at multiple centers. Here, there are 371 unique people in the IDC data, however up to 57 of those may be exactly the same people as are in the PDC data.
- **subject_files_count:** This tells you roughly how much data is available. It is the total count of files for all the subjects in `subject_count`, which is also the total number of files that match your search.
- **researchsubject_count:** Some data sources have individual subjects that are in multiple studies, when this happens the individual will have both a "subject" identifier and a "researchsubject" identifier. This column counts the latter. Zero in this column can mean either "there are no research_subjects that meet your search criteria" or "the data source for this row does not create special identifiers for subjects in multiple studies"
- **researchsubject_files_count:** This is the total count of files specific to researchsubjects in `researchsubject_count`. It is a subset of `subject_count`
- **specimen_count:** Some data sources track whether files come from specific specimens from a given individual. This column counts the number of specimens that meet your search criteria. Zero in this column can mean either "there are no specimens that meet your search criteria" or "the data source for this row does not track specimens seperately from subjects"
- **specimen_files_count:** This is the total count of files specific to specimens in `specimen_count`. It is a subset of both `subject_count` and `researchsubject_files_count`

In [24]:
Q('Subject.ethnicity = "hispanic or latino"').counts().to_dataframe()

Unnamed: 0,system,subject_count,subject_files_count,researchsubject_count,researchsubject_files_count,specimen_count,specimen_files_count
0,IDC,371,86776,0,0,0,0
1,PDC,57,9401,444,9401,1033,9232
2,GDC,3118,32866,19158,32866,20598,30995


In [27]:
unique_terms('ResearchSubject.id', files=True)

['00016c8f-a0be-4319-9c42-4f3bcd90ac92',
 '00021f80-e72d-4c18-9e36-7959060e1fe1',
 '00026454-8225-4395-b536-4ad5be713214',
 '0002f311-f263-4816-95f5-f78cc78f0c45',
 '00039740-17c1-4430-a112-59bf0614a963',
 '00041061-3fe2-417a-a333-23cf87a2f06e',
 '00048fa6-4318-42ef-9709-7dedb0d938b3',
 '0004bacb-64f9-4823-a962-618c3dafec65',
 '0004d251-3f70-4395-b175-c94c2f5b1b81',
 '000520f2-c081-4757-8f5c-0fbffee47a62',
 '00061f34-c891-4f9c-b8d6-3ca68b98c875',
 '0006b1fa-5db8-4737-9396-10284b5d8979',
 '000724d7-bb0c-4b21-91bf-7599399ff7b3',
 '00077417-fddb-4193-89ef-34dec0c75305',
 '00078589-2a72-403c-b300-dbaa4afc7c00',
 '0008bdfb-24a3-50fa-b112-89966d6ca423',
 '00090b9f-40c5-42b4-9687-430db52f7b36',
 '0009e11b-8c3c-464a-a379-c28d8e66d0ea',
 '000aa4f6-473f-4cc1-9392-ab8872019fe7',
 '000abfa8-e8ec-4222-8a57-1d00bafcadac',
 '000bddf2-20df-4322-8bd0-67e1959b786d',
 '000c3acf-e915-4c5f-bbc3-13031797024c',
 '000c783d-5333-4b33-89ee-6d6f97e608ce',
 '000d566c-96c7-4f1c-b36e-fa2222467983',
 '000da8b9-7511-

Additionally, you can specify a particular data provider by using the ```system``` argument:

In [None]:
unique_terms("ResearchSubject.Specimen.source_material_type", system="PDC")

## Basic querying

We can start by getting the record for ```id = TCGA-E2-A10A``` that we mentioned earlier:

We see that we've got a single subject record as a result, which is what we expect.

Let's see how the result looks like:

In [None]:
q = Q('id = "TCGA-E2-A10A"') # note the double quotes for the string value

r = q.run()

r

In [None]:
stage = Q('ResearchSubject.Diagnosis.stage = "Stage IV"')
r = stage.run()
print(r)

In [None]:
r.to_dataframe()


In [None]:
q = Q('Subject.id = "TCGA-E2-A10A"') # note the double quotes for the string value

r = q.counts(table="gdc-bq-sample.dev", version="all_v3_0_Files")

r

In [None]:
r.to_dataframe()


The record is pretty large, so we'll print out ```identifier``` values for each ```ResearchSubject``` to confirm that we have one ```ResearchSubject``` that comes from GDC, and one that comes from PDC:

In [None]:
for research_subject in r['system']:
    print(research_subject['system'])

The values represent ```ResearchSubject``` IDs and are equivalent to ```case_id``` values in data nodes.

## Example queries

Now that we can create a query with ```Q()``` function, let's see how we can combine multiple conditions.

There are three operators available:
* ```And()```
* ```Or()```
* ```From()```

The following examples show how those operators work in practice.

### Query 1

**Find data for subjects who were diagnosed after the age of 50 and who were investigated as part of the TCGA-OV project.**

In [None]:
q1 = Q('sex = "male"')
q2 = Q('identifier.system = "GDC"')
q3 = Q('identifier.system = "PDC"')

r2 = q1.And(q2)
r3 = q1.And(q3)

#taco = r2.run()
#taco

burrito = r3.run()


In [None]:
burrito

In [None]:
q1 = Q('sex = "male"')

taco = q1.run()
taco

In [None]:
from pandas import DataFrame,concat
q = query('ResearchSubject.primary_diagnosis_condition LIKE "Lung%" AND sex = "male"').run()
print(q)
df = DataFrame()
for i in q.paginator(to_df=True):
    print(len(i))
    df = concat([df,i])

In [None]:
df

In [None]:
r.to_dataframe()

### Query 2

**Find data for donors with melanoma (Nevi and Melanomas) diagnosis and who were diagnosed before the age of 30.**

In [None]:
q1 = Q('ResearchSubject.Specimen.primary_disease_type = "Nevi and Melanomas"')
q2 = Q('ResearchSubject.Diagnosis.age_at_diagnosis < 30*365')

q = q1.And(q2)
r = q.run()

print(r)

In addition, we can check how many records come from particular systems by adding one more condition to the query:

In [None]:
q1 = Q('ResearchSubject.Specimen.primary_disease_type = "Nevi and Melanomas"')
q2 = Q('ResearchSubject.Diagnosis.age_at_diagnosis < 30*365')
q3 = Q('ResearchSubject.Specimen.identifier.system = "GDC"')

q = q1.And(q2.And(q3))
r = q.run()

print(r)

By comparing the ```Count``` value of the two results we can see that all the patients returned in the initial query are coming from the GDC.

To explore the results further, we can fetch the patient JSON objects by iterating through the results:

In [None]:
projects = set()

for patient in r:
    research_subjects = patient['ResearchSubject']
    for rs in research_subjects:
        projects.add(rs['member_of_research_project'])

print(projects)

The output shows the projects where _Nevi and Melanomas_ cases appear.

### Query 3

**Identify all samples that meet the following conditions:**

* **Sample is from primary tumor**
* **Disease is ovarian or breast cancer**
* **Subjects are females under the age of 60 years**

In [None]:
tumor_type = Q('ResearchSubject.Specimen.source_material_type = "Primary Tumor"')
disease1 = Q('ResearchSubject.primary_diagnosis_site = "Ovary"')
disease2 = Q('ResearchSubject.primary_diagnosis_site = "Breast"')
demographics1 = Q('sex = "female"')
demographics2 = Q('days_to_birth > -60*365') # note that days_to_birth is a negative value

q1 = tumor_type.And(demographics1.And(demographics2))
q2 = disease1.Or(disease2)
q = q1.And(q2)

r = q.run()
print(r)

In this case, we have a result that contains more than 1000 records which is the default page size. To load the next 1000 records, we can use the ```next_page()``` method:

In [None]:
r2 = r.next_page()

In [None]:
print(r2)

Alternatively, we can use the ```offset``` argument to specify the record to start from:

```
...
r = q.run(offset=1000)
print(r)
```

### Query 4

**Find data for donors with "Ovarian Serous Cystadenocarcinoma" with proteomic and genomic data.**

**Note that disease type value denoting the same disease groups can be completely different within different systems. This is where CDA features come into play.** We first start by exploring the values available for this particular field in both systems.

In [None]:
unique_terms('ResearchSubject.primary_diagnosis_condition', system="GDC")

Since “Ovarian Serous Cystadenocarcinoma” doesn’t appear in GDC values we decide to look into the PDC:

In [None]:
unique_terms('ResearchSubject.primary_diagnosis_condition', system="PDC")

After examining the output, we see that it does come from the PDC. Hence, if we could first identify the data that has research subjects found within the PDC that have this particular disease type, and then further narrow down the results to include only the portion of the data that is present in GDC, we could get the records that we are looking for.

In [None]:
q1 = Q('ResearchSubject.primary_diagnosis_condition = "Ovarian Serous Cystadenocarcinoma"')
q2 = Q('ResearchSubject.identifier.system = "PDC"')
q3 = Q('ResearchSubject.identifier.system = "GDC"')

q = q3.From(q1.And(q2))
r = q.run()

print(r)

As you can see, this is achieved by utilizing ```From``` operator. The ```From``` operator allows us to create queries from results of other queries. This is particularly useful when working with conditions that involve a single field which can take multiple different values for different items in a list that is being part of, e.g. we need ```ResearchSubject.identifier.system``` to be both “PDC” and “GDC” for a single patient. In such cases, ```And``` operator can’t help because it will return those entries where the field takes both values, which is zero entries.

In [4]:
for i in Q.sql("SELECT * FROM `gdc-bq-sample.cda_mvp.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS` WHERE table_name = 'v3' Limit 5"):
    print(i)

(400)
Reason: 
HTTP response headers: HTTPHeaderDict({'Content-Type': 'application/json', 'Transfer-Encoding': 'chunked', 'Date': 'Tue, 19 Apr 2022 19:10:13 GMT', 'Connection': 'close'})
HTTP response body: {"message":"Unable to find schema for that version.","statusCode":400,"causes":["Unable to find schema for that version."]}



TypeError: 'NoneType' object is not iterable

In [None]:
q1 = query('ResearchSubject.identifier.system = "GDC" FROM ResearchSubject.primary_diagnosis_condition = "Ovarian Serous Cystadenocarcinoma" AND ResearchSubject.identifier.system = "PDC"')
result = q1.run(async_call=True)
print(result)


## Data extraction and release information

In [None]:
# If you are interested in the extraction dates or data release versions of GDC, PDC, or IDC that is in a table or view, execute this code

for i in Q.sql("SELECT option_value FROM `gdc-bq-sample.integration.INFORMATION_SCHEMA.TABLE_OPTIONS` WHERE table_name = 'all_v1'"):
    print(i)