# Advanced Search - Operators
---

Before we do any work, we need to import several functions from cdapython:


- `Q` and `query` which power the search
- `columns` which lets us view entity field names
- `unique_terms` which lets view entity field contents
    
We're also asking cdapython to report it's version so we can be sure we're using the one we mean to.

In [1]:
from cdapython import Q, columns, unique_terms, query
print(Q.get_version())
Q.set_host_url("http://35.192.60.10:8080/")

2022.6.9



The CDA provides a custom python tool for searching CDA data. [`Q`](usage/#q) (short for Query) offers several ways to search and filter data, and several input modes:

---
- **[Q.()](../../../Documentation/usage/#q)** builds a query that can be used by `run()` or `count()`
- **[Q.run()](../../../Documentation/usage/#qrun)** returns data for the specified search 
- **[Q.count()](../../../Documentation/usage/#qcounts)** returns summary information (counts) data that fit the specified search

---

## Operators

Operators allow us to make more complex queries, to make more and more refined datasets. These come in two basic types: 

Operators within `Q()`:

- `=`
- `!=`
- `OR()`
- `%`

- `>=`
- `<=`
- `<`
- `>`

And operators that allow us to combine `Q()` statements:

- `AND()`

- `IS()`
- `IN()`

- `NOT()`
- `NOT IN()`
- `NOT LIKE()`
- `IS NOT()`

In both cases, we use these operators to build more and more complex Q statements before sending our query to `run()` or `count()`.


### Q operators
#### Equals: `=`

In the other tutorials, we have always used the same query, which uses the `=` operator. 

```Q('ResearchSubject.primary_diagnosis_site = "brain"')```

This operator will only return data where the primary_diagnosis_site is exactly "brain". Here let's to a similar search, but for "uterus". We'll look at the researchsubject summary:

In [2]:
Q('ResearchSubject.primary_diagnosis_site = "uterus"').researchsubject.count.run()

Total execution time: 3515 ms


    total : 867     


   files : 242362   


system,count
IDC,867

primary_diagnosis_condition,count
,867

primary_diagnosis_site,count
Uterus,867




#### Not Equal: `!=`

The `!=` operator does the opposite of the `=` operator, it returns everything that is not exactly the term you give it:

In [3]:
Q('ResearchSubject.primary_diagnosis_site != "uterus"').researchsubject.count.run()

Total execution time: 3495 ms


   total : 148491   


  files : 39151493  


system,count
GDC,85076
PDC,2334
IDC,61081

primary_diagnosis_condition,count
Adenomas and Adenocarcinomas,32730
Myeloid Leukemias,3965
Gliomas,4772
"Cystic, Mucinous and Serous Neoplasms",3723
Squamous Cell Neoplasms,5076
Osseous and Chondromatous Neoplasms,615
Other,206
Lymphoid Leukemias,2072
Myomatous Neoplasms,632
,61082

primary_diagnosis_site,count
Bladder,2155
Hematopoietic and reticuloendothelial systems,9007
Liver and intrahepatic bile ducts,1609
Other and ill-defined sites,1186
Breast,21945
Colon,8559
Kidney,4788
Thyroid gland,1880
"Uterus, NOS",1998
Lymph nodes,538




Note that in our `!=` results, there are 1998 "Uterus, NOS" samples. These don't appear in our `=` search because "Uterus, NOS" is not *exactly* "Uterus".

There are several ways to change our search to get both "Uterus" and "Uterus, NOS", and which we choose will depend on both our interests, and on how different the terms are that we care about.


### OR

If we have a small enough number of search criteria to reliably type them out, we can use the OR operator to combine results:


In [4]:
Q('ResearchSubject.primary_diagnosis_site = "uterus" OR ResearchSubject.primary_diagnosis_site = "uterus, NOS"').researchsubject.count.run()

Total execution time: 3479 ms


    total : 2865    


   files : 257140   


system,count
GDC,1894
PDC,104
IDC,867

primary_diagnosis_condition,count
Uterine Corpus Endometrial Carcinoma,104
Myomatous Neoplasms,183
Adenomas and Adenocarcinomas,1037
Complex Mixed and Stromal Neoplasms,294
"Cystic, Mucinous and Serous Neoplasms",313
,867
Not Reported,12
Trophoblastic neoplasms,13
Complex Epithelial Neoplasms,2
"Soft Tissue Tumors and Sarcomas, NOS",14

primary_diagnosis_site,count
"Uterus, NOS",1998
Uterus,867




For each `OR` you must specify both the search term ("uterus") and where to find the term ("ResearchSubject.primary_diagnosis_site"). This means that the `OR` operator is flexible enough to run searches across columns, or even across endpoints:

In [6]:
Query1 = Q('ResearchSubject.primary_diagnosis_site = "uterus"') 
Query2 = Q('ResearchSubject.primary_diagnosis_condition = "Uterine Corpus Endometrial Carcinoma"')

Query1.OR(Query2).researchsubject.count.run()

Total execution time: 3456 ms


    total : 104     


    files : 2560    


system,count
PDC,104

primary_diagnosis_condition,count
Uterine Corpus Endometrial Carcinoma,104

primary_diagnosis_site,count
"Uterus, NOS",104




In [9]:
Q('ResearchSubject.primary_diagnosis_site = "uterus" OR ResearchSubject.primary_diagnosis_condition = "Uterine Corpus Endometrial Carcinoma"').researchsubject.count.run()

Total execution time: 3438 ms


    total : 971     


   files : 244922   


system,count
PDC,104
IDC,867

primary_diagnosis_condition,count
Uterine Corpus Endometrial Carcinoma,104
,867

primary_diagnosis_site,count
"Uterus, NOS",104
Uterus,867




### `%` pattern matching

While `OR` is useful for situations with only a few options, in some cases there are many terms that all have similar names, and it would be error prone to type out every variant. For instance, if we filter the unique terms in "ResearchSubject.primary_diagnosis_site" to everything with "uter" we get:

In [None]:
unique_terms("ResearchSubject.primary_diagnosis_site").to_list(filters="uter")

The `%` operator acts as a wildcard, and lets you run a query similar to the filter function in unique_terms:

In [None]:
Q('ResearchSubject.primary_diagnosis_site = "uter%"').researchsubject.count.run()

Because the `%` is at the end of "uter" this query returns anything that starts with "uter". But we can also do broader queries by moving the `%`, or adding more of them:

In [None]:
Q('ResearchSubject.primary_diagnosis_site = "%uter%"').researchsubject.count.run()

In [None]:
Q('ResearchSubject.primary_diagnosis_site != "%uter%"').researchsubject.count.run()

### Query 1

**Find data for subjects who were diagnosed after the age of 50 and who were investigated as part of the TCGA-OV project.**

In [None]:
Q('sex != "m%"').subject.count.run()

In [None]:
q1 = Q('sex = "male"')
q2 = Q('identifier.system = "GDC"')
q3 = Q('identifier.system = "PDC"')

r2 = q1.And(q2)
r3 = q1.And(q3)

#taco = r2.run()
#taco

burrito = r3.run()


In [None]:
burrito

In [None]:
q1 = Q('sex = "male"')

taco = q1.run()
taco

In [None]:
from pandas import DataFrame,concat
q = query('ResearchSubject.primary_diagnosis_condition LIKE "Lung%" AND sex = "male"').run()
print(q)
df = DataFrame()
for i in q.paginator(to_df=True):
    print(len(i))
    df = concat([df,i])

In [None]:
df

In [None]:
r.to_dataframe()

### Query 2

**Find data for donors with melanoma (Nevi and Melanomas) diagnosis and who were diagnosed before the age of 30.**

In [None]:
q1 = Q('ResearchSubject.Specimen.primary_disease_type = "Nevi and Melanomas"')
q2 = Q('ResearchSubject.Diagnosis.age_at_diagnosis < 30*365')

q = q1.And(q2)
r = q.run()

print(r)

In addition, we can check how many records come from particular systems by adding one more condition to the query:

In [None]:
q1 = Q('ResearchSubject.Specimen.primary_disease_type = "Nevi and Melanomas"')
q2 = Q('ResearchSubject.Diagnosis.age_at_diagnosis < 30*365')
q3 = Q('ResearchSubject.Specimen.identifier.system = "GDC"')

q = q1.And(q2.And(q3))
r = q.run()

print(r)

By comparing the ```Count``` value of the two results we can see that all the patients returned in the initial query are coming from the GDC.

To explore the results further, we can fetch the patient JSON objects by iterating through the results:

In [None]:
projects = set()

for patient in r:
    research_subjects = patient['ResearchSubject']
    for rs in research_subjects:
        projects.add(rs['member_of_research_project'])

print(projects)

The output shows the projects where _Nevi and Melanomas_ cases appear.

### Query 3

**Identify all samples that meet the following conditions:**

* **Sample is from primary tumor**
* **Disease is ovarian or breast cancer**
* **Subjects are females under the age of 60 years**

In [None]:
tumor_type = Q('ResearchSubject.Specimen.source_material_type = "Primary Tumor"')
disease1 = Q('ResearchSubject.primary_diagnosis_site = "Ovary"')
disease2 = Q('ResearchSubject.primary_diagnosis_site = "Breast"')
demographics1 = Q('sex = "female"')
demographics2 = Q('days_to_birth > -60*365') # note that days_to_birth is a negative value

q1 = tumor_type.And(demographics1.And(demographics2))
q2 = disease1.Or(disease2)
q = q1.And(q2)

r = q.run()
print(r)

In this case, we have a result that contains more than 1000 records which is the default page size. To load the next 1000 records, we can use the ```next_page()``` method:

In [None]:
r2 = r.next_page()

In [None]:
print(r2)

Alternatively, we can use the ```offset``` argument to specify the record to start from:

```
...
r = q.run(offset=1000)
print(r)
```

### Query 4

**Find data for donors with "Ovarian Serous Cystadenocarcinoma" with proteomic and genomic data.**

**Note that disease type value denoting the same disease groups can be completely different within different systems. This is where CDA features come into play.** We first start by exploring the values available for this particular field in both systems.

In [None]:
unique_terms('ResearchSubject.primary_diagnosis_condition', system="GDC")

Since “Ovarian Serous Cystadenocarcinoma” doesn’t appear in GDC values we decide to look into the PDC:

In [None]:
unique_terms('ResearchSubject.primary_diagnosis_condition', system="PDC")

After examining the output, we see that it does come from the PDC. Hence, if we could first identify the data that has research subjects found within the PDC that have this particular disease type, and then further narrow down the results to include only the portion of the data that is present in GDC, we could get the records that we are looking for.

In [None]:
q1 = Q('ResearchSubject.primary_diagnosis_condition = "Ovarian Serous Cystadenocarcinoma"')
q2 = Q('ResearchSubject.identifier.system = "PDC"')
q3 = Q('ResearchSubject.identifier.system = "GDC"')

q = q3.From(q1.And(q2))
r = q.run()

print(r)

As you can see, this is achieved by utilizing ```From``` operator. The ```From``` operator allows us to create queries from results of other queries. This is particularly useful when working with conditions that involve a single field which can take multiple different values for different items in a list that is being part of, e.g. we need ```ResearchSubject.identifier.system``` to be both “PDC” and “GDC” for a single patient. In such cases, ```And``` operator can’t help because it will return those entries where the field takes both values, which is zero entries.

In [None]:
for i in Q.sql("SELECT * FROM `gdc-bq-sample.cda_mvp.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS` WHERE table_name = 'v3' Limit 5"):
    print(i)

In [None]:
q1 = query('ResearchSubject.identifier.system = "GDC" FROM ResearchSubject.primary_diagnosis_condition = "Ovarian Serous Cystadenocarcinoma" AND ResearchSubject.identifier.system = "PDC"')
result = q1.run(async_call=True)
print(result)


## Data extraction and release information

In [None]:
# If you are interested in the extraction dates or data release versions of GDC, PDC, or IDC that is in a table or view, execute this code

for i in Q.sql("SELECT option_value FROM `gdc-bq-sample.integration.INFORMATION_SCHEMA.TABLE_OPTIONS` WHERE table_name = 'all_v1'"):
    print(i)