# Advanced Search with Operators
---

Before we do any work, we need to import several functions from cdapython:


- `Q` and `query` which power the search
- `columns` which lets us view entity field names
- `unique_terms` which lets view entity field contents
    
We're also asking cdapython to report it's version so we can be sure we're using the one we mean to.

In [1]:
from cdapython import Q, columns, unique_terms, query
print(Q.get_version())
Q.set_host_url("http://35.192.60.10:8080/")

2022.6.21



The CDA provides a custom python tool for searching CDA data. [`Q`](usage/#q) (short for Query) offers several ways to search and filter data, and several input modes:

---
- **[Q.()](../../../Documentation/usage/#q)** builds a query that can be used by `run()` or `count()`
- **[Q.run()](../../../Documentation/usage/#qrun)** returns data for the specified search 
- **[Q.count()](../../../Documentation/usage/#qcounts)** returns summary information (counts) data that fit the specified search

---

Operators allow us to make more complex queries by adding, subtracting, or filtering data.

`Q` uses the following operators:

- [`=` : Equals](#Equals:-=)
- [`!=` : Not Equal](#Not-Equal:-!=)
- [`OR`](#OR) 
- [`AND`](#AND)
- [`IN` and `NOT IN`](#IN-and-NOT-IN)
- [`%`: pattern matching a wildcard](#%-pattern-matching)
- [`IS` and `IS NOT`](#IS-and-IS-NOT)
- [`>`, `>`, `>=`, `<=`: Greater and Less than](#Greater-and-Less-than)



We use these operators to build more and more complex Q statements before sending our query to `run()` or `count()`.


## Equals: `=`

In the other tutorials, we have always used the same query, which uses the `=` operator. 

```Q('ResearchSubject.primary_diagnosis_site = "brain"')```

This operator will only return data where the primary_diagnosis_site is exactly "brain". Here let's to a similar search, but for "uterus". We'll look at the researchsubject summary:

In [2]:
Q('ResearchSubject.primary_diagnosis_site = "uterus"').researchsubject.count.run()

Total execution time: 3446 ms


system,count
IDC,867

primary_diagnosis_condition,count
,867

primary_diagnosis_site,count
Uterus,867




## Not Equal: `!=`

The `!=` operator does the opposite of the `=` operator, it returns everything that is not exactly the term you give it:

In [3]:
Q('ResearchSubject.primary_diagnosis_site != "uterus"').researchsubject.count.run()

Total execution time: 3180 ms


system,count
GDC,85416
IDC,61081
PDC,2334

primary_diagnosis_condition,count
"Cystic, Mucinous and Serous Neoplasms",3723
Gliomas,4772
Ductal and Lobular Neoplasms,7870
Adenomas and Adenocarcinomas,32730
Complex Mixed and Stromal Neoplasms,1826
Breast Invasive Carcinoma,251
Nevi and Melanomas,3155
Squamous Cell Neoplasms,5076
Transitional Cell Papillomas and Carcinomas,1885
Plasma Cell Tumors,1066

primary_diagnosis_site,count
Breast,21945
Hematopoietic and reticuloendothelial systems,9007
Not Reported,506
Prostate gland,2354
Kidney,4788
Bronchus and lung,12256
Adrenal gland,851
Thyroid gland,1880
Brain,2923
Head and Neck,148




Note that in our `!=` results, there are 1998 "Uterus, NOS" samples. These don't appear in our `=` search because "Uterus, NOS" is not *exactly* "Uterus".

There are several ways to change our search to get both "Uterus" and "Uterus, NOS", and which we choose will depend on both our interests, and on how different the terms are that we care about.

## OR

If we have a small enough number of search criteria to reliably type them out, we can use the OR operator to combine results. In an `OR` query, each data point only needs to meet a single piece of criteria to be returned, this makes `OR` good for early, broad searches. It *increases* the amount of data returned.

`OR` can be used both inside a Q statement:

In [4]:
Q('ResearchSubject.primary_diagnosis_site = "uterus" OR ResearchSubject.primary_diagnosis_site = "uterus, NOS"').researchsubject.count.run()

Total execution time: 3561 ms


system,count
PDC,104
GDC,1894
IDC,867

primary_diagnosis_condition,count
Complex Mixed and Stromal Neoplasms,294
Uterine Corpus Endometrial Carcinoma,104
Adenomas and Adenocarcinomas,1037
,867
Myomatous Neoplasms,183
"Cystic, Mucinous and Serous Neoplasms",313
Not Reported,12
"Soft Tissue Tumors and Sarcomas, NOS",14
Complex Epithelial Neoplasms,2
"Epithelial Neoplasms, NOS",20

primary_diagnosis_site,count
"Uterus, NOS",1998
Uterus,867




and to combine 2 or more Q statements:

In [5]:
Query1 = Q('ResearchSubject.primary_diagnosis_site = "uterus, NOS"') 
Query2 = Q('ResearchSubject.primary_diagnosis_condition = "Uterine Corpus Endometrial Carcinoma"')

Query1.OR(Query2).researchsubject.count.run()

Total execution time: 3427 ms


system,count
PDC,104
GDC,1894

primary_diagnosis_condition,count
Myomatous Neoplasms,183
Uterine Corpus Endometrial Carcinoma,104
"Cystic, Mucinous and Serous Neoplasms",313
Adenomas and Adenocarcinomas,1037
Complex Mixed and Stromal Neoplasms,294
Not Reported,12
"Epithelial Neoplasms, NOS",20
"Soft Tissue Tumors and Sarcomas, NOS",14
Trophoblastic neoplasms,13
"Neoplasms, NOS",3

primary_diagnosis_site,count
"Uterus, NOS",1998




For each `OR` you must specify both the search term ("uterus") and where to find the term ("ResearchSubject.primary_diagnosis_site"). This means that the `OR` operator is flexible enough to run searches across columns, or even across endpoints.

## AND 

Like `OR`, `AND` can be used both inside a Q statement, and to join multiple Q statements. `AND` requires that both statements be true simultanously for each returned bit of data. This makes `AND` good for filtering down results. It *decreases* the amount of data returned.

If we reuse the `OR` examples above, the first one will have no results, because primary_diagnosis_site can have only one value, so it can never be both "uterus" and "uterus, NOS":

In [6]:
Q('ResearchSubject.primary_diagnosis_site = "uterus" AND ResearchSubject.primary_diagnosis_site = "uterus, NOS"').researchsubject.count.run()

Total execution time: 3702 ms




However, for searches where you are interested in subsetting multiple columns, `AND` can help you to quickly filter to only the set you want. Note that `AND` can be used both inside a `Q` statement, and to add multiple `Q` statements together:

In [7]:
Q('ResearchSubject.primary_diagnosis_site = "uterus, NOS" AND ResearchSubject.primary_diagnosis_condition = "Uterine Corpus Endometrial Carcinoma"').researchsubject.count.run()

Total execution time: 3281 ms


system,count
PDC,104

primary_diagnosis_condition,count
Uterine Corpus Endometrial Carcinoma,104

primary_diagnosis_site,count
"Uterus, NOS",104




In [8]:
Query1 = Q('ResearchSubject.primary_diagnosis_site = "uterus, NOS"') 
Query2 = Q('ResearchSubject.primary_diagnosis_condition = "Uterine Corpus Endometrial Carcinoma"')

Query1.AND(Query2).researchsubject.count.run()

Total execution time: 3419 ms


system,count
PDC,104

primary_diagnosis_condition,count
Uterine Corpus Endometrial Carcinoma,104

primary_diagnosis_site,count
"Uterus, NOS",104




## `IN` and `NOT IN`

For instances where you have many search terms, it may be easier (and more readable) to use `IN`. With `IN` you make a list of all the terms you are interested in, and ask whether they are `IN` a given field:

In [9]:
Q('ResearchSubject.primary_diagnosis_site IN ("uterus, NOS", "uterus", "Cervix", "Cervix uteri")').researchsubject.count.run()

Total execution time: 3391 ms


system,count
PDC,104
GDC,2809
IDC,1174

primary_diagnosis_condition,count
Uterine Corpus Endometrial Carcinoma,104
Myomatous Neoplasms,183
Adenomas and Adenocarcinomas,1264
Squamous Cell Neoplasms,609
"Cystic, Mucinous and Serous Neoplasms",348
Complex Mixed and Stromal Neoplasms,294
Complex Epithelial Neoplasms,27
,1175
Not Reported,12
"Epithelial Neoplasms, NOS",26

primary_diagnosis_site,count
Cervix uteri,915
"Uterus, NOS",1998
Uterus,867
Cervix,307




The equivilent request without `IN` would require a large number of `OR` statements. (The triple quotes surrounding this example are to allow a multi-line Q statement):

``` py
Q("""ResearchSubject.primary_diagnosis_site = "uterus, NOS" OR 
      ResearchSubject.primary_diagnosis_site = "uterus" OR 
      ResearchSubject.primary_diagnosis_site = "Cervix" OR 
      ResearchSubject.primary_diagnosis_site = "Cervix uteri" """).researchsubject.count.run()
      
```

`NOT IN` is the opposite of `IN`, and so gives the inverse results. If we add `NOT` to our above query, we get all the researchsubjecst who's primary_diagnosis_site was not in our list:

In [24]:
Q('ResearchSubject.primary_diagnosis_site NOT IN ("uterus, NOS", "uterus", "Cervix", "Cervix uteri")').researchsubject.count.run()

Total execution time: 3483 ms


system,count
GDC,82607
IDC,60774
PDC,2230

primary_diagnosis_condition,count
Adenomas and Adenocarcinomas,31466
Other,206
Ductal and Lobular Neoplasms,7870
,60775
Plasma Cell Tumors,1066
Lymphoid Leukemias,2072
Myeloid Leukemias,3965
Nevi and Melanomas,3155
Pancreatic Ductal Adenocarcinoma,144
Neuroepitheliomatous Neoplasms,1331

primary_diagnosis_site,count
Ovary,4346
Breast,21945
Hematopoietic and reticuloendothelial systems,9007
Adrenal gland,851
Eye and adnexa,222
Brain,2923
Retroperitoneum and peritoneum,384
Liver and intrahepatic bile ducts,1609
Corpus uteri,780
Bronchus and lung,12256




## `%` pattern matching

While `OR` is useful for situations with only a few options, in some cases there are many terms that all have similar names, and it would be error prone to type out every variant. For instance, if we filter the unique terms in "ResearchSubject.primary_diagnosis_site" to everything with "uter" we get:

In [13]:
unique_terms("ResearchSubject.primary_diagnosis_site").to_list(filters="uter")

['Cervix uteri', 'Corpus uteri', 'Uterus', 'Uterus, NOS']

The `%` operator acts as a wildcard, and lets you run a query similar to the filter function in unique_terms:

In [14]:
Q('ResearchSubject.primary_diagnosis_site = "uter%"').researchsubject.count.run()

Total execution time: 3317 ms


system,count
GDC,1894
PDC,104
IDC,867

primary_diagnosis_condition,count
Adenomas and Adenocarcinomas,1037
Uterine Corpus Endometrial Carcinoma,104
Not Reported,12
Complex Mixed and Stromal Neoplasms,294
Myomatous Neoplasms,183
,867
"Cystic, Mucinous and Serous Neoplasms",313
Trophoblastic neoplasms,13
"Epithelial Neoplasms, NOS",20
Mesonephromas,2

primary_diagnosis_site,count
"Uterus, NOS",1998
Uterus,867




Because the `%` is at the end of "uter" this query returns anything that starts with "uter", depending on your question, you may want to move the `%`, or add more of them:

In [15]:
Q('ResearchSubject.primary_diagnosis_site = "%uter"').researchsubject.count.run()

Total execution time: 3325 ms




In [16]:
Q('ResearchSubject.primary_diagnosis_site = "%uter%"').researchsubject.count.run()

Total execution time: 3252 ms


system,count
GDC,3589
PDC,104
IDC,867

primary_diagnosis_condition,count
Adenomas and Adenocarcinomas,1671
"Cystic, Mucinous and Serous Neoplasms",487
Squamous Cell Neoplasms,609
Uterine Corpus Endometrial Carcinoma,104
Myomatous Neoplasms,187
Complex Mixed and Stromal Neoplasms,320
,868
Complex Epithelial Neoplasms,27
"Epithelial Neoplasms, NOS",230
Not Reported,12

primary_diagnosis_site,count
Corpus uteri,780
Cervix uteri,915
"Uterus, NOS",1998
Uterus,867




There may be cases in which you want to filter out all of the data with some partial word in it, in which case, you can combine `%` with `!=`: 

In [17]:
Q('sex != "f%"').subject.count.run()

Total execution time: 3267 ms


system,count
IDC,56038
GDC,40024
PDC,1089

sex,count
not reported,266
,51216
male,39793
unknown,81
unspecified,5

race,count
,51216
white,23406
chinese,65
asian,1348
black or african american,1815
not reported,9881
not allowed to collect,1106
other,415
Unknown,2027
american indian or alaska native,56

ethnicity,count
not reported,11796
,51216
not hispanic or latino,23020
Unknown,2293
hispanic or latino,1450
not allowed to collect,1586

cause_of_death,count
,90714
Not Reported,335
Metastasis,1
Infection,3
Cancer Related,198
Unknown,22
Not Cancer Related,76
HCC recurrence,5
Surgical Complications,3
"Cardiovascular Disorder, NOS",3




## IS and IS NOT

In computing, lack of data is often treated as a special case. In the CDA, values listed as "None" are actually `null`, that is, the field is empty. In order to search for emptiness, you need to use the special function `IS`:

In [18]:
Q('ResearchSubject.primary_diagnosis_condition IS null').researchsubject.count.run()

Total execution time: 3273 ms


system,count
IDC,61948
GDC,2

primary_diagnosis_condition,count
,61950

primary_diagnosis_site,count
Breast,12587
Head-Neck,2704
Colon,1491
Chest,28221
Lung,4728
Abdomen,92
Various,449
Brain,1165
"Abdomen, Mediastinum",176
Thymus,125




Probably more common, is to want to filter *out* the empty fields, in which case you use its companion function `IS_NOT`:

In [19]:
Q('sex IS NOT null').subject.count.run()

Total execution time: 3201 ms


system,count
GDC,84979
IDC,11004
PDC,2231

sex,count
female,45509
not reported,266
male,39793
unknown,81
unspecified,5

race,count
black or african american,4567
white,49069
not reported,21816
chinese,90
asian,2951
Unknown,3985
native hawaiian or other pacific islander,55
other,947
american indian or alaska native,116
not allowed to collect,2058

ethnicity,count
not hispanic or latino,48382
not reported,26034
Unknown,4455
hispanic or latino,3131
not allowed to collect,3652

cause_of_death,count
Not Reported,797
,84257
HCC recurrence,7
Cancer Related,336
Unknown,131
Infection,7
Not Cancer Related,107
Surgical Complications,4
"Cardiovascular Disorder, NOS",4
Cancer cell proliferation,1




## Greater and Less than

While all of the above can also be used to search for numbers, there are four operators that only work for numerical values:

- `>` : Greater than
- `<` : Less than
- `>=` : Greater than or Equal to
- `<=` : Less than or Equal to

These can all be used in place of the `=` sign in queries where you are filtering by a numeric value. In this search, we find all the subjects who were over 50 years old when they entered the study. As the study entry date is day 0, `days_to_birth` is reported as a negative number:

In [32]:
Q('days_to_birth < 50*-365 ').subject.run().to_dataframe()

Total execution time: 4500 ms


Unnamed: 0,id,identifier,species,sex,race,ethnicity,days_to_birth,subject_associated_project,vital_status,age_at_death,cause_of_death
0,C3L-02170,"[{'system': 'GDC', 'value': 'C3L-02170'}, {'sy...",Homo sapiens,male,not reported,not reported,-26717,"[CPTAC3-Discovery, CPTAC-3, cptac_lscc]",Alive,,
1,C3L-02704,"[{'system': 'GDC', 'value': 'C3L-02704'}, {'sy...",Homo sapiens,male,white,not hispanic or latino,-19879,"[cptac_gbm, CPTAC3-Discovery, CPTAC-3]",Alive,,
2,C3N-01214,"[{'system': 'GDC', 'value': 'C3N-01214'}, {'sy...",Homo sapiens,male,not reported,not reported,-22241,"[CPTAC3-Discovery, CPTAC-3, cptac_ccrcc]",Alive,,Not Reported
3,C3N-01998,"[{'system': 'GDC', 'value': 'C3N-01998'}, {'sy...",Homo sapiens,male,not reported,not reported,-25293,"[cptac_pda, CPTAC3-Discovery, CPTAC-3]",Dead,186.0,
4,C3N-02339,"[{'system': 'GDC', 'value': 'C3N-02339'}, {'sy...",Homo sapiens,male,not reported,not reported,-28848,"[CPTAC3-Discovery, CPTAC-3, cptac_lscc]",Alive,,
...,...,...,...,...,...,...,...,...,...,...,...
95,GENIE-DFCI-027361,"[{'system': 'GDC', 'value': 'GENIE-DFCI-027361'}]",Homo sapiens,male,white,not hispanic or latino,-20819,[GENIE-DFCI],Not Reported,,
96,GENIE-DFCI-033908,"[{'system': 'GDC', 'value': 'GENIE-DFCI-033908'}]",Homo sapiens,female,white,not hispanic or latino,-19723,[GENIE-DFCI],Not Reported,,
97,GENIE-DFCI-035602,"[{'system': 'GDC', 'value': 'GENIE-DFCI-035602'}]",Homo sapiens,female,white,not hispanic or latino,-19358,[GENIE-DFCI],Not Reported,,
98,GENIE-DFCI-035797,"[{'system': 'GDC', 'value': 'GENIE-DFCI-035797'}]",Homo sapiens,male,white,not hispanic or latino,-20088,[GENIE-DFCI],Not Reported,,
