# Using Operators
---

Operators allow us to make more complex queries by adding, subtracting, or filtering data.

`Q` uses the following operators:

- [`=` : Equals](#equals)
- [`!=` : Not Equal](#not-equal)
- [`OR`](#or) 
- [`AND`](#and)
- [`IN` and `NOT IN`](#in-and-not-in)
- [`%`: pattern matching a wildcard](#pattern-matching)
- [`IS` and `IS NOT`](#is-and-is-not)
- [`>`, `>`, `>=`, `<=`: Greater and Less than](#greater-and-less-than)



We use these operators to build more and more complex Q statements before sending our query to `run()` or `count()`.


The CDA provides a custom python tool for searching CDA data. [`Q`](usage/#q) (short for Query) offers several ways to search and filter data, and several input modes:

---
- **<a href="../../QuickStart/usage/#q">Q.()</a>** builds a query that can be used by `run()` or `count()`
- **<a href="../../QuickStart/usage/#qrun">Q.run()</a>** returns data for the specified search 
- **<a href="../../QuickStart/usage/#qcount">Q.count()</a>** returns summary information (counts) data that fit the specified search
- **<a href="../../QuickStart/usage/#columns">columns()</a>** returns entity field names
- **<a href="../../QuickStart/usage/#unique_terms">unique_terms()</a>** returns entity field contents

---
                                                                    
Before we do any work, we needs to import these functions cdapython.
We're also telling cdapython to report it's version so we can be sure we're using the one we mean to:

In [1]:
from cdapython import Q, columns, unique_terms, query
print(Q.get_version())

2022.6.22


In [None]:
Q.set_default_project_dataset("broad-dsde-dev.cda_dev")
Q.set_host_url("https://cancerdata.dsde-dev.broadinstitute.org/")
Q.get_host_url()
Q.get_default_project_dataset()

## Equals: `=`

In the other tutorials, we have always used the same query, which uses the `=` operator. 

```Q('ResearchSubject.primary_diagnosis_site = "brain"')```

This operator will only return data where the primary_diagnosis_site is exactly "brain". Here let's to a similar search, but for "uterus". We'll look at the researchsubject summary:

In [2]:
Q('ResearchSubject.primary_diagnosis_site = "uterus"').researchsubject.count.run()

Total execution time: 3281 ms


system,count
IDC,867

primary_diagnosis_condition,count
,867

primary_diagnosis_site,count
Uterus,867




## Not Equal: `!=`

The `!=` operator does the opposite of the `=` operator, it returns everything that is not exactly the term you give it:

In [3]:
Q('ResearchSubject.primary_diagnosis_site != "uterus"').researchsubject.count.run()

Total execution time: 3373 ms


system,count
GDC,85416
IDC,61081
PDC,2334

primary_diagnosis_condition,count
Adenomas and Adenocarcinomas,32730
Other,206
Ductal and Lobular Neoplasms,7870
,61083
"Cystic, Mucinous and Serous Neoplasms",3723
Gliomas,4772
Lymphoid Leukemias,2072
Pancreatic Ductal Adenocarcinoma,144
Nevi and Melanomas,3155
Germ Cell Neoplasms,703

primary_diagnosis_site,count
Corpus uteri,780
Lymph nodes,538
Hematopoietic and reticuloendothelial systems,9007
Colon,8559
Breast,21945
Bladder,2155
Thyroid gland,1880
Ovary,4346
Other and unspecified major salivary glands,615
Brain,2923




Note that in our `!=` results, there are 1998 "Uterus, NOS" samples. These don't appear in our `=` search because "Uterus, NOS" is not *exactly* "Uterus".

There are several ways to change our search to get both "Uterus" and "Uterus, NOS", and which we choose will depend on both our interests, and on how different the terms are that we care about.

## OR

If we have a small enough number of search criteria to reliably type them out, we can use the OR operator to combine results. In an `OR` query, each data point only needs to meet a single piece of criteria to be returned, this makes `OR` good for early, broad searches. It *increases* the amount of data returned.

`OR` can be used both inside a Q statement:

In [4]:
Q('ResearchSubject.primary_diagnosis_site = "uterus" OR ResearchSubject.primary_diagnosis_site = "uterus, NOS"').researchsubject.count.run()

Total execution time: 3318 ms


system,count
GDC,1894
PDC,104
IDC,867

primary_diagnosis_condition,count
Uterine Corpus Endometrial Carcinoma,104
Not Reported,12
Adenomas and Adenocarcinomas,1037
Myomatous Neoplasms,183
Complex Mixed and Stromal Neoplasms,294
,867
"Cystic, Mucinous and Serous Neoplasms",313
"Epithelial Neoplasms, NOS",20
Trophoblastic neoplasms,13
"Soft Tissue Tumors and Sarcomas, NOS",14

primary_diagnosis_site,count
"Uterus, NOS",1998
Uterus,867




and to combine 2 or more Q statements:

In [5]:
Query1 = Q('ResearchSubject.primary_diagnosis_site = "uterus, NOS"') 
Query2 = Q('ResearchSubject.primary_diagnosis_condition = "Uterine Corpus Endometrial Carcinoma"')

Query1.OR(Query2).researchsubject.count.run()

Total execution time: 3315 ms


system,count
GDC,1894
PDC,104

primary_diagnosis_condition,count
Complex Mixed and Stromal Neoplasms,294
Myomatous Neoplasms,183
Adenomas and Adenocarcinomas,1037
Uterine Corpus Endometrial Carcinoma,104
Not Reported,12
"Cystic, Mucinous and Serous Neoplasms",313
Trophoblastic neoplasms,13
"Epithelial Neoplasms, NOS",20
"Neoplasms, NOS",3
"Soft Tissue Tumors and Sarcomas, NOS",14

primary_diagnosis_site,count
"Uterus, NOS",1998




For each `OR` you must specify both the search term ("uterus") and where to find the term ("ResearchSubject.primary_diagnosis_site"). This means that the `OR` operator is flexible enough to run searches across columns, or even across endpoints.

## AND 

Like `OR`, `AND` can be used both inside a Q statement, and to join multiple Q statements. `AND` requires that both statements be true simultanously for each returned bit of data. This makes `AND` good for filtering down results. It *decreases* the amount of data returned.

If we reuse the `OR` examples above, the first one will have no results, because primary_diagnosis_site can have only one value, so it can never be both "uterus" and "uterus, NOS":

In [6]:
Q('ResearchSubject.primary_diagnosis_site = "uterus" AND ResearchSubject.primary_diagnosis_site = "uterus, NOS"').researchsubject.count.run()

Total execution time: 3434 ms




However, for searches where you are interested in subsetting multiple columns, `AND` can help you to quickly filter to only the set you want. Note that `AND` can be used both inside a `Q` statement, and to add multiple `Q` statements together:

In [7]:
Q('ResearchSubject.primary_diagnosis_site = "uterus, NOS" AND ResearchSubject.primary_diagnosis_condition = "Uterine Corpus Endometrial Carcinoma"').researchsubject.count.run()

Total execution time: 3374 ms


system,count
PDC,104

primary_diagnosis_condition,count
Uterine Corpus Endometrial Carcinoma,104

primary_diagnosis_site,count
"Uterus, NOS",104




In [8]:
Query1 = Q('ResearchSubject.primary_diagnosis_site = "uterus, NOS"') 
Query2 = Q('ResearchSubject.primary_diagnosis_condition = "Uterine Corpus Endometrial Carcinoma"')

Query1.AND(Query2).researchsubject.count.run()

Total execution time: 3248 ms


system,count
PDC,104

primary_diagnosis_condition,count
Uterine Corpus Endometrial Carcinoma,104

primary_diagnosis_site,count
"Uterus, NOS",104




## `IN` and `NOT IN`

For instances where you have many search terms, it may be easier (and more readable) to use `IN`. With `IN` you make a list of all the terms you are interested in, and ask whether they are `IN` a given field:

In [9]:
Q('ResearchSubject.primary_diagnosis_site IN ("uterus, NOS", "uterus", "Cervix", "Cervix uteri")').researchsubject.count.run()

Total execution time: 3232 ms


system,count
GDC,2809
PDC,104
IDC,1174

primary_diagnosis_condition,count
Squamous Cell Neoplasms,609
Uterine Corpus Endometrial Carcinoma,104
Adenomas and Adenocarcinomas,1264
"Cystic, Mucinous and Serous Neoplasms",348
Myomatous Neoplasms,183
,1175
Complex Mixed and Stromal Neoplasms,294
"Soft Tissue Tumors and Sarcomas, NOS",14
"Neoplasms, NOS",12
"Epithelial Neoplasms, NOS",26

primary_diagnosis_site,count
Cervix uteri,915
"Uterus, NOS",1998
Uterus,867
Cervix,307




The equivilent request without `IN` would require a large number of `OR` statements. (The triple quotes surrounding this example are to allow a multi-line Q statement):

``` py
Q("""ResearchSubject.primary_diagnosis_site = "uterus, NOS" OR 
      ResearchSubject.primary_diagnosis_site = "uterus" OR 
      ResearchSubject.primary_diagnosis_site = "Cervix" OR 
      ResearchSubject.primary_diagnosis_site = "Cervix uteri" """).researchsubject.count.run()
      
```

`NOT IN` is the opposite of `IN`, and so gives the inverse results. If we add `NOT` to our above query, we get all the researchsubjecst who's primary_diagnosis_site was not in our list:

In [10]:
Q('ResearchSubject.primary_diagnosis_site NOT IN ("uterus, NOS", "uterus", "Cervix", "Cervix uteri")').researchsubject.count.run()

Total execution time: 3232 ms


system,count
GDC,82607
PDC,2230
IDC,60774

primary_diagnosis_condition,count
Adenomas and Adenocarcinomas,31466
Other,206
Ductal and Lobular Neoplasms,7870
,60775
Plasma Cell Tumors,1066
Lymphoid Leukemias,2072
Myeloid Leukemias,3965
Nevi and Melanomas,3155
Pancreatic Ductal Adenocarcinoma,144
Neuroepitheliomatous Neoplasms,1331

primary_diagnosis_site,count
Ovary,4346
Other and ill-defined sites,1186
Bladder,2155
Breast,21945
Eye and adnexa,222
Brain,2923
Stomach,1870
Liver and intrahepatic bile ducts,1609
Kidney,4788
Skin,3497




## `%` pattern matching

While `OR` is useful for situations with only a few options, in some cases there are many terms that all have similar names, and it would be error prone to type out every variant. For instance, if we filter the unique terms in "ResearchSubject.primary_diagnosis_site" to everything with "uter" we get:

In [11]:
unique_terms("ResearchSubject.primary_diagnosis_site").to_list(filters="uter")

['Cervix uteri', 'Corpus uteri', 'Uterus', 'Uterus, NOS']

The `%` operator acts as a wildcard, and lets you run a query similar to the filter function in unique_terms:

In [12]:
Q('ResearchSubject.primary_diagnosis_site = "uter%"').researchsubject.count.run()

Total execution time: 3234 ms


system,count
GDC,1894
PDC,104
IDC,867

primary_diagnosis_condition,count
Uterine Corpus Endometrial Carcinoma,104
Adenomas and Adenocarcinomas,1037
Complex Mixed and Stromal Neoplasms,294
,867
Myomatous Neoplasms,183
"Cystic, Mucinous and Serous Neoplasms",313
Mesonephromas,2
"Epithelial Neoplasms, NOS",20
"Soft Tissue Tumors and Sarcomas, NOS",14
Trophoblastic neoplasms,13

primary_diagnosis_site,count
"Uterus, NOS",1998
Uterus,867




Because the `%` is at the end of "uter" this query returns anything that starts with "uter", depending on your question, you may want to move the `%`, or add more of them:

In [13]:
Q('ResearchSubject.primary_diagnosis_site = "%uter"').researchsubject.count.run()

Total execution time: 3266 ms




In [14]:
Q('ResearchSubject.primary_diagnosis_site = "%uter%"').researchsubject.count.run()

Total execution time: 3310 ms


system,count
GDC,3589
PDC,104
IDC,867

primary_diagnosis_condition,count
Squamous Cell Neoplasms,609
Adenomas and Adenocarcinomas,1671
Uterine Corpus Endometrial Carcinoma,104
"Cystic, Mucinous and Serous Neoplasms",487
Complex Mixed and Stromal Neoplasms,320
,868
"Epithelial Neoplasms, NOS",230
Myomatous Neoplasms,187
"Neoplasms, NOS",12
"Soft Tissue Tumors and Sarcomas, NOS",14

primary_diagnosis_site,count
Corpus uteri,780
Cervix uteri,915
"Uterus, NOS",1998
Uterus,867




There may be cases in which you want to filter out all of the data with some partial word in it, in which case, you can combine `%` with `!=`: 

In [15]:
Q('sex != "f%"').subject.count.run()

Total execution time: 3231 ms


system,count
IDC,56038
GDC,40024
PDC,1089

sex,count
not reported,266
,51216
male,39793
unspecified,5
unknown,81

race,count
white,23406
,51216
chinese,65
asian,1348
black or african american,1815
not reported,9881
not allowed to collect,1106
Unknown,2027
american indian or alaska native,56
other,415

ethnicity,count
,51216
not hispanic or latino,23020
not reported,11796
hispanic or latino,1450
Unknown,2293
not allowed to collect,1586

cause_of_death,count
,90714
HCC recurrence,5
Not Reported,335
Cancer Related,198
Unknown,22
Not Cancer Related,76
Surgical Complications,3
"Cardiovascular Disorder, NOS",3
Infection,3
Cerebral Hemorrhage,1




## IS and IS NOT

In computing, lack of data is often treated as a special case. In the CDA, values listed as "None" are actually `null`, that is, the field is empty. In order to search for emptiness, you need to use the special function `IS`:

In [16]:
Q('ResearchSubject.primary_diagnosis_condition IS null').researchsubject.count.run()

Total execution time: 3266 ms


system,count
IDC,61948
GDC,2

primary_diagnosis_condition,count
,61950

primary_diagnosis_site,count
Ovary,664
Head-Neck,2704
Colon,1491
Breast,12587
Chest,28221
Lung,4728
Abdomen,92
Various,449
Brain,1165
"Abdomen, Mediastinum",176




Probably more common, is to want to filter *out* the empty fields, in which case you use its companion function `IS_NOT`:

In [17]:
Q('sex IS NOT null').subject.count.run()

Total execution time: 3192 ms


system,count
GDC,84979
PDC,2231
IDC,11004

sex,count
female,45509
male,39793
not reported,266
unknown,81
unspecified,5

race,count
black or african american,4567
white,49069
not reported,21816
asian,2951
chinese,90
american indian or alaska native,116
Unknown,3985
other,947
not allowed to collect,2058
native hawaiian or other pacific islander,55

ethnicity,count
not hispanic or latino,48382
not reported,26034
Unknown,4455
hispanic or latino,3131
not allowed to collect,3652

cause_of_death,count
,84257
Not Reported,797
HCC recurrence,7
Cancer Related,336
Unknown,131
"Cardiovascular Disorder, NOS",4
Infection,7
Not Cancer Related,107
Cancer cell proliferation,1
Surgical Complications,4




## Greater and Less than

While all of the above can also be used to search for numbers, there are four operators that only work for numerical values:

- `>` : Greater than
- `<` : Less than
- `>=` : Greater than or Equal to
- `<=` : Less than or Equal to

These can all be used in place of the `=` sign in queries where you are filtering by a numeric value. In this search, we find all the subjects who were over 50 years old when they entered the study. As the study entry date is day 0, `days_to_birth` is reported as a negative number:

In [20]:
Q('days_to_birth <= 50*-365 ').specimen.run().to_dataframe()

Total execution time: 15098 ms


Unnamed: 0,id,identifier,associated_project,age_at_collection,primary_disease_type,anatomical_site,source_material_type,specimen_type,derived_from_specimen,subject_id,researchsubject_id
0,00c574fc-2d6c-47da-8b45-f4cbadfec86a,"[{'system': 'GDC', 'value': '00c574fc-2d6c-47d...",TCGA-UCS,-25064,Complex Mixed and Stromal Neoplasms,,Primary Tumor,slide,f30073a4-2ade-5ecb-b69f-52cff561505e,TCGA-N8-A4PN,8b7c42cd-fc2a-4023-843e-a411821f94ff
1,030f905e-f514-5a25-b067-b9ecc0f46cc4,"[{'system': 'GDC', 'value': '030f905e-f514-5a2...",TCGA-MESO,-22135,Mesothelial Neoplasms,,Primary Tumor,portion,5464461d-010f-4713-bc5b-a176601b9218,TCGA-SH-A9CU,f544c652-8ee2-4bfe-a42c-41ab4b3697d1
2,05b857c3-124e-4b4a-a52d-8a5ea4ee1975,"[{'system': 'GDC', 'value': '05b857c3-124e-4b4...",TCGA-PCPG,-27311,Paragangliomas and Glomus Tumors,,Primary Tumor,aliquot,cb3a2587-a3fc-4cd0-8062-2a72d007e123,TCGA-S7-A7WT,cc95f915-33e7-4c57-9e6b-4c0e2ba598b2
3,07845351-b136-4ce9-ac2e-9218adb85194,"[{'system': 'GDC', 'value': '07845351-b136-4ce...",TCGA-UCS,-24823,Complex Mixed and Stromal Neoplasms,,Primary Tumor,aliquot,4160f0e5-1cb3-41de-a363-0bbe71b8c174,TCGA-N5-A4RO,4f4906dc-7ebd-47f1-a8f5-b35d3950e740
4,08b7411a-a54f-4581-8974-8c913d11a779,"[{'system': 'GDC', 'value': '08b7411a-a54f-458...",TCGA-PCPG,-19237,Paragangliomas and Glomus Tumors,,Primary Tumor,analyte,d4999a0b-58ad-435b-993a-4f5bc109ef0b,TCGA-QR-A7IN,dc5c8c56-e3be-4649-8df5-622d497955ce
...,...,...,...,...,...,...,...,...,...,...,...
95,8eca40b3-ebe0-45bb-90d9-2c9c45ce221c,"[{'system': 'GDC', 'value': '8eca40b3-ebe0-45b...",TCGA-PCPG,-24834,Paragangliomas and Glomus Tumors,,Blood Derived Normal,portion,7c86af18-de2e-449f-b675-2b718e3af7e3,TCGA-W2-A7HC,da094e20-df07-4efa-892c-33f4ac5c3056
96,8fed554b-543a-4aec-8516-16e3e3ee0f1e,"[{'system': 'GDC', 'value': '8fed554b-543a-4ae...",TCGA-UVM,-20872,Nevi and Melanomas,,Primary Tumor,sample,initial specimen,TCGA-WC-A87W,60d7b6cf-d605-4837-9061-612f6bbb7393
97,953898f4-51f3-5d82-baa6-eda0f9054b04,"[{'system': 'GDC', 'value': '953898f4-51f3-5d8...",CTSP-DLBCL1,-20786,Mature B-Cell Lymphomas,,Tumor,portion,9bfc1e7b-2d48-40f6-bea6-6527c9b42fd7,CTSP-AD14,9b42ae91-9921-48a0-8cac-e6b637804b3d
98,955c38ee-17c5-49a4-9f46-c5a68c0463ed,"[{'system': 'GDC', 'value': '955c38ee-17c5-49a...",HCMI-CMDC,-26209,Adenomas and Adenocarcinomas,,Next Generation Cancer Model,analyte,d2bffd02-8c71-51f3-846f-6c2c01968921,HCM-CSHL-0238-C18,67e3c17e-e524-4681-8162-99742c802256
