# Using Operators
---

Operators allow us to make more complex queries by adding, subtracting, or filtering data.

`Q` uses the following operators:

- [`=` : Equals](#equals)
- [`!=` : Not Equal](#not-equal)
- [`OR`](#or) 
- [`AND`](#and)
- [`IN` and `NOT IN`](#in-and-not-in)
- [`%`: pattern matching a wildcard](#pattern-matching)
- [`IS` and `IS NOT`](#is-and-is-not)
- [`>`, `>`, `>=`, `<=`: Greater and Less than](#greater-and-less-than)



We use these operators to build more and more complex Q statements before sending our query to `run()` or `count()`.


The CDA provides a custom python tool for searching CDA data. [`Q`](usage/#q) (short for Query) offers several ways to search and filter data, and several input modes:

---
- **<a href="../../QuickStart/usage/#q">Q.()</a>** builds a query that can be used by `run()` or `count()`
- **<a href="../../QuickStart/usage/#qrun">Q.run()</a>** returns data for the specified search 
- **<a href="../../QuickStart/usage/#qcount">Q.count()</a>** returns summary information (counts) data that fit the specified search
- **<a href="../../QuickStart/usage/#columns">columns()</a>** returns entity field names
- **<a href="../../QuickStart/usage/#unique_terms">unique_terms()</a>** returns entity field contents

---
                                                                    
Before we do any work, we needs to import these functions cdapython.
We're also telling cdapython to report it's version so we can be sure we're using the one we mean to:

In [1]:
from cdapython import Q, columns, unique_terms, query
print(Q.get_version())

2022.6.28


In [2]:
Q.set_default_project_dataset("broad-dsde-dev.cda_dev")
Q.set_host_url("https://cancerdata.dsde-dev.broadinstitute.org/")
Q.get_host_url()
Q.get_default_project_dataset()

'broad-dsde-dev.cda_dev'

## Equals: `=`

In the other tutorials, we have always used the same query, which uses the `=` operator. 

```Q('ResearchSubject.primary_diagnosis_site = "brain"')```

This operator will only return data where the primary_diagnosis_site is exactly "brain". Here let's to a similar search, but for "uterus". We'll look at the researchsubject summary:

In [3]:
Q('ResearchSubject.primary_diagnosis_site = "uterus"').researchsubject.count.run()

# total : 867     
#    files : 242362   
# system	count
# IDC	867 
# primary_diagnosis_condition	count
# None	867 
# primary_diagnosis_site	count
# Uterus	867

Total execution time: 3518 ms


system,count
IDC,867

primary_diagnosis_condition,count
,867

primary_diagnosis_site,count
Uterus,867




## Not Equal: `!=`

The `!=` operator does the opposite of the `=` operator, it returns everything that is not exactly the term you give it:

In [4]:
Q('ResearchSubject.primary_diagnosis_site != "uterus"').researchsubject.count.run()

# total : 151363   
#   files : 39829706  
# system	count
# GDC	85552
# PDC	2334
# IDC	63477 
# primary_diagnosis_condition	count
# Ductal and Lobular Neoplasms	7871
# Myomatous Neoplasms	633
# Ovarian Serous Cystadenocarcinoma	283
# Adnexal and Skin Appendage Neoplasms	58
# Nevi and Melanomas	3158
# Adenomas and Adenocarcinomas	32747
# Lymphoid Leukemias	2072
# None	63479
# Cystic, Mucinous and Serous Neoplasms	3723
# Gliomas	4774
# Squamous Cell Neoplasms	5077
# Complex Mixed and Stromal Neoplasms	1827
# Uterine Corpus Endometrial Carcinoma	104
# Transitional Cell Papillomas and Carcinomas	1885
# Epithelial Neoplasms, NOS	5696
# Neoplasms, NOS	1359
# Germ Cell Neoplasms	703
# Plasma Cell Tumors	1066
# Pancreatic Ductal Adenocarcinoma	144
# Acinar Cell Neoplasms	300
# Glioblastoma	100
# Other	206
# Neuroepitheliomatous Neoplasms	1332
# Thymic Epithelial Neoplasms	262
# Myeloid Leukemias	3965
# Breast Invasive Carcinoma	251
# Lung Squamous Cell Carcinoma	118
# Colon Adenocarcinoma	164
# Lung Adenocarcinoma	216
# Mature B-Cell Lymphomas	1019
# Mesothelial Neoplasms	647
# Not Applicable	440
# Osseous and Chondromatous Neoplasms	615
# Leukemias, NOS	118
# Malignant Lymphomas, NOS or Diffuse	42
# Acute Myeloid Leukemia	2
# Nerve Sheath Tumors	115
# Fibroepithelial Neoplasms	25
# Oral Squamous Cell Carcinoma	38
# Paragangliomas and Glomus Tumors	241
# Clear Cell Renal Cell Carcinoma	116
# Not Reported	271
# Hepatocellular Carcinoma	170
# Myelodysplastic Syndromes	386
# Rectum Adenocarcinoma	30
# Soft Tissue Tumors and Sarcomas, NOS	315
# Miscellaneous Tumors	89
# Pediatric/AYA Brain Tumors	199
# Head and Neck Squamous Cell Carcinoma	110
# Early Onset Gastric Cancer	80
# Fibromatous Neoplasms	322
# Chronic Myeloproliferative Disorders	476
# Synovial-like Neoplasms	98
# Miscellaneous Bone Tumors	130
# Lipomatous Neoplasms	343
# Meningiomas	289
# Granular Cell Tumors and Alveolar Soft Part Sarcomas	23
# Mucoepidermoid Neoplasms	60
# Complex Epithelial Neoplasms	254
# Unknown	63
# Precursor Cell Lymphoblastic Lymphoma	12
# Blood Vessel Tumors	156
# Specialized Gonadal Neoplasms	124
# Mature T- and NK-Cell Lymphomas	94
# Basal Cell Neoplasms	45
# Neoplasms of Histiocytes and Accessory Lymphoid Cells	66
# Other Leukemias	68
# Myxomatous Neoplasms	18
# Hodgkin Lymphoma	11
# Mesonephromas	5
# Trophoblastic neoplasms	21
# Chromophobe Renal Cell Carcinoma	1
# Mast Cell Tumors	10
# Odontogenic Tumors	3
# Giant Cell Tumors	3
# Immunoproliferative Diseases	4
# Other Hematologic Disorders	20
# Papillary Renal Cell Carcinoma	2
# Lymphatic Vessel Tumors	1 
# primary_diagnosis_site	count
# Kidney	4844
# Heart, mediastinum, and pleura	706
# Prostate gland	2354
# Retroperitoneum and peritoneum	384
# Uterus, NOS	2000
# Bladder	2156
# Breast	22938
# Skin	3501
# Corpus uteri	780
# Colon	8574
# Not Reported	506
# Brain	3716
# Thyroid gland	1880
# Head-Neck	3053
# Ovary	4349
# Bronchus and lung	12267
# Hematopoietic and reticuloendothelial systems	9007
# Pancreas	3355
# Liver and intrahepatic bile ducts	1609
# Spinal cord, cranial nerves, and other parts of central nervous system	3703
# Chest	28221
# Lung	4848
# Rectum	1310
# Connective, subcutaneous and other soft tissues	1574
# Cervix uteri	915
# Lymph nodes	538
# Floor of mouth	56
# Thymus	431
# Esophagus	1588
# Stomach	1876
# Testis	542
# Other and unspecified parts of tongue	133
# Bones, joints and articular cartilage of limbs	268
# Other and ill-defined sites in lip, oral cavity and pharynx	361
# Abdomen	92
# Head and Neck	148
# Unknown	3233
# Adrenal gland	851
# Hypopharynx	25
# Bones, joints and articular cartilage of other and unspecified sites	455
# Gallbladder	265
# Other and unspecified major salivary glands	615
# Eye and adnexa	222
# Larynx	169
# Other and ill-defined sites	1189
# Tonsil	46
# None	340
# Liver	667
# Other and ill-defined digestive organs	720
# Peripheral nerves and autonomic nervous system	418
# Oropharynx	194
# Rectosigmoid junction	81
# Abdomen, Mediastinum	176
# Various (11 locations)	89
# Head	105
# Prostate	2139
# Various	449
# Pelvis, Prostate, Anus	58
# Phantom	33
# Extremities	51
# Abdomen, Pelvis	230
# Testicles	150
# Chest-Abdomen-Pelvis, Leg, TSpine	261
# Thyroid	507
# Cervix	307
# Adrenal Glands	271
# Bile Duct	51
# Ear	242
# Vagina	72
# Base of tongue	24
# Other and unspecified parts of mouth	43
# Other and unspecified parts of biliary tract	226
# Lip	9
# Anus and anal canal	235
# Other and unspecified urinary organs	217
# Other and unspecified female genital organs	161
# Other endocrine glands and related structures	181
# Marrow, Blood	89
# Palate	5
# Gum	11
# Small intestine	269
# Intraocular	80
# Meninges	243
# Mesothelium	87
# Nasopharynx	101
# Nasal cavity and middle ear	40
# Penis	33
# Ureter	15
# Lung Phantom	8
# Trachea	7
# Vulva	10
# Other and ill-defined sites within respiratory system and intrathoracic organs	2
# Other and unspecified male genital organs	1
# Renal pelvis	1
# Pancreas	1

Total execution time: 3548 ms


system,count
GDC,85552
PDC,2334
IDC,63477

primary_diagnosis_condition,count
Ductal and Lobular Neoplasms,7871
Myomatous Neoplasms,633
Ovarian Serous Cystadenocarcinoma,283
Adnexal and Skin Appendage Neoplasms,58
Nevi and Melanomas,3158
Adenomas and Adenocarcinomas,32747
Lymphoid Leukemias,2072
,63479
"Cystic, Mucinous and Serous Neoplasms",3723
Gliomas,4774

primary_diagnosis_site,count
Kidney,4844
"Heart, mediastinum, and pleura",706
Prostate gland,2354
Retroperitoneum and peritoneum,384
"Uterus, NOS",2000
Bladder,2156
Breast,22938
Skin,3501
Corpus uteri,780
Colon,8574




Note that in our `!=` results, there are 2000 "Uterus, NOS" samples. These don't appear in our `=` search because "Uterus, NOS" is not *exactly* "Uterus".

There are several ways to change our search to get both "Uterus" and "Uterus, NOS", and which we choose will depend on both our interests, and on how different the terms are that we care about.

## OR

If we have a small enough number of search criteria to reliably type them out, we can use the OR operator to combine results. In an `OR` query, each data point only needs to meet a single piece of criteria to be returned, this makes `OR` good for early, broad searches. It *increases* the amount of data returned.

`OR` can be used both inside a Q statement:

In [5]:
Q('ResearchSubject.primary_diagnosis_site = "uterus" OR ResearchSubject.primary_diagnosis_site = "uterus, NOS"').researchsubject.count.run()

#    total : 2867    
#    files : 257215   
# system	count
# GDC	1896
# PDC	104
# IDC	867 
# primary_diagnosis_condition	count
# Adenomas and Adenocarcinomas	1038
# Myomatous Neoplasms	184
# Not Reported	12
# Uterine Corpus Endometrial Carcinoma	104
# Complex Mixed and Stromal Neoplasms	294
# None	867
# Cystic, Mucinous and Serous Neoplasms	313
# Mesonephromas	2
# Epithelial Neoplasms, NOS	20
# Soft Tissue Tumors and Sarcomas, NOS	14
# Complex Epithelial Neoplasms	2
# Trophoblastic neoplasms	13
# Neoplasms, NOS	3
# Neuroepitheliomatous Neoplasms	1 
# primary_diagnosis_site	count
# Uterus, NOS	2000
# Uterus	867

Total execution time: 3353 ms


system,count
GDC,1896
PDC,104
IDC,867

primary_diagnosis_condition,count
Adenomas and Adenocarcinomas,1038
Myomatous Neoplasms,184
Not Reported,12
Uterine Corpus Endometrial Carcinoma,104
Complex Mixed and Stromal Neoplasms,294
,867
"Cystic, Mucinous and Serous Neoplasms",313
Mesonephromas,2
"Epithelial Neoplasms, NOS",20
"Soft Tissue Tumors and Sarcomas, NOS",14

primary_diagnosis_site,count
"Uterus, NOS",2000
Uterus,867




and to combine 2 or more Q statements:

In [6]:
Query1 = Q('ResearchSubject.primary_diagnosis_site = "uterus, NOS"') 
Query2 = Q('ResearchSubject.primary_diagnosis_condition = "Uterine Corpus Endometrial Carcinoma"')

Query1.OR(Query2).researchsubject.count.run()

#     total : 2000    
#    files : 14853    
# system	count
# PDC	104
# GDC	1896 
# primary_diagnosis_condition	count
# Uterine Corpus Endometrial Carcinoma	104
# Soft Tissue Tumors and Sarcomas, NOS	14
# Adenomas and Adenocarcinomas	1038
# Complex Mixed and Stromal Neoplasms	294
# Myomatous Neoplasms	184
# Cystic, Mucinous and Serous Neoplasms	313
# Not Reported	12
# Trophoblastic neoplasms	13
# Epithelial Neoplasms, NOS	20
# Complex Epithelial Neoplasms	2
# Mesonephromas	2
# Neoplasms, NOS	3
# Neuroepitheliomatous Neoplasms	1 
# primary_diagnosis_site	count
# Uterus, NOS	2000

Total execution time: 3449 ms


system,count
PDC,104
GDC,1896

primary_diagnosis_condition,count
Uterine Corpus Endometrial Carcinoma,104
"Soft Tissue Tumors and Sarcomas, NOS",14
Adenomas and Adenocarcinomas,1038
Complex Mixed and Stromal Neoplasms,294
Myomatous Neoplasms,184
"Cystic, Mucinous and Serous Neoplasms",313
Not Reported,12
Trophoblastic neoplasms,13
"Epithelial Neoplasms, NOS",20
Complex Epithelial Neoplasms,2

primary_diagnosis_site,count
"Uterus, NOS",2000




For each `OR` you must specify both the search term ("uterus") and where to find the term ("ResearchSubject.primary_diagnosis_site"). This means that the `OR` operator is flexible enough to run searches across columns, or even across endpoints.

## AND 

Like `OR`, `AND` can be used both inside a Q statement, and to join multiple Q statements. `AND` requires that both statements be true simultanously for each returned bit of data. This makes `AND` good for filtering down results. It *decreases* the amount of data returned.

If we reuse the `OR` examples above, the first one will have no results, because primary_diagnosis_site can have only one value, so it can never be both "uterus" and "uterus, NOS":

In [7]:
Q('ResearchSubject.primary_diagnosis_site = "uterus" AND ResearchSubject.primary_diagnosis_site = "uterus, NOS"').researchsubject.count.run()

#no results

Total execution time: 3906 ms




However, for searches where you are interested in subsetting multiple columns, `AND` can help you to quickly filter to only the set you want. Note that `AND` can be used both inside a `Q` statement, and to add multiple `Q` statements together:

In [8]:
Q('ResearchSubject.primary_diagnosis_site = "uterus, NOS" AND ResearchSubject.primary_diagnosis_condition = "Uterine Corpus Endometrial Carcinoma"').researchsubject.count.run()

#  total : 104     
#     files : 2560    
# system	count
# PDC	104 
# primary_diagnosis_condition	count
# Uterine Corpus Endometrial Carcinoma	104 
# primary_diagnosis_site	count
# Uterus, NOS	104

Total execution time: 3440 ms


system,count
PDC,104

primary_diagnosis_condition,count
Uterine Corpus Endometrial Carcinoma,104

primary_diagnosis_site,count
"Uterus, NOS",104




In [9]:
Query1 = Q('ResearchSubject.primary_diagnosis_site = "uterus, NOS"') 
Query2 = Q('ResearchSubject.primary_diagnosis_condition = "Uterine Corpus Endometrial Carcinoma"')

Query1.AND(Query2).researchsubject.count.run()

Total execution time: 3327 ms


system,count
PDC,104

primary_diagnosis_condition,count
Uterine Corpus Endometrial Carcinoma,104

primary_diagnosis_site,count
"Uterus, NOS",104




## `IN` and `NOT IN`

For instances where you have many search terms, it may be easier (and more readable) to use `IN`. With `IN` you make a list of all the terms you are interested in, and ask whether they are `IN` a given field:

In [10]:
Q('ResearchSubject.primary_diagnosis_site IN ("uterus, NOS", "uterus", "Cervix", "Cervix uteri")').researchsubject.count.run()

#     total : 4089    
#    files : 299923   
# system	count
# GDC	2811
# PDC	104
# IDC	1174 
# primary_diagnosis_condition	count
# Uterine Corpus Endometrial Carcinoma	104
# Squamous Cell Neoplasms	609
# Not Reported	12
# Adenomas and Adenocarcinomas	1265
# Myomatous Neoplasms	184
# Complex Mixed and Stromal Neoplasms	294
# Complex Epithelial Neoplasms	27
# None	1175
# Cystic, Mucinous and Serous Neoplasms	348
# Soft Tissue Tumors and Sarcomas, NOS	14
# Epithelial Neoplasms, NOS	26
# Trophoblastic neoplasms	13
# Neoplasms, NOS	12
# Mesonephromas	5
# Neuroepitheliomatous Neoplasms	1 
# primary_diagnosis_site	count
# Cervix uteri	915
# Uterus, NOS	2000
# Uterus	867
# Cervix	

Total execution time: 3574 ms


system,count
GDC,2811
PDC,104
IDC,1174

primary_diagnosis_condition,count
Uterine Corpus Endometrial Carcinoma,104
Squamous Cell Neoplasms,609
Not Reported,12
Adenomas and Adenocarcinomas,1265
Myomatous Neoplasms,184
Complex Mixed and Stromal Neoplasms,294
Complex Epithelial Neoplasms,27
,1175
"Cystic, Mucinous and Serous Neoplasms",348
"Soft Tissue Tumors and Sarcomas, NOS",14

primary_diagnosis_site,count
Cervix uteri,915
"Uterus, NOS",2000
Uterus,867
Cervix,307




The equivilent request without `IN` would require a large number of `OR` statements. (The triple quotes surrounding this example are to allow a multi-line Q statement):

``` py
Q("""ResearchSubject.primary_diagnosis_site = "uterus, NOS" OR 
      ResearchSubject.primary_diagnosis_site = "uterus" OR 
      ResearchSubject.primary_diagnosis_site = "Cervix" OR 
      ResearchSubject.primary_diagnosis_site = "Cervix uteri" """).researchsubject.count.run()
      
```

`NOT IN` is the opposite of `IN`, and so gives the inverse results. If we add `NOT` to our above query, we get all the researchsubjects who's primary_diagnosis_site was not in our list:

<!--    total : 148141   
  files : 39774745  
system	count
GDC	82741
IDC	63170
PDC	2230 
primary_diagnosis_condition	count
Adenomas and Adenocarcinomas	31482
Neuroepitheliomatous Neoplasms	1331
Squamous Cell Neoplasms	4468
Mature B-Cell Lymphomas	1019
Ductal and Lobular Neoplasms	7871
None	63171
Gliomas	4774
Fibroepithelial Neoplasms	25
Oral Squamous Cell Carcinoma	38
Acinar Cell Neoplasms	300
Nevi and Melanomas	3158
Myeloid Leukemias	3965
Plasma Cell Tumors	1066
Transitional Cell Papillomas and Carcinomas	1885
Lymphoid Leukemias	2072
Epithelial Neoplasms, NOS	5670
Osseous and Chondromatous Neoplasms	615
Ovarian Serous Cystadenocarcinoma	283
Other	206
Thymic Epithelial Neoplasms	262
Cystic, Mucinous and Serous Neoplasms	3375
Clear Cell Renal Cell Carcinoma	116
Pancreatic Ductal Adenocarcinoma	144
Neoplasms, NOS	1347
Glioblastoma	100
Breast Invasive Carcinoma	251
Colon Adenocarcinoma	164
Chronic Myeloproliferative Disorders	476
Lung Squamous Cell Carcinoma	118
Complex Mixed and Stromal Neoplasms	1533
Paragangliomas and Glomus Tumors	241
Lung Adenocarcinoma	216
Leukemias, NOS	118
Mesothelial Neoplasms	647
Soft Tissue Tumors and Sarcomas, NOS	301
Germ Cell Neoplasms	703
Fibromatous Neoplasms	322
Myomatous Neoplasms	449
Meningiomas	289
Hepatocellular Carcinoma	170
Not Applicable	440
Rectum Adenocarcinoma	30
Pediatric/AYA Brain Tumors	199
Complex Epithelial Neoplasms	227
Head and Neck Squamous Cell Carcinoma	110
Early Onset Gastric Cancer	80
Lipomatous Neoplasms	343
Not Reported	259
Myelodysplastic Syndromes	386
Unknown	63
Blood Vessel Tumors	156
Mucoepidermoid Neoplasms	60
Mature T- and NK-Cell Lymphomas	94
Specialized Gonadal Neoplasms	124
Nerve Sheath Tumors	115
Malignant Lymphomas, NOS or Diffuse	42
Synovial-like Neoplasms	98
Miscellaneous Bone Tumors	130
Other Leukemias	68
Adnexal and Skin Appendage Neoplasms	58
Miscellaneous Tumors	89
Precursor Cell Lymphoblastic Lymphoma	12
Odontogenic Tumors	3
Other Hematologic Disorders	20
Lymphatic Vessel Tumors	1
Granular Cell Tumors and Alveolar Soft Part Sarcomas	23
Myxomatous Neoplasms	18
Mast Cell Tumors	10
Neoplasms of Histiocytes and Accessory Lymphoid Cells	66
Basal Cell Neoplasms	45
Trophoblastic neoplasms	8
Papillary Renal Cell Carcinoma	2
Hodgkin Lymphoma	11
Giant Cell Tumors	3
Acute Myeloid Leukemia	2
Chromophobe Renal Cell Carcinoma	1
Immunoproliferative Diseases	4 
primary_diagnosis_site	count
Lymph nodes	538
Prostate gland	2354
Bronchus and lung	12267
Skin	3501
Breast	22938
Hematopoietic and reticuloendothelial systems	9007
Not Reported	506
Retroperitoneum and peritoneum	384
Brain	3716
Head-Neck	3053
Heart, mediastinum, and pleura	706
Stomach	1876
Thyroid gland	1880
Unknown	3233
Ovary	4349
Esophagus	1588
Colon	8574
Kidney	4844
Liver and intrahepatic bile ducts	1609
Pancreas	3355
Rectum	1310
Chest	28221
Lung	4848
Larynx	169
Other and unspecified parts of tongue	133
Floor of mouth	56
Corpus uteri	780
Bladder	2156
Rectosigmoid junction	81
Other and ill-defined sites	1189
Adrenal gland	851
Spinal cord, cranial nerves, and other parts of central nervous system	3703
Other and ill-defined sites in lip, oral cavity and pharynx	361
Abdomen	92
Anus and anal canal	235
Testis	542
Tonsil	46
Bones, joints and articular cartilage of limbs	268
Thymus	431
Various	449
Connective, subcutaneous and other soft tissues	1574
Meninges	243
Other and unspecified major salivary glands	615
Eye and adnexa	222
Liver	667
Other and unspecified parts of mouth	43
Other and unspecified female genital organs	161
Gum	11
Abdomen, Mediastinum	176
Marrow, Blood	89
Various (11 locations)	89
Head	105
Prostate	2139
Lung Phantom	8
Pelvis, Prostate, Anus	58
Phantom	33
Extremities	51
Abdomen, Pelvis	230
Chest-Abdomen-Pelvis, Leg, TSpine	261
Thyroid	507
Adrenal Glands	271
Testicles	150
Mesothelium	87
Ear	242
Other and unspecified parts of biliary tract	226
Head and Neck	148
Bones, joints and articular cartilage of other and unspecified sites	455
Other and unspecified male genital organs	1
Base of tongue	24
Other and ill-defined digestive organs	720
Gallbladder	265
Nasopharynx	101
Oropharynx	194
Small intestine	269
Intraocular	80
Bile Duct	51
Other endocrine glands and related structures	181
Peripheral nerves and autonomic nervous system	418
Penis	33
Other and unspecified urinary organs	217
None	340
Vulva	10
Ureter	15
Lip	9
Nasal cavity and middle ear	40
Vagina	72
Trachea	7
Hypopharynx	25
Palate	5
Other and ill-defined sites within respiratory system and intrathoracic organs	2
Pancreas	1
Renal pelvis	1 -->

In [11]:
Q('ResearchSubject.primary_diagnosis_site NOT IN ("uterus, NOS", "uterus", "Cervix", "Cervix uteri")').researchsubject.count.run()

Total execution time: 3619 ms


system,count
GDC,82741
IDC,63170
PDC,2230

primary_diagnosis_condition,count
Adenomas and Adenocarcinomas,31482
Neuroepitheliomatous Neoplasms,1331
Squamous Cell Neoplasms,4468
Mature B-Cell Lymphomas,1019
Ductal and Lobular Neoplasms,7871
,63171
Gliomas,4774
Fibroepithelial Neoplasms,25
Oral Squamous Cell Carcinoma,38
Acinar Cell Neoplasms,300

primary_diagnosis_site,count
Lymph nodes,538
Prostate gland,2354
Bronchus and lung,12267
Skin,3501
Breast,22938
Hematopoietic and reticuloendothelial systems,9007
Not Reported,506
Retroperitoneum and peritoneum,384
Brain,3716
Head-Neck,3053




## `%` pattern matching

While `OR` is useful for situations with only a few options, in some cases there are many terms that all have similar names, and it would be error prone to type out every variant. For instance, if we filter the unique terms in "ResearchSubject.primary_diagnosis_site" to everything with "uter" we get:

In [12]:
unique_terms("ResearchSubject.primary_diagnosis_site").to_list(filters="uter")
#['Cervix uteri', 'Corpus uteri', 'Uterus', 'Uterus, NOS']

['Cervix uteri', 'Corpus uteri', 'Uterus', 'Uterus, NOS']

The `%` operator acts as a wildcard, and lets you run a query similar to the filter function in unique_terms:

In [13]:
Q('ResearchSubject.primary_diagnosis_site = "uter%"').researchsubject.count.run()

#     total : 2867    
#    files : 257215   
# system	count
# GDC	1896
# PDC	104
# IDC	867 
# primary_diagnosis_condition	count
# Adenomas and Adenocarcinomas	1038
# Complex Mixed and Stromal Neoplasms	294
# Uterine Corpus Endometrial Carcinoma	104
# None	867
# Myomatous Neoplasms	184
# Cystic, Mucinous and Serous Neoplasms	313
# Not Reported	12
# Epithelial Neoplasms, NOS	20
# Soft Tissue Tumors and Sarcomas, NOS	14
# Neuroepitheliomatous Neoplasms	1
# Trophoblastic neoplasms	13
# Complex Epithelial Neoplasms	2
# Neoplasms, NOS	3
# Mesonephromas	2 
# primary_diagnosis_site	count
# Uterus, NOS	2000
# Uterus	867

Total execution time: 3484 ms


system,count
GDC,1896
PDC,104
IDC,867

primary_diagnosis_condition,count
Adenomas and Adenocarcinomas,1038
Complex Mixed and Stromal Neoplasms,294
Uterine Corpus Endometrial Carcinoma,104
,867
Myomatous Neoplasms,184
"Cystic, Mucinous and Serous Neoplasms",313
Not Reported,12
"Epithelial Neoplasms, NOS",20
"Soft Tissue Tumors and Sarcomas, NOS",14
Neuroepitheliomatous Neoplasms,1

primary_diagnosis_site,count
"Uterus, NOS",2000
Uterus,867




Because the `%` is at the end of "uter" this query returns anything that starts with "uter", depending on your question, you may want to move the `%`, or add more of them:

In [14]:
Q('ResearchSubject.primary_diagnosis_site = "%uter"').researchsubject.count.run()
# no results

Total execution time: 3411 ms




In [15]:
Q('ResearchSubject.primary_diagnosis_site = "%uter%"').researchsubject.count.run()

#     total : 4562    
#    files : 302579   
# system	count
# PDC	104
# GDC	3591
# IDC	867 
# primary_diagnosis_condition	count
# Adenomas and Adenocarcinomas	1672
# Squamous Cell Neoplasms	609
# Uterine Corpus Endometrial Carcinoma	104
# Cystic, Mucinous and Serous Neoplasms	487
# Complex Mixed and Stromal Neoplasms	320
# Not Reported	12
# None	868
# Myomatous Neoplasms	188
# Epithelial Neoplasms, NOS	230
# Neuroepitheliomatous Neoplasms	1
# Soft Tissue Tumors and Sarcomas, NOS	14
# Complex Epithelial Neoplasms	27
# Mesonephromas	5
# Trophoblastic neoplasms	13
# Neoplasms, NOS	12 
# primary_diagnosis_site	count
# Uterus, NOS	2000
# Cervix uteri	915
# Corpus uteri	780
# Uterus	867

Total execution time: 5758 ms


system,count
PDC,104
GDC,3591
IDC,867

primary_diagnosis_condition,count
Adenomas and Adenocarcinomas,1672
Squamous Cell Neoplasms,609
Uterine Corpus Endometrial Carcinoma,104
"Cystic, Mucinous and Serous Neoplasms",487
Complex Mixed and Stromal Neoplasms,320
Not Reported,12
,868
Myomatous Neoplasms,188
"Epithelial Neoplasms, NOS",230
Neuroepitheliomatous Neoplasms,1

primary_diagnosis_site,count
"Uterus, NOS",2000
Cervix uteri,915
Corpus uteri,780
Uterus,867




There may be cases in which you want to filter out all of the data with some partial word in it, in which case, you can combine `%` with `!=`: 

In [16]:
Q('sex != "f%"').subject.count.run()

#   total : 92757    
#   files : 38480355  
# system	count
# IDC	58062
# PDC	1089
# GDC	40095 
# sex	count
# not reported	266
# None	52541
# male	39864
# unspecified	5
# unknown	81 
# race	count
# black or african american	1821
# white	23465
# None	52541
# not reported	9882
# chinese	65
# Unknown	2029
# asian	1349
# other	415
# american indian or alaska native	56
# native hawaiian or other pacific islander	26
# not allowed to collect	1106
# unknown	2 
# ethnicity	count
# not reported	11796
# None	52541
# not hispanic or latino	23084
# Unknown	2294
# hispanic or latino	1456
# not allowed to collect	1586 
# cause_of_death	count
# None	92110
# Not Reported	335
# Metastasis	1
# Infection	3
# Cancer Related	198
# Unknown	22
# Not Cancer Related	76
# HCC recurrence	5
# Cardiovascular Disorder, NOS	3
# Cerebral Hemorrhage	1
# Surgical Complications	3

Total execution time: 5961 ms


system,count
IDC,58062
PDC,1089
GDC,40095

sex,count
not reported,266
,52541
male,39864
unspecified,5
unknown,81

race,count
black or african american,1821
white,23465
,52541
not reported,9882
chinese,65
Unknown,2029
asian,1349
other,415
american indian or alaska native,56
native hawaiian or other pacific islander,26

ethnicity,count
not reported,11796
,52541
not hispanic or latino,23084
Unknown,2294
hispanic or latino,1456
not allowed to collect,1586

cause_of_death,count
,92110
Not Reported,335
Metastasis,1
Infection,3
Cancer Related,198
Unknown,22
Not Cancer Related,76
HCC recurrence,5
"Cardiovascular Disorder, NOS",3
Cerebral Hemorrhage,1




## IS and IS NOT

In computing, lack of data is often treated as a special case. In the CDA, values listed as "None" are actually `null`, that is, the field is empty. In order to search for emptiness, you need to use the special function `IS`:

In [17]:
Q('ResearchSubject.primary_diagnosis_condition IS null').researchsubject.count.run()

#    total : 64346    
#   files : 39142943  
# system	count
# IDC	64344
# GDC	2 
# primary_diagnosis_condition	count
# None	64346 
# primary_diagnosis_site	count
# Breast	13571
# Ovary	664
# Head-Neck	3053
# Colon	1491
# Chest	28221
# Lung	4728
# Abdomen	92
# Lung Phantom	8
# Brain	1953
# Abdomen, Mediastinum	176
# Esophagus	187
# Bladder	431
# Kidney	1373
# Uterus	867
# Marrow, Blood	89
# Various (11 locations)	89
# Pancreas	481
# Skin	612
# Head	105
# Liver	497
# Prostate	2139
# Various	449
# Pelvis, Prostate, Anus	58
# Phantom	33
# Extremities	51
# Abdomen, Pelvis	230
# Testicles	150
# Thymus	125
# Rectum	171
# Stomach	443
# Cervix	307
# Chest-Abdomen-Pelvis, Leg, TSpine	261
# Thyroid	507
# Adrenal Glands	271
# Intraocular	80
# Ear	242
# Mesothelium	87
# Bile Duct	51
# None	1
# Pancreas	1
# Cervix uteri	1

Total execution time: 3338 ms


system,count
IDC,64344
GDC,2

primary_diagnosis_condition,count
,64346

primary_diagnosis_site,count
Breast,13571
Ovary,664
Head-Neck,3053
Colon,1491
Chest,28221
Lung,4728
Abdomen,92
Lung Phantom,8
Brain,1953
"Abdomen, Mediastinum",176




Probably more common, is to want to filter *out* the empty fields, in which case you use its companion function `IS_NOT`:

In [18]:
Q('sex IS NOT null').subject.count.run()

#    total : 85790    
#   files : 3523012   
# system	count
# GDC	85115
# PDC	2231
# IDC	12063 
# sex	count
# female	45574
# male	39864
# not reported	266
# unknown	81
# unspecified	5 
# race	count
# black or african american	4578
# white	49178
# not reported	21819
# chinese	90
# asian	2955
# Unknown	3991
# american indian or alaska native	116
# not allowed to collect	2058
# other	947
# native hawaiian or other pacific islander	55
# unknown	3 
# ethnicity	count
# not hispanic or latino	48500
# not reported	26036
# Unknown	4459
# hispanic or latino	3141
# not allowed to collect	3652
# unknown	2 
# cause_of_death	count
# Not Reported	797
# None	84393
# HCC recurrence	7
# Unknown	131
# Cancer Related	336
# Infection	7
# Not Cancer Related	107
# Surgical Complications	4
# Cancer cell proliferation	1
# Metastasis	2
# Cerebral Hemorrhage	1
# Cardiovascular Disorder, NOS	4

Total execution time: 3420 ms


system,count
GDC,85115
PDC,2231
IDC,12063

sex,count
female,45574
male,39864
not reported,266
unknown,81
unspecified,5

race,count
black or african american,4578
white,49178
not reported,21819
chinese,90
asian,2955
Unknown,3991
american indian or alaska native,116
not allowed to collect,2058
other,947
native hawaiian or other pacific islander,55

ethnicity,count
not hispanic or latino,48500
not reported,26036
Unknown,4459
hispanic or latino,3141
not allowed to collect,3652
unknown,2

cause_of_death,count
Not Reported,797
,84393
HCC recurrence,7
Unknown,131
Cancer Related,336
Infection,7
Not Cancer Related,107
Surgical Complications,4
Cancer cell proliferation,1
Metastasis,2




## Greater and Less than

While all of the above can also be used to search for numbers, there are four operators that only work for numerical values:

- `>` : Greater than
- `<` : Less than
- `>=` : Greater than or Equal to
- `<=` : Less than or Equal to

These can all be used in place of the `=` sign in queries where you are filtering by a numeric value. In this search, we find all the subjects who were over 50 years old when they entered the study. As the study entry date is day 0, `days_to_birth` is reported as a negative number:

In [19]:
Q('days_to_birth <= 50*-365 ').specimen.run().to_dataframe()

Total execution time: 3687 ms


Unnamed: 0,id,identifier,associated_project,days_to_collection,primary_disease_type,anatomical_site,source_material_type,specimen_type,derived_from_specimen,subject_id,researchsubject_id
0,041df0b4-6342-4f1b-acc9-6ee6309574e9,"[{'system': 'GDC', 'value': '041df0b4-6342-4f1...",TCGA-UVM,272.0,Nevi and Melanomas,,Blood Derived Normal,aliquot,5adb6385-1092-4fca-8999-85112021bd50,TCGA-WC-AA9E,4480d290-5e8a-4289-8e3c-de087e0de412
1,058796c7-6420-11e8-bcf1-0a2705229b82,"[{'system': 'PDC', 'value': '058796c7-6420-11e...",CPTAC-TCGA,265.0,Rectum Adenocarcinoma,Not Reported,Primary Tumor,sample,initial specimen,TCGA-AG-A015,9ec81956-63d8-11e8-bcf1-0a2705229b82
2,084142ba-bd6b-4f13-888c-c3c0ff85a9be,"[{'system': 'GDC', 'value': '084142ba-bd6b-4f1...",TCGA-UCS,98.0,Complex Mixed and Stromal Neoplasms,,Blood Derived Normal,aliquot,26bb4e21-3275-4f4f-9bf8-030d923844e8,TCGA-QM-A5NM,485b221f-f40f-46a0-a9cf-023f807e6146
3,088250d4-6fcc-4ef6-a2a8-f19d377d6beb,"[{'system': 'GDC', 'value': '088250d4-6fcc-4ef...",CGCI-HTMCP-CC,,Squamous Cell Neoplasms,,Blood Derived Normal,analyte,161312c6-a0b6-4738-9ae3-6301c0450d90,HTMCP-03-06-02361,47b1f82d-506a-4d5c-8e1f-da76360d6c25
4,09c6004b-641f-11e8-bcf1-0a2705229b82,"[{'system': 'PDC', 'value': '09c6004b-641f-11e...",CPTAC-TCGA,,Ovarian Serous Cystadenocarcinoma,Not Reported,Primary Tumor,sample,initial specimen,TCGA-36-2544,ec5cacb8-63d7-11e8-bcf1-0a2705229b82
...,...,...,...,...,...,...,...,...,...,...,...
95,85e5044a-a8b8-537b-b9b5-4d1970b8839a,"[{'system': 'GDC', 'value': '85e5044a-a8b8-537...",TCGA-TGCT,0.0,Germ Cell Neoplasms,,Primary Tumor,portion,208c0c78-d557-462a-9723-764770c6e0ff,TCGA-VF-A8AA,b0df5500-d608-416e-82f4-9a99206d24c8
96,8b50b805-0ecd-46f5-b986-aa9d1ba0831f,"[{'system': 'GDC', 'value': '8b50b805-0ecd-46f...",TCGA-CHOL,1113.0,Adenomas and Adenocarcinomas,,Blood Derived Normal,aliquot,82e4036d-45be-4e96-8787-b66fc36a4b38,TCGA-3X-AAV9,41b97b11-acaa-4fbc-b3b0-0abc1bcac13b
97,8ecfcf2f-240e-4116-83d7-376cd74c7a80,"[{'system': 'GDC', 'value': '8ecfcf2f-240e-411...",TCGA-DLBC,,Mature B-Cell Lymphomas,,Blood Derived Normal,analyte,016c4852-9a45-44a1-9f1b-c62deceefee9,TCGA-FA-8693,a43e5f0e-a21f-48d8-97e0-084d413680b7
98,8f5e2995-cea9-4b95-bf65-9a201cb59e71,"[{'system': 'GDC', 'value': '8f5e2995-cea9-4b9...",TCGA-ACC,0.0,Adenomas and Adenocarcinomas,,Primary Tumor,slide,b53c54a2-ec41-5a9e-9dac-050e916d410c,TCGA-OR-A5LL,0304b12d-7640-4150-a581-2eea2b1f2ad5
