# Basic Search

The CDA provides a custom python tool for searching CDA data. [`Q`](usage/#q) (short for Query) offers several ways to search and filter data, and several input modes:

---
- **<a href="../../QuickStart/usage/#q">Q.()</a>** builds a query that can be used by `run()` or `count()`
- **<a href="../../QuickStart/usage/#qrun">Q.run()</a>** returns data for the specified search 
- **<a href="../../QuickStart/usage/#qcount">Q.count()</a>** returns summary information (counts) data that fit the specified search
- **<a href="../../QuickStart/usage/#columns">columns()</a>** returns entity field names
- **<a href="../../QuickStart/usage/#unique_terms">unique_terms()</a>** returns entity field contents

---
                                                                    
Before we do any work, we needs to import these functions cdapython.
We're also telling cdapython to report it's version so we can be sure we're using the one we mean to:

In [1]:
from cdapython import Q, columns, unique_terms, query
print(Q.get_version())

2022.6.22


In [None]:
Q.set_default_project_dataset("broad-dsde-dev.cda_dev")
Q.set_host_url("https://cancerdata.dsde-dev.broadinstitute.org/")
Q.get_host_url()
Q.get_default_project_dataset()

<div class="cdanote" style="background-color:#b3e5d5;color:black;padding:20px;">
    
CDA data comes from three sources:
<ul>
<li><b>The <a href="https://proteomic.datacommons.cancer.gov/pdc/"> Proteomic Data Commons</a> (PDC)</b></li>
<li><b>The <a href="https://gdc.cancer.gov/">Genomic Data Commons</a> (GDC)</b></li>
<li><b>The <a href="https://datacommons.cancer.gov/repository/imaging-data-commons">Imaging Data Commons</a> (IDC)</b></li>
</ul> 
    
The CDA makes this data searchable in four main endpoints:

<ul>
<li><b>subject:</b> A patient entity captures the study-independent metadata for research subjects. Human research subjects are usually not traceable to a particular person to protect the subjects privacy.</li>
<li><b>researchsubject:</b> A research subject is the entity of interest in a specific research study or project, typically a human being or an animal, but can also be a device, group of humans or animals, or a tissue sample. Human research subjects are usually not traceable to a particular person to protect the subjects privacy. This entity plays the role of the case_id in existing data. An individual who participates in 3 studies will have 3 researchsubject IDs</li>
<li><b>specimen:</b> Any material taken as a sample from a biological entity (living or dead), or from a physical object or the environment. Specimens are usually collected as an example of their kind, often for use in some investigation.</li>
<li><b>file:</b> A unit of data about subjects, researchsubjects, specimens, or their associated information</li>
</ul>
and two endpoints that offer deeper information about data in the researchsubject endpoint:
<ul>
<li><b>diagnosis:</b> A collection of characteristics that describe an abnormal condition of the body as assessed at a point in time. May be used to capture information about neoplastic and non-neoplastic conditions.</li>
<li><b>treatment:</b> Represent medication administration or other treatment types.</li>
</ul>
Any metadata field can be searched from any endpoint, the only difference between search types is what type of data is returned by default. This means that you can think of the CDA as a really, really enormous spreadsheet full of data. To search this enormous spreadsheet, you'd want select columns, and then filter rows.
</div>


If you are looking to build a cohort of distinct individuals who meet some criteria, search by `subject`. If you want to build a cohort, but are particularly interested in studies rather than the participates per se, search by `researchsubject`. If you are looking for biosamples that can be ordered or a specfic format of information (for e.g. histological slides) start with `specimen`. If you are primarily looking for files you can reuse for your own analysis, start with `file`.

In CDA search, these concepts can also be chained together, so you can look specifically for specimen subjects, or researchsubject diagnoses. In the four 'main' tables, all of the rows will have one or more files associated with them that can be directly found by chaining, as in specimen files. Diagnosis and treatment do not have files directly associated with them and so can only be used to find files in conjunction with the other searches.

In all cases, any search can use any metadata field, the only difference between search types is what type of data you return by default. 



## Basic search with endpoints

Let's try a broad search of the CDA to see what information exists about cancers that were first diagnosed in the brain. To run this simple search, we would first construct a query in `Q` and save it to a variable `myquery`:

In [2]:
myquery = Q('ResearchSubject.primary_diagnosis_site = "brain"')


<div class="cdawarn" style="background-color:#f9cfbf;color:black;padding:20px;">
    
<h3>Where did those terms come from?</h3>
    
If you aren't sure how we knew what terms to put in our search, please refer back to the <a href="../SearchTerms">What search terms are available?</a> notebook. 
</div>

### subject
Now we can use that query to search any of information types. Let's start by looking at what Subjects meet our criteria. To do that, we will send our query to the subject endpoint, then ask for it to run:

In [3]:
subjectresults = myquery.subject.run()

Total execution time: 3683 ms


We saved the output in a variable `subjectresults`, so we don't get much visible output. To see what our results are, we need to look into the variable. The simplest way is to call `subjectresults` directly:

In [4]:
subjectresults


            QueryID: 3125058c-432d-4de0-b8bf-8e2bb7a2e5cd
            
            Offset: 0
            Count: 100
            Total Row Count: 2314
            More pages: True
            

This output tells us our QueryID, which we don't really need, but the computer does to track our questions. Then it tells us four parameters that describe our results:

---

- **Offset:** This is how many rows of information we've told the query to skip in the data, here we didn't tell it to skip anything, so the offset is zero
- **Count:** This is how many rows the current page of our results table has. To keep searches fast, we default to pages with 100 rows.
- **Total Row Count:** This is how many rows are in the full results table
- **More pages:** This is always a True or False. False means that our current page has all the available results. True means that we will see only the first 100 results in this table, and will need to page through for more.

---
    
Now that we've seen the metadata about our results, let's look at the actual table. The easiest way to do this is by using the python function `.to_dataframe()` on our `subjectresults` variable:

In [5]:
subjectresults.to_dataframe()

Unnamed: 0,id,identifier,species,sex,race,ethnicity,days_to_birth,subject_associated_project,vital_status,age_at_death,cause_of_death
0,900-00-5445,"[{'system': 'IDC', 'value': '900-00-5445'}]",Homo sapiens,,,,,[rembrandt],,,
1,C16974,"[{'system': 'PDC', 'value': 'C16974'}]",Homo sapiens,male,white,not hispanic or latino,,[Proteogenomic Analysis of Pediatric Brain Can...,Alive,,Not Reported
2,C270477,"[{'system': 'PDC', 'value': 'C270477'}]",Homo sapiens,male,white,not hispanic or latino,,[Proteogenomic Analysis of Pediatric Brain Can...,Alive,,Not Reported
3,C30012,"[{'system': 'PDC', 'value': 'C30012'}]",Homo sapiens,male,white,not hispanic or latino,,[Proteogenomic Analysis of Pediatric Brain Can...,Dead,,Not Reported
4,C38868,"[{'system': 'PDC', 'value': 'C38868'}]",Homo sapiens,female,white,not hispanic or latino,,[Proteogenomic Analysis of Pediatric Brain Can...,Alive,,Not Reported
...,...,...,...,...,...,...,...,...,...,...,...
95,TCGA-CS-6668,"[{'system': 'GDC', 'value': 'TCGA-CS-6668'}, {...",Homo sapiens,female,white,not hispanic or latino,-20836.0,"[TCGA-LGG, tcga_lgg]",Alive,,
96,TCGA-DB-A64X,"[{'system': 'GDC', 'value': 'TCGA-DB-A64X'}]",Homo sapiens,female,white,not hispanic or latino,-20625.0,[TCGA-LGG],Alive,,
97,TCGA-E1-A7YY,"[{'system': 'GDC', 'value': 'TCGA-E1-A7YY'}]",Homo sapiens,female,white,not hispanic or latino,-9867.0,[TCGA-LGG],Dead,4445.0,
98,TCGA-HT-7677,"[{'system': 'GDC', 'value': 'TCGA-HT-7677'}, {...",Homo sapiens,male,white,not hispanic or latino,-19610.0,"[TCGA-LGG, tcga_lgg]",Alive,,


By default `to_dataframe()` shows us the first and last five rows for the first page of our results, so we can easily preview our data.

Since we queried the Subject endpoint, our default results tell us Subject level information, that is, information about unique individuals: their sex, race, age, species, etc. The `id` column tells us the unique identifier for each individual. The identifier column has nested information about what study or studies a Subject participated in, and will list all of their researchsubject identifiers. 

The `to_dataframe()` function converts the results to a pandas dataframe. So if we save the dataframe to a variable, we can use any pandas functions to explore it. For example, lets see whether any of the Subjects in our first 100 results are black or african american. First we'll save the results to a dataframe, then subset that dataframe to only show rows where the word "black" appears in the "race" column. "NAs" which are shown as "None" in these tables, so for our filter to work, we'll need to specifically tell it to ignore NAs:


In [17]:
subjectdata = subjectresults.to_dataframe()
subjectdata[subjectdata['race'].str.contains("black", na=False)]

Unnamed: 0,id,identifier,species,sex,race,ethnicity,days_to_birth,subject_associated_project,vital_status,age_at_death,cause_of_death
36,TCGA-E1-A7Z3,"[{'system': 'GDC', 'value': 'TCGA-E1-A7Z3'}]",Homo sapiens,female,black or african american,not hispanic or latino,-11330.0,[TCGA-LGG],Dead,2235.0,
57,TCGA-06-0394,"[{'system': 'GDC', 'value': 'TCGA-06-0394'}]",Homo sapiens,male,black or african american,not hispanic or latino,-18913.0,[TCGA-GBM],Dead,329.0,


There are subjects in our first hundred results that meet the criteria. If we just want to be sure that the data contains some value, this might be good enough. But often we want to search the entire set of results and not just the first page. 

We'll cover how to work with large results dataframes in the <a href="../Pagination">Pagination</a> notebook. Or, learn how to get summary information from search results in the <a href="../DataSummaries">Data Summaries</a> notebook.

---

<div class="cdadefine" style="background-color:#add9e5;color:black;padding:20px;">

<h3>Subject Field Definitions</h3>

<i>A patient entity captures the study-independent metadata for research subjects. Human research subjects are usually not traceable to a particular person to protect the subjects privacy.</i>

    
<ul>
<li><b>id:</b> The 'logical' identifier of the entity in the system of record, e.g. a UUID. This 'id' is unique within a given system. The identified entity may have a different 'id' in a different system.",STRING</li>
<li><b>identifier:</b> A 'business' identifier for the entity, typically as provided by an external system or authority, that persists across implementing systems (i.e. a 'logical' identifier). Uses a specialized, complex 'Identifier' data type to capture information about the source of the business identifier - or a URI expressed as a string to an existing entity.</li>
<li><b>identifier.system:</b> The system or namespace that defines the identifier.</li>
<li><b>identifier.value:</b> The value of the identifier, as defined by the system.</li>
<li><b>species:</b> The taxonomic group (e.g. species) of the subject. For MVP, since taxonomy vocabulary is consistent between GDC and PDC, using text. Ultimately, this will be a term returned by the vocabulary service.</li>
<li><b>sex:</b> The biologic character or quality that distinguishes male and female from one another as expressed by analysis of the person's gonadal, morphologic (internal and external), chromosomal, and hormonal characteristics.</li>
<li><b>race:</b> An arbitrary classification of a taxonomic group that is a division of a species. It usually arises as a consequence of geographical isolation within a species and is characterized by shared heredity, physical attributes and behavior, and in the case of humans, by common history, nationality, or geographic distribution. The provided values are based on the categories defined by the U.S. Office of Management and Business and used by the U.S. Census Bureau.</li>
<li><b>ethnicity:</b> An individual's self-described social and cultural grouping, specifically whether an individual describes themselves as Hispanic or Latino. The provided values are based on the categories defined by the U.S. Office of Management and Business and used by the U.S. Census Bureau.</li>
<li><b>days_to_birth:</b> Number of days between the date used for index and the date from a person's date of birth represented as a calculated negative number of days.</li>
<li><b>subject_associated_project:</b> The list of Projects associated with the Subject.</li>
<li><b>vital_status:</b> Coded value indicating the state or condition of being living or deceased; also includes the case where the vital status is unknown.</li>
<li><b>days_to_death:</b> Number of days between the date used for index and the date from a person's date of death represented as a calculated number of days.</li>
<li><b>cause_of_death:</b> Coded value indicating the circumstance or condition that results in the death of the subject.</li>
</ul>  

</div>
    
---

### researchsubject

If we're interested in what researchsubjects meet our critera, we can also run our query against the researchsubject endpoint:

In [7]:
researchsubjectresults = myquery.researchsubject.run()
researchsubjectresults

Total execution time: 3487 ms



            QueryID: 9dcafce1-716d-4f18-834b-4ae6277960ea
            
            Offset: 0
            Count: 100
            Total Row Count: 2923
            More pages: True
            

Now we see that our 2314 subjects have 2923 researchsubjects between them, that means that some, but not all, of our subjects were participants in more than one study. Let's peek at the data:

In [8]:
researchsubjectresults.to_dataframe()

Unnamed: 0,id,identifier,member_of_research_project,primary_diagnosis_condition,primary_diagnosis_site,subject_id
0,0a85c7b4-f07d-4727-b9d2-7b14c52edabb,"[{'system': 'GDC', 'value': '0a85c7b4-f07d-472...",TCGA-GBM,Gliomas,Brain,TCGA-12-0703
1,104c852f-2139-11ea-aee1-0e1aae319e49,"[{'system': 'PDC', 'value': '104c852f-2139-11e...",CPTAC3-Discovery,Glioblastoma,Brain,C3N-03473
2,19dd2c8f-05ca-44e8-8c19-bac0327f2ea9,"[{'system': 'GDC', 'value': '19dd2c8f-05ca-44e...",TCGA-GBM,Gliomas,Brain,TCGA-76-6661
3,374117f3-a351-43e8-9848-a1d724c71a46,"[{'system': 'GDC', 'value': '374117f3-a351-43e...",GENIE-DFCI,"Neoplasms, NOS",Brain,GENIE-DFCI-089524
4,3f70c3e3-0131-466f-92aa-0a63ab3d4258,"[{'system': 'GDC', 'value': '3f70c3e3-0131-466...",TCGA-LGG,Gliomas,Brain,TCGA-CS-6188
...,...,...,...,...,...,...
95,149a8565-e0c5-4474-a693-d44f1b445c0c,"[{'system': 'GDC', 'value': '149a8565-e0c5-447...",HCMI-CMDC,Gliomas,Brain,HCM-BROD-0199-C71
96,1f13065a-40f7-455d-b8eb-f9a128722eac,"[{'system': 'GDC', 'value': '1f13065a-40f7-455...",TCGA-LGG,Gliomas,Brain,TCGA-P5-A5EX
97,26d68cd6-e9a6-43b1-b1c6-8b8d366e32ad,"[{'system': 'GDC', 'value': '26d68cd6-e9a6-43b...",CPTAC-3,Gliomas,Brain,C3L-01155
98,2deec528-4550-47b2-9a0d-cac7eeb24f3e,"[{'system': 'GDC', 'value': '2deec528-4550-47b...",TCGA-GBM,Gliomas,Brain,TCGA-02-0085


Each row from the researchsubject endpoint results tells us about a subject in a given study. Using this endpoint we can find out information like what studies fit our search criteria, and also get data that we can filter to have only subjects from multiple studies, or only subjects from single studies.

Any given subject will have one row per study they participated in. The subject_id in the last column of this view is the same as the `id` in the first column of the Subjects endpoint results. You can use this to combine information across endpoints, which is covered near the end of the <a href="../BuildingACohort/#merging-results-across-endpoints">Cohort Building workflow</a> notebook.

---

<div class="cdadefine" style="background-color:#add9e5;color:black;padding:20px;">

<h3>ResearchSubject Field Definitions</h3>

<i>A research subject is the entity of interest in a specific research study or project, typically a human being or an animal, but can also be a device, group of humans or animals, or a tissue sample. Human research subjects are usually not traceable to a particular person to protect the subjects privacy. This entity plays the role of the case_id in existing data. An individual who participates in 3 studies will have 3 researchsubject IDs</i>
    
<ul>
<li><b>id:</b> The 'logical' identifier of the entity in the system of record, e.g. a UUID. This 'id' is unique within a given system. The identified entity may have a different 'id' in a different system. For CDA, this is case_id.</li>
<li><b>identifier:</b> A 'business' identifier for the entity, typically as provided by an external system or authority, that persists across implementing systems (i.e. a 'logical' identifier). Uses a specialized, complex 'Identifier' data type to capture information about the source of the business identifier - or a URI expressed as a string to an existing entity.</li>
<li><b>identifier.system:</b> The system or namespace that defines the identifier.</li>
<li><b>identifier.value:</b> The value of the identifier, as defined by the system.</li>
<li><b>member_of_research_project:</b> A reference to the Study(s) of which this ResearchSubject is a member.</li>
<li><b>primary_diagnosis_condition:</b> The text term used to describe the type of malignant disease, as categorized by the World Health Organization's (WHO) <a href="https://www.who.int/standards/classifications/other-classifications/international-classification-of-diseases-for-oncology">International Classification of Diseases for Oncology</a> (ICD-O). This attribute represents the disease that qualified the subject for inclusion on the ResearchProject.</li>
<li><b>primary_diagnosis_site:</b> The text term used to describe the primary site of disease, as categorized by the World Health Organization's (WHO) <a href="https://www.who.int/standards/classifications/other-classifications/international-classification-of-diseases-for-oncology">International Classification of Diseases for Oncology</a> (ICD-O). This categorization groups cases into general categories. This attribute represents the primary site of disease that qualified the subject for inclusion on the ResearchProject.</li>
<li><b>subject_id:</b> The 'logical' identifier of the entity in the system of record, e.g. a UUID. This 'id' is unique within a given system. The identified entity may have a different 'id' in a different system. Can be joined to the `id` field from subject results</li>
</ul>  

</div>
    
---

### diagnosis

The diagnosis endpoint is an extension of the researchsubject endpoint, and returns information about researchsubjects that have a diagnosis that meets our search criteria:

In [9]:
diagnosisresults = myquery.diagnosis.run()
diagnosisresults.to_dataframe()

Total execution time: 4209 ms


Unnamed: 0,id,identifier,primary_diagnosis,age_at_diagnosis,morphology,stage,grade,method_of_diagnosis,subject_id,researchsubject_id
0,03603c65-d92c-5d36-bcd2-7e58bf7b43a4,"[{'system': 'GDC', 'value': '03603c65-d92c-5d3...",Glioblastoma,21838.0,9440/3,,not reported,,TCGA-28-2499,d88b35a3-a291-457a-b15b-a314859b25c5
1,0b3a155a-3f67-402c-a05d-add39adffcd2,"[{'system': 'GDC', 'value': '0b3a155a-3f67-402...",Mixed germ cell tumor,,9085/3,,Not Reported,,GENIE-MSK-P-0003994,85d7695b-eefc-483a-844b-14efaaff6066
2,1acc1167-4fa0-5940-8de0-88686f0160b0,"[{'system': 'GDC', 'value': '1acc1167-4fa0-594...","Astrocytoma, anaplastic",21824.0,9401/3,,not reported,,TCGA-DU-7013,d61b5d20-4d6d-4fdc-afe8-4b100b686eda
3,3314363b-c85e-5b33-9013-da6f7d1e79ae,"[{'system': 'GDC', 'value': '3314363b-c85e-5b3...","Oligodendroglioma, anaplastic",19185.0,9451/3,,not reported,,TCGA-R8-A6ML,5119aceb-ad95-4904-b6c0-a0a4b42c17d0
4,57d175a7-f78e-5e30-a35c-92ce31bb3de4,"[{'system': 'GDC', 'value': '57d175a7-f78e-5e3...",Glioblastoma,11956.0,9440/3,,not reported,,TCGA-06-0151,c57d6674-08d1-4eda-9ec1-ae439923d7cf
...,...,...,...,...,...,...,...,...,...,...
95,099b4994-46a7-595b-985c-a95af7edfff6,"[{'system': 'GDC', 'value': '099b4994-46a7-595...",Glioblastoma,22166.0,9440/3,,not reported,,TCGA-02-0034,69205453-c504-4684-8d32-57c82efc679d
96,1a78b7fe-0b6f-592d-ae0c-6bceced82b68,"[{'system': 'GDC', 'value': '1a78b7fe-0b6f-592...",Glioblastoma,26395.0,9440/3,,not reported,,TCGA-06-5410,884f867b-4a8b-4b67-8fe4-ab3f068be84e
97,2496b9b7-9dd1-5452-a6c6-6f4c82f7704d,"[{'system': 'GDC', 'value': '2496b9b7-9dd1-545...",Glioblastoma,24488.0,9440/3,,not reported,,TCGA-06-0219,25f41de3-9d70-45df-913f-4fb3e5f0f7d6
98,295813a8-7a75-5b4c-808f-e732e2d68a74,"[{'system': 'GDC', 'value': '295813a8-7a75-5b4...","Astrocytoma, anaplastic",18748.0,9401/3,,not reported,,TCGA-HT-8104,5de6fd84-ed02-4c27-9947-db57f05faf6d



---

<div class="cdadefine" style="background-color:#add9e5;color:black;padding:20px;">

<h3>Diagnosis Field Definitions</h3>

<i>A collection of characteristics that describe an abnormal condition of the body as assessed at a point in time. May be used to capture information about neoplastic and non-neoplastic conditions.</i>

    
<ul>
  <li><b>id:</b> The 'logical' identifier of the entity in the repository, e.g. a UUID. This 'id' is unique within a given system. The identified entity may have a different 'id' in a different system.</li>
<li><b>identifier:</b> A 'business' identifier for the entity, typically as provided by an external system or authority, that persists across implementing systems (i.e. a 'logical' identifier). Uses a specialized, complex 'Identifier' data type to capture information about the source of the business identifier - or a URI expressed as a string to an existing entity.</li>
<li><b>identifier.system:</b> The system or namespace that defines the identifier.</li>
<li><b>identifier.value:</b> The value of the identifier, as defined by the system.</li>
  <li><b>primary_diagnosis:</b> The diagnosis instance that qualified a subject for inclusion on a ResearchProject.</li>
  <li><b>age_at_diagnosis:</b> The age in days of the individual at the time of diagnosis.</li>
  <li><b>morphology:</b> Code that represents the histology of the disease using the third edition of the <a href="https://www.who.int/standards/classifications/other-classifications/international-classification-of-diseases-for-oncology">International Classification of Diseases for Oncology</a>, published in 2000, used principally in tumor and cancer registries for coding the site (topography) and the histology (morphology) of neoplasms.</li>
  <li><b>stage:</b> The extent of a cancer in the body. Staging is usually based on the size of the tumor, whether lymph nodes contain cancer, and whether the cancer has spread from the original site to other parts of the body. Different diseases may use different staging criteria, please refer to the originating data source to see what staging system is reported</li>
  <li><b>grade:</b> The degree of abnormality of cancer cells, a measure of differentiation, the extent to which cancer cells are similar in appearance and function to healthy cells of the same tissue type. The degree of differentiation often relates to the clinical behavior of the particular tumor. Based on the microscopic findings, tumor grade is commonly described by one of four degrees of severity. Histopathologic grade of a tumor may be used to plan treatment and estimate the future course, outcome, and overall prognosis of disease. Certain types of cancers, such as soft tissue sarcoma, primary brain tumors, lymphomas, and breast have special grading systems.</li>
  <li><b>method_of_diagnosis:</b> The method used to confirm the patients malignant diagnosis.</li>
  <li><b>subject_id:</b> An identifier for the subject. Can be joined to the `id` field from subject results</li>
  <li><b>researchsubject_id:</b> An identifier for the subject. Can be joined to the `id` field from researchsubject results</li>
</ul>  

</div>
    
---




### treatment

The treatment endpoint is an extension of diagnosis and returns information about treatments undertaken on research subjects that have a given diagnosis that meets our search criteria:

In [10]:
treatmentresults = myquery.treatment.run()
treatmentresults.to_dataframe()

Total execution time: 3509 ms


Unnamed: 0,id,identifier,treatment_type,treatment_outcome,days_to_treatment_start,days_to_treatment_end,therapeutic_agent,treatment_anatomic_site,treatment_effect,treatment_end_reason,number_of_cycles,subject_id,researchsubject_id,researchsubject_diagnosis_id
0,00066b4a-aa08-59b2-a212-8fce5f9f1f8b,"[{'system': 'GDC', 'value': '00066b4a-aa08-59b...","Radiation Therapy, NOS",,,,,,,,,TCGA-DB-A4XH,5ccd86ab-2587-4f35-b96a-4f4320b10fb9,24a74146-a376-5332-8f2c-a973d4b942d0
1,1d521697-fef3-491e-be88-a0a7c9a25cf9,"[{'system': 'GDC', 'value': '1d521697-fef3-491...",Chemotherapy,,,78.0,Temozolomide,,,,,HCM-BROD-0196-C71,51b83c83-b639-424b-a023-fd5fe7c2f4b5,dbd3dfca-2deb-4ae0-abd7-1a49c5d92d2d
2,27af15e6-4e81-59b8-88d0-ecaef12655d6,"[{'system': 'GDC', 'value': '27af15e6-4e81-59b...","Radiation Therapy, NOS",,,,,,,,,TCGA-S9-A6U5,621424ec-67bc-4c42-b0ad-64c5654b5ad9,08d72d98-8a15-5ddc-a171-027da577a011
3,3d538109-3e26-568b-8ffc-ae0023f7c031,"[{'system': 'GDC', 'value': '3d538109-3e26-568...","Pharmaceutical Therapy, NOS",,,,,,,,,TCGA-P5-A735,a3822b5e-d794-4126-b31f-c70e1cc47f22,4c083e0e-7016-5b64-ac5c-6ee60d980aad
4,3e2b3782-eec3-49b1-88fa-04f34571fda5,"[{'system': 'GDC', 'value': '3e2b3782-eec3-49b...",Chemotherapy,,,60.0,Temozolomide,,,,,HCM-BROD-0048-C71,4b9b3130-2483-4e5d-8c4d-e225590a5cd2,7f7968f0-e608-441d-b7a6-61c30361d16d
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,3f18398e-a5c6-53a0-a7d6-243ff9c9ebff,"[{'system': 'GDC', 'value': '3f18398e-a5c6-53a...","Radiation Therapy, NOS",,,,,,,,,TCGA-DU-A76R,823f571a-3cfd-465d-95fd-187a398ccf6f,e270420a-4930-501b-a163-820f1ed9879f
96,47d2c3c4-c5f9-5f6d-b643-29fbc72a9402,"[{'system': 'GDC', 'value': '47d2c3c4-c5f9-5f6...","Radiation Therapy, NOS",,,,,,,,,TCGA-06-0150,f2cef0e2-fece-4499-9f99-c5719aff8b5a,8497e070-dc24-5de7-ba64-8622bd6d6ceb
97,50844921-f089-5ba7-8d3d-0a66e864d69c,"[{'system': 'GDC', 'value': '50844921-f089-5ba...","Radiation Therapy, NOS",,,,,,,,,TCGA-HT-A5R7,7e41baba-7ef6-4030-b595-d95cd96336b5,0cd3ea68-7bc4-5c83-90d2-0fb78cbc0b1d
98,54781ff8-a33e-5808-adfc-c02bde88564c,"[{'system': 'GDC', 'value': '54781ff8-a33e-580...","Pharmaceutical Therapy, NOS",,,,,,,,,TCGA-02-0052,5a448bf3-8e87-4791-91f2-908307a32ed1,489e38f1-12e2-5d10-b77d-f6816bf5146d



---

<div class="cdadefine" style="background-color:#add9e5;color:black;padding:20px;">

<h3>Treatment Field Definitions</h3>

<i><i> Medication administration or other treatment types. A single research subject may have multiple treatments for a single diagnosis, and/or different diagnoses, and different treatments, across different studies</i></i>

    
<ul>
  <li><b>id:</b> The 'logical' identifier of the entity in the repository, e.g. a UUID. This 'id' is unique within a given system. The identified entity may have a different 'id' in a different system.</li>
<li><b>identifier:</b> A 'business' identifier for the entity, typically as provided by an external system or authority, that persists across implementing systems (i.e. a 'logical' identifier). Uses a specialized, complex 'Identifier' data type to capture information about the source of the business identifier - or a URI expressed as a string to an existing entity.</li>
<li><b>identifier.system:</b> The system or namespace that defines the identifier.</li>
<li><b>identifier.value:</b> The value of the identifier, as defined by the system.</li>
  <li><b>treatment_type:</b> The treatment type including medication/therapeutics or other procedures.</li>
  <li><b>treatment_outcome:</b> The final outcome of the treatment.</li>
  <li><b>days_to_treatment_start:</b> The timepoint at which the treatment started.</li>
  <li><b>days_to_treatment_end:</b>The timepoint at which the treatment ended. </li>
  <li><b>therapeutic_agent:</b> One or more therapeutic agents as part of this treatment.</li>
  <li><b>treatment_anatomic_site:</b> The anatomical site that the treatment targets.</li>
  <li><b>treatment_effect:</b>The effect of a treatment on the diagnosis or tumor. </li>
  <li><b>treatment_end_reason:</b>The reason the treatment ended. </li>
  <li><b>number_of_cycles:</b>The number of treatment cycles the subject received. </li>
  <li><b>subject_id:</b> An identifier for the subject. Can be joined to the `id` field from subject results</li>
  <li><b>researchsubject_id:</b> An identifier for the researchsubject. Can be joined to the `id` field from researchsubject results</li>
  <li><b>researchsubject_diagnosis_id:</b> An identifier for the diagnosis. Can be joined to the `id` field from diagnosis results</li>
</ul>  

</div>
    
---




### specimen

We can use this same query to see what specimens are available for brain tissue at the CDA:

In [11]:
specimenresults =  myquery.specimen.run()
print(specimenresults)

Total execution time: 3447 ms

            QueryID: 21a791a2-ed17-4f2c-978c-da501c259aea
            
            Offset: 0
            Count: 100
            Total Row Count: 39150
            More pages: True
            


Nearly 40,000 specimens meet our search criteria! We would typically expect this number to be much larger than our number of subjects or researchsubjects. First because studies will often take more than one sample per subject, and second because any given specimen might be aliquoted out to be used in multiple tests. Since we didn't specify any further filters, our results will return all of these as seperate speciments. Let's look at a few:

In [12]:
specimenresults.to_dataframe()

Unnamed: 0,id,identifier,associated_project,age_at_collection,primary_disease_type,anatomical_site,source_material_type,specimen_type,derived_from_specimen,subject_id,researchsubject_id
0,003ed478-ceca-4752-8ae4-4cf12e6844bd,"[{'system': 'GDC', 'value': '003ed478-ceca-475...",TCGA-LGG,-24578.0,Gliomas,,Primary Tumor,aliquot,f1475dc1-b44a-438c-884d-b88a1bd2bbc9,TCGA-CS-4941,fc222f23-b3b2-4ac0-bc61-e8e8fa5cc160
1,02eb276e-4a58-4e90-b2fb-6dd4d0700799,"[{'system': 'GDC', 'value': '02eb276e-4a58-4e9...",TCGA-GBM,-21411.0,Gliomas,,Primary Tumor,aliquot,cc0decaf-b9b2-4aa8-8faa-41e8191b3ea3,TCGA-12-1092,4c070166-025e-49b7-b4d0-fe25c8a355b1
2,036ef90a-eaca-4105-9b7b-74c20bfa0bb9,"[{'system': 'GDC', 'value': '036ef90a-eaca-410...",TCGA-LGG,-12655.0,Gliomas,,Primary Tumor,aliquot,2b65591e-9852-412d-b785-2e55b74b3cb7,TCGA-E1-5305,a0bc5147-d9b4-41fd-8df3-a241d23722e8
3,05b85bc7-87db-49bb-85cf-e494554ab191,"[{'system': 'GDC', 'value': '05b85bc7-87db-49b...",TCGA-LGG,-13338.0,Gliomas,,Primary Tumor,aliquot,07020239-d1b6-4a6f-a236-686668c92eec,TCGA-DB-A75L,ba62b5eb-e169-4b41-b949-e004d66e98a1
4,066043e1-0af9-4289-b9d5-0a9523a16ccc,"[{'system': 'GDC', 'value': '066043e1-0af9-428...",TCGA-LGG,-21964.0,Gliomas,,Primary Tumor,sample,initial specimen,TCGA-DU-8165,11dffe71-b516-41e8-9987-e0ff5d2356d1
...,...,...,...,...,...,...,...,...,...,...,...
95,4a0d2845-0d85-4de9-b010-f1f50dbdd0df,"[{'system': 'GDC', 'value': '4a0d2845-0d85-4de...",TCGA-LGG,-19138.0,Gliomas,,Blood Derived Normal,aliquot,c31e4449-2f14-4af9-8f98-857d8e541fae,TCGA-FG-A6J3,ae4f0714-3b74-46c0-9c1b-f028c7cf2e62
96,4ba9ac09-32b5-4613-8a75-ace5fbd38cfc,"[{'system': 'GDC', 'value': '4ba9ac09-32b5-461...",TCGA-LGG,-15589.0,Gliomas,,Primary Tumor,analyte,a149de53-0da4-4d78-b2b9-2269c3c100cd,TCGA-HT-7485,8d06226e-500b-4172-ae96-c674913909e6
97,4c42785c-5cdf-47f4-81a1-f9c191399099,"[{'system': 'GDC', 'value': '4c42785c-5cdf-47f...",TCGA-LGG,-16945.0,Gliomas,,Primary Tumor,portion,f0739840-9fdf-4fb1-8581-542d901e7058,TCGA-DU-A5TY,0167cf11-74be-4701-ab9a-4e057d4bb545
98,4da556bb-76d5-4874-bf90-2eb9e50fc0b7,"[{'system': 'GDC', 'value': '4da556bb-76d5-487...",TCGA-GBM,-25252.0,Gliomas,,Primary Tumor,portion,67c18463-8e95-4501-9feb-190d421f0f47,TCGA-32-1979,1974470e-ec23-4dfc-8907-2e4052c2a0fc



---

<div class="cdadefine" style="background-color:#add9e5;color:black;padding:20px;">

<h3>Specimen Field Definitions</h3>

<i><i>Any material taken as a sample from a biological entity (living or dead), or from a physical object or the environment. Specimens are usually collected as an example of their kind, often for use in some investigation.</i>
 A given specimen will have only a single subject ID and a single research subject ID</i>
    
<ul>
  <li><b>id:</b> The 'logical' identifier of the entity in the repository, e.g. a UUID. This 'id' is unique within a given system. The identified entity may have a different 'id' in a different system.</li>
<li><b>identifier:</b> A 'business' identifier for the entity, typically as provided by an external system or authority, that persists across implementing systems (i.e. a 'logical' identifier). Uses a specialized, complex 'Identifier' data type to capture information about the source of the business identifier - or a URI expressed as a string to an existing entity.</li>
<li><b>identifier.system:</b> The system or namespace that defines the identifier.</li>
<li><b>identifier.value:</b> The value of the identifier, as defined by the system.</li>
  <li><b>associated_project:</b> The Project associated with the specimen.</li>
  <li><b>days_to_collection:</b> The number of days from the index date to either the date a sample was collected for a specific study or project, or the date a patient underwent a procedure (e.g. surgical resection) yielding a sample that was eventually used for research.</li>
  <li><b>primary_disease_type:</b> The text term used to describe the type of malignant disease, as categorized by the World Health Organization's (WHO) International Classification of Diseases for Oncology (ICD-O). This attribute represents the disease that qualified the subject for inclusion on the ResearchProject.</li>
  <li><b>anatomical_site:</b> Per GDC Dictionary, the text term that represents the name of the primary disease site of the submitted tumor sample; recommend dropping tumor; biospecimen_anatomic_site.</li>
  <li><b>source_material_type:</b> The general kind of material from which the specimen was derived, indicating the physical nature of the source material.</li>
  <li><b>specimen_type:</b> The high-level type of the specimen, based on its how it has been derived from the original extracted sample. One of: analyte, aliquot, portion, sample, or slide.</li>
  <li><b>derived_from_specimen:</b> A source/parent specimen from which this one was directly derived.</li>
  <li><b>subject_id:</b> An identifier for the subject. Can be joined to the `id` field from subject results</li>
  <li><b>researchsubject_id:</b> An identifier for the subject. Can be joined to the `id` field from researchsubject results</li>
</ul>  

</div>
    
---


### file

The file endpoint returns information about files that meet our search criteria, regardless of whether they are attached to subjects, research-subjects or specimens: 

In [13]:
myquery.file.run()

Total execution time: 3604 ms



            QueryID: 9d625fce-d4e3-4d03-8d91-5eda35bef78f
            
            Offset: 0
            Count: 100
            Total Row Count: 4530800
            More pages: True
            

In [14]:
fileresults = myquery.file.run()
fileresults.to_dataframe()

Total execution time: 3532 ms


Unnamed: 0,id,identifier,label,data_category,data_type,file_format,associated_project,drs_uri,byte_size,checksum,data_modality,imaging_modality,dbgap_accession_number,researchsubject_specimen_id,researchsubject_id,subject_id
0,08eb7462-2bc2-40d6-88b3-4af74bc19d91,"[{'system': 'IDC', 'value': '08eb7462-2bc2-40d...",idc/08eb7462-2bc2-40d6-88b3-4af74bc19d91.dcm,Imaging,,DICOM,tcga_gbm,drs://dg.4DFC:08eb7462-2bc2-40d6-88b3-4af74bc1...,,,Imaging,MR,,,TCGA-06-0184__tcga_gbm,TCGA-06-0184
1,097913dc-b8d9-4efa-912c-2d3a60e4efb4,"[{'system': 'IDC', 'value': '097913dc-b8d9-4ef...",idc/097913dc-b8d9-4efa-912c-2d3a60e4efb4.dcm,Imaging,,DICOM,acrin_dsc_mr_brain,drs://dg.4DFC:097913dc-b8d9-4efa-912c-2d3a60e4...,,,Imaging,MR,,,ACRIN-DSC-MR-Brain-048__acrin_dsc_mr_brain,ACRIN-DSC-MR-Brain-048
2,170aecd6-e148-4e70-be36-3d771a5cc092,"[{'system': 'IDC', 'value': '170aecd6-e148-4e7...",idc/170aecd6-e148-4e70-be36-3d771a5cc092.dcm,Imaging,,DICOM,ivygap,drs://dg.4DFC:170aecd6-e148-4e70-be36-3d771a5c...,,,Imaging,MR,,,W50__ivygap,W50
3,1b306916-fe77-469e-8c86-43e8b606b1b8,"[{'system': 'GDC', 'value': '1b306916-fe77-469...",nationwidechildrens.org_clinical_nte_gbm.txt,Clinical,Clinical Supplement,BCR Biotab,TCGA-GBM,drs://dg.4DFC:1b306916-fe77-469e-8c86-43e8b606...,1700.0,8bf29be6a3300e3a835534dea2b88b83,Genomic,,,,386b629e-fab1-4033-b088-45d6eeb4a13e,TCGA-06-0745
4,25d550ca-a853-4510-8a34-ba4959f6dbb4,"[{'system': 'IDC', 'value': '25d550ca-a853-451...",idc/25d550ca-a853-4510-8a34-ba4959f6dbb4.dcm,Imaging,,DICOM,acrin_fmiso_brain,drs://dg.4DFC:25d550ca-a853-4510-8a34-ba4959f6...,,,Imaging,MR,,,ACRIN-FMISO-Brain-048__acrin_fmiso_brain,ACRIN-FMISO-Brain-048
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,bd90872f-154d-47ad-8e24-da5716977bc3,"[{'system': 'IDC', 'value': 'bd90872f-154d-47a...",idc/bd90872f-154d-47ad-8e24-da5716977bc3.dcm,Imaging,,DICOM,acrin_dsc_mr_brain,drs://dg.4DFC:bd90872f-154d-47ad-8e24-da571697...,,,Imaging,MR,,,ACRIN-DSC-MR-Brain-035__acrin_dsc_mr_brain,ACRIN-DSC-MR-Brain-035
96,c69d2cd6-c2b9-466e-a8d0-ef5ed9e501f6,"[{'system': 'IDC', 'value': 'c69d2cd6-c2b9-466...",idc/c69d2cd6-c2b9-466e-a8d0-ef5ed9e501f6.dcm,Imaging,,DICOM,acrin_dsc_mr_brain,drs://dg.4DFC:c69d2cd6-c2b9-466e-a8d0-ef5ed9e5...,,,Imaging,MR,,,ACRIN-DSC-MR-Brain-056__acrin_dsc_mr_brain,ACRIN-DSC-MR-Brain-056
97,c87489d0-3c00-45a9-9d9f-5568bfd7cdd2,"[{'system': 'IDC', 'value': 'c87489d0-3c00-45a...",idc/c87489d0-3c00-45a9-9d9f-5568bfd7cdd2.dcm,Imaging,,DICOM,tcga_lgg,drs://dg.4DFC:c87489d0-3c00-45a9-9d9f-5568bfd7...,,,Imaging,MR,,,TCGA-DU-7299__tcga_lgg,TCGA-DU-7299
98,c8ae39d7-af8f-4b78-aec1-77fea5e8b210,"[{'system': 'IDC', 'value': 'c8ae39d7-af8f-4b7...",idc/c8ae39d7-af8f-4b78-aec1-77fea5e8b210.dcm,Imaging,,DICOM,cptac_gbm,drs://dg.4DFC:c8ae39d7-af8f-4b78-aec1-77fea5e8...,,,Imaging,MR,,,C3L-03266__cptac_gbm,C3L-03266


As you might expect, searching file gives us a huge number of results. This is great if you are surveying what kind of data is available, but is less useful for getting a coherent cohort. 

A better way to get files for a specific cohort is to chain your queries together, which we cover in the next tutorial <a href="../Chaining">Chaining Endpoints</a>: Combine information from multiple endpoints, and build And/Or/Like and other advanced query strings.

Another useful way to look at high level information is to use our counts feature which returns summary information rather than the full search results. Check out the <a href="../DataSummaries">Data Summaries tutorial</a> to try it.



---

<div class="cdadefine" style="background-color:#add9e5;color:black;padding:20px;">

<h3>File Field Definitions</h3>

<i>A file is an information-bearing electronic object that contains a physical embodiment of some information using a particular character encoding.</i>

    
<ul>
  <li><b>id:</b> The 'logical' identifier of the entity in the system of record, e.g. a UUID. This 'id' is unique within a given system. The identified entity may have a different 'id' in a different system.</li>
<li><b>identifier:</b> A 'business' identifier for the entity, typically as provided by an external system or authority, that persists across implementing systems (i.e. a 'logical' identifier). Uses a specialized, complex 'Identifier' data type to capture information about the source of the business identifier - or a URI expressed as a string to an existing entity.</li>
<li><b>identifier.system:</b> The system or namespace that defines the identifier.</li>
<li><b>identifier.value:</b> The value of the identifier, as defined by the system.</li>
  <li><b>label:</b> Short name or abbreviation for dataset. Maps to rdfs:label.</li>
  <li><b>data_catagory:</b> Broad categorization of the contents of the data file.</li>
  <li><b>data_type:</b> Specific content type of the data file.</li>
  <li><b>file_format:</b> Format of the data files.</li>
  <li><b>associated_project:</b> A reference to the Project(s) of which this ResearchSubject is a member. The associated_project may be embedded using the ref definition or may be a reference to the id for the Project - or a URI expressed as a string to an existing entity.</li>
  <li><b>drs_uri:</b> A string of characters used to identify a resource on the Data Repo Service(DRS). Can be used to retreive this specific file from a server.</li>
  <li><b>byte_size:</b> Size of the file in bytes. Maps to dcat:byteSize.</li>
  <li><b>checksum:</b> The md5 value for the file. A digit representing the sum of the correct digits in a piece of stored or transmitted digital data, against which later comparisons can be made to detect errors in the data.</li>
  <li><b>data_modality:</b> Data modality describes the biological nature of the information gathered as the result of an Activity, independent of the technology or methods used to produce the information. Always one of "Genomic", "Proteomic", or "Imaging".</li>
  <li><b>imaging_modality:</b> An imaging modality describes the imaging equipment and/or method used to acquire certain structural or functional information about the body. These include but are not limited to computed tomography (CT) and magnetic resonance imaging (MRI). Taken from the DICOM standard.</li>
  <li><b>dbgap_accession_number:</b> The dbgap accession number for the project.</li>
</ul>  

</div>
    
---