Before we do any work, we need to import several functions from cdapython:
- `Q` and `query` which power the search
- `columns` which lets us view entity field names
- `unique_terms` which lets view entity field contents

We're also asking cdapython to report it's version so we can be sure we're using the one we mean to.

In [1]:
from cdapython import Q, columns, unique_terms,query, constantVariables
import cdapython
print(cdapython.__file__)
print(cdapython.__version__)
integration_host = "http://35.192.60.10:8080/"
Q.set_host_url(integration_host)

/Users/amanda.charbonneau/github/cda-python/cdapython/__init__.py
2022.5.9


CDA data comes from three sources:
- The [Proteomic Data Commons](https://proteomic.datacommons.cancer.gov/pdc/) (PDC)
- The [Genomic Data Commons](https://gdc.cancer.gov/) (GDC)
- The [Imaging Data Commons](https://datacommons.cancer.gov/repository/imaging-data-commons) (IDC)

The CDA makes this data searchable in four endpoints:

- `subject`: Specific, unique, individuals
- `research_subject`: Study-individual aggregate entities. A Subject who was part of three studies will appear as three ResearchSubjects
- `specimen`: Samples taken from individual
- `file`: Data about Subjects, ResearchSubjects, Specimens, and their associated information


If you are looking to build a cohort of distinct individuals who meet some criteria, search by `Subject`. If you want to build a cohort, but are particularly interested in studies rather than the participates per se, search by `ResearchSubject`. If you are looking for biosamples that can be ordered or a specfic format of information (for e.g. histological slides) start with `Specimen`. If you are primarily looking for files you can reuse for your own analysis, start with `File`.

In CDA search, these concepts can also be strung together, so you can look specifically for Specimen Files, or ResearchSubject Specimens. In all cases, any search can use any metadata field, the only difference between search types is what type of data you return by default.


## Basic search with endpoints

Let's try a broad search of the CDA to see what information exists about cancers that were first diagnosed in the brain. To run this simple search, we would first construct a query in `Q` and save it to a variable `myquery`:

In [2]:
myquery = Q('ResearchSubject.primary_diagnosis_site = "brain"')



<div style="background-color:#6ce6b9;color:black;padding:20px;">
<h3>Where did those terms come from?</h3>
    
If you aren't sure how we knew what terms to put in our search, please refer back to the <a href="../SearchTerms">What search terms are available?</a> notebook. 
</div>

### subject
Now we can use that query to search any of information types. Let's start by looking at what Subjects meet our criteria. To do that, we will send our query to the subject endpoint, then ask for it to run:

In [3]:
subjectresults = myquery.subject.run()

Total execution time: 3296 ms


We saved the output in a variable `subjectresults`, so we don't get much visible output. To see what our results are, we need to look into the variable. The simplest way is to call `subjectresults` directly:

In [4]:
subjectresults


            QueryID: 4d741a70-95f6-489e-85b5-5542fe8fb805
            Query:SELECT results.* EXCEPT(rn) FROM (SELECT ROW_NUMBER() OVER (PARTITION BY all_Subjects_v3_0_w_RS.id) as rn, all_Subjects_v3_0_w_RS.id AS id, all_Subjects_v3_0_w_RS.identifier AS identifier, all_Subjects_v3_0_w_RS.species AS species, all_Subjects_v3_0_w_RS.sex AS sex, all_Subjects_v3_0_w_RS.race AS race, all_Subjects_v3_0_w_RS.ethnicity AS ethnicity, all_Subjects_v3_0_w_RS.days_to_birth AS days_to_birth, all_Subjects_v3_0_w_RS.subject_associated_project AS subject_associated_project, all_Subjects_v3_0_w_RS.vital_status AS vital_status, all_Subjects_v3_0_w_RS.age_at_death AS age_at_death, all_Subjects_v3_0_w_RS.cause_of_death AS cause_of_death FROM gdc-bq-sample.dev.all_Subjects_v3_0_w_RS AS all_Subjects_v3_0_w_RS LEFT JOIN UNNEST(all_Subjects_v3_0_w_RS.ResearchSubject) AS _ResearchSubject WHERE (UPPER(_ResearchSubject.primary_diagnosis_site) = UPPER('brain'))) as results WHERE rn = 1
            Offset: 0
      

This output tells us our QueryID, which we don't really need, but the computer does to track our questions. Then it tells us four parameters that describe our results:

---

- **Offset:** This is how many rows of information we've told the query to skip in the data, here we didn't tell it to skip anything, so the offset is zero
- **Count:** This is how many rows the current page of our results table has. To keep searches fast, we default to pages with 100 rows.
- **Total Row Count:** This is how many rows are in the full results table
- **More pages:** This is alwasys a True or False. False means that our current page has all the availble results. True means that we will see only the first 100 results in this table, and will need to page through for more.

---
    
Now that we've seen the metadata about our results, let's look at the actual table. The easiest way to do this is by using the python function `.to_dataframe()` on our `subjectresults` variable:

In [5]:
subjectresults.to_dataframe()

Unnamed: 0,id,identifier,species,sex,race,ethnicity,days_to_birth,subject_associated_project,vital_status,age_at_death,cause_of_death
0,ACRIN-DSC-MR-Brain-102,"[{'system': 'IDC', 'value': 'ACRIN-DSC-MR-Brai...",Homo sapiens,,,,,[acrin_dsc_mr_brain],,,
1,ACRIN-FMISO-Brain-012,"[{'system': 'IDC', 'value': 'ACRIN-FMISO-Brain...",Homo sapiens,,,,,[acrin_fmiso_brain],,,
2,C19680,"[{'system': 'PDC', 'value': 'C19680'}]",Homo sapiens,female,black or african american,not hispanic or latino,,[Proteogenomic Analysis of Pediatric Brain Can...,Alive,,Not Reported
3,C3N-01814,"[{'system': 'GDC', 'value': 'C3N-01814'}, {'sy...",Homo sapiens,female,not reported,not reported,-17813,"[CPTAC3-Discovery, CPTAC-3]",Alive,,
4,GENIE-GRCC-6563ba26,"[{'system': 'GDC', 'value': 'GENIE-GRCC-6563ba...",Homo sapiens,male,not allowed to collect,not allowed to collect,-15705,[GENIE-GRCC],Not Reported,,
...,...,...,...,...,...,...,...,...,...,...,...
95,TCGA-HT-7884,"[{'system': 'GDC', 'value': 'TCGA-HT-7884'}, {...",Homo sapiens,female,white,not hispanic or latino,-16280,"[TCGA-LGG, tcga_lgg]",Alive,,
96,TCGA-HT-A617,"[{'system': 'GDC', 'value': 'TCGA-HT-A617'}, {...",Homo sapiens,male,american indian or alaska native,not reported,-17331,"[TCGA-LGG, tcga_lgg]",Alive,,
97,TCGA-QH-A65X,"[{'system': 'GDC', 'value': 'TCGA-QH-A65X'}]",Homo sapiens,female,white,not hispanic or latino,-10440,[TCGA-LGG],Alive,,
98,TCGA-S9-A6WQ,"[{'system': 'GDC', 'value': 'TCGA-S9-A6WQ'}]",Homo sapiens,female,white,not hispanic or latino,-21133,[TCGA-LGG],Alive,,


By default `to_dataframe()` shows us the first and last five rows for the first page of our results, so we can easily preview our data.

Since we queried the Subject endpoint, our default results tell us Subject level information, that is, information about unique individuals: their sex, race, age, species, etc. The `id` column tells us the unique identifier for each individual. The identifier column has nested information about what study or studies a Subject participated in, and will list all of their research_subject identifiers. 

<span style="background-color:#f20505;color:#f5f5f5"> Need to add subject ID info to output. Devs working on it</span>


### research_subject

But, if we're interested in what research_subjects meet our critera, we can also run our query against the research_subject endpoint:

In [6]:
resubjectresults = myquery.research_subject.run()
resubjectresults

Total execution time: 3252 ms



            QueryID: a59083fd-e5da-413e-95d0-7c5cdbaf2474
            Query:SELECT results.* EXCEPT(rn) FROM (SELECT ROW_NUMBER() OVER (PARTITION BY _ResearchSubject.id) as rn, _ResearchSubject.id AS id, _ResearchSubject.identifier AS identifier, _ResearchSubject.member_of_research_project AS member_of_research_project, _ResearchSubject.primary_diagnosis_condition AS primary_diagnosis_condition, _ResearchSubject.primary_diagnosis_site AS primary_diagnosis_site FROM gdc-bq-sample.dev.all_Subjects_v3_0_w_RS AS all_Subjects_v3_0_w_RS LEFT JOIN UNNEST(all_Subjects_v3_0_w_RS.ResearchSubject) AS _ResearchSubject WHERE (UPPER(_ResearchSubject.primary_diagnosis_site) = UPPER('brain'))) as results WHERE rn = 1
            Offset: 0
            Count: 100
            Total Row Count: 2923
            More pages: True
            

Now we see that our 2314 subjects have 2923 research_subjects between them, that means that some, but not all, of our subjects were participants in more than one study. Let's peek at the data:

In [7]:
resubjectresults.to_dataframe()

Unnamed: 0,id,identifier,member_of_research_project,primary_diagnosis_condition,primary_diagnosis_site
0,001ad307-4ad3-4f1d-b2fc-efc032871c7e,"[{'system': 'GDC', 'value': '001ad307-4ad3-4f1...",TCGA-LGG,Gliomas,Brain
1,292cf158-dd47-44a1-912b-73626095a0f7,"[{'system': 'GDC', 'value': '292cf158-dd47-44a...",TCGA-LGG,Gliomas,Brain
2,309005a2-93a8-4566-b8d3-6b9310144266,"[{'system': 'GDC', 'value': '309005a2-93a8-456...",TCGA-GBM,Gliomas,Brain
3,3726b315-6cf8-49a3-a581-b7f3f4ca6e16,"[{'system': 'GDC', 'value': '3726b315-6cf8-49a...",TCGA-GBM,Gliomas,Brain
4,4d0be324-b998-45af-a9ac-f158d8d03e4a,"[{'system': 'GDC', 'value': '4d0be324-b998-45a...",GENIE-DFCI,"Neoplasms, NOS",Brain
...,...,...,...,...,...
95,ACRIN-DSC-MR-Brain-030__acrin_dsc_mr_brain,"[{'system': 'IDC', 'value': 'ACRIN-DSC-MR-Brai...",acrin_dsc_mr_brain,,Brain
96,ACRIN-FMISO-Brain-028__acrin_fmiso_brain,"[{'system': 'IDC', 'value': 'ACRIN-FMISO-Brain...",acrin_fmiso_brain,,Brain
97,C3L-01155__cptac_gbm,"[{'system': 'IDC', 'value': 'C3L-01155'}]",cptac_gbm,,Brain
98,C3L-03744__cptac_gbm,"[{'system': 'IDC', 'value': 'C3L-03744'}]",cptac_gbm,,Brain


Each row from the research-subject endpoint results tells us about a subject in a given study. Any given subject will have one row per study they participated in. Using this endpoint we can find out information like what studies fit our search criteria, and also get data that we can filter to have only subjects from multiple studies, or only subjects from single studies.

### specimens

We can use this same query to see what specimens are available for brain tissue at the CDA:

In [8]:
specresults =  myquery.specimen.run()
print(specresults)

Total execution time: 3406 ms

            QueryID: 313a9675-488f-4885-bdcf-e4246d98582f
            Query:SELECT results.* EXCEPT(rn) FROM (SELECT ROW_NUMBER() OVER (PARTITION BY _ResearchSubject_Specimen.id) as rn, _ResearchSubject_Specimen.id AS id, _ResearchSubject_Specimen.identifier AS identifier, _ResearchSubject_Specimen.associated_project AS associated_project, _ResearchSubject_Specimen.age_at_collection AS age_at_collection, _ResearchSubject_Specimen.primary_disease_type AS primary_disease_type, _ResearchSubject_Specimen.anatomical_site AS anatomical_site, _ResearchSubject_Specimen.source_material_type AS source_material_type, _ResearchSubject_Specimen.specimen_type AS specimen_type, _ResearchSubject_Specimen.derived_from_specimen AS derived_from_specimen, _ResearchSubject_Specimen.derived_from_subject AS derived_from_subject FROM gdc-bq-sample.dev.all_Subjects_v3_0_w_RS AS all_Subjects_v3_0_w_RS LEFT JOIN UNNEST(all_Subjects_v3_0_w_RS.ResearchSubject) AS _ResearchSubject LEF

Nearly 40,000 specimens meet our search criteria! We would typically expect this number to be much larger than our number of subjects or research_subjects. First because studies will often take more than one sample per subject, and second because any given specimen might be aliquoted out to be used in multiple tests. Since we didn't specify any further filters, our results will return all of these as seperate speciments. Let's look at a few:

In [9]:
specresults.to_dataframe()

Unnamed: 0,id,identifier,associated_project,age_at_collection,primary_disease_type,anatomical_site,source_material_type,specimen_type,derived_from_specimen,derived_from_subject
0,006bc276-20b8-4d8d-97e3-44cc9e4690c5,"[{'system': 'GDC', 'value': '006bc276-20b8-4d8...",GENIE-MSK,-11688,Germ Cell Neoplasms,,Metastatic,aliquot,7e88ebc1-e2bf-500e-adf6-440ce3c30fa2,GENIE-MSK-P-0003994
1,0172a922-4f32-4b99-b121-044acea675e0,"[{'system': 'GDC', 'value': '0172a922-4f32-4b9...",TCGA-GBM,-17821,Gliomas,,Primary Tumor,portion,0f78eea7-b8b1-4d0d-b3ea-b127ffb9d849,TCGA-08-0512
2,02cc6ee6-9389-45b4-acd3-2f3bec53cb5c,"[{'system': 'GDC', 'value': '02cc6ee6-9389-45b...",TCGA-LGG,-16021,Gliomas,,Primary Tumor,analyte,287bb768-2aff-40be-99f3-d8ff6944ec3c,TCGA-TM-A7C3
3,03390ebe-13a6-5524-8ae8-07750c4b2f86,"[{'system': 'GDC', 'value': '03390ebe-13a6-552...",TCGA-GBM,-21678,Gliomas,,Primary Tumor,portion,b1c293ea-9f41-4cd9-a3d3-40f42fe4d594,TCGA-27-1832
4,04042fa9-105d-4e12-a976-866a41ca006f,"[{'system': 'GDC', 'value': '04042fa9-105d-4e1...",TCGA-GBM,-15736,Gliomas,,Primary Tumor,analyte,26c4fc97-517a-45a5-933e-3b1d75228ad4,TCGA-06-0138
...,...,...,...,...,...,...,...,...,...,...
95,4a5299fe-a61f-4e0d-9e89-bc7874cff10e,"[{'system': 'GDC', 'value': '4a5299fe-a61f-4e0...",TCGA-LGG,-15727,Gliomas,,Blood Derived Normal,aliquot,afbaf996-34ed-44d0-b10e-885d4b5605af,TCGA-HT-7692
96,4a6549ce-fc5a-4f33-89a6-ebb924f3e953,"[{'system': 'GDC', 'value': '4a6549ce-fc5a-4f3...",TCGA-LGG,-18494,Gliomas,,Primary Tumor,slide,574793ca-1449-5090-bca5-275f1db85e1e,TCGA-VM-A8C8
97,4aac6731-a9fa-4574-8b76-2b41616afd20,"[{'system': 'GDC', 'value': '4aac6731-a9fa-457...",TCGA-GBM,-19661,Gliomas,,Primary Tumor,aliquot,3a13e290-e5e6-49ec-a2bd-fa02d17a6945,TCGA-06-0145
98,4c85a4ba-ea92-4b99-9c08-0b4943620dc9,"[{'system': 'GDC', 'value': '4c85a4ba-ea92-4b9...",TCGA-LGG,,Gliomas,,Blood Derived Normal,aliquot,fff828db-dd51-4208-ad02-eda8f8b49671,TCGA-W9-A837


### file
<span style="background-color:#f20505;color:#f5f5f5"> This command does not work right now. Devs working on it</span>

The file endpoint returns information about files that meet our search criteria, regardless of whether they are attached to subjects, research-subjects or specimens: 

In [10]:
myquery.file.run()

AttributeError: 'Q' object has no attribute 'file'

As you might expect, searching file gives us a huge number of results. This is great if you are surveying what kind of data is available, but is less useful for getting a coherent cohort. 

A better way to get files for a specific cohort is to chain your queries together, which we cover in the next tutorial [Advanced Search](../AdvancedSearch-Chaining): Combine information from multiple endpoints, and build And/Or/Like and other advanced query strings


What are all these fields?

---

<div style="background-color:#a2f2ed;color:black;padding:20px;">

<ul>
  <li><b>id:</b> The unique identifier for this file</li>
  <li><b>identifier:</b> An embedded array of information that includes the originating data center and the ID the file had there</li>
  <li><b>label:</b> The full name of the file</li>
  <li><b>data_catagory:</b> A desecription of the kind of general kind data the file holds</li>
  <li><b>data_type:</b> A more specific descripton of the data type</li>
  <li><b>file_format:</b> The extension of the file</li>
  <li><b>associated_project:</b> The name the data center uses for the study this file was generated for</li>
  <li><b>drs_uri:</b> A unique identifier that can be used to retreive this specific file from a server</li>
  <li><b>byte_size:</b> Size of the file in bytes</li>
  <li><b>checksum:</b> The md5 value for the file</li>
  <li><b>data_modality:</b> A high level descriptor of file data, always one of "Genomic", "Proteomic", or "Imaging"</li>
  <li><b>imaging_modality</b> For files with the `data_modality` of "Imaging", a descriptor for the image type</li>
  <li><b>dbgap_accession_number:</b></li>
</ul>  

</div>
    
---