Before we do any work, we need to import several functions from cdapython:
- `Q` and `query` which power the search
- `columns` which lets us view entity field names
- `unique_terms` which lets view entity field contents

We're also asking cdapython to report it's version so we can be sure we're using the one we mean to.

In [1]:
from cdapython import Q, columns, unique_terms, query
import cdapython
print(cdapython.__version__)
Q.set_host_url("http://35.192.60.10:8080/")

2022.4.13




CDA data comes from three sources:
- The [Proteomic Data Commons](https://proteomic.datacommons.cancer.gov/pdc/) (PDC)
- The [Genomic Data Commons](https://gdc.cancer.gov/) (GDC)
- The [Imaging Data Commons](https://datacommons.cancer.gov/repository/imaging-data-commons) (IDC)

The CDA makes this data searchable in two ways, by a "Subject" table and by "File" table. 

`Subject` data is information that is intrisic to the individual under study, e.g. `sex`, `race`, `ethnicity`. However any given subject might be part of multiple studies. To make search across datasets easier, the CDA model aggregates this data as `ResearchSubject` information. Subjects that participate in multiple projects (are part of multiple nodes), will have multiple `ResearchSubject` entries.

`Subject` and `ResearchSubject` fields are available for both the "Subject" and "File" tables, however terms specific to files e.g. 'data_type' or 'file_format' are only available in "File" table.

To see what fields are available, we use the command `columns`. Here we're looking at the first ten metadata fields available for files:

In [2]:
columns(files=True, limit=10)

['id',
 'identifier',
 'identifier.system',
 'identifier.value',
 'label',
 'data_category',
 'data_type',
 'file_format',
 'associated_project',
 'drs_uri']

By default, `columns()` returns `Subject` table fields. The first several fields (those without a `.` in them) are `Subject` demographic information, which is intrinsically attached to a given subject. Subsequent entities (i.e. `Research.Subject.xxx`) contain details about specific experiments the subject was part of. They are equivilent to the nodes' `Case` record in the GDC and PDC. 

While available search fields may look like ones you've seen in PDC, GDC or IDC, that does not mean they will contain exactly the same information; several are renamed or restructured in the CDA model. The field name mappings are described in [CDA Schema Field Mapping](../Documentation/Schema.md), but we can also directly get information about what data populates any of these fields using the `unique_terms()` function:

In [3]:
unique_terms("ResearchSubject.primary_diagnosis_site", files=True, limit=10)

[None,
 'Adrenal gland',
 'Anus and anal canal',
 'Base of tongue',
 'Bladder',
 'Bones, joints and articular cartilage of limbs',
 'Bones, joints and articular cartilage of other and unspecified sites',
 'Brain',
 'Breast',
 'Bronchus and lung']

The CDA provides a custom python tool for searching CDA data. [`Q`](usage/#q) (short for Query) offers several ways to search and filter data, and several input modes:

---

- **[Q.run()](../../../Documentation/usage/#qrun)** returns **demographic** data for the specified search 
- **[Q.counts()](../../../Documentation/usage/#qcounts)** returns summary information (counts) for the **files** that fit the specified search
- **[Q.files()](../../../Documentation/usage/#qfiles)** returns data for the **files** that fit the specified search
- **[Q.sql()](../../../Documentation/usage/#qsql)** allows you to use SQL syntax instead of Q syntax 
- **[query()](../../../Documentation/usage/#query)** allows you to use a more natural language syntax of Q

---


For todays demo, I'm going to show you how you can use `Q.counts()`, `Q.files()`, and `Q.run()` to build a cohort

## Retrieving summary information

### Demographic summary
Let's run a search with a relatively simple question: We're interested in finding data about Kidney cancer. To run this simple search, we would first construct a query in `Q` and save it to a variable `myquery`:

In [4]:
myquery = Q('ResearchSubject.primary_diagnosis_site = "Kidney"')

Since we are looking for demographic summary information, we want to use this query in Q.run. We do this by running .run() on the query we just saved, and saving the result to a new variable mydemographic:

In [5]:
mydemographic = myquery.run()

Getting results from database

Total execution time: 4043 ms


We're saving information in variables, so we don't get any visible output. To see what our results are, we need to look into the variable. The simplest way is to call `mydemographic` directly:

In [6]:
mydemographic


            QueryID: 4a88b68d-8cc4-49cf-8523-9bed80678a46
            Query:SELECT all_v3_0_subjects_meta.* FROM gdc-bq-sample.dev.all_v3_0_subjects_meta AS all_v3_0_subjects_meta, UNNEST(all_v3_0_subjects_meta.ResearchSubject) AS _ResearchSubject WHERE (UPPER(_ResearchSubject.primary_diagnosis_site) = UPPER('Kidney'))
            Offset: 0
            Count: 100
            Total Row Count: 3415
            More pages: True
            

This output tells us our QueryID, which we don't really need, but the computer does to track our questions. 
Then it tells us five parameters that describe our results:

---

- **Query:** This is the actual SQL query that was run on our database to retreive your results
- **Offset:** This is how many rows of information we've told the query to skip in the data, here we didn't tell it to skip anything, so the offset is zero
- **Count:** This is how many rows (Subjects) the current page of our results table has. To keep searches fast, we default to pages with 100 rows.
- **Total Row Count:** This is how many rows (Subjects) are in the full results table
- **More pages:** This is alwasys a True or False. False means that our current page has all the availble results. True means that we will see only the first 100 results in this table, and will need to page through for more.

---
    
Now that we've seen the metadata about our results, let's look at the actual table. The easiest way to do this is by using the python function `.to_dataframe()` on our `mydemographic` variable:

In [7]:
mydemographic.to_dataframe()

Unnamed: 0,id,identifier,species,sex,race,ethnicity,days_to_birth,subject_associated_project,vital_status,age_at_death,cause_of_death,ResearchSubject
0,GENIE-MSK-P-0014698,"[{'system': 'GDC', 'value': 'GENIE-MSK-P-00146...",Homo sapiens,male,asian,not hispanic or latino,-18993,[GENIE-MSK],Not Reported,,,[{'id': 'e884a6ca-6e96-4911-aba3-4cc8699c028b'...
1,GENIE-MSK-P-0016600,"[{'system': 'GDC', 'value': 'GENIE-MSK-P-00166...",Homo sapiens,male,Unknown,Unknown,-23010,[GENIE-MSK],Not Reported,,,[{'id': '099ab6ef-755f-4a17-8eac-9b6761202234'...
2,AD5315,"[{'system': 'GDC', 'value': 'AD5315'}]",Homo sapiens,male,not reported,not reported,,[FM-AD],Not Reported,,,[{'id': '34469f11-12b1-40c9-a156-13fdbba9c825'...
3,AD3384,"[{'system': 'GDC', 'value': 'AD3384'}]",Homo sapiens,male,not reported,not reported,,[FM-AD],Not Reported,,,[{'id': 'c739cfce-ea1b-412c-be8d-96ab1129f406'...
4,TARGET-30-PAUELT,"[{'system': 'GDC', 'value': 'TARGET-30-PAUELT'}]",Homo sapiens,female,black or african american,not hispanic or latino,,[TARGET-NBL],Dead,,,[{'id': 'e785d22c-f983-55ff-be60-554ac487cf8c'...
...,...,...,...,...,...,...,...,...,...,...,...,...
95,TARGET-50-PAJNNF,"[{'system': 'GDC', 'value': 'TARGET-50-PAJNNF'}]",Homo sapiens,female,white,not reported,,[TARGET-WT],Alive,,,[{'id': '455b8233-d97b-5f25-997e-40f281f858ca'...
96,TCGA-NP-A5GY,"[{'system': 'GDC', 'value': 'TCGA-NP-A5GY'}, {...",Homo sapiens,female,white,not hispanic or latino,-27464,"[TCGA-KICH, tcga_kich]",Dead,22,,[{'id': 'c7921a85-329b-4b23-ad87-5a695a88611e'...
97,GENIE-DFCI-005124,"[{'system': 'GDC', 'value': 'GENIE-DFCI-005124'}]",Homo sapiens,female,white,not hispanic or latino,-17897,[GENIE-DFCI],Not Reported,,,[{'id': 'd3694567-66d6-4689-b868-577680b36965'...
98,GENIE-MSK-P-0006742,"[{'system': 'GDC', 'value': 'GENIE-MSK-P-00067...",Homo sapiens,female,white,not hispanic or latino,-24837,[GENIE-MSK],Not Reported,,,[{'id': '331c0024-6505-45c6-9e86-f7fcdb5fef77'...


#### We can further subset by chaining queries

There are lots of Subjects in our general search for Kidney, so lets try to filter it to just cancer, we're most interested in early stage cancer, so we'll filter by Diagnosis.stage:


In [8]:
cancerquery = Q('ResearchSubject.Diagnosis.stage = "Stage I"').Or(Q('ResearchSubject.Diagnosis.stage = "Stage II"'))
cancerdemographic = cancerquery.run()
cancerdemographic

Getting results from database

Total execution time: 4831 ms



            QueryID: ae953690-63a8-4283-8e21-253353cc03e8
            Query:SELECT all_v3_0_subjects_meta.* FROM gdc-bq-sample.dev.all_v3_0_subjects_meta AS all_v3_0_subjects_meta, UNNEST(all_v3_0_subjects_meta.ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.Diagnosis) AS _ResearchSubject_Diagnosis WHERE ((UPPER(_ResearchSubject_Diagnosis.stage) = UPPER('Stage I')) OR (UPPER(_ResearchSubject_Diagnosis.stage) = UPPER('Stage II')))
            Offset: 0
            Count: 100
            Total Row Count: 226
            More pages: True
            

This is a much more managable number of subjects, and now they're targeted by our actual question. Let's peek at the data:

In [9]:
cancerdemographic.to_dataframe().head()

Unnamed: 0,id,identifier,species,sex,race,ethnicity,days_to_birth,subject_associated_project,vital_status,age_at_death,cause_of_death,ResearchSubject
0,C3N-00315,"[{'system': 'GDC', 'value': 'C3N-00315'}, {'sy...",Homo sapiens,male,not reported,not reported,-25193,"[CPTAC3-Discovery, cptac_ccrcc, CPTAC-3]",Alive,,Not Reported,[{'id': '77817c88-77e1-4ff0-a24e-21ca2942ba3a'...
1,C3N-01522,"[{'system': 'GDC', 'value': 'C3N-01522'}, {'sy...",Homo sapiens,male,not reported,not reported,-29350,"[CPTAC3-Discovery, cptac_ccrcc, CPTAC-3]",Alive,,Not Reported,[{'id': '8be8aafe-ccd7-437f-aa68-09c4bc248d1d'...
2,C3L-02621,"[{'system': 'GDC', 'value': 'C3L-02621'}, {'sy...",Homo sapiens,male,not reported,not reported,-25156,"[CPTAC3-Discovery, CPTAC-3]",Alive,,,[{'id': '37cbde5d-932b-491b-a579-ad8df3905dad'...
3,C3N-00866,"[{'system': 'GDC', 'value': 'C3N-00866'}, {'sy...",Homo sapiens,female,white,not hispanic or latino,-28243,"[CPTAC3-Discovery, cptac_ucec, CPTAC-3]",Alive,,Not Reported,[{'id': '591733c5-3190-493a-8e56-b7e471473e22'...
4,C3L-00917,"[{'system': 'GDC', 'value': 'C3L-00917'}, {'sy...",Homo sapiens,male,white,hispanic or latino,-13840,"[CPTAC3-Discovery, cptac_ccrcc, CPTAC-3]",Alive,,Not Reported,[{'id': '30bf3778-fb08-4568-b6d2-dbcb5d2b1ee6'...


### File summary
Now that we know there are research subjects that meet our criteria, lets see what data exists about them. We can get a summary of the data for these subjects using the `counts()` feature:

In [10]:
cancerquery.counts().to_dataframe()

Unnamed: 0,system,subject_count,subject_files_count,researchsubject_count,researchsubject_files_count,specimen_count,specimen_files_count
0,IDC,403,0,0,0,0,0
1,GDC,437,0,0,0,0,0
2,PDC,478,13046,226,13046,1346,12959


### What do these numbers mean?

---
    
- **system:** Which data source contributed this data? The CDA currently has data from IDC, PDC and GDC
- **subject_count:** How many unique individuals meet our query. Note that *within* a data source the number is of *unique* individuals, but the same individuals can have data at multiple centers. Here, there are 371 unique people in the IDC data, however up to 57 of those may be exactly the same people as are in the PDC data.
- **subject_files_count:** This tells you roughly how much data is available. It is the total count of files for all the subjects in `subject_count`, which is also the total number of files that match your search.
- **researchsubject_count:** Some data sources have individual subjects that are in multiple studies, when this happens the individual will have both a "subject" identifier and a "researchsubject" identifier. This column counts the latter. Zero in this column can mean either "there are no research_subjects that meet your search criteria" or "the data source for this row does not create special identifiers for subjects in multiple studies"
- **researchsubject_files_count:** This is the total count of files specific to researchsubjects in `researchsubject_count`. It is a subset of `subject_count`
- **specimen_count:** Some data sources track whether files come from specific specimens from a given individual. This column counts the number of specimens that meet your search criteria. Zero in this column can mean either "there are no specimens that meet your search criteria" or "the data source for this row does not track specimens seperately from subjects"
- **specimen_files_count:** This is the total count of files specific to specimens in `specimen_count`. It is a subset of both `subject_count` and `researchsubject_files_count`

---

## Retrieving data


Let's run the same query, but instead of asking for summary information, lets get the data about each file. We start this process the same way, by making a `Q` statement. We can reuse `cancerquery` here as well, and just run the `.files()` function on it: 

In [11]:
cancerfiles = cancerquery.files()
cancerfiles

Getting results from database




            QueryID: 44c20c92-7e51-4d7a-b089-93f9a9849d89
            Query:SELECT all_v3_0_Files.* FROM gdc-bq-sample.dev.all_v3_0_Files AS all_v3_0_Files, UNNEST(all_v3_0_Files.ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.Diagnosis) AS _ResearchSubject_Diagnosis WHERE ((UPPER(_ResearchSubject_Diagnosis.stage) = UPPER('Stage I')) OR (UPPER(_ResearchSubject_Diagnosis.stage) = UPPER('Stage II')))
            Offset: 0
            Count: 100
            Total Row Count: 33654
            More pages: True
            

In [12]:
cancerfiles.to_dataframe()

Unnamed: 0,id,identifier,label,data_category,data_type,file_format,associated_project,drs_uri,byte_size,checksum,data_modality,imaging_modality,dbgap_accession_number,Subject,ResearchSubject,Specimen
0,2a644e3e-2cbf-4380-b560-48f33a4287fb,"[{'system': 'PDC', 'value': '2a644e3e-2cbf-438...",19CPTAC_COprospective_W_VU_20151112_14CO002_f0...,Processed Mass Spectra,Open Standard,mzML,CPTAC-2,drs://dg.4DFC:2a644e3e-2cbf-4380-b560-48f33a42...,143700872,472caaf527d31f5c39fb762d152e9f2f,Proteomic,,,"[{'id': '14CO002', 'identifier': [{'system': '...",[{'id': 'c9ab199d-63d6-11e8-bcf1-0a2705229b82'...,[]
1,4e775291-a2de-43eb-b4a5-dd9d800e5a9c,"[{'system': 'PDC', 'value': '4e775291-a2de-43e...",19CPTAC_COprospective_W_VU_20151112_14CO002_f0...,Peptide Spectral Matches,Open Standard,mzIdentML,CPTAC-2,drs://dg.4DFC:4e775291-a2de-43eb-b4a5-dd9d800e...,2388679,8803d600868385a3fabcb53491a3da6a,Proteomic,,,"[{'id': '14CO002', 'identifier': [{'system': '...",[{'id': 'c9ab199d-63d6-11e8-bcf1-0a2705229b82'...,[]
2,fc8aba01-ccc2-403e-8497-a90620f69197,"[{'system': 'PDC', 'value': 'fc8aba01-ccc2-403...",09CPTAC_BCProspective_AcK_BI_20170917_BL_F5.mz...,Peptide Spectral Matches,Open Standard,mzIdentML,CPTAC-2,drs://dg.4DFC:fc8aba01-ccc2-403e-8497-a90620f6...,5053474,ea59e3a50c77f8e284fd262ca3090761,Proteomic,,,"[{'id': '01BR025', 'identifier': [{'system': '...",[{'id': '327f5bee-0a5d-11eb-bc0e-0aad30af8a83'...,[{'id': '32801a2c-0a5d-11eb-bc0e-0aad30af8a83'...
3,cbb865a4-fd37-4889-9822-943faaaefe58,"[{'system': 'PDC', 'value': 'cbb865a4-fd37-488...",09CPTAC_COprospective_P_PNNL_20170215_B3S1_f04...,Processed Mass Spectra,Open Standard,mzML,CPTAC-2,drs://dg.4DFC:cbb865a4-fd37-4889-9822-943faaae...,176336109,c28384e7ebcb5bdaf65d3b64404efab9,Proteomic,,,"[{'id': '05CO020', 'identifier': [{'system': '...",[{'id': '15d7eb12-63d6-11e8-bcf1-0a2705229b82'...,[{'id': '6a47a138-ec51-11e9-81b4-2a2ae2dbcce4'...
4,3c1bb42c-3277-4e96-82bb-50eb0c6e3599,"[{'system': 'PDC', 'value': '3c1bb42c-3277-4e9...",09CPTAC_COprospective_W_PNNL_20170123_B3S1_f05...,Processed Mass Spectra,Open Standard,mzML,CPTAC-2,drs://dg.4DFC:3c1bb42c-3277-4e96-82bb-50eb0c6e...,197609714,64bd9cc2331d32ff0d948d5f654c88f9,Proteomic,,,"[{'id': '05CO020', 'identifier': [{'system': '...",[{'id': '15d7eb12-63d6-11e8-bcf1-0a2705229b82'...,[{'id': '6a47a138-ec51-11e9-81b4-2a2ae2dbcce4'...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,47dc96da-1b50-11e9-8e2e-005056921935,"[{'system': 'PDC', 'value': '47dc96da-1b50-11e...",10CPTAC_CCRCC_P_JHU_20180123_LUMOS_f04.mzML.gz,Processed Mass Spectra,Open Standard,mzML,CPTAC3-Discovery,drs://dg.4DFC:47dc96da-1b50-11e9-8e2e-00505692...,210985259,c83f281bf6ce48000ec806ed06ec6786,Proteomic,,,"[{'id': 'C3L-01836', 'identifier': [{'system':...",[{'id': '2c9d2692-1fb9-11e9-b7f8-0a80fada099c'...,[{'id': '12589be6-204e-11e9-b7f8-0a80fada099c'...
96,47dc96da-1b50-11e9-8e2e-005056921935,"[{'system': 'PDC', 'value': '47dc96da-1b50-11e...",10CPTAC_CCRCC_P_JHU_20180123_LUMOS_f04.mzML.gz,Processed Mass Spectra,Open Standard,mzML,CPTAC3-Discovery,drs://dg.4DFC:47dc96da-1b50-11e9-8e2e-00505692...,210985259,c83f281bf6ce48000ec806ed06ec6786,Proteomic,,,"[{'id': 'C3L-01836', 'identifier': [{'system':...",[{'id': '2c9d2692-1fb9-11e9-b7f8-0a80fada099c'...,[{'id': '12589be6-204e-11e9-b7f8-0a80fada099c'...
97,47dc96da-1b50-11e9-8e2e-005056921935,"[{'system': 'PDC', 'value': '47dc96da-1b50-11e...",10CPTAC_CCRCC_P_JHU_20180123_LUMOS_f04.mzML.gz,Processed Mass Spectra,Open Standard,mzML,CPTAC3-Discovery,drs://dg.4DFC:47dc96da-1b50-11e9-8e2e-00505692...,210985259,c83f281bf6ce48000ec806ed06ec6786,Proteomic,,,"[{'id': 'C3L-01836', 'identifier': [{'system':...",[{'id': '2c9d2692-1fb9-11e9-b7f8-0a80fada099c'...,[{'id': '12589be6-204e-11e9-b7f8-0a80fada099c'...
98,848a8f74-1a06-11e9-b898-005056921935,"[{'system': 'PDC', 'value': '848a8f74-1a06-11e...",10CPTAC_CCRCC_P_JHU_20180123_LUMOS_fA.raw,Raw Mass Spectra,Proprietary,vendor-specific,CPTAC3-Discovery,drs://dg.4DFC:848a8f74-1a06-11e9-b898-00505692...,676842076,498b6a97df442da92e1f01c91281cda5,Proteomic,,,"[{'id': 'C3L-01836', 'identifier': [{'system':...",[{'id': '2c9d2692-1fb9-11e9-b7f8-0a80fada099c'...,[{'id': '12589be6-204e-11e9-b7f8-0a80fada099c'...


What are all these fields?

---

- **id:** The unique identifier for this file
- **identifier:** An embedded array of information that includes the originating data center and the ID the file had there
- **label:** The full name of the file
- **data_catagory:** A desecription of the kind of general kind data the file holds
- **data_type:** A more specific descripton of the data type
- **file_format:** The extension of the file
- **associated_project:** The name the data center uses for the study this file was generated for
- **drs_uri:** A unique identifier that can be used to retreive this specific file from a server
- **byte_size:** Size of the file in bytes
- **checksum:** The md5 value for the file
- **data_modality:** A high level descriptor of file data, always one of "Genomic", "Proteomic", or "Imaging"
- **imaging_modality** For files with the `data_modality` of "Imaging", a descriptor for the image type
- **dbgap_accession_number:** An identifier for the dbGaP project this file belongs to
- **Subject:** An embedded array of information that includes the originating data center and the Subject ID the file had there
- **ResearchSubject:** An embedded array of information that includes the originating data center and the ResearchSubject ID the file had there
- **Specimen:** An embedded array of information that includes the originating data center and the Specimen ID the file had there

---
