Before we do any work, we need to import several functions from cdapython:
- `Q` and `query` which power the search
- `columns` which lets us view entity field names
- `unique_terms` which lets view entity field contents

We're also asking cdapython to report it's version so we can be sure we're using the one we mean to.

In [1]:
from cdapython import Q, columns, unique_terms,query, constantVariables
import cdapython
import pandas as pd
print(cdapython.__file__)
print(cdapython.__version__)
integration_host = "http://35.192.60.10:8080/"
Q.set_host_url(integration_host)


from pandas import DataFrame, json_normalize
from IPython.display import display, HTML, display_html

/Users/amanda.charbonneau/github/cda-python/cdapython/__init__.py
2022.5.23


CDA data comes from three sources:
- The [Proteomic Data Commons](https://proteomic.datacommons.cancer.gov/pdc/) (PDC)
- The [Genomic Data Commons](https://gdc.cancer.gov/) (GDC)
- The [Imaging Data Commons](https://datacommons.cancer.gov/repository/imaging-data-commons) (IDC)

The CDA makes this data searchable in four endpoints:

- `subject`: Specific, unique, individuals
- `research_subject`: Study-individual aggregate entities. A Subject who was part of three studies will appear as three ResearchSubjects
- `specimen`: Samples taken from individual
- `file`: Data about Subjects, ResearchSubjects, Specimens, and their associated information


If you are looking to build a cohort of distinct individuals who meet some criteria, search by `Subject`. If you want to build a cohort, but are particularly interested in studies rather than the participates per se, search by `ResearchSubject`. If you are looking for biosamples that can be ordered or a specfic format of information (for e.g. histological slides) start with `Specimen`. If you are primarily looking for files you can reuse for your own analysis, start with `File`.

In CDA search, these concepts can also be strung together, so you can look specifically for Specimen Files, or ResearchSubject Specimens. In all cases, any search can use any metadata field, the only difference between search types is what type of data you return by default.


## Getting simple summary data

Let's try a broad search of the CDA to see what information exists about cancers that were first diagnosed in the brain. To run this simple search, we would first construct a query in `Q` and save it to a variable `myquery`:

In [2]:
myquery = Q('ResearchSubject.primary_diagnosis_site = "brain"')


<div style="background-color:#6ce6b9;color:black;padding:20px;">
<h3>Where did those terms come from?</h3>
    
If you aren't sure how we knew what terms to put in our search, please refer back to the <a href="../SearchTerms">What search terms are available?</a> notebook. 
</div>

### subject
Now we can use that query to search any of information types. Let's start by looking at what Subjects meet our criteria. To do that, we will send our query to the subject endpoint, then ask for it to run:

In [3]:
subjectresults = myquery.subject.count.run()

Total execution time: 10002 ms


We saved the output in a variable `subjectresults`, so we don't get much visible output. To see what our results are, we need to look into the variable. The simplest way is to call `subjectresults` directly:

In [4]:
q1 = Q('ResearchSubject.primary_diagnosis_site = "brain"')
r = q1.subject.count.run()

result = r[0]

for key in result:
  value = result[key]
  if type(value) is list:
    print(json_normalize(value))
  else:
    print(f'{key}: {value}')

  print('')



result = r[0]
html_string = ''


for key in result:
  value = result[key]
  if type(value) is list:
    df = json_normalize(value)
    s = df.style.hide_index()
    headers = {
        'selector': 'th',
        'props': 'background-color: #000066; color: white; text-align: left'
    }
    columns = {
        'selector': 'td',
        'props': 'text-align:left; border-bottom: 1px solid black;'
    }
    s.set_table_styles([headers, columns])
    s.set_table_attributes("style='display:inline'")
    html_string = html_string + s._repr_html_()
  else:
    print(f"{key}: {value}")
    
display_html(html_string, raw=True)

Total execution time: 3367 ms
total: 2314

files: 1920

  system count
0    IDC   664
1    GDC  1449
2    PDC   201

            sex count
0          None   683
1          male   979
2        female   649
3  not reported     3

                                        race count
0                                       None   683
1                                      white  1308
2                               not reported   135
3                  black or african american    96
4                                      asian    33
5           american indian or alaska native     4
6                                    Unknown    20
7                     not allowed to collect    25
8                                      other     9
9  native hawaiian or other pacific islander     1

                ethnicity count
0                    None   683
1  not hispanic or latino  1282
2      hispanic or latino    84
3            not reported   219
4                 Unknown    21
5  not allowed to 

system,count
IDC,664
GDC,1449
PDC,201

sex,count
,683
male,979
female,649
not reported,3

race,count
,683
white,1308
not reported,135
black or african american,96
asian,33
american indian or alaska native,4
Unknown,20
not allowed to collect,25
other,9
native hawaiian or other pacific islander,1

ethnicity,count
,683
not hispanic or latino,1282
hispanic or latino,84
not reported,219
Unknown,21
not allowed to collect,25

cause_of_death,count
Cancer Related,63
,2028
Unknown,9
Not Reported,200
Not Cancer Related,9
Infection,3
Surgical Complications,2


Since we queried the Subject endpoint, our default results tell us Subject level information, that is, information about unique individuals: their sex, race, age, species, etc. Using counts gives us back a nice pivot table type summary of the countable fields for Subjects.
This gives you a quick way to assess whether the full search results will have the data fields you require.


---

<div style="background-color:#a2f2ed;color:black;padding:20px;">

<h3>Subject Field Definitions</h3>

<i>A subject is a specific, unique individual: for e.g. a single human. When consent allows, a given entity will have a single subject ID that can be connected to all their studies and data across all datasets</i>
    
    
    
<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;margin:0px auto;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-7zrl{text-align:left;vertical-align:bottom}
.tg .tg-0lax{text-align:left;vertical-align:top}
</style>
<table class="tg">
<tbody>
  <tr>
    <td class="tg-7zrl">Not Counted</td>
    <td class="tg-0lax"> id</td>
    <td class="tg-0lax"> The&nbsp;&nbsp;&nbsp;unique identifier for this subject</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Not Counted</td>
    <td class="tg-7zrl">identifier</td>
    <td class="tg-0lax"> An embedded array of information that includes the originating data center and the ID the subject had there</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Counted</td>
    <td class="tg-7zrl">species</td>
    <td class="tg-0lax"> The species of the subject</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Counted</td>
    <td class="tg-7zrl">sex</td>
    <td class="tg-0lax"> The sex of the subject </td>
  </tr>
  <tr>
    <td class="tg-7zrl">Counted</td>
    <td class="tg-7zrl">race</td>
    <td class="tg-0lax"> The race of the subject</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Counted</td>
    <td class="tg-7zrl">ethnicity</td>
    <td class="tg-0lax"> The ethnicity of the subject</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Not Counted</td>
    <td class="tg-7zrl">days_to_birth</td>
    <td class="tg-0lax"> A number counting back to birth from date of first enrollment in a project. Usually negative</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Not Counted</td>
    <td class="tg-7zrl">subject_associated_project</td>
    <td class="tg-0lax"> An embedded array of the names of projects (studies) the subject was part of</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Not Counted</td>
    <td class="tg-7zrl">vital_status</td>
    <td class="tg-0lax"> Whether the subject is alive</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Not Counted</td>
    <td class="tg-7zrl">age_at_death</td>
    <td class="tg-0lax"> The number of days after first enrollment that the subject died</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Counted</td>
    <td class="tg-0lax"> cause_of_death</td>
    <td class="tg-0lax"> The cause of death if known</td>
  </tr>
</tbody>
</table>

</div>
    
---

### research_subject

If we're interested in what research_subjects meet our critera, we can also run our query against the research_subject endpoint:

In [5]:
resubjectresults = myquery.research_subject.run()
resubjectresults

Total execution time: 3517 ms



            QueryID: 1ed0e349-e8c9-49bb-b9d5-523c845b7469
            Query:SELECT results.* EXCEPT(rn) FROM (SELECT ROW_NUMBER() OVER (PARTITION BY _ResearchSubject.id, all_Subjects_v3_0_w_RS.id) as rn, _ResearchSubject.id AS id, _ResearchSubject.identifier AS identifier, _ResearchSubject.member_of_research_project AS member_of_research_project, _ResearchSubject.primary_diagnosis_condition AS primary_diagnosis_condition, _ResearchSubject.primary_diagnosis_site AS primary_diagnosis_site, all_Subjects_v3_0_w_RS.id AS subject_id FROM gdc-bq-sample.dev.all_Subjects_v3_0_w_RS AS all_Subjects_v3_0_w_RS INNER JOIN UNNEST(all_Subjects_v3_0_w_RS.ResearchSubject) AS _ResearchSubject WHERE (UPPER(_ResearchSubject.primary_diagnosis_site) = UPPER('brain'))) as results WHERE rn = 1
            Offset: 0
            Count: 100
            Total Row Count: 2923
            More pages: True
            

Now we see that our 2314 subjects have 2923 research_subjects between them, that means that some, but not all, of our subjects were participants in more than one study. Let's peek at the data:

In [6]:
resubjectresults.to_dataframe()

Unnamed: 0,id,identifier,member_of_research_project,primary_diagnosis_condition,primary_diagnosis_site,subject_id
0,01a13aba-74a4-4895-a5ad-e5119925c202,"[{'system': 'GDC', 'value': '01a13aba-74a4-489...",TCGA-LGG,Gliomas,Brain,TCGA-S9-A7IZ
1,104c1f4b-2139-11ea-aee1-0e1aae319e49,"[{'system': 'PDC', 'value': '104c1f4b-2139-11e...",CPTAC3-Discovery,Glioblastoma,Brain,C3L-01834
2,104c3685-2139-11ea-aee1-0e1aae319e49,"[{'system': 'PDC', 'value': '104c3685-2139-11e...",CPTAC3-Discovery,Glioblastoma,Brain,C3L-03392
3,1aa6b9ff-7c97-47d9-923f-2bd83f20e531,"[{'system': 'GDC', 'value': '1aa6b9ff-7c97-47d...",TCGA-LGG,Gliomas,Brain,TCGA-P5-A5EU
4,1f2f3bf6-acdd-4a72-8c33-230de40910eb,"[{'system': 'GDC', 'value': '1f2f3bf6-acdd-4a7...",TCGA-LGG,Gliomas,Brain,TCGA-RY-A843
...,...,...,...,...,...,...
95,d08d3049-ff5e-11e9-9a07-0a80fada099c,"[{'system': 'PDC', 'value': 'd08d3049-ff5e-11e...",Proteogenomic Analysis of Pediatric Brain Canc...,Pediatric/AYA Brain Tumors,Brain,C116850
96,d08d42ee-ff5e-11e9-9a07-0a80fada099c,"[{'system': 'PDC', 'value': 'd08d42ee-ff5e-11e...",Proteogenomic Analysis of Pediatric Brain Canc...,Pediatric/AYA Brain Tumors,Brain,C155103
97,d08d4986-ff5e-11e9-9a07-0a80fada099c,"[{'system': 'PDC', 'value': 'd08d4986-ff5e-11e...",Proteogenomic Analysis of Pediatric Brain Canc...,Pediatric/AYA Brain Tumors,Brain,C16974
98,d08d5e3c-ff5e-11e9-9a07-0a80fada099c,"[{'system': 'PDC', 'value': 'd08d5e3c-ff5e-11e...",Proteogenomic Analysis of Pediatric Brain Canc...,Pediatric/AYA Brain Tumors,Brain,C22140


Each row from the researchsubject endpoint results tells us about a subject in a given study. Using this endpoint we can find out information like what studies fit our search criteria, and also get data that we can filter to have only subjects from multiple studies, or only subjects from single studies.

Any given subject will have one row per study they participated in. The subject_id in the last column of this view is the same as the `id` in the first column of the Subjects endpoint results. You can use this to combine information across endpoints, which is covered in the [Merging Results](../MergingResults.ipynb) notebook.


---

<div style="background-color:#a2f2ed;color:black;padding:20px;">

<h3>ResearchSubject Field Definitions</h3>

<i>A researchsubject is a person/plant/animal/microbe within a given study. An individual who participates in 3 studies will have 3 researchsubject IDs</i>
    
<ul>
  <li><b>id:</b> The unique identifier for this researchsubject</li>
  <li><b>identifier:</b> An embedded array of information that includes the originating data center and the ID the researchsubject had there</li>
  <li><b>member_of_research_project:</b> The name of the study/project that the subject particpated in</li>
  <li><b>primary_diagnosis_condition:</b> The cancer, disease or other condition under study</li>
  <li><b>primary_diagnosis_site:</b> The anatomical location of the cancer or other condition in the subject</li>
  <li><b>subject_id:</b> An identifier for the subject. Can be joined to the `id` field from subject results</li>
</ul>  

</div>
    
---

### specimens

We can use this same query to see what specimens are available for brain tissue at the CDA:

In [7]:
specresults =  myquery.specimen.run()
print(specresults)

Total execution time: 7254 ms

            QueryID: 12f7c80e-5c85-4bef-b2ca-59f0b5b887cc
            Query:SELECT results.* EXCEPT(rn) FROM (SELECT ROW_NUMBER() OVER (PARTITION BY _ResearchSubject_Specimen.id, all_Subjects_v3_0_w_RS.id, _ResearchSubject.id) as rn, _ResearchSubject_Specimen.id AS id, _ResearchSubject_Specimen.identifier AS identifier, _ResearchSubject_Specimen.associated_project AS associated_project, _ResearchSubject_Specimen.age_at_collection AS age_at_collection, _ResearchSubject_Specimen.primary_disease_type AS primary_disease_type, _ResearchSubject_Specimen.anatomical_site AS anatomical_site, _ResearchSubject_Specimen.source_material_type AS source_material_type, _ResearchSubject_Specimen.specimen_type AS specimen_type, _ResearchSubject_Specimen.derived_from_specimen AS derived_from_specimen, all_Subjects_v3_0_w_RS.id AS subject_id, _ResearchSubject.id AS researchsubject_id FROM gdc-bq-sample.dev.all_Subjects_v3_0_w_RS AS all_Subjects_v3_0_w_RS INNER JOIN UNNEST(al

Nearly 40,000 specimens meet our search criteria! We would typically expect this number to be much larger than our number of subjects or research_subjects. First because studies will often take more than one sample per subject, and second because any given specimen might be aliquoted out to be used in multiple tests. Since we didn't specify any further filters, our results will return all of these as seperate speciments. Let's look at a few:

In [8]:
specresults.to_dataframe()

Unnamed: 0,id,identifier,associated_project,age_at_collection,primary_disease_type,anatomical_site,source_material_type,specimen_type,derived_from_specimen,subject_id,researchsubject_id
0,00d54cc2-96ec-4fab-9239-b6253b7dd637,"[{'system': 'GDC', 'value': '00d54cc2-96ec-4fa...",TCGA-GBM,-13116,Gliomas,,Primary Tumor,aliquot,2b9a6333-4802-4d89-9164-1a30754b243b,TCGA-02-0024,0553e60e-3510-417d-af8a-75947ebe8ab6
1,01be801d-c376-458a-ad95-15dcbe474113,"[{'system': 'GDC', 'value': '01be801d-c376-458...",TCGA-GBM,-24502,Gliomas,,Primary Tumor,slide,695d8016-14a7-5e09-b7e9-7466ce7685ae,TCGA-06-0127,f66d92ff-85ad-4c83-b127-ce34c8488040
2,02876b00-a460-4a00-af72-f227e199b73f,"[{'system': 'GDC', 'value': '02876b00-a460-4a0...",TCGA-LGG,-13371,Gliomas,,Primary Tumor,slide,5a178f84-37c6-419a-b9f3-9e71550a3f8e,TCGA-DB-5275,bbfb5399-8d43-4b75-bf90-23ec142697d7
3,028b33a2-4b77-4b68-baa3-e756946dd085,"[{'system': 'GDC', 'value': '028b33a2-4b77-4b6...",TCGA-LGG,-11509,Gliomas,,Primary Tumor,aliquot,e8b02631-1c0e-4509-a74f-a278078b187a,TCGA-CS-4938,334f715e-08dc-4a29-b8e4-b010b829c478
4,02f3955e-590c-53ac-af30-65e711fcfed4,"[{'system': 'GDC', 'value': '02f3955e-590c-53a...",TCGA-LGG,-10342,Gliomas,,Primary Tumor,portion,5545fec5-5655-4c59-aee5-f5fd4ef6e866,TCGA-S9-A7R3,b7395082-415b-4ab3-bf1e-59d0f4e4b488
...,...,...,...,...,...,...,...,...,...,...,...
95,3a71b861-5ede-4489-a161-8583f66e7e05,"[{'system': 'GDC', 'value': '3a71b861-5ede-448...",TCGA-GBM,-20449,Gliomas,,Blood Derived Normal,aliquot,f8a76522-1fae-4597-a907-4becabbf5744,TCGA-06-0645,878584ad-e6b6-493a-9f7f-3e284f5d9f68
96,3abd8a28-af49-5c9c-9ab3-441cb5fb1432,"[{'system': 'GDC', 'value': '3abd8a28-af49-5c9...",CPTAC-3,-12827,Gliomas,,Primary Tumor,analyte,8a69e556-9c23-5b00-9a7e-ce8cb9574be2,C3N-03184,f9a63d56-94cb-4f15-ac35-7c59a08f4104
97,3b695a86-1603-4a6a-81d7-4b91f219b2a4,"[{'system': 'GDC', 'value': '3b695a86-1603-4a6...",TCGA-LGG,-10170,Gliomas,,Primary Tumor,aliquot,cc17a5fd-ad5a-4b7c-80de-2d35dfc65348,TCGA-DU-A6S7,a56c2083-33d5-4c98-b85f-3b68a3d80add
98,3c1d256c-58d0-525e-b157-aa3063a09bb1,"[{'system': 'GDC', 'value': '3c1d256c-58d0-525...",HCMI-CMDC,-21192,Gliomas,,Expanded Next Generation Cancer Model,portion,fdbc5c4f-c278-4f88-9015-cf06af5db153,HCM-BROD-0029-C71,b5160217-33c3-47cd-8540-44ca283c8464



---

<div style="background-color:#a2f2ed;color:black;padding:20px;">

<h3>Specimen Field Definitions</h3>

<i>A specimen is a tissue sample taken from a given subject, or a portion of the original sample. A given specimen will have only a single subject ID and a single research subject ID</i>

    
<ul>
  <li><b>id:</b> The unique identifier for this specimen</li>
  <li><b>identifier:</b> An embedded array of information that includes the originating data center and the ID the specimen had there</li>
  <li><b>associated_project:</b> The name of the study/project that the subject particpated in</li>
  <li><b>age_at_collection:</b> The subjects age at collection of the proximate specimen</li>
  <li><b>primary_disease_type:</b> The cancer, disease or other condition under study</li>
  <li><b>anatomical_site:</b> The body part from which the proximate specimen was taken</li>
  <li><b>source_material_type:</b> What type of tissue the specimen consists of</li>
  <li><b>specimen_type:</b> One of: analyte, aliquot, portion, sample, or slide</li>
  <li><b>derived_from_specimen:</b> For derived samples, the `id` for the original sample</li>
  <li><b>subject_id:</b> An identifier for the subject. Can be joined to the `id` field from subject results</li>
  <li><b>reearch_subject_id</b> An identifier for the subject. Can be joined to the `id` field from researchsubject results</li>
</ul>  

</div>
    
---


### file

The file endpoint returns information about files that meet our search criteria, regardless of whether they are attached to subjects, research-subjects or specimens: 

In [9]:
myquery.file.run()

Total execution time: 23477 ms



            QueryID: bc670802-8b41-40f4-9bbf-022ac4b90f5a
            Query:with ResearchSubject_Specimen_files as (SELECT results.* EXCEPT(rn) FROM (SELECT ROW_NUMBER() OVER (PARTITION BY all_Files_v3_0_w_RS.id, _ResearchSubject.id, _ResearchSubject_Specimen.id, all_Subjects_v3_0_w_RS.id) as rn, all_Files_v3_0_w_RS.id AS id, all_Files_v3_0_w_RS.identifier AS identifier, all_Files_v3_0_w_RS.label AS label, all_Files_v3_0_w_RS.data_category AS data_category, all_Files_v3_0_w_RS.data_type AS data_type, all_Files_v3_0_w_RS.file_format AS file_format, all_Files_v3_0_w_RS.associated_project AS associated_project, all_Files_v3_0_w_RS.drs_uri AS drs_uri, all_Files_v3_0_w_RS.byte_size AS byte_size, all_Files_v3_0_w_RS.checksum AS checksum, all_Files_v3_0_w_RS.data_modality AS data_modality, all_Files_v3_0_w_RS.imaging_modality AS imaging_modality, all_Files_v3_0_w_RS.dbgap_accession_number AS dbgap_accession_number, _ResearchSubject_Specimen.id AS researchsubject_specimen_id, _ResearchSubject

In [10]:
fileresults = myquery.file.run()
fileresults.to_dataframe()

Total execution time: 3515 ms


Unnamed: 0,id,identifier,label,data_category,data_type,file_format,associated_project,drs_uri,byte_size,checksum,data_modality,imaging_modality,dbgap_accession_number,researchsubject_specimen_id,researchsubject_id,subject_id
0,4ffcd65e-53a9-460f-8bd4-9b2360cbe85f,"[{'system': 'IDC', 'value': '4ffcd65e-53a9-460...",idc/4ffcd65e-53a9-460f-8bd4-9b2360cbe85f.dcm,Imaging,,DICOM,ivygap,drs://dg.4DFC:4ffcd65e-53a9-460f-8bd4-9b2360cb...,,,Imaging,MR,,,W5__ivygap,W5
1,55622e8e-866e-4c70-a7bf-b70150142acc,"[{'system': 'IDC', 'value': '55622e8e-866e-4c7...",idc/55622e8e-866e-4c70-a7bf-b70150142acc.dcm,Imaging,,DICOM,qin_gbm_treatment_response,drs://dg.4DFC:55622e8e-866e-4c70-a7bf-b7015014...,,,Imaging,MR,,,QIN-GBM-TR-04__qin_gbm_treatment_response,QIN-GBM-TR-04
2,5a878629-133b-4fe6-8a2c-0968933a3769,"[{'system': 'IDC', 'value': '5a878629-133b-4fe...",idc/5a878629-133b-4fe6-8a2c-0968933a3769.dcm,Imaging,,DICOM,tcga_lgg,drs://dg.4DFC:5a878629-133b-4fe6-8a2c-0968933a...,,,Imaging,MR,,,TCGA-DU-8166__tcga_lgg,TCGA-DU-8166
3,754f5cd3-6ffe-4de7-9302-4a2bee6d4f2c,"[{'system': 'IDC', 'value': '754f5cd3-6ffe-4de...",idc/754f5cd3-6ffe-4de7-9302-4a2bee6d4f2c.dcm,Imaging,,DICOM,acrin_fmiso_brain,drs://dg.4DFC:754f5cd3-6ffe-4de7-9302-4a2bee6d...,,,Imaging,MR,,,ACRIN-FMISO-Brain-031__acrin_fmiso_brain,ACRIN-FMISO-Brain-031
4,7c8bab39-7554-4ac4-8161-f431cd449815,"[{'system': 'IDC', 'value': '7c8bab39-7554-4ac...",idc/7c8bab39-7554-4ac4-8161-f431cd449815.dcm,Imaging,,DICOM,qin_gbm_treatment_response,drs://dg.4DFC:7c8bab39-7554-4ac4-8161-f431cd44...,,,Imaging,MR,,,QIN-GBM-TR-41__qin_gbm_treatment_response,QIN-GBM-TR-41
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0dabdcbc-628c-4cbe-b3d6-08537f9571b4,"[{'system': 'IDC', 'value': '0dabdcbc-628c-4cb...",idc/0dabdcbc-628c-4cbe-b3d6-08537f9571b4.dcm,Imaging,,DICOM,qin_gbm_treatment_response,drs://dg.4DFC:0dabdcbc-628c-4cbe-b3d6-08537f95...,,,Imaging,MR,,,QIN-GBM-TR-49__qin_gbm_treatment_response,QIN-GBM-TR-49
96,2f974394-0f23-49ce-8d5a-dd02819609fb,"[{'system': 'IDC', 'value': '2f974394-0f23-49c...",idc/2f974394-0f23-49ce-8d5a-dd02819609fb.dcm,Imaging,,DICOM,acrin_dsc_mr_brain,drs://dg.4DFC:2f974394-0f23-49ce-8d5a-dd028196...,,,Imaging,MR,,,ACRIN-DSC-MR-Brain-042__acrin_dsc_mr_brain,ACRIN-DSC-MR-Brain-042
97,436b1790-3c96-451f-8adc-281dfb1bd10c,"[{'system': 'IDC', 'value': '436b1790-3c96-451...",idc/436b1790-3c96-451f-8adc-281dfb1bd10c.dcm,Imaging,,DICOM,qin_gbm_treatment_response,drs://dg.4DFC:436b1790-3c96-451f-8adc-281dfb1b...,,,Imaging,MR,,,QIN-GBM-TR-42__qin_gbm_treatment_response,QIN-GBM-TR-42
98,063157d2-7b14-4a94-b526-a64338ad27cb,"[{'system': 'IDC', 'value': '063157d2-7b14-4a9...",idc/063157d2-7b14-4a94-b526-a64338ad27cb.dcm,Imaging,,DICOM,ivygap,drs://dg.4DFC:063157d2-7b14-4a94-b526-a64338ad...,,,Imaging,MR,,,W4__ivygap,W4


As you might expect, searching file gives us a huge number of results. This is great if you are surveying what kind of data is available, but is less useful for getting a coherent cohort. 

A better way to get files for a specific cohort is to chain your queries together, which we cover in the next tutorial [Chaining Queries](../ChainingQueries): Combine information from multiple endpoints, and build And/Or/Like and other advanced query strings.



---

<div style="background-color:#a2f2ed;color:black;padding:20px;">

<h3>File Field Definitions</h3>

<i>A unit of data about subjects, research_subjects, specimens, or their associated information</i>

    
<ul>
  <li><b>id:</b> The unique identifier for this file</li>
  <li><b>identifier:</b> An embedded array of information that includes the originating data center and the ID the file had there</li>
  <li><b>label:</b> The full name of the file</li>
  <li><b>data_catagory:</b> A desecription of the kind of general kind data the file holds</li>
  <li><b>data_type:</b> A more specific descripton of the data type</li>
  <li><b>file_format:</b> The extension of the file</li>
  <li><b>associated_project:</b> The name the data center uses for the study this file was generated for</li>
  <li><b>drs_uri:</b> A unique identifier that can be used to retreive this specific file from a server</li>
  <li><b>byte_size:</b> Size of the file in bytes</li>
  <li><b>checksum:</b> The md5 value for the file</li>
  <li><b>data_modality:</b> A high level descriptor of file data, always one of "Genomic", "Proteomic", or "Imaging"</li>
  <li><b>imaging_modality</b> For files with the `data_modality` of "Imaging", a descriptor for the image type</li>
  <li><b>dbgap_accession_number:</b></li>
</ul>  

</div>
    
---