# Cohort Building

**Example use case:** 

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" alt="alt_text" align="left"
	width="150" height="150" />
Julia is an oncologist that specializes in female reproductive health. As part of her research, she is interested in  using existing data on uterine cancers. If possible, she would like to see multiple datatypes (gross imaging, genomic data, proteomic data, histology) that come from the same patient, so she can look for shared phenotypes to test for their potential as early diagnostics. Julia heard that the Cancer Data Aggregator has made it easy to search across multiple datasets created by NCI, and so has decided to start her search there.



## Getting Started

The CDA provides a custom python tool for searching CDA data. [`Q`](usage/#q) (short for Query) offers several ways to search and filter data, and several input modes:

---
- **<a href="../../QuickStart/usage/#q">Q.()</a>** builds a query that can be used by `run()` or `count()`
- **<a href="../../QuickStart/usage/#qrun">Q.run()</a>** returns data for the specified search 
- **<a href="../../QuickStart/usage/#qcount">Q.count()</a>** returns summary information (counts) data that fit the specified search
- **<a href="../../QuickStart/usage/#columns">columns()</a>** returns entity field names
- **<a href="../../QuickStart/usage/#unique_terms">unique_terms()</a>** returns entity field contents

---

Before Julia does any work, she needs to import these functions cdapython.
She'll also need to import [pandas](https://pandas.pydata.org/) to get nice dataframes.
Finally, she tells cdapython to report it's version so she can be sure she's using the one she means to:

In [1]:
from cdapython import Q, columns, unique_terms, query
import cdapython
import pandas as pd 
print(cdapython.__version__)

2022.6.28


In [2]:
Q.set_default_project_dataset("broad-dsde-dev.cda_dev")
Q.set_host_url("https://cancerdata.dsde-dev.broadinstitute.org/")
Q.get_host_url()

'https://cancerdata.dsde-dev.broadinstitute.org/'

In [3]:

Q.get_default_project_dataset()

'broad-dsde-dev.cda_dev'

<div class="cdanote" style="background-color:#b3e5d5;color:black;padding:20px;">
    
    
CDA data comes from three sources:
<ul>
<li><b>The <a href="https://proteomic.datacommons.cancer.gov/pdc/"> Proteomic Data Commons</a> (PDC)</b></li>
<li><b>The <a href="https://gdc.cancer.gov/">Genomic Data Commons</a> (GDC)</b></li>
<li><b>The <a href="https://datacommons.cancer.gov/repository/imaging-data-commons">Imaging Data Commons</a> (IDC)</b></li>
</ul> 
    
The CDA makes this data searchable in four main endpoints:

<ul>
<li><b>subject:</b> A patient entity captures the study-independent metadata for research subjects. Human research subjects are usually not traceable to a particular person to protect the subjects privacy.</li>
<li><b>researchsubject:</b> A research subject is the entity of interest in a specific research study or project, typically a human being or an animal, but can also be a device, group of humans or animals, or a tissue sample. Human research subjects are usually not traceable to a particular person to protect the subjects privacy. This entity plays the role of the case_id in existing data. An individual who participates in 3 studies will have 3 researchsubject IDs</li>
<li><b>specimen:</b> Any material taken as a sample from a biological entity (living or dead), or from a physical object or the environment. Specimens are usually collected as an example of their kind, often for use in some investigation.</li>
<li><b>file:</b> A unit of data about subjects, researchsubjects, specimens, or their associated information</li>
</ul>
and two endpoints that offer deeper information about data in the researchsubject endpoint:
<ul>
<li><b>diagnosis:</b> A collection of characteristics that describe an abnormal condition of the body as assessed at a point in time. May be used to capture information about neoplastic and non-neoplastic conditions.</li>
<li><b>treatment:</b> Represent medication administration or other treatment types.</li>
</ul>
Any metadata field can be searched from any endpoint, the only difference between search types is what type of data is returned by default. This means that you can think of the CDA as a really, really enormous spreadsheet full of data. To search this enormous spreadsheet, you'd want select columns, and then filter rows.
</div>


## Finding Search Terms

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
   
   Accordingly, to see what search fields are available, Julia starts by using the command `columns`:

In [4]:
columns().to_list()

['File.id',
 'File.identifier.system',
 'File.identifier.value',
 'File.label',
 'File.data_category',
 'File.data_type',
 'File.file_format',
 'File.associated_project',
 'File.drs_uri',
 'File.byte_size',
 'File.checksum',
 'File.data_modality',
 'File.imaging_modality',
 'File.dbgap_accession_number',
 'File.imaging_series',
 'id',
 'identifier.system',
 'identifier.value',
 'species',
 'sex',
 'race',
 'ethnicity',
 'days_to_birth',
 'subject_associated_project',
 'vital_status',
 'days_to_death',
 'cause_of_death',
 'ResearchSubject.id',
 'ResearchSubject.identifier.system',
 'ResearchSubject.identifier.value',
 'ResearchSubject.member_of_research_project',
 'ResearchSubject.primary_diagnosis_condition',
 'ResearchSubject.primary_diagnosis_site',
 'ResearchSubject.Diagnosis.id',
 'ResearchSubject.Diagnosis.identifier.system',
 'ResearchSubject.Diagnosis.identifier.value',
 'ResearchSubject.Diagnosis.primary_diagnosis',
 'ResearchSubject.Diagnosis.age_at_diagnosis',
 'ResearchSubje

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
   
There are a lot of columns in the CDA data, but Julia is most interested in diagnosis data, so she filters the list to only those:

In [5]:
columns().to_list(filters="diagnosis")

['ResearchSubject.primary_diagnosis_condition',
 'ResearchSubject.primary_diagnosis_site',
 'ResearchSubject.Diagnosis.id',
 'ResearchSubject.Diagnosis.identifier.system',
 'ResearchSubject.Diagnosis.identifier.value',
 'ResearchSubject.Diagnosis.primary_diagnosis',
 'ResearchSubject.Diagnosis.age_at_diagnosis',
 'ResearchSubject.Diagnosis.morphology',
 'ResearchSubject.Diagnosis.stage',
 'ResearchSubject.Diagnosis.grade',
 'ResearchSubject.Diagnosis.method_of_diagnosis',
 'ResearchSubject.Diagnosis.Treatment.id',
 'ResearchSubject.Diagnosis.Treatment.identifier.system',
 'ResearchSubject.Diagnosis.Treatment.identifier.value',
 'ResearchSubject.Diagnosis.Treatment.treatment_type',
 'ResearchSubject.Diagnosis.Treatment.treatment_outcome',
 'ResearchSubject.Diagnosis.Treatment.days_to_treatment_start',
 'ResearchSubject.Diagnosis.Treatment.days_to_treatment_end',
 'ResearchSubject.Diagnosis.Treatment.therapeutic_agent',
 'ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site',
 'Re

<div class="cdanote" style="background-color:#b3e5d5;color:black;padding:20px;">

To search the CDA, a user also needs to know what search terms are available. Each column will contain a huge amount of data, so retreiving all of the rows would be overwhelming. Instead, the CDA has a `unique_terms()` function that will return all of the unique values that populate the requested column. Like `columns`, `unique_terms` defaults to giving us an overview of the results, and can be filtered.
    
</div>

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
Since Julia is interested specificially in uterine cancers, she uses the `unique_terms` function to see what data is available for 'ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site' and 'ResearchSubject.primary_diagnosis_site' to see if 'uterine' appears:

In [7]:
unique_terms("ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site").to_list()

['Brain',
 'Cervix',
 'Head - Face Or Neck, Nos',
 'Lymph Node(s) Paraaortic',
 'Other',
 'Pelvis',
 'Spine',
 'Unknown']

In [8]:
unique_terms("ResearchSubject.primary_diagnosis_site").to_list()

['Abdomen',
 'Abdomen, Mediastinum',
 'Abdomen, Pelvis',
 'Adrenal Glands',
 'Adrenal gland',
 'Anus and anal canal',
 'Base of tongue',
 'Bile Duct',
 'Bladder',
 'Bones, joints and articular cartilage of limbs',
 'Bones, joints and articular cartilage of other and unspecified sites',
 'Brain',
 'Breast',
 'Bronchus and lung',
 'Cervix',
 'Cervix uteri',
 'Chest',
 'Chest-Abdomen-Pelvis, Leg, TSpine',
 'Colon',
 'Connective, subcutaneous and other soft tissues',
 'Corpus uteri',
 'Ear',
 'Esophagus',
 'Extremities',
 'Eye and adnexa',
 'Floor of mouth',
 'Gallbladder',
 'Gum',
 'Head',
 'Head and Neck',
 'Head-Neck',
 'Heart, mediastinum, and pleura',
 'Hematopoietic and reticuloendothelial systems',
 'Hypopharynx',
 'Intraocular',
 'Kidney',
 'Larynx',
 'Lip',
 'Liver',
 'Liver and intrahepatic bile ducts',
 'Lung',
 'Lung Phantom',
 'Lymph nodes',
 'Marrow, Blood',
 'Meninges',
 'Mesothelium',
 'Nasal cavity and middle ear',
 'Nasopharynx',
 'Not Reported',
 'Oropharynx',
 'Other an

<div class="cdanote" style="background-color:#b3e5d5;color:black;padding:20px;">
    
CDA makes multiple datasets searchable from a common interface, but does not harmonize the data. This means that researchers should review all the terms in a column, and not just choose the first one that fits, as there may be other similar terms available as well.
    
</div>

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
Julia sees that "treatment_anatomic_site" does not have 'Uterine', but does have 'Cervix'. She also notes that both 'Uterus' and 'Uterus, NOS' are listed in the "primary_diagnosis_site" results. As she was initially looking for "uterine", Julia decides to expand her search a bit to account for variable naming schemes. So, she runs a fuzzy match filter on the "ResearchSubject.primary_diagnosis_site" for 'uter' as that should cover all variants:

In [5]:
unique_terms("ResearchSubject.primary_diagnosis_site").to_list(filters="uter")

['Cervix uteri', 'Corpus uteri', 'Uterus', 'Uterus, NOS']

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
Just to be sure, Julia also searches for any other instances of "cervix":

In [6]:
unique_terms("ResearchSubject.primary_diagnosis_site").to_list(filters="cerv")

['Cervix', 'Cervix uteri']

## Building a Query

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
With all her likely terms found, Julia begins to create a search that will get data for all of her terms. She does this by writing a series of `Q` statements that define what rows should be returned from each column. For the "treatment_anatomic_site", only one term is of interest, so she uses the `=` operator to get only exact matches:

In [7]:
Tsite = Q('ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site = "Cervix"')

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
However, for "primary_diagnosis_site", Julia has several terms she wants to search with. Luckily, `Q` also can run fuzzy searches. It can also search more than one term at a time, so Julia writes one big `Q` statement to grab everything that is either 'uter' or 'cerv':

In [8]:
Dsite = Q('ResearchSubject.primary_diagnosis_site = "%uter%" OR ResearchSubject.primary_diagnosis_site = "%cerv%"')

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
Finally, Julia adds her two queries together into one large one:

In [9]:
ALLDATA = Tsite.OR(Dsite)

## Looking at Summary Data

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
Now that Julia has a query, she can use it to look for data in any of the CDA endpoints. She starts by getting an overall summary of what data is available using `count`:

In [10]:
ALLDATA.count.run()

Total execution time: 4849 ms




<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
It seems there's a lot of data that might work for Julias study! Since she is interested in the beginings of cancer, she decides to start by looking at  the researchsubject information, since that is where most of the diagnosis information is. She again gets a summary using `count`:

In [11]:
ALLDATA.researchsubject.count.run()

Total execution time: 10077 ms


system,count
PDC,104
GDC,3591
IDC,1174

primary_diagnosis_condition,count
Adenomas and Adenocarcinomas,1672
Uterine Corpus Endometrial Carcinoma,104
Myomatous Neoplasms,188
Squamous Cell Neoplasms,609
"Cystic, Mucinous and Serous Neoplasms",487
Complex Mixed and Stromal Neoplasms,320
Not Reported,12
,1175
Complex Epithelial Neoplasms,27
"Epithelial Neoplasms, NOS",230

primary_diagnosis_site,count
Cervix uteri,915
"Uterus, NOS",2000
Corpus uteri,780
Uterus,867
Cervix,307




## Refining Queries

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
Browsing the primary_diagnosis_condition data, Julia notices that there are a large number of research subjects that are Adenomas and Adenocarcinomas. Since Julia wants to look for common phenotypes in early cancers, she decides it might be easier to exclude the endocrine related data, as they might have different mechanisms. So she adds a new filter to her query:

In [10]:
Noadeno = Q('ResearchSubject.primary_diagnosis_condition != "Adenomas and Adenocarcinomas"')

NoAdenoData = ALLDATA.AND(Noadeno)

NoAdenoData.researchsubject.count.run()

NameError: name 'ALLDATA' is not defined

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
She then previews the actual metadata for researchsubject, subject, and file, to make sure that they have all the information she will need for her work. Since she's mostly interested in looking at the kinds of data available from each endpoint:

In [13]:
NoAdenoData.researchsubject.run().to_dataframe() # view the dataframe

Total execution time: 4063 ms


Unnamed: 0,id,identifier,member_of_research_project,primary_diagnosis_condition,primary_diagnosis_site,subject_id
0,0609d8ac-fc18-4e07-b5fb-12b372151dcf,"[{'system': 'GDC', 'value': '0609d8ac-fc18-4e0...",GENIE-DFCI,"Soft Tissue Tumors and Sarcomas, NOS","Uterus, NOS",GENIE-DFCI-011667
1,1d2937e8-6104-4330-b7dc-7bcd79dac927,"[{'system': 'GDC', 'value': '1d2937e8-6104-433...",TCGA-UCS,Complex Mixed and Stromal Neoplasms,"Uterus, NOS",TCGA-N5-A4RN
2,2639fc46-2bf7-4050-b5dd-c4d8beb89d2e,"[{'system': 'GDC', 'value': '2639fc46-2bf7-405...",GENIE-MSK,Myomatous Neoplasms,"Uterus, NOS",GENIE-MSK-P-0003476
3,30a4fa5d-98f4-4592-868f-2beb443eb82d,"[{'system': 'GDC', 'value': '30a4fa5d-98f4-459...",FM-AD,"Cystic, Mucinous and Serous Neoplasms","Uterus, NOS",AD13447
4,30ffb337-c844-493f-97d6-7aea157d1b5e,"[{'system': 'GDC', 'value': '30ffb337-c844-493...",CGCI-HTMCP-CC,Squamous Cell Neoplasms,Cervix uteri,HTMCP-03-06-02240
...,...,...,...,...,...,...
95,TCGA-N8-A4PN__tcga_ucs,"[{'system': 'IDC', 'value': 'TCGA-N8-A4PN__tcg...",tcga_ucs,,Uterus,TCGA-N8-A4PN
96,TCGA-ND-A4WA__tcga_ucs,"[{'system': 'IDC', 'value': 'TCGA-ND-A4WA__tcg...",tcga_ucs,,Uterus,TCGA-ND-A4WA
97,a6a07bc7-ab1b-429d-81ed-d7a890c88e51,"[{'system': 'GDC', 'value': 'a6a07bc7-ab1b-429...",TCGA-CESC,"Cystic, Mucinous and Serous Neoplasms",Cervix uteri,TCGA-VS-A9V1
98,a7f3f44f-5d75-4572-80e6-279f186e44a3,"[{'system': 'GDC', 'value': 'a7f3f44f-5d75-457...",GENIE-MSK,"Epithelial Neoplasms, NOS","Uterus, NOS",GENIE-MSK-P-0018781


---

<div class="cdadefine" style="background-color:#add9e5;color:black;padding:20px;">

<h3>ResearchSubject Field Definitions</h3>

<i>A research subject is the entity of interest in a specific research study or project, typically a human being or an animal, but can also be a device, group of humans or animals, or a tissue sample. Human research subjects are usually not traceable to a particular person to protect the subjects privacy. This entity plays the role of the case_id in existing data. An individual who participates in 3 studies will have 3 researchsubject IDs</i>
    
<ul>
<li><b>id:</b> The 'logical' identifier of the entity in the system of record, e.g. a UUID. This 'id' is unique within a given system. The identified entity may have a different 'id' in a different system. For CDA, this is case_id.</li>
<li><b>identifier:</b> A 'business' identifier for the entity, typically as provided by an external system or authority, that persists across implementing systems (i.e. a 'logical' identifier). Uses a specialized, complex 'Identifier' data type to capture information about the source of the business identifier - or a URI expressed as a string to an existing entity.</li>
<li><b>identifier.system:</b> The system or namespace that defines the identifier.</li>
<li><b>identifier.value:</b> The value of the identifier, as defined by the system.</li>
<li><b>member_of_research_project:</b> A reference to the Study(s) of which this ResearchSubject is a member.</li>
<li><b>primary_diagnosis_condition:</b> The text term used to describe the type of malignant disease, as categorized by the World Health Organization's (WHO) <a href="https://www.who.int/standards/classifications/other-classifications/international-classification-of-diseases-for-oncology">International Classification of Diseases for Oncology</a> (ICD-O). This attribute represents the disease that qualified the subject for inclusion on the ResearchProject.</li>
<li><b>primary_diagnosis_site:</b> The text term used to describe the primary site of disease, as categorized by the World Health Organization's (WHO) <a href="https://www.who.int/standards/classifications/other-classifications/international-classification-of-diseases-for-oncology">International Classification of Diseases for Oncology</a> (ICD-O). This categorization groups cases into general categories. This attribute represents the primary site of disease that qualified the subject for inclusion on the ResearchProject.</li>
<li><b>subject_id:</b> The 'logical' identifier of the entity in the system of record, e.g. a UUID. This 'id' is unique within a given system. The identified entity may have a different 'id' in a different system. Can be joined to the `id` field from subject results</li>
</ul>  

</div>
    
---

In [14]:
NoAdenoData.subject.run().to_dataframe() # view the dataframe

Total execution time: 3838 ms


Unnamed: 0,id,identifier,species,sex,race,ethnicity,days_to_birth,subject_associated_project,vital_status,days_to_death,cause_of_death
0,AD15141,"[{'system': 'GDC', 'value': 'AD15141'}]",homo sapiens,female,not reported,not reported,,[FM-AD],Not Reported,,
1,AD290,"[{'system': 'GDC', 'value': 'AD290'}]",homo sapiens,female,not reported,not reported,,[FM-AD],Not Reported,,
2,AD4252,"[{'system': 'GDC', 'value': 'AD4252'}]",homo sapiens,female,not reported,not reported,,[FM-AD],Not Reported,,
3,AD9808,"[{'system': 'GDC', 'value': 'AD9808'}]",homo sapiens,female,not reported,not reported,,[FM-AD],Not Reported,,
4,C3L-02382,"[{'system': 'IDC', 'value': 'C3L-02382'}]",Homo sapiens,,,,,[cptac_ucec],,,
...,...,...,...,...,...,...,...,...,...,...,...
95,GENIE-MSK-P-0001439,"[{'system': 'GDC', 'value': 'GENIE-MSK-P-00014...",homo sapiens,female,white,not hispanic or latino,-28854.0,[GENIE-MSK],Not Reported,,
96,GENIE-MSK-P-0001606,"[{'system': 'GDC', 'value': 'GENIE-MSK-P-00016...",homo sapiens,female,white,not hispanic or latino,-21549.0,[GENIE-MSK],Not Reported,,
97,GENIE-MSK-P-0001989,"[{'system': 'GDC', 'value': 'GENIE-MSK-P-00019...",homo sapiens,female,white,not hispanic or latino,-14610.0,[GENIE-MSK],Not Reported,,
98,GENIE-MSK-P-0013913,"[{'system': 'GDC', 'value': 'GENIE-MSK-P-00139...",homo sapiens,female,white,not hispanic or latino,-20454.0,[GENIE-MSK],Not Reported,,


---

<div class="cdadefine" style="background-color:#add9e5;color:black;padding:20px;">

<h3>Subject Field Definitions</h3>

<i>A patient entity captures the study-independent metadata for research subjects. Human research subjects are usually not traceable to a particular person to protect the subjects privacy.</i>

    
<ul>
<li><b>id:</b> The 'logical' identifier of the entity in the system of record, e.g. a UUID. This 'id' is unique within a given system. The identified entity may have a different 'id' in a different system.",STRING</li>
<li><b>identifier:</b> A 'business' identifier for the entity, typically as provided by an external system or authority, that persists across implementing systems (i.e. a 'logical' identifier). Uses a specialized, complex 'Identifier' data type to capture information about the source of the business identifier - or a URI expressed as a string to an existing entity.</li>
<li><b>identifier.system:</b> The system or namespace that defines the identifier.</li>
<li><b>identifier.value:</b> The value of the identifier, as defined by the system.</li>
<li><b>species:</b> The taxonomic group (e.g. species) of the patient. For MVP, since taxonomy vocabulary is consistent between GDC and PDC, using text. Ultimately, this will be a term returned by the vocabulary service.</li>
<li><b>sex:</b> The biologic character or quality that distinguishes male and female from one another as expressed by analysis of the person's gonadal, morphologic (internal and external), chromosomal, and hormonal characteristics.</li>
<li><b>race:</b> An arbitrary classification of a taxonomic group that is a division of a species. It usually arises as a consequence of geographical isolation within a species and is characterized by shared heredity, physical attributes and behavior, and in the case of humans, by common history, nationality, or geographic distribution. The provided values are based on the categories defined by the U.S. Office of Management and Business and used by the U.S. Census Bureau.</li>
<li><b>ethnicity:</b> An individual's self-described social and cultural grouping, specifically whether an individual describes themselves as Hispanic or Latino. The provided values are based on the categories defined by the U.S. Office of Management and Business and used by the U.S. Census Bureau.</li>
<li><b>days_to_birth:</b> Number of days between the date used for index and the date from a person's date of birth represented as a calculated negative number of days.</li>
<li><b>subject_associated_project:</b> The list of Projects associated with the Subject.</li>
<li><b>vital_status:</b> Coded value indicating the state or condition of being living or deceased; also includes the case where the vital status is unknown.</li>
<li><b>days_to_death:</b> Number of days between the date used for index and the date from a person's date of death represented as a calculated number of days.</li>
<li><b>cause_of_death:</b> Coded value indicating the circumstance or condition that results in the death of the subject.</li>
</ul>  

</div>
    
---

In [15]:
NoAdenoData.file.run().to_dataframe() # view the dataframe

Total execution time: 21458 ms


Unnamed: 0,id,identifier,label,data_category,data_type,file_format,associated_project,drs_uri,byte_size,checksum,data_modality,imaging_modality,dbgap_accession_number,imaging_series,researchsubject_specimen_id,researchsubject_id,subject_id
0,002ff7dd-d55c-4f60-8d9a-c814443ed665,"[{'system': 'GDC', 'value': '002ff7dd-d55c-4f6...",9e4a57ec-ca07-4843-b77f-eb5504a817c4.rna_seq.c...,Sequencing Reads,Aligned Reads,BAM,TCGA-CESC,drs://dg.4DFC:002ff7dd-d55c-4f60-8d9a-c814443e...,96671613,6a4989df92ba58fe6558e7f6d0d5eb0f,Genomic,,,,a36de57d-2315-4bc3-93bf-69ec4f51063b,0bd52d75-5113-4ef1-bc75-509f59eaef2b,TCGA-JW-A5VG
1,00305181-fec7-41c6-a143-f3bf4636bd5f,"[{'system': 'GDC', 'value': '00305181-fec7-41c...",39550f2d-268b-47a8-996c-7638f07de1dd.wxs.muse....,Simple Nucleotide Variation,Raw Simple Somatic Mutation,VCF,TCGA-UCEC,drs://dg.4DFC:00305181-fec7-41c6-a143-f3bf4636...,25035,42b1a48e34690de6cd6851ca753cff9d,Genomic,,,,987af661-a8b7-47b0-800f-a0b6a3ef0096,e31db89c-15ac-45a4-9788-afb339b0f3bc,TCGA-AX-A3G6
2,005ff5ca-0975-4b0b-b413-4a906701e923,"[{'system': 'GDC', 'value': '005ff5ca-0975-4b0...",f059db9c-510f-4dc0-bf36-6038149bf06a.wxs.somat...,Simple Nucleotide Variation,Raw Simple Somatic Mutation,VCF,TCGA-CESC,drs://dg.4DFC:005ff5ca-0975-4b0b-b413-4a906701...,268168,4174aa64898dfbdf2a73634c73b77d01,Genomic,,,,8a53198f-ee29-4cca-8cb2-9944790bcc53,e15d92a1-6925-4b58-a494-d6f665f67b83,TCGA-LP-A4AU
3,007c3c99-c50f-4e45-907a-7ab78f47ff0d,"[{'system': 'GDC', 'value': '007c3c99-c50f-4e4...",TCGA_UCS.7ea16309-04b4-41ff-93aa-dc3605be9fc7....,Simple Nucleotide Variation,Annotated Somatic Mutation,VCF,TCGA-UCS,drs://dg.4DFC:007c3c99-c50f-4e45-907a-7ab78f47...,229002,8d2de6c206adef0e1236a1008d9234f9,Genomic,,,,e8dc4a59-fdfb-40d0-afc9-f7f21eb0f339,a1206473-cf9d-4bca-97c9-67f35817b806,TCGA-N9-A4Q1
4,00a7b64e-ffd2-11e8-abd7-005056921935,"[{'system': 'PDC', 'value': '00a7b64e-ffd2-11e...",15CPTAC_UCEC_P_PNNL_20180503_B4S3_f08.mzid.gz,Peptide Spectral Matches,Open Standard,mzIdentML,CPTAC3-Discovery,drs://dg.4DFC:00a7b64e-ffd2-11e8-abd7-00505692...,5832060,194e198498ab1c3ff18984bff2ef28a9,Proteomic,,,,61c4d96e-1259-11e9-afb9-0a9c39d33490,7ef278e9-118a-11e9-afb9-0a9c39d33490,C3L-01925
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,6a95c4ba-ffb4-11e8-95a3-005056921935,"[{'system': 'PDC', 'value': '6a95c4ba-ffb4-11e...",07CPTAC_UCEC_W_PNNL_20170922_B2S3_f18.mzML.gz,Processed Mass Spectra,Open Standard,mzML,CPTAC3-Discovery,drs://dg.4DFC:6a95c4ba-ffb4-11e8-95a3-00505692...,231084046,4eccdcd78267609a2e6e0cdd6d1f075f,Proteomic,,,,c8a6ef50-1259-11e9-afb9-0a9c39d33490,bf1f593d-118a-11e9-afb9-0a9c39d33490,C3N-01217
96,6afce870-f915-11e8-90c1-005056921935,"[{'system': 'PDC', 'value': '6afce870-f915-11e...",10CPTAC_UCEC_P_PNNL_20180222_B3S2_f04.raw,Raw Mass Spectra,Proprietary,vendor-specific,CPTAC3-Discovery,drs://dg.4DFC:6afce870-f915-11e8-90c1-00505692...,806180410,625413aa5fece36dc7fa5375f92580ec,Proteomic,,,,d799fbea-1271-11e9-afb9-0a9c39d33490,c289029e-118a-11e9-afb9-0a9c39d33490,C3N-01267
97,6b78fe54-ffcf-11e8-9f26-005056921935,"[{'system': 'PDC', 'value': '6b78fe54-ffcf-11e...",03CPTAC_UCEC_P_PNNL_20170922_B1S3_f05.psm,Peptide Spectral Matches,Text,tsv,CPTAC3-Discovery,drs://dg.4DFC:6b78fe54-ffcf-11e8-9f26-00505692...,2957979,71d54336cc42e41228484d9399d22f59,Proteomic,,,,39743eb5-1259-11e9-afb9-0a9c39d33490,6b043589-118a-11e9-afb9-0a9c39d33490,C3L-01256
98,6bcbea24-ffcf-11e8-9f26-005056921935,"[{'system': 'PDC', 'value': '6bcbea24-ffcf-11e...",03CPTAC_UCEC_P_PNNL_20170922_B1S3_f08.psm,Peptide Spectral Matches,Text,tsv,CPTAC3-Discovery,drs://dg.4DFC:6bcbea24-ffcf-11e8-9f26-00505692...,2496941,bb2eb73c4d58f988538ca5b6b8453cfb,Proteomic,,,,bb56d00f-1258-11e9-afb9-0a9c39d33490,f91c655f-1189-11e9-afb9-0a9c39d33490,C3L-00358



---

<div class="cdadefine" style="background-color:#add9e5;color:black;padding:20px;">

<h3>File Field Definitions</h3>

<i>A file is an information-bearing electronic object that contains a physical embodiment of some information using a particular character encoding.</i>

    
<ul>
  <li><b>id:</b> The 'logical' identifier of the entity in the system of record, e.g. a UUID. This 'id' is unique within a given system. The identified entity may have a different 'id' in a different system.</li>
<li><b>identifier:</b> A 'business' identifier for the entity, typically as provided by an external system or authority, that persists across implementing systems (i.e. a 'logical' identifier). Uses a specialized, complex 'Identifier' data type to capture information about the source of the business identifier - or a URI expressed as a string to an existing entity.</li>
<li><b>identifier.system:</b> The system or namespace that defines the identifier.</li>
<li><b>identifier.value:</b> The value of the identifier, as defined by the system.</li>
  <li><b>label:</b> Short name or abbreviation for dataset. Maps to rdfs:label.</li>
  <li><b>data_catagory:</b> Broad categorization of the contents of the data file.</li>
  <li><b>data_type:</b> Specific content type of the data file.</li>
  <li><b>file_format:</b> Format of the data files.</li>
  <li><b>associated_project:</b> A reference to the Project(s) of which this ResearchSubject is a member. The associated_project may be embedded using the ref definition or may be a reference to the id for the Project - or a URI expressed as a string to an existing entity.</li>
  <li><b>drs_uri:</b> A string of characters used to identify a resource on the Data Repo Service(DRS). Can be used to retreive this specific file from a server.</li>
  <li><b>byte_size:</b> Size of the file in bytes. Maps to dcat:byteSize.</li>
  <li><b>checksum:</b> The md5 value for the file. A digit representing the sum of the correct digits in a piece of stored or transmitted digital data, against which later comparisons can be made to detect errors in the data.</li>
  <li><b>data_modality:</b> Data modality describes the biological nature of the information gathered as the result of an Activity, independent of the technology or methods used to produce the information. Always one of "Genomic", "Proteomic", or "Imaging".</li>
  <li><b>imaging_modality:</b> An imaging modality describes the imaging equipment and/or method used to acquire certain structural or functional information about the body. These include but are not limited to computed tomography (CT) and magnetic resonance imaging (MRI). Taken from the DICOM standard.</li>
  <li><b>dbgap_accession_number:</b> The dbgap accession number for the project.</li>
</ul>  

</div>
    
---


## Working with Results (pagination)

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
Finally, Julia wants to save these results to use for the future. Since the preview dataframes only show the first 100 results of each search, she uses the `paginator` function to get all the data from the subject and researchsubject endpoints into their own dataframes:

In [16]:
researchsubs = NoAdenoData.researchsubject.run()
rsdf = pd.DataFrame()
for i in researchsubs.paginator(to_df=True):
    rsdf = pd.concat([rsdf, i])

Total execution time: 3540 ms


In [17]:
subs = NoAdenoData.subject.run()
subsdf = pd.DataFrame()
for i in subs.paginator(to_df=True):
    subsdf = pd.concat([subsdf, i])

Total execution time: 3538 ms


In [18]:
rsdf # view the researchsubject dataframe

Unnamed: 0,id,identifier,member_of_research_project,primary_diagnosis_condition,primary_diagnosis_site,subject_id
0,0609d8ac-fc18-4e07-b5fb-12b372151dcf,"[{'system': 'GDC', 'value': '0609d8ac-fc18-4e0...",GENIE-DFCI,"Soft Tissue Tumors and Sarcomas, NOS","Uterus, NOS",GENIE-DFCI-011667
1,1d2937e8-6104-4330-b7dc-7bcd79dac927,"[{'system': 'GDC', 'value': '1d2937e8-6104-433...",TCGA-UCS,Complex Mixed and Stromal Neoplasms,"Uterus, NOS",TCGA-N5-A4RN
2,2639fc46-2bf7-4050-b5dd-c4d8beb89d2e,"[{'system': 'GDC', 'value': '2639fc46-2bf7-405...",GENIE-MSK,Myomatous Neoplasms,"Uterus, NOS",GENIE-MSK-P-0003476
3,30a4fa5d-98f4-4592-868f-2beb443eb82d,"[{'system': 'GDC', 'value': '30a4fa5d-98f4-459...",FM-AD,"Cystic, Mucinous and Serous Neoplasms","Uterus, NOS",AD13447
4,30ffb337-c844-493f-97d6-7aea157d1b5e,"[{'system': 'GDC', 'value': '30ffb337-c844-493...",CGCI-HTMCP-CC,Squamous Cell Neoplasms,Cervix uteri,HTMCP-03-06-02240
...,...,...,...,...,...,...
92,da85e3db-253d-4331-bf09-b4228d6c301a,"[{'system': 'GDC', 'value': 'da85e3db-253d-433...",CGCI-HTMCP-CC,Squamous Cell Neoplasms,Cervix uteri,HTMCP-03-06-02210
93,db7036b1-1427-49fc-9381-2df96ee45faf,"[{'system': 'GDC', 'value': 'db7036b1-1427-49f...",GENIE-DFCI,Trophoblastic neoplasms,"Uterus, NOS",GENIE-DFCI-007248
94,ed12e8d3-570b-4f63-b236-c807c523c7d0,"[{'system': 'GDC', 'value': 'ed12e8d3-570b-4f6...",GENIE-DFCI,Trophoblastic neoplasms,"Uterus, NOS",GENIE-DFCI-000252
95,ede93186-676f-4ebc-a750-60607929b0d1,"[{'system': 'GDC', 'value': 'ede93186-676f-4eb...",GENIE-MSK,"Cystic, Mucinous and Serous Neoplasms","Uterus, NOS",GENIE-MSK-P-0015430


In [19]:
subsdf # view the subject dataframe

Unnamed: 0,id,identifier,species,sex,race,ethnicity,days_to_birth,subject_associated_project,vital_status,days_to_death,cause_of_death
0,AD15141,"[{'system': 'GDC', 'value': 'AD15141'}]",homo sapiens,female,not reported,not reported,,[FM-AD],Not Reported,,
1,AD290,"[{'system': 'GDC', 'value': 'AD290'}]",homo sapiens,female,not reported,not reported,,[FM-AD],Not Reported,,
2,AD4252,"[{'system': 'GDC', 'value': 'AD4252'}]",homo sapiens,female,not reported,not reported,,[FM-AD],Not Reported,,
3,AD9808,"[{'system': 'GDC', 'value': 'AD9808'}]",homo sapiens,female,not reported,not reported,,[FM-AD],Not Reported,,
4,C3L-02382,"[{'system': 'IDC', 'value': 'C3L-02382'}]",Homo sapiens,,,,,[cptac_ucec],,,
...,...,...,...,...,...,...,...,...,...,...,...
4,TCGA-D1-A17K,"[{'system': 'GDC', 'value': 'TCGA-D1-A17K'}, {...",homo sapiens,female,white,not hispanic or latino,-27283.0,"[TCGA-UCEC, tcga_ucec]",Alive,,
5,TCGA-EK-A2R8,"[{'system': 'GDC', 'value': 'TCGA-EK-A2R8'}, {...",homo sapiens,female,white,not hispanic or latino,-17692.0,"[TCGA-CESC, tcga_cesc]",Alive,,
6,TCGA-N6-A4VC,"[{'system': 'GDC', 'value': 'TCGA-N6-A4VC'}, {...",homo sapiens,female,white,not hispanic or latino,-30643.0,"[TCGA-UCS, tcga_ucs]",Dead,597.0,
7,TCGA-N7-A4Y5,"[{'system': 'GDC', 'value': 'TCGA-N7-A4Y5'}, {...",homo sapiens,female,white,not reported,-32058.0,"[TCGA-UCS, tcga_ucs]",Dead,8.0,


## Merging Results across Endpoints

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
Then Julia uses the `id` fields in each result to merge them together into one big dataset:

In [20]:
allmetadata = pd.merge(rsdf,
                subsdf,
                left_on="subject_id",
                right_on='id')

allmetadata

Unnamed: 0,id_x,identifier_x,member_of_research_project,primary_diagnosis_condition,primary_diagnosis_site,subject_id,id_y,identifier_y,species,sex,race,ethnicity,days_to_birth,subject_associated_project,vital_status,days_to_death,cause_of_death
0,0609d8ac-fc18-4e07-b5fb-12b372151dcf,"[{'system': 'GDC', 'value': '0609d8ac-fc18-4e0...",GENIE-DFCI,"Soft Tissue Tumors and Sarcomas, NOS","Uterus, NOS",GENIE-DFCI-011667,GENIE-DFCI-011667,"[{'system': 'GDC', 'value': 'GENIE-DFCI-011667'}]",homo sapiens,female,white,not hispanic or latino,-23010.0,[GENIE-DFCI],Not Reported,,
1,1d2937e8-6104-4330-b7dc-7bcd79dac927,"[{'system': 'GDC', 'value': '1d2937e8-6104-433...",TCGA-UCS,Complex Mixed and Stromal Neoplasms,"Uterus, NOS",TCGA-N5-A4RN,TCGA-N5-A4RN,"[{'system': 'GDC', 'value': 'TCGA-N5-A4RN'}, {...",homo sapiens,female,white,not hispanic or latino,-26054.0,"[TCGA-UCS, tcga_ucs]",Alive,,
2,TCGA-N5-A4RN__tcga_ucs,"[{'system': 'IDC', 'value': 'TCGA-N5-A4RN__tcg...",tcga_ucs,,Uterus,TCGA-N5-A4RN,TCGA-N5-A4RN,"[{'system': 'GDC', 'value': 'TCGA-N5-A4RN'}, {...",homo sapiens,female,white,not hispanic or latino,-26054.0,"[TCGA-UCS, tcga_ucs]",Alive,,
3,2639fc46-2bf7-4050-b5dd-c4d8beb89d2e,"[{'system': 'GDC', 'value': '2639fc46-2bf7-405...",GENIE-MSK,Myomatous Neoplasms,"Uterus, NOS",GENIE-MSK-P-0003476,GENIE-MSK-P-0003476,"[{'system': 'GDC', 'value': 'GENIE-MSK-P-00034...",homo sapiens,female,white,not hispanic or latino,-16071.0,[GENIE-MSK],Not Reported,,
4,30a4fa5d-98f4-4592-868f-2beb443eb82d,"[{'system': 'GDC', 'value': '30a4fa5d-98f4-459...",FM-AD,"Cystic, Mucinous and Serous Neoplasms","Uterus, NOS",AD13447,AD13447,"[{'system': 'GDC', 'value': 'AD13447'}]",homo sapiens,female,not reported,not reported,,[FM-AD],Not Reported,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3192,da85e3db-253d-4331-bf09-b4228d6c301a,"[{'system': 'GDC', 'value': 'da85e3db-253d-433...",CGCI-HTMCP-CC,Squamous Cell Neoplasms,Cervix uteri,HTMCP-03-06-02210,HTMCP-03-06-02210,"[{'system': 'GDC', 'value': 'HTMCP-03-06-02210'}]",homo sapiens,female,black or african american,Unknown,,[CGCI-HTMCP-CC],Dead,183.0,Unknown
3193,db7036b1-1427-49fc-9381-2df96ee45faf,"[{'system': 'GDC', 'value': 'db7036b1-1427-49f...",GENIE-DFCI,Trophoblastic neoplasms,"Uterus, NOS",GENIE-DFCI-007248,GENIE-DFCI-007248,"[{'system': 'GDC', 'value': 'GENIE-DFCI-007248'}]",homo sapiens,female,white,not hispanic or latino,-15340.0,[GENIE-DFCI],Not Reported,,
3194,ed12e8d3-570b-4f63-b236-c807c523c7d0,"[{'system': 'GDC', 'value': 'ed12e8d3-570b-4f6...",GENIE-DFCI,Trophoblastic neoplasms,"Uterus, NOS",GENIE-DFCI-000252,GENIE-DFCI-000252,"[{'system': 'GDC', 'value': 'GENIE-DFCI-000252'}]",homo sapiens,female,white,not hispanic or latino,-13879.0,[GENIE-DFCI],Not Reported,,
3195,ede93186-676f-4ebc-a750-60607929b0d1,"[{'system': 'GDC', 'value': 'ede93186-676f-4eb...",GENIE-MSK,"Cystic, Mucinous and Serous Neoplasms","Uterus, NOS",GENIE-MSK-P-0015430,GENIE-MSK-P-0015430,"[{'system': 'GDC', 'value': 'GENIE-MSK-P-00154...",homo sapiens,female,white,not hispanic or latino,-30681.0,[GENIE-MSK],Not Reported,,


<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
And saves it out to a csv so she can browse it with Excel:

In [21]:
allmetadata.to_csv("allmetadata.csv")

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
Julia knows from her subject count summary that there are more than 200,000 files associated with her subjects, which is likely far more than she needs. To help her decide what files she wants, Julia uses endpoint chaining to get summary information about the files that are assigned to researchsubjects for her search criteria:


In [22]:
NoAdenoData.researchsubject.file.count.run()

Total execution time: 13151 ms


system,count
IDC,264429
GDC,294
PDC,2560

data_category,count
Imaging,264429
Raw Mass Spectra,640
Peptide Spectral Matches,1280
Processed Mass Spectra,640
Simple Nucleotide Variation,114
Somatic Structural Variation,4
Copy Number Variation,37
Transcriptome Profiling,31
Biospecimen,17
Structural Variation,29

file_format,count
DICOM,264429
TSV,39
tsv,640
vendor-specific,640
BAM,35
mzML,640
BEDPE,19
mzIdentML,640
VCF,61
TXT,50

data_type,count
,264429
Text,640
Open Standard,1280
Proprietary,640
Aligned Reads,35
Copy Number Segment,11
Annotated Somatic Mutation,55
Gene Level Copy Number Scores,5
Gene Level Copy Number,5
Raw Simple Somatic Mutation,36




<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
    
Julia decides that a good place to start would be with Slide Images. There's only 1111, so she should be able to quickly scan through them over the next few days and see if they will be useful. So she adds one more filter on her search:

In [23]:
JustSlides = Q('file.data_type = "Slide Image"')
NoadenoJustSlides = NoAdenoData.AND(JustSlides)
NoadenoJustSlides.researchsubject.file.count.run()

Total execution time: 7349 ms


system,count
GDC,8

data_category,count
Biospecimen,8

file_format,count
SVS,8

data_type,count
Slide Image,8




<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
    
Finally, Julia uses the pagenation function again to get all the slide files, and merges her metadata file with this file information. This way she will be able to review what phenotypes each slide is associated with:

In [24]:
slides = NoadenoJustSlides.researchsubject.file.run()
slidesdf = pd.DataFrame()
for i in slides.paginator(to_df=True):
    slidesdf = pd.concat([slidesdf, i])


Total execution time: 7162 ms


In [25]:
slidemetadata = pd.merge(slidesdf, 
                         allmetadata, 
                         on="subject_id")
slidemetadata

Unnamed: 0,id,identifier,label,data_category,data_type,file_format,associated_project,drs_uri,byte_size,checksum,...,identifier_y,species,sex,race,ethnicity,days_to_birth,subject_associated_project,vital_status,days_to_death,cause_of_death
0,002a60d4-280c-44b5-a0bf-5e2f063253b2,"[{'system': 'GDC', 'value': '002a60d4-280c-44b...",TCGA-EA-A6QX-01A-01-TSA.66853F14-1300-4D44-B29...,Biospecimen,Slide Image,SVS,TCGA-CESC,drs://dg.4DFC:002a60d4-280c-44b5-a0bf-5e2f0632...,256639621,c937af221666a0be405b665631582b32,...,"[{'system': 'GDC', 'value': 'TCGA-EA-A6QX'}, {...",homo sapiens,female,asian,not hispanic or latino,-18035.0,"[TCGA-CESC, tcga_cesc]",Alive,,
1,002a60d4-280c-44b5-a0bf-5e2f063253b2,"[{'system': 'GDC', 'value': '002a60d4-280c-44b...",TCGA-EA-A6QX-01A-01-TSA.66853F14-1300-4D44-B29...,Biospecimen,Slide Image,SVS,TCGA-CESC,drs://dg.4DFC:002a60d4-280c-44b5-a0bf-5e2f0632...,256639621,c937af221666a0be405b665631582b32,...,"[{'system': 'GDC', 'value': 'TCGA-EA-A6QX'}, {...",homo sapiens,female,asian,not hispanic or latino,-18035.0,"[TCGA-CESC, tcga_cesc]",Alive,,
2,016cff56-c7a4-462b-8a84-12d112a90739,"[{'system': 'GDC', 'value': '016cff56-c7a4-462...",TCGA-EK-A2H1-01A-01-TSA.B0EA350D-B3CA-482F-AC0...,Biospecimen,Slide Image,SVS,TCGA-CESC,drs://dg.4DFC:016cff56-c7a4-462b-8a84-12d112a9...,133766513,ec96548913e8060f75db0c867bb00336,...,"[{'system': 'GDC', 'value': 'TCGA-EK-A2H1'}, {...",homo sapiens,female,white,not reported,-7649.0,"[TCGA-CESC, tcga_cesc]",Alive,,
3,016cff56-c7a4-462b-8a84-12d112a90739,"[{'system': 'GDC', 'value': '016cff56-c7a4-462...",TCGA-EK-A2H1-01A-01-TSA.B0EA350D-B3CA-482F-AC0...,Biospecimen,Slide Image,SVS,TCGA-CESC,drs://dg.4DFC:016cff56-c7a4-462b-8a84-12d112a9...,133766513,ec96548913e8060f75db0c867bb00336,...,"[{'system': 'GDC', 'value': 'TCGA-EK-A2H1'}, {...",homo sapiens,female,white,not reported,-7649.0,"[TCGA-CESC, tcga_cesc]",Alive,,
4,01e9b85a-e6c7-4117-9bfd-86050f5a2ff7,"[{'system': 'GDC', 'value': '01e9b85a-e6c7-411...",TCGA-IW-A3M4-01Z-00-DX1.03BE04CC-552B-40E8-85D...,Biospecimen,Slide Image,SVS,TCGA-SARC,drs://dg.4DFC:01e9b85a-e6c7-4117-9bfd-86050f5a...,2303418723,6ef35302e17fdb5f712c0ece967ad543,...,"[{'system': 'GDC', 'value': 'TCGA-IW-A3M4'}, {...",homo sapiens,female,white,not hispanic or latino,-17721.0,"[tcga_sarc, TCGA-SARC]",Alive,,
5,00d9d6d3-07af-41f8-bf84-4f2bf200461d,"[{'system': 'GDC', 'value': '00d9d6d3-07af-41f...",TCGA-AX-A2H5-11A-01-TSA.6a778f9f-2717-4bb8-83d...,Biospecimen,Slide Image,SVS,TCGA-UCEC,drs://dg.4DFC:00d9d6d3-07af-41f8-bf84-4f2bf200...,56262805,58733b0552dc9dde57605e4215550b51,...,"[{'system': 'GDC', 'value': 'TCGA-AX-A2H5'}, {...",homo sapiens,female,black or african american,not hispanic or latino,-24757.0,"[TCGA-UCEC, tcga_ucec]",Alive,,
6,00d9d6d3-07af-41f8-bf84-4f2bf200461d,"[{'system': 'GDC', 'value': '00d9d6d3-07af-41f...",TCGA-AX-A2H5-11A-01-TSA.6a778f9f-2717-4bb8-83d...,Biospecimen,Slide Image,SVS,TCGA-UCEC,drs://dg.4DFC:00d9d6d3-07af-41f8-bf84-4f2bf200...,56262805,58733b0552dc9dde57605e4215550b51,...,"[{'system': 'GDC', 'value': 'TCGA-AX-A2H5'}, {...",homo sapiens,female,black or african american,not hispanic or latino,-24757.0,"[TCGA-UCEC, tcga_ucec]",Alive,,
7,00af722b-dd64-4909-92fa-27116bd2d728,"[{'system': 'GDC', 'value': '00af722b-dd64-490...",TCGA-FU-A3HZ-01Z-00-DX1.1E78D8EF-B1FF-49AC-9EC...,Biospecimen,Slide Image,SVS,TCGA-CESC,drs://dg.4DFC:00af722b-dd64-4909-92fa-27116bd2...,538854312,1a799ba38252a841393ef5aab4f87764,...,"[{'system': 'GDC', 'value': 'TCGA-FU-A3HZ'}, {...",homo sapiens,female,white,not hispanic or latino,-23565.0,"[TCGA-CESC, tcga_cesc]",Alive,,
8,00af722b-dd64-4909-92fa-27116bd2d728,"[{'system': 'GDC', 'value': '00af722b-dd64-490...",TCGA-FU-A3HZ-01Z-00-DX1.1E78D8EF-B1FF-49AC-9EC...,Biospecimen,Slide Image,SVS,TCGA-CESC,drs://dg.4DFC:00af722b-dd64-4909-92fa-27116bd2...,538854312,1a799ba38252a841393ef5aab4f87764,...,"[{'system': 'GDC', 'value': 'TCGA-FU-A3HZ'}, {...",homo sapiens,female,white,not hispanic or latino,-23565.0,"[TCGA-CESC, tcga_cesc]",Alive,,
9,02739e88-15df-4e7b-ac86-c0e2931bf713,"[{'system': 'GDC', 'value': '02739e88-15df-4e7...",TCGA-VS-A9UL-01Z-00-DX1.4348D92F-0EA0-4A98-8F5...,Biospecimen,Slide Image,SVS,TCGA-CESC,drs://dg.4DFC:02739e88-15df-4e7b-ac86-c0e2931b...,341653835,0c506efa82b701b2ef0f3bba8b275db0,...,"[{'system': 'GDC', 'value': 'TCGA-VS-A9UL'}, {...",homo sapiens,female,white,not reported,-29033.0,"[TCGA-CESC, tcga_cesc]",Dead,442.0,


<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
    
She saves this file out as well.

In [None]:
slidemetadata.to_csv("slidemetadata.csv")

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
    
Now Julia has all the information she needs to begin work on her project. She can use the `drs_id` column information to directly download the images she is interested in using a DRS resolver, or she can input the DRS IDs at a cloud workspace such as [Terra](https://terra.bio/) or the [Cancer Genomics Cloud](https://www.cancergenomicscloud.org/) to view the images online. In either case, she has all the metadata she needs to get started, and can save this notebook of her work in case she'd like to come back and modify her search.