# Cohort Building

**Example use case:** 

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" alt="alt_text" align="left"
	width="150" height="150" />
Julia is an oncologist that specializes in female reproductive health. As part of her research, she is interested in  using existing data on uterine cancers. If possible, she would like to see multiple datatypes (gross imaging, genomic data, proteomic data, histology) that come from the same patient, so she can look for shared phenotypes to test for their potential as early diagnostics. Julia heard that the Cancer Data Aggregator has made it easy to search across multiple datasets created by NCI, and so has decided to start her search there.



## Getting Started

The CDA provides a custom python tool for searching CDA data. [`Q`](usage/#q) (short for Query) offers several ways to search and filter data, and several input modes:

---
- **<a href="../../QuickStart/usage/#q">Q.()</a>** builds a query that can be used by `run()` or `count()`
- **<a href="../../QuickStart/usage/#qrun">Q.run()</a>** returns data for the specified search 
- **<a href="../../QuickStart/usage/#qcount">Q.count()</a>** returns summary information (counts) data that fit the specified search
- **<a href="../../QuickStart/usage/#columns">columns()</a>** returns entity field names
- **<a href="../../QuickStart/usage/#unique_terms">unique_terms()</a>** returns entity field contents

---

Before Julia does any work, she needs to import these functions cdapython.
She'll also need to import [pandas](https://pandas.pydata.org/) to get nice dataframes.
Finally, she tells cdapython to report it's version so she can be sure she's using the one she means to:

In [1]:
from cdapython import Q, columns, unique_terms, query
import cdapython
import pandas as pd 
print(cdapython.__version__)

2022.6.28


<div class="cdanote" style="background-color:#b3e5d5;color:black;padding:20px;">
    
    
CDA data comes from three sources:
<ul>
<li><b>The <a href="https://proteomic.datacommons.cancer.gov/pdc/"> Proteomic Data Commons</a> (PDC)</b></li>
<li><b>The <a href="https://gdc.cancer.gov/">Genomic Data Commons</a> (GDC)</b></li>
<li><b>The <a href="https://datacommons.cancer.gov/repository/imaging-data-commons">Imaging Data Commons</a> (IDC)</b></li>
</ul> 
    
The CDA makes this data searchable in four main endpoints:

<ul>
<li><b>subject:</b> A patient entity captures the study-independent metadata for research subjects. Human research subjects are usually not traceable to a particular person to protect the subjects privacy.</li>
<li><b>researchsubject:</b> A research subject is the entity of interest in a specific research study or project, typically a human being or an animal, but can also be a device, group of humans or animals, or a tissue sample. Human research subjects are usually not traceable to a particular person to protect the subjects privacy. This entity plays the role of the case_id in existing data. An individual who participates in 3 studies will have 3 researchsubject IDs</li>
<li><b>specimen:</b> Any material taken as a sample from a biological entity (living or dead), or from a physical object or the environment. Specimens are usually collected as an example of their kind, often for use in some investigation.</li>
<li><b>file:</b> A unit of data about subjects, researchsubjects, specimens, or their associated information</li>
</ul>
and two endpoints that offer deeper information about data in the researchsubject endpoint:
<ul>
<li><b>diagnosis:</b> A collection of characteristics that describe an abnormal condition of the body as assessed at a point in time. May be used to capture information about neoplastic and non-neoplastic conditions.</li>
<li><b>treatment:</b> Represent medication administration or other treatment types.</li>
</ul>
Any metadata field can be searched from any endpoint, the only difference between search types is what type of data is returned by default. This means that you can think of the CDA as a really, really enormous spreadsheet full of data. To search this enormous spreadsheet, you'd want select columns, and then filter rows.
</div>


## Finding Search Terms

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
   
   Accordingly, to see what search fields are available, Julia starts by using the command `columns`:

In [2]:
columns().to_list()

['File.id',
 'File.identifier.system',
 'File.identifier.value',
 'File.label',
 'File.data_category',
 'File.data_type',
 'File.file_format',
 'File.associated_project',
 'File.drs_uri',
 'File.byte_size',
 'File.checksum',
 'File.data_modality',
 'File.imaging_modality',
 'File.dbgap_accession_number',
 'File.imaging_series',
 'id',
 'identifier.system',
 'identifier.value',
 'species',
 'sex',
 'race',
 'ethnicity',
 'days_to_birth',
 'subject_associated_project',
 'vital_status',
 'days_to_death',
 'cause_of_death',
 'ResearchSubject.id',
 'ResearchSubject.identifier.system',
 'ResearchSubject.identifier.value',
 'ResearchSubject.member_of_research_project',
 'ResearchSubject.primary_diagnosis_condition',
 'ResearchSubject.primary_diagnosis_site',
 'ResearchSubject.Diagnosis.id',
 'ResearchSubject.Diagnosis.identifier.system',
 'ResearchSubject.Diagnosis.identifier.value',
 'ResearchSubject.Diagnosis.primary_diagnosis',
 'ResearchSubject.Diagnosis.age_at_diagnosis',
 'ResearchSubje

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
   
There are a lot of columns in the CDA data, but Julia is most interested in diagnosis data, so she filters the list to only those:

In [3]:
columns().to_list(filters="diagnosis")

['ResearchSubject.primary_diagnosis_condition',
 'ResearchSubject.primary_diagnosis_site',
 'ResearchSubject.Diagnosis.id',
 'ResearchSubject.Diagnosis.identifier.system',
 'ResearchSubject.Diagnosis.identifier.value',
 'ResearchSubject.Diagnosis.primary_diagnosis',
 'ResearchSubject.Diagnosis.age_at_diagnosis',
 'ResearchSubject.Diagnosis.morphology',
 'ResearchSubject.Diagnosis.stage',
 'ResearchSubject.Diagnosis.grade',
 'ResearchSubject.Diagnosis.method_of_diagnosis',
 'ResearchSubject.Diagnosis.Treatment.id',
 'ResearchSubject.Diagnosis.Treatment.identifier.system',
 'ResearchSubject.Diagnosis.Treatment.identifier.value',
 'ResearchSubject.Diagnosis.Treatment.treatment_type',
 'ResearchSubject.Diagnosis.Treatment.treatment_outcome',
 'ResearchSubject.Diagnosis.Treatment.days_to_treatment_start',
 'ResearchSubject.Diagnosis.Treatment.days_to_treatment_end',
 'ResearchSubject.Diagnosis.Treatment.therapeutic_agent',
 'ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site',
 'Re

<div class="cdanote" style="background-color:#b3e5d5;color:black;padding:20px;">

To search the CDA, a user also needs to know what search terms are available. Each column will contain a huge amount of data, so retreiving all of the rows would be overwhelming. Instead, the CDA has a `unique_terms()` function that will return all of the unique values that populate the requested column. Like `columns`, `unique_terms` defaults to giving us an overview of the results, and can be filtered.
    
</div>

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
Since Julia is interested specificially in uterine cancers, she uses the `unique_terms` function to see what data is available for 'ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site' and 'ResearchSubject.primary_diagnosis_site' to see if 'uterine' appears:

In [4]:
unique_terms("ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site").to_list()

['Brain',
 'Cervix',
 'Head - Face Or Neck, Nos',
 'Lymph Node(s) Paraaortic',
 'Other',
 'Pelvis',
 'Spine',
 'Unknown']

In [5]:
unique_terms("ResearchSubject.primary_diagnosis_site").to_list()

['Abdomen',
 'Abdomen, Mediastinum',
 'Abdomen, Pelvis',
 'Adrenal Glands',
 'Adrenal gland',
 'Anus and anal canal',
 'Base of tongue',
 'Bile Duct',
 'Bladder',
 'Bones, joints and articular cartilage of limbs',
 'Bones, joints and articular cartilage of other and unspecified sites',
 'Brain',
 'Breast',
 'Bronchus and lung',
 'Cervix',
 'Cervix uteri',
 'Chest',
 'Chest-Abdomen-Pelvis, Leg, TSpine',
 'Colon',
 'Connective, subcutaneous and other soft tissues',
 'Corpus uteri',
 'Ear',
 'Esophagus',
 'Extremities',
 'Eye and adnexa',
 'Floor of mouth',
 'Gallbladder',
 'Gum',
 'Head',
 'Head and Neck',
 'Head-Neck',
 'Heart, mediastinum, and pleura',
 'Hematopoietic and reticuloendothelial systems',
 'Hypopharynx',
 'Intraocular',
 'Kidney',
 'Larynx',
 'Lip',
 'Liver',
 'Liver and intrahepatic bile ducts',
 'Lung',
 'Lung Phantom',
 'Lymph nodes',
 'Marrow, Blood',
 'Meninges',
 'Mesothelium',
 'Nasal cavity and middle ear',
 'Nasopharynx',
 'Not Reported',
 'Oropharynx',
 'Other an

<div class="cdanote" style="background-color:#b3e5d5;color:black;padding:20px;">
    
CDA makes multiple datasets searchable from a common interface, but does not harmonize the data. This means that researchers should review all the terms in a column, and not just choose the first one that fits, as there may be other similar terms available as well.
    
</div>

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
Julia sees that "treatment_anatomic_site" does not have 'Uterine', but does have 'Cervix'. She also notes that both 'Uterus' and 'Uterus, NOS' are listed in the "primary_diagnosis_site" results. As she was initially looking for "uterine", Julia decides to expand her search a bit to account for variable naming schemes. So, she runs a fuzzy match filter on the "ResearchSubject.primary_diagnosis_site" for 'uter' as that should cover all variants:

In [6]:
unique_terms("ResearchSubject.primary_diagnosis_site").to_list(filters="uter")

['Cervix uteri', 'Corpus uteri', 'Uterus', 'Uterus, NOS']

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
Just to be sure, Julia also searches for any other instances of "cervix":

In [7]:
unique_terms("ResearchSubject.primary_diagnosis_site").to_list(filters="cerv")

['Cervix', 'Cervix uteri']

## Building a Query

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
With all her likely terms found, Julia begins to create a search that will get data for all of her terms. She does this by writing a series of `Q` statements that define what rows should be returned from each column. For the "treatment_anatomic_site", only one term is of interest, so she uses the `=` operator to get only exact matches:

In [8]:
Tsite = Q('ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site = "Cervix"')

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
However, for "primary_diagnosis_site", Julia has several terms she wants to search with. Luckily, `Q` also can run fuzzy searches. It can also search more than one term at a time, so Julia writes one big `Q` statement to grab everything that is either 'uter' or 'cerv':

In [9]:
Dsite = Q('ResearchSubject.primary_diagnosis_site = "%uter%" OR ResearchSubject.primary_diagnosis_site = "%cerv%"')

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
Finally, Julia adds her two queries together into one large one:

In [10]:
ALLDATA = Tsite.OR(Dsite)

## Looking at Summary Data

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
Now that Julia has a query, she can use it to look for data in any of the CDA endpoints. She starts by getting an overall summary of what data is available using `count`:

In [11]:
ALLDATA.count.run()

Total execution time: 3489 ms




<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
It seems there's a lot of data that might work for Julias study! Since she is interested in the beginings of cancer, she decides to start by looking at  the researchsubject information, since that is where most of the diagnosis information is. She again gets a summary using `count`:

In [12]:
ALLDATA.researchsubject.count.run()

Total execution time: 3633 ms


system,count
GDC,3591
PDC,104
IDC,1174

primary_diagnosis_condition,count
Uterine Corpus Endometrial Carcinoma,104
Squamous Cell Neoplasms,609
Adenomas and Adenocarcinomas,1672
Complex Mixed and Stromal Neoplasms,320
"Cystic, Mucinous and Serous Neoplasms",487
,1175
"Epithelial Neoplasms, NOS",230
Myomatous Neoplasms,188
Trophoblastic neoplasms,13
"Neoplasms, NOS",12

primary_diagnosis_site,count
Corpus uteri,780
"Uterus, NOS",2000
Cervix uteri,915
Uterus,867
Cervix,307




## Refining Queries

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
Browsing the primary_diagnosis_condition data, Julia notices that there are a large number of research subjects that are Adenomas and Adenocarcinomas. Since Julia wants to look for common phenotypes in early cancers, she decides it might be easier to exclude the endocrine related data, as they might have different mechanisms. So she adds a new filter to her query:

In [13]:
Noadeno = Q('ResearchSubject.primary_diagnosis_condition != "Adenomas and Adenocarcinomas"')

NoAdenoData = ALLDATA.AND(Noadeno)

NoAdenoData.researchsubject.count.run()

Total execution time: 3364 ms


system,count
GDC,1919
PDC,104
IDC,1174

primary_diagnosis_condition,count
"Cystic, Mucinous and Serous Neoplasms",487
Squamous Cell Neoplasms,609
Complex Mixed and Stromal Neoplasms,320
Not Reported,12
Uterine Corpus Endometrial Carcinoma,104
"Neoplasms, NOS",12
Myomatous Neoplasms,188
,1175
"Epithelial Neoplasms, NOS",230
Mesonephromas,5

primary_diagnosis_site,count
"Uterus, NOS",962
Cervix uteri,688
Corpus uteri,373
Uterus,867
Cervix,307




<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
She then previews the actual metadata for researchsubject, subject, and file, to make sure that they have all the information she will need for her work. Since she's mostly interested in looking at the kinds of data available from each endpoint:

In [14]:
NoAdenoData.researchsubject.run().to_dataframe() # view the dataframe

Total execution time: 3486 ms


Unnamed: 0,id,identifier,member_of_research_project,primary_diagnosis_condition,primary_diagnosis_site,subject_id
0,1cb371d5-20ca-41c3-8cf8-0979fa7bd9e0,"[{'system': 'GDC', 'value': '1cb371d5-20ca-41c...",GENIE-MSK,"Epithelial Neoplasms, NOS",Corpus uteri,GENIE-MSK-P-0021035
1,26d54a63-24da-444f-bbb7-1f40d6335c78,"[{'system': 'GDC', 'value': '26d54a63-24da-444...",GENIE-MSK,"Epithelial Neoplasms, NOS",Corpus uteri,GENIE-MSK-P-0010283
2,3f24b940-868d-40b3-916f-268bd66b7a5c,"[{'system': 'GDC', 'value': '3f24b940-868d-40b...",GENIE-DFCI,"Soft Tissue Tumors and Sarcomas, NOS","Uterus, NOS",GENIE-DFCI-000076
3,474438a9-a134-43fd-8434-8a657b63b3db,"[{'system': 'GDC', 'value': '474438a9-a134-43f...",GENIE-MSK,"Cystic, Mucinous and Serous Neoplasms","Uterus, NOS",GENIE-MSK-P-0019659
4,4a02ccbd-ed6b-497c-9e36-b7575ee2ee5d,"[{'system': 'GDC', 'value': '4a02ccbd-ed6b-497...",GENIE-DFCI,"Cystic, Mucinous and Serous Neoplasms","Uterus, NOS",GENIE-DFCI-010402
...,...,...,...,...,...,...
95,8bb89c0f-0f04-485e-8e08-801042dfcba9,"[{'system': 'GDC', 'value': '8bb89c0f-0f04-485...",TCGA-UCEC,"Cystic, Mucinous and Serous Neoplasms",Corpus uteri,TCGA-EY-A3QX
96,8bf18925-f53b-48b5-8245-e5b6c6a1cfd5,"[{'system': 'GDC', 'value': '8bf18925-f53b-48b...",GENIE-UHN,Squamous Cell Neoplasms,Cervix uteri,GENIE-UHN-993502
97,9ff09e88-9897-4dcf-a99e-9f1bfbcecbf1,"[{'system': 'GDC', 'value': '9ff09e88-9897-4dc...",GENIE-UHN,"Epithelial Neoplasms, NOS",Corpus uteri,GENIE-UHN-776060
98,C3L-01604__cptac_ucec,"[{'system': 'IDC', 'value': 'C3L-01604__cptac_...",cptac_ucec,,Uterus,C3L-01604


---

<div class="cdadefine" style="background-color:#add9e5;color:black;padding:20px;">

<h3>ResearchSubject Field Definitions</h3>

<i>A research subject is the entity of interest in a specific research study or project, typically a human being or an animal, but can also be a device, group of humans or animals, or a tissue sample. Human research subjects are usually not traceable to a particular person to protect the subjects privacy. This entity plays the role of the case_id in existing data. An individual who participates in 3 studies will have 3 researchsubject IDs</i>
    
<ul>
<li><b>id:</b> The 'logical' identifier of the entity in the system of record, e.g. a UUID. This 'id' is unique within a given system. The identified entity may have a different 'id' in a different system. For CDA, this is case_id.</li>
<li><b>identifier:</b> A 'business' identifier for the entity, typically as provided by an external system or authority, that persists across implementing systems (i.e. a 'logical' identifier). Uses a specialized, complex 'Identifier' data type to capture information about the source of the business identifier - or a URI expressed as a string to an existing entity.</li>
<li><b>identifier.system:</b> The system or namespace that defines the identifier.</li>
<li><b>identifier.value:</b> The value of the identifier, as defined by the system.</li>
<li><b>member_of_research_project:</b> A reference to the Study(s) of which this ResearchSubject is a member.</li>
<li><b>primary_diagnosis_condition:</b> The text term used to describe the type of malignant disease, as categorized by the World Health Organization's (WHO) <a href="https://www.who.int/standards/classifications/other-classifications/international-classification-of-diseases-for-oncology">International Classification of Diseases for Oncology</a> (ICD-O). This attribute represents the disease that qualified the subject for inclusion on the ResearchProject.</li>
<li><b>primary_diagnosis_site:</b> The text term used to describe the primary site of disease, as categorized by the World Health Organization's (WHO) <a href="https://www.who.int/standards/classifications/other-classifications/international-classification-of-diseases-for-oncology">International Classification of Diseases for Oncology</a> (ICD-O). This categorization groups cases into general categories. This attribute represents the primary site of disease that qualified the subject for inclusion on the ResearchProject.</li>
<li><b>subject_id:</b> The 'logical' identifier of the entity in the system of record, e.g. a UUID. This 'id' is unique within a given system. The identified entity may have a different 'id' in a different system. Can be joined to the `id` field from subject results</li>
</ul>  

</div>
    
---

In [15]:
NoAdenoData.subject.run().to_dataframe() # view the dataframe

Total execution time: 3862 ms


Unnamed: 0,id,identifier,species,sex,race,ethnicity,days_to_birth,subject_associated_project,vital_status,days_to_death,cause_of_death
0,AD10521,"[{'system': 'GDC', 'value': 'AD10521'}]",homo sapiens,female,not reported,not reported,,[FM-AD],Not Reported,,
1,AD15235,"[{'system': 'GDC', 'value': 'AD15235'}]",homo sapiens,female,not reported,not reported,,[FM-AD],Not Reported,,
2,AD16470,"[{'system': 'GDC', 'value': 'AD16470'}]",homo sapiens,female,not reported,not reported,,[FM-AD],Not Reported,,
3,C3N-00860,"[{'system': 'IDC', 'value': 'C3N-00860'}]",homo sapiens,,,,,[cptac_ucec],,,
4,C3N-00872,"[{'system': 'IDC', 'value': 'C3N-00872'}]",homo sapiens,,,,,[cptac_ucec],,,
...,...,...,...,...,...,...,...,...,...,...,...
95,GENIE-MSK-P-0000382,"[{'system': 'GDC', 'value': 'GENIE-MSK-P-00003...",homo sapiens,female,white,not hispanic or latino,-23010.0,[GENIE-MSK],Not Reported,,
96,GENIE-MSK-P-0001564,"[{'system': 'GDC', 'value': 'GENIE-MSK-P-00015...",homo sapiens,female,white,not hispanic or latino,-18993.0,[GENIE-MSK],Not Reported,,
97,GENIE-MSK-P-0004976,"[{'system': 'GDC', 'value': 'GENIE-MSK-P-00049...",homo sapiens,female,Unknown,not hispanic or latino,-20454.0,[GENIE-MSK],Not Reported,,
98,GENIE-MSK-P-0011101,"[{'system': 'GDC', 'value': 'GENIE-MSK-P-00111...",homo sapiens,female,white,not hispanic or latino,-24837.0,[GENIE-MSK],Not Reported,,


---

<div class="cdadefine" style="background-color:#add9e5;color:black;padding:20px;">

<h3>Subject Field Definitions</h3>

<i>A patient entity captures the study-independent metadata for research subjects. Human research subjects are usually not traceable to a particular person to protect the subjects privacy.</i>

    
<ul>
<li><b>id:</b> The 'logical' identifier of the entity in the system of record, e.g. a UUID. This 'id' is unique within a given system. The identified entity may have a different 'id' in a different system.",STRING</li>
<li><b>identifier:</b> A 'business' identifier for the entity, typically as provided by an external system or authority, that persists across implementing systems (i.e. a 'logical' identifier). Uses a specialized, complex 'Identifier' data type to capture information about the source of the business identifier - or a URI expressed as a string to an existing entity.</li>
<li><b>identifier.system:</b> The system or namespace that defines the identifier.</li>
<li><b>identifier.value:</b> The value of the identifier, as defined by the system.</li>
<li><b>species:</b> The taxonomic group (e.g. species) of the patient. For MVP, since taxonomy vocabulary is consistent between GDC and PDC, using text. Ultimately, this will be a term returned by the vocabulary service.</li>
<li><b>sex:</b> The biologic character or quality that distinguishes male and female from one another as expressed by analysis of the person's gonadal, morphologic (internal and external), chromosomal, and hormonal characteristics.</li>
<li><b>race:</b> An arbitrary classification of a taxonomic group that is a division of a species. It usually arises as a consequence of geographical isolation within a species and is characterized by shared heredity, physical attributes and behavior, and in the case of humans, by common history, nationality, or geographic distribution. The provided values are based on the categories defined by the U.S. Office of Management and Business and used by the U.S. Census Bureau.</li>
<li><b>ethnicity:</b> An individual's self-described social and cultural grouping, specifically whether an individual describes themselves as Hispanic or Latino. The provided values are based on the categories defined by the U.S. Office of Management and Business and used by the U.S. Census Bureau.</li>
<li><b>days_to_birth:</b> Number of days between the date used for index and the date from a person's date of birth represented as a calculated negative number of days.</li>
<li><b>subject_associated_project:</b> The list of Projects associated with the Subject.</li>
<li><b>vital_status:</b> Coded value indicating the state or condition of being living or deceased; also includes the case where the vital status is unknown.</li>
<li><b>days_to_death:</b> Number of days between the date used for index and the date from a person's date of death represented as a calculated number of days.</li>
<li><b>cause_of_death:</b> Coded value indicating the circumstance or condition that results in the death of the subject.</li>
</ul>  

</div>
    
---

In [16]:
NoAdenoData.file.run().to_dataframe() # view the dataframe

Total execution time: 4030 ms


Unnamed: 0,id,identifier,label,data_category,data_type,file_format,associated_project,drs_uri,byte_size,checksum,data_modality,imaging_modality,dbgap_accession_number,imaging_series,researchsubject_specimen_id,researchsubject_id,subject_id
0,031aea12-2e34-42d9-8aeb-2f279db4bc96,"[{'system': 'GDC', 'value': '031aea12-2e34-42d...",7357a5af-22c2-47db-9e8f-cf9573253e58.wgs.BRASS...,Somatic Structural Variation,Structural Rearrangement,BEDPE,CGCI-HTMCP-CC,drs://dg.4DFC:031aea12-2e34-42d9-8aeb-2f279db4...,104470,29ea3df3c376bd588ad81b25da485f72,Genomic,,phs000528,,c9444e17-2283-4cc2-81be-a247434cc9f2,f1b9e9d1-e08f-47be-9584-f968ca0e507c,HTMCP-03-06-02242
1,04780388-c545-4c11-a495-215e16a19420,"[{'system': 'GDC', 'value': '04780388-c545-4c1...",a9873ccc-fb22-4ca1-a4d7-c4905a4ab339.wgs.BRASS...,Somatic Structural Variation,Structural Rearrangement,BEDPE,CGCI-HTMCP-CC,drs://dg.4DFC:04780388-c545-4c11-a495-215e16a1...,119886,9ae5a4d8e330882aae98e2f5b9e1aa43,Genomic,,phs000528,,de873ade-747a-4c4c-92a3-444d4b3f1447,f33eb114-1258-4fde-be0c-4a245f58e1ce,HTMCP-03-06-02128
2,0b5cd4dc-5081-466a-9bcc-bc717b3254af,"[{'system': 'GDC', 'value': '0b5cd4dc-5081-466...",2db011f6-0005-4fab-ad90-a0e02afc1a0b.wgs.BRASS...,Somatic Structural Variation,Structural Rearrangement,VCF,CGCI-HTMCP-CC,drs://dg.4DFC:0b5cd4dc-5081-466a-9bcc-bc717b32...,67148,724e8649e0213a3ff98e4f47cffe324a,Genomic,,phs000528,,a6dca3aa-fb41-49b4-a3c2-68c55c1a1e50,226914c5-e486-4041-9b0a-83ba8baae0e7,HTMCP-03-06-02003
3,0f6bff1c-a0e0-47f6-b4a9-6eff171b7e82,"[{'system': 'GDC', 'value': '0f6bff1c-a0e0-47f...",e05cb0b1-3d3b-47d0-b513-aeeba47b313d.wgs.BRASS...,Somatic Structural Variation,Structural Rearrangement,BEDPE,CGCI-HTMCP-CC,drs://dg.4DFC:0f6bff1c-a0e0-47f6-b4a9-6eff171b...,106577,d7208a402ce015d4c5f8cc7ee5aaf290,Genomic,,phs000528,,dc61fea6-cee3-48dc-9ff1-27a4e635d665,29754644-2d3d-4e04-830c-4b1c845df651,HTMCP-03-06-02180
4,213b10f5-f321-4783-becb-d7732e356d47,"[{'system': 'GDC', 'value': '213b10f5-f321-478...",ba783552-1267-45d8-8bb9-3c1ef687a6ef.wgs.BRASS...,Somatic Structural Variation,Structural Rearrangement,BEDPE,CGCI-HTMCP-CC,drs://dg.4DFC:213b10f5-f321-4783-becb-d7732e35...,60762,bfd55fb0a72927a01eda936547528967,Genomic,,phs000528,,75a6157f-7644-495b-b834-9e9c5adbcc89,74b5fdf0-acb1-446e-a604-df12010a7384,HTMCP-03-06-02097
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,613e2cfc-83b8-40b1-b22c-41fa6ea92364,"[{'system': 'GDC', 'value': '613e2cfc-83b8-40b...",74dcdb96-608d-4768-94e7-10d3203d0721.wgs.BRASS...,Somatic Structural Variation,Structural Rearrangement,BEDPE,CGCI-HTMCP-CC,drs://dg.4DFC:613e2cfc-83b8-40b1-b22c-41fa6ea9...,186690,a07299bd5a8e61454058eae5613c521b,Genomic,,phs000528,,2a0ae2ba-f41b-4365-a56e-5f9ca43ab956,29d2a6dd-fb70-46ca-bfa3-0a00025970b7,HTMCP-03-06-02428
96,6c97257c-13cf-48b8-a2a1-01aceab7bb5b,"[{'system': 'GDC', 'value': '6c97257c-13cf-48b...",dc304249-7be1-4035-800a-ade9b2b5fc67.wgs.BRASS...,Somatic Structural Variation,Structural Rearrangement,BEDPE,CGCI-HTMCP-CC,drs://dg.4DFC:6c97257c-13cf-48b8-a2a1-01aceab7...,102572,dbceb045a7533e9ed0d0338c3efbd103,Genomic,,phs000528,,216980f2-62d2-464e-9fec-7dc1986debec,226914c5-e486-4041-9b0a-83ba8baae0e7,HTMCP-03-06-02003
97,7aaae0a8-ee16-4817-8e02-40dc032f20db,"[{'system': 'GDC', 'value': '7aaae0a8-ee16-481...",852776ab-970b-4f18-8f68-4efac4044955.wgs.BRASS...,Somatic Structural Variation,Structural Rearrangement,VCF,CGCI-HTMCP-CC,drs://dg.4DFC:7aaae0a8-ee16-4817-8e02-40dc032f...,71685,8a7c00e35763dd3b8335a39a64951316,Genomic,,phs000528,,631f7f49-3aed-45cc-82d0-75b9dfb015ef,59673b19-2e14-441f-9624-626ed0a3530b,HTMCP-03-06-02125
98,9d3ec32e-f978-46b6-86a7-0f0f222e7b66,"[{'system': 'GDC', 'value': '9d3ec32e-f978-46b...",786bfbd3-b161-402e-bfaa-af3646d7cf8e.wgs.BRASS...,Somatic Structural Variation,Structural Rearrangement,VCF,CGCI-HTMCP-CC,drs://dg.4DFC:9d3ec32e-f978-46b6-86a7-0f0f222e...,600398,7f430f95a5b784e75086768a14e69779,Genomic,,phs000528,,c7b71dc2-019f-4739-b03d-9cb2a4c984cc,ec6c18ae-1e62-4c7b-84b9-8b795b9da775,HTMCP-03-06-02109



---

<div class="cdadefine" style="background-color:#add9e5;color:black;padding:20px;">

<h3>File Field Definitions</h3>

<i>A file is an information-bearing electronic object that contains a physical embodiment of some information using a particular character encoding.</i>

    
<ul>
  <li><b>id:</b> The 'logical' identifier of the entity in the system of record, e.g. a UUID. This 'id' is unique within a given system. The identified entity may have a different 'id' in a different system.</li>
<li><b>identifier:</b> A 'business' identifier for the entity, typically as provided by an external system or authority, that persists across implementing systems (i.e. a 'logical' identifier). Uses a specialized, complex 'Identifier' data type to capture information about the source of the business identifier - or a URI expressed as a string to an existing entity.</li>
<li><b>identifier.system:</b> The system or namespace that defines the identifier.</li>
<li><b>identifier.value:</b> The value of the identifier, as defined by the system.</li>
  <li><b>label:</b> Short name or abbreviation for dataset. Maps to rdfs:label.</li>
  <li><b>data_catagory:</b> Broad categorization of the contents of the data file.</li>
  <li><b>data_type:</b> Specific content type of the data file.</li>
  <li><b>file_format:</b> Format of the data files.</li>
  <li><b>associated_project:</b> A reference to the Project(s) of which this ResearchSubject is a member. The associated_project may be embedded using the ref definition or may be a reference to the id for the Project - or a URI expressed as a string to an existing entity.</li>
  <li><b>drs_uri:</b> A string of characters used to identify a resource on the Data Repo Service(DRS). Can be used to retreive this specific file from a server.</li>
  <li><b>byte_size:</b> Size of the file in bytes. Maps to dcat:byteSize.</li>
  <li><b>checksum:</b> The md5 value for the file. A digit representing the sum of the correct digits in a piece of stored or transmitted digital data, against which later comparisons can be made to detect errors in the data.</li>
  <li><b>data_modality:</b> Data modality describes the biological nature of the information gathered as the result of an Activity, independent of the technology or methods used to produce the information. Always one of "Genomic", "Proteomic", or "Imaging".</li>
  <li><b>imaging_modality:</b> An imaging modality describes the imaging equipment and/or method used to acquire certain structural or functional information about the body. These include but are not limited to computed tomography (CT) and magnetic resonance imaging (MRI). Taken from the DICOM standard.</li>
  <li><b>dbgap_accession_number:</b> The dbgap accession number for the project.</li>
</ul>  

</div>
    
---


## Working with Results (pagination)

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
Finally, Julia wants to save these results to use for the future. Since the preview dataframes only show the first 100 results of each search, she uses the `paginator` function to get all the data from the subject and researchsubject endpoints into their own dataframes:

In [17]:
researchsubs = NoAdenoData.researchsubject.run()
rsdf = pd.DataFrame()
for i in researchsubs.paginator(to_df=True):
    rsdf = pd.concat([rsdf, i])

Total execution time: 3579 ms


In [18]:
subs = NoAdenoData.subject.run()
subsdf = pd.DataFrame()
for i in subs.paginator(to_df=True):
    subsdf = pd.concat([subsdf, i])

Total execution time: 3520 ms


In [19]:
rsdf # view the researchsubject dataframe

Unnamed: 0,id,identifier,member_of_research_project,primary_diagnosis_condition,primary_diagnosis_site,subject_id
0,1cb371d5-20ca-41c3-8cf8-0979fa7bd9e0,"[{'system': 'GDC', 'value': '1cb371d5-20ca-41c...",GENIE-MSK,"Epithelial Neoplasms, NOS",Corpus uteri,GENIE-MSK-P-0021035
1,26d54a63-24da-444f-bbb7-1f40d6335c78,"[{'system': 'GDC', 'value': '26d54a63-24da-444...",GENIE-MSK,"Epithelial Neoplasms, NOS",Corpus uteri,GENIE-MSK-P-0010283
2,3f24b940-868d-40b3-916f-268bd66b7a5c,"[{'system': 'GDC', 'value': '3f24b940-868d-40b...",GENIE-DFCI,"Soft Tissue Tumors and Sarcomas, NOS","Uterus, NOS",GENIE-DFCI-000076
3,474438a9-a134-43fd-8434-8a657b63b3db,"[{'system': 'GDC', 'value': '474438a9-a134-43f...",GENIE-MSK,"Cystic, Mucinous and Serous Neoplasms","Uterus, NOS",GENIE-MSK-P-0019659
4,4a02ccbd-ed6b-497c-9e36-b7575ee2ee5d,"[{'system': 'GDC', 'value': '4a02ccbd-ed6b-497...",GENIE-DFCI,"Cystic, Mucinous and Serous Neoplasms","Uterus, NOS",GENIE-DFCI-010402
...,...,...,...,...,...,...
92,bf1f593d-118a-11e9-afb9-0a9c39d33490,"[{'system': 'PDC', 'value': 'bf1f593d-118a-11e...",CPTAC3-Discovery,Uterine Corpus Endometrial Carcinoma,"Uterus, NOS",C3N-01217
93,de7dabc7-ba48-4ac9-828c-02ec47351d6a,"[{'system': 'GDC', 'value': 'de7dabc7-ba48-4ac...",GENIE-DFCI,"Cystic, Mucinous and Serous Neoplasms","Uterus, NOS",GENIE-DFCI-004071
94,e1cb5a54-594b-4b2a-a9da-e206e2103c5c,"[{'system': 'GDC', 'value': 'e1cb5a54-594b-4b2...",FM-AD,"Cystic, Mucinous and Serous Neoplasms","Uterus, NOS",AD14736
95,e40bbfbd-b273-4b17-ac5b-d57125bc9ba8,"[{'system': 'GDC', 'value': 'e40bbfbd-b273-4b1...",GENIE-DFCI,Trophoblastic neoplasms,"Uterus, NOS",GENIE-DFCI-006773


In [20]:
subsdf # view the subject dataframe

Unnamed: 0,id,identifier,species,sex,race,ethnicity,days_to_birth,subject_associated_project,vital_status,days_to_death,cause_of_death
0,AD10521,"[{'system': 'GDC', 'value': 'AD10521'}]",homo sapiens,female,not reported,not reported,,[FM-AD],Not Reported,,
1,AD15235,"[{'system': 'GDC', 'value': 'AD15235'}]",homo sapiens,female,not reported,not reported,,[FM-AD],Not Reported,,
2,AD16470,"[{'system': 'GDC', 'value': 'AD16470'}]",homo sapiens,female,not reported,not reported,,[FM-AD],Not Reported,,
3,C3N-00860,"[{'system': 'IDC', 'value': 'C3N-00860'}]",homo sapiens,,,,,[cptac_ucec],,,
4,C3N-00872,"[{'system': 'IDC', 'value': 'C3N-00872'}]",homo sapiens,,,,,[cptac_ucec],,,
...,...,...,...,...,...,...,...,...,...,...,...
4,TCGA-EK-A3GM,"[{'system': 'GDC', 'value': 'TCGA-EK-A3GM'}, {...",homo sapiens,female,white,hispanic or latino,-23879.0,"[tcga_cesc, TCGA-CESC]",Alive,,
5,TCGA-EY-A1GH,"[{'system': 'GDC', 'value': 'TCGA-EY-A1GH'}, {...",homo sapiens,female,white,not hispanic or latino,-25802.0,"[tcga_ucec, TCGA-UCEC]",Alive,,
6,TCGA-HB-A43Z,"[{'system': 'GDC', 'value': 'TCGA-HB-A43Z'}, {...",homo sapiens,female,white,not hispanic or latino,-21233.0,"[tcga_sarc, TCGA-SARC]",Alive,,
7,TCGA-KJ-A3U4,"[{'system': 'GDC', 'value': 'TCGA-KJ-A3U4'}, {...",homo sapiens,female,white,hispanic or latino,-20392.0,"[tcga_ucec, TCGA-UCEC]",Alive,,


## Merging Results across Endpoints

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
Then Julia uses the `subject_id` and `id` fields in each result to merge them together into one big dataset. She also specifies that any columns that are in both tables should be kept and have a suffix added to their name. This will help her to check that her merge worked correctly:

In [21]:
allmetadata = pd.merge(rsdf,
                subsdf,
                left_on="subject_id",
                right_on='id',
                suffixes=("_rs", "_sub"))

allmetadata

Unnamed: 0,id_rs,identifier_rs,member_of_research_project,primary_diagnosis_condition,primary_diagnosis_site,subject_id,id_sub,identifier_sub,species,sex,race,ethnicity,days_to_birth,subject_associated_project,vital_status,days_to_death,cause_of_death
0,1cb371d5-20ca-41c3-8cf8-0979fa7bd9e0,"[{'system': 'GDC', 'value': '1cb371d5-20ca-41c...",GENIE-MSK,"Epithelial Neoplasms, NOS",Corpus uteri,GENIE-MSK-P-0021035,GENIE-MSK-P-0021035,"[{'system': 'GDC', 'value': 'GENIE-MSK-P-00210...",homo sapiens,female,white,not hispanic or latino,-21915.0,[GENIE-MSK],Not Reported,,
1,26d54a63-24da-444f-bbb7-1f40d6335c78,"[{'system': 'GDC', 'value': '26d54a63-24da-444...",GENIE-MSK,"Epithelial Neoplasms, NOS",Corpus uteri,GENIE-MSK-P-0010283,GENIE-MSK-P-0010283,"[{'system': 'GDC', 'value': 'GENIE-MSK-P-00102...",homo sapiens,female,white,not hispanic or latino,-28854.0,[GENIE-MSK],Not Reported,,
2,3f24b940-868d-40b3-916f-268bd66b7a5c,"[{'system': 'GDC', 'value': '3f24b940-868d-40b...",GENIE-DFCI,"Soft Tissue Tumors and Sarcomas, NOS","Uterus, NOS",GENIE-DFCI-000076,GENIE-DFCI-000076,"[{'system': 'GDC', 'value': 'GENIE-DFCI-000076'}]",homo sapiens,female,white,not hispanic or latino,-17897.0,[GENIE-DFCI],Not Reported,,
3,474438a9-a134-43fd-8434-8a657b63b3db,"[{'system': 'GDC', 'value': '474438a9-a134-43f...",GENIE-MSK,"Cystic, Mucinous and Serous Neoplasms","Uterus, NOS",GENIE-MSK-P-0019659,GENIE-MSK-P-0019659,"[{'system': 'GDC', 'value': 'GENIE-MSK-P-00196...",homo sapiens,female,white,not hispanic or latino,-21184.0,[GENIE-MSK],Not Reported,,
4,4a02ccbd-ed6b-497c-9e36-b7575ee2ee5d,"[{'system': 'GDC', 'value': '4a02ccbd-ed6b-497...",GENIE-DFCI,"Cystic, Mucinous and Serous Neoplasms","Uterus, NOS",GENIE-DFCI-010402,GENIE-DFCI-010402,"[{'system': 'GDC', 'value': 'GENIE-DFCI-010402'}]",homo sapiens,female,white,not hispanic or latino,-23010.0,[GENIE-DFCI],Not Reported,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3192,a79ece94-3030-4e33-a9aa-623f430cd379,"[{'system': 'GDC', 'value': 'a79ece94-3030-4e3...",CGCI-HTMCP-CC,Squamous Cell Neoplasms,Cervix uteri,HTMCP-03-06-02103,HTMCP-03-06-02103,"[{'system': 'GDC', 'value': 'HTMCP-03-06-02103'}]",homo sapiens,female,Unknown,Unknown,,[CGCI-HTMCP-CC],Dead,415.0,Unknown
3193,de7dabc7-ba48-4ac9-828c-02ec47351d6a,"[{'system': 'GDC', 'value': 'de7dabc7-ba48-4ac...",GENIE-DFCI,"Cystic, Mucinous and Serous Neoplasms","Uterus, NOS",GENIE-DFCI-004071,GENIE-DFCI-004071,"[{'system': 'GDC', 'value': 'GENIE-DFCI-004071'}]",homo sapiens,female,white,not hispanic or latino,-24471.0,[GENIE-DFCI],Not Reported,,
3194,e1cb5a54-594b-4b2a-a9da-e206e2103c5c,"[{'system': 'GDC', 'value': 'e1cb5a54-594b-4b2...",FM-AD,"Cystic, Mucinous and Serous Neoplasms","Uterus, NOS",AD14736,AD14736,"[{'system': 'GDC', 'value': 'AD14736'}]",homo sapiens,female,not reported,not reported,,[FM-AD],Not Reported,,
3195,e40bbfbd-b273-4b17-ac5b-d57125bc9ba8,"[{'system': 'GDC', 'value': 'e40bbfbd-b273-4b1...",GENIE-DFCI,Trophoblastic neoplasms,"Uterus, NOS",GENIE-DFCI-006773,GENIE-DFCI-006773,"[{'system': 'GDC', 'value': 'GENIE-DFCI-006773'}]",homo sapiens,female,white,hispanic or latino,-10227.0,[GENIE-DFCI],Not Reported,,


<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
`subject_id` from the research subject results seems to perfectly match the id data from the subject table, `id_sub`. Julia then checks to see that her dataframe is the right size. She had 3197 researchsubject rows, so she expects 3197 rows here as well:

In [35]:
allmetadata.count()

id_rs                          3197
identifier_rs                  3197
member_of_research_project     3197
primary_diagnosis_condition    2022
primary_diagnosis_site         3197
subject_id                     3197
id_sub                         3197
identifier_sub                 3197
species                        3197
sex                            3025
race                           3025
ethnicity                      3025
days_to_birth                  2740
subject_associated_project     3197
vital_status                   3025
days_to_death                   482
cause_of_death                  309
dtype: int64

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
Satisfied with her results, Julia saves the data out to a csv so she can browse it with Excel:

In [22]:
allmetadata.to_csv("allmetadata.csv")

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
Julia knows from her subject count summary that there are more than 200,000 files associated with her subjects, which is likely far more than she needs. To help her decide what files she wants, Julia uses endpoint chaining to get summary information about the files that are assigned to researchsubjects for her search criteria:


In [23]:
NoAdenoData.researchsubject.file.count.run()

Total execution time: 3409 ms


system,count
IDC,264429
PDC,2560
GDC,31274

data_category,count
Imaging,264429
Simple Nucleotide Variation,11745
Structural Variation,2192
Sequencing Reads,4142
Biospecimen,2866
Transcriptome Profiling,2820
DNA Methylation,1947
Peptide Spectral Matches,1280
Copy Number Variation,4079
Processed Mass Spectra,640

file_format,count
DICOM,264429
MAF,5235
TSV,3980
SVS,1111
TXT,4827
IDAT,1298
BAM,4142
BCR SSF XML,517
BEDPE,1504
BCR XML,1217

data_type,count
Gene Level Copy Number Scores,809
Raw Simple Somatic Mutation,3397
,264429
Gene Expression Quantification,710
Text,640
Gene Level Copy Number,637
Annotated Somatic Mutation,6043
Biospecimen Supplement,1755
Aggregated Somatic Mutation,732
Aligned Reads,4142




<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
    
Julia decides that a good place to start would be with Slide Images. There's only 1111, so she should be able to quickly scan through them over the next few days and see if they will be useful. So she adds one more filter on her search:

In [36]:
JustSlides = Q('file.data_type = "Slide Image"')
NoadenoJustSlides = NoAdenoData.AND(JustSlides)
NoadenoJustSlides.researchsubject.file.count.run()

Total execution time: 3569 ms


system,count
GDC,1111

data_category,count
Biospecimen,1111

file_format,count
SVS,1111

data_type,count
Slide Image,1111




<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
    
Finally, Julia uses the pagenation function again to get all the slide files, and merges her metadata file with this file information. This way she will be able to review what phenotypes each slide is associated with:

In [37]:
slides = NoadenoJustSlides.researchsubject.file.run()
slidesdf = pd.DataFrame()
for i in slides.paginator(to_df=True):
    slidesdf = pd.concat([slidesdf, i])


Total execution time: 3734 ms


In [38]:
slidemetadata = pd.merge(slidesdf, 
                         allmetadata, 
                         left_on=("subject_id","researchsubject_id"),
                         right_on=("subject_id", "id_rs"),
                         suffixes=("_slide", "_all"))
slidemetadata

Unnamed: 0,id,identifier,label,data_category,data_type,file_format,associated_project,drs_uri,byte_size,checksum,...,identifier_sub,species,sex,race,ethnicity,days_to_birth,subject_associated_project,vital_status,days_to_death,cause_of_death
0,31dbb27e-40d9-421d-9dc1-e46014798478,"[{'system': 'GDC', 'value': '31dbb27e-40d9-421...",TCGA-EA-A3HS-01A-01-TSA.70C289D1-DA4E-4351-AF6...,Biospecimen,Slide Image,SVS,TCGA-CESC,drs://dg.4DFC:31dbb27e-40d9-421d-9dc1-e4601479...,397383025,1bb62627b5439493aa58da2163adfcee,...,"[{'system': 'GDC', 'value': 'TCGA-EA-A3HS'}, {...",homo sapiens,female,white,not hispanic or latino,-13117.0,"[tcga_cesc, TCGA-CESC]",Alive,,
1,c89ec421-a3f9-4f88-85a1-eaf0eb8337e5,"[{'system': 'GDC', 'value': 'c89ec421-a3f9-4f8...",TCGA-EA-A3HS-01Z-00-DX1.F74FF288-29EB-4AEF-BC6...,Biospecimen,Slide Image,SVS,TCGA-CESC,drs://dg.4DFC:c89ec421-a3f9-4f88-85a1-eaf0eb83...,1899807371,12bedf131ac5414599272920366ec9cf,...,"[{'system': 'GDC', 'value': 'TCGA-EA-A3HS'}, {...",homo sapiens,female,white,not hispanic or latino,-13117.0,"[tcga_cesc, TCGA-CESC]",Alive,,
2,56af89c9-5059-425c-bf6a-49f50b61e128,"[{'system': 'GDC', 'value': '56af89c9-5059-425...",TCGA-C5-A1BQ-01C-01-TS1.98EDB638-ED5E-47FB-833...,Biospecimen,Slide Image,SVS,TCGA-CESC,drs://dg.4DFC:56af89c9-5059-425c-bf6a-49f50b61...,157482701,00dd7016b7f336c716b7945e511b316a,...,"[{'system': 'GDC', 'value': 'TCGA-C5-A1BQ'}, {...",homo sapiens,female,white,not hispanic or latino,-24018.0,"[tcga_cesc, TCGA-CESC]",Dead,604.0,
3,b6c8e60d-65a5-40af-83ca-1d9b5c4889ea,"[{'system': 'GDC', 'value': 'b6c8e60d-65a5-40a...",TCGA-C5-A1BQ-01Z-00-DX1.72164F1F-956F-4C46-A7B...,Biospecimen,Slide Image,SVS,TCGA-CESC,drs://dg.4DFC:b6c8e60d-65a5-40af-83ca-1d9b5c48...,457023672,6c8c02b7002ec849793a1ab2051d1b10,...,"[{'system': 'GDC', 'value': 'TCGA-C5-A1BQ'}, {...",homo sapiens,female,white,not hispanic or latino,-24018.0,"[tcga_cesc, TCGA-CESC]",Dead,604.0,
4,6779a257-513c-4ff8-ae34-28db3e832265,"[{'system': 'GDC', 'value': '6779a257-513c-4ff...",TCGA-3B-A9I1-01A-01-TS1.D5A4C7EE-A821-4792-A41...,Biospecimen,Slide Image,SVS,TCGA-SARC,drs://dg.4DFC:6779a257-513c-4ff8-ae34-28db3e83...,93966751,c961d45bd551695d8327ff0bbe2f7017,...,"[{'system': 'GDC', 'value': 'TCGA-3B-A9I1'}, {...",homo sapiens,female,black or african american,not hispanic or latino,-20629.0,"[tcga_sarc, TCGA-SARC]",Dead,567.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1106,1da21c74-a418-4379-ba1a-cdebfcc21c7e,"[{'system': 'GDC', 'value': '1da21c74-a418-437...",TCGA-AJ-A2QM-01A-01-TSA.A7357359-DE36-4AB7-991...,Biospecimen,Slide Image,SVS,TCGA-UCEC,drs://dg.4DFC:1da21c74-a418-4379-ba1a-cdebfcc2...,104552321,0604561d915143572662c3b58a5ff198,...,"[{'system': 'GDC', 'value': 'TCGA-AJ-A2QM'}, {...",homo sapiens,female,white,not hispanic or latino,-24776.0,"[tcga_ucec, TCGA-UCEC]",Alive,,
1107,2adfdc32-8b41-418e-beb7-b0ab8544ba6b,"[{'system': 'GDC', 'value': '2adfdc32-8b41-418...",TCGA-EK-A2PL-01A-01-TS1.2A3E42CE-BFFE-4761-90F...,Biospecimen,Slide Image,SVS,TCGA-CESC,drs://dg.4DFC:2adfdc32-8b41-418e-beb7-b0ab8544...,640115960,0043c9d66b32e2780c91530d56c8606a,...,"[{'system': 'GDC', 'value': 'TCGA-EK-A2PL'}, {...",homo sapiens,female,white,not reported,-13384.0,"[tcga_cesc, TCGA-CESC]",Alive,,
1108,977bd855-9430-4f41-97ca-6b5010b67076,"[{'system': 'GDC', 'value': '977bd855-9430-4f4...",TCGA-KD-A5QU-01A-01-TS1.46D8773F-5E6F-49B1-963...,Biospecimen,Slide Image,SVS,TCGA-SARC,drs://dg.4DFC:977bd855-9430-4f41-97ca-6b5010b6...,87301173,c24e969802885328178a764134045706,...,"[{'system': 'GDC', 'value': 'TCGA-KD-A5QU'}, {...",homo sapiens,female,white,not hispanic or latino,-15018.0,"[tcga_sarc, TCGA-SARC]",Dead,1073.0,
1109,358728ce-157b-4002-8377-5391781a3d57,"[{'system': 'GDC', 'value': '358728ce-157b-400...",TCGA-EK-A2RB-01A-01-TS1.45101B3C-E301-4BD3-B4E...,Biospecimen,Slide Image,SVS,TCGA-CESC,drs://dg.4DFC:358728ce-157b-4002-8377-5391781a...,116519647,592cb544b9ee7331d52d0ab68505d956,...,"[{'system': 'GDC', 'value': 'TCGA-EK-A2RB'}, {...",homo sapiens,female,white,not hispanic or latino,-17752.0,"[tcga_cesc, TCGA-CESC]",Alive,,


In [41]:
slidemetadata.count()

id                             1111
identifier                     1111
label                          1111
data_category                  1111
data_type                      1111
file_format                    1111
associated_project             1111
drs_uri                        1111
byte_size                      1111
checksum                       1111
data_modality                  1111
imaging_modality                  0
dbgap_accession_number            0
imaging_series                    0
subject_id                     1111
researchsubject_id             1111
id_rs                          1111
identifier_rs                  1111
member_of_research_project     1111
primary_diagnosis_condition    1111
primary_diagnosis_site         1111
id_sub                         1111
identifier_sub                 1111
species                        1111
sex                            1099
race                           1099
ethnicity                      1099
days_to_birth               

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
    
Julia saves this dataframe to a csv as well, and now she has all the information she needs to begin work on her project. She can use the `drs_id` column information to directly download the images she is interested in using a DRS resolver, or she can input the DRS IDs at a cloud workspace such as [Terra](https://terra.bio/) or the [Cancer Genomics Cloud](https://www.cancergenomicscloud.org/) to view the images online. In either case, she has all the metadata she needs to get started, and can save this notebook of her work in case she'd like to come back and modify her search.

In [30]:
slidemetadata.to_csv("slidemetadata.csv")