# Cohort Building

**Example use case:** 

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" alt="alt_text" align="left"
	width="150" height="150" />
Julia is an oncologist that specializes in female reproductive health. As part of her research, she is interested in  using existing data on uterine cancers. If possible, she would like to see multiple datatypes (gross imaging, genomic data, proteomic data, histology) that come from the same patient, so she can look for shared phenotypes to test for their potential as early diagnostics. Julia heard that the Cancer Data Aggregator has made it easy to search across multiple datasets created by NCI, and so has decided to start her search there.



## Getting Started

The CDA provides a custom python tool for searching CDA data. [`Q`](usage/#q) (short for Query) offers several ways to search and filter data, and several input modes:

---
- **<a href="../../QuickStart/usage/#q">Q.()</a>** builds a query that can be used by `run()` or `count()`
- **<a href="../../QuickStart/usage/#qrun">Q.run()</a>** returns data for the specified search 
- **<a href="../../QuickStart/usage/#qcount">Q.count()</a>** returns summary information (counts) data that fit the specified search
- **<a href="../../QuickStart/usage/#columns">columns()</a>** returns entity field names
- **<a href="../../QuickStart/usage/#unique_terms">unique_terms()</a>** returns entity field contents

---

Before Julia does any work, she needs to import these functions cdapython.
She'll also need to import [pandas](https://pandas.pydata.org/) to get nice dataframes.
Finally, she tells cdapython to report it's version so she can be sure she's using the one she means to:

In [1]:
from cdapython import Q, columns, unique_terms, query
import cdapython
import pandas as pd 
print(Q.get_version())

2022.6.28


In [2]:
Q.set_default_project_dataset("broad-dsde-dev.cda_dev")
Q.set_host_url("https://cancerdata.dsde-dev.broadinstitute.org/")
Q.get_host_url()
Q.get_default_project_dataset()

'broad-dsde-dev.cda_dev'

<div class="cdanote" style="background-color:#b3e5d5;color:black;padding:20px;">
    
    
CDA data comes from three sources:
<ul>
<li><b>The <a href="https://proteomic.datacommons.cancer.gov/pdc/"> Proteomic Data Commons</a> (PDC)</b></li>
<li><b>The <a href="https://gdc.cancer.gov/">Genomic Data Commons</a> (GDC)</b></li>
<li><b>The <a href="https://datacommons.cancer.gov/repository/imaging-data-commons">Imaging Data Commons</a> (IDC)</b></li>
</ul> 
    
The CDA makes this data searchable in four main endpoints:

<ul>
<li><b>subject:</b> A patient entity captures the study-independent metadata for research subjects. Human research subjects are usually not traceable to a particular person to protect the subjects privacy.</li>
<li><b>researchsubject:</b> A research subject is the entity of interest in a specific research study or project, typically a human being or an animal, but can also be a device, group of humans or animals, or a tissue sample. Human research subjects are usually not traceable to a particular person to protect the subjects privacy. This entity plays the role of the case_id in existing data. An individual who participates in 3 studies will have 3 researchsubject IDs</li>
<li><b>specimen:</b> Any material taken as a sample from a biological entity (living or dead), or from a physical object or the environment. Specimens are usually collected as an example of their kind, often for use in some investigation.</li>
<li><b>file:</b> A unit of data about subjects, researchsubjects, specimens, or their associated information</li>
</ul>
and two endpoints that offer deeper information about data in the researchsubject endpoint:
<ul>
<li><b>diagnosis:</b> A collection of characteristics that describe an abnormal condition of the body as assessed at a point in time. May be used to capture information about neoplastic and non-neoplastic conditions.</li>
<li><b>treatment:</b> Represent medication administration or other treatment types.</li>
</ul>
Any metadata field can be searched from any endpoint, the only difference between search types is what type of data is returned by default. This means that you can think of the CDA as a really, really enormous spreadsheet full of data. To search this enormous spreadsheet, you'd want select columns, and then filter rows.
</div>


## Finding Search Terms

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
   
   Accordingly, to see what search fields are available, Julia starts by using the command `columns`:

In [3]:
columns().to_list()

['id',
 'identifier.system',
 'identifier.value',
 'species',
 'sex',
 'race',
 'ethnicity',
 'days_to_birth',
 'subject_associated_project',
 'vital_status',
 'days_to_death',
 'cause_of_death',
 'ResearchSubject.id',
 'ResearchSubject.identifier.system',
 'ResearchSubject.identifier.value',
 'ResearchSubject.member_of_research_project',
 'ResearchSubject.primary_diagnosis_condition',
 'ResearchSubject.primary_diagnosis_site',
 'ResearchSubject.Diagnosis.id',
 'ResearchSubject.Diagnosis.identifier.system',
 'ResearchSubject.Diagnosis.identifier.value',
 'ResearchSubject.Diagnosis.primary_diagnosis',
 'ResearchSubject.Diagnosis.age_at_diagnosis',
 'ResearchSubject.Diagnosis.morphology',
 'ResearchSubject.Diagnosis.stage',
 'ResearchSubject.Diagnosis.grade',
 'ResearchSubject.Diagnosis.method_of_diagnosis',
 'ResearchSubject.Diagnosis.Treatment.id',
 'ResearchSubject.Diagnosis.Treatment.identifier.system',
 'ResearchSubject.Diagnosis.Treatment.identifier.value',
 'ResearchSubject.Diagno

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
   
There are a lot of columns in the CDA data, but Julia is most interested in diagnosis data, so she filters the list to only those:

In [4]:
columns().to_list(filters="diagnosis")

['ResearchSubject.primary_diagnosis_condition',
 'ResearchSubject.primary_diagnosis_site',
 'ResearchSubject.Diagnosis.id',
 'ResearchSubject.Diagnosis.identifier.system',
 'ResearchSubject.Diagnosis.identifier.value',
 'ResearchSubject.Diagnosis.primary_diagnosis',
 'ResearchSubject.Diagnosis.age_at_diagnosis',
 'ResearchSubject.Diagnosis.morphology',
 'ResearchSubject.Diagnosis.stage',
 'ResearchSubject.Diagnosis.grade',
 'ResearchSubject.Diagnosis.method_of_diagnosis',
 'ResearchSubject.Diagnosis.Treatment.id',
 'ResearchSubject.Diagnosis.Treatment.identifier.system',
 'ResearchSubject.Diagnosis.Treatment.identifier.value',
 'ResearchSubject.Diagnosis.Treatment.treatment_type',
 'ResearchSubject.Diagnosis.Treatment.treatment_outcome',
 'ResearchSubject.Diagnosis.Treatment.days_to_treatment_start',
 'ResearchSubject.Diagnosis.Treatment.days_to_treatment_end',
 'ResearchSubject.Diagnosis.Treatment.therapeutic_agent',
 'ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site',
 'Re

<div class="cdanote" style="background-color:#b3e5d5;color:black;padding:20px;">

To search the CDA, a user also needs to know what search terms are available. Each column will contain a huge amount of data, so retreiving all of the rows would be overwhelming. Instead, the CDA has a `unique_terms()` function that will return all of the unique values that populate the requested column. Like `columns`, `unique_terms` defaults to giving us an overview of the results, and can be filtered.
    
</div>

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
Since Julia is interested specificially in uterine cancers, she uses the `unique_terms` function to see what data is available for 'ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site' and 'ResearchSubject.primary_diagnosis_site' to see if 'uterine' appears:

In [5]:
unique_terms("ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site").to_list()

['Brain',
 'Cervix',
 'Head - Face Or Neck, Nos',
 'Lymph Node(s) Paraaortic',
 'Other',
 'Pelvis',
 'Spine',
 'Unknown']

In [6]:
unique_terms("ResearchSubject.primary_diagnosis_site").to_list()

['Abdomen',
 'Abdomen, Mediastinum',
 'Abdomen, Pelvis',
 'Adrenal Glands',
 'Adrenal gland',
 'Anus and anal canal',
 'Base of tongue',
 'Bile Duct',
 'Bladder',
 'Bones, joints and articular cartilage of limbs',
 'Bones, joints and articular cartilage of other and unspecified sites',
 'Brain',
 'Breast',
 'Bronchus and lung',
 'Cervix',
 'Cervix uteri',
 'Chest',
 'Chest-Abdomen-Pelvis, Leg, TSpine',
 'Colon',
 'Connective, subcutaneous and other soft tissues',
 'Corpus uteri',
 'Ear',
 'Esophagus',
 'Extremities',
 'Eye and adnexa',
 'Floor of mouth',
 'Gallbladder',
 'Gum',
 'Head',
 'Head and Neck',
 'Head-Neck',
 'Heart, mediastinum, and pleura',
 'Hematopoietic and reticuloendothelial systems',
 'Hypopharynx',
 'Intraocular',
 'Kidney',
 'Larynx',
 'Lip',
 'Liver',
 'Liver and intrahepatic bile ducts',
 'Lung',
 'Lung Phantom',
 'Lymph nodes',
 'Marrow, Blood',
 'Meninges',
 'Mesothelium',
 'Nasal cavity and middle ear',
 'Nasopharynx',
 'Not Reported',
 'Oropharynx',
 'Other an

<div class="cdanote" style="background-color:#b3e5d5;color:black;padding:20px;">
    
CDA makes multiple datasets searchable from a common interface, but does not harmonize the data. This means that researchers should review all the terms in a column, and not just choose the first one that fits, as there may be other similar terms available as well.
    
</div>

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
Julia sees that "treatment_anatomic_site" does not have 'Uterine', but does have 'Cervix'. She also notes that both 'Uterus' and 'Uterus, NOS' are listed in the "primary_diagnosis_site" results. As she was initially looking for "uterine", Julia decides to expand her search a bit to account for variable naming schemes. So, she runs a fuzzy match filter on the "ResearchSubject.primary_diagnosis_site" for 'uter' as that should cover all variants:

In [7]:
unique_terms("ResearchSubject.primary_diagnosis_site").to_list(filters="uter")

['Cervix uteri', 'Corpus uteri', 'Uterus', 'Uterus, NOS']

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
Just to be sure, Julia also searches for any other instances of "cervix":

In [8]:
unique_terms("ResearchSubject.primary_diagnosis_site").to_list(filters="cerv")

['Cervix', 'Cervix uteri']

## Building a Query

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
With all her likely terms found, Julia begins to create a search that will get data for all of her terms. She does this by writing a series of `Q` statements that define what rows should be returned from each column. For the "treatment_anatomic_site", only one term is of interest, so she uses the `=` operator to get only exact matches:

In [9]:
Tsite = Q('ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site = "Cervix"')

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
However, for "primary_diagnosis_site", Julia has several terms she wants to search with. Luckily, `Q` also can run fuzzy searches. It can also search more than one term at a time, so Julia writes one big `Q` statement to grab everything that is either 'uter' or 'cerv':

In [10]:
Dsite = Q('ResearchSubject.primary_diagnosis_site = "%uter%" OR ResearchSubject.primary_diagnosis_site = "%cerv%"')

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
Finally, Julia adds her two queries together into one large one:

In [11]:
ALLDATA = Tsite.OR(Dsite)

## Looking at Summary Data

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
Now that Julia has a query, she can use it to look for data in any of the CDA endpoints. She starts by getting an overall summary of what data is available using `count`:

In [12]:
ALLDATA.count.run()
#specimen_count : 40793
#treatment_count : 3049
#diagnosis_count : 3685
#researchsubject_count : 4869
#subject_count : 3742

Total execution time: 3741 ms




<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
It seems there's a lot of data that might work for Julias study! Since she is interested in the beginings of cancer, she decides to start by looking at  the researchsubject information, since that is where most of the diagnosis information is. She again gets a summary using `count`:

In [13]:
ALLDATA.researchsubject.count.run()
# system	count
# GDC	3591
# PDC	104
# IDC	1174 
# primary_diagnosis_condition	count
# Adenomas and Adenocarcinomas	1672
# Uterine Corpus Endometrial Carcinoma	104
# Myomatous Neoplasms	188
# Cystic, Mucinous and Serous Neoplasms	487
# Squamous Cell Neoplasms	609
# Not Reported	12
# Complex Mixed and Stromal Neoplasms	320
# None	1175
# Epithelial Neoplasms, NOS	230
# Soft Tissue Tumors and Sarcomas, NOS	14
# Complex Epithelial Neoplasms	27
# Trophoblastic neoplasms	13
# Neoplasms, NOS	12
# Mesonephromas	5
# Neuroepitheliomatous Neoplasms	1 
# primary_diagnosis_site	count
# Cervix uteri	915
# Corpus uteri	780
# Uterus, NOS	2000
# Uterus	867
# Cervix	307

Total execution time: 3562 ms


system,count
GDC,3591
PDC,104
IDC,1174

primary_diagnosis_condition,count
Adenomas and Adenocarcinomas,1672
Uterine Corpus Endometrial Carcinoma,104
Myomatous Neoplasms,188
"Cystic, Mucinous and Serous Neoplasms",487
Squamous Cell Neoplasms,609
Not Reported,12
Complex Mixed and Stromal Neoplasms,320
,1175
"Epithelial Neoplasms, NOS",230
"Soft Tissue Tumors and Sarcomas, NOS",14

primary_diagnosis_site,count
Cervix uteri,915
Corpus uteri,780
"Uterus, NOS",2000
Uterus,867
Cervix,307




## Refining Queries

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
Browsing the primary_diagnosis_condition data, Julia notices that there are a large number of research subjects that are Adenomas and Adenocarcinomas. Since Julia wants to look for common phenotypes in early cancers, she decides it might be easier to exclude the endocrine related data, as they might have different mechanisms. So she adds a new filter to her query:

In [14]:
Noadeno = Q('ResearchSubject.primary_diagnosis_condition != "Adenomas and Adenocarcinomas"')

NoAdenoData = ALLDATA.AND(Noadeno)

NoAdenoData.researchsubject.count.run()
#   total : 3197    
#    files : 298263   
# system	count
# PDC	104
# GDC	1919
# IDC	1174 
# primary_diagnosis_condition	count
# Myomatous Neoplasms	188
# Squamous Cell Neoplasms	609
# Uterine Corpus Endometrial Carcinoma	104
# Cystic, Mucinous and Serous Neoplasms	487
# Complex Mixed and Stromal Neoplasms	320
# None	1175
# Not Reported	12
# Epithelial Neoplasms, NOS	230
# Complex Epithelial Neoplasms	27
# Soft Tissue Tumors and Sarcomas, NOS	14
# Neuroepitheliomatous Neoplasms	1
# Trophoblastic neoplasms	13
# Neoplasms, NOS	12
# Mesonephromas	5 
# primary_diagnosis_site	count
# Cervix uteri	688
# Corpus uteri	373
# Uterus, NOS	962
# Uterus	867
# Cervix	307

Total execution time: 3560 ms


system,count
PDC,104
GDC,1919
IDC,1174

primary_diagnosis_condition,count
Myomatous Neoplasms,188
Squamous Cell Neoplasms,609
Uterine Corpus Endometrial Carcinoma,104
"Cystic, Mucinous and Serous Neoplasms",487
Complex Mixed and Stromal Neoplasms,320
,1175
Not Reported,12
"Epithelial Neoplasms, NOS",230
Complex Epithelial Neoplasms,27
"Soft Tissue Tumors and Sarcomas, NOS",14

primary_diagnosis_site,count
Cervix uteri,688
Corpus uteri,373
"Uterus, NOS",962
Uterus,867
Cervix,307




<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
She then previews the actual metadata for researchsubject, subject, and file, to make sure that they have all the information she will need for her work. Since she's mostly interested in looking at the kinds of data available from each endpoint:

In [15]:
NoAdenoData.researchsubject.run().to_dataframe() # view the dataframe

Total execution time: 3622 ms


Unnamed: 0,id,identifier,member_of_research_project,primary_diagnosis_condition,primary_diagnosis_site,subject_id
0,1f20b0c8-11c1-4a3c-84fc-d485aac49bc8,"[{'system': 'GDC', 'value': '1f20b0c8-11c1-4a3...",FM-AD,Complex Mixed and Stromal Neoplasms,"Uterus, NOS",AD16769
1,29d93fb1-0b3d-4d13-8799-2dcf3e14be04,"[{'system': 'GDC', 'value': '29d93fb1-0b3d-4d1...",GENIE-DFCI,"Cystic, Mucinous and Serous Neoplasms","Uterus, NOS",GENIE-DFCI-006032
2,3bccf0e3-d467-4477-adfd-b04d71f3eb86,"[{'system': 'GDC', 'value': '3bccf0e3-d467-447...",GENIE-DFCI,Myomatous Neoplasms,"Uterus, NOS",GENIE-DFCI-009614
3,3c25bf98-1a11-41dc-8d65-182b0e8c5978,"[{'system': 'GDC', 'value': '3c25bf98-1a11-41d...",GENIE-MSK,Myomatous Neoplasms,"Uterus, NOS",GENIE-MSK-P-0014679
4,4ed1616e-8c62-4104-bcbe-26d7059f04d0,"[{'system': 'GDC', 'value': '4ed1616e-8c62-410...",TCGA-CESC,Squamous Cell Neoplasms,Cervix uteri,TCGA-C5-A7X3
...,...,...,...,...,...,...
95,3789ae3f-286b-42f0-a805-320ad69ea1df,"[{'system': 'GDC', 'value': '3789ae3f-286b-42f...",TCGA-UCEC,"Cystic, Mucinous and Serous Neoplasms",Corpus uteri,TCGA-B5-A3S1
96,3ac62513-b5f4-4063-aabd-dad08a1f56fb,"[{'system': 'GDC', 'value': '3ac62513-b5f4-406...",TCGA-CESC,Squamous Cell Neoplasms,Cervix uteri,TCGA-EA-A3Y4
97,3ba89750-e310-4771-88e7-b5bf289354e0,"[{'system': 'GDC', 'value': '3ba89750-e310-477...",FM-AD,Squamous Cell Neoplasms,Cervix uteri,AD16507
98,5040ec9a-5050-4295-b792-38863eded12b,"[{'system': 'GDC', 'value': '5040ec9a-5050-429...",FM-AD,Squamous Cell Neoplasms,Cervix uteri,AD3542


---

<div class="cdadefine" style="background-color:#add9e5;color:black;padding:20px;">

<h3>ResearchSubject Field Definitions</h3>

<i>A research subject is the entity of interest in a specific research study or project, typically a human being or an animal, but can also be a device, group of humans or animals, or a tissue sample. Human research subjects are usually not traceable to a particular person to protect the subjects privacy. This entity plays the role of the case_id in existing data. An individual who participates in 3 studies will have 3 researchsubject IDs</i>
    
<ul>
<li><b>id:</b> The 'logical' identifier of the entity in the system of record, e.g. a UUID. This 'id' is unique within a given system. The identified entity may have a different 'id' in a different system. For CDA, this is case_id.</li>
<li><b>identifier:</b> A 'business' identifier for the entity, typically as provided by an external system or authority, that persists across implementing systems (i.e. a 'logical' identifier). Uses a specialized, complex 'Identifier' data type to capture information about the source of the business identifier - or a URI expressed as a string to an existing entity.</li>
<li><b>identifier.system:</b> The system or namespace that defines the identifier.</li>
<li><b>identifier.value:</b> The value of the identifier, as defined by the system.</li>
<li><b>member_of_research_project:</b> A reference to the Study(s) of which this ResearchSubject is a member.</li>
<li><b>primary_diagnosis_condition:</b> The text term used to describe the type of malignant disease, as categorized by the World Health Organization's (WHO) <a href="https://www.who.int/standards/classifications/other-classifications/international-classification-of-diseases-for-oncology">International Classification of Diseases for Oncology</a> (ICD-O). This attribute represents the disease that qualified the subject for inclusion on the ResearchProject.</li>
<li><b>primary_diagnosis_site:</b> The text term used to describe the primary site of disease, as categorized by the World Health Organization's (WHO) <a href="https://www.who.int/standards/classifications/other-classifications/international-classification-of-diseases-for-oncology">International Classification of Diseases for Oncology</a> (ICD-O). This categorization groups cases into general categories. This attribute represents the primary site of disease that qualified the subject for inclusion on the ResearchProject.</li>
<li><b>subject_id:</b> The 'logical' identifier of the entity in the system of record, e.g. a UUID. This 'id' is unique within a given system. The identified entity may have a different 'id' in a different system. Can be joined to the `id` field from subject results</li>
</ul>  

</div>
    
---

In [16]:
NoAdenoData.subject.run().to_dataframe() # view the dataframe

Total execution time: 3596 ms


Unnamed: 0,id,identifier,species,sex,race,ethnicity,days_to_birth,subject_associated_project,vital_status,days_to_death,cause_of_death
0,AD95,"[{'system': 'GDC', 'value': 'AD95'}]",homo sapiens,female,not reported,not reported,,[FM-AD],Not Reported,,
1,C3L-02894,"[{'system': 'IDC', 'value': 'C3L-02894'}]",Homo sapiens,,,,,[cptac_ucec],,,
2,C3N-03005,"[{'system': 'IDC', 'value': 'C3N-03005'}]",Homo sapiens,,,,,[cptac_ucec],,,
3,GENIE-DFCI-002257,"[{'system': 'GDC', 'value': 'GENIE-DFCI-002257'}]",homo sapiens,female,white,not hispanic or latino,-22280.0,[GENIE-DFCI],Not Reported,,
4,GENIE-DFCI-003511,"[{'system': 'GDC', 'value': 'GENIE-DFCI-003511'}]",homo sapiens,female,white,not hispanic or latino,-24837.0,[GENIE-DFCI],Not Reported,,
...,...,...,...,...,...,...,...,...,...,...,...
95,TCGA-EY-A3QX,"[{'system': 'GDC', 'value': 'TCGA-EY-A3QX'}, {...",homo sapiens,female,white,not hispanic or latino,-23663.0,"[TCGA-UCEC, tcga_ucec]",Dead,989.0,
96,TCGA-N5-A4RU,"[{'system': 'GDC', 'value': 'TCGA-N5-A4RU'}, {...",homo sapiens,female,white,not hispanic or latino,-20025.0,"[TCGA-UCS, tcga_ucs]",Dead,1526.0,
97,TCGA-NG-A4VU,"[{'system': 'GDC', 'value': 'TCGA-NG-A4VU'}, {...",homo sapiens,female,white,not hispanic or latino,-23323.0,"[TCGA-UCS, tcga_ucs]",Dead,442.0,
98,TCGA-PG-A7D5,"[{'system': 'GDC', 'value': 'TCGA-PG-A7D5'}, {...",homo sapiens,female,black or african american,not hispanic or latino,-23006.0,"[TCGA-UCEC, tcga_ucec]",Alive,,


---

<div class="cdadefine" style="background-color:#add9e5;color:black;padding:20px;">

<h3>Subject Field Definitions</h3>

<i>A patient entity captures the study-independent metadata for research subjects. Human research subjects are usually not traceable to a particular person to protect the subjects privacy.</i>

    
<ul>
<li><b>id:</b> The 'logical' identifier of the entity in the system of record, e.g. a UUID. This 'id' is unique within a given system. The identified entity may have a different 'id' in a different system.",STRING</li>
<li><b>identifier:</b> A 'business' identifier for the entity, typically as provided by an external system or authority, that persists across implementing systems (i.e. a 'logical' identifier). Uses a specialized, complex 'Identifier' data type to capture information about the source of the business identifier - or a URI expressed as a string to an existing entity.</li>
<li><b>identifier.system:</b> The system or namespace that defines the identifier.</li>
<li><b>identifier.value:</b> The value of the identifier, as defined by the system.</li>
<li><b>species:</b> The taxonomic group (e.g. species) of the patient. For MVP, since taxonomy vocabulary is consistent between GDC and PDC, using text. Ultimately, this will be a term returned by the vocabulary service.</li>
<li><b>sex:</b> The biologic character or quality that distinguishes male and female from one another as expressed by analysis of the person's gonadal, morphologic (internal and external), chromosomal, and hormonal characteristics.</li>
<li><b>race:</b> An arbitrary classification of a taxonomic group that is a division of a species. It usually arises as a consequence of geographical isolation within a species and is characterized by shared heredity, physical attributes and behavior, and in the case of humans, by common history, nationality, or geographic distribution. The provided values are based on the categories defined by the U.S. Office of Management and Business and used by the U.S. Census Bureau.</li>
<li><b>ethnicity:</b> An individual's self-described social and cultural grouping, specifically whether an individual describes themselves as Hispanic or Latino. The provided values are based on the categories defined by the U.S. Office of Management and Business and used by the U.S. Census Bureau.</li>
<li><b>days_to_birth:</b> Number of days between the date used for index and the date from a person's date of birth represented as a calculated negative number of days.</li>
<li><b>subject_associated_project:</b> The list of Projects associated with the Subject.</li>
<li><b>vital_status:</b> Coded value indicating the state or condition of being living or deceased; also includes the case where the vital status is unknown.</li>
<li><b>days_to_death:</b> Number of days between the date used for index and the date from a person's date of death represented as a calculated number of days.</li>
<li><b>cause_of_death:</b> Coded value indicating the circumstance or condition that results in the death of the subject.</li>
</ul>  

</div>
    
---

In [17]:
NoAdenoData.file.run().to_dataframe() # view the dataframe

Total execution time: 4225 ms


Unnamed: 0,id,identifier,label,data_category,data_type,file_format,associated_project,drs_uri,byte_size,checksum,data_modality,imaging_modality,dbgap_accession_number,imaging_series,researchsubject_specimen_id,researchsubject_id,subject_id
0,0b3ca53a-102f-4e81-845f-cad1afeabcb9,"[{'system': 'GDC', 'value': '0b3ca53a-102f-4e8...",baff6124-e0ed-4ae8-abb2-614fc68e6728.wgs.BRASS...,Somatic Structural Variation,Structural Rearrangement,BEDPE,CGCI-HTMCP-CC,drs://dg.4DFC:0b3ca53a-102f-4e81-845f-cad1afea...,186898,c2d7d5de9497268525e9a25753e108e8,Genomic,,phs000528,,1f1933a7-64e4-4bb5-934d-e61ab9c7eb10,055ef10b-309e-4105-8379-ef6282d30c3a,HTMCP-03-06-02036
1,308d0d13-ba05-46b4-a46d-ec02e9768f63,"[{'system': 'GDC', 'value': '308d0d13-ba05-46b...",32ac189b-859b-4f44-8040-c68b5bfc499d.wgs.BRASS...,Somatic Structural Variation,Structural Rearrangement,VCF,CGCI-HTMCP-CC,drs://dg.4DFC:308d0d13-ba05-46b4-a46d-ec02e976...,64370,f7eb03972ab6cca2e63ee209598d46e7,Genomic,,phs000528,,fdb8c4e2-fc29-444e-a441-dded10f769fd,54d01763-63d0-43a7-85f2-d447c6d9950e,HTMCP-03-06-02076
2,a1e661f5-3158-4748-8e4c-c262f6981e05,"[{'system': 'GDC', 'value': 'a1e661f5-3158-474...",TCGA-2W-A8YY-01A-21-A40H-20_RPPA_data.tsv,Proteome Profiling,Protein Expression Quantification,TSV,TCGA-CESC,drs://dg.4DFC:a1e661f5-3158-4748-8e4c-c262f698...,23833,a2c830ca73e7b2dba189b70c7ae96c05,Genomic,,,,c9afd150-ea3a-4219-b634-6e097c3b2a4f,5aeac31a-176a-4f93-a376-a93a670821bb,TCGA-2W-A8YY
3,a6de4f29-79fb-4139-b320-b664d2d5387c,"[{'system': 'GDC', 'value': 'a6de4f29-79fb-413...",16afeee8-917a-4682-abf6-3d599345fc28.wgs.BRASS...,Somatic Structural Variation,Structural Rearrangement,VCF,CGCI-HTMCP-CC,drs://dg.4DFC:a6de4f29-79fb-4139-b320-b664d2d5...,38046,714ef4e3d21edd894a5e4e972e983170,Genomic,,phs000528,,900e0e61-e387-4be4-ba67-99423a0a23aa,1ffd0e35-68fd-4eaa-8b2b-a337fbdff5b9,HTMCP-03-06-02026
4,b30cd89d-895b-4d7c-b2dc-9cb469c9b39f,"[{'system': 'GDC', 'value': 'b30cd89d-895b-4d7...",c2ad8dfa-b36b-4bf8-8db2-1a3beb9ec181.wgs.BRASS...,Somatic Structural Variation,Structural Rearrangement,VCF,CGCI-HTMCP-CC,drs://dg.4DFC:b30cd89d-895b-4d7c-b2dc-9cb469c9...,48137,01d4836e1af4a4e60e36a8a40178b6ba,Genomic,,phs000528,,6d2955d3-764c-4457-9bd2-b4d94b26f80c,8859b1ba-2d81-43e5-a845-4966ca6866da,HTMCP-03-06-02320
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,fed0013e-e5e2-4fce-b1f1-e7bc942b4a4a,"[{'system': 'GDC', 'value': 'fed0013e-e5e2-4fc...",a415048b-118c-4e8e-a44d-ca9ea524eedd.wgs.BRASS...,Somatic Structural Variation,Structural Rearrangement,VCF,CGCI-HTMCP-CC,drs://dg.4DFC:fed0013e-e5e2-4fce-b1f1-e7bc942b...,109420,f63e7a05c74d325a1247d4e351ab477e,Genomic,,phs000528,,ff4bae0a-b24e-461d-a802-28b27ad870bf,a284a345-42f1-4f3c-9833-13fc835e131a,HTMCP-03-06-02174
96,0b5cd4dc-5081-466a-9bcc-bc717b3254af,"[{'system': 'GDC', 'value': '0b5cd4dc-5081-466...",2db011f6-0005-4fab-ad90-a0e02afc1a0b.wgs.BRASS...,Somatic Structural Variation,Structural Rearrangement,VCF,CGCI-HTMCP-CC,drs://dg.4DFC:0b5cd4dc-5081-466a-9bcc-bc717b32...,67148,724e8649e0213a3ff98e4f47cffe324a,Genomic,,phs000528,,b669fa70-1392-4a9c-9d12-a56eb50b6e8d,226914c5-e486-4041-9b0a-83ba8baae0e7,HTMCP-03-06-02003
97,0f6bff1c-a0e0-47f6-b4a9-6eff171b7e82,"[{'system': 'GDC', 'value': '0f6bff1c-a0e0-47f...",e05cb0b1-3d3b-47d0-b513-aeeba47b313d.wgs.BRASS...,Somatic Structural Variation,Structural Rearrangement,BEDPE,CGCI-HTMCP-CC,drs://dg.4DFC:0f6bff1c-a0e0-47f6-b4a9-6eff171b...,106577,d7208a402ce015d4c5f8cc7ee5aaf290,Genomic,,phs000528,,4a72437d-3711-4e49-9751-c327ca836f18,29754644-2d3d-4e04-830c-4b1c845df651,HTMCP-03-06-02180
98,1844c0ab-f430-45bc-9865-4078cf57c5c4,"[{'system': 'GDC', 'value': '1844c0ab-f430-45b...",bcdf7a5f-2726-4b83-9a56-0ddf6e5d48c4.wgs.BRASS...,Somatic Structural Variation,Structural Rearrangement,BEDPE,CGCI-HTMCP-CC,drs://dg.4DFC:1844c0ab-f430-45bc-9865-4078cf57...,87697,0e5e07ff37a2fdccba492db6a9de1c7a,Genomic,,phs000528,,9456911a-d588-436f-a77c-cc7754e218c7,57d7c3a9-0f1d-4ec1-8815-fea5d4565d3b,HTMCP-03-06-02435



---

<div class="cdadefine" style="background-color:#add9e5;color:black;padding:20px;">

<h3>File Field Definitions</h3>

<i>A file is an information-bearing electronic object that contains a physical embodiment of some information using a particular character encoding.</i>

    
<ul>
  <li><b>id:</b> The 'logical' identifier of the entity in the system of record, e.g. a UUID. This 'id' is unique within a given system. The identified entity may have a different 'id' in a different system.</li>
<li><b>identifier:</b> A 'business' identifier for the entity, typically as provided by an external system or authority, that persists across implementing systems (i.e. a 'logical' identifier). Uses a specialized, complex 'Identifier' data type to capture information about the source of the business identifier - or a URI expressed as a string to an existing entity.</li>
<li><b>identifier.system:</b> The system or namespace that defines the identifier.</li>
<li><b>identifier.value:</b> The value of the identifier, as defined by the system.</li>
  <li><b>label:</b> Short name or abbreviation for dataset. Maps to rdfs:label.</li>
  <li><b>data_catagory:</b> Broad categorization of the contents of the data file.</li>
  <li><b>data_type:</b> Specific content type of the data file.</li>
  <li><b>file_format:</b> Format of the data files.</li>
  <li><b>associated_project:</b> A reference to the Project(s) of which this ResearchSubject is a member. The associated_project may be embedded using the ref definition or may be a reference to the id for the Project - or a URI expressed as a string to an existing entity.</li>
  <li><b>drs_uri:</b> A string of characters used to identify a resource on the Data Repo Service(DRS). Can be used to retreive this specific file from a server.</li>
  <li><b>byte_size:</b> Size of the file in bytes. Maps to dcat:byteSize.</li>
  <li><b>checksum:</b> The md5 value for the file. A digit representing the sum of the correct digits in a piece of stored or transmitted digital data, against which later comparisons can be made to detect errors in the data.</li>
  <li><b>data_modality:</b> Data modality describes the biological nature of the information gathered as the result of an Activity, independent of the technology or methods used to produce the information. Always one of "Genomic", "Proteomic", or "Imaging".</li>
  <li><b>imaging_modality:</b> An imaging modality describes the imaging equipment and/or method used to acquire certain structural or functional information about the body. These include but are not limited to computed tomography (CT) and magnetic resonance imaging (MRI). Taken from the DICOM standard.</li>
  <li><b>dbgap_accession_number:</b> The dbgap accession number for the project.</li>
</ul>  

</div>
    
---


## Working with Results (pagination)

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
Finally, Julia wants to save these results to use for the future. Since the preview dataframes only show the first 100 results of each search, she uses the `paginator` function to get all the data from the subject and researchsubject endpoints into their own dataframes:

In [18]:
researchsubs = NoAdenoData.researchsubject.run()
rsdf = pd.DataFrame()
for i in researchsubs.paginator(to_df=True):
    rsdf = pd.concat([rsdf, i])

Total execution time: 3592 ms


In [19]:
subs = NoAdenoData.subject.run()
subsdf = pd.DataFrame()
for i in subs.paginator(to_df=True):
    subsdf = pd.concat([subsdf, i])

Total execution time: 3596 ms


In [20]:
rsdf # view the researchsubject dataframe

Unnamed: 0,id,identifier,member_of_research_project,primary_diagnosis_condition,primary_diagnosis_site,subject_id
0,1f20b0c8-11c1-4a3c-84fc-d485aac49bc8,"[{'system': 'GDC', 'value': '1f20b0c8-11c1-4a3...",FM-AD,Complex Mixed and Stromal Neoplasms,"Uterus, NOS",AD16769
1,29d93fb1-0b3d-4d13-8799-2dcf3e14be04,"[{'system': 'GDC', 'value': '29d93fb1-0b3d-4d1...",GENIE-DFCI,"Cystic, Mucinous and Serous Neoplasms","Uterus, NOS",GENIE-DFCI-006032
2,3bccf0e3-d467-4477-adfd-b04d71f3eb86,"[{'system': 'GDC', 'value': '3bccf0e3-d467-447...",GENIE-DFCI,Myomatous Neoplasms,"Uterus, NOS",GENIE-DFCI-009614
3,3c25bf98-1a11-41dc-8d65-182b0e8c5978,"[{'system': 'GDC', 'value': '3c25bf98-1a11-41d...",GENIE-MSK,Myomatous Neoplasms,"Uterus, NOS",GENIE-MSK-P-0014679
4,4ed1616e-8c62-4104-bcbe-26d7059f04d0,"[{'system': 'GDC', 'value': '4ed1616e-8c62-410...",TCGA-CESC,Squamous Cell Neoplasms,Cervix uteri,TCGA-C5-A7X3
...,...,...,...,...,...,...
92,c2b54dd8-ef0a-464a-b31e-3a4c1e9cd20c,"[{'system': 'GDC', 'value': 'c2b54dd8-ef0a-464...",GENIE-MSK,"Epithelial Neoplasms, NOS","Uterus, NOS",GENIE-MSK-P-0000080
93,d5ee39e8-a1d2-4244-9c38-07d7c5c1baba,"[{'system': 'GDC', 'value': 'd5ee39e8-a1d2-424...",TCGA-UCEC,"Cystic, Mucinous and Serous Neoplasms",Corpus uteri,TCGA-B5-A5OE
94,e5e66e4f-46af-4d88-a238-a8eb8e9dca09,"[{'system': 'GDC', 'value': 'e5e66e4f-46af-4d8...",GENIE-DFCI,"Epithelial Neoplasms, NOS",Corpus uteri,GENIE-DFCI-037165
95,f8c840a8-2891-4c72-b9d8-9eb9c4351cf7,"[{'system': 'GDC', 'value': 'f8c840a8-2891-4c7...",GENIE-UHN,"Cystic, Mucinous and Serous Neoplasms","Uterus, NOS",GENIE-UHN-242802


In [21]:
subsdf # view the subject dataframe

Unnamed: 0,id,identifier,species,sex,race,ethnicity,days_to_birth,subject_associated_project,vital_status,days_to_death,cause_of_death
0,AD95,"[{'system': 'GDC', 'value': 'AD95'}]",homo sapiens,female,not reported,not reported,,[FM-AD],Not Reported,,
1,C3L-02894,"[{'system': 'IDC', 'value': 'C3L-02894'}]",Homo sapiens,,,,,[cptac_ucec],,,
2,C3N-03005,"[{'system': 'IDC', 'value': 'C3N-03005'}]",Homo sapiens,,,,,[cptac_ucec],,,
3,GENIE-DFCI-002257,"[{'system': 'GDC', 'value': 'GENIE-DFCI-002257'}]",homo sapiens,female,white,not hispanic or latino,-22280.0,[GENIE-DFCI],Not Reported,,
4,GENIE-DFCI-003511,"[{'system': 'GDC', 'value': 'GENIE-DFCI-003511'}]",homo sapiens,female,white,not hispanic or latino,-24837.0,[GENIE-DFCI],Not Reported,,
...,...,...,...,...,...,...,...,...,...,...,...
4,TCGA-D1-A1NZ,"[{'system': 'GDC', 'value': 'TCGA-D1-A1NZ'}, {...",homo sapiens,female,white,not hispanic or latino,-22027.0,"[TCGA-UCEC, tcga_ucec]",Alive,,
5,TCGA-D1-A2G5,"[{'system': 'GDC', 'value': 'TCGA-D1-A2G5'}, {...",homo sapiens,female,white,not hispanic or latino,-18322.0,"[TCGA-UCEC, tcga_ucec]",Alive,,
6,TCGA-IS-A3K7,"[{'system': 'GDC', 'value': 'TCGA-IS-A3K7'}, {...",homo sapiens,female,white,not reported,-23022.0,"[tcga_sarc, TCGA-SARC]",Alive,,
7,TCGA-NF-A4WU,"[{'system': 'GDC', 'value': 'TCGA-NF-A4WU'}, {...",homo sapiens,female,white,not hispanic or latino,-21901.0,"[TCGA-UCS, tcga_ucs]",Alive,,


## Merging Results across Endpoints

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
Then Julia uses the `id` fields in each result to merge them together into one big dataset:

In [22]:
allmetadata = pd.merge(rsdf,
                subsdf,
                left_on="subject_id",
                right_on='id')

allmetadata

Unnamed: 0,id_x,identifier_x,member_of_research_project,primary_diagnosis_condition,primary_diagnosis_site,subject_id,id_y,identifier_y,species,sex,race,ethnicity,days_to_birth,subject_associated_project,vital_status,days_to_death,cause_of_death
0,1f20b0c8-11c1-4a3c-84fc-d485aac49bc8,"[{'system': 'GDC', 'value': '1f20b0c8-11c1-4a3...",FM-AD,Complex Mixed and Stromal Neoplasms,"Uterus, NOS",AD16769,AD16769,"[{'system': 'GDC', 'value': 'AD16769'}]",homo sapiens,female,not reported,not reported,,[FM-AD],Not Reported,,
1,29d93fb1-0b3d-4d13-8799-2dcf3e14be04,"[{'system': 'GDC', 'value': '29d93fb1-0b3d-4d1...",GENIE-DFCI,"Cystic, Mucinous and Serous Neoplasms","Uterus, NOS",GENIE-DFCI-006032,GENIE-DFCI-006032,"[{'system': 'GDC', 'value': 'GENIE-DFCI-006032'}]",homo sapiens,female,white,not hispanic or latino,-31046.0,[GENIE-DFCI],Not Reported,,
2,3bccf0e3-d467-4477-adfd-b04d71f3eb86,"[{'system': 'GDC', 'value': '3bccf0e3-d467-447...",GENIE-DFCI,Myomatous Neoplasms,"Uterus, NOS",GENIE-DFCI-009614,GENIE-DFCI-009614,"[{'system': 'GDC', 'value': 'GENIE-DFCI-009614'}]",homo sapiens,female,white,not hispanic or latino,-17897.0,[GENIE-DFCI],Not Reported,,
3,3c25bf98-1a11-41dc-8d65-182b0e8c5978,"[{'system': 'GDC', 'value': '3c25bf98-1a11-41d...",GENIE-MSK,Myomatous Neoplasms,"Uterus, NOS",GENIE-MSK-P-0014679,GENIE-MSK-P-0014679,"[{'system': 'GDC', 'value': 'GENIE-MSK-P-00146...",homo sapiens,female,white,not hispanic or latino,-16071.0,[GENIE-MSK],Not Reported,,
4,4ed1616e-8c62-4104-bcbe-26d7059f04d0,"[{'system': 'GDC', 'value': '4ed1616e-8c62-410...",TCGA-CESC,Squamous Cell Neoplasms,Cervix uteri,TCGA-C5-A7X3,TCGA-C5-A7X3,"[{'system': 'GDC', 'value': 'TCGA-C5-A7X3'}, {...",homo sapiens,female,white,not hispanic or latino,-25665.0,"[TCGA-CESC, tcga_cesc]",Dead,284.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3192,baf1b1f8-36ae-4370-98dd-23bda7ed147e,"[{'system': 'GDC', 'value': 'baf1b1f8-36ae-437...",GENIE-MSK,"Cystic, Mucinous and Serous Neoplasms","Uterus, NOS",GENIE-MSK-P-0005695,GENIE-MSK-P-0005695,"[{'system': 'GDC', 'value': 'GENIE-MSK-P-00056...",homo sapiens,female,asian,not hispanic or latino,-24471.0,[GENIE-MSK],Not Reported,,
3193,c2b54dd8-ef0a-464a-b31e-3a4c1e9cd20c,"[{'system': 'GDC', 'value': 'c2b54dd8-ef0a-464...",GENIE-MSK,"Epithelial Neoplasms, NOS","Uterus, NOS",GENIE-MSK-P-0000080,GENIE-MSK-P-0000080,"[{'system': 'GDC', 'value': 'GENIE-MSK-P-00000...",homo sapiens,female,white,not hispanic or latino,-24837.0,[GENIE-MSK],Not Reported,,
3194,e5e66e4f-46af-4d88-a238-a8eb8e9dca09,"[{'system': 'GDC', 'value': 'e5e66e4f-46af-4d8...",GENIE-DFCI,"Epithelial Neoplasms, NOS",Corpus uteri,GENIE-DFCI-037165,GENIE-DFCI-037165,"[{'system': 'GDC', 'value': 'GENIE-DFCI-037165'}]",homo sapiens,female,white,not hispanic or latino,-24471.0,[GENIE-DFCI],Not Reported,,
3195,f8c840a8-2891-4c72-b9d8-9eb9c4351cf7,"[{'system': 'GDC', 'value': 'f8c840a8-2891-4c7...",GENIE-UHN,"Cystic, Mucinous and Serous Neoplasms","Uterus, NOS",GENIE-UHN-242802,GENIE-UHN-242802,"[{'system': 'GDC', 'value': 'GENIE-UHN-242802'}]",homo sapiens,female,black or african american,not allowed to collect,-21915.0,[GENIE-UHN],Not Reported,,


<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
And saves it out to a csv so she can browse it with Excel:

In [23]:
allmetadata.to_csv("allmetadata.csv")

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
Julia knows from her subject count summary that there are more than 200,000 files associated with her subjects, which is likely far more than she needs. To help her decide what files she wants, Julia uses endpoint chaining to get summary information about the files that are assigned to researchsubjects for her search criteria:


In [24]:
NoAdenoData.researchsubject.file.count.run()

Total execution time: 3505 ms


system,count
GDC,31274
IDC,264429
PDC,2560

data_category,count
Imaging,264429
Processed Mass Spectra,640
Simple Nucleotide Variation,11745
Copy Number Variation,4079
Sequencing Reads,4142
Clinical,774
Somatic Structural Variation,370
Biospecimen,2866
Structural Variation,2192
Raw Mass Spectra,640

file_format,count
DICOM,264429
BAM,4142
VCF,6652
SVS,1111
tsv,640
TSV,3980
BEDPE,1504
BCR SSF XML,517
MAF,5235
vendor-specific,640

data_type,count
Annotated Somatic Mutation,6043
,264429
Aggregated Somatic Mutation,732
Isoform Expression Quantification,700
Slide Image,1111
Copy Number Segment,1140
Proprietary,640
Masked Intensities,1298
Aligned Reads,4142
Gene Level Copy Number Scores,809




<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
    
Julia decides that a good place to start would be with Slide Images. There's only 1111, so she should be able to quickly scan through them over the next few days and see if they will be useful. So she adds one more filter on her search:

In [25]:
JustSlides = Q('file.data_type = "Slide Image"')
NoadenoJustSlides = NoAdenoData.AND(JustSlides)
NoadenoJustSlides.researchsubject.file.count.run()

Total execution time: 3618 ms


system,count
GDC,1111

data_category,count
Biospecimen,1111

file_format,count
SVS,1111

data_type,count
Slide Image,1111




<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
    
Finally, Julia uses the pagenation function again to get all the slide files, and merges her metadata file with this file information. This way she will be able to review what phenotypes each slide is associated with:

In [26]:
slides = NoadenoJustSlides.researchsubject.file.run()
slidesdf = pd.DataFrame()
for i in slides.paginator(to_df=True):
    slidesdf = pd.concat([slidesdf, i])


Total execution time: 3522 ms


In [37]:
slidemetadata = pd.merge(slidesdf, 
                         allmetadata, 
                         on="subject_id")
slidemetadata

Unnamed: 0,id,identifier,label,data_category,data_type,file_format,associated_project,drs_uri,byte_size,checksum,...,identifier_y,species,sex,race,ethnicity,days_to_birth,subject_associated_project,vital_status,days_to_death,cause_of_death
0,040229b3-224c-4107-bd33-0854196b6423,"[{'system': 'GDC', 'value': '040229b3-224c-410...",TCGA-VS-A8EI-01Z-00-DX1.8DD9CBFB-C3B2-48D0-ADE...,Biospecimen,Slide Image,SVS,TCGA-CESC,drs://dg.4DFC:040229b3-224c-4107-bd33-0854196b...,54328093,d4ecae7c6f8f467afbcd060058ffca84,...,"[{'system': 'GDC', 'value': 'TCGA-VS-A8EI'}, {...",homo sapiens,female,white,not reported,-14129.0,"[TCGA-CESC, tcga_cesc]",Alive,,
1,040229b3-224c-4107-bd33-0854196b6423,"[{'system': 'GDC', 'value': '040229b3-224c-410...",TCGA-VS-A8EI-01Z-00-DX1.8DD9CBFB-C3B2-48D0-ADE...,Biospecimen,Slide Image,SVS,TCGA-CESC,drs://dg.4DFC:040229b3-224c-4107-bd33-0854196b...,54328093,d4ecae7c6f8f467afbcd060058ffca84,...,"[{'system': 'GDC', 'value': 'TCGA-VS-A8EI'}, {...",homo sapiens,female,white,not reported,-14129.0,"[TCGA-CESC, tcga_cesc]",Alive,,
2,a9e316b2-abcf-4e40-870d-3e1d74abf8e4,"[{'system': 'GDC', 'value': 'a9e316b2-abcf-4e4...",TCGA-VS-A8EI-01A-01-TS1.64C2A4BF-CE1B-46CB-AE4...,Biospecimen,Slide Image,SVS,TCGA-CESC,drs://dg.4DFC:a9e316b2-abcf-4e40-870d-3e1d74ab...,192640689,c9526a0e3df583efda8f0dc61bb21d04,...,"[{'system': 'GDC', 'value': 'TCGA-VS-A8EI'}, {...",homo sapiens,female,white,not reported,-14129.0,"[TCGA-CESC, tcga_cesc]",Alive,,
3,a9e316b2-abcf-4e40-870d-3e1d74abf8e4,"[{'system': 'GDC', 'value': 'a9e316b2-abcf-4e4...",TCGA-VS-A8EI-01A-01-TS1.64C2A4BF-CE1B-46CB-AE4...,Biospecimen,Slide Image,SVS,TCGA-CESC,drs://dg.4DFC:a9e316b2-abcf-4e40-870d-3e1d74ab...,192640689,c9526a0e3df583efda8f0dc61bb21d04,...,"[{'system': 'GDC', 'value': 'TCGA-VS-A8EI'}, {...",homo sapiens,female,white,not reported,-14129.0,"[TCGA-CESC, tcga_cesc]",Alive,,
4,1595ab35-ad76-4f27-8b2e-8de0482bf164,"[{'system': 'GDC', 'value': '1595ab35-ad76-4f2...",TCGA-DI-A2QY-11A-01-TS1.9DEA9504-6CF4-4FD2-A91...,Biospecimen,Slide Image,SVS,TCGA-UCEC,drs://dg.4DFC:1595ab35-ad76-4f27-8b2e-8de0482b...,50173067,79e82fcf3107181102b50ca3e2fde9f5,...,"[{'system': 'GDC', 'value': 'TCGA-DI-A2QY'}, {...",homo sapiens,female,white,not hispanic or latino,-23398.0,"[TCGA-UCEC, tcga_ucec]",Dead,3349.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2145,317ec307-62fd-45d0-b1bd-03aeb094a67e,"[{'system': 'GDC', 'value': '317ec307-62fd-45d...",TCGA-AX-A2H4-01A-02-TSB.E4077E33-365F-46B0-838...,Biospecimen,Slide Image,SVS,TCGA-UCEC,drs://dg.4DFC:317ec307-62fd-45d0-b1bd-03aeb094...,361909653,df2c6b4dd80890e4f4bc84dd9c1b44fc,...,"[{'system': 'GDC', 'value': 'TCGA-AX-A2H4'}, {...",homo sapiens,female,white,not hispanic or latino,-24234.0,"[TCGA-UCEC, tcga_ucec]",Dead,916.0,
2146,0b9e4e31-bbb1-4d88-b50c-a989ce9f4aff,"[{'system': 'GDC', 'value': '0b9e4e31-bbb1-4d8...",TCGA-AX-A2H4-11A-01-TSA.6A13C652-59BA-40D8-937...,Biospecimen,Slide Image,SVS,TCGA-UCEC,drs://dg.4DFC:0b9e4e31-bbb1-4d88-b50c-a989ce9f...,126509613,3b8d3699fad4ea15f5efc8e60e4977e2,...,"[{'system': 'GDC', 'value': 'TCGA-AX-A2H4'}, {...",homo sapiens,female,white,not hispanic or latino,-24234.0,"[TCGA-UCEC, tcga_ucec]",Dead,916.0,
2147,0b9e4e31-bbb1-4d88-b50c-a989ce9f4aff,"[{'system': 'GDC', 'value': '0b9e4e31-bbb1-4d8...",TCGA-AX-A2H4-11A-01-TSA.6A13C652-59BA-40D8-937...,Biospecimen,Slide Image,SVS,TCGA-UCEC,drs://dg.4DFC:0b9e4e31-bbb1-4d88-b50c-a989ce9f...,126509613,3b8d3699fad4ea15f5efc8e60e4977e2,...,"[{'system': 'GDC', 'value': 'TCGA-AX-A2H4'}, {...",homo sapiens,female,white,not hispanic or latino,-24234.0,"[TCGA-UCEC, tcga_ucec]",Dead,916.0,
2148,358728ce-157b-4002-8377-5391781a3d57,"[{'system': 'GDC', 'value': '358728ce-157b-400...",TCGA-EK-A2RB-01A-01-TS1.45101B3C-E301-4BD3-B4E...,Biospecimen,Slide Image,SVS,TCGA-CESC,drs://dg.4DFC:358728ce-157b-4002-8377-5391781a...,116519647,592cb544b9ee7331d52d0ab68505d956,...,"[{'system': 'GDC', 'value': 'TCGA-EK-A2RB'}, {...",homo sapiens,female,white,not hispanic or latino,-17752.0,"[TCGA-CESC, tcga_cesc]",Alive,,


<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
    
This merge seems to have created a lot of duplicate lines. So Julia wants to drop the duplicated lines before saving the dataframe to a csv. Since the identifier columns of the data are lists, python will have a hard time telling if they are duplicated. A quick fix is to subset the data to only the columns that should be used for the deduplication, so Julia first gets the list of all the column names, and then pastes everything but "identifier", "identifier_x", "identifier_y" and "subject_associated_project"into her drop_duplicates call:

In [40]:
slidemetadata.columns

Index(['id', 'identifier', 'label', 'data_category', 'data_type',
       'file_format', 'associated_project', 'drs_uri', 'byte_size', 'checksum',
       'data_modality', 'imaging_modality', 'dbgap_accession_number',
       'imaging_series', 'subject_id', 'researchsubject_id', 'id_x',
       'identifier_x', 'member_of_research_project',
       'primary_diagnosis_condition', 'primary_diagnosis_site', 'id_y',
       'identifier_y', 'species', 'sex', 'race', 'ethnicity', 'days_to_birth',
       'subject_associated_project', 'vital_status', 'days_to_death',
       'cause_of_death'],
      dtype='object')

In [44]:
slidemetadata.drop_duplicates(inplace=True, 
                              subset=['id', 'label', 'data_category', 'data_type',
       'file_format', 'associated_project', 'drs_uri', 'byte_size', 'checksum',
       'data_modality', 'imaging_modality', 'dbgap_accession_number',
       'imaging_series', 'subject_id', 'researchsubject_id', 'id_x',
        'member_of_research_project',
       'primary_diagnosis_condition', 'primary_diagnosis_site', 'id_y',
        'species', 'sex', 'race', 'ethnicity', 'days_to_birth',
         'vital_status', 'days_to_death',
       'cause_of_death'])

In [45]:
slidemetadata

Unnamed: 0,id,identifier,label,data_category,data_type,file_format,associated_project,drs_uri,byte_size,checksum,...,identifier_y,species,sex,race,ethnicity,days_to_birth,subject_associated_project,vital_status,days_to_death,cause_of_death
0,040229b3-224c-4107-bd33-0854196b6423,"[{'system': 'GDC', 'value': '040229b3-224c-410...",TCGA-VS-A8EI-01Z-00-DX1.8DD9CBFB-C3B2-48D0-ADE...,Biospecimen,Slide Image,SVS,TCGA-CESC,drs://dg.4DFC:040229b3-224c-4107-bd33-0854196b...,54328093,d4ecae7c6f8f467afbcd060058ffca84,...,"[{'system': 'GDC', 'value': 'TCGA-VS-A8EI'}, {...",homo sapiens,female,white,not reported,-14129.0,"[TCGA-CESC, tcga_cesc]",Alive,,
2,a9e316b2-abcf-4e40-870d-3e1d74abf8e4,"[{'system': 'GDC', 'value': 'a9e316b2-abcf-4e4...",TCGA-VS-A8EI-01A-01-TS1.64C2A4BF-CE1B-46CB-AE4...,Biospecimen,Slide Image,SVS,TCGA-CESC,drs://dg.4DFC:a9e316b2-abcf-4e40-870d-3e1d74ab...,192640689,c9526a0e3df583efda8f0dc61bb21d04,...,"[{'system': 'GDC', 'value': 'TCGA-VS-A8EI'}, {...",homo sapiens,female,white,not reported,-14129.0,"[TCGA-CESC, tcga_cesc]",Alive,,
4,1595ab35-ad76-4f27-8b2e-8de0482bf164,"[{'system': 'GDC', 'value': '1595ab35-ad76-4f2...",TCGA-DI-A2QY-11A-01-TS1.9DEA9504-6CF4-4FD2-A91...,Biospecimen,Slide Image,SVS,TCGA-UCEC,drs://dg.4DFC:1595ab35-ad76-4f27-8b2e-8de0482b...,50173067,79e82fcf3107181102b50ca3e2fde9f5,...,"[{'system': 'GDC', 'value': 'TCGA-DI-A2QY'}, {...",homo sapiens,female,white,not hispanic or latino,-23398.0,"[TCGA-UCEC, tcga_ucec]",Dead,3349.0,
6,e2086d9b-63ba-4e03-966e-a4427b6ade74,"[{'system': 'GDC', 'value': 'e2086d9b-63ba-4e0...",TCGA-DI-A2QY-01Z-00-DX1.EBBED5CC-3098-4694-BCD...,Biospecimen,Slide Image,SVS,TCGA-UCEC,drs://dg.4DFC:e2086d9b-63ba-4e03-966e-a4427b6a...,1283995391,165a9199090fe0ff9ea86fd2168d41f2,...,"[{'system': 'GDC', 'value': 'TCGA-DI-A2QY'}, {...",homo sapiens,female,white,not hispanic or latino,-23398.0,"[TCGA-UCEC, tcga_ucec]",Dead,3349.0,
8,dd8d427d-d63c-41d4-9da7-dab428a3dc3b,"[{'system': 'GDC', 'value': 'dd8d427d-d63c-41d...",TCGA-DI-A2QY-01A-01-TS1.62316960-36E9-4919-BE3...,Biospecimen,Slide Image,SVS,TCGA-UCEC,drs://dg.4DFC:dd8d427d-d63c-41d4-9da7-dab428a3...,255850447,b96eceebd9116cf9567830ce36cd157a,...,"[{'system': 'GDC', 'value': 'TCGA-DI-A2QY'}, {...",homo sapiens,female,white,not hispanic or latino,-23398.0,"[TCGA-UCEC, tcga_ucec]",Dead,3349.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2140,9bc76014-1b92-4914-8bf1-482b85bae0dd,"[{'system': 'GDC', 'value': '9bc76014-1b92-491...",TCGA-AJ-A3BG-01A-01-TS1.FDB1B2A0-9AAD-460C-A87...,Biospecimen,Slide Image,SVS,TCGA-UCEC,drs://dg.4DFC:9bc76014-1b92-4914-8bf1-482b85ba...,78523868,6d4195721e8f04d6fbf708fd62f98af3,...,"[{'system': 'GDC', 'value': 'TCGA-AJ-A3BG'}, {...",homo sapiens,female,white,not hispanic or latino,-23861.0,"[TCGA-UCEC, tcga_ucec]",Alive,,
2142,859f8cfb-116d-44ec-9017-b6f6617d62bf,"[{'system': 'GDC', 'value': '859f8cfb-116d-44e...",TCGA-AJ-A3BG-01Z-00-DX1.87799C6D-8229-4DE2-BD6...,Biospecimen,Slide Image,SVS,TCGA-UCEC,drs://dg.4DFC:859f8cfb-116d-44ec-9017-b6f6617d...,185346200,67c97665154649a31354b8880aebd713,...,"[{'system': 'GDC', 'value': 'TCGA-AJ-A3BG'}, {...",homo sapiens,female,white,not hispanic or latino,-23861.0,"[TCGA-UCEC, tcga_ucec]",Alive,,
2144,317ec307-62fd-45d0-b1bd-03aeb094a67e,"[{'system': 'GDC', 'value': '317ec307-62fd-45d...",TCGA-AX-A2H4-01A-02-TSB.E4077E33-365F-46B0-838...,Biospecimen,Slide Image,SVS,TCGA-UCEC,drs://dg.4DFC:317ec307-62fd-45d0-b1bd-03aeb094...,361909653,df2c6b4dd80890e4f4bc84dd9c1b44fc,...,"[{'system': 'GDC', 'value': 'TCGA-AX-A2H4'}, {...",homo sapiens,female,white,not hispanic or latino,-24234.0,"[TCGA-UCEC, tcga_ucec]",Dead,916.0,
2146,0b9e4e31-bbb1-4d88-b50c-a989ce9f4aff,"[{'system': 'GDC', 'value': '0b9e4e31-bbb1-4d8...",TCGA-AX-A2H4-11A-01-TSA.6A13C652-59BA-40D8-937...,Biospecimen,Slide Image,SVS,TCGA-UCEC,drs://dg.4DFC:0b9e4e31-bbb1-4d88-b50c-a989ce9f...,126509613,3b8d3699fad4ea15f5efc8e60e4977e2,...,"[{'system': 'GDC', 'value': 'TCGA-AX-A2H4'}, {...",homo sapiens,female,white,not hispanic or latino,-24234.0,"[TCGA-UCEC, tcga_ucec]",Dead,916.0,


<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
    
That looks better! Julia saves this dataframe to a csv, and now she has all the information she needs to begin work on her project. She can use the `drs_id` column information to directly download the images she is interested in using a DRS resolver, or she can input the DRS IDs at a cloud workspace such as [Terra](https://terra.bio/) or the [Cancer Genomics Cloud](https://www.cancergenomicscloud.org/) to view the images online. In either case, she has all the metadata she needs to get started, and can save this notebook of her work in case she'd like to come back and modify her search.

In [28]:
slidemetadata.to_csv("slidemetadata.csv")