# Build a Cohort

**Example use case:** 

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" alt="alt_text" align="left"
	width="150" height="150" />
Julia is an oncologist that specializes in female reproductive health. As part of her research, she is interested in  using existing data on uterine cancers. If possible, she would like to see multiple datatypes (gross imaging, genomic data, proteomic data, histology) that come from the same patient, so she can look for shared phenotypes to test for their potential as early diagnostics. Julia heard that the Cancer Data Aggregator has made it easy to search across multiple datasets created by NCI, and so has decided to start her search there.



Before Julia does any work, she needs to import several functions from cdapython:

- `Q` and `query` which power the search
- `columns` which lets us view entity field names
- `unique_terms` which lets view entity field contents

She also asks cdapython to report it's version so she can be sure she's using the one she means to.

In [1]:
from cdapython import Q, columns, unique_terms, query
import cdapython
import pandas as pd 
print(cdapython.__version__)


2022.6.22


<div style="background-color:#b3e5d5;color:black;padding:20px;">
    
CDA data comes from three sources:
<ul>
<li><b>The <a href="https://proteomic.datacommons.cancer.gov/pdc/"> Proteomic Data Commons</a> (PDC)</b></li>
<li><b>The <a href="https://gdc.cancer.gov/">Genomic Data Commons</a> (GDC)</b></li>
<li><b>The <a href="https://datacommons.cancer.gov/repository/imaging-data-commons">Imaging Data Commons</a> (IDC)</b></li>
</ul> 
    
The CDA makes this data searchable in four main endpoints:

<ul>
<li><b>subject:</b> A specific, unique individual: for e.g. a single human. When consent allows, a given entity will have a single subject ID that can be connected to all their studies and data across all datasets</li>
<li><b>researchsubject:</b> a person/plant/animal/microbe within a given study. An individual who participates in 3 studies will have 3 researchsubject IDs</li>
<li><b>specimen:</b> a tissue sample taken from a given subject, or a portion of the original sample. A given specimen will have only a single subject ID and a single research subject ID</li>
<li><b>file:</b> A unit of data about subjects, researchsubjects, specimens, or their associated information</li>
</ul>
and two endpoints that offer deeper information about data in the researchsubject endpoint:
<ul>
<li><b>diagnosis:</b> Information about what medical diagnosis a researchsubject has</li>
<li><b>treatment:</b> Information about what medical treatment(s) were performed for a given diagnosis</li>
</ul>
Any metadata field can be searched from any endpoint, the only difference between search types is what type of data is returned by default. This means that you can think of the CDA as a really, really enormous spreadsheet full of data. To search this enormous spreadsheet, you'd want select columns, and then filter rows.
</div>


<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
   
   Accordingly, to see what search fields are available, Julia starts by using the command `columns`:

In [2]:
columns().to_list()

['File.id',
 'File.identifier.system',
 'File.identifier.value',
 'File.label',
 'File.data_category',
 'File.data_type',
 'File.file_format',
 'File.associated_project',
 'File.drs_uri',
 'File.byte_size',
 'File.checksum',
 'File.data_modality',
 'File.imaging_modality',
 'File.dbgap_accession_number',
 'id',
 'identifier.system',
 'identifier.value',
 'species',
 'sex',
 'race',
 'ethnicity',
 'days_to_birth',
 'subject_associated_project',
 'vital_status',
 'age_at_death',
 'cause_of_death',
 'ResearchSubject.id',
 'ResearchSubject.identifier.system',
 'ResearchSubject.identifier.value',
 'ResearchSubject.member_of_research_project',
 'ResearchSubject.primary_diagnosis_condition',
 'ResearchSubject.primary_diagnosis_site',
 'ResearchSubject.Diagnosis.id',
 'ResearchSubject.Diagnosis.identifier.system',
 'ResearchSubject.Diagnosis.identifier.value',
 'ResearchSubject.Diagnosis.primary_diagnosis',
 'ResearchSubject.Diagnosis.age_at_diagnosis',
 'ResearchSubject.Diagnosis.morphology',

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
   
There are a lot of columns in the CDA data, but Julia is most interested in diagnosis data, so she filters the list to only those:

In [3]:
columns().to_list(filters="diagnosis")

['ResearchSubject.primary_diagnosis_condition',
 'ResearchSubject.primary_diagnosis_site',
 'ResearchSubject.Diagnosis.id',
 'ResearchSubject.Diagnosis.identifier.system',
 'ResearchSubject.Diagnosis.identifier.value',
 'ResearchSubject.Diagnosis.primary_diagnosis',
 'ResearchSubject.Diagnosis.age_at_diagnosis',
 'ResearchSubject.Diagnosis.morphology',
 'ResearchSubject.Diagnosis.stage',
 'ResearchSubject.Diagnosis.grade',
 'ResearchSubject.Diagnosis.method_of_diagnosis',
 'ResearchSubject.Diagnosis.Treatment.id',
 'ResearchSubject.Diagnosis.Treatment.identifier.system',
 'ResearchSubject.Diagnosis.Treatment.identifier.value',
 'ResearchSubject.Diagnosis.Treatment.treatment_type',
 'ResearchSubject.Diagnosis.Treatment.treatment_outcome',
 'ResearchSubject.Diagnosis.Treatment.days_to_treatment_start',
 'ResearchSubject.Diagnosis.Treatment.days_to_treatment_end',
 'ResearchSubject.Diagnosis.Treatment.therapeutic_agent',
 'ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site',
 'Re

<div style="background-color:#b3e5d5;color:black;padding:20px;">

To search the CDA, a user also needs to know what search terms are available. Each column will contain a huge amount of data, so retreiving all of the rows would be overwhelming. Instead, the CDA has a `unique_terms()` function that will return all of the unique values that populate the requested column. Like `columns`, `unique_terms` defaults to giving us an overview of the results, and can be filtered.
    
</div>

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
Since Julia is interested specificially in uterine cancers, she uses the `unique_terms` function to see what data is available for 'ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site' and 'ResearchSubject.primary_diagnosis_site' to see if 'uterine' appears:

In [4]:
unique_terms("ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site").to_list()

['Brain',
 'Cervix',
 'Head - Face Or Neck, Nos',
 'Lymph Node(s) Paraaortic',
 'Other',
 'Pelvis',
 'Spine',
 'Unknown']

In [5]:
unique_terms("ResearchSubject.primary_diagnosis_site").to_list()

['Abdomen',
 'Abdomen, Mediastinum',
 'Adrenal Glands',
 'Adrenal gland',
 'Anus and anal canal',
 'Base of tongue',
 'Bile Duct',
 'Bladder',
 'Bones, joints and articular cartilage of limbs',
 'Bones, joints and articular cartilage of other and unspecified sites',
 'Brain',
 'Breast',
 'Bronchus and lung',
 'Cervix',
 'Cervix uteri',
 'Chest',
 'Chest-Abdomen-Pelvis, Leg, TSpine',
 'Colon',
 'Connective, subcutaneous and other soft tissues',
 'Corpus uteri',
 'Ear',
 'Esophagus',
 'Extremities',
 'Eye and adnexa',
 'Floor of mouth',
 'Gallbladder',
 'Gum',
 'Head',
 'Head and Neck',
 'Head-Neck',
 'Heart, mediastinum, and pleura',
 'Hematopoietic and reticuloendothelial systems',
 'Hypopharynx',
 'Intraocular',
 'Kidney',
 'Larynx',
 'Lip',
 'Liver',
 'Liver and intrahepatic bile ducts',
 'Lung',
 'Lung Phantom',
 'Lymph nodes',
 'Marrow, Blood',
 'Meninges',
 'Mesothelium',
 'Nasal cavity and middle ear',
 'Nasopharynx',
 'Not Reported',
 'Oropharynx',
 'Other and ill-defined digest

<div style="background-color:#b3e5d5;color:black;padding:20px;">
    
CDA makes multiple datasets searchable from a common interface, but does not harmonize the data. This means that researchers should review all the terms in a column, and not just choose the first one that fits, as there may be other similar terms available as well.
    
</div>

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
Julia sees that "treatment_anatomic_site" does not have 'Uterine', but does have 'Cervix'. She also notes that both 'Uterus' and 'Uterus, NOS' are listed in the "primary_diagnosis_site" results. As she was initially looking for "uterine", Julia decides to expand her search a bit to account for variable naming schemes. So, she runs a fuzzy match filter on the "ResearchSubject.primary_diagnosis_site" for 'uter' as that should cover all variants:

In [6]:
unique_terms("ResearchSubject.primary_diagnosis_site").to_list(filters="uter")

['Cervix uteri', 'Corpus uteri', 'Uterus', 'Uterus, NOS']

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
Just to be sure, Julia also searches for any other instances of "cervix":

In [7]:
unique_terms("ResearchSubject.primary_diagnosis_site").to_list(filters="cerv")

['Cervix', 'Cervix uteri']

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
With all her likely terms found, Julia begins to create a search that will get data for all of her terms. She does this by writing a series of `Q` statements that define what rows should be returned from each column. For the "treatment_anatomic_site", only one term is of interest, so she uses the `=` operator to get only exact matches:

In [8]:
Tsite = Q('ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site = "Cervix"')

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
However, for "primary_diagnosis_site", Julia has several terms she wants to search with. Luckily, `Q` also can run fuzzy searches. It can also search more than one term at a time, so Julia writes one big `Q` statement to grab everything that is either 'uter' or 'cerv':

In [9]:
Dsite = Q('ResearchSubject.primary_diagnosis_site = "%uter%" OR ResearchSubject.primary_diagnosis_site = "%cerv%"')

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
Finally, Julia adds her two queries together into one large one:

In [10]:
ALLDATA = Tsite.OR(Dsite)

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
Now that Julia has a query, she can use it to look for data in any of the CDA endpoints. She starts by getting an overall summary of what data is available using `count`:

In [11]:
ALLDATA.count.run()

Total execution time: 3401 ms




<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
It seems there's a lot of data that might work for Julias study! Since she is interested in the beginings of cancer, she decides to start by looking at  the researchsubject information, since that is where most of the diagnosis information is. She again gets a summary using `count`:

In [12]:
ALLDATA.researchsubject.run()

Total execution time: 3521 ms



            QueryID: 32babf30-f9f1-4061-b4f5-ef333397be62
            
            Offset: 0
            Count: 100
            Total Row Count: 4867
            More pages: True
            

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
Browsing the primary_diagnosis_condition data, Julia notices that there are a large number of research subjects that are Adenomas and Adenocarcinomas. Since Julia wants to look for common phenotypes in early cancers, she decides it might be easier to exclude the endocrine related data, as they might have different mechanisms. So she adds a new filter to her query:

In [13]:
Noadeno = Q('ResearchSubject.primary_diagnosis_condition != "Adenomas and Adenocarcinomas"')

NoAdenoData = ALLDATA.AND(Noadeno)

NoAdenoData.researchsubject.count.run()

Total execution time: 3648 ms


system,count
GDC,1918
PDC,104
IDC,1174

primary_diagnosis_condition,count
"Cystic, Mucinous and Serous Neoplasms",487
Squamous Cell Neoplasms,609
Complex Mixed and Stromal Neoplasms,320
Myomatous Neoplasms,187
Uterine Corpus Endometrial Carcinoma,104
,1175
"Epithelial Neoplasms, NOS",230
Mesonephromas,5
Not Reported,12
"Neoplasms, NOS",12

primary_diagnosis_site,count
Cervix uteri,688
"Uterus, NOS",961
Corpus uteri,373
Uterus,867
Cervix,307




<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
She then previews the actual metadata for researchsubject, subject, and file, to make sure that they have all the information she will need for her work. Since she's mostly interested in looking at the kinds of data available from each endpoint, she uses `.head(2)` on her queries so they only give her back 2 lines of data, which is much easier to read:

In [14]:
NoAdenoData.researchsubject.run().to_dataframe().head(2) # view the first two lines of the dataframe

Total execution time: 3489 ms


Unnamed: 0,id,identifier,member_of_research_project,primary_diagnosis_condition,primary_diagnosis_site,subject_id
0,0020317d-d10e-4e75-8fa6-7c1bdcdee471,"[{'system': 'GDC', 'value': '0020317d-d10e-4e7...",TCGA-SARC,Myomatous Neoplasms,"Uterus, NOS",TCGA-HS-A5NA
1,081e129a-0082-4fe5-9917-8d54d420289d,"[{'system': 'GDC', 'value': '081e129a-0082-4fe...",GENIE-MSK,Myomatous Neoplasms,"Uterus, NOS",GENIE-MSK-P-0001095


---

<div style="background-color:#add9e5;color:black;padding:20px;">

<h3>ResearchSubject Field Definitions</h3>

<i>A research subject is the entity of interest in a research study, typically a human being or an animal, but can also be a device, group of humans or animals, or a tissue sample. Human research subjects are usually not traceable to a particular person to protect the subject’s privacy. An individual who participates in 3 studies will have 3 researchsubject IDs</i>
    
<ul>
  <li><b>id:</b> The unique identifier for this researchsubject</li>
  <li><b>identifier:</b> An embedded array of information that includes the originating data center and the ID the researchsubject had there</li>
  <li><b>member_of_research_project:</b> The name of the study/project that the subject particpated in</li>
  <li><b>primary_diagnosis_condition:</b> The cancer, disease or other condition under study</li>
  <li><b>primary_diagnosis_site:</b> The primary_disease_site that qualifies the researchsubject for the research_project</li>
  <li><b>subject_id:</b> An identifier for the subject. Can be joined to the `id` field from subject results</li>
</ul>  

</div>
    
---

In [15]:
NoAdenoData.subject.run().to_dataframe().head(2) # view the first two lines of the dataframe

Total execution time: 3643 ms


Unnamed: 0,id,identifier,species,sex,race,ethnicity,days_to_birth,subject_associated_project,vital_status,age_at_death,cause_of_death
0,AD7251,"[{'system': 'GDC', 'value': 'AD7251'}]",Homo sapiens,female,not reported,not reported,,[FM-AD],Not Reported,,
1,C3L-00942,"[{'system': 'GDC', 'value': 'C3L-00942'}, {'sy...",Homo sapiens,female,white,not reported,-23631.0,"[CPTAC3-Discovery, CPTAC-3, cptac_ucec]",Alive,,Not Reported


---

<div style="background-color:#add9e5;color:black;padding:20px;">

<h3>Subject Field Definitions</h3>

<i>A subject is a specific, unique individual: for e.g. a single human. When consent allows, a given entity will have a single subject ID that can be connected to all their studies and data across all datasets</i>

    
<ul>
  <li><b>id:</b> The unique identifier for this subject</li>
  <li><b>identifier:</b> An embedded array of information that includes the originating data center and the ID the subject had there</li>
  <li><b>species:</b> The species of the subject</li>
  <li><b>sex:</b> A reference to the biological sex of the donor organism. </li>
  <li><b>race:</b> The race of the subject</li>
  <li><b>ethnicity:</b> The ethnicity of the subject</li>
  <li><b>days_to_birth:</b> Number of days between the date used for index and the date from a person's date of birth represented as a calculated negative number of days</li>
  <li><b>subject_associated_project:</b> An embedded array of the names of projects (studies) the subject was part of</li>
  <li><b>vital_status:</b> Whether the subject is alive</li>
  <li><b>age_at_death:</b> The number of days after first enrollment that the subject died</li>
  <li><b>cause_of_death:</b> The cause of death, if known</li>
</ul>  

</div>
    
---

In [16]:
NoAdenoData.file.run().to_dataframe().head(2) # view the first two lines of the dataframe

Total execution time: 3948 ms


Unnamed: 0,id,identifier,label,data_category,data_type,file_format,associated_project,drs_uri,byte_size,checksum,data_modality,imaging_modality,dbgap_accession_number,researchsubject_specimen_id,researchsubject_id,subject_id
0,02da6ec6-341b-4f9f-bd8f-54cc5ceed88b,"[{'system': 'GDC', 'value': '02da6ec6-341b-4f9...",586d3dc0-0e65-4308-9ee2-63d43f1de70d.wgs.BRASS...,Somatic Structural Variation,Structural Rearrangement,BEDPE,CGCI-HTMCP-CC,drs://dg.4DFC:02da6ec6-341b-4f9f-bd8f-54cc5cee...,100421,f661db60a3ba026973f578abb10cec67,Genomic,,phs000528,8e3ca471-209d-4020-aabd-b578568ed791,80d31738-d3f2-42e6-871a-f0612812a678,HTMCP-03-06-02013
1,2d7591a0-28ed-42e8-a14f-46cee06a3982,"[{'system': 'GDC', 'value': '2d7591a0-28ed-42e...",5a08d4f6-b8f3-40ee-8931-1d521f6c7733.wgs.BRASS...,Somatic Structural Variation,Structural Rearrangement,VCF,CGCI-HTMCP-CC,drs://dg.4DFC:2d7591a0-28ed-42e8-a14f-46cee06a...,43628,8afd15c59cecff6620fb9c4327fa1b06,Genomic,,phs000528,b1cd6f12-ccb1-413f-9659-a074bb28675a,4b0fa193-d964-4936-bb28-f21cb967e98a,HTMCP-03-06-02215



---

<div style="background-color:#add9e5;color:black;padding:20px;">

<h3>File Field Definitions</h3>

<i>A file is an information-bearing electronic object that contains a physical embodiment of some information using a particular character encoding.</i>

    
<ul>
  <li><b>id:</b> The unique identifier for this file</li>
  <li><b>identifier:</b> An embedded array of information that includes the originating data center and the ID the file had there</li>
  <li><b>label:</b> The full name of the file</li>
  <li><b>data_catagory:</b> A desecription of the kind of general kind data the file holds</li>
  <li><b>data_type:</b> A more specific descripton of the data type</li>
  <li><b>file_format:</b> String to identify the full file extension including compression extensions</li>
  <li><b>associated_project:</b> The name the data center uses for the study this file was generated for</li>
  <li><b>drs_uri:</b> A unique identifier that can be used to retreive this specific file from a server</li>
  <li><b>byte_size:</b> Size of the file in bytes</li>
  <li><b>checksum:</b> The md5 value for the file</li>
  <li><b>data_modality:</b> Describes the biological nature of the information gathered as the result of an activity, independent of the technology or methods used to produce the information. Always one of "Genomic", "Proteomic", or "Imaging"</li>
  <li><b>imaging_modality:</b> For files with the `data_modality` of "Imaging", a descriptor for the image type</li>
  <li><b>dbgap_accession_number:</b> The project id number for this data on dbGaP</li>
</ul>  

</div>
    
---


<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
Finally, Julia wants to save these results to use for the future. Since the preview dataframes only show the first 100 results of each search, she uses the `paginator` function to get all the data from the subject and researchsubject endpoints into their own dataframes:

In [17]:
researchsubs = NoAdenoData.researchsubject.run()
rsdf = pd.DataFrame()
for i in researchsubs.paginator(to_df=True):
    rsdf = pd.concat([rsdf, i])

Total execution time: 3422 ms


In [18]:
subs = NoAdenoData.subject.run()
subsdf = pd.DataFrame()
for i in subs.paginator(to_df=True):
    subsdf = pd.concat([subsdf, i])

Total execution time: 3495 ms


In [19]:
rsdf.head(2) # view the first two lines of the researchsubject dataframe

Unnamed: 0,id,identifier,member_of_research_project,primary_diagnosis_condition,primary_diagnosis_site,subject_id
0,0020317d-d10e-4e75-8fa6-7c1bdcdee471,"[{'system': 'GDC', 'value': '0020317d-d10e-4e7...",TCGA-SARC,Myomatous Neoplasms,"Uterus, NOS",TCGA-HS-A5NA
1,081e129a-0082-4fe5-9917-8d54d420289d,"[{'system': 'GDC', 'value': '081e129a-0082-4fe...",GENIE-MSK,Myomatous Neoplasms,"Uterus, NOS",GENIE-MSK-P-0001095


In [20]:
subsdf.head(2) # view the first two lines of the subject dataframe

Unnamed: 0,id,identifier,species,sex,race,ethnicity,days_to_birth,subject_associated_project,vital_status,age_at_death,cause_of_death
0,AD7251,"[{'system': 'GDC', 'value': 'AD7251'}]",Homo sapiens,female,not reported,not reported,,[FM-AD],Not Reported,,
1,C3L-00942,"[{'system': 'GDC', 'value': 'C3L-00942'}, {'sy...",Homo sapiens,female,white,not reported,-23631.0,"[CPTAC3-Discovery, CPTAC-3, cptac_ucec]",Alive,,Not Reported


<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
Then Julia uses the `id` fields in each result to merge them together into one big dataset:

In [21]:
allmetadata = pd.merge(rsdf,
                subsdf,
                left_on="subject_id",
                right_on='id')

allmetadata.head(2)

Unnamed: 0,id_x,identifier_x,member_of_research_project,primary_diagnosis_condition,primary_diagnosis_site,subject_id,id_y,identifier_y,species,sex,race,ethnicity,days_to_birth,subject_associated_project,vital_status,age_at_death,cause_of_death
0,0020317d-d10e-4e75-8fa6-7c1bdcdee471,"[{'system': 'GDC', 'value': '0020317d-d10e-4e7...",TCGA-SARC,Myomatous Neoplasms,"Uterus, NOS",TCGA-HS-A5NA,TCGA-HS-A5NA,"[{'system': 'GDC', 'value': 'TCGA-HS-A5NA'}, {...",Homo sapiens,female,not reported,not reported,-24025.0,"[tcga_sarc, TCGA-SARC]",Alive,,
1,081e129a-0082-4fe5-9917-8d54d420289d,"[{'system': 'GDC', 'value': '081e129a-0082-4fe...",GENIE-MSK,Myomatous Neoplasms,"Uterus, NOS",GENIE-MSK-P-0001095,GENIE-MSK-P-0001095,"[{'system': 'GDC', 'value': 'GENIE-MSK-P-00010...",Homo sapiens,female,white,not hispanic or latino,-24837.0,[GENIE-MSK],Not Reported,,


<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
And saves it out to a csv so she can browse it with Excel:

In [22]:
allmetadata.to_csv("allmetadata.csv")

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
   
Julia knows from her subject count summary that there are 297923 files associated with her subjects, which is likely far more than she needs. To help her decide what files she wants, Julia uses endpoint chaining to get summary information about the files that are assigned to researchsubjects for her search criteria:


In [23]:
NoAdenoData.researchsubject.file.count.run()

Total execution time: 3460 ms


system,count
IDC,264429
GDC,30934
PDC,2560

data_category,count
Imaging,264429
Sequencing Reads,4137
Structural Variation,2188
Simple Nucleotide Variation,11745
Peptide Spectral Matches,1280
Transcriptome Profiling,2818
Processed Mass Spectra,640
Copy Number Variation,4079
Raw Mass Spectra,640
Biospecimen,2865

file_format,count
DICOM,264429
BAM,4137
TXT,4717
BCR XML,1215
MAF,5235
IDAT,1078
VCF,6652
SVS,1111
TSV,3976
tsv,640

data_type,count
,264429
Aligned Reads,4137
Biospecimen Supplement,1754
Annotated Somatic Mutation,6043
Raw Simple Somatic Mutation,3397
Masked Copy Number Segment,998
Splice Junction Quantification,709
Gene Level Copy Number Scores,809
Clinical Supplement,776
Slide Image,1111




<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
    
Julia decides that a good place to start would be with Slide Images. There's only 1111, so she should be able to quickly scan through them over the next few days and see if they will be useful. So she adds one more filter on her search:

In [24]:
JustSlides = Q('file.data_type = "Slide Image"')
NoadenoJustSlides = NoAdenoData.AND(JustSlides)
NoadenoJustSlides.researchsubject.file.count.run()

Total execution time: 3361 ms


system,count
GDC,1111

data_category,count
Biospecimen,1111

file_format,count
SVS,1111

data_type,count
Slide Image,1111




<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
    
Finally, Julia uses the pagenation function again to get all the slide files, and merges her metadata file with this file information. This way she will be able to review what phenotypes each slide is associated with:

In [25]:
slides = NoadenoJustSlides.researchsubject.file.run()
slidesdf = pd.DataFrame()
for i in slides.paginator(to_df=True):
    slidesdf = pd.concat([slidesdf, i])


Total execution time: 3501 ms


In [26]:
slidemetadata = pd.merge(slidesdf, 
                         allmetadata, 
                         on="subject_id")
slidemetadata.head(2)

Unnamed: 0,id,identifier,label,data_category,data_type,file_format,associated_project,drs_uri,byte_size,checksum,...,identifier_y,species,sex,race,ethnicity,days_to_birth,subject_associated_project,vital_status,age_at_death,cause_of_death
0,0261cb79-9879-480a-9626-63fd50d6daea,"[{'system': 'GDC', 'value': '0261cb79-9879-480...",TCGA-VS-A9UP-01A-01-TS1.68E82753-9B6E-4544-B91...,Biospecimen,Slide Image,SVS,TCGA-CESC,drs://dg.4DFC:0261cb79-9879-480a-9626-63fd50d6...,201483703,8318b25bce4cc03960bdbca1112c520b,...,"[{'system': 'GDC', 'value': 'TCGA-VS-A9UP'}, {...",Homo sapiens,female,not reported,not reported,-15849.0,"[TCGA-CESC, tcga_cesc]",Alive,,
1,0261cb79-9879-480a-9626-63fd50d6daea,"[{'system': 'GDC', 'value': '0261cb79-9879-480...",TCGA-VS-A9UP-01A-01-TS1.68E82753-9B6E-4544-B91...,Biospecimen,Slide Image,SVS,TCGA-CESC,drs://dg.4DFC:0261cb79-9879-480a-9626-63fd50d6...,201483703,8318b25bce4cc03960bdbca1112c520b,...,"[{'system': 'GDC', 'value': 'TCGA-VS-A9UP'}, {...",Homo sapiens,female,not reported,not reported,-15849.0,"[TCGA-CESC, tcga_cesc]",Alive,,


<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
    
She saves this file out as well.

In [27]:
slidemetadata.to_csv("slidemetadata.csv")

<img src="https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/images/Julia.png?raw=true" align="left"
	width="50" height="50" />
    
Now Julia has all the information she needs to begin work on her project. She can use the `drs_id` column information to directly download the images she is interested in using a DRS resolver, or she can input the DRS IDs at a cloud workspace such as [Terra](https://terra.bio/) or the [Cancer Genomics Cloud](https://www.cancergenomicscloud.org/) to view the images online. In either case, she has all the metadata she needs to get started, and can save this notebook of her work in case she'd like to come back and modify her search.