Before we do any work, we need to import several functions from cdapython:
- `Q` and `query` which power the search
- `columns` which lets us view entity field names
- `unique_terms` which lets view entity field contents

We're also asking cdapython to report it's version so we can be sure we're using the one we mean to.

In [1]:
from cdapython import Q, columns, unique_terms, query
import cdapython
print(cdapython.__version__)
Q.set_host_url("http://35.192.60.10:8080/")

2022.4.13


Note that I'm working off a testing version, these are new features that will come out with our next release, so if you go try these things in the current stable version they won't work. When we do release, this notebook will be available for you to run and play with.

CDA data comes from three sources:
- The [Proteomic Data Commons](https://proteomic.datacommons.cancer.gov/pdc/) (PDC)
- The [Genomic Data Commons](https://gdc.cancer.gov/) (GDC)
- The [Imaging Data Commons](https://datacommons.cancer.gov/repository/imaging-data-commons) (IDC)

The CDA makes this data searchable in two ways, by a "Subject" table and by "File" table. 

`Subject` data is information that is intrisic to the individual under study, e.g. `sex`, `race`, `ethnicity`. However any given subject might be part of multiple studies. To make search across datasets easier, the CDA model aggregates this data as `ResearchSubject` information. Subjects that participate in multiple projects (are part of multiple nodes), will have multiple `ResearchSubject` entries.

`Subject` and `ResearchSubject` fields are available for both the "Subject" and "File" tables, however terms specific to files e.g. 'data_type' or 'file_format' are only available in "File" table.

To see what fields are available, we use the command `columns`. Here we're looking at the first ten metadata fields available for files:

In [2]:
columns(files=True, limit=10)

['id',
 'identifier',
 'identifier.system',
 'identifier.value',
 'label',
 'data_category',
 'data_type',
 'file_format',
 'associated_project',
 'drs_uri']

By default, `columns()` returns `Subject` table fields. The first several fields (those without a `.` in them) are `Subject` demographic information, which is intrinsically attached to a given subject. Subsequent entities (i.e. `Research.Subject.xxx`) contain details about specific experiments the subject was part of. They are equivilent to the nodes' `Case` record in the GDC and PDC. 

While available search fields may look like ones you've seen in PDC, GDC or IDC, that does not mean they will contain exactly the same information; several are renamed or restructured in the CDA model. The field name mappings are described in [CDA Schema Field Mapping](../Documentation/Schema.md), but we can also directly get information about what data populates any of these fields using the `unique_terms()` function:

In [3]:
unique_terms("ResearchSubject.primary_diagnosis_site", files=True, limit=10)

[None,
 'Adrenal gland',
 'Anus and anal canal',
 'Base of tongue',
 'Bladder',
 'Bones, joints and articular cartilage of limbs',
 'Bones, joints and articular cartilage of other and unspecified sites',
 'Brain',
 'Breast',
 'Bronchus and lung']

The CDA provides a custom python tool for searching CDA data. [`Q`](usage/#q) (short for Query) offers several ways to search and filter data, and several input modes:

---

- **[Q.run()](../../../Documentation/usage/#qrun)** returns **demographic** data for the specified search 
- **[Q.counts()](../../../Documentation/usage/#qcounts)** returns summary information (counts) for the **files** that fit the specified search
- **[Q.files()](../../../Documentation/usage/#qfiles)** returns data for the **files** that fit the specified search
- **[Q.sql()](../../../Documentation/usage/#qsql)** allows you to use SQL syntax instead of Q syntax 
- **[query()](../../../Documentation/usage/#query)** allows you to use a more natural language syntax of Q

---


For todays demo, I'm going to show you how you can use `Q.counts()`, `Q.files()`, and `Q.run()` to build a cohort

## Retrieving summary information

### Demographic summary
Let's run a search with a relatively simple question: We're interested in finding data about Kidney cancer. To run this simple search, we would first construct a query in `Q` and save it to a variable `myquery`:

In [4]:
myquery = Q('ResearchSubject.primary_diagnosis_site = "Kidney"')

Since we are looking for demographic summary information, we want to use this query in Q.run. We do this by running .run() on the query we just saved, and saving the result to a new variable mydemographic:

In [5]:
mydemographic = myquery.run(version="all_Subjects_v3_0_w_RS")

Getting results from database

Total execution time: 5610 ms


We're saving information in variables, so we don't get any visible output. To see what our results are, we need to look into the variable. The simplest way is to call `mydemographic` directly:

In [6]:
mydemographic


            QueryID: da8ac8ca-1b5e-49ec-b489-65e29aca57cf
            Query:SELECT all_Subjects_v3_0_w_RS.* FROM gdc-bq-sample.dev.all_Subjects_v3_0_w_RS AS all_Subjects_v3_0_w_RS, UNNEST(all_Subjects_v3_0_w_RS.ResearchSubject) AS _ResearchSubject WHERE (UPPER(_ResearchSubject.primary_diagnosis_site) = UPPER('Kidney'))
            Offset: 0
            Count: 100
            Total Row Count: 4788
            More pages: True
            

This output tells us our QueryID, which we don't really need, but the computer does to track our questions. 
Then it tells us five parameters that describe our results:

---

- **Query:** This is the actual SQL query that was run on our database to retreive your results
- **Offset:** This is how many rows of information we've told the query to skip in the data, here we didn't tell it to skip anything, so the offset is zero
- **Count:** This is how many rows (Subjects) the current page of our results table has. To keep searches fast, we default to pages with 100 rows.
- **Total Row Count:** This is how many rows (Subjects) are in the full results table
- **More pages:** This is alwasys a True or False. False means that our current page has all the availble results. True means that we will see only the first 100 results in this table, and will need to page through for more.

---
    
Now that we've seen the metadata about our results, let's look at the actual table. The easiest way to do this is by using the python function `.to_dataframe()` on our `mydemographic` variable:

In [7]:
mydemographic.to_dataframe()

Unnamed: 0,id,identifier,species,sex,race,ethnicity,days_to_birth,subject_associated_project,vital_status,age_at_death,cause_of_death,Files,ResearchSubject
0,GENIE-DFCI-005752,"[{'system': 'GDC', 'value': 'GENIE-DFCI-005752'}]",Homo sapiens,male,white,not hispanic or latino,-24837,[GENIE-DFCI],Not Reported,,,"[a8142a2a-4fe9-4ba1-bf33-ea6596a4210d, 58f1279...",[{'id': '9a8d25ec-4d3c-49dd-93b1-07173cfd0c2a'...
1,GENIE-VICC-161528,"[{'system': 'GDC', 'value': 'GENIE-VICC-161528'}]",Homo sapiens,female,Unknown,Unknown,-6574,[GENIE-VICC],Not Reported,,,"[f496e4cc-03eb-48b7-b26d-9ddc3444254c, b19e8f3...",[{'id': '2696fae1-0d32-40b4-9d70-0ea912ca75cc'...
2,AD14487,"[{'system': 'GDC', 'value': 'AD14487'}]",Homo sapiens,male,not reported,not reported,,[FM-AD],Not Reported,,,"[181889ec-c742-48e2-bc4c-bc08285f4b15, 6f9b04d...",[{'id': 'e907827e-9387-4309-88bd-c96b0c9e268d'...
3,TARGET-50-PAJNEC,"[{'system': 'GDC', 'value': 'TARGET-50-PAJNEC'}]",Homo sapiens,female,white,not hispanic or latino,,[TARGET-WT],Dead,1066,,"[30ad3e4d-845f-47da-a513-1408dc3c561d, 5d6a64d...",[{'id': '37825bea-c75c-5830-90ed-22ea7e057a44'...
4,TARGET-50-PAKKRK,"[{'system': 'GDC', 'value': 'TARGET-50-PAKKRK'}]",Homo sapiens,male,black or african american,not hispanic or latino,,[TARGET-WT],Alive,,,"[bf82599e-2722-4ffb-b837-d6707353872a, 0f42c06...",[{'id': 'b07b660d-236c-5fff-9e7a-13b649968118'...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,TCGA-KO-8414,"[{'system': 'GDC', 'value': 'TCGA-KO-8414'}, {...",Homo sapiens,female,white,not hispanic or latino,-18502,"[TCGA-KICH, tcga_kich]",Alive,,,"[cc1d9187-04b0-4d0d-9fa2-89636508dea0, 9cdb2cc...",[{'id': 'e562c60c-8e6e-481f-ad4f-28b067437ece'...
96,TCGA-KO-8414,"[{'system': 'GDC', 'value': 'TCGA-KO-8414'}, {...",Homo sapiens,female,white,not hispanic or latino,-18502,"[TCGA-KICH, tcga_kich]",Alive,,,"[cc1d9187-04b0-4d0d-9fa2-89636508dea0, 9cdb2cc...",[{'id': 'e562c60c-8e6e-481f-ad4f-28b067437ece'...
97,TCGA-CZ-5469,"[{'system': 'GDC', 'value': 'TCGA-CZ-5469'}, {...",Homo sapiens,male,white,not hispanic or latino,-15242,"[tcga_kirc, TCGA-KIRC]",Dead,946,,"[68597051-2f4e-4f8c-b29c-bd7894b74079, 809d5da...",[{'id': '6205c2d8-d431-4c4c-8ac6-ee407170e833'...
98,TCGA-CZ-5469,"[{'system': 'GDC', 'value': 'TCGA-CZ-5469'}, {...",Homo sapiens,male,white,not hispanic or latino,-15242,"[tcga_kirc, TCGA-KIRC]",Dead,946,,"[68597051-2f4e-4f8c-b29c-bd7894b74079, 809d5da...",[{'id': '6205c2d8-d431-4c4c-8ac6-ee407170e833'...


#### We can further subset by chaining queries

There are lots of Subjects in our general search for Kidney, so lets try to filter it to just cancer, we're most interested in early stage cancer, so we'll filter by Diagnosis.stage:


In [8]:
cancerquery = Q('ResearchSubject.Diagnosis.stage = "Stage I"').Or(Q('ResearchSubject.Diagnosis.stage = "Stage II"'))
kidneycancerquery = cancerquery.And(myquery)
kidneycancerdemographic = kidneycancerquery.run(version="all_Subjects_v3_0_w_RS")
kidneycancerdemographic

Getting results from database

Total execution time: 8264 ms



            QueryID: 4877bfc3-06bd-435a-a61a-c664e8a2241f
            Query:SELECT all_Subjects_v3_0_w_RS.* FROM gdc-bq-sample.dev.all_Subjects_v3_0_w_RS AS all_Subjects_v3_0_w_RS, UNNEST(all_Subjects_v3_0_w_RS.ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.Diagnosis) AS _ResearchSubject_Diagnosis WHERE (((UPPER(_ResearchSubject_Diagnosis.stage) = UPPER('Stage I')) OR (UPPER(_ResearchSubject_Diagnosis.stage) = UPPER('Stage II'))) AND (UPPER(_ResearchSubject.primary_diagnosis_site) = UPPER('Kidney')))
            Offset: 0
            Count: 65
            Total Row Count: 65
            More pages: False
            

This is a much more managable number of subjects, and now they're targeted by our actual question. Let's peek at the data:

In [9]:
kidneycancerdemographic.to_dataframe()

Unnamed: 0,id,identifier,species,sex,race,ethnicity,days_to_birth,subject_associated_project,vital_status,age_at_death,cause_of_death,Files,ResearchSubject
0,C3N-00573,"[{'system': 'GDC', 'value': 'C3N-00573'}, {'sy...",Homo sapiens,male,not reported,not reported,-22407,"[CPTAC3-Discovery, CPTAC-3, cptac_ccrcc]",Alive,,Not Reported,"[b49d5be8-8012-4ff2-8e72-c9b1f5379d9e, 5f7ecf1...",[{'id': 'a0d5a7b2-4c7f-4d19-bda1-1357d3b1d72b'...
1,C3N-01524,"[{'system': 'GDC', 'value': 'C3N-01524'}, {'sy...",Homo sapiens,male,not reported,not reported,-22558,"[CPTAC3-Discovery, CPTAC-3, cptac_ccrcc]",Alive,,Not Reported,"[b76dc7ae-1b54-11e9-b4de-005056921935, 287cb4a...",[{'id': '5657645b-d839-4035-8e69-2a9c78232cdf'...
2,C3L-01283,"[{'system': 'GDC', 'value': 'C3L-01283'}, {'sy...",Homo sapiens,male,white,hispanic or latino,-13526,"[CPTAC3-Discovery, CPTAC-3, cptac_ccrcc]",Alive,,Not Reported,"[202ff344-1b54-11e9-a7d8-005056921935, 1d101c1...",[{'id': '3348ec86-9b46-4f5b-85ad-1a126385b695'...
3,C3L-00581,"[{'system': 'GDC', 'value': 'C3L-00581'}, {'sy...",Homo sapiens,male,white,not hispanic or latino,-19213,"[CPTAC3-Discovery, CPTAC-3, cptac_ccrcc]",Dead,1230,Not Reported,"[5d83ff7c-b9df-45eb-a827-35bb3cd0f0da, 842250b...",[{'id': '80d14fd9-4413-4fce-9d75-f7ec4711e285'...
4,C3N-00494,"[{'system': 'GDC', 'value': 'C3N-00494'}, {'sy...",Homo sapiens,male,not reported,not reported,-24122,"[CPTAC3-Discovery, CPTAC-3, cptac_ccrcc]",Alive,,Not Reported,"[7ea5329d-bc52-4e83-b9c5-f73d27d8b292, f8e03ca...",[{'id': '239a5fbb-0145-46f6-8ec3-2a96a04011dd'...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
60,C3N-00317,"[{'system': 'GDC', 'value': 'C3N-00317'}, {'sy...",Homo sapiens,female,not reported,not reported,-27092,"[CPTAC3-Discovery, CPTAC-3, cptac_ccrcc]",Alive,,Not Reported,"[30b33e2e-1b54-11e9-a7d8-005056921935, 325ee4d...",[{'id': '0d81506b-7823-4ef2-8c94-d8fa624a51a1'...
61,C3N-00492,"[{'system': 'GDC', 'value': 'C3N-00492'}, {'sy...",Homo sapiens,female,not reported,not reported,-17993,"[CPTAC3-Discovery, CPTAC-3, cptac_ccrcc]",Alive,,Not Reported,"[f6891537-9de9-4f0b-8dab-29d18b84df18, 9228dde...",[{'id': '920b70b3-265f-4c67-afe5-079b073cab03'...
62,C3L-00902,"[{'system': 'GDC', 'value': 'C3L-00902'}, {'sy...",Homo sapiens,female,white,not hispanic or latino,-25512,"[CPTAC3-Discovery, CPTAC-3, cptac_ccrcc]",Alive,,Not Reported,"[bd67c388-cbb5-459d-babe-e6a80bd7715c, dc36814...",[{'id': 'fd2ff719-f1d6-43d1-a7b3-8e4e67da89ee'...
63,C3N-00315,"[{'system': 'GDC', 'value': 'C3N-00315'}, {'sy...",Homo sapiens,male,not reported,not reported,-25193,"[CPTAC3-Discovery, CPTAC-3, cptac_ccrcc]",Alive,,Not Reported,"[30b33e2e-1b54-11e9-a7d8-005056921935, 325ee4d...",[{'id': '77817c88-77e1-4ff0-a24e-21ca2942ba3a'...


### File summary
Now that we know there are research subjects that meet our criteria, lets see what data exists about them. We can get a summary of the data for these subjects using the `counts()` feature:

In [10]:
kidneycancerquery.counts(version="all_Files_v3_0_w_RS").to_dataframe().iloc[0:3,0:2]

Unnamed: 0,system,subject_count
0,IDC,106
1,PDC,119
2,GDC,106


### What do these numbers mean?

---
    
- **system:** Which data source contributed this data? The CDA currently has data from IDC, PDC and GDC
- **subject_count:** How many unique individuals meet our query. Note that *within* a data source the number is of *unique* individuals, but the same individuals can have data at multiple centers. Here, there are 371 unique people in the IDC data, however up to 57 of those may be exactly the same people as are in the PDC data.

---

## Retrieving data


Let's run the same query, but instead of asking for summary information, lets get the data about each file. We start this process the same way, by making a `Q` statement. We can reuse `cancerquery` here as well, and just run the `.files()` function on it: 

In [11]:
kidneycancerfiles = kidneycancerquery.files(version="all_Files_v3_0_w_RS")
kidneycancerfiles

Getting results from database




            QueryID: 5245d52b-7efa-4526-a5f5-c35c8e98ca42
            Query:SELECT all_Files_v3_0_w_RS.* FROM gdc-bq-sample.dev.all_Files_v3_0_w_RS AS all_Files_v3_0_w_RS, UNNEST(all_Files_v3_0_w_RS.ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.Diagnosis) AS _ResearchSubject_Diagnosis WHERE (((UPPER(_ResearchSubject_Diagnosis.stage) = UPPER('Stage I')) OR (UPPER(_ResearchSubject_Diagnosis.stage) = UPPER('Stage II'))) AND (UPPER(_ResearchSubject.primary_diagnosis_site) = UPPER('Kidney')))
            Offset: 0
            Count: 100
            Total Row Count: 9982
            More pages: True
            

In [12]:
kidneycancerfiles.to_dataframe()

Unnamed: 0,id,identifier,label,data_category,data_type,file_format,associated_project,drs_uri,byte_size,checksum,data_modality,imaging_modality,dbgap_accession_number,Subject,ResearchSubject,Specimen
0,836e7306-19cc-11e9-99db-005056921935,"[{'system': 'PDC', 'value': '836e7306-19cc-11e...",07CPTAC_CCRCC_W_JHU_20171127_LUMOS_f22.raw,Raw Mass Spectra,Proprietary,vendor-specific,CPTAC3-Discovery,drs://dg.4DFC:836e7306-19cc-11e9-99db-00505692...,753651146,102264f0bec47aa75ddd826a924369ac,Proteomic,,,"[{'id': 'QC3', 'identifier': [{'system': 'PDC'...",[{'id': '0b43d28c-1fba-11e9-b7f8-0a80fada099c'...,[{'id': '0635b755-204e-11e9-b7f8-0a80fada099c'...
1,836e7306-19cc-11e9-99db-005056921935,"[{'system': 'PDC', 'value': '836e7306-19cc-11e...",07CPTAC_CCRCC_W_JHU_20171127_LUMOS_f22.raw,Raw Mass Spectra,Proprietary,vendor-specific,CPTAC3-Discovery,drs://dg.4DFC:836e7306-19cc-11e9-99db-00505692...,753651146,102264f0bec47aa75ddd826a924369ac,Proteomic,,,"[{'id': 'QC3', 'identifier': [{'system': 'PDC'...",[{'id': '0b43d28c-1fba-11e9-b7f8-0a80fada099c'...,[{'id': '0635b755-204e-11e9-b7f8-0a80fada099c'...
2,aef47e4c-1a00-11e9-b898-005056921935,"[{'system': 'PDC', 'value': 'aef47e4c-1a00-11e...",07CPTAC_CCRCC_P_JHU_20171212_LUMOS_f03.raw,Raw Mass Spectra,Proprietary,vendor-specific,CPTAC3-Discovery,drs://dg.4DFC:aef47e4c-1a00-11e9-b898-00505692...,908045374,fd60952b13702194cc1db527adbc95a9,Proteomic,,,"[{'id': 'QC3', 'identifier': [{'system': 'PDC'...",[{'id': '0b43d28c-1fba-11e9-b7f8-0a80fada099c'...,[{'id': '0635b755-204e-11e9-b7f8-0a80fada099c'...
3,aef47e4c-1a00-11e9-b898-005056921935,"[{'system': 'PDC', 'value': 'aef47e4c-1a00-11e...",07CPTAC_CCRCC_P_JHU_20171212_LUMOS_f03.raw,Raw Mass Spectra,Proprietary,vendor-specific,CPTAC3-Discovery,drs://dg.4DFC:aef47e4c-1a00-11e9-b898-00505692...,908045374,fd60952b13702194cc1db527adbc95a9,Proteomic,,,"[{'id': 'QC3', 'identifier': [{'system': 'PDC'...",[{'id': '0b43d28c-1fba-11e9-b7f8-0a80fada099c'...,[{'id': '0635b755-204e-11e9-b7f8-0a80fada099c'...
4,d0159292-1b58-11e9-b01e-005056921935,"[{'system': 'PDC', 'value': 'd0159292-1b58-11e...",07CPTAC_CCRCC_P_JHU_20171212_LUMOS_f07.mzid.gz,Peptide Spectral Matches,Open Standard,mzIdentML,CPTAC3-Discovery,drs://dg.4DFC:d0159292-1b58-11e9-b01e-00505692...,3405616,98c7e21ed95fa7d3964f82cca1a02d50,Proteomic,,,"[{'id': 'QC3', 'identifier': [{'system': 'PDC'...",[{'id': '0b43d28c-1fba-11e9-b7f8-0a80fada099c'...,[{'id': '0635b755-204e-11e9-b7f8-0a80fada099c'...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,bd1df6ca-1b53-11e9-a7d8-005056921935,"[{'system': 'PDC', 'value': 'bd1df6ca-1b53-11e...",02CPTAC_CCRCC_W_JHU_20171003_LUMOS_f15.psm,Peptide Spectral Matches,Text,tsv,CPTAC3-Discovery,drs://dg.4DFC:bd1df6ca-1b53-11e9-a7d8-00505692...,3890682,aaf08b7d54c2920821c75acf681bf74c,Proteomic,,,"[{'id': 'C3N-01214', 'identifier': [{'system':...",[{'id': 'a767e303-1fb9-11e9-b7f8-0a80fada099c'...,[{'id': 'e2be63e9-204c-11e9-b7f8-0a80fada099c'...
96,055f62fe-1b4d-11e9-8e2e-005056921935,"[{'system': 'PDC', 'value': '055f62fe-1b4d-11e...",06CPTAC_CCRCC_P_JHU_20171124_LUMOS_f02.mzML.gz,Processed Mass Spectra,Open Standard,mzML,CPTAC3-Discovery,drs://dg.4DFC:055f62fe-1b4d-11e9-8e2e-00505692...,269075153,054afc6635be21e96d4bc33ebca802ea,Proteomic,,,"[{'id': 'QC2', 'identifier': [{'system': 'PDC'...",[{'id': '09a53ad8-1fba-11e9-b7f8-0a80fada099c'...,[{'id': '04afc8fb-204e-11e9-b7f8-0a80fada099c'...
97,055f62fe-1b4d-11e9-8e2e-005056921935,"[{'system': 'PDC', 'value': '055f62fe-1b4d-11e...",06CPTAC_CCRCC_P_JHU_20171124_LUMOS_f02.mzML.gz,Processed Mass Spectra,Open Standard,mzML,CPTAC3-Discovery,drs://dg.4DFC:055f62fe-1b4d-11e9-8e2e-00505692...,269075153,054afc6635be21e96d4bc33ebca802ea,Proteomic,,,"[{'id': 'QC2', 'identifier': [{'system': 'PDC'...",[{'id': '09a53ad8-1fba-11e9-b7f8-0a80fada099c'...,[{'id': '04afc8fb-204e-11e9-b7f8-0a80fada099c'...
98,0c844200-1b54-11e9-a7d8-005056921935,"[{'system': 'PDC', 'value': '0c844200-1b54-11e...",16CPTAC_CCRCC_W_JHU_20180322_LUMOS_f15.psm,Peptide Spectral Matches,Text,tsv,CPTAC3-Discovery,drs://dg.4DFC:0c844200-1b54-11e9-a7d8-00505692...,4171463,720ea80c683b77a6ce53334ba54318d9,Proteomic,,,"[{'id': 'QC6', 'identifier': [{'system': 'PDC'...",[{'id': '10135c5c-1fba-11e9-b7f8-0a80fada099c'...,[{'id': '0ca44a76-204e-11e9-b7f8-0a80fada099c'...


What are all these fields?

---

- **id:** The unique identifier for this file
- **identifier:** An embedded array of information that includes the originating data center and the ID the file had there
- **label:** The full name of the file
- **data_catagory:** A desecription of the kind of general kind data the file holds
- **data_type:** A more specific descripton of the data type
- **file_format:** The extension of the file
- **associated_project:** The name the data center uses for the study this file was generated for
- **drs_uri:** A unique identifier that can be used to retreive this specific file from a server
- **byte_size:** Size of the file in bytes
- **checksum:** The md5 value for the file
- **data_modality:** A high level descriptor of file data, always one of "Genomic", "Proteomic", or "Imaging"
- **imaging_modality** For files with the `data_modality` of "Imaging", a descriptor for the image type
- **dbgap_accession_number:** An identifier for the dbGaP project this file belongs to
- **Subject:** An embedded array of information that includes the originating data center and the Subject ID the file had there
- **ResearchSubject:** An embedded array of information that includes the originating data center and the ResearchSubject ID the file had there
- **Specimen:** An embedded array of information that includes the originating data center and the Specimen ID the file had there

---
