# What terms are searchable?
---

Before we do any work, we need to import several functions from cdapython:
- `Q` and `query` which power the search
- `columns` which lets us view entity field names
- `unique_terms` which lets view entity field contents

We're also asking cdapython to report it's version so we can be sure we're using the one we mean to.

In [1]:
from cdapython import Q, columns, unique_terms, query
import cdapython
print(cdapython.__version__)
Q.set_host_url("http://35.192.60.10:8080/")

2022.4.13




CDA data comes from three sources:
- The [Proteomic Data Commons](https://proteomic.datacommons.cancer.gov/pdc/) (PDC)
- The [Genomic Data Commons](https://gdc.cancer.gov/) (GDC)
- The [Imaging Data Commons](https://datacommons.cancer.gov/repository/imaging-data-commons) (IDC)

The CDA makes this data searchable in two ways, by a "Subject" table and by "File" table. 

`Subject` data is information that is intrisic to the individual under study, e.g. `sex`, `race`, `ethnicity`. However any given subject might be part of multiple studies. To make search across datasets easier, the CDA model aggregates this data as `ResearchSubject` information. Subjects that participate in multiple projects (are part of multiple nodes), will have multiple `ResearchSubject` entries.

`Subject` and `ResearchSubject` fields are available for both the "Subject" and "File" tables, however terms specific to files e.g. 'data_type' or 'file_format' are only available in "File" table.

To see what fields are available, we use the command `columns`. Here we're limiting to 20 results for readability, but you can remove the limit to see all of them:

In [2]:
columns(limit=20)

['id',
 'identifier',
 'identifier.system',
 'identifier.value',
 'species',
 'sex',
 'race',
 'ethnicity',
 'days_to_birth',
 'subject_associated_project',
 'vital_status',
 'age_at_death',
 'cause_of_death',
 'ResearchSubject',
 'ResearchSubject.id',
 'ResearchSubject.identifier',
 'ResearchSubject.identifier.system',
 'ResearchSubject.identifier.value',
 'ResearchSubject.member_of_research_project',
 'ResearchSubject.primary_diagnosis_condition']

By default, `columns()` returns `Subject` table fields. The first several fields (those without a `.` in them) are `Subject` demographic information, which is intrinsically attached to a given subject. Subsequent entities (i.e. `Research.Subject.xxx`) contain details about specific experiments the subject was part of. They are equivilent to the nodes' `Case` record in the GDC and PDC. 

To see the fields available in the File table, we add `files=True` to the command:


In [10]:
columns(files=True)

['id',
 'identifier',
 'identifier.system',
 'identifier.value',
 'label',
 'data_category',
 'data_type',
 'file_format',
 'associated_project',
 'drs_uri',
 'byte_size',
 'checksum',
 'data_modality',
 'imaging_modality',
 'dbgap_accession_number',
 'Subject',
 'Subject.id',
 'Subject.identifier',
 'Subject.identifier.system',
 'Subject.identifier.value',
 'Subject.species',
 'Subject.sex',
 'Subject.race',
 'Subject.ethnicity',
 'Subject.days_to_birth',
 'Subject.subject_associated_project',
 'Subject.vital_status',
 'Subject.age_at_death',
 'Subject.cause_of_death',
 'ResearchSubject',
 'ResearchSubject.id',
 'ResearchSubject.identifier',
 'ResearchSubject.identifier.system',
 'ResearchSubject.identifier.value',
 'ResearchSubject.member_of_research_project',
 'ResearchSubject.primary_diagnosis_condition',
 'ResearchSubject.primary_diagnosis_site',
 'ResearchSubject.Diagnosis',
 'ResearchSubject.Diagnosis.id',
 'ResearchSubject.Diagnosis.identifier',
 'ResearchSubject.Diagnosis.iden

Note that while some of the search values are the same in subject and file, not all of them are. There are also many more fields in the "File" table, and some names vary slightly. Further, while available search fields may look like ones you've seen in PDC, GDC or IDC, that does not mean they will contain exactly the same information; several are renamed or restructured in the CDA model. The field name mappings are described in [CDA Schema Field Mapping](../Documentation/Schema.md), but we can also directly get information about what data populates any of these fields using the `unique_terms()` function:

In [4]:
unique_terms("ethnicity")

[None,
 'Unknown',
 'hispanic or latino',
 'not allowed to collect',
 'not hispanic or latino',
 'not reported']

Like `columns`, `unique_terms` defaults to giving us information for a "Subject" search. To see what fields are available for "Files" we again add `files=True` to the command:

In [9]:
unique_terms("ethnicity", files=True)

['Genomic', 'Imaging', 'Proteomic']

We get an error! If we go back and check our `columns(files=True, limit=35)` command at `Out[3]` above, we see that in the files table, ethnicity is prepended with Subject, so it's field name is actually `Subject.ethnicity` not just `ethnicity`. This makes sense because in the Subject search, we are specifically asking about Subject information, whereas in the Files search, we need to tell the computer to go look under Subject for this information. Like this: 

In [7]:
unique_terms("Subject.ethnicity", files=True)

[None,
 'Unknown',
 'hispanic or latino',
 'not allowed to collect',
 'not hispanic or latino',
 'not reported']

That's better! Both tables can be searched by Subject terms so we can filter the file results by ethnicity, we just have to make sure we're asking for the right term. If you run into unexpected errors when running a search, be sure that the field you've requested is specified correctly for the table you are searching.

Explore the available terms by changing which table, how many results, and which unique terms you request. Once you have found terms you're interested in, head to [How to Search](../SearchTypes) to build simple queries.