# Available Search Terms
---

Before we do any work, we need to import several functions from cdapython:
- `Q` and `query` which power the search
- `columns` which lets us view entity field names
- `unique_terms` which lets view entity field contents

We're also asking cdapython to report it's version so we can be sure we're using the one we mean to.

In [29]:
from cdapython import Q, columns, unique_terms, query
print(Q.get_version())



<div class="cdanote" style="background-color:#b3e5d5;color:black;padding:20px;">
    
You can think of the CDA as a really, really enormous spreadsheet full of data. To search this enormous spreadsheet, you'd want select columns, and then filter rows.
</div>

Accordingly, to see what search fields are available, we use the command `columns`:

In [2]:
columns()


            
            Offset: 0
            Count: 65
            Total Row Count: 65
            More pages: False
            

This output tells us that there are 65 searchable fields, but it doesn't output them directly. Running CDA commands like this first gives you an overall summary of the data you're going to get, and so is nice for doing a gut check. However, if we want to see the data on our screen we can have `columns()` print out it's contents to a list instead:

In [3]:
columns().to_list()

['File.id',
 'File.identifier.system',
 'File.identifier.value',
 'File.label',
 'File.data_category',
 'File.data_type',
 'File.file_format',
 'File.associated_project',
 'File.drs_uri',
 'File.byte_size',
 'File.checksum',
 'File.data_modality',
 'File.imaging_modality',
 'File.dbgap_accession_number',
 'File.imaging_series',
 'id',
 'identifier.system',
 'identifier.value',
 'species',
 'sex',
 'race',
 'ethnicity',
 'days_to_birth',
 'subject_associated_project',
 'vital_status',
 'days_to_death',
 'cause_of_death',
 'ResearchSubject.id',
 'ResearchSubject.identifier.system',
 'ResearchSubject.identifier.value',
 'ResearchSubject.member_of_research_project',
 'ResearchSubject.primary_diagnosis_condition',
 'ResearchSubject.primary_diagnosis_site',
 'ResearchSubject.Diagnosis.id',
 'ResearchSubject.Diagnosis.identifier.system',
 'ResearchSubject.Diagnosis.identifier.value',
 'ResearchSubject.Diagnosis.primary_diagnosis',
 'ResearchSubject.Diagnosis.age_at_diagnosis',
 'ResearchSubje

By default, `columns()` returns the first 100 items. If that is too many, you can limit your search to only a specified number: 

In [4]:
columns(limit=10).to_list()

['File.id',
 'File.identifier.system',
 'File.identifier.value',
 'File.label',
 'File.data_category',
 'File.data_type',
 'File.file_format',
 'File.associated_project',
 'File.drs_uri',
 'File.byte_size']

Or you can filter the list for terms that match your interests:

In [5]:
columns().to_list(filters="diagnosis")

['ResearchSubject.primary_diagnosis_condition',
 'ResearchSubject.primary_diagnosis_site',
 'ResearchSubject.Diagnosis.id',
 'ResearchSubject.Diagnosis.identifier.system',
 'ResearchSubject.Diagnosis.identifier.value',
 'ResearchSubject.Diagnosis.primary_diagnosis',
 'ResearchSubject.Diagnosis.age_at_diagnosis',
 'ResearchSubject.Diagnosis.morphology',
 'ResearchSubject.Diagnosis.stage',
 'ResearchSubject.Diagnosis.grade',
 'ResearchSubject.Diagnosis.method_of_diagnosis',
 'ResearchSubject.Diagnosis.Treatment.id',
 'ResearchSubject.Diagnosis.Treatment.identifier.system',
 'ResearchSubject.Diagnosis.Treatment.identifier.value',
 'ResearchSubject.Diagnosis.Treatment.treatment_type',
 'ResearchSubject.Diagnosis.Treatment.treatment_outcome',
 'ResearchSubject.Diagnosis.Treatment.days_to_treatment_start',
 'ResearchSubject.Diagnosis.Treatment.days_to_treatment_end',
 'ResearchSubject.Diagnosis.Treatment.therapeutic_agent',
 'ResearchSubject.Diagnosis.Treatment.treatment_anatomic_site',
 'Re

<div class="cdawarn" style="background-color:#f9cfbf;color:black;padding:20px;">
<strong>Check your search criteria!</strong>
While available search fields may look like ones you've seen in PDC, GDC or IDC, that does not mean they will contain exactly the same information; several are renamed or restructured in the CDA model. The field name mappings are described in <a href="../../Schema/overview_mapping">CDA Schema Field Mapping.</a>
</div>


We can directly get information about what data populates any of these fields using the `unique_terms()` function. Like `columns`, `unique_terms` defaults to giving us an overview of the results, and we view them the same way:

In [6]:
unique_terms("ResearchSubject.primary_diagnosis_site").to_list()

['Abdomen',
 'Abdomen, Mediastinum',
 'Abdomen, Pelvis',
 'Adrenal Glands',
 'Adrenal gland',
 'Anus and anal canal',
 'Base of tongue',
 'Bile Duct',
 'Bladder',
 'Bones, joints and articular cartilage of limbs',
 'Bones, joints and articular cartilage of other and unspecified sites',
 'Brain',
 'Breast',
 'Bronchus and lung',
 'Cervix',
 'Cervix uteri',
 'Chest',
 'Chest-Abdomen-Pelvis, Leg, TSpine',
 'Colon',
 'Connective, subcutaneous and other soft tissues',
 'Corpus uteri',
 'Ear',
 'Esophagus',
 'Extremities',
 'Eye and adnexa',
 'Floor of mouth',
 'Gallbladder',
 'Gum',
 'Head',
 'Head and Neck',
 'Head-Neck',
 'Heart, mediastinum, and pleura',
 'Hematopoietic and reticuloendothelial systems',
 'Hypopharynx',
 'Intraocular',
 'Kidney',
 'Larynx',
 'Lip',
 'Liver',
 'Liver and intrahepatic bile ducts',
 'Lung',
 'Lung Phantom',
 'Lymph nodes',
 'Marrow, Blood',
 'Meninges',
 'Mesothelium',
 'Nasal cavity and middle ear',
 'Nasopharynx',
 'Not Reported',
 'Oropharynx',
 'Other an

When you are browsing for possible search terms, it can often be useful to see how much data they have. A quick way to see the overall volume of data for any given term is to use the `show_counts` option. This needs to be viewed as a dataframe since it is two dimensional data:

In [28]:
unique_terms("ResearchSubject.primary_diagnosis_site", show_counts = True).to_dataframe()


Unnamed: 0,primary_diagnosis_site,Count
0,Abdomen,92
1,"Abdomen, Mediastinum",176
2,"Abdomen, Pelvis",230
3,Adrenal Glands,271
4,Adrenal gland,851
...,...,...
90,"Uterus, NOS",2000
91,Vagina,72
92,Various,449
93,Various (11 locations),89


We can also use the same `filters` option here to search for only diagnosis sites that we're interested in:

In [7]:
unique_terms("ResearchSubject.primary_diagnosis_site").to_list(filters="lung")


['Bronchus and lung', 'Lung', 'Lung Phantom']

`filters` looks for both full and partial matches, which is useful for searching unharmonized data. For instance, if I'm not sure whether the data I'm interested in would be labeled as "uterine" or "uterus" I might search for just "uter"

In [None]:
unique_terms("ResearchSubject.primary_diagnosis_site").to_list(filters="uter")

Success! Not only are there multiple ways that "Uterus" is specified in the CDA data, I now also know that there are also data for specific uterine tissues. 

---

<div class="cdawarn" style="background-color:#f9cfbf;color:black;padding:20px;">
<strong>Check your search terms!</strong>
If you run into unexpected results when running a search, be sure that you're searching all the terms you want. CDA data is not yet harmonized across centers, so there are many cases where a single term search will not return all the information you need, however the CDA provides tools that make it easy to search all forms of a term to enable cross dataset search.
</div>

---

However, if your filter is very short, or a very common word, this partial match behavior might give too many results. To force the search to only find exact matches, add the `exact = True` option:

In [22]:
unique_terms("ResearchSubject.primary_diagnosis_site").to_list(filters="lung", exact = True)

['Lung']

In [23]:
unique_terms("ResearchSubject.primary_diagnosis_site").to_list(filters="uter", exact = True)

[]

Explore the available terms by changing filters, how many results, and which unique terms you request. Once you have found terms you're interested in, head to <a href="../BasicSearch">Basic Search</a> to build simple queries.