# CDA Release 3 Query and Import Notebook

This notebook demonstrates usage of CDA python API and importing resulting files into a project on CGC. This notebook utilizes python functions present in the *utilities.py* file which were developed specifically for this demo. Feel free to use the same functions in your projects. For questions, suggestions and bug reporting, feel free to contact Seven Bridges support.

This notebook consists of several parts. First, necessary libraries are installed. In order to install libraries, the project in which the notebook is executed needs to have internet access. To check if your project has internet access, go to project settings (top right of the project page, or directly using https://cgc.sbgenomics.com/u/username/project-name/settings) -> *Execution settings* and select *Allow internet access*.

In the second part, connection to the CDA is established and a query is run on six different endpoints. 

Finally, dataframes obtained from multiple endpoints are merged and files are imported using Seven Bridges API.

The goal of this notebook is not data exploration. There are many [online examples](https://cda.readthedocs.io/en/latest/Examples/SearchTerms/) on how to explore CDA data. This notebook focuses on importing CDA files into your CGC project. Full functionality of CDA python can be explored in the [official documentation](https://github.com/CancerDataAggregator/readthedocs)

### Requirements installation

When the following cell is run, required libraries will be installed. By default, installer is run quietly. To see what is going on behind the scenes, just remove -qq flag. It is possible that running the cell will return a dependancy conflict. If the conflict only contains scipy and numba, feel free to ignore the error. CDA python is still under development, and it clashes with other libraries present in this environment. Seven Bridges is working on a separate environment just for CDA.

In [1]:
! pip install -r requirements.txt -qq

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
scipy 1.7.1 requires numpy<1.23.0,>=1.16.5, but you have numpy 1.23.4 which is incompatible.
numba 0.54.0 requires numpy<1.21,>=1.17, but you have numpy 1.23.4 which is incompatible.[0m


### Importing cda python

CDA python comes with a tool **Q** which handles the interaction with the database. Additionaly, structure of the dataframe can be seen using the _columns_ function.


In [2]:
from cdapython import Q, columns
import cdapython

print(cdapython.__file__)
print(cdapython.__version__)

### Writing a query

Lets get all files with subject_id = TCGA-E2-A10A

In [5]:
q = Q('subject_id = "TCGA-E2-A10A"')

## Querying CDA

CDA consists of multiple endpoints. By using each endpoint a certain subset of Data can be accessed. Since the goal of this notebook is to get files, first endpoint is the files endpoint.

For transforming data into dataframes, _iter_pages()_ function is defined in the _utilities.py_. This function iterates over response's pages and gets all data into a single dataframe. There are approaches to transforming query results into a dataframe. For other approaches, feel free to explore the [Pagination notebook](https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/Pagination.ipynb).

Depending on the query, the following cell can take a couple of minutes to execute.

### Querying files endpoint

In [6]:
from utilities import iter_pages
files_of_interest = q.file.run()

files_df = iter_pages(files_of_interest)

Lets check out the format of the obtained dataframe:

In [7]:
files_df.head(3)

Unnamed: 0,file_id,file_identifier,label,data_category,data_type,file_format,file_associated_project,drs_uri,byte_size,checksum,data_modality,imaging_modality,dbgap_accession_number,imaging_series,specimen_id,researchsubject_id,subject_id
0,7af2c68c-6b5c-11e9-ab5c-005056921935,"[{'system': 'PDC', 'value': '7af2c68c-6b5c-11e...",TCGA_E2-A10A_BH-A18Q_C8-A130_117C_W_BI_2013022...,Peptide Spectral Matches,Open Standard,mzIdentML,CPTAC-TCGA,drs://dg.4DFC:7af2c68c-6b5c-11e9-ab5c-00505692...,11371231.0,d77024459559eb31f4db5b32eae36813,Proteomic,,,,,,TCGA-E2-A10A
1,4adca628-6b5e-11e9-9244-005056921935,"[{'system': 'PDC', 'value': '4adca628-6b5e-11e...",TCGA_E2-A10A_BH-A18Q_C8-A130_117C_P_BI_2013022...,Peptide Spectral Matches,Text,tsv,CPTAC-TCGA,drs://dg.4DFC:4adca628-6b5e-11e9-9244-00505692...,3776441.0,919fdbf943adc46bbef970847cf2aa6f,Proteomic,,,,,,TCGA-E2-A10A
2,6bf22560-6b5c-11e9-ab5c-005056921935,"[{'system': 'PDC', 'value': '6bf22560-6b5c-11e...",TCGA_E2-A10A_BH-A18Q_C8-A130_117C_W_BI_2013022...,Peptide Spectral Matches,Text,tsv,CPTAC-TCGA,drs://dg.4DFC:6bf22560-6b5c-11e9-ab5c-00505692...,6847471.0,4c713108933f5b554a11d9acb4e340b3,Proteomic,,,,,,TCGA-E2-A10A


Dataframe contains all file-related information as well as ids of other endpoints it connects to. These ids will lated be useful for merging dataframes from multiple endpoints.

### Querying other endpoints

When querying other endpoints, **id** column is renamed to properly reflect which id it is referring to. This is done so that merging dataframes can be performed using appropriate **id** columns. Depending on the query, the following cell can take a couple of minutes to execute.

In [8]:
# Diagnosis
diagnosis = q.diagnosis.run()
diagnosis_df = iter_pages(diagnosis)

# Treatment
treatment = q.treatment.run()
treatment_df = iter_pages(treatment)

# Research subject
research_subject_of_interest = q.researchsubject.run()
rs_df = iter_pages(research_subject_of_interest)

# Subject
subject_of_interest = q.subject.run()
subject_df = iter_pages(subject_of_interest)

#Specimen
specimen_of_interest = q.specimen.run()
specimen_df = iter_pages(specimen_of_interest)

After obtaining dataframes, they can be merged into a single dataframe. Goal of querying and merging multiple dataframes is to be able to infer all metadata fields for files which are to be imported. In general, it is strongly suggested to explore the results and adjust the query before proceeding to dataframe merging and importing data to CGC.

For dataframe manipulation, pandas must be imported:

In [9]:
import pandas as pd

# Merge files and research_subject
combineddata = pd.merge(files_df, rs_df, on='researchsubject_id', how='outer', suffixes=(None, '_rs'))

# Merge combined and subject
combineddata = pd.merge(combineddata, subject_df, on='subject_id', how='outer', suffixes=(None, '_subject'))

# Merge combined and specimen
combineddata = pd.merge(combineddata, specimen_df, on='specimen_id', how='outer', suffixes=(None, '_specimen'))

# Merge combined and treatment
combineddata = pd.merge(combineddata, treatment_df, on='researchsubject_id', how='outer', suffixes=(None, '_treatment'))

# Merge combined and diagnosis
combineddata = pd.merge(combineddata, diagnosis_df, on='researchsubject_id', how='outer', suffixes=(None, '_diagnosis'))

# See how the merged dataframe looks like:
combineddata.head()

Unnamed: 0,file_id,file_identifier,label,data_category,data_type,file_format,file_associated_project,drs_uri,byte_size,checksum,data_modality,imaging_modality,dbgap_accession_number,imaging_series,specimen_id,researchsubject_id,subject_id,researchsubject_identifier,member_of_research_project,primary_diagnosis_condition,primary_diagnosis_site,subject_id_rs,subject_identifier,species,sex,race,ethnicity,days_to_birth,subject_associated_project,vital_status,days_to_death,cause_of_death,specimen_identifier,specimen_associated_project,days_to_collection,primary_disease_type,anatomical_site,source_material_type,specimen_type,derived_from_specimen,subject_id_specimen,researchsubject_id_specimen,treatment_id,treatment_identifier,treatment_type,treatment_outcome,days_to_treatment_start,days_to_treatment_end,therapeutic_agent,treatment_anatomic_site,treatment_effect,treatment_end_reason,number_of_cycles,subject_id_treatment,diagnosis_id,diagnosis_id_diagnosis,diagnosis_identifier,primary_diagnosis,age_at_diagnosis,morphology,stage,grade,method_of_diagnosis,subject_id_diagnosis
0,7af2c68c-6b5c-11e9-ab5c-005056921935,"[{'system': 'PDC', 'value': '7af2c68c-6b5c-11e...",TCGA_E2-A10A_BH-A18Q_C8-A130_117C_W_BI_2013022...,Peptide Spectral Matches,Open Standard,mzIdentML,CPTAC-TCGA,drs://dg.4DFC:7af2c68c-6b5c-11e9-ab5c-00505692...,11371231.0,d77024459559eb31f4db5b32eae36813,Proteomic,,,,,,TCGA-E2-A10A,,,,,,"[{'system': 'GDC', 'value': 'TCGA-E2-A10A'}, {...",homo sapiens,female,white,not hispanic or latino,-15085.0,"[TCGA-BRCA, tcga_brca, CPTAC-TCGA]",Alive,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,4adca628-6b5e-11e9-9244-005056921935,"[{'system': 'PDC', 'value': '4adca628-6b5e-11e...",TCGA_E2-A10A_BH-A18Q_C8-A130_117C_P_BI_2013022...,Peptide Spectral Matches,Text,tsv,CPTAC-TCGA,drs://dg.4DFC:4adca628-6b5e-11e9-9244-00505692...,3776441.0,919fdbf943adc46bbef970847cf2aa6f,Proteomic,,,,,,TCGA-E2-A10A,,,,,,"[{'system': 'GDC', 'value': 'TCGA-E2-A10A'}, {...",homo sapiens,female,white,not hispanic or latino,-15085.0,"[TCGA-BRCA, tcga_brca, CPTAC-TCGA]",Alive,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,6bf22560-6b5c-11e9-ab5c-005056921935,"[{'system': 'PDC', 'value': '6bf22560-6b5c-11e...",TCGA_E2-A10A_BH-A18Q_C8-A130_117C_W_BI_2013022...,Peptide Spectral Matches,Text,tsv,CPTAC-TCGA,drs://dg.4DFC:6bf22560-6b5c-11e9-ab5c-00505692...,6847471.0,4c713108933f5b554a11d9acb4e340b3,Proteomic,,,,,,TCGA-E2-A10A,,,,,,"[{'system': 'GDC', 'value': 'TCGA-E2-A10A'}, {...",homo sapiens,female,white,not hispanic or latino,-15085.0,"[TCGA-BRCA, tcga_brca, CPTAC-TCGA]",Alive,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,d44ffd83-1b69-4fd3-9b67-51e16228820f,"[{'system': 'GDC', 'value': 'd44ffd83-1b69-4fd...",8a540cec-e283-4a26-868e-a07c64fe3efb.mirbase21...,Transcriptome Profiling,miRNA Expression Quantification,TXT,TCGA-BRCA,drs://dg.4DFC:d44ffd83-1b69-4fd3-9b67-51e16228...,50169.0,ce4f99883fdca5372ea7793b484c1f6e,Genomic,,,,,,TCGA-E2-A10A,,,,,,"[{'system': 'GDC', 'value': 'TCGA-E2-A10A'}, {...",homo sapiens,female,white,not hispanic or latino,-15085.0,"[TCGA-BRCA, tcga_brca, CPTAC-TCGA]",Alive,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,2bd10176-6abd-11e9-884a-005056921935,"[{'system': 'PDC', 'value': '2bd10176-6abd-11e...",TCGA_E2-A10A_BH-A18Q_C8-A130_117C_W_BI_2013022...,Processed Mass Spectra,Open Standard,mzML,CPTAC-TCGA,drs://dg.4DFC:2bd10176-6abd-11e9-884a-00505692...,179837609.0,ac1f36dd14c3c26598cb08e76746810b,Proteomic,,,,,,TCGA-E2-A10A,,,,,,"[{'system': 'GDC', 'value': 'TCGA-E2-A10A'}, {...",homo sapiens,female,white,not hispanic or latino,-15085.0,"[TCGA-BRCA, tcga_brca, CPTAC-TCGA]",Alive,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


We can also check the size of combined data and compare it to *files_df*. Notice that *combineddata* is much larger in both dimensions.

In [10]:
print('Dimensions of the files dataframe: {}'.format(files_df.shape))
print('Dimensions of the combined dataframe: {}'.format(combineddata.shape))

### Import data to CGC

In order to import data to CGC, first the CGC authentication must be entered. Authentication token is available under the [**Developer->Authentication token**](https://cgc.sbgenomics.com/developer/token) menu. If you have not used an authentication token before, it will need to be generated first.

In [9]:
# Set your SBG API token

import getpass
token = getpass.getpass()

 ································


Now, CDA files can be imported into a specified project. _process_and_upload_ function handles the bulk import. Files are imported in chunks of 100. Make sure to change the **project** variable to reflect your project. Project should be in format _username/project-name_.

Note that a row in the _combineddata_ dataframe does not necessary equate to a single file as a single file might be described using multiple rows. In this example, although 200 rows are used for import, less than a 100 files will actually be imported. In general, you should import the whole dataframe (by removing _.head(200)_ from the function call). A subset is only used in order to make this demonstration faster and cheaper to execute.

In [9]:
project = 'boris_majic/cda-test-project'

# you can tag files with multiple labels
tags = ['all_files', 'tcga_brca']

# Import
from utilities import process_and_upload

import_jobs = process_and_upload(
    df=combineddata,
    token=token,
    import_project=project,
    tags=tags
)



### Additional dataframe filtering

As result of merging dataframes is a _pandas.Dataframe_ object, dataframe can be further manipulated before file importing. Lets say that out of all the prepared files, we wish to import only files of a specific format. In this case, lets import all VCF files:

In [12]:
vcf_df = combineddata[combineddata.File_format == 'VCF']

print('There are {} VCF files'.format(vcf_df.File_id.nunique()))

As format of filtered dataframe is still a dataframe, it can be used with the _process_and_upload()_ function.

In [13]:
# Import
from utilities import process_and_upload

project = 'boris_majic/cda-test-project'

# you can tag files with multiple labels
tags = ['vcf files', 'tcga_brca']

process_and_upload(
    df=vcf_df,
    token=token,
    import_project=project,
    tags=tags
)


[<DRSBulkImport: id=190762152705986560>]

### Example of a more complex query

CDA python supports creating a more complex query by using operators within the query language. More information on available operators and creating a more complex query can be found in the [Operators notebook](https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/Operators.ipynb).

Briefly, CDA python supports the following operators:
* =
* !=
* AND
* OR
* IN and NOT IN
* % pattern matching
* IS and IS NOT
* comparison operators \>, <, >=, <=

Here a query is built to obrain all data related to research subjects whose primary diagnosis site is uterus, and for patients who are younger than 40. Note that CDA uses negative numbers in days to describe age:

In [10]:
q = Q('primary_diagnosis_site = "uterus" AND days_to_birth >= 40*-365')

Before proceeding to querying all endpoints, lets check the results of the files enpoint:

In [11]:
files_of_interest = q.file.run()
files_of_interest


            
            Offset: 0
            Count: 100
            Total Row Count: 5497
            More pages: True
            

Note that there are almost 5500 files! In rare cases, you might actually be interested in importing as many files, but generally, this is a sign that the query is too broad and it might make sense to add aditional conditions or to make the query more stict. Lets try again, but now lets only query for patients who are younger than 34:

In [12]:
q = Q('primary_diagnosis_site = "uterus" AND days_to_birth >= 34*-365')
files_of_interest = q.file.run()
files_of_interest


            
            Offset: 0
            Count: 100
            Total Row Count: 218
            More pages: True
            

Now, we have 218 files to work with, which is a reasonable number for this demonstration.
Lets convert the result to a dataframe and proceed to query other endpoints.

In [13]:
# Paginate the files query result:
files_df = iter_pages(files_of_interest)

# Query other endpoints:
diagnosis = q.diagnosis.run()
diagnosis_df = iter_pages(diagnosis)
treatment = q.treatment.run()
treatment_df = iter_pages(treatment)
research_subject_of_interest = q.researchsubject.run()
rs_df = iter_pages(research_subject_of_interest)
subject_of_interest = q.subject.run()
subject_df = iter_pages(subject_of_interest)
specimen_of_interest = q.specimen.run()
specimen_df = iter_pages(specimen_of_interest)

It is important to note that some queries might return no results. In our query, treatment, diagnosis and specimen enpoints have returned **0 results**. This highlights the importance of checking the results before proceeding to merging dataframe and importing data:

In [14]:
diagnosis.count

0

To avoid issues when merging dataframes, each result is checked before performing mergers:

In [15]:
if research_subject_of_interest.count:
    combineddata = pd.merge(files_df, rs_df, on='researchsubject_id', how='outer', suffixes=(None, '_rs'))

if subject_of_interest.count:
    combineddata = pd.merge(combineddata, subject_df, on='subject_id', how='outer', suffixes=(None, '_subject'))

if specimen_of_interest.count:
    combineddata = pd.merge(combineddata, specimen_df, on='specimen_id', how='outer', suffixes=(None, '_specimen'))

if treatment.count:
    combineddata = pd.merge(combineddata, treatment_df, on='researchsubject_id', how='outer', suffixes=(None, '_treatment'))

if diagnosis.count:
    combineddata = pd.merge(combineddata, diagnosis_df, on='researchsubject_id', how='outer', suffixes=(None, '_diagnosis'))

# See how the merged dataframe looks like:
combineddata.head(3)

Unnamed: 0,file_id,file_identifier,label,data_category,data_type,file_format,file_associated_project,drs_uri,byte_size,checksum,data_modality,imaging_modality,dbgap_accession_number,imaging_series,specimen_id,researchsubject_id,subject_id,researchsubject_identifier,member_of_research_project,primary_diagnosis_condition,primary_diagnosis_site,subject_id_rs,subject_identifier,species,sex,race,ethnicity,days_to_birth,subject_associated_project,vital_status,days_to_death,cause_of_death
0,60e7c43b-0fcb-44ea-9790-cba420d15ee7,"[{'system': 'GDC', 'value': '60e7c43b-0fcb-44e...",nationwidechildrens.org_clinical_follow_up_v1....,Clinical,Clinical Supplement,BCR Biotab,TCGA-UCEC,drs://dg.4DFC:60e7c43b-0fcb-44ea-9790-cba420d1...,48222.0,94549a7df14021988e0b8471777ce7db,Genomic,,,,,,TCGA-A5-A1OJ,,,,,,"[{'system': 'GDC', 'value': 'TCGA-A5-A1OJ'}, {...",homo sapiens,female,asian,not hispanic or latino,-11635,"[tcga_ucec, TCGA-UCEC]",Alive,,
1,125414a1-6f59-4477-8aa0-2be672b037db,"[{'system': 'GDC', 'value': '125414a1-6f59-447...",nationwidechildrens.org_biospecimen.TCGA-A5-A1...,Biospecimen,Biospecimen Supplement,BCR XML,TCGA-UCEC,drs://dg.4DFC:125414a1-6f59-4477-8aa0-2be672b0...,61374.0,a505b1568d8432f7c018e2ade0ae72e6,Genomic,,,,,,TCGA-A5-A1OJ,,,,,,"[{'system': 'GDC', 'value': 'TCGA-A5-A1OJ'}, {...",homo sapiens,female,asian,not hispanic or latino,-11635,"[tcga_ucec, TCGA-UCEC]",Alive,,
2,a17f99b3-a3d0-4e6d-8b1c-a2e50ba5cda3,"[{'system': 'GDC', 'value': 'a17f99b3-a3d0-4e6...",TCGA_UCEC.0f68b6df-c5da-4398-a0bf-8c9b05a756f3...,Simple Nucleotide Variation,Annotated Somatic Mutation,VCF,TCGA-UCEC,drs://dg.4DFC:a17f99b3-a3d0-4e6d-8b1c-a2e50ba5...,58710.0,27f1afb7c0c81efc9d6a402f3255a285,Genomic,,,,,,TCGA-A5-A1OJ,,,,,,"[{'system': 'GDC', 'value': 'TCGA-A5-A1OJ'}, {...",homo sapiens,female,asian,not hispanic or latino,-11635,"[tcga_ucec, TCGA-UCEC]",Alive,,


Finally, _combineddata_ dataframe can be used to import the files to the CGC:

In [16]:
# Import
from utilities import process_and_upload

project = 'boris_majic/cda-test-project'

# you can tag files with multiple labels
tags = ['uterus', 'young']

process_and_upload(
    df=combineddata,
    token=token,
    import_project=project,
    tags=tags
)

[<DRSBulkImport: id=190762812053721088>,
 <DRSBulkImport: id=190762814752755712>]