# CDA Release 3 Query and Import Notebook

This notebook demonstrates usage of CDA python API and importing resulting files into a project on CGC. This notebook utilizes python functions present in the *utilities.py* file which were developed specifically for this demo. Feel free to use the same functions in your projects. For questions, suggestions and bug reporting, feel free to contact Seven Bridges support.

This notebook consists of several parts. First, necessary libraries are installed. In order to install libraries, the project in which the notebook is executed needs to have internet access. To check if your project has internet access, go to project settings (top right of the project page, or directly using https://cgc.sbgenomics.com/u/username/project-name/settings) -> *Execution settings* and select *Allow internet access*.

In the second part, connection to the CDA is established and a query is run on six different endpoints. 

Finally, dataframes obtained from multiple endpoints are merged and files are imported using Seven Bridges API.

The goal of this notebook is not data exploration. There are many [online examples](https://cda.readthedocs.io/en/latest/Examples/SearchTerms/) on how to explore CDA data. This notebook focuses on importing CDA files into your CGC project. Full functionality of CDA python can be explored in the [official documentation](https://github.com/CancerDataAggregator/readthedocs)

In [2]:
! pip install -r requirements.txt -qq

### Importing cda python

CDA python comes with a tool **Q** which handles the interaction with the database. Additionaly, structure of the dataframe can be seen using the _columns_ function.


In [3]:
from cdapython import Q, columns
import cdapython

print(cdapython.__file__)
print(cdapython.__version__)

/opt/conda/lib/python3.9/site-packages/cdapython/__init__.py
2022.6.28


### Writing a query

Lets get all files with sample.id = TCGA-E2-A10A

In [4]:
q = Q('identifier.value = "TCGA-E2-A10A"')

## Querying CDA

CDA consists of multiple endpoints. By using each endpoint a certain subset of Data can be accessed. Since the goal of this notebook is to get files, first endpoint is the files endpoint.

For transforming data into dataframes, _iter_pages()_ function is defined in the _utilities.py_. This function iterates over response's pages and gets all data into a single dataframe. There are approaches to transforming query results into a dataframe. For other approaches, feel free to explore the [Pagination notebook](https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/Pagination.ipynb).

Depending on the query, the following cell can take a couple of minutes to execute.

### Querying files endpoint

In [5]:
from utilities import iter_pages
files_of_interest = q.file.run()

files_df = iter_pages(files_of_interest)

Total execution time: 4488 ms


Lets check out the format of the obtained dataframe:

In [6]:
files_df.head(3)

Unnamed: 0,id,identifier,label,data_category,data_type,file_format,associated_project,drs_uri,byte_size,checksum,data_modality,imaging_modality,dbgap_accession_number,imaging_series,researchsubject_specimen_id,researchsubject_id,subject_id
0,77a57466-7b65-46bd-817a-ca96b8342273,"[{'system': 'GDC', 'value': '77a57466-7b65-46b...",41bd1fd2-89a4-49b2-bcbd-508ef23b471a.wxs.mutec...,Simple Nucleotide Variation,Raw Simple Somatic Mutation,VCF,TCGA-BRCA,drs://dg.4DFC:77a57466-7b65-46bd-817a-ca96b834...,176010.0,5e4497b6e9451fb8eef7032b0f522a6e,Genomic,,,,0c8615d8-e7a2-465a-8de3-961734941c16,4da7abaf-ac7a-41c0-8033-5780a398545c,TCGA-E2-A10A
1,82f51d21-b272-4aaa-96bb-27bfeaea6c7a,"[{'system': 'GDC', 'value': '82f51d21-b272-4aa...",d32c4662-2b0b-4f78-b031-e6ff5286664c.rna_seq.t...,Sequencing Reads,Aligned Reads,BAM,TCGA-BRCA,drs://dg.4DFC:82f51d21-b272-4aaa-96bb-27bfeaea...,16137100000.0,c0aebaf40110c7af46c610e6f7bc993a,Genomic,,,,0c8615d8-e7a2-465a-8de3-961734941c16,4da7abaf-ac7a-41c0-8033-5780a398545c,TCGA-E2-A10A
2,9d44b791-95e5-42d0-8073-203980b2d23a,"[{'system': 'GDC', 'value': '9d44b791-95e5-42d...",92069840-1bfb-4d89-9907-e3a352a5b339.wxs.Pinde...,Simple Nucleotide Variation,Annotated Somatic Mutation,MAF,TCGA-BRCA,drs://dg.4DFC:9d44b791-95e5-42d0-8073-203980b2...,27410.0,5fd8154ab0d2d3628730668f67bec894,Genomic,,,,79389df7-59f1-427f-80d4-3a959407407e,4da7abaf-ac7a-41c0-8033-5780a398545c,TCGA-E2-A10A


Dataframe contains all file-related information as well as ids of other endpoints it connects to. These ids will lated be useful for merging dataframes from multiple endpoints.

### Querying other endpoints

When querying other endpoints, **id** column is renamed to properly reflect which id it is referring to. This is done so that merging dataframes can be performed using appropriate **id** columns. Depending on the query, the following cell can take a couple of minutes to execute.

In [7]:
# Diagnosis
diagnosis = q.diagnosis.run()
diagnosis_df = iter_pages(diagnosis).rename(columns={'id': 'diagnosis_id'})

# Treatment
treatment = q.treatment.run()
treatment_df = iter_pages(treatment).rename(columns={'id': 'treatment_id'})

# Research subject
research_subject_of_interest = q.researchsubject.run()
rs_df = iter_pages(research_subject_of_interest).rename(columns={'id': 'researchsubject_id'})

# Subject
subject_of_interest = q.subject.run()
subject_df = iter_pages(subject_of_interest).rename(columns={'id': 'subject_id'})

#Specimen
specimen_of_interest = q.specimen.run()
specimen_df = iter_pages(specimen_of_interest).rename(columns={'id': 'researchsubject_specimen_id'})

Total execution time: 3399 ms


Total execution time: 4132 ms


Total execution time: 3318 ms


Total execution time: 3244 ms


Total execution time: 3431 ms


After obtaining dataframes, they can be merged into a single dataframe. Goal of querying and merging multiple dataframes is to be able to infer all metadata fields for files which are to be imported. In general, it is strongly suggested to explore the results and adjust the query before proceeding to dataframe merging and importing data to CGC.

For dataframe manipulation, pandas must be imported:

In [8]:
import pandas as pd

# Merge files and research_subject
combineddata = pd.merge(files_df, rs_df, on='researchsubject_id', how='outer', suffixes=(None, '_rs'))

# Merge combined and subject
combineddata = pd.merge(combineddata, subject_df, on='subject_id', how='outer', suffixes=(None, '_subject'))

# Merge combined and specimen
combineddata = pd.merge(combineddata, specimen_df, on='researchsubject_specimen_id', how='outer', suffixes=(None, '_specimen'))

# Merge combined and treatment
combineddata = pd.merge(combineddata, treatment_df, on='researchsubject_id', how='outer', suffixes=(None, '_treatment'))

# Merge combined and diagnosis
combineddata = pd.merge(combineddata, diagnosis_df, on='researchsubject_id', how='outer', suffixes=(None, '_diagnosis'))

# See how the merged dataframe looks like:
combineddata.head()

Unnamed: 0,id,identifier,label,data_category,data_type,file_format,associated_project,drs_uri,byte_size,checksum,data_modality,imaging_modality,dbgap_accession_number,imaging_series,researchsubject_specimen_id,researchsubject_id,subject_id,identifier_rs,member_of_research_project,primary_diagnosis_condition,primary_diagnosis_site,subject_id_rs,identifier_subject,species,sex,race,ethnicity,days_to_birth,subject_associated_project,vital_status,days_to_death,cause_of_death,identifier_specimen,associated_project_specimen,days_to_collection,primary_disease_type,anatomical_site,source_material_type,specimen_type,derived_from_specimen,subject_id_specimen,researchsubject_id_specimen,treatment_id,identifier_treatment,treatment_type,treatment_outcome,days_to_treatment_start,days_to_treatment_end,therapeutic_agent,treatment_anatomic_site,treatment_effect,treatment_end_reason,number_of_cycles,subject_id_treatment,researchsubject_diagnosis_id,diagnosis_id,identifier_diagnosis,primary_diagnosis,age_at_diagnosis,morphology,stage,grade,method_of_diagnosis,subject_id_diagnosis
0,77a57466-7b65-46bd-817a-ca96b8342273,"[{'system': 'GDC', 'value': '77a57466-7b65-46b...",41bd1fd2-89a4-49b2-bcbd-508ef23b471a.wxs.mutec...,Simple Nucleotide Variation,Raw Simple Somatic Mutation,VCF,TCGA-BRCA,drs://dg.4DFC:77a57466-7b65-46bd-817a-ca96b834...,176010.0,5e4497b6e9451fb8eef7032b0f522a6e,Genomic,,,,0c8615d8-e7a2-465a-8de3-961734941c16,4da7abaf-ac7a-41c0-8033-5780a398545c,TCGA-E2-A10A,"[{'system': 'GDC', 'value': '4da7abaf-ac7a-41c...",TCGA-BRCA,Ductal and Lobular Neoplasms,Breast,TCGA-E2-A10A,"[{'system': 'GDC', 'value': 'TCGA-E2-A10A'}, {...",homo sapiens,female,white,not hispanic or latino,-15085.0,"[TCGA-BRCA, tcga_brca, CPTAC-TCGA]",Alive,,,"[{'system': 'GDC', 'value': '0c8615d8-e7a2-465...",TCGA-BRCA,1096.0,Ductal and Lobular Neoplasms,,Primary Tumor,sample,initial specimen,TCGA-E2-A10A,4da7abaf-ac7a-41c0-8033-5780a398545c,de356671-f18e-5152-83cb-197f72b897d3,"[{'system': 'GDC', 'value': 'de356671-f18e-515...","Pharmaceutical Therapy, NOS",,,,,,,,,TCGA-E2-A10A,a84accf0-2294-550d-9825-22625f09f989,a84accf0-2294-550d-9825-22625f09f989,"[{'system': 'GDC', 'value': 'a84accf0-2294-550...","Infiltrating duct carcinoma, NOS",15085.0,8500/3,,not reported,,TCGA-E2-A10A
1,77a57466-7b65-46bd-817a-ca96b8342273,"[{'system': 'GDC', 'value': '77a57466-7b65-46b...",41bd1fd2-89a4-49b2-bcbd-508ef23b471a.wxs.mutec...,Simple Nucleotide Variation,Raw Simple Somatic Mutation,VCF,TCGA-BRCA,drs://dg.4DFC:77a57466-7b65-46bd-817a-ca96b834...,176010.0,5e4497b6e9451fb8eef7032b0f522a6e,Genomic,,,,0c8615d8-e7a2-465a-8de3-961734941c16,4da7abaf-ac7a-41c0-8033-5780a398545c,TCGA-E2-A10A,"[{'system': 'GDC', 'value': '4da7abaf-ac7a-41c...",TCGA-BRCA,Ductal and Lobular Neoplasms,Breast,TCGA-E2-A10A,"[{'system': 'GDC', 'value': 'TCGA-E2-A10A'}, {...",homo sapiens,female,white,not hispanic or latino,-15085.0,"[TCGA-BRCA, tcga_brca, CPTAC-TCGA]",Alive,,,"[{'system': 'GDC', 'value': '0c8615d8-e7a2-465...",TCGA-BRCA,1096.0,Ductal and Lobular Neoplasms,,Primary Tumor,sample,initial specimen,TCGA-E2-A10A,4da7abaf-ac7a-41c0-8033-5780a398545c,c1ec925a-387c-5a0c-9d50-655aa789c652,"[{'system': 'GDC', 'value': 'c1ec925a-387c-5a0...","Radiation Therapy, NOS",,,,,,,,,TCGA-E2-A10A,a84accf0-2294-550d-9825-22625f09f989,a84accf0-2294-550d-9825-22625f09f989,"[{'system': 'GDC', 'value': 'a84accf0-2294-550...","Infiltrating duct carcinoma, NOS",15085.0,8500/3,,not reported,,TCGA-E2-A10A
2,82f51d21-b272-4aaa-96bb-27bfeaea6c7a,"[{'system': 'GDC', 'value': '82f51d21-b272-4aa...",d32c4662-2b0b-4f78-b031-e6ff5286664c.rna_seq.t...,Sequencing Reads,Aligned Reads,BAM,TCGA-BRCA,drs://dg.4DFC:82f51d21-b272-4aaa-96bb-27bfeaea...,16137100000.0,c0aebaf40110c7af46c610e6f7bc993a,Genomic,,,,0c8615d8-e7a2-465a-8de3-961734941c16,4da7abaf-ac7a-41c0-8033-5780a398545c,TCGA-E2-A10A,"[{'system': 'GDC', 'value': '4da7abaf-ac7a-41c...",TCGA-BRCA,Ductal and Lobular Neoplasms,Breast,TCGA-E2-A10A,"[{'system': 'GDC', 'value': 'TCGA-E2-A10A'}, {...",homo sapiens,female,white,not hispanic or latino,-15085.0,"[TCGA-BRCA, tcga_brca, CPTAC-TCGA]",Alive,,,"[{'system': 'GDC', 'value': '0c8615d8-e7a2-465...",TCGA-BRCA,1096.0,Ductal and Lobular Neoplasms,,Primary Tumor,sample,initial specimen,TCGA-E2-A10A,4da7abaf-ac7a-41c0-8033-5780a398545c,de356671-f18e-5152-83cb-197f72b897d3,"[{'system': 'GDC', 'value': 'de356671-f18e-515...","Pharmaceutical Therapy, NOS",,,,,,,,,TCGA-E2-A10A,a84accf0-2294-550d-9825-22625f09f989,a84accf0-2294-550d-9825-22625f09f989,"[{'system': 'GDC', 'value': 'a84accf0-2294-550...","Infiltrating duct carcinoma, NOS",15085.0,8500/3,,not reported,,TCGA-E2-A10A
3,82f51d21-b272-4aaa-96bb-27bfeaea6c7a,"[{'system': 'GDC', 'value': '82f51d21-b272-4aa...",d32c4662-2b0b-4f78-b031-e6ff5286664c.rna_seq.t...,Sequencing Reads,Aligned Reads,BAM,TCGA-BRCA,drs://dg.4DFC:82f51d21-b272-4aaa-96bb-27bfeaea...,16137100000.0,c0aebaf40110c7af46c610e6f7bc993a,Genomic,,,,0c8615d8-e7a2-465a-8de3-961734941c16,4da7abaf-ac7a-41c0-8033-5780a398545c,TCGA-E2-A10A,"[{'system': 'GDC', 'value': '4da7abaf-ac7a-41c...",TCGA-BRCA,Ductal and Lobular Neoplasms,Breast,TCGA-E2-A10A,"[{'system': 'GDC', 'value': 'TCGA-E2-A10A'}, {...",homo sapiens,female,white,not hispanic or latino,-15085.0,"[TCGA-BRCA, tcga_brca, CPTAC-TCGA]",Alive,,,"[{'system': 'GDC', 'value': '0c8615d8-e7a2-465...",TCGA-BRCA,1096.0,Ductal and Lobular Neoplasms,,Primary Tumor,sample,initial specimen,TCGA-E2-A10A,4da7abaf-ac7a-41c0-8033-5780a398545c,c1ec925a-387c-5a0c-9d50-655aa789c652,"[{'system': 'GDC', 'value': 'c1ec925a-387c-5a0...","Radiation Therapy, NOS",,,,,,,,,TCGA-E2-A10A,a84accf0-2294-550d-9825-22625f09f989,a84accf0-2294-550d-9825-22625f09f989,"[{'system': 'GDC', 'value': 'a84accf0-2294-550...","Infiltrating duct carcinoma, NOS",15085.0,8500/3,,not reported,,TCGA-E2-A10A
4,ef49581e-e0e5-4eae-ae46-75573f643426,"[{'system': 'GDC', 'value': 'ef49581e-e0e5-4ea...",d32c4662-2b0b-4f78-b031-e6ff5286664c.rna_seq.s...,Transcriptome Profiling,Splice Junction Quantification,TSV,TCGA-BRCA,drs://dg.4DFC:ef49581e-e0e5-4eae-ae46-75573f64...,2653872.0,e0cf2be15fa62968721ead15104cbaa9,Genomic,,,,0c8615d8-e7a2-465a-8de3-961734941c16,4da7abaf-ac7a-41c0-8033-5780a398545c,TCGA-E2-A10A,"[{'system': 'GDC', 'value': '4da7abaf-ac7a-41c...",TCGA-BRCA,Ductal and Lobular Neoplasms,Breast,TCGA-E2-A10A,"[{'system': 'GDC', 'value': 'TCGA-E2-A10A'}, {...",homo sapiens,female,white,not hispanic or latino,-15085.0,"[TCGA-BRCA, tcga_brca, CPTAC-TCGA]",Alive,,,"[{'system': 'GDC', 'value': '0c8615d8-e7a2-465...",TCGA-BRCA,1096.0,Ductal and Lobular Neoplasms,,Primary Tumor,sample,initial specimen,TCGA-E2-A10A,4da7abaf-ac7a-41c0-8033-5780a398545c,de356671-f18e-5152-83cb-197f72b897d3,"[{'system': 'GDC', 'value': 'de356671-f18e-515...","Pharmaceutical Therapy, NOS",,,,,,,,,TCGA-E2-A10A,a84accf0-2294-550d-9825-22625f09f989,a84accf0-2294-550d-9825-22625f09f989,"[{'system': 'GDC', 'value': 'a84accf0-2294-550...","Infiltrating duct carcinoma, NOS",15085.0,8500/3,,not reported,,TCGA-E2-A10A


We can also check the size of combined data and compare it to *files_df*. Notice that *combineddata* is much larger in both dimensions.

In [9]:
print('Dimensions of the files dataframe: {}'.format(files_df.shape))
print('Dimensions of the combined dataframe: {}'.format(combineddata.shape))

Dimensions of the files dataframe: (860, 17)
Dimensions of the combined dataframe: (1213, 64)


### Import data to CGC

In order to import data to CGC, first the CGC authentication must be entered. Authentication token is available under the [**Developer->Authentication token**](https://cgc.sbgenomics.com/developer/token) menu. If you have not used an authentication token before, it will need to be generated first.

In [11]:
# Set your SBG API token

import getpass
token = getpass.getpass()

 ································


Now, CDA files can be imported into a specified project. _process_and_upload_ function handles the bulk import. Files are imported in chunks of 100. Make sure to change the **project** variable to reflect your project. Project should be in format _username/project-name_.

Note that a row in the _combineddata_ dataframe does not necessary equate to a single file as a single file might be described using multiple rows. In this example, although 200 rows are used for import, less than a 100 files will actually be imported. In general, you should import the whole dataframe (by removing _.head(200)_ from the function call). A subset is only used in order to make this demonstration faster and cheaper to execute.

In [10]:
project = 'boris_majic/cda-data-import'

# you can tag files with multiple labels
tags = ['all_files', 'tcga_brca']

# Import
from utilities import process_and_upload

import_jobs = process_and_upload(
    df=combineddata.head(200),
    token=token,
    import_project=project,
    tags=tags
)


Submitting files in chunks of 100.

Importing chunk 1/1

SUBMITTED: 0
RUNNING: 0
FINISHED: 0/1

SUBMITTED: 0
RUNNING: 0
FINISHED: 1/1

Import completed!


### Additional dataframe filtering

As result of merging dataframes is a _pandas.Dataframe_ object, dataframe can be further manipulated before file importing. Lets say that out of all the prepared files, we wish to import only files of a specific format. In this case, lets import all VCF files:

In [11]:
vcf_df = combineddata[combineddata.file_format == 'VCF']

print('There are {} VCF files'.format(vcf_df.id.nunique()))

There are 10 VCF files


As format of filtered dataframe is still a dataframe, it can be used with the _process_and_upload()_ function.

In [12]:
# Import
from utilities import process_and_upload

project = 'boris_majic/cda-data-import'

# you can tag files with multiple labels
tags = ['vcf files', 'tcga_brca']

process_and_upload(
    df=vcf_df,
    token=token,
    import_project=project,
    tags=tags
)


Submitting files in chunks of 100.

Importing chunk 1/1

SUBMITTED: 0
RUNNING: 0
FINISHED: 0/1

SUBMITTED: 0
RUNNING: 0
FINISHED: 1/1

Import completed!


[<DRSBulkImport: id=183047688648986624>]

### Example of a more complex query

CDA python supports creating a more complex query by using operators within the query language. More information on available operators and creating a more complex query can be found in the [Operators notebook](https://github.com/CancerDataAggregator/readthedocs/blob/main/docs/Examples/Operators.ipynb).

Briefly, CDA python supports the following operators:
* =
* !=
* AND
* OR
* IN and NOT IN
* % pattern matching
* IS and IS NOT
* comparison operators \>, <, >=, <=

Here a query is built to obrain all data related to research subjects whose primary diagnosis site is uterus, and for patients who are younger than 40. Note that CDA uses negative numbers in days to describe age:

In [13]:
q = Q('ResearchSubject.primary_diagnosis_site = "uterus" AND days_to_birth >= 40*-365')

Before proceeding to querying all endpoints, lets check the results of the files enpoint:

In [14]:
files_of_interest = q.file.run()
files_of_interest

Total execution time: 10306 ms



            QueryID: bbc96ea3-cda4-4669-834b-9f34e183cab3
            
            Offset: 0
            Count: 100
            Total Row Count: 5497
            More pages: True
            

Note that there are almost 5500 files! In rare cases, you might actually be interested in importing as many files, but generally, this is a sign that the query is too broad and it might make sense to add aditional conditions or to make the query more stict. Lets try again, but now lets only query for patients who are younger than 34:

In [15]:
q = Q('ResearchSubject.primary_diagnosis_site = "uterus" AND days_to_birth >= 34*-365')
files_of_interest = q.file.run()
files_of_interest

Total execution time: 7401 ms



            QueryID: 799426e7-da2b-4440-af15-e75f97eff7bc
            
            Offset: 0
            Count: 100
            Total Row Count: 218
            More pages: True
            

Now, we have 218 files to work with, which is a reasonable number for this demonstration.
Lets convert the result to a dataframe and proceed to query other endpoints.

In [16]:
# Paginate the files query result:
files_df = iter_pages(files_of_interest)

# Query other endpoints:
diagnosis = q.diagnosis.run()
diagnosis_df = iter_pages(diagnosis).rename(columns={'id': 'diagnosis_id'})
treatment = q.treatment.run()
treatment_df = iter_pages(treatment).rename(columns={'id': 'treatment_id'})
research_subject_of_interest = q.researchsubject.run()
rs_df = iter_pages(research_subject_of_interest).rename(columns={'id': 'researchsubject_id'})
subject_of_interest = q.subject.run()
subject_df = iter_pages(subject_of_interest).rename(columns={'id': 'subject_id'})
specimen_of_interest = q.specimen.run()
specimen_df = iter_pages(specimen_of_interest).rename(columns={'id': 'researchsubject_specimen_id'})

Total execution time: 3806 ms


Total execution time: 3507 ms


Total execution time: 3726 ms


Total execution time: 3681 ms


Total execution time: 3592 ms


It is important to note that some queries might return no results. In our query, treatment, diagnosis and specimen enpoints have returned **0 results**. This highlights the importance of checking the results before proceeding to merging dataframe and importing data:

In [17]:
diagnosis.count

0

To avoid issues when merging dataframes, each result is checked before performing mergers:

In [18]:
if research_subject_of_interest.count:
    combineddata = pd.merge(files_df, rs_df, on='researchsubject_id', how='outer', suffixes=(None, '_rs'))

if subject_of_interest.count:
    combineddata = pd.merge(combineddata, subject_df, on='subject_id', how='outer', suffixes=(None, '_subject'))

if specimen_of_interest.count:
    combineddata = pd.merge(combineddata, specimen_df, on='researchsubject_specimen_id', how='outer', suffixes=(None, '_specimen'))

if treatment.count:
    combineddata = pd.merge(combineddata, treatment_df, on='researchsubject_id', how='outer', suffixes=(None, '_treatment'))

if diagnosis.count:
    combineddata = pd.merge(combineddata, diagnosis_df, on='researchsubject_id', how='outer', suffixes=(None, '_diagnosis'))

# See how the merged dataframe looks like:
combineddata.head(3)

Unnamed: 0,id,identifier,label,data_category,data_type,file_format,associated_project,drs_uri,byte_size,checksum,data_modality,imaging_modality,dbgap_accession_number,imaging_series,researchsubject_specimen_id,researchsubject_id,subject_id,identifier_rs,member_of_research_project,primary_diagnosis_condition,primary_diagnosis_site,subject_id_rs,identifier_subject,species,sex,race,ethnicity,days_to_birth,subject_associated_project,vital_status,days_to_death,cause_of_death
0,72a509a3-7e38-4e0c-ae05-3104fc1f2321,"[{'system': 'GDC', 'value': '72a509a3-7e38-4e0...",TCGA-AX-A3FZ-01A-01-TS1.0BE8C3AA-52DF-40C0-802...,Biospecimen,Slide Image,SVS,TCGA-UCEC,drs://dg.4DFC:72a509a3-7e38-4e0c-ae05-3104fc1f...,354582476.0,b81a2971435f981514493c955d9f5c49,Genomic,,,,,,TCGA-AX-A3FZ,,,,,,"[{'system': 'GDC', 'value': 'TCGA-AX-A3FZ'}, {...",homo sapiens,female,white,not hispanic or latino,-12331,"[tcga_ucec, TCGA-UCEC]",Alive,,
1,35d2a0de-d452-4fa9-9422-25c0130145f3,"[{'system': 'GDC', 'value': '35d2a0de-d452-4fa...",TCGA_UCEC.97c3b010-81c0-4075-b265-3f76d1801bf0...,Simple Nucleotide Variation,Annotated Somatic Mutation,VCF,TCGA-UCEC,drs://dg.4DFC:35d2a0de-d452-4fa9-9422-25c01301...,573829.0,8e60d0d65c8ec5dc49f04849d66fc87b,Genomic,,,,,,TCGA-AX-A3FZ,,,,,,"[{'system': 'GDC', 'value': 'TCGA-AX-A3FZ'}, {...",homo sapiens,female,white,not hispanic or latino,-12331,"[tcga_ucec, TCGA-UCEC]",Alive,,
2,bdde5dbb-9046-40b6-b67a-fa39303cf10d,"[{'system': 'GDC', 'value': 'bdde5dbb-9046-40b...",nationwidechildrens.org_clinical_follow_up_v4....,Clinical,Clinical Supplement,BCR Biotab,TCGA-UCEC,drs://dg.4DFC:bdde5dbb-9046-40b6-b67a-fa39303c...,12678.0,6f37e92361bacd1e0c06d87d2d1bf744,Genomic,,,,,,TCGA-AX-A3FZ,,,,,,"[{'system': 'GDC', 'value': 'TCGA-AX-A3FZ'}, {...",homo sapiens,female,white,not hispanic or latino,-12331,"[tcga_ucec, TCGA-UCEC]",Alive,,


Finally, _combineddata_ dataframe can be used to import the files to the CGC:

In [19]:
# Import
from utilities import process_and_upload

project = 'boris_majic/cda-data-import'

# you can tag files with multiple labels
tags = ['uterus', 'young']

process_and_upload(
    df=combineddata,
    token=token,
    import_project=project,
    tags=tags
)

Submitting files in chunks of 100.

Importing chunk 1/2
Importing chunk 2/2

SUBMITTED: 0
RUNNING: 0
FINISHED: 1/2

SUBMITTED: 0
RUNNING: 0
FINISHED: 2/2

Import completed!


[<DRSBulkImport: id=183047811921678336>,
 <DRSBulkImport: id=183047814508515328>]