# Loading Data
### Laurence Nickel (i6257119)

Libraries used: 
* TCGAbiolinks (version: '2.26.0')
* data.table (version: '1.14.8')

References: 
* [1] National Cancer Institute GDC Data Portal (2023). TCGA-GBM. Available: https://portal.gdc.cancer.gov/projects/TCGA-GBM (last accessed May 15, 2023).
* [2] Fisher, R., Pusztai, L., & Swanton, C. (2013). Cancer Heterogeneity: Implications for Targeted Therapeutics. British Journal of Cancer 108(3), 479-485. doi: 10.1038/bjc.2012.581.
* [3] Mayo Clinic (2023). Glioma - Glioblastoma. Available: https://www.mayoclinic.org/diseases-conditions/glioblastoma/cdc-20350148 (accessed January 23, 2023).
* [4] National Cancer Institute (2023). TCGA's Study of Glioblastoma Multiforme. Available: https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga/studied-cancers/glioblastoma (accessed January 23, 2023).
* [5] Novogene (2023). How to choose Normalization methods (TPM/RPKM/FPKM) for mRNA expression. Available: https://www.novogene.com/us-en/resources/blog/how-to-choose-normalization-methods-tpm-rpkm-fpkm-for-mrna-expression/ (last accessed May 8, 2023).

## Introduction

Within this notebook, we load the methylation and gene expression data for people who have/had brain cancer. The data that I will use for my thesis originates from the National Cancer Institute GDC Data Portal (https://portal.gdc.cancer.gov/repository) [1]. Cancer is heterogeneous meaning that subpopulations of its cells can have different genetic and molecular characteristics which can make it challenging to develop effective treatments as a treatment that may work for one type of cancer may not work for another [2]. Therefore, I decided that I will identify the key sites of DNA methylation that affect gene expression in a single type of brain cancer called Glioblastoma, which is an aggressive type of cancer that tends to occur most often in older adults [3]. I have decided to choose this type of brain cancer since the recovery rate of patients is very low as there are no effective long-term treatments for this disease and patients usually survive less than 15 months after having been diagnosed [4]. 

From the Data Portal, two datasets will be retrieved each featuring patients that were diagnosed with the aforementioned type of brain cancer and every sample within these datasets originates from the same project: TCGA-GBM. The first dataset contains the DNA methylation data of 142 different patients and the second dataset contains the gene expression data of 162 different patients.

In the sections below, these datasets will be retrieved in the form of idat files (two for each sample; one green channel idat file and one red channel idat file) for the methylation data and in the form of tsv files (one for each sample) for the gene expression data. Here it is important to mention that the patients do not correspond one-to-one to the samples, as for some patients there exist multiple samples. When all of these files are retrieved, we only keep the ones where the corresponding sample both has methylation data and gene expression data available.

For the idat files, the green and red channels do not directly correspond to colors but rather they correspond to the fluorescent dyes used in the microarray assay to detect DNA methylation levels which are labelled respectively as 'Cy3' and 'Cy5'. The intensity values in the green channel typically represent the signal for unmethylated DNA, while the red channel represents the signal for methylated DNA.

For the tsv files, the gene expression values represent the amount of activity or expression of each gene for the current sample. This value is measured in Transcripts Per Million (TPM) units. TPM is a normalization method of the amount of RNA transcripts for a particular gene in a sample [5]. TPM values are expressed on a logarithmic scale, representing the expression of a gene between samples. Higher TPM values indicate higher expression levels, while lower values indicate lower expression levels. TPM normalization allows gene expression levels to be compared between samples, and changes in sequence depth and gene length are accounted for.

For reproducibility purposes, the methylation and gene expression data has been retrieved on the date: 15-May-23 (19:36).

### Defining the data directory

Before we start the retrieval of the files, we first need to define our data directory to which the resulting files will be stored. Please mind that this needs to be changed to the desired directory to be able to work with the data directory.

In [1]:
data_directory = "C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/original_data"

### Importing the libraries

In addition, we should first import some libraries that will be used throughout this notebook. These libraries can be installed through 'BiocManager'.

In [2]:
# Checking whether the package 'BiocManager' has already been installed and installing it if it has not been installed yet.
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")


cat("Starting the installing of the libraries...")


# Using 'BiocManager' to install the following libraries (which are also mentioned in the introduction of this notebook).
BiocManager::install("TCGAbiolinks")

# Using the R command 'install.packages()' to install the remaining necessary libraries.
install.packages("data.table")


cat("Finishing the installing of the libraries.")

Bioconductor version '3.16' is out-of-date; the current release version '3.17'
  is available with R version '4.3'; see https://bioconductor.org/install



Starting the installing of the libraries...

'getOption("repos")' replaces Bioconductor standard repositories, see
'help("repositories", package = "BiocManager")' for details.
Replacement repositories:
    CRAN: https://cran.r-project.org

Bioconductor version 3.16 (BiocManager 1.30.20), R 4.2.3 (2023-03-15 ucrt)

"package(s) not installed when version(s) same as or greater than current; use
  `force = TRUE` to re-install: 'TCGAbiolinks'"
Installation paths not writeable, unable to update packages
  path: C:/Program Files/R/R-4.2.3/library
  packages:
    class, KernSmooth, lattice, MASS, Matrix, nnet, survival

Old packages: 'cachem', 'DelayedArray', 'dplyr', 'evaluate', 'fs', 'httpuv',
  'httr', 'httr2', 'later', 'profvis', 'rlang', 'sass', 'testthat', 'vctrs',
  'viridisLite', 'vroom', 'waldo', 'xfun', 'xml2'

Installing package into 'C:/Users/laure/AppData/Local/R/win-library/4.2'
(as 'lib' is unspecified)



package 'data.table' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\laure\AppData\Local\Temp\RtmpIP5AE5\downloaded_packages
Finishing the installing of the libraries.

Now that all the libraries have been installed we can load them into this notebook by using the command 'libraries()'. To verify that these libraries have been loaded into this notebook, we can use the command 'packageVersion()' which will display the version of the package currently installed.

In [3]:
# Loading the following libraries (which are also mentioned in the introduction of this notebook) into this notebook. 
library(TCGAbiolinks)
library(data.table)


# Retrieving the version of the packages to verify they have been correctly loaded into this notebook.
cat("The library 'TCGAbiolinks' has been loaded into the notebook with its version being:")
packageVersion("TCGAbiolinks")

cat("The library 'data.table' has been loaded into the notebook with its version being:")
packageVersion("data.table")

The library 'TCGAbiolinks' has been loaded into the notebook with its version being:

[1] '2.26.0'

The library 'data.table' has been loaded into the notebook with its version being:

[1] '1.14.8'

### Setting up the queries

First we set up the queries we later want to use to retrieve the correct files. These queries contain the filters which are used to retrieve those files. Since we want to retrieve the files of patients who had been diagnosed with Glioblastoma, both of the queries have for the 'project' parameter 'TCGA-GBM'. The 'query_methylation_data' contains the filters for the methylation data and naturally has as 'data.category' 'DNA Methylation' with the platform being 'Illumina Human Methylation 450' and as 'data.type' 'Masked Intensities'. 

In [4]:
# The query to retrieve the methylation data.
query_methylation_data <- GDCquery(
    project = "TCGA-GBM",
    data.category = "DNA Methylation",
    legacy = FALSE,
    platform = "Illumina Human Methylation 450",
    data.type = "Masked Intensities"
)

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------



The 'query_expression_data' contains the filters for the gene expression data and naturally has as 'data.category' 'Transcriptome Profiling' with the 'data.type' being 'Gene Expression Quantification' and as 'workflow.type' 'STAR - Counts'. 

In [5]:
# The query to retrieve the gene expression data.
query_expression_data <- GDCquery(
    project = "TCGA-GBM",
    data.category = "Transcriptome Profiling",
    data.type = "Gene Expression Quantification", 
    workflow.type = "STAR - Counts"
)

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By data.type

ooo By workflow.type

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases

ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------



### Collecting the data tables

Now that the queries have been defined, we can retrieve all of the files including some additional metadata about these by calling the function 'data.table()'. Apart from the first parameter which denotes which query should be used to retrieve the results, the 'options' parameter denotes how the data table should be displayed when it is called.

In [6]:
# Retrieving the methylation data files.
methylation_data_files = data.table(
    getResults(query_methylation_data),
    options = list(scrollX = TRUE, keys = TRUE, pageLength = 5)
)

"Item 2 has 3 rows but longest item has 310; recycled with remainder."


In [7]:
# Retrieving the gene expression data files.
gene_expression_data_files = data.table(
    getResults(query_expression_data),
    options = list(scrollX = TRUE, keys = TRUE, pageLength = 5)
)

"Item 2 has 3 rows but longest item has 175; recycled with remainder."


We can now display both of these data tables.

In [8]:
cat("The data table featuring the methylation data files:")
methylation_data_files

The data table featuring the methylation data files:

id,data_format,cases,access,file_name,channel,submitter_id,data_category,type,platform,⋯,analysis_state,analysis_submitter_id,analysis_workflow_link,analysis_workflow_type,analysis_workflow_version,sample_type,is_ffpe,cases.submitter_id,sample.submitter_id,options
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<lgl>,<chr>,<chr>,<list>
ccfd07e7-1a49-4f92-8051-1439925768ef,IDAT,TCGA-12-5301-01A-01D-1481-05,open,228a0ed2-736a-4ef5-8803-55608b2f68ed_noid_Red.idat,Red,21ec19c3-93cb-4e2e-af4d-98e2f48df0ad,DNA Methylation,masked_methylation_array,Illumina Human Methylation 450,⋯,released,228a0ed2-736a-4ef5-8803-55608b2f68ed,https://github.com/NCI-GDC/sesame-cwl/blob/7732f58193690d65dc535c9b2a4f1071113884c5/workflows/gdc_sesame_workflow.cwl,SeSAMe Methylation Beta Estimation,7732f58193690d65dc535c9b2a4f1071113884c5,Primary Tumor,,TCGA-12-5301,TCGA-12-5301-01A,TRUE
06be9906-bb51-467e-8425-cf85f0fb3d49,IDAT,TCGA-28-5209-01A-01D-1481-05,open,26dc8f0f-89fa-444f-b8ad-c003858fc44d_noid_Red.idat,Red,0d4a3194-b34a-4d7b-a9bd-819046c93878,DNA Methylation,masked_methylation_array,Illumina Human Methylation 450,⋯,released,26dc8f0f-89fa-444f-b8ad-c003858fc44d,https://github.com/NCI-GDC/sesame-cwl/blob/7732f58193690d65dc535c9b2a4f1071113884c5/workflows/gdc_sesame_workflow.cwl,SeSAMe Methylation Beta Estimation,7732f58193690d65dc535c9b2a4f1071113884c5,Primary Tumor,,TCGA-28-5209,TCGA-28-5209-01A,TRUE
0e7a6088-9eff-4863-85ad-cc544d71e669,IDAT,TCGA-76-6286-01A-11D-1844-05,open,c56d4813-ec8e-48f3-943e-4a61be10a1e8_noid_Grn.idat,Green,cb587a76-d60b-4994-9a7d-aa61a3f1c62d,DNA Methylation,masked_methylation_array,Illumina Human Methylation 450,⋯,released,c56d4813-ec8e-48f3-943e-4a61be10a1e8,https://github.com/NCI-GDC/sesame-cwl/blob/7732f58193690d65dc535c9b2a4f1071113884c5/workflows/gdc_sesame_workflow.cwl,SeSAMe Methylation Beta Estimation,7732f58193690d65dc535c9b2a4f1071113884c5,Primary Tumor,,TCGA-76-6286,TCGA-76-6286-01A,5
f0f7c26d-0235-43e3-8c16-309e1cfcfc51,IDAT,TCGA-06-6701-01A-11D-1844-05,open,d5863779-fbd0-4c36-8b1e-31bbddb9fb12_noid_Red.idat,Red,4c027231-6d43-4c18-adb0-20b5aa870daf,DNA Methylation,masked_methylation_array,Illumina Human Methylation 450,⋯,released,d5863779-fbd0-4c36-8b1e-31bbddb9fb12,https://github.com/NCI-GDC/sesame-cwl/blob/7732f58193690d65dc535c9b2a4f1071113884c5/workflows/gdc_sesame_workflow.cwl,SeSAMe Methylation Beta Estimation,7732f58193690d65dc535c9b2a4f1071113884c5,Primary Tumor,,TCGA-06-6701,TCGA-06-6701-01A,TRUE
977a3ade-812d-44b8-bd01-c4cfe8630763,IDAT,TCGA-06-6701-01A-11D-1844-05,open,d5863779-fbd0-4c36-8b1e-31bbddb9fb12_noid_Grn.idat,Green,f83cc414-99be-483c-8c02-6f3a45366270,DNA Methylation,masked_methylation_array,Illumina Human Methylation 450,⋯,released,d5863779-fbd0-4c36-8b1e-31bbddb9fb12,https://github.com/NCI-GDC/sesame-cwl/blob/7732f58193690d65dc535c9b2a4f1071113884c5/workflows/gdc_sesame_workflow.cwl,SeSAMe Methylation Beta Estimation,7732f58193690d65dc535c9b2a4f1071113884c5,Primary Tumor,,TCGA-06-6701,TCGA-06-6701-01A,TRUE
25453af6-229e-430a-bfe9-4ee13ee32d26,IDAT,TCGA-19-5954-01A-11D-1697-05,open,d01ed9e7-57d7-400b-933d-8306dd60fcf7_noid_Grn.idat,Green,75f4b2e0-5e59-4c0d-9d1d-0957f39a6571,DNA Methylation,masked_methylation_array,Illumina Human Methylation 450,⋯,released,d01ed9e7-57d7-400b-933d-8306dd60fcf7,https://github.com/NCI-GDC/sesame-cwl/blob/7732f58193690d65dc535c9b2a4f1071113884c5/workflows/gdc_sesame_workflow.cwl,SeSAMe Methylation Beta Estimation,7732f58193690d65dc535c9b2a4f1071113884c5,Primary Tumor,,TCGA-19-5954,TCGA-19-5954-01A,5
c03d0407-8543-45e1-b1d5-1fe00bf5f080,IDAT,TCGA-19-5954-01A-11D-1697-05,open,d01ed9e7-57d7-400b-933d-8306dd60fcf7_noid_Red.idat,Red,1fa75a74-59d3-495f-9bd0-33f857829a5c,DNA Methylation,masked_methylation_array,Illumina Human Methylation 450,⋯,released,d01ed9e7-57d7-400b-933d-8306dd60fcf7,https://github.com/NCI-GDC/sesame-cwl/blob/7732f58193690d65dc535c9b2a4f1071113884c5/workflows/gdc_sesame_workflow.cwl,SeSAMe Methylation Beta Estimation,7732f58193690d65dc535c9b2a4f1071113884c5,Primary Tumor,,TCGA-19-5954,TCGA-19-5954-01A,TRUE
98e8e240-1db4-4c58-b894-826d30757a5b,IDAT,TCGA-RR-A6KA-01A-21D-A33U-05,open,d1d732f0-794d-4bae-893b-915246e4c709_noid_Grn.idat,Green,f56da61c-2656-4469-b229-8c0a53ba6e8b,DNA Methylation,masked_methylation_array,Illumina Human Methylation 450,⋯,released,d1d732f0-794d-4bae-893b-915246e4c709,https://github.com/NCI-GDC/sesame-cwl/blob/7732f58193690d65dc535c9b2a4f1071113884c5/workflows/gdc_sesame_workflow.cwl,SeSAMe Methylation Beta Estimation,7732f58193690d65dc535c9b2a4f1071113884c5,Primary Tumor,,TCGA-RR-A6KA,TCGA-RR-A6KA-01A,TRUE
dd64cea8-33e5-4ac3-8ee1-a1a0ee66f992,IDAT,TCGA-19-5947-01A-11D-1697-05,open,e2577575-fcf2-4362-8523-a593a82960b0_noid_Grn.idat,Green,f09587ea-d5bc-4b37-a3a1-8a5ab1cc1db3,DNA Methylation,masked_methylation_array,Illumina Human Methylation 450,⋯,released,e2577575-fcf2-4362-8523-a593a82960b0,https://github.com/NCI-GDC/sesame-cwl/blob/7732f58193690d65dc535c9b2a4f1071113884c5/workflows/gdc_sesame_workflow.cwl,SeSAMe Methylation Beta Estimation,7732f58193690d65dc535c9b2a4f1071113884c5,Primary Tumor,,TCGA-19-5947,TCGA-19-5947-01A,5
f78cc9e4-4c83-4519-9117-4bbff43974ef,IDAT,TCGA-19-5947-01A-11D-1697-05,open,e2577575-fcf2-4362-8523-a593a82960b0_noid_Red.idat,Red,21d31e39-bd37-4d3b-b554-f6b41a28a37c,DNA Methylation,masked_methylation_array,Illumina Human Methylation 450,⋯,released,e2577575-fcf2-4362-8523-a593a82960b0,https://github.com/NCI-GDC/sesame-cwl/blob/7732f58193690d65dc535c9b2a4f1071113884c5/workflows/gdc_sesame_workflow.cwl,SeSAMe Methylation Beta Estimation,7732f58193690d65dc535c9b2a4f1071113884c5,Primary Tumor,,TCGA-19-5947,TCGA-19-5947-01A,TRUE


In [9]:
cat("The data table featuring the gene expression data files:")
gene_expression_data_files

The data table featuring the gene expression data files:

id,data_format,cases,access,file_name,submitter_id,data_category,type,file_size,created_datetime,⋯,analysis_state,analysis_submitter_id,analysis_workflow_link,analysis_workflow_type,analysis_workflow_version,sample_type,is_ffpe,cases.submitter_id,sample.submitter_id,options
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<lgl>,<chr>,<chr>,<list>
ddb16b8d-9ad8-4c66-a986-513d3507b26d,TSV,TCGA-06-0156-01A-03R-1849-01,open,1076483a-e462-47e5-a47a-ca9544548dff.rna_seq.augmented_star_gene_counts.tsv,601d6500-64b0-43ab-b93f-af78fddb3591,Transcriptome Profiling,gene_expression,4249454,2021-12-13T17:00:58.155882-06:00,⋯,released,1076483a-e462-47e5-a47a-ca9544548dff_star__counts,https://github.com/NCI-GDC/gdc-rnaseq-cwl/blob/5d8c131bbff59fb0c969217fc1d44e6d1503cd1f/rnaseq-star-align/star2pass.rnaseq_harmonization.cwl,STAR - Counts,5d8c131bbff59fb0c969217fc1d44e6d1503cd1f,Primary Tumor,,TCGA-06-0156,TCGA-06-0156-01A,TRUE
4bed9101-07f4-4d76-b79f-6eb8de04bd19,TSV,TCGA-06-0675-11A-32R-A36H-07,open,2856d609-b09a-486e-8552-23abaa1df201.rna_seq.augmented_star_gene_counts.tsv,d2fca1e9-2d34-4961-a8e7-39921cc396b5,Transcriptome Profiling,gene_expression,4243935,2021-12-13T17:00:47.023554-06:00,⋯,released,2856d609-b09a-486e-8552-23abaa1df201_star__counts,https://github.com/NCI-GDC/gdc-rnaseq-cwl/blob/5d8c131bbff59fb0c969217fc1d44e6d1503cd1f/rnaseq-star-align/star2pass.rnaseq_harmonization.cwl,STAR - Counts,5d8c131bbff59fb0c969217fc1d44e6d1503cd1f,Solid Tissue Normal,,TCGA-06-0675,TCGA-06-0675-11A,TRUE
7b5011b5-5816-4c25-ad72-31152b70b9a0,TSV,TCGA-32-4213-01A-01R-1850-01,open,954911a5-d0a0-42bb-842c-1dbd9ee60c6c.rna_seq.augmented_star_gene_counts.tsv,a4219b59-5018-462e-b8c6-44fdae703ded,Transcriptome Profiling,gene_expression,4209977,2021-12-13T17:02:30.106708-06:00,⋯,released,954911a5-d0a0-42bb-842c-1dbd9ee60c6c_star__counts,https://github.com/NCI-GDC/gdc-rnaseq-cwl/blob/5d8c131bbff59fb0c969217fc1d44e6d1503cd1f/rnaseq-star-align/star2pass.rnaseq_harmonization.cwl,STAR - Counts,5d8c131bbff59fb0c969217fc1d44e6d1503cd1f,Primary Tumor,,TCGA-32-4213,TCGA-32-4213-01A,5
dfbc7136-f7f7-4c2f-93a1-c46b2247cce5,TSV,TCGA-32-2634-01A-01R-1850-01,open,7a25193c-a87d-4929-b724-3d89553fa028.rna_seq.augmented_star_gene_counts.tsv,dfc5c954-00be-4dcf-adc2-455f03f1abff,Transcriptome Profiling,gene_expression,4262595,2021-12-13T17:02:45.822798-06:00,⋯,released,7a25193c-a87d-4929-b724-3d89553fa028_star__counts,https://github.com/NCI-GDC/gdc-rnaseq-cwl/blob/5d8c131bbff59fb0c969217fc1d44e6d1503cd1f/rnaseq-star-align/star2pass.rnaseq_harmonization.cwl,STAR - Counts,5d8c131bbff59fb0c969217fc1d44e6d1503cd1f,Primary Tumor,,TCGA-32-2634,TCGA-32-2634-01A,TRUE
bd40ca45-33da-4798-a3bd-7640c784d6a9,TSV,TCGA-06-0187-01A-01R-1849-01,open,c2bc46d6-7065-44ff-a21a-7a3676990fdb.rna_seq.augmented_star_gene_counts.tsv,c965fc0b-01c9-4afa-8c2b-907f8aa39d0e,Transcriptome Profiling,gene_expression,4236802,2021-12-13T16:58:17.287350-06:00,⋯,released,c2bc46d6-7065-44ff-a21a-7a3676990fdb_star__counts,https://github.com/NCI-GDC/gdc-rnaseq-cwl/blob/5d8c131bbff59fb0c969217fc1d44e6d1503cd1f/rnaseq-star-align/star2pass.rnaseq_harmonization.cwl,STAR - Counts,5d8c131bbff59fb0c969217fc1d44e6d1503cd1f,Primary Tumor,,TCGA-06-0187,TCGA-06-0187-01A,TRUE
7888f7b7-38da-4cc7-b542-836b193adfbf,TSV,TCGA-12-3652-01A-01R-1849-01,open,75615deb-e120-44f6-a585-1ca07987b7ef.rna_seq.augmented_star_gene_counts.tsv,147ee350-7db0-4fa5-b1ee-72d743fa94ee,Transcriptome Profiling,gene_expression,4228972,2021-12-13T17:06:16.439520-06:00,⋯,released,75615deb-e120-44f6-a585-1ca07987b7ef_star__counts,https://github.com/NCI-GDC/gdc-rnaseq-cwl/blob/5d8c131bbff59fb0c969217fc1d44e6d1503cd1f/rnaseq-star-align/star2pass.rnaseq_harmonization.cwl,STAR - Counts,5d8c131bbff59fb0c969217fc1d44e6d1503cd1f,Primary Tumor,,TCGA-12-3652,TCGA-12-3652-01A,5
20ccfe17-ec2c-433f-a87c-07018a3a7522,TSV,TCGA-06-0156-01A-02R-1849-01,open,9fefe100-9765-4b8a-8468-51df066f0f0c.rna_seq.augmented_star_gene_counts.tsv,d5489a43-3f38-4f5e-b136-abf11c9b3280,Transcriptome Profiling,gene_expression,4251212,2021-12-13T17:00:08.122310-06:00,⋯,released,9fefe100-9765-4b8a-8468-51df066f0f0c_star__counts,https://github.com/NCI-GDC/gdc-rnaseq-cwl/blob/5d8c131bbff59fb0c969217fc1d44e6d1503cd1f/rnaseq-star-align/star2pass.rnaseq_harmonization.cwl,STAR - Counts,5d8c131bbff59fb0c969217fc1d44e6d1503cd1f,Primary Tumor,,TCGA-06-0156,TCGA-06-0156-01A,TRUE
f93cd112-520d-408d-a15d-3451f2bacdfd,TSV,TCGA-26-5134-01A-01R-1850-01,open,a8050792-d0f5-4b8f-aea2-0a303f40b1ca.rna_seq.augmented_star_gene_counts.tsv,be3905a4-6bc4-44b7-8d22-a0a671f7802e,Transcriptome Profiling,gene_expression,4206919,2021-12-13T16:57:27.624883-06:00,⋯,released,a8050792-d0f5-4b8f-aea2-0a303f40b1ca_star__counts,https://github.com/NCI-GDC/gdc-rnaseq-cwl/blob/5d8c131bbff59fb0c969217fc1d44e6d1503cd1f/rnaseq-star-align/star2pass.rnaseq_harmonization.cwl,STAR - Counts,5d8c131bbff59fb0c969217fc1d44e6d1503cd1f,Primary Tumor,,TCGA-26-5134,TCGA-26-5134-01A,TRUE
e1f33468-f9c4-4b1a-8cd5-b2c77cc80dc9,TSV,TCGA-32-2638-01A-01R-1850-01,open,5cafc8c0-a92c-407a-ad52-a2d9562c0bfa.rna_seq.augmented_star_gene_counts.tsv,c0c2932e-8349-4f9a-aa93-b66d96161a90,Transcriptome Profiling,gene_expression,4250273,2021-12-13T16:59:17.535636-06:00,⋯,released,5cafc8c0-a92c-407a-ad52-a2d9562c0bfa_star__counts,https://github.com/NCI-GDC/gdc-rnaseq-cwl/blob/5d8c131bbff59fb0c969217fc1d44e6d1503cd1f/rnaseq-star-align/star2pass.rnaseq_harmonization.cwl,STAR - Counts,5d8c131bbff59fb0c969217fc1d44e6d1503cd1f,Primary Tumor,,TCGA-32-2638,TCGA-32-2638-01A,5
6a0df69c-6211-4db9-a278-fc7b2a2bb92c,TSV,TCGA-06-0882-01A-01R-1849-01,open,7412a444-9fad-4d96-a4d4-2c270045919a.rna_seq.augmented_star_gene_counts.tsv,d8d4338b-b3be-4e88-8ea5-0df30d66be41,Transcriptome Profiling,gene_expression,4248718,2021-12-13T17:04:50.788894-06:00,⋯,released,7412a444-9fad-4d96-a4d4-2c270045919a_star__counts,https://github.com/NCI-GDC/gdc-rnaseq-cwl/blob/5d8c131bbff59fb0c969217fc1d44e6d1503cd1f/rnaseq-star-align/star2pass.rnaseq_harmonization.cwl,STAR - Counts,5d8c131bbff59fb0c969217fc1d44e6d1503cd1f,Primary Tumor,,TCGA-06-0882,TCGA-06-0882-01A,TRUE


### Deleting the non-gliomas

Now that we have retrieved the data files for both the methylation data and the gene expression data, we still need to adjust one thing. Currently, the methylation data files still contains 2 files for which the disease type is not reported and the gene expression data files still contains 5 files for which the disease type is not reported (for all the other files the disease type is 'gliomas'). Since there is no option within the 'GDCquery()' function to filter these, we have to perform this step here.

To achieve this, we can remove the files for which the disease type is not reported by finding the index of these files and use the operator '-'. Since the number of files for which the disease type is not reported is quite low (7 files across the two sets of data files), I have decided to perform this step manually.

In [10]:
# Removing the 2 methylation data files for which the disease type is not reported.
cat("The data table featuring the methylation data files after removing the files for which the disease type is not reported:")
methylation_data_files[-methylation_data_files[, which(file_name == "617785bf-02af-4d75-b7e6-1153c7d967f1_noid_Red.idat" |
                 file_name == "617785bf-02af-4d75-b7e6-1153c7d967f1_noid_Grn.idat")]]

# Removing the 5 gene expression data files for which the disease type is not reported.
cat("The data table featuring the gene expression data files after removing the files for which the disease type is not reported:")
gene_expression_data_files[-gene_expression_data_files[, which(file_name == "4250878d-ba0b-47e1-96c6-f2331f93ce37.rna_seq.augmented_star_gene_counts.tsv" | 
                 file_name == "2856d609-b09a-486e-8552-23abaa1df201.rna_seq.augmented_star_gene_counts.tsv" |
                 file_name == "152343f6-832e-4077-83d3-1589ef71b472.rna_seq.augmented_star_gene_counts.tsv" |
                 file_name == "3404cbdb-4d1b-462a-a820-58c4f01e84fc.rna_seq.augmented_star_gene_counts.tsv" |
                 file_name == "4cc96a03-4717-465b-b0c6-dfb947ff3c5e.rna_seq.augmented_star_gene_counts.tsv")]]

The data table featuring the methylation data files after removing the files for which the disease type is not reported:

id,data_format,cases,access,file_name,channel,submitter_id,data_category,type,platform,⋯,analysis_state,analysis_submitter_id,analysis_workflow_link,analysis_workflow_type,analysis_workflow_version,sample_type,is_ffpe,cases.submitter_id,sample.submitter_id,options
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<lgl>,<chr>,<chr>,<list>
ccfd07e7-1a49-4f92-8051-1439925768ef,IDAT,TCGA-12-5301-01A-01D-1481-05,open,228a0ed2-736a-4ef5-8803-55608b2f68ed_noid_Red.idat,Red,21ec19c3-93cb-4e2e-af4d-98e2f48df0ad,DNA Methylation,masked_methylation_array,Illumina Human Methylation 450,⋯,released,228a0ed2-736a-4ef5-8803-55608b2f68ed,https://github.com/NCI-GDC/sesame-cwl/blob/7732f58193690d65dc535c9b2a4f1071113884c5/workflows/gdc_sesame_workflow.cwl,SeSAMe Methylation Beta Estimation,7732f58193690d65dc535c9b2a4f1071113884c5,Primary Tumor,,TCGA-12-5301,TCGA-12-5301-01A,TRUE
06be9906-bb51-467e-8425-cf85f0fb3d49,IDAT,TCGA-28-5209-01A-01D-1481-05,open,26dc8f0f-89fa-444f-b8ad-c003858fc44d_noid_Red.idat,Red,0d4a3194-b34a-4d7b-a9bd-819046c93878,DNA Methylation,masked_methylation_array,Illumina Human Methylation 450,⋯,released,26dc8f0f-89fa-444f-b8ad-c003858fc44d,https://github.com/NCI-GDC/sesame-cwl/blob/7732f58193690d65dc535c9b2a4f1071113884c5/workflows/gdc_sesame_workflow.cwl,SeSAMe Methylation Beta Estimation,7732f58193690d65dc535c9b2a4f1071113884c5,Primary Tumor,,TCGA-28-5209,TCGA-28-5209-01A,TRUE
0e7a6088-9eff-4863-85ad-cc544d71e669,IDAT,TCGA-76-6286-01A-11D-1844-05,open,c56d4813-ec8e-48f3-943e-4a61be10a1e8_noid_Grn.idat,Green,cb587a76-d60b-4994-9a7d-aa61a3f1c62d,DNA Methylation,masked_methylation_array,Illumina Human Methylation 450,⋯,released,c56d4813-ec8e-48f3-943e-4a61be10a1e8,https://github.com/NCI-GDC/sesame-cwl/blob/7732f58193690d65dc535c9b2a4f1071113884c5/workflows/gdc_sesame_workflow.cwl,SeSAMe Methylation Beta Estimation,7732f58193690d65dc535c9b2a4f1071113884c5,Primary Tumor,,TCGA-76-6286,TCGA-76-6286-01A,5
f0f7c26d-0235-43e3-8c16-309e1cfcfc51,IDAT,TCGA-06-6701-01A-11D-1844-05,open,d5863779-fbd0-4c36-8b1e-31bbddb9fb12_noid_Red.idat,Red,4c027231-6d43-4c18-adb0-20b5aa870daf,DNA Methylation,masked_methylation_array,Illumina Human Methylation 450,⋯,released,d5863779-fbd0-4c36-8b1e-31bbddb9fb12,https://github.com/NCI-GDC/sesame-cwl/blob/7732f58193690d65dc535c9b2a4f1071113884c5/workflows/gdc_sesame_workflow.cwl,SeSAMe Methylation Beta Estimation,7732f58193690d65dc535c9b2a4f1071113884c5,Primary Tumor,,TCGA-06-6701,TCGA-06-6701-01A,TRUE
977a3ade-812d-44b8-bd01-c4cfe8630763,IDAT,TCGA-06-6701-01A-11D-1844-05,open,d5863779-fbd0-4c36-8b1e-31bbddb9fb12_noid_Grn.idat,Green,f83cc414-99be-483c-8c02-6f3a45366270,DNA Methylation,masked_methylation_array,Illumina Human Methylation 450,⋯,released,d5863779-fbd0-4c36-8b1e-31bbddb9fb12,https://github.com/NCI-GDC/sesame-cwl/blob/7732f58193690d65dc535c9b2a4f1071113884c5/workflows/gdc_sesame_workflow.cwl,SeSAMe Methylation Beta Estimation,7732f58193690d65dc535c9b2a4f1071113884c5,Primary Tumor,,TCGA-06-6701,TCGA-06-6701-01A,TRUE
25453af6-229e-430a-bfe9-4ee13ee32d26,IDAT,TCGA-19-5954-01A-11D-1697-05,open,d01ed9e7-57d7-400b-933d-8306dd60fcf7_noid_Grn.idat,Green,75f4b2e0-5e59-4c0d-9d1d-0957f39a6571,DNA Methylation,masked_methylation_array,Illumina Human Methylation 450,⋯,released,d01ed9e7-57d7-400b-933d-8306dd60fcf7,https://github.com/NCI-GDC/sesame-cwl/blob/7732f58193690d65dc535c9b2a4f1071113884c5/workflows/gdc_sesame_workflow.cwl,SeSAMe Methylation Beta Estimation,7732f58193690d65dc535c9b2a4f1071113884c5,Primary Tumor,,TCGA-19-5954,TCGA-19-5954-01A,5
c03d0407-8543-45e1-b1d5-1fe00bf5f080,IDAT,TCGA-19-5954-01A-11D-1697-05,open,d01ed9e7-57d7-400b-933d-8306dd60fcf7_noid_Red.idat,Red,1fa75a74-59d3-495f-9bd0-33f857829a5c,DNA Methylation,masked_methylation_array,Illumina Human Methylation 450,⋯,released,d01ed9e7-57d7-400b-933d-8306dd60fcf7,https://github.com/NCI-GDC/sesame-cwl/blob/7732f58193690d65dc535c9b2a4f1071113884c5/workflows/gdc_sesame_workflow.cwl,SeSAMe Methylation Beta Estimation,7732f58193690d65dc535c9b2a4f1071113884c5,Primary Tumor,,TCGA-19-5954,TCGA-19-5954-01A,TRUE
98e8e240-1db4-4c58-b894-826d30757a5b,IDAT,TCGA-RR-A6KA-01A-21D-A33U-05,open,d1d732f0-794d-4bae-893b-915246e4c709_noid_Grn.idat,Green,f56da61c-2656-4469-b229-8c0a53ba6e8b,DNA Methylation,masked_methylation_array,Illumina Human Methylation 450,⋯,released,d1d732f0-794d-4bae-893b-915246e4c709,https://github.com/NCI-GDC/sesame-cwl/blob/7732f58193690d65dc535c9b2a4f1071113884c5/workflows/gdc_sesame_workflow.cwl,SeSAMe Methylation Beta Estimation,7732f58193690d65dc535c9b2a4f1071113884c5,Primary Tumor,,TCGA-RR-A6KA,TCGA-RR-A6KA-01A,TRUE
dd64cea8-33e5-4ac3-8ee1-a1a0ee66f992,IDAT,TCGA-19-5947-01A-11D-1697-05,open,e2577575-fcf2-4362-8523-a593a82960b0_noid_Grn.idat,Green,f09587ea-d5bc-4b37-a3a1-8a5ab1cc1db3,DNA Methylation,masked_methylation_array,Illumina Human Methylation 450,⋯,released,e2577575-fcf2-4362-8523-a593a82960b0,https://github.com/NCI-GDC/sesame-cwl/blob/7732f58193690d65dc535c9b2a4f1071113884c5/workflows/gdc_sesame_workflow.cwl,SeSAMe Methylation Beta Estimation,7732f58193690d65dc535c9b2a4f1071113884c5,Primary Tumor,,TCGA-19-5947,TCGA-19-5947-01A,5
f78cc9e4-4c83-4519-9117-4bbff43974ef,IDAT,TCGA-19-5947-01A-11D-1697-05,open,e2577575-fcf2-4362-8523-a593a82960b0_noid_Red.idat,Red,21d31e39-bd37-4d3b-b554-f6b41a28a37c,DNA Methylation,masked_methylation_array,Illumina Human Methylation 450,⋯,released,e2577575-fcf2-4362-8523-a593a82960b0,https://github.com/NCI-GDC/sesame-cwl/blob/7732f58193690d65dc535c9b2a4f1071113884c5/workflows/gdc_sesame_workflow.cwl,SeSAMe Methylation Beta Estimation,7732f58193690d65dc535c9b2a4f1071113884c5,Primary Tumor,,TCGA-19-5947,TCGA-19-5947-01A,TRUE


The data table featuring the gene expression data files after removing the files for which the disease type is not reported:

id,data_format,cases,access,file_name,submitter_id,data_category,type,file_size,created_datetime,⋯,analysis_state,analysis_submitter_id,analysis_workflow_link,analysis_workflow_type,analysis_workflow_version,sample_type,is_ffpe,cases.submitter_id,sample.submitter_id,options
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<lgl>,<chr>,<chr>,<list>
ddb16b8d-9ad8-4c66-a986-513d3507b26d,TSV,TCGA-06-0156-01A-03R-1849-01,open,1076483a-e462-47e5-a47a-ca9544548dff.rna_seq.augmented_star_gene_counts.tsv,601d6500-64b0-43ab-b93f-af78fddb3591,Transcriptome Profiling,gene_expression,4249454,2021-12-13T17:00:58.155882-06:00,⋯,released,1076483a-e462-47e5-a47a-ca9544548dff_star__counts,https://github.com/NCI-GDC/gdc-rnaseq-cwl/blob/5d8c131bbff59fb0c969217fc1d44e6d1503cd1f/rnaseq-star-align/star2pass.rnaseq_harmonization.cwl,STAR - Counts,5d8c131bbff59fb0c969217fc1d44e6d1503cd1f,Primary Tumor,,TCGA-06-0156,TCGA-06-0156-01A,TRUE
7b5011b5-5816-4c25-ad72-31152b70b9a0,TSV,TCGA-32-4213-01A-01R-1850-01,open,954911a5-d0a0-42bb-842c-1dbd9ee60c6c.rna_seq.augmented_star_gene_counts.tsv,a4219b59-5018-462e-b8c6-44fdae703ded,Transcriptome Profiling,gene_expression,4209977,2021-12-13T17:02:30.106708-06:00,⋯,released,954911a5-d0a0-42bb-842c-1dbd9ee60c6c_star__counts,https://github.com/NCI-GDC/gdc-rnaseq-cwl/blob/5d8c131bbff59fb0c969217fc1d44e6d1503cd1f/rnaseq-star-align/star2pass.rnaseq_harmonization.cwl,STAR - Counts,5d8c131bbff59fb0c969217fc1d44e6d1503cd1f,Primary Tumor,,TCGA-32-4213,TCGA-32-4213-01A,5
dfbc7136-f7f7-4c2f-93a1-c46b2247cce5,TSV,TCGA-32-2634-01A-01R-1850-01,open,7a25193c-a87d-4929-b724-3d89553fa028.rna_seq.augmented_star_gene_counts.tsv,dfc5c954-00be-4dcf-adc2-455f03f1abff,Transcriptome Profiling,gene_expression,4262595,2021-12-13T17:02:45.822798-06:00,⋯,released,7a25193c-a87d-4929-b724-3d89553fa028_star__counts,https://github.com/NCI-GDC/gdc-rnaseq-cwl/blob/5d8c131bbff59fb0c969217fc1d44e6d1503cd1f/rnaseq-star-align/star2pass.rnaseq_harmonization.cwl,STAR - Counts,5d8c131bbff59fb0c969217fc1d44e6d1503cd1f,Primary Tumor,,TCGA-32-2634,TCGA-32-2634-01A,TRUE
bd40ca45-33da-4798-a3bd-7640c784d6a9,TSV,TCGA-06-0187-01A-01R-1849-01,open,c2bc46d6-7065-44ff-a21a-7a3676990fdb.rna_seq.augmented_star_gene_counts.tsv,c965fc0b-01c9-4afa-8c2b-907f8aa39d0e,Transcriptome Profiling,gene_expression,4236802,2021-12-13T16:58:17.287350-06:00,⋯,released,c2bc46d6-7065-44ff-a21a-7a3676990fdb_star__counts,https://github.com/NCI-GDC/gdc-rnaseq-cwl/blob/5d8c131bbff59fb0c969217fc1d44e6d1503cd1f/rnaseq-star-align/star2pass.rnaseq_harmonization.cwl,STAR - Counts,5d8c131bbff59fb0c969217fc1d44e6d1503cd1f,Primary Tumor,,TCGA-06-0187,TCGA-06-0187-01A,TRUE
7888f7b7-38da-4cc7-b542-836b193adfbf,TSV,TCGA-12-3652-01A-01R-1849-01,open,75615deb-e120-44f6-a585-1ca07987b7ef.rna_seq.augmented_star_gene_counts.tsv,147ee350-7db0-4fa5-b1ee-72d743fa94ee,Transcriptome Profiling,gene_expression,4228972,2021-12-13T17:06:16.439520-06:00,⋯,released,75615deb-e120-44f6-a585-1ca07987b7ef_star__counts,https://github.com/NCI-GDC/gdc-rnaseq-cwl/blob/5d8c131bbff59fb0c969217fc1d44e6d1503cd1f/rnaseq-star-align/star2pass.rnaseq_harmonization.cwl,STAR - Counts,5d8c131bbff59fb0c969217fc1d44e6d1503cd1f,Primary Tumor,,TCGA-12-3652,TCGA-12-3652-01A,5
20ccfe17-ec2c-433f-a87c-07018a3a7522,TSV,TCGA-06-0156-01A-02R-1849-01,open,9fefe100-9765-4b8a-8468-51df066f0f0c.rna_seq.augmented_star_gene_counts.tsv,d5489a43-3f38-4f5e-b136-abf11c9b3280,Transcriptome Profiling,gene_expression,4251212,2021-12-13T17:00:08.122310-06:00,⋯,released,9fefe100-9765-4b8a-8468-51df066f0f0c_star__counts,https://github.com/NCI-GDC/gdc-rnaseq-cwl/blob/5d8c131bbff59fb0c969217fc1d44e6d1503cd1f/rnaseq-star-align/star2pass.rnaseq_harmonization.cwl,STAR - Counts,5d8c131bbff59fb0c969217fc1d44e6d1503cd1f,Primary Tumor,,TCGA-06-0156,TCGA-06-0156-01A,TRUE
f93cd112-520d-408d-a15d-3451f2bacdfd,TSV,TCGA-26-5134-01A-01R-1850-01,open,a8050792-d0f5-4b8f-aea2-0a303f40b1ca.rna_seq.augmented_star_gene_counts.tsv,be3905a4-6bc4-44b7-8d22-a0a671f7802e,Transcriptome Profiling,gene_expression,4206919,2021-12-13T16:57:27.624883-06:00,⋯,released,a8050792-d0f5-4b8f-aea2-0a303f40b1ca_star__counts,https://github.com/NCI-GDC/gdc-rnaseq-cwl/blob/5d8c131bbff59fb0c969217fc1d44e6d1503cd1f/rnaseq-star-align/star2pass.rnaseq_harmonization.cwl,STAR - Counts,5d8c131bbff59fb0c969217fc1d44e6d1503cd1f,Primary Tumor,,TCGA-26-5134,TCGA-26-5134-01A,TRUE
e1f33468-f9c4-4b1a-8cd5-b2c77cc80dc9,TSV,TCGA-32-2638-01A-01R-1850-01,open,5cafc8c0-a92c-407a-ad52-a2d9562c0bfa.rna_seq.augmented_star_gene_counts.tsv,c0c2932e-8349-4f9a-aa93-b66d96161a90,Transcriptome Profiling,gene_expression,4250273,2021-12-13T16:59:17.535636-06:00,⋯,released,5cafc8c0-a92c-407a-ad52-a2d9562c0bfa_star__counts,https://github.com/NCI-GDC/gdc-rnaseq-cwl/blob/5d8c131bbff59fb0c969217fc1d44e6d1503cd1f/rnaseq-star-align/star2pass.rnaseq_harmonization.cwl,STAR - Counts,5d8c131bbff59fb0c969217fc1d44e6d1503cd1f,Primary Tumor,,TCGA-32-2638,TCGA-32-2638-01A,5
6a0df69c-6211-4db9-a278-fc7b2a2bb92c,TSV,TCGA-06-0882-01A-01R-1849-01,open,7412a444-9fad-4d96-a4d4-2c270045919a.rna_seq.augmented_star_gene_counts.tsv,d8d4338b-b3be-4e88-8ea5-0df30d66be41,Transcriptome Profiling,gene_expression,4248718,2021-12-13T17:04:50.788894-06:00,⋯,released,7412a444-9fad-4d96-a4d4-2c270045919a_star__counts,https://github.com/NCI-GDC/gdc-rnaseq-cwl/blob/5d8c131bbff59fb0c969217fc1d44e6d1503cd1f/rnaseq-star-align/star2pass.rnaseq_harmonization.cwl,STAR - Counts,5d8c131bbff59fb0c969217fc1d44e6d1503cd1f,Primary Tumor,,TCGA-06-0882,TCGA-06-0882-01A,TRUE
a04f2c22-87db-420d-b673-33fd9e48c89e,TSV,TCGA-06-0168-01A-01R-1849-01,open,075029c6-31b2-4b4a-9945-b905fc6bf730.rna_seq.augmented_star_gene_counts.tsv,efeb3596-b9d7-4d46-990e-303e0443e70c,Transcriptome Profiling,gene_expression,4202959,2021-12-13T17:06:01.947579-06:00,⋯,released,075029c6-31b2-4b4a-9945-b905fc6bf730_star__counts,https://github.com/NCI-GDC/gdc-rnaseq-cwl/blob/5d8c131bbff59fb0c969217fc1d44e6d1503cd1f/rnaseq-star-align/star2pass.rnaseq_harmonization.cwl,STAR - Counts,5d8c131bbff59fb0c969217fc1d44e6d1503cd1f,Primary Tumor,,TCGA-06-0168,TCGA-06-0168-01A,5


### Checking accessibility

Before we can check which cases (patients) have both methylation and gene expression data files, we need to filter out the ones which are not 'open' to download. These files will have in the 'access' column of the data tables 'controlled' rather than 'open'. Since we do not have access to these controlled files, we would need to remove them before we start to download them later on.

This can be achieved by calling the function 'all()' which only returns true when every single file has 'open' access. This can then be performed for both of the sets of data files.

In [11]:
methylation_all_open = all(methylation_data_files$access == "open")
gene_expression_all_open = all(gene_expression_data_files$access == "open")

if (methylation_all_open == TRUE) {
    cat("All of the files in the 'methylation_data_files' have open access.\n")
} else {
    cat("Not all of the files in the 'methylation_data_files' have open access, some of them are controlled.\n")
}

if (gene_expression_all_open == TRUE) {
    cat("All of the files in the 'gene_expression_data_files' have open access.\n")
} else {
    cat("Not all of the files in the 'gene_expression_data_files' have open access, some of them are controlled.\n")
}

All of the files in the 'methylation_data_files' have open access.
All of the files in the 'gene_expression_data_files' have open access.


As we can see, all of the files in both of the sets of data files have open access, so we do not need to remove any of the files.

### Finding common patients

The next step is to find which patients have both of the methylation and gene expression data available as we later in the machine learning part of the thesis need to relate the methylation data to the gene expression data of the same patient.

To achieve this, the function 'intersect()' can be called which retrieves the values which can be found in both of the two columns which are the two parameters. Both of the parameters in the function call below represent the 'cases.submitter_id' columns of both types of data which contain the patient id associated with the file of the current record. It will thus find the common patient ids and output them below.

In [12]:
# Retrieving the patient ids which appear in both the methylation and gene expression data files.
common_patient_ids <- intersect(methylation_data_files$cases.submitter_id, gene_expression_data_files$cases.submitter_id)

# Displaying them and how many there exist.
cat("The patient ids which appear in both the methylation and the gene expression data files:\n")
print(common_patient_ids)
cat("\nThe number of patient ids which appear in both the methylation and the gene expression data files:\n")
cat(length(common_patient_ids))

The patient ids which appear in both the methylation and the gene expression data files:
 [1] "TCGA-28-5209" "TCGA-26-5134" "TCGA-06-AABW" "TCGA-28-5204" "TCGA-06-0221"
 [6] "TCGA-14-1034" "TCGA-32-5222" "TCGA-76-4928" "TCGA-14-0736" "TCGA-06-0152"
[11] "TCGA-12-5295" "TCGA-26-1442" "TCGA-06-0210" "TCGA-32-1980" "TCGA-06-5412"
[16] "TCGA-06-5856" "TCGA-76-4929" "TCGA-41-5651" "TCGA-26-5139" "TCGA-28-5218"
[21] "TCGA-06-5417" "TCGA-06-0125" "TCGA-76-4931" "TCGA-28-5215" "TCGA-28-5207"
[26] "TCGA-19-1389" "TCGA-26-5136" "TCGA-26-5132" "TCGA-06-5416" "TCGA-06-5411"
[31] "TCGA-06-0171" "TCGA-06-0211" "TCGA-15-1444" "TCGA-12-5299" "TCGA-06-5418"
[36] "TCGA-06-5859" "TCGA-28-5208" "TCGA-14-0781" "TCGA-19-4065" "TCGA-19-0957"
[41] "TCGA-14-1402" "TCGA-06-5858" "TCGA-76-4926" "TCGA-06-5408" "TCGA-76-4927"
[46] "TCGA-06-1804" "TCGA-76-4932" "TCGA-28-5216" "TCGA-26-5133" "TCGA-06-5415"
[51] "TCGA-28-2510" "TCGA-06-0190" "TCGA-76-4925" "TCGA-06-5410" "TCGA-19-5960"
[56] "TCGA-06-5413" "TCGA-06-54

As we can see from the output above, we have for in total 60 different patients both the methylation data files and the gene expression data files.

### Retrieving the files of patients who have both types of files

Now that we know which patients appear in both the methylation and gene expression data files, we want to retrieve the files from both of the methylation and gene expression data tables we defined before where the patient ids is the list of the common patient ids. 

To achieve this, we need to define new queries for both of the two data types which are the same queries as before but now the 'barcode' parameter is included which we set equal to the 'common_patient_ids'.

In [13]:
# The query to retrieve the methylation data where only the patients appearing in the 'common_patient_ids' are included.
query_methylation_data_common <- GDCquery(
    project = "TCGA-GBM",
    data.category = "DNA Methylation",
    legacy = FALSE,
    platform = "Illumina Human Methylation 450",
    data.type = "Masked Intensities",
    barcode = common_patient_ids
)

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------



In [14]:
# The query to retrieve the gene expression data where only the patients appearing in the 'common_patient_ids' are included.
query_expression_data_common <- GDCquery(
    project = "TCGA-GBM",
    data.category = "Transcriptome Profiling",
    data.type = "Gene Expression Quantification", 
    workflow.type = "STAR - Counts",
    barcode = common_patient_ids
)

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By data.type

ooo By workflow.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases

ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------



### Collecting the data tables after filtering out non-common patients

Now that we have redefined the queries to only include files of which the patient ids appear in the 'common_patient_ids', we can retrieve all of the files including some additional metadata by calling the function 'data.table()'. Apart from the first parameter which denotes which query should be used to retrieve the results, the 'options' parameter denotes how the data table should be displayed when it is called.

In [15]:
# Retrieving the methylation data files of which the patient ids appear in the 'common_patient_ids'.
methylation_data_files_common = data.table(
    getResults(query_methylation_data_common),
    options = list(scrollX = TRUE, keys = TRUE, pageLength = 5)
)

In [16]:
# Retrieving the gene expression data files of which the patient ids appear in the 'common_patient_ids'.
gene_expression_data_files_common = data.table(
    getResults(query_expression_data_common),
    options = list(scrollX = TRUE, keys = TRUE, pageLength = 5)
)

"Item 2 has 3 rows but longest item has 67; recycled with remainder."


We can now display both of these data tables.

In [17]:
cat("The data table featuring the methylation data files of which the patient ids appear in the 'common_patient_ids':")
methylation_data_files_common

The data table featuring the methylation data files of which the patient ids appear in the 'common_patient_ids':

In [18]:
cat("The data table featuring the gene expression data files of which the patient ids appear in the 'common_patient_ids':")
gene_expression_data_files_common

The data table featuring the gene expression data files of which the patient ids appear in the 'common_patient_ids':

id,data_format,cases,access,file_name,submitter_id,data_category,type,file_size,created_datetime,⋯,analysis_state,analysis_submitter_id,analysis_workflow_link,analysis_workflow_type,analysis_workflow_version,sample_type,is_ffpe,cases.submitter_id,sample.submitter_id,options
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<lgl>,<chr>,<chr>,<list>
9c73dcae-262a-4473-a999-3ec148b733ed,TSV,TCGA-28-5209-01A-01R-1850-01,open,79b783ac-103f-47a6-bc4b-8498a0be46ac.rna_seq.augmented_star_gene_counts.tsv,820a9a50-b6f6-4e6c-a497-c7dc1e514b33,Transcriptome Profiling,gene_expression,4242247,2021-12-13T16:57:50.107914-06:00,⋯,released,79b783ac-103f-47a6-bc4b-8498a0be46ac_star__counts,https://github.com/NCI-GDC/gdc-rnaseq-cwl/blob/5d8c131bbff59fb0c969217fc1d44e6d1503cd1f/rnaseq-star-align/star2pass.rnaseq_harmonization.cwl,STAR - Counts,5d8c131bbff59fb0c969217fc1d44e6d1503cd1f,Primary Tumor,,TCGA-28-5209,TCGA-28-5209-01A,TRUE
f93cd112-520d-408d-a15d-3451f2bacdfd,TSV,TCGA-26-5134-01A-01R-1850-01,open,a8050792-d0f5-4b8f-aea2-0a303f40b1ca.rna_seq.augmented_star_gene_counts.tsv,be3905a4-6bc4-44b7-8d22-a0a671f7802e,Transcriptome Profiling,gene_expression,4206919,2021-12-13T16:57:27.624883-06:00,⋯,released,a8050792-d0f5-4b8f-aea2-0a303f40b1ca_star__counts,https://github.com/NCI-GDC/gdc-rnaseq-cwl/blob/5d8c131bbff59fb0c969217fc1d44e6d1503cd1f/rnaseq-star-align/star2pass.rnaseq_harmonization.cwl,STAR - Counts,5d8c131bbff59fb0c969217fc1d44e6d1503cd1f,Primary Tumor,,TCGA-26-5134,TCGA-26-5134-01A,TRUE
477a4ae1-84c0-49f2-b3b3-4015b8c26f18,TSV,TCGA-06-AABW-11A-31R-A36H-07,open,4250878d-ba0b-47e1-96c6-f2331f93ce37.rna_seq.augmented_star_gene_counts.tsv,76effa8e-5d82-4893-a30e-720e6e273a78,Transcriptome Profiling,gene_expression,4237877,2021-12-13T17:01:33.058225-06:00,⋯,released,4250878d-ba0b-47e1-96c6-f2331f93ce37_star__counts,https://github.com/NCI-GDC/gdc-rnaseq-cwl/blob/5d8c131bbff59fb0c969217fc1d44e6d1503cd1f/rnaseq-star-align/star2pass.rnaseq_harmonization.cwl,STAR - Counts,5d8c131bbff59fb0c969217fc1d44e6d1503cd1f,Solid Tissue Normal,,TCGA-06-AABW,TCGA-06-AABW-11A,5
5a2427b9-3e21-4ae3-bde7-06efcdf2bea3,TSV,TCGA-28-5204-01A-01R-1850-01,open,b45c00da-6d61-4b10-befa-dc1123f955a9.rna_seq.augmented_star_gene_counts.tsv,a959d266-fa89-44d5-b1d9-0de78c677fcc,Transcriptome Profiling,gene_expression,4240477,2021-12-13T17:01:57.463417-06:00,⋯,released,b45c00da-6d61-4b10-befa-dc1123f955a9_star__counts,https://github.com/NCI-GDC/gdc-rnaseq-cwl/blob/5d8c131bbff59fb0c969217fc1d44e6d1503cd1f/rnaseq-star-align/star2pass.rnaseq_harmonization.cwl,STAR - Counts,5d8c131bbff59fb0c969217fc1d44e6d1503cd1f,Primary Tumor,,TCGA-28-5204,TCGA-28-5204-01A,TRUE
d8b840c7-e6cd-4cf9-824f-131fb18bf3b4,TSV,TCGA-06-0221-02A-11R-2005-01,open,08dce278-ba8d-4ba0-90a1-e40820fd1740.rna_seq.augmented_star_gene_counts.tsv,03ffbe63-bb1c-4ddf-8d87-a17ffb258ba1,Transcriptome Profiling,gene_expression,4249712,2021-12-13T17:05:47.511774-06:00,⋯,released,08dce278-ba8d-4ba0-90a1-e40820fd1740_star__counts,https://github.com/NCI-GDC/gdc-rnaseq-cwl/blob/5d8c131bbff59fb0c969217fc1d44e6d1503cd1f/rnaseq-star-align/star2pass.rnaseq_harmonization.cwl,STAR - Counts,5d8c131bbff59fb0c969217fc1d44e6d1503cd1f,Recurrent Tumor,,TCGA-06-0221,TCGA-06-0221-02A,TRUE
7f21d9fd-fd8a-44ec-b01f-93ded2b6babf,TSV,TCGA-14-1034-01A-01R-1849-01,open,64d208b9-6a15-4077-885f-30d9cdedf147.rna_seq.augmented_star_gene_counts.tsv,0513a62a-3635-4ce3-b5a8-071ad7ba0831,Transcriptome Profiling,gene_expression,4229310,2021-12-13T16:54:58.730301-06:00,⋯,released,64d208b9-6a15-4077-885f-30d9cdedf147_star__counts,https://github.com/NCI-GDC/gdc-rnaseq-cwl/blob/5d8c131bbff59fb0c969217fc1d44e6d1503cd1f/rnaseq-star-align/star2pass.rnaseq_harmonization.cwl,STAR - Counts,5d8c131bbff59fb0c969217fc1d44e6d1503cd1f,Primary Tumor,,TCGA-14-1034,TCGA-14-1034-01A,5
87f40fda-2a48-4cb6-8b65-e1b043661cce,TSV,TCGA-14-1034-02B-01R-2005-01,open,a400f64e-ec00-4cc0-8110-38b2e349c7cf.rna_seq.augmented_star_gene_counts.tsv,a34feb57-9c1c-4e83-8928-c7930d631c3b,Transcriptome Profiling,gene_expression,4232820,2021-12-13T16:57:13.598988-06:00,⋯,released,a400f64e-ec00-4cc0-8110-38b2e349c7cf_star__counts,https://github.com/NCI-GDC/gdc-rnaseq-cwl/blob/5d8c131bbff59fb0c969217fc1d44e6d1503cd1f/rnaseq-star-align/star2pass.rnaseq_harmonization.cwl,STAR - Counts,5d8c131bbff59fb0c969217fc1d44e6d1503cd1f,Recurrent Tumor,,TCGA-14-1034,TCGA-14-1034-02B,TRUE
20db3a49-ca35-4562-bc96-7e98c381830f,TSV,TCGA-32-5222-01A-01R-1850-01,open,1a61593f-909e-46f4-8cfa-9e23bc050033.rna_seq.augmented_star_gene_counts.tsv,08eb7aae-bb31-4858-aba4-d6cf278f6945,Transcriptome Profiling,gene_expression,4244005,2021-12-13T16:54:56.031582-06:00,⋯,released,1a61593f-909e-46f4-8cfa-9e23bc050033_star__counts,https://github.com/NCI-GDC/gdc-rnaseq-cwl/blob/5d8c131bbff59fb0c969217fc1d44e6d1503cd1f/rnaseq-star-align/star2pass.rnaseq_harmonization.cwl,STAR - Counts,5d8c131bbff59fb0c969217fc1d44e6d1503cd1f,Primary Tumor,,TCGA-32-5222,TCGA-32-5222-01A,TRUE
3bd5dd10-0de9-495a-a94a-c57ecd63c097,TSV,TCGA-76-4928-01B-01R-1850-01,open,1069403e-ef6a-43d4-b62d-3249e5a8e034.rna_seq.augmented_star_gene_counts.tsv,79d42e1d-0229-4932-b2bc-98dffeb01147,Transcriptome Profiling,gene_expression,4247314,2021-12-13T17:03:18.432380-06:00,⋯,released,1069403e-ef6a-43d4-b62d-3249e5a8e034_star__counts,https://github.com/NCI-GDC/gdc-rnaseq-cwl/blob/5d8c131bbff59fb0c969217fc1d44e6d1503cd1f/rnaseq-star-align/star2pass.rnaseq_harmonization.cwl,STAR - Counts,5d8c131bbff59fb0c969217fc1d44e6d1503cd1f,Primary Tumor,,TCGA-76-4928,TCGA-76-4928-01B,5
c66cf9f2-a9c2-4b4b-8281-ac1593e87246,TSV,TCGA-14-0736-02A-01R-2005-01,open,a261b442-2909-4aae-8c65-7a16b1f7d905.rna_seq.augmented_star_gene_counts.tsv,98a5d06b-23d5-47ad-a739-46d56a74c4ec,Transcriptome Profiling,gene_expression,4222639,2021-12-13T16:55:15.721391-06:00,⋯,released,a261b442-2909-4aae-8c65-7a16b1f7d905_star__counts,https://github.com/NCI-GDC/gdc-rnaseq-cwl/blob/5d8c131bbff59fb0c969217fc1d44e6d1503cd1f/rnaseq-star-align/star2pass.rnaseq_harmonization.cwl,STAR - Counts,5d8c131bbff59fb0c969217fc1d44e6d1503cd1f,Recurrent Tumor,,TCGA-14-0736,TCGA-14-0736-02A,TRUE


### Verifying whether each sample within the 'methylation_data_files_common' occurs at least twice

We expect that the methylation data for each sample is split into two files since there is one file for the green channel and one file for the red channel (we do not need to check this for the gene expression data as each sample is captured by only a single file). To verify whether this is the case, we can call the function 'table()' to retrieve the number of occurences within the 'methylation_data_files_common' data table containing the files for each of the 'analysis_submitter_id'. Since we expect the values for each of them to be equal to two as they are the only ones that are submitted at a time together, we can loop over every element in the table and verify whether it equals two and whether these two are made up out of one green channel file and one red channel file.

In [19]:
# Retrieving the number of occurences within the 'methylation_data_files_common' data table containing the files for each 
# of the 'analysis_submitter_id'.
methylation_analysis_submitter_id_counts <- table(methylation_data_files_common$analysis_submitter_id)

# Looping over every element of the 'methylation_analysis_submitter_id_counts' and verifying whether they all equal two and 
# whether these two are made up out of one green channel file and one red channel file.
for (analysis_submitter_id in names(methylation_analysis_submitter_id_counts)) {
  if (methylation_analysis_submitter_id_counts[analysis_submitter_id] != 2 ||
      !(xor(methylation_data_files_common[analysis_submitter_id == analysis_submitter_id][1]$channel == "Green",
            methylation_data_files_common[analysis_submitter_id == analysis_submitter_id][2]$channel == "Green")) ||
      !(xor(methylation_data_files_common[analysis_submitter_id == analysis_submitter_id][1]$channel == "Red",
            methylation_data_files_common[analysis_submitter_id == analysis_submitter_id][2]$channel == "Red"))) {
    stop("Not every sample is split into two files")
  }
}

cat("Every sample is split into two files: a green channel file and a red channel file")

Every sample is split into two files: a green channel file and a red channel file

### Identifying which patients have multiple data files for a single data type

As we can see from the two data tables outputted above, the number of rows are not equal as there are 144 files present in the methylation data table (corresponding to 72 samples as each sample consists of two files: one containing the green channel data and one containing the red channel data) and 67 files present in the gene expression data table. This difference in the number of files (or rather the difference in the number of samples) is potentially caused by there existing multiple files for the same data type for some patient ids (for example: multiple files can exist containing gene expression data for the same patient, which is then most likely taken at different locations). 

We can find which records in the data tables have this property by performing the following steps. First, the function 'table()' is called to retrieve the number of occurences within the data table containing the files for each of the 'cases.submitter_id'. Next, we can index the resulting table to find which 'cases.submitter_id' occur more than a certain number of times within the data table containing the files. For the 'methylation_data_files_common' data table, we want this number to be 2 as two files for each sample exist (one for the green channel and one for the red channel). For the 'gene_expression_data_files_common' data table, we want this number to be 1 as only a single file for each sample exists.

In [20]:
# Retrieving the number of occurences within the 'methylation_data_files_common' data table containing the files for each 
# of the 'cases.submitter_id'.
methylation_cases_submitter_id_counts <- table(methylation_data_files_common$cases.submitter_id)

# Retrieving the cases for which more than 2 files are present in the 'methylation_data_files_common' data table meaning 
# that multiple samples could be present for these cases.
methylation_cases_more_than_2_files <- methylation_cases_submitter_id_counts[methylation_cases_submitter_id_counts > 2]

cat("Denoting which patient ids have more than 2 files in the 'methylation_data_files_common' data table:\n")
print(names(methylation_cases_more_than_2_files))

Denoting which patient ids have more than 2 files in the 'methylation_data_files_common' data table:
 [1] "TCGA-06-0125" "TCGA-06-0152" "TCGA-06-0171" "TCGA-06-0190" "TCGA-06-0210"
 [6] "TCGA-06-0211" "TCGA-06-0221" "TCGA-14-0736" "TCGA-14-1402" "TCGA-19-0957"
[11] "TCGA-19-1389" "TCGA-19-4065"


In [21]:
# Retrieving the number of occurences within the 'gene_expression_data_files_common' data table containing the files for 
# each of the 'cases.submitter_id'.
gene_expression_cases_submitter_id_counts <- table(gene_expression_data_files_common$cases.submitter_id)

# Retrieving the cases for which more than 1 file is present in the 'gene_expression_data_files_common' data table meaning 
# that multiple samples could be present for these cases.
gene_expression_cases_more_than_1_file <- gene_expression_cases_submitter_id_counts[gene_expression_cases_submitter_id_counts > 1]

cat("Denoting which patient ids have more than 1 file in the 'gene_expression_data_files_common' data table:\n")
print(names(gene_expression_cases_more_than_1_file))

Denoting which patient ids have more than 1 file in the 'gene_expression_data_files_common' data table:
[1] "TCGA-06-0125" "TCGA-06-0190" "TCGA-06-0210" "TCGA-06-0211" "TCGA-14-1034"
[6] "TCGA-19-4065"


Next, we can use these patient ids to retrieve all of the files belonging to those patients from the 'methylation_data_files_common' and 'gene_expression_data_files_common' data tables.

In [22]:
# Retrieving all of the files of the 'methylation_data_files_common' belonging to the patients which have more than 2 files.
methylation_all_duplicate_patient_rows <- data.table()
for (patient in names(methylation_cases_more_than_2_files)) {
    methylation_matching_rows <- methylation_data_files_common[cases.submitter_id == patient]
    methylation_all_duplicate_patient_rows <- rbind(methylation_all_duplicate_patient_rows, methylation_matching_rows)
}
cat("The files of the 'methylation_data_files_common' belonging to the patients which have more than 2 files:")
methylation_all_duplicate_patient_rows

The files of the 'methylation_data_files_common' belonging to the patients which have more than 2 files:

id,data_format,cases,access,file_name,channel,submitter_id,data_category,type,platform,⋯,analysis_state,analysis_submitter_id,analysis_workflow_link,analysis_workflow_type,analysis_workflow_version,sample_type,is_ffpe,cases.submitter_id,sample.submitter_id,options
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<lgl>,<chr>,<chr>,<list>
b1e3ad28-42c6-4ecc-8926-95556d9c0f38,IDAT,TCGA-06-0125-01A-01D-A45W-05,open,964f8f23-7801-412f-981d-5ee34ffc6dd1_noid_Grn.idat,Green,4c0bd9e1-7905-470e-a340-70346d315f50,DNA Methylation,masked_methylation_array,Illumina Human Methylation 450,⋯,released,964f8f23-7801-412f-981d-5ee34ffc6dd1,https://github.com/NCI-GDC/sesame-cwl/blob/7732f58193690d65dc535c9b2a4f1071113884c5/workflows/gdc_sesame_workflow.cwl,SeSAMe Methylation Beta Estimation,7732f58193690d65dc535c9b2a4f1071113884c5,Primary Tumor,,TCGA-06-0125,TCGA-06-0125-01A,5
dabbed4c-2517-4aa1-a0dd-b9eef17f7522,IDAT,TCGA-06-0125-02A-11D-2004-05,open,70d41610-9d05-4101-84c0-ed3280f1656c_noid_Red.idat,Red,5985305d-b9ce-45c6-b12f-daba69fa25fe,DNA Methylation,masked_methylation_array,Illumina Human Methylation 450,⋯,released,70d41610-9d05-4101-84c0-ed3280f1656c,https://github.com/NCI-GDC/sesame-cwl/blob/7732f58193690d65dc535c9b2a4f1071113884c5/workflows/gdc_sesame_workflow.cwl,SeSAMe Methylation Beta Estimation,7732f58193690d65dc535c9b2a4f1071113884c5,Recurrent Tumor,,TCGA-06-0125,TCGA-06-0125-02A,TRUE
2b07e03f-5cc7-49f5-a41e-fa2b7b3a3e88,IDAT,TCGA-06-0125-02A-11D-2004-05,open,70d41610-9d05-4101-84c0-ed3280f1656c_noid_Grn.idat,Green,30799da8-cc98-485b-a7ef-a5335ac951f9,DNA Methylation,masked_methylation_array,Illumina Human Methylation 450,⋯,released,70d41610-9d05-4101-84c0-ed3280f1656c,https://github.com/NCI-GDC/sesame-cwl/blob/7732f58193690d65dc535c9b2a4f1071113884c5/workflows/gdc_sesame_workflow.cwl,SeSAMe Methylation Beta Estimation,7732f58193690d65dc535c9b2a4f1071113884c5,Recurrent Tumor,,TCGA-06-0125,TCGA-06-0125-02A,TRUE
a6e30d93-e50e-4e80-8bb0-c8f700c6534e,IDAT,TCGA-06-0125-01A-01D-A45W-05,open,964f8f23-7801-412f-981d-5ee34ffc6dd1_noid_Red.idat,Red,f075fc05-0b6d-40be-a076-863222013727,DNA Methylation,masked_methylation_array,Illumina Human Methylation 450,⋯,released,964f8f23-7801-412f-981d-5ee34ffc6dd1,https://github.com/NCI-GDC/sesame-cwl/blob/7732f58193690d65dc535c9b2a4f1071113884c5/workflows/gdc_sesame_workflow.cwl,SeSAMe Methylation Beta Estimation,7732f58193690d65dc535c9b2a4f1071113884c5,Primary Tumor,,TCGA-06-0125,TCGA-06-0125-01A,5
2b868a2e-54aa-42c4-80d9-478303d8ffc3,IDAT,TCGA-06-0152-01A-02D-A45W-05,open,15e80be2-4416-4480-a2b3-3c235e9da8ca_noid_Red.idat,Red,84dbff7e-9276-4961-84ed-f6ac53fcff83,DNA Methylation,masked_methylation_array,Illumina Human Methylation 450,⋯,released,15e80be2-4416-4480-a2b3-3c235e9da8ca,https://github.com/NCI-GDC/sesame-cwl/blob/7732f58193690d65dc535c9b2a4f1071113884c5/workflows/gdc_sesame_workflow.cwl,SeSAMe Methylation Beta Estimation,7732f58193690d65dc535c9b2a4f1071113884c5,Primary Tumor,,TCGA-06-0152,TCGA-06-0152-01A,TRUE
05f6787f-4f64-4eb8-9589-8fb02fe61f86,IDAT,TCGA-06-0152-01A-02D-A45W-05,open,15e80be2-4416-4480-a2b3-3c235e9da8ca_noid_Grn.idat,Green,98fe85e9-b00b-4619-8f27-9f8efcd6ccc7,DNA Methylation,masked_methylation_array,Illumina Human Methylation 450,⋯,released,15e80be2-4416-4480-a2b3-3c235e9da8ca,https://github.com/NCI-GDC/sesame-cwl/blob/7732f58193690d65dc535c9b2a4f1071113884c5/workflows/gdc_sesame_workflow.cwl,SeSAMe Methylation Beta Estimation,7732f58193690d65dc535c9b2a4f1071113884c5,Primary Tumor,,TCGA-06-0152,TCGA-06-0152-01A,5
80957b25-63a3-45ed-bfc3-943b57c0e640,IDAT,TCGA-06-0152-02A-01D-2004-05,open,416ac016-2058-4ffb-9adc-5cee517670f8_noid_Red.idat,Red,580144d7-ed5b-450a-a8fd-4943e9ee65f7,DNA Methylation,masked_methylation_array,Illumina Human Methylation 450,⋯,released,416ac016-2058-4ffb-9adc-5cee517670f8,https://github.com/NCI-GDC/sesame-cwl/blob/7732f58193690d65dc535c9b2a4f1071113884c5/workflows/gdc_sesame_workflow.cwl,SeSAMe Methylation Beta Estimation,7732f58193690d65dc535c9b2a4f1071113884c5,Recurrent Tumor,,TCGA-06-0152,TCGA-06-0152-02A,TRUE
b8b87abf-2e67-4a69-9a54-8754346e2833,IDAT,TCGA-06-0152-02A-01D-2004-05,open,416ac016-2058-4ffb-9adc-5cee517670f8_noid_Grn.idat,Green,bbe63f60-83d8-4301-b380-665a39c7e7b3,DNA Methylation,masked_methylation_array,Illumina Human Methylation 450,⋯,released,416ac016-2058-4ffb-9adc-5cee517670f8,https://github.com/NCI-GDC/sesame-cwl/blob/7732f58193690d65dc535c9b2a4f1071113884c5/workflows/gdc_sesame_workflow.cwl,SeSAMe Methylation Beta Estimation,7732f58193690d65dc535c9b2a4f1071113884c5,Recurrent Tumor,,TCGA-06-0152,TCGA-06-0152-02A,TRUE
8b2339c0-d2d6-4c9c-b8f3-e48917057a79,IDAT,TCGA-06-0171-01A-02D-A45W-05,open,d23ad165-d58c-46fd-8ffa-938675f55ee1_noid_Grn.idat,Green,54787ace-894c-4bdb-8593-fcacb1068696,DNA Methylation,masked_methylation_array,Illumina Human Methylation 450,⋯,released,d23ad165-d58c-46fd-8ffa-938675f55ee1,https://github.com/NCI-GDC/sesame-cwl/blob/7732f58193690d65dc535c9b2a4f1071113884c5/workflows/gdc_sesame_workflow.cwl,SeSAMe Methylation Beta Estimation,7732f58193690d65dc535c9b2a4f1071113884c5,Primary Tumor,,TCGA-06-0171,TCGA-06-0171-01A,TRUE
b3351c4b-26d9-498e-aec1-736105afac86,IDAT,TCGA-06-0171-02A-11D-2004-05,open,e410fe15-33c0-4381-aebe-0b5a4d8c5b68_noid_Grn.idat,Green,2497622d-ae7a-48a1-8885-f929330d9dd1,DNA Methylation,masked_methylation_array,Illumina Human Methylation 450,⋯,released,e410fe15-33c0-4381-aebe-0b5a4d8c5b68,https://github.com/NCI-GDC/sesame-cwl/blob/7732f58193690d65dc535c9b2a4f1071113884c5/workflows/gdc_sesame_workflow.cwl,SeSAMe Methylation Beta Estimation,7732f58193690d65dc535c9b2a4f1071113884c5,Recurrent Tumor,,TCGA-06-0171,TCGA-06-0171-02A,TRUE


In [23]:
# Retrieving all of the files of the 'gene_expression_data_files_common' belonging to the patients which have multiple files.
gene_expression_all_duplicate_patient_rows <- data.table()
for (patient in names(gene_expression_cases_more_than_1_file)) {
    gene_expression_matching_rows <- gene_expression_data_files_common[cases.submitter_id == patient]
    gene_expression_all_duplicate_patient_rows <- rbind(gene_expression_all_duplicate_patient_rows, gene_expression_matching_rows)
}
cat("The files of the 'gene_expression_data_files_common' belonging to the patients which have multiple files:")
gene_expression_all_duplicate_patient_rows

The files of the 'gene_expression_data_files_common' belonging to the patients which have multiple files:

id,data_format,cases,access,file_name,submitter_id,data_category,type,file_size,created_datetime,⋯,analysis_state,analysis_submitter_id,analysis_workflow_link,analysis_workflow_type,analysis_workflow_version,sample_type,is_ffpe,cases.submitter_id,sample.submitter_id,options
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<lgl>,<chr>,<chr>,<list>
4c8938ac-74ea-46f3-af36-8aa38d01e380,TSV,TCGA-06-0125-02A-11R-2005-01,open,a01b2990-48f1-4513-8438-df7d1c39b51f.rna_seq.augmented_star_gene_counts.tsv,d539f07a-949e-44dd-acb7-714b3b6b33ab,Transcriptome Profiling,gene_expression,4239356,2021-12-13T17:00:42.710877-06:00,⋯,released,a01b2990-48f1-4513-8438-df7d1c39b51f_star__counts,https://github.com/NCI-GDC/gdc-rnaseq-cwl/blob/5d8c131bbff59fb0c969217fc1d44e6d1503cd1f/rnaseq-star-align/star2pass.rnaseq_harmonization.cwl,STAR - Counts,5d8c131bbff59fb0c969217fc1d44e6d1503cd1f,Recurrent Tumor,,TCGA-06-0125,TCGA-06-0125-02A,5
6827c4ab-4d67-41a3-9e37-b540f5d7c703,TSV,TCGA-06-0125-01A-01R-1849-01,open,e1757a20-2d6f-4aee-bafb-804302b448ea.rna_seq.augmented_star_gene_counts.tsv,2d6bd39e-cd51-4ecc-b888-d3c28605807d,Transcriptome Profiling,gene_expression,4251418,2021-12-13T17:01:03.414593-06:00,⋯,released,e1757a20-2d6f-4aee-bafb-804302b448ea_star__counts,https://github.com/NCI-GDC/gdc-rnaseq-cwl/blob/5d8c131bbff59fb0c969217fc1d44e6d1503cd1f/rnaseq-star-align/star2pass.rnaseq_harmonization.cwl,STAR - Counts,5d8c131bbff59fb0c969217fc1d44e6d1503cd1f,Primary Tumor,,TCGA-06-0125,TCGA-06-0125-01A,TRUE
ba9afa8e-405c-49b7-bc66-8ab1af1f942e,TSV,TCGA-06-0190-01A-01R-1849-01,open,dad61e18-e3f1-4beb-b3c3-ae434e35af2d.rna_seq.augmented_star_gene_counts.tsv,2495495d-d5e2-4974-b959-bb871e69c657,Transcriptome Profiling,gene_expression,4242874,2021-12-13T16:58:53.041980-06:00,⋯,released,dad61e18-e3f1-4beb-b3c3-ae434e35af2d_star__counts,https://github.com/NCI-GDC/gdc-rnaseq-cwl/blob/5d8c131bbff59fb0c969217fc1d44e6d1503cd1f/rnaseq-star-align/star2pass.rnaseq_harmonization.cwl,STAR - Counts,5d8c131bbff59fb0c969217fc1d44e6d1503cd1f,Primary Tumor,,TCGA-06-0190,TCGA-06-0190-01A,TRUE
9916e085-1027-4c91-9729-93dc36b71c86,TSV,TCGA-06-0190-02A-01R-2005-01,open,2c25aa49-879a-4023-80b4-84df717dc537.rna_seq.augmented_star_gene_counts.tsv,1ae05351-04c6-4d20-a906-31e9186f0907,Transcriptome Profiling,gene_expression,4236023,2021-12-13T16:59:24.758343-06:00,⋯,released,2c25aa49-879a-4023-80b4-84df717dc537_star__counts,https://github.com/NCI-GDC/gdc-rnaseq-cwl/blob/5d8c131bbff59fb0c969217fc1d44e6d1503cd1f/rnaseq-star-align/star2pass.rnaseq_harmonization.cwl,STAR - Counts,5d8c131bbff59fb0c969217fc1d44e6d1503cd1f,Recurrent Tumor,,TCGA-06-0190,TCGA-06-0190-02A,TRUE
74f3f3cb-d6f1-4e4f-8c76-7ebeefa0dbcf,TSV,TCGA-06-0210-01A-01R-1849-01,open,1483c347-bb2c-4678-af16-163e4fc1791d.rna_seq.augmented_star_gene_counts.tsv,2a72f3d6-8afc-4fd3-9df4-6b7c944f3e72,Transcriptome Profiling,gene_expression,4244621,2021-12-13T17:01:06.778467-06:00,⋯,released,1483c347-bb2c-4678-af16-163e4fc1791d_star__counts,https://github.com/NCI-GDC/gdc-rnaseq-cwl/blob/5d8c131bbff59fb0c969217fc1d44e6d1503cd1f/rnaseq-star-align/star2pass.rnaseq_harmonization.cwl,STAR - Counts,5d8c131bbff59fb0c969217fc1d44e6d1503cd1f,Primary Tumor,,TCGA-06-0210,TCGA-06-0210-01A,TRUE
75a95ec8-b67a-4e0c-a959-61344628ab11,TSV,TCGA-06-0210-02A-01R-2005-01,open,9ea6219c-d1fb-4f5b-bdab-c3492f180ac2.rna_seq.augmented_star_gene_counts.tsv,3cca0708-22c9-4571-8c83-58fd216ff580,Transcriptome Profiling,gene_expression,4243645,2021-12-13T17:03:01.821086-06:00,⋯,released,9ea6219c-d1fb-4f5b-bdab-c3492f180ac2_star__counts,https://github.com/NCI-GDC/gdc-rnaseq-cwl/blob/5d8c131bbff59fb0c969217fc1d44e6d1503cd1f/rnaseq-star-align/star2pass.rnaseq_harmonization.cwl,STAR - Counts,5d8c131bbff59fb0c969217fc1d44e6d1503cd1f,Recurrent Tumor,,TCGA-06-0210,TCGA-06-0210-02A,5
74895f90-43ef-4d2e-ab0a-9e7177f9d27c,TSV,TCGA-06-0211-02A-02R-2005-01,open,bb335256-50b2-473f-a886-b62b7441c436.rna_seq.augmented_star_gene_counts.tsv,c77ecbc8-57d2-4545-b6c6-7d6bdf594664,Transcriptome Profiling,gene_expression,4240739,2021-12-13T17:05:12.807665-06:00,⋯,released,bb335256-50b2-473f-a886-b62b7441c436_star__counts,https://github.com/NCI-GDC/gdc-rnaseq-cwl/blob/5d8c131bbff59fb0c969217fc1d44e6d1503cd1f/rnaseq-star-align/star2pass.rnaseq_harmonization.cwl,STAR - Counts,5d8c131bbff59fb0c969217fc1d44e6d1503cd1f,Recurrent Tumor,,TCGA-06-0211,TCGA-06-0211-02A,TRUE
574174d5-f635-414b-ae70-afdd6ec6ddcf,TSV,TCGA-06-0211-01B-01R-1849-01,open,cebc70af-a3df-45ef-9656-050f382f64e8.rna_seq.augmented_star_gene_counts.tsv,d83aa4cd-56a6-4dd3-9009-ed9659f2b3bc,Transcriptome Profiling,gene_expression,4248101,2021-12-13T17:04:43.818977-06:00,⋯,released,cebc70af-a3df-45ef-9656-050f382f64e8_star__counts,https://github.com/NCI-GDC/gdc-rnaseq-cwl/blob/5d8c131bbff59fb0c969217fc1d44e6d1503cd1f/rnaseq-star-align/star2pass.rnaseq_harmonization.cwl,STAR - Counts,5d8c131bbff59fb0c969217fc1d44e6d1503cd1f,Primary Tumor,,TCGA-06-0211,TCGA-06-0211-01B,5
bba8d1b1-ee43-49dd-8048-78ee57bddca1,TSV,TCGA-06-0211-01A-01R-1849-01,open,4fa49452-cb75-46c0-a633-42b3f7c4361b.rna_seq.augmented_star_gene_counts.tsv,f7f52a15-5fe8-429c-8e3c-9d6293e822c0,Transcriptome Profiling,gene_expression,4242161,2021-12-13T17:03:55.825546-06:00,⋯,released,4fa49452-cb75-46c0-a633-42b3f7c4361b_star__counts,https://github.com/NCI-GDC/gdc-rnaseq-cwl/blob/5d8c131bbff59fb0c969217fc1d44e6d1503cd1f/rnaseq-star-align/star2pass.rnaseq_harmonization.cwl,STAR - Counts,5d8c131bbff59fb0c969217fc1d44e6d1503cd1f,Primary Tumor,,TCGA-06-0211,TCGA-06-0211-01A,TRUE
7f21d9fd-fd8a-44ec-b01f-93ded2b6babf,TSV,TCGA-14-1034-01A-01R-1849-01,open,64d208b9-6a15-4077-885f-30d9cdedf147.rna_seq.augmented_star_gene_counts.tsv,0513a62a-3635-4ce3-b5a8-071ad7ba0831,Transcriptome Profiling,gene_expression,4229310,2021-12-13T16:54:58.730301-06:00,⋯,released,64d208b9-6a15-4077-885f-30d9cdedf147_star__counts,https://github.com/NCI-GDC/gdc-rnaseq-cwl/blob/5d8c131bbff59fb0c969217fc1d44e6d1503cd1f/rnaseq-star-align/star2pass.rnaseq_harmonization.cwl,STAR - Counts,5d8c131bbff59fb0c969217fc1d44e6d1503cd1f,Primary Tumor,,TCGA-14-1034,TCGA-14-1034-01A,5


At this point, we have all retrieved the data tables where the patient id, the 'cases.submitter_id', has multiple samples present within the same data table. 

### Handling the files for which their patients have multiple samples for a single data type

Our aim now is is to create a reference table which will contain the patient id together with the corresponding methylation data file and the gene expression data file. To be able to see which methylation file corresponds to which gene expression file (while of course this can only happen when they belong to the same patient) the submitter_id can be checked to see whether it matches between two files. This is relatively straightforward to do, but to handle the patients which have multiple samples featured within the same data table we need to perform some processing beforehand. We can first create a reference table for the patient ids to denote whether they have multiple samples within each of the two data tables by checking whether the patient ids appear in either the 'methylation_all_duplicate_patient_rows' or in the 'gene_expression_all_duplicate_patient_rows'.

In [24]:
# Creating a reference table for the patient ids to denote whether they have multiple samples within each of the two data 
# tables.
reference_table_patients_multiple_samples <- data.table(case_id = character(), multiple_methylation_samples = logical(), multiple_expression_samples = logical())

# We first need to retrieve all of different patient ids such that we can later attach True or False values to them for the 
# second and third columns. This can be achieved by calling the function 'unique()' and we only need to apply this to one of
# the data tables as at this point both data tables contain the same patient ids (just a different number of occurences).
unique_patient_ids <- unique(methylation_data_files_common$cases.submitter_id)

# Looping over all the 'unique_patient_ids' and adding to the 'reference_table_patients_multiple_files' whether for the 
# patient there are multiple samples present in the methylation data and gene expression data.
for (index in 1:length(unique_patient_ids)){
    current_patient_id = unique_patient_ids[index]
    
    # Checking whether the patient id appears in the 'methylation_all_duplicate_patient_rows[["cases.submitter_id"]]' 
    # column.
    if (current_patient_id %in% methylation_all_duplicate_patient_rows[["cases.submitter_id"]]) {
        temp_methylation = TRUE
    } else {
        temp_methylation = FALSE
    }
    # Checking whether the patient id appears in the 'gene_expression_all_duplicate_patient_rows[["cases.submitter_id"]]' 
    # column.
    if (current_patient_id %in% gene_expression_all_duplicate_patient_rows[["cases.submitter_id"]]) {
        temp_gene_expression = TRUE
    } else {
        temp_gene_expression = FALSE
    }
    
    # Creating a new row and adding it to the 'reference_table_patients_multiple_files'. 
    new_patient_row = data.table(case_id = current_patient_id, multiple_methylation_samples = temp_methylation, multiple_expression_samples = temp_gene_expression)
    reference_table_patients_multiple_samples <- rbind(reference_table_patients_multiple_samples, new_patient_row)
}

# Sorting the 'reference_table_patients_multiple_samples' data table by the 'case_id' column in ascending order.
reference_table_patients_multiple_samples <- reference_table_patients_multiple_samples[order(case_id),]

# Displaying the reference table.
cat("The reference table for the patient ids to denote whether they have multiple samples within each of the two data tables:")
reference_table_patients_multiple_samples

The reference table for the patient ids to denote whether they have multiple samples within each of the two data tables:

case_id,multiple_methylation_samples,multiple_expression_samples
<chr>,<lgl>,<lgl>
TCGA-06-0125,True,True
TCGA-06-0152,True,False
TCGA-06-0171,True,False
TCGA-06-0190,True,True
TCGA-06-0210,True,True
TCGA-06-0211,True,True
TCGA-06-0221,True,False
TCGA-06-1804,False,False
TCGA-06-5408,False,False
TCGA-06-5410,False,False


### Identifying which files belong to which patient

Now that we have a reference table for the patient ids to denote whether they have multiple files within each of the two data tables, we can use this reference table to then retrieve which files match (i.e., the methylation data file for the green channel, methylation data file for the red channel and gene expression data file belong together) for which patients by going over this reference table. In order to do this, we first create a new reference table which now features which files of which patients match such that we can later download all of these files while also having a reference as to which files (which data) belong together. Then, we can loop over all of the case_ids in the reference table and retrieve one list of entries from the 'methylation_data_files_common' data table featuring all the entries which have this 'case_id' as their 'cases.submitter_id' for the green channel, one list of entries from the 'methylation_data_files_common' data table featuring all the entries which have this 'case_id' as their 'cases.submitter_id' for the red channel, and one list of entries from the 'gene_expression_data_files_common' data table featuring all the entries which have this 'case_id' as their 'cases.submitter_id'. Now, we want to find similarities between these remaining records in order to match them correctly (as the lists may feature multiple records and thus multiple files and even for three lists with a single file, there is no guarantee that the files match). The match can be found by looking at the 'cases' column in the data tables. When three files match across the two different tables, they have the same beginning structure of this case value. For example:

The case_id TCGA-26-5136 relates to two files in the 'methylation_data_files_common' data table and to one file in the 'gene_expression_data_files_common' data table. These files have the 'cases' values: 
* 	TCGA-26-5136-01B-01D-1481-05 (the methylation file for the green channel)
*   TCGA-26-5136-01B-01D-1481-05 (the methylation file for the red channel)
* 	TCGA-26-5136-01B-01R-1850-01 (the gene expression file)

As we can see, there is a general structure that is the same in the 'cases' values: TCGA-26-5136-01B-01 (in this case, this would differ for other case_ids but the ones with the same case_id would still match).

Once such a match has been found, the corresponding files in both the 'methylation_data_files_common' data table and the 'gene_expression_data_files_common' data table are retrieved and added to a new reference table to be used later to download all of the files.

In [25]:
# Creating a reference table for which files of which patients match to later download them.
reference_table_files_per_patient <- data.table(case_id = character(), methylation_file_green_channel = character(), methylation_file_red_channel = character(), gene_expression_file = character())

# Looping over all of the patients present in 'reference_table_patients_multiple_files'.
for (i in seq_along(reference_table_patients_multiple_samples$case_id)) {
    # For both the methylation and gene expression data, we retrieve all of the records in the common data tables. For each
    # of these records, the substring on which we try to match the records are added to a list.
    methylation_cases_green_substrings <- list()
    methylation_cases_green = methylation_data_files_common[cases.submitter_id == reference_table_patients_multiple_samples[i, case_id] & channel == "Green"]
    for (j in seq_along(methylation_cases_green$id)) {
        methylation_cases_green_substrings[[j]] = substring(methylation_cases_green[j, cases], 1, 19)
    }
    methylation_cases_red_substrings <- list()
    methylation_cases_red = methylation_data_files_common[cases.submitter_id == reference_table_patients_multiple_samples[i, case_id] & channel == "Red"]
    for (j in seq_along(methylation_cases_red$id)) {
        methylation_cases_red_substrings[[j]] = substring(methylation_cases_red[j, cases], 1, 19)
    }
    gene_expression_cases_substrings <- list()
    gene_expression_cases = gene_expression_data_files_common[cases.submitter_id == reference_table_patients_multiple_samples[i, case_id]]
    for (j in seq_along(gene_expression_cases$id)) {
        gene_expression_cases_substrings[[j]] = substring(gene_expression_cases[j, cases], 1, 19)
    }
    
      # Comparing each of the entries in the lists of substrings that we try to match on, and if these are similar, we add the corresponding files to a new reference table.
    for (k in seq_along(methylation_cases_green_substrings)) {
        for (l in seq_along(methylation_cases_red_substrings)) {
            for (m in seq_along(gene_expression_cases_substrings)) {
                if (methylation_cases_green_substrings[[k]] == methylation_cases_red_substrings[[l]] && methylation_cases_green_substrings[[k]] == gene_expression_cases_substrings[[m]]) {
                    methylation_file_green_channel = (methylation_data_files_common[grepl(methylation_cases_green_substrings[[k]], cases) & channel == "Green"])[1, file_name]
                    methylation_file_red_channel = (methylation_data_files_common[grepl(methylation_cases_red_substrings[[l]], cases) & channel == "Red"])[1, file_name]
                    gene_expression_file = (gene_expression_data_files_common[grepl(gene_expression_cases_substrings[[m]], cases)])[1, file_name]

                    new_patient_row = data.table(
                        case_id = methylation_cases_green_substrings[[k]],
                        methylation_file_green_channel = methylation_file_green_channel,
                        methylation_file_red_channel = methylation_file_red_channel,
                        gene_expression_file = gene_expression_file
                    )
                    reference_table_files_per_patient <- rbind(reference_table_files_per_patient, new_patient_row)
                } else {
                    cat("The following (substrings of the) cases ids do not match:\n")
                    print(methylation_cases_green_substrings[[k]])
                    print(methylation_cases_red_substrings[[l]])
                    print(gene_expression_cases_substrings[[m]])
                }
            }
        }
    }
}

# Displaying the reference table.
cat("The reference table for which files of which patients match to later download them:")
reference_table_files_per_patient

# The path where the 'reference_table_files_per_patient' data table should be stored.
path <- file.path(data_directory, "/reference_table_files_per_patient.csv")

# If the 'path' defined above does already point to a file, the execution of this code block is terminated and a message is 
# displayed informing that the file already exists.
if (file.exists(path)) {
    cat(paste("There is already a file present at the path: ", path))
} else {
    # Writing the data table to a CSV file.
    data.table::fwrite(reference_table_files_per_patient, path)
    cat(paste("The file has been created at the path: ", path))
}

The following (substrings of the) cases ids do not match:
[1] "TCGA-06-0125-01A-01"
[1] "TCGA-06-0125-02A-11"
[1] "TCGA-06-0125-02A-11"
The following (substrings of the) cases ids do not match:
[1] "TCGA-06-0125-01A-01"
[1] "TCGA-06-0125-02A-11"
[1] "TCGA-06-0125-01A-01"
The following (substrings of the) cases ids do not match:
[1] "TCGA-06-0125-01A-01"
[1] "TCGA-06-0125-01A-01"
[1] "TCGA-06-0125-02A-11"
The following (substrings of the) cases ids do not match:
[1] "TCGA-06-0125-02A-11"
[1] "TCGA-06-0125-02A-11"
[1] "TCGA-06-0125-01A-01"
The following (substrings of the) cases ids do not match:
[1] "TCGA-06-0125-02A-11"
[1] "TCGA-06-0125-01A-01"
[1] "TCGA-06-0125-02A-11"
The following (substrings of the) cases ids do not match:
[1] "TCGA-06-0125-02A-11"
[1] "TCGA-06-0125-01A-01"
[1] "TCGA-06-0125-01A-01"
The following (substrings of the) cases ids do not match:
[1] "TCGA-06-0152-01A-02"
[1] "TCGA-06-0152-01A-02"
[1] "TCGA-06-0152-02A-01"
The following (substrings of the) cases ids do n

case_id,methylation_file_green_channel,methylation_file_red_channel,gene_expression_file
<chr>,<chr>,<chr>,<chr>
TCGA-06-0125-01A-01,964f8f23-7801-412f-981d-5ee34ffc6dd1_noid_Grn.idat,964f8f23-7801-412f-981d-5ee34ffc6dd1_noid_Red.idat,e1757a20-2d6f-4aee-bafb-804302b448ea.rna_seq.augmented_star_gene_counts.tsv
TCGA-06-0125-02A-11,70d41610-9d05-4101-84c0-ed3280f1656c_noid_Grn.idat,70d41610-9d05-4101-84c0-ed3280f1656c_noid_Red.idat,a01b2990-48f1-4513-8438-df7d1c39b51f.rna_seq.augmented_star_gene_counts.tsv
TCGA-06-0152-02A-01,416ac016-2058-4ffb-9adc-5cee517670f8_noid_Grn.idat,416ac016-2058-4ffb-9adc-5cee517670f8_noid_Red.idat,f5e230a7-b044-4190-8b61-36260d8cd54c.rna_seq.augmented_star_gene_counts.tsv
TCGA-06-0171-02A-11,e410fe15-33c0-4381-aebe-0b5a4d8c5b68_noid_Grn.idat,e410fe15-33c0-4381-aebe-0b5a4d8c5b68_noid_Red.idat,52f507d3-113b-4995-9b7a-8677e7812ef3.rna_seq.augmented_star_gene_counts.tsv
TCGA-06-0190-01A-01,e1f76540-4db5-4efb-aabb-00073f0dbb82_noid_Grn.idat,e1f76540-4db5-4efb-aabb-00073f0dbb82_noid_Red.idat,dad61e18-e3f1-4beb-b3c3-ae434e35af2d.rna_seq.augmented_star_gene_counts.tsv
TCGA-06-0190-02A-01,fd19aeb9-4ee0-4243-8610-4e96fb6aa9d5_noid_Grn.idat,fd19aeb9-4ee0-4243-8610-4e96fb6aa9d5_noid_Red.idat,2c25aa49-879a-4023-80b4-84df717dc537.rna_seq.augmented_star_gene_counts.tsv
TCGA-06-0210-01A-01,8eb46a29-b50d-4db6-b12e-9faa644567a7_noid_Grn.idat,8eb46a29-b50d-4db6-b12e-9faa644567a7_noid_Red.idat,1483c347-bb2c-4678-af16-163e4fc1791d.rna_seq.augmented_star_gene_counts.tsv
TCGA-06-0210-02A-01,89b8397c-b111-4ca0-ae7c-d3d88f375dc0_noid_Grn.idat,89b8397c-b111-4ca0-ae7c-d3d88f375dc0_noid_Red.idat,9ea6219c-d1fb-4f5b-bdab-c3492f180ac2.rna_seq.augmented_star_gene_counts.tsv
TCGA-06-0211-01A-01,05555697-7e2e-4ae8-892d-f54d2b656d24_noid_Grn.idat,05555697-7e2e-4ae8-892d-f54d2b656d24_noid_Red.idat,4fa49452-cb75-46c0-a633-42b3f7c4361b.rna_seq.augmented_star_gene_counts.tsv
TCGA-06-0211-02A-02,ca69ff04-d622-472d-901a-b59821109688_noid_Grn.idat,ca69ff04-d622-472d-901a-b59821109688_noid_Red.idat,bb335256-50b2-473f-a886-b62b7441c436.rna_seq.augmented_star_gene_counts.tsv


There is already a file present at the path:  C:/Users/laure/OneDrive/Documenten/Bachelor Thesis Data/original_data//reference_table_files_per_patient.csv

### Downloading the files

Now that we know which files we need to download, we can actually download them using the function 'GDCdownload()'. As this function takes as a parameter a query, we first have the retrieve the query containing only the desired file which then be passed as the parameter to the 'GDCdownload()' function. The files will be stored to a folder created by TCGA Biolinks as I am not able to adjust the destination directory with this function, but these will be moved to the 'data' folder present in the parent directory of the current folder 'Preprocessing' manually.

In [26]:
# Looping over all the files and downloading them.
for (i in seq_along(reference_table_files_per_patient$case_id)) {
    
    # Retrieving the full case ids of the current 'case_id' as these are needed for the queries below.
    methylation_file_green = reference_table_files_per_patient[i, methylation_file_green_channel]
    methylation_file_red = reference_table_files_per_patient[i, methylation_file_red_channel]
    gene_expression_file = reference_table_files_per_patient[i, gene_expression_file]
    methylation_case_id_green = (methylation_data_files_common[file_name == methylation_file_green])[1, cases]
    methylation_case_id_red = (methylation_data_files_common[file_name == methylation_file_red])[1, cases]
    gene_expression_case_id = (gene_expression_data_files_common[file_name == gene_expression_file])[1, cases]
    
    # Query to retrieve the current methylation file for the green channel.
    methylation_file_query_green <- GDCquery(project = "TCGA-GBM",
        data.category = "DNA Methylation",
        legacy = FALSE,
        platform = "Illumina Human Methylation 450",
        data.type = "Masked Intensities",
        barcode = c(methylation_case_id_green)
    )
    
    # Query to retrieve the current methylation file for the red channel.
    methylation_file_query_red <- GDCquery(project = "TCGA-GBM",
        data.category = "DNA Methylation",
        legacy = FALSE,
        platform = "Illumina Human Methylation 450",
        data.type = "Masked Intensities",
        barcode = c(methylation_case_id_red)
    )

    # Query to retrieve the current gene expression file.
    gene_expression_file_query <- GDCquery(project = "TCGA-GBM",
        data.category = "Transcriptome Profiling",
        data.type = "Gene Expression Quantification", 
        workflow.type = "STAR - Counts",
        barcode = c(gene_expression_case_id)
    )
    
    # Downloading the methylation and gene expression files.                   
    GDCdownload(methylation_file_query_green)
    GDCdownload(methylation_file_query_red)
    GDCdownload(gene_expression_file_query)
}

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.4 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.251418 MB

Downloading as: e1757a20-2d6f-4aee-bafb-804302b448ea.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.5 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.239356 MB

Downloading as: a01b2990-48f1-4513-8438-df7d1c39b51f.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.4 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.251732 MB

Downloading as: f5e230a7-b044-4190-8b61-36260d8cd54c.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.5 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.246034 MB

Downloading as: 52f507d3-113b-4995-9b7a-8677e7812ef3.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.5 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.242874 MB

Downloading as: dad61e18-e3f1-4beb-b3c3-ae434e35af2d.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.5 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.236023 MB

Downloading as: 2c25aa49-879a-4023-80b4-84df717dc537.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.5 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.244621 MB

Downloading as: 1483c347-bb2c-4678-af16-163e4fc1791d.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.4 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.243645 MB

Downloading as: 9ea6219c-d1fb-4f5b-bdab-c3492f180ac2.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.6 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.242161 MB

Downloading as: 4fa49452-cb75-46c0-a633-42b3f7c4361b.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.5 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.240739 MB

Downloading as: bb335256-50b2-473f-a886-b62b7441c436.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.5 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.249712 MB

Downloading as: 08dce278-ba8d-4ba0-90a1-e40820fd1740.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.5 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.261799 MB

Downloading as: f489cec9-6f4b-4364-b85a-d34cbc8c2015.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.5 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.248202 MB

Downloading as: 5763b7c5-89cb-49c4-87aa-bad01f28b541.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.6 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.244297 MB

Downloading as: 5237a678-9f82-4aa3-a874-39179cff6ba5.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.5 MB       

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.25756 MB

Downloading as: 44aa45ef-4bcd-47eb-a2ef-68b614d00a51.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.6 MB       

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.246236 MB

Downloading as: 6824df5a-832c-4a45-b654-05ba9ac03369.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.5 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.251398 MB

Downloading as: 251de1f4-0bc3-4b44-af08-1abc9c720f67.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.6 MB       

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.245339 MB

Downloading as: 45ab4192-f1bb-46cd-8e85-c2341d67642f.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.6 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.268629 MB

Downloading as: f60c516c-6be7-4dda-9e46-46a87039c099.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.6 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.236559 MB

Downloading as: ce5fb607-99c5-457f-9022-df38b540fce6.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.6 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.202046 MB

Downloading as: c0b1e5a6-3308-43c6-9e76-feda9f67c1a0.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.6 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.24205 MB

Downloading as: f426768e-744a-4bed-bb8e-b5957a350bf2.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.5 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.250864 MB

Downloading as: 6ff649cb-781f-4d71-b06f-3504d95879ad.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.6 MB       

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.248662 MB

Downloading as: 325f91c9-70b9-4285-a7ff-0b7b6280a8ed.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.5 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.249092 MB

Downloading as: 1f027bee-5316-4d9f-a566-41890da18493.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.5 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.237877 MB

Downloading as: 4250878d-ba0b-47e1-96c6-f2331f93ce37.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.6 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.2259 MB

Downloading as: 5a2fb8ba-a509-4216-83b5-174c1ee94c17.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.6 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.241176 MB

Downloading as: 69804564-72f3-443e-a45e-add43d917b2c.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.4 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.222639 MB

Downloading as: a261b442-2909-4aae-8c65-7a16b1f7d905.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.5 MB       

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.245026 MB

Downloading as: b134958f-ae1e-4309-9cfc-16b443ab10c8.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.5 MB       

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.23282 MB

Downloading as: a400f64e-ec00-4cc0-8110-38b2e349c7cf.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.5 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.234574 MB

Downloading as: c0c076b8-8559-4403-ae4d-bfdf57522b59.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.4 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.250525 MB

Downloading as: f54636e5-d823-4bc2-9475-2f4f81879067.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.5 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.242748 MB

Downloading as: fd0dc2d4-4a8e-4f37-8e1c-320193ace083.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.5 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.231523 MB

Downloading as: e88178d8-35c0-4578-b52b-30ebe4847b6e.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.5 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.252927 MB

Downloading as: 4a91d152-b46f-4721-a868-7ed491f9af56.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.5 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.241098 MB

Downloading as: 664dc12a-5d76-4efb-bb33-f0f58c937062.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.5 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.254634 MB

Downloading as: ec1ada4a-34c5-4d5f-a75f-25fc45d05a28.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.5 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.248667 MB

Downloading as: c47a6a50-b448-4736-b29d-bea454e28085.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.6 MB       

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.247089 MB

Downloading as: 139fe07b-05fb-4bc5-b6e2-179269e7ca51.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.6 MB       

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.240457 MB

Downloading as: d0d080e3-b40d-466e-8b51-58ee3f463667.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.6 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.206919 MB

Downloading as: a8050792-d0f5-4b8f-aea2-0a303f40b1ca.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.6 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.236301 MB

Downloading as: 4a44bd40-1141-4524-a35f-f29ee37f6c63.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.6 MB       

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.220583 MB

Downloading as: aab4bee0-b7a4-4671-b7e3-e11f11abe0e7.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.6 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.243121 MB

Downloading as: d83433da-3354-4e58-94f6-52d56d296363.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.5 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.247998 MB

Downloading as: 9b1f39e2-a587-4fde-afd2-e0808a7766c1.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.7 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.240477 MB

Downloading as: b45c00da-6d61-4b10-befa-dc1123f955a9.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.6 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.24883 MB

Downloading as: 37bae717-22b7-4a28-8768-ab83c62ad8ee.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.6 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.245636 MB

Downloading as: 71390b5a-ecd0-475f-8582-8bfb47bb844c.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.6 MB       

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.242247 MB

Downloading as: 79b783ac-103f-47a6-bc4b-8498a0be46ac.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.6 MB       

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.218805 MB

Downloading as: 6b59c7e7-11f7-4326-a849-2e1a6bb9f3e7.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.6 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.254868 MB

Downloading as: 41bdf255-88ed-4346-9033-7813b2c97d84.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.6 MB       

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.236915 MB

Downloading as: 7953f14d-563a-4c41-882e-cce3821547a3.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.6 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.233834 MB

Downloading as: 13da7123-db39-4bab-9521-daef5d3030ae.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.6 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.243153 MB

Downloading as: 32291119-ef4e-4041-9949-dc88dfce459a.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.5 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.247997 MB

Downloading as: e4bd9518-37b8-4d39-b717-7c750f311eba.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.6 MB       

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.244005 MB

Downloading as: 1a61593f-909e-46f4-8cfa-9e23bc050033.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.5 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.243355 MB

Downloading as: d620997f-b99e-4e20-a20d-d5b67ba3ce4b.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.6 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.249885 MB

Downloading as: e36091c6-d62c-4997-bb9f-7192b73f12ed.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.6 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.252362 MB

Downloading as: a0cfa49e-93dd-48c9-8a9a-c3dfdb081ae4.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.6 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.231886 MB

Downloading as: 54c6eff1-866a-4b26-9778-6b085fc25123.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.6 MB       

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.247314 MB

Downloading as: 1069403e-ef6a-43d4-b62d-3249e5a8e034.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.6 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.245935 MB

Downloading as: ff86d680-8a35-4f19-a850-eca08dfd3d48.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.6 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.251659 MB

Downloading as: 5b56daff-2531-46b1-a277-86b88b878c61.rna_seq.augmented_star_gene_counts.tsv





--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

--------------------

ooo By platform

ooo By data.type

ooo By barcode

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases


ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-GBM

--------------------

oo Filtering results

---------------

Downloading: 8.6 MB     

Downloading data for project TCGA-GBM

Of the 2 files for download 2 already exist.

All samples have been already downloaded

Downloading data for project TCGA-GBM

GDCdownload will download: 4.238794 MB

Downloading as: 9cdfbc8a-c640-4e27-922c-053bc7fce262.rna_seq.augmented_star_gene_counts.tsv





### Checking whether all files have been downloaded

Before completing this notebook, we first need to check whether indeed all of the necessary files have been downloaded. This can be done by first loading the names of the files which have been stored and checking whether all of the files present in the 'reference_table_files_per_patient' have been downloaded by comparing them with the loaded file names.

In [26]:
# Defining the directories where the methylation and gene expression data is stored.
methylation_files_directory = paste0(data_directory, "/methylation_files")
gene_expression_files_directory = paste0(data_directory, "/gene_expression_files")

# Loading the names of the files by using the 'list.files()' function.
methylation_file_names <- list.files(methylation_files_directory)
gene_expression_file_names <- list.files(gene_expression_files_directory)

# First checking whether the lengths of the file names lists correspond to the number of patients in the 
# 'reference_table_files_per_patient'. Here we subtract -1 from the length of the file names lists as a hidden configuration
# file is found in every folder on Microsoft Windows. This file is called 'Desktop.ini' and is also retrieved by the 
# 'list.files()' function. For the methylation data, we expect the number of files to be equal to two as each sample 
# contains two methylation files (and only one gene expression file).
if (length(methylation_file_names)-1 == length(reference_table_files_per_patient$case_id)*2) {
    cat("The length of the 'methylation_file_names' corresponds to the number of patients in the 'reference_table_files_per_patient'.\n")
} else {
    cat("The length of the 'methylation_file_names' does not correspond to the number of patients in the 'reference_table_files_per_patient'.\n")
}

if (length(gene_expression_file_names)-1 == length(reference_table_files_per_patient$case_id)) {
    cat("The length of the 'gene_expression_file_names' corresponds to the number of patients in the 'reference_table_files_per_patient'.\n")
} else {
    cat("The length of the 'gene_expression_file_names' does not correspond to the number of patients in the 'reference_table_files_per_patient'.\n")
}

The length of the 'methylation_file_names' corresponds to the number of patients in the 'reference_table_files_per_patient'.
The length of the 'gene_expression_file_names' corresponds to the number of patients in the 'reference_table_files_per_patient'.


As we can see the lengths are equal meaning we can proceed to checking whether all the files included in the 'reference_table_files_per_patient' have been downloaded. This can be done by looping over all of the files present in the 'reference_table_files_per_patient' and checking whether these can be found in the downloaded/loaded file names.

In [27]:
no_file_not_downloaded_found = FALSE

# Looping over all the files present in the 'reference_table_files_per_patient' and checking whether these can be found in 
# the downloaded/loaded file names.
for (i in seq_along(reference_table_files_per_patient$case_id)) {
    # Retrieving the methylation files and gene expression file for the current 'case_id'.
    methylation_file_green = reference_table_files_per_patient[i, methylation_file_green_channel]
    methylation_file_red = reference_table_files_per_patient[i, methylation_file_red_channel]
    gene_expression_file = reference_table_files_per_patient[i, gene_expression_file]
    
    # Checking whether the methylation and gene expression files found for the current 'case_id' have been downloaded.
    if (!(methylation_file_green %in% methylation_file_names)) {
        cat("The following methylation file for the green channel has not been downloaded:\n")
        print(methylation_file_green)
        no_file_not_downloaded_found = TRUE
    }
    
    if (!(methylation_file_red %in% methylation_file_names)) {
        cat("The following methylation file for the red channel has not been downloaded:\n")
        print(methylation_file_red)
        no_file_not_downloaded_found = TRUE
    }
    
    if (!(gene_expression_file %in% gene_expression_file_names)) {
        cat("The following gene_expression file has not been downloaded:\n")
        print(gene_expression_file)
        no_file_not_downloaded_found = TRUE
    }
}

if (no_file_not_downloaded_found == FALSE) {
        cat("No files have been found in the 'reference_table_files_per_patient' that are not downloaded.")
}

No files have been found in the 'reference_table_files_per_patient' that are not downloaded.

As we can see, all the methylation and gene expression files needed have been downloaded. Now we can proceed to the next stop which consists further processing the methylation and gene expression files which also includes reordering the files (creating one file containing all the methylation data and one file containing all the gene expression data). This next step can be found in the notebooks present in the same directory, 'Preprocessing', with the names 'Further Processing Methylation Files Part 1', 'Further Processing Methylation Files Part 2', and 'Further Processing Gene Expression Files'.