#tcga-expedition The Cancer Genome Atlas Project (TCGA) is a National Cancer Institute effort to profile about 24K cases of 33 different tumor types using genomic platforms and to make these data, both raw and processed, available to all researchers. TCGA data are currently over 1.4 Petabyte in size and include whole genome sequence (WGS), whole exome sequence, methylation, RNA expression, proteomic, and clinical datasets. Publicly accessible TCGA data are released through public portals, but many challenges exist in navigating and using data obtained from these sites.
We developed TCGAExpedition to support the research community focused on computational methods for cancer research. Data obtained, versioned, and archived using TCGAExpedition supports analysis with third-party tools as well as command line access at high-performance computing facilities. TCGAExpedition software consists of a set of scripts written in Bash, Python and Java that download, extract, harmonize, version and store all TCGA data and metadata. The software generates a versioned, participant- and sample-centered, local TCGA data directory with metadata structures that directly reference the local data files as well as the original data files. The software supports flexible searches of the data via a web portal, user-centric data tracking tools, and data provenance tools.
See TCGAExpedition installation and demo video.
Users accessing controlled data will need to have a dbGAP Data Use Certificate.
#System Requirements
Unix 64-bit. Java 1.7
#Installation
####1. Select and install one of the supported storage
- PostgreSLQ: http://www.postgresql.org/download/
- Virtuoso: http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VOSDownload
NOTE: We recommend to use PostgreSQL - it's much faster than RDF store.
####2. Configure
Set parameters in resources/tcgaexpedition.conf file:
Required:
-
TCGA credentials. Leave blank if download public data only
tcga.user
tcga.pwd
-
Storage selection. Uncomment 'postgres' or 'virtuoso'
storage.name=postgres
#storage.name=virtuoso
-
Set access parameters for PostgreSQL or Virtuoso
-
Set local repository location
repository.home
Optional:
-
Email used to send updates to the web portal subscribers
-
Sender and receiver emails for notification about ambiguous tissue source site names
#How to Run
Usage: java -jar tcgaExpedition-<vx.x.x>.jar --diseaseList <list> --analysistype <string> --accesstype <string>
================================================================================
--diseaseList Comma separated list of disease abbreviations. Use ALL for the whole data set.
--analysistype See table below for available analysis types based on the data source.
--accesstype Use public or controlled. Access type depends on analysis type ans data level. See table below.
Example
Download Clinical data for acc:
java -jar tcgaExpedition-<vx.x.x>.jar brca clinical public
#Availalbe Analysis / Access Types
DataSource | AnalysisType | AccessType | Level |
TCGA | clinical | public | 2 |
TCGA | cnv_(cn_array) | public | 1,2,3 |
TCGA | cnv_(snp_array) | controlled | 1,2 |
TCGA | cnv_(snp_array) | public | 3 |
TCGA | cnv_(low_pass_dnaseq) | controlled | 2 |
TCGA | cnv_(low_pass_dnaseq) | public | 3 |
TCGA | dna_methylation | public | 1,2,3 |
TCGA | expression_gene | public | 1,2,3 |
TCGA | expression_protein | public | 0,1,2,3 |
TCGA | fragment_analysis | controlled | 1 |
TCGA | images | public | 1 |
TCGA | mirnaseq | public | 3 |
TCGA | protected_mutations | controlled | 2 |
TCGA | protected_mutations_maf | public | 2 |
TCGA | rnaseq | controlled | 2 |
TCGA | rnaseq | public | 3 |
TCGA | rnaseqv2 | public | 3 |
TCGA | somatic_mutations | public | 2 |
Firehose | CN_Level4 | controlled | 4 |
Georgetown | mass_spectrometry | public | 4 |
cgHub* | bisulfite-seq_(cghub) | controlled | 1 |
cgHub* | mirna-seq_(cghub) | controlled | 1 |
cgHub* | rna-seq_(cghub) | controlled | 1 |
cgHub* | validation_(cghub) | controlled | 1 |
cgHub* | wgs_(cghub) | controlled | 1 |
cgHub* | wxs_(cghub) | controlled | 1 |
-
- coming soon
#License [GPLv2] (http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html)