Skip to content

TCGAExpedition/tcga-expedition

Repository files navigation

#tcga-expedition The Cancer Genome Atlas Project (TCGA) is a National Cancer Institute effort to profile about 24K cases of 33 different tumor types using genomic platforms and to make these data, both raw and processed, available to all researchers. TCGA data are currently over 1.4 Petabyte in size and include whole genome sequence (WGS), whole exome sequence, methylation, RNA expression, proteomic, and clinical datasets. Publicly accessible TCGA data are released through public portals, but many challenges exist in navigating and using data obtained from these sites.

We developed TCGAExpedition to support the research community focused on computational methods for cancer research. Data obtained, versioned, and archived using TCGAExpedition supports analysis with third-party tools as well as command line access at high-performance computing facilities. TCGAExpedition software consists of a set of scripts written in Bash, Python and Java that download, extract, harmonize, version and store all TCGA data and metadata. The software generates a versioned, participant- and sample-centered, local TCGA data directory with metadata structures that directly reference the local data files as well as the original data files. The software supports flexible searches of the data via a web portal, user-centric data tracking tools, and data provenance tools.

See TCGAExpedition installation and demo video.

Access Requirements

Users accessing controlled data will need to have a dbGAP Data Use Certificate.

#System Requirements

Unix 64-bit. Java 1.7

#Installation

####1. Select and install one of the supported storage

NOTE: We recommend to use PostgreSQL - it's much faster than RDF store.

####2. Configure

Set parameters in resources/tcgaexpedition.conf file:

Required:

  • TCGA credentials. Leave blank if download public data only

    tcga.user

    tcga.pwd

  • Storage selection. Uncomment 'postgres' or 'virtuoso'

    storage.name=postgres

    #storage.name=virtuoso

  • Set access parameters for PostgreSQL or Virtuoso

  • Set local repository location

    repository.home

Optional:

  • Email used to send updates to the web portal subscribers

  • Sender and receiver emails for notification about ambiguous tissue source site names

#How to Run

Usage: java -jar tcgaExpedition-<vx.x.x>.jar --diseaseList <list> --analysistype <string> --accesstype <string>
================================================================================
--diseaseList           Comma separated list of disease abbreviations. Use ALL for the whole data set.
--analysistype          See table below for available analysis types based on the data source.
--accesstype       	Use public or controlled. Access type depends on analysis type ans data level. See table below.

Example

Download Clinical data for acc:

java -jar tcgaExpedition-<vx.x.x>.jar brca clinical public

#Availalbe Analysis / Access Types

DataSource AnalysisType AccessType Level
TCGA clinical public 2
TCGA cnv_(cn_array) public 1,2,3
TCGA cnv_(snp_array) controlled 1,2
TCGA cnv_(snp_array) public 3
TCGA cnv_(low_pass_dnaseq) controlled 2
TCGA cnv_(low_pass_dnaseq) public 3
TCGA dna_methylation public 1,2,3
TCGA expression_gene public 1,2,3
TCGA expression_protein public 0,1,2,3
TCGA fragment_analysis controlled 1
TCGA images public 1
TCGA mirnaseq public 3
TCGA protected_mutations controlled 2
TCGA protected_mutations_maf public 2
TCGA rnaseq controlled 2
TCGA rnaseq public 3
TCGA rnaseqv2 public 3
TCGA somatic_mutations public 2
Firehose CN_Level4 controlled 4
Georgetown mass_spectrometry public 4
cgHub* bisulfite-seq_(cghub) controlled 1
cgHub* mirna-seq_(cghub) controlled 1
cgHub* rna-seq_(cghub) controlled 1
cgHub* validation_(cghub) controlled 1
cgHub* wgs_(cghub) controlled 1
cgHub* wxs_(cghub) controlled 1
    • coming soon

#License [GPLv2] (http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages