transmart-etlgen
Tool for generating a tranSMART ETL mapping file from a TCGA clinical biotab file.
The Cancer Genome Atlas (TCGA) collects clinical and biospecimen information for all qualified patients which are submitted as XML file for each patient. These XML files are converted into tab-delimited text files or "biotabs" which contain collated information for patients clinical and biospecimens data. Each clinical biotab file contains a header row of metadata for each column of data i.e., field name and common data element ID (CDE_ID) or public id. This metadata provides structured data used as a source to obtain the reference mappings. With this data, the translation process can parse this metadata into their discrete parts. Using the CDE_ID as a key reference, a REStful call is made to the ITCR semantic metadata mapping service. This service contains mapped vocabularies for all the TCGA data elements, the North American of Central Cancer Registries (NAACCR) data elements as well as mappings to their respective DeepPhe ontology mappings. This ensures that related data elements are consistently mapped across these common sources and provides further consistency across cancers in general. The returned mapping data, from the ITCR service includes information such as preferred data label, permissible values, and ontology classifier, among various other provided information. This metadata is used to describe key elements (i.e., category code, data label and controlled vocab code) to map the clinical biotab file. A tab-delimited text file is generated which provides the specification used by the standard ELT scripts, provided by tranSMART, to read the clinical biotab files and load data directly into the tranSMART repository.
Compile
mvn package
Run
java -jar etlgen-0.0.1.jar <tcgaBiotabFile>
See the following for tranSMART mapping file specs or TCGA
Quick Instructions on Loading Data Into tranSMART
from \transmart-data
- $ source vars (this sets up the appropriate ENV, only one time for the initial session)
- create a study directory structure under \transmart-data\sample\studies, should look like this:
\tcga-brca (any study name)
|-- \clinical (must be named this)
| |-- tcga-biotab.txt (biotab containing the data)
| |-- tcga-map-spec.txt (mapping spec created from etlgen)
-- clinical.params (text file of file list, see below *)
_*Content of the clinical.params file_
COLUMN_MAP_FILE=tcga-map-spec.txt
WORD_MAP_FILE=x
RECORD_EXCLUSION_FILE=x
Tar up directory (in the directory the study was created)
at the study level (i.e, tcga-brca)tar the contents to (e.g, \transmart-data\sample\studies\tcga-brca):
- tar -cJf _clinical.tar.xz * (e.g., tar -cJf tcga-brca_clinical.tar.xz *)
__Load data using the tranSMART ELT scripts__
_From the \transmart_data level_
$ make -C samples/{oracle,postgres} load_clinical_[study] // use either postgres or oracle depending on what’s installed for the vm (e.g., make -C samples/postgres load_clinical_tcga-brca)