08. Gene Variant Data (vcf)

Gene Variant Data in VCF format is loaded from VCFDataToUpload directory. VCF data loading with transmart-data is described on tranSMART wiki

VCF data file is just that - gene variant data in VCF format generated as an output of a gene variant software package such as PLINK. One should not attempt to open a vcf format data file with any text editing desktop apps. This breaks the format and the file can’t be loaded.

vcf Data (sample)

Mapping file

Unlike Mapping files for HDD data, vcf data mapping file includes two comment lines.

#STUDY_TITLE: Complete Genomics_Breast Cancer

Tumor_Blood_Pair1 1 Biomarker_Data+ Single_Nucleotide_Polymorphism+ Variant_Call_Format_(VCF)
Tumor_Blood_Pair2 2 Biomarker_Data+ Single_Nucleotide_Polymorphism+ Variant_Call_Format_(VCF)

NOTE 1: vcf data is loaded into DE_VARIANT_DATASET. DATASET_ID column in this table is limited to 50 characters. In case of vcf data, DATASET_ID is Study ID concatenated with the name of the actual vcf data file! Make sure that the name of the file is not too long and together with the Study ID does not exceed 50 characters.

NOTE 2: tMDataLoader can load vcf data from either one merged or multiple (e.g. one data file per subject) files. If the path in category_cd is the same for all samples, multiple vcf will be loaded into one node. By default, without category_cd, new node is created for each vcf file. If vcf data is supplied in multiple files, all files are loaded in parallel. This approach allows to reduce the loading time.

