Skip to content

ThuenenFG/varianttools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Variant Tools Build Status

A tool for generating variant matrices of multiple individuals produced by the CLC Genomics Workbench (CLC GWB).

For all processing options type ./varianttools -h in the bin directory.

Three input options are mandatory:

  • A fasta file containing the reference sequence (option -f file).
  • A directory containing a set of csv files with the called variants of different individuals. (option -i dir).
  • The data type of the input data which can either be "SNP" or "INDEL" (option -t type).

Optionally a directory containing a set of coverage files in csv format can be specified. The headers have to correspond to the format generated by the CLC GWB coverage export.

The headers of the SNP and INDEL input csv files have to conform to the table header specifications for called variants of the CLC GWB version 7.5 or higher.

Besides the variant matrix a log file is created. The variant matrix in csv format contains the following headers:

  • Mapping, The reference or contig name
  • Reference Position
  • Type, Type of variant (SNP, INDEL, Deletion, Insertion)
  • Lenght of variant
  • Reference base
  • alternative base for each individual
  • coverage for every individual at the reference position
  • Number Of Individual Allels Deviating From Reference, is a count for all found variants in all individuals at a specific genomic position.
  • Number Of Allels Matching The Reference With Minimal Coverage, is a count for all positions in all individuals where no variant has been called and that are supported by a minimum coverage. The threshold for the minimum coverage can be set with the -m option (default is 8).
  • Critical Forward Reverse Balance, is an indicator for systematic sequencing errors and describes how many forward and reverse reads are supporting the called variant. Due to sequencing errors only occurring on one strand the forward reverse balance would converge to either one or zero. In a perfect world the forward reverse balance would always be 0.5. Only values lower than 0.2 or higher than 0.8 are shown in the output table. The value is averaged over all individuals showing the variant.
  • Left Flank
  • Allel e.g. [G/A/C] G is reference; A and C are alternative bases.
  • Right Flank

The flanking sequences are dynamically calculated based on two given sequence distance thresholds. E.g. if the default values 75 and 50 are used no flanking sequences will be given if another variant is found within 50 bp of a called variant. If a variant is present within 75 bp a flanking sequence of 50 bp will be created. Else the flanking sequence will be 75 bp long.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages