Skip to content
/ gatk Public
forked from broadgsa/gatk

Daniel Klevebring's GATK Repository: contains the core MIT-licensed GATK framework, and custom tools, free for all uses.

Notifications You must be signed in to change notification settings

Q-KIM/gatk

 
 

Repository files navigation

The Genome Analysis Toolkit

Build Status

This is my public version of the (MIT licensed) GATK source tree with added custom walkers.

SomaticPindelFilter

Filtering of variants from reported from pindel. Performs a Fisher's Exact test for each variants and Benjamini-Hochberg adjusts all p-values. Prints variants with adjusted p < cutoff (settable).

TL;DR

java -jar GenomeAnalysisTK.jar -T SomaticPindelFilter -V pindel_variants.vcf -o out.vcf -TID TUMOR_ID -NID NORMAL_ID -R reference.fa

Parameters

Arguments for SomaticPindelFilter:

 -V,--variant <variant>                                        Select variants from this VCF file
 -TID,--tumorid <tumorid>                                      Sample Name of the tumor
 -NID,--normalid <normalid>                                    Sample Name of the normal
 -o,--out <out>                                                File to which variants should be written
 -MIN_DP_N,--minCoverageNormal <minCoverageNormal>             Minimum depth required in the normal, default 25
 -MIN_DP_T,--minCoverageTumor <minCoverageTumor>               Minimum depth required in the tumor, default 25
 -MAX_DP_N,--maxCoverageNormal <maxCoverageNormal>             Maximum depth required in the normal, default 1000
 -MAX_DP_T,--maxCoverageTumor <maxCoverageTumor>               Maximum depth required in the tumor, default 1000
 -MAX_N_FRAC,--maxNormalFraction <maxNormalFraction>           Maximum fraction allowed in the normal, default 0.15
 -ADJ_P_CUTOFF,--AdjustedPValueCutoff <AdjustedPValueCutoff>   Cutoff for the adjusted p values, default 0.05

Note that with a clean normal sample (like DNA from blood), -MAX_N_FRAC can ususally be set way lower than 0.15. The likeliehood that a sequencing error in the normal sample resulting in an indel at the exact same place as a true somatic indel in the tumor is likely very very low. The error rate of the index reads likely sets the limit for this (reads that are misclassified to the tumor sample due to low quality of the index).

Where does the variants come from?

I run pindel with a config file like so:

pindel -f reference.fa -i configfile.txt -o pindelPrefix
pindel2vcf -P pindelPrefix --gatk_compatible -r reference.fa \
           -R human_g1k_v37_decoy -d 20140127 -v pindel_variants.vcf --compact_output_limit 15

Please refer to the pindel website for further information on how to run pindel. Of note, if --compact_output_limit 15 is omitted, the REF and ALT fields of the vcf file can be several Mb in size for long inversions, insertions or deletion. That results in a large and possibly unparsable vcf file (baaaaad).


SomaticIndelDetectorFilter

Filtering of variants from reported from IndelGenotyperV2 (the now very old Broad Cancer tool). Performs a Fisher's Exact test for each variants and Benjamini-Hochberg adjusts all p-values. Prints variants with adjusted p < cutoff (settable).

TL;DR

java -jar GenomeAnalysisTK.jar -T SomaticIndelDetectorFilter -V indelDetector_variants.vcf -o out.vcf -TID TUMOR_ID -NID NORMAL_ID -R reference.fa

Parameters

Arguments for SomaticIndelDetectorFilter:

-V,--variant <variant>                                        Select variants from this VCF file
-TID,--tumorid <tumorid>                                      Sample Name of the tumor
-NID,--normalid <normalid>                                    Sample Name of the normal
-o,--out <out>                                                File to which variants should be written
-MIN_DP_N,--minCoverageNormal <minCoverageNormal>             Minimum depth required in the normal
-MIN_DP_T,--minCoverageTumor <minCoverageTumor>               Minimum depth required in the tumor
-MAX_DP_N,--maxCoverageNormal <maxCoverageNormal>             Maximum depth required in the normal
-MAX_DP_T,--maxCoverageTumor <maxCoverageTumor>               Maximum depth required in the tumor
-MAX_N_FRAC,--maxNormalFraction <maxNormalFraction>           Maximum fraction allowed in the normal
-ADJ_P_CUTOFF,--AdjustedPValueCutoff <AdjustedPValueCutoff>   Cutoff for the adjusted p values

SelectPathogenicVariants

Walker to select pathogenic variants from clinvar.

TL;DR

To select variants from certain genes, run

 java -jar ./public/external-example/target/external-example-1.0-SNAPSHOT.jar -T SelectPathogenicVariants \
 -V public/testdata/dakl/clinvar_20140807.vcf -R ~/genome/human_g1k_v37_decoy.fasta -o ~/tmp/tmp.vcf -G BRCA1 -G BRCA2

To select pathogenic variants from all genes, run

java -jar ./public/external-example/target/external-example-1.0-SNAPSHOT.jar -T SelectPathogenicVariants \
-V public/testdata/dakl/clinvar_20140807.vcf -R ~/genome/human_g1k_v37_decoy.fasta -o ~/tmp/tmp.vcf -G BRCA1 -G BRCA2

Parameters

Arguments for SelectPathogenicVariants: -V,--variant Input VCF file -o,--out File to which variants should be written -G,--genes String of gene names from which variants are selected. Multiple allowed. If not specified, all genes are selected -GI,--GENEINFO Name of the field containing gene info.


TelomereQuant

Walker to quantify the telomeric content in a sample. The walker calculates the following:

  • The number of reads that has at least NR repeats of the telomere repeat sequence (TTAGGG or CCCTAA) in them (ref).
  • The total number of repeats seen, only counting repeats in reads that has NR x teloseq. Reads that have fewer than NR x teloseq are not counted to avoid false positives.
  • Total bases sequenced

The walker allows for multiple samples to be given within a single BAM file, and will calculate these figures for each sample given (using read groups).

TL;DR

<<<<<<< HEAD java -jar ./public/external-example/target/external-example-1.0-SNAPSHOT.jar -T TelomereQuant -I mybam.bam -o outfile.txt

-jar ./public/external-example/target/external-example-1.0-SNAPSHOT.jar -T TelomereQuant -R reference.fa -I mybam.bam -o outfile.txt

FETCH_HEAD

Parameters

Arguments for TelomereQuant:

-o,--out <out>                                An output file created by the walker.  Will overwrite contents if file
                                              exists
-NR,--number_of_repeats <number_of_repeats>   Number of repeats of telomere sequence required for a read to be
                                              classified as telomeric. Default 4.

HeterozygoteConcordance

Walker to calculate the number of heterozygote SNPs in a VCF file that have read support in a BAM file.

I use this to verify the identity of a tumor and normal exome or panel pairs. Variants are called in the normal sample, and then used as input (-V) to the walker.

The output is a table with the total number of variants, total hz variants and number of hz variants that have read support in the BAM, broken down per sample in the BAM file.

TL;DR

java -jar ./public/external-example/target/external-example-1.0-SNAPSHOT.jar -T HeterozygoteConcordance -I tumor.bam -V normalvariants.vcf -R reference.fa -sid VcfSampleToUse -o output.txt

Parameters

Arguments for HeterozygoteConcordance:

 -I,--input_file <input_file>   Input file containing sequence data (SAM or BAM) (required)
 -sid,--VCFSample <VCFSample>   Sample ID from which to get HZ calls. Must be present in the VCF. (required)
 -V,--variant <variant>         Input VCF file (required)
 -o,--out <out>                 File to which output should be written
 -md,--mindepth <mindepth>      Minimum BAM depth of a position to be considered (default 30)
 -up,--upper <upper>            Upper limit over which hom alt GTs are called (default .9)
 -lo,--lower <lower>            Lower limit over which hom ref GTs are called (default .1)

Compile me

You need maven to compile the GATK.

git clone https://github.com/dakl/gatk.git gatk-klevebring
cd gatk-klevebring

I usually skip compiling Queue:

mvn verify -P\!queue
cp ./public/external-example/target/external-example-1.0-SNAPSHOT.jar ./GenomeAnalysisTK-Klevebring.jar

The resulting jar file is ./public/external-example/target/external-example-1.0-SNAPSHOT.jar

Not working as expected?

Please report any bugs in the issue tracker on this site.

See also

See http://www.broadinstitute.org/gatk/

About

Daniel Klevebring's GATK Repository: contains the core MIT-licensed GATK framework, and custom tools, free for all uses.

Resources

Stars

Watchers

Forks

Packages

No packages published