Skip to content

Frequently Asked Questions

tamsen edited this page Aug 23, 2016 · 2 revisions

For AD, is there any filtering on allele counts prior to calculation of depth? For example, if we calculate raw depth from the BAM file (no filtering based on alignment quality) will we get the same counts as is shown in AF?

The AD field in the .vcf file reflects any filtering done by the somatic variant caller.

In particular:

  • Reads with low mapping quality (below 1) are not included

  • Individual base calls with low base call quality (by default, below 20; this is override-able with the MinQScore setting)

Because of this read- and base-level filtering, the raw depth from the bam file will often be greater than the value reported in the AD field.

How is the LowVariantFreq frequency filter calculated? Is it VF < 5 % or min (VF, 1-VF) < 0.05, or something else?

This filter is applied to variants where the variant frequency (VF) is below a threshold. For example, if VariantFrequencyFilterCutoff is set to 0.05 in the sample sheet, then any variants with VF < 0.05 will be filtered (flagged with LowVF in the FILTER column of the .vcf file)

Can you please go through this example VF calculation:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 09401982
chr17 40474482 . T A 100.00 PASS DP=30709 GT:GQ:AD:VF:NL:SB:GQX 0\1 : 100:27165,3488:0.1136:20:-100.0000:100

VF = 3488 / 30709 = 0.11358

VarInfo.VarFrequency = VarInfo.VarSupport / VarInfo.Coverage;

For NL, how is the basecall noise level calculated?

The Somatic Variant Caller by default uses a Noise Level of Q=20 or 1% for all sites. This is because we apply a basecall quality filter of 20 to the reads used in our calculations. This NoiseLevel, along with the Allele Depth, is used in our variant quality scoring model, in order to determine if a variant is significant. In future versions we may use a more complex noise model, and so NL would not be 20 for all sites.

What is the difference between MinimumCoverageDepthEmitCutoff and MinimumDepthCutoff . I understand that MininumDepthCutoff is a post-processing step, but how does the post-processing step differ from the regular Pisces step? Does the MinimumCoverageDepthEmitCutoff determine whether it will be a reference 0/0 or a NoCall ./. and then the MinimumDepthCutoff applies the LowDP filter?

See the "-c" argument, in the Option section of this doc, for how Pisces filters for coverage. Typically, when Isas calls Pisces, MinimumCoverageDepthEmitCutoff is passed in from Isas as the -c argument to Pisces, and MininumDepthCutoff is applied to set the LowDP filter on top of that, in Isas post processing.

When is the quality filter applied? (We assume the filter is applied post-alignment, per your response to question #1, but some other 3rd party applications rely on pre-alignment filtering and we wish to confirm.)

In illumina’s MSR pipeline, any reads failing the chastity filter are excluded entirely from analysis (they are omitted from the FASTQ files, so they aren't aligned or included in variant calling)

The somatic variant caller skips any reads with mapping quality < 1, and and when computing coverage, the somatic variant caller excludes individual base calls with quality < 20

Are there any calculations in the VCF that do not use basecall quality filtered counts? We assume not but, if so, which?

The output of the somatic variant caller is all based on post-filtering data. (In particular: A depth of 100 indicates the site has 100 PF reads, with mapping quality >0, with base call quality >=20).

Does the MinQScore setting used by Somatic Variant caller (specified in the SampleSheet file) refer specifically to basecall quality filter (or does it also apply to other quality scores, such as alignment, or genotype)?

MinQScore refers specifically to the base call quality score (from the FASTQ file). For example, if an otherwise high-quality read has a single base call with quality 5, the somatic variant caller won't use that base call (but can use the rest of the base calls from the read)

How is strand bias calculated?

To compute the strand bias, we start by evaluating evidence for the variant using only the reads from one strand at a time, applying the same Poisson probability distributions that are used to compute the Q-score of the variant (using pooled data from both strands). This data is used to compute a test statistic based on likelihood ratios. The directional bias for the strand 1 is computed using this formula:

Direction1Bias = (Direction1.ChanceVarFreqGreaterThanZero * Direction2.ChanceFalsePos) / OverallStats.ChanceVarFreqGreaterThanZero

The strand 2 bias is computed similarly. The maximum of these two test statistics, P, is then compared to a threshold, and if the test statistic is greater than the threshold, the variant receives the SB filter. (The default threshold is 0.5, and is configured using the StrandBiasFilter sample sheet setting. Setting StrandBiasFilter to a value closer to 1 makes filtering more permissive) The value SB in the .vcf file is a scaled version of this test statistic: 10 * log10(P). Note, that the 0.5 threshold is -3.01 in the scaled coordinate system. Therefore, is the scaled SB is greater than -3.01, we consider the variant to fail due to strand bias.

(This scaled value is also the strand bias metric reported from the GATK variant caller)

Is it fair to assume that Somatic Variant caller will never call variants in the shoulders, due to strand bias (except where amplicons overlap), so long as the StrandBiasFilter setting is properly configured?

For the truseq amplicon workflow in particular, variants covered by a single strand will still be called (and not filtered). A special case for strand bias is the case where a loci only has coverage in one strand direction. Normally variants at these loci are always flagged with the SB filter, but in the case of the Custom Amplicon workflow, these variants are permitted. (This is because for TSCA, some positions may always be covered on only one strand)

I’m not sure I’m comfortable with the exception to the ‘evidence from one strand only’ exclusion for Custom Amplicon data. I may be misunderstanding but I imagine a scenario where reads-1 and 2 (+ and - strand let’s say) overlap, covering a position on both strands, but only one strand shows non-reference calls, thereby providing highly biased support for the variant. I would like this to NOT be called, or at least heavily penalised and therefore filterable.

So, the question is, does “variants covered by a single strand will still be called” mean where there are only reads aligning to the position on one strand, or where there are non-reference calls on only one strand?

Good question! In the former case (coverage only available in one direction), we will not consider as biased if doing Custom Amplicon, because we expect this to happen for that workflow. In the latter case (coverage available in both directions, but reference reads all in one direction) that is considered biased, no matter what workflow you are doing . So, yes, we take care of the scenario you imagine.

We often have these “top hat” shaped coverage distributions for Custom Amplicon. If we have coverage (not just variant support) in both directions we apply the strand bias filter. If we only have coverage in one direction (like on the brims of the hat) we just give the variant a free pass with regards to strand bias.

So, yes, the situation you are describing, with an overlap on both strands, but the variant only appearing in one direction, would (depending on the coverage depth) be considered very biased. The algorithm is such that if the coverage is high (and the observed bias is very consistent through the reads at that loci) we consider the variant very biased. However, if you only had a low number of reads, say two reads, one in each direction, and the bias would be more modest, because you do not have much evidence either way.

If we are not doing Custom Amplicon, then we require variants to be present in both directions. (No free pass on the “brims” !)

(When running the somatic caller at the command line, this is configurable with the –fo flag.)

Let me get this right. 0 SB means very stand biased? Isn't that kind of backwards?

Yes. I (with reservations) originally elected for it because that is what I thought that users would be most familiar with. (and since then, GATK has abandoned the likelihood based SB estimation, with the counter-intuitive mapping, and gone for a Fisher’s exact test. Not a bad idea, except FE can be computationally unstable.) I have yet to find a SB algorithm I am totally happy with.

For AF, is allele frequency ever used for TruSeq amplicon cancer panel? It is defined in the .VCF but doesn't appear in our data files.

That’s a very good point. We will remove that header from the .vcf file in a future release.

We opted to report VF (for Variant Frequency) from the somatic variant caller, as opposed to using AF (for Allele Frequency), to emphasize that we are reporting the fraction of reads supporting the variant. (The "AF" is more typically used in a population genetics to give the frequency of an allele among many diploid individuals. Typically "AF" is reported as exactly 0.5 for a single individual, even if the fraction of reads with the variant is not precisely half)

What is the limit on indel length that SVC can detect?

Short answer: The limit on indel length is determined by the aligner.

Long answer: The SVC can detect any indels that are present in the .bam files output by the aligner. The SVC is typically used with the Custom Amplicon Workflow, and uses our Custom Amplicon Aligner, which uses a banded Smith Waterman alignment algorithm. The Smith Waterman bandwidth, which constrains the maximum indel length, is set to 10 in MSR 2.1 and below. After that , it was raised to 25, and made configurable.

What is a reasonable maximum value for CustomAmpliconAlignerMaxIndelSize?

We’ve tested as high as 25 with good results, but you could go higher. Its a trade-off between alignment time and indel length, and right now we are thinking 25 is a nice balance. But this is all upstream of variant calling. The variant caller itself has no limit.

Exactly how is the PB is calculated? (this is specific to Amplicon-DS)

A variant gets PB bias tag if it is in one pool and not the other, or if it shows up in both pools but is significantly biased (think 3% in one and 50%, both with good coverage, -> biased).

To figure out how biased is too biased, we use the exact same alg and threshold as with strand-bias (except substitute “pool” for “strand” in the text. The math is the same). See answer #4 for the details.

Can I use the Isas sample sheet to pass Pisces any cmd line options?

Yes, since Isas 2.6.18, samplesheet setting CustomParametersPisces is enabled. ie, "CustomParametersPisces,-b 42 -c 75". Please, for dev use only.

Can I force Isas to use my favorite version of Pisces?

Yes, if the API is compatible with the version of Isas you have. See the Isas SampleSheet link, "Internal-use" [Settings] section for more info.

What other Pisces settings can I tweek with the Isas sample sheet?

See the Isas SampleSheet documentation.

Can we get a version of the somatic caller with the hard coded lower detection threshold of 1% reduced or removed?

Technically, you already have one. There is no hardcoded lower detection threshold. 1% is the default that is easily defended, but one can certainly go less that 1% given you have the data quality to support that. You just need to pass in the desired min freq threshold, and also tighten the basecall quality filter (so the noise model actually believes it can call lower than 1%). And you must have basecalls that pass the filter threshold you chose to apply.

Note that tightening the -b in response to a low -f is now done automatically in Isas v2.6.15 (in the Pisces wrapper, not in Pisces itself). For example:

So, normally the Pisces default emit freq is 0.01 and the basecall Q score is 20. Isas filters variant calls at q30 (variant call quality score here, not basecall quality score).

Now, suppose the Pisces emit freq is 0.001 adjusted and the basecall Q score is forced up to 30* . Isas still filters variant calls at q30, based on a 0.1% noise model.

So a 1% variant with the old settings would get the same confidence as a 0.1% with the new settings (ie, you have a 50/50 shot its noise), and the variant filtering will take care of that (so a variant with qscore q30 with the old settings has the same meaning a q30 with the new parameters.).

Relevant Isas sample sheet settings are:

Isas setting Pisces setting example
minqscore -b 20
variantfrequencyemitcutoff -f 0.01
CustomParametersPisces [any] *

*example use is: "CustomParametersPisces,-b 42 -c 75 -a 10" etc..

The following Isas settings manipulate post processing, and are not passed to Pisces.

Isas setting Post processing affect example
variantfrequencyfiltercutoff variant frequency filter 0.05
variantminimumgqcutoff variant q score filter 30
variantminimumgqxcutoff variant q score filter 30

Why are there two gq cutoffs in ISAS?

GQ and GQX for Pisces are the same value. This is not the case for all VarCallers. As a somatic caller, Pisces does not have any diploid genotyping model. The Q scores simply reflect confidence that a variant exists, as opposed to the observations being due to noise.

Is this based on observations in both pull-down and non-amplicon data or the latter? We assumed that this would be true for amplicon data but we expect enrichment data to have fewer sample prep induced errors. We have noticed that we can call variants with AF below 1% if we change base quality but I don’t understand why, does Pisces use base quality internally or does it assume that all bases have the same base quality (i.e. set by the basecall quality filter).

Yes, it gets used internally. We only accept bases with acceptable quality. And then, in our noise model, we assume all bases have this quality or above ( a reasonable assumption, given the filter)

So, if you assume 1/100 noise you end up with quite low confidences in an observed 1% variant,. The probability that it is a true variant is less than 50%. Therefore the Q score is dismal, and it will get filtered.

Conversely, if you assume 1/1000 noise you end up with passable confidence at 1% variant.

Does the somatic caller have an upper threshold for allowed variant frequency? I.e. would it find the normal germline calls that have VFs of 50% or 100%? Or is it focused on low-frequency variants and would ignore the high-frequency ones? In the latter case, would it basically report noise if I ran it on a germline sample?

It has no upper limit. It will pick off low frequency sample noise, het and hom SNVs. Note the “het” designation for the somatic caller is a little different than what you might normally think of as a het, in healthy tissue. Since tumors tend to be heterogenous tissue, we allow various levels of ‘het’ calls. “Hets” are called as long as there is sufficient evidence (support across the reads), even at low frequency.

Does Pisces filter PCR Duplicates?

Yes, if the reads are marked as such in the bam. This has been on since Isas version 2.5.18. (~Pisces 3.2.4 or similar.)

Can I turn off QScore Recalibration?

Yes. Recalibration is a post-processing step in our pipelines. it is separate from the Pisces application. We hope to make that code available soon.

Can I turn on long range phasing?

This setting is not yet exposed in our pipelines for Pisces. Somatic long range phasing is separate from the Pisces application. We hope to make that code available soon.

What is Isas? It pervades this documentation.

Isas (formerly known as Isis) is our illumina secondary analysis software pipeline. Variant calling is a small part of this pipeline.

Are Pisces and the illumina Somatic Variant Caller the same thing?

Yes. Technically, Pisces is the project name, and the main executable was originally named CallSomaticVariants.

General

5.2.10

5.2.9

5.2.7

5.2.5

5.2.0

5.1.6

5.1.3

Clone this wiki locally