-
Notifications
You must be signed in to change notification settings - Fork 67
(PR #2 of 2) Determine concordance between MuTect2 and Strelka2 SNV calls #76
(PR #2 of 2) Determine concordance between MuTect2 and Strelka2 SNV calls #76
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job on this! I have a few questions about moving forward:
- Are you suggesting using just concordant SNVs/indels or using only those from Strelka2?
- Prior to recommending something for 1), I think we should assess the pathogenicity of the Mutect only variants. It would be useful to see these plots for nonsynonymous, coding mutations only. For the purposes of oncoprints and TMB, for example, we will not use synonymous variants anyway. Then, of those predicted damaging nonsynonymous, coding mutations called only from Mutect2, are these in any putative oncogenes/tumor suppressor genes (whether the SNV is annotated in COSMIC is probably the easiest way to do a quick check, but we also have some genelists if interested)? For example, if we still see a low-level TP53 mutation in the DNA binding domain or a canonical RAS mutation, we should retain these for downstream analysis.
@jharenza : would you be worried in that scenario that you'd be including variants that are labeled pathogenic but that aren't really there? That would be my biggest concern, since Mutect2 looks very aggressive at calling even low VAF things. These seem more likely to be enriched with errors (whether they are pathogenic or not). If we're saying that we should retain anything from Mutect2 labeled pathogenic, then we're basically just saying we should do the union of Strelka2 and Mutect2. |
I think it depends on whether these are likely true subclonal mutations or artifacts. I'm not clear how many of the potentially damaging mutations are in hotspot/are recurrent (COSMIC) vs one-off, potentially private mutations. Eg: if they are mostly the latter, then it doesn't make sense to keep them. We can look at some of these low level nonsynonymous damaging mutations in IGV using the BAM files here: https://cavatica.sbgenomics.com/u/cavatica/pbta-cbttc/ or https://cavatica.sbgenomics.com/u/cavatica/pbta-pnoc003/? |
If someone is willing/able to look at the set in the BAM files then I think we should ask for @cansav09 to produce a list of pathogenic ones that are Mutect2 only, Strelka2 only, and both callers. How many would folks be willing to look at? We could ask @cansav09 to create files that dump up to that number, spread across the three categories? Perhaps even better if we randomly shuffle the three lists together before handing them to whoever is looking by hand, and we keep that person blinded. |
Let me discuss with @yuankunzhu today during our sprint planning. @cansav09 can you give us an N of variants that would need to be inspected? |
@jharenza Couple clarification questions I have after this discussion:
|
@cansav09
|
After exploring the data some more, I can't find any synonymous mutations that have been assigned a "damaging" label. However, just to confirm, what MAF fields are most dependable for determining synonymous vs nonsynonymous? It appears all the categories in "Variant Classification" are nonsynonymous variant categories, so if there is an NA in this category, I am assuming that means it is synonymous? Is there a more exact approach you suggest for identifying synonymous vs nonsynonymous? This approach of If the above assumption about synonymous mutations and the damaging labels are correct, then the numbers for the possibly and probably damaging non-synonymous variants for each categories are the following:
|
…ellka' into cansav09/30-part2_mutect2-vs-strellka
…av09/30-part2_mutect2-vs-strellka
Hi @cansav09! We decided during our sprint planning that we will run two additional variant calling algorithms, Lancet and Vardict, on the PBTA data and create a consensus call set with calls from 2+ algorithms. @migbro was testing this intermittently and we recently came up with some agreement and methods for creating consensus calls using a gold standard dataset, so it looks like we can apply this to PBTA to help with confidence in calls. We can release as V3 once we have run these pipelines; running them should happen during this sprint. For now, I think we can close this PR. |
@jharenza : Nice! So here's my summary of next steps:
In that case:
Is that aligned with your thinking? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍.
After reviewing this PR, I do have a question for you or @jaclyn-taroni. Now that we’ve implemented the Docker container are we still including the install packages steps like in the Set Up section of the R markdown in this PR?
Yes. I had asked @jaclyn-taroni this elsewhere, I think the thought is that if people don’t use the docker container (which they might not), the package installation steps are still useful for them to have as reference. It doesn’t hurt anything basically. |
Sounds good, thanks! |
For reference: #56 (comment) |
Based on the plan laid out by @cgreene I will go ahead and merge this for now. |
@migbro did some preliminary comparisons with 5 algorithms, we chose 4, and then assessed percent true calls obtained using consensus of 2/4, 3/4, so perhaps we can combine efforts and add to your analyses some things we have learned with the new tools. There are some nuances with Vardict - many more variant calls and more false positives. |
Purpose/implementation
This notebook analyzes the overlap and distinctions between between MuTect2 and
Strelka2 results after having set the data up in #69
If you'd like to see the html output without having to checkout this branch, go here.
Issue
It closes issue # 30 in OpenPBTA.
Directions for reviewers
Due to some of the findings here with Mutect2 data, I suggest we should move
forward with only Strelka2 data OR move forward with the variants that are
detected by both algorithms.
Results
MuTect2 and Strelka2 detect 55,808 of the same variants. This as defined by having the same Hugo gene symbol, base change, chromosomal start site, and sample of origin (see notebook 01-set-up.Rmd for more details). If moving forward we want only the most reliably called variants, this set of 55,808 variants would give us plenty to work with.
MuTect2 and Strelka2 highly agree in their Variant allele frequency (VAF) calculations.
This is good regardless of our choices moving forward.
Variants detected only by MuTect2 have a particularly low VAF compared to
variants detected only by Strelka2. (See VAF Violin plots).
These density plots suggest some of these MuTect2 calls may be noise.
Although these low-VAF of MuTect2 could be identifying true variants, our further
analyses would probably benefit from a more robust, higher confidence set of
variants.
MuTect2 also registers dinucleotide and larger variants where Strelka2 seems
to break these variants into their single nucleotide changes.
In these analyses, these base changes have been grouped together and collectively
called
long_changes
. The higher base resolution of Strelka2, and its ability to parse apart the SNVs from each other, is more useful to us for this particular analyses, as the larger structural variants are better detected in the Manta or LUMPY analyses.Docker and continuous integration
Pull Request Check List: