(PR #2 of 2) Determine concordance between MuTect2 and Strelka2 SNV calls #76

cansavvy · 2019-08-22T14:13:21Z

Purpose/implementation

This notebook analyzes the overlap and distinctions between between MuTect2 and
Strelka2 results after having set the data up in #69

If you'd like to see the html output without having to checkout this branch, go here.

Issue

It closes issue # 30 in OpenPBTA.

Directions for reviewers

Due to some of the findings here with Mutect2 data, I suggest we should move
forward with only Strelka2 data OR move forward with the variants that are
detected by both algorithms.

Do these conclusions make sense based on the analyses shown?
Are there any additional analyses we need?

Results

MuTect2 and Strelka2 detect 55,808 of the same variants. This as defined by having the same Hugo gene symbol, base change, chromosomal start site, and sample of origin (see notebook 01-set-up.Rmd for more details). If moving forward we want only the most reliably called variants, this set of 55,808 variants would give us plenty to work with.
MuTect2 and Strelka2 highly agree in their Variant allele frequency (VAF) calculations.
This is good regardless of our choices moving forward.
Variants detected only by MuTect2 have a particularly low VAF compared to
variants detected only by Strelka2. (See VAF Violin plots).
These density plots suggest some of these MuTect2 calls may be noise.
Although these low-VAF of MuTect2 could be identifying true variants, our further
analyses would probably benefit from a more robust, higher confidence set of
variants.
MuTect2 also registers dinucleotide and larger variants where Strelka2 seems
to break these variants into their single nucleotide changes.
In these analyses, these base changes have been grouped together and collectively
called long_changes. The higher base resolution of Strelka2, and its ability to parse apart the SNVs from each other, is more useful to us for this particular analyses, as the larger structural variants are better detected in the Manta or LUMPY analyses.

Docker and continuous integration

The dependencies required to run the code in this pull request have been added to the project Dockerfile. (Requirements are the same as was added for (PR #1 of 2) Determine concordance between MuTect2 and Strelka2 SNV calls #69).
This analysis has been added to continuous integration.

Pull Request Check List:

Run a linter
Set the seed (if applicable)
Comments and/or documentation up to date
Double checked paths
Spell check any Rmd file or md file
Restart R and run all notebooks fresh and save
Connect the pertinent issues on ZenHub
Updated Dockerfile and built Docker image successfully

…utect2-vs-strellka

.circleci/config.yml

jharenza

Nice job on this! I have a few questions about moving forward:

Are you suggesting using just concordant SNVs/indels or using only those from Strelka2?
Prior to recommending something for 1), I think we should assess the pathogenicity of the Mutect only variants. It would be useful to see these plots for nonsynonymous, coding mutations only. For the purposes of oncoprints and TMB, for example, we will not use synonymous variants anyway. Then, of those predicted damaging nonsynonymous, coding mutations called only from Mutect2, are these in any putative oncogenes/tumor suppressor genes (whether the SNV is annotated in COSMIC is probably the easiest way to do a quick check, but we also have some genelists if interested)? For example, if we still see a low-level TP53 mutation in the DNA binding domain or a canonical RAS mutation, we should retain these for downstream analysis.

cgreene · 2019-08-23T17:40:23Z

@jharenza : would you be worried in that scenario that you'd be including variants that are labeled pathogenic but that aren't really there? That would be my biggest concern, since Mutect2 looks very aggressive at calling even low VAF things. These seem more likely to be enriched with errors (whether they are pathogenic or not). If we're saying that we should retain anything from Mutect2 labeled pathogenic, then we're basically just saying we should do the union of Strelka2 and Mutect2.

jharenza · 2019-08-23T20:15:20Z

@jharenza : would you be worried in that scenario that you'd be including variants that are labeled pathogenic but that aren't really there? That would be my biggest concern, since Mutect2 looks very aggressive at calling even low VAF things. These seem more likely to be enriched with errors (whether they are pathogenic or not). If we're saying that we should retain anything from Mutect2 labeled pathogenic, then we're basically just saying we should do the union of Strelka2 and Mutect2.

I think it depends on whether these are likely true subclonal mutations or artifacts. I'm not clear how many of the potentially damaging mutations are in hotspot/are recurrent (COSMIC) vs one-off, potentially private mutations. Eg: if they are mostly the latter, then it doesn't make sense to keep them. We can look at some of these low level nonsynonymous damaging mutations in IGV using the BAM files here: https://cavatica.sbgenomics.com/u/cavatica/pbta-cbttc/ or https://cavatica.sbgenomics.com/u/cavatica/pbta-pnoc003/?

cgreene · 2019-08-23T20:23:46Z

If someone is willing/able to look at the set in the BAM files then I think we should ask for @cansav09 to produce a list of pathogenic ones that are Mutect2 only, Strelka2 only, and both callers.

How many would folks be willing to look at? We could ask @cansav09 to create files that dump up to that number, spread across the three categories? Perhaps even better if we randomly shuffle the three lists together before handing them to whoever is looking by hand, and we keep that person blinded.

jharenza · 2019-08-26T11:54:44Z

If someone is willing/able to look at the set in the BAM files then I think we should ask for @cansav09 to produce a list of pathogenic ones that are Mutect2 only, Strelka2 only, and both callers.

How many would folks be willing to look at? We could ask @cansav09 to create files that dump up to that number, spread across the three categories? Perhaps even better if we randomly shuffle the three lists together before handing them to whoever is looking by hand, and we keep that person blinded.

Let me discuss with @yuankunzhu today during our sprint planning. @cansav09 can you give us an N of variants that would need to be inspected?

cansavvy · 2019-08-26T12:32:54Z

@cansav09 can you give us an N of variants that would need to be inspected?

@jharenza Couple clarification questions I have after this discussion:

Did you want these numbers for the non-synonymous variants only? Or all variants?
Also did you want “possibly_damaging” category variants to be included or only “probably_damaging” variants?
Or did you prefer I use something besides PolyPhen, like SIFT or ClinVar?
For linking back to BAM files what information would you like included in these lists to make this easiest to trace back? Or should I just include all the information and you guys will figure it out?

jharenza · 2019-08-26T12:42:09Z

@cansav09

Was thinking non-synonymous only, because I am not sure if any synonymous would be damaging - I guess we can do a quick check on that.
2&3) Possibly and probably damaging/deleterious from at least 1 of the three tools you mention
If you gave the variant coordinates, nucleotide change, and biospecimen ID (which is probably the TSB), then that should suffice.

cansavvy · 2019-08-26T15:10:23Z

Was thinking non-synonymous only, because I am not sure if any synonymous would be damaging - I guess we can do a quick check on that.
2&3) Possibly and probably damaging/deleterious from at least 1 of the three tools you mention

After exploring the data some more, I can't find any synonymous mutations that have been assigned a "damaging" label. However, just to confirm, what MAF fields are most dependable for determining synonymous vs nonsynonymous? It appears all the categories in "Variant Classification" are nonsynonymous variant categories, so if there is an NA in this category, I am assuming that means it is synonymous? Is there a more exact approach you suggest for identifying synonymous vs nonsynonymous? This approach of NAs feels risky and potentially inaccurate to me.

If the above assumption about synonymous mutations and the damaging labels are correct, then the numbers for the possibly and probably damaging non-synonymous variants for each categories are the following:

MuTect2 6678 
Strelka2 7842 
Both ~25k

…ellka' into cansav09/30-part2_mutect2-vs-strellka

…av09/30-part2_mutect2-vs-strellka

jharenza · 2019-08-27T11:49:57Z

Hi @cansav09! We decided during our sprint planning that we will run two additional variant calling algorithms, Lancet and Vardict, on the PBTA data and create a consensus call set with calls from 2+ algorithms. @migbro was testing this intermittently and we recently came up with some agreement and methods for creating consensus calls using a gold standard dataset, so it looks like we can apply this to PBTA to help with confidence in calls. We can release as V3 once we have run these pipelines; running them should happen during this sprint. For now, I think we can close this PR.

cgreene · 2019-08-27T11:59:29Z

@jharenza : Nice! So here's my summary of next steps:

We merge this PR (potentially after @cbethell reviews since it looks like there's an outstanding review request)
During the next two weeks, Lancet and Vardict calls will be added to the OpenPBTA data.
After that point, we'll ask @cansav09 to update this analysis for the four callers.
Based on those results, we'll figure out a path forward. It may involve some manual review (the gold standard?). If so, we'll use calls in the categories we discussed (combined for & blinded review as noted above).

In that case:

Merge this.
Close Evaluate concordance between Mutect2 and Strelka2 and decide on next steps #30 with a short summary of the next steps (this summary).
Open a new issue to track the re-analysis with the four callers.

Is that aligned with your thinking?

cbethell

LGTM 👍.

After reviewing this PR, I do have a question for you or @jaclyn-taroni. Now that we’ve implemented the Docker container are we still including the install packages steps like in the Set Up section of the R markdown in this PR?

cansavvy · 2019-08-27T12:30:29Z

Now that we’ve implemented the Docker container are we still including the install packages steps like in the Set Up section of the R markdown in this PR?

Yes. I had asked @jaclyn-taroni this elsewhere, I think the thought is that if people don’t use the docker container (which they might not), the package installation steps are still useful for them to have as reference. It doesn’t hurt anything basically.

cbethell · 2019-08-27T12:39:42Z

Now that we’ve implemented the Docker container are we still including the install packages steps like in the Set Up section of the R markdown in this PR?

Yes. I had asked @jaclyn-taroni this elsewhere, I think the thought is that if people don’t use the docker container (which they might not), the package installation steps are still useful for them to have as reference. It doesn’t hurt anything basically.

Sounds good, thanks!

jaclyn-taroni · 2019-08-27T12:39:52Z

For reference: #56 (comment)

cansavvy · 2019-08-27T15:04:54Z

Based on the plan laid out by @cgreene I will go ahead and merge this for now.

jharenza · 2019-08-27T15:29:01Z

@jharenza : Nice! So here's my summary of next steps:

We merge this PR (potentially after @cbethell reviews since it looks like there's an outstanding review request)

During the next two weeks, Lancet and Vardict calls will be added to the OpenPBTA data.

After that point, we'll ask @cansav09 to update this analysis for the four callers.

Based on those results, we'll figure out a path forward. It may involve some manual review (the gold standard?). If so, we'll use calls in the categories we discussed (combined for & blinded review as noted above).

In that case:

Merge this.

Close Evaluate concordance between Mutect2 and Strelka2 and decide on next steps #30 with a short summary of the next steps (this summary).

Open a new issue to track the re-analysis with the four callers.

Is that aligned with your thinking?

@migbro did some preliminary comparisons with 5 algorithms, we chose 4, and then assessed percent true calls obtained using consensus of 2/4, 3/4, so perhaps we can combine efforts and add to your analyses some things we have learned with the new tools. There are some nuances with Vardict - many more variant calls and more false positives.

Candace Savonen added 7 commits August 21, 2019 11:53

One last thing

70ae412

Refresh notebooks

42371e5

Merge branch 'cansav09/30-mutect-vs-strelka' into cansav09/30-part2_m…

9da3dea

…utect2-vs-strellka

Add PolyPhen plot

56a0968

add to circle CI

62b77b7

Add VennDiagrams to Dockerfile

85d0ab5

Fix a Dockerfile prob

66726f1

cansavvy requested a review from cbethell August 22, 2019 14:31

cgreene reviewed Aug 22, 2019

View reviewed changes

.circleci/config.yml Outdated Show resolved Hide resolved

.circleci/config.yml Outdated Show resolved Hide resolved

Edit CI analysis to be under one header

358b2da

cansavvy requested a review from jharenza August 22, 2019 15:20

Put both notebooks in same command

c7fccb5

jaclyn-taroni mentioned this pull request Aug 22, 2019

How we add completed analysis modules to CI #77

Closed

cansavvy added 2 commits August 22, 2019 12:05

Merge branch 'master' into cansav09/30-part2_mutect2-vs-strellka

b1d1895

Merge branch 'master' into cansav09/30-part2_mutect2-vs-strellka

25fa5f4

jharenza suggested changes Aug 23, 2019

View reviewed changes

Merge branch 'master' into cansav09/30-part2_mutect2-vs-strellka

d03ddd3

cansavvy and others added 5 commits August 26, 2019 11:10

Merge branch 'master' into cansav09/30-part2_mutect2-vs-strellka

54faa17

Push changes to NA handling

bf497d6

Merge remote-tracking branch 'origin/cansav09/30-part2_mutect2-vs-str…

20a3c8e

…ellka' into cansav09/30-part2_mutect2-vs-strellka

Merge branch 'cansav09/30-part2_mutect2-vs-strellka' into origin/cans…

5e5399c

…av09/30-part2_mutect2-vs-strellka

Add writing variants to files and plots saving

2d05853

cansavvy mentioned this pull request Aug 26, 2019

Evaluate concordance between Mutect2 and Strelka2 and decide on next steps #30

Closed

Merge branch 'master' into cansav09/30-part2_mutect2-vs-strellka

a6b49db

jharenza approved these changes Aug 27, 2019

View reviewed changes

cbethell approved these changes Aug 27, 2019

View reviewed changes

cansavvy merged commit 4f81efb into AlexsLemonade:master Aug 27, 2019

jaclyn-taroni mentioned this pull request Aug 25, 2020

molecular_subtype is different for the same sample with different experimental_strategy #735

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(PR #2 of 2) Determine concordance between MuTect2 and Strelka2 SNV calls #76

(PR #2 of 2) Determine concordance between MuTect2 and Strelka2 SNV calls #76

cansavvy commented Aug 22, 2019 •

edited

Loading

jharenza left a comment

cgreene commented Aug 23, 2019

jharenza commented Aug 23, 2019

cgreene commented Aug 23, 2019

jharenza commented Aug 26, 2019

cansavvy commented Aug 26, 2019

jharenza commented Aug 26, 2019

cansavvy commented Aug 26, 2019 •

edited

Loading

jharenza commented Aug 27, 2019

cgreene commented Aug 27, 2019

cbethell left a comment

cansavvy commented Aug 27, 2019

cbethell commented Aug 27, 2019

jaclyn-taroni commented Aug 27, 2019

cansavvy commented Aug 27, 2019

jharenza commented Aug 27, 2019 •

edited

Loading

(PR #2 of 2) Determine concordance between MuTect2 and Strelka2 SNV calls #76

(PR #2 of 2) Determine concordance between MuTect2 and Strelka2 SNV calls #76

Conversation

cansavvy commented Aug 22, 2019 • edited Loading

Purpose/implementation

Issue

Directions for reviewers

Results

Docker and continuous integration

Pull Request Check List:

jharenza left a comment

Choose a reason for hiding this comment

cgreene commented Aug 23, 2019

jharenza commented Aug 23, 2019

cgreene commented Aug 23, 2019

jharenza commented Aug 26, 2019

cansavvy commented Aug 26, 2019

jharenza commented Aug 26, 2019

cansavvy commented Aug 26, 2019 • edited Loading

jharenza commented Aug 27, 2019

cgreene commented Aug 27, 2019

cbethell left a comment

Choose a reason for hiding this comment

cansavvy commented Aug 27, 2019

cbethell commented Aug 27, 2019

jaclyn-taroni commented Aug 27, 2019

cansavvy commented Aug 27, 2019

jharenza commented Aug 27, 2019 • edited Loading

cansavvy commented Aug 22, 2019 •

edited

Loading

cansavvy commented Aug 26, 2019 •

edited

Loading

jharenza commented Aug 27, 2019 •

edited

Loading