Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

(PR #2 of 2) Determine concordance between MuTect2 and Strelka2 SNV calls #76

Merged

Conversation

cansavvy
Copy link
Collaborator

@cansavvy cansavvy commented Aug 22, 2019

Purpose/implementation

This notebook analyzes the overlap and distinctions between between MuTect2 and
Strelka2 results after having set the data up in #69

If you'd like to see the html output without having to checkout this branch, go here.

Issue

It closes issue # 30 in OpenPBTA.

Directions for reviewers

Due to some of the findings here with Mutect2 data, I suggest we should move
forward with only Strelka2 data OR move forward with the variants that are
detected by both algorithms.

  1. Do these conclusions make sense based on the analyses shown?
  2. Are there any additional analyses we need?

Results

  • MuTect2 and Strelka2 detect 55,808 of the same variants. This as defined by having the same Hugo gene symbol, base change, chromosomal start site, and sample of origin (see notebook 01-set-up.Rmd for more details). If moving forward we want only the most reliably called variants, this set of 55,808 variants would give us plenty to work with.

  • MuTect2 and Strelka2 highly agree in their Variant allele frequency (VAF) calculations.
    This is good regardless of our choices moving forward.

  • Variants detected only by MuTect2 have a particularly low VAF compared to
    variants detected only by Strelka2. (See VAF Violin plots).
    These density plots suggest some of these MuTect2 calls may be noise.
    Although these low-VAF of MuTect2 could be identifying true variants, our further
    analyses would probably benefit from a more robust, higher confidence set of
    variants.

  • MuTect2 also registers dinucleotide and larger variants where Strelka2 seems
    to break these variants into their single nucleotide changes.
    In these analyses, these base changes have been grouped together and collectively
    called long_changes. The higher base resolution of Strelka2, and its ability to parse apart the SNVs from each other, is more useful to us for this particular analyses, as the larger structural variants are better detected in the Manta or LUMPY analyses.

Docker and continuous integration

Pull Request Check List:

  • Run a linter
  • Set the seed (if applicable)
  • Comments and/or documentation up to date
  • Double checked paths
  • Spell check any Rmd file or md file
  • Restart R and run all notebooks fresh and save
  • Connect the pertinent issues on ZenHub
  • Updated Dockerfile and built Docker image successfully

@cansavvy cansavvy requested a review from cbethell August 22, 2019 14:31
.circleci/config.yml Outdated Show resolved Hide resolved
.circleci/config.yml Outdated Show resolved Hide resolved
@cansavvy cansavvy requested a review from jharenza August 22, 2019 15:20
Copy link
Collaborator

@jharenza jharenza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job on this! I have a few questions about moving forward:

  1. Are you suggesting using just concordant SNVs/indels or using only those from Strelka2?
  2. Prior to recommending something for 1), I think we should assess the pathogenicity of the Mutect only variants. It would be useful to see these plots for nonsynonymous, coding mutations only. For the purposes of oncoprints and TMB, for example, we will not use synonymous variants anyway. Then, of those predicted damaging nonsynonymous, coding mutations called only from Mutect2, are these in any putative oncogenes/tumor suppressor genes (whether the SNV is annotated in COSMIC is probably the easiest way to do a quick check, but we also have some genelists if interested)? For example, if we still see a low-level TP53 mutation in the DNA binding domain or a canonical RAS mutation, we should retain these for downstream analysis.

@cgreene
Copy link
Collaborator

cgreene commented Aug 23, 2019

@jharenza : would you be worried in that scenario that you'd be including variants that are labeled pathogenic but that aren't really there? That would be my biggest concern, since Mutect2 looks very aggressive at calling even low VAF things. These seem more likely to be enriched with errors (whether they are pathogenic or not). If we're saying that we should retain anything from Mutect2 labeled pathogenic, then we're basically just saying we should do the union of Strelka2 and Mutect2.

@jharenza
Copy link
Collaborator

@jharenza : would you be worried in that scenario that you'd be including variants that are labeled pathogenic but that aren't really there? That would be my biggest concern, since Mutect2 looks very aggressive at calling even low VAF things. These seem more likely to be enriched with errors (whether they are pathogenic or not). If we're saying that we should retain anything from Mutect2 labeled pathogenic, then we're basically just saying we should do the union of Strelka2 and Mutect2.

I think it depends on whether these are likely true subclonal mutations or artifacts. I'm not clear how many of the potentially damaging mutations are in hotspot/are recurrent (COSMIC) vs one-off, potentially private mutations. Eg: if they are mostly the latter, then it doesn't make sense to keep them. We can look at some of these low level nonsynonymous damaging mutations in IGV using the BAM files here: https://cavatica.sbgenomics.com/u/cavatica/pbta-cbttc/ or https://cavatica.sbgenomics.com/u/cavatica/pbta-pnoc003/?

@cgreene
Copy link
Collaborator

cgreene commented Aug 23, 2019

If someone is willing/able to look at the set in the BAM files then I think we should ask for @cansav09 to produce a list of pathogenic ones that are Mutect2 only, Strelka2 only, and both callers.

How many would folks be willing to look at? We could ask @cansav09 to create files that dump up to that number, spread across the three categories? Perhaps even better if we randomly shuffle the three lists together before handing them to whoever is looking by hand, and we keep that person blinded.

@jharenza
Copy link
Collaborator

If someone is willing/able to look at the set in the BAM files then I think we should ask for @cansav09 to produce a list of pathogenic ones that are Mutect2 only, Strelka2 only, and both callers.

How many would folks be willing to look at? We could ask @cansav09 to create files that dump up to that number, spread across the three categories? Perhaps even better if we randomly shuffle the three lists together before handing them to whoever is looking by hand, and we keep that person blinded.

Let me discuss with @yuankunzhu today during our sprint planning. @cansav09 can you give us an N of variants that would need to be inspected?

@cansavvy
Copy link
Collaborator Author

@cansav09 can you give us an N of variants that would need to be inspected?

@jharenza Couple clarification questions I have after this discussion:

  1. Did you want these numbers for the non-synonymous variants only? Or all variants?

  2. Also did you want “possibly_damaging” category variants to be included or only “probably_damaging” variants?

  3. Or did you prefer I use something besides PolyPhen, like SIFT or ClinVar?

  4. For linking back to BAM files what information would you like included in these lists to make this easiest to trace back? Or should I just include all the information and you guys will figure it out?

@jharenza
Copy link
Collaborator

@cansav09

  1. Was thinking non-synonymous only, because I am not sure if any synonymous would be damaging - I guess we can do a quick check on that.
    2&3) Possibly and probably damaging/deleterious from at least 1 of the three tools you mention
  2. If you gave the variant coordinates, nucleotide change, and biospecimen ID (which is probably the TSB), then that should suffice.

@cansavvy
Copy link
Collaborator Author

cansavvy commented Aug 26, 2019

Was thinking non-synonymous only, because I am not sure if any synonymous would be damaging - I guess we can do a quick check on that.
2&3) Possibly and probably damaging/deleterious from at least 1 of the three tools you mention

After exploring the data some more, I can't find any synonymous mutations that have been assigned a "damaging" label. However, just to confirm, what MAF fields are most dependable for determining synonymous vs nonsynonymous? It appears all the categories in "Variant Classification" are nonsynonymous variant categories, so if there is an NA in this category, I am assuming that means it is synonymous? Is there a more exact approach you suggest for identifying synonymous vs nonsynonymous? This approach of NAs feels risky and potentially inaccurate to me.

If the above assumption about synonymous mutations and the damaging labels are correct, then the numbers for the possibly and probably damaging non-synonymous variants for each categories are the following:

MuTect2 6678 
Strelka2 7842 
Both ~25k

@jharenza
Copy link
Collaborator

Hi @cansav09! We decided during our sprint planning that we will run two additional variant calling algorithms, Lancet and Vardict, on the PBTA data and create a consensus call set with calls from 2+ algorithms. @migbro was testing this intermittently and we recently came up with some agreement and methods for creating consensus calls using a gold standard dataset, so it looks like we can apply this to PBTA to help with confidence in calls. We can release as V3 once we have run these pipelines; running them should happen during this sprint. For now, I think we can close this PR.

@cgreene
Copy link
Collaborator

cgreene commented Aug 27, 2019

@jharenza : Nice! So here's my summary of next steps:

  • We merge this PR (potentially after @cbethell reviews since it looks like there's an outstanding review request)
  • During the next two weeks, Lancet and Vardict calls will be added to the OpenPBTA data.
  • After that point, we'll ask @cansav09 to update this analysis for the four callers.
  • Based on those results, we'll figure out a path forward. It may involve some manual review (the gold standard?). If so, we'll use calls in the categories we discussed (combined for & blinded review as noted above).

In that case:

  1. Merge this.
  2. Close Evaluate concordance between Mutect2 and Strelka2 and decide on next steps #30 with a short summary of the next steps (this summary).
  3. Open a new issue to track the re-analysis with the four callers.

Is that aligned with your thinking?

Copy link
Contributor

@cbethell cbethell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍.

After reviewing this PR, I do have a question for you or @jaclyn-taroni. Now that we’ve implemented the Docker container are we still including the install packages steps like in the Set Up section of the R markdown in this PR?

@cansavvy
Copy link
Collaborator Author

Now that we’ve implemented the Docker container are we still including the install packages steps like in the Set Up section of the R markdown in this PR?

Yes. I had asked @jaclyn-taroni this elsewhere, I think the thought is that if people don’t use the docker container (which they might not), the package installation steps are still useful for them to have as reference. It doesn’t hurt anything basically.

@cbethell
Copy link
Contributor

Now that we’ve implemented the Docker container are we still including the install packages steps like in the Set Up section of the R markdown in this PR?

Yes. I had asked @jaclyn-taroni this elsewhere, I think the thought is that if people don’t use the docker container (which they might not), the package installation steps are still useful for them to have as reference. It doesn’t hurt anything basically.

Sounds good, thanks!

@jaclyn-taroni
Copy link
Member

For reference: #56 (comment)

@cansavvy
Copy link
Collaborator Author

Based on the plan laid out by @cgreene I will go ahead and merge this for now.

@cansavvy cansavvy merged commit 4f81efb into AlexsLemonade:master Aug 27, 2019
@jharenza
Copy link
Collaborator

jharenza commented Aug 27, 2019

@jharenza : Nice! So here's my summary of next steps:

  • We merge this PR (potentially after @cbethell reviews since it looks like there's an outstanding review request)
  • During the next two weeks, Lancet and Vardict calls will be added to the OpenPBTA data.
  • After that point, we'll ask @cansav09 to update this analysis for the four callers.
  • Based on those results, we'll figure out a path forward. It may involve some manual review (the gold standard?). If so, we'll use calls in the categories we discussed (combined for & blinded review as noted above).

In that case:

  1. Merge this.
  2. Close Evaluate concordance between Mutect2 and Strelka2 and decide on next steps #30 with a short summary of the next steps (this summary).
  3. Open a new issue to track the re-analysis with the four callers.

Is that aligned with your thinking?

@migbro did some preliminary comparisons with 5 algorithms, we chose 4, and then assessed percent true calls obtained using consensus of 2/4, 3/4, so perhaps we can combine efforts and add to your analyses some things we have learned with the new tools. There are some nuances with Vardict - many more variant calls and more false positives.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Evaluate concordance between Mutect2 and Strelka2 and decide on next steps
5 participants