Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Updated analysis: snv-callers #275

Closed
jashapiro opened this issue Nov 15, 2019 · 11 comments
Closed

Updated analysis: snv-callers #275

jashapiro opened this issue Nov 15, 2019 · 11 comments
Assignees

Comments

@jashapiro
Copy link
Member

jashapiro commented Nov 15, 2019

What analysis module should be updated and why?

snv-callers: The generation of the SNV consensus

Why should the module be updated?

The current SNV consensus strategy will lose some multinucleotide mutations due to differences between strelka2 and other callers. As noted in https://www.biorxiv.org/content/biorxiv/early/2019/04/30/623702.full.pdf, strelka does not make any multinucleotide (MNV) calls, instead calling as strings of SNVs, whereas other callers make DNP, TNP and ONP calls for di-, tri- and quad+ nucleotide variants, respectively.

What changes need to be made? Please provide enough detail for another participant to make the update.

There seem to be two main options:

  1. Split all MNVs from mutect and lancet before they hit the database that the consensus list is built from in scripts/01-setup_db.py. This may be a bit simpler to implement at the outset, but does lose haplotype information.
  2. Build the SNV consensus as we currently do in scripts/02-merge_callers.R, then write a second query to pull just the MNVs from callers that create them, split those up, and look for confirmation in the strelka data. This will make the consensus creation code a bit uglier, but may end up quicker overall, and can also be made to preserve haplotypes more easily.

I will be pursuing option 2 to start.

What input data should be used? Which data were used in the version being updated?

v10, when it arrives

When do you expect the revised analysis will be completed?

Early the week of 11/18

Who will complete the updated analysis?

@jashapiro, with assistance as needed from @cansavvy

@jashapiro
Copy link
Member Author

Note that this was previously also discussed in #30 (comment)

@migbro
Copy link
Contributor

migbro commented Nov 15, 2019

Hi all,
I was asked to post my attempt within the DRC to tackle this issue. It's in cwl, but with the meat of the work done injected as python code, so shouldn't be too hard to suss out.
The steps I go through are first, normalize each vcf following this protocol:
https://github.com/kids-first/kf-somatic-workflow/blob/master/tools/normalize_vcf.cwl

Then I build the mnps using this tool:
https://github.com/kids-first/kf-somatic-workflow/blob/master/tools/prep_mnp_variants.cwl

Then I take those results and run the consensus caller on them:
https://github.com/kids-first/kf-somatic-workflow/blob/master/tools/bcbio_variant_recall_ensemble.cwl

These are all tied together, using this workflow script:
https://github.com/kids-first/kf-somatic-workflow/blob/master/workflow/kfdrc_consensus_calling.cwl

I haven't had a chance to detail this out in the README, playing catch up with that!
As a general description I normalize each vcf, then I take the snp portion of the strelka2 vcf and put that in a file. Then I take all of the mnps from of all the other vcfs, put them in a table, and I hash them by position. Next, I go through each snp in the strelka2 snp file. If the position lines up in the hash, I look to see if any of the mnps are covered by strelka2 calls, and use the longest one (not sure if there are too many cases where one says AT and another ATG for a given position for example). Then I omit the snvs used to create the mnp, and output a new strelka2 file that is snps, reconstituted mnps, and indels, then pass along to consensus calling. In the normalize tool, there is an option to strip the existing annotation from the vcf (should be INFO/CSQ, not INFO/ANN for ours vcfs) so that when the new consensus vcf is made., vcf2maf can be run again to create fresh annotations and maf files. If there are any Q's, feel free to ask!

@jashapiro
Copy link
Member Author

jashapiro commented Nov 15, 2019

Thanks @migbro! That is very close to what I was going to be doing, so that is really helpful.
I haven't had a chance to look at the code yet, but I do have a couple questions right off the bat...
Does the final VCF/MAF preserve the read count info for reference and alternate?

The reason I am asking is really for my second question: Because it seems like if your code is already working well, should we just drop in replace (or supplement) the strelka MAF file with the one you generated that includes MNVs in the data download? That would definitely be the least effort at our end, especially since you have already done most (all?) of the work! Then we would just rerun to make updated consensus files.

@migbro
Copy link
Contributor

migbro commented Nov 15, 2019

Hmmm, I'll have to get back to you on that one. What I do see, before reannotation, is that the consensus caller seems to keep the genotype and depth information from the first caller of the consensus for each call. So, the format ends up varying. a bit, but you'll see something like:

1       16181842        .       C       TAATACATTATATAAGTATGTACTT       38.59   PASS    CALLERS=lancet,vardict;FETS=38.5932;KMERSIZE=25;LEN=24;OLD_VARIANT=1:16181841:AC/ATAATACATTATATAAGTATGTACTT;SB=10.0976;SOMATIC;TYPE=complex     GT:AD:DP:SA:SR  0/0:47,0:47:0,0:22,25   0/1:32,11:43:2,9:14,18
1       16303976        .       T       A       .       PASS    CALLERS=strelka2,mutect2,lancet,vardict;DP=131;MQ=59.08;MQ0=3;NT=ref;QSS=72;QSS_NT=72;ReadPosRankSum=-0.38;SGT=TT->AT;SNVSB=0;SOMATIC;SomaticEVS=9.2;TQSS=2;TQSS_NT=2   AU:CU:DP:FDP:GU:SDP:SUBDP:TU    0,0:0,0:59:0:0,0:0:0:59,68      5,5:0,0:48:1:0,0:0:0:42,57
1       16547522        .       G       T       .       PASS    CALLERS=strelka2,mutect2,lancet,vardict;DP=132;MQ=60;MQ0=0;NT=ref;QSS=167;QSS_NT=3070;ReadPosRankSum=0.74;SGT=GG->GT;SNVSB=0;SOMATIC;SomaticEVS=19.73;TQSS=1;TQSS_NT=1  AU:CU:DP:FDP:GU:SDP:SUBDP:TU    1,1:0,0:65:1:63,66:0:0:0,0      0,0:0,0:60:0:28,29:0:0:32,36

as an example snippet from the benchmarking set. So, for strelka2 supported mnps, since trying to recalculate DP and GT for strelkla2 would be tricky, I use the information that would essentially be whatever the first base pair of that mnp was. Hopefully this makes sense. I can follow up maybe this weekend or Monday on a what a final vcf/maf would look like, but I imagine it's basically the same with an annotation. I don't think VEP would recalculate depth, but maybe it does!

@jashapiro
Copy link
Member Author

I guess I am asking about before the final consensus, as I think we want to keep the callers separate for now. It sounds like you are saying you use the first bp of the MNP in the strelka VCF, which I think is probably good choice. Is that right?

What happens with the consensus is always going to be a challenge to settle on.

@migbro
Copy link
Contributor

migbro commented Nov 18, 2019

Hi @jashapiro , yes, the first base pair, if and when an mnp is built with strelka2, is used to inform read depth and genotype information. As for the final file, the good news is that the vcf2maf script that we use from MSKCC, in additional to annotating with VEP, also seem to "standardize" the formatting of the AD and DP files in the annotated vcf.

@jaclyn-taroni
Copy link
Member

Closed by #279 or should we keep this open for some reason @jashapiro ?

@jashapiro
Copy link
Member Author

Mostly closed. The only remaining question is whether to reconstitute MNVs. I would probably close this with #293, and we can file a new issue if MNV reconstruction is desired later.

@jaclyn-taroni
Copy link
Member

Sounds good!

@cansavvy
Copy link
Collaborator

Are we able to close this issue since the MNV part of the snv caller analysis has been implemented? Are there any un-addressed aspects of this issue?

@jaclyn-taroni
Copy link
Member

Looks like closed with #293

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants