Updated analysis: snv-callers #275

jashapiro · 2019-11-15T20:25:17Z

What analysis module should be updated and why?

snv-callers: The generation of the SNV consensus

Why should the module be updated?

The current SNV consensus strategy will lose some multinucleotide mutations due to differences between strelka2 and other callers. As noted in https://www.biorxiv.org/content/biorxiv/early/2019/04/30/623702.full.pdf, strelka does not make any multinucleotide (MNV) calls, instead calling as strings of SNVs, whereas other callers make DNP, TNP and ONP calls for di-, tri- and quad+ nucleotide variants, respectively.

What changes need to be made? Please provide enough detail for another participant to make the update.

There seem to be two main options:

Split all MNVs from mutect and lancet before they hit the database that the consensus list is built from in scripts/01-setup_db.py. This may be a bit simpler to implement at the outset, but does lose haplotype information.
Build the SNV consensus as we currently do in scripts/02-merge_callers.R, then write a second query to pull just the MNVs from callers that create them, split those up, and look for confirmation in the strelka data. This will make the consensus creation code a bit uglier, but may end up quicker overall, and can also be made to preserve haplotypes more easily.

I will be pursuing option 2 to start.

What input data should be used? Which data were used in the version being updated?

v10, when it arrives

When do you expect the revised analysis will be completed?

Early the week of 11/18

Who will complete the updated analysis?

@jashapiro, with assistance as needed from @cansavvy

The text was updated successfully, but these errors were encountered:

jashapiro · 2019-11-15T20:31:52Z

Note that this was previously also discussed in #30 (comment)

migbro · 2019-11-15T21:44:52Z

Hi all,
I was asked to post my attempt within the DRC to tackle this issue. It's in cwl, but with the meat of the work done injected as python code, so shouldn't be too hard to suss out.
The steps I go through are first, normalize each vcf following this protocol:
https://github.com/kids-first/kf-somatic-workflow/blob/master/tools/normalize_vcf.cwl

Then I build the mnps using this tool:
https://github.com/kids-first/kf-somatic-workflow/blob/master/tools/prep_mnp_variants.cwl

Then I take those results and run the consensus caller on them:
https://github.com/kids-first/kf-somatic-workflow/blob/master/tools/bcbio_variant_recall_ensemble.cwl

These are all tied together, using this workflow script:
https://github.com/kids-first/kf-somatic-workflow/blob/master/workflow/kfdrc_consensus_calling.cwl

I haven't had a chance to detail this out in the README, playing catch up with that!
As a general description I normalize each vcf, then I take the snp portion of the strelka2 vcf and put that in a file. Then I take all of the mnps from of all the other vcfs, put them in a table, and I hash them by position. Next, I go through each snp in the strelka2 snp file. If the position lines up in the hash, I look to see if any of the mnps are covered by strelka2 calls, and use the longest one (not sure if there are too many cases where one says AT and another ATG for a given position for example). Then I omit the snvs used to create the mnp, and output a new strelka2 file that is snps, reconstituted mnps, and indels, then pass along to consensus calling. In the normalize tool, there is an option to strip the existing annotation from the vcf (should be INFO/CSQ, not INFO/ANN for ours vcfs) so that when the new consensus vcf is made., vcf2maf can be run again to create fresh annotations and maf files. If there are any Q's, feel free to ask!

jashapiro · 2019-11-15T21:56:00Z

Thanks @migbro! That is very close to what I was going to be doing, so that is really helpful.
I haven't had a chance to look at the code yet, but I do have a couple questions right off the bat...
Does the final VCF/MAF preserve the read count info for reference and alternate?

The reason I am asking is really for my second question: Because it seems like if your code is already working well, should we just drop in replace (or supplement) the strelka MAF file with the one you generated that includes MNVs in the data download? That would definitely be the least effort at our end, especially since you have already done most (all?) of the work! Then we would just rerun to make updated consensus files.

migbro · 2019-11-15T22:29:36Z

Hmmm, I'll have to get back to you on that one. What I do see, before reannotation, is that the consensus caller seems to keep the genotype and depth information from the first caller of the consensus for each call. So, the format ends up varying. a bit, but you'll see something like:

1       16181842        .       C       TAATACATTATATAAGTATGTACTT       38.59   PASS    CALLERS=lancet,vardict;FETS=38.5932;KMERSIZE=25;LEN=24;OLD_VARIANT=1:16181841:AC/ATAATACATTATATAAGTATGTACTT;SB=10.0976;SOMATIC;TYPE=complex     GT:AD:DP:SA:SR  0/0:47,0:47:0,0:22,25   0/1:32,11:43:2,9:14,18
1       16303976        .       T       A       .       PASS    CALLERS=strelka2,mutect2,lancet,vardict;DP=131;MQ=59.08;MQ0=3;NT=ref;QSS=72;QSS_NT=72;ReadPosRankSum=-0.38;SGT=TT->AT;SNVSB=0;SOMATIC;SomaticEVS=9.2;TQSS=2;TQSS_NT=2   AU:CU:DP:FDP:GU:SDP:SUBDP:TU    0,0:0,0:59:0:0,0:0:0:59,68      5,5:0,0:48:1:0,0:0:0:42,57
1       16547522        .       G       T       .       PASS    CALLERS=strelka2,mutect2,lancet,vardict;DP=132;MQ=60;MQ0=0;NT=ref;QSS=167;QSS_NT=3070;ReadPosRankSum=0.74;SGT=GG->GT;SNVSB=0;SOMATIC;SomaticEVS=19.73;TQSS=1;TQSS_NT=1  AU:CU:DP:FDP:GU:SDP:SUBDP:TU    1,1:0,0:65:1:63,66:0:0:0,0      0,0:0,0:60:0:28,29:0:0:32,36

as an example snippet from the benchmarking set. So, for strelka2 supported mnps, since trying to recalculate DP and GT for strelkla2 would be tricky, I use the information that would essentially be whatever the first base pair of that mnp was. Hopefully this makes sense. I can follow up maybe this weekend or Monday on a what a final vcf/maf would look like, but I imagine it's basically the same with an annotation. I don't think VEP would recalculate depth, but maybe it does!

jashapiro · 2019-11-15T22:43:42Z

I guess I am asking about before the final consensus, as I think we want to keep the callers separate for now. It sounds like you are saying you use the first bp of the MNP in the strelka VCF, which I think is probably good choice. Is that right?

What happens with the consensus is always going to be a challenge to settle on.

migbro · 2019-11-18T19:47:36Z

Hi @jashapiro , yes, the first base pair, if and when an mnp is built with strelka2, is used to inform read depth and genotype information. As for the final file, the good news is that the vcf2maf script that we use from MSKCC, in additional to annotating with VEP, also seem to "standardize" the formatting of the AD and DP files in the annotated vcf.

jaclyn-taroni · 2019-11-27T15:02:27Z

Closed by #279 or should we keep this open for some reason @jashapiro ?

jashapiro · 2019-11-27T15:05:52Z

Mostly closed. The only remaining question is whether to reconstitute MNVs. I would probably close this with #293, and we can file a new issue if MNV reconstruction is desired later.

jaclyn-taroni · 2019-11-27T15:06:21Z

Sounds good!

cansavvy · 2019-12-12T20:51:41Z

Are we able to close this issue since the MNV part of the snv caller analysis has been implemented? Are there any un-addressed aspects of this issue?

jaclyn-taroni · 2019-12-12T20:52:59Z

Looks like closed with #293

jashapiro added the updated analysis label Nov 15, 2019

jashapiro self-assigned this Nov 15, 2019

jashapiro mentioned this issue Nov 18, 2019

Jashapiro/mnv consensus #279

Merged

2 tasks

This was referenced Nov 22, 2019

Proposed Analysis: Getting a consensus set of SNV mutations #161

Closed

Planned data release: V11 #287

Closed

yuankunzhu mentioned this issue Nov 26, 2019

V11 Release #293

Merged

jashapiro mentioned this issue Dec 2, 2019

Update Documentation to Reflect SNV refactoring #304

Merged

2 tasks

jaclyn-taroni closed this as completed Dec 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated analysis: snv-callers #275

Updated analysis: snv-callers #275

jashapiro commented Nov 15, 2019 •

edited

jashapiro commented Nov 15, 2019

migbro commented Nov 15, 2019

jashapiro commented Nov 15, 2019 •

edited

migbro commented Nov 15, 2019

jashapiro commented Nov 15, 2019

migbro commented Nov 18, 2019

jaclyn-taroni commented Nov 27, 2019

jashapiro commented Nov 27, 2019

jaclyn-taroni commented Nov 27, 2019

cansavvy commented Dec 12, 2019

jaclyn-taroni commented Dec 12, 2019

Updated analysis: snv-callers #275

Updated analysis: snv-callers #275

Comments

jashapiro commented Nov 15, 2019 • edited

What analysis module should be updated and why?

Why should the module be updated?

What changes need to be made? Please provide enough detail for another participant to make the update.

What input data should be used? Which data were used in the version being updated?

When do you expect the revised analysis will be completed?

Who will complete the updated analysis?

jashapiro commented Nov 15, 2019

migbro commented Nov 15, 2019

jashapiro commented Nov 15, 2019 • edited

migbro commented Nov 15, 2019

jashapiro commented Nov 15, 2019

migbro commented Nov 18, 2019

jaclyn-taroni commented Nov 27, 2019

jashapiro commented Nov 27, 2019

jaclyn-taroni commented Nov 27, 2019

cansavvy commented Dec 12, 2019

jaclyn-taroni commented Dec 12, 2019

jashapiro commented Nov 15, 2019 •

edited

jashapiro commented Nov 15, 2019 •

edited