-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
open-cravat octopus vs strelka #64
Comments
am I going crazy? there was useful comment here from @rkimoakbioinformatics about the 3' rule |
@cariaso Sorry, I deleted my comments because I was afraid of possibly giving confusing information and wanted to write a clearer one again. I'll write below. OpenCRAVAT normalizes input variants with the 3' rule and then feed the normalized variants to annotators. Thus,
As I said, I am not sure if this is an answer to your question. |
It's not a final answer, but it definitely helps. I'm finding it a bit difficult to keep these straight, so I'm assigning names S1, S2, O1, O2 and attempting to restate what you've written. I'm not agreeing or disagreeing. strelka VCF:
opencravat 3' normalized strelka: octopus VCF:
opencravat 3' normalized octopus: Although
|
I see clear evidence of octopus working as you've described, with O1 & O2 as expected as 2 rows in the variant table. The only entry in the variant table near this location looks like this:
Should there be 1 or 2 entries in the variant table? |
There should be 2 entries in the variant table. The following is the test with which I saw two entries. The input and the tsv report file are attached.
In the tsv file and also with |
And indeed there are. I'd failed to notice one of them, because there remains a crucial difference between the results of strelka and octopus. For strelka the results are just as you say, with 1 variant at however when the same process is repeated with a VCF files based on the 2 octopus VCF lines I provided above, there are again 2 variants, however they both have the same Can you please repeat your
Do you see what I see, 2 variants with different |
I see the following from an oc run with your two octopus VCF format variants:
There are a few things here:
What do you think? |
|
more seriously, or perhaps just more productively, I really don't know. I've been vaguely aware of these sorts of issues for a quite a while, but until you've got a concrete case it's hard to make progress. I think this is useful in that way. No tool, nor any database alone can solve this. And any industry wide fix is likely to be a very slow slog. It's quite disheartening, but I expect that I can now use this thread to raise related bug reports in the https://github.com/Illumina/strelka https://github.com/luntergroup/octopus repos, and perhaps raise it to clinvar. I suppose it also speaks to the need for an independent tool to do the 3' shifting. Otherwise there will be a lot of independent, and slightly different solutions to the problem. Even with that, there is a timing issue very akin to |
Ok. Meanwhile, OpenCRAVAT team will discuss the possibility of applying a unified process on all resources for consistency. If this strelka, octopus, and ClinVar case is solved within OpenCRAVAT, I'll post an update. |
I'll also mention that while HGVS seems to dictate 3', the VCF spec seems to necessitate a 5' 'upstream' base before indels (it seems left aligned might be a more correct term). For this reason, when I've adhoced partial solutions to this, In the past I've done 5' shifting. Perhaps what you're imagining is a post-VCF step and it doesn't matter. But if there was a generic tool for shifting VCFs, it would likely take a VCF as input and spit one out. If so, having both 5' and 3' requirements might end up quite unnecessarily large, and perhaps raise pathologic issues with the difference between the individual vs the chosen reference. |
https://genome.sph.umich.edu/wiki/Variant_Normalization https://annovar.openbioinformatics.org/en/latest/articles/VCF/
|
more notes that seem relevant https://samtools.github.io/bcftools/bcftools.html#norm https://github.com/Janchorizo/ban-vcf https://github.com/freebayes/freebayes suggests others that might be relevant
|
I'm seeing a broad consensus that left-aligning is the most common for vcfs (confirmed for octopus but see comments on why this isn't ideal at luntergroup/octopus#172 (comment) ) and I suspect that is what both strelka and clinvar do. |
https://pypi.org/project/bioutils/ expanded is new to me, and used by https://github.com/ga4gh/vrs-python & https://github.com/ga4gh/vrs the idea being that any ambiguous positioning will be expanded to cover the entire range of ambiguous positions so that comparisons are easier.
becomes
and
becomes
note the difference between |
Similar to #63 but not hg19 specific. I have a BAM on which I've used both
https://github.com/luntergroup/octopus
and
https://github.com/Illumina/strelka
They agree on what they found (deletion of TG & deletion of TGTG) but disagree on how to represent that in the output. In particular
strelka says
while octopus instead writes that as 2 rows
the octopus team seems to view their representation as a feature, not a bug
https://github.com/luntergroup/octopus#output-format
however numerous parts of opencravat behave quite differently when processing these. Here are some of the differences I've noted.
The text was updated successfully, but these errors were encountered: