Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

open-cravat octopus vs strelka #64

Closed
cariaso opened this issue Apr 15, 2021 · 15 comments
Closed

open-cravat octopus vs strelka #64

cariaso opened this issue Apr 15, 2021 · 15 comments

Comments

@cariaso
Copy link

cariaso commented Apr 15, 2021

Similar to #63 but not hg19 specific. I have a BAM on which I've used both

https://github.com/luntergroup/octopus
and
https://github.com/Illumina/strelka

They agree on what they found (deletion of TG & deletion of TGTG) but disagree on how to represent that in the output. In particular

strelka says

chr7    74053320        .       CTGTG   CTG,C   278     PASS    CIGAR=1M2D2M,1M4D;RU=TG,TG;REFREP=19,19;IDREP=18,17;MQ=60       GT:GQ:GQX:DPI:AD:ADF:ADR:FT:PL  1/2:65:19:36:1,12,9:0,2,1:1,10,8:PASS:282,102,66,142,0,123

while octopus instead writes that as 2 rows

chr7  74053320        .       CTG     C,*     266.45  PASS    AC=1,1;AN=2;DP=37;MQ=59;NS=1    GT:GQ:DP:MQ:PS:PQ:FT    2|1:93:37:59:74053320:100:PASS
chr7  74053320        .       CTGTG   C,*     48.37   PASS    AC=1,1;AN=2;DP=37;MQ=59;NS=1    GT:GQ:DP:MQ:PS:PQ:FT    1|2:48:37:59:74053320:100:PASS

the octopus team seems to view their representation as a feature, not a bug
https://github.com/luntergroup/octopus#output-format

however numerous parts of opencravat behave quite differently when processing these. Here are some of the differences I've noted.

{
  "base__all_mappings": [
    "{\"ELN\": [[\"P15502\", \"\", \"INT\", \"ENST00000252034.12\", \"c.1096+47_1096+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000320399.10\", \"c.1096+47_1096+50del\"], [\"\", \"\", \"INT\", \"ENST00000320492.11\", \"c.988+47_988+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000357036.9\", \"c.1111+47_1111+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000358929.8\", \"c.1096+47_1096+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000380553.8\", \"c.745+47_745+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000380562.8\", \"c.1096+47_1096+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000380575.8\", \"c.1066+47_1066+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000380576.9\", \"c.1096+47_1096+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000380584.8\", \"c.1054+47_1054+50del\"], [\"\", \"\", \"INT\", \"ENST00000414324.5\", \"c.1081+47_1081+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000429192.5\", \"c.1111+47_1111+50del\"], [\"\", \"\", \"INT\", \"ENST00000445912.5\", \"c.1096+47_1096+50del\"], [\"\", \"\", \"INT\", \"ENST00000458204.5\", \"c.1066+47_1066+50del\"], [\"\", \"\", \"INT,PTR\", \"ENST00000466878.5\", \"\"], [\"\", \"\", \"INT\", \"ENST00000621115.4\", \"c.964+47_964+50del\"]]}",
    "{\"ELN\": [[\"P15502\", \"\", \"INT\", \"ENST00000252034.12\", \"c.1096+49_1096+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000320399.10\", \"c.1096+49_1096+50del\"], [\"\", \"\", \"INT\", \"ENST00000320492.11\", \"c.988+49_988+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000357036.9\", \"c.1111+49_1111+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000358929.8\", \"c.1096+49_1096+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000380553.8\", \"c.745+49_745+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000380562.8\", \"c.1096+49_1096+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000380575.8\", \"c.1066+49_1066+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000380576.9\", \"c.1096+49_1096+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000380584.8\", \"c.1054+49_1054+50del\"], [\"\", \"\", \"INT\", \"ENST00000414324.5\", \"c.1081+49_1081+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000429192.5\", \"c.1111+49_1111+50del\"], [\"\", \"\", \"INT\", \"ENST00000445912.5\", \"c.1096+49_1096+50del\"], [\"\", \"\", \"INT\", \"ENST00000458204.5\", \"c.1066+49_1066+50del\"], [\"\", \"\", \"INT,PTR\", \"ENST00000466878.5\", \"\"], [\"\", \"\", \"INT\", \"ENST00000621115.4\", \"c.964+49_964+50del\"]]}",
    "{\"ELN\": [[\"P15502\", \"\", \"INT\", \"ENST00000252034.12\", \"c.1096+47_1096+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000320399.10\", \"c.1096+47_1096+50del\"], [\"\", \"\", \"INT\", \"ENST00000320492.11\", \"c.988+47_988+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000357036.9\", \"c.1111+47_1111+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000358929.8\", \"c.1096+47_1096+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000380553.8\", \"c.745+47_745+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000380562.8\", \"c.1096+47_1096+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000380575.8\", \"c.1066+47_1066+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000380576.9\", \"c.1096+47_1096+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000380584.8\", \"c.1054+47_1054+50del\"], [\"\", \"\", \"INT\", \"ENST00000414324.5\", \"c.1081+47_1081+50del\"], [\"P15502\", \"\", \"INT\", \"ENST00000429192.5\", \"c.1111+47_1111+50del\"], [\"\", \"\", \"INT\", \"ENST00000445912.5\", \"c.1096+47_1096+50del\"], [\"\", \"\", \"INT\", \"ENST00000458204.5\", \"c.1066+47_1066+50del\"], [\"\", \"\", \"INT,PTR\", \"ENST00000466878.5\", \"\"], [\"\", \"\", \"INT\", \"ENST00000621115.4\", \"c.964+47_964+50del\"]]}"
  ],
  "base__cchange": [
    "c.1096+47_1096+50del",
    "c.1096+49_1096+50del",
    "c.1096+47_1096+50del"
  ],
  "base__ref_base": [
    "TGTG",
    "TG",
    "TGTG"
  ],
  "base__uid": [
    2292288,
    2229240,
    2229241
  ],
  "clinvar__disease_names": [
    "not specified",
    "Supravalvar aortic stenosis|Cutis laxa, autosomal dominant",
    "not specified"
  ],
  "clinvar__disease_refs": [
    "MedGen:CN169374",
    "Human Phenotype Ontology:HP:0004381,MONDO:MONDO:0008504,MedGen:C0003499,OMIM:185500,Orphanet:ORPHA3193,SNOMED CT:268185002|MONDO:MONDO:0019571,MedGen:C0268350,Orphanet:ORPHA90348,SNOMED CT:111388003",
    "MedGen:CN169374"
  ],
  "clinvar__id": [
    "402825",
    "360646",
    "402825"
  ],
  "clinvar__sig": [
    "Benign",
    "Uncertain significance",
    "Benign"
  ],
  "extra_vcf_info__AC": [
    null,
    1,
    1
  ],
  "extra_vcf_info__AN": [
    null,
    2,
    2
  ],
  "extra_vcf_info__CIGAR": [
    "1M4D",
    null,
    null
  ],
  "extra_vcf_info__DP": [
    null,
    37,
    37
  ],
  "extra_vcf_info__IDREP": [
    17,
    null,
    null
  ],
  "extra_vcf_info__MQ": [
    60,
    59,
    59
  ],
  "extra_vcf_info__NS": [
    null,
    1,
    1
  ],
  "extra_vcf_info__REFREP": [
    19,
    null,
    null
  ],
  "extra_vcf_info__RU": [
    "TG",
    null,
    null
  ],
  "extra_vcf_info__ref": [
    "CTGTG",
    "CTG",
    "CTGTG"
  ],
  "gnomad3__af": [
    0.543651,
    0.0472857,
    0.543651
  ],
  "gnomad3__af_afr": [
    0.48686,
    0.0873048,
    0.48686
  ],
  "gnomad3__af_asj": [
    0.604372,
    0.0314233,
    0.604372
  ],
  "gnomad3__af_eas": [
    0.569697,
    0.054698,
    0.569697
  ],
  "gnomad3__af_fin": [
    0.573476,
    0.0422155,
    0.573476
  ],
  "gnomad3__af_lat": [
    0.602031,
    0.0405933,
    0.602031
  ],
  "gnomad3__af_nfe": [
    0.558495,
    0.025235,
    0.558495
  ],
  "gnomad3__af_oth": [
    0.55107,
    0.0501949,
    0.55107
  ],
  "gnomad3__af_sas": [
    0.547005,
    0.044838,
    0.547005
  ],
  "vcfinfo__af": [
    "0.4090909090909091",
    null,
    null
  ],
  "vcfinfo__alt_reads": [
    "9",
    null,
    null
  ],
  "vcfinfo__phred": [
    "278",
    "266.45",
    "48.37"
  ],
  "vcfinfo__tot_reads": [
    "22",
    "37",
    "37"
  ]
}```


Does the opencravat team feel that octopus is "just plain wrong"? "not supported"? "cutting edge"? something else?
Do these different opencravat outputs reflect a limitation of opencravat?
If so, does it seem fixable? or is it an unavoidable limitation of opencravat architecture?
@cariaso cariaso changed the title open-cravat does seem to like octopus VCF open-cravat does not seem to like octopus VCF Apr 15, 2021
@cariaso
Copy link
Author

cariaso commented Apr 15, 2021

am I going crazy? there was useful comment here from @rkimoakbioinformatics about the 3' rule
https://varnomen.hgvs.org/recommendations/checklist/#:~:text=The%203'%20rule%20%2D%20do%20you,amino%20acid

@rkimoakbioinformatics
Copy link
Contributor

@cariaso Sorry, I deleted my comments because I was afraid of possibly giving confusing information and wanted to write a clearer one again. I'll write below.

OpenCRAVAT normalizes input variants with the 3' rule and then feed the normalized variants to annotators. Thus,

  • The first input: strelka input chr7:74053320:CTGTG>CTG becomes chr7:74053323:TGdel after normalization with the 3' rule. octopus input chr7:74053320:CTG>C becomes chr7:74053321:TGdel after normalization with the 3' rule. Although these two variants look different after normalization with the 3' rule, when the hg38 mapper processes them, the mapper looks at the bases around the variants to apply the 3' rule further. This step by the mapper ensures that correct HGVS-format consequences are found. As a result, OpenCRAVAT produces the same variant consequences for the both variants (e.g. ENST00000252034.12 intron_variant c.1096+49_1096+50del).
  • The second input: strelka input chr7:74053320:CTGTG>C becomes chr7:74053321:TGTGdel after normalization with the 3' rule. octopus input chr7:74053320:CTGTG>C becomes chr7:74053321:TGTGdel after normalization with the 3' rule. Both are the same and OpenCRAVAT produces the same variant consequences (e.g. ENST00000252034.12 intron_variant c.1096+47_1096+50del).

As I said, I am not sure if this is an answer to your question.

@cariaso
Copy link
Author

cariaso commented Apr 16, 2021

I am not sure if this is an answer to your question.

It's not a final answer, but it definitely helps. I'm finding it a bit difficult to keep these straight, so I'm assigning names S1, S2, O1, O2 and attempting to restate what you've written. I'm not agreeing or disagreeing.

strelka VCF:

chr7    74053320        .       CTGTG   CTG,C   278     PASS    CIGAR=1M2D2M,1M4D;RU=TG,TG;REFREP=19,19;IDREP=18,17;MQ=60       GT:GQ:GQX:DPI:AD:ADF:ADR:FT:PL  1/2:65:19:36:1,12,9:0,2,1:1,10,8:PASS:282,102,66,142,0,123

opencravat 3' normalized strelka:
S1 chr7:74053320:CTGTG>CTG becomes chr7:74053323:TGdel
S2 chr7:74053320:CTGTG>C becomes chr7:74053321:TGTGdel

octopus VCF:

chr7  74053320        .       CTG     C,*     266.45  PASS    AC=1,1;AN=2;DP=37;MQ=59;NS=1    GT:GQ:DP:MQ:PS:PQ:FT    2|1:93:37:59:74053320:100:PASS
chr7  74053320        .       CTGTG   C,*     48.37   PASS    AC=1,1;AN=2;DP=37;MQ=59;NS=1    GT:GQ:DP:MQ:PS:PQ:FT    1|2:48:37:59:74053320:100:PASS

opencravat 3' normalized octopus:
O1 chr7:74053320:CTG>C becomes chr7:74053321:TGdel
O2 chr7:74053320:CTGTG>C becomes chr7:74053321:TGTGdel

Although S1 & O1 look different after normalization with the 3' rule, when the hg38 mapper processes them, the mapper looks at the bases around the variants to apply the 3' rule further. This step by the mapper ensures that correct HGVS-format consequences are found. As a result, OpenCRAVAT produces the same variant consequences for the both variants (e.g. ENST00000252034.12 intron_variant c.1096+49_1096+50del).

S2 & O2 are the same and OpenCRAVAT produces the same variant consequences (e.g. ENST00000252034.12 intron_variant c.1096+47_1096+50del).

@cariaso
Copy link
Author

cariaso commented Apr 16, 2021

I see clear evidence of octopus working as you've described, with O1 & O2 as expected as 2 rows in the variant table.
However for Strelka I'm only seeing evidence of S2.

The only entry in the variant table near this location looks like this:

  base__uid = 2292288
  base__chrom = chr7
  base__pos = 74053321
  base__ref_base = TGTG
  base__alt_base = -
  base__hugo = ELN
  base__transcript = ENST00000252034.12
  base__so = INT
  base__cchange = c.1096+47_1096+50del
  base__all_mappings = {"ELN": [["P15502", "", "INT", "ENST00000252034.12", "c.1096+47_1096+50del"], ["P15502", "", "INT", "ENST00000320399.10", "c.1096+47_1096+50del"], ["", "", "INT", "ENST00000320492.11", "c.988+47_988+50del"], ["P15502", "", "INT", "ENST00000357036.9", "c.1111+47_1111+50del"], ["P15502", "", "INT", "ENST00000358929.8", "c.1096+47_1096+50del"], ["P15502", "", "INT", "ENST00000380553.8", "c.745+47_745+50del"], ["P15502", "", "INT", "ENST00000380562.8", "c.1096+47_1096+50del"], ["P15502", "", "INT", "ENST00000380575.8", "c.1066+47_1066+50del"], ["P15502", "", "INT", "ENST00000380576.9", "c.1096+47_1096+50del"], ["P15502", "", "INT", "ENST00000380584.8", "c.1054+47_1054+50del"], ["", "", "INT", "ENST00000414324.5", "c.1081+47_1081+50del"], ["P15502", "", "INT", "ENST00000429192.5", "c.1111+47_1111+50del"], ["", "", "INT", "ENST00000445912.5", "c.1096+47_1096+50del"], ["", "", "INT", "ENST00000458204.5", "c.1066+47_1066+50del"], ["", "", "INT,PTR", "ENST00000466878.5", ""], ["", "", "INT", "ENST00000621115.4", "c.964+47_964+50del"]]}
  clinvar__sig = Benign
  clinvar__disease_refs = MedGen:CN169374
  clinvar__disease_names = not specified
  clinvar__rev_stat = criteria provided, single submitter
  clinvar__id = 402825 
  extra_vcf_info__pos = 74053320
  extra_vcf_info__ref = CTGTG
  extra_vcf_info__alt = C
  extra_vcf_info__CIGAR = 1M4D
  extra_vcf_info__RU = TG
  extra_vcf_info__REFREP = 19
  extra_vcf_info__IDREP = 17
  extra_vcf_info__MQ = 60
  gnomad3__af = 0.543651
  gnomad3__af_afr = 0.48686
  gnomad3__af_asj = 0.604372
  gnomad3__af_eas = 0.569697
  gnomad3__af_fin = 0.573476
  gnomad3__af_lat = 0.602031
  gnomad3__af_nfe = 0.558495
  gnomad3__af_oth = 0.55107
  gnomad3__af_sas = 0.547005
  vcfinfo__phred = 278
  vcfinfo__filter = PASS
  vcfinfo__zygosity = het
  vcfinfo__alt_reads = 9
  vcfinfo__tot_reads = 22
  vcfinfo__af = 0.4090909090909091
  tagsampler__numsample = 1
  tagsampler__samples = HG02219
  tagsampler__tags = 

Should there be 1 or 2 entries in the variant table?
If only 1, how should it represent S1 & S2?

@cariaso cariaso changed the title open-cravat does not seem to like octopus VCF open-cravat for GT=1/2 (het, two distinct both non-ref) Apr 16, 2021
@rkimoakbioinformatics
Copy link
Contributor

There should be 2 entries in the variant table. The following is the test with which I saw two entries. The input and the tsv report file are attached.

oc run strelka.vcf --skip annotator -t tsv

In the tsv file and also with oc gui strelka.vcf.sqlite, two entries are in the result. By the way, vcf-converter version was 2.0.2.

strelka.zip

@cariaso
Copy link
Author

cariaso commented Apr 18, 2021

There should be 2 entries in the variant table.

And indeed there are. I'd failed to notice one of them, because there remains a crucial difference between the results of strelka and octopus.

For strelka the results are just as you say, with 1 variant at
base__pos = 74053323
and the other at
base__pos = 74053321

however when the same process is repeated with a VCF files based on the 2 octopus VCF lines I provided above, there are again 2 variants, however they both have the same
base_pos = 74053321

Can you please repeat your oc run command from the prior comment, with this octopus VCF:

chr7  74053320        .       CTG     C,*     266.45  PASS    AC=1,1;AN=2;DP=37;MQ=59;NS=1    GT:GQ:DP:MQ:PS:PQ:FT    2|1:93:37:59:74053320:100:PASS
chr7  74053320        .       CTGTG   C,*     48.37   PASS    AC=1,1;AN=2;DP=37;MQ=59;NS=1    GT:GQ:DP:MQ:PS:PQ:FT    1|2:48:37:59:74053320:100:PASS

Do you see what I see, 2 variants with different base__pos?
If so, equivalent inputs produce variants reported at different locations, with secondary effects on clinvar__disease_names and numerous other fields.

@cariaso cariaso changed the title open-cravat for GT=1/2 (het, two distinct both non-ref) open-cravat octopus vs strelka Apr 18, 2021
@rkimoakbioinformatics
Copy link
Contributor

I see the following from an oc run with your two octopus VCF format variants:

1   chr7    74053321    TG  -
2   chr7    74053321    TGTG    -

There are a few things here:

  • strelka and octopus actually give different decision on the first variant. strelka says "chr7 74053320 CTGTG CTG" meaning that positions 74053323 to 74053324 were deleted, while octopus says "chr7 74053320 CTG C" meaning that positions 74053321 to 74053322 were deleted.

  • Before mapping to transcripts, OpenCRAVAT does its best to standardize input position, reference allele, and alternate allele within the limit of what is given as input. Thus, at this stage, OpenCRAVAT does not know that whether the 3' rule can be further applied to chr7 74053321 TG - or not. To be fair, as far as I know, no other tool inquires that, either.

  • Further complication is that, if we follow HGVS's 3' rule, strelka's chr7 74053320 CTGTG CTG and octopus' chr7 74053320 CTG C are both wrong. There are 19 repeats of TG which span from the position 74053321 to the position 74053358. Thus, according to HGVS's 3' rule, the correct representation would be chr7 74053357 TG -, not 74053321 nor 74053323 (OpenCRAVAT's mapper follows these repeats and outputs correct cDNA and protein changes).

  • The issue here is that there is not yet any standard which all variant callers and annotation databases agree to use. If the HGVS 3' rule was considered as the standard, strelka and octopus would be both wrong. If ClinVar would adopt the 3' rule, both strelka and octopus variants wouldn't be found in ClinVar.

  • Maybe internally applying normalization on input variants and annotation databases is a solution and this can be the value that such a tool as OpenCRAVAT can contribute, although there can be the issue of the confusion coming from the difference between an annotation source's original data and its OpenCRAVAT-processed version. Also, it would be quite a large work.

What do you think?

@cariaso
Copy link
Author

cariaso commented Apr 20, 2021

What do you think?

https://www.youtube.com/watch?v=dsx2vdn7gpY

@cariaso
Copy link
Author

cariaso commented Apr 20, 2021

more seriously, or perhaps just more productively, I really don't know. I've been vaguely aware of these sorts of issues for a quite a while, but until you've got a concrete case it's hard to make progress. I think this is useful in that way. No tool, nor any database alone can solve this. And any industry wide fix is likely to be a very slow slog. It's quite disheartening, but I expect that I can now use this thread to raise related bug reports in the https://github.com/Illumina/strelka https://github.com/luntergroup/octopus repos, and perhaps raise it to clinvar.

I suppose it also speaks to the need for an independent tool to do the 3' shifting. Otherwise there will be a lot of independent, and slightly different solutions to the problem.

Even with that, there is a timing issue very akin to
https://en.wikipedia.org/wiki/Dagen_H
which would be impossible in practice

Illumina/strelka#193
luntergroup/octopus#172

@rkimoakbioinformatics
Copy link
Contributor

Ok. Meanwhile, OpenCRAVAT team will discuss the possibility of applying a unified process on all resources for consistency. If this strelka, octopus, and ClinVar case is solved within OpenCRAVAT, I'll post an update.

@cariaso
Copy link
Author

cariaso commented Apr 20, 2021

I'll also mention that while HGVS seems to dictate 3', the VCF spec seems to necessitate a 5' 'upstream' base before indels (it seems left aligned might be a more correct term). For this reason, when I've adhoced partial solutions to this, In the past I've done 5' shifting. Perhaps what you're imagining is a post-VCF step and it doesn't matter. But if there was a generic tool for shifting VCFs, it would likely take a VCF as input and spit one out. If so, having both 5' and 3' requirements might end up quite unnecessarily large, and perhaps raise pathologic issues with the difference between the individual vs the chosen reference.

@cariaso
Copy link
Author

cariaso commented Apr 20, 2021

https://genome.sph.umich.edu/wiki/Variant_Normalization
argues for 5' normalization, and indicates that this tool can do the job
https://github.com/atks/vt
https://genome.sph.umich.edu/wiki/Vt#Normalization

https://annovar.openbioinformatics.org/en/latest/articles/VCF/

There is no community consensus yet on how to describe an indel in an unique way. Many users prefer to do a left-normalization. Left-normalization means that the start position of a variant should be shifted to the left utill it is no longer possible to do so, so that the smaller the number, the better. However, HGVS clearly specifies that left-normalization would be performed on cDNA (mRNA) coordinate, which means that right-normalization is required for half the genes in human genome. We will just have to accept the fact that people do not agree with each other at this point.

Currently, the following databases in ANNOVAR are left-normalized so that users can directly use them to compare to your left-normalized variant file: ... clinvar_20150330 ....

@cariaso
Copy link
Author

cariaso commented Apr 20, 2021

more notes that seem relevant

https://samtools.github.io/bcftools/bcftools.html#norm

https://github.com/Janchorizo/ban-vcf

https://github.com/freebayes/freebayes suggests others that might be relevant

decompose the output with tools in vcflib, vt, vcf-tools, bcftools, GATK, or Picard.

@cariaso
Copy link
Author

cariaso commented Apr 20, 2021

I'm seeing a broad consensus that left-aligning is the most common for vcfs (confirmed for octopus but see comments on why this isn't ideal at luntergroup/octopus#172 (comment) ) and I suspect that is what both strelka and clinvar do.

@cariaso
Copy link
Author

cariaso commented Apr 23, 2021

https://pypi.org/project/bioutils/
bioutils.normalize – allele normalization (left shuffle, right shuffle, expanded, vcf)

expanded is new to me, and used by https://github.com/ga4gh/vrs-python & https://github.com/ga4gh/vrs the idea being that any ambiguous positioning will be expanded to cover the entire range of ambiguous positions so that comparisons are easier.

ga4gh:VA.rUTGEnc_US6EQSAh64vrXMyDVxq_TWL2
NC_000007.14:g.74053331_74053332del

becomes

ga4gh:VA.cTEuSY_2j4R7UFLTj_Rh10JjVDayk_vr
NC_000007.14:g.74053321_74053359delinsTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT

and

ga4gh:VA.GaUmNc2yxPnonnHIqiWOG1iiPJJKSzB0
NC_000007.14:g.74053321_74053322del

becomes

ga4gh:VA.cTEuSY_2j4R7UFLTj_Rh10JjVDayk_vr
NC_000007.14:g.74053321_74053359delinsTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT

note the difference between
NC_000007.14:g.74053331_74053332del
NC_000007.14:g.74053321_74053322del
ending up with the same expanded form

@kmoad kmoad closed this as completed Mar 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants