Use of parentheses for predicted consequences #18

jtdendunnen · 2023-02-09T15:36:40Z

jtdendunnen
Feb 9, 2023
Maintainer

The current HGVS nomenclature recommendations state: predicted consequences, i.e. without experimental evidence (no RNA or protein sequence analysed), should be given in parentheses, e.g. p.(Arg727Ser).

People have proposed to get rid of this rule and use p.Arg727Ser irrespective of whether there is experimental evidence or not. In my opinion, this is not a good idea, the use of "()" is a standard HGVS format to indicate uncertainty. When people do not want to make the discrimination between predicted/not predicted, they can use Arg727Ser, i.e. not using the "p.", so not following HGVS recommendations.

ifokkema · 2023-02-14T15:45:48Z

ifokkema
Feb 14, 2023
Maintainer

The most important issue for me is that this proposal would create inconsistency and, therefore, will need very strong arguments why the parentheses should be removed for protein descriptions. Parentheses are used throughout the HGVS nomenclature to indicate uncertainty and/or that values are inferred.

On DNA, RNA, and protein level, parentheses indicate that the variant was inferred from another observation, including when the variant was mapped from animal models, e.g., p.(Arg727Ser) when the variant was predicted from a change at the DNA level.
Parentheses surround uncertain positions, e.g., c.(3996_4196)_(5090_5284)del. Note that on the protein level, this is also used, e.g., p.(Ala123_Pro131)insX[4].
Parentheses surround uncertain ranges of lengths, e.g., g.32717298_32717299insN[(80_120)].
In alleles, uncertain configurations (cis or trans) of variants are indicated by putting parentheses around the separator, e.g., c.2376G>C(;)3103del.
In alleles, the uncertainty between a homozygous or hemizygous variant is indicated using parentheses, e.g., c.2376G>C(;)(2376G>C).
There might be more that I forgot to mention.

Removal of the rule to use parentheses on the protein level if the variant has only been detected on the DNA level would create an inconsistency that creates an illogical conflict/inconsistency in the nomenclature. It also takes away part of the descriptive power of the nomenclature, that now can indicate on DNA, RNA, and protein level that a change is predicted/inferred and not observed directly. Strong issues will need to be raised to justify this removal.

Issues that were raised, to my awareness:

``They're not used''.

This is factually incorrect, so I'm not sure how this was measured. Also, this argument would mean that all features of the HGVS nomenclature that seemingly aren't used (much?) should be removed. However, this feature is used to a great extent. LOVD, Mutalyzer, and VariantValidator all use parentheses and use them to indicate when protein changes are predictions. On the other hand, ClinVar doesn't seem to value receiving protein descriptions much at all - their submission template doesn't have a field for protein descriptions, only one for alternative descriptions. They actually state providing the protein level is not necessary because they always provide predictions, so we have to assume all protein descriptions in ClinVar are predictions. I was told ClinVar doesn't use parentheses in the generated protein descriptions because submitters weren't providing them. This doesn't surprise me since there isn't a protein change field (sounds like a chicken or egg problem). In my opinion, it would be more logical to notify a few large annotation services that their generated protein descriptions aren't valid HGVS; Ensembl VEP, for example, has in the past responded positively when I highlighted bugs in their generated HGVS descriptions.

``It's a common complaint.''

I'm not sure who complains here or what the issues are that they encounter. Users of Mutalyzer or Variant Validator all get protein descriptions with parentheses, so I assume these are users of other annotation/prediction tools that then end up not having parentheses. Perhaps they don't want to add them manually. In that case, the most logical course of action is to file a bug with their annotation software. A skilled programmer familiar with the annotation software would need only a few minutes to make sure that all predicted output will use parentheses from now on, so this can't be a real issue. If, however, they're running into problems finding data associated with this protein description in databases, several issues could be causing this. See the next point.

``It makes matching in databases hard''.

Let's actually look at that. It only takes a few minutes for a skilled programmer familiar with the software to make sure the matching is performed regardless of using parentheses or not. However, the biggest problem is that the protein change is the least consistent of all descriptions across databases. E.g., ClinVar uses the I1803V format as the primary identifier for protein change, ANNOVAR uses p.I1803V, and Ensembl VEP uses p.Ile1803Val. Also, Ter is often described as * combined with three-letter code amino acids. The parentheses really aren't the problem here. Most databases yield the best results when matching variants on the DNA level, but if this is, for whatever reason, a problem, then databases should standardize their use of protein descriptions in general and/or build converters.

Additionally to all points above, suggesting changing the HGVS nomenclature and removing the use of parentheses in protein descriptions creates a win/lose situation. There are people, databases, and systems relying on this specific use of parentheses, and this option would be taken away. The simplest solution, costing a mere fraction of the time already spent on this discussion, is to get databases that care about matching to implement a parentheses-agnostic method of matching the input with the database contents. It'll often be only one line of code, and it also solves the same problem on DNA and RNA levels, where parentheses are also used to indicate predictions. This would create a win/win situation. The guidelines will still be consistent between DNA, RNA, and protein level. No time is wasted on changing the guidelines, updating systems (LOVD, Mutalyzer, Variant Validator), and checking and correcting hundreds of thousands of data entries in LOVD. People using and relying on the parentheses can still show the difference between predictions and confirmed variants. And, matching no longer relies on parentheses.

0 replies

jfjlaros · 2023-02-15T14:11:14Z

jfjlaros
Feb 15, 2023
Maintainer

From what I recall from previous conversations, the main argument against the use of parentheses is that they convey no information about variants themselves, but rather about how variants were identified. I can see why people would like to keep these concepts separated, but I can also see why others find it important to keep annotation close by.

Related to the above, the use of parentheses introduce a minor inconvenience in normalisation procedures because annotation needs to be preserved, e.g., normalisation of NM_003002.4:c.(274G>T) and NM_003002.4:c.274G>T give different results. This should not be a major problem for databases though; if variant descriptions are parsed at submission time and stored in a computer readable format, annotation could be extracted and stored separately. Other forms of expressing uncertainty (uncertain positions, unspecified insertions, etc.) may be more challenging, up to the point where such descriptions can only be checked for syntactic correctness.

Concerns about the adoption of parentheses are valid in principle as the amount of information conveyed by them is a function of the consistency of their use. Perhaps database owners and/or curators can estimate to which extent this concern is relevant in practice.

Personally, I have no strong opinion about the use of parentheses. I do have some concerns about the clarity (and perhaps internal consistency) of the relevant guidelines in their current form. I may come back to this later, pending the outcome of this discussion.

0 replies

murphyte · 2023-02-27T16:22:08Z

murphyte
Feb 27, 2023

The inherent issue I see with the use of parentheses is that the specification infers meaning from the lack of markup. That is, the lack of parentheses means some non-typical analysis (sequencing at the RNA or protein level) was done. However, that can't be distinguished from simple omission, and there's no way to independently determine the method of analysis (other than reading a paper if there is one describing the variant), so we functionally have to resort to considering everything to be predicted, and therefore assume that parentheses should be added for all r. and p. descriptions. To me that means they confer no additional information.

Furthermore, while the original request was to consider the use of parentheses with protein descriptions, I believe the same issue applies with their use in DNA and RNA descriptions. The only difference is that most sequencing is done on DNA, so parentheses aren't called for with g. and c. descriptions, and r. descriptions are rarely used in the literature and submissions.

I also find the definition for parentheses aren't very clear.
from http://varnomen.hgvs.org/recommendations/general/:

“( )” (parentheses) are used to indicate uncertainties and predicted consequences; NC_000023.9:g.(123456_234567)_(345678_456789)del, p.(Ser123Arg)
descriptions should make clear whether the change was experimentally determined or theoretically deduced by giving predicted consequences in parentheses

So there are two separable meanings (uncertainty vs predicted), and the meaning of theoretically deduced isn't explicitly spelled out such that users may not realize how it should apply for g., c., and r. expressions.

Taking a look at how parentheses are treated in Mutalyzer and Variant Validator:

NM_003002.4:c.274G>T
Mutalyzer adds parentheses for r. and p. expressions, fitting the guidelines
Variant Validator doesn't offer an r. expression, but does add parentheses to p. also fitting the guidelines
NM_003002.4:c.(274G>T)
Mutalyzer keeps the parentheses for r. and p., but strips them for mapping to g. which seems off-spec
Variant Validator rejects it as an "uncertain position" which I think comes from multiple usage of parentheses in the guidelines
NM_003002.4:r.274g>u
Mutalyzer doesn't offer any mapping to c. or g.
Variant Validator maps to c. and g. without parentheses, which I believe is off-spec
NP_002993.1:p.Asp92Tyr
Mutalyzer automatically adds parentheses
Variant Validator says it's fine

So both fit spec for p. with the assumption that parentheses should always be present (in which case I'd argue that they're not really adding anything), and the mapping behavior between c., g., and r. isn't always adding parentheses when changing molecule types.

The potential value in knowing if an r. expression is based on RNA sequencing vs inferred from DNA sequencing would arise from cases of RNA editing changing the sequence. Similarly, the potential value for p. expressions would come from RNA editing or rare cases of alternate tRNA usage (e.g. translation of stop codons as selenocysteine or some other amino acid in the case of stop-codon readthrough). Otherwise inferring from DNA is reliable. If the HGVS spec was originally stated the other way, such that addition of some markup meant direct sequencing of the molecule had been done, then it would be useful. But as it stands it seems any logic must assume that everything is from DNA sequencing by default, and lack of parentheses is an error of omission, rendering the markup uninformative.

1 reply

ifokkema Feb 28, 2023
Maintainer

The inherent issue I see with the use of parentheses is that the specification infers meaning from the lack of markup. That is, the lack of parentheses means some non-typical analysis (sequencing at the RNA or protein level) was done. However, that can't be distinguished from simple omission, (...)

This is a valid point. Although I would argue that the ones omitting the parentheses should simply stop omitting them and start following the recommendations (ClinVar, VEP, ANNOVAR, etc), I do agree it would have been more clear to have a way to indicate whether the variant was observed or inferred by choosing between two different notations of which one is not an omission of something. However, creating a new rule that states that observed variants should then use something other than parentheses to indicate the observed state, would mean that all DNA-level variants observed on DNA level should then also include that new character. And I'm pretty sure this won't happen if certain systems now already don't apply the recommendations fully.

(...) and there's no way to independently determine the method of analysis (other than reading a paper if there is one describing the variant), so we functionally have to resort to considering everything to be predicted, and therefore assume that parentheses should be added for all r. and p. descriptions.

That's not always true. LOVD uses parentheses when the variant is inferred or no parentheses when the variant is observed on the RNA level. ClinVar has hidden the method to provide the protein level in submissions quite well, so I'm not sure if people know how to provide these (I didn't). They actually state providing the protein level is not necessary because they always provide predictions; however, these sometimes make no sense (e.g., the variant is proven to cause splicing issues because the variant has been sequenced on RNA level). Mutalyzer and Variant Validator provide parentheses when the input is DNA, and Variant Validator allows RNA input and then drops the parentheses, following the recommendations.

Furthermore, while the original request was to consider the use of parentheses with protein descriptions, I believe the same issue applies with their use in DNA and RNA descriptions. The only difference is that most sequencing is done on DNA, so parentheses aren't called for with g. and c. descriptions, and r. descriptions are rarely used in the literature and submissions.

c. and g. descriptions will carry parentheses when the description is inferred, e.g., for some functional in-vitro assays or when variants were mapped from animal models. LOVD has plenty of examples.

I also find the definition for parentheses aren't very clear. from http://varnomen.hgvs.org/recommendations/general/:

“( )” (parentheses) are used to indicate uncertainties and predicted consequences; NC_000023.9:g.(123456_234567)_(345678_456789)del, p.(Ser123Arg)

descriptions should make clear whether the change was experimentally determined or theoretically deduced by giving predicted consequences in parentheses

So there are two separable meanings (uncertainty vs predicted), and the meaning of theoretically deduced isn't explicitly spelled out such that users may not realize how it should apply for g., c., and r. expressions.

Jeroen had a similar remark, so this will need to be clarified if it's decided to keep the parentheses.

Taking a look at how parentheses are treated in Mutalyzer and Variant Validator:

NM_003002.4:c.274G>T
Mutalyzer adds parentheses for r. and p. expressions, fitting the guidelines
Variant Validator doesn't offer an r. expression, but does add parentheses to p. also fitting the guidelines

NM_003002.4:c.(274G>T)
Mutalyzer keeps the parentheses for r. and p., but strips them for mapping to g. which seems off-spec
Variant Validator rejects it as an "uncertain position" which I think comes from multiple usage of parentheses in the guidelines

Good catch. I remember discussing this with the VV team before; I believe they plan to correct this.

NM_003002.4:r.274g>u
Mutalyzer doesn't offer any mapping to c. or g.
Variant Validator maps to c. and g. without parentheses, which I believe is off-spec

The lack of parentheses on c. level is a good catch. I guess indeed, it should be included. The lack of parentheses on the protein level is intended, as the recommendations state that the protein change consequence of a variant observed on RNA level does not require parentheses. Jeroen also highlighted that this is not according to the rule that inferred variants should be described using parentheses, but as protein sequencing is almost never performed, the analysis on RNA level is commonly used as evidence for a protein level variant. This is also why the current nomenclature approves this.

NP_002993.1:p.Asp92Tyr
Mutalyzer automatically adds parentheses

That would be incorrect, as the input is without parentheses. If the input is specified to be observed, Mutalyzer should not change it to inferred.

Variant Validator says it's fine

So both fit spec for p. with the assumption that parentheses should always be present (in which case I'd argue that they're not really adding anything), (...)

That's not correct - they add parentheses when they must, i.e., when they're producing predictions.

(...) and the mapping behavior between c., g., and r. isn't always adding parentheses when changing molecule types.

This is indeed something to be looked at.

The potential value in knowing if an r. expression is based on RNA sequencing vs inferred from DNA sequencing would arise from cases of RNA editing changing the sequence.

Actually, by far, the most common issue is splicing effects. Even simple substitutions in the exon are known to be able to cause splicing defects (exon exclusion, intron retention, etc), which can only be observed on RNA or protein level. So every protein description derived from a DNA description is a prediction that is too "careless" to be considered equal to a protein prediction of a variant observed on the RNA level.

Similarly, the potential value for p. expressions would come from RNA editing or rare cases of alternate tRNA usage (e.g. translation of stop codons as selenocysteine or some other amino acid in the case of stop-codon readthrough). Otherwise inferring from DNA is reliable.

From RNA, yes; from DNA, no (see above).

If the HGVS spec was originally stated the other way, such that addition of some markup meant direct sequencing of the molecule had been done, then it would be useful.

I agree - I think it was done this way to save space on all DNA variants found on DNA level. Well, or perhaps because when the recommendations came to exist, they were based on expressions already in use. And I guess that at that time, nobody thought to annotate variant descriptions with additional characters because the variant was actually observed. That likely was considered the standard. But that's just a theory on my side.

But as it stands it seems any logic must assume that everything is from DNA sequencing by default, and lack of parentheses is an error of omission, rendering the markup uninformative.

That depends on the source of the data (see above).

ahwagner · 2023-11-30T22:43:02Z

ahwagner
Nov 30, 2023
Maintainer

When this is discussed, I would like us to consider parallelism between the decision here and inferred sequence junctions, as discussed in #15.

0 replies

heidirehm · 2024-03-02T23:50:20Z

heidirehm
Mar 2, 2024

It seems this issue is still not resolved and as such, I would like to request a vote of at least the current HGVS committee, if not engagement of the external community, on the topic. I'd be happy to help with this engagement. However, I would like to clarify the request, which is not to remove parentheses, but to instead create consistency in their use. The current recommendation requires varied use of parentheses depending on whether there is scientific evidence that a protein is produced with the predicted change.

Lack of evidence: c.176C>G p.(Thr59Arg)
Evidence to prove protein effect: c.176C>G p.Thr59Arg
However, this varied use leads to many downstream challenges as listed below and therefore I propose that parentheses should always be used and the c./p. format be consistent with the format used in ClinVar: NM_004004.6(GJB2):c.101T>C (p.Met34Thr).

Challenges with varied parentheses use. Please note I included Terrence Murphy's argument as well.

It is not straightforward to discern when a protein impact prediction has been proven versus some other outcome is happening so having only two options (present or absent) is insufficient as often it can be a gradation of evidence from no knowledge to complete confidence and everything in between. There is also insufficient nomenclature rules to represent all of the possible actual outcomes of various DNA changes in terms of what actually happens to a gene and its protein products. The HGVS nomenclature system is not the best place to convey the complexity of this information. Instead, a better place is in the evidence summary that accompanies the classification of a variant, where varying degrees of certainty, and the supporting evidence to base one’s claim, can be appropriately conveyed.
The current HGVS specification infers meaning from the lack of markup. The lack of parentheses means some non-typical analysis (sequencing at the RNA or protein level) was done. However, that can not be distinguished from simple omission.
Software used to annotate and validate HGVS validation software often simply assumes everything is a prediction, conferring no useful information or incorrect information.
The two major databases used for variant classification, ClinVar and gnomAD, do not use parentheses in the manner HGVS recommends. They are not used when representing the c. and p. nomenclature separately and they are used outside of the p. description entirely when the variant’s full nomenclature is represented: NM_004004.6(GJB2):c.101T>C (p.Met34Thr). Parentheses are also rarely used in other databases or publications unless the journal hires staff to specifically police the issue. Clinical labs also generally do not use them in the manner recommended when reporting variants to patients.
By putting a parenthesis symbol between the p. and the amino acid change it disrupts the nomenclature and makes it more difficult to computationally search and find matches given the inconsistent use of the parentheses. Consistency is needed to improve the standardization of variant nomenclature, critical for scalable genomics.
The most commonly written HGVS nomenclature includes both c. and p. written as “c.176C>G p.Thr59Arg” or “c.176C>G p.(Thr59Arg)” leaving no good separation between the two. This makes it confusing to know when the c. portion stops and then p. continues and many will add a comma, hash, semicolon or other incorrect symbol to try to create better separation, introducing inconsistency and confusion given other recommended uses of these symbols.

10 replies

ahwagner Aug 19, 2024
Maintainer

Just to avoid confusion; the original request was discussed and voted on during the meeting of June 12th, 2023.

I was present for that meeting, but don't recall a vote on this issue at that time or in any subsequent meeting I participated in. Looking at the minutes from the 2023-06-12, in appears that the discussion ran over and we moved on without conclusion (which matches my recollection of that conversation).

Relevant to this, I think it is important that the HVNC formalizes the taking and recording of votes on HGVS technical issues. Discussion #179 has an associated open discussion thread specifically on voting to help shape this.

ifokkema Aug 21, 2024
Maintainer

I recommend we focus on this point in our discussion of this issue. We, as a committee, should first definitively decide whether conveying experimental certainty belongs in the HGVS nomenclature. I agree that we will want to be consistent across the listed scenarios.

IMO, the discussion should include:

If so, should the current nomenclature be made more consistent? (i.e., parentheses vs question marks)
If not, is the removal of this option justified before an alternative is present, as it will create an immense amount of issues?

I was present for that meeting, but don't recall a vote on this issue at that time or in any subsequent meeting I participated in. Looking at the minutes from the 2023-06-12, in appears that the discussion ran over and we moved on without conclusion (which matches my recollection of that conversation).

Relevant to this, I think it is important that the HVNC formalizes the taking and recording of votes on HGVS technical issues. Discussion #179 has an associated open discussion thread specifically on voting to help shape this.

We should perhaps record and store meetings long-term and/or improve the notes made; I can't even find minutes for the August 7th and October 2nd meetings from 2023, even though my time tracker indicates that I worked on the minutes after the meetings. So it looks like they have gotten lost somewhere.

Either way, although the minutes indeed show internal disagreement, they also state that it was decided that Johan would draft a group response. So even if, as you say, there was no voting done, we apparently agreed that Johan would draft our response. That does not seem to make sense to me if we moved on without coming to a conclusion. 🤔
That draft only came Nov 27, 2023, and contains "the [HVNC] has finally come to a decision (...). The HVNC decision is to not drop this recommendation." Members had until Dec 11th to send in their comments, but, according to Johan, none came. The December 2023 minutes also contain no pushback on this. To me, all of this supports that we, in fact, did come to a conclusion.

ahwagner Aug 21, 2024
Maintainer

I recommend we focus on this point in our discussion of this issue. We, as a committee, should first definitively decide whether conveying experimental certainty belongs in the HGVS nomenclature. I agree that we will want to be consistent across the listed scenarios.

IMO, the discussion should include:

If so, should the current nomenclature be made more consistent? (i.e., parentheses vs question marks)

If not, is the removal of this option justified before an alternative is present, as it will create an immense amount of issues?

Good points, I agree, let's include.

I was present for that meeting, but don't recall a vote on this issue at that time or in any subsequent meeting I participated in. Looking at the minutes from the 2023-06-12, in appears that the discussion ran over and we moved on without conclusion (which matches my recollection of that conversation).
Relevant to this, I think it is important that the HVNC formalizes the taking and recording of votes on HGVS technical issues. Discussion #179 has an associated open discussion thread specifically on voting to help shape this.

We should perhaps record and store meetings long-term and/or improve the notes made; I can't even find minutes for the August 7th and October 2nd meetings from 2023, even though my time tracker indicates that I worked on the minutes after the meetings. So it looks like they have gotten lost somewhere.

I enthusiastically support this idea! I would appreciate your addition to the proposals in #179, I hope that we can collectively shape our preferred practices for our decision making processes there.

Either way, although the minutes indeed show internal disagreement, they also state that it was decided that Johan would draft a group response. So even if, as you say, there was no voting done, we apparently agreed that Johan would draft our response. That does not seem to make sense to me if we moved on without coming to a conclusion. 🤔

That draft only came Nov 27, 2023, and contains "the [HVNC] has finally come to a decision (...). The HVNC decision is to not drop this recommendation." Members had until Dec 11th to send in their comments, but, according to Johan, none came. The December 2023 minutes also contain no pushback on this. To me, all of this supports that we, in fact, did come to a conclusion.

Thanks for digging into this. As you say, the minutes showed internal disagreement, right up until Johan suggested we move on for time. From what I recall, Johan had suggested we work it out asynchronously in writing. How that morphed between that point and the late November email draft indicating a consensus opinion is unclear to me, which underscores the suggestion to better document meetings and decisions. I missed the meetings indicated due to medical leave so not to say consensus wasn't reached, only that I am unaware of if, when, and how a consensus opinion was formed.

Clearly, however, there is more to discuss, and hopefully we can put this discussion to rest at our October meeting.

jfjlaros Aug 22, 2024
Maintainer

For the record, I cannot recall a vote on this subject either.

ifokkema Sep 18, 2024
Maintainer

IMO, the discussion should include:

If so, should the current nomenclature be made more consistent? (i.e., parentheses vs question marks)

If not, is the removal of this option justified before an alternative is present, as it will create an immense amount of issues?

Good points, I agree, let's include.

I wanted to add this to the latest agenda, but I have no edit rights. Related to this, I suggest:

Providing some form of edit access (even just for suggesting stuff is fine) for all HVNC members.
Using the previous agenda as the template for the agenda for the next meeting. Obviously, subjects fully discussed can then be removed from the new agenda, but copying the whole previous agenda makes it unlikely we forget stuff that was left undiscussed the last time or what we wrote down we should discuss the next time, as well as the tasks from the last meeting. I feel we often forget to check the previous agenda, and this would make it a no-brainer.

We should perhaps record and store meetings long-term and/or improve the notes made; I can't even find minutes for the August 7th and October 2nd meetings from 2023, even though my time tracker indicates that I worked on the minutes after the meetings. So it looks like they have gotten lost somewhere.

I enthusiastically support this idea! I would appreciate your addition to the proposals in #179, I hope that we can collectively shape our preferred practices for our decision making processes there.

Thanks for that! I left some comments there. It would be good to get that all formalized and well-documented, especially considering we can't seem to remember exactly what happened, and minutes were even lost. I look forward to closing and/or properly documenting the open issues and discussions that are now floating around without decisions being made 😅

Either way, although the minutes indeed show internal disagreement, they also state that it was decided that Johan would draft a group response. So even if, as you say, there was no voting done, we apparently agreed that Johan would draft our response. That does not seem to make sense to me if we moved on without coming to a conclusion. 🤔
That draft only came Nov 27, 2023, and contains "the [HVNC] has finally come to a decision (...). The HVNC decision is to not drop this recommendation." Members had until Dec 11th to send in their comments, but, according to Johan, none came. The December 2023 minutes also contain no pushback on this. To me, all of this supports that we, in fact, did come to a conclusion.

Thanks for digging into this. As you say, the minutes showed internal disagreement, right up until Johan suggested we move on for time. From what I recall, Johan had suggested we work it out asynchronously in writing. How that morphed between that point and the late November email draft indicating a consensus opinion is unclear to me, which underscores the suggestion to better document meetings and decisions. I missed the meetings indicated due to medical leave so not to say consensus wasn't reached, only that I am unaware of if, when, and how a consensus opinion was formed.

Indeed, we should obviously be able to reproduce what was decided and when instead of the guesswork that's now involved. I really like your idea of using Google Forms for voting. We should perhaps even do this for all decisions made unless the decision is unanimous and won't be "challenged" later. This wouldn't necessarily require much overhead; we can create a generic form with one single input field and a yes/no/abstain button. For each decision, we can then agree on what goes into the input field so we can easily group the responses in that single form we use throughout all meetings. But I'm getting ahead of myself...

jfjlaros · 2024-07-03T19:15:18Z

jfjlaros
Jul 3, 2024
Maintainer

My few cents.

Storing this information in free text fields without proper standards is not a solution [...]

To me, this sounds like an argument for a separate column to explicitly indicate prediction status, instead of having two independent types of information in one field.

Unfortunately, the HGVS nomenclature allows multiple descriptions for the same sequence on the protein level.

Both this and the converse is true (different effects may share the same description). I would be in favour of solving these issues (i.e., a variant description should be lossless and only depend on the reference and modified sequence). The possibility of finding DNA or RNA variants that share the same (predicted) effect would be a great asset.

1 reply

ifokkema Jul 3, 2024
Maintainer

My few cents.

Storing this information in free text fields without proper standards is not a solution [...]

To me, this sounds like an argument for a separate column to explicitly indicate prediction status, instead of having two independent types of information in one field.

To function as an alternative solution, this would need to be standardized (so we all can/will use it without leaving us with the same situation we're already in), and usable in databases as well as journals (free text). The current solution allows for just that.

Unfortunately, the HGVS nomenclature allows multiple descriptions for the same sequence on the protein level.

Both this and the converse is true (different effects may share the same description). I would be in favour of solving these issues (i.e., a variant description should be lossless and only depend on the reference and modified sequence). The possibility of finding DNA or RNA variants that share the same (predicted) effect would be a great asset.

Do you have an example? I'm not sure what you mean.

jfjlaros · 2024-07-06T10:37:06Z

jfjlaros
Jul 6, 2024
Maintainer

Do you have an example?

Variants NM_003002.4:c.39_40insATGGG and NM_003002.4:c.39_40insATGCC lead to different protein sequences, but their protein description is p.(Leu14Metfs*3) in both cases.

A description is lossless when it contains all the information needed to reconstruct the modified (or observed) sequence. We could describe the effect of the first variant above as a delins to realise this. E.g., p.(Leu14_Leu159delinsMetGly).

Variants NM_003002.4:c.37del and NM_003002.4:c.36_37insCCCTAG lead to protein descriptions p.(Ala13Profs*2) and p.(Ala13_Leu159delinsPro) respectively, while their protein sequences are identical. A database capable of linking these two variants would be very useful I think.

1 reply

ifokkema Jul 9, 2024
Maintainer

Ah, frameshift descriptions. Yes, these are, indeed, ambiguous.
Your last example of multiple protein descriptions for the same molecule was exactly the point that I was trying to make.

heidirehm · 2024-08-05T12:47:05Z

heidirehm
Aug 5, 2024

Point three ("software used to annotate and validate HGVS validation software often simply assumes everything is a prediction, conferring no useful information or incorrect information") — Sorry, I don't understand what is meant here.

Annotation tools (e.g. VEP) just spit out HGVS nomenclature with or without parentheses per command line, not by knowing which is experimentally correct. So these annotations never convey actual knowledge. For example, if you examine VEP documentation, here: https://useast.ensembl.org/info/docs/tools/vep/script/vep_options.html the tool simply forces all p. to have parentheses. Force --hgvs to return the HGVSp notation in predicted format. For example, ENSP00000233741.4:p.Thr367AsnfsTer13 will be returned as ENSP00000233741.4:p.(Thr367AsnfsTer13).

Point four ("the two major databases used for variant classification (...) do not use parentheses") is not an argument to then remove things from the nomenclature. The reasons why these two databases chose not to adhere to the standards are important. I don't actually know what reasons they had. Is that documented somewhere?

This is a valid question but the answer is that it was based on extensive input from the community. My first ClinGen grant funded many focus groups and efforts to work with ClinVar to provide guidance to how ClinVar represented information. For gnomAD, we also routinely interact with the community and perform user input surveys to gather input on how people want information provided. There is overwhelming objection to the parentheses usage in the HGVS rules.

Point five ("makes it more difficult to computationally search and find matches given the inconsistent use of the parentheses") is not a valid point. As in, it's like blowing up your house to kill the spider sitting on the wall. Sure, it "solves" the problem... but it doesn't, actually, and creates many more problems in the process. If the parentheses were an actual problem in this situation, it would be fixed in less time than it took me to write this post, and it wouldn't actually solve the problem that's mentioned. See further below for more on this.

I'm not sure I follow. You'd have to deploy custom search algorithms for every search tool used to find information on variation including common ones (PubMed, Google Scholar, etc) to thousands that are more obscure.

Storing this information in free text fields without proper standards is not a solution [...]

The truth is that both free text information and structured information is needed. For example, for all of ClinGen's expert panels, we provide both structured and free text evidence. If you follow this link: https://erepo.clinicalgenome.org/evrepo/ui/classification/e8b1ce82-bc97-4baf-b442-294c2a6849cd you will see the structure evidence codes: PM3_Very Strong PS4 PP1 PP3 are listed as applicable and all others listed as not applicable. These codes are useful for AI/ML-based uses of the aggregate data in ClinVar, but doesn't begin to convey the nuances of the actual complex data that has been assembled on this variant and the rationale for how the committee classified the variant with respect to pathogenicity. You'll also see that many labs that submit to ClinVar take a similar approach in applying both structured evidence codes as well as free text. But in focus groups led by ClinVar, the community overwhelmingly highlighted the free text evidence descriptions submitted by the labs as the most useful information that ClinVar provides, far more useful than the structured evidence codes. No one ever relies on the HGVS nomenclature parentheses to convey information given how little it is used and when used, is largely based on automated application, not actually figuring out the experimental data. In the new ACMG guidelines we are currently working on, we will have more codes that will differentiate both predicted impacts from splicing prediction algorithms like SpliceAI, Pangolin, etc as well as missense in silico predictors and then separate codes for conveying many different types of experimentally derived data. All of these codes will be structured and have point values along with each code to convey the strength of evidence in addition to the type of evidence. This is where this type of information should be collected, not in the nomenclature.

3 replies

ifokkema Aug 21, 2024
Maintainer

Point three ("software used to annotate and validate HGVS validation software often simply assumes everything is a prediction, conferring no useful information or incorrect information") — Sorry, I don't understand what is meant here.

Annotation tools (e.g. VEP) just spit out HGVS nomenclature with or without parentheses per command line, not by knowing which is experimentally correct. So these annotations never convey actual knowledge. For example, if you examine VEP documentation, here: https://useast.ensembl.org/info/docs/tools/vep/script/vep_options.html the tool simply forces all p. to have parentheses. Force --hgvs to return the HGVSp notation in predicted format. For example, ENSP00000233741.4:p.Thr367AsnfsTer13 will be returned as ENSP00000233741.4:p.(Thr367AsnfsTer13).

I don't understand the problem — VEP would have to check, e.g., LOVD, to see whether protein changes have been experimentally verified. Also, there is a possibility that one patient's results are not necessarily another patient's results. Anyway, since VEP only creates in silico predictions, parentheses should be used according to the HGVS Nomenclature. As such, the parentheses clearly do convey actual knowledge; they convey that VEP has simply created predictions and that none of these predictions are validated experimentally.

Why not labeling predictions as such is a problem can be seen, e.g., here: https://groups.google.com/g/hgvs-nomenclature/c/xHKR4xwu-Zw/m/k4uSxcSHBwAJ
An exon 15 deletion in MSH2 is described as "synonymous" by a user simply because ClinVar calls it synonymous. ClinVar calls it synonymous because they only generate predictions, without labeling them as such, not by using parentheses and not by using some other method (in a structured or free-text manner). That is a problem that could (and should, if you ask me) be fixed quickly, as it's confusing users (or worse).

Point four ("the two major databases used for variant classification (...) do not use parentheses") is not an argument to then remove things from the nomenclature. The reasons why these two databases chose not to adhere to the standards are important. I don't actually know what reasons they had. Is that documented somewhere?

This is a valid question but the answer is that it was based on extensive input from the community. My first ClinGen grant funded many focus groups and efforts to work with ClinVar to provide guidance to how ClinVar represented information. For gnomAD, we also routinely interact with the community and perform user input surveys to gather input on how people want information provided. There is overwhelming objection to the parentheses usage in the HGVS rules.

The value of this "overwhelming objection" fully depends on the reasoning behind it and the perceived advantages and disadvantages of this option. E.g., if somebody thinks that you can't find variants using Google because of the parentheses, and this will be solved by adapting the nomenclature, and they don't think there are any negative side effects, then what value would their objection have? It's based on incorrect information and lack of experience with actually using the difference (more on that near the end below).

Point five ("makes it more difficult to computationally search and find matches given the inconsistent use of the parentheses") is not a valid point. As in, it's like blowing up your house to kill the spider sitting on the wall. Sure, it "solves" the problem... but it doesn't, actually, and creates many more problems in the process. If the parentheses were an actual problem in this situation, it would be fixed in less time than it took me to write this post, and it wouldn't actually solve the problem that's mentioned. See further below for more on this.

I'm not sure I follow. You'd have to deploy custom search algorithms for every search tool used to find information on variation including common ones (PubMed, Google Scholar, etc) to thousands that are more obscure.

None of these are able to understand HGVS expressions and, therefore, they can't see the difference between, e.g., p.Thr367AsnfsTer13, p.Thr367Asnfs*13, p.Thr367Asnfs, p.T367Nfs*13, p.T367Nfs, a delins format that describes the same protein change, or ClinVar's T367fs. I, therefore, argue that databases should be used if people are interested in finding variant observations. Parentheses are not the problem here, or well, they are such a small part that removing them fixes nothing (Google Search ignores the parentheses and shows results both with and without parentheses). Ambiguity within the HGVS recommendations for protein changes, the myriad of software generating incorrect HGVS expressions (ClinVar included), and the fact that a string search is performed but a search for structured information is expected, are all problems here.

Storing this information in free text fields without proper standards is not a solution [...]

The truth is that both free text information and structured information is needed. For example, for all of ClinGen's expert panels, we provide both structured and free text evidence. If you follow this link: https://erepo.clinicalgenome.org/evrepo/ui/classification/e8b1ce82-bc97-4baf-b442-294c2a6849cd you will see the structure evidence codes: PM3_Very Strong PS4 PP1 PP3 are listed as applicable and all others listed as not applicable. These codes are useful for AI/ML-based uses of the aggregate data in ClinVar, but doesn't begin to convey the nuances of the actual complex data that has been assembled on this variant and the rationale for how the committee classified the variant with respect to pathogenicity. You'll also see that many labs that submit to ClinVar take a similar approach in applying both structured evidence codes as well as free text. But in focus groups led by ClinVar, the community overwhelmingly highlighted the free text evidence descriptions submitted by the labs as the most useful information that ClinVar provides, far more useful than the structured evidence codes.

Of course, and therefore, they should both be available. However, you assume here that they should also both be stored. But that depends on what information is conveyed. Further details of structured information should, of course, be stored separately, as the structured information does not explain everything. But, a free-text expression of structured information should not — not only is it against common database normalization practices, but it also prevents solving the problem at hand. For example, it makes no sense to store both the "US" country code and "United States of America" as human-readable text. They are the same. The system will store "US", but can show "United States of America" (or both) to the human reader. Currently, the HVGS nomenclature is, in reality, both structured information and free-text. That is now causing this discussion. That's also why the common search engines you mentioned are all unable to reliably do 1:1 matching (I'm focusing on protein changes here). You seek to adapt the human-readable text, but in reality, this removes structured information (that the variant description is a prediction). You are fully right in saying that what kind of prediction isn't stored, and this can (or should) be improved upon. You're also right that in a large number of cases, the lack of parentheses doesn't indicate that the variant is not a prediction. This fully depends on the domain, though. Within LOVD, the lack of parentheses means the data is not a prediction. In ClinVar, everything is a prediction but isn't indicated as such. In many papers, it's unknown. However, none of this is solved by dropping the ability to store this structured information. That would just cause an immense amount of issues while not actually solving the issue that you can't match 1:1. That issue remains.

No one ever relies on the HGVS nomenclature parentheses to convey information given how little it is used and when used, is largely based on automated application, not actually figuring out the experimental data.

I don't understand the issue with the automated application of parentheses — all in silico predicted protein descriptions should use parentheses. That's how they indicate they're predictions.

But in order to keep this conversation scientific, I suggest not using terms like "no one", "everybody", "always", "(n)ever", etc. They are incorrect and misleading. Even "little" is misleading, as it's extremely biased and domain-specific. Please note that we are from different worlds; not only geographically, but also scientifically. Both our opinions are colored by our surroundings. I'm surrounded by people who do use parentheses for predictions, who do understand what the lack of parentheses means, and who do actively seek RNA-validated data, and, therefore, highly value exactly that data in LOVD that will become rather useless should the ability to indicate the difference between predictions and validated data be removed. Since ClinVar doesn't have dedicated fields for RNA or protein changes and even discourages the submission of protein changes in the documentation, I can understand that in the ClinVar ecosystem, there is no value to something that doesn't exist there. However, please respect those who are making the difference between predicted and lab-validated data, sometimes even for multiple decades already. You don't have to understand or agree with them, but denying their existence does not help this conversation.

In the new ACMG guidelines we are currently working on, we will have more codes that will differentiate both predicted impacts from splicing prediction algorithms like SpliceAI, Pangolin, etc as well as missense in silico predictors and then separate codes for conveying many different types of experimentally derived data. All of these codes will be structured and have point values along with each code to convey the strength of evidence in addition to the type of evidence. This is where this type of information should be collected, not in the nomenclature.

That sounds like a good solution to me that would (as I previously suggested) create a win-win situation; is there a release date yet? That would mean that we could stop spending time on long discussions but, instead, wait for that to come out, implement it so we have a proper solution, and then prepare the nomenclature changes.

heidirehm Aug 22, 2024

The basic framework is already finished, though it won't be published til 2025. We are just finessing point values, code labels, etc and piloting with labs. Its been presented at ACMG, ESHG and ASHG. So for the purposes of deciding what systems the evidence should be stored in, we are far enough along to make that decision. Also, with respect to your other comments, we clearly disagree and I am also generalizing the feedback I have received over the last 2 decades. But this is not a decision that you and I should make, nor Johan alone, nor any single person. It should be informed by the broader community which is why I originally asked for the committee to discuss this as a group and for a survey of the broader community to be done. I am still asking for that.

ifokkema Sep 18, 2024
Maintainer

The basic framework is already finished, though it won't be published til 2025. We are just finessing point values, code labels, etc and piloting with labs. Its been presented at ACMG, ESHG and ASHG. So for the purposes of deciding what systems the evidence should be stored in, we are far enough along to make that decision.

Good! However, knowing when it will be released will help the discussion, as knowing when an alternative is available can help decide whether to move forward with updating the recommendations or not.

Also, with respect to your other comments, we clearly disagree (...)

About what part? I provided evidence for some of my points; if there is counterevidence or I overlooked something, I would like to know, of course. If there are some parts that we do agree on, that would be good to know, too (e.g., saving information on whether something is predicted in a structured way). Having some kind of common ground could bring us closer to a solution.

jtdendunnen · 2024-10-03T08:28:52Z

jtdendunnen
Oct 3, 2024
Maintainer Author

Heidi suggests: I propose that parentheses should always be used and the c./p. format be consistent with the format used in ClinVar: NM_004004.6(GJB2):c.101T>C (p.Met34Thr).

please note "NM_004004.6(GJB2):c.101T>C" does not follow HGVS nomenclature, "NM_004004.6:c.101T>C" does. HGVS nomenclature does not allow the use of Gene Symbols in variant descriptions.
to make this complete, RNA is missing, would this be NM_004004.6:c.101T>C (r.101u>c p.Met34Thr) meaning both "r.101u>c" and "p.Met34Thr" are predicted consequences? Current nomenclature would be NM_004004.6:c.101T>C r.(101u>c) p.(Met34Thr).
when RNA would have been sequenced (RT-PCR) and the presence of r.101u>c detected, how would this be indicated using the format Heidi suggests? For this example it might not be so relevant, but it will be relevant for variants affecting splicing where I need to be able to make a discrimination whether RNA was analysed or not.

1 reply

heidirehm Oct 19, 2024

@jtdendunnen Yes, I'm aware that the gene symbol is not part of the HGVS nomenclature but others have proposed that this would make it the variant recognizable by knowing which gene its in, where the transcript ID is not human recognizable. This was raised by Ada Hamosh. On point 3, to convey evidence about a variant, I suggest using the ACMG/AMP guidelines. Currently, there are 26 codes, each with strength modifiers, to convey different types of evidence. In the new framework highlighted here, https://tinyurl.com/ACMGv4, there will be even more detail including results from RNA sequencing and the evidence obtained to understand the variant impact. Trying to include the enormous array of variant evidence in the name is untenable.

jtdendunnen · 2024-11-25T16:23:34Z

jtdendunnen
Nov 25, 2024
Maintainer Author

Regarding the first issue, simply use "GJB2 NM_004004.6:c.101T>C", you list the gene (for human regonition) and the variant description following the HGVS nomenclature standard.

Regarding point 3, we will not agree. ACMG is covers variant classification, HGVS nomeclature variant description and HGVS nomenclature states that predicted descriptions should be given using parentheses.

1 reply

heidirehm Nov 25, 2024

When you say "we will not agree", who are you referring to? Has this been discussed and voted on by the HGVS committee? Has there been community engagement/input?

Use of parentheses for predicted consequences #18

Uh oh!

jtdendunnen Feb 9, 2023 Maintainer

Replies: 10 comments · 18 replies

Uh oh!

Uh oh!

ifokkema Feb 14, 2023 Maintainer

Uh oh!

jfjlaros Feb 15, 2023 Maintainer

Uh oh!

murphyte Feb 27, 2023

Uh oh!

Uh oh!

ifokkema Feb 28, 2023 Maintainer

Uh oh!

ahwagner Nov 30, 2023 Maintainer

Uh oh!

heidirehm Mar 2, 2024

Uh oh!

ahwagner Aug 19, 2024 Maintainer

Uh oh!

ifokkema Aug 21, 2024 Maintainer

Uh oh!

Uh oh!

ahwagner Aug 21, 2024 Maintainer

Uh oh!

jfjlaros Aug 22, 2024 Maintainer

Uh oh!

ifokkema Sep 18, 2024 Maintainer

Uh oh!

jfjlaros Jul 3, 2024 Maintainer

Uh oh!

ifokkema Jul 3, 2024 Maintainer

Uh oh!

jfjlaros Jul 6, 2024 Maintainer

Uh oh!

ifokkema Jul 9, 2024 Maintainer

Uh oh!

heidirehm Aug 5, 2024

Uh oh!

ifokkema Aug 21, 2024 Maintainer

Uh oh!

heidirehm Aug 22, 2024

Uh oh!

ifokkema Sep 18, 2024 Maintainer

Uh oh!

jtdendunnen Oct 3, 2024 Maintainer Author

Uh oh!

heidirehm Oct 19, 2024

Uh oh!

jtdendunnen Nov 25, 2024 Maintainer Author

Uh oh!

heidirehm Nov 25, 2024

jtdendunnen
Feb 9, 2023
Maintainer

Replies: 10 comments 18 replies

ifokkema
Feb 14, 2023
Maintainer

jfjlaros
Feb 15, 2023
Maintainer

murphyte
Feb 27, 2023

ifokkema Feb 28, 2023
Maintainer

ahwagner
Nov 30, 2023
Maintainer

heidirehm
Mar 2, 2024

ahwagner Aug 19, 2024
Maintainer

ifokkema Aug 21, 2024
Maintainer

ahwagner Aug 21, 2024
Maintainer

jfjlaros Aug 22, 2024
Maintainer

ifokkema Sep 18, 2024
Maintainer

jfjlaros
Jul 3, 2024
Maintainer

ifokkema Jul 3, 2024
Maintainer

jfjlaros
Jul 6, 2024
Maintainer

ifokkema Jul 9, 2024
Maintainer

heidirehm
Aug 5, 2024

ifokkema Aug 21, 2024
Maintainer

ifokkema Sep 18, 2024
Maintainer

jtdendunnen
Oct 3, 2024
Maintainer Author

jtdendunnen
Nov 25, 2024
Maintainer Author