Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicit support for global terminal modifications #6

Open
RalfG opened this issue Mar 18, 2024 · 6 comments
Open

Explicit support for global terminal modifications #6

RalfG opened this issue Mar 18, 2024 · 6 comments

Comments

@RalfG
Copy link
Collaborator

RalfG commented Mar 18, 2024

Fixed modifications, such as carbamidomethylation of C can be written as a global modification (section 4.6.2). For instance:

<[Carbamidomethyl]@C>ATPEILTCNSIGCLK

However, it is not explicitly stated whether global terminal modifications are supported, and if so, which "target tags" should be used. I would use this in the case of isobaric labeling modifications. For instance:

<[TMT6plex]@K,N-term>ATPEILTCNSIGCLK

Which would be equivalent to:

[TMT6plex]-ATPEILTCNSIGCLK[TMT6plex]

This would require a definition of the tags to be used for terminal modifications, for example N-term and C-term.

RalfG added a commit to compomics/psm_utils that referenced this issue Mar 18, 2024
For now using a workaround while waiting for official support and an implementation in Pyteomics. See HUPO-PSI/ProForma#6
@edeutsch
Copy link

edeutsch commented Apr 5, 2024

This is currently legal:
[TMT6plex]-ATPEILTCNSIGCLK[TMT6plex]
<[TMT6plex]@k>[TMT6plex]-ATPEILTCNSIGCLK

Options to extend:
<[TMT6plex]@k,N-term>ATPEILTCNSIGCLK (use 'N-term' and 'C-term')
<[TMT6plex]@k,n>ATPEILTCNSIGCLK (use lower case n and c)
<[TMT6plex]@k,^>ATPEILTCNSIGCLK (use ^ for N-term and $ for C-term)

ProForma current allows amino acids to be lower case, so the second is not a good idea
Seems like the preferred format would be:
<[TMT6plex]@k,N-term>ATPEILTCNSIGCLK

If we wanted to support N-term amino acids:
<[TMT6plex]@k,N-term,N-term A>ATPEILTCASIGCLK
<[TMT6plex]@k,N-term,N-term(A)>ATPEILTCASIGCLK

<[TMT6plex]@k,N-term,N-term:AS>ATPEILTCASIGCLK
<[TMT6plex]@k,N-term,N-term(AS)>ATPEILTCASIGCLK

After discussion, end up
<[TMT6plex]@k,N-term>ATPEILTCASIGCLK
<[TMT6plex]@k,N-term:A,N-term:S>ATPEILD[U:Cation:Fe[III]]CASIGCLK

Discuss again with other ProForma 2.0 stakeholders

Other potential things to change:

  • Clearly specify the order of <>{}[] at the front
  • This is not clearly defined in the text of the spec. Update to specify more clearly.

@douweschulte
Copy link

As a bit of a follow up thought after the meeting. I would argue for @N-term:ABC as a valid representation of the concept of a modification on the N terminus of Alanine, Ambiguous glutamine, or Cysteine. The idea in the meeting itself was to not allow this form and instead use @N-term:A,N-term:B,N-term:C which is a slightly easier grammar.

My argument for allowing the first form is that this is easier to type out. This makes the grammar slightly more complex but there is no dividing character used so the rule is that anything (alphabetic characters only) following the colon is a location where this modification can be placed. In terms of logic for the parser this is not much more complex because it had to check if the character following the initial amino acid is a comma anyways and with this addition it just has to keep taking input until the next comma.

On the level of complexity for the intermediate representation used any program using pro forma notation I would argue there is no difference in either syntax. So that means that any program able to handle the N-term:A,N-term:B notation can without any changes to the code (except for the parser of course) handle the N-term:AB notation.

But I am quite interested to hear about the feasibility from the other people writing ProForma parsers. This mostly reflects how my parser is written and it might be harder if you are using other libraries or parser generators.

@mobiusklein
Copy link
Collaborator

My argument against the N-term:ABC notation, or packing for ease of reference, is that it introduces an extra layer of complexity and it introduces a second way of specifying a list of amino acid targets. The first is colored by my own implementation choices, but suppose we have the following abstract types:

class ModificationRule {
  modification: Modification
  targets: List<ModificationTarget>
}

class ModificationTarget {
  amino_acid: String | null
  terminal: String | null
}

This fully covers the first existing usage, where each amino acid is a separate ModificationTarget. If we allow packing we now need to allow a ModificationTarget to cover multiple amino acids, or we need to add an extra step after parsing where we split those overloaded targets into separate entries. If we allow variadic ModificationTargets, then we break an implicit contract that a target is about a single amino acid. If we do introduce an intermediate splitting step, we break the 1:1 assumption between syntax and representation, and unless you implement rule merging, N-term:ABC may then be rendered N-term:A,N-term:B,N-term:C. ProForma explicitly doesn't advocate standard canonicalization rules, but round-tripping is nice to have.

The second concern is a syntax to semantics concern. Suppose I write N-term:ABC, and then say "Ah but I also need this rule to target Z, X and Q not on the N-terminal". The spec says I should then write Z,X,Q,N-term:ABC, but I just packed ABC together, so why can't I write ZXQ,N-term:ABC, or I may write Z,X,Q,N-term:A,B,C because I think I have a list of targets.

Neither is intractable to break, and others may implement things in such a way that this is not an issue.

@douweschulte
Copy link

I do the grouping internally already, so for me on the parser side there is no problem. But your second argument on semantics I fully agree with. So that leaves me in favour of the unpacked syntax.

douweschulte added a commit to snijderlab/rustyms that referenced this issue Apr 10, 2024
Support for diagnostic ions from labile modifications
Bit of refactoring for proforma parse code
@edeutsch
Copy link

Original intent:
AC[Carbamidomethyl]AHC[Carbamidomethyl]HAC[Carbamidomethyl]FC[Carbamidomethyl]AC[Carbamidomethyl]
<[Carbamidomethyl]@C>ACAHCHACFCAC

<[Carbamidomethyl]@C>AAHHAFA
(should this be legal? It is according to the current spec, but does it violate the spirit of what ProForma was trying to do?)

Do we want to amend the specification to ProForma 2.1 to clarify these things?
Or should we have an addendum document that clarifies things in ProForma 2.0 that were not clearly specified

Douwe's code has the capability to read a ProForma string that has all the fixed modifications prefixed and normalizes it to what is actually in the peptide.

If we added the N-term support, it would be a breaking change, and would be ProForma 2.1

TODO: Start a Google doc in which we start documenting and resolving these various open issues, including #8 and #9
TODO: Juan will put ProForma 2.0 into an editable Google doc
TODO: Douwe will create a Google doc that is an addendum/clarification of 2.0

@bittremieux
Copy link

bittremieux commented Apr 14, 2024

<[Carbamidomethyl]@C>AAHHAFA
(should this be legal? It is according to the current spec, but does it violate the spirit of what ProForma was trying to do?)

This is fine imo.

mobiusklein added a commit to mobiusklein/pyteomics that referenced this issue Apr 20, 2024
**Modification caching**
All ModificationResolver types now use an in-memory cache for
resolved modification definitions, reducing overhead of resolving
the same rule over and over again.

Sub-classes should move their implementation of `resolve` to the
`_resolve_impl` method, otherwise the cache will not be used.

To disable the cache for a resolver instance, call `resolver.enable_caching(False)`.

**Constant terminal modifications**
This implements support for the syntax discussed in HUPO-PSI/ProForma#6
to include constant modification rules that apply to specific sequence
terminals with or without specific amino acids.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants