Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SEP 033: Concrete descriptions of non-canonical DNA, RNA, and proteins #77

Open
jonrkarr opened this issue Jun 21, 2019 · 9 comments

Comments

Projects
None yet
5 participants
@jonrkarr
Copy link
Contributor

commented Jun 21, 2019

Full details in the pull request, in the file https://github.com/SynBioDex/SEPs/blob/master/sep_033.md

@palchicz palchicz changed the title SEP XXX -- Concrete descriptions of non-canonical DNA, RNA, and proteins SEP 033: Concrete descriptions of non-canonical DNA, RNA, and proteins Jun 22, 2019

@jakebeal

This comment has been minimized.

Copy link
Contributor

commented Jul 14, 2019

@jonrkarr In theory, supporting BpForms seems pretty straight-forward to me --- it's just another textual sequence format.

Some things about the recommended implementation in the SEP are not clear to me, however:

  • Is the recommendation to add BpForms as an optional format or to replace IUPAC as the recommended format?
  • The SEP recommends incorporating the BpForms library into SBOL libraries. BpForms appears to be only available in Python currently, but SBOL libraries are available in multiple languages (Java, Python, C++, Javascript, F#). How do you recommend navigating this issue?
@jonrkarr

This comment has been minimized.

Copy link
Contributor Author

commented Jul 15, 2019

Jake, thanks for the clarifying questions. I'll revise the SEP to try to address these questions.

  • Yes, the proposal is to add BpForms as a recommended encoding for non-canonical DNA, RNA, and proteins. More concretely, the proposal is simply to add BpForms to Table 1 of the SBOL specification.

    Although BpForms can represent any IUPAC sequence, because BpForms is, at least not yet, a standard, we are not recommending replacing IUPAC with BpForms. Allowing both IUPAC and BpForms would allow SBOL to continue to support the existing users, as well as accommodate users who require more chemical information and precision.

  • The proposal is simply to add BpForms to Table 1 of the SBOL specification.

    Going forward, I think it would be useful to incorporate the capabilities to describe and validate non-canonical DNA, RNA, and proteins more directly into SBOL/libSBOL. One potential way is to incorporate the BpForms software into each flavor of libSBOL. Another potential way is to encode non-canonical DNA, RNA, and proteins into RDF. Because there are multiple ways to achieve this, I think this is something that the community should discuss. Until this time, users who need to capture more chemical detail could use the BpForms encoding. This would allow users to begin to explore use cases which require more chemical information, which could anchor discussion about how SBOL should proceed.

    Although I think it would be helpful to incorporate the interpretation of BpForms and the other encodings (SMILES, IUPAC, IUBMB) into libSBOL (e.g., this would enable verification of encoded strings), I am also not recommending incorporating the BpForms software into libSBOL at this time because libSBOL does not have currently the ability to interpret the other sequence encodings. To follow this separation, the BpForms software should remain separate from libSBOL.

@jakebeal

This comment has been minimized.

Copy link
Contributor

commented Jul 15, 2019

Thanks: this clarification is very helpful.

If we're going to support adding BpForms to Table 1, then we'll definitely need to have some form of support for dealing with BpForms in the SBOL libraries. The full BpForms library might not be necessary, but a number of library operations depend on the ability to reason about locations in sequence strings, e.g., to annotate a location, to check if a sub-component is correctly aligned, or to compose sequences together (in ways more complex than just concatenating).

If I understand correctly, BpForms does not have a 1-to-1 alignment between string index and sequence location. Is there a simple way of computing locations in BpForms, or does that require a significant amount of the BpForms library code?

@jonrkarr

This comment has been minimized.

Copy link
Contributor Author

commented Jul 15, 2019

Correct, there is no one-to-one alignment between string indices and sequence locations. This is also true of SMILES. However, the alignment is simple (the monomer indices increase from left to right as in IUPAC). We can provide a function to compute locations which could be incorporate into libSBOL. This just requires counting the number of single characters, square brackets, and curly brackets. This can be done without fully interpreting encoded sequences with the BpForms library.

As an aside, the SMILES situation is more complicated because there are multiple versions of SMILES. Different software assign different numbers to atoms when they interpret SMILES. For example OpenBabel and Marvin assign different atoms numbers for dGMP (OC1CC(OC1COP(=O)([O-])[O-])n1cnc2c1nc(N)[nH]c2=O). In BpForms, we deal with this by specifying that the atom numbers are in the basis established by OpenBabel.

@cjmyers

This comment has been minimized.

Copy link
Contributor

commented Jul 15, 2019

@jonrkarr

This comment has been minimized.

Copy link
Contributor Author

commented Jul 15, 2019

@jonrkarr

This comment has been minimized.

Copy link
Contributor Author

commented Jul 15, 2019

I updated the SEP to clarify these questions. See PR #83.

@cjmyers

This comment has been minimized.

Copy link
Contributor

commented Jul 16, 2019

@jonrkarr

This comment has been minimized.

Copy link
Contributor Author

commented Jul 16, 2019

The grammar is written for Lark in EBNF (with a few slight modifications described here).

We plan to make a more standard version of the grammar that can be used with other parser generators. I think this will only require a few small changes. There are several parser generators for C, C++, Java, Python, etc. that support EBNF such as:

  • ANTLR
  • Beaver
  • Coco/R
  • Grammatica
  • JavaCC
  • UniCC

If you have a preferred library, we can target that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.