Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SEP 033: Concrete descriptions of non-canonical DNA, RNA, and proteins #77

Open
jonrkarr opened this issue Jun 21, 2019 · 14 comments
Open
Assignees
Milestone

Comments

@jonrkarr
Copy link
Contributor

jonrkarr commented Jun 21, 2019

Full details in the pull request, in the file https://github.com/SynBioDex/SEPs/blob/master/sep_033.md

@palchicz palchicz changed the title SEP XXX -- Concrete descriptions of non-canonical DNA, RNA, and proteins SEP 033: Concrete descriptions of non-canonical DNA, RNA, and proteins Jun 22, 2019
@palchicz palchicz assigned ghost Jun 25, 2019
@jakebeal
Copy link
Contributor

@jonrkarr In theory, supporting BpForms seems pretty straight-forward to me --- it's just another textual sequence format.

Some things about the recommended implementation in the SEP are not clear to me, however:

  • Is the recommendation to add BpForms as an optional format or to replace IUPAC as the recommended format?
  • The SEP recommends incorporating the BpForms library into SBOL libraries. BpForms appears to be only available in Python currently, but SBOL libraries are available in multiple languages (Java, Python, C++, Javascript, F#). How do you recommend navigating this issue?

@jonrkarr
Copy link
Contributor Author

Jake, thanks for the clarifying questions. I'll revise the SEP to try to address these questions.

  • Yes, the proposal is to add BpForms as a recommended encoding for non-canonical DNA, RNA, and proteins. More concretely, the proposal is simply to add BpForms to Table 1 of the SBOL specification.

    Although BpForms can represent any IUPAC sequence, because BpForms is, at least not yet, a standard, we are not recommending replacing IUPAC with BpForms. Allowing both IUPAC and BpForms would allow SBOL to continue to support the existing users, as well as accommodate users who require more chemical information and precision.

  • The proposal is simply to add BpForms to Table 1 of the SBOL specification.

    Going forward, I think it would be useful to incorporate the capabilities to describe and validate non-canonical DNA, RNA, and proteins more directly into SBOL/libSBOL. One potential way is to incorporate the BpForms software into each flavor of libSBOL. Another potential way is to encode non-canonical DNA, RNA, and proteins into RDF. Because there are multiple ways to achieve this, I think this is something that the community should discuss. Until this time, users who need to capture more chemical detail could use the BpForms encoding. This would allow users to begin to explore use cases which require more chemical information, which could anchor discussion about how SBOL should proceed.

    Although I think it would be helpful to incorporate the interpretation of BpForms and the other encodings (SMILES, IUPAC, IUBMB) into libSBOL (e.g., this would enable verification of encoded strings), I am also not recommending incorporating the BpForms software into libSBOL at this time because libSBOL does not have currently the ability to interpret the other sequence encodings. To follow this separation, the BpForms software should remain separate from libSBOL.

@jakebeal
Copy link
Contributor

Thanks: this clarification is very helpful.

If we're going to support adding BpForms to Table 1, then we'll definitely need to have some form of support for dealing with BpForms in the SBOL libraries. The full BpForms library might not be necessary, but a number of library operations depend on the ability to reason about locations in sequence strings, e.g., to annotate a location, to check if a sub-component is correctly aligned, or to compose sequences together (in ways more complex than just concatenating).

If I understand correctly, BpForms does not have a 1-to-1 alignment between string index and sequence location. Is there a simple way of computing locations in BpForms, or does that require a significant amount of the BpForms library code?

@jonrkarr
Copy link
Contributor Author

Correct, there is no one-to-one alignment between string indices and sequence locations. This is also true of SMILES. However, the alignment is simple (the monomer indices increase from left to right as in IUPAC). We can provide a function to compute locations which could be incorporate into libSBOL. This just requires counting the number of single characters, square brackets, and curly brackets. This can be done without fully interpreting encoded sequences with the BpForms library.

As an aside, the SMILES situation is more complicated because there are multiple versions of SMILES. Different software assign different numbers to atoms when they interpret SMILES. For example OpenBabel and Marvin assign different atoms numbers for dGMP (OC1CC(OC1COP(=O)([O-])[O-])n1cnc2c1nc(N)[nH]c2=O). In BpForms, we deal with this by specifying that the atom numbers are in the basis established by OpenBabel.

@cjmyers
Copy link
Contributor

cjmyers commented Jul 15, 2019 via email

@jonrkarr
Copy link
Contributor Author

jonrkarr commented Jul 15, 2019 via email

@jonrkarr
Copy link
Contributor Author

I updated the SEP to clarify these questions. See PR #83.

@cjmyers
Copy link
Contributor

cjmyers commented Jul 16, 2019 via email

@jonrkarr
Copy link
Contributor Author

jonrkarr commented Jul 16, 2019

The grammar is written for Lark in EBNF (with a few slight modifications described here).

We plan to make a more standard version of the grammar that can be used with other parser generators. I think this will only require a few small changes. There are several parser generators for C, C++, Java, Python, etc. that support EBNF such as:

  • ANTLR
  • Beaver
  • Coco/R
  • Grammatica
  • JavaCC
  • UniCC

If you have a preferred library, we can target that.

@cjmyers cjmyers added this to the SBOL 2.4 milestone Jul 23, 2019
@palchicz palchicz modified the milestone: SBOL 2.4 Jul 23, 2019
@graik
Copy link
Contributor

graik commented Dec 11, 2019

This is proposing that the sequence string of a SBOL record could be embedded with an orthogonal data structure that has nothing to do with neither RDF or SBOL. BpForms has, as it seems, been only just proposed this year. No offense Jon but writing this SEP while also being the author of BpForms... I think this mostly serves the promotion of the BpForms project.

I believe it would be a bad idea to allow a complex and, so far, non-standard syntax to enter SBOL sequence strings. Right now, sequence strings are IUPAC compatible and can be parsed and processed by pretty much any bioinformatics tool. With this SEP, this compatibility would be broken.

Note also, that all examples in the SEP are highly specialized natural modifications of natural molecules. None of the examples comes from engineered systems. I would argue that there is no urgent need for the description of this kind of non-canonical chemistry in bioengineering (and we do support SMILES, which is a standard already supported by many tools). It would be nice if we could describe a handful of very common modifications but that could be achieved much easier (and without breaking the sequence record).

@cjmyers
Copy link
Contributor

cjmyers commented Dec 11, 2019

@graik You are correct that this SEP is about promotion of BpForms for expressing this type of information. Without this SEP, it is perfectly legal to use BpForms in SBOL already. The Sequence encoding types are not restricted to just IUPAC and SMILES, but they are currently the only two that the SBOL community recommends to use. This SEP does not say that people must use BpForms, but rather it would say that if you want to express the types of structure that it can represent that it is a suggested way to represent this. BpForms would not replace any of our existing encodings, but it would be an alternative encoding.

To be honest, I'm not sure how useful this is to synthetic biologists or how often this type of information needs to be recorded. Since it is already allowed, the question is really is there a better way to express this information?

@cjmyers cjmyers assigned jamesamcl and unassigned ghost Dec 12, 2019
@jamesamcl
Copy link
Member

I am reviewing this on behalf of the SBOL editors. From what I can gather, there are no outstanding questions about the content of the SEP, but only about how useful BpForms would be - something I don't think is really up to us to decide.

We can take this either take this to a vote now for SBOL 2.4, or we can defer it until after 3.0. @jonrkarr - do you have a preference?

@jonrkarr
Copy link
Contributor Author

jonrkarr commented Dec 14, 2019 via email

@cjmyers
Copy link
Contributor

cjmyers commented Dec 16, 2019

@jonrkarr Given the concerns raised, I think maybe we should defer this to SBOL3 for now. It can already be used by SBOL2 now, since this is just another possible sequence encoding. I think to advocate its use as a best practice, the community would need to see some (perhaps at least 3) uses of BpForms in synthetic biology examples. If you or collaborators have some examples to share, please do so. We also started a conversation earlier about using constraints and/or interactions to represent some elements of BpForms. This conversation would be interesting to begin again, especially if it may suggest some changes that we should consider for SBOL3. Thanks again for your contributions.

@cjmyers cjmyers modified the milestones: SBOL 2.3.1, SBOL 3.0 Dec 16, 2019
@jakebeal jakebeal modified the milestones: SBOL 3.0, SBOL 3.1 Jan 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants