-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SEP 033: Concrete descriptions of non-canonical DNA, RNA, and proteins #77
Comments
@jonrkarr In theory, supporting BpForms seems pretty straight-forward to me --- it's just another textual sequence format. Some things about the recommended implementation in the SEP are not clear to me, however:
|
Jake, thanks for the clarifying questions. I'll revise the SEP to try to address these questions.
|
Thanks: this clarification is very helpful. If we're going to support adding BpForms to Table 1, then we'll definitely need to have some form of support for dealing with BpForms in the SBOL libraries. The full BpForms library might not be necessary, but a number of library operations depend on the ability to reason about locations in sequence strings, e.g., to annotate a location, to check if a sub-component is correctly aligned, or to compose sequences together (in ways more complex than just concatenating). If I understand correctly, BpForms does not have a 1-to-1 alignment between string index and sequence location. Is there a simple way of computing locations in BpForms, or does that require a significant amount of the BpForms library code? |
Correct, there is no one-to-one alignment between string indices and sequence locations. This is also true of SMILES. However, the alignment is simple (the monomer indices increase from left to right as in IUPAC). We can provide a function to compute locations which could be incorporate into libSBOL. This just requires counting the number of single characters, square brackets, and curly brackets. This can be done without fully interpreting encoded sequences with the BpForms library. As an aside, the SMILES situation is more complicated because there are multiple versions of SMILES. Different software assign different numbers to atoms when they interpret SMILES. For example OpenBabel and Marvin assign different atoms numbers for dGMP ( |
I'm not concerned about the one-to-one alignment issue. Our libraries do already allow for SMILES which Jonathan points out does not have this. We only require this for IUPAC sequences, so we could continue to do so. Indeed, I would expect each object to still have an IUPAC sequence. The BpForms sequence if provided would be added information.
I would though want the libraries to at least be able to tell if the BpForms encoded string was syntactically correct. libSBOLj at least does this for SMILES. I'm not sure if the other libraries due, but for validation I would want to have this ability at least. Do you have a grammar file? Or could you provide a java library that we could link to that would validate the syntax?
… On Jul 15, 2019, at 9:03 PM, Jonathan Karr ***@***.*** ***@***.***>> wrote:
Correct, there is no one-to-one alignment between string indices and sequence locations. This is also true of SMILES. However, the alignment is simple (the monomer indices increase from left to right as in IUPAC). We can provide a function to compute locations which could be incorporate into libSBOL. This just requires counting the number of single characters, square brackets, and curly brackets. This can be done without fully interpreting encoded sequences with the BpForms library.
As an aside, the SMILES situation is more complicated because there are multiple versions of SMILES. Different software assign different numbers to atoms when they interpret SMILES. For example OpenBabel and Marvin assign different atoms numbers for dGMP (OC1CC(OC1COP(=O)([O-])[O-])n1cnc2c1nc(N)[nH]c2=O). In BpForms, we deal with this by specifying that the atom numbers are in the basis established by OpenBabel.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#77?email_source=notifications&email_token=AA2YH523RAWQIKNOPYPR7ELP7TCXRA5CNFSM4H2HWWLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZ6VA2Q#issuecomment-511529066>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA2YH5Y5VJ5Y7435TKDZBFTP7TCXRANCNFSM4H2HWWLA>.
|
It seems like the encodings need to have well-defined locations, which
BpForms does. BpForms uses the same location conventions as IUPAC and
SMILES. These are also fairly human-readable as the locations increase
monotonically from left to right.
Yes, there's a grammar. FYI, there's a few more sophisticated validations
that aren't encoded in the grammar. For example, verifying that 5' caps
only appear at the last position. SMILES has the same issue -- it is
possible to encode molecules in SMILES that are not physically realistic. I
think grammar validation would be a good start. Ideally, validation would
go further to verify that the encoded molecules are realistic.
…On Mon, Jul 15, 2019 at 3:55 PM cjmyers ***@***.***> wrote:
I'm not concerned about the one-to-one alignment issue. Our libraries do
already allow for SMILES which Jonathan points out does not have this. We
only require this for IUPAC sequences, so we could continue to do so.
Indeed, I would expect each object to still have an IUPAC sequence. The
BpForms sequence if provided would be added information.
I would though want the libraries to at least be able to tell if the
BpForms encoded string was syntactically correct. libSBOLj at least does
this for SMILES. I'm not sure if the other libraries due, but for
validation I would want to have this ability at least. Do you have a
grammar file? Or could you provide a java library that we could link to
that would validate the syntax?
> On Jul 15, 2019, at 9:03 PM, Jonathan Karr ***@***.***
***@***.***>> wrote:
>
> Correct, there is no one-to-one alignment between string indices and
sequence locations. This is also true of SMILES. However, the alignment is
simple (the monomer indices increase from left to right as in IUPAC). We
can provide a function to compute locations which could be incorporate into
libSBOL. This just requires counting the number of single characters,
square brackets, and curly brackets. This can be done without fully
interpreting encoded sequences with the BpForms library.
>
> As an aside, the SMILES situation is more complicated because there are
multiple versions of SMILES. Different software assign different numbers to
atoms when they interpret SMILES. For example OpenBabel and Marvin assign
different atoms numbers for dGMP
(OC1CC(OC1COP(=O)([O-])[O-])n1cnc2c1nc(N)[nH]c2=O). In BpForms, we deal
with this by specifying that the atom numbers are in the basis established
by OpenBabel.
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub <
#77?email_source=notifications&email_token=AA2YH523RAWQIKNOPYPR7ELP7TCXRA5CNFSM4H2HWWLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZ6VA2Q#issuecomment-511529066>,
or mute the thread <
https://github.com/notifications/unsubscribe-auth/AA2YH5Y5VJ5Y7435TKDZBFTP7TCXRANCNFSM4H2HWWLA
>.
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#77?email_source=notifications&email_token=AAVXMKOO2MTKO6X5PJWFGKTP7TI2JA5CNFSM4H2HWWLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZ6ZDLY#issuecomment-511545775>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAVXMKL4AAQ3R4ZLKMQQZGLP7TI2JANCNFSM4H2HWWLA>
.
|
I updated the SEP to clarify these questions. See PR #83. |
Sounds good to me. Do you have a Java library that we can use for this?
Chris
… On Jul 15, 2019, at 10:17 PM, Jonathan Karr ***@***.***> wrote:
It seems like the encodings need to have well-defined locations, which
BpForms does. BpForms uses the same location conventions as IUPAC and
SMILES. These are also fairly human-readable as the locations increase
monotonically from left to right.
Yes, there's a grammar. FYI, there's a few more sophisticated validations
that aren't encoded in the grammar. For example, verifying that 5' caps
only appear at the last position. SMILES has the same issue -- it is
possible to encode molecules in SMILES that are not physically realistic. I
think grammar validation would be a good start. Ideally, validation would
go further to verify that the encoded molecules are realistic.
On Mon, Jul 15, 2019 at 3:55 PM cjmyers ***@***.***> wrote:
> I'm not concerned about the one-to-one alignment issue. Our libraries do
> already allow for SMILES which Jonathan points out does not have this. We
> only require this for IUPAC sequences, so we could continue to do so.
> Indeed, I would expect each object to still have an IUPAC sequence. The
> BpForms sequence if provided would be added information.
>
> I would though want the libraries to at least be able to tell if the
> BpForms encoded string was syntactically correct. libSBOLj at least does
> this for SMILES. I'm not sure if the other libraries due, but for
> validation I would want to have this ability at least. Do you have a
> grammar file? Or could you provide a java library that we could link to
> that would validate the syntax?
>
> > On Jul 15, 2019, at 9:03 PM, Jonathan Karr ***@***.***
> ***@***.***>> wrote:
> >
> > Correct, there is no one-to-one alignment between string indices and
> sequence locations. This is also true of SMILES. However, the alignment is
> simple (the monomer indices increase from left to right as in IUPAC). We
> can provide a function to compute locations which could be incorporate into
> libSBOL. This just requires counting the number of single characters,
> square brackets, and curly brackets. This can be done without fully
> interpreting encoded sequences with the BpForms library.
> >
> > As an aside, the SMILES situation is more complicated because there are
> multiple versions of SMILES. Different software assign different numbers to
> atoms when they interpret SMILES. For example OpenBabel and Marvin assign
> different atoms numbers for dGMP
> (OC1CC(OC1COP(=O)([O-])[O-])n1cnc2c1nc(N)[nH]c2=O). In BpForms, we deal
> with this by specifying that the atom numbers are in the basis established
> by OpenBabel.
> >
> > —
> > You are receiving this because you are subscribed to this thread.
> > Reply to this email directly, view it on GitHub <
> #77?email_source=notifications&email_token=AA2YH523RAWQIKNOPYPR7ELP7TCXRA5CNFSM4H2HWWLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZ6VA2Q#issuecomment-511529066>,
> or mute the thread <
> https://github.com/notifications/unsubscribe-auth/AA2YH5Y5VJ5Y7435TKDZBFTP7TCXRANCNFSM4H2HWWLA
> >.
> >
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#77?email_source=notifications&email_token=AAVXMKOO2MTKO6X5PJWFGKTP7TI2JA5CNFSM4H2HWWLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZ6ZDLY#issuecomment-511545775>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAVXMKL4AAQ3R4ZLKMQQZGLP7TI2JANCNFSM4H2HWWLA>
> .
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#77?email_source=notifications&email_token=AA2YH5ZTZ5IU6DAGI3GSEJDP7TLPJA5CNFSM4H2HWWLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZ63AFA#issuecomment-511553556>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA2YH5ZASN6IHVYY4K6RGRTP7TLPJANCNFSM4H2HWWLA>.
|
The grammar is written for Lark in EBNF (with a few slight modifications described here). We plan to make a more standard version of the grammar that can be used with other parser generators. I think this will only require a few small changes. There are several parser generators for C, C++, Java, Python, etc. that support EBNF such as:
If you have a preferred library, we can target that. |
This is proposing that the sequence string of a SBOL record could be embedded with an orthogonal data structure that has nothing to do with neither RDF or SBOL. BpForms has, as it seems, been only just proposed this year. No offense Jon but writing this SEP while also being the author of BpForms... I think this mostly serves the promotion of the BpForms project. I believe it would be a bad idea to allow a complex and, so far, non-standard syntax to enter SBOL sequence strings. Right now, sequence strings are IUPAC compatible and can be parsed and processed by pretty much any bioinformatics tool. With this SEP, this compatibility would be broken. Note also, that all examples in the SEP are highly specialized natural modifications of natural molecules. None of the examples comes from engineered systems. I would argue that there is no urgent need for the description of this kind of non-canonical chemistry in bioengineering (and we do support SMILES, which is a standard already supported by many tools). It would be nice if we could describe a handful of very common modifications but that could be achieved much easier (and without breaking the sequence record). |
@graik You are correct that this SEP is about promotion of BpForms for expressing this type of information. Without this SEP, it is perfectly legal to use BpForms in SBOL already. The Sequence encoding types are not restricted to just IUPAC and SMILES, but they are currently the only two that the SBOL community recommends to use. This SEP does not say that people must use BpForms, but rather it would say that if you want to express the types of structure that it can represent that it is a suggested way to represent this. BpForms would not replace any of our existing encodings, but it would be an alternative encoding. To be honest, I'm not sure how useful this is to synthetic biologists or how often this type of information needs to be recorded. Since it is already allowed, the question is really is there a better way to express this information? |
I am reviewing this on behalf of the SBOL editors. From what I can gather, there are no outstanding questions about the content of the SEP, but only about how useful BpForms would be - something I don't think is really up to us to decide. We can take this either take this to a vote now for SBOL 2.4, or we can defer it until after 3.0. @jonrkarr - do you have a preference? |
Hi James,
Whatever the editors recommend work for me.
Regarding the utility, this SEP is about enhancing the chemical precision
of macromolecules in synthetic designs. While some projects may not need
more precision at the current early stage of synthetic biology, the SEP is
about facilitating more chemical precision for the projects which need it,
will I think will grow in the future. This could include describing
critical RNA and protein modifications to describing entirely new genetic
codes. For example, we have a project to build cells with mirror chiral
proteins that would require something like BpForms to describe the design
for the cells. At the moment, there does not appear to be a concrete way to
describe modifications within SBOL, let alone designs that involve new
genetic codes with new amino acids and/or new peptide bonds.
I think one underlying question for the community is should SBOL be able to
capture designs for entirely new organisms that may depart
significantly from natural biology and that may involve entirely new parts?
For example, should SBOL be able to describe organisms that involve new
genetic codes. In that case, I think it will be important for SBOL to
capture more information that is normally implicit in our shared
biochemical knowledge.
Regards
Jonathan
…On Sat, Dec 14, 2019 at 5:16 PM James McLaughlin ***@***.***> wrote:
I am reviewing this on behalf of the SBOL editors. From what I can gather,
there are no outstanding questions about the content of the SEP, but only
about how useful BpForms would be - something I don't think is really up to
us to decide.
We can take this either take this to a vote now for SBOL 2.4, or we can
defer it until after 3.0. @jonrkarr <https://github.com/jonrkarr> - do
you have a preference?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#77?email_source=notifications&email_token=AAVXMKJOWVNYCWSPVDQV533QYVLMTA5CNFSM4H2HWWLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEG4MQNA#issuecomment-565758004>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAVXMKNWQP3O6RJ7I6ZSK73QYVLMTANCNFSM4H2HWWLA>
.
|
@jonrkarr Given the concerns raised, I think maybe we should defer this to SBOL3 for now. It can already be used by SBOL2 now, since this is just another possible sequence encoding. I think to advocate its use as a best practice, the community would need to see some (perhaps at least 3) uses of BpForms in synthetic biology examples. If you or collaborators have some examples to share, please do so. We also started a conversation earlier about using constraints and/or interactions to represent some elements of BpForms. This conversation would be interesting to begin again, especially if it may suggest some changes that we should consider for SBOL3. Thanks again for your contributions. |
Full details in the pull request, in the file https://github.com/SynBioDex/SEPs/blob/master/sep_033.md
The text was updated successfully, but these errors were encountered: