SEP 020 -- Linking experimental results with Biological clones and replicates
|Title||SEP 020 -- Linking experimental results with Biological clones and replicates|
|Authors||Raik Gruenberg (firstname.lastname@example.org)|
|Editor||Nicholas Roehner (email@example.com)|
|Replaces||SEP 014, SEP 016, SEP 017, SEP 019|
Experimental information recorded for one engineered clone of cells or a batch of biomaterial is often not representative for the behavior of another clone or batch. The reason are additional variables (known or unknown) such as mutations or stochastic biochemical events that lie outside the scope of the actual design. Bioengineers are therefore used to tracking the identity of clones, batches or replicates of biological materials whenever experimental results are collected. The Implementation class (previously defined in SEP 19) is here modified in order to allow a reliable and straightforward identification of cell lines, bacterial clones and other replicates of biomaterial. In contrast to SEP 19, the present proposal introduces a mandatory and direct link ('design') from
Implementation to its intended design. SEP 20 therefore guarantees that we always know What a clone or batch is. Information about How something was constructed may be added using prov-O (provenance ontology) records just as described in SEP 19.
Table of Contents
- 1. Rationale
- 2. Specification
- 2.1 Implementation
- 2.2 Best Practices
- 2.3 Validation Rules
- 3. Examples
- 4. Backwards Compatibility
- 5. Discussion
- 6. Relation to Other SEPs
If I am given a sample of something, the most important question is: "What is it?". SEP 19 gives no less than three different ways of answering this question: (1)
Activity. This would be confusing enough but, in fact, none of these redundant links is mandatory and it is not clear which one is preferred. Therefore, while SEP 19 makes a very useful first attempt to record workflow information ("How was this made?" or "How was this measured?") the, arguably, more basic information about "What is this?" may be ambiguous, in conflict with other statements or completely missing. This is the main issue that is being addressed by the present proposal.
The main focus of SEP 20 lies therefore on the
Implementation class. However, the SBOL editors have demanded a clear choice between SEP 19 and SEP 20 and I am therefore also including workflow-related information in this proposal. These are taken
directly from SEP 19 and should be interpreted as preliminary guidelines, to be tested and further refined by practical use.
The main goals of this proposal are therefore:
- Identify cell lines (clones, replicates, batches) as needed for the exchange of biomaterials (e.g. from AddGene, ACTC, partsregistry) and as also currently required by Journals for experimental papers
- Allow to attach experimental results to clones/batches
- Guarantee that each record of a clone or batch always points to exactly one authorative design
- Workflow & provenance information make up an additional layer of meta information rather than being required
- Ensure that the link from Experimental data through clone/batch to (intended) design is as direct and unambiguous as possible and does not require iterations over chains of events / activities (with the associated risk of missing links)
Implementation represents a real, physical instance of a synthetic biological construct which may be associated with, typically more than one, actual laboratory samples. Note that the representation of individual samples (as in LIMS systems) remains outside the scope of SBOL. Whether or not two samples are considered to belong to the same batch or at which point they are split into separate
Implementation instances is decided by the user and will be a matter of community best practices in a given area of bioengineering.
Implementation MUST be linked back to its intended design, either a
ComponentDefinition, using a mandatory field
design (this is the main difference to SEP 19). An
Implementation MAY also link, via an optional field
built, to an additional
ComponentDefinition which represents a more detailed picture of its actual structure as it was realized in the laboratory.
Figure 1: Diagram of the
Implementation class and its associated properties
Implementation record MUST have exactly one
design property which MUST point to a
ComponentDefinition or a
Component/ModuleDefinition is interpreted as the intended design. Sister clones of a bioengineering experiment will therefore always point to the same
ComponentDefinition regardless of whether or not they accurately implement this design.
If, after creation of a clone, additional design information becomes available that equally applies to all clones of an experiment, then the design
ComponentDefinition instance may be updated in-place. Alternatively the
design field of all affected
Implementations may be changed to point to a new version of the
Component/ModuleDefinition. In both cases, previous versions and reasons for the change can, optionally, be tracked using established prov-O versioning (e.g.
ComponentDefinition.wasDerivedFrom) as already prescribed by SBOL.
Implementation.built property references one
ComponentDefinition that describes the, possibly deviating, actual physical composition of a realized construct.
built MUST NOT reference the same
design points to. Instead, if no extra or deviating structural information exists for a given clone, the
built field is simply not to be used at all. [This is a difference to SEP 19]
built structure should be specific for this particular
Implementation. For example, it may be based on the sequencing of this particular clone.
Implementation.built SHOULD NOT be used to introduce general corrections to the original design (see 2.1.1 above how this should instead be handled).
Once created, the
uri identifier of an
Implementation instance MUST NOT change. The
Implementation record is mutable and may change over time as new information is added or corrected. However, since
Implementation represents a physical object in the real world, changes to the information about this physical object MUST NOT lead to a change of its unique identifier.
The following Best Practices are currently considered guide lines which may change in future SBOL releases.
Implementation MAY link to experimental data using the
attachment field approved in SEP 18. It is RECOMMENDED that
sbol:Attachment instances containing experimental data are further annotated by prov-O activity records as shown in the following Figure.
Figure 2: Association of experimental data with
Implementation (Clones/Batches) and intended Design.
Attachments MAY instead also be grouped into
Collections. Details will have to be determined by practical use.
Detailed information about how a given construct or organism was built in the lab can be encoded in a
prov-O:Activity record. [Arguably, this should be described in a separate SEP. The detailed description of experiments and protocols is beyond the scope of this SEP]. All sister clones of a construction experiment SHOULD point to the same
Activity record. Clones that implement the same design but were produced independently (e.g. in a different lab) SHOULD NOT point to the same
Activity record, even if the assembly protocol happened to be identical.
Figure 3: Representation of construct or organism assembly.
Activity MAY also have a
Usage:design record pointing to the design
ComponentDefinition. This link is redundant with the
Implementation.design field and therefore optional [A difference to SEP 19]. If present, it MUST point to the same
Component/ModuleDefinition instance as the
prov:wasDerivedFrom references are used to express that one batch or clone was directly created from another one. It is RECOMMENDED that this should be accompanied by a
prov:Activity record that outlines how one batch was derived from the other. However, we do not yet have a standard for formulating these Activity records. Examples are: transformation of DNA into bacteria to generate a new cell stock or purification of DNA or proteins from a given cell stock. Also the simple splitting of cell lines or inoculation of a new culture from a given bacterial stock may or may not warrant the definition of a new
Figure 4: Documentation of simple cell line or clone ancestry.
It is RECOMMENDED that
Implementation.wasDerivedFrom is exclusively used for experimental strain/batch ancestry relations. This is in contrast to SEP 19, where the same term is also used to link
Implementation to its design
Note: This solution may create conflicts or ambiguities with SBOL versioning. Other alternatives are being discussed.
The use of prov-O vocabulary together with
Implementation is encouraged but optional. Parsing of prov-O information MUST NOT be needed for (1) the simple identification of a clone, (2) neither for determining what structure (a.k.a. sequence) it is supposed to have. Further details on the use of prov-O for workflow documentation are suggested in the Example section below.
In line with SEP 19 (see detailed discussion in 2.3.2), we RECOMMEND as a best practice that objects linked by Activities not be successive versions (of the computational record) of each other, though we leave this at the discretion of users and library developers.
Not much is needed beyond the strict data model definition:
Implementation.builtSHOULD NOT point to the same
If a "Built"
Usage:designrecord, then this
Usage:designMUST point to the same
SEP 19 prescribes a host of additional validation rules related to prov-O usage. Some of these may apply here as well.
The first example covers the, presumably, most common use case of transmitting information about one particular plasmid which may be shipped between labs or obtained from the iGEM registry or AddGene:
Example 1a: This example covers the, presumably, most common use case of transmitting information about one particular plasmid which may be shipped between labs or obtained from the iGEM registry or AddGene. Two scenarios are shown. Top: the by far, most common use case where the plasmid corresponds exactly to the given design. Bottom: a scenario as it may, for example, happen in intermediate stages of assembly processes, where a plasmid is tracked and found to not exactly correspond to its design.
It is instructive to compare the above example to exactly the same information expressed according to SEP 19. Most UML diagrams in SEP 19 are highly simplified. So here is the same example with everything spelled out according to SEP 19 and the current prov-O documentation:
SEP 19 Example: The exact same data as in Example 1a shown in the notation suggested by SEP 19. Note that the same information expressed by a single
design field in SEP 20, requires, in SEP 19, 6 prov-O fields, 2 prov-O class instances, and one novel sbol:"design" term.
Returning to SEP 20, the above (correct) plasmid may be transformed into an E. coli expression strain for protein production. A cell stock of the single colony picked for the expression culture will be kept for future reference. Let's again assume something went wrong and we later discover that there was a genomic mutation in this particular clone. In other words, the cell stock faithfully implements the core of the design (the expression plasmid) but its genome deviates from the original design (and from its sister clones). It may still be used in subsequent experiments and the deviation is documented using the
Example 1b: A clone of a bacterial strain harboring the same expression plasmid but with a novel genomic mutation.
Note: SBOL currently has no clear rule for the identification of organisms. Presumably, a plasmid within a cell will be represented by a
ComponentDefinition for the plasmid, contained within a
ModuleDefinition somehow representing the cell. The given scheme is thus merely a suggestion.
Competing SEP 19 emphasizes the documentation of an idealized design - build - test workflow. I am skeptical as to how realistic this is at this point. For example, automated model building or automated sequence design from models is still the exception. In my experience, models are (if at all) built after a design has been made. I therefore find it problematic to connect Experimental Results -> Implementation, Implementation -> Design and Design -> Model with
prov:wasDerivedFrom links as suggested in SEP 19. In 95% of use cases, this will be confusing or outright false. Moreover, SBOL already specifies proper fields for connecting Design and Model as well as Attachment.
The following is therefore, according to this proposal, the minimal representation of the actual result of a design - build - test workflow:
Example 2a: A minimal workflow representation without use of prov-O meta data. This example provides the essential link from experimental result to clone, design and theoretical model.
In-line with SEP 19, rich meta data may be added to annotate the factual workflow result with "activity" records:
Example 2b: A tentative full workflow representation. Prov-O activities provide an orthogonal layer of meta information on top of the native SBOL chain of objects. Note: the prov:usage terms "design", "build", "test", "learn" have been suggested in sboldev discussions and are defined in SEP 19. "biomaterial" would be a new term which, in my view, better represents the relation between experimental protocols and clones/batches. The OBI ontology for biomedical investigations contains many terms that may be of more specific use here.
Note: This example has been adapted from Example 2 in SEP 19. The main difference is that
wasDerivedFrom links are replaced by
attachment. Other changes to the prov-O layout are suggestions only.
Please refer to section 2.2.2.
The following example is taken 1:1 from SEP 19. There is only two modifications: (1) designs are linked via
design rather than
built., (2) there is no "wasDerivedFrom" link between Implementation and design.
Example 4: An initial plasmid kit based on part designs from the iGEM Registry is replicated and split into different samples. Note, contrary to the use (in blue) in this example,
Implementation is not actually supposed to be equaled to "Sample".
Implementation defines clones or batches that are typically distributed over many samples and only occasionally will each sample be unique enough to warrant its own
Implementation. However, as SBOL currently does not have a term for "Sample", the given scheme is a workaround.
The following example is taken 1:1 from SEP 19 (Example 6). As before, the data model is somewhat simplified by the fact that there is only one link between
ComponentDefinition. In other words,
design replaces the redundant use of
Example 5: Two sets of overlapping constructs that implement the components of a CRISPR-based circuit are co-transfected to implement two different versions of the circuit.
Note: Arguably, the
wasDerivedFrom links (shown in gray color) from the CRISPR construct "mixture" or circuits on the right to the individual genes on the left can be removed. The same information is already provided by the ModuleDefinition. Alternatively, the sub-implementations should be linked together through a more expressive gene assembly "prov:Activity" activity (as outlined in section 2.2.2).
This proposal does not affect backwards compatibility.
SEP 19 relies heavily on prov-O provenance ontology terms which are freely mixed with SBOL terms. SEP 20 instead tries to establish a clear separation of concerns: SBOL terms are used to describe the core information of What we are looking at. Prov-O terms are used to add an optional layer of information about How something was created. In practice, this means the proper SBOL terms
attachment are used to link Experimental Results,
Model (the "What").
SEP 19 gives no less than three different ways of answering the "What are we looking at" question: (1)
Activity. None of these redundant links is mandatory and it is not clear which one is preferred.
wasDerivedFrom has the serious issue that is also being used for (i) computational version tracking, and (ii) tracking of clonal relationships between samples and, moreover, (iii) it could to be used for pointing to structural building blocks (Implementations) used during construction.
Built seems to be the most obvious and straightforward choice but it is NOT fitting either because it is explicitely reserved for "actual" structure or "actual function". Upon creation of a batch or clone in the lab, the actual physical structure is not yet known (functional behavior may never be known, is a can of worms to even specify, and is irrelevant for the question of what something is). As we MUST NOT create an Implementation record that is not pointing to its content -- after all I am holding a physical sample in my hands and have to say what the heck it is -- this only leaves us with "design".
design according to SEP 19 is formulated using a prov-O record comprising around 8 different classes and fields. The link may furthermore be indirect (going through "parent" Implementations) which means it needs to be inferred by iterating over a potential chain of Implementation instances. This chain may easily become interupted as there is no guarantee or mechanism to ensure that all parent Implementations are in the same SBOL document or even accessible elsewhere.
SEP 20 solves these issues with a direct and mandatory field
design. All use of provenance ontology terms and constructs is made optional.
As a side effect, this leads to a massive reduction of prov-O overhead and reference redundancy. In SEP 19, the expression of the design link via a prov-O
Activity comprises a massive overhead of, in many cases redundant, fields and classes (8 or 9 interlinked entities, likely translating to 20 to 40 lines of XML code) only so that additional meta information could theoretically be added on top of that. SEP 20 replaces this by a single pointer. This will be sufficient for most use cases. If, and only if, additional meta information is available, prov-O
Activities can be layered on top of that.
The high number of redundant references in the SEP 19 prov-O model (see the Figure in Example 1 above) also requires a very high number of new validation rules that verify that two different but equivalent prov-O pointers indeed reference the same
ComponentDefinition object. Realistically, we have to assume SBOL documents will not always meet all validation rules. SEP 19 therefore creates an abundance of possibilities for serious bugs where software tools receive conflicting references.
The use of
Implementation.wasDerivedFrom for strain ancestry (section 2.2.3) creates a potential conflict with versioning:
wasDerivedFrom is currently also used to link different (computational) versions of SBOL records. (This problem is even more acute in SEP 19 where
wasDerivedFrom is heavily used for non-versioning related links.) This issue requires further discussion. One possible alternative to
wasDerivedFrom would be the use of
generatedBy pointing to an
Activity could then be annotated with a OBI term describing the exact experiment and relation used to derive one clone from another.
Define a new "sbol:ancestor" (or similar) term to be associated with a
prov:Usagerecord within the activity.
Invent a custom SBOL term for physical ancestry of batches and strains.
As mostly philosophical difference between this proposal and SEP 19 is that SEP 19 considers the
Implementation to be "generated" or "derived" from the design. In my view, a given
Implementation either implements a design or it does not. I would argue that there should not be any
wasDerivedFrom link between
Implementation and design. One could easily argue that a given implementation has been primarily generated not from a design but from the physical building blocks that it was assembled from.
However, this argument does not rule out custom "usage" links from
activity records to
SEP 16 suggested the introduction of an
Implementation.status field pointing to one of a fixed set of possible values: "validated_correct", "validated_incorrect", "not_validated", "validated_ambiguous". This field was explicitly only supposed to deal with sequence validation without touching the more complicated issue whether or not a given clone in fact "works" as expected.
The field may indeed be required if we insist that
design always points to the intended design and
build is only used if there are known deviations from this design. In this case it becomes difficult to distinguish between a clone that has been shown to be exactly as specified in the design and another clone for which this is simply not yet known. Also, clones that are shown to be incorrect (e.g. fail sequencing) could be more easily flagged without always attaching the underlying sequencing information.
On the other hand, this kind of information can only be a (albeit useful) shortcut and almost always would benefit from additional meta information (how was a batch validated, how was the result evaluated and by whom?). For this reason, and in order to minimize differences with SEP 19, it has not been included here.
This proposal is a merge of SEP 16 and SEP 19 after discussion on the Github issue tracker, and on the ["Design-Build-Test" thread]: https://groups.google.com/forum/#!topic/sbol-dev/AnpwJP2_f5A and during a conference call in 11/2017.
SEP 20 relies on SEP 18 (definition of "Attachment").
SEP 20 is meant to replace SEP 19.
To the extent possible under law, SBOL developers has waived all copyright and related or neighboring rights to SEP 020. This work is published from: United States.