SEP 013 -- Sequence insertion and replacement
|Title||Sequence insertion and replacement|
|Authors||Jacob Beal (firstname.lastname@example.org), Raik Gruenberg|
A new, optional,
Component.sourceLocation field is
Component.sourceLocation allows only a portion, rather than the
entirety, of a ComponentDefinition's sequence to be included in the Component's
parent. This serves two related purposes: (1) To support the insertion of parts
within a larger plasmid or genome "backbone". Typically, this will replace a
small segment of the backbone which, currently, cannot be expressed in SBOL. (2)
To support the "trimming" of part boundaries as it may happen during DNA
Table of Contents
- 1. Rationale
- 2. Specification
- 3. Example or Use Case
- 4. Backwards Compatibility
- 5. Discussion
- 6. Competing SEPs
The SBOL sequence data model so far assumes that parts are joined together such that A + B + C gives exactly ABC. However, there are several very common scenarios where the final sequence of a genetic design is not exactly the sum of its parts:
(1) A designed sequence (e.g. synthetic gene or operon with many genes) is inserted into a common plasmid backbone or into a genome. Typically, a small but variable segment of the target vector is replaced in the process. This insert + backbone scheme probably accounts for 90% of all genetic designs published until today. Depending on assembly method, the insertion site and the extent of replacement often is variable and cannot be predicted by the designer or publisher of the backbone part.
(2) A larger sequence is assembled from re-usable fragments (e.g. promoter, RBS, CDS, terminator) but the original fragments are "trimmed" in the process. This may be a design decision (e.g. "Let's shorten this linker sequence") or a mere technical consequence of DNA assembly. For example, neighboring parts may have been registered with a complementary homology region, one copy of which disappears during assembly.
1.1 current situation
SBOL can already cleanly represent complete insertions (without replacement),
such as ABC + X --> ABXC, by using the ability to put multiple
Locations on a
Location records are interpreted as the part
being split up in the parent (product) sequence. This is equivalent to the
"JOIN" statement in genbank.
At least in theory, complete insertions of a part can be handled with a
statement such that ABC -> AXC. However, this requires that the exact
replacement site is aready annotated as a part when the backbone part is first
published. This is, in practice, rarely the case, with the notable exception of
certain highly structured Golden Gate cloning approaches such as MoClo.
Deletions such as ABC -> AC are not handled by current SBOL. Likewise, part boundaries cannot be changed.
The basic idea of this proposal is that a Component should be able to include only a portion of its source sequence rather than necessarily the entirety of it. To do this, we add a new optional field, Component.sourceLocations which can be used to specify which portions of the source's sequence are being used.
On the vector or "backbone" side, this allows us to "lose" some basepairs around the insertion site. On the "insert" side, this allows us to trim the source part before inserting or adding it to the final product.
On Component, add an optional field "sourceLocations" with cardinality 0..*, which contains a set of Locations.
If Component.sourceLocations is not set, then the whole sequence is assumed to be included.
If Component.sourceLocations is set, then the relationship between the original sequence and the sequence included is defined precisely complementary to SequenceAnnotation.locations.
[SBOL 3.x]: If a
Component has one or more
locations, the sequence range or
sequence ranges defined by Component.sourceLocation MUST add up to exactly the
same length as the range(s) covered by the Component.location record(s).
[SBOL 2.x]: For each Component.sourceLocation, if there are one or more locations associated in the containing ComponentDefinition (via SequenceAnnotation.location) then those locations must define a sequence range of exactly the same length.
Rule sbol-10520 will need to be relaxed to account for partial sequence inclusion. Additional new rules will need to be added, symmetric with those for SequenceAnnotation.locations
3.1 insertion into a plasmid
Let's assume we want to describe a plasmid "myPlasmid" which has been created by inserting a single part "cargo" into a common plasmid backbone called "backbone". The insert is 50 bp long and will replace 10 base pairs of the vector backbone.
CD backbone sequence = Sequence of length 200 CD cargo sequence = Sequence of length 50 CD myPlasmid sequence = final Sequence of length 240 Component insert definition -> cargo location = Location 101..150 Component backbone definition -> backbone location = Location 1..100, 151..240 sourceLocation = Location 1..100, 111..200
Note that Component "backbone" can be immediately validated by checking that each of the sourceLocations has a (target) location of exactly matching length. In this particular example, location records cover the full sequence without gaps. The final (product) sequence of "myPlasmid" could therefore be constructed from the two components and there is, stricly speaking, no need to list it again.
Below is a UML object diagram that is another illustration the above example. Note
3.2 BioBrick assembly
The BioBrick cloning standard defines a DNA prefix and suffix sequence to allow iterative cloning:
PREFIX[part A]SUFFIX + PREFIX[part B]SUFFIX --> PREFIX[part A]scar[part B]SUFFIX
However, the original standard (RFC 10) did not support protein fusions -- the scar sequence was 4 bp long and contained a STOP codon. An alternative Biobrick standard (RFC 25) therefore extended the original prefix and suffix with an additional set of "inner" restriction sites that could be used for protein fusions. RFC 25 is therefore compatible with both RFC 10 assembly and with protein fusions. A schematic overview:
PREFIXo0o~[part A]~xxxSUFFIX + ... --> PREFIXo0o~[part A]~~[part B]~xxxSUFFIX
The iGEM parts registry, for a long time, only supported RFC 10 parts. RFC 25 parts are therefore often registered like this:
That means, the inner part of the RFC 25 prefix (
TGGCCGGC) and suffix
ACCGGTTAAT) are included in the part definition. Using
product of assembling part A and B can nevertheless be described in SBOL:
CD partA sequence = TGGCCGGC part A ACCGGTTAAT CD partB sequence = TGGCCGGC part B ACCGGTTAAT CD part AB sequence = TGGCCGGC part A ACCGGC part B ACCGGTTAAT Component A sourceLocation 9..110 location 9..110 Component B sourceLocation 9..110 location 117..128
CD ... ComponentDefinition
Note: The example assumes part A and B are both exactly 100 bp long. The
ACCGGC is not assigned to any sub-part definition as it does
not carry any function. The description of part AB therefore MUST include a
Similar issues with the definition of part boundaries may also arise in modern
Golden Gate based assembly standards such as MoClo and GoldenBraid. MoClo and
GoldenBraid define similar "prefix" and "suffix" sequences with IIS restriction
sites, custom 4 bp overhangs as well as "filling" nucleotides. Depending on
software and user preference, prefixes, suffixes and assembly overhangs may
either be considered to belong to the part or not. Consequently,
sourceLocation gives the flexibility to cleanly describe DNA assembly products
from such parts in either scenario.
This SEP is backward compatible, in the sense that all prior SBOL 2.x files remain valid and retain their exact same meaning.
An old tool reading a file that uses Component.sourceLocation, however, will not know how to correctly interpret the relationship between the two sequences affected by Component.sourceLocation, however, and should report violation of rule sbol-10520.
This seems unlikely to be a major issue for two reasons:
- sbol-10520 is a "best practices" rule, not a requirement
- Any well-designed tool should react by refusing to make inferences about a sequence relationship that it does not understand, thus at least not damaging files.
There are at least two scenarios in which an SBOL 2.1 compatible tool may create a wrong sequence when presented with a Component.sourceLocation field:
- The tool interprets the unknown
sourceLocationfield as a custom annotation (ignored but passed on)
- There is no
locationrecord on the Component's SequenceAnnotation, e.g. because the sequence is defined using relative Sequence constraints.
Another failure scenario is that there is a
location record but the tool
choses to ignore the fact that the length defined by
location and the length
of the sub-part are not matching.
Although we need sourceLocation to be available at Component, it is worth considering whether it should be placed on its parent, ComponentInstance, in order to make sourceLocation available to FunctionalComponent as well.
A ModuleDefinition, however, has no Sequence for a "partial import" to be targeting. Since it is a different type of representation, every ComponentDefinition that is imported must be imported entirely. The "partial import" analog would not be on FunctionalComponent but some form of "partial import" field on Module. We are not currently aware of any compelling case for such, however (ModuleDefinitions do not have the equivalent of the massive sequence information associated ComponentDefinitions), we do not propose any such modification at this time, but note that it would be reasonable to consider in the future if sufficient motivation appears.
A competing proposal has been put forth (though not yet formalized into an SEP), for representing sequence modifications through MapsTo. The basic idea of this proposal is to make use of the "useLocal" option on MapsTo to specify a modification of the sequence. A basic example would look like this:
CD InsertionSite1 ... CD InsertionSite2 ... CD InsertionSite3 ... CD backbone Component InsertionSite1 Location 11..20 Component InsertionSite2 Location 12..19 Component InsertionSite3 Location 10..21 Sequence of length 200 bases CD cargo Sequence of length 20 bases CD MyInsertion Component MyCargo Location 11..30 Definition cargo Component MyBackBone Location 1..9, 31..210 Definition backbone MapsTo Local MyCargo Remote InsertionSite1
For deletion, we can either use a sequence of length 0 or we can allow a MapsTo to have no local, implying deletion (change to the current data model). This approach is advantageous in that it can be technically implemented without any modification of the data model (using empty sequences). Conceptually, however, the MapsTo approach represents a major kludge, in that it requires the introduction of a potentially large number of ComponentDefinitions without any particular engineering meaning, that only exist to represent statements about potential insertion sites.
Our community has already strongly rejected this idea in SEP004, which allows us to represent small statements about sequence features, like targets and cut sites, without having to separate those fragments (which do not have an independent existence) as ComponentDefinition. We thus have a clear precedent for "ComponentDefinition" representing a unit of engineering modularity, rather than a unit of sequence information. Moreover, under the MapsTo approach, describing a new ComponentDefinition forces one to modify an existing ComponentDefinition, which seems to be a clear violation of modularity. There are also technical issues, such as interaction with sbol-10520 and sbol-10526, but the major challenge is the conceptual change implied by the use of MapsTo in this fashion.
It may or may not be possible to express the extraction of a part using Prov-O provenance terms. Example 3.1 could potentially be re-formulated like this:
CD backbone ## original backbone record; defined by somebody else sequence = Sequence of length 200 CD myBackbone ## re-definition to remove insertion site sequence = Sequence of length 190 prov:wasDerivedFrom -> backbone prov:wasGeneratedBy -> ??SequenceExtract?? ??range?? 1..100 ??range?? 111..200 CD cargo sequence = Sequence of length 50 CD myPlasmid sequence = final Sequence of length 240 Component insert definition -> cargo location = Location 101..150 Component backbone definition -> myBackbone location = Location 1..100, 151..240
Equivalent "activities" for "??SequenceExtract?? and ??range?? may or may not
exist in the prov-O ontology. SBOL could define a custom activity
"SequenceExtract" and use
Location records within this activity. This would be
equivalent to the
sourceLocation proposal but would be much more verbose and
indirect. An additional disadvantage may be that the truncated part may, in
fact, be defined in another document which makes it more complicated to validate
that extracted sequence and the target sequence defined by
location have exactly
the same length.
In addition, it may be possible to encode Example 3.1 using both the sourceLocation property and Prov-O in a complementary fashion. In the UML object diagram below, the insertion of a cargo ComponentDefinition (pink) into a backbone ComponentDefinition (also pink) is specified by making them sub-Components (orange) of a plasmid ComponentDefinition (red) with associated Component sourceLocations (orange) and Sequence Annotation locations (yellow) as described earlier. The insertion is also specified in this diagram by an in silico assembly Activity with Usages (orange) of the cargo and backbone, plus a Plan (also orange) for their assembly. In this way, a tool that does not fully support DNA assembly could still interpret the composition of a new design in terms of non-modular pieces of previous designs by referring to the location and sourceLocation properties, while a tool that does support DNA assembly could expose a richer view of the biochemical mechanisms by which this composition could occur by referring to the Usages and assembly Plan.
None at present.
To the extent possible under law, SBOL developers has waived all copyright and related or neighboring rights to SEP 013. This work is published from: United States.