Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SEP 014 -- Using SBOL to Model the Design-Build-Test Cycle #31

Closed
bbartley opened this issue Jul 10, 2017 · 40 comments
Closed

SEP 014 -- Using SBOL to Model the Design-Build-Test Cycle #31

bbartley opened this issue Jul 10, 2017 · 40 comments

Comments

@bbartley
Copy link
Contributor

Linking experimental data with SBOL designs is becoming critical to a number of important projects. Therefore this SEP introduces a Design-Build-Test data model for SBOL. SEP 14 is here.

@cjmyers
Copy link
Contributor

cjmyers commented Jul 13, 2017

  1. I’m actually open to the idea of productionStatus being a new field, since if we need this for ModuleDefinition, then we would need to add type anyway. That being said, we may want to add type to MD anyway to be able to say if a MD represents a host. For now, I’m guessing we do this with a TaxonId annotation. Could this field be a Boolean? Do we have more than two potential values? Does it need to be 0..* or should it be 0..1?

  2. In the Test class, “hashes” should be “hash”, format URL link to NuML is broken. Any idea of a good ontology for formats? I would prefer that this was a required field actually. Not sure if URI Plan class works for protocol or not.

  3. Prefer not to have a validation rule for Test not being allowed for designs. I think we should be open to using Test to point to simulation data too.

  4. I’m not excited about Test being linked to from CD, since I would prefer they indicate the host/environment of the test using a MD. However, I imagine that we might need to allow this for those reluctant to use MD class. However, I’m really unsure what it means to test a CD independent of a host/environment. On second thought, maybe this is a good way to motivate the MD class to those working at the CD level.

@graik
Copy link
Contributor

graik commented Jul 14, 2017 via email

@cjmyers
Copy link
Contributor

cjmyers commented Jul 15, 2017

I was thinking a bit more about the Test class, and I thought of a potential problem. The current idea assumes one protocol produces one data file. I think that may not be true. One experiment I could imagine could produce more than one set of data, or certainly multiple representations of the data. For example, you might want to attach to a “Test”, a link to the raw data, a link to the processed data, and a link to a graphical representation of the processed data. In order to address this issue, I considered perhaps what we want to do is:

  1. Formalize the Attachment class that SynBioHub is using as proper SBOL, and allow all TopLevel (or Identified) objects to be able to reference Attachments.
  2. Reduce the Test class to a protocol and it would then be able to have 0 or more Attachments for the data.

However, if Test is only a protocol, then I was wondering if PROV-O actually is the solution. Namely, we could do something like:

  1. Link a ModuleDefinition to an Attachment that includes raw data, the Attachment would have a wasGeneratedBy pointing to an Activity that references the protocol used to generate this data.

  2. Have a 2nd Attachment which is the processed data that has a wasGeneratedBy pointing to the Activity processes the raw data.

  3. Have a 3rd Attachment which is the graphical representation that has a wasGeneratedBy pointing to the Activity that graphs the processed data.

Essentially use PROV-O to stitch the entire design-build-test flow together.

@bbartley
Copy link
Contributor Author

Responding to Raik first.

Sample... has some basic history of how it was derived from other samples.

Assuming this SEP is enacted, sample history can in fact be captured using the PROVO classes which are already part of the data model.

In my own practice, it is absolutely crucial that, e.g., sequencing data are NOT directly linked to a plasmid record but are instead linked to a SAMPLE object.

This is one of the motivations for this SEP, and your experience corroborates that of myself and the other authors. It is necessary to distinguish what the user intended to build (design) from what the user actually built (build). In this SEP, we represent a sample by using a ComponentDefinition with productionStatus:build.

An experimentalist needs to know which sample an experiment was performed on. Each clone (of cells or DNA derived from those cells) potentially has unknown mutations, samples become corrupted or mixed up etc pp and we may need to re-validate them or want to re-use them later.

What you describe is encompassed by this SEP. See Example 1. A Test can be associated with a ComponentDefinition representing a clone.

In any case, ComponentDefinition should not become mixed up with this concept of an experimental sample. It already is too broadly defined as it stands.

There are two main reasons for using a ComponentDefinition to represent a sample:

  • In some cases, it may be necessary to use SequenceAnnotations or Components to describe the substructure of a sample, especially when the sample does not match the target. Therefore it is advantageous to use ComponentDefinitions to represent both a design and a build (sample). For further discussion, see the third paragraph under Production Status.
  • The consensus sequence for a given plasmid clone or sample is represented by the Sequence object that is associated with the ComponentDefinition representing the build. See Example 1. The target sequence is represented by a Sequence associated with a design.

Thanks,
Bryan

@bbartley
Copy link
Contributor Author

Now responding to Chris

One experiment I could imagine could produce more than one set of data, or certainly multiple representations of the data.

Agreed.

Formalize the Attachment class that SynBioHub is using as proper SBOL, and allow all TopLevel (or Identified) objects to be able to reference Attachments.

Specification of the Attachment class goes beyond the scope of this SEP. However, this SEP is compatible with that vision. See Relation to Other Proposals for discussion of Attachments and associated metadata.

Reduce the Test class to a protocol and it would then be able to have 0 or more Attachments for the data.

The latest revision to this SEP does essentially this. All metadata has been stripped from the Test class. Currently, a Test refers directly to external files through its attachments property. No metadata is specied. Tooling will have to infer the data type of the attachment through a file extension, but in the short term this should be workable.

However, the attachments property could be easily co-opted in the future to refer to Attachment objects which contain important metadata about an external file link.

@graik
Copy link
Contributor

graik commented Jul 20, 2017 via email

@cjmyers
Copy link
Contributor

cjmyers commented Jul 20, 2017 via email

@bbartley
Copy link
Contributor Author

How do you want to encode what buffer a DNA molecule is stored in? How do you want to encode the concentration of it? How do you want to encode the fact that there is a mixture of molecules (each with its own concentration) within this sample?

That is beyond the scope of this SEP. These issues are important, but we won't quickly agree on how to represent a sample. The issue addressed by this SEP is more fundamental -- does a ComponentDefinition represent a concept in the user's head, or is it actually describing the structure of an entity in the real world. Design, Build, Test. What stage of the synthetic biology life cycle are we in?

(Also, I think I'm using the word clone slightly different than you. I'm using it to refer to a plasmid clone, such as you might isolate during the sequence verification process. I'm not using it to refer to a cell clone or freezer stock)

Thanks
Bryan

@cjmyers
Copy link
Contributor

cjmyers commented Jul 20, 2017 via email

@bbartley
Copy link
Contributor Author

UML updated.

Perhaps others can comment on whether an Attachment class should be included in this SEP.

@jamesamcl
Copy link
Member

The problem is that Attachments are not simple. We can't just take the synbiohub idea of Attachment and formalize it into SBOL directly. For example, synbiohub attachments don't provide any information about where to retrieve the attachment from, only the file hash. We also need to decide how to represent the type of the file (e.g. mime types), etc.

Also, in synbiohub Attachments can be attached to absolutely anything, so it's not just related to the Test class, which I think makes it beyond the scope of this SEP.

@cjmyers
Copy link
Contributor

cjmyers commented Jul 21, 2017 via email

@graik
Copy link
Contributor

graik commented Jul 21, 2017 via email

@bbartley
Copy link
Contributor Author

Hi Raik, Chris

What we found at HARMONY is that we don't have a clear consensus about how to represent samples. Chris is not alone in arguing that ModuleDefinition might be used to represent details about a sample. That is why I deliberately chose to limit the scope of this SEP. It's purpose is not to describe samples in detail. However, it does support sequence verification workflows. From my point of view, that is fundamental.

Best
Bryan

@bbartley
Copy link
Contributor Author

Hi Chris,

James: are you willing to put forward an SEP for attachments then in short order. We should get that one approved before approving the experimental data one, since it will depend on it.

From your point of view, why is an Attachment class REQUIRED in order to implement this SEP?

I see this current revision as entirely workable, with an easy migration path towards adding Attachments to the data model in a future proposal.

  • In this revision we add an attachments property to a Test. If we later move attachments to a TopLevel property, then this is a very easy modification to the libraries, and SBOL files with an attachments property on Test will continue to be reverse compatible.
  • In many cases, I can easily infer the purpose of an attached file from its file extension, even without formally specified metadata about an attachment
  • The immediate problem that Jake described, which motivated this proposal, was linking experimental data to designs. Not data exchange. This SEP provides an immediate and workable solution to that problem.
  • Any metadata that Jake needs about his file attachments can be added as generic annotations

@jakebeal I think your input might be a critical tie breaker on this. Do you see the current SEP as workable, or would you like us to work out the semantics of Attachments?

@jakebeal
Copy link
Contributor

Here is my take. I believe that SBOL's value comes primarily as an "integration hub" for linking different aspects of biological engineering workflows. This means that, while we do not want to get down into the weeds of LIMS systems, metrology, and experimental data exchange, we do need to be able to represent the critical engineering decisions associated with them.

To this end, I see a high degree of value in being able to distinguish between "idealized" engineered artifacts (design) and realized instances. Critically, PROV-O lets us link these cleanly, as well as potentially attaching protocol descriptions to explain how we got from a design to a sample. PROV-O also lets us cleanly link an intended design to a realized design.

I also see it as worthwhile to allow this distinction to be attached to both ModuleDefinition and ComponentDefinition. The key point of this distinction is not the fine details of what happens in the lab (I agree those are best left to LIMS systems), but to have a clean representation of critical engineering decisions. For example, consider Raik's example of DNA being stored in a particular liquid buffer. We should be able to represent this in two different ways:

  1. Here is some DNA, stored in a way we expect to not have to care about as long as you do it "normally." Here the design would be represented by a ComponentDefinition.
  2. Here is some DNA whose storage medium is an unusual and notable part of the design. Here the design would be represented by a ModuleDefinition, which includes the media as a FunctionalComponent.

Thinking about it from this perspective, my expectation is that when it comes to physical samples, ModuleDefinition is more likely to be useful for talking about experiments with actual cells, while ComponentDefinition is more likely to be used for talking about a construction process and verification.

So far, so good, and I think without any controversy.

As I am working out more use cases, however, I am becoming uncomfortable with the particulars of this proposal, and my discomforts are leading me to an alternative that I think is still quite simple. Here are some of my sources of discomfort:

  • Let's say I'm building a simple circuit, described by a ModuleDefinition. When I create a sample to test, that involves copying the ModuleDefinition, so that we can mark it as now being a physical object. But I don't really know what I've got in that sample, only what I intended to have. Later, I might measure it and confirm it or find I need to adjust the model, but until I do so the "physical" ModuleDefinition is still really an intention and not a known reality.

  • Sometimes we build something, it works, then we sequence it and find out what worked was actually a beneficial mutant. We then add that to the collection of ComponentDefinitions, where it goes from being physical back to being a design.

These are pointing me toward a conclusion that while I think the (extremely simple) information we're trying to encode is the right information, we need to make an adjustment in the representation. Since this comment is getting super-long, I will follow with another comment with my new proposal.

@jakebeal
Copy link
Contributor

Here is my alternate proposal, which tries to capture the same information with the following two differences:

  1. A cleaner distinction between intention and reality
  2. Designs aren't forked until you actually know they differ from their original.

New classes, with their fields:

  • Sample: this represents something physical
    • field: specification [1]: link to a ComponentDefinition or ModuleDefinition
    • field: data [0 .. *]: links to Data objects

In my proposal, the Sample class plays exactly the same role as the productionStatus field in the current proposal. A Sample is equivalent to a derived ComponentDefinition / ModuleDefinition with its productionStatus set to build. The difference is that we don't have to copy all of the sub-structure of the CD/MD, just link to it. If the reality turns out to be different, then we can fork the CD/MD then, using PROV-O to link just as we would have before. We can also use PROV-O to link the Sample to its intended CD/MD, in order to represent the whole process: "Sample X was supposed to be an instance to ModuleDefinition A, but instead I ended up with ModuleDefinition A'"

The data field is identical to tests, just renamed to follow my proposed adjustment to that class.

  • Protocol: this is a placeholder class like Model, linking to an external protocol specification
    • field: source [1]: URI
    • field: language [1]: URI

The fields of this class (and Data, below) are modeled exactly after Model. At some later point we may add more fields, but not in this proposal. The idea is that Protocol gets used as part of PROV-O links talking about the derivation of a Sample from a ComponentDefinition or ModuleDefinition or of one Sample from another Sample.

  • Data: this is a placeholder class like Model, linking to an external data object / file / or collection
    • field: source [1]: URI
    • field: format [1]: URI

Mostly there I just renamed Test to expand the notion that data can come from any stage of sample manipulation, not just a "testing" stage. We don't need the protocol field because it can be embedded with PROV-O if desired, just as for the samples. I also propose dropping the fields focused on data transport.

@graik
Copy link
Contributor

graik commented Jul 21, 2017 via email

@jakebeal
Copy link
Contributor

I'm not deeply attached to the name. Let's talk about the data model first, however, and then make sure we get the best synonym.

@graik
Copy link
Contributor

graik commented Jul 21, 2017 via email

@jakebeal
Copy link
Contributor

Exactly.

@bbartley
Copy link
Contributor Author

bbartley commented Jul 21, 2017

Hi Jake,

I'm not sure all your criticisms are fair, and therefore I don't see a need for a new proposal. Please see my response to your comments below.

...until I do so the "physical" ModuleDefinition is still really an intention and not a known reality.

I don't understand this. If something is "physical", it is real.

Sometimes we build something, it works, then we sequence it and find out what worked was actually a beneficial mutant. We then add that to the collection of ComponentDefinitions, where it goes from being physical back to being a design.

I feel like this use case is perfectly accommodated by the current SEP. I thought I had explained it in the text, but I see now I just half explained it. Anyway, I don't think this is really a problem, and I can update the SEP to explain this in more detail.

In your proposal, you state:

The difference is that we don't have to copy all of the sub-structure of the CD/MD, just link to it.

This is already explicitly stated in the current proposal. There is also a pretty clear UML diagram of this in Example 1:

For a given design many builds may be generated. In general, the design should serve as a reference to which builds are compared either for quality control (sequence verification) or comparison of observed versus expected output (experimental data vs. model predictions). Therefore, as a best practice, a user SHOULD NOT recursively copy all the Components and Modules which describe the compositional hierarchy of a design over to each new build generated, as this would be inefficient and redundant. A build object SHOULD be a simple ComponentDefinition or ModuleDefinition containing no subparts

A cleaner distinction between intention and reality

This argument has little weight with me now. Some of us argued at HARMONY for taking a more explicit, knowledge-representation approach. There were 3 possibilities discussed:

  1. Derive Design and Build from CD. That was the original proposal.
  2. Add new TopLevel classes, for Design and Build, and reference a CD or MD from them. This is similar to your approach here with Sample.
  3. Use an ontology term, because in the future we might want to add more detailed stages other than design and build.

We went with 3, which was a concession to you, Jake! Now it appears we are back to something like option 2.

I just renamed Test to expand the notion that data can come from any stage of sample manipulation

Can you provide an example of sample manipulation that would not qualify as a Test?

One thing I would like to emphasize. Our current proposal defines clear semantics about where data should be attached. A Test class represents empirical data. A Model represents simulation data. These each occupy a special place in the Design-Build-Test-Learn cycle (see Example 3). I think it is very important that Test and Model remain conceptually distinct and explicit. What would make sense to me is deriving both Test and Model from an abstract Data class.

Furthermore, we this SEP has another clear semantic about where data should be attached. Structural data (including sequence verification data) should be attached to CD. Characterization data should be attached to MD. This means client tooling has a very good idea where to look for certain kinds of data. I feel like this is an important consideration, since we seem to be discussing adding Data or Attachments to arbitrary SBOL objects.

Regards,
Bryan

@jakebeal
Copy link
Contributor

Let me focus on the heart of my concern, which is my discomfort with exactly this part of the proposal:

Therefore, as a best practice, a user SHOULD NOT recursively copy all the Components and Modules which describe the compositional hierarchy of a design over to each new build generated, as this would be inefficient and redundant. A build object SHOULD be a simple ComponentDefinition or ModuleDefinition containing no subparts

With this best practice, we would be recommending effectively using a ComponentDefinition or ModuleDefinition only as a pointer to another "master" copy, by means of the PROV-O link. That is a very different usage than we have ever had previously. Critically, the ComponentDefinition (or, equivalently ModuleDefinition) is no longer "self-contained," in the sense that you can find out what it is just by looking at child Components, Sequences, etc. Moreover, wasDerivedFrom can have multiple links, per SEP012 --- what does it mean if we link an "empty" ComponentDefinition to multiple sources by wasDerivedFrom? How do we even reason about this or effectively detect it? Is this new usage limited only to "design"/"build" relations or can it relate between two designs as well?

I know that my position was different last month, but as I've been working through my use cases, I've been getting progressively more uncomfortable with the repurposing of ComponentDefinition and ModuleDefinition as a sort of proxy pointer. I feel that this is a larger change of the meaning of the data model than is being accounted for, but if we prohibit this usage, then we have lots of cloning and the problems of describing something before we can verify it.

This is the core of my concerns, and I believe this issue needs to be addressed one way or another.

@bbartley
Copy link
Contributor Author

Hi Jake,

With this best practice, we would be recommending effectively using a ComponentDefinition or ModuleDefinition only as a pointer to another "master" copy, by means of the PROV-O link

This is not the only reason we are using CD or MD to represent builds. The other reason we are using CD and MD to represent builds (discussed in the SEP, and the comments above) is as follows:

  • In some cases, it may be necessary to use SequenceAnnotations or Components to describe the substructure of a sample or annotate it, especially when the sample does not match the target. This is similar to a use case you cited earlier: Sometimes we build something, it works, then we sequence it and find out what worked was actually a beneficial mutant. We then add that to the collection of ComponentDefinitions, where it goes from being physical back to being a design.
  • The consensus sequence for a given plasmid clone or sample is represented by the Sequence object that is associated with the ComponentDefinition representing the build. See Example 1. The target sequence is represented by a Sequence associated with a design.

With regards to concerns about the wasDerivedFrom field, indeed there is ambiguity in how a wasDerivedFrom property may be interpreted. Those I think are deeper issues that go beyond the scope of this SEP. The examples you cite sound like edge cases to me. Also, nothing in this proposal is outside the recommended usage of wasDerivedFrom. The W3C spec is as follows:
"A derivation is a transformation of an entity into another, an update of an entity resulting in a new one, or the construction of a new entity based on a pre-existing entity."

In earlier versions of the SEP, these ambiguities were not an issue, because we favored explicit naming of classes, similar to the approach you took here with the Sample class. Also, I'd like to point out that specification of your Sample class...

Sample: this represents something physical
field: specification [1]: link to a ComponentDefinition or ModuleDefinition
field: data [0 .. *]: links to Data objects

...is pretty much identical to the Build class in the original proposal! Except design has changed to specification, test has changed to data, and Build has changed to Sample. So much for design-build-test! This is very ironic to me.

@graik
Copy link
Contributor

graik commented Jul 22, 2017 via email

@jakebeal
Copy link
Contributor

Hi, Brian:

With this best practice, we would be recommending effectively using a ComponentDefinition or ModuleDefinition only as a pointer to another "master" copy, by means of the PROV-O link

This is not the only reason we are using CD or MD to represent builds. [snip]

I agree, and I also agree that we must be able to describe the contents of samples. However, there is more than one way to achieve this. The fact that CD and MD are convenient in other ways does not affect my concerns about the change of semantics needed in order to use an empty CD/MD as a pointer.

As I approach the question of solutions, it is indeed true that my thoughts do have a good deal of commonality with your original proposal. I would not view this as reverting, but as "spiraling up" to a view that includes the good parts of both the old and new proposal. The key differences in what I am proposing are (again, not worrying about names):

  • I see CD and MD as sufficient to represent design, without need for a new class.
  • The model allows multiple "stages" of building, and testing can happen at any or all of them.

There are other minor differences, but indeed, I have come around to the view expressed by both yourself and Raik that it is valuable to have not just a field but a whole separate class to represent a physical object, so that we can have lightweight "pointers" for representing large numbers of samples.

@bbartley
Copy link
Contributor Author

I think it is perfectly fine to change opinions -- that's the difference
between open discussion and ideological debate ;) So let's please remember that we are all looking for a good solution here and fair or unfair has nothing to do with it.

No problem with changing opinions. However, this feels like we are going in circles instead of converging. I hope that we are indeed "spiraling up" as Jake said. At this stage, an entirely new proposal might solve some issues, but at the same time it will likely introduce new issues, or worse re-introduce old issues which have already been discussed.

Fundamentally, a CD represents structure. IMHO, I should be able to use a CD to describe real, physical, manufactured structures, as well as theoretical, conceptual structures. When we start talking about Samples then the issue gets convoluted. I'm not trying to use CD to describe samples, I'm trying to use it to describe structure. I hope that any forthcoming proposal is at least consistent with this fundamental interpretation.

@jakebeal
Copy link
Contributor

I absolutely agree with you that a CD represents structure, and that we should be able to use it to describe real, physical structures. That is exactly why I want to not use "shallow" CDs as pointers to "real" CDs. Likewise for MDs.

I think we need to separate the "pointer" as a separate class, whatever the right name turns out to be, whether it be "Sample" or "Build" or "Aliquot" or "PhysicalThing" or whatever else might be the best fit for a representation of a physically instantiated design that somehow points to a CD or MD that describes it fully.

@graik
Copy link
Contributor

graik commented Jul 22, 2017 via email

@cjmyers
Copy link
Contributor

cjmyers commented Jul 23, 2017 via email

@graik
Copy link
Contributor

graik commented Jul 23, 2017 via email

@jakebeal
Copy link
Contributor

I had an insight --- I think the productionStatus field (possibly renamed) needs to be on the Implementation / Sample / Build, rather than on the ComponentDefinition / ModuleDefinition.

The reason we are making these "pointers" is to be able to make distinctions like "this is a real thing" vs. "this is an intention." In all of the proposals that have been made, we would be using a single CD/MD to provide the full-detail description of both a design and many actual samples --- the "no-content copies" are then allowing us to distinguish the physical/virtual nature of the different instances. So no matter what we do, we need to have a productionStatus associated with each sample, rather than with the full-detail CD/MD.

We can do this without having a "no-content copy" if we associate the field with the pointer to the design in the Implementation / Sample / Build, something like:

Implementation

  • design [1] -> ComponentDefinition / ModuleDefinition
  • designStatus [1] --> (#designIntent, #confirmedInSample)
  • test [0 .. *] -> Test

@graik
Copy link
Contributor

graik commented Jul 23, 2017 via email

@bbartley
Copy link
Contributor Author

Some good suggestions from everybody here.

  • I'm totally on board with using a pointer class for the Build / Sample /Implementation.
  • I buy Jake's argument that productionStatus field should be on the Implementation / Sample / Build
  • But I'm also open to Raik's suggestion that there be a productionStatus at each design, build, test stage. See following.
  • The iGEM registry / SynBioHub has several different status fields that map to these in some way. For example, I think igem#experience : Works maps very nicely to the Test stage. There are also #partStatus, #sampleStatus, and #status. It would be nice to know what the full range of values are for these fields.
  • According to Jake's spec we do not need a pointer class for Design. So what if a build error causes an interesting mutation which leads to a new design. I think we are in agreement that this is an important use case. However, when I try to diagram out this scenario with Jake's spec, I'm not sure it works. A Build in this case would have to reference two Designs, the original target design and the mutant design. Correct me if I'm wrong.
  • I like how Chris proposed to add an explicit Test class for experimental data and a more abstract Data class which will be used in the future for other kinds of file attachments. Leaving the properties to source and format for now sounds great. There is a "keep it simple, stupid" (KISS) mentality that I like about this approach. We need James and Newcastle on board with this though.

Cheers,
Bryan

@cjmyers
Copy link
Contributor

cjmyers commented Jul 26, 2017 via email

@cjmyers
Copy link
Contributor

cjmyers commented Jul 26, 2017 via email

@graik
Copy link
Contributor

graik commented Jul 26, 2017 via email

@cjmyers
Copy link
Contributor

cjmyers commented Jul 26, 2017 via email

@graik
Copy link
Contributor

graik commented Jul 26, 2017 via email

@palchicz
Copy link
Contributor

Closing in accordance with changes to SEP issue tracking rules detailed in SEP 001 bcbbcab#diff-44cec2aabf4c066f9a54ac4ef6634b9b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants