New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SEP 021 -- Experiments and Experimental Data #50

Open
jakebeal opened this Issue Jun 19, 2018 · 24 comments

Comments

Projects
None yet
7 participants
@jakebeal
Contributor

jakebeal commented Jun 19, 2018

Linking experimental data to samples and genetic designs is becoming critical to many synthetic biology projects.

We thus propose to add two new classes, Experiment and ExperimentalData, to enable users to aggregate Attachment objects (used to link experimental data files) in a more domain-specific manner than just using the SBOL Collection class. We also propose to modify the best-practice validation rules that govern the specification of provenance relationships between objects in a design-build-test-learn cycle.

Full details in: https://github.com/SynBioDex/SEPs/blob/master/sep_021.md

@jakebeal jakebeal added this to the SBOL 2.3 milestone Jun 19, 2018

jamesscottbrown added a commit to jamesscottbrown/SEPs that referenced this issue Jun 19, 2018

Add links from SEPs to issues
Whilst SEP and Issue numbers initially matched, they have now
significantly diverged: SEP 021 is Issue SynBioDex#50.
@graik

This comment has been minimized.

Contributor

graik commented Jul 3, 2018

I agree with the data model (Experiment -attachment-> Experimental data) but have some concerns when looking at the example (3.2 Interlab Example):

  1. The link ExperimentalData --generatedBy--> Implementation is redundant and misleading -- experimental data are generally not (yet?) generated by biological samples. Our engineered cells are not quite intelligent enough for that yet. I don't think any such link is necessary at all because the connection to biological batches will be created through the parent Experiment record.

  2. The link Experiment --prov:wasDerivedFrom--> Implementation is very confusing. How can a human activity (the experiment) be derived from some batch of biomaterial? Clearly, an experiment will never be another version of a biological sample.

Moreover, the connection from Experiment to Implementation needs context information. What for was the biomaterial used in the experiment? Instead of using prov:wasDerivedFrom, we could formulate a prov:Activity (Usage?) record that tells us for which purpose a certain biological material was used in the Experiment. There is also some arguments to be made to create a more custom SBOL feature instead of the more generic provO.

@jakebeal

This comment has been minimized.

Contributor

jakebeal commented Jul 4, 2018

I think you're definitely right that 1 is a problem: those links should be prov:wasDerivedFrom rather than prov:wasGeneratedBy. We might also add Activity objects for which prov:wasGeneratedBy would be appropriate, but that doesn't negate or remove the prov:wasDerivedFrom link.
@chrisAta would you be willing to update your diagram accordingly?

For 2, perhaps the problem is that the name should be something other than Experiment, since that suggests an activity, when what we are really intending is just to have a more specialized sort of Collection. Maybe you can suggest a better name?

@graik

This comment has been minimized.

Contributor

graik commented Jul 4, 2018

The problem with this use of prov:wasDerivedFrom is that, according to my limited knowledge (and Google searches) it is in conflict with the definition of this field by ProvO. wasDerivedFrom is supposed to connect Entities of the same type or things that are at least similar enough so that one can indeed be considered another version or offspring of the other.

Implementation represents a batch of biochemical material. Experimental Data (or collections or Experiments) are not another version of this biochemical material. They are completely different things. By contrast, different biomaterials are used in different roles for the generation of experimental data. Example: a certain E. coli strain may be used in the role of competent cell to transform and amplify certain plasmid DNA -- in this case there is two Implementations (cell and DNA), one experiment (the transformation) and various number of experimental data (ranging from nothing, via colony counts to pictures of colonies or microscopy images).

Concerning 2, I don't think any link is needed as long as there is a clearly defined link from data to Experiment(Collection) and from there to Implementation (used for the Experiment).

@jakebeal

This comment has been minimized.

Contributor

jakebeal commented Jul 4, 2018

The PROV-O definition of wasDerivedFrom is, in total:

A derivation is a transformation of an entity into another, an update of an entity resulting in a new one, or the construction of a new entity based on a pre-existing entity.

There are no requirement regarding type that are stated or implied, and "construction of a new entity based on a pre-existing entity" covers our cases quite well. Furthermore, we are already using wasDerivedFrom links in this manner in SBOL 2.2, per the adopted SEP 019.

With regards to the linking of Experiment, I think I can see your point. There are cases where we might want some additional link, but those don't necessarily need to be standardized yet.

@graik

This comment has been minimized.

Contributor

graik commented Jul 5, 2018

Thanks for digging out the official definition!

The key word is "transformation". If one reads this definition carefully, it is obvious that wasDerivedFrom is meant for versioning, conversions or child-parent relations that all connect things that are reasonably similar. The way it is used in the SEP 21 example is not compatible with that. SEP 19 is getting this wrong too, and that was one of my points of criticism.

Experimental data are not "derived from" some biological material. At best, experimental data are derived from actual experiments. The experiments, in turn, use various biomaterials for various purposes (roles).

Example: A CRISPR knockout screen is hitting all Kinase genes in the human genome. The experiment uses: (1) A library with 849 gRNA constructs (each with its own CD and Implementation/Batch), (2) a certain batch of biological growth media (another Implementation instance), (3) a specific human cell line from a specific collection (Implementation again), (4) a purified Cas9 protein from a specific purification batch (Implementation) ... Everything is mixed, incubated, FACS-sorted (perhaps with on-the-fly microscopy for each cell), deep sequenced.
Are the experimental results "derived" from the cell line? Or from any of the individual gRNA batches? Or from the growth medium? Or from the protein? Even the question intuitively doesn't make sense.

Moreover, it would be very confusing if the cell used in the experiment is connected through the same relationship as the protein or as all 849 gRNAs or as the growth medium without any context information. What we should do instead is having the Experiment Activity pointing with a Usage record to the cell line. This Usage record then can have a role competent cell or whatever we can find in protocol ontologies. And preferably this should be the only connection. This would allow direct queries for e.g. the cell line used in the experiment, or the medium etc.

@jakebeal

This comment has been minimized.

Contributor

jakebeal commented Jul 5, 2018

I'm afraid that I have a different perspective here:

  1. I see "or" not "and" in the definition
  2. I very much see a .fcs file from a flow cytometer as something that I have derived from a particular biological sample by means of a measurement Activity.

Similarly, for your example, I see a sequence of Implementations that are connected by various protocol step Activities, splitting and joining as appropriate, and with measurement activities applied to two of those Implementation objects---they don't necessarily have design information associated, but we most certainly know their derivation relationship to things that did have specific designs (the library, the growth media, and the cell line). We are representing library screening operations very similar to this in SD2 right now, and the linkages are working quite well.

@graik

This comment has been minimized.

Contributor

graik commented Jul 5, 2018

@jakebeal

This comment has been minimized.

Contributor

jakebeal commented Jul 5, 2018

It seems to me that this discussion has drifted off of the actual questions of SEP 021 and is returning to debate on SEP 019 (which may be entirely reasonable, but deserves its own thread on sbol-dev). If we fix the diagram to include the elided activities and delete the wasDerivedFrom links, then SEP 021 is consistent with SEP 019.

Do you have any problem with SEP 021 separate from these issues?

@cjmyers

This comment has been minimized.

Contributor

cjmyers commented Jul 5, 2018

I think I land somewhere in the middle on this one. While I don't think it is technically wrong to use wasDerivedFrom links in this way, I don't think it should be the preferred practice. In the Interlab example, why not use the Usage links from the Characterization Activity? I think it is particularly critical in this case as this is being generated from 3 different Implementations, and the role of each in this activity should be documented. On the other hand, for the ExperimentalData to Implementation links, since there is only a single object being linked, the wasDerivedFrom may be acceptable.

I do understand why the term feels awkward as Raik points out. However, the definition is vague enough that it could be applied to link entities of different types.

I think the term to use when they must be entities of the same time is: wasRevisionOf:

https://www.w3.org/TR/prov-o/#wasRevisionOf

"A revision is a derivation for which the resulting entity is a revised version of some original. The implication here is that the resulting entity contains substantial content from the original. Revision is a particular case of derivation."

wasDerivedFrom is a super class for this, which I think is defined more liberally:

https://www.w3.org/TR/prov-o/#wasDerivedFrom

Finally, wasInfluencedBy is a super class for this, which is even more open:

https://www.w3.org/TR/prov-o/#wasInfluencedBy

A compromise might be to use the term "wasInfluencedBy" instead for these direct connections of entities of different types. Another option is to introduced wasRevisionOf to use in the cases where they must be the same entity type. Or we can do both.

@jakebeal

This comment has been minimized.

Contributor

jakebeal commented Jul 5, 2018

@cjmyers @graik Can we please move the discussion of linking practices to SBOL-dev or to an SEP that actually addresses that issue?

@bbartley

This comment has been minimized.

Contributor

bbartley commented Jul 5, 2018

I think some of this awkwardness with provenance links could be avoided simply by defining Experiment as a specialized subclass of Activity. ExperimentalData would then be generated by Experiment activities. In addition Experiments would link directly to their Plans. In natural language we might express such relationships as "Some experimental data was generated by an experiment using an Implementation"

@jakebeal

This comment has been minimized.

Contributor

jakebeal commented Jul 6, 2018

@bbartley I would be comfortable with that.

@eoberortner

This comment has been minimized.

Member

eoberortner commented Jul 6, 2018

@jakebeal

This comment has been minimized.

Contributor

jakebeal commented Jul 6, 2018

@eoberortner I think that you make a very good point. We would, however, most definitely still need an ExperimentalData class to handle bundling and slicing of data files.

@eoberortner

This comment has been minimized.

Member

eoberortner commented Jul 6, 2018

@jakebeal

This comment has been minimized.

Contributor

jakebeal commented Jul 6, 2018

I'd like to get clearer agreement on the substance before opening up naming discussions.

@cjmyers

This comment has been minimized.

Contributor

cjmyers commented Jul 6, 2018

+1 for making Experiment actually be an Activity. Indeed, we are already doing this for simulation experiments in the Prov-O generated by iBioSim. Our plan is the SED-ML file which is the instructions for simulation. Here is an example:

https://synbiohub.programmingbiology.org/public/Cello_VPRGeneration_Paper/Cello_VPRGeneration_Paper_collection/1

This is a Collection of Collections. Each sub-Collection is a Collection of Simulation Data (so similar to your ExperiementalData), for example (the so-called Rule 30 example):

https://synbiohub.programmingbiology.org/public/Cello_VPRGeneration_Paper/Model_0x78_collection/1

Now, that I think about it, the wasGeneratedBy should likely be on this Collection, but it is currently on each Attachment, instead, such as:

https://synbiohub.programmingbiology.org/public/Cello_VPRGeneration_Paper/circuit_0x78_simulation__graph_png/1

Finally, here is the Simulation Experiment Activity:

https://synbiohub.programmingbiology.org/public/Cello_VPRGeneration_Paper/Model_0x78_iBioSim_activity/1

While SEP 027 to add a type would be useful, I don't think it is a prerequisite to taking this approach for Experiment now.

What I'm still wondering though is why is ExperimentData class needed? We are using Collection for the same thing. The only difference appears to be a restriction that the members are Attachments. However, what if the experimental data is something else, for example, how about a Sequence object generated by a sequence analysis? I could see this, for example, in the Addgene conversion.

@graik

This comment has been minimized.

Contributor

graik commented Jul 6, 2018

@bbartley

I think some of this awkwardness with provenance links could be avoided simply by defining Experiment as a specialized subclass of Activity. ExperimentalData would then be generated by Experiment activities. In addition Experiments would link directly to their Plans. In natural language we might express such relationships as "Some experimental data was generated by an experiment using an Implementation"

I very much agree. This would be a very clean solution.
@cjmyers The ExperimentalData class seems like a good idea because it will bundle file attachments that are very different from the SBOL objects that are typically bundled in a Collection. My understanding is that this would more likely be the raw sequencing traces rather than an SBOL sequence object.

@cjmyers

This comment has been minimized.

Contributor

cjmyers commented Jul 6, 2018

Can we use ExperimentalData to bundle Attachments for simulation experiments? If so, I could get behind this. I suggest though that the class inherit from Collection. This would make it a Collection with an extra validation rule that all members are Attachment objects. Is there any other difference than this? Seems like an added complication to the data model without a lot in return. What am I missing?

@jakebeal

This comment has been minimized.

Contributor

jakebeal commented Jul 6, 2018

@cjmyers I have no objection to applying ExperimentalData to simulation experiments as well as wetlab experiments. I likewise would not have any objection to inheriting from both Activity and Collection.

The key thing that I still want to preserve is the ability to expand in the future to support pulling out only fragments of data from a file. A typical use case for this is plate reader data, in which one output file contains readings from many samples. We are not yet proposing to standardize this yet, but are using this via annotations already: in this case a single Attachment is linked from many ExperimentalData objects, each annotated with the coordinates associated with the Implementation linked via PROV-O.

@NeilWipat

This comment has been minimized.

Collaborator

NeilWipat commented Oct 11, 2018

Update as of COMBINE 2018

Consensus has been reached that this SEP should be released for a vote. Curtis has volunteered to be the editor in charge.

@cjmyers

This comment has been minimized.

Contributor

cjmyers commented Oct 11, 2018

There is some inconsistency in the SEP. The UML says 1..* ExperimentalData objects, but the text says it an optional and refers to 0 or more URIs. Also, I'm unsure of relationship between Figure 1 and 2. Since an Experiment can link to Attachments already, is the ExperimentalData class necessary? What does this class represent?

@nroehner

This comment has been minimized.

Member

nroehner commented Oct 16, 2018

Thanks for catching that, Chris. I have updated the UML diagram so that the experimentalData property of the Experiment class has a cardinality of 0..*

Yes, both the Experiment and ExperimentalData classes are needed. We need a class (Experiment) to group experimental results (ExperimentalData) with reference to some experimental design (not specified by this SEP). It is acceptable for an Experiment to group Attachments, but if it does not also refer to ExperimentalData objects that sub-group these attachments, then we don't have points to link in samples for different experimental conditions using PROV-O relationships.

@cjmyers

This comment has been minimized.

Contributor

cjmyers commented Oct 16, 2018

Ok, sounds good. Would be good to have this motivation clear in the spec, since there was some discussion at COMBINE about whether classes were needed and could not just use attachment links.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment