Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KUDOS and the question(s) #2

Open
yarikoptic opened this issue Nov 1, 2019 · 4 comments
Open

KUDOS and the question(s) #2

yarikoptic opened this issue Nov 1, 2019 · 4 comments

Comments

@yarikoptic
Copy link

Hi Lion,

I was thrilled to find ASDF and SEIS-PROV used by it while looking around at how provenance is captured while working on/with standardized (for the field) data storage formats (like ASDF) and types.
In our case it is NWB file format for neural data which is, like ASDF, is hdf5 based. ATM it has no provisioning for provenance capture, so I was intrigued by ASDF storing full PROV records inside the .hdf5.

Based on your experience with establishing SEIS-PROV (well done! I really loved going through http://seismicdata.github.io/SEIS-PROV/), the fact that it was out there for years now, and somewhat reflecting on the fact that you seems to be the sole developer/contributor to it, and that there were no recent activity to further its development etc, I wondered:

  • PROV (like the rest of semantic web markup) is not really easily digestable by humans with all the random IDs etc. json-ld and other serializations made things better but not really easy. That is why often there is a seductive power of "let's just come up with some schema which could be compatible with PROV, i.e. that we could convert to PROV representation if needed; or which would just be more useful to humans instead of computers". Do you still feel that "native" PROV in ASDF was the way to go?

  • do higher level user tools in the field use PROV information, e.g. for visualization or querying by mere mortals for pragmatic benefit (e.g. just listing types of filtering done on the data with parameters used etc)?

  • did you see or could refer to specific pragmatic (goal driven, not just demos on "what could be done") use cases / studies / benefits from having PROV in ASDF?

In our case with NWB, our immediate "prototypical" use-case is the brand new https://github.com/spikodrome/ project, where one of the goals would be to compare results between different spike sorting algorithms and human curators. So we are thinking now about how to capture provenance information on preprocessing + spike sorting in NWB files. (ref: https://github.com/SpikeInterface/spikeextractors/issues/290), hence I decided to ask this question(s) ;)

If you would prefer to reply in private -- debian at onerussian.com should work ;)

Thanks in advance and Kudos on all great work!

@krischer
Copy link
Member

krischer commented Nov 1, 2019

Hi Yarik,

thanks for the nice words!

I'm not entirely sure I'd consider SEIS-PROV a success - people always are very interested in it but as you've noticed there is little actual activity going on. That being said ASDF is being used a fair bunch and continues to grow so maybe SEIS-PROV will be rejuvenated at one point in time.

PROV (like the rest of semantic web markup) is not really easily digestable by humans with all the random IDs etc. json-ld and other serializations made things better but not really easy. That is why often there is a seductive power of "let's just come up with some schema which could be compatible with PROV, i.e. that we could convert to PROV representation if needed; or which would just be more useful to humans instead of computers". Do you still feel that "native" PROV in ASDF was the way to go?

Yes - I still think that "native" PROV is the way to go. Any non-trivial provenance description will become a directed graph. As soon as there is a graph there must be unique ids of some form and that point it can really only be understood by humans by looking at a graphical representation. At that point I no longer care too much about the data format representation and using an existing standard is IMHO always the right choice as it comes with libraries and tools to do these visualizations (amongst other things).

And PROV-N (https://www.w3.org/TR/2013/REC-prov-n-20130430/) is easy enough to read and I cannot envision something simpler that could still represent a DAG.

Also this is very easy to implement while still being powerful. You might have noticed that ASDF embeds all kinds of XML formats so we've chosen the PROV XML representation, but embedding others would also work. We chose to directly embed the encoded byte representation of XML files - this might seem ugly and cumbersome but it has the big advantage that one no longer has to deal with text encodings and other nasty issues as this is all handled by the underlying (in this case XML) parsing engine. And its actually more efficient compared to storing it in deeply nested HDF5 groups and attributes which are really slow to query.

  • do higher level user tools in the field use PROV information, e.g. for visualization or querying by mere mortals for pragmatic benefit (e.g. just listing types of filtering done on the data with parameters used etc)?

  • did you see or could refer to specific pragmatic (goal driven, not just demos on "what could be done") use cases / studies / benefits from having PROV in ASDF?

I'll answer these two together. In our field, like I imagine in others, provenance and reproducibility are things everyone likes talking about but few people actually tackle it in generally useful ways.

Thus no to both. It has not happened yet and I feel like it would only happen if all provenance acquisition and storage would happen fully automatically without ANY additional work and friction for scientists and users.

The closest I got to this was to implement automatic provenance tracking in ObsPy (https://github.com/obspy/obspy/wiki), a standard tool in our field. It is not merged because a few edge cases need some more work but maybe I finish it at one point.

It is probably possible to implement something similar for your use case. Some inspiration: https://github.com/krischer/obspy/blob/obspy_provenance/obspy/core/provenance.py and a decorator for all the processing methods that actually tracks the provenance: https://github.com/krischer/obspy/blob/obspy_provenance/obspy/core/trace.py#L224

Let me know if this does not help or you need more information. I'm always happy to answer questions!

@yarikoptic
Copy link
Author

Thank you Lion for the quick and informative reply! I am yet to digest it fully, while also researching into PROV-JSON and PROV-JSONLD (poster, datalad issue which pointed me to it), but let me comment on one aspect (sorry if it sounds like a sales speech... it would be partially true ;)):

... In our field, like I imagine in others, provenance and reproducibility are things everyone likes talking about but few people actually tackle it in generally useful ways.

;) well, for me previously "provenance" was 1) a Debian release + list of packages which we used in the study; 2) bash history. So it was non-standardized, largely "human readable and machine actionable if you know how to run debootstrap + apt-get and grep the history".

Currently it is a container + list of inputs/outputs and the command to execute the container with some facilitation from DataLad (disclaimer - I am involved there). With everything (code/data/containers) tracked in VCS, we have not only provenance but pragmatic ways to reproduce. See e.g. example in https://github.com/ReproNim/containers#a-typical-workflow . And I do use that functionality pretty much regularly - it is digestible during "data archeological" expeditions in git history and it is machine actionable (see datalad rerun). So, again -- not standardized but minimally sufficient to achieve reproducibility. Actual "provenance" information is spread out (git log record, git annex information on container checksum/location where to get, embedded in container information about components which could be discovered with aux tools). So user is exposed only to minimal set of provenance to reproduce but all of it could be recovered if needed.

@krischer
Copy link
Member

krischer commented Nov 4, 2019

That sounds pretty nice and useful indeed! And I totally agree that container based workflows are the most useful and practical choice to get things done. And one could certainly argue that a VCS introduces a graph structure into your system ;-)

SEIS-PROV tried to establish something more akin to a system independent "light" provenance. I've actually written a few sentences of its original goals at the end of this page: http://seismicdata.github.io/SEIS-PROV/motivation.html#goal-of-seis-prov

@yarikoptic
Copy link
Author

I did mention that goal when I initially looked at the docs, and looked that hint of pragmatism ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants