Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capsulization and Packaging of Replication Objects in Dataverse #6085

Open
djbrooke opened this issue Aug 9, 2019 · 13 comments
Open

Capsulization and Packaging of Replication Objects in Dataverse #6085

djbrooke opened this issue Aug 9, 2019 · 13 comments

Comments

@djbrooke
Copy link
Contributor

@djbrooke djbrooke commented Aug 9, 2019

We'll need to talk about the specific steps with @atrisovic when she gets here, but I'm putting in this placeholder for now. We'd like to evaluate how we can better support/display capsules in Dataverse, such as those used by:

@TaniaSchlatter
Copy link
Contributor

@TaniaSchlatter TaniaSchlatter commented Sep 24, 2019

I'm interested in learning what is different about this this type of object compared with others, to help get at UI details and possibilities. What are the content items related to this type of file/object?

Also, what do users expect (if anything) about this type of object? Do they expect to see it as a unit like a package file, or like a container (folder) with contents?

@djbrooke
Copy link
Contributor Author

@djbrooke djbrooke commented Oct 9, 2019

Hi @TaniaSchlatter - thanks for talking about this briefly earlier today.

As an example, take this replication dataset:

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/Y3XHB6

We'd want to provide the ability to deposit a capsule (data, code, prov, compute environment) of this dataset as one object so that replication tools such as Code Ocean can run it. At the same time, we'd also want it unzipped and displayed as it is now so that the individual files can take advantage of our external tool infrastructure and so that users that are perhaps just interested in some data files but not the analysis (or vice versa) can pick and choose. I liked your idea of adding the capsule view as a third view here and having some appropriate view once it's selected:

Screen Shot 2019-10-08 at 10 20 35 PM

To answer your question about what I'd expect users to be able to do with it, I'd expect it to be downloaded by a user through the UI/API or some tool using our API.

I'm checking with our hosting team at Harvard about how much storage cost we're racking up a month to try and determine the implications of hosting two versions of each dataset.

A question from an architecture standpoint is whether or not we keep the full environment with each dataset or we package up the appropriate environment at the time that the capsule is requested. I do not know which is preferred from a preservation standpoint or from an efficiency standpoint. If we keep the full environment for each dataset (1000 copies of Stata 14 or whatever :)) there may be further storage cost issues. But, if we keep each capsule together with the environment and everything else we can possibly more easily serve them from S3.

@atrisovic
Copy link
Member

@atrisovic atrisovic commented Nov 4, 2019

Arbitrarily chosen examples of research capsules from CodeOcean:

@atrisovic
Copy link
Member

@atrisovic atrisovic commented Nov 4, 2019

User interface of capsules stored on Dockerhub:

image

https://hub.docker.com/_/r-base

@pdurbin pdurbin self-assigned this Nov 4, 2019
@pdurbin
Copy link
Member

@pdurbin pdurbin commented Nov 4, 2019

I checked in with @Xarthisius and here are three examples of capsules (which they call "tales") created with Whole Tale:

He also said,

"there's nothing special about Tales/Capsules published somewhere, apart from the fact that they have DOI.

you can go to https://dashboard.wholetale.org and export file as BagIt (zip) locally

the content is the same as the data we "publish" to external repository, i.e. that would be the thing that would land in Dataverse"

And to that I would add that from https://dev2.dataverse.org anyone is able to create a dataset and click the "Explore" button to play with it in Whole Tale and then click "Export as BagIt" like in the screenshot below.

Screen Shot 2019-11-04 at 3 46 54 PM

This is what I got from my dataset when I exported it as BagIT:

$ unzip 5dc089a87bf5ca3bf549e3dd.zip 
Archive:  5dc089a87bf5ca3bf549e3dd.zip
 extracting: 5dc089a87bf5ca3bf549e3dd/data/workspace/dataverse-irc-metrics-master/data/irclog.tsv  
 extracting: 5dc089a87bf5ca3bf549e3dd/data/workspace/dataverse-irc-metrics-master/apt.txt  
 extracting: 5dc089a87bf5ca3bf549e3dd/data/workspace/dataverse-irc-metrics-master/index.ipynb  
 extracting: 5dc089a87bf5ca3bf549e3dd/data/workspace/dataverse-irc-metrics-master/install.R  
 extracting: 5dc089a87bf5ca3bf549e3dd/data/workspace/dataverse-irc-metrics-master/README.md  
 extracting: 5dc089a87bf5ca3bf549e3dd/data/workspace/dataverse-irc-metrics-master/runtime.txt  
 extracting: 5dc089a87bf5ca3bf549e3dd/data/workspace/dataverse-irc-metrics-master/superuser_graph.ipynb  
 extracting: 5dc089a87bf5ca3bf549e3dd/data/workspace/dataverse-irc-metrics-master/superuser_graph-monthly.ipynb  
 extracting: 5dc089a87bf5ca3bf549e3dd/data/workspace/index.ipynb  
 extracting: 5dc089a87bf5ca3bf549e3dd/run-local.sh  
 extracting: 5dc089a87bf5ca3bf549e3dd/data/LICENSE  
 extracting: 5dc089a87bf5ca3bf549e3dd/README.md  
 extracting: 5dc089a87bf5ca3bf549e3dd/bagit.txt  
 extracting: 5dc089a87bf5ca3bf549e3dd/bag-info.txt  
 extracting: 5dc089a87bf5ca3bf549e3dd/fetch.txt  
 extracting: 5dc089a87bf5ca3bf549e3dd/manifest-md5.txt  
 extracting: 5dc089a87bf5ca3bf549e3dd/manifest-sha256.txt  
 extracting: 5dc089a87bf5ca3bf549e3dd/metadata/environment.json  
 extracting: 5dc089a87bf5ca3bf549e3dd/metadata/manifest.json  
 extracting: 5dc089a87bf5ca3bf549e3dd/tagmanifest-md5.txt  
 extracting: 5dc089a87bf5ca3bf549e3dd/tagmanifest-sha256.txt  

The files I uploaded to Dataverse are at https://github.com/pdurbin/dataverse-irc-metrics and they are shown in a folder called "dataverse-irc-metrics-master" above. To get them into Dataverse, I downloaded my GitHub repo as a zip and added it to my dataset.

@pdurbin pdurbin removed their assignment Nov 4, 2019
@djbrooke
Copy link
Contributor Author

@djbrooke djbrooke commented Nov 4, 2019

Meeting notes from 11/4 below. I see everyone has been completing action items as I've been out walking the dog :)

https://docs.google.com/document/d/1hF93XtIkacD6HE0koeoBtqk9FUfJhalhd6EtvD4nlnk/edit

My one item was to get some more details on Renku and I'm working on setting up a meeting this week. Generally, once we have examples of capsules and capsule-equivalents from around the community, we'll get back together.

@pdurbin
Copy link
Member

@pdurbin pdurbin commented Nov 5, 2019

whole-tale/whole-tale#53 is the "publishing tales/capsules from Whole Tale to Dataverse" issue to track and there are lots of great screenshots in there.

@pdurbin
Copy link
Member

@pdurbin pdurbin commented Nov 5, 2019

My one item was to get some more details on Renku and I'm working on setting up a meeting this week.

Here is where Renku is tracking this: SwissDataScienceCenter/renku-python#668

@pdurbin
Copy link
Member

@pdurbin pdurbin commented Nov 6, 2019

There was so much great information, screenshots and chatter yesterday from @craig-willis in whole-tale/whole-tale#53 that I suggested to him that we should consider scheduling a call with Whole Tale to get their take on depositing capsules into Dataverse.

@craig-willis maybe we should schedule the 3rd Open Science Infrastructure working group call? whole-tale/whole-tale#61

Or maybe we could ask @KirstieJane if we could dedicate a future "Turing Way online Collaboration Cafe" to the topic of depositing capsules into data repositories? Here are the upcoming dates and times: https://github.com/alan-turing-institute/the-turing-way/blob/master/project_management/online-collaboration-cafe.md#dates-and-start-times . I did my best to introduce Dataverse to the Turing Way communing about a month ago in https://www.youtube.com/watch?v=HIIJvDZ8pzw . The advantage of the collaboration cafe is that the meetings are recorded so I can very easily add them to DataverseTV: https://github.com/IQSS/dataverse-tv 😄 If the call is recorded, we'll have much more reach.

@craig-willis
Copy link
Contributor

@craig-willis craig-willis commented Nov 7, 2019

@pdurbin I'm happy to try to coordinate call or participate in a related community call. I do now have access to Zoom for recording, but may not have the reach of the "Collaboration Cafe".

@TaniaSchlatter TaniaSchlatter added this to UI/UX Design 💡📝 in IQSS/dataverse Nov 18, 2019
@TaniaSchlatter
Copy link
Contributor

@TaniaSchlatter TaniaSchlatter commented Nov 20, 2019

I've started to add images of capsules and notes from discussions to a presentation doc. If you have a representative image, you can add:
https://docs.google.com/presentation/d/16Blkgb1ozjIijx-jv_3QtvgQhAlvHXDiH5ZjtK5WJrw/edit?usp=sharing

@djbrooke
Copy link
Contributor Author

@djbrooke djbrooke commented Feb 10, 2020

Document with a proposed approach, comments welcome:
https://docs.google.com/document/d/1xG8xAcPSOe1xCWUlhj46AKrK4MAZbY6ed96yBKHCXiA/edit

@craig-willis
Copy link
Contributor

@craig-willis craig-willis commented Feb 11, 2020

@stain might have some constructive input from the RO-crate perspective, if not already involved.

@DS-INRA DS-INRA added this to Interest in Data INRAE Jan 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Data INRAE
  
Interest
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants