Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File Hierarchy: I want to be able to preserve my dataset's files' directory structure, for easy import, computation, and navigation. #2249

Closed
eaquigley opened this issue Jun 11, 2015 · 51 comments

Comments

@eaquigley
Copy link
Contributor

User community request to be able to organize files in a dataset in a hierarchical manner so when a user exports files from Google Drive, Dropbox, or OSF (or any other location), the file structure is maintained without causing the user extra work on the Dataverse side.

@eaquigley eaquigley added UX & UI: Design This issue needs input on the design of the UI and from the product owner Type: Suggestion an idea Status: Design Feature: File Upload & Handling labels Jun 11, 2015
@eaquigley eaquigley self-assigned this Jun 11, 2015
@dpwrussell
Copy link

One highly related thing which is extremely prevalent in microscopy and I would guess other fields, is that in addition to encoding metadata into the directory hierarchy, they have also encoded it into the filenames, usually underscore separated.

E.g. /usr/people/bioc0759/data/EB1-posterior-polarity/EB1-Colcemid-UV-inactivation/RMP_20090228_colcemid_UV_inactivation/rmp_20090228_colcemid-6hrs_EB1EB1_stg9_Az_18_R3D.dv

Some of this will be important metadata, some of it may not be. The ability to automatically or semi-automatically import some of this metadata (but hopefully not the junk) in the form of tag annotations sounds useful so that search/filtering can make use of them.

@dpwrussell
Copy link

It might be useful to have a look at this tool, built for the Open Microscopy Environment. The UI is not beautiful, but it does this kind of metadata extraction into Tag Annotations: https://www.openmicroscopy.org/site/products/partner/omero.webtagging/

There is also a "search" tool which should really be called "navigation" because it allows the user to browse the graph of tags from any origin point. This resembled filesystem navigation somewhat and seemed to satisfy some users.

Caveat Emptor: The queries to do this navigation because the tags are stored in a relational DB can get very slow if there are large numbers of tags and/or large amounts of data tagged with them. It would be ideal to be storing and updating a graph DB for this functionality to make these queries performant.

@scolapasta scolapasta modified the milestone: In Review Jun 30, 2015
@eaquigley
Copy link
Contributor Author

@eaquigley eaquigley modified the milestones: In Review, In Design Aug 14, 2015
@mercecrosas mercecrosas modified the milestones: In Design, In Review Nov 30, 2015
@pdurbin
Copy link
Member

pdurbin commented Jan 15, 2016

Feedback from @pameyer: "preserving file naming and directory structure (with the exception of files.sha which holds the checksums) is important for users downloading the dataset, and doing computation locally on it".

Mostly I just want to make it clear that download is a use case. (We probably need a separate issue to talk about running computation on files.) In the FRD above this is currently a question ("Do these carry over into a folder structure when downloaded as a zip?") and the answer for many users, I think, is that they want/expect to be able to upload a zip and later download a zip that has the same directory structure inside it. Some months ago @cchoirat was talking about the importance of this for her (though she may not have been talking about zip files specifically). It's a common expectation. Right now Dataverse flattens your files into a single namespace/directory on upload.

@scolapasta scolapasta removed this from the Not Assigned to a Release milestone Jan 28, 2016
@leeper
Copy link
Member

leeper commented Mar 1, 2016

I think this would be really valuable. It was how things worked with versions < 4.0, as I recall, and makes it somewhat unpredictable what will happen currently when uploading a project (e.g., via the API).

One possibility might be to do what S3 does with object keys that can have slashes in them:

Note that the Amazon S3 data model is a flat structure: you create a bucket, and the bucket stores objects. There is no hierarchy of subbuckets or subfolders; however, you can infer logical hierarchy using keyname prefixes and delimiters as the Amazon S3 console does. The Amazon S3 console supports a concept of folders.

The examples they give of object keys are:

Development/Projects1.xls
Finance/statement1.pdf
Private/taxdocument.pdf
s3-dg.pdf

This would allow a "flat" Dataset to contain files that can be batch downloaded into a hierarchical structure. Of course, I don't know if that works on the backend.

@pdurbin
Copy link
Member

pdurbin commented Mar 1, 2016

This issue was raised yesterday by @pameyer and others from @sbgrid . @bmckinney if you want you could assign yourself to this issue to at least think about. I remember @dpwrussell of OMERO fame talking about it during the 2015 Dataverse Community Meeting.

@leeper you're right. From what I've heard from @landreev in the DVN 3.x days a zip download would sort of reconstruct the file system hierarchy. I'm obviously fuzzy on the details.

@wddabc
Copy link

wddabc commented May 1, 2016

I'm wondering whether there is a way to upload the entire directory (For example, by dragging a folder. Currently, it only supports dragging a file) so that the structure is maintained. The user can browse the directory and files by simply clicking into like Dropbox and github without explicitly downloading and unzipping the data.

@jeisner
Copy link

jeisner commented May 3, 2016

👍 on this request. The directory structure is often important. Sometimes there are even multiple subdirectories that contain identically named files, e.g., for different experimental subjects or different versions of an experiment.

@pdurbin is correct that download is a use case. So is online browsing of the dataset to get a feel for what's there -- the directory structure provides very useful organization.

@TaniaSchlatter
Copy link
Member

In the short term, we are considering using the file hierarchy information as metadata, stored in the database, rather than having the files in a hierarchy on disk. This would allow users to view the hierarchy in a preview, with the file display in the table (adding filtering and sorting capabilities).

  • Users could add or move things around from the UI by specifying or editing the file's path.

  • On download, the original structure is recreated in the zip file they download.

This doesn't address all desired, however we are interested in getting comments on this proposal. Here is a more detailed description:

Depositor drags a zip (not double zip) in to the dataset. The file will unzip and preserve the directory structure (see #3448). Individual files will be ingested (if necessary) and displayed just like any other file in a dataset – flat. Individual files can be downloaded. If all or any files are downloaded, the hierarchy will be re-created in a zip, matching the structure of the file that was uploaded in the first place.

A user wanting to access data selects “Download all” and downloads the original zip hierarchy. The system behavior is transparent to depositors.

Add files

  • User deposits a new file and adds a path to an input field to put the file in to or out of the structure.
  • If new .zip files added - either specifying a path where these will be added, or if no path is specified, people would get two .zip files (provide as a separate file in the zipped file that’s downloaded)
  • Can create new subdirectory (on a file, empty subdirectories cannot be created).

Move files
Similar function to above, provide a way to edit the file path

Versioning
Consider moving or adding a metadata change and display a new version in the version table
File removed - same as any other file

View hierarchy
Show a “preview” of the hierarchical contents of the dataset.

  • System generates a text or image file that can be previewed (popup, tab, replace thumbnail, in addition to thumbnail…) Format of the hierarchy preview - Image? Text? What if there are a large number of files? Is an interactive preview is reasonably doable?

Replace/Unzip existing .zip
For existing double .zip. Users can delete original .zip and then upload with a single .zip.

Download
For Stata file add a toggle for original or ingested? Decide to show one? What about the download limit? How might that affect download? Can we leverage the S3/large/package file download UI?

@pdurbin
Copy link
Member

pdurbin commented Apr 5, 2019

@dpwrussell @pameyer @leeper @wddabc @jeisner @nmedeiro @christophergandrud @pdeffebach @setgree @mdehollander (and anyone else who is following this issue) good news! Dataverse 4.12 has support for organizing files into folders!

Can you all please try it out at https://demo.dataverse.org and give us feedback? Here are some screenshots that show how to introduce a folder hierarchy to your dataset's files:

Screen Shot 2019-04-05 at 7 26 47 AM

Screen Shot 2019-04-05 at 7 27 03 AM

This feature is documented as "File Path" at http://guides.dataverse.org/en/4.12/user/dataset-management.html#file-path and here's a screenshot of the docs:

Screen Shot 2019-04-05 at 7 32 27 AM

Please just leave a comment below! Thanks!

@mdehollander
Copy link

@pdurbin Good that this works now with a zip file. Ideally I would like to see this also working with drag&drop. And that you can browser through folders in the interface in stead of listing the folder name for each file. But hey, thanks for making this already possible!

@pdurbin
Copy link
Member

pdurbin commented Apr 5, 2019

@mdehollander great suggestion! Please feel free to open a new issue for this.

Everyone, while I'm writing I'll mention that I also wrote about the progress so far in this "Control over dataset file hierarchy + directory structure (new feature in Dataverse 4.12)" thread and feedback is welcome there as well: https://groups.google.com/d/msg/dataverse-community/8gn5pq0cVc0/MCMQAQHRAQAJ

If anyone want to reply via Twitter, I would suggest piling on to one of these tweets:

We're currently working on "Enable the display of file hierarchy metadata on the dataset page" in #5572.

@nmedeiro
Copy link

nmedeiro commented Apr 5, 2019 via email

@pdurbin
Copy link
Member

pdurbin commented Apr 5, 2019

@nmedeiro fantastic! If you have a public sample zip file with a folder hierarchy that you give to your students that we can also use in our own testing, please let us know where to download it. 😄

Yes, I've been thinking that this is an important step toward more automated reproducibility. Code Ocean, for example, wants a "data" folder and "code" folder, as I wrote about at #4714 (comment) . Here's a screenshot:

49315649-70e6c100-f4bc-11e8-9c04-9034186e1571

@nmedeiro
Copy link

nmedeiro commented Apr 5, 2019 via email

@pdurbin
Copy link
Member

pdurbin commented Apr 5, 2019

@nmedeiro thanks! It's only 6.5 MB. Can I make it public by attaching it to this issue?

@nmedeiro
Copy link

nmedeiro commented Apr 5, 2019 via email

@pdurbin
Copy link
Member

pdurbin commented Apr 5, 2019

@nmedeiro thanks! Here it is: dataverse_files.zip

Inside the "Replication Documentation for Midlife Crisis Paper" directory are the following files:

Original-Data/importable-pew.dta
Original-Data/original-pew.sav
Original-Data/original-wdi.xlsx
Command-Files/5-analysis.do
Command-Files/4-data-appendix.do
Analysis-Data/country-analysis.dta
Analysis-Data/individual-analysis.dta
Introduction to the Tier Protocol (v. 3.0) demo project. 2017-07-11.pdf
Introduction to the Tier Protocol (v. 3.0) demo project. 2017-07-11.docx

@djbrooke
Copy link
Contributor

Thanks all for the feedback as we evaluated and implemented this in Dataverse. Very exciting to see this feature added.

#5572 (view hierarchy) has been merged and will be included in the next release. Retaining file hierarchy for zips and and editing hierarchy has been added in previous releases so I'm closing this issue.

@pdurbin
Copy link
Member

pdurbin commented May 10, 2019

@nmedeiro here's how the files and folders look in the "tree" view we shipped in Dataverse 4.13:

Screen Shot 2019-05-10 at 5 57 45 AM

Thanks again!

@nmedeiro
Copy link

nmedeiro commented May 10, 2019 via email

@setgree
Copy link

setgree commented Feb 16, 2024

Hi, so the canonical solution to this problem is to upload a zip file? I was trying to upload some files and folders recently -- which I've organized carefully in order to ensure reproducibility -- and I was unable to figure out how to upload the files in a nested way.

@pdurbin
Copy link
Member

pdurbin commented Feb 16, 2024

@setgree
Copy link

setgree commented Feb 16, 2024

Thank you, I appreciate your quick response.This answer surprised me. IMHO:

  1. reproducibility is a core goal/function of Dataverse;

  2. good organizational hygiene is essential for computational reproducibility;

  3. The tools you have shared are all, IMO, workarounds, and not integrated into the default way that a person would use dataverse -- which I understand to be A) uploading files via the browser interface B) minting a DOI and C) putting that DOI in the accompanying paper -- nor surfaced to a user trying to upload files.

Does the Dataverse team intend to integrate folder preservation into the default flow, or is the team happy with the way things stand?

(Perhaps this has been discussed elsewhere, my apologies if I missed it)

@qqmyers
Copy link
Member

qqmyers commented Feb 16, 2024

FWIW: There are potential security issues in allowing web apps to scan your disk for files. The DVWebloader uses a ~defacto standard that is supported by most browsers to allow you to upload a whole directory, after clicking OK in a browser mandated popup. (Conversely, when the user specifies the exact files involved as in our normal upload, no popup is required, but an app doesn't get to know the local path.) I'm sure with the work on creating a new React front end for Dataverse, we'll be looking at supporting directories more cleanly, as possible. (Also note w.r.t. surfacing - when DVWebloader is installed, the upload page shows an 'Upload a Folder' option, so it is visible.)

@pdurbin
Copy link
Member

pdurbin commented Feb 16, 2024

As @qqmyers says, DVWebloader already integrates folder preservation into the default flow, but it's an optional component that needs to be installed (see https://guides.dataverse.org/en/6.1/user/dataset-management.html#folder-upload ). If you're curious what it looks like, there are some screenshots in this pull request:

And yes, I agree that when we get to implementing file upload in the new frontend ( https://github.com/IQSS/dataverse-frontend ), we should strongly consider folder upload. Better reproducibility without workarounds. 100%.

@setgree all this is to say, yes, we are fully supportive of your ideas! 😄

As far as things being discussed elsewhere, a good place for discussion is https://groups.google.com/g/dataverse-community or https://chat.dataverse.org . You are very welcome to join and post!

@jeisner
Copy link

jeisner commented Feb 16, 2024

Just a remark that if Dataverse were being built today, it would undoubtedly be built on top of git. Obviously git already handles all of the concerns above, including directory structure and avoiding duplicate storage between similar versions, so reinventing all the functionality may be unnecessary.

To use git-Dataverse, a project would need to host its own git repo anywhere else. It could be a private repo. That repo would tag a small number of revisions as releases. git-Dataverse would then host a public, read-only "sparse mirror" that contained only the release revisions (and only the public parts of them) but was guaranteed to be archival, which is the point of Dataverse, I think? So a user of the sparse mirror could download a snapshot -- or could download the whole sparse mirror and see the diffs between releases.

I am not sure how to construct such a sparse mirror, which collapses the intermediate history between releases and removes private material from each release. However, https://github.com/newren/git-filter-repo looks like a possible starting point.

BTW, this is a feature that I could imagine github providing -- a kind of compromise between public and private repos -- but maybe they don't do this because they want to encourage open-source development, with fully public repos. Even if they did provide it, Dataverse may support bigger datasets and may have other features I don't know about.

@pdurbin
Copy link
Member

pdurbin commented Feb 16, 2024

@jeisner ha, this remind me of my "a thought experiment: datasets as git repos" (doc, email) from 10 years ago. I even made a little logo:

Screenshot 2024-02-16 at 11 56 53 AM

That is to say, we thought about it! OSF is actually built on top of git and they dissuaded us from doing the same when we did our big rewrite back then.

Dataverse is still in the git game though! See https://github.com/datalad/datalad-dataverse for a recent integration. I'm planning on learning more in Germany in April at distribits.

Anyway, the Google Group and the chat room are good places to talk about this. Please feel free to kick off a discussion! 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

No branches or pull requests