File Hierarchy: I want to be able to preserve my dataset's files' directory structure, for easy import, computation, and navigation. #2249

eaquigley · 2015-06-11T13:56:25Z

User community request to be able to organize files in a dataset in a hierarchical manner so when a user exports files from Google Drive, Dropbox, or OSF (or any other location), the file structure is maintained without causing the user extra work on the Dataverse side.

dpwrussell · 2015-06-11T14:09:20Z

One highly related thing which is extremely prevalent in microscopy and I would guess other fields, is that in addition to encoding metadata into the directory hierarchy, they have also encoded it into the filenames, usually underscore separated.

E.g. /usr/people/bioc0759/data/EB1-posterior-polarity/EB1-Colcemid-UV-inactivation/RMP_20090228_colcemid_UV_inactivation/rmp_20090228_colcemid-6hrs_EB1EB1_stg9_Az_18_R3D.dv

Some of this will be important metadata, some of it may not be. The ability to automatically or semi-automatically import some of this metadata (but hopefully not the junk) in the form of tag annotations sounds useful so that search/filtering can make use of them.

dpwrussell · 2015-06-11T14:16:08Z

It might be useful to have a look at this tool, built for the Open Microscopy Environment. The UI is not beautiful, but it does this kind of metadata extraction into Tag Annotations: https://www.openmicroscopy.org/site/products/partner/omero.webtagging/

There is also a "search" tool which should really be called "navigation" because it allows the user to browse the graph of tags from any origin point. This resembled filesystem navigation somewhat and seemed to satisfy some users.

Caveat Emptor: The queries to do this navigation because the tags are stored in a relational DB can get very slow if there are large numbers of tags and/or large amounts of data tagged with them. It would be ideal to be storing and updating a graph DB for this functionality to make these queries performant.

eaquigley · 2015-08-05T18:54:53Z

FRD for this feature (work in progress): https://docs.google.com/document/d/1PqL6EljP-N51rt3puy3HedStrnV5DOJ3Gf7H_zPHcA0/edit?usp=sharing

pdurbin · 2016-01-15T20:33:38Z

Feedback from @pameyer: "preserving file naming and directory structure (with the exception of files.sha which holds the checksums) is important for users downloading the dataset, and doing computation locally on it".

Mostly I just want to make it clear that download is a use case. (We probably need a separate issue to talk about running computation on files.) In the FRD above this is currently a question ("Do these carry over into a folder structure when downloaded as a zip?") and the answer for many users, I think, is that they want/expect to be able to upload a zip and later download a zip that has the same directory structure inside it. Some months ago @cchoirat was talking about the importance of this for her (though she may not have been talking about zip files specifically). It's a common expectation. Right now Dataverse flattens your files into a single namespace/directory on upload.

leeper · 2016-03-01T15:56:36Z

I think this would be really valuable. It was how things worked with versions < 4.0, as I recall, and makes it somewhat unpredictable what will happen currently when uploading a project (e.g., via the API).

One possibility might be to do what S3 does with object keys that can have slashes in them:

Note that the Amazon S3 data model is a flat structure: you create a bucket, and the bucket stores objects. There is no hierarchy of subbuckets or subfolders; however, you can infer logical hierarchy using keyname prefixes and delimiters as the Amazon S3 console does. The Amazon S3 console supports a concept of folders.

The examples they give of object keys are:

Development/Projects1.xls
Finance/statement1.pdf
Private/taxdocument.pdf
s3-dg.pdf

This would allow a "flat" Dataset to contain files that can be batch downloaded into a hierarchical structure. Of course, I don't know if that works on the backend.

pdurbin · 2016-03-01T16:34:40Z

This issue was raised yesterday by @pameyer and others from @sbgrid . @bmckinney if you want you could assign yourself to this issue to at least think about. I remember @dpwrussell of OMERO fame talking about it during the 2015 Dataverse Community Meeting.

@leeper you're right. From what I've heard from @landreev in the DVN 3.x days a zip download would sort of reconstruct the file system hierarchy. I'm obviously fuzzy on the details.

wddabc · 2016-05-01T04:55:26Z

I'm wondering whether there is a way to upload the entire directory (For example, by dragging a folder. Currently, it only supports dragging a file) so that the structure is maintained. The user can browse the directory and files by simply clicking into like Dropbox and github without explicitly downloading and unzipping the data.

jeisner · 2016-05-03T16:59:36Z

👍 on this request. The directory structure is often important. Sometimes there are even multiple subdirectories that contain identically named files, e.g., for different experimental subjects or different versions of an experiment.

@pdurbin is correct that download is a use case. So is online browsing of the dataset to get a feel for what's there -- the directory structure provides very useful organization.

TaniaSchlatter · 2019-01-07T15:55:59Z

In the short term, we are considering using the file hierarchy information as metadata, stored in the database, rather than having the files in a hierarchy on disk. This would allow users to view the hierarchy in a preview, with the file display in the table (adding filtering and sorting capabilities).

Users could add or move things around from the UI by specifying or editing the file's path.
On download, the original structure is recreated in the zip file they download.

This doesn't address all desired, however we are interested in getting comments on this proposal. Here is a more detailed description:

Depositor drags a zip (not double zip) in to the dataset. The file will unzip and preserve the directory structure (see #3448). Individual files will be ingested (if necessary) and displayed just like any other file in a dataset – flat. Individual files can be downloaded. If all or any files are downloaded, the hierarchy will be re-created in a zip, matching the structure of the file that was uploaded in the first place.

A user wanting to access data selects “Download all” and downloads the original zip hierarchy. The system behavior is transparent to depositors.

Add files

User deposits a new file and adds a path to an input field to put the file in to or out of the structure.
If new .zip files added - either specifying a path where these will be added, or if no path is specified, people would get two .zip files (provide as a separate file in the zipped file that’s downloaded)
Can create new subdirectory (on a file, empty subdirectories cannot be created).

Move files
Similar function to above, provide a way to edit the file path

Versioning
Consider moving or adding a metadata change and display a new version in the version table
File removed - same as any other file

View hierarchy
Show a “preview” of the hierarchical contents of the dataset.

System generates a text or image file that can be previewed (popup, tab, replace thumbnail, in addition to thumbnail…) Format of the hierarchy preview - Image? Text? What if there are a large number of files? Is an interactive preview is reasonably doable?

Replace/Unzip existing .zip
For existing double .zip. Users can delete original .zip and then upload with a single .zip.

Download
For Stata file add a toggle for original or ingested? Decide to show one? What about the download limit? How might that affect download? Can we leverage the S3/large/package file download UI?

pdurbin · 2019-04-05T11:35:14Z

@dpwrussell @pameyer @leeper @wddabc @jeisner @nmedeiro @christophergandrud @pdeffebach @setgree @mdehollander (and anyone else who is following this issue) good news! Dataverse 4.12 has support for organizing files into folders!

Can you all please try it out at https://demo.dataverse.org and give us feedback? Here are some screenshots that show how to introduce a folder hierarchy to your dataset's files:

This feature is documented as "File Path" at http://guides.dataverse.org/en/4.12/user/dataset-management.html#file-path and here's a screenshot of the docs:

Please just leave a comment below! Thanks!

mdehollander · 2019-04-05T12:02:38Z

@pdurbin Good that this works now with a zip file. Ideally I would like to see this also working with drag&drop. And that you can browser through folders in the interface in stead of listing the folder name for each file. But hey, thanks for making this already possible!

pdurbin · 2019-04-05T15:07:21Z

@mdehollander great suggestion! Please feel free to open a new issue for this.

Everyone, while I'm writing I'll mention that I also wrote about the progress so far in this "Control over dataset file hierarchy + directory structure (new feature in Dataverse 4.12)" thread and feedback is welcome there as well: https://groups.google.com/d/msg/dataverse-community/8gn5pq0cVc0/MCMQAQHRAQAJ

If anyone want to reply via Twitter, I would suggest piling on to one of these tweets:

We're currently working on "Enable the display of file hierarchy metadata on the dataset page" in #5572.

nmedeiro · 2019-04-05T15:54:35Z

Phil, this is great! It worked perfectly with the test dataset I uploaded to the demo site. Thanks very much to you and your team for getting this much-needed functionality into Dataverse. It's critical to the computational reproducibility we're teaching. All the best, Norm

…

__________________________________________________________________________ *Norm Medeiros* Associate Librarian of the College Coordinator for Collection Management and Metadata Services Haverford College 370 Lancaster Ave., Haverford, PA 19041 (610) 896-1173

On Fri, Apr 5, 2019 at 7:35 AM Philip Durbin ***@***.***> wrote: @dpwrussell <https://github.com/dpwrussell> @pameyer <https://github.com/pameyer> @leeper <https://github.com/leeper> @wddabc <https://github.com/wddabc> @jeisner <https://github.com/jeisner> @nmedeiro <https://github.com/nmedeiro> @christophergandrud <https://github.com/christophergandrud> @pdeffebach <https://github.com/pdeffebach> @setgree <https://github.com/setgree> @mdehollander <https://github.com/mdehollander> (and anyone else who is following this issue) good news! Dataverse 4.12 has support for organizing files into folders! Can you all please try it out at https://demo.dataverse.org and give us feedback? Here are some screenshots that show how to introduce a folder hierarchy to your dataset's files: [image: Screen Shot 2019-04-05 at 7 26 47 AM] <https://user-images.githubusercontent.com/21006/55624812-fcc36f00-5774-11e9-927c-1a5747ea98da.png> [image: Screen Shot 2019-04-05 at 7 27 03 AM] <https://user-images.githubusercontent.com/21006/55624811-fc2ad880-5774-11e9-97d3-cb8c6d504e97.png> This feature is documented as "File Path" at http://guides.dataverse.org/en/4.12/user/dataset-management.html#file-path and here's a screenshot of the docs: [image: Screen Shot 2019-04-05 at 7 32 27 AM] <https://user-images.githubusercontent.com/21006/55624810-fc2ad880-5774-11e9-9123-09983378ba2a.png> Please just leave a comment below! Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2249 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AL-rjETidrUUwPj-aDJ7yZwbYsoN2eR-ks5vdzT1gaJpZM4E_vhX> .

pdurbin · 2019-04-05T16:00:25Z

@nmedeiro fantastic! If you have a public sample zip file with a folder hierarchy that you give to your students that we can also use in our own testing, please let us know where to download it. 😄

Yes, I've been thinking that this is an important step toward more automated reproducibility. Code Ocean, for example, wants a "data" folder and "code" folder, as I wrote about at #4714 (comment) . Here's a screenshot:

nmedeiro · 2019-04-05T16:11:17Z

I loaded one to the demo site https://doi.org/10.5072/FK2/86JG25 Feel free to use for testing.

…

__________________________________________________________________________ *Norm Medeiros* Associate Librarian of the College Coordinator for Collection Management and Metadata Services Haverford College 370 Lancaster Ave., Haverford, PA 19041 (610) 896-1173

On Fri, Apr 5, 2019 at 12:00 PM Philip Durbin ***@***.***> wrote: @nmedeiro <https://github.com/nmedeiro> fantastic! If you have a public sample zip file with a folder hierarchy that you give to your students that we can also use in our own testing, please let us know where to download it. 😄 Yes, I've been thinking that this is an important step toward more automated reproducibility. Code Ocean, for example, wants a "data" folder and "code" folder, as I wrote about at #4714 (comment) <#4714 (comment)> . Here's a screenshot: [image: 49315649-70e6c100-f4bc-11e8-9c04-9034186e1571] <https://user-images.githubusercontent.com/21006/55640790-4f635200-579a-11e9-8305-3ce74daf0936.png> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2249 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AL-rjOhkgCDMUMSRVbbmzIDT5d2bkDgPks5vd3MdgaJpZM4E_vhX> .

pdurbin · 2019-04-05T16:46:05Z

@nmedeiro thanks! It's only 6.5 MB. Can I make it public by attaching it to this issue?

nmedeiro · 2019-04-05T17:00:05Z

Sure.

…

__________________________________________________________________________ *Norm Medeiros* Associate Librarian of the College Coordinator for Collection Management and Metadata Services Haverford College 370 Lancaster Ave., Haverford, PA 19041 (610) 896-1173

On Fri, Apr 5, 2019 at 12:46 PM Philip Durbin ***@***.***> wrote: @nmedeiro <https://github.com/nmedeiro> thanks! It's only 6.5 MB. Can I make it public by attaching it to this issue? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2249 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AL-rjIHMbeXCe9MJhqN_nlNNtqZsglCVks5vd33PgaJpZM4E_vhX> .

pdurbin · 2019-04-05T17:03:31Z

@nmedeiro thanks! Here it is: dataverse_files.zip

Inside the "Replication Documentation for Midlife Crisis Paper" directory are the following files:

Original-Data/importable-pew.dta
Original-Data/original-pew.sav
Original-Data/original-wdi.xlsx
Command-Files/5-analysis.do
Command-Files/4-data-appendix.do
Analysis-Data/country-analysis.dta
Analysis-Data/individual-analysis.dta
Introduction to the Tier Protocol (v. 3.0) demo project. 2017-07-11.pdf
Introduction to the Tier Protocol (v. 3.0) demo project. 2017-07-11.docx

djbrooke · 2019-04-15T21:20:18Z

Thanks all for the feedback as we evaluated and implemented this in Dataverse. Very exciting to see this feature added.

#5572 (view hierarchy) has been merged and will be included in the next release. Retaining file hierarchy for zips and and editing hierarchy has been added in previous releases so I'm closing this issue.

pdurbin · 2019-05-10T09:58:45Z

@nmedeiro here's how the files and folders look in the "tree" view we shipped in Dataverse 4.13:

Thanks again!

nmedeiro · 2019-05-10T10:59:55Z

Beautiful! Thanks for your efforts with this. It's very important to our work with students.

…

__________________________________________________________________________ *Norm Medeiros* Associate Librarian of the College Coordinator for Collection Management and Metadata Services Haverford College 370 Lancaster Ave., Haverford, PA 19041 (610) 896-1173

On Fri, May 10, 2019 at 5:58 AM Philip Durbin ***@***.***> wrote: @nmedeiro <https://github.com/nmedeiro> here's how the files and folders look in the "tree" view we shipped in Dataverse 4.13: [image: Screen Shot 2019-05-10 at 5 57 45 AM] <https://user-images.githubusercontent.com/21006/57519094-8fad7700-72e8-11e9-8e1f-a49dbc9fff05.png> Thanks again! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2249 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AC72XDAN3ZIG4ESOFOPRF5LPUVBNTANCNFSM4BH67BLQ> .

setgree · 2024-02-16T14:26:59Z

Hi, so the canonical solution to this problem is to upload a zip file? I was trying to upload some files and folders recently -- which I've organized carefully in order to ensure reproducibility -- and I was unable to figure out how to upload the files in a nested way.

pdurbin · 2024-02-16T14:34:42Z

@setgree you can also use https://github.com/GlobalDataverseCommunityConsortium/dataverse-uploader or https://github.com/gdcc/python-dvuploader

Or if you're in control of the installation, you can install https://github.com/gdcc/dvwebloader

setgree · 2024-02-16T14:55:23Z

Thank you, I appreciate your quick response.This answer surprised me. IMHO:

reproducibility is a core goal/function of Dataverse;
good organizational hygiene is essential for computational reproducibility;
The tools you have shared are all, IMO, workarounds, and not integrated into the default way that a person would use dataverse -- which I understand to be A) uploading files via the browser interface B) minting a DOI and C) putting that DOI in the accompanying paper -- nor surfaced to a user trying to upload files.

Does the Dataverse team intend to integrate folder preservation into the default flow, or is the team happy with the way things stand?

(Perhaps this has been discussed elsewhere, my apologies if I missed it)

qqmyers · 2024-02-16T15:11:03Z

FWIW: There are potential security issues in allowing web apps to scan your disk for files. The DVWebloader uses a ~defacto standard that is supported by most browsers to allow you to upload a whole directory, after clicking OK in a browser mandated popup. (Conversely, when the user specifies the exact files involved as in our normal upload, no popup is required, but an app doesn't get to know the local path.) I'm sure with the work on creating a new React front end for Dataverse, we'll be looking at supporting directories more cleanly, as possible. (Also note w.r.t. surfacing - when DVWebloader is installed, the upload page shows an 'Upload a Folder' option, so it is visible.)

pdurbin · 2024-02-16T15:31:32Z

As @qqmyers says, DVWebloader already integrates folder preservation into the default flow, but it's an optional component that needs to be installed (see https://guides.dataverse.org/en/6.1/user/dataset-management.html#folder-upload ). If you're curious what it looks like, there are some screenshots in this pull request:

IQSS/9095-dvwebloader integration #9096

And yes, I agree that when we get to implementing file upload in the new frontend ( https://github.com/IQSS/dataverse-frontend ), we should strongly consider folder upload. Better reproducibility without workarounds. 100%.

@setgree all this is to say, yes, we are fully supportive of your ideas! 😄

As far as things being discussed elsewhere, a good place for discussion is https://groups.google.com/g/dataverse-community or https://chat.dataverse.org . You are very welcome to join and post!

jeisner · 2024-02-16T16:29:16Z

Just a remark that if Dataverse were being built today, it would undoubtedly be built on top of git. Obviously git already handles all of the concerns above, including directory structure and avoiding duplicate storage between similar versions, so reinventing all the functionality may be unnecessary.

To use git-Dataverse, a project would need to host its own git repo anywhere else. It could be a private repo. That repo would tag a small number of revisions as releases. git-Dataverse would then host a public, read-only "sparse mirror" that contained only the release revisions (and only the public parts of them) but was guaranteed to be archival, which is the point of Dataverse, I think? So a user of the sparse mirror could download a snapshot -- or could download the whole sparse mirror and see the diffs between releases.

I am not sure how to construct such a sparse mirror, which collapses the intermediate history between releases and removes private material from each release. However, https://github.com/newren/git-filter-repo looks like a possible starting point.

BTW, this is a feature that I could imagine github providing -- a kind of compromise between public and private repos -- but maybe they don't do this because they want to encourage open-source development, with fully public repos. Even if they did provide it, Dataverse may support bigger datasets and may have other features I don't know about.

pdurbin · 2024-02-16T16:58:46Z

@jeisner ha, this remind me of my "a thought experiment: datasets as git repos" (doc, email) from 10 years ago. I even made a little logo:

That is to say, we thought about it! OSF is actually built on top of git and they dissuaded us from doing the same when we did our big rewrite back then.

Dataverse is still in the git game though! See https://github.com/datalad/datalad-dataverse for a recent integration. I'm planning on learning more in Germany in April at distribits.

Anyway, the Google Group and the chat room are good places to talk about this. Please feel free to kick off a discussion! 😄

eaquigley added UX & UI: Design This issue needs input on the design of the UI and from the product owner Type: Suggestion an idea Status: Design Feature: File Upload & Handling labels Jun 11, 2015

eaquigley self-assigned this Jun 11, 2015

eaquigley removed the Status: Design label Jun 25, 2015

scolapasta modified the milestone: In Review Jun 30, 2015

eaquigley modified the milestones: In Review, In Design Aug 14, 2015

pdurbin mentioned this issue Oct 29, 2015

Files: Need Persistent Identifiers/URL's for Data Files #2700

Closed

mercecrosas modified the milestones: In Design, In Review Nov 30, 2015

scolapasta unassigned eaquigley Jan 28, 2016

scolapasta added Status: Triaged and removed Status: Dev labels Jan 28, 2016

scolapasta removed this from the Not Assigned to a Release milestone Jan 28, 2016

bmckinney self-assigned this Mar 3, 2016

wddabc mentioned this issue May 1, 2016

Browsing and Uploading the entire directory #3098

Closed

pdurbin removed User Role: Depositor Creates datasets, uploads data, etc. Type: Suggestion an idea labels Oct 19, 2018

pdurbin mentioned this issue Dec 5, 2018

create a zip file on the fly using the new "directorylabel" field added in pull request #3412 #3448

Closed

djbrooke mentioned this issue Dec 18, 2018

WIP: 3439 zip upload checkbox disable unzip #5396

Closed

5 tasks

djbrooke mentioned this issue Jan 30, 2019

Preserve File Hierarchy in Multiple File Zipped Downloads #5498

Closed

This was referenced Feb 20, 2019

Edit the position of files in a directory structure #5565

Closed

Enable the display of file hierarchy metadata on the dataset page #5572

Closed

djbrooke closed this as completed Apr 15, 2019

djbrooke added this to UI/UX Design 💡📝 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) May 8, 2019

TaniaSchlatter moved this from UI/UX Design 💡📝 to Done 🚀 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) May 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File Hierarchy: I want to be able to preserve my dataset's files' directory structure, for easy import, computation, and navigation. #2249

File Hierarchy: I want to be able to preserve my dataset's files' directory structure, for easy import, computation, and navigation. #2249

eaquigley commented Jun 11, 2015

dpwrussell commented Jun 11, 2015

dpwrussell commented Jun 11, 2015

eaquigley commented Aug 5, 2015

pdurbin commented Jan 15, 2016

leeper commented Mar 1, 2016 •

edited by pdurbin

Loading

pdurbin commented Mar 1, 2016

wddabc commented May 1, 2016

jeisner commented May 3, 2016

TaniaSchlatter commented Jan 7, 2019

pdurbin commented Apr 5, 2019

mdehollander commented Apr 5, 2019

pdurbin commented Apr 5, 2019

nmedeiro commented Apr 5, 2019 via email

pdurbin commented Apr 5, 2019

nmedeiro commented Apr 5, 2019 via email

pdurbin commented Apr 5, 2019

nmedeiro commented Apr 5, 2019 via email

pdurbin commented Apr 5, 2019 •

edited

Loading

djbrooke commented Apr 15, 2019

pdurbin commented May 10, 2019

nmedeiro commented May 10, 2019 via email

setgree commented Feb 16, 2024

pdurbin commented Feb 16, 2024 •

edited

Loading

setgree commented Feb 16, 2024 •

edited

Loading

qqmyers commented Feb 16, 2024 •

edited

Loading

pdurbin commented Feb 16, 2024

jeisner commented Feb 16, 2024 •

edited

Loading

pdurbin commented Feb 16, 2024

File Hierarchy: I want to be able to preserve my dataset's files' directory structure, for easy import, computation, and navigation. #2249

File Hierarchy: I want to be able to preserve my dataset's files' directory structure, for easy import, computation, and navigation. #2249

Comments

eaquigley commented Jun 11, 2015

dpwrussell commented Jun 11, 2015

dpwrussell commented Jun 11, 2015

eaquigley commented Aug 5, 2015

pdurbin commented Jan 15, 2016

leeper commented Mar 1, 2016 • edited by pdurbin Loading

pdurbin commented Mar 1, 2016

wddabc commented May 1, 2016

jeisner commented May 3, 2016

TaniaSchlatter commented Jan 7, 2019

pdurbin commented Apr 5, 2019

mdehollander commented Apr 5, 2019

pdurbin commented Apr 5, 2019

nmedeiro commented Apr 5, 2019 via email

pdurbin commented Apr 5, 2019

nmedeiro commented Apr 5, 2019 via email

pdurbin commented Apr 5, 2019

nmedeiro commented Apr 5, 2019 via email

pdurbin commented Apr 5, 2019 • edited Loading

djbrooke commented Apr 15, 2019

pdurbin commented May 10, 2019

nmedeiro commented May 10, 2019 via email

setgree commented Feb 16, 2024

pdurbin commented Feb 16, 2024 • edited Loading

setgree commented Feb 16, 2024 • edited Loading

qqmyers commented Feb 16, 2024 • edited Loading

pdurbin commented Feb 16, 2024

jeisner commented Feb 16, 2024 • edited Loading

pdurbin commented Feb 16, 2024

leeper commented Mar 1, 2016 •

edited by pdurbin

Loading

pdurbin commented Apr 5, 2019 •

edited

Loading

pdurbin commented Feb 16, 2024 •

edited

Loading

setgree commented Feb 16, 2024 •

edited

Loading

qqmyers commented Feb 16, 2024 •

edited

Loading

jeisner commented Feb 16, 2024 •

edited

Loading