[CDV] Filenames in underlying storage should be human readable #4041

jeremyfreudberg · 2017-08-03T17:47:18Z

#2909 (comment) affirmed that in Cloud Dataverse a filename in the underlying storage (Swift) would be a "filesystem name", which is unique, but also not human-readable.

The lack of a true rename operation in Swift, worries about uniqueness, and the fact that the Dataverse download API preserves meaningful filenames anyway meant that at the time we were satisfied with the solution of non-pretty names in Swift.

Two specific scenarios where pretty filenames are wanted/needed:

BigData compute task globbing across multiple files, e.g. input to a Spark job is swift://container/*.csv
Ben Lewis has pointed out the value of optimal/detailed naming in the GeoMesa context; we need an easy way to identify time chunks, so a meaningful file listing would be helpful

More generally, the relevant scenarios can be summarized as any time someone or some service uses the Swift API to download files. We are currently dreaming up more applications (besides Hadoop/Spark via Sahara) which would prefetch files from the Swift endpoint for the user play with using compute. In the current state of CDV, the user wouldn't be able to tell what's going on, since they would receive a whole bunch of random files (anything bundled with the dataset, not just raw data) with no way to tell what's what.

Worth noting that these concerns are really especially relevant for larger datasets -- direct access through the Swift API instead of the Dataverse API is crucial in that case.

This discussion also ties into a larger discussion about how dataset versioning is reflected on the Swift side of things.

pameyer · 2017-08-03T18:18:04Z

This is something that would be useful for us as well - the first of these scenarios is also relevant to compute tasks on data files / large datasets in Dataverse, regardless of the underlying storage (swift, POSIX, s3, etc); or more generically whenever Dataverse is sharing storage with another system.

jeremyfreudberg · 2017-08-15T18:10:07Z

Simplest solution (via @scolapasta and @ferrys) is to include a file in the container that gives the mapping of pretty filenames to actual filenames.

Also, to allow for the globbing (e.g. *.csv) use case that I mentioned, we can keep the file extension but replace the file name with the unique identifier.

Still to be determined is compatibility with Geomesa/Accumulo/etc use cases.

pdurbin · 2022-09-30T02:31:31Z

Just a note that with the new Globus support, filenames are mapped from their weird underlying names on S3 to the human readable names one would expect. This is the PR:

GDCC/Globus and Big Data Support #8891

pdurbin · 2022-10-09T22:55:59Z

@jeremyfreudberg are you still interested in this? Thanks.

jeremyfreudberg · 2022-10-11T18:22:17Z

@pdurbin I haven't worked on Dataverse stuff since 2018 (but if you're hiring...). I'm not sure whether the MOC folks are still interested in this. If from your end it makes sense to simply close out this issue, then please do so.

djbrooke added the Status: Backlog label Sep 7, 2017

djbrooke removed the Status: Backlog label Feb 11, 2018

pdurbin added the Component: Code Infrastructure formerly "Feature: Code Infrastructure" label Oct 13, 2018

djbrooke added this to Inbox 🗄 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) May 8, 2019

pdurbin added Feature: File Upload & Handling Status: Still Interested? labels Oct 9, 2022

mreekie removed the Status: Still Interested? label Jan 10, 2023

pdurbin closed this as completed Oct 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CDV] Filenames in underlying storage should be human readable #4041

[CDV] Filenames in underlying storage should be human readable #4041

jeremyfreudberg commented Aug 3, 2017

pameyer commented Aug 3, 2017

jeremyfreudberg commented Aug 15, 2017

pdurbin commented Sep 30, 2022

pdurbin commented Oct 9, 2022

jeremyfreudberg commented Oct 11, 2022

[CDV] Filenames in underlying storage should be human readable #4041

[CDV] Filenames in underlying storage should be human readable #4041

Comments

jeremyfreudberg commented Aug 3, 2017

pameyer commented Aug 3, 2017

jeremyfreudberg commented Aug 15, 2017

pdurbin commented Sep 30, 2022

pdurbin commented Oct 9, 2022

jeremyfreudberg commented Oct 11, 2022