Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CDV] Filenames in underlying storage should be human readable #4041

Closed
jeremyfreudberg opened this issue Aug 3, 2017 · 5 comments
Closed
Labels

Comments

@jeremyfreudberg
Copy link

#2909 (comment) affirmed that in Cloud Dataverse a filename in the underlying storage (Swift) would be a "filesystem name", which is unique, but also not human-readable.

The lack of a true rename operation in Swift, worries about uniqueness, and the fact that the Dataverse download API preserves meaningful filenames anyway meant that at the time we were satisfied with the solution of non-pretty names in Swift.

Two specific scenarios where pretty filenames are wanted/needed:

  • BigData compute task globbing across multiple files, e.g. input to a Spark job is swift://container/*.csv
  • Ben Lewis has pointed out the value of optimal/detailed naming in the GeoMesa context; we need an easy way to identify time chunks, so a meaningful file listing would be helpful

More generally, the relevant scenarios can be summarized as any time someone or some service uses the Swift API to download files. We are currently dreaming up more applications (besides Hadoop/Spark via Sahara) which would prefetch files from the Swift endpoint for the user play with using compute. In the current state of CDV, the user wouldn't be able to tell what's going on, since they would receive a whole bunch of random files (anything bundled with the dataset, not just raw data) with no way to tell what's what.

Worth noting that these concerns are really especially relevant for larger datasets -- direct access through the Swift API instead of the Dataverse API is crucial in that case.

This discussion also ties into a larger discussion about how dataset versioning is reflected on the Swift side of things.

@pameyer
Copy link
Contributor

pameyer commented Aug 3, 2017

This is something that would be useful for us as well - the first of these scenarios is also relevant to compute tasks on data files / large datasets in Dataverse, regardless of the underlying storage (swift, POSIX, s3, etc); or more generically whenever Dataverse is sharing storage with another system.

@jeremyfreudberg
Copy link
Author

Simplest solution (via @scolapasta and @ferrys) is to include a file in the container that gives the mapping of pretty filenames to actual filenames.

Also, to allow for the globbing (e.g. *.csv) use case that I mentioned, we can keep the file extension but replace the file name with the unique identifier.

Still to be determined is compatibility with Geomesa/Accumulo/etc use cases.

@pdurbin
Copy link
Member

pdurbin commented Sep 30, 2022

Just a note that with the new Globus support, filenames are mapped from their weird underlying names on S3 to the human readable names one would expect. This is the PR:

@pdurbin
Copy link
Member

pdurbin commented Oct 9, 2022

@jeremyfreudberg are you still interested in this? Thanks.

@jeremyfreudberg
Copy link
Author

@pdurbin I haven't worked on Dataverse stuff since 2018 (but if you're hiring...). I'm not sure whether the MOC folks are still interested in this. If from your end it makes sense to simply close out this issue, then please do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants