Lazy artifact without unpacking (non-tarball) #2764

jonas-schulze · 2021-10-12T18:03:53Z

I would like to "deliver" some mat data set that I need during package testing as an artifact. The data set happened to be hosted already, though as is and not as a tar.gz. IIRC .mat support compression on their own, so wrapping them in a tar.gz feels odd.

How do I declare a lazy artifact (containing only a single file) that doesn't need to be unpacked?

If that's not possible (yet), I would like to propose to add a new keyword unpack (default: true which matches the current behavior) to the Artifacts.toml.

Somewhat related:

Document automatic unpacking of artifacts #1467 (suggested to me that unpacking does not always happen)
better API for generating artifact from tarball URL #1950 (would need the keyword as well)
https://discourse.julialang.org/t/creating-artifacts-toml-for-existing-tarball/33365

The text was updated successfully, but these errors were encountered:

DilumAluthge · 2021-10-13T02:36:07Z

The DataDeps.jl package (https://github.com/oxinabox/DataDeps.jl) might be a good solution for your use case.

KristofferC · 2021-10-13T09:51:48Z

In theory we could look at the magic bytes to see if it is a gzipped file, otherwise, assume it is uncompressed..?

StefanKarpinski · 2021-10-13T13:43:55Z

Another layer of compression shouldn't really hurt though, and you can use gzip -1 to minimize the effort.

jonas-schulze · 2021-10-13T13:50:32Z

Ref oxinabox/DataDeps.jl#113

KristofferC · 2021-10-13T14:01:27Z

Another layer of compression shouldn't really hurt though

But then you need to rehost the files.

simonbyrne · 2023-06-03T18:55:49Z

Agreed, there are many dataset hosting providers which expect you to upload the file directly, rather than uploading a tarball wrapping a file.

StefanKarpinski · 2023-06-06T16:10:17Z

If we allow artifacts to be arbitrary container and non-container formats with arbitrary compression schemes, there's really a never-ending stream of features that would have to be added, which is not something I think it's acceptable to do with a feature like artifacts that's built into the package manager.

Consider something apparently simple like allowing artifacts to be just a single file. This seems straightfoward enough: you just use the git blob hash of the file as its content address and put the file at the the artifact path instead of an extracted artifact directory like we do currently. So the path to this file will be something like ~/.julia/artifacts/a01fab9ad601903eaa0290a41c6a796525313337. However, many use cases of files require that the file name have a correct extension and a reasonable file name like data.mat. The current answer to that is genuinely simple if not always convenient: the artifact is a directory containing the single file data.mat. If we're trying to support an artifact being a single file with this extension/file name requirement, we'd need to start adding features: in this case an option to say that the actual path to the artifact is inside the usual top-level location at ~/.julia/artifacts/a01fab9ad601903eaa0290a41c6a796525313337/data.mat. But then the artifact isn't actually content-addressed anymore: you need to know the content hash and the path inside of the directory, and if two different artifact files referenced the same content address with a different hash, then they could extract the data to a different location. So even something simple like "let an artifact be a single file" leads to a whole can of worms. The simplest option is just to require it to be a directory, which is what we've done.

Different compression and container formats are more reasonable, imo, since they only complicate the model of how to deliver an artifact, rather than complicating the model of what an artifact is. The main issue with that is that Pkg needs to be able to extract other container formats. Julia is shipped with the dependencies required to decompress and extract tarballs, but we don't really want to add more dependencies to Julia for every format someone happens to want to use. But we could have a plugin system where a download stanza specifies a registered package/function for handling the content of the download stanza, and then lets the package acquire the artifact content however one wants.

For example, we could support downloading a single file something like this:

[data_mat]
git-tree-sha1 = "83f7499f0e79ac39a1a34d3e6ac119f5389ee66d"

    [[data_mat.download]]
    plugin = "FileArtifacts"
    url = "https://example.com/path/to/data.mat"
    sha256 = "ab2332e1005836afb236bf8515adf1b0522b640a51c9b8a401d64e3f5fc4478c"

What this would do is use the package called FileArtifacts (which must appear in the Project.toml file of the package where the Artifacts.toml file lives) to download the data_mat artifact. It would do the following:

Download the URL https://example.com/path/to/data.mat
Check that the SHA256 hash of the file is ab23...4478c
Save the file as data.mat (derived from the URL) in an empty directory
Compute the git-tree-sha1 of the directory (not the file) and make sure it's 83f7...e66d
Install the artifact directory at ~/.julia/artifacts/83f7...e66d

The end result is that data.mat can be found at ~/.julia/artifacts/83f7...e66d/data.mat. People could implement artifact downloaders for zip files, different compression formats, etc.

This is the way forward, but I'm not sure I really want to do this. Among other things, this would entail either not serving such artifacts through the package server system, or running arbitrary package code for artifact downloading in the package server system. Neither option is super appealing to me. We could maybe approve specific packages as "blessed" downloaders that we allow running on the package servers.

jonas-schulze · 2023-06-09T13:46:53Z

But then the artifact isn't actually content-addressed anymore: you need to know the content hash and the path inside of the directory, [...]

Isn't this exactly what is required now already from a user's perspective? In order to access anything from an artifact, the user has to joinpath(artifact"foo", "data.mat"). Here, artifact"foo" resolves to the content-addressed hash of the directory and data.mat is the object within the user is actually interested in.

[...] and if two different artifact files referenced the same content address with a different hash, then they could extract the data to a different location.

I think I don't quite understand what you mean. If two artifact files (does this refer to "descriptors", i.e. artifact"foo" and artifact"bar"?) refer to the same content, they will by design resolve to the same hash, won't they?

The considerations you described sound more like implementation details to me -- no offend. All I am asking for is an option to skip a certain part of the download/registration/creation process of an artifact, namely archive inflation. I am not questioning what an artifact is. An artifact remains a single file before and during download (a compressed or un-compressed tar-ball, or an arbitrary file) which becomes a content-addressed directory. This doesn't change at all. And from a user's perspective it doesn't change either. The user shouldn't need to care how the content-hash comes to be, because a user never gets in touch with it anyway. This is a detail hidden within artifact"foo", as it should be.

simonbyrne changed the title ~~Lazy artifact without unpacking~~ Lazy artifact without unpacking (non-tarball) Jun 3, 2023

chengchingwen mentioned this issue Oct 6, 2023

support tiktoken chengchingwen/BytePairEncoding.jl#7

Merged

marcom mentioned this issue Jan 11, 2024

Simplify Initial Setup for LLM Newcomers Using llama.jl Package marcom/Llama.jl#2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lazy artifact without unpacking (non-tarball) #2764

Lazy artifact without unpacking (non-tarball) #2764

jonas-schulze commented Oct 12, 2021

DilumAluthge commented Oct 13, 2021

KristofferC commented Oct 13, 2021

StefanKarpinski commented Oct 13, 2021

jonas-schulze commented Oct 13, 2021

KristofferC commented Oct 13, 2021

simonbyrne commented Jun 3, 2023

StefanKarpinski commented Jun 6, 2023

jonas-schulze commented Jun 9, 2023

Lazy artifact without unpacking (non-tarball) #2764

Lazy artifact without unpacking (non-tarball) #2764

Comments

jonas-schulze commented Oct 12, 2021

DilumAluthge commented Oct 13, 2021

KristofferC commented Oct 13, 2021

StefanKarpinski commented Oct 13, 2021

jonas-schulze commented Oct 13, 2021

KristofferC commented Oct 13, 2021

simonbyrne commented Jun 3, 2023

StefanKarpinski commented Jun 6, 2023

jonas-schulze commented Jun 9, 2023