Pkg + BinaryProvider #841

staticfloat · 2018-10-18T15:52:36Z

The ponderous forms of Pkg and BinaryProvider slowly intermesh; Hulking behemoths merging their forms like waves from two separate oceans breaking upon the same shore. The silhouette of one blends seamlessly into the shadow of the other, a möbius strip of darkness and light, beginning and ending within itself.

Let's talk about the possible merging of BinaryProvider and Pkg, to integrate the binary installation story to unheard-of levels. Whereas:

Binary installation for us is now as simple as unpacking a tarball
Pkg knows how to unpack tarballs

I suggest that we do away with the weird indirection we currently have with packages using build.jl files to download tarballs, and instead integrate these downloads into Pkg completely. This implies that we:

Create a new concept within Pkg, that of a Binary Artifact. The main difference between a Binary Artifact and a Package is that Packages are platform-independent, Binary Artifacts are necessarily not so. We would need to load over the same kind of platform-matching code as is in BP right now, e.g. dynamically choosing the most specific matching tarball based on the currently running Julia. (See choose_download() within BP for more).
Modify BinaryBuilder output to generate Binary Artifacts that are then directly imported into the General Registry. The Binary Artifacts contain within them a small amount of Julia code; things like setting environment variables, mappings from LibraryProduct to actual .so file, functions to run an ExecutableProduct, etc... This is all auto-generated by BinaryBuilder.
Change client packages to simply declare a dependency upon these Binary Artifacts when they require a library. E.g. FLAC.jl would declare a dependency upon FLAC_jll, which itself declares a dependency upon Ogg_jll, and so on and so forth.
Eliminate the Pkg.build() step for these packages, as the build will be completed by the end of the download step. (We can actually just bake the deps.jl file into the Binary Artifact, as we are using relative paths anyway)

Please discuss.

The text was updated successfully, but these errors were encountered:

staticfloat · 2019-01-14T19:50:52Z

Okay, let's get started on the first bullet point of this list; defining a BinaryArtifact type within Pkg. We need to create a new datatype within Pkg that represents not a Julia package, but a BinaryArtifact, which is distinct in the following ways:

BinaryArtifacts are chosen not only by version, but also by runtime-reflected properties (CPU architecture, OS, libgfortran version, etc....)
Allow packages to list BinaryArtifacts as something they require, complete with version bounds.
Provide an interface for BinaryArtifacts to either "export code" or "bundle metadata". Things like "LibFoo.jll exports the abspath location of libfoo.so", or a wrapper function that sets environment variables before invoking Git.jll's bundled git.exe.

00vareladavid · 2019-01-14T22:00:44Z

I guess we can create an AbstractDependency type with PackageSpec and BinaryArtifact as subtypes? Then we replace most current occurrences of PackageSpec with AbstractDependency.

00vareladavid · 2019-01-14T22:06:55Z

Is the idea to download a BinaryArtifact and then key into it with runtime information to determine what tarballs should be downloaded? Or is a BinaryArtifact the tarball itself?

StefanKarpinski · 2019-01-14T22:09:21Z

How about just calling it Dependency since we're not going to have Dependency <: AbstractDependency, we're going to have PackageSpec, BinaryArtifact <: Dependency.

00vareladavid · 2019-01-14T22:28:50Z

Ok, and theses types of nodes will be mostly indistinguishable until we hit what is currently build_versions. At which point, we key into them with runtime information(i.e. choose_download) to determine the exact tarball which needs to be set up. Is that roughly the plan?

staticfloat · 2019-01-14T22:56:32Z

Sounds reasonable to me; I'd be happy to discuss this further and nail down more of an implementation plan during the Pkg call tomorrow?

StefanKarpinski · 2019-01-15T19:13:26Z

Version constraints are against the version of the library, not the version of the thing that builds the library. But you want to be able to lock down a specific build of a library. But a specific build is completely platform-specific. There are some layers of versioning:

Artifact identity. The exact identity of the binary artifact that was used in a configuration. We want to record this or be able to reconstruct it somehow, but it's too specific.
Build script version. The version of the build scripts that produces that binary artifact. This will typically support multiple different platforms. This is probably what should be in the manifest.
Library version. The version of the library that the build script is building. This is what compatibility constraints should work with.

Is this correct and complete? The artifact identity should be completely determined by some "system properties" tuple that captures all the things that determine which artifact generated by a build script one needs. The end user mostly only needs to care about the library version, which is what determines its API and therefore usage. There might, however, be situations where one needs compatibility constraints on both the library version and the build script version: e.g. an older build was configured in some way that makes the resulting artifact unusable in certain ways.

StefanKarpinski · 2019-01-15T19:25:42Z

Does a given version of a build script always produce just a single version of a given library?

stevengj · 2019-01-22T02:32:43Z

How would this work with packages that use BinaryProvider but fall back to compiling from source if a working binary is not available (typically for less-popular Linux distros)? e.g. ZMQ or Blosc IIRC. You need some kind of optional-dependency support, it seems, or support for a source “platform”.

staticfloat · 2019-01-22T02:43:47Z

For building from source, we will support it manually by allowing users to dev a jll package, then they just need to copy their .so files into that directory. This is analogous to allowing users to modify their .jl files within a dev'ed Julia package.

I do not think we should ever build from source automatically. Looking at ZMQ, it looks like you have full platform coverage; under what circumstances are you compiling?

ararslan · 2019-01-22T02:56:05Z

Another example to add to Steven's list is SpecialFunctions, which falls back to BinDeps when a binary isn't available from BinaryProvider. Once upon a time that was used on FreeBSD, before we had FreeBSD support in BinaryProvider, but now I don't know when it's used aside from on demand on CI.

stevengj · 2019-01-24T13:44:12Z

Looking at ZMQ, it looks like you have full platform coverage; under what circumstances are you compiling?

We needed it on CentOS, for example (JuliaInterop/ZMQ.jl#176), because of JuliaPackaging/BinaryBuilder.jl#230.

There are an awful lot of Unix flavors out there, and it's nice to have a compilation fallback.

StefanKarpinski · 2019-01-24T18:18:34Z

Regardless of the many UNIX variations, the only things you really need are the right executable format and the right libc, which we can pretty much cover at this point.

stevengj · 2019-01-24T20:29:25Z

And the right libstdc++, which is apparently harder to cover.

(This was why I had to enable source builds for ZMQ and Blosc. Are we confident that this is fixed, or are we happy to go back to breaking installs for any package that calls a C++ library?)

staticfloat · 2019-01-24T21:35:02Z

I think our libstdc++ problems should be largely solved now that JuliaPackaging/BinaryBuilder.jl#253 has been merged. We now build with GCC 4.8.5 by default, using a libstdc++ version of 3.4.18, so we are guaranteed to work with anything at least newer than that. I'm not entirely sure it's possible to build Julia with GCC earlier than 4.8 at the moment, (the Julia README still says GCC 4.7+, but I'm pretty sure LLVM requires GCC 4.8+) so this seems like a pretty safe bet to me. I would be eager to hear how users are running Julia with a version of libstdc++ older than 3.4.18.

stevengj · 2019-01-25T01:51:44Z

Should JuliaPackaging/BinaryBuilder.jl#230 be closed then?

staticfloat · 2019-01-25T06:53:01Z

Yes I think so.

Petr-Hlavenka · 2019-01-31T12:52:34Z

I'm very supportive in managing the binary artifacts by Pkg. I'd just like to point out that the implementation of library loading should be flexible enough to include some strategy for AOT compilation and deployment (to a different computer). The app deployed to a different computer will have to load libraries from different locations and the hardcoding of paths in deps.jl makes this pretty difficult, see JuliaPackaging/BinaryProvider.jl#140. The best way would be either not have deps.jl at all or no need to store absolute path to the library.

StefanKarpinski · 2019-02-04T14:35:30Z

Yes, that's the plan: you declare what you need, referring to it by platform-independent identity instead of generating it explicitly and then hardcoding its location, instead allowing Pkg to figure out the best way to get you what you need and telling you where it is.

staticfloat · 2019-03-07T23:43:29Z

Progress! There is some code behind this post, and other things remain vaporware, with the aspiration of striking up some discussion on whether these are the aesthetics we want.

Building a builder repository results now in the tarballs (typically uploaded to a GitHub release like this one) as well as an Artifact.toml. These currently look something like this:

name = "JpegTurbo_jll"
uuid = "7e164b9a-ae9a-5a84-973f-661589e6cf70"
version = "2.0.1"

[artifacts.arm-linux-gnueabihf]
hash = "45674d19e63e562be8a794249825566f004ea194de337de615cb5cab059e9737"
url = "https://github.com/JuliaPackaging/Yggdrasil/releases/download/JpegTurbo-v2.0.1/JpegTurbo.v2.0.1.arm-linux-gnueabihf.tar.gz"

    [artifacts.arm-linux-gnueabihf.products]
    djpeg = "bin/djpeg"
    libjpeg = "lib/libjpeg.so"
    libturbojpeg = "lib/libturbojpeg.so"
    jpegtran = "bin/jpegtran"
    cjpeg = "bin/cjpeg"

[artifacts.i686-w64-mingw32]
hash = "c2911c98f9cadf3afe84224dfc509b9e483a61fd4095ace529f3ae18d2e68858"
url = "https://github.com/JuliaPackaging/Yggdrasil/releases/download/JpegTurbo-v2.0.1/JpegTurbo.v2.0.1.i686-w64-mingw32.tar.gz"

    [artifacts.i686-w64-mingw32.products]
    djpeg = "bin/djpeg.exe"
    libjpeg = "bin/libjpeg-62.dll"
    libturbojpeg = "bin/libturbojpeg.dll"
    jpegtran = "bin/jpegtran.exe"
    cjpeg = "bin/cjpeg.exe"
...

My plan is to embed this file into the Registry in the same way that Project.toml files are embedded right now. Artifacts will be analogous to Project.toml files with the following similarities/differences:
- They will contain Compat.toml, Deps.toml and Versions.toml entries, which will function exactly the same as a normal Registry entry, except that the downstream DAG of Artifacts can only contain other Artifacts; an Artifact cannot depend on a general Julia package, so in that sense the dependency links are restricted somewhat.
- They will not contain Manifest.toml, Project.toml or Package.toml, only the afore-mentioned Artifact.toml. This is mostly for simplicity, I don't see why we need these, but I am aware that I may not be thinking this through completely.
Pkg is now binary platform-aware, by essentially gutting code from BinaryProvider to instead live inside of Pkg. This allows me to ask things like "what is the ABI-aware triplet of the currently-running host?" (you now get that by calling Pkg.triplet(Pkg.platform_abi_key())).
When the user expresses a dependency on one of these Artifact objects (e.g. through Pkg.add("LibFoo_jll")) it will get added to the dependency graph as usual, but when being concretized into a URL to be downloaded, an extra step of indirection is applied by reaching into the Artifact.toml's dictionary, finding dict["artifacts"][triplet(platform_abi_key())] and using the embedded entries as the url and hash to download and unpack into a directory somewhere.
After downloading and unpacking the binaries, Pkg will generate a wrapper Julia package that exposes an API to "get at" these files, so that client code (such as LibFoo.jl, the fictitious julia-code side of things) can use it in as natural a way as possible. Example generated Julia code:

# LibFoo_jll/src/LibFoo_jll.jl
# Autogenerated code, do not modify
module LibFoo_jll
using Libdl

# Chain other dependent jll packages here, as necessary
using LibBar_jll

# This is just the `artifacts` -> platform_key() -> `products` mappings embedded in `Artifact.toml` above
const libfoo = abspath(joinpath(@__DIR__, "..", "deps", "usr", "lib", "libfoo.so"))
const fooifier = abspath(joinpath(@__DIR__, "..", "deps", "usr", "bin", "fooifier"))

# This is critical, as it allows a dependency that `libfoo.so` has on `libbar.so` to be satisfied.
# It does mean that we pretty much never dlclose() things though.
handles = []
function __init__()
    # Explicitly link in library products so that we can construct a necessary dependency tree
    for lib_product in (libfoo,)
        push!(handles, Libdl.dlopen(lib_product))
    end
end
end

Example Julia package client code:

# LibFoo.jl/src/LibFoo.jl

import LibFoo_jll

function fooify(a, b)
    return ccall((:fooify, LibFoo_jll.libfoo), Cint, (Cint, Cint), a, b)
end
...

StefanKarpinski · 2019-03-08T00:16:32Z

I like it in general. I'll have to think for a bit about the structure of the artifacts file. There's a consistent compression scheme used by Deps.toml and Compat.toml; we'll want to use the same compression scheme for the artifact data in the registry which somewhat informs how you want to structure the data in the file as well.

Do you think I think we'll eventually want to teach ccall about libraries so that we can just write ccall(:libfoo, ...) and have it know to find the LibFoo shared library? That seems like the nicest interface to this possible—just declare the dependency in your project file and ccall it with the right name and everything just works.

staticfloat · 2019-03-08T00:28:46Z

That seems like the nicest interface to this possible—just declare the dependency in your project file and ccall it with the right name and everything just works.

I am actively shying away from teaching Pkg/Base too much about dynamic libraries; it's a deep rabbit hole. In this proposal I'm even not baking in the platform-specific library searching awareness (e.g. "look for libraries in bin on windows, lib elsewhere). I want to keep Pkg as simple as possible.

On the other hand, I would like it if dlopen() was able to tell me, for instance, that trying to use libqt on a Linux system that doesn't have X11 installed already isn't going to work. It would know this because it would try to dlopen("libqt.so") and fail, and it would inspect the dependency tree and notice that libx11.so was not findable. This is all possible with not much new code written, but it does mean that we need to bring in things like ObjectFile.jl into Base, and that's a lot of code.

It would be nice if we could do things like search for packages that contain libfoo.so. That's actually one advantage to listing everything out in the Artifact.toml within the registry like that.

staticfloat · 2019-03-08T00:30:15Z

There's a consistent compression scheme used by Deps.toml and Compat.toml

I'm not entirely sure what you mean by this, but I will await your instruction. I have no strong opinions over the Artifact.toml organization, except for the vague feeling that I want to make it as small as possible to avoid bloating the registry and making things slow to download/install/parse/search.

Petr-Hlavenka · 2019-03-08T06:22:28Z

After downloading and unpacking the binaries, Pkg will generate a wrapper Julia package that exposes an API to "get at" these files, so that client code (such as LibFoo.jl, the fictitious julia-code side of things) can use it in as natural a way as possible. Example generated Julia code:

const libfoo = abspath(joinpath(@__DIR__, "..", "deps", "usr", "lib", "libfoo.so"))
const fooifier = abspath(joinpath(@__DIR__, "..", "deps", "usr", "bin", "fooifier"))

This automatic wrapper generation with const assigning the absolute path is exactly the the thing that prevents AOT with deployment to a different computer. So during AOT PackageCompiler will need to modify every single artifact_wrapper_jlpackage to get rid of the baked-in absolute path.

If the code is auto-generated, why cannot this functionality be part of some function or macro-call that would open the handles and generate the const paths on-the-fly? In that case PackageCompiler could just pre-collect all the artifact to a "deployment depot" and let the dlopen reach for this "configurable" path. Or would redefine this const-path generator for the AOT build.

And is the constantness of the lib path really necessarily for efficient ccall?

staticfloat · 2019-06-13T17:27:33Z

I should have said will result in those package versions being broken if they are add'd

That's what I'm saying is wrong; you're saying "if I delete Foo that Bar depends on, then try to add Bar, it will fail because Foo is missing". That's not how Pkg works; when you want to install Bar, it will automatically install Foo because it knows that Foo is a dependency of Bar. That's how artifacts will work as well; all the installation happens at ] add time, not ] build time (we're explicitly moving away from being able to have mutable state; this means that everything needs to be installed by the time you finish the Pkg.add() operation).

StefanKarpinski · 2019-06-13T17:28:04Z

Right, that is what I am saying will break.

Yeah, there's no good reason to support that.

I'm also having trouble coming up with realistic scenarios where you need to clean out packages but not artifacts or vice versa. But the operation proceeds in two fairly separate phases:

Figure out which packages are no longer reference by any manifests and delete them.
Figure out which artifacts are no longer reference by any installed packages and delete them.

You can do one or the other independently and not break things, or one then the other which should be the default and cleans up the most space.

StefanKarpinski · 2019-06-13T17:30:10Z

One thing that I really like about this new approach that occurred to me is that by not having artifacts inside of packages, it allows artifacts to live in different ~~repos~~ depots than packages do. So you could have a pre-installed system copy of an artifact that is used by one or more user-installed copies of a package. That's quite cool, and potentially useful, imo.

oxinabox · 2019-06-13T17:35:23Z

That's not how Pkg works; when you want to install Bar, it will automatically install Foo because it knows that Foo is a dependency of Bar. That's how artifacts will work as well;

Right, ok, I had the picture wrong in my head.
I thought artifacts would not resolve just like packages.
(since they don't have UUIDs or versions)
but I guess there is indeed nothing stopping that.

One thing that I really like about this new approach that occurred to me is that by not having artifacts inside of packages, it allows artifacts to live in different repos than packages do. So you could have a pre-installed system copy of an artifact that is used by one or more user-installed copies of a package. That's quite cool, and potentially useful, imo.

That is nice. The DataDeps way of doing the same is a bit scary and unsafe. and kind encourages being unsafe (will probably have to change it eventually. I am now super sold on this whole naming things using their SHA); DataDeps just uses the name.
But because the artifacts are identified by SHA (On futher thought, I assume the even after unpacking the SHA is available? Because it will be used as a folder name?)

Ok cool things are much clearer now.

StefanKarpinski · 2019-06-13T17:57:07Z

(since they don't have UUIDs or versions)

They don't need UUIDs or versions because they're content-addressed. You don't really care if one libfoo is "the same artifact" as a different libfoo—they're either the same data or they aren't.

it allows artifacts to live in different repos than packages do.

Oops, I meant "depots" not "repos".

But because the artifacts are identified by SHA (On futher thought, I assume the even after unpacking the SHA is available? Because it will be used as a folder name?)

This comment was about keeping metadata about artifacts around after they're installed so that you know what the SHA etc. was. I'm not really sure about how to structure the thing that goes at ~/.julia/artifacts/libfoo/$slug: you want the actual artifact content somewhere but you also want a bit of metadata about it. This is complicated by the possibility that it is sometimes just a single file and sometimes a folder that we've extracted from an archive. @oxinabox, @staticfloat, do you guys have any thoughts about the structure of these? What would the layout be?

StefanKarpinski · 2019-06-13T22:28:46Z

I'm removing the "speculative" label because this is getting pretty concrete at this point. Some updates from Slack discussion:

We should identify artifacts by their on-disk content, not the archive hash. After all, the former is the definitive thing that offers no wiggle room, whereas many different archives can produce the same on-disk archive. That means we should have a git-tree-sha1 field in each artifact stanza, much like we do in package manifest stanzas. We may want to think about how artifact stanzas mirror manifest stanzas in other ways as well.
As a corollary of the above, you can potentially have different ways of acquiring the same exact artifact—different download URLs, different archive hashes. I previously thought that we should keep metadata about artifacts somewhere with the artifact, but with this design change I'm not so sure. After all, the one true defining characteristic of an artifact is its tree hash and you can always recompute that from it on disk—and if the slug from that hash doesn't match, then you have a corrupted artifact that you shouldn't use anyway.
Maybe we want to keep a log of artifact downloads somewhere like ~/.julia/logs/artifact_usage.toml: a record of what package triggered the install of an artifact, whether it was already installed or not, where it would have been downloaded from, etc.
We still want to record a SHA256 hash of the downloaded, pre-extraction state of each artifact so that we can verify it before downloading it, but this is no longer how we identify it.
I'm still not fully clear on how we should do artifact variant selection. @staticfloat's platform string approach or my more verbose dict approach. This is one of the last things to be decided.

simonbyrne · 2019-06-13T22:33:21Z

I'm still not fully clear on how we should do artifact variant selection. @staticfloat's platform string approach or my more verbose dict approach. This is one of the last things to be decided.

The advantage of the dict approach is that it is more extensible should additional keys be required in future.

staticfloat · 2019-06-14T00:46:41Z

I’m very willing to use a dict based approach. There’s no inherent advantage to the string format other than compactness (and ability to fit within a filename) but living within the Artifact.toml, if we have access to richer data structures we should just use them.

visr · 2019-06-14T17:08:09Z

Great work on the design. I want to bring up a point about build variants, that I was thinking about. Curious about your thoughts.

If I understand correctly, the LibFoo_jll binary variant that is selected is based on its version and on system properties only. Is there any other way for the user to pick a different build, that is not full manual dev mode? Or should they create a separate LibFoo_with_x_enabled_jll and fork LibFoo.jl, and change the Artifact.toml to use LibFoo_with_x_enabled_jll instead? A concrete example is for instance SQLite with the R*tree module enabled, which perhaps does not make sense as a default, but could be requested specifically in Artifact.toml of a project or package. Although you'd probably still want to use it through the same julia wrapper package (SQLite.jl), which would need to know that you want to use a different variant of the binary. Similarly, we could make a default GDAL install small with only the most commonly needed formats, but allow a user to explicitly request a large full variant instead (issue ref). Right now I don't see a way to do that rather than deving everything and putting all artifacts in manually. Not sure how big of a can of worms this is though.

StefanKarpinski · 2019-06-14T19:29:38Z

So latest sketch of the way Artifacts.toml will look:

[dataset-A]
git-tree-sha1 = "e445efb1f3e2bffc06e349651f13729e6f7aeaaf"
basename = "dataset-A.csv"

  [dataset-A.download]
  sha256 = "b2ebe09298004f91b988e35d633668226d71995a84fbd12fea2b08c1201d427f"
  url = [ # multiple URLs to try
      "https://server1.com/path/to/dataset.csv",
      "https://server2.com/path/to/dataset.csv",
  ]

[nlp-model-1]
git-tree-sha1 = "dccae443aeddea507583c348d8f082d5ed5c5e55"
basename = "nlp-model-1.onnx"

  [[nlp-model-1.download]] # multiple ways to download
  sha256 = "5dc925ffbda11f7e87f866351bf859ee7cbe8c0c7698c4201999c40085b4b980"
  url = "https://server1.com/nlp-model-1.onnx.gz"
  extract = "gzip" # decompress file

  [[nlp-model-1.download]]
  sha256 = "9f45411f32dcc332331ff244504ca12ee0b402e00795ab719612a46b7fb24216"
  url = "https://server2.com/nlp-model-1.onnx"

[[libfoo]]
git-tree-sha1 = "05d42b0044984825ae286ebb9e1fc38ed2cce80a"
os = "Linux"
arch = "armv7l"

  [libfoo.download]
  sha256 = "19e7370ab1819d45c6126d5017ba0889bd64869e1593f826c6075899fb1c0a38"
  url = "https://server.com/libfoo/Linux-armv7l/libfoo-1.2.3.tar.gz"
  extract = ["gzip", "tar"] # outermost first or last?

[[libfoo]]
git-tree-sha1 = "c2dc12a509eec2236e806569120e72058579ba19"
os = "Windows"
arch = "i686"

  [libfoo.download]
  sha256 = "95683bb088e35743966d1ea8b242c2694b57155c8084a406b29aecd81b4b6c92"
  url = "https://server.com/libfoo/Windows-i686/libfoo-1.2.3.zip"
  extract = "zip"

[[libfoo]]
git-tree-sha1 = "d633f5f44b06d810d75651a347cae945c3b7f23d"
os = "macOS"
arch = "x86_64"

  [libfoo.download]
  sha256 = "b65f08c0e4d454e2ff9298c5529e512b1081d0eebf46ad6e3364574e0ca7a783"
  url = "https://server.com/libfoo/macOS-x86_64/libfoo-1.2.3.xz"
  extract = ["xz", "tar"]

Some features of this sketch:

the git-tree-sha1 is the defining key of each artifact variant—it must be present
- this is the tree hash of the final extracted artifact as it appears on disk
- if there is already an artifact in the corresponding location, there is no need to reinstall the artifact
- multiple different packages can use the same artifact and describe different ways to get it—the only thing that matters is the bits on disk, which is what this is a hash of; as long as those are the same, just use it, it doesn't matter how it got there
an optional basename key can be given for an artifact variant
- if it is absent, the extracted artifact tree will be installed at ~/.julia/artifacts/$name/$slug
- if it is present, the extracted artifact tree will be installed at ~/.julia/artifacts/$name/$slug/$basename
- this is intended to handle situations where the name of the artifact file is significant to some consumer, e.g. a reader that expects a CSV file to have the .csv extension
- this could also be handled by putting the basename part inside of artifact, but there may be cases where we want to download artifacts as-is and therefore cannot control their structure
top-level keys in artifact stanzas with multiple variants are variant selectors, os, arch, etc.
each artifact variant has one or more download stanzas which describe a way to get it
- there can be one or more url values in a download stanza—this is just a shorthand for giving multiple identical download stanzas that only differ by URL since that will be a common case
- download stanzas have a sha256 entry, which gives the SHA256 hash of the downloaded file; this may be different for different download methods for the same artifact since it may be archived or compressed differently; this hash allows checking download correctness before extracting.
- download stanzas may have an extract entry which indicates how to extract the actual artifact tree from the download; it can be a string to indicate a single extraction step or an array of string to indicate a sequence of extraction steps; these can only be selected from a set of known extraction steps, e.g. tar, gz, bz2, xz, zip; by default, no extraction is performed

StefanKarpinski · 2019-06-14T19:38:30Z

I'm not so sure if the basename bit is necessary or a good idea. Maybe it isn't—it does mean that all users of an artifact must not only agree on the git-tree-sha1 but also the basename, which gives me pause. Maybe this should be a feature of the download instead, e.g. prefix = "dataset-A.csv"?

oxinabox · 2019-06-14T19:59:28Z

Part of the download seems right.
If it was a tarball containing a CSV with that name,
that was untarballed
then that should be the same as a CSV that was downloaded
and then lost its name (because Base.download does not know how to negotiate names or the webserver was bad)
and then had its name put back in by postprocessing.
(probably not prefix though, maybelocalfilename?)
It should be mutually exclusing with extract.
So it would be nice to express both extact and the setting of the name as values for a single option

Edit: Oh but we migth want to allowed .csv.gz and have that be extracted to a .csv.
Still putting this into the realm of postfetch feels right.

StefanKarpinski · 2019-06-14T20:49:26Z

Maybe call it basename but put it in the download section and have it mean that the download will be extracted to ~/.julia/artifacts/$name/$slug/$basename. The thing that's git tree hashed is the entire tree at ~/.julia/artifacts/$name/$slug, which in that situation would be $basename and whatever it contains. Updated sketch with this scheme:

[dataset-A]
git-tree-sha1 = "e445efb1f3e2bffc06e349651f13729e6f7aeaaf"

  [dataset-A.download]
  basename = "dataset-A.csv"
  sha256 = "b2ebe09298004f91b988e35d633668226d71995a84fbd12fea2b08c1201d427f"
  url = [ # multiple URLs to try
      "https://server1.com/path/to/dataset.csv",
      "https://server2.com/path/to/dataset.csv",
  ]

[nlp-model-1]
git-tree-sha1 = "dccae443aeddea507583c348d8f082d5ed5c5e55"

  [[nlp-model-1.download]] # multiple ways to download
  basename = "nlp-model-1.onnx"
  sha256 = "5dc925ffbda11f7e87f866351bf859ee7cbe8c0c7698c4201999c40085b4b980"
  url = "https://server1.com/nlp-model-1.onnx.gz"
  extract = "gzip" # decompress file

  [[nlp-model-1.download]]
  basename = "nlp-model-1.onnx"
  sha256 = "9f45411f32dcc332331ff244504ca12ee0b402e00795ab719612a46b7fb24216"
  url = "https://server2.com/nlp-model-1.onnx"

[[libfoo]]
git-tree-sha1 = "05d42b0044984825ae286ebb9e1fc38ed2cce80a"
os = "Linux"
arch = "armv7l"

  [libfoo.download]
  sha256 = "19e7370ab1819d45c6126d5017ba0889bd64869e1593f826c6075899fb1c0a38"
  url = "https://server.com/libfoo/Linux-armv7l/libfoo-1.2.3.tar.gz"
  extract = ["gzip", "tar"] # outermost first or last?

[[libfoo]]
git-tree-sha1 = "c2dc12a509eec2236e806569120e72058579ba19"
os = "Windows"
arch = "i686"

  [libfoo.download]
  sha256 = "95683bb088e35743966d1ea8b242c2694b57155c8084a406b29aecd81b4b6c92"
  url = "https://server.com/libfoo/Windows-i686/libfoo-1.2.3.zip"
  extract = "zip"

[[libfoo]]
git-tree-sha1 = "d633f5f44b06d810d75651a347cae945c3b7f23d"
os = "macOS"
arch = "x86_64"

  [libfoo.download]
  sha256 = "b65f08c0e4d454e2ff9298c5529e512b1081d0eebf46ad6e3364574e0ca7a783"
  url = "https://server.com/libfoo/macOS-x86_64/libfoo-1.2.3.xz"
  extract = ["xz", "tar"]

oxinabox · 2019-06-14T21:38:27Z

I think we need more thought.

What is so base about basename about it?

It only should matter for things that are not tarballs or zips.
I am kinda think it shouldn't ever exist for other cases?

Or at least I am not sure what it will do in those cases.

Understanding more how in interacts with

extract = ["tar", "gz"]
Vs
For
extract = ["gz"] on a csv
Vs
extract = [] on a csv

Are we thinking that tarballs extract to become 1 folder and we the rename that folder?
Or are we thinking that tarballs become a collection of files?
I was thinking the latter, but now I think I am wrong?

StefanKarpinski · 2019-06-15T04:20:07Z

basename is just the traditional Unix name for the last part of a path. A better scheme for this would be good.

StefanKarpinski · 2019-06-18T19:10:21Z

Idea: basename could be an extraction step, but I'm not sure how to express this. Rough attempt:

[dataset-A.download]
sha256 = "b2ebe09298004f91b988e35d633668226d71995a84fbd12fea2b08c1201d427f"
url = [ # multiple URLs to try
   "https://server1.com/path/to/dataset.csv",
   "https://server2.com/path/to/dataset.csv",
]
extract = { rename: "dataset.csv" }

That's not quite right though since I don't think you can put a dict in an array.

oxinabox · 2019-06-18T22:03:57Z

That is what I was saying.

StefanKarpinski · 2019-06-18T23:17:31Z

Only took me four days for the same thing to occur to me 😁

oxinabox · 2019-06-19T08:42:06Z

[dataset-A.download]
sha256 = "b2ebe09298004f91b988e35d633668226d71995a84fbd12fea2b08c1201d427f"
url = [ # multiple URLs to try
   "https://server1.com/path/to/dataset.csv.gz",
   "https://server2.com/path/to/dataset.csv.gz",
]
postfetch.extract = ["gz"]
postfetch.rename ="dataset.csv"

With the stern rule rename always occurs after extract, and that omitting either results in identity/noop.

staticfloat · 2019-06-20T05:47:23Z

Having thought about this for a bit, I am uncomfortable with the coupling between rename and git-tree-sha1 (if you change rename you're going to need to change git-tree-sha1). I'm also uncomfortable with how rename doesn't make sense when dealing with a .tar.gz, since if you're going to extract a file, you kind of don't care what the .tar.gz file's filename was, and renaming something after extracting doesn't make sense in that case.

I think I would rather have extraction only be an option in the well-defined case; where we have a container (like a .tar.gz) and that file structure is stored within it; this would make extract and basename mutually exclusive; either you're extracting things, or you're downloading a single file. basename will still interact with git-tree-sha1, but I'm willing to forgive that.

For more complex usecases, I think I would rather push this off onto a more advanced Pkg concept, which I have helpfully written up a big "thing" about over here: #1234 (whooo I got a staircase issue number! Lucky day!). Even if that's not something we want in Pkg, I still think restricting the flexibility here is going to help us keep a sane, simple design.

StefanKarpinski · 2019-06-20T13:48:10Z

Making extract and basename mutually exclusive extraction options seems sane to me. Maybe in that case calling the option filename would be more obvious than basename which felt more applicable to both files and directories, but of course in the case of a directory, there’s no need for an option to control the name.

When it comes to extraction, we should be very strict about how extraction is allowed: it should only ever produce files under the target location. I know some archive formats allow other destinations, which we should make sure to prevent.

staticfloat · 2019-06-20T17:40:25Z

Yeah, I like filename better as well.

When it comes to extraction, we should be very strict about how extraction is allowed: it should only ever produce files under the target location. I know some archive formats allow other destinations, which we should make sure to prevent.

I want to make sure that extraction can work everywhere, right now with .tar.gz we have pretty good support (since we bundle 7zip with Julia), if we allow people to download non-BB generated things, we may want to widen that to .zip and .tar.bz2 as well (which would also be pretty well supported). Beyond that, there is some desire for .tar.xz just because it compresses pretty well, but the long tail of distro support doesn't have our backs on that one quite yet. We could conceivably ship binaries of tar and xz for all platforms, add it as a lazy Artifact to Pkg itself (stored in a .tar.gz of course, haha) and then we'd be able to do it..... but for now, I argue let's just stick with a small subset of things we already know works.

1277: Add Artifacts to Pkg r=StefanKarpinski a=staticfloat This adds the artifacts subsystem to Pkg, [read this WIP blog post](https://github.com/JuliaLang/www.julialang.org/pull/417/files?short_path=514f74c#diff-514f74c34d50677638b76f65d910ad17) for more details. Closes #841 and #1234. This PR still needs: - [x] A `pkg> gc` hook that looks at the list of projects that we know about, examines which artifacts are bound, and marks all that are unbound. Unbound artifacts that have been continuously unbound for a certain time period (e.g. one month, or something like that) will be automatically reaped. - [x] Greater test coverage (even without seeing the codecov report, I am certain of this), especially as related to the installation of platform-specific binaries. - [x] `Overrides.toml` support for global overrides of artifact locations Co-authored-by: Elliot Saba <staticfloat@gmail.com>

staticfloat added the speculative label Oct 18, 2018

staticfloat mentioned this issue Nov 17, 2018

Run fixup-libgfortran at the make stage JuliaLang/julia#29522

Merged

staticfloat mentioned this issue Jan 3, 2019

Installation error when Julia is built from source JuliaLinearAlgebra/Arpack.jl#5

Closed

mweastwood mentioned this issue Jan 7, 2019

SDL_image JuliaMultimedia/SimpleDirectMediaLayer.jl#29

Closed

fredrikekre mentioned this issue Jan 19, 2019

don't install into package directory JuliaPackaging/BinaryProvider.jl#129

Open

visr mentioned this issue Feb 7, 2019

libgdal not defined on linux machine Julia 1.1.0 JuliaGeo/GDAL.jl#61

Closed

StefanKarpinski removed the speculative label Jun 13, 2019

simonbyrne mentioned this issue Jun 17, 2019

Binary dependency from SpecialFunctions JuliaMath/Quadmath.jl#37

Closed

staticfloat mentioned this issue Jun 20, 2019

Ad-Hoc Artifacts (Data Containers) #1234

Closed

staticfloat mentioned this issue Jul 6, 2019

CodecZLib build fails on Linux and WSL JuliaPackaging/BinaryProvider.jl#170

Open

staticfloat mentioned this issue Aug 1, 2019

Add Artifacts to Pkg #1277

Merged

3 tasks

staticfloat closed this as completed in #1277 Aug 15, 2019

tkf mentioned this issue Sep 26, 2019

Support Py_LIMITED_API in separate package as a dependency to PyCall.jl JuliaPy/PyCall.jl#714

Open

Pkg + BinaryProvider #841

Pkg + BinaryProvider #841

Comments

staticfloat commented Oct 18, 2018 • edited Loading

staticfloat commented Jan 14, 2019

00vareladavid commented Jan 14, 2019

00vareladavid commented Jan 14, 2019

StefanKarpinski commented Jan 14, 2019

00vareladavid commented Jan 14, 2019

staticfloat commented Jan 14, 2019

StefanKarpinski commented Jan 15, 2019 • edited Loading

StefanKarpinski commented Jan 15, 2019

stevengj commented Jan 22, 2019 • edited Loading

staticfloat commented Jan 22, 2019

ararslan commented Jan 22, 2019

stevengj commented Jan 24, 2019 • edited Loading

StefanKarpinski commented Jan 24, 2019

stevengj commented Jan 24, 2019

staticfloat commented Jan 24, 2019

stevengj commented Jan 25, 2019

staticfloat commented Jan 25, 2019

Petr-Hlavenka commented Jan 31, 2019

StefanKarpinski commented Feb 4, 2019

staticfloat commented Mar 7, 2019 • edited Loading

StefanKarpinski commented Mar 8, 2019

staticfloat commented Mar 8, 2019

staticfloat commented Mar 8, 2019

Petr-Hlavenka commented Mar 8, 2019 • edited Loading

staticfloat commented Jun 13, 2019

StefanKarpinski commented Jun 13, 2019 • edited Loading

StefanKarpinski commented Jun 13, 2019 • edited Loading

oxinabox commented Jun 13, 2019 • edited Loading

StefanKarpinski commented Jun 13, 2019

StefanKarpinski commented Jun 13, 2019

simonbyrne commented Jun 13, 2019

staticfloat commented Jun 14, 2019

visr commented Jun 14, 2019

StefanKarpinski commented Jun 14, 2019 • edited Loading

StefanKarpinski commented Jun 14, 2019

oxinabox commented Jun 14, 2019 • edited Loading

StefanKarpinski commented Jun 14, 2019

oxinabox commented Jun 14, 2019

StefanKarpinski commented Jun 15, 2019 • edited Loading

StefanKarpinski commented Jun 18, 2019

oxinabox commented Jun 18, 2019

StefanKarpinski commented Jun 18, 2019

oxinabox commented Jun 19, 2019

staticfloat commented Jun 20, 2019

StefanKarpinski commented Jun 20, 2019

staticfloat commented Jun 20, 2019

staticfloat commented Oct 18, 2018 •

edited

Loading

StefanKarpinski commented Jan 15, 2019 •

edited

Loading

stevengj commented Jan 22, 2019 •

edited

Loading

stevengj commented Jan 24, 2019 •

edited

Loading

staticfloat commented Mar 7, 2019 •

edited

Loading

Petr-Hlavenka commented Mar 8, 2019 •

edited

Loading

StefanKarpinski commented Jun 13, 2019 •

edited

Loading

StefanKarpinski commented Jun 13, 2019 •

edited

Loading

oxinabox commented Jun 13, 2019 •

edited

Loading

StefanKarpinski commented Jun 14, 2019 •

edited

Loading

oxinabox commented Jun 14, 2019 •

edited

Loading

StefanKarpinski commented Jun 15, 2019 •

edited

Loading