Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiple packages using a type of file? #46

Open
sbromberger opened this issue Dec 6, 2015 · 15 comments
Open

multiple packages using a type of file? #46

sbromberger opened this issue Dec 6, 2015 · 15 comments

Comments

@sbromberger
Copy link

We have several packages that might handle a specific type of file (and will return an object based on those file's contents). Does the presence of the centralized registry mean that only one package can "register" a dependency for that type of file for use in FileIO?

To use an existing example, it looks like PNG requires ImageMagick. But if I'm working on my own package that uses PNGs, and don't use ImageMagick myself in my package, does this mean that if I want to use FileIO for loading PNGs, I'm stuck with requiring my users to download (if not use) ImageMagick?

Sorry for the basic questions. My use case is with graph and matrix file formats, and there are many packages out there that use them in completely different ways.

@SimonDanisch
Copy link
Member

FileIO allows to register multiple libraries for one format.
Here are the lines that handle this:
https://github.com/JuliaIO/FileIO.jl/blob/master/src/loadsave.jl#L72
As you can see, it tries to load them and the first library that is installed and able to load without error will be chosen.
We could also introduce some mechanism to force using one specific library. Maybe via checking if one of the applicable libraries is already loaded.

This is how you can register multiple libraries for one file format:
https://github.com/JuliaIO/FileIO.jl/blob/master/src/registry.jl#L65

Best,
Simon

@nstiurca
Copy link

This is also an issue for me. In my case, I am writing a package to deal with PLY files in a way different than MeshIO.jl

My suggestion is that perhaps FileIO.jl can have some set_preffered_loader/set_preferred_saver. Loading a package (which is presumably already registered with FileIO) can call those functions to set itself as the default. So in my code, I can write

using FiloIO
using PLY # this is my package which calls the functions I suggested
data = open("foo.ply")    # uses PLY.jl loading even if MeshIO.jl is default package for PLY files

@SimonDanisch
Copy link
Member

Can you explain what your package is doing differently? Would it make sense to generally prefer it to the one in MeshIO?
The easiest solution might be, that you just prefer IO packages, that are explicitly loaded by the user.
So your code snippet would already work without set_preferred_loader (which I think shouldn't be called in an IO library itself).

If you like, you can supply a pull request for this! Should be quite straight forward to first search for already included modules in the IO list in https://github.com/JuliaIO/FileIO.jl/blob/master/src/loadsave.jl#L75

@nstiurca
Copy link

My package builds Julia types to correspond directly to whatever is
described in the PLY header, and uses generated functions to read those
types. Conversely, any type of data consisting of elements with numeric
fields or arrays of numeric fields can be written to a PLY file directly,
again using generated functions.

I haven't implemented binary format IO yet so that's why I haven't
submitted a pull request yet.
On Apr 28, 2016 7:08 AM, "Simon" notifications@github.com wrote:

Can you explain what your package is doing differently? Would it make
sense to generally prefer it to the one in MeshIO?
The easiest solution might be, that you just prefer IO packages, that are
explicitly loaded by the user.
So your code snippet would already work without set_preferred_loader
(which I think shouldn't be called in an IO library itself).

If you like, you can supply a pull request for this! Should be quite
straight forward to first search for already included modules in the IO
list in
https://github.com/JuliaIO/FileIO.jl/blob/master/src/loadsave.jl#L75


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#46 (comment)

@c42f
Copy link

c42f commented Oct 27, 2016

Oh, hmm, well this is somewhat embarrassing, but I also implemented a ply IO library (https://github.com/FugroRoames/PlyIO.jl) and came to FileIO to discuss exactly the same issue!

It's an interesting one, because MeshIO and PlyIO have quite different views on what the user wants from loading a ply file: PlyIO models ply as a generic container of arbitrarily named elements and properties (essentially what you see in the header), while MeshIO wants a fixed format, and will interpret that as a mesh for you. Unfortunately, it's not really possible to guess which one of these the user will actually want, because they'll want different things in different contexts.

I think this requires an extension of the load interface. What about using a type as a first argument to hint at which library to call:

using MeshIO
using PlyIO

a_mesh = load(AbstractMesh, "test.ply")
a_ply = load(Ply, "test.ply")

a_something = load("test.ply") # Fall back to the current precedence system; I guess FileIO will need to guess what people want "most of the time"...

A generic fallback is straightforward:

load{T}(::Type{T}, args...; kwargs...) = load(args...; kwargs...)

@c42f
Copy link

c42f commented Oct 27, 2016

Perhaps the generic fallback could be a bit smarter, to avoid some of the issues with Type invariance:

load{T}(::Type{T}, args...; kwargs...) = load(supertype(T), args...; kwargs...)
load(::Type{Any}, args...; kwargs...) = load(args...; kwargs...)

@SimonDanisch
Copy link
Member

Hm, I'm not a 100% sure if the type you want should influence the loader library. In fact, I'm using load in a very similar way, but defined as something like:

load(T, args...) = convert(T, load(args...))

I'd argue, that you might want this behavior, since you still want a way to enforce type stability even when loaders switch under the hood.
Also, there might be loaders returning the same type, but you still prefer the non default loader for whatever reason.
So maybe rather:

load{T}(M::Module, args...; kwargs...) = M.load(args...; kwargs...)
load(file, args...; kwargs...) = load(figure_out_module(file), args...; kwargs...)

@c42f
Copy link

c42f commented Oct 29, 2016

The module version is interesting, but I don't think it solves the whole problem: Requiring the user to mention the backend loader module has the wrong semantics from a library user's point of view. If I'm loading data from a file, I primarily care about which type of things I can do with the resulting data object. On the other hand, I don't care so much about which module was involved in the loading.

Achieving this abstraction of "just go find a good library to load this" seems to be the whole problem that FileIO is trying to solve, but it currently lacks specificity on the semantics of returned objects.

Let's consider: loading images vs arrays. These two types of objects are quite similar in many ways, with a somewhat interchangeable data model. However, they're not quite the same. I'd like to be able to achieve the equivalent of the following with FileIO:

filename = "whatever.blah"
an_image = load_image(filename)
an_array = load_array(filename)

Semantically, I now know that an_image is a thing which behaves like an image, and an_array is a thing which behaves like an array. My next code should be able to assume this is true and carry on without further checks.

So, how do we achieve this? It's tricky right? In fact it seems to be a kind of multiple dispatch. We have as input:

  • The user's desired semantic (modeled as a type, or a trait?)
  • The file's format (only available at runtime)

Putting these two things together, FileIO should maintain the machinery and registry to find a good IO module. This way, you can in principle do cool things like load a CSV containing a greyscale image as an actual Image type! You don't need to first load it as a DataFrame, then convert to an array and finally put that inside an image data structure (just for argument's sake).

So, with all that in mind, I think I'd like to be able to write:

filename = "whatever.csv"
an_image = load(Image, filename)
a_matrix = load(Matrix, filename)

With this kind of design, it seems like individual packages can't directly overload the load(T, filename) methods, as multiple packages may be able to load type T. This might be a good place to insert some of your module-based design in the backend?

@mauro3
Copy link

mauro3 commented Jan 18, 2017

Why not use, say, load(File(format"PNG-ARRAY", "my.png")) to specify that I want to use a different loader to the standard one?

@SimonDanisch
Copy link
Member

You mean return type?

@mauro3
Copy link

mauro3 commented Jan 18, 2017

The loader of format"PNG-ARRAY" returns an array, whereas the loader format"PNG" returns an image type.

@SimonDanisch
Copy link
Member

hm... but return type and loader library are not strictly coupled!

@mauro3
Copy link

mauro3 commented Jan 18, 2017

I see (well not quite), in that case that does not work.

@SimonDanisch
Copy link
Member

That's especially the case for png's, where all IO libraries only return Arrays on the lower level, but then get conveniently converted to an Image. Also, asserting that you get AbstractMatrix{RGB} seems to be much more usable, and is not coupled to the IO library.
On another note, I would find it quite annoying, that you'd need to remember two things, the Julia type AND the corresponding format string ;)

@plut
Copy link

plut commented Nov 17, 2021

So, after almost 6 years this still has not been solved (and yes, I just did write yet another .ply file reader).

A way to ease the dispatch would be to let the user decide which package gets the load function for a given file format by simply looking at which packages are loaded. Namely, if two existing packages, say Corge and Grault both read .foo files (each one outputting to a different data type), it is usually safe to assume that, if the user has said using Corge but not using Grault, then they are using Corge's data types and thus want to load files using Corge.load.

This could be implemented by e.g. exposing a couple of functions, FileIO.register_loader and FileIO.register_saver, and letting Corge.__init__() call FileIO.register_loader(format"foo"=>my_foo_reading_function) as needed. This also allows detecting conflicts (if register_loader(format"foo") has been called twice; even better, be lazy and only detect this when actually opening a ".foo" file). So in that case, the behaviour is not undefined; rather, an exception is thrown.

If a conflict exists, the only reasonable way to solve it is by hand (since only the end-user knows which type they really want, etc.); a possible API for it is letting any package with two conflicting dependencies call e.g. FileIO.chose_loader(format"foo" => Corge) and/or exposing a secondary method load("a.foo", Grault) for ad-hoc invocation of a non-default loading method.

The final result for this is that the end-user, when they load only non-conflicting packages, still has a nice load("a.foo") interface. If they load conflicting packages, a warning/error is emitted, and they can choose their preferred package by hand. Also, only packages which are actually installed on the system are ever called (even better, those packages actually called are manually loaded in the project and thus listed as dependencies). Finally, of course the weird global registry table disappears. (It could even be optionally re-implemented for compatibility by wrapping it as a sub-package of FileIO with all the packages as dependencies).

(Final small bonus: dispensing with global registry kills the chicken-and-egg problem of #124)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants