Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major rewrite for correctness, performance #295

Merged
merged 7 commits into from
Mar 3, 2021
Merged

Major rewrite for correctness, performance #295

merged 7 commits into from
Mar 3, 2021

Conversation

timholy
Copy link
Member

@timholy timholy commented Mar 1, 2021

This package was started in 2015 (back in the Julia 0.3 or 0.4 days), and a lot
has changed since then. It's never reallynot recently gotten a serious freshening.
This rewrite has several goals.

Improving the robustness of package/Module identification

In modern versions of Julia, the package manager uses name/UUID combinations
to identify packages. This is far more robust and flexible than older strategies
for specifying packages. FileIO doesn't do this: it identifies modules by their
name only. We should adopt the new approach here: going forward (once the
deprecation period has passed and we release FileIO v2), all not-yet-loaded
modules must be specified by name/UUID.

There are some cases--often used in tests or transiently during development of
a new I/O package--where the handler isn't a registered package, and so there's
no UUID available. Currently we try to look up the module based on a name::Symbol.
It used to be that most modules were loaded into Main, then Julia switched
to Base.__toplevel__; currently we search both, since modules defined in the
REPL or tests might still live in Main.
Of course, even back in the old days, sub-modules could not be found in Main,
so the current system can't handle submodules.

To address the need for specifying modules that aren't packages, while
improving both correctness and flexibility, this PR allows you to
specify it by (duh) the module itself rather than the name of the module.

The combination of using either the module itself or a name/UUID combination
means that we can replace a lot of brittle & slow code. When we have the module,
we're done; when we have a name/UUID combination, we just call Base.require
to get the module. It even checks for us whether the module is already loaded, and
when it is it's just a single Dict lookup. This should be more robust and considerably
faster than

function _findmod(f::Symbol)
for (u,v) in Base.loaded_modules
(Symbol(v) == f) && return u
end
nothing
end
function topimport(modname)
@eval Base.__toplevel__ import $modname
u = _findmod(modname)
@eval $modname = Base.loaded_modules[$u]
end
function checked_import(pkg::Symbol)
lock(load_locker) do
# kludge for test suite
if isdefined(Main, pkg)
m1 = getfield(Main, pkg)
isa(m1, Module) && return m1
end
if isdefined(FileIO, pkg)
m1 = getfield(FileIO, pkg)
isa(m1, Module) && return m1
end
m = _findmod(pkg)
m == nothing || return Base.loaded_modules[m]
topimport(pkg)
return Base.loaded_modules[_findmod(pkg)]
end
end
.

To help transition existing users to the new system, this has
"depwarn"-code to look for the module based on its name. It searches:

  1. The user's current Pkg environment
  2. Main
  3. All registries used by the user

One key (breaking) difference is that this lookup is now done during add_format
rather than when the user tries to load or save a file. This is obviously
better for runtime efficiency, but it does change the point in the code where
an error occurs. One of the relatively few changes to the tests addresses this change.

Summary: the new system is strictly more flexible than the old one, since we could
never previously support sub-modules. It is also strictly more correct since
the registry now specifies precisely what it means by ImageIO.
There is depwarn-code to help existing users transition, and the only known breakages
only concern the specific point in the code from which an error would be thrown.
Finally, it should be substantially more performant.

Improving performance and reducing latency with better inferrability

In the original design of this package, load and save were designed to be
specialized by packages. To allow format-specific dispatch, we encoded the
file format into the type system using types like DataFormat{:PNG}.
However, at a certain point we switched to calling module-specific
unexported load and save methods. As a consequence, we don't really
need to encode the format in the type system, we can just use a runtime
value. Indeed, the downside of using the type system is that having each
format be a separate type makes it impossible to infer types. This hurts the
runtime performance, increases latency due to unnecessary method specialization
by the compiler, and increases the risk of invalidation.

However, one way in which we may under-specialize is for the filename.
#259 removed the type-specification of the filename to support types
defined in FilePathsBase. That's a nice change, but this package does quite
a lot of manipulation based on file name, and having the type be non-inferrable
has some downsides.

Finally, several of the container types have historically been poorly-specified,
e.g., const magic_list = Vector{Pair}().

This rewrite tries to straddle two goals: improving internal inferrability
while maintaining backwards compatibility. The strategy taken is to try to
wait until the last possible moment to construct non-inferrable objects---to wait
until the results are reported back to the caller.
In this rewrite, the data format is encoded internally just as a Symbol,
and the file is passed around as a separate object. This prevents one from
needing to specialize on the data format while preserving inferrability for the file.
To move towards a world in which we could infer the type of the filename,
this adds a parameter to our existing types.

There are a couple of minor changes to internal types, and this forced a couple of
changes to the tests. Most significantly, File{fmt} is no longer a concrete
type, because File got a second type-parameter to encode the filename type.
To prevent inference failures due to varying-length tuples, this also transitions
all magic bytes from NTuple{N,UInt8} to Vector{UInt8}.

Overall consequences

As a case study, with the existing FileIO release, I get ~50us to load a
10x10 RGB png file. With this version, it's ~25us. It's remarkable that some
of the current bad things about the code here can compete with I/O as a source
of slowness, but there you have it. Of course for a larger image it becomes
increasingly I/O dominated.

I've not measured latency yet, because this currently removes all precompile
directives. I'll add those back in after addressing review comments, since I expect
that could change the precompile directives too.

TODOs

These will probably be separate PRs to enhance the reviewability of this one.

  • Update tests that trigger depwarns (currently left unchanged to show what's breaking and what's not)
  • Put all tests in testsets (will obviate Avoid creating global variables #271)
  • Update the precompiles
  • Switch to documenter and add doctests (to ensure that documentation is accurate)

Breaking or not?

Since I think all the breakages (check the changes to the test files) are essentially things that should
only be used internally, I don't think this needs to be FileIO v2.0.0; instead, it can be FileIO v1.5,
and when we delete the deprecations we can go to FileIO v2.0.0. However, there is one important
caveat: ImageIO exploited FileIO internals and some of these internals are changing.
(See JuliaIO/ImageIO.jl#22.) I don't see any way of preventing a breakage for
people who have current versions of ImageIO. Perhaps we could retrospectively place an upper bound
on the FileIO version at 1.4 in the registry?
(see below)

@timholy
Copy link
Member Author

timholy commented Mar 1, 2021

Oh, another thing this does is warn packages that extend FileIO.load etc. to stop doing that. That might pave the way for changing File and Stream so that they don't use the type-system to encode the format.

@timholy
Copy link
Member Author

timholy commented Mar 1, 2021

One more: @SimonDanisch, this rewrite revealed that there are still shader formats registered to GLAbstraction, which AFAICT is not a registered package. Delete? Or something else?

@SimonDanisch
Copy link
Member

Amazing, thanks for taking the time to do this :)

this rewrite revealed that there are still shader formats registered to GLAbstraction

Oh wow, definitely delete!

@tlnagy
Copy link
Contributor

tlnagy commented Mar 1, 2021

Super excited for this change! Out of curiosity, I'm not sure I understand why the following is true:

Indeed, the downside of using the type system is that having each format be a separate type makes it impossible to infer types. This hurts the runtime performance, increases latency due to unnecessary method specialization by the compiler, and increases the risk of invalidation.

@timholy
Copy link
Member Author

timholy commented Mar 1, 2021

It's a lot like this:

julia> foo(n) = Val(n)
foo (generic function with 1 method)

julia> @code_warntype foo(3)
MethodInstance for foo(::Int64)
  from foo(n) in Main at REPL[1]:1
Arguments
  #self#::Core.Const(foo)
  n::Int64
Body::Val{_A} where _A
1%1 = Main.Val(n)::Val{_A} where _A
└──      return %1

(with lots of red). There's no way to infer the result type. In contrast, if you keep things in the value domain (use n rather than Val(n)) then everything stays inferrable and lower latency because the compiler won't specialize so many methods that consume the output of foo.

Using sym::Symbol to encode the file format internally via querysym means that its callers can infer its return type. In contrast, query returns a File{DataFormat{sym}} and so the result-type of query is not inferrable.

This still creates a File{DataFormat{sym}} right before "exiting" FileIO, which breaks inference. But if you're calling load or save you're using invokelatest anyway (to circumvent world-age issues) so it basically doesn't matter.

@timholy
Copy link
Member Author

timholy commented Mar 1, 2021

OK, after a bit of ugly Pkg-wrangling across Julia versions we're passing tests. I might be able to tackle the whole list above in the morning. Should I implement the precompiles, or should I hold off a while to give someone a chance to review it? I'm happy leaving this open a week or so if needed. However, I also know it's not a fun PR to review, so no worries either way.

@codecov
Copy link

codecov bot commented Mar 2, 2021

Codecov Report

Merging #295 (412fdc6) into master (71bdffe) will increase coverage by 3.30%.
The diff coverage is 90.03%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #295      +/-   ##
==========================================
+ Coverage   85.37%   88.68%   +3.30%     
==========================================
  Files           9       10       +1     
  Lines         588      592       +4     
==========================================
+ Hits          502      525      +23     
+ Misses         86       67      -19     
Impacted Files Coverage Δ
src/FileIO.jl 100.00% <ø> (ø)
src/deprecated.jl 47.82% <47.82%> (ø)
src/error_handling.jl 80.00% <50.00%> (+26.66%) ⬆️
src/registry.jl 85.83% <90.90%> (+0.23%) ⬆️
src/registry_setup.jl 97.29% <96.49%> (+2.24%) ⬆️
src/query.jl 95.90% <97.08%> (+5.46%) ⬆️
src/loadsave.jl 98.71% <98.63%> (+10.17%) ⬆️
src/precompile.jl 93.75% <100.00%> (+20.28%) ⬆️
src/types.jl 88.23% <100.00%> (+6.41%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 71bdffe...412fdc6. Read the comment docs.

@timholy
Copy link
Member Author

timholy commented Mar 2, 2021

I've got the docs ready but I think it's probably best to submit them as a second PR. The docs would detract from the reviewability of this PR since they are mostly a move from the README, making for big linecount changes with little practical effect.

The same goes for the rest of my TODOs (with the possible exception of fixing the depwarns). So I might just wait and submit those separately.

@timholy
Copy link
Member Author

timholy commented Mar 2, 2021

In the new file deprecated.jl, I've restored the functions that older versions of ImageIO expect, so we shouldn't have to fear a non-breaking bump and we probably don't need to do any registry shenanigans.

@timholy timholy force-pushed the teh/rewrite branch 3 times, most recently from 3ed09c5 to 44bbf99 Compare March 3, 2021 10:41
@timholy timholy changed the title [RFC] Major rewrite for correctness, performance [Major rewrite for correctness, performance Mar 3, 2021
@timholy timholy changed the title [Major rewrite for correctness, performance Major rewrite for correctness, performance Mar 3, 2021
@IanButterworth
Copy link
Member

I haven't taken more than a cursory glance at the diff here, but your rationale sounds good
I echo Simon's comment

Amazing, thanks for taking the time to do this :)

@timholy
Copy link
Member Author

timholy commented Mar 3, 2021

OK! I'll fix the depwarns from the test suite and then merge.

This package was started in 2015 (back in the Julia 0.3 or 0.4 days), and a lot
has changed since then. It's never really gotten a serious freshening.
This rewrite has several goals.

Improving the robustness of package/Module identification
---------------------------------------------------------

In modern versions of Julia, the package manager uses name/UUID combinations
to identify packages. This is far more robust and flexible than older strategies
for specifying packages. FileIO doesn't do this: it identifies modules by their
name only. We should adopt the new approach here: going forward (once the
deprecation period has passed and we release FileIO v2), all not-yet-loaded
modules must be specified by name/UUID.

There are some cases--often used in tests or transiently during development of
a new I/O package--where the handler *isn't* a registered package, and so there's
no UUID available. Currently we try to look up the module based on a `name::Symbol`.
It used to be that most modules were loaded into `Main`, then Julia switched
to `Base.__toplevel__`; currently we search both, since modules defined in the
REPL or tests might still live in `Main`.
Of course, even back in the old days, sub-modules could not be found in `Main`,
so the current system can't handle submodules.

To address the need for specifying modules that aren't packages, while
improving both correctness and flexibility, this PR allows you to
specify it by (duh) the module itself rather than the name of the module.

The combination of using either the module itself or a name/UUID combination
means that we can replace a lot of brittle & slow code. When we have the module,
we're done; when we have a name/UUID combination, we just call `Base.require`
to get the module. It even checks for us whether the module is already loaded.
End of story.

To help transition existing users to the new system, this has
"depwarn"-code to look for the module based on its name. It searches:

1. the currently-loaded modules
2. `Main`
3. The user's current `Pkg` environment

One key (breaking) difference is that this lookup is now done during `add_format`
rather than when the user tries to `load` or `save` a file. This is obviously
better for runtime efficiency, but it does change the point in the code where
an error occurs. One of the relatively changes to the tests addresses this change.

**Summary**: the new system is strictly more flexible than the old one, since we could
never previously support sub-modules. It is also strictly more correct since
the registry now specifies precisely what it means by `ImageIO`.
There is depwarn-code to help existing users transition, and the only known breakages
only concern the specific point in the code from which an error would be thrown.

Improving performance and reducing latency with better inferrability
--------------------------------------------------------------------

In the original design of this package, `load` and `save` were designed to be
specialized by packages. To allow format-specific dispatch, we encoded the
file format into the type system using types like `DataFormat{:PNG}`.
However, at a certain point we switched to calling module-specific
unexported `load` and `save` methods. As a consequence, we don't really
need to encode the format in the type system, we can just use a runtime
value. Indeed, the downside of using the type system is that having each
format be a separate type makes it impossible to infer types. This hurts the
runtime performance, increases latency due to unnecessary method specialization
by the compiler, and increases the risk of invalidation.

However, one way in which we may *under*-specialize is for the filename.
defined in `FilePathsBase`. That's a nice change, but this package does quite
a lot of manipulation based on file name, and having the type be non-inferrable
has some downsides.

Finally, several of the container types have historically been poorly-specified,
e.g., `const magic_list = Vector{Pair}()`.

This rewrite tries to straddle two goals: improving internal inferrability
while maintaining backwards compatibility. The strategy taken is to try to
wait until the last possible moment to construct non-inferrable objects---to wait
until the results are reported back to the caller.
In this rewrite, the data format is encoded internally just as a `Symbol`,
and the file is passed around as a separate object. This prevents one from
needing to specialize on the data format while preserving inferrability for the file.

There are a couple of minor changes to internal types, and this forced a couple of
changes to the tests. Most significantly, `File{fmt}` is no longer a concrete
type, because `File` got a second type-parameter to encode the filename type.
To prevent inference failures due to varying-length tuples, this also transitions
all magic bytes from `NTuple{N,UInt8}` to `Vector{UInt8}`.

As a case study, with the existing FileIO release, I get ~50us to load a
10x10 RGB png file. With this version, it's ~25us. It's remarkable that inference
can compete with I/O as a source of slowness, but there you have it.
This is in preparation for adding Documenter docs, but it's
useful on its own.
This will improve printing in documentation tables
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants