Major rewrite for correctness, performance #295

timholy · 2021-03-01T15:10:25Z

This package was started in 2015 (back in the Julia 0.3 or 0.4 days), and a lot
has changed since then. It's ~~never really~~not recently gotten a serious freshening.
This rewrite has several goals.

Improving the robustness of package/Module identification

In modern versions of Julia, the package manager uses name/UUID combinations
to identify packages. This is far more robust and flexible than older strategies
for specifying packages. FileIO doesn't do this: it identifies modules by their
name only. We should adopt the new approach here: going forward (once the
deprecation period has passed and we release FileIO v2), all not-yet-loaded
modules must be specified by name/UUID.

There are some cases--often used in tests or transiently during development of
a new I/O package--where the handler isn't a registered package, and so there's
no UUID available. Currently we try to look up the module based on a name::Symbol.
It used to be that most modules were loaded into Main, then Julia switched
to Base.__toplevel__; currently we search both, since modules defined in the
REPL or tests might still live in Main.
Of course, even back in the old days, sub-modules could not be found in Main,
so the current system can't handle submodules.

To address the need for specifying modules that aren't packages, while
improving both correctness and flexibility, this PR allows you to
specify it by (duh) the module itself rather than the name of the module.

The combination of using either the module itself or a name/UUID combination
means that we can replace a lot of brittle & slow code. When we have the module,
we're done; when we have a name/UUID combination, we just call Base.require
to get the module. It even checks for us whether the module is already loaded, and
when it is it's just a single Dict lookup. This should be more robust and considerably
faster than

FileIO.jl/src/loadsave.jl

Lines 7 to 35 in 1df5033

    
           function _findmod(f::Symbol) 
        
               for (u,v) in Base.loaded_modules 
        
                   (Symbol(v) == f) && return u 
        
               end 
        
               nothing 
        
           end 
        
           function topimport(modname) 
        
               @eval Base.__toplevel__  import $modname 
        
               u = _findmod(modname) 
        
               @eval $modname = Base.loaded_modules[$u] 
        
           end 
        
           function checked_import(pkg::Symbol) 
        
               lock(load_locker) do 
        
                   # kludge for test suite 
        
                   if isdefined(Main, pkg) 
        
                       m1 = getfield(Main, pkg) 
        
                       isa(m1, Module) && return m1 
        
                   end 
        
                   if isdefined(FileIO, pkg) 
        
                       m1 = getfield(FileIO, pkg) 
        
                       isa(m1, Module) && return m1 
        
                   end 
        
                   m = _findmod(pkg) 
        
                   m == nothing || return Base.loaded_modules[m] 
        
                   topimport(pkg) 
        
                   return Base.loaded_modules[_findmod(pkg)] 
        
               end 
        
           end

.

To help transition existing users to the new system, this has
"depwarn"-code to look for the module based on its name. It searches:

The user's current Pkg environment
Main
All registries used by the user

One key (breaking) difference is that this lookup is now done during add_format
rather than when the user tries to load or save a file. This is obviously
better for runtime efficiency, but it does change the point in the code where
an error occurs. One of the relatively few changes to the tests addresses this change.

Summary: the new system is strictly more flexible than the old one, since we could
never previously support sub-modules. It is also strictly more correct since
the registry now specifies precisely what it means by ImageIO.
There is depwarn-code to help existing users transition, and the only known breakages
only concern the specific point in the code from which an error would be thrown.
Finally, it should be substantially more performant.

Improving performance and reducing latency with better inferrability

In the original design of this package, load and save were designed to be
specialized by packages. To allow format-specific dispatch, we encoded the
file format into the type system using types like DataFormat{:PNG}.
However, at a certain point we switched to calling module-specific
unexported load and save methods. As a consequence, we don't really
need to encode the format in the type system, we can just use a runtime
value. Indeed, the downside of using the type system is that having each
format be a separate type makes it impossible to infer types. This hurts the
runtime performance, increases latency due to unnecessary method specialization
by the compiler, and increases the risk of invalidation.

However, one way in which we may under-specialize is for the filename.
#259 removed the type-specification of the filename to support types
defined in FilePathsBase. That's a nice change, but this package does quite
a lot of manipulation based on file name, and having the type be non-inferrable
has some downsides.

Finally, several of the container types have historically been poorly-specified,
e.g., const magic_list = Vector{Pair}().

This rewrite tries to straddle two goals: improving internal inferrability
while maintaining backwards compatibility. The strategy taken is to try to
wait until the last possible moment to construct non-inferrable objects---to wait
until the results are reported back to the caller.
In this rewrite, the data format is encoded internally just as a Symbol,
and the file is passed around as a separate object. This prevents one from
needing to specialize on the data format while preserving inferrability for the file.
To move towards a world in which we could infer the type of the filename,
this adds a parameter to our existing types.

There are a couple of minor changes to internal types, and this forced a couple of
changes to the tests. Most significantly, File{fmt} is no longer a concrete
type, because File got a second type-parameter to encode the filename type.
To prevent inference failures due to varying-length tuples, this also transitions
all magic bytes from NTuple{N,UInt8} to Vector{UInt8}.

Overall consequences

As a case study, with the existing FileIO release, I get ~50us to load a
10x10 RGB png file. With this version, it's ~25us. It's remarkable that some
of the current bad things about the code here can compete with I/O as a source
of slowness, but there you have it. Of course for a larger image it becomes
increasingly I/O dominated.

I've not measured latency yet, because this currently removes all precompile
directives. I'll add those back in after addressing review comments, since I expect
that could change the precompile directives too.

TODOs

These will probably be separate PRs to enhance the reviewability of this one.

Update tests that trigger depwarns (currently left unchanged to show what's breaking and what's not)
Put all tests in testsets (will obviate Avoid creating global variables #271)
Update the precompiles
Switch to documenter and add doctests (to ensure that documentation is accurate)

Breaking or not?

Since I think all the breakages (check the changes to the test files) are essentially things that should
only be used internally, I don't think this needs to be FileIO v2.0.0; instead, it can be FileIO v1.5,
and when we delete the deprecations we can go to FileIO v2.0.0. However, there is one important
caveat: ImageIO exploited FileIO internals and some of these internals are changing.
(See JuliaIO/ImageIO.jl#22.) I don't see any way of preventing a breakage for
people who have current versions of ImageIO. Perhaps we could retrospectively place an upper bound
on the FileIO version at 1.4 in the registry? (see below)

timholy · 2021-03-01T15:16:39Z

Oh, another thing this does is warn packages that extend FileIO.load etc. to stop doing that. That might pave the way for changing File and Stream so that they don't use the type-system to encode the format.

timholy · 2021-03-01T15:20:17Z

One more: @SimonDanisch, this rewrite revealed that there are still shader formats registered to GLAbstraction, which AFAICT is not a registered package. Delete? Or something else?

SimonDanisch · 2021-03-01T15:20:55Z

Amazing, thanks for taking the time to do this :)

this rewrite revealed that there are still shader formats registered to GLAbstraction

Oh wow, definitely delete!

tlnagy · 2021-03-01T20:11:10Z

Super excited for this change! Out of curiosity, I'm not sure I understand why the following is true:

Indeed, the downside of using the type system is that having each format be a separate type makes it impossible to infer types. This hurts the runtime performance, increases latency due to unnecessary method specialization by the compiler, and increases the risk of invalidation.

timholy · 2021-03-01T21:54:43Z

It's a lot like this:

julia> foo(n) = Val(n)
foo (generic function with 1 method)

julia> @code_warntype foo(3)
MethodInstance for foo(::Int64)
  from foo(n) in Main at REPL[1]:1
Arguments
  #self#::Core.Const(foo)
  n::Int64
Body::Val{_A} where _A
1 ─ %1 = Main.Val(n)::Val{_A} where _A
└──      return %1

(with lots of red). There's no way to infer the result type. In contrast, if you keep things in the value domain (use n rather than Val(n)) then everything stays inferrable and lower latency because the compiler won't specialize so many methods that consume the output of foo.

Using sym::Symbol to encode the file format internally via querysym means that its callers can infer its return type. In contrast, query returns a File{DataFormat{sym}} and so the result-type of query is not inferrable.

This still creates a File{DataFormat{sym}} right before "exiting" FileIO, which breaks inference. But if you're calling load or save you're using invokelatest anyway (to circumvent world-age issues) so it basically doesn't matter.

timholy · 2021-03-01T23:54:12Z

OK, after a bit of ugly Pkg-wrangling across Julia versions we're passing tests. I might be able to tackle the whole list above in the morning. Should I implement the precompiles, or should I hold off a while to give someone a chance to review it? I'm happy leaving this open a week or so if needed. However, I also know it's not a fun PR to review, so no worries either way.

codecov · 2021-03-02T14:18:51Z

Codecov Report

Merging #295 (412fdc6) into master (71bdffe) will increase coverage by 3.30%.
The diff coverage is 90.03%.

@@            Coverage Diff             @@
##           master     #295      +/-   ##
==========================================
+ Coverage   85.37%   88.68%   +3.30%     
==========================================
  Files           9       10       +1     
  Lines         588      592       +4     
==========================================
+ Hits          502      525      +23     
+ Misses         86       67      -19

Impacted Files	Coverage Δ
src/FileIO.jl	`100.00% <ø> (ø)`
src/deprecated.jl	`47.82% <47.82%> (ø)`
src/error_handling.jl	`80.00% <50.00%> (+26.66%)`	⬆️
src/registry.jl	`85.83% <90.90%> (+0.23%)`	⬆️
src/registry_setup.jl	`97.29% <96.49%> (+2.24%)`	⬆️
src/query.jl	`95.90% <97.08%> (+5.46%)`	⬆️
src/loadsave.jl	`98.71% <98.63%> (+10.17%)`	⬆️
src/precompile.jl	`93.75% <100.00%> (+20.28%)`	⬆️
src/types.jl	`88.23% <100.00%> (+6.41%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 71bdffe...412fdc6. Read the comment docs.

timholy · 2021-03-02T14:41:44Z

I've got the docs ready but I think it's probably best to submit them as a second PR. The docs would detract from the reviewability of this PR since they are mostly a move from the README, making for big linecount changes with little practical effect.

The same goes for the rest of my TODOs (with the possible exception of fixing the depwarns). So I might just wait and submit those separately.

timholy · 2021-03-02T14:47:21Z

In the new file deprecated.jl, I've restored the functions that older versions of ImageIO expect, so we shouldn't have to fear a non-breaking bump and we probably don't need to do any registry shenanigans.

IanButterworth · 2021-03-03T11:55:15Z

I haven't taken more than a cursory glance at the diff here, but your rationale sounds good
I echo Simon's comment

Amazing, thanks for taking the time to do this :)

timholy · 2021-03-03T11:58:05Z

OK! I'll fix the depwarns from the test suite and then merge.

This package was started in 2015 (back in the Julia 0.3 or 0.4 days), and a lot has changed since then. It's never really gotten a serious freshening. This rewrite has several goals. Improving the robustness of package/Module identification --------------------------------------------------------- In modern versions of Julia, the package manager uses name/UUID combinations to identify packages. This is far more robust and flexible than older strategies for specifying packages. FileIO doesn't do this: it identifies modules by their name only. We should adopt the new approach here: going forward (once the deprecation period has passed and we release FileIO v2), all not-yet-loaded modules must be specified by name/UUID. There are some cases--often used in tests or transiently during development of a new I/O package--where the handler *isn't* a registered package, and so there's no UUID available. Currently we try to look up the module based on a `name::Symbol`. It used to be that most modules were loaded into `Main`, then Julia switched to `Base.__toplevel__`; currently we search both, since modules defined in the REPL or tests might still live in `Main`. Of course, even back in the old days, sub-modules could not be found in `Main`, so the current system can't handle submodules. To address the need for specifying modules that aren't packages, while improving both correctness and flexibility, this PR allows you to specify it by (duh) the module itself rather than the name of the module. The combination of using either the module itself or a name/UUID combination means that we can replace a lot of brittle & slow code. When we have the module, we're done; when we have a name/UUID combination, we just call `Base.require` to get the module. It even checks for us whether the module is already loaded. End of story. To help transition existing users to the new system, this has "depwarn"-code to look for the module based on its name. It searches: 1. the currently-loaded modules 2. `Main` 3. The user's current `Pkg` environment One key (breaking) difference is that this lookup is now done during `add_format` rather than when the user tries to `load` or `save` a file. This is obviously better for runtime efficiency, but it does change the point in the code where an error occurs. One of the relatively changes to the tests addresses this change. **Summary**: the new system is strictly more flexible than the old one, since we could never previously support sub-modules. It is also strictly more correct since the registry now specifies precisely what it means by `ImageIO`. There is depwarn-code to help existing users transition, and the only known breakages only concern the specific point in the code from which an error would be thrown. Improving performance and reducing latency with better inferrability -------------------------------------------------------------------- In the original design of this package, `load` and `save` were designed to be specialized by packages. To allow format-specific dispatch, we encoded the file format into the type system using types like `DataFormat{:PNG}`. However, at a certain point we switched to calling module-specific unexported `load` and `save` methods. As a consequence, we don't really need to encode the format in the type system, we can just use a runtime value. Indeed, the downside of using the type system is that having each format be a separate type makes it impossible to infer types. This hurts the runtime performance, increases latency due to unnecessary method specialization by the compiler, and increases the risk of invalidation. However, one way in which we may *under*-specialize is for the filename. defined in `FilePathsBase`. That's a nice change, but this package does quite a lot of manipulation based on file name, and having the type be non-inferrable has some downsides. Finally, several of the container types have historically been poorly-specified, e.g., `const magic_list = Vector{Pair}()`. This rewrite tries to straddle two goals: improving internal inferrability while maintaining backwards compatibility. The strategy taken is to try to wait until the last possible moment to construct non-inferrable objects---to wait until the results are reported back to the caller. In this rewrite, the data format is encoded internally just as a `Symbol`, and the file is passed around as a separate object. This prevents one from needing to specialize on the data format while preserving inferrability for the file. There are a couple of minor changes to internal types, and this forced a couple of changes to the tests. Most significantly, `File{fmt}` is no longer a concrete type, because `File` got a second type-parameter to encode the filename type. To prevent inference failures due to varying-length tuples, this also transitions all magic bytes from `NTuple{N,UInt8}` to `Vector{UInt8}`. As a case study, with the existing FileIO release, I get ~50us to load a 10x10 RGB png file. With this version, it's ~25us. It's remarkable that inference can compete with I/O as a source of slowness, but there you have it.

This is in preparation for adding Documenter docs, but it's useful on its own.

This will improve printing in documentation tables

Generally we assume parameters can be duplicated without seeing side-effects. That is not entirely true of mutable globals and multi-threading. Refs: #36450 Fixes: #39508

This was referenced Mar 1, 2021

Don't rely on FileIO internals JuliaIO/ImageIO.jl#22

Merged

ImageIO Benchmarks JuliaIO/ImageIO.jl#21

Open

timholy force-pushed the teh/rewrite branch from f92743f to 64eab01 Compare March 2, 2021 14:17

timholy mentioned this pull request Mar 2, 2021

Add ImageIO for normal TIFF files #290

Merged

timholy force-pushed the teh/rewrite branch 3 times, most recently from 3ed09c5 to 44bbf99 Compare March 3, 2021 10:41

timholy changed the title ~~[RFC] Major rewrite for correctness, performance~~ [Major rewrite for correctness, performance Mar 3, 2021

timholy changed the title ~~[Major rewrite for correctness, performance~~ Major rewrite for correctness, performance Mar 3, 2021

timholy mentioned this pull request Mar 3, 2021

Add precompile helper tlnagy/TiffImages.jl#35

Merged

timholy force-pushed the teh/rewrite branch from 24f531c to b3e4558 Compare March 3, 2021 12:20

timholy added 7 commits March 3, 2021 06:58

Improve docstrings

ea3e9af

This is in preparation for adding Documenter docs, but it's useful on its own.

Use strings for magic bytes where possible

6a5b688

This will improve printing in documentation tables

Add precompiles

c25f5cd

Delete cruft & improve test coverage

50cc444

Switch badge to codecov

714b915

Eliminate test depwarns

412fdc6

timholy force-pushed the teh/rewrite branch from b3e4558 to 412fdc6 Compare March 3, 2021 13:02

timholy merged commit b0932bc into master Mar 3, 2021

timholy deleted the teh/rewrite branch March 3, 2021 13:14

This was referenced Mar 4, 2021

Remove FileIO cruft from GLAbstraction JuliaPlots/GLMakie.jl#164

Merged

'dataset' suddenly stopped working with 'ISLR' JuliaStats/RDatasets.jl#117

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Major rewrite for correctness, performance #295

Major rewrite for correctness, performance #295

timholy commented Mar 1, 2021 •

edited

Loading

timholy commented Mar 1, 2021

timholy commented Mar 1, 2021

SimonDanisch commented Mar 1, 2021

tlnagy commented Mar 1, 2021

timholy commented Mar 1, 2021 •

edited

Loading

timholy commented Mar 1, 2021

codecov bot commented Mar 2, 2021 •

edited

Loading

timholy commented Mar 2, 2021

timholy commented Mar 2, 2021

IanButterworth commented Mar 3, 2021

timholy commented Mar 3, 2021

	function _findmod(f::Symbol)
	for (u,v) in Base.loaded_modules
	(Symbol(v) == f) && return u
	end
	nothing
	end
	function topimport(modname)
	@eval Base.__toplevel__ import $modname
	u = _findmod(modname)
	@eval $modname = Base.loaded_modules[$u]
	end

	function checked_import(pkg::Symbol)
	lock(load_locker) do
	# kludge for test suite
	if isdefined(Main, pkg)
	m1 = getfield(Main, pkg)
	isa(m1, Module) && return m1
	end
	if isdefined(FileIO, pkg)
	m1 = getfield(FileIO, pkg)
	isa(m1, Module) && return m1
	end
	m = _findmod(pkg)
	m == nothing \|\| return Base.loaded_modules[m]
	topimport(pkg)
	return Base.loaded_modules[_findmod(pkg)]
	end
	end

Major rewrite for correctness, performance #295

Major rewrite for correctness, performance #295

Conversation

timholy commented Mar 1, 2021 • edited Loading

Improving the robustness of package/Module identification

Improving performance and reducing latency with better inferrability

Overall consequences

TODOs

Breaking or not?

timholy commented Mar 1, 2021

timholy commented Mar 1, 2021

SimonDanisch commented Mar 1, 2021

tlnagy commented Mar 1, 2021

timholy commented Mar 1, 2021 • edited Loading

timholy commented Mar 1, 2021

codecov bot commented Mar 2, 2021 • edited Loading

Codecov Report

timholy commented Mar 2, 2021

timholy commented Mar 2, 2021

IanButterworth commented Mar 3, 2021

timholy commented Mar 3, 2021

timholy commented Mar 1, 2021 •

edited

Loading

timholy commented Mar 1, 2021 •

edited

Loading

codecov bot commented Mar 2, 2021 •

edited

Loading