-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Archiving DataLoaders.jl #90
Comments
I'm having second thoughts on this, since I heard some complaints about MLUtils.jl now being too heavy for someone that needs only basic utilities like train/test split. For instance, the latency of julia> @time using MLUtils # on master
2.090915 seconds (2.10 M allocations: 124.960 MiB, 3.50% gc time, 68.02% compilation time) julia> @time using MLUtils # after removing FLoops and FoldsThreads
0.928274 seconds (890.85 k allocations: 53.444 MiB, 2.36% gc time, 73.03% compilation time) |
Maybe we can keep DataLoaders.jl, move it to the JuliaML org, copy and paste the implementation from here and mark a breaking release. Then we have Flux reexport it. For MLUtils, we keep As a further breakdown, I also suggest to move Or we just go on with the plan outlined in the OP. What do you think @lorenzoh @darsnack @ToucheSir ? |
Right, I didn't think of the latency. I think what you're proposing makes sense, and makes for an easier migration as well. |
My opinion is that we should really consider how many people are annoyed by this latency issue and if this is a real demand that justifies dividing work again into multiple repositories. This premature modularization is a serious issue in Julia organizations where people simply split things into packages because 1 user commented about load time. At some point we all hope that Julia will fix the latency issue, and so let's think carefully before deciding to go separate ways. Can you think of users of DataLoaders.jl that are not users of MLUtils.jl? If the answer is yes, then we have a more solid argument to keep these efforts separate with such a tiny community of maintainers. |
How much can we reduce import times without removing core deps (FLoops can probably go though since it's mostly used for syntax sugar)? I don't think this package has any precompilation, for example. |
I agree with Brian that we should look into reducing the load time contributed by Floops first. It would be nicer as a user to not have to load different packages to set up the data pipeline. I think the main complaint re: load times is packages that want to extend the interface, not user load times. For this, let's see how far people get with the |
As another data point, the import time difference is also quite big on my machine: julia> @time using MLUtils
0.222900 seconds (468.17 k allocations: 27.956 MiB, 1.72% compilation time) (no parallel deps)
1.010500 seconds (1.67 M allocations: 95.612 MiB, 1.94% gc time, 43.90% compilation time) (master)
0.995093 seconds (1.56 M allocations: 88.791 MiB, 46.31% compilation time) (without FLoops) Edit: added time without FLoops.jl |
I can think of many users of MLUtils not being users of DataLoaders.
true. This is mitigated though by the fact that Flux's users just import Flux and have anything they need, without having to know anything about MLUtils or DataLoaders. Non-deep-learning users can just use an
so removing FLoops doesn't help much. As for the mantainance burden of another repo, I expect DataLoaders.jl to be very low activity, we will hardly change anything until we start experimenting with other parallel loading strategies. |
What about precompilation? I don't think any packages in JuliaML use it? |
Precompilation would help with TTFX but not with speeding up |
I thought it was part of SciML/DifferentialEquations.jl#786, but reading back through SciML/DifferentialEquations.jl#786 (comment) it appears invalidations + Requires were more important for import times. Has anyone tried running |
This is what I get: julia> @time_imports using MLUtils
10.4 ms ┌ MacroTools
17.8 ms ┌ ZygoteRules 34.58% compilation time
0.6 ms ┌ DefineSingletons
0.3 ms ┌ IteratorInterfaceExtensions
0.9 ms ┌ TableTraits
1.1 ms ┌ DelimitedFiles
3.3 ms ┌ Compat
1.2 ms ┌ Requires
0.4 ms ┌ DataValueInterfaces
1.0 ms ┌ DataAPI
11.2 ms ┌ Tables
1.0 ms ┌ ConstructionBase
16.2 ms ┌ Setfield 29.96% compilation time (22% recompilation)
13.6 ms ┌ InitialValues
127.6 ms ┌ BangBang 55.14% compilation time (2% recompilation)
10.0 ms ┌ FunctionWrappers
51.4 ms ┌ DataStructures
0.4 ms ┌ CompositionsBase
15.4 ms ┌ Accessors 24.46% compilation time
1.5 ms ┌ ContextVariablesX
60.8 ms ┌ ChainRulesCore
62.5 ms ┌ ChangesOfVariables
0.8 ms ┌ InverseFunctions
2.8 ms ┌ DocStringExtensions 52.48% compilation time
3.6 ms ┌ IrrationalConstants
0.7 ms ┌ StatsAPI
0.7 ms ┌ SortingAlgorithms
1.4 ms ┌ LogExpFunctions
6.1 ms ┌ Missings
38.9 ms ┌ StatsBase 3.78% compilation time
6.9 ms ┌ MicroCollections
0.7 ms ┌ Adapt
0.7 ms ┌ ArgCheck
6.4 ms ┌ SplittablesBase
34.7 ms ┌ Baselet
85.8 ms ┌ Transducers 10.24% compilation time
28.4 ms ┌ MLStyle
1.1 ms ┌ PrettyPrint
5.1 ms ┌ ShowCases
265.3 ms ┌ FoldsThreads 86.56% compilation time
1.2 ms ┌ FLoopsBase
0.7 ms ┌ NameResolution
3.4 ms ┌ JuliaVariables
6.1 ms ┌ FLoops
747.3 ms MLUtils 43.14% compilation time (<1% recompilation) |
We could do the lazy import (or require import) thing we do in MLDatasets.jl also here for |
Some finer-grained timing locally suggests that the majority of import overhead for BangBang and FoldsThreads is self time and not from dependencies (unless there is significant invalidation happening, but that does not appear to be the case). It's not clear to me why this is though, @tkf do you have any thoughts? |
getobs
from an array of graphs
JuliaGraphs/GraphNeuralNetworks.jl#170
I'm waiting for the outcome of this discussion before tagging a new release that adds the 'parallel' API to DataLoader which we may have to break in the future if the parallel implementation is moved out. But now a few changes have accumulated, most importantly bug fixes like #97, so we should reach a decision soon. |
Would it be possible to add Thinking through it, I think it would be better to have Re being able to implement the data container interface without taking on a large dep: I think |
I share Lorenz's view here about waiting to see what people say. Also, keeping it in means we can release now, so I suggest doing that and moving this discussion to the next cycle. The long load times + |
Yes, the load time is already released. |
Ok, let's release then |
Yes, please. |
Now that MLUtils.jl has feature parity with
DataLoaders.DataLoader
, it's time to deprecate DataLoaders.jl.We should:
DataLoaders.DataLoader
andMLUtils.DataLoader
.torch.utils.data.DataLoader
and the comparison to PyTorchMLDatasets.FileDataset
andmapobs
instead of custom struct)The text was updated successfully, but these errors were encountered: