A better default dataids(::AbstractArray) #26237

mbauman · 2018-02-27T19:45:12Z

~~I was previously calling objectid -- I should have been using pointer_from_objref. The former is a hash, the latter is much quicker and easier to compute.~~

This PR now implements a more robust default dataids definition that attempts to dynamically dig into custom array struct fields. This started as the simpler optimization struck above; that was split off and merged separately at #29854.

base/abstractarray.jl

martinholters · 2018-07-19T07:45:46Z

How about:

@generated function dataids(A::AbstractArray)
    exprs = Expr[]
    if A.mutable
        push!(exprs, :(UInt(pointer(A))))
    end
    append!(exprs, Expr(:..., :(dataids(A.$f))) for f in fieldnames(A))
    return :(tuple($(exprs...)))
end

I think this subsumes the custom definitions I've added to StaticArrays (https://github.com/JuliaArrays/StaticArrays.jl/pull/376/files) and also would make many of the methods in Base obsolete. Would that make sense?

vtjnash · 2018-07-19T16:25:00Z

That has no reason to be a generated function, it's significantly more complicated, less inferrable, and wrong on v0.7 (due to getproperty overloading):

function dataids(A::AbstractArray)
    ids = ntuple(i -> dataids(getfield(A, f)), Val(nfields(A)))
    if !isimmutable(A)
        ids = (UInt(pointer(A)), ids...)
    end
    return ids
end

But I think pointer(A) is wrong here – we expect that most Arrays will not implement pointer, and if they happen to do so, we don't expect it to necessarily point to the first element of the base array.

martinholters · 2018-07-20T06:54:47Z

Yes, good points. Should probably be pointer_from_objref (together with the existing specialization for Array still using pointer). However, the ids need to be flattened, so how about this:

function dataids(A::AbstractArray)
     ids = _splatmap(dataids, ntuple(i -> getfield(A, i), Val(nfields(A))))
     if !isimmutable(A)
          ids = (UInt(pointer_from_objref(A)), ids...)
     end
     return ids
end

BTW, ntuple(i -> dataids(getfield(A, i)), Val(nfields(A))) cannot be inferred, but map(dataids, ntuple(i -> getfield(A, i), Val(nfields(A)))) can.

mbauman · 2018-07-20T18:21:11Z

That's awesome Martin… but unfortunately it's running into order-dependent type inference. That may also be why you're seeing a difference between map and ntuple.

These things are rather frustrating, and I always burn more time on it than I'd like when I run into it. I've pushed the version with broken tests… but if I re-evaluate the exact same definition at the REPL and incrementally test it then it works. My understanding from prior issues is that this means that we're sometimes getting a "better" result than expected and that we cannot count on it.

vtjnash · 2018-07-20T20:11:13Z

Yes, this abstract method is probably guaranteed never to be inferred reliably. We should probably try to avoid forming a Tuple of the field elements though, since this can be very slow and costly, and instead form a tuple of the dataids results, then flatten / flatmap(identity) / _splatmap(identity).

mbauman · 2018-07-20T21:15:19Z

Is your thinking that we should be forming a homogenous tuple immediately? I'm not sure that's possible, either, as each element of that tuple will be a result from dataids — which is a varying-length tuple itself. I tried making an nsplattuple function as the trivial modification of ntuple (splatting the elements), but that also fails once you add one layer of nesting.

I had really been trying to avoid a generated function here, but I think that's the only path forward.

martinholters · 2018-07-23T07:03:16Z

Some observations:

Explicitly marking _splatmap as @inline helps inference. Not sure it's sufficient for all tests to pass, though.
Tridiagonal needs a specialized dataids due to the du2 field which may be uninitialized. (I guess we can just ignore it for dataids.)
Thinking about it, I think we only need pointer_from_objref(A) if ids is otherwise empty. (If two objects have the same pointer_from_objref (PFO), all their other dataids entries are equal, too, because the objects are identical. If they have different PFOs, we still need to compare all other dataids, too, whether any of them are equal.)
While this change will make the default dataids give a correct result more often, it is nevertheless a breaking change (e.g. the Tridiagonal problem), which seems unavoidable.
It's OK if it's not inferable in all cases (especially ff we accept this to be breaking)---it's just a fallback.

martinholters · 2018-08-22T12:17:02Z

Looking at the inferabilty issue of dataids(M1(M2([1],2))), I believe it is due to the recursion with this call chain:

dataids(::M1{M2{Array{Int64,1},Int64}})
_splatmap(dataids, ::Tuple{M2{Array{Int64,1},Int64}})
dataids(::M2{Array{Int64,1},Int64})
_splatmap(dataids, ::Tuple{Array{Int64,1},Int64})

Here, Tuple{Array{Int64,1},Int64} is not seen as less complex than Tuple{M2{Array{Int64,1},Int64}}, so inference bails out. OTOH, the @generated version foregos the _splatmap calls, and the argument type of dataids indeed does get less complex here.

Note that if some datatype Foo <: AbstractArray has a field that is an AbstractArray, the type of which is not carried as a type parameter of Foo, and neither has a custom dataids implementation, even the @generated version fails to infer. However, the pattern WrapperType{...,WrappedArrayType,...} is so common that I'd say it's justified to use a @generated function here. I'll push an update momentarily.

base/abstractarray.jl

martinholters · 2018-08-22T15:56:03Z

Let's check the fallout here:
@nanosoldier runbenchmarks(ALL, vs = ":master")

I expect regressions for three possible reasons:

More work in aliasing detection due to more entries in the dataids. Seems unavoidable if we aim at more precise aliasing detection.
Aliasing detected where previously it was not. Would need to be examined whether correct or not.
Type-instability of the new implementation.

Will be fun working through the regressions to classify them. Let's hope we don't see any 😄

martinholters · 2018-09-10T06:29:48Z

@nanosoldier runbenchmarks(ALL, vs=":master")

nanosoldier · 2018-09-10T13:26:36Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

mbauman · 2018-09-10T22:20:21Z

Hooray. That looks great. Now the only question is that access of #undef fields. It'd be easy (and efficient) to just add an isdefined check for them… but it ends up making the whole function type unstable since we can no longer infer the number of ids. I'm inclined to deal with the slight breakage and tag this for 1.1. Thoughts?

StefanKarpinski · 2018-09-10T23:18:39Z

I'm inclined to deal with the slight breakage and tag this for 1.1. Thoughts?

Fine but we'll have to run PkgEval first and make sure that it doesn't actually break anything.

martinholters · 2018-09-11T06:14:49Z

I'm pretty amazed the benchmarks went so smoothly. How does one run PkgEval? @Keno runpkgeval(ALL, vs=":master")?

mbauman · 2023-08-07T18:12:46Z

Coming back to this, it appears we still had two outstanding classes of error in the ecosystem:

structs with #undef fields (like Tridiagonal). These throw errors now.
Custom arrays who depend upon non-array-like fields to store their data (like DefaultArray's Dict). These might have happened to participate in aliasing decently well before, but would now no longer detect aliasing. We can play whack-a-mole and define dataids(::AbstractDictionary) (as I did in e250233) but it's worth noting that there may be more moles. One potential solution would be to always append the struct's own objectid to the dataids tuple, but that has a not-insignificant cost to it that is what we were trying to avoid in this PR in the first place.

Finally, @sostock had objected to this because it performs fancy automagic aliasing detection but doesn't provide the same sort of automatics in the unaliascopy implementation. But, honestly, I don't find that to be so problematic for three reasons: (1) the errors only occur in cases where potentially problematic aliasing is already happening, (2) the error message clearly say how to fix it, and (3) most custom arrays do indeed work just fine with the default unaliascopy implementation — it'll only fail if copy(::T) !isa T.

I've rebased; let's see what has happened in the ecosystem since 2020:

@nanosoldier runtests()

nanosoldier · 2023-08-08T11:38:46Z

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

only use pointer_from_objref for Dict itself; otherwise use objectid as some AbstractDicts are immutable

mbauman · 2023-08-08T13:27:32Z

There does still seem to be an issue in MeshArray's copy implementation, but I also had an issue here. Let's see how much better this gets:

@nanosoldier runtests(["NCTiles", "PointCloudRasterizers", "MetadataArrays", "Pyehtim", "Arrow", "AxisArrayTables", "MLJDecisionTreeInterface", "PlanktonIndividuals", "VoronoiFVM", "Jets", "Decapodes", "OndaEDF", "RegularizedProblems", "VoronoiFVMDiffEq", "Legolas", "GPUArrays", "OutlierDetection", "DSP", "BayesianQuadrature", "DiffEqDevTools", "EvoTrees", "MIRTjim", "HOODESolver", "DistributedJets", "Reel", "GMT", "MeshArrays", "Onda", "DFWannier", "MovingBoundaryProblems1D", "MagNav", "ClimaTimeSteppers", "HomotopyContinuation", "ConformalPrediction", "MixtureDensityNetworks"])

nanosoldier · 2023-08-09T01:37:55Z

Your job failed. Consult the server logs for more details (cc @maleadt).

maleadt · 2023-08-09T06:32:07Z

Failed to compile (bootstrap); LoadError(at "compiler/compiler.jl" line 3: LoadError(at "abstractarray.jl" line 1551: UndefVarError(var=:Dict))).

mbauman · 2023-08-09T12:43:54Z

Yeah, bootstrap problems — serves me right for making changes blind. 4832890 fixes it, so let's try again:

@nanosoldier runtests(["NCTiles", "PointCloudRasterizers", "MetadataArrays", "Pyehtim", "Arrow", "AxisArrayTables", "MLJDecisionTreeInterface", "PlanktonIndividuals", "VoronoiFVM", "Jets", "Decapodes", "OndaEDF", "RegularizedProblems", "VoronoiFVMDiffEq", "Legolas", "GPUArrays", "OutlierDetection", "DSP", "BayesianQuadrature", "DiffEqDevTools", "EvoTrees", "MIRTjim", "HOODESolver", "DistributedJets", "Reel", "GMT", "MeshArrays", "Onda", "DFWannier", "MovingBoundaryProblems1D", "MagNav", "ClimaTimeSteppers", "HomotopyContinuation", "ConformalPrediction", "MixtureDensityNetworks"])

nanosoldier · 2023-08-09T14:49:49Z

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

mbauman · 2023-08-09T16:12:36Z

OK, here's the ecosystem roundup:

The GPUArrays crash looks real, but I can't immediately see why it's happening (Internal error: encountered unexpected error in runtime: ErrorException("unhandled TypeVar")).
DSP needs an implementation of Base.unaliascopy(A::PermutedDimsArray).
The MeshArrays issue (Copying a 1-element "empty" gcmvector errors JuliaClimate/MeshArrays.jl#103) propagated to NCTiles and PlanktonIndividuals.
GMT looks unrelated?
Jets, DistributedJets, and MetadataArrays all hit a similar stack overflow that previously had plagued MeshArrays. Perhaps this is something we should more directly tackle?
AxisArrayTables needs to implement Base.unaliascopy(A::AxisArrayTable)::typeof(A) and is politely informed of such.

mbauman · 2023-08-09T18:13:15Z

Jets, DistributedJets, and MetadataArrays all hit a similar stack overflow (🥁) that previously had plagued MeshArrays. Perhaps this is something we should more directly tackle?

Ah, I see, these are the inverse case of DefaultArrays — they create similar arrays that intentionally share a dictionary. Unlike DefaultArrays, however, this dictionary is not mutated by setindex!, so ideally it would not be a part of their dataids.

So that means that we have four choices:

Define dataids(::AbstractDict). This ensures that DefaultArrays (and other arrays like it) continue to participate in aliasing detection. But it breaks every single broadcasting expression for Jets, MetadataArrays and others like it.
Document that this recurses through arrays only by default. This would, however, mean that DefaultArrays' dataids would change from being their objectid (and thus catching some egregious cases of aliasing) to being the empty tuple.
a. Submit a PR to DefaultArrays to implement their dataids and be done with it
b. Decide to always append the objectid as part of the default dataids tuple. Unfortunately, this would then fail to address the performance issue AbstractArrays in broadcast falls back to slow objectid when checking mightalias #28178.
c. Only use the objectid in cases where the dataids would otherwise be empty. This would indeed catch DefaultArrays... and that's probably representative of private code as well.

I was initially leaning towards 2c, but when I went to implement it I realized that it means that we're recursively adding that objectid to all array fields, too, even ones that are truly immutable. For example, it'd append a hash of a ::UnitRange field to the dataids of a struct that uses a range as a field to define its axes, and then see all arrays with the same axes as aliasing each other. That'd be pretty terrible.

So I think 2a is probably the only realistic path forward?

mbauman · 2023-08-09T21:53:41Z

Let's try Pkgeval again, this time also adding back some of the ones that failed when Martin had run it before:

@nanosoldier runtests(["AES", "AbstractLogic", "DefaultArrays", "NCTiles", "PointCloudRasterizers", "MetadataArrays", "Pyehtim", "Arrow", "AxisArrayTables", "MLJDecisionTreeInterface", "PlanktonIndividuals", "VoronoiFVM", "Jets", "Decapodes", "OndaEDF", "RegularizedProblems", "VoronoiFVMDiffEq", "Legolas", "GPUArrays", "OutlierDetection", "DSP", "BayesianQuadrature", "DiffEqDevTools", "EvoTrees", "MIRTjim", "HOODESolver", "DistributedJets", "Reel", "GMT", "MeshArrays", "Onda", "DFWannier", "MovingBoundaryProblems1D", "MagNav", "ClimaTimeSteppers", "HomotopyContinuation", "ConformalPrediction", "MixtureDensityNetworks"])

nanosoldier · 2023-08-10T05:00:27Z

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

rafaqz · 2023-10-05T19:03:00Z

stdlib/LinearAlgebra/src/tridiag.jl

@@ -591,6 +591,8 @@ similar(M::Tridiagonal, ::Type{T}, dims::Union{Dims{1},Dims{2}}) where {T} = sim
 # Operations on Tridiagonal matrices
 copyto!(dest::Tridiagonal, src::Tridiagonal) = (copyto!(dest.dl, src.dl); copyto!(dest.d, src.d); copyto!(dest.du, src.du); dest)

+Base.dataids(A::Tridiagonal) = (Base.dataids(A.dl), Base.dataids(A.d), Base.dataids(A.du))


Is the idea here that any non-base package also needs to define a similar dataids method to avoid an error with undef fields? (when they have those)

Yes, they would need to (or otherwise not have the undef fields).

Maybe we should mention that in the docs along with the other reasons to define custom dataids

I mean its obscure and no one should do it, but its technically a breaking change

rafaqz · 2023-10-05T19:05:02Z

base/abstractarray.jl

+    if @generated
+        :(ids = tuple($([:(dataids(getfield(A, $i))...) for i in 1:fieldcount(A)]...)))
+    else
+        ids = _splatmap(dataids, ntuple(i -> getfield(A, i), Val(nfields(A))))


Do we need to check all registered packages for undef fields in AbstractArray before merging this?

The PkgEval runs didn't come up with any such arrays to my reading, which is indeed a bit surprising (but welcome)!

Haha maybe its literally only Tridiagonal.

How certain are we than package tests actually call dataids somewhere on all AbstractArray?

Good question — they'll hit it if they touch any of the builtin implementations of broadcasting, view, or reshape... which I'd rate as fairly likely (even if not explicitly as a targeted test, they'd probably do at least one of those in some implementation). I'd think, anyhow.

Yes it sounds unlikely none of those are called.

rafaqz · 2023-10-05T20:01:32Z

One comment: it's possible the recursive method will lead to false positives in the ecosystem that will have performance consequences.

I would guess in AxisArrays.jl as it doesn't define dataids. DimensionalData.jl does define it, but this change would be completely debilitating if it didn't. Its very common to have the same lookup vectors attached to the source and dest arrays in a broadcast.

mbauman · 2023-10-05T21:21:58Z

So to summarize the status quo of the PR as is implemented:

This tries to get a better default implementation of by looking at all fields that contain AbstractArrays and recursively ask for their dataids.
- What does better mean?
  - This should improve performance in the case that two arrays do not alias (avoiding a more expensive objectid hash in many more cases).
  - This does address some of unalias's false negatives and correctly flag aliasing arrays for copies. For example, this correctly identified more true positives for PermutedDimsArray, LogicalIndex, AxisArrayTable (these were identified because they need a unaliascopy definition). There are likely more that happily satisfy copy(::T)::T and thus don't bubble up.
- When is this not better?
  - This does introduce new unalias false positives. For example, AxisArrays will use this default implementation which will naively consider its axes as part of the "data" portion of the array (when it is not mutated by setindex!). MeshArrays would also see new false positives. This has performance impacts, but hopefully they are somewhat obvious when troublesome (e.g., profiling should show time being spent in an evocatively-named unaliascopy).
  - This does introduce new unalias false negatives. For example, DefaultArrays use a Dict to store its data; this will not be considered as part of its dataid. The existing implementation treats DefaultArray as an opaque blob and happily identifies a shared dict. This would cause semantic/correctness regressions. We can mitigate known cases like these by helping them implement dataids, but finding them may be challenging.
  - I don't believe this would meaningfully correct any existing false positives.
  - Arrays with #undef fields will 💣, but PkgEval did not uncover any such arrays beyond Tridiagonal.

That's pretty much how this got stuck. Are the above tradeoffs worthwhile? After five years I don't have anything better.

mbauman mentioned this pull request Feb 27, 2018

Improve the codegen of objectid #26236

Closed

mbauman added domain:arrays [a, r, r, a, y, s] performance Must go faster labels Feb 27, 2018

yuyichao reviewed Feb 27, 2018

View reviewed changes

base/abstractarray.jl Show resolved Hide resolved

KristofferC mentioned this pull request Jul 19, 2018

AbstractArrays in broadcast falls back to slow objectid when checking mightalias #28178

Open

mbauman force-pushed the mb/thisshouldhaveoccurredtome branch from 2359f32 to eee1b77 Compare July 20, 2018 18:22

mbauman commented Aug 22, 2018

View reviewed changes

base/abstractarray.jl Show resolved Hide resolved

This comment has been minimized.

Sign in to view

martinholters added 3 commits August 7, 2023 13:29

Provide custom dataids for Tridiagonal

77f7a92

Further refine fallback dataids

ae2d5bc

Add custom implementation of unaliascopy(::LogicalIndex)

ef1d351

mbauman force-pushed the mb/thisshouldhaveoccurredtome branch from 1c153b2 to c9b5705 Compare August 7, 2023 17:30

mbauman added 2 commits August 7, 2023 14:09

add dataids(::AbstractDictionary)

83d9a1c

update docstring

67348e0

mbauman force-pushed the mb/thisshouldhaveoccurredtome branch from c9b5705 to 67348e0 Compare August 7, 2023 18:09

more careful dict handling

5040206

only use pointer_from_objref for Dict itself; otherwise use objectid as some AbstractDicts are immutable

mbauman mentioned this pull request Aug 8, 2023

Copying a 1-element "empty" gcmvector errors JuliaClimate/MeshArrays.jl#103

Open

remove dataids(::Dict) micro-optimization

4832890

implement unaliascopy(A::PermutedDimsArray)

1ba7f18

mbauman added 2 commits August 9, 2023 14:56

remove dataids(::AbstractDict)

ef31a5f

slightly improved dataids docstring

c5d6f31

mbauman requested review from martinholters and sostock August 9, 2023 19:09

rafaqz reviewed Oct 5, 2023

View reviewed changes

vtjnash mentioned this pull request Nov 26, 2023

Avoid unalias() for matching Array{T} when broadcasting #52286

Closed

A better default dataids(::AbstractArray) #26237

Are you sure you want to change the base?

A better default dataids(::AbstractArray) #26237

Conversation

mbauman commented Feb 27, 2018 • edited Loading

martinholters commented Jul 19, 2018

vtjnash commented Jul 19, 2018

martinholters commented Jul 20, 2018

mbauman commented Jul 20, 2018

vtjnash commented Jul 20, 2018

mbauman commented Jul 20, 2018

martinholters commented Jul 23, 2018

martinholters commented Aug 22, 2018

martinholters commented Aug 22, 2018

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

martinholters commented Sep 10, 2018

nanosoldier commented Sep 10, 2018

mbauman commented Sep 10, 2018

StefanKarpinski commented Sep 10, 2018

martinholters commented Sep 11, 2018

mbauman commented Aug 7, 2023 • edited Loading

nanosoldier commented Aug 8, 2023

mbauman commented Aug 8, 2023 • edited Loading

nanosoldier commented Aug 9, 2023

maleadt commented Aug 9, 2023

mbauman commented Aug 9, 2023

nanosoldier commented Aug 9, 2023

mbauman commented Aug 9, 2023 • edited Loading

mbauman commented Aug 9, 2023 • edited Loading

mbauman commented Aug 9, 2023

nanosoldier commented Aug 10, 2023

rafaqz Oct 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rafaqz commented Oct 5, 2023 • edited Loading

mbauman commented Oct 5, 2023 • edited Loading

mbauman commented Feb 27, 2018 •

edited

Loading

mbauman commented Aug 7, 2023 •

edited

Loading

mbauman commented Aug 8, 2023 •

edited

Loading

mbauman commented Aug 9, 2023 •

edited

Loading

mbauman commented Aug 9, 2023 •

edited

Loading

rafaqz Oct 5, 2023 •

edited

Loading

rafaqz commented Oct 5, 2023 •

edited

Loading

mbauman commented Oct 5, 2023 •

edited

Loading