Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added some methods to help support LoopVectorization.jl #61

Merged
merged 53 commits into from
Sep 19, 2020

Conversation

chriselrod
Copy link
Collaborator

The methods are:

  • canavx(f), a function whitelist for supporting @avx.
  • is_cpu_column_major. The name should be explanatory; figured it'd be best to go with something unambiguous.
  • stridelayout. Provides a description of the memory layout.

Tagging @LaurentPlagne, we we discussed the intention of stridelayout earlier.
stridelayout is meant to decouple the axis that is good for SIMD (indicated by contig) from cache-friendliness (lower ranks). Our goal additionally is to decouple these from the array's interface, so that we can easily optimize the memory layouts of our arrays without having to touch any of the code.
This is facilitated by LoopVectorization, which will adjust the order of loops as necessary, making our code generic yet still optimal with respect to memory layout.

If you think of a Vector{ComplexF64} as a 2xN Matrix{Float64}, then a StructOfArrays is essentially just a transpose, to allow you to interact with the array as before even though the memory layout is now essentially a Nx2 Matrix{Float64}, which should generally be much friendlier to SIMD.
However, the memory accesses for re and im numbers may now be far apart. While unlikely to be a problem in this example, this isn't necessarily the case for more general transformations of large arbitrary-dimensional arrays.
In this example, contig,batch=2,4 yield a memory layout of 4x2xN/4. That is, with the AVX instruction set, we could easily SIMD-load 4 contiguous elements of real and imaginary numbers at a time, yet these subsequent loads would still be contiguous and thus likely to trigger the DCU hardware prefetcher, even in more complicated examples.
As another example, for AVX-matrix multiplication A * B with contiguous-access packing, our m_c x k_c packed blocks of A would be reshaped into 8xk_cxm_c/8. That is, it would correspond to stridelayout(packed_A) == (1,8,(2,1)): m_c is contiguous with batches of size 8, but has a higher stride than k_c.

Being able to simply try out a bunch of different memory layouts without needing to change the code for correctness (thanks to API-memory layout decoupling) or performance (thanks to LoopVectorization's loop-reordering) should make this critical aspect of performance optimization much simpler, and even open the door to automation.

README.md Outdated
@@ -112,6 +112,21 @@ If `step` of instances of type `T` are known at compile time, return that step.
Otherwise, returns `nothing`. For example, `known_step(UnitRange{Int})` returns
`one(Int)`.

## is_cpu_column_major(::Type{T})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why cpu?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also want to consider accommodating other memory layouts that have been used in improving performance of array operations like these?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChrisRackauckas
"cpu" because I wanted it to be false for GPUArrays.
Given that AbstractDeviceArrays <: DenseArray to opt into the StrideArray interface (even though LinearAlgebra routines presumably don't work), I wanted to make it clear they shouldn't opt into this.

@Tokazama
The stridelayout function can efficiently handle row-major or increasing vs decreasing strides, e.g.:

julia> using ArrayInterface: stridelayout

julia> is_row_major(x) = stridelayout(x) === (2,1,(2,1))
 is_row_major (generic function with 1 method)

julia> B = rand(3,4);

julia> @code_typed is_row_major(B')
 CodeInfo(
1return true
) => Bool

julia> @code_typed is_row_major(B)
 CodeInfo(
1return false
) => Bool

But perhaps I should add density. Knowing that certain dims are dense could potentially allow subsets of loops to undergo CartesianIndexing -> LinearIndexing style loop fusion, even if on the whole the IndexStyle is IndexCartesian().
Or, as a correction for

julia> IndexStyle(typeof(B))
 IndexLinear()

julia> IndexStyle(typeof(B'))
 IndexCartesian()

Maybe all arrays are transposed (and thus IndexCartesian()), but if they're still dense, we could use linear indexing anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was specifically thinking of dense arrays but didn't want to chance overlooking something else. Using IndexStyle to accomplish this is pretty clever.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added density info on a per-stride basis.

julia> A = rand(3,4,5);

julia> stridelayout(PermutedDimsArray(A,(3,1,2)))
 (2, 1, (3, 1, 2), (true, true, true))

julia> stridelayout(@view(PermutedDimsArray(A,(3,1,2))[2,1:2,:]))
 (1, 1, (1, 2), (true, false))

julia> stridelayout(@view(PermutedDimsArray(A,(3,1,2))[2:3,1:2,:]))
 (2, 1, (3, 1, 2), (false, false, true))

Unfortunately, we can't tell from type info whether a slice of an array was partial or complete:

julia> s2 = Base.Slice(Base.OneTo(2))
 Base.Slice(Base.OneTo(2))

julia> view(A,s2,s2,s2)
 2×2×2 view(::Array{Float64,3}, :, :, :) with eltype Float64:
[:, :, 1] =
 0.329658  0.543774
 0.350255  0.0817058

[:, :, 2] =
 0.310211  0.126501
 0.587191  0.884039

as Base.Slice{Base.OneTo{Int64}} is the type of a : slice. This means we have to be very conservative with base SubArrays, but packages (like PaddedMatrices.jl) might want to provide more compile time info for their types.

The finer granularity would allow for the partial linear-vs-cartesian indexing.
Other reasons to have this information:

  • using linear indexing to decide what to do for something like a vec implementation is a little goofy.
  • LoopVectorization often spills scalar integer registers (i.e., r* registers). Many of these registers are used for holding array strides, and integer multiples of these strides. E.g., with a AVX512 column-major matmul kernel for C = A * B, it wants to hold stride(A,2), 3stride(A,2), 5stride(A,2), stride(C,2), 3stride(C,2), 5stride(C,2). So if we know that stride(A,2) == stride(C,2), we could free 3 registers, save a couple calculations, and a bunch of integer load/stores from the stack when we get less spills. I'm not sure how to implement this optimization / let a user guarantee that they aren't writing 100% valid code like
A = rand(4,5);
C = rand(40,50);
for m in axes(A,1), n in axes(A,2)
    C[m,n] = 2A[m,n]
end

So knowing that A and C are both dense still isn't enough to guarantee that the strides are equal.

Not particularly relevant to this issue, but I wonder if there'd be a good way to communicate size equality. Users will often check for equal sizes in front of a loop, and eachindex will guarantee this as well

for I in eachindex(A,C) # checks for equivalent size
   
end

Because eachindex isn't always appropriate (e.g., when you don't have exact correspondence between axes of the arrays), I think it'd be a good to have something like an eachindexalongaxis:

function mymatmulkernel!(C, A, B)
    @avx for m in eachindexalongaxis((C,A),(1,1)), n in eachindexalongaxis((C,B),(2,2)), k in eachindexalongaxis((A,B),(2,1))
        C[m,n] += A[m,k] * B[k,n]
    end
end

where it checks sizes. eachindexalongaxis or each_index_along_axis doesn't really roll off the tongue, and I don't know where such a function should live.
I'd also have to think a little more about how to use it and actually pass the information of dimension equality (to be combined with density guarantees) to deduce stride-equality. The hacky way would be to just pattern match eachindex and eachindexalongindex in the macro, handling dimension equality through the expression, while handling density info based on types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make a separate issue about the axes specific indexing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only think I have left to say about this one is that you may want to include a some explanation or a link so people understand the "cpu" part. It's obviously not a huge deal, but I anticipate it becoming a common question that becomes irksome to continually explain.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about a comment simply saying that it should return false for any sort of GPUArray?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would work.

README.md Outdated
The elements of the tuple include
- `contig`: The axis with contiguous elements. `contig == -1` indicates no axis is contiguous. `striderank[contig]` does not necessarilly equal `1`.
- `batch`: indicates the number of contiguous elements. That is, if `batch == 16`, then axis `contig` will contain batches of 16 contiguous elements interleaved with axis `findfirst(isone.(striderank))`.
- `striderank` indicates the rank of the given stride with respect to the others. If for `A::T` we have `striderank[i] > striderank[j]`, then `stride(A,i) > stride(A,j)`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add can_avx?

@chriselrod
Copy link
Collaborator Author

This would probably be good to merge, but I am currently rewriting VectorizationBase and SIMDPirates (combining them into one library, just VectorizationBase) and the parts of LoopVectorization that will need to change to properly support this and a few other modifications.
I've also started rewriting PaddedMatrices, which will take advantage of these features and allow users to easily change the underlying memory layout.

They'll all support/make use of ArrayInterface.
I figured I'd leave the open until I'm closer to completion to give others the chance to comment, and myself the chance to realize that maybe something more/else is needed.

Once all that is done, I'll finish the fast convolution example.

Another change I'm working on (that contributes a lot to the overall code-churn) is representing unrolled vector operations with a type, rather than successive expressions. This will have a few advantages, e.g.

  1. It'll interleave the operations within called functions. Many called functions, e.g. tanh, are composed of a long chain of operations. Instead of calling tanh 4 times, it'd allow for better out-of-order execution to interleave the individual operations of 4 calls, as would happen automatically with the unrolled type.
  2. By applying operations group-wise, we can define specific overloads to optimize certain cases when possible. For example, replacing certain non-contiguous loads/stores with contiguous ones + sequences of shuffles.
    To get full use out of this, it'd also sometimes start lumping all the distinct loads/stores from an array together in that bundled operation. That way something like this:
@avx for i in 1:length(x) >>> 1
    a = x[2i-1]
    b = x[2i]
    # do something with `a` and `b`
end

which currently results in 2 gathers, 1 each for a and b, can be optimized into 2 regular loads + 2 shuffles.

@timholy
Copy link
Member

timholy commented Aug 19, 2020

Personally I'd be tempted to orthogonalize traits as much as possible. So something like Device(::Type) = CPU() would mean it should be handled on the CPU, and then the striding would be a separate trait. There's a stab at this kind of interface in https://github.com/timholy/ComputationalResources.jl but it hasn't seen widespread usage. I'd be happy to have anything useful in that package migrate here if folks want.

Back to the orthogonality, for the trait depot you can do something like this::

myalgorithm(A::AbstractArray) = _myalgorithm((Device(A), StridedLayout(A), some_other_trait(A)), A)
_myalgorithm(::Tuple{CPU,Contiguous{1},<:Any}, A) = # implementation for is_cpu_columnmajor
_myalgorithm(::Tuple{CPU,Contiguous{2},<:Any}, A) = # implementation for row-major
...

or if you want to do some things using constant propagation of course there's the lovely tuple-destructuring syntax:

_myalgorithm((dev, layout, other)::Tuple{CPU,<:Any,<:Any}, A) = if layout === (1, (whatever)) ... else ... end

though it would be worth double-checking whether constant-propagation works flawlessly here.

is_cpu_column_major(::Type{T})

Does an Array of type `T` point to column major memory in the cpu's address space?
If `is_cpu_column_major(typeof(A))` return `true` and the element type is a primite
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is primite type suppose to be primitive

"""
stridelayout(::Type{T}) -> (contig, batch, striderank, axesdense)

Descrive the memory layout of a strided container of type `T`. If unknown or not strided, returns `nothing`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Descrive -> Describe

@Tokazama
Copy link
Member

@timholy , Are you proposing that stride rank, batch, and contiguousness be separate traits or just device and stride layout?

@timholy
Copy link
Member

timholy commented Aug 19, 2020

More the general principle that this_and_that traits are probably better split into their component parts. I acknowledge that some distinctions are fuzzy.

@chriselrod
Copy link
Collaborator Author

chriselrod commented Aug 19, 2020

I definitely agree with replacing is_cpu_column_major with Device.

I'll think more about striderank. Orthoganlization is good thing.

I'll try to start implementing some functions using this tonight.
Relying on constant prop makes me nervous, and I've been trying to abstain from generated functions as much as is reasonably possible (relying more on @eval in for loops, but I don't really know how this compares in precompilation time, package load time, or latency).

That does mean I'll likely want to make all of these types.
One thing I have been doing, for example, is define functions calculating linear indices from cartesian indices using the following pattern (e.g., for getindex(A, i::Vararg) call with tdot(i, strides(A)) to take the dot product of the cartesian indices and array strides):

tdot(a::Tuple{A}, b::Tuple{B}) where {A,B} = a * b
tdot(a::Tuple, b::Tuple) = first(a) * first(b) + tdot(Base.tail(a), Base.tail(b))

with modifications for each stridedpointer type to avoid promoting the _MM type (used to represent vector loads) for contiguous dimensions into a vector of indices.
When the final load/store happens, a vector of indices will use a slow gather/scatter operation to load from/store to each element of that vector, while the _MM type results in fast contiguous loads/stores.

There were special AbstractStridedPointers representing different memory layouts, e.g. one for column major (avoid promoting _MM => Vec for first dim only), another for transposed column major (avoid promoting for second), another for SubArrays without contiguous (always promote), etc.

So a goal is to replace all of these with a single StridedPointer type, parameterized by information from ArrayInterface. To add support, one then just has to define Device and stridelayout (or whatever stridelayout is broken up into), aside from the usual methods like strides.
Making these types would help

julia> struct Contiguous{N} end; Base.@pure Contiguous(N::Int) = Contiguous{N}.instance
 Contiguous

julia> totuple(::Contiguous{N}, ::Val{D}) where {N,D} = ntuple(d -> Val(d == N), Val(D))
 totuple (generic function with 1 method)

julia> @code_warntype totuple(Contiguous(2), Val(4))
Variables
  #self#::Core.Const(totuple, false)
  #unused#@_2::Core.Const(Contiguous{2}(), false)
  #unused#@_3::Core.Const(Val{4}(), false)
  #5::var"#5#6"{2}

Body::Tuple{Val{false},Val{true},Val{false},Val{false}}
1%1 = Core.apply_type(Main.:(var"#5#6"), $(Expr(:static_parameter, 1)))::Core.Const(var"#5#6"{2}, false)
│        (#5 = %new(%1))%3 = #5::Core.Const(var"#5#6"{2}(), false)%4 = Main.Val($(Expr(:static_parameter, 2)))::Core.Const(Val{4}(), false)
│   %5 = Main.ntuple(%3, %4)::Core.Const((Val{false}(), Val{true}(), Val{false}(), Val{false}()), false)
└──      return %5

The totuple function works and is inferred.
Thoughts on the reliability of this, vs adding something like @generated or Base.@pure?

Now I'd be able to do something like

tdot(a::Tuple{A}, b::Tuple{B}, ::Tuple{Val{false}}) where {A,B} = a * b
tdot(a::Tuple, b::Tuple, c::Tuple{Val{false},Vararg}) = first(a) * first(b) + tdot(Base.tail(a), Base.tail(b), Base.tail(c))
tdot(a::Tuple{A}, b::Tuple{B}, ::Tuple{Val{true}}) where {A,B} = mul_no_promote_MM(a, b)
tdot(a::Tuple, b::Tuple, c::Tuple{Val{true},Vararg}) = mul_no_promote_MM(first(a), first(b)) + tdot(Base.tail(a), Base.tail(b), Base.tail(c))

I may as well separate Contiguous, BatchSize, StrideRank, and Dense as well, depending on feedback and my thoughts as I start implementing these.

@Tokazama
Copy link
Member

I had some quick things I tested when developing the indices stuff to make sure it was flexible enough for what you're doing. I'll share some of it when I get to my computer.

@chriselrod
Copy link
Collaborator Author

It's a little goofy that Device returns nothing for tuples or SArrays, so I'm open to a better name if there's one that would make this more clear.

What I'm going to use it for is just to know whether or not pointer works.

@Tokazama
Copy link
Member

Most of what follows was motivated by the comment you had under stride_rank and dense_dims where you said that there needs to be a better way of doing things, but it looks like it may be helpful in other places too. I wonder if what's missing is a simple way to map a dimension argument to the parent dimension and characterize the.

If we index something with x does it result in a single element (i.e. will that dimension be dropped)?

is_element(x) = is_element(typeof(x))
is_element(::Type{T}) where {T<:Number} = true
is_element(::Type{T}) where {T<:CartesianIndex} = true
is_element(::Type{T}) where {T} = false
@generated is_element(::Type{T}) where {T<:Tuple} = (map(is_element, I.parameters)...,)

Generic way to move to the parent dimensions space.

to_parent_dim(::Type{T}, ::Val{dim}) where {T,dim} = to_parent_dim(T, dim)
to_parent_dim(::Type{T}, dim::Integer) where {T<:AbstractArray} = dim
to_parent_dim(::Type{T}, dim::Integer) where {T<:Union{<:Adjoint,<:Transpose}} = (2, 1)[dim]
to_parent_dim(::Type{<:PermutedDimsArray{T,N,I}}, i::Integer) where {T,N,I} = I[i]
function to_parent_dim(::Type{<:SubArray{T,N,A,I}}, dim::Integer) where {T,N,A,I}
    # if a view is created with something that maps to a single value thena dimension
    # is dropped. We add all dimensions dropped up to i
    return i + accumulate(+, is_element(I))[dim]
end

Then stride_rank becomes something like

stride_rank(::Type{T}, ::Val{dim}) where {T,dim} = stride_rank(T, dim)
stride_rank(::Type{T}, dim::Integer) where {T} = nothing
stride_rank(::Type{T}, dim::Integer) where {T<:Array} = i
function stride_rank(::Type{T}, i::Integer) where {T<:AbstractArray}
    if parent_type(T) <: T  # assume no parent
        return nothing
    else
        return stride_rank(parent_type(T), to_parent_dim(T, dim))
    end
end

I'm not sure if this is what you had in mind as an easier/better way for getting at those values, so if it's not helpful feel free to ignore it.

@chriselrod
Copy link
Collaborator Author

That implementation (with a couple fixes below) is a bit more elegant

is_element(x) = is_element(typeof(x))
is_element(::Type{T}) where {T<:Number} = true
is_element(::Type{T}) where {T<:CartesianIndex} = true
is_element(::Type{T}) where {T} = false
@generated is_element(::Type{T}) where {T<:Tuple} = (map(is_element, T.parameters)...,)

to_parent_dim(::Type{T}, ::Val{dim}) where {T,dim} = to_parent_dim(T, dim)
to_parent_dim(::Type{T}, dim::Integer) where {T<:AbstractArray} = dim
to_parent_dim(::Type{T}, dim::Integer) where {T<:Union{<:Adjoint,<:Transpose}} = (2, 1)[dim]
to_parent_dim(::Type{<:PermutedDimsArray{T,N,I}}, i::Integer) where {T,N,I} = I[i]
function to_parent_dim(::Type{<:SubArray{T,N,A,I}}, dim::Integer) where {T,N,A,I}
    # if a view is created with something that maps to a single value thena dimension
    # is dropped. We add all dimensions dropped up to i
    return dim + accumulate(+, is_element(I))[dim]
end

stride_rank(::Type{T}, ::Val{dim}) where {T,dim} = stride_rank(T, dim)
stride_rank(::Type{T}, dim::Integer) where {T} = nothing
stride_rank(::Type{T}, dim::Integer) where {T<:Array} = dim
function stride_rank(::Type{T}, dim::Integer) where {T<:AbstractArray}
    if parent_type(T) <: T  # assume no parent
        return nothing
    else
        return stride_rank(parent_type(T), to_parent_dim(T, dim))
    end
end

stride_rank(::Type{T}) where {D, T <: AbstractArray{<:Any,D}} = ntuple(d -> stride_rank(T, d), Val{D}())
stride_rank(A) = stride_rank(typeof(A))

The difference in behavior is whether stride ranks should map to the original parents, or whether sort(collect(stride_rank)) == 1:length(stride_rank):

julia> A = rand(2,3,4,5);

julia> stride_rank(@view(PermutedDimsArray(A, (3,2,4,1))[:,1,:,:]))
(3, 4, 1)

julia> ArrayInterface.stride_rank(@view(PermutedDimsArray(A, (3,2,4,1))[:,1,:,:]))
(2, 3, 1)

julia> stride_rank(@view(PermutedDimsArray(@view(PermutedDimsArray(A, (3,2,4,1))[:,1,:,:]), (2,3,1))[:,:,1])')
(1, 4)

julia> ArrayInterface.stride_rank(@view(PermutedDimsArray(@view(PermutedDimsArray(A, (3,2,4,1))[:,1,:,:]), (2,3,1))[:,:,1])')
(1, 2)

While the (1, 4) provides the same information + more, it may be less convenient to use in that it won't immediately give us the order of the ranks in the array view we have.
But maybe that's okay. I could write a function to produce (1,2) from (1,4), but for some of the use cases I had in mind, the (1,4) is probably actually more useful. E.g., for ordering loops, the (1,4) let's us know that the stride across dim 2 is much larger than across dim 1, so it seems like it more faithfully represents the question it's answering...
Or what if we had stride_rank(A) = (2,1,3), and contiguous_axis(A) == Contiguous{1}().
This would mean that axis 1 has contiguous batches, but the stride between these batches is larger than the stride for axis 2.
Now if we take B = @view(A[:,2,:]), having stride_rank(B) == (1,2) with ``contiguous_axis(B) == Contiguous{1}()would be misleading with respect to the actual layout.contiguous_batch_size(B) !== ContiguousBatch{0}` would be a giveaway that this is the case, but still seems likely to result in bugs.

With all that in mind, I decided I like that behavior better and updated the PR.

@Tokazama
Copy link
Member

Should I break consistency and call it offsets?

I'd be fine with that.

Perhaps it should be ArrayInterface.size and ArrayInterface.strides`?

I'd be fine with that. I hope that eventually we can just have size(::StaticArray) -> Tuple{Vararg{<:Static}} and methods that only accept Int will be generalized to accept Static. That way we don't have to specify all of those variations in here. I opened up this issue because ultimately we need a more formal interface for referring to dimensions in Base and it could allow us to pass Static to a methods like axes without redefining it for every array.

@c42f
Copy link
Member

c42f commented Sep 16, 2020

I hope that eventually we can just have size(::StaticArray) -> Tuple{Vararg{<:Static}} and methods that only accept Int will be generalized to accept Static.

Agreed this is the right way forward in the longer term. See also JuliaArrays/StaticArrays.jl#807

Speaking of static integers, did you consider using StaticNumbers.StaticInteger?

@chriselrod
Copy link
Collaborator Author

chriselrod commented Sep 16, 2020

> juliamaster -O3 -e "@time using ArrayInterface"
  0.099558 seconds (126.48 k allocations: 8.313 MiB)

> juliamaster -O3 -e "@time using StaticNumbers"
  0.301221 seconds (635.21 k allocations: 35.458 MiB)

> juliamaster -O3 -e "@time using StaticNumbers, ArrayInterface"
  0.808027 seconds (1.47 M allocations: 82.250 MiB, 1.13% gc time)

> julia -O3 -e "@time using ArrayInterface"
  0.177640 seconds (294.34 k allocations: 16.299 MiB, 2.82% gc time)

> julia -O3 -e "@time using StaticNumbers"
  0.177957 seconds (358.17 k allocations: 20.333 MiB, 2.80% gc time)

> julia -O3 -e "@time using StaticNumbers, ArrayInterface"
  0.925970 seconds (3.28 M allocations: 162.273 MiB, 2.31% gc time)

The load time seems relatively high, although much of it is probably Requires. Yet it's peculiar that it's worse than additive.
ArrayInterface was on the current PR, which includes our simple StaticNumber implementation.

But I wouldn't be opposed to us more broadly getting behind a package like StaticNumbers. If another project is using StaticNumbers, loading a half dozen different StaticNumber implementations would obviously be slower than loading just 1, besides the duplication of effort.

@Tokazama
Copy link
Member

Additionally, supporting other static numbers than Int was discussed. I think the conclusion was that it would make sense if it added in some meaningful way. However, the need for anything besides Int seems like a a pretty rare situation.

@mateuszbaran
Copy link
Contributor

That time added by StaticNumbers is largely caused by invalidations. After (breakingly) removing all sources of them I have:

➜  julia git:(master) ./julia -O3 -e "@time using StaticNumbers, ArrayInterface"
  0.260417 seconds (150.04 k allocations: 10.864 MiB)
➜  julia git:(master) ./julia -O3 -e "@time using ArrayInterface" 
  0.191187 seconds (63.91 k allocations: 4.931 MiB)

It doesn't look bad. Constructor of Int64 from static numbers seems to be the biggest problem.

@chriselrod
Copy link
Collaborator Author

chriselrod commented Sep 16, 2020

There are a couple invalidations in ArrayInterface itself:

julia> using SnoopCompile

julia> invalidations = @snoopr using ArrayInterface;

julia> trees = invalidation_trees(invalidations)
2-element Vector{SnoopCompile.MethodInvalidations}:
 inserting promote_rule(::Type{Bool}, ::Type{S}) where S<:ArrayInterface.Static in ArrayInterface at /home/chriselrod/.julia/dev/ArrayInterface/src/static.jl:34 invalidated:
   backedges: 1: superseding promote_rule(::Type{Bool}, ::Type{T}) where T<:Number in Base at bool.jl:4 with MethodInstance for promote_rule(::Type{Bool}, ::Type{var"#s431"} where var"#s431"<:Integer) (2 children)

 inserting convert(::Type{T}, ::ArrayInterface.Static{N}) where {T<:Number, N} in ArrayInterface at /home/chriselrod/.julia/dev/ArrayInterface/src/static.jl:18 invalidated:
   backedges: 1: superseding convert(::Type{T}, x::Number) where T<:Number in Base at number.jl:7 with MethodInstance for convert(::Type{Int8}, ::Integer) (1 children)
              2: superseding convert(::Type{T}, x::Number) where T<:Number in Base at number.jl:7 with MethodInstance for convert(::Type{UInt8}, ::Integer) (1 children)
              3: superseding convert(::Type{T}, x::Number) where T<:Number in Base at number.jl:7 with MethodInstance for convert(::Type{Int64}, ::Integer) (9 children)
              4: superseding convert(::Type{T}, x::Number) where T<:Number in Base at number.jl:7 with MethodInstance for convert(::Type{UInt64}, ::Integer) (27 children)

Can we do anything about them?

I removed the Val(::Static{N}) definition.

@chriselrod
Copy link
Collaborator Author

I'd like to set the goal of merging this on Friday.

I'm open to StaticNumbers. While all we need are Static integers, it's less code to using StaticNumbers: StaticIntegers.

@Tokazama
Copy link
Member

I'm not completely against StaticNumbers.jl (I've even brought it up in the past), but I'm not 100% sold on it. Even if we decide the niche use case for other static numbers is sufficient to justify supporting them, I'm not sure we're ready to handle things here like StaticNumber:StaticNumber:StaticNumber, which are a lot more complicated to properly support.

If we do decide to support all static numbers I think that's going to have to be a separate, large PR. Once you support different types of numbers for ranges that have different step sizes it gets pretty complicated (e.g., safely handling code with smaller byte sizes, handling high precision number types, accounting for numbers with units attached, etc.).

It seems like the most straightforward approach is to just use Static for now and then if it becomes clear that we need more static number support we can make the change. If methods like ArrayInterface.size and ArrayInterface.static_first are used as intended (generic support for static Int), then most other packages should be agnostic to the change. If we're really worried about it we could even just change Static to StaticInteger so it's a matter of using StaticNumbers: StaticInteger if the time should ever come we want to depend on StaticNumbers.

@c42f
Copy link
Member

c42f commented Sep 17, 2020

It seems like the most straightforward approach is to just use Static for now and then if it becomes clear that we need more static number support we can make the change.

FWIW I think it's completely fine to use Static for now, and I hesitate to depend on StaticNumbers in StaticArrays for similar reasons:

  • We really only need a small subset of stable core functionality — StaticInteger, static() and perhaps a couple of other things.
  • Load time
  • StaticNumbers currently extends StaticArrays via Requires.jl. So disentangling that would be a little work.

Perhaps we should make a StaticNumbersBase.jl to define only the core functionality. @perrutquist what are your thoughts?

@perrutquist
Copy link

perrutquist commented Sep 17, 2020

I think a StaticNumbersBase.jl package would be a good idea. I've been thinking for a while that I need to split up that package. (It started out as a generalization of Zeros.jl, and hence with real numbers, but the integer use-case is fairly orthogonal to that.)

Maybe even better if we could have a StaticBase.jl which also contains stubs and abstract types for static arrays and iterators? That way it would be easier to define interaction without using Requires.

However, it seems that most of the load time (and post-load slowdown) in StaticNumbers.jl comes from a single method definition: Int(::StaticInteger). Either there is a way to fix that, or load-times will be a problem for any implementation of static integers that can be used to construct an Int. cf: perrutquist/StaticNumbers.jl#11

@chriselrod
Copy link
Collaborator Author

chriselrod commented Sep 17, 2020

I don't really have any experience here, but this by itself at least doesn't seem to take very long on Julia 1.6:

julia> using SnoopCompile

julia> invalidations = @snoopr begin
       struct Static{N} <: Integer end
       (::Type{T})(::Static{N}) where {T <: Number, N} = T(N)
       end;

julia> invalidation_trees(invalidations)
1-element Vector{SnoopCompile.MethodInvalidations}:
 inserting (::Type{T})(::Static{N}) where {T<:Number, N} in Main at REPL[2]:3 invalidated:
   mt_backedges:  1: signature Tuple{Type{Int64}, Integer} triggered MethodInstance for findnext(::Pkg.Types.var"#22#23"{String}, ::AbstractString, ::Integer) (0 children)
                  2: signature Tuple{Type{Int64}, Integer} triggered MethodInstance for _array_for(::Type{T}, ::Any, ::Base.HasLength) where T (0 children)
                  3: signature Tuple{Type{Int64}, Integer} triggered MethodInstance for to_shape(::Integer) (0 children)
                  4: signature Tuple{Type{Int64}, Integer} triggered MethodInstance for _array_for(::DataType, ::Any, ::Base.HasLength) (0 children)
                  5: signature Tuple{Type{Int64}, Integer} triggered MethodInstance for _similar_for(::Vector{_A} where _A, ::Type, ::Base.Generator{_A, Base.var"#811#813"
} where _A, ::Base.HasLength) (0 children)
                  6: signature Tuple{Type{Int64}, Integer} triggered MethodInstance for _similar_for(::Vector{_A} where _A, ::DataType, ::Base.Generator{_A, Base.var"#811#
813"} where _A, ::Base.HasLength) (0 children)
                  7: signature Tuple{Type{Int64}, Integer} triggered MethodInstance for _array_for(::Type{Base.RefValue{Any}}, ::Any, ::Base.HasLength) (0 children)
                  8: signature Tuple{Type{UInt16}, Integer} triggered MethodInstance for connect!(::Sockets.TCPSocket, ::Sockets.IPv4, ::Integer) (0 children)
                  9: signature Tuple{Type{UInt16}, Integer} triggered MethodInstance for connect!(::Sockets.TCPSocket, ::Sockets.IPv6, ::Integer) (0 children)
                 10: signature Tuple{Type{UInt32}, Integer} triggered MethodInstance for VersionNumber(::Integer, ::Int64, ::Int64, ::Tuple{}, ::Tuple{}) (1 children)
                 11: signature Tuple{Type{Int64}, Integer} triggered MethodInstance for manage(::Distributed.SSHManager, ::Integer, ::Distributed.WorkerConfig, ::Symbol) (
1 children)
                 12: signature Tuple{Type{UInt64}, Integer} triggered MethodInstance for convert(::Type{UInt64}, ::Integer) (7 children)
                 13: signature Tuple{Type{Int64}, Integer} triggered MethodInstance for prevind(::String, ::Integer) (11 children)
                 14: signature Tuple{Type{Int64}, Integer} triggered MethodInstance for full_va_len(::Core.SimpleVector) (784 children)

Although 784 sounds like a lot of children (for full_va_len(::Core.SimpleVector)).

It seems odd to me that

julia> struct Static{N} <: Integer end

julia> @which Int(Static{5}())
ERROR: no unique matching method found for the specified argument types

there are no methods, yet we can't insert one without invalidating others?

I'd be in favor of a StaticBase.

While I like the brevity of Static, but could rename it StaticInteger so that we could swap the underlying implementation with a dependency on another package without breaking dependents.
(While Static is already on ArrayInterface master, there haven't been any releases featuring it yet.)

@perrutquist
Copy link

I'm only speculating, but I think the invalidations come from methods that were defined when Int was only defined for a small number of Integer types, so methods that called Int(x::Integer) could be compiled with the assumption that x belonged to that small type union.
I also think that most of the slowdown does not happen immediately when the constructor is defined, but rather in the code that runs after it, as methods are recompiled.

If you're thinking of renaming Static, then maybe a good name would be StaticSigned or StaticInt to signify that the argument is always a Signed or an Int. In retrospect, it was a mistake to create StaticInteger{X}, since it is impossible to dispatch on the type of X, and this is causing me a big headache in trying to deal with unsigned integers. (In a future release, StaticInteger will likely be a type union.)

@Tokazama
Copy link
Member

I played with static numbers pretty extensively a while ago. There are all sorts of quirky things that will get you if you're not careful. I recall getting tripped up on ensuring type stability with a certain type of log method.

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@chriselrod
Copy link
Collaborator Author

chriselrod commented Sep 18, 2020

You have to be careful with StaticInts and heterogeneously-typed tuples in general, which ArrayInterface.size and ArrayInterface.static are likely to produce.

@Tokazama
Copy link
Member

You have to be careful with StaticInts and heterogeneously-typed tuples in general, which ArrayInterface.size and ArrayInterface.static are likely to produce.

I'm assuming you mean the difference between NTuple{N,<:Integer} and Tuple{Vararg{<:Integer,N}}?

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@chriselrod
Copy link
Collaborator Author

chriselrod commented Sep 18, 2020

I meant

julia> using StaticArrays, ArrayInterface

julia> A = @SMatrix rand(4,3);

julia> @code_warntype getindex(ArrayInterface.size(A), 1)
Variables
  #self#::Core.Compiler.Const(getindex, false)
  t::Core.Compiler.Const((ArrayInterface.StaticInt{4}(), ArrayInterface.StaticInt{3}()), false)
  i::Int64

Body::Union{ArrayInterface.StaticInt{4}, ArrayInterface.StaticInt{3}}
1nothing%2 = Base.getfield(t, i, $(Expr(:boundscheck)))::Union{ArrayInterface.StaticInt{4}, ArrayInterface.StaticInt{3}}
└──      return %2

julia> @code_warntype getindex(ArrayInterface.size(A), ArrayInterface.StaticInt(1))
Variables
  #self#::Core.Compiler.Const(getindex, false)
  t::Core.Compiler.Const((ArrayInterface.StaticInt{4}(), ArrayInterface.StaticInt{3}()), false)
  i::Core.Compiler.Const(ArrayInterface.StaticInt{1}(), false)

Body::ArrayInterface.StaticInt{4}
1nothing%2 = Base.convert(Base.Int, i)::Core.Compiler.Const(1, false)
│   %3 = Base.getfield(t, %2, $(Expr(:boundscheck)))::Core.Compiler.Const(ArrayInterface.StaticInt{4}(), false)
└──      return %3

You need to make sure the indices into these tuples const-prop so that the result can be inferred.

@chriselrod chriselrod merged commit 1e1f4a5 into JuliaArrays:master Sep 19, 2020
@Tokazama Tokazama mentioned this pull request Jan 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants