Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eliminate the StoredArray/AbstractArray distinction #6212

Closed
stevengj opened this issue Mar 19, 2014 · 63 comments · Fixed by #6258
Closed

eliminate the StoredArray/AbstractArray distinction #6212

stevengj opened this issue Mar 19, 2014 · 63 comments · Fixed by #6258
Labels
needs decision A decision on this change is needed
Milestone

Comments

@stevengj
Copy link
Member

I don't see what useful purpose is served by having this distinction in the type system, introduced by #987 (and see also the discussion in #5810). According to the description in the manual:

The AbstractArray type includes anything vaguely array-like, and implementations of it might be quite different from conventional arrays. For example, elements might be computed on request rather than stored. Or, it might not be possible to assign or access every array location.

StoredArray is an abstract subtype of AbstractArray intended to include all arrays that behave like memories: all elements are independent, can be accessed, and (for mutable arrays) all elements can be assigned.

the main difference between a StoredArray and an AbstractArray is that in the latter, the elements are computed or requested on the fly whereas in the former they are "stored". But this seems like an implementation detail.... why do we want to have this distinction in the type system? Under what circumstances would you dispatch on the difference between computed and stored values?

The other difference is that, in an AbstractArray, not all elements may be accessible. (But in this case, shouldn't it be an Associative type?) If you can't even access all elements, what non-trivial array-like methods could possibly operate on an AbstractArray? And if you can't write a non-trivial method for it, why even have the type at all?

My suggestion would be to just drop this distinction, and make StoredArray === AbstractArray. And suggest that every concrete AbstractArray type should, at minimum, provide size and getindex (and setindex! if it is mutable), including a single-index variant.

@jiahao
Copy link
Member

jiahao commented Mar 19, 2014

See my comment in #5810. There is a useful distinction to draw between arrays whose matrix elements are explicitly stored (StoredArrays) vs objects that can behave as matrices, like QRCompactWY or any of the other compact QR objects, where the matrix elements are not stored explicitly and so calling getindex on them would incur a nontrivial amount of computation. O(1) vs O(N^2) seems like a useful distinction.

@lindahua
Copy link
Contributor

Here are two questions:

  • Is it useful to make type distinction based on the computational cost of getindex?
  • If it is the case, what are the criteria to define nontrivial amount of computation? Perhaps, a natural choice is O(1). Under this criterion, then sparse arrays should not be considered as StoredArray (which definitely require more than O(1) to access an element).

@jiahao
Copy link
Member

jiahao commented Mar 19, 2014

The crux of the matter, it seems, is whether the type system should only be allowed to delineate what is permitted (a descriptive approach), or whether it should encode preferences as to what should or should not be done (a prescriptive approach). In the former approach, cost doesn't really enter into the design decision. In the latter, structuring the type hierarchy in a particular way can be suggestive of underlying computational complexity. Should the design decision be, we'll be general enough to let you do it automatically but you just have to know that you have no guarantee on cost in practice, or that we know that certain operations on certain types will be terribly slow, so if you want to write this method you should know what you are getting into?

@StefanKarpinski
Copy link
Sponsor Member

I generally favor not providing default implementations that we know are going to have terrible performance. This is a bit of a judgement call, but I feel that it more often than not just leads to people trying something and filing issues about the performance.

@stevengj
Copy link
Member Author

If getindex is not reasonably fast, then I don't think it should be an array. "Array" in computer science implies random access.

On a more practical note, if you can't assume that getindex is reasonably fast, how could you implement any nontrivial method for AbstractArrays? And if you can't write a nontrivial method for it, why bother having that type in the first place?

@jiahao
Copy link
Member

jiahao commented Mar 19, 2014

Array" in computer science implies random access.

But isn't the point that AbstractArrays are not Arrays, but that they are some sort of generalization thereof?

if you can't assume that getindex is reasonably fast, how could you implement any nontrivial method for AbstractArrays?

You can define *(QRCompactWY, Matrix) quite easily without ever calling getindex on QRCompactWY to get its matrix elements explicitly, because it is faster to work out the effects of elementary Householder reflectors on the target matrix.

@stevengj
Copy link
Member Author

@jiahao, an Array is a specific realization of a random-access array data structure, but there are many others. Encapsulating those others should be the point of AbstractArray.

The point is to define a useful generalization, not just "some sort" of generalization. It has to be a generalization that is still specific enough that I can write a non-trivial f(A::AbstractArray). It has to be a distinction that it is useful to dispatch on, e.g. my f(A::AbstractArray) and f(A::StoredArray) and f(itr) are different.

Whether you can define a *(QRCompactWY, Matrix) method is irrelevant; the question is, what nontrivial AbstractMatrix methods can you pass QRCompactWY to, not what methods you can write specifically for QRCompactWY. Furthermore, we already had the discussion concluding that not all linear operators can be AbstractArray subtypes, nor do they need to be, any more than all AbstractArray subtypes represent linear operators. And because of the lack of multiple inheritance, it is not useful to define a LinearOperator abstract type.

@StefanKarpinski
Copy link
Sponsor Member

Technically, you can define getindex for any linear operator L – to get the L[j,k] just apply the operator to e_k and see what's at index j in the result. Of course, that requires knowing what size e_k should be. But if L is really a linear operator, then it has a well defined domain and codomain. When we call the I thing a "linear operator", we're using the term somewhat loosely – it's not really a linear operator but a whole family of linear operators.

@jiahao
Copy link
Member

jiahao commented Mar 19, 2014

an Array is a specific realization of a random-access data structure, but there are many others. Encapsulating those others should be the point of AbstractArray.

You could make exactly the same arguments with StoredArray. You clearly have your point of view with regard to linear operators, and I don't really want to get into the abstract black box function issue right now, but QRCompactWY is exactly the kind of thing that refuses to conform to being a pure linear operator or StoredArray. It is a convenient representation of a particular kind of orthogonal/unitary matrix. You can think of it as a representation of a linear operator, but that is not really the point. QRCompactWY can be interconverted with its Matrix exactly without losing anything. It is a pure performance hack for a very particular kind of matrix.

@stevengj
Copy link
Member Author

@StefanKarpinski, technically, you cannot define getindex for any linear operator, because linear operators might be infinite-dimensional in a non-countable way. (Imagine a derivative operator for a chebfun-like package.) Furthermore, even for finite-dimensional linear operators, when they are defined implicitly it is often simply not useful to define a getindex function, because you never in practice access the individual elements. (e.g. my FFT plans in #6193 are finite-dimensional linear operators, but they are not AbstractMatrix subtypes, in part because it would be a PITA to compute matrix elements for arbitrary multidimensional transforms, and would serve no useful purpose.)

Furthermore, this is not just "my" point of view. It is the point of view already adopted in Base. For example, our Factorization objects are clearly finite-dimensional linear operators, but they are not AbstractMatrix subtypes.

@jiahao, you can only make the same arguments about StoredArray if you don't assume it is random access. My suggestion is to define AbstractArray == StoredArray, require size and getindex, and suggest that getindex should be fast enough to treat it as a random-access data structure.

For example, random access allows us to distinguish between sum(x::AbstractArray) and sum(itr), because the former can use pairwise summation. If we accept your definition, however, then our current sum(x::AbstractArray) is potentially a total disaster (imagine a LinkedList <: AbstractVector type). This should scare you. (With my definition, it is the programmer's fault if they ignore the manual and make LinkedList <: AbstractVector. With your definition, it is our fault that we provided a disastrous sum implementation.)

@StefanKarpinski
Copy link
Sponsor Member

I suspect that we won't be implementing linear operators on uncountable spaces any time soon and as long as you can enumerate your basis, that approach works. I had two points:

  1. I isn't really a linear operator since if it were, it would have a well-defined size.
  2. You can define getindex even for black-box transformations, it's just a matter of cost.

I'm not really arguing for or against elimination – I don't know – I'm just making some points.

@stevengj
Copy link
Member Author

And you still haven't answered my question: with your suggested definitions, what non-trivial f(::AbstractArray) function can I possibly write? Especially if you want to distinguish from the necessarily duck-typed f(iterable) or f(linearoperator)?

(With regards to QRCompactWY, I'm suggesting that should not be a subtype of AbstractMatrix at all, any more than QR <: Factorization <: Any is.)

@lindahua
Copy link
Contributor

If getindex is not reasonably fast, then I don't think it should be an array. "Array" in computer science implies random access.

@stevengj Do you think sparse matrices as arrays? Doing random access over a sparse matrix/vector is not less worse than doing random access over a linked list.

@stevengj
Copy link
Member Author

Lookup of an arbitrary (i,j) index on an MxN CSC sparse matrix with at most K elements per column can be done in O(log K) time, which is close enough to random access, and is far better than the O(N) of a linked list of N elements.

But if you have a sparse matrix in IJ format with no ordering, e.g. just a linked list of (i,j,value), then no, I would not think of that as an array data structure. Not all arrays are matrices, and not all matrices are stored in arrays.

@StefanKarpinski
Copy link
Sponsor Member

So maybe logarithmic or better (with low constant factor) is the right definition? That means that, e.g. binary search is admissible. There's not much besides direct indexing that is actually O(1), so if we required O(1), then we would almost necessarily be limiting ourselves to plain old dense arrays.

@stevengj
Copy link
Member Author

Yes, I would say that we should recommend Õ(1) access for AbstractArray types, where Õ means ignoring polylog factors.

@lindahua
Copy link
Contributor

For many array (or array-like objects), even getindex is costly, it is still able to perform a lot of useful operations.

Take sum, for example. We can write:

function sum(a::DenseArray)
    # ... default implementation ...
end

function sum(a::AbstractArray)
    tmp = copy!(similar(a), a)
    return sum(tmp)
end

Even if getindex(a, i, j) is costly, it is often the case that copying the array takes amortized O(1) for each element. In such cases, a is still usable as an array in many useful applications.

@stevengj
Copy link
Member Author

@lindahua, it seems like you have an infinite dispatch loop, since copy!(similar(a), a) is likely to return something with the same type as a (the documentation for similar is Create an uninitialized array of the same type as the given array). A more correct implementation would seem to be:

sum(a) = sum(collect(a))

but in this case there is no point in restricting it to AbstractArray at all; you might as well support any iterable collection.

@lindahua
Copy link
Contributor

Maybe I should write it this way to make the example more clear

function sum{T}(a::AbstractArray{T}, region)
    tmp = copy!(Array(T, size(a)), a)
    return sum(tmp, region)
end

I don't think collect would do the job when dimensions are involved. (collect only works for 1D case).

@stevengj
Copy link
Member Author

@lindahua, your function would still be O(N^2) for a linked list.

I'm fine if you want to set the bar for AbstractArray lower, to Õ(1) amortized time, rather than Õ(1) worst-case time. That seems like the bare minimum for useful algorithms.

(But for Õ(1) amortized access, I don't see the point in making a copy. Why not just call sum directly?)

@lindahua
Copy link
Contributor

I suggest the following requirement for AbstractArray and StoredArray:

AbstractArray requires:

  • size(a)
  • getindex(a, i...), the operation can be very costly
  • copy!(dst::Array, a), with amortized cost Õ(1) per element.

StoredArray additionally requires:

  • getindex(a, i...) to be efficient enough, perhaps Õ(1)?

@lindahua
Copy link
Contributor

My sum function takes O(n) for linked list if you do the copy! in a correct way.

The copy! can be specialized for each special kind of arrays. For linked list, it can be just traversing from head to tail.

@stevengj
Copy link
Member Author

This seems like a very fine distinction. Why not just require Õ(1) amortized access per element, assuming many elements are accessed?

Otherwise, pretty much all non-trivial AbstractVector functions will have to call copy! first, which seems obnoxious. If the user is willing to do that, I would prefer to make the choice explicit: they can just call copy! themselves. It seems rather unexpected and undesirable that sum on a large data structure will silently make a copy.

If you want a LinkedList type, you should just make it an iterable collection. That's what iterables are for. Why do we need an AbstractVector definition that is so broad that it includes LinkedList, and as a consequence every non-trivial AbstractVector operation involves making a copy first?

@lindahua
Copy link
Contributor

My point is that something is qualified as an instance of AbstractArray, if it has a size, an a way (efficient enough) is provided to convert/copy it to an ordinary array.

The threshold can be discussed. The distinction between AbstractArray and StoredArray is that the latter one can support efficient random access, while the former only requires that it can be converted to an array efficient enough (which subsumes a much broader family of things, while is enough to support many useful operations for arrays with reasonable performance).

@lindahua
Copy link
Contributor

In this way, we can define sum and many other useful functions through dispatch based on such distinctions:

function sum(a::StoredArray)
     # using getindex
end

function sum(a::AbstractArray)
    # first copy/convert it to an ordinary array, and then do sum
end

@stevengj
Copy link
Member Author

@lindahua, I agree that you can define an AbstractArray/StoredArray distinction in that way. I'm not convinced that we should.

It seems very unpleasant to me to have an "array" type such that all non-trivial methods on it first make a copy. As a user, I wouldn't expect that sum-ming a large data structure would make a copy of it first.

At the very least, we shouldn't call it an AbstractArray in that case. We should call it something else, maybe CopyBeforeUsing :-). Calling it AbstractArray is just a trap for the unwary if indexing is not fast (at least on average).

@lindahua
Copy link
Contributor

That's just a default fallback. One can always specialize on its own special array types if there exist specialized ways that are more efficient.

@stevengj
Copy link
Member Author

@lindahua, I agree, but that is not the documented meaning of DenseArray right now. The manual only says:

DenseArray is a further abstract subtype of StoredArray. Arrays of this type have storage for every possible index, and provide uniform access performance for all elements.

We should update the documentation to be something like your definition: subtypes must define strides and convert(Ptr{T}, a), pointing to a memory layout corresponding to the strides.

Or we could have a slightly weaker definition: at least require strides to give the conversion between linear indices and multi-indices, and recommend but don't require a pointer conversion with the corresponding memory layout. e.g. my pure-Julia multi-dimensional FFT implementation requires strides to convert between linear and multi-indices (in order to call the 1d FFT algorithm along individual dimensions), but doesn't require a C pointer.

But probably your stronger definition, specifying the memory layout, is best. Otherwise, there is no way to know the memory layout that is pointed to by pointer(a), and calling a BLAS (or FFTW) function on a StridedArray (DenseArray and SubArrays thereof) is extremely unsafe.

@JeffBezanson
Copy link
Sponsor Member

Ok. That definition is a fully compatible refinement of the current description of DenseArray.
I certainly think that if a DenseArray implements pointer then the memory layout is defined in this way. There may be dense arrays that don't provide native pointers, but I agree that feels like a fairly abstract notion.

@lindahua
Copy link
Contributor

Think about it more, I agree that the distinction between AbstractArray and StoredArray is not clear enough to justify two distinct types. Such confusion would probably cause inconsistency and misuse in practice.

A reasonable way would be to maintain two levels of abstract array types in Base: AbstractArray at the top level, DenseArray and AbstractSparseArray at the second level.

  • Minimum requirements for AbstractArray: size and getindex. In particular, we recommend that getindex should be efficient enough (with time complexity Õ(1)). Generic methods such as eltype, ndims, length, start, next, done can then be defined based on this requirement.
  • Types that represent factorization should be under a Factorization abstract type. We may provide a function array (or implement convert) to convert them to ordinary arrays. I don't see much benefit of having them under AbstractArray. For example, a factorization shouldn't be shown like an array, instead, we should implement a show method for Factorization that shows the factorized parts (instead of the product).
  • Subtypes of DenseArrayshould additionally provide two methods pointer and strides. It should be required that the entire array resides in memory and the memory layout conforms to the specification given by pointer and strides.
  • Subtypes of AbstractSparseArray should additionally provide two methods nfilled and findnz. The findnz method should return (I, J, V) in O(nfilled(a)) time. Methods that rely only on non-zero elements, e.g. matrix product and sum, should be specialized for sparse array types.

Additional notes:

  • I suggest AbstractSparseArray to be renamed to SparseArray (to be consistent with the naming of DenseArray). We don't have a concrete type called SparseArray, so this renaming should be straightforward.
  • StridedArray might be removed in near future (say 0.4) when array views land. When this is done, we will have strided array view types to be subtypes of DenseArray that provide views into dense arrays. We will also have more generic view types to be subtypes of AbstractArray to provide views into generic arrays.

@JeffBezanson
Copy link
Sponsor Member

I like it.
SparseArray sounds like a sparse version of Array --- a concrete, n-d array type but with sparse storage.

@lindahua
Copy link
Contributor

Or perhaps the other way round: DenseArray ==> AbstractDenseArray.

Anyway, this is a minor issue. I am fine if we keep the current names as they are. I think we should make the decision sooner better than later (preferably before 0.3 release). Then we have a clear guideline to clean up things.

@carlobaldassi
Copy link
Member

Subtypes of DenseArray should additionally provide two methods pointer and strides. It should be required that the entire array resides in memory and the memory layout conforms to the specification given by pointer and strides.

Looking at concrete examples currently in Base:

julia> subtypes(DenseArray)
6-element Array{Any,1}:
 Array{FieldValue,1}
 Array{T,N}         
 BitArray{1}        
 BitArray{2}        
 BitArray{N}        
 SharedArray{T,N} 

I wonder how the pointer and strides specification would play out for BitArrays. There's no way you can get a pointer to the storage and increment it to get the different elements, you get chunks of at least 8 bits at a time.

@lindahua
Copy link
Contributor

The point of DenseArray is that it guarantees the strided memory layout, then a lot of useful things can then be done thereon. For example,

  • call C-functions (e.g. BLAS & LAPACK) that accepts strided memory layout
  • construct & use strided array views

BitArrays have a very special way to store elements, which clearly do not fall in the definition of DenseArray above, and functions that rely on regular strided layouts cannot be apply to them in general.

Hence, I think the solution is pretty straightforward -- BitArray should be a subtype of AbstractArray{Bool}, as there is little benefit to categorize it under DenseArray.

@carlobaldassi
Copy link
Member

One of the original goals of revising the array hierarchy was to avoid code duplication between Arrays and BitArrays. It is true however that at the moment no methods use DenseArray in the signature. Currently, the definition of StridedArray uses DenseArray and therefore BitArray <: StridedArray, but there should be no harm in separating them again from what I can tell (I just noticed by grepping through Base that there are some methods which still use Union(StridedArray,BitArray) in the signature and which could have been simplified).

As far as the name goes, on the other hand, it's hard to think of something denser than a BitArray...

@lindahua
Copy link
Contributor

perhaps call it RegularArray or something?

To me, the choice of these names is a relatively minor issue. It is more important to give clear meaning/semantics to these types, and stick to them in practice.

@stevengj stevengj added this to the 0.3 milestone Mar 20, 2014
@stevengj
Copy link
Member Author

Marking this for 0.3, since we really shouldn't release 0.3 with all the new StoredArray stuff just to change things again in the next release.

@stevengj
Copy link
Member Author

+1 for RegularArray. Overall, @lindahua's summary sounds good to me.

@lindahua
Copy link
Contributor

My search through the codebase shows that Union(StridedArray,BitArray) only appears here:
https://github.com/JuliaLang/julia/blob/master/base/bitarray.jl#L1490-L1533

These codes use nothing more than getindex and length, and should perfectly apply to AbstractArray in general.

@Jutho
Copy link
Contributor

Jutho commented Mar 20, 2014

I like the latest conclusion. Just to complicate matters a bit, the recently introduced UniformScaling <: AbstractArray but doesn't have size defined (nor getindex).

@lindahua
Copy link
Contributor

If we settle on this, then UniformScaling will not be a sub type of AbstractMatrix.

@andreasnoack
Copy link
Member

@Jutho getindex is defined for UniformScaling. size is not, because UniformScaling works perfectly well as a matrix without the size method. However, @lindahua is right that if it is decided that any subtype of AbstractArray should support size the UniformScaling shouldn't be a subtype anymore.

@nalimilan
Copy link
Member

+1 for the proposal; though about the naming RegularArray sounds too much like it has special mathematical properties. AbstractStridedArray, maybe?

@stevengj
Copy link
Member Author

AbstractStridedArray is confusing because makes it sound like StridedArray is concrete, however. RegularArray should be clear enough once it is defined in the manual.

stevengj added a commit to stevengj/julia that referenced this issue Mar 24, 2014
stevengj added a commit to stevengj/julia that referenced this issue Mar 24, 2014
stevengj added a commit to stevengj/julia that referenced this issue Mar 31, 2014
stevengj added a commit to stevengj/julia that referenced this issue Mar 31, 2014
stevengj added a commit to stevengj/julia that referenced this issue Mar 31, 2014
stevengj added a commit to stevengj/julia that referenced this issue Mar 31, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs decision A decision on this change is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.