Introduce Buffer type and make Array an abstraction on top of it #12447

malmaud · 2015-08-04T01:35:57Z

There's been some talk of switching to a model for arrays with two parts (and similar for strings) as part of Arrrayageddon:

A memory buffer type, which is just a linear set of bytes
An array, which holds a reference to its backing memory buffer and stores shape and stride information.

A few advantages:

The distinction between and a subarray becomes smaller (or disappears): both are just views on a memory buffer.
Array logic could be totally moved from C to Julia - only the buffer type would be defined in C.
It's easier to reason about when different array instances are aliasing memory - if they have the same backing buffer, they're aliased.
You could dynamically change the backing buffer of an array.

This is essentially the model that Torch uses, where the memory buffer is the Storage type and the array abstraction is the Tensor type.

quinnj · 2015-08-04T02:31:41Z

+100. To expound on the conversation between @malmaud, @StefanKarpinski, @carnaval, and myself.

Buffer would be a fixed-size linear byte buffer tied closely to actual memory
Array and String, for example, would be interface layers on top of these byte Buffer types
We could also do a custom allocator for Buffer and perhaps specialized GC (like ref counts on individual elements instead of the entire buffer/page)
IOBuffer could also be reworked to use this low-level Buffer type for tighter IO control
We didn't really come to a final conclusion, but there could perhaps be an integration with NTuple here that would create a MutableBuffer and ImmutableBuffer types that share a common interface (related to Julep: More support for working with immutables #11902)

ScottPJones · 2015-08-04T02:48:49Z

+100 also, love that it would fit in perfectly with what @quinnj has been working on in https://github.com/quinnj/Strings.jl.
Would it be possible for short buffers to be "inline" instead of pointed to, i.e. like @StefanKarpinski's bytevec stuff?

JeffBezanson · 2015-08-04T02:54:22Z

We should definitely start experimenting with an Array defined mostly in julia, mixing in some SubArray features.

How should we handle arrays of references? I don't think we want to have explicit write barriers in julia code for example.

If the storage of an Array can be switched at any time, you have extra pointer loads to worry about, and aliasing issues.

Another issue is in-line allocation of small arrays. Ideally arrays and strings would not mandate 2 objects for each.

timholy · 2015-08-04T03:06:56Z

The distinction between and a subarray becomes smaller (or disappears): both are just views on a memory buffer

Small correction: that's only true for SubArray{T,N,Array{T,...}}. But our SubArrays are much more flexible than that. We can have views on things that don't even exist in memory (e.g., a view on (i,j)->i^2+j^2, if we want).

That small correction aside, I am enthusiastic about this proposal, including mixing in some features of SubArrays into an Array-as-buffer-wrapper type. A big advantage is that FixedSizeArrays become much easier to implement because we'd finally have what's essentially a mutable Tuple.

However, I'm not so sure about switching the backing buffer of the array, and I think Jeff's points deserve careful thought.

ScottPJones · 2015-08-04T03:31:41Z

@malmaud, what use cases do you see for dynamically switching the backing buffer of the array?
I can see a lot of uses for having all sorts of backing buffers, but dynamic switching sounds scary.

quinnj · 2015-08-04T03:50:59Z

Yeah, I'm not too sure on the need to be able to switch out backing buffers. I mean I would say we just make

immutable Array
  ...
end

Let's see, what else do we currently store in Array, element-size? length? is-shared? is-aligned? dimensions? Would we really need the ability to change any of those fields? I would think since Array is now a wrapper type, making it immutable would work since it's fairly cheap to create a new instance pointing to the same data (for example)

simonster · 2015-08-04T04:25:15Z

Right now we allow changing data pointers for unshared vectors, which we use when we have to reallocate, i.e. for push!. But we don't allow changing data pointers for other ND arrays, and the compiler knows that and can hoist array pointer loads in more cases. It seems difficult to maintain these semantics without splitting Vector and Array into two separate types, since we'd need to be able to control mutability of the type based on the number of dimensions. OTOH it may turn out that with appropriate TBAA etc. we can hoist the pointer loads in the cases where it matters for performance even if the storage can be changed at any time.

ViralBShah · 2015-08-04T05:24:56Z

This is very timely for 0.5. We used to have a Buffer in the very early days, before the Julia public release, and it was quite nice to be able to define Array itself in Julia.

malmaud · 2015-08-04T12:13:22Z

I don't personally see much use for switching out the buffer - I think @johnmyleswhite got feedback from a serious Torch user that it would be useful, but I don't really know what for.

carnaval · 2015-08-04T13:40:31Z

I don't think having a write barrier builtin from julia is too bad. I'm actually somewhat surprised we never needed it. However this hypothetical buffer object will definitely need a flag to indicate whether it's full of pointers or bits for the gc, ruining a bit the illusion of dumb bytes, but well, there is no way around that.

It would be interesting to see if we can provide the feature of inline small buffer allocation in a first class (and maybe more general) way. It's kinda contrary to the local philosophy to have builtin Arrays be variable length but no one else can.

I would be fine if resizing was limited to another type than Array (ArrayList ?) but I feel this opinion is not so popular. That way paying the pointer indirection cost is a conscious decision (that I would venture you don't need to do very often). Maybe we could even try to support nd-resizing that way, to add lines and columns.

mbauman · 2015-08-04T14:10:54Z

I've definitely thought about this before, too, and would love to see Array defined in Julia. But I think that it will definitely require much more flexible Julia type definitions to get competitive performance. Here's where I've run into these sorts of performance issues in the past:

Two allocations for small objects. This isn't terribly unlike my attempts to make the BitArray struct immutable with a RefValue{Int} field to allow for mutable length. Allocating that extra tiny object was a 3x penalty for small BitArrays (<64 bits). WIP: Allow constant fields in mutable types #11430 (comment)
Performance from data locality. This is related to the above, but just as important. The ability of the builtin Array to put its data immediately following the header is extremely cache friendly, and I think it makes a big difference. Check out this StackOverflow answer where I created an unsafe Array header that pointed to the data as a super-lightweight column view. The good old allocating getindex still beat it after the cache warmed up!
Mutable length vectors, constant size arrays. This can be worked around, but it's ugly: WIP: Allow constant fields in mutable types #11430 (comment). A somewhat-related benchmark here is back when I foisted the manual pointer hoist for BitArrays when I was trying to prevent a GC frame. That caused a 20% performance regression in some cases: Speed up scalar BitArray indexing by ~25% #11403 (comment). But as @simonster mentions, there are also other approaches here, and perhaps TBAA becomes easier with a separate storage type.

argriffing · 2015-08-04T19:32:40Z

A memory buffer type, which is just a linear set of bytes

An array, which holds a reference to its backing memory buffer and stores shape and stride information.

A few advantages:
[...]
3) It's easier to reason about when different array instances are aliasing memory - if they have the same backing buffer, they're aliased.

What if someone puts their whole system in a tensor and only ever operates on views of its backing memory buffer? Under the reasoning above, wouldn't all of these array instances be considered to be aliasing memory with each other?

johnmyleswhite · 2015-08-04T19:38:06Z

That seems to mostly depend on whether the views are disjoint.

argriffing · 2015-08-04T19:44:45Z

That seems to mostly depend on whether the views are disjoint.

Right, 3) It's easier to reason about when different array instances are aliasing memory - if they have the same backing buffer, they're aliased. doesn't sound like it depends on whether the views are disjoint if they share the same backing memory buffer. It can be nontrivial to check for the disjointness because of shapes and strides stuff, and there's an active numpy issue related to this.

johnmyleswhite · 2015-08-04T19:47:39Z

That's all true. But how is that different from the status quo?

mbauman · 2015-08-04T19:47:46Z

I think the point is that we'll use different types that represent aliased (or possibly aliased) data. Currently reshape shares data between two Array objects, which is definitely something I want to move away from.

argriffing · 2015-08-04T19:49:25Z

how is that different from the status quo?

Honestly I don't know the status quo in Julia, I was just adding 2 cents because I saw something wrong (or at least questionable) on the internet :)

quinnj · 2015-08-12T20:13:05Z

@mbauman, maybe we need reshape and reshape!?

KristofferC · 2015-08-12T20:19:51Z

@quinnj that makes a lot of sense to me

mbauman · 2015-08-12T20:24:24Z

See #10507. I think reshape should always return a view (like indexing will) and ideally it'll be performant enough to not need reshape!. (Edit to add: allowing an operation like reshape! means that the dimensions of an array are not constant… which may have bigger performance effects everywhere.)

quinnj · 2015-08-12T20:28:08Z

Ultimately though, will the proposal there return a ReshapeArray instead of a regular Array? I feel like a general trend has been figuring out how to return the same types with all these operations, utilizing shared data (i.e. getindex returning a String or Array instead of SubString/SubArray). I've probably missed a key discussion here or there on this though, so feel free to correct me on this.

timholy · 2015-08-12T20:50:37Z

There's no way to return an Array from certain reshape operations unless you copy data. So we will need a special type.

Performance is indeed my big worry with the ReshapedArrays. It seems to be a much harder problem than I expected.

quinnj · 2015-08-12T21:01:52Z

@timholy, but if Array is defined in Julia, could we have more control over always returning an Array? What are the "certain operations" where we need to copy data?

mbauman · 2015-08-12T21:47:28Z

It'd be very difficult to shoehorn all of SubArray and ReshapedArray (performantly!) into Array without slowing down the normal, non-view case. You probably could make it work with a bunch of type parameters, but then it may as well be a different type.

By exposing the Buffer type, we could hoist the Array's buffer directly into a field of the view objects, which should help with their performance (if LLVM optimizations don't beat us to the punch). It could also mitigate some of the MAX_TYPE_DEPTH and over-specialization issues that we'll probably start running into with lots of deeply nested type parameter lists flying around.

The certain operations are some combinations of reshapes of sub-views, if I remember correctly.

quinnj · 2015-08-12T22:02:39Z

I'll certainly defer to our resident array ninjas on the subtleties of reshape and sub operations. I'm going to make a probably poor attempt at least for the Buffer type soon; I figure we can start as small as possible and build from there.

argriffing · 2015-08-12T22:06:05Z

What are the "certain operations" where we need to copy data?

Here's an example from numpy: numpy/numpy#6166 (comment), where I think the generalized transpose is like http://docs.julialang.org/en/latest/stdlib/arrays/?#Base.permutedims in Julia.

simonster · 2015-08-12T22:06:23Z

Indexing of SubArrays already has zero cost over indexing of ordinary Arrays, so I think we could shoehorn it all into Array if we wanted to, but there may be reasons not to. LLVM can also already hoist the load of the array pointer from SubArray out of loops in most/all cases where it matters because they are immutable, so I don't expect improved performance for that case. Rather, I expect that trying to implement Arrays in Julia will force us to sort out issues that currently affect other AbstractArray types, e.g. the stupid GC root-related performance issues and unhoisted array pointer loads for BitArrays.

timholy · 2015-08-12T23:14:32Z

The biggest issue is that the there is no (nice) view type that is closed under both reshaping and taking a cartesian subset. The only type that is closed is one that stores one memory-offset per element, and that would be horrible. See #9874 (comment).

JeffBezanson · 2016-08-09T22:06:58Z

Tentatively marking as 0.6 to encourage experimentation in that timeframe.

vtjnash · 2016-08-10T03:23:58Z

Agreed, that should definitely be the plan. I think this needed the stack allocation and an inline immutable PRs to be merged first to make this a perform.

DemiMarie · 2016-08-10T03:35:27Z

I think the current cache behavior (excellent) needs to be kept, which means that the Array data must still immediately follow the header.

andyferris · 2016-10-28T06:51:35Z

I was looking for progress on this since @StefanKarpinski announced there is a feature freeze quite soon!

In principle this doesn't seem to hard for isbits element types... I took an attempt at it here using a native-Julia Buffer and performance didn't seem too bad at first glance (same as Array, but I'm sure that's not true for all functions).

~~Unfortunately, I get crashes sometimes where my finalizers call free...~~ (bug fixed)

andyferris · 2016-10-28T08:10:25Z

I forgot to mention my approach only works nicely for element types which are isbits. Amongst other things, some control over GC would be needed for non-isbits types. One day I'd like to learn more about Julia's GC.

timholy · 2016-10-29T09:30:11Z

Neat experiment. But finalizers are yucky, would be much better if this used julia's own GC to manage the buffer.

andyferris · 2016-10-29T11:25:36Z

But finalizers are yucky

OK I admit I didn't know much about how they are implemented. I've been reading around - is the main problem here that they are slow? Would type-based finalization fix the performance problem? I can kind-of see that deferred finalization of many resources is not desirable, but our GC already does that for memory, which is what this finalizer is all about...

would be much better if this used julia's own GC to manage the buffer.

This is no doubt true, and makes more sense. On the other hand, having the ability to boss around the GC from julia code could also be pretty awesome (or potentially nightmarish). However, asking it for a buffer (and telling it whether it is full of data or full of pointers it should scan) is probably precisely the only sane thing you can boss it around to do from Julia code...

I really should read the code for the GC one day, but here's a question. If the buffer contains arrays where some fields are data and some fields are references to other Julia objects, how does the GC know which parts of the buffer to scan? (Is this only a problem in the future when non-isbits immutables get inlined?)

yuyichao · 2016-10-29T12:12:31Z

is the main problem here that they are slow? Would type-based finalization fix the performance problem?

No and no.

I can kind-of see that deferred finalization of many resources is not desirable, but our GC already does that for memory, which is what this finalizer is all about...

No the GC keep track of the memory usage, but not resources behind finalizer.

On the other hand, having the ability to boss around the GC from julia code could also be pretty awesome (or potentially nightmarish).

I'd prefer not giving people access to it unless a really convincing use case is made.

However, asking it for a buffer (and telling it whether it is full of data or full of pointers it should scan) is probably precisely the only sane thing you can boss it around to do from Julia code...

Yes.

If the buffer contains arrays where some fields are data and some fields are references to other Julia objects, how does the GC know which parts of the buffer to scan?

The buffer is always well typed.

(Is this only a problem in the future when non-isbits immutables get inlined?)

Which is a requirement for this change.

andyferris · 2016-10-29T12:42:10Z

The buffer is always well typed.

Ahh! Thanks, I was just beginning to think this is the only way it would make sense. So a future Buffer{T} would be a lot like (a non-resizeable) Vector{T} but with almost no methods defined on it (just length, eltype, load/getindex, and store!/setindex!, say)? I guess being accessible to the GC means it is more restrictive than a buffer in C++ would be (where you can store stuff in whatever memory pattern you like and cast the data to the appropriate type later - whereas in Julia a "generic" buffer (e.g. Buffer{UInt8}) could only be useful for isbits types).

(Is this only a problem in the future when non-isbits immutables get inlined?)

Which is a requirement for this change.

The circular nature didn't escape me :) It's almost cute. Though, strictly, if it is true that Vector currently uses a double indirection, then I suppose that a Buffer backed Array could be viable already, no?

yuyichao · 2016-10-29T14:17:07Z

Though, strictly, if it is true that Vector currently uses a double indirection, then I suppose that a Buffer backed Array could be viable already, no?

No.

vtjnash · 2016-12-22T18:15:18Z

This'll happen in the 1.0 release. It should perhaps have had more attention in 0.6 so we could have it on master longer, but hard stop is really just the 1.0 release.

Seelengrab · 2022-04-04T10:26:57Z

This would be useful for buffered Channels, to remove the popfirst! ccall & resizing overhead by implementing a ringbuffer on top of this. With the changes to compiler & GC over the past few years, has the situation changed in regards to this being viable/an option before 2.0? Have new possibilities for implementing this cropped up?

nsajko · 2024-04-20T11:35:21Z

I think this is fixed now on the v1.11 prereleases due to #51319.

shashi closed this as completed Aug 4, 2015

shashi reopened this Aug 4, 2015

simonster changed the title ~~Introduce Buffer type and make Array an abstraction on top of it~~ Introduce Buffer type and make Array an abstraction on top of it Aug 12, 2015

mbauman mentioned this issue Sep 15, 2015

Arraypocalypse Now and Then #13157

Closed

27 tasks

JeffBezanson added this to the 0.6.0 milestone Aug 9, 2016

yuyichao mentioned this issue Aug 19, 2016

Add 'isassigned' for RefValue #18082

Merged

simonster mentioned this issue Sep 3, 2016

Deprecate and remove isdefined(a::Array, index::Int) #18346

Closed

JeffBezanson mentioned this issue Oct 14, 2016

avoid more tuple allocations in Array constructors #18918

Merged

quinnj mentioned this issue Nov 17, 2016

improper types when using new DataFrames JuliaData/Feather.jl#30

Closed

vtjnash modified the milestones: 1.0, 0.6.0 Dec 22, 2016

JeffBezanson modified the milestones: 2.0+, 1.0 May 2, 2017

maleadt mentioned this issue Feb 2, 2018

expose UVM API JuliaGPU/CUDAdrv.jl#93

Merged

tkf mentioned this issue Oct 16, 2018

Improve aliasing detection of wrappers of SubArrays #29546

Open

brenhinkeller added the domain:arrays [a, r, r, a, y, s] label Nov 21, 2022

nsajko closed this as completed Apr 20, 2024

Introduce Buffer type and make Array an abstraction on top of it #12447

Introduce Buffer type and make Array an abstraction on top of it #12447

Comments

malmaud commented Aug 4, 2015

quinnj commented Aug 4, 2015

ScottPJones commented Aug 4, 2015

JeffBezanson commented Aug 4, 2015

timholy commented Aug 4, 2015

ScottPJones commented Aug 4, 2015

quinnj commented Aug 4, 2015

simonster commented Aug 4, 2015

ViralBShah commented Aug 4, 2015

malmaud commented Aug 4, 2015

carnaval commented Aug 4, 2015

mbauman commented Aug 4, 2015

argriffing commented Aug 4, 2015

johnmyleswhite commented Aug 4, 2015

argriffing commented Aug 4, 2015

johnmyleswhite commented Aug 4, 2015

mbauman commented Aug 4, 2015

argriffing commented Aug 4, 2015

quinnj commented Aug 12, 2015

KristofferC commented Aug 12, 2015

mbauman commented Aug 12, 2015

quinnj commented Aug 12, 2015

timholy commented Aug 12, 2015

quinnj commented Aug 12, 2015

mbauman commented Aug 12, 2015

quinnj commented Aug 12, 2015

argriffing commented Aug 12, 2015

simonster commented Aug 12, 2015

timholy commented Aug 12, 2015

JeffBezanson commented Aug 9, 2016

vtjnash commented Aug 10, 2016

DemiMarie commented Aug 10, 2016

andyferris commented Oct 28, 2016 • edited Loading

andyferris commented Oct 28, 2016

timholy commented Oct 29, 2016

andyferris commented Oct 29, 2016

yuyichao commented Oct 29, 2016

andyferris commented Oct 29, 2016

yuyichao commented Oct 29, 2016

vtjnash commented Dec 22, 2016

Seelengrab commented Apr 4, 2022

nsajko commented Apr 20, 2024

andyferris commented Oct 28, 2016 •

edited

Loading