RFC: Nullables as collections #16961

TotalVerb · 2016-06-16T05:12:29Z

This implements map, filter, and broadcast for Nullable arguments.

cc: @nalimilan @johnmyleswhite @hayd @vchuravy

hayd · 2016-06-16T05:58:47Z

base/nullable.jl

+    elseif !any(isnull, xs)
+        Nullable(map(f, map(unsafe_getindex, xs)...))
+    else
+        throw(DimensionMismatch("expected all null or all nonnull"))


Is this a standard definition of map on options? I expected any(isnull, xs) && return Nullable().

Among languages I know of (or could find on Google) with multi-argument map...

OCaml: option type does not support multi-argument map
(most) lisps: no real option type, but cons used in its place often. behaves like current implementation
Python: no option type
Haskell: zipWith does not apply to options

The intersection of multi-argument map and option types is quite small. This makes some degree of sense, as a lot of languages with option types are either OO (where multi-argument map often doesn't make sense) or curried.

In absence of a real precedent, I think it's a better idea to copy the behaviour of actual collections, rather than introduce special cases:

julia> map(+, Int[], [1]) ERROR: DimensionMismatch("dimensions must match")

Your argument about consistency convinced me, but I think the error message should also say something like "use broadcast instead". Basically, map shouldn't be used for lifting Nullable, it's mostly implemented for completeness (and maybe as a way to ensure fail-fast when you don't want a null to propagate).

I would tend to think like @hayd on this one. In Haskell, zipWith on lists has no problem with one list shorter than the other, it makes a lot of sense and I don't see why our map don't act similarly, both for arrays and nullables. Also, when lifting via map an operation on two or more nullables, I would naturally interpret getting a isnull back as an "error": the computation couldn't be done because of missing values. Having an error thrown when the inputs are not all isnull is redundant, as you have now to different ways to be notified of a failed computation, and you have to check both for an isnull result and for a possible thrown exception. Maybe in Haskell can be used to encode an error at the type level, and lifting f on Maybe would be naturally implemented as

map2 f mx my = do x <- mx y <- my return $ f x y

which returns Nothing if one or more inputs are Nothing (that said, applying map2 on lists doesn't correspond to our multiple-list-arguments map).

We have broadcast for what you describe. I think there's been discussion somewhere about replacing map with it. But let's not make this PR even more complex by opening this debate: the goal here is to be consistent with arrays, whether that behavior is good or not.

Ah indeed I hadn't notice the broadcast behavior. But still, I don't see why map should be defined for Nullable with the view that they are containers. For example NaN+1 doesn't throw an error. I'd then rather not define map until real cases has shown what should be the semantics (or has it happened already?) Anyway, I'm not competent so won't argue more. I would just love one day some docs on map vs broadcast (the concept of broadcast is still relatively alien to me).

There is really no harm in making mixed map an error for now. If needed, it can always be relaxed... though I don't like broadcasting in map.

nalimilan · 2016-06-16T08:30:51Z

Thanks. I agree it's OK to start with an unoptimized version, and add a fast path for safe isbits types like Base's number types ` (for which we can avoid a branch) later. Cf. JuliaStats/NullableArrays.jl#116 (comment).

I'll leave it to people more familiar with the broadcast machinery to say whether the general approach is good. One thing I wonder is how broadcast works on arguments mixing nullables, numbers and arrays. Could you add tests for that too?

tkelman · 2016-06-16T10:15:29Z

test/nullable.jl

+
+s = 0
+for x in Nullable{Int}()
+    s += x


this does not seem desirable to me at all

tkelman · 2016-06-16T10:20:42Z

Why should nullables be iterable? This seems like going further down making scalars iterable, but with more conflating of nullability with container emptiness. I think it's clearer if the concepts are kept separate.

nalimilan · 2016-06-16T10:33:50Z

The main feature is to give f.(x) lifting semantics via broadcast. Other methods seem less useful to me, but can be interesting to have for consistency. This also seems consistent to how other languages treat nullable/option types.

tkelman · 2016-06-16T10:48:25Z

Should be able to opt in to the syntactic sugar without making them iterable. This would lead to far more methods which aren't expecting to deal with nullables trying to operate on them, which I imagine is the intent here but I see that going wrong often and would rather prefer being explicit. Iteration would suddenly discard the possibility of null values, or silently do nothing instead of visibly flagging that a null was present. Doing nothing on a null value isn't always appropriate, encoding it as part of the behavior strikes me as wrong.

nalimilan · 2016-06-16T10:58:10Z

f.(x) needs to always call broadcast since that's just syntactic sugar. So we need at least to implement broadcast on nullables.

I have no strong opinion on implementing the more general interface, but I figure e.g. Scala and Rust had good reasons to do that. @johnmyleswhite?

TotalVerb · 2016-06-16T13:23:09Z

I intentionally left broadcast between arrays and nullables. This operation is ambiguous because the return type could be Nullable{Array} or Array{Nullable} (although I admit the latter is almost always more useful). The original reason for leaving out this operation was a consequence of thinking of nullables as infinite-dimensional arrays in the original PR: arrays must be promoted in dimension to broadcast to nullable, and nullable must be promoted in size to broadcast to array.

Now that that interpretation has been dispensed with, it could be worth reconsidering whether Array{Nullable} is an acceptable return type. Though, it is a bit unfortunate that the return type would not be a NullableArray.

TotalVerb · 2016-06-16T13:30:30Z

@tkelman The Julia documentation is pretty explicit that Nullable types are container-like:

In many settings, you need to interact with a value of type T that may or may not exist. To handle these settings, Julia provides a parametric type called Nullable{T}, which can be thought of as a specialized container type that can contain either zero or one values.

This was why I filed #16889—being thought of as a container type is useless if they don't behave as containers. The container property is important, and is what (functionally) distinguishes Nullable{T} from Union{T, Void} (though the second one is much slower with the current compiler).

TotalVerb · 2016-06-16T13:37:12Z

There are effectively four interpretations of a Nullable{T} out there:

Interpretation 1: Nullable{T} is Union{T, Void}. This is how many other dynamic languages do it: Python, JavaScript, etc.
Interpretation 2: Nullable{T} is a "dumb option type" that isn't much more powerful than Union{T, Void}. This is really only used in static languages, where the compiler provides type checking.
Interpretation 3: Nullable{T} is an option type that behaves as a collection. This is how some functional languages and most OO languages do it.
Interpretation 4: Nullable{T} is a monad. This is only done in functional languages, and only by few of them.

The current behaviour is interpretation 2. I would argue that interpretation 3 makes more sense for Julia.

tkelman · 2016-06-16T13:58:46Z

The container property is important

I'm not convinced. Why? We have plenty of "container" wrapper types that don't act like iterable collections.

mschauer · 2016-06-16T15:25:47Z

Nullable started as a very minimal type "with a very minimal interface with the hope that this will encourage you to resolve the uncertainty of whether a value is missing as soon as possible" to cite @johnmyleswhite . As some point it was even discussed that Nullable{Nullable{T}} could be collapsed to Nullable{T} (compare Union{Union{T,Void},Void} == Union{T,Void}). That indicates to me that there is reason to be cautious with a commitment to the "container with one or zero elements" interpretation.

eschnett · 2016-06-16T16:30:52Z

There are a few more interpretation of Nullable{T} that are interesting in certain cases:

encode missing values (see NA)
encode defaults (adding a null value corresponds to adding zero)
encode errors; adding a null value leads to an error (i.e. a null value) [that's the monadic interpretation]

At one point, Julia could begin to offer different types with different behaviour. There could be an Optional type with defaulting behaviour, a Maybe type that is monadic and container-like, etc. They would all be built onto the same low-level (and bare-bones) implementation.

Here, instead of adding more container-ness to Nullable (which seems to be counter-productive for some uses), we could instead add a Maybe type that is container-like, and which supports arithmetic operations, treating null values as errors, as is suggested here.

An Optional type (all names to be bikeshedded, of course) could also implement arithmetic, but would treat missing values as neutral, not as error.

What hasn't happened so far is to make a strong case for these types, and to prototype them in a package. I'd expect some of these to become widely used, others to remain obscure.

nalimilan · 2016-06-16T21:07:00Z

Here, instead of adding more container-ness to Nullable (which seems to be counter-productive for some uses),

"Counter-productive" sounds quite strong. Do you have any evidence of a case where different interpretations really conflict?

What hasn't happened so far is to make a strong case for these types, and to prototype them in a package. I'd expect some of these to become widely used, others to remain obscure.

Precisely, a lot of work has gone into NullableArrays, and we're seeing the current limitations of Nullable support in Base while porting DataFrames to it (i.e. with the NA interpretation): JuliaData/DataFrames.jl#994. The use case is well defined now, and we know we need at least a good support for lifting semantics. So far the best solution I can see is to use f.(x) (hence broadcast) for that. Whether we need to treat Nullable as a container in all cases remains to be seen, but that could be a logical extension.

johnmyleswhite · 2016-12-29T17:40:04Z

Thanks again for making this happen, @TotalVerb!

tkelman · 2016-12-30T01:36:59Z

We should have checked the appveyor (edit: travis too!) log more carefully. This somehow didn't cause the tests to register as failing, but something strange is up here:

From worker 1:		From worker 1:	Worker 1 failed running test broadcast:
UndefVarError: TestMain_broadcast not defined
 in deserialize(::Base.ClusterSerializer{TCPSocket}, ::Type{Module}) at .\serialize.jl:601
 in handle_deserialize(::Base.ClusterSerializer{TCPSocket}, ::Int32) at .\serialize.jl:580
 in deserialize(::Base.ClusterSerializer{TCPSocket}) at .\serialize.jl:540
 in deserialize_datatype(::Base.ClusterSerializer{TCPSocket}) at .\serialize.jl:835
 in handle_deserialize(::Base.ClusterSerializer{TCPSocket}, ::Int32) at .\serialize.jl:570
 in deserialize(::Base.ClusterSerializer{TCPSocket}) at .\serialize.jl:540
 in ntuple(::Base.Serializer.##1#2{Base.ClusterSerializer{TCPSocket}}, ::Int64) at .\tuple.jl:79
 in handle_deserialize(::Base.ClusterSerializer{TCPSocket}, ::Int32) at .\serialize.jl:561
 in deserialize(::Base.ClusterSerializer{TCPSocket}, ::DataType) at .\serialize.jl:894
 in deserialize_datatype(::Base.ClusterSerializer{TCPSocket}) at .\serialize.jl:848
 in handle_deserialize(::Base.ClusterSerializer{TCPSocket}, ::Int32) at .\serialize.jl:570
 in deserialize(::Base.ClusterSerializer{TCPSocket}, ::DataType) at .\serialize.jl:894
 in deserialize_datatype(::Base.ClusterSerializer{TCPSocket}) at .\serialize.jl:848
 in handle_deserialize(::Base.ClusterSerializer{TCPSocket}, ::Int32) at .\serialize.jl:570
 in deserialize_array(::Base.ClusterSerializer{TCPSocket}) at .\serialize.jl:718
 in handle_deserialize(::Base.ClusterSerializer{TCPSocket}, ::Int32) at .\serialize.jl:568
 in deserialize(::Base.ClusterSerializer{TCPSocket}, ::DataType) at .\serialize.jl:894
 in deserialize_datatype(::Base.ClusterSerializer{TCPSocket}) at .\serialize.jl:848
 in handle_deserialize(::Base.ClusterSerializer{TCPSocket}, ::Int32) at .\serialize.jl:570
 in deserialize(::Base.ClusterSerializer{TCPSocket}, ::DataType) at .\serialize.jl:894
 in deserialize_datatype(::Base.ClusterSerializer{TCPSocket}) at .\serialize.jl:848
 in handle_deserialize(::Base.ClusterSerializer{TCPSocket}, ::Int32) at .\serialize.jl:570
 in deserialize(::Base.ClusterSerializer{TCPSocket}, ::DataType) at .\serialize.jl:894
 in deserialize_datatype(::Base.ClusterSerializer{TCPSocket}) at .\serialize.jl:848
 in handle_deserialize(::Base.ClusterSerializer{TCPSocket}, ::Int32) at .\serialize.jl:570
 in deserialize_msg(::Base.ClusterSerializer{TCPSocket}, ::Type{Base.ResultMsg}) at .\multi.jl:120
 in deserialize_msg(::Base.ClusterSerializer{TCPSocket}) at .\multi.jl:130
 in message_handler_loop(::TCPSocket, ::TCPSocket, ::Bool) at .\multi.jl:1371
 in process_tcp_streams(::TCPSocket, ::TCPSocket, ::Bool) at .\multi.jl:1328
 in (::Base.##551#552{TCPSocket,TCPSocket,Bool})() at .\event.jl:66

TotalVerb · 2016-12-30T01:48:07Z

Oh dear. Let me see if I can reproduce this locally.

tkelman · 2016-12-30T01:54:30Z

I get a failure that bisects to here both on Windows and Linux from just running make test-broadcast - the issue above might be something wrong with the test system's handling of that failure?

TotalVerb · 2016-12-30T02:00:34Z

The failure isn't in the travis logs for #19745, so it could be the known typo in this PR substituting S for f, which was corrected in that one.

tkelman · 2016-12-30T02:17:13Z

That seems like a good explanation to me. There's a test system bug that it was able to continue and be marked as a success here then. Manifests only when tests are run in parallel. @yuyichao any idea?

JeffBezanson · 2016-12-30T19:45:28Z

@johnmyleswhite I recall a discussion where we agreed that nullables are not containers. However, that aspect of this --- e.g. defining map on Nullable --- is probably ok.

But I'm very much against the broadcast behavior. Making broadcast also implicitly lift the function to nullables has nothing to do with treating nullables as containers. For example

julia> println.([[1],[2]])
[1]
[2]

julia> println.([Nullable(1), Nullable(2)])
1
2

So, at the very least I find the title of this PR misleading.

I think it's unacceptable for f.(xs) to "take apart" more than one layer of container (differing from map). Making 1 .+ Nullable(2) work might be fine; that is what I expected "Nullables as collections" to mean. Can we please take out the extra, horribly confusing behavior?

ararslan · 2016-12-30T20:15:26Z

we agreed that nullables are not containers

If I understand the current thinking correctly, I believe treating Nullables as containers is more of a stopgap until we can get better Unionss so that nullables can be a simple union rather than a container.

cf. JuliaLang/Juleps#21

JeffBezanson · 2016-12-30T22:23:54Z

base/broadcast.jl


-_broadcast_type(f, T::Type, As...) = Base._return_type(f, typestuple(T, As...))
-_broadcast_type(f, A, Bs...) = Base._default_eltype(Base.Generator{ziptype(A, Bs...), ftype(f, A, Bs...)})
+# nullables need to be treated like scalars sometimes and like containers


The occurrence of "sometimes" here definitely raises a red flag.

pabloferz · 2016-12-30T22:30:01Z

I have a PR to address @JeffBezanson's concerns. I can submit it within the next hour.

TotalVerb · 2016-12-30T23:05:57Z

Thanks @JeffBezanson for the comments. Indeed your concerns were noted and discussed previously around #16961 (comment) and the (weak) consensus seemed to be that "nullables go in arrays, and not the other way around" as a justification for this special case, but I admit that it is not the prettiest solution. Let's see if @pabloferz has a better one.

johnmyleswhite · 2016-12-31T00:08:19Z

I think it's unacceptable for f.(xs) to "take apart" more than one layer of container (differing from map).

FWIW, I'm ok with either approach. This "taking apart" is the behavior that Erik Meijer and Eric Lippert were encouraging us to adopt, but that's because their vision of a dot operator for C# always involved flattening for all containers-of-containers and not just containers-of-nullables.

For now I think we can remove this behavior since it's not very important for most use cases.

nalimilan · 2016-12-31T14:52:48Z

We can get rid of this behavior for now, but that seems to imply we'll need to keep a special array type like NullableArray even after moving to a Union type to represent nullables. Else we would lose the ability to use element-wise operators for lifting over arrays.

TotalVerb · 2016-12-31T21:18:59Z

@nalimilan I agree that we will need to keep NullableArray.

Related to this point, @ScottPJones had the following comment to make;

I just wanted to point out that having some sort of NullableArray type will still be important for performance reasons, because (IIUC) for arrays of immutable types and bitstypes, the Union type would need at least to take a pointer (or more, to preserve alignment) size of extra memory per element, while a NullableArray only takes 1 byte extra per element.

Note: I think that it would be good to actually have a parameterized NullableArray type, where anything that is an abstract bit array could be used (because in many cases, a BitVector or BitArray would be more performant (both space and timewise) than using an Array{Bool}, and would allow for using a sparse bit array for even further space savings, if the number of Null elements is small compared to the size of the array.

I think future progress on this will be on the package side.

Sacha0 · 2016-12-31T21:19:15Z

First off, much thanks for your great work in / persistence with this pull request @TotalVerb! :)

(pending a comment from Sacha on future cleanups he has in mind)

In brief: For better extensibility and maintainability of broadcast, we should consider a different approach to evolving the Base.Broadcast module than that taken in this pull request.

Specifically, this pull request introduced additional type-specific functionality into Base.Broadcast, and tightly coupled that functionality with existing generic infrastructure, making broadcast more difficult to extend and maintain. Heading the opposite direction seems advantageous in the long run: Focusing Base.Broadcast on generic, readily extensible infrastructure, allowing (and encouraging) type-specific functionality to live outside of Base.Broadcast (and outside of Base altogether as appropriate).

Not certain whether I will have time to comment in detail prior to travel tomorrow morning. Most specific concerns I have are happily addressed by #19745 and #19787.

Cross refs: @andreasnoack expresses similar sentiments in #19787 (comment), @TotalVerb in #19745 (review), and I in #19745 (review) and #19723 (comment).

Best!

nalimilan · 2017-01-01T10:44:41Z

I just wanted to point out that having some sort of NullableArray type will still be important for performance reasons, because (IIUC) for arrays of immutable types and bitstypes, the Union type would need at least to take a pointer (or more, to preserve alignment) size of extra memory per element, while a NullableArray only takes 1 byte extra per element.

No, AFAIK @vtjnash's plan is to optimize Union{T, Void} so that the memory representation is as efficient as a NullableArray. See this paragraph in @johnmyleswhite's Julep.

hayd reviewed Jun 16, 2016
View reviewed changes

tkelman reviewed Jun 16, 2016
View reviewed changes

test/nullable.jl

s = 0

for x in Nullable{Int}()

s += x

Copy link

Contributor

tkelman Jun 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not seem desirable to me at all

andreasnoack mentioned this pull request Dec 29, 2016

Use containertype to determine array type for array broadcast #19745

Merged

JeffBezanson reviewed Dec 30, 2016

View reviewed changes

Sacha0 mentioned this pull request Dec 30, 2016

Simplify and extend broadcast eltype promotion mechanism #19723

Merged

pabloferz mentioned this pull request Dec 31, 2016

Limit broadcast mechanism over Nullables #19787

Merged

tkelman mentioned this pull request Dec 31, 2016

LLVM 3.9 miss-compiles inlined broadcast for Array & SubArray #19792

Closed

JeffBezanson mentioned this pull request Jan 1, 2017

temporarily disable LLVM assertions on travis #19803

Closed

nalimilan mentioned this pull request Jan 31, 2017

Implement more operators on Nullable with lifting semantics #19034

Closed

davidanthoff mentioned this pull request Feb 7, 2017

Nested . syntax #20502

Open

Sacha0 mentioned this pull request Feb 22, 2017

Wishlist: API and documentation for extending broadcast #20740

Closed

This was referenced Apr 5, 2017

Refactor null_safe_op to workaround codegen changes #21290

Closed

Codegen change during 0.6 development cycle #21305

Closed

nalimilan mentioned this pull request Sep 8, 2017

Replace Nullable{T} with Union{Some{T}, Void} #23642

Merged

TotalVerb mentioned this pull request Sep 11, 2017

Missing nullable operations? #16889

Closed

davidanthoff mentioned this pull request Jul 10, 2017

Various issues in other repos to re-examine queryverse/DataValues.jl#18

Open

bramtayl mentioned this pull request Mar 31, 2022

Implements flatmap #44792

Merged

RFC: Nullables as collections #16961

RFC: Nullables as collections #16961

Conversation

TotalVerb commented Jun 16, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TotalVerb Jun 17, 2016 • edited

Choose a reason for hiding this comment

nalimilan commented Jun 16, 2016

Choose a reason for hiding this comment

tkelman commented Jun 16, 2016

nalimilan commented Jun 16, 2016

tkelman commented Jun 16, 2016 • edited

nalimilan commented Jun 16, 2016

TotalVerb commented Jun 16, 2016 • edited

TotalVerb commented Jun 16, 2016

TotalVerb commented Jun 16, 2016 • edited

tkelman commented Jun 16, 2016 • edited

mschauer commented Jun 16, 2016 • edited

eschnett commented Jun 16, 2016

nalimilan commented Jun 16, 2016

johnmyleswhite commented Dec 29, 2016

tkelman commented Dec 30, 2016 • edited

TotalVerb commented Dec 30, 2016

tkelman commented Dec 30, 2016 • edited

TotalVerb commented Dec 30, 2016

tkelman commented Dec 30, 2016

JeffBezanson commented Dec 30, 2016 • edited

ararslan commented Dec 30, 2016

Choose a reason for hiding this comment

pabloferz commented Dec 30, 2016

TotalVerb commented Dec 30, 2016 • edited

johnmyleswhite commented Dec 31, 2016

nalimilan commented Dec 31, 2016

TotalVerb commented Dec 31, 2016

Sacha0 commented Dec 31, 2016

nalimilan commented Jan 1, 2017

TotalVerb commented Jun 16, 2016 •

edited

TotalVerb Jun 17, 2016 •

edited

tkelman commented Jun 16, 2016 •

edited

TotalVerb commented Jun 16, 2016 •

edited

TotalVerb commented Jun 16, 2016 •

edited

tkelman commented Jun 16, 2016 •

edited

mschauer commented Jun 16, 2016 •

edited

tkelman commented Dec 30, 2016 •

edited

tkelman commented Dec 30, 2016 •

edited

JeffBezanson commented Dec 30, 2016 •

edited

TotalVerb commented Dec 30, 2016 •

edited