Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Nullables as collections #16961

Merged
merged 10 commits into from
Dec 29, 2016

Conversation

TotalVerb
Copy link
Contributor

@TotalVerb TotalVerb commented Jun 16, 2016

This implements map, filter, and broadcast for Nullable arguments.

cc: @nalimilan @johnmyleswhite @hayd @vchuravy

elseif !any(isnull, xs)
Nullable(map(f, map(unsafe_getindex, xs)...))
else
throw(DimensionMismatch("expected all null or all nonnull"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a standard definition of map on options? I expected any(isnull, xs) && return Nullable().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Among languages I know of (or could find on Google) with multi-argument map...

OCaml: option type does not support multi-argument map
(most) lisps: no real option type, but cons used in its place often. behaves like current implementation
Python: no option type
Haskell: zipWith does not apply to options

The intersection of multi-argument map and option types is quite small. This makes some degree of sense, as a lot of languages with option types are either OO (where multi-argument map often doesn't make sense) or curried.

In absence of a real precedent, I think it's a better idea to copy the behaviour of actual collections, rather than introduce special cases:

julia> map(+, Int[], [1])
ERROR: DimensionMismatch("dimensions must match")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your argument about consistency convinced me, but I think the error message should also say something like "use broadcast instead". Basically, map shouldn't be used for lifting Nullable, it's mostly implemented for completeness (and maybe as a way to ensure fail-fast when you don't want a null to propagate).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would tend to think like @hayd on this one. In Haskell, zipWith on lists has no problem with one list shorter than the other, it makes a lot of sense and I don't see why our map don't act similarly, both for arrays and nullables. Also, when lifting via map an operation on two or more nullables, I would naturally interpret getting a isnull back as an "error": the computation couldn't be done because of missing values. Having an error thrown when the inputs are not all isnull is redundant, as you have now to different ways to be notified of a failed computation, and you have to check both for an isnull result and for a possible thrown exception. Maybe in Haskell can be used to encode an error at the type level, and lifting f on Maybe would be naturally implemented as

map2 f mx my = do
    x <- mx
    y <- my
    return $ f x y

which returns Nothing if one or more inputs are Nothing (that said, applying map2 on lists doesn't correspond to our multiple-list-arguments map).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have broadcast for what you describe. I think there's been discussion somewhere about replacing map with it. But let's not make this PR even more complex by opening this debate: the goal here is to be consistent with arrays, whether that behavior is good or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah indeed I hadn't notice the broadcast behavior. But still, I don't see why map should be defined for Nullable with the view that they are containers. For example NaN+1 doesn't throw an error. I'd then rather not define map until real cases has shown what should be the semantics (or has it happened already?) Anyway, I'm not competent so won't argue more. I would just love one day some docs on map vs broadcast (the concept of broadcast is still relatively alien to me).

Copy link
Contributor Author

@TotalVerb TotalVerb Jun 17, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is really no harm in making mixed map an error for now. If needed, it can always be relaxed... though I don't like broadcasting in map.

@nalimilan
Copy link
Member

Thanks. I agree it's OK to start with an unoptimized version, and add a fast path for safe isbits types like Base's number types ` (for which we can avoid a branch) later. Cf. JuliaStats/NullableArrays.jl#116 (comment).

I'll leave it to people more familiar with the broadcast machinery to say whether the general approach is good. One thing I wonder is how broadcast works on arguments mixing nullables, numbers and arrays. Could you add tests for that too?


s = 0
for x in Nullable{Int}()
s += x
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not seem desirable to me at all

@tkelman
Copy link
Contributor

tkelman commented Jun 16, 2016

Why should nullables be iterable? This seems like going further down making scalars iterable, but with more conflating of nullability with container emptiness. I think it's clearer if the concepts are kept separate.

@nalimilan
Copy link
Member

The main feature is to give f.(x) lifting semantics via broadcast. Other methods seem less useful to me, but can be interesting to have for consistency. This also seems consistent to how other languages treat nullable/option types.

@tkelman
Copy link
Contributor

tkelman commented Jun 16, 2016

Should be able to opt in to the syntactic sugar without making them iterable. This would lead to far more methods which aren't expecting to deal with nullables trying to operate on them, which I imagine is the intent here but I see that going wrong often and would rather prefer being explicit. Iteration would suddenly discard the possibility of null values, or silently do nothing instead of visibly flagging that a null was present. Doing nothing on a null value isn't always appropriate, encoding it as part of the behavior strikes me as wrong.

@nalimilan
Copy link
Member

f.(x) needs to always call broadcast since that's just syntactic sugar. So we need at least to implement broadcast on nullables.

I have no strong opinion on implementing the more general interface, but I figure e.g. Scala and Rust had good reasons to do that. @johnmyleswhite?

@TotalVerb
Copy link
Contributor Author

TotalVerb commented Jun 16, 2016

I intentionally left broadcast between arrays and nullables. This operation is ambiguous because the return type could be Nullable{Array} or Array{Nullable} (although I admit the latter is almost always more useful). The original reason for leaving out this operation was a consequence of thinking of nullables as infinite-dimensional arrays in the original PR: arrays must be promoted in dimension to broadcast to nullable, and nullable must be promoted in size to broadcast to array.

Now that that interpretation has been dispensed with, it could be worth reconsidering whether Array{Nullable} is an acceptable return type. Though, it is a bit unfortunate that the return type would not be a NullableArray.

@TotalVerb
Copy link
Contributor Author

@tkelman The Julia documentation is pretty explicit that Nullable types are container-like:

In many settings, you need to interact with a value of type T that may or may not exist. To handle these settings, Julia provides a parametric type called Nullable{T}, which can be thought of as a specialized container type that can contain either zero or one values.

This was why I filed #16889—being thought of as a container type is useless if they don't behave as containers. The container property is important, and is what (functionally) distinguishes Nullable{T} from Union{T, Void} (though the second one is much slower with the current compiler).

@TotalVerb
Copy link
Contributor Author

TotalVerb commented Jun 16, 2016

There are effectively four interpretations of a Nullable{T} out there:

  • Interpretation 1: Nullable{T} is Union{T, Void}. This is how many other dynamic languages do it: Python, JavaScript, etc.
  • Interpretation 2: Nullable{T} is a "dumb option type" that isn't much more powerful than Union{T, Void}. This is really only used in static languages, where the compiler provides type checking.
  • Interpretation 3: Nullable{T} is an option type that behaves as a collection. This is how some functional languages and most OO languages do it.
  • Interpretation 4: Nullable{T} is a monad. This is only done in functional languages, and only by few of them.

The current behaviour is interpretation 2. I would argue that interpretation 3 makes more sense for Julia.

@tkelman
Copy link
Contributor

tkelman commented Jun 16, 2016

The container property is important

I'm not convinced. Why? We have plenty of "container" wrapper types that don't act like iterable collections.

@mschauer
Copy link
Contributor

mschauer commented Jun 16, 2016

Nullable started as a very minimal type "with a very minimal interface with the hope that this will encourage you to resolve the uncertainty of whether a value is missing as soon as possible" to cite @johnmyleswhite . As some point it was even discussed that Nullable{Nullable{T}} could be collapsed to Nullable{T} (compare Union{Union{T,Void},Void} == Union{T,Void}). That indicates to me that there is reason to be cautious with a commitment to the "container with one or zero elements" interpretation.

@eschnett
Copy link
Contributor

There are a few more interpretation of Nullable{T} that are interesting in certain cases:

  • encode missing values (see NA)
  • encode defaults (adding a null value corresponds to adding zero)
  • encode errors; adding a null value leads to an error (i.e. a null value) [that's the monadic interpretation]

At one point, Julia could begin to offer different types with different behaviour. There could be an Optional type with defaulting behaviour, a Maybe type that is monadic and container-like, etc. They would all be built onto the same low-level (and bare-bones) implementation.

Here, instead of adding more container-ness to Nullable (which seems to be counter-productive for some uses), we could instead add a Maybe type that is container-like, and which supports arithmetic operations, treating null values as errors, as is suggested here.

An Optional type (all names to be bikeshedded, of course) could also implement arithmetic, but would treat missing values as neutral, not as error.

What hasn't happened so far is to make a strong case for these types, and to prototype them in a package. I'd expect some of these to become widely used, others to remain obscure.

@nalimilan
Copy link
Member

Here, instead of adding more container-ness to Nullable (which seems to be counter-productive for some uses),

"Counter-productive" sounds quite strong. Do you have any evidence of a case where different interpretations really conflict?

What hasn't happened so far is to make a strong case for these types, and to prototype them in a package. I'd expect some of these to become widely used, others to remain obscure.

Precisely, a lot of work has gone into NullableArrays, and we're seeing the current limitations of Nullable support in Base while porting DataFrames to it (i.e. with the NA interpretation): JuliaData/DataFrames.jl#994. The use case is well defined now, and we know we need at least a good support for lifting semantics. So far the best solution I can see is to use f.(x) (hence broadcast) for that. Whether we need to treat Nullable as a container in all cases remains to be seen, but that could be a logical extension.

@johnmyleswhite
Copy link
Member

Thanks again for making this happen, @TotalVerb!

@tkelman
Copy link
Contributor

tkelman commented Dec 30, 2016

We should have checked the appveyor (edit: travis too!) log more carefully. This somehow didn't cause the tests to register as failing, but something strange is up here:

From worker 1:		From worker 1:	Worker 1 failed running test broadcast:
UndefVarError: TestMain_broadcast not defined
 in deserialize(::Base.ClusterSerializer{TCPSocket}, ::Type{Module}) at .\serialize.jl:601
 in handle_deserialize(::Base.ClusterSerializer{TCPSocket}, ::Int32) at .\serialize.jl:580
 in deserialize(::Base.ClusterSerializer{TCPSocket}) at .\serialize.jl:540
 in deserialize_datatype(::Base.ClusterSerializer{TCPSocket}) at .\serialize.jl:835
 in handle_deserialize(::Base.ClusterSerializer{TCPSocket}, ::Int32) at .\serialize.jl:570
 in deserialize(::Base.ClusterSerializer{TCPSocket}) at .\serialize.jl:540
 in ntuple(::Base.Serializer.##1#2{Base.ClusterSerializer{TCPSocket}}, ::Int64) at .\tuple.jl:79
 in handle_deserialize(::Base.ClusterSerializer{TCPSocket}, ::Int32) at .\serialize.jl:561
 in deserialize(::Base.ClusterSerializer{TCPSocket}, ::DataType) at .\serialize.jl:894
 in deserialize_datatype(::Base.ClusterSerializer{TCPSocket}) at .\serialize.jl:848
 in handle_deserialize(::Base.ClusterSerializer{TCPSocket}, ::Int32) at .\serialize.jl:570
 in deserialize(::Base.ClusterSerializer{TCPSocket}, ::DataType) at .\serialize.jl:894
 in deserialize_datatype(::Base.ClusterSerializer{TCPSocket}) at .\serialize.jl:848
 in handle_deserialize(::Base.ClusterSerializer{TCPSocket}, ::Int32) at .\serialize.jl:570
 in deserialize_array(::Base.ClusterSerializer{TCPSocket}) at .\serialize.jl:718
 in handle_deserialize(::Base.ClusterSerializer{TCPSocket}, ::Int32) at .\serialize.jl:568
 in deserialize(::Base.ClusterSerializer{TCPSocket}, ::DataType) at .\serialize.jl:894
 in deserialize_datatype(::Base.ClusterSerializer{TCPSocket}) at .\serialize.jl:848
 in handle_deserialize(::Base.ClusterSerializer{TCPSocket}, ::Int32) at .\serialize.jl:570
 in deserialize(::Base.ClusterSerializer{TCPSocket}, ::DataType) at .\serialize.jl:894
 in deserialize_datatype(::Base.ClusterSerializer{TCPSocket}) at .\serialize.jl:848
 in handle_deserialize(::Base.ClusterSerializer{TCPSocket}, ::Int32) at .\serialize.jl:570
 in deserialize(::Base.ClusterSerializer{TCPSocket}, ::DataType) at .\serialize.jl:894
 in deserialize_datatype(::Base.ClusterSerializer{TCPSocket}) at .\serialize.jl:848
 in handle_deserialize(::Base.ClusterSerializer{TCPSocket}, ::Int32) at .\serialize.jl:570
 in deserialize_msg(::Base.ClusterSerializer{TCPSocket}, ::Type{Base.ResultMsg}) at .\multi.jl:120
 in deserialize_msg(::Base.ClusterSerializer{TCPSocket}) at .\multi.jl:130
 in message_handler_loop(::TCPSocket, ::TCPSocket, ::Bool) at .\multi.jl:1371
 in process_tcp_streams(::TCPSocket, ::TCPSocket, ::Bool) at .\multi.jl:1328
 in (::Base.##551#552{TCPSocket,TCPSocket,Bool})() at .\event.jl:66

@TotalVerb
Copy link
Contributor Author

Oh dear. Let me see if I can reproduce this locally.

@tkelman
Copy link
Contributor

tkelman commented Dec 30, 2016

I get a failure that bisects to here both on Windows and Linux from just running make test-broadcast - the issue above might be something wrong with the test system's handling of that failure?

@TotalVerb
Copy link
Contributor Author

The failure isn't in the travis logs for #19745, so it could be the known typo in this PR substituting S for f, which was corrected in that one.

@tkelman
Copy link
Contributor

tkelman commented Dec 30, 2016

That seems like a good explanation to me. There's a test system bug that it was able to continue and be marked as a success here then. Manifests only when tests are run in parallel. @yuyichao any idea?

@JeffBezanson
Copy link
Sponsor Member

JeffBezanson commented Dec 30, 2016

@johnmyleswhite I recall a discussion where we agreed that nullables are not containers. However, that aspect of this --- e.g. defining map on Nullable --- is probably ok.

But I'm very much against the broadcast behavior. Making broadcast also implicitly lift the function to nullables has nothing to do with treating nullables as containers. For example

julia> println.([[1],[2]])
[1]
[2]

julia> println.([Nullable(1), Nullable(2)])
1
2

So, at the very least I find the title of this PR misleading.

I think it's unacceptable for f.(xs) to "take apart" more than one layer of container (differing from map). Making 1 .+ Nullable(2) work might be fine; that is what I expected "Nullables as collections" to mean. Can we please take out the extra, horribly confusing behavior?

@ararslan
Copy link
Member

we agreed that nullables are not containers

If I understand the current thinking correctly, I believe treating Nullables as containers is more of a stopgap until we can get better Unionss so that nullables can be a simple union rather than a container.

cf. JuliaLang/Juleps#21


_broadcast_type(f, T::Type, As...) = Base._return_type(f, typestuple(T, As...))
_broadcast_type(f, A, Bs...) = Base._default_eltype(Base.Generator{ziptype(A, Bs...), ftype(f, A, Bs...)})
# nullables need to be treated like scalars sometimes and like containers
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The occurrence of "sometimes" here definitely raises a red flag.

@pabloferz
Copy link
Contributor

I have a PR to address @JeffBezanson's concerns. I can submit it within the next hour.

@TotalVerb
Copy link
Contributor Author

TotalVerb commented Dec 30, 2016

Thanks @JeffBezanson for the comments. Indeed your concerns were noted and discussed previously around #16961 (comment) and the (weak) consensus seemed to be that "nullables go in arrays, and not the other way around" as a justification for this special case, but I admit that it is not the prettiest solution. Let's see if @pabloferz has a better one.

@johnmyleswhite
Copy link
Member

I think it's unacceptable for f.(xs) to "take apart" more than one layer of container (differing from map).

FWIW, I'm ok with either approach. This "taking apart" is the behavior that Erik Meijer and Eric Lippert were encouraging us to adopt, but that's because their vision of a dot operator for C# always involved flattening for all containers-of-containers and not just containers-of-nullables.

For now I think we can remove this behavior since it's not very important for most use cases.

@nalimilan
Copy link
Member

We can get rid of this behavior for now, but that seems to imply we'll need to keep a special array type like NullableArray even after moving to a Union type to represent nullables. Else we would lose the ability to use element-wise operators for lifting over arrays.

@TotalVerb
Copy link
Contributor Author

@nalimilan I agree that we will need to keep NullableArray.

Related to this point, @ScottPJones had the following comment to make;

I just wanted to point out that having some sort of NullableArray type will still be important for performance reasons, because (IIUC) for arrays of immutable types and bitstypes, the Union type would need at least to take a pointer (or more, to preserve alignment) size of extra memory per element, while a NullableArray only takes 1 byte extra per element.

Note: I think that it would be good to actually have a parameterized NullableArray type, where anything that is an abstract bit array could be used (because in many cases, a BitVector or BitArray would be more performant (both space and timewise) than using an Array{Bool}, and would allow for using a sparse bit array for even further space savings, if the number of Null elements is small compared to the size of the array.

I think future progress on this will be on the package side.

@Sacha0
Copy link
Member

Sacha0 commented Dec 31, 2016

First off, much thanks for your great work in / persistence with this pull request @TotalVerb! :)

(pending a comment from Sacha on future cleanups he has in mind)

In brief: For better extensibility and maintainability of broadcast, we should consider a different approach to evolving the Base.Broadcast module than that taken in this pull request.

Specifically, this pull request introduced additional type-specific functionality into Base.Broadcast, and tightly coupled that functionality with existing generic infrastructure, making broadcast more difficult to extend and maintain. Heading the opposite direction seems advantageous in the long run: Focusing Base.Broadcast on generic, readily extensible infrastructure, allowing (and encouraging) type-specific functionality to live outside of Base.Broadcast (and outside of Base altogether as appropriate).

Not certain whether I will have time to comment in detail prior to travel tomorrow morning. Most specific concerns I have are happily addressed by #19745 and #19787.

Cross refs: @andreasnoack expresses similar sentiments in #19787 (comment), @TotalVerb in #19745 (review), and I in #19745 (review) and #19723 (comment).

Best!

@nalimilan
Copy link
Member

I just wanted to point out that having some sort of NullableArray type will still be important for performance reasons, because (IIUC) for arrays of immutable types and bitstypes, the Union type would need at least to take a pointer (or more, to preserve alignment) size of extra memory per element, while a NullableArray only takes 1 byte extra per element.

No, AFAIK @vtjnash's plan is to optimize Union{T, Void} so that the memory representation is as efficient as a NullableArray. See this paragraph in @johnmyleswhite's Julep.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:missing data Base.missing and related functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet