Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace Nullable{T} with Union{Some{T}, Void} #23642

Merged
merged 16 commits into from
Dec 15, 2017
Merged

Replace Nullable{T} with Union{Some{T}, Void} #23642

merged 16 commits into from
Dec 15, 2017

Conversation

nalimilan
Copy link
Member

@nalimilan nalimilan commented Sep 8, 2017

This is a first proposal to fix #22682. Tests pass but I haven't updated docs and the broadcast code needs checking. What this PR does currently is:

  • Deprecate Nullable, moved to a new Nullables.jl package
  • Introduce Null/null from Nulls.jl, Some{T} and Optional{T} = Union{Null, Some{T}} as a replacement for Nullable in cases where either 1) one wants to force explicit unwrapping and/or 2) one needs to allow wrapping null inside an Option and be able to distinguish it from a null (typically, for tryget on dictionaries or tryparse on a nullable value). Union{T, Null} is still available to represent missing values, and will allow silent propagation of nulls once we implement operators on Null.
  • Rename get(::Nullable[, y]) to unwrap(::Some[, y]): this is not strictly needed, but allows seeing to extent to which it's used in Base, and evaluating how simpler the code would be if we used Union{Null, T} rather than Union{Null, Some{T}} (and therefore didn't require unwrapping)

Issues to discuss include:

  • Naming: Option (Rust, Scala, ML, F#) could be Optional (Swift, C++17, Java 8) or Maybe (Haskell). It seems that the two former are the most common. Some (used by Rust, Scala, ML, F#, Swift) could also be Value or Just (as in Haskell).
  • Whether to support convert(Some{T}, ::T). At first I thought it was convenient (we allowed it for Nullable), but I realized it wasn't a great idea since it implies that convert(Option{T}, x) gives Some(x) in general but null when x = null, which is a trap transforming a wrapped null (which should be Some(null)) into a null Option, defeating the purpose of using a wrapper in the first place.
  • What to do with broadcast: things like broadcast(exp, Some(1)) work, but an error is thrown for broadcast(exp, null), which makes this operation mostly useless. We can fix this by defining operators on Null (just like in Nulls.jl), but that won't work for all functions. We should be able to add a special rule for Null to the broadcast machinery, that's probably worth it since that's one of the main features of Option in many languages.
  • Whether to introduce another form of missing value for Option in addition to Void/nothing and Null/null, which could be None/none for consistency with other languages. The main argument in favor of this would be to distinguish clearly Option from ?T/Union{Null, T}, since the latter is intended for missing values in data with automatic propagation of nulls. But I think most people were opposed to this in the discussion.

@nalimilan nalimilan added the domain:missing data Base.missing and related functionality label Sep 8, 2017
base/null.jl Outdated

convert(::Type{Option{T}}, ::Null) where {T} = null
convert(::Type{Option }, ::Null) = null
# FIXME: find out why removing these two methods makes addprocs() fail
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has turned out to be very hard to debug. Without these two methods, addprocs(2) hangs and finally times out, leaving two julia processes using 100% CPU in the background. Replacing all occurrences of nothing with null in base/distributed/* didn't fix it. The output isn't very helpful. Any help on how to debug this is welcome!

julia> addprocs(2)
ERROR: Timed out waiting to read host:port string from worker.
read_worker_host_port(::Pipe) at ./distributed/cluster.jl:302
connect(::Base.Distributed.LocalManager, ::Int64, ::WorkerConfig) at ./distributed/managers.jl:395
create_worker(::Base.Distributed.LocalManager, ::WorkerConfig) at ./distributed/cluster.jl:507
setup_launched_worker(::Base.Distributed.LocalManager, ::WorkerConfig, ::Array{Int64,1}) at ./distributed/cluster.jl:453
(::getfield(Base.Distributed, Symbol("##41#44")){Base.Distributed.LocalManager,WorkerConfig})() at ./task.jl:335

...and 1 more exception(s).

Stacktrace:
 [1] sync_end() at ./task.jl:287
 [2] macro expansion at ./task.jl:303 [inlined]
 [3] #addprocs_locked#38(::Array{Any,1}, ::Function, ::Base.Distributed.LocalManager) at ./distributed/cluster.jl:407
 [4] #addprocs#37(::Array{Any,1}, ::Function, ::Base.Distributed.LocalManager) at ./distributed/cluster.jl:371
 [5] #addprocs#257(::Bool, ::Array{Any,1}, ::Function, ::Int64) at ./distributed/managers.jl:314
 [6] addprocs(::Int64) at ./distributed/managers.jl:313

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps related to automatic conversion thru fallback constructor? #23273

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally this rather happens when assigning to an Option field. It's quite rare to call Option(x).

I've finally found a way to see the backtrace from the workers, by adding print statements to read_worker_host_port. Turns out the TCPSocket inner constructor used nothing to set a Nullable field to null. Now it works!

"boundscheck", "error", "cartesian", "asmvariant", "osutils",
"channels", "iostream", "specificity", "codegen", "codevalidation",
# FIXME: fix ambiguities
"ambiguous"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test fails due to ambiguities between the following methods:

convert(::Type{Union{Null, Some{T}}}, x::Some) where T
convert(::Type{Union{Null, Some{T}}}, ::Void) where T
convert(::Type{Union{Null, Some{T}}}, ::Null) where T

How can I fix this? My attempts have not been successful so far.

@nalimilan nalimilan mentioned this pull request Sep 8, 2017
@StefanKarpinski
Copy link
Sponsor Member

Is Some a standard name for this immutable wrapper type? We had also discussed Value, but that might be confusing with Val (which is also a confusing name, but that's another issue).

@ararslan
Copy link
Member

ararslan commented Sep 8, 2017

Thanks so much for working on this! It's exciting to see this finally coming to fruition.

My take on the issues you've enumerated:

Naming:

  • +1 for Some over Value, since Value is too similar to Val.
  • +1 for Maybe over Option and Optional. Option would conflict with REPL: add an Option type to handle REPL options #23637 and in my opinion Optional isn't terribly descriptive. To me, Maybe{T} says "maybe this is some T and maybe it's null."

Supporting conversion to Some{T}:

  • I don't see why convert(Some{T}, x) implies that convert(Option{T}, x) should work. In my mind, it would only make sense to allow conversion to Option{T} if both conversion to Some{T} and conversion to Null were allowed, but I don't think it makes sense to allow conversion to Null in the general case (just the no-op convert(Null, null)).

Broadcasting:

  • I don't think we should define broadcast/map to do unwraping of any kind. Those functions are for vectorization and frankly I think it was a bad decision to use broadcast for Nullable.
  • broadcast(exp, null) should just be exp(null) since null is a scalar, and we should define exp(null) to be null. The same should be done for many other Base functions as a way to indicate that they know how to deal with missing values.

Yet another type:

  • I don't understand why that would be necessary. -1 for adding a None/none. Seems like we have our bases covered with Union{T, Null} and Union{Some{T}, Null}.

@ararslan
Copy link
Member

ararslan commented Sep 8, 2017

Is Some a standard name for this immutable wrapper type?

As one data point it's what Rust uses. Not sure what else uses it.

@nalimilan
Copy link
Member Author

nalimilan commented Sep 8, 2017

Is Some a standard name for this immutable wrapper type? We had also discussed Value, but that might be confusing with Val (which is also a confusing name, but that's another issue).

Yes, according to Wikipedia, Some is used by Rust, Scala, ML and F# (Swift uses some). Another common term is Just (in Haskell notably).

EDIT: Value doesn't seem to be used by any language.

@ararslan ararslan added this to the 1.0 milestone Sep 8, 2017
@ararslan ararslan added the needs news A NEWS entry is required for this change label Sep 8, 2017
base/null.jl Outdated
an `Option` object is null, and [`unwrap`](@ref) to access the value inside the `Some`
object if not.
"""
Option{T} = Union{Null, Some{T}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need a const, right?

@nalimilan
Copy link
Member Author

nalimilan commented Sep 8, 2017

Naming:

+1 for Some over Value, since Value is too similar to Val.
+1 for Maybe over Option and Optional. Option would conflict with #23637 and in my opinion Optional isn't terribly descriptive. To me, Maybe{T} says "maybe this is some T and maybe it's null."

Ah, good point about #23637. Though the issue with Maybe is that it seems languages that have it (like Haskell) use Nothing or nothing to denote no value. That could help confusion with our nothing. OTOH languages which have Option or Optional often use None, which differs from our Null but at least does not exist at all in the language. So I guess I'd prefer Optional.

Supporting conversion to Some{T}:

I don't see why convert(Some{T}, x) implies that convert(Option{T}, x) should work. In my mind, it would only make sense to allow conversion to Option{T} if both conversion to Some{T} and conversion to Null were allowed, but I don't think it makes sense to allow conversion to Null in the general case (just the no-op convert(Null, null)).

Why not, but there are probably very few cases in which one would like to convert from T to Some{T}. The most common case (see the diff) is to assign to an Option{T} field. So basically the decisive issue is whether or not to allow conversion to Option.

Broadcasting:

I don't think we should define broadcast/map to do unwraping of any kind. Those functions are for vectorization and frankly I think it was a bad decision to use broadcast for Nullable.

Yet in most languages supporting Option, the recommended way of working with them is to use operations on collections like map/broadcast, considering them as zero- or one-element containers. See for example the use of flatMap in Scala. I wasn't a fan of treating Nullable as a container, but now that it's clear missing data should be represented using Union{T, Null}, I don't have any problem with treating Option like that. Cf. also the long discussion for Nullable at #16961.

Anyway, we can always start without this and add it later if needed. That feature doesn't seem to be used in Base, which might indicate that it's not essential (but be sure people will ask for it). But I'll have to actively remove code to prevent it from working at all, since the basics already exist due to the Nullable legacy.

@rfourquet
Copy link
Member

Great! I didn't follow closely this switch-away-from-nullable effort, so I don't see why we wouldn't have Optional{T} = Union{Void, Some{T}} ? I understand that Null is needed for Union{Null, T}, because T could be Void (I think?), but for optional we might as well go the simpler option with Void. After all, assuming nothing is a must-know in Julia, someone could then learn Optional with minimal overhead; Null would be only for the "data scientist working with missing value". And isn't it an advantage to separate completely those two types which were before merged in Nullable ? (genuine question)

@ararslan
Copy link
Member

ararslan commented Sep 8, 2017

The problem with using Void is that it gives the type a different meaning. nothing is what's returned from loops, print, and other things like that, so it would be weird to say that if a value isn't present, it's the same as what you'd get from assigning to the result of print.

@nalimilan
Copy link
Member Author

@rfourquet We discussed that a bit in #22682. There's no intrinsic reason to use Null rather than Void here, since T could equally be Null or Void (the solution to that is to use Union{Null, Some{T}} or Union{Void, Some{T}}, see above). It's mostly a matter of convention. Since null is already meant to denote the absence of a value, it sounds logical to use it. We need isnull and get/unwrap anyway, so better use it both for Option{T} and for Union{Null, T}. That way we can keep nothing for its only documented use, that is as the return value of blocks which do not return anything.

That said, one small drawback of reusing Null here is that we plan to define things like exp(::Null) = null to allow for propagation of missing values. That means that exp(::Option{T}) would throw an error when a value is present (i.e. with a Some{T} argument), but would succeed when the value is missing (null argument). That's counter-intuitive, though in practice it's probably not a big deal since people generally test their code with valid values -- what they forget to test is when the value is missing.

@jamesonquinn
Copy link
Contributor

jamesonquinn commented Sep 8, 2017

I like all the general ideas here.

My somewhat non-bikeshedding comment: we should definitely have different values for "or maybe not" (that is, Union{Some{T}, XXX}) and "na" (that is ?T = Union{T, YYY}). The whole point of these two objects is that they have different semantics. For instance, == and === should work the same for the former, not for the latter.

My purely bikeshedding comments:

For ?T = Union{T, YYY} (data scientist's null), I think YYY should be na::NA, after R. Relatively well-known and well-understood, name explains pretty much what it means; data that's not available.

For Union{Some{T}, XXX} (computer scientist's null), I think XXX should be either nothing::Void (because why multiply our null-like values for no reason), or if not, either empty::Empty or null::Null. Or maybe none::None, by contrast with Some.

@TotalVerb
Copy link
Contributor

My two cents:

  • No conversion from T to Some{T} is needed. This was useful in cases where you have a field/array of a known type that is nullable, but those cases can be replaced by Union{T, Null} instead of Option{T}. Cases where wrapped nulls may be necessary always require explicit wrapping instead of implicit conversion to be correct.
  • Shouldn't we define broadcast(f, ::Null) = null, which seems to be simple and correct?

@TotalVerb
Copy link
Contributor

Another option is Union{Tuple{T}, Tuple{}} which gives covariance and the broadcast behavior for free. But this is perhaps a little too much of a pun.

@nalimilan
Copy link
Member Author

@jamesonquinn The ship has sailed on using null for missing values. We're not going to change DataFrames to use na at this point.

My somewhat non-bikeshedding comment: we should definitely have different values for "or maybe not" (that is, Union{Some{T}, XXX}) and "na" (that is ?T = Union{T, YYY}). The whole point of these two objects is that they have different semantics. For instance, == and === should work the same for the former, not for the latter.

What do you mean? What exact behavior would you expect these operators would do in each case?

@nalimilan
Copy link
Member Author

Shouldn't we define broadcast(f, ::Null) = null, which seems to be simple and correct?

@TotalVerb That wouldn't help with more complex cases like x .+ y, right? We need something more general if we want to support broadcast on Option.

@rfourquet
Copy link
Member

Thanks for your answers. So I read a bit more the threads, in which some of my impressions were already expressed:

  1. That we want it or not, I believe Union{T, Void} will be a de facto goto solution for software-null: it's already used (even in base, maybe from a time predating Nullable), is extremly simple, it became efficient, and most of the time there is no ambiguity on the meaning of nothing (i.e. it's not a possible value of T).
  2. I didn't find a clear comparison of the merits of Union{Some{T}, None} (let's call it None here to think independently of data-null) vs a Maybe type which could be implemented exactly like Nullable, or alternatively:
struct Maybe{T}
    value::Union{Some{T}, None}
    Maybe{T}() where T = new(None())
    Maybe{T}(x) where T = new(Some(x))
end

I can't come up with solid arguments for now, but I feel that it's my prefered solution:

  • It allows to deal with a well specified type, Maybe, which is the only new introduced name (Some remains an implementation detail). Otherwise, I'm pretty sure we will have const Maybe{T} = Union{Some{T}, None} where T, which makes it two new names. Without counting None, if it's different from Void, and from Null...
  • it seems better for type stability; even though performance would not be a concern anymore in this case, type stability helps reasoning about the code; I also imagine having a typed absent-value can come in handy.
  • Some{T} considered independently of the Union appears strange and meaningless (imagine finding a function definifion like f(x::Some{Int}) = ... ). Also you can not prevent someone from using Union{Some{T}, Void} or even Union{Some{T}, CustomNone}. Given point 1, Union{Some{T}, Void} seems to offer very little value over Union{T, Void} when no ambiguity is possible. I guess my point is that fragmentation of usages can't be avoided with the Union{Some{T}, None} solution. With Maybe{T}, it's not possible to split one component out of the "context of optional value".
  • Some being an implementation detail, None also is, and can as well be Void after all, but it doesn't matter.

PS: note that I saw e.g here the similar proposition struct Maybe{T}; x::Union{T, None}; end, but which doesn't allow to discriminate an "absent value" from a "present None()"; I guess it's an simple oversight as it was presented as an alternative implementation of Nullable.

@nalimilan
Copy link
Member Author

@rfourquet I don't understand the point of defining Maybe{T} as you do above. Using Some inside a struct is redundant (you wrap values twice). You seem to be arguing in favor of keeping the current Nullable instead. That approach shares the same limitations as Nullable, in particular the "counter-factual T" issue (i.e. you need to compute T even when you don't have any value, which is sometimes problematic, cf. #22682).

Regarding the fact that people could define their custom Union{CustomNull, Some{T}}, I really don't see why they would do that when we already provide Null for that. They could be tempted to use Void, but isnull wouldn't work for it. Finally, in some cases using a different type instead of Null could be useful, e.g. to implement a Result type of to distinguish multiple kind of nulls.

@rfourquet
Copy link
Member

I interpreted the last paragraph of this comment as meaning that the Union representation can be more efficient than the Nullable implementation, hence the idea to wrap the Union to replace Nullable, but I guess this is irrelevant to the design question and I should have refrained from going into this detail: in other words, I was indeed arguing in favor of a rebranded Nullable.

I was not aware of the "counterfactual return type problem" (presented there), and I agree that it's quite convincing.

@jamesonquinn
Copy link
Contributor

jamesonquinn commented Sep 9, 2017

@nalimilan Using na as the DS null and none as the CS one, purely for purposes of illustration, I'd expect na == na to return na (just as 3. * na would), while none == none and na === na (and NA == NA, that is, the type) would both be true.

This is merely one of many examples where I think the semantics of these values would be different.

@nalimilan
Copy link
Member Author

@nalimilan Using na as the DS null and none as the CS one, purely for purposes of illustration, I'd expect na == na to return na (just as 3. * na would), while none == none and na === na (and NA == NA, that is, the type) would both be true.

OK, I see. I wanted this behavior (three-valued logic) too for a long time, see JuliaStats/NullableArrays.jl#85 for a long discussion about that. Unfortunately, it turned out to be quite inconvenient, since Julia expects Bool in many places. After moving away from Nullable{Bool} to Union{Null, Bool}, we might be able to revisit this decision since at least all operations expecting a Bool would work in the absence of nulls even if null == null returned null; not sure. Anyway, we should have this discussion in a different issue I think, as it tends to generate long debates.

@nalimilan
Copy link
Member Author

BTW, I've just created the Nullables.jl package by taking all code and tests from Base. It works on Julia 0.6. and 0.7. Comments welcome (please file issues there), at some point it should probably be moved to JuliaArchive.

@ararslan
Copy link
Member

Shouldn't we define broadcast(f, ::Null) = null, which seems to be simple and correct?

No, that's incorrect for things like isnull. Yes, we can specialize on broadcast(::typeof(isnull), ::Null), but there are likely other examples that would come up and have a subtly wrong fallback.

@nalimilan
Copy link
Member Author

I think this broadcast definition is fine for isnull too. isnull.(Some(1)) should return Some(false) and isnull.(Some(null)) should return true, but isnull.(null) should return null, i.e. propagate missingness. That's very different from calling isnull(null), where you want to test whether the Option itself is null: when broadcasting, you want to apply the function to the wrapped value, or return null if there's no value. This is actually equivalent to how it works for Nullable currently.

Anyway, the discussion about broadcast would better be separated, since we are not required to support it with Option.

@ararslan
Copy link
Member

I'd prefer we not support broadcast with Option(al)/Maybe anyway.

Use it everywhere get() was called and nothing would not trigger an error immediately.
MaybeValue is just a simplified version of Nullable reserved for internal
use which wraps a value or no value in a type-stable way. Information
about whether a value is present or not is carried separately.
@rfourquet
Copy link
Member

Alléluia 🎉

@quinnj
Copy link
Member

quinnj commented Dec 16, 2017

Fantastique!

@ararslan
Copy link
Member

Excellent work here, Milan, and thanks for sticking through so many rebases. You're a hero!

@mlhetland
Copy link
Contributor

There are two contradictory entries in NEWS.md, with similar text except the instruction to use Void or Nothing as the type for nothing, and one mentioning the removal of NullException.

@nalimilan
Copy link
Member Author

Good catch @mlhetland. See #25573.

@omus omus mentioned this pull request Nov 25, 2022
Keno pushed a commit that referenced this pull request Jun 5, 2024
Also add coalesce() function to return first non-nothing value and unwrap Some objects. Use the notnothing() function internally where it makes sense to assert that the result is different from nothing.

Use custom MaybeValue wrapper for ProductIterator to work around a performance regression due to type instability (information about whether a value is present or not is carried separately).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:missing data Base.missing and related functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The fate of Nullable