Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement more operators on Nullable with lifting semantics #19034

Closed
wants to merge 1 commit into from

Conversation

nalimilan
Copy link
Member

This defines all arithmetic operators (plus a few others) for Nullable, with lifting semantics.

This is the next step after #18304: it implements most operators from #16988. I left out comparison operators for which it's not yet clear whether we want to return Bool or Nullable{Bool}; they will constitute the next step. The semantics for the present PR are quite clear (identical to those of lift in #18758): return Nullable{T}(op(x, y)) if neither argument is null, and Nullable{T}() if one of them is. T is chosen via promote_op to ensure type stability.

These are currently shipped by NullableArrays.jl, but this is not ideal as it's really type piracy, and other packages like Query.jl need them even without using NullableArrays. The idea is to include them in Base, and then use the NullableOps.jl package for Julia 0.4 and 0.5 (just like Compat).

Cc: cc: @johnmyleswhite @davidagold @quinnj @davidanthoff @TotalVerb @vchuravy

This defines all arithmetic operators, plus a few others, for Nullable,
with lifting semantics.
@nalimilan nalimilan added the missing data Base.missing and related functionality label Oct 19, 2016
@davidanthoff
Copy link
Contributor

Fantastic! Can't wait to see these in NullableOps.jl as well, so that I can remove my type piracy from Query.jl (which I'm also guilty off).

Maybe not the right place to discuss this, but just the follow up question: we don't have agreement on methods for mixed Nullable and standard values, right? I would love to also see method definitions that handle things like 3 + Nullable(4) etc. But I think @johnmyleswhite was not in favor? If there is disagreement, maybe worth opening a new issue where we can discuss this and hopefully can get some more perspective from a wider group?

@tkelman
Copy link
Contributor

tkelman commented Oct 19, 2016

Jeff's points from #16988 (comment) still stand, don't they? Wouldn't using a single higher-order generalizable strategy in all cases be more consistent than having a hard-coded non-extensible set of privileged operators that have these methods defined?

@TotalVerb
Copy link
Contributor

I still side toward using .+, .- as lifted operators. There was another proposal to use ?+ and ?-. But there needs to be short syntax... lift(+)(x, y) is just not helpful.

@davidagold
Copy link
Contributor

I, too, don't see how this is terribly different from the vectorization problem, and we know how that resolved.

@JeffBezanson
Copy link
Member

Maybe lift(+) isn't "helpful", but reality is not always helpful. I find the idea that all code and all functions must deal with Nullable unhelpful.

@TotalVerb
Copy link
Contributor

@JeffBezanson That is a similar argument as was used for vectorized functions, and I agree. But we came to the conclusion that .+, .*, etc. are useful, so the direction is to allow a shorter syntax.

I don't have a problem with lift(+) but that should not be the only way to lift + over nullable values. It would be great if this was accessible in a more natural way; compare

x = x .+ y .* z

with

x = lift(+)(x, lift(*)(y, z))

where the readability difference is night and day.

@TotalVerb
Copy link
Contributor

I am also of the opinion that Nullable{T <: Number} really isn't so different from Number itself. It's a process more similar to augmenting a type with a NaN-like value which propagates and represents some unknown, missing, or non-standard value. I don't think many people will like having to write lift(+)(x, y) for what seem to be straightforward arithmetic operations on numeric values.

@JeffBezanson
Copy link
Member

But not everybody has Nullable values. Nor are operations on Nullables uniquely defined. I also believe it's better to eliminate null values as early as possible rather than threading them through every operation. For example

if isnull(x)
    skip...
else
    y = sin(2*get(x) + 3)
end

rather than relying on +, *, and sin to do what you want on null.

You can also use tricks like + = lift(Base.:+) (either globally or let-bound) to get better syntax.

@davidanthoff
Copy link
Contributor

I find the idea that all code and all functions must deal with Nullable unhelpful.

That is not the proposal here. The proposal is to add a select few methods for the most common arithmetic operators to work with Nullables. Essentially what C# had for a decade.

I also believe it's better to eliminate null values as early as possible rather than threading them through every operation.

That simply doesn't square with the requirements of folks that do data analysis work... In some situations that can be done, but there are tons of situations where that is just not what a data analyst need to do, and we need to provide a solution for those cases if want julia to play in the data science space.

The analogy to the vectorized function story has come up a couple of times. Here is how I see that: back when vectorized functions were added, no one had come up with the great idea that is now the . syntax. Thank goodness no one argued at that point "lets not put vectorized functions in because we don't have a general solution for that problem". Instead they were put in, they annoyed everyone, eventually someone came up with a better idea, and they were replaced. But in the meantime julia was usable and provided what folks needed to get stuff done. I think we should follow the same process for lifting. No one has had a good general idea that is the equivalent of the . syntax. So lets for now put in the manual, not so elegant solution so that packages like Query.jl actually stand a chance of solving a really, really dire need in the data handling world. Once someone has a better idea, we can still switch over to that better idea.

@andyferris
Copy link
Member

I am also of the opinion that Nullable{T <: Number} really isn't so different from Number itself.

This really resonates with me. It's just providing NaN for Ints and Bools and every other type. If you really want me to think of it as a container that I can only sometimes open, I could probably live with that, but my mental model of Nullable just hasn't evolved that way.

Thank goodness no one argued at that point "lets not put vectorized functions in because we don't have a general solution for that problem". Instead they were put in, they annoyed everyone, eventually someone came up with a better idea, and they were replaced.

Right. For example, in future versions of Julia with traits, we might be able to transfer the traits of T to Nullable{T}. We just won't know how this will play out until people have had time to play with it and Julia evolves.

I left out comparison operators for which it's not yet clear whether we want to return Bool or Nullable{Bool}

Similarly, if my mental model of a Nullable is similar to NaN for Float64, then that provides the answer (Bool output, with == and isequal behaving like those with NaN inputs, same for < and isless). If it's a container, then .== should return the same container type (Nullable{Bool}). We need to choose one of these mental models, and this is the fundamental issue.

In conclusion, Julia has allowed us to abstract away so much better than any other language I have used, which means generic functions generally work very well (i.e. for a large range of input types). Requiring the user to explicitly implement higher-order lifting (even with a single character like ?) is kind of the opposite - I always have to keep in mind whether I have T or Nullable{T}. That feels more like C++ to me.

@andyferris
Copy link
Member

andyferris commented Oct 20, 2016

I also believe it's better to eliminate null values as early as possible rather than threading them through every operation.

@JeffBezanson my opinion is that your example is a great demonstration of an optimization a user might additionally provide. But Julia is usually very expressive in the sense that I can achieve much a the REPL with just a few lines of code. I don't want to be forced to type all that boilerplate at the REPL when y = sin(2x+3) suffices, and I don't want to be forced to provide more methods for my functions than strictly necessary to be semantically correct (I can add them later, once a prototype is working, as optimizations where necessary).

@nalimilan
Copy link
Member Author

Jeff's points from #16988 (comment) still stand, don't they? Wouldn't using a single higher-order generalizable strategy in all cases be more consistent than having a hard-coded non-extensible set of privileged operators that have these methods defined?

@tkelman My position with regards to the argument that we're repeating the vectorization mistake: the plan is to provide lifted versions only for standard operators, i.e. those for which we deemed having a short element-wise form (.+, etc.) was necessary. After all, operators exist because we need a short way of expressing formulas. This limited lifting of is what C# does, which it's not a hacky stats-only language, so it's not totally unreasonable.

Just like for vectorization (broadcast), all other functions will need to be explicitly lifted via the lift mechanism from #18758. Of course in the future it would be great to find a more general short syntax like dot operations for vectorization, but that's yet another issue...

But not everybody has Nullable values. Nor are operations on Nullables uniquely defined.

@JeffBezanson Maybe, but those who have nullable values are them intensively and cannot be satisfied with a verbose syntax. Also lifted operators on Nullable don't hurt people who don't use it, so I don't really see the problem.

Operations are "uniquely defined" when you accept null propagation: lifting semantics are well-defined in several languages.

@JeffBezanson
Copy link
Member

Why are abs, abs2, sqrt, and cbrt included?

@nalimilan
Copy link
Member Author

sqrt and cbrt have their own infix operators, that's why I included them. abs and abs2 probably shouldn't be there, though. I'm not opposed to removing them (the two, or the four of them).

@JeffBezanson
Copy link
Member

Could somebody point me to a compelling real-world example where this helps? It still seems marginal to me. I would expect, for example, to call a function to give me just the non-null values from a vector, at which point I have a normal numeric vector and can do anything.

This strikes me as a minefield of special cases. maximum and minimum work, but std, var, and mean do not. Yet to provide this limited set of functionality, this PR adds over 130 methods.

The C# precedent is a bit compelling; if everybody thinks that worked out well then it's certainly a good data point.

@nalimilan
Copy link
Member Author

A typical example came up today in a question to the mailing list: subsetting a data frame. With the current design, df[df[:y] .> 1000, :] works, but it fails after the port to Nullable. Likewise, df[:newcol] = df[:y] + 2 currently works, but it doesn't with Nullable. In a nutshell, there are many cases where you need to propagate null values, not just remove/skip them.

We want to provide user-friendly macros like DataFramesMeta, Query and StructuredQueries to make working with data it easy, efficient and flexible. These could perform automatic lifting but I think supporting these basic operations is still very useful in many cases (and I think @davidanthoff doesn't want to do automatic lifting with Query to remain close to standard Julia syntax). I'm afraid people are going to try this and find Julia frustrating if they cannot do that: I wouldn't feel confident switching data frames to Nullable without it.

Nullable{R}($op(x.value, y.value))
end
end
$op(x::Nullable{Union{}}, y::Nullable{Union{}}) = Nullable()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the definitions with Union{} really necessary? I thought promote_op would handle that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that Union{} <: NullSafeTypes, so it would be considered as safe and we would try applying the operation even if the value field isn't defined. One alternative is to define null_safe_op for Union{} for each operation (to avoid ambiguities), which isn't cleaner. Not sure whether there's a better way of doing this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could potentially be fixed by defining NullSafeTypes as Union{Type{Int8}, Type{Int16}, ...}.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, interesting. I will have a look at this solution.

@JeffBezanson
Copy link
Member

I was hoping for something more real-world than df[:newcol] = df[:y] + 2. Granted, + is important, but you're going to need to learn and use a more general approach pretty quickly.

@dmbates
Copy link
Member

dmbates commented Oct 20, 2016

@JeffBezanson It is even more disappointing to a new user that

y + 2

is fine when y is a numeric array but then

df[:y] = y
df[:y] + 2

would throw an error unless there is a method for + and Nullable types. The reason is that

df[:y] = y

is implemented as

df[:y] = NullableArray(y)

I feel that this design decision in the DataFrames package to transform arrays to NullableArrays unconditionally is a bad design but I seem to be in the minority.

@JeffBezanson
Copy link
Member

I feel that this design decision in the DataFrames package to transform arrays to NullableArrays unconditionally is a bad design

+100 to that. If you're lucky enough to have no missing data, I don't see why Nullable should be forced on you. CSV.jl moved to making Nullable optional, and AFAICT everything got better and everybody is happy.

Random thought: maybe we can use the C# "safe navigation" operator ?. somehow. f?.(x) is a natural possibility (currently doesn't parse). Of course that doesn't really handle infix operators though.

@davidagold
Copy link
Contributor

Operations on Nullables seem to be most present in manipulations of tabular data. We're already rolling out tabular data manipulation interfaces based on macros. We can use macros to replace arbitrary calls f(xs...) with lift(f, xs...). If we're going to encourage users to use these macro-based interfaces anyway, why not just include "automatic" higher-order lifting?

Re: the C# analogy: Does C# defined mixed-signature lifted operators, e.g. +(x::Nullable{Int}, y::Int)? The consensus among those who support method-extension lifting seems to be that we need these mixed-signature methods. If C# doesn't define them, then arguably we're trying to do something different with method extension lifting and we should understand what that difference amounts to.

People seem to believe this PR is necessary, so I won't oppose it beyond what I've said here and elsewhere. But I do predict that merging this PR commits us to an undetermined amount of maintenance for this strategy. I don't intend to rely on it.

@quinnj
Copy link
Member

quinnj commented Oct 20, 2016

I actually agree with @JeffBezanson here that it feels much more sensible and long-term to come up with a way to be able to apply lifting semantics to any function call, a la vectorization.

x +? y  #  => gets lowered to lift(+)(x, y)

@nalimilan
Copy link
Member Author

nalimilan commented Oct 20, 2016

We can certainly stop converting the NullableArray by default, but that's kind of orthogonal to the question of what happens with nullables.

Random thought: maybe we can use the C# "safe navigation" operator ?. somehow. f?.(x) is a natural possibility (currently doesn't parse). Of course that doesn't really handle infix operators though.

Or maybe just f?(x)? That's interesting as a compact and generic syntax. Though as you note we would still need support for infix operators: would things like ?+, ?-, etc. would fit the bill, or are they too weird? How would we allow element-wise lifted operations? .?< sounds too much like a smiley...

EDIT: maybe lifted operators ?+, etc. could be element-wise too, since there's little chance you'd do matrix multiplication like X::Nullable{Matrix{Float64}} * Y::Nullable{Matrix{Float64}}.

@nalimilan
Copy link
Member Author

@davidagold C# actually allows mixing nullables and scalar: https://msdn.microsoft.com/en-us/library/2cf62fcy(v=vs.140).aspx#Anchor_4

@TotalVerb
Copy link
Contributor

Is there a consensus that the existing broadcasting operators will not be appropriate for this? Before introducing new syntax we should consider whether . syntax can be reused.

@nalimilan
Copy link
Member Author

@TotalVerb But then how do you apply a lifted element-wise operation like x::NullableArray .+ 2 or x::NullableArray .+ Nullable(2)?

@TotalVerb
Copy link
Contributor

TotalVerb commented Oct 20, 2016

I think we should just make broadcast recursive (in the sense that transpose or norm are recursive). This would be a departure from map, but I don't see much downside. Then you just apply lifted element-wise with x .+ 2.

And lifted elementwise will be a problem anyway, right? Even if we have abs, etc. built-in, what if someone now wants besselj0 to be lifted across all elements?

@ararslan
Copy link
Member

Just kind of thinking aloud, if I may...

I can kind of see both sides here, though I think I'm leaning toward agreeing with Jeff. It seems good to know what you're doing if your data are nullable, in which case having to more explicitly deal with null values via isnull seems appropriate, if a bit verbose. However, people who know what they're doing may (understandably!) get annoyed quickly without a more automatic way of dealing with nulls.

I don't think that functions, especially those in Base, should have to be able to handle nulls by default. And treating a certain group specially seems somewhat arbitrary and is perhaps a slippery slope. While introducing a convenient syntax that can be applied anywhere is tempting, something like ?+ or +? seems really obscure to me. Plus, any code that has to deal heavily with nulls will look very confused. Imagine:

f?(x) ?+ g?(x) ?> 0 ? x ?+ 1 : y

At the same time, reusing the broadcast syntax feels like a weird pun if you aren't used to thinking of Nullables as containers. Though maybe if you do tend to think of them as containers then it's the natural choice, not sure.

Personally I think things like this could be dealt with in packages that provide abstractions and conveniences like automatic lifting by default, and Base shouldn't have to be concerned with it. People who want to deal with raw Nullables on their own can use existing Base functions, such as get and isnull. That doesn't seem like much of a problem to me, though I could very well be proved wrong.

@johnmyleswhite
Copy link
Member

That's a good point. I personally am ok with adding these specific operations given the C# precedent, although I continue to think the mixed-type case is a good bit weirder than either the pure no-nullable-arguments case or the pure all-nullable-arguments case. I'd want to know when a reduction operation on an array like [1, 2, 3, 4, NULL] might end up producing a nullable type output rather than an error, but one could address that by linting against the use of certain types of abstractly-typed arrays.

@johnmyleswhite
Copy link
Member

Having chimed in already, let me explain my major concern with these kinds of explicit method definitions for nullable types: they encourage people to define functions on nullables rather than employing a generic lifting strategy like dot-broadcast notation. My main fear with such ad hoc method definitions is that we'll see examples in the wild in which f(x::Nullable{T}) has behavior that subtly differs from the behavior of lift(f)(x::T), even though there's no obvious reason for that discrepancy. I think that would be a very bad state to end up in.

I believe all of the operations in this PR will match the standard lifting semantics one would expect (and so I'm ok with merging this PR), but I do worry about end-users deciding that their code would be cleaner if they only had access to a lifted version of gamma or some other function we don't lift by default. As Tony keeps noting, I worry that what I generally view as a white-list style approach to lifting individual functions by explicit method definition leads us back down the dark road we already took with both NA and vectorized functions in the past. But the C# precedent is very compelling as the designers of C# are very thoughtful and careful people. So a limited white-list is ok with me as long as everyone agrees that we should encourage almost all users to use generic lifting strategies rather than ad hoc method definitions almost all the time.

@nalimilan
Copy link
Member Author

I fully agree with @johnmyleswhite's concerns. I think the best solution to discourage people from defining specialized methods on Nullable (beyond the small set of operators from Base) is to offer a nice syntax for lifting like f?(x). As recent experience shows, there's zero incentive to write vectorized methods in 0.5 now that we have the f.(x) syntax. The same will happen with Nullable if we can find the right mechanism.

It could still be interesting to get input from somebody familiar with the design of C#. If anybody knows such a person...

@johnmyleswhite
Copy link
Member

I might be able to get Erik Meijer to chime in if we had specific questions to answer that I could e-mail him.

@JeffBezanson
Copy link
Member

I would ask if they were overall happy with allowing operators on nullables. Did it hide bugs? Were people annoyed or confused about the specific set of operations supported? Do they wish they could have applied the idea more widely? If they could do it again, would there be more kinds of nullable?

@davidanthoff
Copy link
Contributor

For Query.jl I've now completely changed strategy around this: I'm no longer using Nullable, instead I followed @andyferris suggestion and use a data-friendly missing value type internally. The package still interops fine with sources and sinks that use Nullable. That strategy essentially solves all my problems around lifted functions for now, so at least from Query.jl's point of view I don't care anymore whether there are lifted versions of anything in base for Nullable. See also JuliaData/Roadmap.jl#3

Having said that, I have to say that I'm not sold at all by the general lifting approach, mainly because I don't see any workable proposal on the table for that. I've written my main objections about the current approaches up here queryverse/Query.jl#71, in case anyone is interested ;) The TLDR version is that I believe that unlike in the vectorization story, there is no one lifting semantics that should be applied to all functions. There are more points in the linked issue, but that is the main one.

Of course, this still leaves the question of the API for DataFrames if one doesn't want to use a query framework. I actually think the current strategy for that is a major, major mistake that will make the whole data stack in julia even less usable than it was before. That discussion is happening here. But I think to solve that with Nullables would require many more lifted versions in base than this PR. Given the resistance to that, I don't see how that can work out. And just to be sure, I was in favor of the current strategy, so this is not a "I told you so" comment. I changed my mind over the last week or so.

@davidanthoff
Copy link
Contributor

@johnmyleswhite If you are in touch with Erik it would actually be fantastic if you could also simply ask him what he would have done differently in the design of LINQ in hindsight? It is a broad question, but Query.jl is very, very close to LINQ and getting any insight into what he thinks they messed up in the original design would be unbelievably helpful at this point in Query.jl's life.

@JeffBezanson
Copy link
Member

I don't consider defining a new null type and then adding methods to "every" function for it materially different from defining those same methods for Nullable. That Nullable is in Base is not really the relevant issue to me. If people want both a NULL-like type and a NaN-like type that's fine, but the approach of adding methods to "every" function is questionable in both cases. It's the difference between writing for all n, n+1 > n, and exhaustively listing examples "for all values of n needed in practice".

@ihnorton
Copy link
Member

Another Eric (Lippert, formerly on the C# team and ECMAScript committee), has some interesting comments about C#'s nullable and lifting -- design constraints, optimizations, and some issues encountered:

http://stackoverflow.com/a/9013171
http://softwareengineering.stackexchange.com/a/237746
http://stackoverflow.com/questions/18342943/serious-bugs-with-lifted-nullable-conversions-from-int-allowing-conversion-from/18343264#comment26929041_18342943
https://ericlippert.com/2012/12/20/nullable-micro-optimizations-part-one/

In particular: https://blogs.msdn.microsoft.com/ericlippert/2007/06/27/what-exactly-does-lifted-mean/

Of note there, with respect to C#'s lifting as a positive precedent for this PR:

I regret the confusion. I do not believe there is any particular sensible reason for these inconsistencies. Rather, the details were changed so many times over the years as the nullable feature was developed that these sorts of subtle problems crept into the spec and were never expunged. Though of course all of us have as a goal that the standard be a model of clarity and permanence, it is fundamentally a working, evolving, imperfect document; these kinds of things will happen. Hopefully in the next version of the standard some of these sorts of details will be tidied up.

(of course, IIUC, some of these issues were indeed cleaned up in later standard revisions)

@nalimilan
Copy link
Member Author

These are interesting references. Note though that the "confusion" he apologizes for only regards the use of the term "lifted" for operations which are actually not lifted in the strict sense: he's talking about comparison operators (==, >, etc.) which return Bool instead of Nullable{Bool}, and which are abusively called lifted operators in C#.

Nowhere does he actually say that lifting arithmetic operators by default was a mistake. We can even interpret this post as the affirmation of the contrary, as he apologizes for bad terminology, which would have been the occasion of saying the design was inconsistent too if he thought so.

@johnmyleswhite
Copy link
Member

I had a meeting with Erik Meijer and Eric Lippert today about these issues.

Before the meeting, Erik Meijer encouraged reading his paper on Cω and the power of the dot syntax, which is very close to our new broadcast syntax. The main distinction is that Erik thinks of things as e.m(a) where e is an expression, m is a method on e and a contains the method arguments. The big idea is that you should automatically lift methods on scalars to collections. For example, field access on a value p of type Person (where the type has a field name) is done with p.name, but this is also lifted to arrays of persons so that ps.name transforms into map(p -> p.name, ps). This is also done to Nullables, which are treated as 0-or-1-element containers.

The distinction from what we offer or plan to offer (aside from lifting field access) is that this lifting operation also automatically flattens the results. This is linked to a comment Jameson recently made in an offline discussion about how our dot syntax should apply to Array{Nullable}: we want this notation to not only map over each element of the array, but we also want a second iteration of mapping to happen to each element along the way. Alternatively, the example Erik offered was mapping a function likes names(p) which returns a vector of strings: this function would be automatically flattened so that ps.names would generate a vector of strings rather than a vector of vectors of strings.

These ideas did not generally make it into C# and stayed only in the Cω prototype.

In regards to C#'s development, Eric Lippert first said the following in an e-mail before we met in person to discuss the same issues:

· How does C# generate code for lifted operators? How does it optimize them? I wrote all the code in Roslyn that does that, and documented it here:

https://ericlippert.com/2012/12/20/nullable-micro-optimizations-part-one/
https://ericlippert.com/2012/12/27/nullable-micro-optimization-part-two/
https://ericlippert.com/2013/01/03/nullable-micro-optimization-part-three/
https://ericlippert.com/2013/01/07/nullable-micro-optimization-part-four/
https://ericlippert.com/2013/01/10/nullable-micro-optimization-part-five/
https://ericlippert.com/2013/01/15/nullable-micro-optimizations-part-six/
https://ericlippert.com/2013/01/17/nullable-micro-optimizations-part-seven/
https://ericlippert.com/2013/01/21/nullable-micro-optimization-part-eight/

· What, if any, are the semantics of lifted logical operations on Booleans?

https://blogs.msdn.microsoft.com/ericlippert/2012/03/26/null-is-not-false/
https://blogs.msdn.microsoft.com/ericlippert/2012/04/12/null-is-not-false-part-two/

· I would ask if they were overall happy with allowing operators on nullables.

Overall, yes, I think the general consensus is that nullable lifting on operators is good in the aggregate. It’s a nice concept.

In the specifics… oh my goodness, what a godawful mess. The specification for lifting on user-defined conversions in C# is a mass of special cases and confusion, and the implementation frequently diverges greatly from the specification. I was unable to reconcile them in Roslyn, and tried my best to preserve the awful behavior of C# 6 in Roslyn.

I would not recommend copying the C# rules for lifting conversions to nullable, and certainly do not attempt to match the actual compiler behavior.

For excessive details regarding the specified rules and how C# departs from them, read all the comments in these files:

https://github.com/dotnet/roslyn/blob/master/src/Compilers/CSharp/Portable/Binder/Semantics/Conversions/UserDefinedExplicitConversions.cs
https://github.com/dotnet/roslyn/blob/master/src/Compilers/CSharp/Portable/Binder/Semantics/Conversions/UserDefinedImplicitConversions.cs

Pre-roslyn there were many many bugs – not just the compiler not following the spec, but the compiler also producing plain crazy output. Example:

http://stackoverflow.com/questions/6256847/curious-null-coalescing-operator-custom-implicit-conversion-behaviour/6271607#6271607

· Did it hide bugs?

It’s not so much that it hides bugs as it creates unexpected gotchas. There are a number of lifting rules that I would have tweaked. For example, in C# an explicit null on either side of an equality test is handled specially, but the compiler is occasionally not smart enough to notice that

if (some_non_nullable_type == null)  foo();

has an unreachable path.

http://stackoverflow.com/questions/2177850/guid-null-should-not-be-allowed-by-the-compiler

This is just dumb; I think I broke the warning by accident in C# 3 and we never adequately fixed it.

There are a number of odd gotchas like that where small tweaks or simplifications could have found bugs.

· Were people annoyed or confused about the specific set of operations supported?

This isn’t a big source of user complaints to my knowledge.

· Do they wish they could have applied the idea more widely?

There was a lot of feeling over the years that it would have been nice to lift the member access operator, “.” – which the C# team finally did by making an explicit ?. operator.

· If they could do it again, would there be more kinds of nullable?

Visual Basic 6 had Null (database null), Nothing (invalid object reference), Empty (default value) , Undefined (an uninitialized variant), Missing (the value passed when an optional argument is omitted) and Error types. OMG WHAT A MESS. No one understood the distinctions between them, how they compared, what the semantics were, and so on. I spent so much time in the VBScript runtime getting all this nonsense sorted out and I will never get that time back.

If we had to do it all over again, I think I would want to make the connection between null reference and null value stronger and more consistent. Starting with nullable reference types, then grafting on nullable value types one version later, and then attempting to do non-nullable reference types a decade after that, made for a bit of a mess. I wouldn’t want to multiply the number of kinds of nulls.

And while we’re dreaming, it also would have been nice to be really, really clear about the intended semantics of null. “Unknown”, “missing” and “invalid” all have subtly different semantics but we use null for all of them. (“True or unknown” is definitely true, but “true or invalid” could arguably resolve to invalid.)

I think it might have led to fewer user problems had we settled on one and really clearly sent that message to developers.

Building off of these points in person, we discussed our broadcast syntax and how it relates to Erik Meijer's ideas for Cω. Both Erik and Eric were in favor of using our dot notation to automatically lift functions on Nullables. In addition, they thought that most binary operators like + should be equivalent to .+ and therefore automatically provide lifted semantics.

They also suggested implementing automatic flattening. We'll need to think through those details for how they'd apply in Julia.

In general, they note that our dot syntax should continue down the road @nalimilan has already headed by transforming f.(x) into broadcast(lift(f), x), which provides the natural lifting semantics that map null to null by default, but allows special functions to behave differently.

In that regard, the functions that deserve special semantics were get, isnull, ==, === and Boolean functions like & and |, which should probably implement three-valued logic as they do in C#.

Important things we should do differently than C#:

  • We should insist that x == null and x == y both use natural lifting and therefore generate null when y is null. We should not allow any sort of truthiness. They were not able to do this in C# because backwards compatibility with reference types required them to preserve the tradition that x == null generates a bool. Instead, we should do things cleanly and use the special function isnull(x) to generate Bool, while == should generate Nullable{Bool}.

More broadly, they suggested we should choose one semantic interpretation of null and stick with it everywhere. They suggested sticking with database semantics for null and accepting that this will end up conflating database null with things like null pointers and invalid values that one might otherwise prefer not to propagate. If one wants to avoid Tony Hoare's billion-dollar mistake, this should be done by discouraging the use of nullable when they are not needed and using static analysis tooling to detect and warn about the misuse of nullables.

Finally, they suggested following C#'s lead and defining three-valued logic for non-short circuiting Boolean functions, but not allowing nullable values at all with short-circuiting Boolean operators. Eric's blog posts linked above detail the arguments for this distinction and I find his position very compelling.

I believe that summarizes all of my communications with them on the topic. Happy to try translating all of this content and some recent offline conversations I've had with other committers into a formal design for where we want Nullable to end up.

@davidanthoff
Copy link
Contributor

@johnmyleswhite Cool, thanks, this is fantastic! And will take a while to digest (at least for me).

One quick question on something specific:

We should insist that x == null and x == y both use natural lifting and therefore generate null when y is null

Did they give a reason for that recommendation?

@johnmyleswhite
Copy link
Member

I'm not sure, except insofar as returning a nullable prevents one from using that idiom in condition clauses like if and while.

@nalimilan
Copy link
Member Author

Thanks for writing this detailed summary. Sounds like a great plan in general, and it's reassuring that they agree with our previous decisions, while giving insights on the ones we were unsure about.

I just have a small reservation regarding this:

Finally, they suggested following C#'s lead and defining three-valued logic for non-short circuiting Boolean functions, but not allowing nullable values at all with short-circuiting Boolean operators. Eric's blog posts linked above detail the arguments for this distinction and I find his position very compelling.

Eric's blog post only gives one argument in favor of not supporting Nullable with short-circuiting operators: to preserve (in x && y) "the nice property that y is only evaluated if x is true". As noted in a comment on that post, it's not obvious that this property is really better than "the second operand is evaluated if and only if it is necessary to determine the answer". Actually, the latter sounds much more natural to me, as it works for both && and ||, while the former must be reversed to apply to ||. With that definition, null can be supported with short-circuiting operators -- unless we find other reasons not to.

Anyway, that's really a detail and it can be changed later without breaking backward compatibility, so I don't think we need to decide this right now. But I just wanted to mention it after reading the blog post.

@tkelman
Copy link
Contributor

tkelman commented Dec 9, 2016

Short circuiting operators are control flow. Allowing nullables with "if necessary" semantics would be equivalent to truthiness in if statements.

Would it achieve the same goals of avoiding misuse if non-lifted boolean operations were not-comparable-errors on nullables, and opt in lifting could give you 3VL when requested? Default 3VL seems like it might put us in an odd place where some generic code might do a sane thing on nullable inputs, but anything that used comparison results in control flow would error until explicit null handling was added. Do we want to require explicit null handling everywhere, somewhere (control flow only?), or nowhere?

@johnmyleswhite
Copy link
Member

Default 3VL seems like it might put us in an odd place where some generic code might do a sane thing on nullable inputs

Is this right? Since 3VL logic generates Nullable{Bool}, it can't be used in control flow anywhere even when the wrapped value is true or false?

@tkelman
Copy link
Contributor

tkelman commented Dec 9, 2016

julia> if Nullable(true)
       println("ok")
       end
ERROR: TypeError: non-boolean (Nullable{Bool}) used in boolean context

Code that uses control flow would not work unmodified if it gets a nullable in the condition (unless we changed the above behavior). It's code that doesn't use any comparison results in control flow (not the biggest subset, but "some") that would sometimes give you a useful Nullable answer, without needing to change anything if you have 3VL auto lifting.

@johnmyleswhite
Copy link
Member

Code that uses control flow would not work unmodified if it gets a nullable in the condition (unless we changed the above behavior). It's code that doesn't use any comparison results in control flow (not the biggest subset, but "some") that would sometimes give you a useful Nullable answer, without needing to change anything if you have 3VL auto lifting.

I'm still confused. 3VL would apply to Nullable{Bool}, not Bool, so code like your example doesn't work now and wouldn't work after implementing 3VL. What changes in that regard?

@tkelman
Copy link
Contributor

tkelman commented Dec 10, 2016

My point isn't that 3VL changes the behavior of code that has nullable comparisons in control flow. It's that it changes the behavior of code that doesn't - you would start getting nullable from comparisons as the output of some functions, if they managed not to error. You couldn't use the output in logical indexing or other places that expect a boolean though. So what advantage does auto lifting give, how often will the 3VL result actually be what you wanted? Generic code is probably written assuming comparisons return boolean, and that seems like an important part of the contract of what those operators should mean generically.

@nalimilan
Copy link
Member Author

I agree it doesn't bring a very clear advantage if nullable is still not supported in control flow. Though it doesn't hurt either, since code would have failed even earlier without it.

Anyway, my point was that short-circuiting 3VL can be given clear and quite natural semantics, so that's not an argument for not implementing them. There may well be other arguments not to implement it, like the fact that it's not very useful in isolation.

@nalimilan
Copy link
Member Author

Closing since we now have #16961.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
missing data Base.missing and related functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.