Change comparison operators to return null when one of the operands is null #33

nalimilan · 2017-09-27T16:50:48Z

This is consistent with DataArrays, SQL and R, is more consistent with the null-propagating behavior used with other operations, and is safer in case null values were not expected (as an error will be thrown).

isequal() and isless() still return true or false.

After discussing this, it appears returning null is the safest and most standard behavior for data analysis. For reference, previous discussions happened at JuliaLang/julia#19034 (comment) (and following comments), JuliaStats/NullableArrays.jl#85 and JuliaData/DataFramesMeta.jl#58.

…s null This is consistent with DataArrays, SQL and R, is more consistent with the null-propagating behavior used with other operations, and is safer in case null values were not expected (as an error will be thrown). isequal() and isless() still return true or false.

codecov-io · 2017-09-27T16:50:50Z

Codecov Report

Merging #33 into master will decrease coverage by 1.63%.
The diff coverage is 93.33%.

@@            Coverage Diff             @@
##           master      #33      +/-   ##
==========================================
- Coverage     100%   98.36%   -1.64%     
==========================================
  Files           1        1              
  Lines          50       61      +11     
==========================================
+ Hits           50       60      +10     
- Misses          0        1       +1

Impacted Files	Coverage Δ
src/Nulls.jl	`98.36% <93.33%> (-1.64%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 53878ba...43b130d. Read the comment docs.

nalimilan · 2017-09-27T16:57:17Z

Coverage fall seems spurious since the offending line is clearly tested.

ararslan · 2017-09-27T18:36:37Z

I imagine this will be fairly breaking for any packages currently relying on Nulls. We may have wanted to ensure we get a patch release out with any non-breaking changes (if any exist) before merging this.

nalimilan · 2017-09-27T18:53:41Z

Yes, in particular we need to fix DataFrames and add upper bounds. The positive side is that it will be less breaking for people migrating from the current DataFrames release.

davidanthoff · 2017-09-28T20:02:35Z

I only saw this now. I'm still of the same opinion that I was a year ago: I think this is not a good choice. In fact, one of the main reasons I created DataValue in the first place and used that in Query.jl was that I feel very strongly that I need a type for missing data that returns a Bool for predicates and doesn't throw users into 3VL land whenever they use a predicate. The DataValue semantics have been in use in Query.jl for about a year now and have worked great, so if anything, that experience has convinced me even more that it was the right choice.

ararslan · 2017-09-28T20:21:35Z

I kind of agree, though as much as I really dislike 3VL, it's what's familiar to users of R and SQL, so it would be potentially more confusing to newcomers to have our own conventions. Personally I prefer NaN semantics, where nothing is equal to null, not even null, but this can cause issues in practice, especially when selecting groups for an analysis.

nalimilan · 2017-09-28T20:25:16Z

Yes, I know you still don't like this. But we've discussed it again, based on how R, SQL and DataArrays work, and given the feedback we've had from Erik Lippert at JuliaLang/julia#19034 (comment). We also considered the fact that we generally do not drop/skip nulls silently (e.g. with sum), which 1 == null -> null actually implies in cases like x = [1, null]; x[x> 1]. Always propagating nulls is the safest solution, in particular for scientific work, as it forces people to think about how nulls are treated. Finally, it makes DSLs like Query more consistent when translated to SQL, which isn't negligible.

davidanthoff · 2017-10-07T00:28:18Z

I only saw this now. I'm still of the same opinion that I was a year ago: I think this is not a good choice.

Actually, I take that back. One of the main reasons I didn't like this for Nullable/DataValue is that I wanted something like if DataValue(3)==3 to just work. But that of course is not an issue with the Union{T,Null} approach.

I still don't like the SQL behavior of dropping rows for which a filter predicate returns null, but that is of course independent of the definition of these operators. So if we manage to sort out the Nulls.jl/Query.jl compat story, we could just make any filter throw an error if the filter predicate returns a null, exactly like an if statement throws an error if it sees a ``null` value.

nalimilan · 2017-10-07T11:50:05Z

Actually, I take that back. One of the main reasons I didn't like this for Nullable/DataValue is that I wanted something like if DataValue(3)==3 to just work. But that of course is not an issue with the Union{T,Null} approach.

Great! Indeed that's exactly the advantage of Union{T, Null}: it behaves strictly as þ when there are no nulls.

I still don't like the SQL behavior of dropping rows for which a filter predicate returns null, but that is of course independent of the definition of these operators. So if we manage to sort out the Nulls.jl/Query.jl compat story, we could just make any filter throw an error if the filter predicate returns a null, exactly like an if statement throws an error if it sees a ``null` value.

Yeah, I agree that silently dropping nulls is inconsistent with what we do elsewhere, and apparently it also confuses SQL users. I have a few ideas about how to handle this with arguments to where/filter, let's discuss this at JuliaData/DataFramesMeta.jl#58 (where some discussion has happened already).

nalimilan mentioned this pull request Sep 27, 2017

Port to Nulls.jl JuliaStats/DataArrays.jl#288

Merged

quinnj approved these changes Sep 27, 2017

View reviewed changes

quinnj merged commit 03edaac into master Sep 27, 2017

quinnj deleted the nl/cmp branch September 27, 2017 18:05

cjprybol mentioned this pull request Sep 29, 2017

README mistake? #35

Closed

This was referenced Sep 30, 2017

Restore old == using three-valued logic, separate from isequal() JuliaData/DataFrames.jl#1242

Merged

(NA == NA) should return true and (NA == 1) should return false JuliaData/DataFrames.jl#1244

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change comparison operators to return null when one of the operands is null #33

Change comparison operators to return null when one of the operands is null #33

nalimilan commented Sep 27, 2017 •

edited

Loading

codecov-io commented Sep 27, 2017 •

edited

Loading

nalimilan commented Sep 27, 2017

ararslan commented Sep 27, 2017

nalimilan commented Sep 27, 2017

davidanthoff commented Sep 28, 2017

ararslan commented Sep 28, 2017

nalimilan commented Sep 28, 2017

davidanthoff commented Oct 7, 2017

nalimilan commented Oct 7, 2017

Change comparison operators to return null when one of the operands is null #33

Change comparison operators to return null when one of the operands is null #33

Conversation

nalimilan commented Sep 27, 2017 • edited Loading

codecov-io commented Sep 27, 2017 • edited Loading

Codecov Report

nalimilan commented Sep 27, 2017

ararslan commented Sep 27, 2017

nalimilan commented Sep 27, 2017

davidanthoff commented Sep 28, 2017

ararslan commented Sep 28, 2017

nalimilan commented Sep 28, 2017

davidanthoff commented Oct 7, 2017

nalimilan commented Oct 7, 2017

nalimilan commented Sep 27, 2017 •

edited

Loading

codecov-io commented Sep 27, 2017 •

edited

Loading