Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change comparison operators to return null when one of the operands is null #33

Merged
merged 1 commit into from
Sep 27, 2017

Conversation

nalimilan
Copy link
Member

@nalimilan nalimilan commented Sep 27, 2017

This is consistent with DataArrays, SQL and R, is more consistent with the null-propagating behavior used with other operations, and is safer in case null values were not expected (as an error will be thrown).

isequal() and isless() still return true or false.


After discussing this, it appears returning null is the safest and most standard behavior for data analysis. For reference, previous discussions happened at JuliaLang/julia#19034 (comment) (and following comments), JuliaStats/NullableArrays.jl#85 and JuliaData/DataFramesMeta.jl#58.

…s null

This is consistent with DataArrays, SQL and R, is more consistent with
the null-propagating behavior used with other operations, and is safer
in case null values were not expected (as an error will be thrown).

isequal() and isless() still return true or false.
@codecov-io
Copy link

codecov-io commented Sep 27, 2017

Codecov Report

Merging #33 into master will decrease coverage by 1.63%.
The diff coverage is 93.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #33      +/-   ##
==========================================
- Coverage     100%   98.36%   -1.64%     
==========================================
  Files           1        1              
  Lines          50       61      +11     
==========================================
+ Hits           50       60      +10     
- Misses          0        1       +1
Impacted Files Coverage Δ
src/Nulls.jl 98.36% <93.33%> (-1.64%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 53878ba...43b130d. Read the comment docs.

@nalimilan
Copy link
Member Author

Coverage fall seems spurious since the offending line is clearly tested.

@quinnj quinnj merged commit 03edaac into master Sep 27, 2017
@quinnj quinnj deleted the nl/cmp branch September 27, 2017 18:05
@ararslan
Copy link
Member

I imagine this will be fairly breaking for any packages currently relying on Nulls. We may have wanted to ensure we get a patch release out with any non-breaking changes (if any exist) before merging this.

@nalimilan
Copy link
Member Author

Yes, in particular we need to fix DataFrames and add upper bounds. The positive side is that it will be less breaking for people migrating from the current DataFrames release.

@davidanthoff
Copy link

I only saw this now. I'm still of the same opinion that I was a year ago: I think this is not a good choice. In fact, one of the main reasons I created DataValue in the first place and used that in Query.jl was that I feel very strongly that I need a type for missing data that returns a Bool for predicates and doesn't throw users into 3VL land whenever they use a predicate. The DataValue semantics have been in use in Query.jl for about a year now and have worked great, so if anything, that experience has convinced me even more that it was the right choice.

@ararslan
Copy link
Member

I kind of agree, though as much as I really dislike 3VL, it's what's familiar to users of R and SQL, so it would be potentially more confusing to newcomers to have our own conventions. Personally I prefer NaN semantics, where nothing is equal to null, not even null, but this can cause issues in practice, especially when selecting groups for an analysis.

@nalimilan
Copy link
Member Author

Yes, I know you still don't like this. But we've discussed it again, based on how R, SQL and DataArrays work, and given the feedback we've had from Erik Lippert at JuliaLang/julia#19034 (comment). We also considered the fact that we generally do not drop/skip nulls silently (e.g. with sum), which 1 == null -> null actually implies in cases like x = [1, null]; x[x> 1]. Always propagating nulls is the safest solution, in particular for scientific work, as it forces people to think about how nulls are treated. Finally, it makes DSLs like Query more consistent when translated to SQL, which isn't negligible.

@davidanthoff
Copy link

I only saw this now. I'm still of the same opinion that I was a year ago: I think this is not a good choice.

Actually, I take that back. One of the main reasons I didn't like this for Nullable/DataValue is that I wanted something like if DataValue(3)==3 to just work. But that of course is not an issue with the Union{T,Null} approach.

I still don't like the SQL behavior of dropping rows for which a filter predicate returns null, but that is of course independent of the definition of these operators. So if we manage to sort out the Nulls.jl/Query.jl compat story, we could just make any filter throw an error if the filter predicate returns a null, exactly like an if statement throws an error if it sees a ``null` value.

@nalimilan
Copy link
Member Author

Actually, I take that back. One of the main reasons I didn't like this for Nullable/DataValue is that I wanted something like if DataValue(3)==3 to just work. But that of course is not an issue with the Union{T,Null} approach.

Great! Indeed that's exactly the advantage of Union{T, Null}: it behaves strictly as þ when there are no nulls.

I still don't like the SQL behavior of dropping rows for which a filter predicate returns null, but that is of course independent of the definition of these operators. So if we manage to sort out the Nulls.jl/Query.jl compat story, we could just make any filter throw an error if the filter predicate returns a null, exactly like an if statement throws an error if it sees a ``null` value.

Yeah, I agree that silently dropping nulls is inconsistent with what we do elsewhere, and apparently it also confuses SQL users. I have a few ideas about how to handle this with arguments to where/filter, let's discuss this at JuliaData/DataFramesMeta.jl#58 (where some discussion has happened already).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants