-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
null
s in expressions
#905
Comments
Thanks for writing this up @aljazerzen ! Very clear. One option is to only apply this to column == true -> column IS TRUE (do)
column == false -> column IS FALSE (do)
column == 15 -> COALESCE(column = 15, FALSE) (don't do) A couple of points:
|
Your first point is the major one: when But, a consequence is that current status is quite close to what we want, even without translating To close this issue completely, I think that to coalesce correctly in your 3rd point, we would have to know what value of scalar In summary, right now we are very close to done, but to finish now it would be a herculean effort. One easy thing now, is to translate |
I'm not quite following all the points in this discussion but I don't think the following is the case:
On the PRQL playground the above query translates to: SELECT
employees.*,
last_salary = 1160 AS has_minimum_wage
FROM
employees which will emit What I recall from our last discussion on this topic, is that from employees
filter last_salary == null which gets translated to SELECT
employees.*
FROM
employees
WHERE
last_salary IS NULL which is presumably what you want since a query where the filter condition emits only Therefore I think everything is fine as it is and no change is necessary. What am I missing? |
Well you are looking at it from "how to translate" point of view:
This point of view works for now, but consider we add variables and evaluation into PRQL: let my_null = null;
let my_number = 8;
from numbers # -> FROM numbers
derive c = b + my_number # -> SELECT b + 8 AS c
filter foo == my_null # -> WHERE ? Now should last transform translate to But what if What if we instead had My point here is that this point of view does not cover all the cases that may arise in the future, which may lead some language invariants to be broken. For example one would expect that these two are the same: Instead, I propose we decide what the results On the PRQL playground the above query translates to: SELECT
employees.*,
last_salary = 1160 AS has_minimum_wage
FROM
employees which will emit NULLs just as expected. What I propose is to expect this query to emit FALSEs instead of NULLs. I'm basically asking for permission to translate all |
Thanks for the clarification. I'm not sure about all the implications when you add variables and expression evaluation to the compiler. I would suggest we go with postponing this as you suggested (and stick to the YAGNI principle until then):
With regards to the larger point of creating our own semantics around Also in many applications, it's important to be able to distinguish missing data, i.e. NULL or NA, from falsey values so an automatic translation sounds like a bad idea to me. |
So carrying on with the previous example from employees
derive [
has_minimum_wage = last_salary == 1160
] I want this to emit from employees
derive [
has_minimum_wage = last_salary == 1160 ?? false
] which gets translated to SELECT
employees.*,
COALESCE(last_salary = 1160, false) AS has_minimum_wage
FROM
employees If you put that |
You'd have to use an (unimplemented) ternary operator: from employees
derive [
has_minimum_wage = last_salary == null ? null : last_salary == 1160
] I see that is worse than what we have now, but problem is that we are currently in-between the two camps:
So we picked the best of the two worlds and haven't committed to one of them fully. I'm saying that this may come back and bite us, but for now we can leave this as is. |
A minimal example of the problem:
... evaluates to |
So we should just type juggle and coerce values into something else and hope its what was intended? So 10 problems + Returning an error is not the same as "throwing" an error, and is not the same as a Null Pointer Exception (NPE). The PRQL compiler can return a list of error messages. |
I don't agree that it should evaluate to Moreover from [{a = 1, b = null}]
select {a == b, b == b, b == null} evaluates to
which I think is perfect. We're keeping the SQL 3-value logic around NULL while providing the It's the last one that's really the odd one out so why do I think that's justified? The first two behaviours around In my mind the last one is different because what could I agree this is a bit inconsistent and will probably trip up newcomers (and maybe others as well). However I think it's an acceptable price in this instance. If people think it's too confusing then we may have to abandon that syntactic shortcut (but let's discuss that first). However I think we can't have |
@vanillajonathan do I understand your correctly: @snth I have strong opinion against 3-value logic and an even stronger one against our current in-between behavior. That's because with current semantics, it is not possible to easily "extract a variable":
... here, one would expect that comparing I don't understand what you mean with "lose access to a whole lot of data representations around missing values". The proposed change in this issue would only change the "default" behavior of what happens when an operand is null, but one can still explicitly state that behavior:
... or make a helper function for it. So I don't think the convenience of 3-value logic in some cases is worth the massive inconsistency in a few others. |
@aljazerzen Yeah, Python seems to handle this nicely. In [1]: 3 + None
-----------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[1], line 1
----> 1 3 + None
TypeError: unsupported operand type(s) for +: 'int' and 'NoneType' |
Just to add an example (because I lost it when my browser restarted today): let to_int = func x -> case [x=='null' => null, true => (x | as int)]
let sentinel = -999
from_text format:csv """i,a_i,b_i
0,0,0
1,1,10
2,2,null
3,null,30
4,null,null
"""
sort i
select {a=a_i|to_int, b=b_i|to_int, current=a==b, proposed=(a??sentinel)==(b??sentinel)} which yields
I will come back to add some text as to why I think the proposed behaviour of treating null as a well defined value is a problem. In the meantime, it would be easy enough to add the following function for anyone that really wants the proposed behaviour: let eq_null = func y z s:sentinel -> (y??s)==(z??s) and then derive eq_null=(eq_null a b s:-1) would give:
|
TIL about <expression> IS [NOT] DISTINCT FROM <expression> See Markus Winand's page: NULL-Aware Comparison: is [not] distinct from This yields a better implementation for dialects where this is supported: let is_distinct = func a1 b1 -> s"({a1} IS DISTINCT FROM {b1})"
let not_distinct = func a2 b2 -> s"({a2} IS NOT DISTINCT FROM {b2})" For dialects where let eq_null = func a3 b3 -> case [a3==b3 || (a3==null && b3==null) => 1, true => 0] == 1 Adding this to my previous example: derive {eq_null=(eq_null a b), not_distinct=(not_distinct a b), is_distinct=(is_distinct a b)} we get:
|
Great overview. It showcases the problem with 3-value logic: there are so many more possible comparisons which make up a need to have different comparison operators. This then spirals out of control and you end up with a bizzare amount of "is equal" operators. |
This page has an even better overview of The Three-Valued Logic of SQL. I haven't read all of it yet but one conclusion I came away with from that article is that even if you patch the Thinking on this over the last day or two, I think I have a proposal that could suit all of us. First I want to say a bit about the three-valued logic of missing or unknown values and its importance to data analysis. Disclaimer: I had ChatGPT write this out for me based on my prompt. Three-Valued Logic in Data AnalysisThe concept of "three-valued logic" in data analysis, particularly in the context of handling missing values, is a fundamental aspect across various data processing systems like SQL, Excel, R, and Python's Pandas. Let's explore why this behavior is not only intentional but crucial. Key Points of Three-Valued Logic in Data Analysis
Detailed Exploration1. Representation of Uncertainty
2. Preventing Misleading Interpretations
3. Data Integrity and Quality
4. Flexibility in Data Analysis
ConclusionThe implementation of three-valued logic across various data systems is a deliberate choice to accurately represent the realities of data collection and analysis. Proposal
The benefits I see with this are:
|
One thing that bothered me while reading this discussion was the handling of nulls in joins. |
FYI, Julia has |
Good point @eitsupi . I'm actually surprised that there is no (direct) mention of it on the The Three-Valued Logic of SQL. Digging through that page I did find the following though: In Footnote 2:
and the
That's why I think trying to build our own, different behaviour around NULL handling on top of this would be a Herculean task and would likely lead to inconsistencies. My recommendation therefore is to keep the standard behaviour but provide conveniences/ergonomics for common tasks, e.g. through something like a
as @aljazerzen said. |
@snth Please refrain from adding generated text into comments as it adds unnecessary clutter and can be engineered to provide any opinion. I'm not sold on the idea of multiple comparison operators. I see the use-case, but I think that it could be implemented with a slightly more verbose, but way more consistent way. I want to point out that removing 3-value-logic from PRQL would not imply that PRQL is incapable of expressing uncertainty or missing data. It only means that comparison operators (and a few other) treat This means that one can still use the "bubbling-NULL-up" behavior with What I mean with "bizzare amount of is equal operators" is that following the same logic as to add I would much rather make operators as simple as possible, as orthogonal as possible, so they can be composed together. In this case, this means that comparison operators return either a |
Continuing discussion from Discord: https://discord.com/channels/936728116712316989/1001415848902283284
In short, we have to decide for each operator how it handles null inputs.
My guiding principle would be that we should not throw errors (NPEs suck!) and that we should not coalesce
null
s into0
orfalse
or any other falsy value.For
+
, I'd say it's quite obvious then:But a problem arises with
==
. Because we translatex == null
intox IS NULL
, this would be a logical consequence:If we choose this behavior,
==
will never emit a null, but onlytrue
/false
.But as @mklopets and @max-sixty pointed out, that would lead to divergence from expected SQL behavior, when using
==
as:... where employees who have not received a salary have
last_salary
set tonull
. In these cases,==
would just emitfalse
, while in SQLlast_salary = 1160
would emitnull
.I think that this divergence is actually a good thing - we decided to treat
null
as a value and not a concept ofUNKNOWN
. This decision now is a logical consequence and I think we should stick to it.In practice, translations would be:
The text was updated successfully, but these errors were encountered: