Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nulls in expressions #905

Open
aljazerzen opened this issue Aug 3, 2022 · 20 comments
Open

nulls in expressions #905

aljazerzen opened this issue Aug 3, 2022 · 20 comments
Labels
language-design Changes to PRQL-the-language needs-discussion Undecided dilemma

Comments

@aljazerzen
Copy link
Member

Continuing discussion from Discord: https://discord.com/channels/936728116712316989/1001415848902283284

In short, we have to decide for each operator how it handles null inputs.

My guiding principle would be that we should not throw errors (NPEs suck!) and that we should not coalesce nulls into 0 or false or any other falsy value.

For +, I'd say it's quite obvious then:

   1 + 2    -> 3
null + 2    -> null
   1 + null -> null
null + null -> null

But a problem arises with ==. Because we translate x == null into x IS NULL, this would be a logical consequence:

   1 == 2    -> false
   1 == null -> 1 IS NULL -> false
null == 2    -> 2 IS NULL -> false
null == null -> NULL IS NULL -> true

If we choose this behavior, == will never emit a null, but only true/false.

But as @mklopets and @max-sixty pointed out, that would lead to divergence from expected SQL behavior, when using == as:

from employees
derive [
  has_minimum_wage = last_salary == 1160
]

... where employees who have not received a salary have last_salary set to null. In these cases, == would just emit false, while in SQL last_salary = 1160 would emit null.

I think that this divergence is actually a good thing - we decided to treat null as a value and not a concept of UNKNOWN. This decision now is a logical consequence and I think we should stick to it.


In practice, translations would be:

column == null    -> column IS NULL (done)
column != null    -> column IS NOT NULL (done)
column == true    -> column IS TRUE (todo)
column == false   -> column IS FALSE (todo)
column == 15      -> COALESCE(column = 15, FALSE)    (todo)
column == 'foo'   -> COALESCE(column = 'foo', FALSE) (todo)
column == 'foo'   -> COALESCE(column = 'foo', FALSE) (todo)
@max-sixty
Copy link
Member

max-sixty commented Aug 4, 2022

Thanks for writing this up @aljazerzen ! Very clear.

One option is to only apply this to true & false — i.e.

column == true    -> column IS TRUE (do)
column == false   -> column IS FALSE (do)
column == 15      -> COALESCE(column = 15, FALSE)    (don't do)

A couple of points:

  • This only applies in SQL Select statements — other comparisons require a true — e.g. joins, so it's not that disruptive to the outputted SQL
  • Applying coalesce to true & false values doesn't preclude any behavior — people can use foo or not foo rather than foo == true / foo == false — in SQL foo = TRUE is the same as foo.
  • Applying coalesce only to true & false values means that x == true and x == column_that_contains_true will have a different result, assuming we won't always have types in columns
  • Applying coalesce to all values does preclude behavior, I think — how would someone represent foo = 5, if they want nulls passed along?

@aljazerzen
Copy link
Member Author

Your first point is the major one: when == is used in WHERE, HAVING, JOIN or CASE, nulls in SQL are treated as falsey values anyway and COALESCE(..., FALSE) has no effect. It only has effect in SELECTs. Which means that ideally, we would emit COALESCE only in SELECTs to prevent SQL complexity explosion.

But, a consequence is that current status is quite close to what we want, even without translating column == true and column == false as you suggested. Which is why I suggest we postpone this whole issue until we begin polishing the compiler.

To close this issue completely, I think that to coalesce correctly in your 3rd point, we would have to know what value of scalar x is. Even if it is a variable computed from data. For this, we would need #643.

In summary, right now we are very close to done, but to finish now it would be a herculean effort. One easy thing now, is to translate column == true.

@snth
Copy link
Member

snth commented Aug 10, 2022

I'm not quite following all the points in this discussion but I don't think the following is the case:

If we choose this behavior, == will never emit a null, but only true/false.

But as @mklopets and @max-sixty pointed out, that would lead to divergence from expected SQL behavior, when using == as:

from employees
derive [
  has_minimum_wage = last_salary == 1160
]

... where employees who have not received a salary have last_salary set to null. In these cases, == would just emit false, while in SQL last_salary = 1160 would emit null.

On the PRQL playground the above query translates to:

SELECT
  employees.*,
  last_salary = 1160 AS has_minimum_wage
FROM
  employees

which will emit NULLs just as expected.

What I recall from our last discussion on this topic, is that == null only ever gets special treatment when there is a literal null in the PRQL query which we concluded is what you want, as for example in something like the following:

from employees
filter last_salary == null

which gets translated to

SELECT
  employees.*
FROM
  employees
WHERE
  last_salary IS NULL

which is presumably what you want since a query where the filter condition emits only NULLs and returns no rows would seem pointless.

Therefore I think everything is fine as it is and no change is necessary.

What am I missing?

@aljazerzen
Copy link
Member Author

Well you are looking at it from "how to translate" point of view:

  • if there is == null in PRQL -> translate into IS NULL
  • if there is == x in PRQL -> translate into = x

This point of view works for now, but consider we add variables and evaluation into PRQL:

let my_null = null;
let my_number = 8;

from numbers              # -> FROM numbers
derive c = b + my_number  # -> SELECT b + 8 AS c
filter foo == my_null     # -> WHERE ?

Now should last transform translate to WHERE foo = null or WHERE foo IS NULL? I'd argue that is should be IS NULL, because the other one does not make any sense.

But what if my_null is a computed value? Something like let my_null = s"SELECT MAX(bar) FROM another_table"? If it is null, we have to translate to IS NULL, otherwise to = my_null?

What if we instead had filter foo == baz (where baz is from table numbers)? If baz is null, we will compare foo = null, which will be null, which would never happen before.

My point here is that this point of view does not cover all the cases that may arise in the future, which may lead some language invariants to be broken. For example one would expect that these two are the same: x == null with x == my_null (where let my_null = null), but they may not be.

Instead, I propose we decide what the results == should be, and then follow that regardless of whether operands are literals, variables or columns.


On the PRQL playground the above query translates to:

SELECT
  employees.*,
  last_salary = 1160 AS has_minimum_wage
FROM
  employees

which will emit NULLs just as expected.

What I propose is to expect this query to emit FALSEs instead of NULLs. I'm basically asking for permission to translate all x == y into COALESCE(x = y, FALSE).

@snth
Copy link
Member

snth commented Aug 10, 2022

Thanks for the clarification.

I'm not sure about all the implications when you add variables and expression evaluation to the compiler. I would suggest we go with postponing this as you suggested (and stick to the YAGNI principle until then):

Which is why I suggest we postpone this whole issue until we begin polishing the compiler.


With regards to the larger point of creating our own semantics around NULL, I would be very wary of that and would vote against it. I personally see PRQL as a convenient way to write SQL (perhaps data transformations more generally with Substrait) and would want to retain the ability to map 1-to-1 onto the underlying SQL constructs. The logic of NULLs and NAs is tricky and I think you can easily get yourself into a mess there. I'm not even sure that it's always treated the same in all underlying systems. One could argue that therefore that could be a way for us to improve upon SQL but I am skeptical. See for example this additional new NULL behaviour in indexes in postgres 15. That doesn't affect us directly but I'm just saying that given how many subtleties there are in this space, let's not add another layer of complexity by introducing more magic behaviour between PRQL and SQL.

Also in many applications, it's important to be able to distinguish missing data, i.e. NULL or NA, from falsey values so an automatic translation sounds like a bad idea to me.

@snth
Copy link
Member

snth commented Aug 10, 2022

So carrying on with the previous example

from employees
derive [
  has_minimum_wage = last_salary == 1160
]

I want this to emit NULLs because that tells me that I have missing data. If I want this to purely give me a true\false answer, then PRQL already makes it easy enough as I can simply write

from employees
derive [
  has_minimum_wage = last_salary == 1160 ?? false
]

which gets translated to

SELECT
  employees.*,
  COALESCE(last_salary = 1160, false) AS has_minimum_wage
FROM
  employees

If you put that COALESCE in automatically then how can I recover that missing value information?

@aljazerzen
Copy link
Member Author

You'd have to use an (unimplemented) ternary operator:

from employees
derive [
  has_minimum_wage = last_salary == null ? null : last_salary == 1160
]

I see that is worse than what we have now, but problem is that we are currently in-between the two camps:

  • null means unknown (what SQL does)
  • null is a value (what we use for column == null comparison)

So we picked the best of the two worlds and haven't committed to one of them fully. I'm saying that this may come back and bite us, but for now we can leave this as is.

@aljazerzen aljazerzen added the language-design Changes to PRQL-the-language label Dec 15, 2022
@aljazerzen aljazerzen added the needs-discussion Undecided dilemma label Jan 8, 2023
@aljazerzen aljazerzen removed the needs-discussion Undecided dilemma label Jan 17, 2023
@aljazerzen
Copy link
Member Author

A minimal example of the problem:

from [{a = null, b = null}]
select a == b

... evaluates to [{null}] in playground at the moment, but it should evaluate to [{true}].

@aljazerzen aljazerzen added the needs-discussion Undecided dilemma label Dec 18, 2023
@vanillajonathan
Copy link
Collaborator

My guiding principle would be that we should not throw errors (NPEs suck!) and that we should not coalesce nulls into 0 or false or any other falsy value.

So we should just type juggle and coerce values into something else and hope its what was intended?
Then the user uses the result set to drive some decision which results in catastrophe?

So 10 problems + null is null, hence no problem?
So 100 million dollars + null is null, hence its free?

Returning an error is not the same as "throwing" an error, and is not the same as a Null Pointer Exception (NPE). The PRQL compiler can return a list of error messages.

@snth
Copy link
Member

snth commented Dec 18, 2023

from [{a = null, b = null}]
select a == b

... evaluates to [{null}] in playground at the moment, but it should evaluate to [{true}].

I don't agree that it should evaluate to [{true}] and I think it actually returns the correct result.

Moreover

from [{a = 1, b = null}]
select {a == b, b == b, b == null}

evaluates to

(a = b) (b = b) (b IS NULL)
null null true

which I think is perfect. We're keeping the SQL 3-value logic around NULL while providing the b==null shortcut.

It's the last one that's really the odd one out so why do I think that's justified?

The first two behaviours around a==b and b==b have to stay because otherwise we lose out on the whole functionality around missing values. So the question is can we allow b==null or do we need to introduce another operator like IS in SQL?

In my mind the last one is different because what could b==null return? It could either return a column of NULLs (as per the behaviour of the first two examples) but why would you ever want that because you could just use select {null} for that. Therefore comparing a column to a literal null is something that you would never want. Therefore that syntax can be used for something else. Hence we can use it to filter the rows for where null values are found.

I agree this is a bit inconsistent and will probably trip up newcomers (and maybe others as well). However I think it's an acceptable price in this instance. If people think it's too confusing then we may have to abandon that syntactic shortcut (but let's discuss that first). However I think we can't have b==b return true because then we lose access to a whole lot of data representations around missing values.

@aljazerzen
Copy link
Member Author

@vanillajonathan do I understand your correctly:
You propose we make operators such as + not accept null values and report that as compilation error.
This would then require null handling before the operator.


@snth I have strong opinion against 3-value logic and an even stronger one against our current in-between behavior.

That's because with current semantics, it is not possible to easily "extract a variable":

from x
select {a, b = null}
select {a == null, a == b}

... here, one would expect that comparing a to null would be the same as comparing to a variable that is just null. But that is not the case.

I don't understand what you mean with "lose access to a whole lot of data representations around missing values". The proposed change in this issue would only change the "default" behavior of what happens when an operand is null, but one can still explicitly state that behavior:

case [a == null or b == null => null, true => a == b]

... or make a helper function for it.

So I don't think the convenience of 3-value logic in some cases is worth the massive inconsistency in a few others.

@vanillajonathan
Copy link
Collaborator

@aljazerzen Yeah, Python seems to handle this nicely.

In [1]: 3 + None
-----------------------------------------------------------------------
TypeError                             Traceback (most recent call last)
Cell In[1], line 1
----> 1 3 + None

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

@snth
Copy link
Member

snth commented Dec 20, 2023

Just to add an example (because I lost it when my browser restarted today):

let to_int = func x -> case [x=='null' => null, true => (x | as int)]
let sentinel = -999

from_text format:csv """i,a_i,b_i
0,0,0
1,1,10
2,2,null
3,null,30
4,null,null
"""
sort i
select {a=a_i|to_int, b=b_i|to_int, current=a==b, proposed=(a??sentinel)==(b??sentinel)}

which yields

a b current proposed
0 0 true true
1 10 false false
2 null null false
null 30 null false
null null null true

I will come back to add some text as to why I think the proposed behaviour of treating null as a well defined value is a problem.

In the meantime, it would be easy enough to add the following function for anyone that really wants the proposed behaviour:

let eq_null = func y z s:sentinel -> (y??s)==(z??s)

and then

derive eq_null=(eq_null a b s:-1)

would give:

a b current proposed eq_null
0 0 true true true
1 10 false false false
2 null null false false
null 30 null false false
null null null true true

@snth
Copy link
Member

snth commented Dec 21, 2023

TIL about NULL aware comparisons in SQL using

<expression> IS [NOT] DISTINCT FROM <expression>

See Markus Winand's page: NULL-Aware Comparison: is [not] distinct from

This yields a better implementation for dialects where this is supported:

let is_distinct = func a1 b1 -> s"({a1} IS DISTINCT FROM {b1})"
let not_distinct = func a2 b2 -> s"({a2} IS NOT DISTINCT FROM {b2})"

For dialects where IS [NOT] DISTINCT FROM is not implemented (according to that page MariaDB, MySQL, and Oracle), Markus also has a suggestion for a better implementation of eq_null that doesn't rely on a sentinel value (since that could collide with values from the value domain):

let eq_null = func a3 b3 -> case [a3==b3 || (a3==null && b3==null) => 1, true => 0] == 1

Adding this to my previous example:

derive {eq_null=(eq_null a b), not_distinct=(not_distinct a b), is_distinct=(is_distinct a b)}

we get:

a b current proposed eq_null not_distinct is_distinct
0 0 true true true true false
1 10 false false false false true
2 null null false false false true
null 30 null false false false true
null null null true true true false

@aljazerzen
Copy link
Member Author

Great overview.

It showcases the problem with 3-value logic: there are so many more possible comparisons which make up a need to have different comparison operators. This then spirals out of control and you end up with a bizzare amount of "is equal" operators.

@snth
Copy link
Member

snth commented Dec 21, 2023

This page has an even better overview of The Three-Valued Logic of SQL.

I haven't read all of it yet but one conclusion I came away with from that article is that even if you patch the == operator to treat two NULLs as equal, the behaviour for other parts of SQL will be different and you will have a massive task to patch those all consistently.

Thinking on this over the last day or two, I think I have a proposal that could suit all of us. First I want to say a bit about the three-valued logic of missing or unknown values and its importance to data analysis.


Disclaimer: I had ChatGPT write this out for me based on my prompt.

Three-Valued Logic in Data Analysis

The concept of "three-valued logic" in data analysis, particularly in the context of handling missing values, is a fundamental aspect across various data processing systems like SQL, Excel, R, and Python's Pandas. Let's explore why this behavior is not only intentional but crucial.

Key Points of Three-Valued Logic in Data Analysis

  1. Representation of Uncertainty: Missing values (NULL in SQL, #N/A in Excel, NA in R, NaN in Pandas) represent uncertainty or absence of data. In standard binary logic, values are either true or false. The inclusion of a third state (missing or unknown) acknowledges that not all information is always available or applicable.

  2. Preventing Misleading Interpretations: By not equating missing values (e.g., NULL = NULL as TRUE), these systems avoid creating misleading or incorrect interpretations. This approach forces analysts to consciously handle these unknowns rather than making automatic assumptions.

  3. Data Integrity and Quality: It emphasizes the importance of data quality and encourages proper data management practices, making users aware of and responsive to missing or uncertain data.

  4. Flexibility in Data Analysis: It allows for more nuanced analysis and decision-making processes, accommodating scenarios where incomplete data is a significant factor.

Detailed Exploration

1. Representation of Uncertainty

  • In real-world scenarios, data often comes with gaps. For instance, a survey response might be blank, or a sensor might fail to record a reading.
  • In systems like SQL or Pandas, a distinct state for 'unknown' or 'not applicable' ensures that these gaps are recognized and not conflated with actual data.

2. Preventing Misleading Interpretations

  • Consider an operation like averaging. Including NULL values as zeros would skew results. By keeping them separate, calculations remain accurate for the available data.
  • In SQL, 'NULL = NULL' being NULL instead of TRUE acknowledges that two unknowns can't be definitively said to be equal. It’s like saying, “Are two things I know nothing about the same? I don’t know.”

3. Data Integrity and Quality

  • This approach makes missing data explicitly visible. Analysts can't ignore or overlook these gaps, prompting better data cleaning and preprocessing.
  • It also aids in identifying data collection issues or areas where more data gathering is necessary.

4. Flexibility in Data Analysis

  • In statistical modeling or machine learning, different strategies can be employed for handling missing data, like imputation, deletion, or using models that can inherently handle it.
  • This flexibility is crucial for robust, realistic models that reflect the complexities of real-world data.

Conclusion

The implementation of three-valued logic across various data systems is a deliberate choice to accurately represent the realities of data collection and analysis.


Proposal

  • Introduce another operator for the " IS NOT DISTINCT FROM " SQL operator (and the eq_null where it is not implemented), for example this could be named something like ==? (since it handles a NULL related behaviour).
  • Keep == for the standard SQL = behaviour. This implies that we have to remove the X == null => X IS NULL behaviour and henceforth X == null -> NULL.
  • Point out in the docs that in order to filter for NULLs, you have to use ==?, i.e.
    from employees
    filter manager ==? null

The benefits I see with this are:

  • It keeps the standard three-valued logic of unknowns which is a key part of data analysis.
  • Almost keeps the convenience of the x == null behaviour (you only have to add a single ? character) but makes things consistent again.
  • We gain a useful ==? operator which ergonomically allows us to test for NULLs in corresponding places.
  • Corner cases of NULL behaviour around different database behaviour edges are preserved so the experts can utilise their knowledge/experience and the dumbfounded can consult the documentation of their database/query engine for the documented behaviour.
  • This saves us from covering a potentially massive API surface of NULL behaviour corner cases.
  • We don't introduce hard-to-debug errors which are interactions of bugs/unimplemented parts of PRQL behaviour and some database behaviour oddity.

@eitsupi
Copy link
Member

eitsupi commented Dec 21, 2023

One thing that bothered me while reading this discussion was the handling of nulls in joins.
(It may be similar to the sort order problem that was found somewhere.)

@eitsupi
Copy link
Member

eitsupi commented Dec 22, 2023

FYI, Julia has ===.
https://docs.julialang.org/en/v1/manual/missing/

@snth
Copy link
Member

snth commented Dec 22, 2023

One thing that bothered me while reading this discussion was the handling of nulls in joins.
(It may be similar to the sort order problem that was found somewhere.)

Good point @eitsupi . I'm actually surprised that there is no (direct) mention of it on the The Three-Valued Logic of SQL. Digging through that page I did find the following though:

In Footnote 2:

Many other features are defined in terms of where—e.g. the join’s on and using clauses (ISO/IEC 9075-2:2023 §7.10 SR 1)

and the WHERE behaviour is discussed in General Rule: where, having, when, etc.:

The where, having, and when clauses (e.g. in case expressions) require true conditions.2 It is not enough that a condition is not false.

The result of the following query is therefore always the empty set:

SELECT col
  FROM t
 WHERE col = NULL

The result of the equals comparison to null is always unknown. The where clause thus rejects all rows.

Use the is null predicate to search for null values:

WHERE col IS NULL

That's why I think trying to build our own, different behaviour around NULL handling on top of this would be a Herculean task and would likely lead to inconsistencies. My recommendation therefore is to keep the standard behaviour but provide conveniences/ergonomics for common tasks, e.g. through something like a ==? or === operator to easily find and identify NULL values without getting to the point where

This then spirals out of control and you end up with a bizzare amount of "is equal" operators.

as @aljazerzen said.

@aljazerzen
Copy link
Member Author

@snth Please refrain from adding generated text into comments as it adds unnecessary clutter and can be engineered to provide any opinion.


I'm not sold on the idea of multiple comparison operators. I see the use-case, but I think that it could be implemented with a slightly more verbose, but way more consistent way.

I want to point out that removing 3-value-logic from PRQL would not imply that PRQL is incapable of expressing uncertainty or missing data. It only means that comparison operators (and a few other) treat null as any other value, without special behavior, as it would be a sentinel value.

This means that one can still use the "bubbling-NULL-up" behavior with case or a helper function.


What I mean with "bizzare amount of is equal operators" is that following the same logic as to add ==? and === would also hint toward adding <=?, <?, >=? and >?. Going even further, it would also make sense to add +?, -?, *?, /? and //?.


I would much rather make operators as simple as possible, as orthogonal as possible, so they can be composed together. In this case, this means that comparison operators return either a true or a false and don't need a table with 5 rows to express their behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
language-design Changes to PRQL-the-language needs-discussion Undecided dilemma
Projects
None yet
Development

No branches or pull requests

5 participants