Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Same column name in multiple tables causes buggy behaviour #820

Closed
mklopets opened this issue Jul 18, 2022 · 24 comments · Fixed by #904
Closed

Same column name in multiple tables causes buggy behaviour #820

mklopets opened this issue Jul 18, 2022 · 24 comments · Fixed by #904
Labels
bug Invalid compiler output or panic compiler priority

Comments

@mklopets
Copy link
Collaborator

mklopets commented Jul 18, 2022

When multiple PRQL tables share an identical column name (e.g. id): joining them together causes the compiler to reuse the meaning of one id in multiple places (even where it's incorrect).

This reduced example should illustrate it best:

table x = (
  from x_orig # table with columns: id
  derive [
    x_id = id
  ]
  select [x_id]
)

table y = (
  from y_orig # table with columns: id, x_id
  derive [
    y_id = id
  ]
  select [y_id, x_id]
)

from x
join y [x_id]

compiled SQL:

WITH x AS (
  SELECT
    id AS x_id
  FROM
    x_orig
),
y AS (
  SELECT
    x_id AS y_id, --------- this should instead be `id AS y_id`
    x_id
  FROM
    y_orig
)
SELECT
  x.*,
  y.*,
  x_id
FROM
  x
  JOIN y USING(x_id)

Note that if the PRQL derive statement of y_id = id is changed to y_id = y_orig.id, then the buggy output SQL line correctly becomes id AS y_id. But, we can't expect users to always add these (unless this becomes an explicit requirement)

@mklopets mklopets added bug Invalid compiler output or panic compiler labels Jul 18, 2022
@max-sixty
Copy link
Member

Thanks a lot @mklopets

This does look like a bug.

@aljazerzen has been carrying the compiler and so I'm working to get up to speed with it, so I suspect it won't be a super quick fix; I should have some time later in the week though, depending on how difficult it is to fix.

I recognize this isn't ideal, but this will work around it in the short term:

table y = (
  from y_orig # table with columns: id, x_id
  derive [
-    y_id = id
+    y_id = y_orig.id
  ]
  select [y_id, x_id]
)

@mklopets
Copy link
Collaborator Author

mklopets commented Jul 18, 2022

For hard-to-explain (or understand, lol) reasons, this suggested workaround wasn't actually super easy to implement for us in our specific (quite complex) real-world scenario.

But, what does work for us (and it's kind of a worse solution, and I assume it won't work in all cases) is using S-strings to bypass all of PRQL's magic around it sometimes replacing identifiers:

table x = (
  from x_orig # table with columns: id
  derive [
-    x_id = id
+    x_id = s"id"
  ]
  select [x_id]
)

table y = (
  from y_orig # table with columns: id, x_id
  derive [
-    y_id = id
+    y_id = s"id"
  ]
  select [y_id, x_id]
)

from x
join y [x_id]

outputs

WITH x AS (
  SELECT
    id AS x_id
  FROM
    x_orig
),
y AS (
  SELECT
    id AS y_id,
    x_id
  FROM
    y_orig
)
SELECT
  x.*,
  y.*,
  x_id
FROM
  x
  JOIN y USING(x_id)

@max-sixty
Copy link
Member

OK, congrats on the creative solution, but the current state is not great.

I think @aljazerzen is out for a while, so I'll try and get deeper into this, but it may take me some time to get up to speed. I'll some proper time this weekend if I don't make progress before then.

@mklopets
Copy link
Collaborator Author

mklopets commented Jul 19, 2022

Another (simplified) case we ran into this with (the same S-string workaround technically works, but I don't think @max-sixty's original suggestion applies):

table x = (
  from orig_x
  join foo [orig_x.foo_id==foo.foo_id]
)

table y = (
  from orig_y
  join foo [orig_y.foo_id==foo.foo_id] # workaround-able using ==s"foo.foo_id"
)

from y

compiles to

WITH x AS (
  SELECT
    orig_x.*,
    foo.*
  FROM
    orig_x
    JOIN foo ON orig_x.foo_id = foo.foo_id
),
y AS (
  SELECT
    orig_y.*,
    foo.*
  FROM
    orig_y
    JOIN foo ON orig_y.foo_id = x.foo_id ------------ the right-hand side here should be foo.foo_id
)
SELECT
  y.*
FROM
  y

@mklopets
Copy link
Collaborator Author

mklopets commented Jul 31, 2022

To add to what I said previously: the one case I'm completely unable to get a workaround working for, is when we also do some grouping (group doesn't accept S-strings unlike select/derive):

table x = (
  from orig_x
  join foo [orig_x.foo_id==foo.foo_id]
)

table y = (
  from orig_y
  join foo [orig_y.foo_id==foo.foo_id] <------------ this gets translated incorrectly
  group [foo.foo_id] ( # can't use S-strings in here <------------- and so does this
    aggregate [count]
  )
)

from y

@max-sixty
Copy link
Member

As discussed on discord, I'll prioritize this. Thanks a lot for the cases @mklopets , and for saying what's important on your end.

It's in the area of the code that I'm less familiar with, so I'm working through it.

@max-sixty
Copy link
Member

max-sixty commented Aug 2, 2022

FYI I've reduced the example above down a bit:

table x = (
  from x_table
  select only_in_x = foo
)

table y = (
  from y_table
  select foo
)

from x
join y [id]
WITH x AS (
  SELECT
    foo AS only_in_x
  FROM
    x_table
),
y AS (
  SELECT
    only_in_x AS foo  # wrong!
  FROM
    y_table
)
SELECT
  x.*,
  y.*,
  id
FROM
  x
  JOIN y USING(id)

@max-sixty
Copy link
Member

max-sixty commented Aug 3, 2022

Update:

We keep a name -> location mapping in order to resolve where each name comes from. We lookup the name here: https://github.com/prql/prql/blob/0.2.5/prql-compiler/src/semantic/name_resolver.rs#L498-L505

The problem is that we don't take into account what scopes are available at that point, so in the example above, we assume that foo in y is actually in x, since we're already added a mapping of foo to the x table.

I'll think about what the best approach is — possibly we encode into the caller the available scopes, though it means passing much more context through name resolution.

(@aljazerzen IIUC you have very intermittent internet — if you do get access then lmk your thoughts on this with higher priority than anything else! Or I'll send you a Starlink...)


On a meta level, this is the first issue I've really struggled to debug — and beyond just being less familiar with this piece of code — it's not possible to reduce the size of the example down that much (i.e. unlike parsing, we can't just break it down further), the data structures are much larger, there are dozens of other objects running through the same code, some named the same, so logging is verbose.

Possibly the approach to testing we've been successful with so far — mostly "given this PRQL am I getting this {AST,SQL}" isn't sufficient, and we need to work with the inner objects more directly, like more old school unit testing. e.g. represent a tiny Context object (no stdlib, for brevity), add a node, and then snaphot the result. I'll continue thinking, very open to feedback.

max-sixty added a commit that referenced this issue Aug 3, 2022
I think to fix #820 we're going to need to know "where we are" in the
folding — what tables we've seen before, what is currently visible;
which I think means we want to control how each table is parsed from a
higher method.

So this:
- Adds a `fold_query` method
- Moves some functions from taking nodes to taking a `Query` object. I'm
  not that confident this is better everywhere, so we can revert parts.
@aljazerzen
Copy link
Member

Ok, I took a look a this and I think I know where the problem is.

In the example above, both foos are resolved as same ExternRef (reference to an actual table column). As you suspected, this could happen because Scope is not cleared between different table definitions. I assume that the problem is with this function which is probably also causing some I other bug I cannot find at the moment.

When I discovered that bug, I tried to write an easy fix, but it involved pulling some data from other compiled stages and it was too much trouble for the part of the code base that will get rewritten on semantic branch.

I won't be fixing this now, but I can speed up work on semantic.


I've also found PRQL -> SQL debugging approach quite lacking. I fall back on dbg! and prql-compiler debug command, which is quite good at explaining what gets resolved into what.

@max-sixty
Copy link
Member

As you suspected, this could happen because Scope is not cleared between different table definitions. I assume that the problem is with this function which is probably also causing some I other bug I cannot find at the moment.

Ah great, thanks for pointing me towards that.

I think we want something similar but not exactly this — this function clears the context (e.g. after a select) — we want to "clear and then add back" after the table. We already have this for "within_{group;window;aggregate}", this is a bit more complicated since we need to know what's available, which requires some context; e.g.

table x = (
  from x_table
  select only_in_x = foo
)

table y = (
  from y_table
  join x_table
  select foo   # now foo could come from `x_table`
)

Maybe we want to generalize those within_X into namespaces which we can opt into depending on what's available.

@max-sixty
Copy link
Member

max-sixty commented Aug 3, 2022

I've also found PRQL -> SQL debugging approach quite lacking. I fall back on dbg! and prql-compiler debug command, which is quite good at explaining what gets resolved into what.

Yeah, just to give you an idea of how inefficient I'm being — here's the current dbg diff!

diff --git a/prql-compiler/src/semantic/name_resolver.rs b/prql-compiler/src/semantic/name_resolver.rs
index 11c61bdd..9a9ebaee 100644
--- a/prql-compiler/src/semantic/name_resolver.rs
+++ b/prql-compiler/src/semantic/name_resolver.rs
@@ -97,6 +97,14 @@ fn fold_nodes(&mut self, items: Vec<Node>) -> Result<Vec<Node>> {
     }
 
     fn fold_node(&mut self, mut node: Node) -> Result<Node> {
+        if node
+            .item
+            .as_ident()
+            .map_or(false, |x| x == "foo" || x == "y")
+        {
+            dbg!(&node);
+        }
+
         let r = match node.item {
             Item::FuncCall(ref func_call) => {
                 // find declaration
@@ -104,16 +112,20 @@ fn fold_node(&mut self, mut node: Node) -> Result<Node> {
                     self.lookup_variable(&func_call.name, node.span)
                         .map_err(|e| Error::new(Reason::Simple(e)).with_span(node.span))?,
                 );
-
                 self.fold_function_call(node)?
             }
 
             Item::Ident(ref ident) => {
+                if node.item.as_ident().map_or(false, |x| x == "foo") {
+                    dbg!(&node);
+                }
                 node.declared_at = Some(
                     (self.lookup_variable(ident, node.span))
                         .map_err(|e| Error::new(Reason::Simple(e)).with_span(node.span))?,
                 );
-
+                if node.item.as_ident().map_or(false, |x| x == "foo") {
+                    dbg!(&node);
+                }
                 // convert ident to function without args
                 let decl = &self.context.declarations.0[node.declared_at.unwrap()].0;
                 if matches!(decl, Declaration::Function(_)) {
@@ -122,6 +134,9 @@ fn fold_node(&mut self, mut node: Node) -> Result<Node> {
                         args: vec![],
                         named_args: Default::default(),
                     });
+                    if node.item.as_ident().map_or(false, |x| x == "foo") {
+                        dbg!(&node);
+                    }
                     self.fold_function_call(node)?
                 } else {
                     node
@@ -263,6 +278,8 @@ fn fold_join_filter(&mut self, filter: JoinFilter) -> Result<JoinFilter> {
 
     fn fold_table(&mut self, mut table: Table) -> Result<Table> {
         // fold pipeline
+        dbg!("folding table", &table.name);
+        // Run a separate name resolution process for the table, to ensure
         table.pipeline = Box::new(self.fold_node(*table.pipeline)?);
 
         // declare table
@@ -507,6 +524,9 @@ fn validate_function_call(
     pub fn lookup_variable(&mut self, ident: &str, span: Option<Span>) -> Result<usize, String> {
         let (namespace, variable) = split_var_name(ident);
 
+        if ident == "foo" {
+            dbg!(&self.context.scope.variables);
+        }
         if let Some(decls) = self.context.scope.variables.get(ident) {
             // lookup the inverse index
 
@@ -514,7 +534,7 @@ pub fn lookup_variable(&mut self, ident: &str, span: Option<Span>) -> Result<usi
                 0 => unreachable!("inverse index contains empty lists?"),
 
                 // single match, great!
-                1 => Ok(decls.iter().next().cloned().unwrap()),
+                1 => Ok(dbg!(decls.iter().next().cloned().unwrap())),
 
                 // ambiguous
                 _ => {

After sleeping on it, I think we should do something like the above suggestion — let us engage / test with these inner APIs without needing a full PRQL statement. That means making them easy to create, easy to serialize, etc. So a test can be:

fn test_context() {
    context = Context::new();
    let names = resolve_names(<small AST>);
    assert_snapshot(names, @"<a couple of lines long>");
}

...and then we can test it more directly

@max-sixty
Copy link
Member

Closed by #908

@mklopets
Copy link
Collaborator Author

mklopets commented Aug 19, 2022

it looks like #908 greatly improved the situation but there's still similar buggy behaviour:

table y = (
  from x
  select [
    x_id = x.x_id,
  ]
)

from x
join y [something]

select [
  x_id = x.x_id # this should compile to selecting x.x_id, not y.x_id 
]

currently compiles to

WITH y AS (
  SELECT
    x_id
  FROM
    x
)
SELECT
  y.x_id <--------------------------- this is incorrect, should be `x.x_id`
FROM
  x
  JOIN y USING(something)

this is problematic e.g. if left joining y – y.x_id is null for rows that don't match there (this example code query isn't perfect, but our real query is hundreds of lines long)

@mklopets
Copy link
Collaborator Author

@max-sixty should we reopen this issue (see prev comment)? This still has us using a looot of S-strings that shouldn't be necessary

@max-sixty
Copy link
Member

Sorry, yes, thanks for raising. FWIW this is still a thorn for me, we haven't forgotten it. (Though work has slowed here, to some extent this has become dependent on the semantic rewrite).

@max-sixty max-sixty reopened this Oct 11, 2022
@aljazerzen
Copy link
Member

aljazerzen commented Oct 12, 2022

I'll explain what I think the resolver should be doing on your example max.
Edit: I changed the example a bit (see join). Original example is explained below.

table x = (
  from x_table
  select only_in_x = foo
)

table y = (
  from y_table
  join x
  select foo   # now foo could come from `x_table`
)

When resolving select for table x, scope should contain only _frame.*, which points to table x_table. This means that foo should resolve into ExternRef x_table.foo.

After the table x is finished, it should be stored into Context somehow (this part is not implemented!).

Now, in table y, from should add * to _frame namespace and it should point to table y_table. But join should not add * to _frame, because we know that x contains only variable only_in_x. Thus, after join scope should contain:

_frame:
- * which points to y_table
- only_in_x which points to expression which contains ref to `x.foo`

When resolving select of y, name foo should match * and resolve into y_table.foo.

@mklopets
Copy link
Collaborator Author

@aljazerzen I might be missing something (and I don't know how any of this works under the hood) but why would only_in_x be in scope after starting with y_table and joining on x_table (not x)?

The last sentence in your comment was confusing for me as well: shouldn't foo be resolved into x_table.foo (not y_table)?

@aljazerzen
Copy link
Member

aljazerzen commented Oct 12, 2022

Oh, I made a mistake - I thought that we are joining with x, not x_table. I'll update my previous comment and also add commentary on what should happen when joining with x_table:

table x = (
  from x_table
  select only_in_x = foo
)

table y = (
  from y_table
  join x_table
  select foo   # now foo could come from `x_table`
)

For the first table, everything is the same as the first example.

In table y, from should add * to _frame namespace and it should point to table y_table. join should similarly add * to _frame, which should now have two _frame.*.

When resolving select, name foo should match _frame.*, which refers to two variables. I once made a workaround for this case and allow name foo not to be matched to any table, but be translated directly into foo. Rationale behind this is that maybe, this is a column from one table only, which the database could resolve. The code for that is here.

But now I see that this must raise an error due to ambiguity, because:

  • we cannot just pass name through without resolving it, because this would mean that some columns of our IR are not resolved and we don't know what they are.
  • conceptually, this means that sometimes, PRQL semantics for name resolution fallback to SQL semantics

This see this error be annoying, but the situation would be improved if we add table defintions in the future. All this could be avoided, if the compiler would have the knowledge of all columns in both tables.


We could maybe go further and infer the frame of table x_table and figure out that foo could match x_table.foo, but this should still produce an error becuase foo could still match two variables: x_table.foo and y_table.*.

@mklopets
Copy link
Collaborator Author

On the semantic rewrite front: is there any way me or @priithaamer could help out to speed this up? This issue is super important for us. FWIW: I'll also be joining the call this Sunday.

@max-sixty
Copy link
Member

I wrote this on Discord so I'll add here: I would vote to do the "closest" thing, as long as it's possible to extend in the future. So if that means raising an error for ambiguities for the moment, then that's a reasonable tradeoff.

In particular, it's quite easy to work around in the case it's in y_table:

table y = (
  from y_table
+ select foo
  join x_table
- select foo
)

...and not too difficult if it's in x_table — either another table, or x_table.foo.


With more rigidity, I worry more about flexibility in s-strings & functions than more simple ambiguities like the case above given how easy it is work around.

For example, an adjusted example on the dbt-prql repo:

func filter_amount method -> s"sum(case when payment_method = '{method}' then amount end) as {method}_amount"

from payments
group order_id (
  aggregate [filter_amount bank_transfer]
)
select bank_transfer_amount   # this might not be resolvable

If we are trying to find a way for this to work while maintaining a good conceptual framework, I wonder whether it would be possible to have a notion of "I don't know where this comes from, so I'm just going to put foo there". On discord I asked whether it would be possible to have:

table y = (
  from y_table
  join x_table
  select foo   # <- this isn't resolvable to a table, so we put it in a scope 
  # attached to this line. 
  # In a way, we treat it similarly to `derive foo = 5` (or even `derive foo = foo`!)
)

(but as above, if @aljazerzen doesn't think it's worth pursuing now, I would vote to push ahead, no worries about having 100% agreement)

max-sixty added a commit that referenced this issue Oct 23, 2022
This is a temporary workaround for @mklopets & team, while we resolve #820.

@aljazerzen -- not sure if you think this is an acceptable tradeoff? I'll wait for your review.
@mklopets
Copy link
Collaborator Author

mklopets commented Nov 23, 2022

Here's one more example of something where working around this issue is very difficult (if not impossible):

from x

filter field > 0 # this causes an intermediary table_0 to be created

join y [something]

select [
  # if this has to use an S-string because of issue #820,
  # then this incorrectly outputs x.field instead of table_0.field,
  # which will error: missing FROM-clause entry for table "x"
  field = s"{x}.field"
]

If #820 wasn't a thing, then there would be no need for the S-string, but if we do need it (due to this very same issue here), then this will compile to:

WITH table_0 AS (
  SELECT
    x.*
  FROM
    x
  WHERE
    field > 0
)
SELECT
  x.field AS field <---------------- this will error: missing FROM-clause entry for table "x"
FROM
  table_0
  JOIN y USING(something)

as a terrible workaround, one could do:

-  field = s"{x}.field"
+  field = s"{x.*}.field"

which compiles to table_0.*.field; because there are no valid uses for .*. (AFAIK), one could theoretically replace every occurrence of .*. in the compiled SQL with .

@max-sixty
Copy link
Member

@mklopets if you want to give main a drive with these examples and let us know whether everything works / things are still broken, that would be great!

@mklopets
Copy link
Collaborator Author

mklopets commented Nov 29, 2022

@max-sixty I'm already on it since Sunday – looking good so far; the behaviour from my last comment here still remains, but it'll no longer affect us once we remove most S-strings anyway

anywho – I'll definitely ping and/or close the issue when confirmed that all is good!

@mklopets
Copy link
Collaborator Author

This was truly fixed with v0.3.0!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Invalid compiler output or panic compiler priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants