Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve opaqueness of idents #1535

Closed
max-sixty opened this issue Jan 16, 2023 · 3 comments · Fixed by #2563
Closed

Resolve opaqueness of idents #1535

max-sixty opened this issue Jan 16, 2023 · 3 comments · Fixed by #2563
Labels
compiler language-design Changes to PRQL-the-language priority

Comments

@max-sixty
Copy link
Member

max-sixty commented Jan 16, 2023

Edit: Renamed from "Resolve idents based on type"; new proposal below. Original issue follows:


Continuing on from #1516

Currently backticks are required in:

from `schema.table`

...which compiles to

SELECT * FROM 'schema'.'table' -- (or just `schema.table`, no backticks, but *not* 'schema.table' with quotes)

That's because the resolver will try to resolve...

from schema.table

...as a namespace schema and a column name table, which then breaks.


I think ideally — we would instead resolve the argument based on its type:

  • from & join would resolve to namespaces
  • select & derive etc would resolve to columns

Then:

  • from schema.table would be parsed as a single namespace schema.table, not a namespace schema and column table.
  • select table.column would still be parsed as a namespace table & column column, because that's typed as a column

And then:

  • any ident in backticks could be opaque — we wouldn't need to split by ., clearing up feat: Parse idents with * as quoted #1516
  • we'd take from schema.table, not from `schema.table` and produce the correct thing
@max-sixty
Copy link
Member Author

@aljazerzen as discussed

@max-sixty
Copy link
Member Author

max-sixty commented Mar 21, 2023

We discussed this on the dev call and I said I'd think more about the tradeoffs. I think this issue demonstrates some weaknesses is our current name handling. We have a tri-lemma; we want:

  • The ability to specify an identifier as opaque.
    • This:
      from `https://example.com/foo.parquet` 
      should definitely not compile to
      FROM "https://example"."com/foo".parquet
  • To not need to manage database schema hierarchy in PRQL
    • Ideally we could avoid managing more than a) Table and b) Column.
      • Specifically, all identifiers are $namespace.$column, where $namespace can be opaque
    • i.e. from foo.bar can be handled as the table foo.bar; select baz.foo.bar can just be the table baz.foo & the column bar.
    • NB: Databases require breaking up the idents when quoted — i.e. SELECT * FROM "foo.bar" will try and query a Table named foo.bar, not a schema named foo and table named bar
      • So currently we split up the namespace by periods when compiling to SQL
  • Names work similarly wherever they're used
    • This was the point @aljazerzen made on the dev call — with the approach this issue proposed above, we'd be handling identifiers differently depending on where they were used. from schema.table would not attempt to split schema.table, but select table.column would attempt to look up column in table.

The trilemma is that we can't avoid splitting https://example.com/foo.parquet, while splitting select baz.foo.bar into "bar"."foo"."bar", unless we handle select and from differently, which we don't want to do.


I think the current state is quite bad, and it's worth adjusting & investing to fix it, even if it means breaking things.

The only option I can think of is that we give up the "To not need to manage database schema hierarchy in PRQL". So, we would:

  • Maintain a nested namespace of objects
    • The biggest open question is whether this is quagmire.... Do we need to start understanding whether bar in from foo.bar.baz | select bar.baz is the same in both references??
  • Anything within backticks would be opaque
    • So from `https://example.com/foo.parquet` would compile correctly
  • To compile to FROM "project-foo".dataset.table we'd supply from `project-foo`.dataset.table
    • and not `from `project-foo.dataset.table` as it is now

As long as it's do-able, I think this would be an excellent result — a simpler model, with fewer surprises, and much better for things like URLs.

The main internal change is that Namespaces would need to become hierarchical. Any thoughts? We can discuss when \@aljazerzen is back

@max-sixty max-sixty changed the title Resolve idents based on type Resolve opaqueness of idents Mar 22, 2023
max-sixty added a commit to max-sixty/prql that referenced this issue Mar 24, 2023
I'm not sure this was correct prior; at least it didn't pass these tests, and I couldn't work out why it started with `!`.

This is part of the refactoring in PRQL#1535
max-sixty added a commit that referenced this issue Mar 24, 2023
I'm not sure this was correct prior; at least it didn't pass these tests, and I couldn't work out why it started with `!`.

This is part of the refactoring in #1535
max-sixty added a commit to max-sixty/prql that referenced this issue Mar 24, 2023
Part of PRQL#1535, this changes our Idents in PL from `$namespace.$name` to an arbitrary hierachy.

It's full of `.clone()` & `.into()` -- I've been doing this by replacing one definition and then adding lots of `.into()` untils it passes (not the most conceptual work!). Doing it incrementally at least means I can't end up in a quagmire of not knowing why the new version doesn't work; it's easy to revert to the last known good state. We can do another pass to improve the rust / reduce allocations.

The plan here would be to:
- replace any other usages in the compilation prior to the ident being resolved. Once it's resolved, it doesn't necessarily need a full hierarchy. Possibly we can remove the old `Ident` / rename the new one to `Ident`.
- adjust how the resolution works so we can have arbitrary hierarchies of schema (`a.b.c`). Still some work to think about how this should work (some initial comments in PRQL#1535)
- make backticks fully opaque, so PRQL#1535 works -- then using parquet files will be easy, we won't need to quote schemas, the semantics will be simple & consistent
@max-sixty
Copy link
Member Author

I'm closing #2305 but pasting the next steps from that here, as those are still required:

  • adjust how the resolution works so we can have arbitrary hierarchies of schema (a.b.c). Still some work to think about how this should work (some initial comments in Resolve opaqueness of idents #1535)
  • make backticks fully opaque, so Resolve opaqueness of idents #1535 works -- then using parquet files will be easy, we won't need to quote schemas, the semantics will be simple & consistent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler language-design Changes to PRQL-the-language priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants