Skip to content

feat!: require explict database references#4223

Merged
aljazerzen merged 6 commits into
mainfrom
feat-no-from
Feb 16, 2024
Merged

feat!: require explict database references#4223
aljazerzen merged 6 commits into
mainfrom
feat-no-from

Conversation

@aljazerzen
Copy link
Copy Markdown
Member

@aljazerzen aljazerzen commented Feb 13, 2024

This PR removes std.from and renames default_db module to from. This is the effect:

from.x
join from.y (==id)

@aljazerzen
Copy link
Copy Markdown
Member Author

I've tried a few different ways to implement the special case name-resolving behavior of std.from and std.join, but I could not consider any of them as anything else than a hack that works only most of the time.

So I give up of trying to make this behavior work and am instead removing it.

I don't accept this PR to be merged as it has arguably worse aesthetics than what we had before. If it doesn't I don't mind, but I would continue to develop a fork of PRQL as my pet language.

As @max-sixty pointed out the other day, current state of prqlc is such that most things work, but when you get off the beaten track, things start to break with weird error messages. I attribute this to the fact that we have a few "hackish" parts of the codebase, which are necessary for a few features that I consider "inconsistent with the rest of the language".

It is probably possible to make these feature work correctly, but I don't have the motivation to spend enormous amounts of time to do that.

@max-sixty
Copy link
Copy Markdown
Member

I don't accept this PR to be merged as it has arguably worse aesthetics than what we had before. If it doesn't I don't mind, but I would continue to develop a fork of PRQL as my pet language.

Can we discuss more?1 Happy to speak this week on Discord.

You're contributed more to PRQL than anyone, and so your view should have a huge weight. I do think it would be a shame for both PRQL and PRQL-pet-lang — one would lose most of its code momentum, and the other would lose most of its community momentum. Projects really need both to be successful. I accept we've been a bit stuck for a while, without the user momentum we'd expect from the community + code momentum.


I admittedly don't get the from.x. We did initially have no from as an aesthetic choice, but had broad consensus that leading transforms on every line were easier to read. This basically removes that, and mandates a form of default_db.foo throughout.

If we're looking for smaller changes interface changes which achieve the same goal, could we restrict from foo to always be from default_db.foo And then a prefix is required to escape that? (I'm not sure this is great but possibly there are similar things)


As @max-sixty pointed out the other day, current state of prqlc is such that most things work, but when you get off the beaten track, things start to break with weird error messages. I attribute this to the fact that we have a few "hackish" parts of the codebase, which are necessary for a few features that I consider "inconsistent with the rest of the language".

If we think we can't get PRQL to be robust with the current strategy and we need to make bigger changes, there are things we could do to restrict the scope and make it more robust.

For example at the extreme, we could force having a schema in order to compile PRQL. This would be a big change, and would significantly reduce its flexibility, but it would also have the potential to become really robust in the short-medium term. (And it wouldn't completely preclude us from removing this constraint in the future)

Footnotes

  1. FWIW I do think it's a bit extreme to say "if this isn't merged then I'm forking", but I realize we haven't spoken in a while, and I haven't contributed to the compiler itself at all recently, so maybe this is what's necessary to inspire the change...

@eitsupi
Copy link
Copy Markdown
Member

eitsupi commented Feb 14, 2024

Thank you for the enormous amount of work and consideration on these.
IIUC, I feel that it might make sense to remove it because currently from is not doing anything (and may work without it).

However, I feel that there is a better alternative to the database than from.<table_name>.
For example, dummy.<table_name>.

@aljazerzen
Copy link
Copy Markdown
Member Author

Mentioning the word "fork" was probably not very wise of me :D Please understand my frustration when rewriting the name resolution for the 3rd time and hitting a wall again.

Of course we can talk about what to do about the issue of from and how to make it more ergonomic, but at least for now, I will drop the restriction that "the name resolution of an expression depends on the function call that the expression is in".

In practical terms for PRQL, this means that in this example:

let b = read_parquet a

let c = from a

... variable a will use the same name resolution rules regardless of it is used in read_parquet or in from.

Currently this does not hold, as the example above would report a missing error for a in read_parquet, but would infer that a is a table when used in from.

My solution is simple: require the module name in all table references.

In the example above, that would mean this:

let b = read_parquet a

let c = from default_db.a

... because default_db is currently the name of the module that infers all table declarations by default.

(This is the required part, now follows what can easily be changed)


To make this more ergonomic:

  • I've removed from as it is really a no-op now,
  • I've also changed name default_db to from, so keep the resemblance to the current aesthetics.
# proposed change
from.my_table
select {a, b}
join from.other_table

But this is not the only way, we could have:

# alternative 1
db.my_table
select {a, b}
join db.other_table

... or ...

# alternative 2
from db.my_table
select {a, b}
join db.other_table

@max-sixty
Copy link
Copy Markdown
Member

Please understand my frustration when rewriting the name resolution for the 3rd time and hitting a wall again.

Totally, no stress my end.

And to reiterate — sometimes saying the extreme thing is the way to get change! It's often underrated, particularly when others aren't listening well enough.


In the example above, that would mean this:

let b = read_parquet a

let c = from default_db.a

Yes, good point.


If we really have to choose something in the list above, I would vote for:

# alternative 2
from db.my_table
select {a, b}
join db.other_table

I think the natural question from this is — "but why not just assume the db. unless it's ambiguous, i.e. unless it's an existing variable"?

I think the answer is — "because we want to resolve the a before we know it's part of from", is that right? (And we don't want to assume that in derive x = y, y is default_db.y...)

In thinking through this, I'm less opposed to it than I was. I do feel like we lose some of the beautiful simplicity that drew people in to PRQL. But plausibly prefixing db. to every table name isn't that bad.

Two hail-marys before we make the change:

  • Might types help here at all? Specifically that would do name resolution after we know types; so that from a would be typed as a table, while derive x = y wouldn't be
  • Would using .column syntax help, jq-style? I don't think so, because a still needs to be resolved between a table and a declared variable.

I really don't get the from.my_table, I don't think from works well as a namespace...


If we also want to drop from, I'm not strongly opposed to it. But folks do generally seem to like it. I think it partly depends on whether we're a simple language with each line starting with pre-existing transforms, like:

from foo      # nice to have each line starting with a transform
derive (bar = baz + 5)
...etc

...or we're a more complicated language with variables and custom functions, like:

let foo = # some table definition
let bar = # some function

foo | bar     # lines aren't starting with transforms anyway

@aljazerzen
Copy link
Copy Markdown
Member Author

"but why not just assume the db.?"
I think the answer is — "because we want to resolve the a before we know it's part of from", is that right? (And we don't want to assume that in derive x = y, y is default_db.y...)

Exactly, well explained.


Might types help here at all? Specifically that would do name resolution after we know types; so that from a would be typed as a table, while derive x = y wouldn't be

I think you are right, we it would not help much. Ideally, name resolution would happen before and independent of type resolution. It would be weird that adding a type annotation would make some name point to completely different variable.


Would using .column syntax help, jq-style? I don't think so, because a still needs to be resolved between a table and a declared variable.

Yup, it wouldn't solve this issue.


I really don't get the from.my_table, I don't think from works well as a namespace...

Looking at code examples with from.x (two days after I made this PR) it does seem very weird. Using a preposition for a namespace now feels like a crime.


If we also want to drop from, I'm not strongly opposed to it. But folks do generally seem to like it

I'm not opposed to it either. It is unnecessary, but it does look nice:

from db.x
join db.y
into a

from a

... as opposed to:

db.x
join db.y
into a

a

Ok, I'll update this PR to "alternative 2".

@aljazerzen
Copy link
Copy Markdown
Member Author

Another thing: std.from would have a utility of easily adding type annotations. If you start a pipeline with std.from, that would signal that this pipeline operates on relations and not arrays of ints for example.

@snth
Copy link
Copy Markdown
Member

snth commented Feb 15, 2024

I'm in favour of "alternative 2" as well.

"but why not just assume the db.?"

I agree with the name resolution reasons discussed above (although I'm not really qualified to comment on them the explanations do make sense), but more importantly for me (given that I mostly wear the "Advanced User" rather than "Compiler Engineer" hat), I think it improves the user experience as well. In particular it makes it clearer what are database tables/views and what are derived relations. I was going to add an example for this but Aljaz has already provided the perfect one:

from db.x
join db.y
into a

from a

I think this nicely separates derived relation a from database relations x and y.

db is a nice, short prefix that's easy to type and perfectly communicates that this is a database relation that we're dealing with.


I would strongly prefer to keep the from, mostly from an aesthetic point of view but I think there might be a technical reason as well, namely how would aliasing work otherwise?

from x=db.long_legacy_system_table_name
join y=db.other_long_legacy_system_table_name
...

What would that be without from? Maybe

x=db.long_legacy_system_table_name
join y=db.other_long_legacy_system_table_name
...

but then one might think shouldn't that have a let in front of the first line but that would actually change the meaning then and would no longer "pipe" records into the join statement, i.e. the following means something different (and presumably is an error?):

let x=db.long_legacy_system_table_name
join y=db.other_long_legacy_system_table_name
...

Where would the alias go with the from.my_table syntax? Maybe

x=from.my_table
...

but that has the same let problem as the previous example.

@snth
Copy link
Copy Markdown
Member

snth commented Feb 15, 2024

Of course aliasing could be solved with separate let declarations but that seems more clumsy to me, e.g. compare

from x=db.long_legacy_system_table_name
join y=db.other_long_legacy_system_table_name
...

with

let x=db.long_legacy_system_table_name

x
join y=db.other_long_legacy_system_table_name
...

And what about role playing aliases, e.g. manager and employee?

let employee=employees

employee
join manager=employees (employee.manager_id==manager.id)
...

Again, I think the following is much better:

from employee=employees
join manager=employees (employee.manager_id==manager.id)

@aljazerzen aljazerzen changed the title feat!: remove from and rename default_db to from feat!: require explict database references Feb 16, 2024
@aljazerzen aljazerzen enabled auto-merge (squash) February 16, 2024 17:37
@aljazerzen aljazerzen merged commit 3dcace7 into main Feb 16, 2024
@aljazerzen aljazerzen deleted the feat-no-from branch February 16, 2024 17:48
@max-sixty
Copy link
Copy Markdown
Member

I'll help clear this up

We should run pr-nightly for these sorts of PRs, since the bindings will break without their paths being changed; so the tests won't run automatically

@max-sixty
Copy link
Copy Markdown
Member

#4241

max-sixty added a commit to max-sixty/prql that referenced this pull request Mar 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants