Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename select to map #3782

Closed
aljazerzen opened this issue Nov 9, 2023 · 23 comments
Closed

Rename select to map #3782

aljazerzen opened this issue Nov 9, 2023 · 23 comments
Labels
language-design Changes to PRQL-the-language

Comments

@aljazerzen
Copy link
Member

What's up?

Since relations are just arrays of tuples, select is just mapping each tuple into a new tuple. This is equivalent to behavior of .map() function from other languages:

  • Python: map()
  • JavaScript: Array.map,
  • Rust: Iterator.map,
  • Java: Stream.map,

Should we rename select to map?

The major downside is that right now, PRQL looks very similar to SQL, thanks to many similar names of PRQL transforms and SQL clauses.

@aljazerzen aljazerzen added the language-design Changes to PRQL-the-language label Nov 9, 2023
@richb-hanover
Copy link
Contributor

The major downside is that right now, PRQL looks very similar to SQL, ...

I am going to second this concern. Its similarity to SQL also makes it feel peculiar on first inspection. The newcomer's internal monologue might be:

  • from - It has to be the first statement... Hmmm, that's interesting...
  • sort - I guess that will turn into ORDER BY
  • filter - and that'll become WHERE (or maybe HAVING...)
  • group - I bet that's GROUP BY

But with select readers will say, "AHAH! I know exactly what that does!" and they'd be right.

I don't have the confidence that map would elicit such a response. I'd stick with select. Thanks for listening

@max-sixty
Copy link
Member

max-sixty commented Nov 10, 2023

When we say artists | map name, is that a traditional form of map, though? name is a function that extracts an item from a tuple?

(I see that some of select is similar to map, so I'm not dismissing — starting with the strongest argument against)

Worth checking out jq's map

@eitsupi
Copy link
Member

eitsupi commented Nov 10, 2023

I too think that map is generally familiar as a higher-order function.
https://en.wikipedia.org/wiki/Map_(higher-order_function)

However, the way it is used here does not necessarily take a function as an argument, so it seems strange for me.

@richb-hanover
Copy link
Contributor

Thanks @eitsupi for voicing another concern that I didn't express: map(foo) takes a function. In select { foo, bar }, these are not functions, so it the understanding of map doesn't transfer too well.

To expand on my earlier note, I want to say that we need to overcome people's reservations about PRQL as an alternative to SQL.

When I first read about PRQL, I remember thinking, "PRQL seems too good to be true... I wonder how much new stuff I'll have to learn." Keeping select as a direct analogue to SQL's SELECT retains a critical piece of familiarity for newcomers.

@snth
Copy link
Member

snth commented Nov 10, 2023

I guess @aljazerzen 's point might be that select is really a mix of two things at the moment:

  • projection
  • column expressions which are really an application of map to arrays of rows/records.

So

select { alias = func col1 col2 ... coln }

is really shorthand for something like (in Pandas notation to try and avoid ambiguity):

df['alias'] = map( lambda row: func(row['col1'], row['col2'], ..., row['coln']), df.iterrows() )

We get a fair amount of the question "Why have derive when we have select?" which also points at this overlap. I think this also touches on #2723 . There was another previous discussion on namespacing and whether all columns should be prefixed with something like .. The best implementation of something like this that I have seen is the each keyword/construct in Power Query M Language (https://learn.microsoft.com/en-us/powerquery-m/m-spec-functions#simplified-declarations) which makes this explicit while keeping the syntax short and concise (in contrast to the verbosity of the rest of M lang).

This also came up a lot in my discussions at ApacheCon with Julian Hyde and Mihai Budiu about the "two language problem" of SQL: there's the query language that we all think of as SQL (FROM, SELECT, ...) and then there's the expression language for working with column values. Opinion was divided on whether this is necessary and should be maintained or a better language would eliminate this.

I think these are big questions and we should leave things be as they are until we have really formalised the type system.

@snth
Copy link
Member

snth commented Nov 10, 2023

A shorter answer is that I would suggest rather keeping map for working with array values in columns. I don't know if that's actually implemented anywhere.

Looks like Postgres doesn't have it (https://www.postgresql.org/docs/16/functions-array.html) and the usual solution is to UNNEST into rows, work on those and then array_agg again:
https://dba.stackexchange.com/questions/183370/map-functions-over-array-with-postgres

@aljazerzen
Copy link
Member Author

aljazerzen commented Nov 10, 2023

I see a lot of backpressure, but I have to say I had expected it on this one. I don't expect this will get accepted, but I want to explain my motivation, as it is deeper than a mere rename.


However, the way it is used here does not necessarily take a function as an argument, so it seems strange for me.

This is exactly my point: I do see select (and most of the other transforms) as high order functions. That's because I understand the first argument of select as an "implicit lambda function". Let me explain.

If we try to express my_rel | select {col_a + 1} in Python syntax, it would look like this:

select((col_a + 1,), my_rel)

The pipelined value my_rel becomes the last argument, select is a function, {col_a + 1} is its first argument. The problem with this notation is that col_a refers to something within my_rel. Python would not allow this, but we want to, since this is what makes PRQL useful.

My explanation of why this is allowed (other than saying "select just works this way"), says that in Python notation, the query would actually be expressed like this:

select(lambda row: (row.col_a + 1, ), my_rel)

What I'm saying is that the PRQL expression {col_a + 1} is an implicit lambda function, which could be expanded to:

func row -> (
   use row.*;
   {col_a + 1}
)

You might all wonder why I chose to spend time on mental gymnastics like this, since we do have bug reports open. When talking about PRQL, I like to say that it has "consistent semantics" and "compact name resolution rules". But in reality, this is not true - we do have some quite obscure rules for name resolution. We had a lot of discussion about "which column names are available where", "when is a column name overridden?", "how can names of relations in a query be used?" and it all boils down to a single question:

How to refer to things within a pipeline?

  • Python DataFrame libraries (i.e. Pandas and friends) say "don't worry, just use strings". So they have my_rel.assign(col_b=my_rel['col_a'] + 1), but they lose type-checking capabilities.
  • R DataFrame libraries take similar approach, but exploit weird R late-resolution rules that I don't fully understand. In the end, you can write my_dataframe %>% mutate(col_a + 1) and have the same downsides as Python approach.
  • SQL backes-in their name resolution rules into each clause, each one a bit different than the others. This does allow type-checking, but has obvious downsides.

So I think that the defining feature of PRQL is how we answers this question.

And I don't want it to be, what it currently is:

  • When a last argument of a function call is a relation, all of its columns are added into the resolution scope of the proceeding arguments under name this.
  • And similarly for the first argument, but under the name that.
  • And also, this and that can be omitted.
  • And also, column names can be prefixed with relation names.
  • And we have operator broadcasting over arrays that works as you would expect, except when you have nested arrays (i.e. when a cell in a dataframe contains an array).

I think that the "implicit lambda functions" can define our name resolution and transforms much more elegantly, but unfortunately not exactly the same as they are now.


This has now become a long rant, with an intention to explain my major concern with language design of PRQL. If this problem resonates with anyone, we can continue this discussion, but otherwise it does take a lot of time to write down.

@eitsupi
Copy link
Member

eitsupi commented Nov 11, 2023

@aljazerzen Thanks for your detailed explanation.
My question is the same as the suggestion by @snth that the name map should be used here.
In other words, if we change select to map, what happens to derive?

@eitsupi

This comment has been minimized.

@aljazerzen

This comment has been minimized.

@aljazerzen
Copy link
Member Author

What I really proposing is making map for arrays and select for relations equivalent.

Derive is very similar, but does have a bit different functionality, so I don't mind keeping it.

@eitsupi
Copy link
Member

eitsupi commented Nov 11, 2023

I see.

One question I have is, does this have anything to do with row-oriented or column-oriented?
In other words, I feel that in the column oriented case, the table is like a tuple of arrays, not an array of tuples.

[
  {a=5, b=false},
  {a=6, b=true},
]

v.s.

{
  a = [5, 6],
  b = [false, true],
}

R's data.frame and pandas.DataFrame are actually the latter, right?

@aljazerzen
Copy link
Member Author

Yes, dataframes are mostly the latter (row-oriented), but not all of them. The same applies to SQL databases - traditionally, they are row-oriented, but some of them opt for columnar storage.

Ultimately, this does not matter for the query language - we can say that relations as row-oriented, but then compile all operations into columnar operations (i.e. regular SQL).

@eitsupi
Copy link
Member

eitsupi commented Nov 14, 2023

I am wondering if it is possible to define select, derive, etc. using the new map function.
In other words, do we need the following use row.*; section in just the map function?

func row -> (
   use row.*;
   {col_a + 1}
)

Possibly this is a process specific to something like select and should not be included in the plain map function?

@aljazerzen
Copy link
Member Author

if it is possible to define select, derive, etc. using the new map function

Yes, that is very possible, I'd say that you could define them as:

let select = func expr rel -> (map expr rel)
let derive = func expr rel -> (map {this, expr} rel)

do we need the following use row.*; section in just the map function

No, that's a general concept that would be used to explain behavior of following transforms:

  • select
  • derive
  • sort
  • filter
  • aggregate
  • group (the by parameter)
  • join (the cond parameter)

Missing here are:

  • take (not dependent on the relation contents)
  • window (not dependent on the relation contents)
  • append (works with relation vars only)
  • loop (works with relation vars and a function)

@eitsupi
Copy link
Member

eitsupi commented Nov 14, 2023

Is automatic resolution of top-level names useful when considering things like nested tables or json type colums?
I suppose it may better that no such functionality exists in the map function and that it would be better to define another function like select we need it.

@aljazerzen
Copy link
Member Author

I don't understand your last comment.

@eitsupi
Copy link
Member

eitsupi commented Nov 14, 2023

I don't understand your last comment.

I meant that I am concerned that the automatic insertion of something like use row.*; when using the map function on nested structures might make things confusing.

@snth
Copy link
Member

snth commented Nov 14, 2023

@aljazerzen Thank you for taking the time to write out the details in #3782 (comment).

I agree a lot with your drive to define a consistent semantics and simplify the name resolution rules. My main objection at this stage is to the choice of name map. I might be more open to mutate as in dplyr or something like that.

Probably more importantly though, I think the naming is probably a secondary concern and we should figure out the correct semantics and name resolution rules. We could do that under the banner of select and change the name to map later if it became clear at that stage that map was a better description of the operations performed.


From your list of name resolution rules, the one that I would really like to remove is:

  • And also, column names can be prefixed with relation names.

SQL needs this queries essentially have one scope / namespace (at least at the top level and ignoring things like subqueries) and there is no sequencing or logical flow.

PRQL is different as each transform acts on the current relation. So while it is sometimes useful to prefix columns with the relation name in order to disambiguate duplicate column names after a join, I see this actually as an indication that we haven't clearly defined the semantics of join yet. So in my view this is a carry over from SQL and should not be retained.

I see at least two problems with keeping this:

  • columns may not be in scope at the time that they are referenced, and
  • it prevents us from easily mapping onto DataFrame backends which would be an easy target otherwise.

An example will probably make it clearer:

from a
join b (==id)
select {a.x, b.y, b.z}
filter b.y > a.y

Say column y is common to both tables. The problem is that in the last line a.y is no longer in scope/accessible. Now of course the compiler could catch that and throw an error. However to me the point is that does it even make sense to refer to relations from previous steps because each transform should only act on the current relation this.

My sense is that we've kept this around because in line 3 we have a problem in that we don't know which y we want. We went back and forth on whether the prefixes are required for relations in joins etc... but all of that was kicking the can down the road.

The problem is that after the join in line 2, the relation this is actually not properly defined because we don't know which y column we're keeping. SQL is lax about this and allows duplicated column names in the output. I don't think we should keep that and should rather find a different solution.

Some approaches around this that I've seen:

  • M lang: The result tuple is {id=a.id, x=a.x, y=a.y, b=[{b.id, b.y, b.z}]}, i,e. it appends a single column named b which contains a table of all matching rows in b for the current row in a.
  • Morel: The result tuple is {a=[{a.id, a.x, .y}], b=[{b.id, b.y, b.z}]}, it does the same for a and b which restores parity for RIGHT and OUTER joins.
  • Pandas: IIRC allows you to specify a prefix. I think without that it probably allows duplicate column names like SQL.
  • LINQ: I believe SelectMany takes as a second argument a lambda of the form (a, b) -> {a.x, ..., b.z} which forces you to specify the output tuple for each row match found.

I think the LINQ approach is the best as that was specifically designed to be that way by Erik Meijer in order to make LINQ monadic and in fact I believe that second lambda is (or is related to) the bind operation in the Monad definition. This forces you to make the decision about which columns to keep at the point of defining the join which seems right as all other approaches kick the can down the road to some extent.

If we were to adopt this we would need some additional syntax:

  • join would need to take another argument after the join condition.
  • This third argument could be a lambda, but that's probably too verbose and not required, or a tuple definition as would usually be supplied to select.
  • The lambda / tuple definition would receive two arguments: this and the name of the relation being joined
    • The name of the first argument has to be this as it could be a multi-step pipeline so no relation name will be available in general.
    • The second relation being joined will be given by name so that could be used or otherwise one can just use that.
  • The tuple could default to {this.*, that.*} and could be omitted, only being required to be given when there is a column name clash.
  • With those changes there should be no more need for relation name prefixes.
from a
join b (==id) {x, b.y, b.z}

equivalent to

from a
join b (==id) {this.x, that.y, that.z}

Sorry, this turned out to be a rather longer post as well. That wasn't actually intended and some of these ideas just developed as I went along.

@richb-hanover
Copy link
Contributor

@snth's suggestion of having join use a tuple to express the columns in its result is interesting. It seems to clear up the mystery of when an earlier table's name flows through the join ("Never") A couple comments:

  • I'm not entirely sure if it's related, but What columns does group pass through? #3490 talks about the columns that are passed through by group ()

  • ergonomics: how do we expect people to handle joining two wide tables that might have many columns with the same name. It would be an enormous hassle to enumerate all the columns that might be interesting, especially as the analyst is just getting started in the investigation.

  • a possible solution is using { this.*, that.* } It's convenient, but how can the person disambiguate between this.foo and that.foo? Perhaps the tuple could use { afoo = this.foo, this.*, that.* } to assign a name to a specific table's column. Does repeating a column break any rules - PRQL, SQL, or otherwise?

  • Although you have convinced me that select { foo = sum(bar) } is a map() operation in disguise, I still strongly favor retaining the select keyword. (I wasn't there when they were designing the first version of SQL, but I suspect that they were so pleased to be able to type statements that actually implement relational algebra that they figured "We're done!" Besides, the other "modern" programming languages at the time were COBOL and FORTRAN - neither of which are models of expressive beauty.) But if we could have whispered in their ear, "Psst, foo = sum(bar) is a whole lot more understandable than SELECT SUM(BAR) AS FOO", they might have agreed...

@aljazerzen
Copy link
Member Author

I meant that I am concerned that the automatic insertion of something like use row.*; when using the map function on nested structures might make things confusing.

That's a good point, yeah.


This issue is not about renaming select anymore, so we can shelve that for when implicit lambda functions are a thing. But my intention of bringing our name resolution into question seems to be working!


Regarding the join: I would classify current behavior of join as your "Morel" option.

It is not directly inherited from SQL, as in fact does work in the way that you say it should. Your example reports that a.y does not exist at line 4.

Also your two concerns might not that concerning.

  • columns that are not in scope at the time that they are referenced will (and already do) cause an error,
  • it would not be that hard to compile RQ into transformations of a DataFrame, as RQ is much more explicit about which columns are present where.

Regardless, I like your proposal of a post-join-tuple-merger function.
It makes sense and nicely deals away with the need for that.
My concern is that we are making join less orthogonal, as it would also perform what select already does.

@snth
Copy link
Member

snth commented Nov 15, 2023

Regardless, I like your proposal of a post-join-tuple-merger function.
It makes sense and nicely deals away with the need for that.

After reading your comment I was thinking that it might even remove the need for this as well.

Referring to the docs at Identifiers & keywords >> this & that, the only remaining need for this there is for things like count this. Should we consider bringing * back for that since it's more ergonomic (less to type)?

In the join examples, this can always be covered by an absence of prefix which clearly refers to the current relation and that by the name of the relation being joined. So adapting the example from the documentation page referenced above:

from invoices
join tracks (track_id==tracks.id)

My concern is that we are making join less orthogonal, as it would also perform what select already does.

I do see that, and I don't necessarily have a good answer yet but my gut feeling is that what you gain in consistency makes it worth it.

I thought that the following simplified version of my previous example would prove my point but to my surprise this actually works in the playground:

let a = [{id=1, x=1}]
let b = [{id=1, x=10}]

from a
join b (==id)
#select {a.id, b.x}

which produces

id	x	id	x
1	1	1	10

I guess that's because SQL is very lax and allows duplicate column names. However if we want PRQL to be a general language that can apply in many diverse data processing examples then I don't think we'll have that available. For example say you wanted to use PRQL as a DSL for streams of structs in Rust or JSON objects. In both those cases field names would need to be unique so from a | join b (==id) would not be a valid pipeline. Now that contradicts a statement that we also often make, which is that you should be able to cut the pipeline at any point and the "head" should still be a valid pipeline.

So I hear your concern about duplicating the behaviour of select but it seems to me that it is required here.

@aljazerzen
Copy link
Member Author

aljazerzen commented Dec 18, 2023

The topic of this issue is all over the place.

I deem my proposal rejected with good reasons, @snth, I suggest you open a new issue about post-join-mapping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
language-design Changes to PRQL-the-language
Projects
None yet
Development

No branches or pull requests

5 participants