Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Describe how identifiers in SELECT queries are resolved #23194

Closed
alexey-milovidov opened this issue Apr 16, 2021 · 24 comments
Closed

Describe how identifiers in SELECT queries are resolved #23194

alexey-milovidov opened this issue Apr 16, 2021 · 24 comments
Assignees
Labels
development Developement process & source code & implementation details nightmare Difficulty level

Comments

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Apr 16, 2021

Query analysis in ClickHouse is more complicated than it is in standard SQL due to the following extensions:

  • aliases can be defined and used in any part of the query;
  • expressions can be used in any part of the query;
  • there is support for complex columns with subcolumns (although standard SQL also has some sort of it);
  • there is ARRAY JOIN;
  • lambda functions and higher order functions.

These extensions are very convenient for writing SQL queries we don't regret of them :)

But it is not convenient from implementation standpoint - many things are in different places in code and are underspecified. It makes further development difficult.

Let's try to describe how identifiers in SELECT queries are resolved.
Then we will have a chance to implement it in more consistent, easy to maintain fashion.

I will try to describe how it works right now and how it is supposed to work after upcoming extensions.
The text below is only for ClickHouse experts.


  1. Identifier can correspond to a column in a table in section FROM:
SELECT column FROM table
  1. Identifier can be unqualified, qualified by table or table alias:
SELECT table.column FROM table
SELECT t.column FROM table AS t
  1. A table can be qualified with database name:
SELECT database.table.column FROM database.table
  1. A question: should we allow to qualify column identifier with default database if it's not qualified in table identifier?
SELECT database.table.column FROM table
  1. In case of multiple tables in a query, unqualified identifer will be searched in every table. If it is found in only one table, it will be resolved to this table. In case of ambiguity, exception should be thrown.
SELECT t1.x, t2.x, y FROM (SELECT x, y) AS t1, (SELECT x, z) AS t2
  1. If query does not contain FROM section, it behaves the same way as if FROM system.one is present.
SELECT 1;
SELECT dummy;
SELECT one.dummy;
SELECT system.one.dummy;

This query also works:

SELECT *

But we may remove the support for asterisk expansion in case if section FROM is omitted.

This query works in current version of ClickHouse, but we are not proud of this fact:

SELECT ***
  1. Identifier can consist of multiple components and can be resolved to a column or a subcolumn of arbitrarily nested path or a table.
SELECT * FROM t WHERE x IN table;
SELECT * FROM t WHERE x IN db.table;
SELECT * FROM t WHERE dictHas(table, x);
SELECT * FROM t WHERE joinGet(db.table, x);

In this example, x is resolved to a column in table t, and table, db.table and t are resolved to tables.

SELECT nest.key.subkey FROM t

In this example, nest.key.subkey is subcolumn of complex column nest. Ambiguity is possible. In case of ambiguity, exception should be thrown.

  1. Table expression in FROM section can contain arbitrary number of ARRAY JOIN or LEFT ARRAY JOIN expressions.
    On the left is a table and on the right is a column of Array type.
SELECT * FROM (SELECT [1, 2, 3] AS arr) ARRAY JOIN arr

A column to array join can be specified by identifier (arr in the example above) or as immediate array (literal):

SELECT * FROM table ARRAY JOIN [1, 2, 3] AS arr

If it is specified as literal, alias is required. A question: we may not require alias.

A column to array join can be resolved to complex column (nested data structure) as well:

CREATE TEMPORARY TABLE test1 (a String, nest Nested(x String, y String));

SELECT a, nest.* FROM test1 ARRAY JOIN nest;
SELECT a, n.* FROM test1 ARRAY JOIN nest AS n;

Nested data type is a syntactic sugar for Array(Tuple(...)):

CREATE TEMPORARY TABLE test1 (a String, nest Array(Tuple(x String, y String)));

SELECT a, nest.* FROM test1 ARRAY JOIN nest;
SELECT a, n.* FROM test1 ARRAY JOIN nest AS n;

If alias is specified for the right hand side of array join: ARRAY JOIN arr AS joined, the joined alias will reference to array elements and the original name arr will reference to the original value.

If alias is not specified for the right hand side of array join: ARRAY JOIN arr, the original name arr (can be qualified or not) will reference to the array elements and the original arrays cannot be referenced.

SELECT table.nest FROM table ARRAY JOIN nest
  1. Aliases can be specified for every expression in a query. Within the boundaries of one SELECT query/subquery, alias is globally visible, even before it's declaration.
SELECT 1 AS x, x, x + 1;
SELECT x, x + 1, 1 AS x;
SELECT x, 1 + (2 + (3 AS x));
  1. Aliases defined outside of subquery are not visible in subqueries (but see below).

Unless they are specified in WITH clause and enable_global_with_statement is set.

SELECT 123 AS x FROM (SELECT a, x FROM (SELECT 1 AS a, 2 AS b)); -- invalid query, identifier `x` in subquery is not found.
SELECT 123 AS x, (SELECT x) AS y; -- does not work even for scalar subqueries.
SELECT 123 AS x, 123 IN (SELECT x); -- does not work even for IN subquery either.
SET enable_global_with_statement = 1; -- this is by default, BTW.

WITH 123 AS x SELECT 1 FROM (SELECT a, x FROM (SELECT 1 AS a, 2 AS b)); -- ok, identifier `x` in subquery is defined in WITH section.

-- here we refer to table `t` (defined as subquery) three times, one of them inside another scalar subquery.
WITH t AS (SELECT 1) SELECT t, (SELECT * FROM t) FROM t;

We want to change it in the following way:

  • aliases and columns from the outside will be visible inside subqueries;
  • ambiguity will be resolved by scoping - the nearest name to the current scope is selected;
  • columns in current scope are preferred for disambiguation to aliases from the outer scope;
  • this is needed to allow correlated subqueries;
  • the visibility of outer aliases inside FROM subqueries is mostly pointless but still usable for scalar values;
  1. Aliases defined in subquery are not visible outside of subquery (but see clarification below).
SELECT x FROM (SELECT y FROM table WHERE (1 AS x) = y) AS t; -- throws, because x is not visible outside.
SELECT t.x FROM (SELECT * FROM (SELECT 1 AS x) AS t); -- throws, because the table name `t` is not visible outside.
SELECT x FROM (SELECT * FROM (SELECT 1 AS x) AS t); -- ok.

A question: should we allow to refer to aliases from subqueries by the names qualified by subquery name?

SELECT t.x FROM (SELECT a, a + (1 AS x) AS b) t
  1. Only the names of SELECT expression (a.k.a projection names, correlation names) are "exported" outside as table columns:
SELECT t.x FROM (SELECT 1 AS x) AS t;
SELECT t.x FROM (SELECT x FROM tbl) AS t;
SELECT x FROM (SELECT x FROM tbl) AS t;
SELECT tbl.x FROM (SELECT x FROM tbl) AS t; -- this is wrong, the `tbl` name is not exported
SELECT t2.x FROM (SELECT x FROM tbl AS t2) AS t; -- this is also wrong, the `t2` alias is not exported

Nothing is "exported" from scalar subqueries or IN subqueries, subquery expressions:

SELECT x, (SELECT 1 AS x); -- does not work, `x` is not visible;
SELECT x IN (SELECT 1 AS x); -- does not work either;
SELECT x IN (SELECT 1 AS x) FROM (SELECT 1 AS x); -- this will work, but keep in mind that there are two different `x`.
  1. Aliases can be ambiguous with column names. In case of ambiguity inside the boundaries of one query/subquery, it is resolved according to the value of setting prefer_column_name_to_alias.
SELECT x + 1 AS x, x FROM (SELECT 1 AS x);

This query will return either 2, 1 or 2, 2 depending on the value of prefer_column_name_to_alias.

SELECT x, x + 1 AS x FROM (SELECT 1 AS x);

The order of alias usage and alias definition does not matter (this may be a subject to change).

  1. If more than one alias with the same name are defined inside the boundaries of one query/subquery and these aliases refers to semantically different expressions, exception should be thrown.
SELECT 1 AS x, 2 AS x; -- throws exception.
  1. Identifier can appear as a formal parameter in a lambda function:
SELECT arrayMap(x -> x + 1, arr)

Here x is a formal parameter.

Parameters to lambda functions are scoped. Their names can be ambiguous to other identifiers. In case of ambiguity, parameter of lambda function is preferred. The nearest to the current scope parameter is preferred.

SELECT x, arrayMap((x, y) -> x[1] + y + arrayFirst(x -> x != y, x), arr) FROM (SELECT 1 AS x, [([1, 2], 3), ([4, 5], 6)] AS arr);

This query disambiguated as follows:

SELECT x1, arrayMap((x2, y2) -> x2[1] + y2 + arrayFirst(x3 -> x3 != y2, x2), arr) FROM (SELECT 1 AS x1, [([1, 2], 3), ([4, 5], 6)] AS arr);

In expression x -> expr(x) the scope of x is expr(x). So, the expression arrayMap(x -> x, x) should be read as arrayMap(x1 -> x1, x2).

  1. Aliases should refer to identifiers by their semantical meaning, not by the name or syntactic construction.

If alias referes to an expression that is using a name that is unavailable in the place where alias is substituted, exception should be thrown.

SELECT arrayMap(x -> [y * 2, (x + 1) AS y, 1 AS z], arr), y;

The last y refers to an expression that depends on x that is formal parameter of lambda function and is not available out of its scope.

SELECT arrayMap(x -> [y * 2, (x + 1) AS y, 1 AS z], arr), z;

This will work as alias z is not scoped.

SELECT arrayMap(x -> (x + 1) AS y, arr1), arrayMap(x -> (x || 'hello') AS y, arr);

This should throw exception as aliases y refers to different expressions.
The scope of formal parameter x is not related to any aliases that are using x.

  1. If identifier can be resolved to a table or a column in a table, ambiguity is resolved in the following way:

In right hand side of ARRAY JOIN it is resolved to a column.
In other parts of FROM it is resolved to a table.
In all other places it is resolved to a column. This includes right hand side of IN operator, dictGet, joinGet.

SELECT x FROM x ARRAY JOIN x WHERE x IN x

This is disambiguated as follows:

SELECT col FROM table ARRAY JOIN col WHERE col IN col

Note: we can make more cases when we simply throw exception in case of ambiguity.

  1. In case of ambiguity with subcolumns of complex columns, exception should be thrown.

The previous rule can be theoretically applied to subcolumns of complex columns as well:

CREATE TABLE db.tbl
(
    col String,
    db Nested
    (
        tbl Nested
        (
            col String
        )
    )
)
ENGINE = Memory;

SELECT db.tbl.col FROM db.tbl

In this example, db.tbl.col can be resolved to the name of subcolumn of the Nested column.
But it will be much better to simply throw exception in all cases of ambiguity with subcolumns.

  1. Asterisk or qualified asterisk can be specified in any place where a list of expressions is possible.
SELECT * FROM table; -- ok
SELECT f(*) FROM table; -- ok
SELECT * + * FROM table; -- ok, though looks weird
SELECT * GROUP BY *; -- ok
SELECT * ORDER BY *; -- not ok as every component of ORDER BY may contain ASC/DESC and COLLATE; though can be supported in some sense
SELECT * WHERE *; -- not ok as WHERE contains single expression, not a list
  1. Unqualified asterisk is expanded to all the columns from all tables in section FROM in current scope; without virtual (hidden) columns; without ALIAS and MATERIALIZED columns unless it's changed by the setting asterisk_include_materialized_columns, asterisk_include_alias_columns.

In case of ambiguous column names in tables, asterisk is expanded to an identifier qualified by table name (and if table name is ambiguous, also by database name). In case of no ambiguity, unqualified name is used.

SELECT * FROM (SELECT a) AS t, (SELECT b) AS u;
-- equivalent to:
SELECT a, b FROM (SELECT a) AS t, (SELECT b) AS u;
SELECT * FROM (SELECT a) AS t, (SELECT a) AS u;
-- equivalent to:
SELECT t.a, u.a FROM (SELECT a) AS t, (SELECT a) AS u;
SELECT * FROM (SELECT a) AS db1.t, (SELECT a) AS db2.t;
-- equivalent to:
SELECT db1.t.a, db2.t.a FROM (SELECT a) AS db1.t, (SELECT a) AS db2.t;
  1. Qualified asterisk is expanded to all the columns from a table (with the same exceptions as for unqualified asterisk) or all subcolumns of a complex column or subcolumn of a table.

A question. Should we allow asterisk qualified by database only?

SELECT db.* FROM db.t1, db.t2; -- it's in question whether this should work.
SELECT nest.* FROM table; -- expansion of complex column.
SELECT nest FROM table; -- you can select all complex column at once as a single tuple / array of tuples.

For compatibility reasons, asterisk expansion also expands all complex columns recursively, so they are flattened into multiple columns. This maybe a subject for adjustment by settings.

CREATE TABLE t
(
    x String,
    nest Nested
    (
        a String,
        b String
    )
) ENGINE = Memory;

SELECT * FROM t;

-- equivalent to:
SELECT x, nest.* FROM t;

-- equivalent to:
SELECT x, nest.a, nest.b FROM t;
  1. Qualified asterisk as well as named tuple access operator can be applied to expressions.
SELECT generateTuple().1; -- simple tuple access operator
SELECT generateTuple().hello; -- named tuple or complex column access operator - can be applied to Nested type as well as Array of named Tuple
SELECT generateTuple().*; -- expansion of a tuple or complex column with asterisk

SELECT CAST(('hello', 1) AS Tuple(name String, count UInt32)).*
  1. Untuple expression is an alternative way to write asterisk expansion.
SELECT untuple(CAST(('hello', 1) AS Tuple(name String, count UInt32))); -- will give two columns `name` and `count`.
  1. Asterisk expansion can be modified by column transformers expressions.
SELECT * EXCEPT('hello|world');
SELECT t.* EXCEPT(hello, world);
SELECT db.t.* REPLACE(x + 1 AS x);

Column transformers can be chained:

SELECT * EXCEPT(hello) REPLACE(x + 1 AS x);
  1. COLUMNS expression or qualified COLUMNS expression can be used in place of asterisk or qualified asterisk. It is processed similarly but with the specified filter.
SELECT COLUMNS('^test_') FROM table;
SELECT t.COLUMNS('^test_') FROM t, u;
SELECT t.COLUMNS('^test_') EXCEPT (test_hello, test_world) FROM t, u;

The filter in form of regular expression is applied to unqualified names from the corresponding tables.

  1. Expressions with identical "correlation names" inside subqueries are not allowed.
SELECT * FROM (SELECT x, x FROM (SELECT 1 AS x)); -- will throw;
SELECT x FROM (SELECT x, x FROM (SELECT 1 AS x)); -- will throw;
SELECT 1 FROM (SELECT x, x FROM (SELECT 1 AS x)); -- will throw;
  1. Unnamed expressions inside subqueries are automatically named in implementation specific way.
SELECT `plus(1, 2)` FROM (SELECT 1 + 2); -- this works for now but not guaranteed.

A question. We can leave this automatic names only for outermost query and simply not allow to refer unnamed expressions from subqueries.

  1. Identifiers can be resolved to functions. Lambda expressions can be aliased. (proposal)
SELECT arrayMap(plus, arr1, arr2);
SELECT x -> x + 1 AS fun, arrayMap(fun, arr);

This is needed to allow SQL UDFs.

Aggregate functions can be found as well:

SELECT arrayReduce(sum, arr);
-- currently works only this way:
SELECT arrayReduce('sum', arr);

In case of ambiguity, function is less preferred than alias or column name or table name.

  1. Aliases specified inside table function expression are not visible outside of table function expression.
SELECT x FROM mysql(('host' AS x) || '123', port, db, table); -- does not work as `x` is not visible outside.
  1. Table may have ALIAS columns.

These columns can be referred as usual columns, but they are expanded during query analysis (for better possibilities of optimization).

CREATE TEMPORARY TABLE t
(
    x UInt8, 
    y ALIAS x + 1
);

SELECT y FROM t;

If an expression of ALIAS column defines other aliases, these aliases are only visible inside this expression.

CREATE TEMPORARY TABLE t
(
    x UInt8, 
    y ALIAS ((x + 1) AS z) + 1
);

SELECT x, y, z FROM t; -- does not work, because `z` is not visible.

-- this is correct:
CREATE TEMPORARY TABLE t
(
    x UInt8, 
    y ALIAS z + 1,
    z ALIAS x + 1
);

ALIAS columns may refer other ALIAS columns in the same table. The order of ALIAS columns does not matter for that.

@alexey-milovidov alexey-milovidov added the development Developement process & source code & implementation details label Apr 16, 2021
@vdimir
Copy link
Member

vdimir commented May 24, 2021

#24395

@alexey-milovidov
Copy link
Member Author

Should also consider #6571
Sometimes we need to locate tables and query their structure on remote servers.

@CurtizJ
Copy link
Member

CurtizJ commented Jun 25, 2021

Do we want to support UTF identifiers? #25594

@alexey-milovidov
Copy link
Member Author

@CurtizJ I have no idea why they are not already supported.
Our identifiers should be charset-agnostic.

@alexey-milovidov
Copy link
Member Author

See also #18185.

@alexey-milovidov
Copy link
Member Author

See also #26030.

@vdimir
Copy link
Member

vdimir commented Jul 27, 2021

Notes about current JOIN behavior

Let's consider two tables:

CREATE TABLE t1 (id Int, s String, v1 String) ENGINE = TinyLog;
CREATE TABLE t2 (id Int, s String, v2 String) ENGINE = TinyLog;
  1. SELECT * FROM t1 FULL JOIN t2 ON t1.id = t2.id;
    Yields columns with names: id, s, v1, t2.id, t2.s, v2.
    Duplicating names in right table qualified with t2.

  2. SELECT s, t1.s, t2.s FROM t1 FULL JOIN t2 ON t1.id = t2.id;
    Yields columns with names: s, s, t2.s
    Qualified and non-qualified names from left table are the same.

  3. SELECT v2, t2.v2 FROM t1 FULL JOIN t2 ON t1.id = t2.id;
    Yields columns with names: v2, v2
    If column presented only in right table, so unqualified name refers to column from right table.

  4. SELECT id, t1.id FROM t1 FULL JOIN t2 USING id;
    Yields id, id.
    But this id column differs from JOIN ON case: it contains concatenated values from both left and right table, and doesn't contain nulls or defaults (non joined rows from left filled with values from right). So, type of this column have to hold all possible values from left and right type for FULL join. Now it have result type is super-type regardless join kind.

  5. SELECT id, t1.id, t2.id FROM t1 FULL JOIN t2 USING id;
    Yields id, id, t2.id
    It's weird case. If we add qualified name for right table, behavior changed to same as for JOIN ON, column id refers to column for left table with nulls or defaults where not joined.

  6. SELECT * FROM (SELECT * FROM t1) FULL JOIN (SELECT * FROM t2) USING (id) SETTINGS joined_subquery_requires_alias = 0;
    Yields: id, s, v1, s, v2
    With joined_subquery_requires_alias = 0 unqualified columns returned from left table and unqualified non-duplicated columns from right.
    Duplicated columns from right table replaced with columns from left table, so we get two s columns from left.

So,

SELECT t1.s FROM ( SELECT t1.s FROM t1 FULL JOIN t2 ON t1.id = t2.id ); -- returns error, column resolved to `s`
SELECT s FROM ( SELECT t1.s FROM t1 FULL JOIN t2 ON t1.id = t2.id ); -- ok

SELECT t2.s FROM ( SELECT t2.s FROM t1 FULL JOIN t2 ON t1.id = t2.id ); -- ok
SELECT s FROM ( SELECT t2.s FROM t1 FULL JOIN t2 ON t1.id = t2.id ); -- error

SELECT t1.v1 FROM ( SELECT t1.v1 FROM t1 FULL JOIN t2 ON t1.id = t2.id ); -- returns error, column resolved to `v1`
SELECT v1 FROM ( SELECT t1.v1 FROM t1 FULL JOIN t2 ON t1.id = t2.id ); -- ok

SELECT t2.v2 FROM ( SELECT t2.v2 FROM t1 FULL JOIN t2 ON t1.id = t2.id );  -- returns error, column resolved to `v2`
SELECT v2 FROM ( SELECT t2.v2 FROM t1 FULL JOIN t2 ON t1.id = t2.id ); -- ok

As for return type, it should be the same as in original table except case with USING described above and if join_use_nulls is set, type in output is changed to Nullable for left table in case of LEFT join, for right table in case of RIGHT join, for both tables in case of FULL join.

@den-crane
Copy link
Contributor

den-crane commented Jul 13, 2022

psql -- OK

with d as (select 'key'::Varchar(255) c, 'x'::Varchar(255) s)
SELECT r1, c as r2
FROM ( 
       SELECT t as s, c as r1
       FROM ( SELECT 'y'::Varchar(255) as t, 'x'::Varchar(255) as s) t1
       LEFT JOIN d USING (s)
     ) t2
LEFT JOIN d using (s)
 r1  | r2
-----+----
 key |

CH -- not OK

with d as (select 'key'::Varchar(255) c, 'x'::Varchar(255) s)
SELECT r1, c as r2
FROM ( 
       SELECT t as s, c as r1
       FROM ( SELECT 'y'::Varchar(255) as t, 'x'::Varchar(255) as s) t1
       LEFT JOIN d USING (s)
     ) t2
LEFT JOIN d using (s);
┌─r1─┬─r2─┐
│    │    │
└────┴────┘

t as s -- t is aliased to s, but the external query uses another s column (not aliased one).

@alexey-milovidov
Copy link
Member Author

#11757
#9954

@filimonov
Copy link
Contributor

#40955

@alexey-milovidov
Copy link
Member Author

#36973

@den-crane
Copy link
Contributor

#41964

@alexey-milovidov
Copy link
Member Author

#32639

@alexey-milovidov
Copy link
Member Author

#37729

@den-crane
Copy link
Contributor

#44153

@den-crane
Copy link
Contributor

#44365

@den-crane
Copy link
Contributor

#48308

@den-crane
Copy link
Contributor

#55803 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development Developement process & source code & implementation details nightmare Difficulty level
Projects
None yet
Development

No branches or pull requests

8 participants