Summary
When a SELECT-list alias shadows an underlying column name and the same name is referenced elsewhere in the same SELECT, Embucket resolves the reference to the alias instead of the FROM-clause column. This silently produces wrong results — no error, no warning. ANSI SQL and Snowflake both resolve the reference to the FROM-clause column.
Minimal standalone reproducer
WITH s AS (
SELECT 'S1' AS sid,
TIMESTAMP '2020-01-01 00:00:00' AS start_tstamp,
TIMESTAMP '2020-01-01 00:00:00' AS user_start_tstamp
UNION ALL
SELECT 'S2' AS sid,
TIMESTAMP '2020-01-01 05:00:00' AS start_tstamp,
TIMESTAMP '2020-01-01 00:00:00' AS user_start_tstamp
)
SELECT user_start_tstamp AS start_tstamp, -- alias shadows column
MAX(CASE WHEN start_tstamp = user_start_tstamp THEN sid END) AS first_sid
FROM s
GROUP BY user_start_tstamp;
-- Snowflake result:
-- start_tstamp=2020-01-01 00:00:00, first_sid='S1' (correct: only S1 satisfies the CASE condition)
-- Embucket result:
-- start_tstamp=2020-01-01 00:00:00, first_sid='S2' (wrong: CASE degenerates to always-true, MAX picks 'S2')
Removing the AS start_tstamp alias (or renaming it) makes Embucket return 'S1' correctly. So the bug is specifically in how Embucket resolves the reference when the SELECT list introduces an alias whose name collides with an existing column name.
Root cause analysis
In the reproducer, the CASE has start_tstamp = user_start_tstamp:
- Correct behavior:
start_tstamp resolves to the FROM table's column (which differs per row), so only S1's row satisfies the predicate, and MAX returns 'S1'.
- Observed Embucket behavior:
start_tstamp resolves to the SELECT-list alias user_start_tstamp, so the condition becomes user_start_tstamp = user_start_tstamp — always true. MAX(CASE WHEN TRUE THEN sid END) degenerates to MAX(sid), which picks lexicographically largest 'S2'.
Per ANSI SQL (and every Snowflake-compatible engine I'm aware of), SELECT-list aliases are not visible to other expressions in the same SELECT clause, except in the GROUP BY / ORDER BY / HAVING clauses of some dialects. Referencing the alias name inside a SELECT-list aggregate expression must resolve to the underlying column.
Impact surfaced in a real pipeline
Found while running dbt-snowplow-web on Embucket vs Snowflake against a shared S3 Tables Iceberg source. The snowplow_web_users_aggs model does exactly this anti-pattern:
SELECT domain_userid,
user_start_tstamp AS start_tstamp, -- alias shadows column
user_end_tstamp AS end_tstamp, -- alias shadows column
MAX(CASE WHEN start_tstamp = user_start_tstamp THEN domain_sessionid END) AS first_domain_sessionid,
MAX(CASE WHEN end_tstamp = user_end_tstamp THEN domain_sessionid END) AS last_domain_sessionid,
SUM(page_views) AS page_views,
COUNT(DISTINCT domain_sessionid) AS sessions,
SUM(engaged_time_in_s) AS engaged_time_in_s
FROM snowplow_web_users_sessions_this_run
GROUP BY 1, 2, 3
On Embucket both first_domain_sessionid and last_domain_sessionid collapse to MAX(domain_sessionid) over the user's sessions, making them identical. They should point to the user's chronologically first and last sessions respectively.
Observed impact at 609,671 input events:
- 8,166 of 35,094 users (23%) have a different
first_domain_sessionid between Embucket and Snowflake.
- Downstream, the
snowplow_web_users derived table's first_page_url, first_page_title, first_page_urlhost, first_geo_country, first_geo_city, first_br_lang, referrer, etc. are wrong on Embucket for every affected user. last_* fields are correct because MAX(sid) happens to coincide with the last session's sid.
Per-user spot check for 5205d181-a915-4379-8e6c-15862d72b9ed (4 sessions):
domain_sessionid start_tstamp end_tstamp
0d9e647e-a5ff-434a-a964-58b324927e02 2026-04-22 15:00:43.863 2026-04-22 15:00:43.863 <-- first
9ad7f7bb-af95-4743-966d-2572046e90cc 2026-04-22 15:01:14.335 2026-04-22 15:01:34.179
65a62129-6d8c-4dd6-99bb-9420fa2d16ef 2026-04-22 15:01:17.089 2026-04-22 15:01:17.089
b96cd902-b6e1-4072-af17-dedb4e6e501c 2026-04-22 15:13:31.538 2026-04-22 15:13:31.538 <-- last, lex-max UUID
user_start_tstamp = 2026-04-22 15:00:43.863
user_end_tstamp = 2026-04-22 15:13:31.538
first_domain_sessionid:
Snowflake: 0d9e647e-... (correct)
Embucket: b96cd902-... (wrong -- lex-max UUID)
last_domain_sessionid:
both: b96cd902-... (correct by coincidence)
Both engines see byte-identical source rows (scratch users_sessions_this_run has the same 4 rows for this user on both); the divergence enters at the aggregation.
Ruling out the obvious alternatives
- Not a tie-break on tied
start_tstamp: zero users have two sessions sharing start_tstamp in either engine (confirmed via COUNT(DISTINCT domain_sessionid) WHERE start_tstamp = user_start_tstamp GROUP BY user_start_tstamp HAVING COUNT > 1 returning 0 on both).
- Not a grouping issue:
user_start_tstamp and user_end_tstamp are identical between engines after GROUP BY.
- Not an issue with MAX over NULLs: a direct query against
users_sessions_this_run with the same aggregation but without the alias shadow returns the correct result on Embucket.
Expected fix
Inside the SELECT clause, column references (and especially references inside aggregate function arguments like MAX(CASE WHEN <expr> ...)) must resolve against the FROM-clause relation, not the SELECT-list aliases. SELECT-list aliases are visible only in ORDER BY, GROUP BY (dialect-dependent), and HAVING.
Severity
High. Silently wrong results with no error. Any production SQL of the form
SELECT some_col AS other_col_name, ..., AGG(CASE WHEN other_col_name ... THEN ... END)
will produce subtly wrong aggregates on Embucket and correct ones on Snowflake — the exact scenario where parity testing with a reference engine is needed to catch it.
Environment
Embucket version: v0.2.2 (Lambda-deployed). DataFusion-based SQL planner.
Related context
Full investigation report at Embucket/embucket-snowplow#6 (specs/2026-04-22-embucket-snowflake-investigation.md, "Finding B / Finding 4").
Summary
When a SELECT-list alias shadows an underlying column name and the same name is referenced elsewhere in the same SELECT, Embucket resolves the reference to the alias instead of the FROM-clause column. This silently produces wrong results — no error, no warning. ANSI SQL and Snowflake both resolve the reference to the FROM-clause column.
Minimal standalone reproducer
Removing the
AS start_tstampalias (or renaming it) makes Embucket return'S1'correctly. So the bug is specifically in how Embucket resolves the reference when the SELECT list introduces an alias whose name collides with an existing column name.Root cause analysis
In the reproducer, the CASE has
start_tstamp = user_start_tstamp:start_tstampresolves to the FROM table's column (which differs per row), so only S1's row satisfies the predicate, and MAX returns'S1'.start_tstampresolves to the SELECT-list aliasuser_start_tstamp, so the condition becomesuser_start_tstamp = user_start_tstamp— always true. MAX(CASE WHEN TRUE THEN sid END) degenerates to MAX(sid), which picks lexicographically largest'S2'.Per ANSI SQL (and every Snowflake-compatible engine I'm aware of), SELECT-list aliases are not visible to other expressions in the same SELECT clause, except in the GROUP BY / ORDER BY / HAVING clauses of some dialects. Referencing the alias name inside a SELECT-list aggregate expression must resolve to the underlying column.
Impact surfaced in a real pipeline
Found while running dbt-snowplow-web on Embucket vs Snowflake against a shared S3 Tables Iceberg source. The
snowplow_web_users_aggsmodel does exactly this anti-pattern:On Embucket both
first_domain_sessionidandlast_domain_sessionidcollapse toMAX(domain_sessionid)over the user's sessions, making them identical. They should point to the user's chronologically first and last sessions respectively.Observed impact at 609,671 input events:
first_domain_sessionidbetween Embucket and Snowflake.snowplow_web_usersderived table'sfirst_page_url,first_page_title,first_page_urlhost,first_geo_country,first_geo_city,first_br_lang,referrer, etc. are wrong on Embucket for every affected user.last_*fields are correct because MAX(sid) happens to coincide with the last session's sid.Per-user spot check for
5205d181-a915-4379-8e6c-15862d72b9ed(4 sessions):Both engines see byte-identical source rows (scratch
users_sessions_this_runhas the same 4 rows for this user on both); the divergence enters at the aggregation.Ruling out the obvious alternatives
start_tstamp: zero users have two sessions sharingstart_tstampin either engine (confirmed viaCOUNT(DISTINCT domain_sessionid) WHERE start_tstamp = user_start_tstamp GROUP BY user_start_tstamp HAVING COUNT > 1returning 0 on both).user_start_tstampanduser_end_tstampare identical between engines after GROUP BY.users_sessions_this_runwith the same aggregation but without the alias shadow returns the correct result on Embucket.Expected fix
Inside the SELECT clause, column references (and especially references inside aggregate function arguments like
MAX(CASE WHEN <expr> ...)) must resolve against the FROM-clause relation, not the SELECT-list aliases. SELECT-list aliases are visible only inORDER BY,GROUP BY(dialect-dependent), andHAVING.Severity
High. Silently wrong results with no error. Any production SQL of the form
will produce subtly wrong aggregates on Embucket and correct ones on Snowflake — the exact scenario where parity testing with a reference engine is needed to catch it.
Environment
Embucket version:
v0.2.2(Lambda-deployed). DataFusion-based SQL planner.Related context
Full investigation report at Embucket/embucket-snowplow#6 (
specs/2026-04-22-embucket-snowflake-investigation.md, "Finding B / Finding 4").