Skip to content

Column reference inside aggregate CASE resolves to SELECT-list alias instead of FROM-column (silently wrong result) #131

@rampage644

Description

@rampage644

Summary

When a SELECT-list alias shadows an underlying column name and the same name is referenced elsewhere in the same SELECT, Embucket resolves the reference to the alias instead of the FROM-clause column. This silently produces wrong results — no error, no warning. ANSI SQL and Snowflake both resolve the reference to the FROM-clause column.

Minimal standalone reproducer

WITH s AS (
  SELECT 'S1' AS sid,
         TIMESTAMP '2020-01-01 00:00:00' AS start_tstamp,
         TIMESTAMP '2020-01-01 00:00:00' AS user_start_tstamp
  UNION ALL
  SELECT 'S2' AS sid,
         TIMESTAMP '2020-01-01 05:00:00' AS start_tstamp,
         TIMESTAMP '2020-01-01 00:00:00' AS user_start_tstamp
)
SELECT user_start_tstamp AS start_tstamp,                                        -- alias shadows column
       MAX(CASE WHEN start_tstamp = user_start_tstamp THEN sid END) AS first_sid
FROM s
GROUP BY user_start_tstamp;

-- Snowflake result:
--   start_tstamp=2020-01-01 00:00:00, first_sid='S1'   (correct: only S1 satisfies the CASE condition)
-- Embucket result:
--   start_tstamp=2020-01-01 00:00:00, first_sid='S2'   (wrong: CASE degenerates to always-true, MAX picks 'S2')

Removing the AS start_tstamp alias (or renaming it) makes Embucket return 'S1' correctly. So the bug is specifically in how Embucket resolves the reference when the SELECT list introduces an alias whose name collides with an existing column name.

Root cause analysis

In the reproducer, the CASE has start_tstamp = user_start_tstamp:

  • Correct behavior: start_tstamp resolves to the FROM table's column (which differs per row), so only S1's row satisfies the predicate, and MAX returns 'S1'.
  • Observed Embucket behavior: start_tstamp resolves to the SELECT-list alias user_start_tstamp, so the condition becomes user_start_tstamp = user_start_tstamp — always true. MAX(CASE WHEN TRUE THEN sid END) degenerates to MAX(sid), which picks lexicographically largest 'S2'.

Per ANSI SQL (and every Snowflake-compatible engine I'm aware of), SELECT-list aliases are not visible to other expressions in the same SELECT clause, except in the GROUP BY / ORDER BY / HAVING clauses of some dialects. Referencing the alias name inside a SELECT-list aggregate expression must resolve to the underlying column.

Impact surfaced in a real pipeline

Found while running dbt-snowplow-web on Embucket vs Snowflake against a shared S3 Tables Iceberg source. The snowplow_web_users_aggs model does exactly this anti-pattern:

SELECT domain_userid,
       user_start_tstamp AS start_tstamp,    -- alias shadows column
       user_end_tstamp   AS end_tstamp,      -- alias shadows column
       MAX(CASE WHEN start_tstamp = user_start_tstamp THEN domain_sessionid END) AS first_domain_sessionid,
       MAX(CASE WHEN end_tstamp   = user_end_tstamp   THEN domain_sessionid END) AS last_domain_sessionid,
       SUM(page_views)                    AS page_views,
       COUNT(DISTINCT domain_sessionid)   AS sessions,
       SUM(engaged_time_in_s)             AS engaged_time_in_s
FROM snowplow_web_users_sessions_this_run
GROUP BY 1, 2, 3

On Embucket both first_domain_sessionid and last_domain_sessionid collapse to MAX(domain_sessionid) over the user's sessions, making them identical. They should point to the user's chronologically first and last sessions respectively.

Observed impact at 609,671 input events:

  • 8,166 of 35,094 users (23%) have a different first_domain_sessionid between Embucket and Snowflake.
  • Downstream, the snowplow_web_users derived table's first_page_url, first_page_title, first_page_urlhost, first_geo_country, first_geo_city, first_br_lang, referrer, etc. are wrong on Embucket for every affected user. last_* fields are correct because MAX(sid) happens to coincide with the last session's sid.

Per-user spot check for 5205d181-a915-4379-8e6c-15862d72b9ed (4 sessions):

domain_sessionid                       start_tstamp           end_tstamp
0d9e647e-a5ff-434a-a964-58b324927e02   2026-04-22 15:00:43.863  2026-04-22 15:00:43.863   <-- first
9ad7f7bb-af95-4743-966d-2572046e90cc   2026-04-22 15:01:14.335  2026-04-22 15:01:34.179
65a62129-6d8c-4dd6-99bb-9420fa2d16ef   2026-04-22 15:01:17.089  2026-04-22 15:01:17.089
b96cd902-b6e1-4072-af17-dedb4e6e501c   2026-04-22 15:13:31.538  2026-04-22 15:13:31.538   <-- last, lex-max UUID

user_start_tstamp = 2026-04-22 15:00:43.863
user_end_tstamp   = 2026-04-22 15:13:31.538

first_domain_sessionid:
  Snowflake: 0d9e647e-...   (correct)
  Embucket:  b96cd902-...   (wrong -- lex-max UUID)

last_domain_sessionid:
  both: b96cd902-...         (correct by coincidence)

Both engines see byte-identical source rows (scratch users_sessions_this_run has the same 4 rows for this user on both); the divergence enters at the aggregation.

Ruling out the obvious alternatives

  • Not a tie-break on tied start_tstamp: zero users have two sessions sharing start_tstamp in either engine (confirmed via COUNT(DISTINCT domain_sessionid) WHERE start_tstamp = user_start_tstamp GROUP BY user_start_tstamp HAVING COUNT > 1 returning 0 on both).
  • Not a grouping issue: user_start_tstamp and user_end_tstamp are identical between engines after GROUP BY.
  • Not an issue with MAX over NULLs: a direct query against users_sessions_this_run with the same aggregation but without the alias shadow returns the correct result on Embucket.

Expected fix

Inside the SELECT clause, column references (and especially references inside aggregate function arguments like MAX(CASE WHEN <expr> ...)) must resolve against the FROM-clause relation, not the SELECT-list aliases. SELECT-list aliases are visible only in ORDER BY, GROUP BY (dialect-dependent), and HAVING.

Severity

High. Silently wrong results with no error. Any production SQL of the form

SELECT some_col AS other_col_name, ..., AGG(CASE WHEN other_col_name ... THEN ... END)

will produce subtly wrong aggregates on Embucket and correct ones on Snowflake — the exact scenario where parity testing with a reference engine is needed to catch it.

Environment

Embucket version: v0.2.2 (Lambda-deployed). DataFusion-based SQL planner.

Related context

Full investigation report at Embucket/embucket-snowplow#6 (specs/2026-04-22-embucket-snowflake-investigation.md, "Finding B / Finding 4").

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions