Column reference inside aggregate CASE resolves to SELECT-list alias instead of FROM-column (silently wrong result)

## Summary

When a SELECT-list alias shadows an underlying column name and the same name is referenced elsewhere in the same SELECT, Embucket resolves the reference to the alias instead of the FROM-clause column. This silently produces wrong results — no error, no warning. ANSI SQL and Snowflake both resolve the reference to the FROM-clause column.

## Minimal standalone reproducer

```sql
WITH s AS (
  SELECT 'S1' AS sid,
         TIMESTAMP '2020-01-01 00:00:00' AS start_tstamp,
         TIMESTAMP '2020-01-01 00:00:00' AS user_start_tstamp
  UNION ALL
  SELECT 'S2' AS sid,
         TIMESTAMP '2020-01-01 05:00:00' AS start_tstamp,
         TIMESTAMP '2020-01-01 00:00:00' AS user_start_tstamp
)
SELECT user_start_tstamp AS start_tstamp,                                        -- alias shadows column
       MAX(CASE WHEN start_tstamp = user_start_tstamp THEN sid END) AS first_sid
FROM s
GROUP BY user_start_tstamp;

-- Snowflake result:
--   start_tstamp=2020-01-01 00:00:00, first_sid='S1'   (correct: only S1 satisfies the CASE condition)
-- Embucket result:
--   start_tstamp=2020-01-01 00:00:00, first_sid='S2'   (wrong: CASE degenerates to always-true, MAX picks 'S2')
```

Removing the `AS start_tstamp` alias (or renaming it) makes Embucket return `'S1'` correctly. So the bug is specifically in how Embucket resolves the reference when the SELECT list introduces an alias whose name collides with an existing column name.

## Root cause analysis

In the reproducer, the CASE has `start_tstamp = user_start_tstamp`:
- Correct behavior: `start_tstamp` resolves to the FROM table's column (which differs per row), so only S1's row satisfies the predicate, and MAX returns `'S1'`.
- Observed Embucket behavior: `start_tstamp` resolves to the SELECT-list alias `user_start_tstamp`, so the condition becomes `user_start_tstamp = user_start_tstamp` — always true. MAX(CASE WHEN TRUE THEN sid END) degenerates to MAX(sid), which picks lexicographically largest `'S2'`.

Per ANSI SQL (and every Snowflake-compatible engine I'm aware of), SELECT-list aliases are **not** visible to other expressions in the same SELECT clause, except in the GROUP BY / ORDER BY / HAVING clauses of some dialects. Referencing the alias name inside a SELECT-list aggregate expression must resolve to the underlying column.

## Impact surfaced in a real pipeline

Found while running dbt-snowplow-web on Embucket vs Snowflake against a shared S3 Tables Iceberg source. The `snowplow_web_users_aggs` model does exactly this anti-pattern:

```sql
SELECT domain_userid,
       user_start_tstamp AS start_tstamp,    -- alias shadows column
       user_end_tstamp   AS end_tstamp,      -- alias shadows column
       MAX(CASE WHEN start_tstamp = user_start_tstamp THEN domain_sessionid END) AS first_domain_sessionid,
       MAX(CASE WHEN end_tstamp   = user_end_tstamp   THEN domain_sessionid END) AS last_domain_sessionid,
       SUM(page_views)                    AS page_views,
       COUNT(DISTINCT domain_sessionid)   AS sessions,
       SUM(engaged_time_in_s)             AS engaged_time_in_s
FROM snowplow_web_users_sessions_this_run
GROUP BY 1, 2, 3
```

On Embucket both `first_domain_sessionid` and `last_domain_sessionid` collapse to `MAX(domain_sessionid)` over the user's sessions, making them identical. They should point to the user's chronologically first and last sessions respectively.

Observed impact at 609,671 input events:
- **8,166 of 35,094 users (23%)** have a different `first_domain_sessionid` between Embucket and Snowflake.
- Downstream, the `snowplow_web_users` derived table's `first_page_url`, `first_page_title`, `first_page_urlhost`, `first_geo_country`, `first_geo_city`, `first_br_lang`, `referrer`, etc. are wrong on Embucket for every affected user. `last_*` fields are correct because MAX(sid) happens to coincide with the last session's sid.

Per-user spot check for `5205d181-a915-4379-8e6c-15862d72b9ed` (4 sessions):

```
domain_sessionid                       start_tstamp           end_tstamp
0d9e647e-a5ff-434a-a964-58b324927e02   2026-04-22 15:00:43.863  2026-04-22 15:00:43.863   <-- first
9ad7f7bb-af95-4743-966d-2572046e90cc   2026-04-22 15:01:14.335  2026-04-22 15:01:34.179
65a62129-6d8c-4dd6-99bb-9420fa2d16ef   2026-04-22 15:01:17.089  2026-04-22 15:01:17.089
b96cd902-b6e1-4072-af17-dedb4e6e501c   2026-04-22 15:13:31.538  2026-04-22 15:13:31.538   <-- last, lex-max UUID

user_start_tstamp = 2026-04-22 15:00:43.863
user_end_tstamp   = 2026-04-22 15:13:31.538

first_domain_sessionid:
  Snowflake: 0d9e647e-...   (correct)
  Embucket:  b96cd902-...   (wrong -- lex-max UUID)

last_domain_sessionid:
  both: b96cd902-...         (correct by coincidence)
```

Both engines see byte-identical source rows (scratch `users_sessions_this_run` has the same 4 rows for this user on both); the divergence enters at the aggregation.

## Ruling out the obvious alternatives

- Not a tie-break on tied `start_tstamp`: zero users have two sessions sharing `start_tstamp` in either engine (confirmed via `COUNT(DISTINCT domain_sessionid) WHERE start_tstamp = user_start_tstamp GROUP BY user_start_tstamp HAVING COUNT > 1` returning 0 on both).
- Not a grouping issue: `user_start_tstamp` and `user_end_tstamp` are identical between engines after GROUP BY.
- Not an issue with MAX over NULLs: a direct query against `users_sessions_this_run` with the same aggregation but without the alias shadow returns the correct result on Embucket.

## Expected fix

Inside the SELECT clause, column references (and especially references inside aggregate function arguments like `MAX(CASE WHEN <expr> ...)`) must resolve against the FROM-clause relation, not the SELECT-list aliases. SELECT-list aliases are visible only in `ORDER BY`, `GROUP BY` (dialect-dependent), and `HAVING`.

## Severity

High. Silently wrong results with no error. Any production SQL of the form

```sql
SELECT some_col AS other_col_name, ..., AGG(CASE WHEN other_col_name ... THEN ... END)
```

will produce subtly wrong aggregates on Embucket and correct ones on Snowflake — the exact scenario where parity testing with a reference engine is needed to catch it.

## Environment

Embucket version: `v0.2.2` (Lambda-deployed). DataFusion-based SQL planner.

## Related context

Full investigation report at https://github.com/Embucket/embucket-snowplow/pull/6 (`specs/2026-04-22-embucket-snowflake-investigation.md`, "Finding B / Finding 4").

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Column reference inside aggregate CASE resolves to SELECT-list alias instead of FROM-column (silently wrong result) #131

Summary

Minimal standalone reproducer

Root cause analysis

Impact surfaced in a real pipeline

Ruling out the obvious alternatives

Expected fix

Severity

Environment

Related context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Column reference inside aggregate CASE resolves to SELECT-list alias instead of FROM-column (silently wrong result) #131

Description

Summary

Minimal standalone reproducer

Root cause analysis

Impact surfaced in a real pipeline

Ruling out the obvious alternatives

Expected fix

Severity

Environment

Related context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions