DATEDIFF('second', a, b) returns CEIL of duration instead of boundary-count (deviates from Snowflake semantics)

## Summary

`DATEDIFF('second', a, b)` on Embucket returns `CEIL((b - a) / 1 second)` — the ceiling of the true fractional duration. Snowflake's documented semantics is boundary-count: `floor(epoch_seconds(b)) - floor(epoch_seconds(a))`. Any positive sub-second difference rounds up to 1 on Embucket. No error; silently wrong result.

## Minimal standalone reproducer

Run the same query on both engines:

```sql
SELECT DATEDIFF('second',
                TIMESTAMP '2020-01-01 00:00:00.100',
                TIMESTAMP '2020-01-01 00:00:00.900');
-- Snowflake: 0   (same second bucket, 0 boundaries crossed)
-- Embucket:  1   (CEIL(0.8) = 1)
```

## Full test matrix

| a | b | true duration | Embucket | Snowflake |
|---|---|---|---|---|
| `00:00:00.500` | `00:00:00.500` | 0.0 sec | 0 | 0 |
| `00:00:00.100` | `00:00:00.101` | 0.001 sec same sec | **1** | 0 |
| `00:00:00.200` | `00:00:00.700` | 0.5 sec same sec | **1** | 0 |
| `00:00:00.750` | `00:00:01.250` | 0.5 sec straddles boundary | 1 | 1 |
| `00:00:00.900` | `00:00:01.100` | 0.2 sec straddles boundary | 1 | 1 |
| `00:00:00.500` | `00:00:01.500` | 1.0 sec | 1 | 1 |
| `00:00:00.250` | `00:00:01.750` | 1.5 sec | **2** | 1 |
| `00:00:00.000` | `00:00:02.000` | 2.0 sec exact | 2 | 2 |
| `00:00:00.500` | `00:00:02.900` | 2.4 sec | **3** | 2 |
| `00:00:00.000` | `00:00:02.500` | 2.5 sec | **3** | 2 |
| `00:00:00.999` | `00:01:00.000` | 59.001 sec | 60 | 60 |

## Formulas

- **Snowflake (correct / documented):** `DATEDIFF('second', a, b) = floor(epoch_seconds(b)) - floor(epoch_seconds(a))`. Counts how many `second`-boundaries are crossed. Returns 0 for two timestamps in the same second regardless of sub-second distance.
- **Embucket (current):** `DATEDIFF('second', a, b) = CEIL((b - a) / 1 second)`. Returns the ceiling of the true elapsed seconds. Returns 1 for any positive sub-second difference.

The engines agree only when both endpoints sit on exact second boundaries, or when the ceil value happens to equal the boundary count.

## Snowflake reference

Snowflake docs on `DATEDIFF` (https://docs.snowflake.com/en/sql-reference/functions/datediff): "Calculates the difference between two date, time, or timestamp expressions based on the date or time part requested. The function returns the result of subtracting the second argument from the third argument." The result is the count of `date_or_time_part` boundaries crossed, not the real elapsed duration. This is the same semantics as Microsoft SQL Server's `DATEDIFF`, BigQuery's `DATETIME_DIFF`, and Databricks' `DATEDIFF`.

## Impact surfaced in a real pipeline

Found while running dbt-snowplow-web on Embucket vs Snowflake against a shared S3 Tables Iceberg source. Snowplow's `snowplow_web_page_views_this_run` computes:

```sql
DATEDIFF('second', p.derived_tstamp, COALESCE(t.end_tstamp, p.derived_tstamp)) AS absolute_time_in_s
```

Aggregate inflation on Embucket, 609,671 input events:
- `snowplow_web_page_views.sum_absolute_time_in_s`: +1.25% over Snowflake (2,058,065 vs 2,032,667).
- `snowplow_web_sessions.sum_absolute_time_in_s`: +0.85% over Snowflake (13,149,140 vs 13,037,353).

Direction is strictly one-sided (Embucket ≥ Snowflake always), consistent with the formula.

Spot checks:
- Session `7da3c35f-5565-4903-88df-bb3c19a918f9`: Embucket 231 sec, Snowflake 230 sec; all other columns identical.
- Page view `00005058-c0e9-4e2a-8c2a-24db2e1f8fa4`: Embucket 40, Snowflake 39; all else equal.

## Expected fix

`DATEDIFF('second', a, b)` should return `floor(epoch_seconds(b)) - floor(epoch_seconds(a))`. Presumably generalizable: for `DATEDIFF(part, a, b)`, truncate both endpoints to `part` precision first, then compute the integer difference in `part` units. This is the semantics all Snowflake-compatible engines implement.

Same fix should be applied consistently to every time unit (minute, hour, day, etc.) — worth auditing whether other units already use boundary-count or also use CEIL-of-duration.

## Environment

Embucket version: `v0.2.2` (Lambda-deployed; `MEM_POOL_SIZE_MB=9216`, `MEM_POOL_TYPE=greedy`). DataFusion-based. Reproduced via the embucket SQL endpoint on an ARM64 Lambda, 10 GB memory.

## Related context

Found as part of the broader Embucket-vs-Snowflake parity investigation at https://github.com/Embucket/embucket-snowplow/pull/6 (full report in `specs/2026-04-22-embucket-snowflake-investigation.md`, "Finding A / Finding 3").

a	b	true duration	Embucket	Snowflake
`00:00:00.500`	`00:00:00.500`	0.0 sec	0	0
`00:00:00.100`	`00:00:00.101`	0.001 sec same sec	1	0
`00:00:00.200`	`00:00:00.700`	0.5 sec same sec	1	0
`00:00:00.750`	`00:00:01.250`	0.5 sec straddles boundary	1	1
`00:00:00.900`	`00:00:01.100`	0.2 sec straddles boundary	1	1
`00:00:00.500`	`00:00:01.500`	1.0 sec	1	1
`00:00:00.250`	`00:00:01.750`	1.5 sec	2	1
`00:00:00.000`	`00:00:02.000`	2.0 sec exact	2	2
`00:00:00.500`	`00:00:02.900`	2.4 sec	3	2
`00:00:00.000`	`00:00:02.500`	2.5 sec	3	2
`00:00:00.999`	`00:01:00.000`	59.001 sec	60	60

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DATEDIFF('second', a, b) returns CEIL of duration instead of boundary-count (deviates from Snowflake semantics) #130

Summary

Minimal standalone reproducer

Full test matrix

Formulas

Snowflake reference

Impact surfaced in a real pipeline

Expected fix

Environment

Related context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DATEDIFF('second', a, b) returns CEIL of duration instead of boundary-count (deviates from Snowflake semantics) #130

Description

Summary

Minimal standalone reproducer

Full test matrix

Formulas

Snowflake reference

Impact surfaced in a real pipeline

Expected fix

Environment

Related context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions