Add Apache Spark dialect, observe-fork options, and BigQuery COALESCE fix by richardwooding · Pull Request #8 · SPANDigital/pycel2sql

richardwooding · 2026-04-27T21:21:53Z

Summary

Ports recent feature work from cel2sql Go (PRs #117, #113) and cel2sql4j (PRs #10, #9) so pycel2sql is back at feature parity with both siblings.

What's in this PR

Apache Spark SQL dialect (sixth dialect): ? positional params, RLIKE, get_json_object, concat(), array_contains (arg-order swap), 0-indexed arrays, EXPLODE / (SELECT collect_list(...)), format_string(), (dayofweek(t) - 1) for day-of-week. JSON array membership raises (the dialect contract doesn't expose the candidate element needed to build a boolean predicate). Regex passes through to Python's re engine after the same ReDoS validators DuckDB / Postgres use. Reserved-keyword + 128-char identifier validation.
json_variables option: declare CEL variable names as flat JSONB columns. context.host and context["host"] both emit dialect-specific JSON extraction. Nested chains supported (tags.corpus.section → tags->'corpus'->>'section'). Comprehension iter vars shadow json_variables. has(jsonvar.key) emits write_json_existence.
column_aliases option: map CEL identifier name → SQL column name. The alias is validated against the dialect's identifier rules.
param_start_index option (convert_parameterized only): shift the placeholder counter so the first parameter is $5 / @p5 instead of $1. Values < 1 clamp to 1. Positional-? dialects ignore the index in placeholder text but preserve parameter ordering.
Dialect.write_format() per-dialect dispatch (fixes latent bug exposed by adding Spark): Postgres + BigQuery emit FORMAT(...), SQLite + DuckDB emit printf(...), Spark emits format_string(...), MySQL raises UnsupportedDialectFeatureError. Previously every dialect emitted FORMAT(...), which is invalid SQL on SQLite / DuckDB / MySQL / Spark.
BigQuery COALESCE NULL-safety fix: write_array_length and write_json_array_length now wrap output in COALESCE(..., 0), matching the four other dialects. Ports cel2sql4j PR ci: restrict GITHUB_TOKEN to contents:read #9 / cel2sql4j upstream commit 1689adc.

Test results

704 unit tests pass (607 baseline + 97 new across tests/test_spark.py (47 tests), tests/test_options.py (26 tests), and per-dialect format() cases in tests/test_string_functions.py (5 tests) plus a BigQuery COALESCE assertion).
Ruff clean.
Mypy non-lark errors unchanged from baseline (6); one added lark Tree[Any] generic-arg note matches the existing pattern documented in CLAUDE.md.

Docs

README: Spark badge, Conversion Options section (json_variables / column_aliases / param_start_index / format() per-dialect table), updated placeholder table, sixth-dialect imports.
CLAUDE.md: dialect count bumped to six, Spark added to the dialect-files list and Dialect Differences bullets, new options documented under Important Conventions.

Follow-up scheduled

A one-time remote agent fires on Mon 11 May 2026 to verify the BigQuery COALESCE wrap against the BigQuery emulator (PYCEL2SQL_TEST_BIGQUERY=1). It will post a confirmation comment on success, or open an issue on failure. Routine: trig_01XUGxNA7onedEDdRnuK4ixL.

Test plan

uv run pytest tests/ --ignore=tests/integration -v — 704 passing
uv run ruff check src/ tests/ — clean
uv run mypy src/pycel2sql/ — no NEW non-lark errors
Optional: PYCEL2SQL_TEST_BIGQUERY=1 uv run pytest tests/integration/ -v (deferred to scheduled follow-up agent)
Manual smoke for each new option in a Python REPL (json_variables, column_aliases, param_start_index, Spark format() → format_string)

🤖 Generated with Claude Code

@p5

… fix Ports recent feature work from cel2sql Go (PRs #117, #113) and cel2sql4j (PRs #10, #9) so pycel2sql is back at feature parity with both siblings. New dialect: - Apache Spark SQL (sixth dialect) — backticked-free identifiers, ? positional params, RLIKE, get_json_object, concat(), array_contains (arg-order swap), 0-indexed arrays, EXPLODE / (SELECT collect_list(...)), format_string(), (dayofweek(t) - 1) day-of-week adjustment, JSON-array membership raises (no boolean predicate available). Regex passes through to Python's re engine after the same ReDoS validators DuckDB / Postgres use; reserved-keyword + 128-char identifier validation. New conversion options (mirrors cel2sql4j ConvertOptions): - json_variables: declare CEL variable names as flat JSONB columns; dot and bracket notation emit dialect-specific JSON extraction (Postgres ->>, BigQuery JSON_VALUE, etc.). Nested chains supported (tags.corpus.section -> tags->'corpus'->>'section'). Comprehension iter vars shadow json_variables. has(jsonvar.key) emits write_json_existence. - column_aliases: map CEL identifier -> SQL column name. Alias is validated against the dialect's identifier rules. - param_start_index (convert_parameterized only): shift the placeholder counter so the first param is $5 / @p5 instead of $1. Values < 1 clamp to 1. Positional-? dialects ignore the index in placeholder text. Per-dialect format() dispatch (fixes a latent bug exposed by adding Spark): - Dialect.write_format() abstract method. - Postgres + BigQuery emit FORMAT('...', ...). - SQLite + DuckDB emit printf('...', ...). - Spark emits format_string('...', ...). - MySQL raises UnsupportedDialectFeatureError (no portable equivalent). Bug fix: - BigQuery write_array_length and write_json_array_length now wrap output in COALESCE(..., 0) for NULL safety, matching the four other dialects. Mirrors cel2sql4j PR #9 / cel2sql4j upstream commit 1689adc. Tests: 704 unit tests pass (607 baseline + 97 new across tests/test_spark.py, tests/test_options.py, and per-dialect format() cases in tests/test_string_functions.py). Ruff clean. Mypy non-lark errors unchanged (6); 1 added lark Tree[Any] generic-arg note matches the existing pattern documented in CLAUDE.md. Docs: README adds Spark badge, Conversion Options section (json_variables / column_aliases / param_start_index / format() per-dialect table), updated placeholder table with Spark, sixth-dialect imports. CLAUDE.md bumps dialect count to six, adds Spark to the file list + Dialect Differences bullets, documents the new options under Important Conventions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Ports feature parity updates from the Go/Java sibling projects into pycel2sql by adding an Apache Spark dialect, new conversion options (JSON-root variables, identifier aliases, param start offset), and fixing BigQuery array-length NULL behavior.

Changes:

Added SparkDialect with Spark-specific SQL generation (positional ? params, RLIKE, get_json_object, array_contains, format_string, EXPLODE/collect_list, 0-based arrays, Spark identifier validation).
Introduced conversion options: json_variables, column_aliases, and param_start_index (for convert_parameterized).
Implemented per-dialect format() dispatch via Dialect.write_format() and fixed BigQuery COALESCE(..., 0) wrapping for array length functions.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`src/pycel2sql/dialect/spark.py`	New Spark SQL dialect implementation including regex validation, JSON extraction, arrays, and comprehensions.
`src/pycel2sql/dialect/_base.py`	Adds new `DialectName.SPARK` and makes `write_format()` an abstract requirement for dialects.
`src/pycel2sql/dialect/__init__.py`	Exports `SparkDialect` and wires it into `get_dialect()`.
`src/pycel2sql/dialect/postgres.py`	Implements `write_format()` using `FORMAT(...)`.
`src/pycel2sql/dialect/bigquery.py`	Adds `COALESCE(..., 0)` wrapping for array lengths and implements `write_format()` using `FORMAT(...)`.
`src/pycel2sql/dialect/sqlite.py`	Implements `write_format()` using `printf(...)`.
`src/pycel2sql/dialect/duckdb.py`	Implements `write_format()` using `printf(...)`.
`src/pycel2sql/dialect/mysql.py`	Implements `write_format()` by raising `UnsupportedDialectFeatureError`.
`src/pycel2sql/_converter.py`	Adds support for `json_variables`/`column_aliases`/`param_start_index`, JSON-root routing, aliasing, and delegates `format()` generation to the dialect.
`src/pycel2sql/__init__.py`	Public API wiring for new options and refactors shared kwargs building.
`tests/test_spark.py`	New Spark-specific unit test suite.
`tests/test_options.py`	New tests covering `json_variables`, `column_aliases`, and `param_start_index`.
`tests/test_string_functions.py`	Adds per-dialect `format()` dispatch assertions.
`tests/test_bigquery.py`	Updates BigQuery array length expectation to include `COALESCE`.
`tests/test_dialect_parametrized.py`	Adds Spark to the parametrized dialect matrix.
`tests/conftest.py`	Adds `spark_dialect` fixture and includes Spark in `ALL_DIALECTS`.
`README.md`	Documents Spark support and the new options / per-dialect `format()` behavior.
`CLAUDE.md`	Updates project guidance to include Spark and the new options / format dispatch behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-27T21:41:34Z

+    if "(?m" in pattern or "(?s" in pattern or "(?-" in pattern:
+        raise InvalidRegexPatternError(
+            "invalid pattern in expression",
+            "inline flags other than (?i) are not supported in Spark regex",
+        )


The inline-flag validation misses combined flag groups like (?im) / (?mi): the current substring checks only catch (?m/(?s) when those flags appear first. This allows non-i flags through despite the intent to reject them. Consider parsing (?[a-zA-Z]+) groups and rejecting any set that includes flags other than i (including combined forms).

Fixed in 6eb6436. The check now uses _INLINE_FLAG_GROUP_RE = re.compile(r"\(\?([a-zA-Z]+)[:)]") to extract the flag-letter set from each (?<flags>) or (?<flags>: group and rejects any group containing flags other than i. The bare (?- clear form is also rejected. Added parametrized tests for (?im), (?mi), (?ix:foo), (?-i), plus a sanity check that (?i)foo alone still passes.

Copilot · 2026-04-27T21:41:35Z

+
+class TestFormatPerDialect:
+    """format() dispatches per dialect: FORMAT for Postgres/BigQuery,
+    printf for SQLite/DuckDB, raises for MySQL."""


The docstring for TestFormatPerDialect is now incomplete: it documents FORMAT/printf/MySQL-raises, but this PR also adds Spark's format_string(...) mapping. Please update the docstring so it matches the actual per-dialect behavior being tested elsewhere in the suite.

Suggested change

printf for SQLite/DuckDB, raises for MySQL."""

format_string for Spark, printf for SQLite/DuckDB, raises for MySQL."""

Fixed in 6eb6436 — applied your suggested wording verbatim.

Copilot · 2026-04-27T21:41:35Z

+                f"Spark dialect does not support multi-dimensional array "
+                f"length (dimension={dimension})",
+            )
+        # Spark size() returns -1 for null; COALESCE collapses to 0.


Comment says size() returns -1 for NULL, but in Spark SQL size(NULL) evaluates to NULL. The COALESCE wrapper is still correct, but the comment should be updated to avoid misleading future maintainers.

Suggested change

# Spark size() returns -1 for null; COALESCE collapses to 0.

# In Spark SQL, size(NULL) evaluates to NULL; COALESCE converts that to 0.

Fixed in 6eb6436 — applied your suggested wording verbatim. Good catch: the -1 claim was carried over from a different language port and is wrong for Spark SQL.

Copilot · 2026-04-27T21:41:35Z

                )

-        # Convert %d, %f etc. to %s for SQL FORMAT()
+        # Convert %d, %f etc. to %s for printf-style dialects


This comment says numeric specifiers are converted to %s "for printf-style dialects", but the conversion is applied unconditionally before dispatching to Dialect.write_format() (including Postgres/BigQuery/Spark). Update the comment to reflect the actual behavior, or make the conversion dialect-specific if that was the intent.

Suggested change

# Convert %d, %f etc. to %s for printf-style dialects

# Normalize supported numeric specifiers (%d, %f, etc.) to %s before

# dispatching to Dialect.write_format().

Fixed in 6eb6436. Updated comment to "Normalize supported numeric specifiers (%d, %f, etc.) to %s before dispatching to Dialect.write_format()". Kept the conversion unconditional (rather than making it dialect-specific) because %s is universally accepted by FORMAT/printf/format_string and avoids type-coercion mismatches when CEL arguments do not match the original specifier exactly.

Four review-comment fixes: 1. Spark inline-flag check now catches combined flag groups. The previous substring-only check (`"(?m" in pattern`) missed `(?im)` / `(?mi)` / `(?ix:...)`. New `_INLINE_FLAG_GROUP_RE` parses each `(?<flags>)` or `(?<flags>:` group and rejects any flag set containing letters other than `i`. The bare `(?-` clear-flag form remains explicitly rejected. 2. Update tests/test_string_functions.py TestFormatPerDialect docstring to mention Spark's format_string() mapping alongside the four other dialect dispatches. 3. Fix misleading comment in Spark write_array_length: size(NULL) evaluates to NULL in Spark SQL, not -1. The COALESCE wrap is still correct; only the comment was wrong. 4. Update _visit_format() comment in _converter.py to reflect that the %d/%f → %s normalization runs unconditionally before Dialect.write_format() dispatch — not "for printf-style dialects". Tests: 709 unit pass (704 + 5 new flag-group rejection cases). Ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Captures the workflow for syncing features and bug fixes from upstream cel2sql Go (the canonical source) and cross-referencing cel2sql4j Java. The recent v3.7.0/v3.7.1 sync (PR #8: Spark dialect + observe-fork options + BigQuery COALESCE fix) followed this exact shape. Layout under .claude/skills/port-from-upstream/: - SKILL.md (~120 lines) — quick start, two-upstream model, pre-port verification (grep first), out-of-scope concerns mirrored from cel2sql4j's same decisions, per-commit workflow, when-to-split-PRs trade-off, verification. - references/go-to-python-idioms.md — Go-to-Python mapping table (closures, errors, struct-and-interface → ABC + dataclass, generics, strings, maps, time/duration, reflection). Distilled from the actual edits in PR #8. - references/java-to-python-cross-refs.md — places where cel2sql4j's Java solution was a closer template than the Go original (per-dialect writeFormat, ConvertOptions shape, single-functional-interface lambda vs Go closure, dual-message exception base, shared skipped concerns). - scripts/list_upstream_changes.sh — lists upstream cel2sql commits and tags, plus cel2sql4j commits in the same time window. Auto-detects since-sha from the most recent "Port" / "backport" commit on pycel2sql main. Falls clearly when either sibling repo isn't checked out. Lints clean against .claude/skills/skill-authoring/scripts/lint_skill.py. Co-authored-by: Richard Wooding <richardwooding@Richards-Virtual-Machine.local> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

richardwooding requested a review from Copilot April 27, 2026 21:24

Copilot AI reviewed Apr 27, 2026

View reviewed changes

richardwooding merged commit be6f12d into main Apr 28, 2026
7 checks passed

richardwooding mentioned this pull request Apr 28, 2026

Add port-from-upstream agent skill #13

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Apache Spark dialect, observe-fork options, and BigQuery COALESCE fix#8

Add Apache Spark dialect, observe-fork options, and BigQuery COALESCE fix#8
richardwooding merged 2 commits intomainfrom
port-spark-and-observe-options

richardwooding commented Apr 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

richardwooding Apr 28, 2026

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

richardwooding Apr 28, 2026

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

richardwooding Apr 28, 2026

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

richardwooding Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	printf for SQLite/DuckDB, raises for MySQL."""
	format_string for Spark, printf for SQLite/DuckDB, raises for MySQL."""

	# Spark size() returns -1 for null; COALESCE collapses to 0.
	# In Spark SQL, size(NULL) evaluates to NULL; COALESCE converts that to 0.

	# Convert %d, %f etc. to %s for printf-style dialects
	# Normalize supported numeric specifiers (%d, %f, etc.) to %s before
	# dispatching to Dialect.write_format().

Conversation

richardwooding commented Apr 27, 2026

Summary

What's in this PR

Test results

Docs

Follow-up scheduled

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

richardwooding Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

richardwooding Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

richardwooding Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

richardwooding Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants