Add Apache Spark dialect, observe-fork options, and BigQuery COALESCE fix#8
Conversation
… fix Ports recent feature work from cel2sql Go (PRs #117, #113) and cel2sql4j (PRs #10, #9) so pycel2sql is back at feature parity with both siblings. New dialect: - Apache Spark SQL (sixth dialect) — backticked-free identifiers, ? positional params, RLIKE, get_json_object, concat(), array_contains (arg-order swap), 0-indexed arrays, EXPLODE / (SELECT collect_list(...)), format_string(), (dayofweek(t) - 1) day-of-week adjustment, JSON-array membership raises (no boolean predicate available). Regex passes through to Python's re engine after the same ReDoS validators DuckDB / Postgres use; reserved-keyword + 128-char identifier validation. New conversion options (mirrors cel2sql4j ConvertOptions): - json_variables: declare CEL variable names as flat JSONB columns; dot and bracket notation emit dialect-specific JSON extraction (Postgres ->>, BigQuery JSON_VALUE, etc.). Nested chains supported (tags.corpus.section -> tags->'corpus'->>'section'). Comprehension iter vars shadow json_variables. has(jsonvar.key) emits write_json_existence. - column_aliases: map CEL identifier -> SQL column name. Alias is validated against the dialect's identifier rules. - param_start_index (convert_parameterized only): shift the placeholder counter so the first param is $5 / @p5 instead of $1. Values < 1 clamp to 1. Positional-? dialects ignore the index in placeholder text. Per-dialect format() dispatch (fixes a latent bug exposed by adding Spark): - Dialect.write_format() abstract method. - Postgres + BigQuery emit FORMAT('...', ...). - SQLite + DuckDB emit printf('...', ...). - Spark emits format_string('...', ...). - MySQL raises UnsupportedDialectFeatureError (no portable equivalent). Bug fix: - BigQuery write_array_length and write_json_array_length now wrap output in COALESCE(..., 0) for NULL safety, matching the four other dialects. Mirrors cel2sql4j PR #9 / cel2sql4j upstream commit 1689adc. Tests: 704 unit tests pass (607 baseline + 97 new across tests/test_spark.py, tests/test_options.py, and per-dialect format() cases in tests/test_string_functions.py). Ruff clean. Mypy non-lark errors unchanged (6); 1 added lark Tree[Any] generic-arg note matches the existing pattern documented in CLAUDE.md. Docs: README adds Spark badge, Conversion Options section (json_variables / column_aliases / param_start_index / format() per-dialect table), updated placeholder table with Spark, sixth-dialect imports. CLAUDE.md bumps dialect count to six, adds Spark to the file list + Dialect Differences bullets, documents the new options under Important Conventions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Ports feature parity updates from the Go/Java sibling projects into pycel2sql by adding an Apache Spark dialect, new conversion options (JSON-root variables, identifier aliases, param start offset), and fixing BigQuery array-length NULL behavior.
Changes:
- Added SparkDialect with Spark-specific SQL generation (positional
?params,RLIKE,get_json_object,array_contains,format_string,EXPLODE/collect_list, 0-based arrays, Spark identifier validation). - Introduced conversion options:
json_variables,column_aliases, andparam_start_index(forconvert_parameterized). - Implemented per-dialect
format()dispatch viaDialect.write_format()and fixed BigQueryCOALESCE(..., 0)wrapping for array length functions.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
src/pycel2sql/dialect/spark.py |
New Spark SQL dialect implementation including regex validation, JSON extraction, arrays, and comprehensions. |
src/pycel2sql/dialect/_base.py |
Adds new DialectName.SPARK and makes write_format() an abstract requirement for dialects. |
src/pycel2sql/dialect/__init__.py |
Exports SparkDialect and wires it into get_dialect(). |
src/pycel2sql/dialect/postgres.py |
Implements write_format() using FORMAT(...). |
src/pycel2sql/dialect/bigquery.py |
Adds COALESCE(..., 0) wrapping for array lengths and implements write_format() using FORMAT(...). |
src/pycel2sql/dialect/sqlite.py |
Implements write_format() using printf(...). |
src/pycel2sql/dialect/duckdb.py |
Implements write_format() using printf(...). |
src/pycel2sql/dialect/mysql.py |
Implements write_format() by raising UnsupportedDialectFeatureError. |
src/pycel2sql/_converter.py |
Adds support for json_variables/column_aliases/param_start_index, JSON-root routing, aliasing, and delegates format() generation to the dialect. |
src/pycel2sql/__init__.py |
Public API wiring for new options and refactors shared kwargs building. |
tests/test_spark.py |
New Spark-specific unit test suite. |
tests/test_options.py |
New tests covering json_variables, column_aliases, and param_start_index. |
tests/test_string_functions.py |
Adds per-dialect format() dispatch assertions. |
tests/test_bigquery.py |
Updates BigQuery array length expectation to include COALESCE. |
tests/test_dialect_parametrized.py |
Adds Spark to the parametrized dialect matrix. |
tests/conftest.py |
Adds spark_dialect fixture and includes Spark in ALL_DIALECTS. |
README.md |
Documents Spark support and the new options / per-dialect format() behavior. |
CLAUDE.md |
Updates project guidance to include Spark and the new options / format dispatch behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if "(?m" in pattern or "(?s" in pattern or "(?-" in pattern: | ||
| raise InvalidRegexPatternError( | ||
| "invalid pattern in expression", | ||
| "inline flags other than (?i) are not supported in Spark regex", | ||
| ) |
There was a problem hiding this comment.
The inline-flag validation misses combined flag groups like (?im) / (?mi): the current substring checks only catch (?m/(?s) when those flags appear first. This allows non-i flags through despite the intent to reject them. Consider parsing (?[a-zA-Z]+) groups and rejecting any set that includes flags other than i (including combined forms).
There was a problem hiding this comment.
Fixed in 6eb6436. The check now uses _INLINE_FLAG_GROUP_RE = re.compile(r"\(\?([a-zA-Z]+)[:)]") to extract the flag-letter set from each (?<flags>) or (?<flags>: group and rejects any group containing flags other than i. The bare (?- clear form is also rejected. Added parametrized tests for (?im), (?mi), (?ix:foo), (?-i), plus a sanity check that (?i)foo alone still passes.
|
|
||
| class TestFormatPerDialect: | ||
| """format() dispatches per dialect: FORMAT for Postgres/BigQuery, | ||
| printf for SQLite/DuckDB, raises for MySQL.""" |
There was a problem hiding this comment.
The docstring for TestFormatPerDialect is now incomplete: it documents FORMAT/printf/MySQL-raises, but this PR also adds Spark's format_string(...) mapping. Please update the docstring so it matches the actual per-dialect behavior being tested elsewhere in the suite.
| printf for SQLite/DuckDB, raises for MySQL.""" | |
| format_string for Spark, printf for SQLite/DuckDB, raises for MySQL.""" |
There was a problem hiding this comment.
Fixed in 6eb6436 — applied your suggested wording verbatim.
| f"Spark dialect does not support multi-dimensional array " | ||
| f"length (dimension={dimension})", | ||
| ) | ||
| # Spark size() returns -1 for null; COALESCE collapses to 0. |
There was a problem hiding this comment.
Comment says size() returns -1 for NULL, but in Spark SQL size(NULL) evaluates to NULL. The COALESCE wrapper is still correct, but the comment should be updated to avoid misleading future maintainers.
| # Spark size() returns -1 for null; COALESCE collapses to 0. | |
| # In Spark SQL, size(NULL) evaluates to NULL; COALESCE converts that to 0. |
There was a problem hiding this comment.
Fixed in 6eb6436 — applied your suggested wording verbatim. Good catch: the -1 claim was carried over from a different language port and is wrong for Spark SQL.
| ) | ||
|
|
||
| # Convert %d, %f etc. to %s for SQL FORMAT() | ||
| # Convert %d, %f etc. to %s for printf-style dialects |
There was a problem hiding this comment.
This comment says numeric specifiers are converted to %s "for printf-style dialects", but the conversion is applied unconditionally before dispatching to Dialect.write_format() (including Postgres/BigQuery/Spark). Update the comment to reflect the actual behavior, or make the conversion dialect-specific if that was the intent.
| # Convert %d, %f etc. to %s for printf-style dialects | |
| # Normalize supported numeric specifiers (%d, %f, etc.) to %s before | |
| # dispatching to Dialect.write_format(). |
There was a problem hiding this comment.
Fixed in 6eb6436. Updated comment to "Normalize supported numeric specifiers (%d, %f, etc.) to %s before dispatching to Dialect.write_format()". Kept the conversion unconditional (rather than making it dialect-specific) because %s is universally accepted by FORMAT/printf/format_string and avoids type-coercion mismatches when CEL arguments do not match the original specifier exactly.
Four review-comment fixes: 1. Spark inline-flag check now catches combined flag groups. The previous substring-only check (`"(?m" in pattern`) missed `(?im)` / `(?mi)` / `(?ix:...)`. New `_INLINE_FLAG_GROUP_RE` parses each `(?<flags>)` or `(?<flags>:` group and rejects any flag set containing letters other than `i`. The bare `(?-` clear-flag form remains explicitly rejected. 2. Update tests/test_string_functions.py TestFormatPerDialect docstring to mention Spark's format_string() mapping alongside the four other dialect dispatches. 3. Fix misleading comment in Spark write_array_length: size(NULL) evaluates to NULL in Spark SQL, not -1. The COALESCE wrap is still correct; only the comment was wrong. 4. Update _visit_format() comment in _converter.py to reflect that the %d/%f → %s normalization runs unconditionally before Dialect.write_format() dispatch — not "for printf-style dialects". Tests: 709 unit pass (704 + 5 new flag-group rejection cases). Ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures the workflow for syncing features and bug fixes from upstream cel2sql Go (the canonical source) and cross-referencing cel2sql4j Java. The recent v3.7.0/v3.7.1 sync (PR #8: Spark dialect + observe-fork options + BigQuery COALESCE fix) followed this exact shape. Layout under .claude/skills/port-from-upstream/: - SKILL.md (~120 lines) — quick start, two-upstream model, pre-port verification (grep first), out-of-scope concerns mirrored from cel2sql4j's same decisions, per-commit workflow, when-to-split-PRs trade-off, verification. - references/go-to-python-idioms.md — Go-to-Python mapping table (closures, errors, struct-and-interface → ABC + dataclass, generics, strings, maps, time/duration, reflection). Distilled from the actual edits in PR #8. - references/java-to-python-cross-refs.md — places where cel2sql4j's Java solution was a closer template than the Go original (per-dialect writeFormat, ConvertOptions shape, single-functional-interface lambda vs Go closure, dual-message exception base, shared skipped concerns). - scripts/list_upstream_changes.sh — lists upstream cel2sql commits and tags, plus cel2sql4j commits in the same time window. Auto-detects since-sha from the most recent "Port" / "backport" commit on pycel2sql main. Falls clearly when either sibling repo isn't checked out. Lints clean against .claude/skills/skill-authoring/scripts/lint_skill.py. Co-authored-by: Richard Wooding <richardwooding@Richards-Virtual-Machine.local> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Ports recent feature work from cel2sql Go (PRs #117, #113) and cel2sql4j (PRs #10, #9) so pycel2sql is back at feature parity with both siblings.
What's in this PR
?positional params,RLIKE,get_json_object,concat(),array_contains(arg-order swap), 0-indexed arrays,EXPLODE/(SELECT collect_list(...)),format_string(),(dayofweek(t) - 1)for day-of-week. JSON array membership raises (the dialect contract doesn't expose the candidate element needed to build a boolean predicate). Regex passes through to Python'sreengine after the same ReDoS validators DuckDB / Postgres use. Reserved-keyword + 128-char identifier validation.json_variablesoption: declare CEL variable names as flat JSONB columns.context.hostandcontext["host"]both emit dialect-specific JSON extraction. Nested chains supported (tags.corpus.section→tags->'corpus'->>'section'). Comprehension iter vars shadowjson_variables.has(jsonvar.key)emitswrite_json_existence.column_aliasesoption: map CEL identifier name → SQL column name. The alias is validated against the dialect's identifier rules.param_start_indexoption (convert_parameterizedonly): shift the placeholder counter so the first parameter is$5/@p5instead of$1. Values< 1clamp to 1. Positional-?dialects ignore the index in placeholder text but preserve parameter ordering.Dialect.write_format()per-dialect dispatch (fixes latent bug exposed by adding Spark): Postgres + BigQuery emitFORMAT(...), SQLite + DuckDB emitprintf(...), Spark emitsformat_string(...), MySQL raisesUnsupportedDialectFeatureError. Previously every dialect emittedFORMAT(...), which is invalid SQL on SQLite / DuckDB / MySQL / Spark.COALESCENULL-safety fix:write_array_lengthandwrite_json_array_lengthnow wrap output inCOALESCE(..., 0), matching the four other dialects. Ports cel2sql4j PR ci: restrict GITHUB_TOKEN to contents:read #9 / cel2sql4j upstream commit 1689adc.Test results
tests/test_spark.py(47 tests),tests/test_options.py(26 tests), and per-dialectformat()cases intests/test_string_functions.py(5 tests) plus a BigQuery COALESCE assertion).Tree[Any]generic-arg note matches the existing pattern documented in CLAUDE.md.Docs
json_variables/column_aliases/param_start_index/format()per-dialect table), updated placeholder table, sixth-dialect imports.Follow-up scheduled
A one-time remote agent fires on Mon 11 May 2026 to verify the BigQuery
COALESCEwrap against the BigQuery emulator (PYCEL2SQL_TEST_BIGQUERY=1). It will post a confirmation comment on success, or open an issue on failure. Routine:trig_01XUGxNA7onedEDdRnuK4ixL.Test plan
uv run pytest tests/ --ignore=tests/integration -v— 704 passinguv run ruff check src/ tests/— cleanuv run mypy src/pycel2sql/— no NEW non-lark errorsPYCEL2SQL_TEST_BIGQUERY=1 uv run pytest tests/integration/ -v(deferred to scheduled follow-up agent)json_variables,column_aliases,param_start_index, Sparkformat()→format_string)🤖 Generated with Claude Code