Skip to content

Add Apache Spark dialect, observe-fork options, and BigQuery COALESCE fix#8

Merged
richardwooding merged 2 commits intomainfrom
port-spark-and-observe-options
Apr 28, 2026
Merged

Add Apache Spark dialect, observe-fork options, and BigQuery COALESCE fix#8
richardwooding merged 2 commits intomainfrom
port-spark-and-observe-options

Conversation

@richardwooding
Copy link
Copy Markdown
Contributor

Summary

Ports recent feature work from cel2sql Go (PRs #117, #113) and cel2sql4j (PRs #10, #9) so pycel2sql is back at feature parity with both siblings.

What's in this PR

  • Apache Spark SQL dialect (sixth dialect): ? positional params, RLIKE, get_json_object, concat(), array_contains (arg-order swap), 0-indexed arrays, EXPLODE / (SELECT collect_list(...)), format_string(), (dayofweek(t) - 1) for day-of-week. JSON array membership raises (the dialect contract doesn't expose the candidate element needed to build a boolean predicate). Regex passes through to Python's re engine after the same ReDoS validators DuckDB / Postgres use. Reserved-keyword + 128-char identifier validation.
  • json_variables option: declare CEL variable names as flat JSONB columns. context.host and context["host"] both emit dialect-specific JSON extraction. Nested chains supported (tags.corpus.sectiontags->'corpus'->>'section'). Comprehension iter vars shadow json_variables. has(jsonvar.key) emits write_json_existence.
  • column_aliases option: map CEL identifier name → SQL column name. The alias is validated against the dialect's identifier rules.
  • param_start_index option (convert_parameterized only): shift the placeholder counter so the first parameter is $5 / @p5 instead of $1. Values < 1 clamp to 1. Positional-? dialects ignore the index in placeholder text but preserve parameter ordering.
  • Dialect.write_format() per-dialect dispatch (fixes latent bug exposed by adding Spark): Postgres + BigQuery emit FORMAT(...), SQLite + DuckDB emit printf(...), Spark emits format_string(...), MySQL raises UnsupportedDialectFeatureError. Previously every dialect emitted FORMAT(...), which is invalid SQL on SQLite / DuckDB / MySQL / Spark.
  • BigQuery COALESCE NULL-safety fix: write_array_length and write_json_array_length now wrap output in COALESCE(..., 0), matching the four other dialects. Ports cel2sql4j PR ci: restrict GITHUB_TOKEN to contents:read #9 / cel2sql4j upstream commit 1689adc.

Test results

  • 704 unit tests pass (607 baseline + 97 new across tests/test_spark.py (47 tests), tests/test_options.py (26 tests), and per-dialect format() cases in tests/test_string_functions.py (5 tests) plus a BigQuery COALESCE assertion).
  • Ruff clean.
  • Mypy non-lark errors unchanged from baseline (6); one added lark Tree[Any] generic-arg note matches the existing pattern documented in CLAUDE.md.

Docs

  • README: Spark badge, Conversion Options section (json_variables / column_aliases / param_start_index / format() per-dialect table), updated placeholder table, sixth-dialect imports.
  • CLAUDE.md: dialect count bumped to six, Spark added to the dialect-files list and Dialect Differences bullets, new options documented under Important Conventions.

Follow-up scheduled

A one-time remote agent fires on Mon 11 May 2026 to verify the BigQuery COALESCE wrap against the BigQuery emulator (PYCEL2SQL_TEST_BIGQUERY=1). It will post a confirmation comment on success, or open an issue on failure. Routine: trig_01XUGxNA7onedEDdRnuK4ixL.

Test plan

  • uv run pytest tests/ --ignore=tests/integration -v — 704 passing
  • uv run ruff check src/ tests/ — clean
  • uv run mypy src/pycel2sql/ — no NEW non-lark errors
  • Optional: PYCEL2SQL_TEST_BIGQUERY=1 uv run pytest tests/integration/ -v (deferred to scheduled follow-up agent)
  • Manual smoke for each new option in a Python REPL (json_variables, column_aliases, param_start_index, Spark format()format_string)

🤖 Generated with Claude Code

… fix

Ports recent feature work from cel2sql Go (PRs #117, #113) and cel2sql4j
(PRs #10, #9) so pycel2sql is back at feature parity with both siblings.

New dialect:
- Apache Spark SQL (sixth dialect) — backticked-free identifiers, ?
  positional params, RLIKE, get_json_object, concat(), array_contains
  (arg-order swap), 0-indexed arrays, EXPLODE / (SELECT collect_list(...)),
  format_string(), (dayofweek(t) - 1) day-of-week adjustment, JSON-array
  membership raises (no boolean predicate available). Regex passes through
  to Python's re engine after the same ReDoS validators DuckDB / Postgres
  use; reserved-keyword + 128-char identifier validation.

New conversion options (mirrors cel2sql4j ConvertOptions):
- json_variables: declare CEL variable names as flat JSONB columns; dot
  and bracket notation emit dialect-specific JSON extraction
  (Postgres ->>, BigQuery JSON_VALUE, etc.). Nested chains supported
  (tags.corpus.section -> tags->'corpus'->>'section'). Comprehension
  iter vars shadow json_variables. has(jsonvar.key) emits write_json_existence.
- column_aliases: map CEL identifier -> SQL column name. Alias is
  validated against the dialect's identifier rules.
- param_start_index (convert_parameterized only): shift the placeholder
  counter so the first param is $5 / @p5 instead of $1. Values < 1 clamp
  to 1. Positional-? dialects ignore the index in placeholder text.

Per-dialect format() dispatch (fixes a latent bug exposed by adding Spark):
- Dialect.write_format() abstract method.
- Postgres + BigQuery emit FORMAT('...', ...).
- SQLite + DuckDB emit printf('...', ...).
- Spark emits format_string('...', ...).
- MySQL raises UnsupportedDialectFeatureError (no portable equivalent).

Bug fix:
- BigQuery write_array_length and write_json_array_length now wrap output
  in COALESCE(..., 0) for NULL safety, matching the four other dialects.
  Mirrors cel2sql4j PR #9 / cel2sql4j upstream commit 1689adc.

Tests: 704 unit tests pass (607 baseline + 97 new across tests/test_spark.py,
tests/test_options.py, and per-dialect format() cases in
tests/test_string_functions.py). Ruff clean. Mypy non-lark errors unchanged
(6); 1 added lark Tree[Any] generic-arg note matches the existing pattern
documented in CLAUDE.md.

Docs: README adds Spark badge, Conversion Options section
(json_variables / column_aliases / param_start_index / format() per-dialect
table), updated placeholder table with Spark, sixth-dialect imports.
CLAUDE.md bumps dialect count to six, adds Spark to the file list +
Dialect Differences bullets, documents the new options under
Important Conventions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@richardwooding richardwooding requested a review from Copilot April 27, 2026 21:24
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Ports feature parity updates from the Go/Java sibling projects into pycel2sql by adding an Apache Spark dialect, new conversion options (JSON-root variables, identifier aliases, param start offset), and fixing BigQuery array-length NULL behavior.

Changes:

  • Added SparkDialect with Spark-specific SQL generation (positional ? params, RLIKE, get_json_object, array_contains, format_string, EXPLODE/collect_list, 0-based arrays, Spark identifier validation).
  • Introduced conversion options: json_variables, column_aliases, and param_start_index (for convert_parameterized).
  • Implemented per-dialect format() dispatch via Dialect.write_format() and fixed BigQuery COALESCE(..., 0) wrapping for array length functions.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/pycel2sql/dialect/spark.py New Spark SQL dialect implementation including regex validation, JSON extraction, arrays, and comprehensions.
src/pycel2sql/dialect/_base.py Adds new DialectName.SPARK and makes write_format() an abstract requirement for dialects.
src/pycel2sql/dialect/__init__.py Exports SparkDialect and wires it into get_dialect().
src/pycel2sql/dialect/postgres.py Implements write_format() using FORMAT(...).
src/pycel2sql/dialect/bigquery.py Adds COALESCE(..., 0) wrapping for array lengths and implements write_format() using FORMAT(...).
src/pycel2sql/dialect/sqlite.py Implements write_format() using printf(...).
src/pycel2sql/dialect/duckdb.py Implements write_format() using printf(...).
src/pycel2sql/dialect/mysql.py Implements write_format() by raising UnsupportedDialectFeatureError.
src/pycel2sql/_converter.py Adds support for json_variables/column_aliases/param_start_index, JSON-root routing, aliasing, and delegates format() generation to the dialect.
src/pycel2sql/__init__.py Public API wiring for new options and refactors shared kwargs building.
tests/test_spark.py New Spark-specific unit test suite.
tests/test_options.py New tests covering json_variables, column_aliases, and param_start_index.
tests/test_string_functions.py Adds per-dialect format() dispatch assertions.
tests/test_bigquery.py Updates BigQuery array length expectation to include COALESCE.
tests/test_dialect_parametrized.py Adds Spark to the parametrized dialect matrix.
tests/conftest.py Adds spark_dialect fixture and includes Spark in ALL_DIALECTS.
README.md Documents Spark support and the new options / per-dialect format() behavior.
CLAUDE.md Updates project guidance to include Spark and the new options / format dispatch behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/pycel2sql/dialect/spark.py Outdated
Comment on lines +235 to +239
if "(?m" in pattern or "(?s" in pattern or "(?-" in pattern:
raise InvalidRegexPatternError(
"invalid pattern in expression",
"inline flags other than (?i) are not supported in Spark regex",
)
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inline-flag validation misses combined flag groups like (?im) / (?mi): the current substring checks only catch (?m/(?s) when those flags appear first. This allows non-i flags through despite the intent to reject them. Consider parsing (?[a-zA-Z]+) groups and rejecting any set that includes flags other than i (including combined forms).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 6eb6436. The check now uses _INLINE_FLAG_GROUP_RE = re.compile(r"\(\?([a-zA-Z]+)[:)]") to extract the flag-letter set from each (?<flags>) or (?<flags>: group and rejects any group containing flags other than i. The bare (?- clear form is also rejected. Added parametrized tests for (?im), (?mi), (?ix:foo), (?-i), plus a sanity check that (?i)foo alone still passes.

Comment thread tests/test_string_functions.py Outdated

class TestFormatPerDialect:
"""format() dispatches per dialect: FORMAT for Postgres/BigQuery,
printf for SQLite/DuckDB, raises for MySQL."""
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring for TestFormatPerDialect is now incomplete: it documents FORMAT/printf/MySQL-raises, but this PR also adds Spark's format_string(...) mapping. Please update the docstring so it matches the actual per-dialect behavior being tested elsewhere in the suite.

Suggested change
printf for SQLite/DuckDB, raises for MySQL."""
format_string for Spark, printf for SQLite/DuckDB, raises for MySQL."""

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 6eb6436 — applied your suggested wording verbatim.

Comment thread src/pycel2sql/dialect/spark.py Outdated
f"Spark dialect does not support multi-dimensional array "
f"length (dimension={dimension})",
)
# Spark size() returns -1 for null; COALESCE collapses to 0.
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment says size() returns -1 for NULL, but in Spark SQL size(NULL) evaluates to NULL. The COALESCE wrapper is still correct, but the comment should be updated to avoid misleading future maintainers.

Suggested change
# Spark size() returns -1 for null; COALESCE collapses to 0.
# In Spark SQL, size(NULL) evaluates to NULL; COALESCE converts that to 0.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 6eb6436 — applied your suggested wording verbatim. Good catch: the -1 claim was carried over from a different language port and is wrong for Spark SQL.

Comment thread src/pycel2sql/_converter.py Outdated
)

# Convert %d, %f etc. to %s for SQL FORMAT()
# Convert %d, %f etc. to %s for printf-style dialects
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment says numeric specifiers are converted to %s "for printf-style dialects", but the conversion is applied unconditionally before dispatching to Dialect.write_format() (including Postgres/BigQuery/Spark). Update the comment to reflect the actual behavior, or make the conversion dialect-specific if that was the intent.

Suggested change
# Convert %d, %f etc. to %s for printf-style dialects
# Normalize supported numeric specifiers (%d, %f, etc.) to %s before
# dispatching to Dialect.write_format().

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 6eb6436. Updated comment to "Normalize supported numeric specifiers (%d, %f, etc.) to %s before dispatching to Dialect.write_format()". Kept the conversion unconditional (rather than making it dialect-specific) because %s is universally accepted by FORMAT/printf/format_string and avoids type-coercion mismatches when CEL arguments do not match the original specifier exactly.

Four review-comment fixes:

1. Spark inline-flag check now catches combined flag groups. The previous
   substring-only check (`"(?m" in pattern`) missed `(?im)` / `(?mi)` /
   `(?ix:...)`. New `_INLINE_FLAG_GROUP_RE` parses each `(?<flags>)` or
   `(?<flags>:` group and rejects any flag set containing letters other
   than `i`. The bare `(?-` clear-flag form remains explicitly rejected.

2. Update tests/test_string_functions.py TestFormatPerDialect docstring
   to mention Spark's format_string() mapping alongside the four other
   dialect dispatches.

3. Fix misleading comment in Spark write_array_length: size(NULL)
   evaluates to NULL in Spark SQL, not -1. The COALESCE wrap is still
   correct; only the comment was wrong.

4. Update _visit_format() comment in _converter.py to reflect that the
   %d/%f → %s normalization runs unconditionally before
   Dialect.write_format() dispatch — not "for printf-style dialects".

Tests: 709 unit pass (704 + 5 new flag-group rejection cases).
Ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@richardwooding richardwooding merged commit be6f12d into main Apr 28, 2026
7 checks passed
richardwooding added a commit that referenced this pull request Apr 28, 2026
Captures the workflow for syncing features and bug fixes from upstream
cel2sql Go (the canonical source) and cross-referencing cel2sql4j Java.
The recent v3.7.0/v3.7.1 sync (PR #8: Spark dialect + observe-fork
options + BigQuery COALESCE fix) followed this exact shape.

Layout under .claude/skills/port-from-upstream/:
- SKILL.md (~120 lines) — quick start, two-upstream model, pre-port
  verification (grep first), out-of-scope concerns mirrored from
  cel2sql4j's same decisions, per-commit workflow, when-to-split-PRs
  trade-off, verification.
- references/go-to-python-idioms.md — Go-to-Python mapping table
  (closures, errors, struct-and-interface → ABC + dataclass, generics,
  strings, maps, time/duration, reflection). Distilled from the actual
  edits in PR #8.
- references/java-to-python-cross-refs.md — places where cel2sql4j's
  Java solution was a closer template than the Go original (per-dialect
  writeFormat, ConvertOptions shape, single-functional-interface lambda
  vs Go closure, dual-message exception base, shared skipped concerns).
- scripts/list_upstream_changes.sh — lists upstream cel2sql commits and
  tags, plus cel2sql4j commits in the same time window. Auto-detects
  since-sha from the most recent "Port" / "backport" commit on pycel2sql
  main. Falls clearly when either sibling repo isn't checked out.

Lints clean against .claude/skills/skill-authoring/scripts/lint_skill.py.

Co-authored-by: Richard Wooding <richardwooding@Richards-Virtual-Machine.local>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants