Add Apache Spark SQL dialect by richardwooding · Pull Request #10 · SPANDigital/cel2sql4j

richardwooding · 2026-04-27T16:27:02Z

Summary

Ports the Apache Spark dialect that landed upstream in cel2sql v3.7.0 (commit ef505ab), bringing cel2sql4j to feature parity on dialect coverage. Branched from main; depends on nothing in #9 (PR A).

New files

dialect/spark/SparkDialect.java — full Dialect + IndexAdvisor impl
dialect/spark/SparkRegex.java — RE2 → Java regex passthrough with ReDoS safety checks (Spark runs on the JVM, so patterns are unchanged after validation)
dialect/spark/SparkValidation.java — identifier rules + reserved-keyword set
DialectName.SPARK
Cel2SqlSparkTest — 18 unit cases covering output, regex, identifier validation

Spark-specific output


Strings	`concat(a, b)`, `LOCATE(needle, haystack) > 0`, `RLIKE`
Arrays	`array(...)`, `array_contains(arr, x)`, `COALESCE(size(arr), 0)`
JSON	`get_json_object`, `from_json(..., 'ARRAY<STRING>')` + `EXPLODE` for arrays
Timestamps	`EXTRACT(... FROM ts)`, with `(dayofweek(ts) - 1)` for `getDayOfWeek` (Spark uses 1=Sunday; CEL uses 0=Sunday)
Placeholders	positional `?`
Multi-dim arrays	rejected — Spark has no portable equivalent

Index advisor

recommendIndex() always returns null. Spark indexing is storage-layer specific (Delta Z-order vs Iceberg sort vs plain Parquet) and not portable as a single SQL recommendation. Implementing the interface (rather than omitting it) prevents Cel2Sql.analyzeQuery() from silently falling back to Postgres recommendations on a Spark dialect.

Out of scope (mirrors upstream)

Spark JDBC type provider (Go-side spark/provider.go) — not portable; Java users construct Schema directly
Spark Testcontainers integration tests — heavy; will land separately if needed

Test plan

./gradlew test — all 388 unit tests pass (370 existing + 18 new Spark)
./gradlew build clean
CI green on this branch

🤖 Generated with Claude Code

Ports the Spark dialect added to upstream cel2sql in v3.7.0 (commit ef505ab), making cel2sql4j a 6th-dialect library matching upstream's coverage. New files: - dialect/spark/SparkDialect.java — full Dialect impl + IndexAdvisor (no-op recommendations; Spark indexing is storage-layer specific) - dialect/spark/SparkRegex.java — RE2/Java-regex passthrough validator (Spark uses java.util.regex, same engine cel2sql4j compiles against, so patterns pass through unchanged after ReDoS safety checks) - dialect/spark/SparkValidation.java — identifier validation + reserved keyword set (~110 Spark SQL keywords, lowercased) - DialectName enum: add SPARK Spark-specific output: - Strings: concat(a, b), LOCATE/RLIKE - Arrays: array(...), array_contains, COALESCE(size(arr), 0) - JSON: get_json_object / from_json + EXPLODE for arrays - Timestamps: EXTRACT(... FROM ts), with (dayofweek(t) - 1) for getDayOfWeek to convert Spark's 1=Sunday to CEL's 0=Sunday - Placeholders: positional ? - Multi-dim ARRAY_LENGTH throws (Spark has no portable equivalent) IndexAdvisor.recommendIndex returns null for all patterns — analyzeQuery on Spark returns empty recommendations rather than falling back to Postgres advice, matching upstream's `SupportsIndexAnalysis returns false` semantics. Tests: Cel2SqlSparkTest covers basic comparisons, string functions, regex, array membership, timestamp extraction, parameterized output, identifier validation, and the regex security checks. 18 test cases. Docs: - README: Spark badge, 6-dialect feature claim, supported-dialects table row, dialect-import example - CLAUDE.md: dialect-comparison table now has 6 columns; Spark notes paragraph covering the regex passthrough, day-of-week adjustment, and the no-op index advisor Out of scope (mirrors upstream): - Spark JDBC type provider (Go-side `spark/provider.go`) — not portable; Java users construct Schema directly - Spark Testcontainers integration tests — heavy; will land separately if needed Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR ports the Apache Spark SQL dialect into cel2sql4j, aligning the Java implementation with upstream cel2sql dialect coverage and enabling Spark-specific SQL generation (syntax, regex handling, identifier validation, and analysis behavior).

Changes:

Added a new Spark dialect implementation (SparkDialect) plus Spark-specific regex and identifier validation helpers.
Extended the dialect enum (DialectName) and updated documentation to list Spark as a supported dialect.
Added a Spark-focused test suite (Cel2SqlSparkTest) covering core translation behaviors and dialect validation.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
src/main/java/com/spandigital/cel2sql/dialect/spark/SparkDialect.java	Implements Spark-specific SQL emission for strings/arrays/json/timestamps and provides a no-op `IndexAdvisor` behavior.
src/main/java/com/spandigital/cel2sql/dialect/spark/SparkRegex.java	Adds Spark regex validation/passthrough with ReDoS-oriented guards.
src/main/java/com/spandigital/cel2sql/dialect/spark/SparkValidation.java	Adds Spark identifier rules and reserved keyword validation.
src/main/java/com/spandigital/cel2sql/dialect/DialectName.java	Adds `SPARK` dialect name.
src/test/java/com/spandigital/cel2sql/Cel2SqlSparkTest.java	Adds unit tests for Spark SQL output, parameter placeholders, index analysis behavior, and validation.
README.md	Documents Spark as a supported dialect and adds usage + capability table row.
CLAUDE.md	Updates internal contributor docs to include Spark dialect notes and capability table.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-27T16:49:54Z

+        // Mirrors SQLite's `lhs = (SELECT value FROM json_each(...))` pattern.
+        w.append("(SELECT col FROM (SELECT EXPLODE(from_json(");
+        writeExpr.write();
+        w.append(", 'ARRAY<STRING>')) AS col) t)");
+    }
+
+    @Override
+    public void writeNestedJSONArrayMembership(StringBuilder w, SqlWriter writeExpr) throws ConversionException {
+        w.append("(SELECT col FROM (SELECT EXPLODE(from_json(");
+        writeExpr.write();
+        w.append(", 'ARRAY<STRING>')) AS col) t)");


writeJSONArrayMembership currently emits a scalar subquery that can return multiple rows (one per element from EXPLODE(...)). In Spark SQL, scalar subqueries must return at most one row, so this will fail at runtime for non-empty arrays with >1 element and it also doesn't implement boolean membership semantics. Consider generating a boolean membership predicate (e.g., EXISTS (...) / array_contains(...)) or adjusting the converter/dialect contract so the element expression can be incorporated correctly.

Suggested change

// Mirrors SQLite's `lhs = (SELECT value FROM json_each(...))` pattern.

w.append("(SELECT col FROM (SELECT EXPLODE(from_json(");

writeExpr.write();

w.append(", 'ARRAY<STRING>')) AS col) t)");

}

@Override

public void writeNestedJSONArrayMembership(StringBuilder w, SqlWriter writeExpr) throws ConversionException {

w.append("(SELECT col FROM (SELECT EXPLODE(from_json(");

writeExpr.write();

w.append(", 'ARRAY<STRING>')) AS col) t)");

throw new ConversionException(

"Spark JSON array membership requires a boolean predicate such as array_contains(...) or EXISTS (...), "

+ "but the current dialect contract does not provide the candidate element needed to build one.");

}

@Override

public void writeNestedJSONArrayMembership(StringBuilder w, SqlWriter writeExpr) throws ConversionException {

throw new ConversionException(

"Spark nested JSON array membership requires a boolean predicate such as array_contains(...) or EXISTS (...), "

+ "but the current dialect contract does not provide the candidate element needed to build one.");

Took your throw-with-error suggestion in 740d544. The Dialect contract doesn't pass the candidate element through, so we can't emit the correct array_contains(from_json(...), elem) form here without an interface change. Failing fast at conversion time is preferable to the silent multi-row scalar subquery that Spark rejects at runtime. Native ARRAY<T> columns continue to work via writeArrayMembership (the standard array_contains path). Added a unit test asserting the throw.

Copilot · 2026-04-27T16:49:54Z

+        // Mirrors SQLite's `lhs = (SELECT value FROM json_each(...))` pattern.
+        w.append("(SELECT col FROM (SELECT EXPLODE(from_json(");
+        writeExpr.write();
+        w.append(", 'ARRAY<STRING>')) AS col) t)");
+    }
+
+    @Override
+    public void writeNestedJSONArrayMembership(StringBuilder w, SqlWriter writeExpr) throws ConversionException {
+        w.append("(SELECT col FROM (SELECT EXPLODE(from_json(");
+        writeExpr.write();
+        w.append(", 'ARRAY<STRING>')) AS col) t)");


writeNestedJSONArrayMembership has the same scalar-subquery issue as writeJSONArrayMembership: SELECT col FROM (SELECT EXPLODE(...)) can yield multiple rows and will error in Spark when used as a scalar expression, and it doesn't produce a boolean membership test. This should be rewritten to a boolean predicate (e.g., EXISTS with a filter) or the surrounding converter logic should be updated to use a Spark-friendly membership form.

Suggested change

// Mirrors SQLite's `lhs = (SELECT value FROM json_each(...))` pattern.

w.append("(SELECT col FROM (SELECT EXPLODE(from_json(");

writeExpr.write();

w.append(", 'ARRAY<STRING>')) AS col) t)");

}

@Override

public void writeNestedJSONArrayMembership(StringBuilder w, SqlWriter writeExpr) throws ConversionException {

w.append("(SELECT col FROM (SELECT EXPLODE(from_json(");

writeExpr.write();

w.append(", 'ARRAY<STRING>')) AS col) t)");

w.append("array_contains(from_json(");

writeExpr.write();

w.append(", 'ARRAY<STRING>'), ").append(jsonFunc).append(')');

}

@Override

public void writeNestedJSONArrayMembership(StringBuilder w, SqlWriter writeExpr) throws ConversionException {

w.append("COALESCE(size(from_json(");

writeExpr.write();

w.append(", 'ARRAY<STRING>')), 0) > 0");

Same fix in 740d544. The array_contains(from_json(...), jsonFunc) rewrite you suggested would emit incorrect SQL because jsonFunc is the JSON helper function name (e.g. jsonb_array_elements_text), not the candidate element — that information isn't plumbed through this Dialect method. Throwing is the honest answer; users with native ARRAY<T> columns get the working array_contains path via writeArrayMembership.

Copilot · 2026-04-27T16:49:54Z

+ * test cases: RLIKE for regex, array_contains for membership, size+COALESCE for
+ * length, get_json_object for JSON access, and the dayofweek - 1 adjustment
+ * for {@code getDayOfWeek}.


The class Javadoc mentions testing JSON access via get_json_object, but the test cases in this file don't currently assert any JSON translations. Either add a representative JSON access case (so Spark's JSON behavior is covered) or update the comment to avoid implying coverage that isn't present.

Suggested change

* test cases: RLIKE for regex, array_contains for membership, size+COALESCE for

* length, get_json_object for JSON access, and the dayofweek - 1 adjustment

* for {@code getDayOfWeek}.

* test cases covered here: RLIKE for regex, array_contains for membership,

* size+COALESCE for length, and the dayofweek - 1 adjustment for

* {@code getDayOfWeek}.

Trimmed the Javadoc claim in 740d544 (you're right — the JSON branch wasn't covered, and the existing tests assert RLIKE / array_contains / size+COALESCE / dayofweek). The new JSON test verifies the throw on JSON array membership.

Copilot · 2026-04-27T16:49:55Z

+import org.junit.jupiter.params.provider.Arguments;
+import org.junit.jupiter.params.provider.MethodSource;
+
+import java.util.List;


java.util.List is imported but not used in this test class. Removing the unused import will keep the file clean and avoid compiler warnings.

Suggested change

import java.util.List;

Removed the unused java.util.List import in 740d544.

- Spark writeJSONArrayMembership / writeNestedJSONArrayMembership now throw a clear ConversionException instead of emitting a multi-row scalar subquery that Spark rejects at runtime. The Dialect contract doesn't pass the candidate element through, so we can't rewrite to the correct array_contains(from_json(...), elem) form here. Failing fast at conversion time is honest; runtime SQL failure was the previous (also broken) outcome. Native ARRAY<T> columns continue to work via writeArrayMembership. - Cel2SqlSparkTest: drop unused java.util.List import; trim the class Javadoc to match what the tests actually assert (the JSON-access claim was overstated). New test verifies the JSON-array-membership throw. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…#11) Three new project-local skills under .claude/skills/, each authored against the existing skill-authoring conventions and passing the lint script: - add-cel-function: walks adding a new CEL function to all 6 dialects (constant + dispatch + Dialect interface method + 6 impls + tests). Includes references/sql-writer-pattern.md and references/conversion-exception.md. - add-sql-dialect: walks adding a new SQL dialect (e.g. CockroachDB, Snowflake). Includes a scaffold_dialect.sh script that copies an existing dialect package and renames classes via sed, plus references/ dialect-method-checklist.md and references/test-files.md. - port-from-upstream: walks porting from the Go upstream cel2sql repo. Includes a list_upstream_changes.sh script that auto-detects the last port commit on cel2sql4j and lists candidate upstream commits, plus references/go-to-java-idioms.md and references/two-pr-shape.md. Also includes the previously-uncommitted skill-authoring skill that was staged on main (the foundation these new skills are authored against). While running the build-verification step on this PR, discovered that SparkDialect (PR #10) was merged on top of PR #9 without rebasing and silently lost the writeFormat method that PR #9 added to the Dialect interface — main currently does not compile. Fixed in this PR by adding SparkDialect.writeFormat using Spark's format_string() function (its printf-equivalent), and added a Spark case to Cel2SqlOptionsTest's formatTests. All 392 unit tests pass; ./gradlew build clean. Co-authored-by: Richard Wooding <richardwooding@Richards-Virtual-Machine.local> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

richardwooding requested a review from Copilot April 27, 2026 16:41

Copilot started reviewing on behalf of richardwooding April 27, 2026 16:42 View session

Copilot AI reviewed Apr 27, 2026

View reviewed changes

richardwooding merged commit 0bc7802 into main Apr 27, 2026
4 checks passed

richardwooding mentioned this pull request Apr 27, 2026

Add agent skills + fix missed Spark writeFormat impl #11

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Apache Spark SQL dialect#10

Add Apache Spark SQL dialect#10
richardwooding merged 2 commits intomainfrom
feat/spark-dialect

richardwooding commented Apr 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

richardwooding Apr 27, 2026

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

richardwooding Apr 27, 2026

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

richardwooding Apr 27, 2026

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

richardwooding Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

richardwooding commented Apr 27, 2026

Summary

New files

Spark-specific output

Index advisor

Out of scope (mirrors upstream)

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

richardwooding Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

richardwooding Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

richardwooding Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

richardwooding Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants