Add Apache Spark SQL dialect#10
Conversation
Ports the Spark dialect added to upstream cel2sql in v3.7.0 (commit ef505ab), making cel2sql4j a 6th-dialect library matching upstream's coverage. New files: - dialect/spark/SparkDialect.java — full Dialect impl + IndexAdvisor (no-op recommendations; Spark indexing is storage-layer specific) - dialect/spark/SparkRegex.java — RE2/Java-regex passthrough validator (Spark uses java.util.regex, same engine cel2sql4j compiles against, so patterns pass through unchanged after ReDoS safety checks) - dialect/spark/SparkValidation.java — identifier validation + reserved keyword set (~110 Spark SQL keywords, lowercased) - DialectName enum: add SPARK Spark-specific output: - Strings: concat(a, b), LOCATE/RLIKE - Arrays: array(...), array_contains, COALESCE(size(arr), 0) - JSON: get_json_object / from_json + EXPLODE for arrays - Timestamps: EXTRACT(... FROM ts), with (dayofweek(t) - 1) for getDayOfWeek to convert Spark's 1=Sunday to CEL's 0=Sunday - Placeholders: positional ? - Multi-dim ARRAY_LENGTH throws (Spark has no portable equivalent) IndexAdvisor.recommendIndex returns null for all patterns — analyzeQuery on Spark returns empty recommendations rather than falling back to Postgres advice, matching upstream's `SupportsIndexAnalysis returns false` semantics. Tests: Cel2SqlSparkTest covers basic comparisons, string functions, regex, array membership, timestamp extraction, parameterized output, identifier validation, and the regex security checks. 18 test cases. Docs: - README: Spark badge, 6-dialect feature claim, supported-dialects table row, dialect-import example - CLAUDE.md: dialect-comparison table now has 6 columns; Spark notes paragraph covering the regex passthrough, day-of-week adjustment, and the no-op index advisor Out of scope (mirrors upstream): - Spark JDBC type provider (Go-side `spark/provider.go`) — not portable; Java users construct Schema directly - Spark Testcontainers integration tests — heavy; will land separately if needed Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR ports the Apache Spark SQL dialect into cel2sql4j, aligning the Java implementation with upstream cel2sql dialect coverage and enabling Spark-specific SQL generation (syntax, regex handling, identifier validation, and analysis behavior).
Changes:
- Added a new Spark dialect implementation (
SparkDialect) plus Spark-specific regex and identifier validation helpers. - Extended the dialect enum (
DialectName) and updated documentation to list Spark as a supported dialect. - Added a Spark-focused test suite (
Cel2SqlSparkTest) covering core translation behaviors and dialect validation.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/main/java/com/spandigital/cel2sql/dialect/spark/SparkDialect.java | Implements Spark-specific SQL emission for strings/arrays/json/timestamps and provides a no-op IndexAdvisor behavior. |
| src/main/java/com/spandigital/cel2sql/dialect/spark/SparkRegex.java | Adds Spark regex validation/passthrough with ReDoS-oriented guards. |
| src/main/java/com/spandigital/cel2sql/dialect/spark/SparkValidation.java | Adds Spark identifier rules and reserved keyword validation. |
| src/main/java/com/spandigital/cel2sql/dialect/DialectName.java | Adds SPARK dialect name. |
| src/test/java/com/spandigital/cel2sql/Cel2SqlSparkTest.java | Adds unit tests for Spark SQL output, parameter placeholders, index analysis behavior, and validation. |
| README.md | Documents Spark as a supported dialect and adds usage + capability table row. |
| CLAUDE.md | Updates internal contributor docs to include Spark dialect notes and capability table. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Mirrors SQLite's `lhs = (SELECT value FROM json_each(...))` pattern. | ||
| w.append("(SELECT col FROM (SELECT EXPLODE(from_json("); | ||
| writeExpr.write(); | ||
| w.append(", 'ARRAY<STRING>')) AS col) t)"); | ||
| } | ||
|
|
||
| @Override | ||
| public void writeNestedJSONArrayMembership(StringBuilder w, SqlWriter writeExpr) throws ConversionException { | ||
| w.append("(SELECT col FROM (SELECT EXPLODE(from_json("); | ||
| writeExpr.write(); | ||
| w.append(", 'ARRAY<STRING>')) AS col) t)"); |
There was a problem hiding this comment.
writeJSONArrayMembership currently emits a scalar subquery that can return multiple rows (one per element from EXPLODE(...)). In Spark SQL, scalar subqueries must return at most one row, so this will fail at runtime for non-empty arrays with >1 element and it also doesn't implement boolean membership semantics. Consider generating a boolean membership predicate (e.g., EXISTS (...) / array_contains(...)) or adjusting the converter/dialect contract so the element expression can be incorporated correctly.
| // Mirrors SQLite's `lhs = (SELECT value FROM json_each(...))` pattern. | |
| w.append("(SELECT col FROM (SELECT EXPLODE(from_json("); | |
| writeExpr.write(); | |
| w.append(", 'ARRAY<STRING>')) AS col) t)"); | |
| } | |
| @Override | |
| public void writeNestedJSONArrayMembership(StringBuilder w, SqlWriter writeExpr) throws ConversionException { | |
| w.append("(SELECT col FROM (SELECT EXPLODE(from_json("); | |
| writeExpr.write(); | |
| w.append(", 'ARRAY<STRING>')) AS col) t)"); | |
| throw new ConversionException( | |
| "Spark JSON array membership requires a boolean predicate such as array_contains(...) or EXISTS (...), " | |
| + "but the current dialect contract does not provide the candidate element needed to build one."); | |
| } | |
| @Override | |
| public void writeNestedJSONArrayMembership(StringBuilder w, SqlWriter writeExpr) throws ConversionException { | |
| throw new ConversionException( | |
| "Spark nested JSON array membership requires a boolean predicate such as array_contains(...) or EXISTS (...), " | |
| + "but the current dialect contract does not provide the candidate element needed to build one."); |
There was a problem hiding this comment.
Took your throw-with-error suggestion in 740d544. The Dialect contract doesn't pass the candidate element through, so we can't emit the correct array_contains(from_json(...), elem) form here without an interface change. Failing fast at conversion time is preferable to the silent multi-row scalar subquery that Spark rejects at runtime. Native ARRAY<T> columns continue to work via writeArrayMembership (the standard array_contains path). Added a unit test asserting the throw.
| // Mirrors SQLite's `lhs = (SELECT value FROM json_each(...))` pattern. | ||
| w.append("(SELECT col FROM (SELECT EXPLODE(from_json("); | ||
| writeExpr.write(); | ||
| w.append(", 'ARRAY<STRING>')) AS col) t)"); | ||
| } | ||
|
|
||
| @Override | ||
| public void writeNestedJSONArrayMembership(StringBuilder w, SqlWriter writeExpr) throws ConversionException { | ||
| w.append("(SELECT col FROM (SELECT EXPLODE(from_json("); | ||
| writeExpr.write(); | ||
| w.append(", 'ARRAY<STRING>')) AS col) t)"); |
There was a problem hiding this comment.
writeNestedJSONArrayMembership has the same scalar-subquery issue as writeJSONArrayMembership: SELECT col FROM (SELECT EXPLODE(...)) can yield multiple rows and will error in Spark when used as a scalar expression, and it doesn't produce a boolean membership test. This should be rewritten to a boolean predicate (e.g., EXISTS with a filter) or the surrounding converter logic should be updated to use a Spark-friendly membership form.
| // Mirrors SQLite's `lhs = (SELECT value FROM json_each(...))` pattern. | |
| w.append("(SELECT col FROM (SELECT EXPLODE(from_json("); | |
| writeExpr.write(); | |
| w.append(", 'ARRAY<STRING>')) AS col) t)"); | |
| } | |
| @Override | |
| public void writeNestedJSONArrayMembership(StringBuilder w, SqlWriter writeExpr) throws ConversionException { | |
| w.append("(SELECT col FROM (SELECT EXPLODE(from_json("); | |
| writeExpr.write(); | |
| w.append(", 'ARRAY<STRING>')) AS col) t)"); | |
| w.append("array_contains(from_json("); | |
| writeExpr.write(); | |
| w.append(", 'ARRAY<STRING>'), ").append(jsonFunc).append(')'); | |
| } | |
| @Override | |
| public void writeNestedJSONArrayMembership(StringBuilder w, SqlWriter writeExpr) throws ConversionException { | |
| w.append("COALESCE(size(from_json("); | |
| writeExpr.write(); | |
| w.append(", 'ARRAY<STRING>')), 0) > 0"); |
There was a problem hiding this comment.
Same fix in 740d544. The array_contains(from_json(...), jsonFunc) rewrite you suggested would emit incorrect SQL because jsonFunc is the JSON helper function name (e.g. jsonb_array_elements_text), not the candidate element — that information isn't plumbed through this Dialect method. Throwing is the honest answer; users with native ARRAY<T> columns get the working array_contains path via writeArrayMembership.
| * test cases: RLIKE for regex, array_contains for membership, size+COALESCE for | ||
| * length, get_json_object for JSON access, and the dayofweek - 1 adjustment | ||
| * for {@code getDayOfWeek}. |
There was a problem hiding this comment.
The class Javadoc mentions testing JSON access via get_json_object, but the test cases in this file don't currently assert any JSON translations. Either add a representative JSON access case (so Spark's JSON behavior is covered) or update the comment to avoid implying coverage that isn't present.
| * test cases: RLIKE for regex, array_contains for membership, size+COALESCE for | |
| * length, get_json_object for JSON access, and the dayofweek - 1 adjustment | |
| * for {@code getDayOfWeek}. | |
| * test cases covered here: RLIKE for regex, array_contains for membership, | |
| * size+COALESCE for length, and the dayofweek - 1 adjustment for | |
| * {@code getDayOfWeek}. |
There was a problem hiding this comment.
Trimmed the Javadoc claim in 740d544 (you're right — the JSON branch wasn't covered, and the existing tests assert RLIKE / array_contains / size+COALESCE / dayofweek). The new JSON test verifies the throw on JSON array membership.
| import org.junit.jupiter.params.provider.Arguments; | ||
| import org.junit.jupiter.params.provider.MethodSource; | ||
|
|
||
| import java.util.List; |
There was a problem hiding this comment.
java.util.List is imported but not used in this test class. Removing the unused import will keep the file clean and avoid compiler warnings.
| import java.util.List; |
There was a problem hiding this comment.
Removed the unused java.util.List import in 740d544.
- Spark writeJSONArrayMembership / writeNestedJSONArrayMembership now throw a clear ConversionException instead of emitting a multi-row scalar subquery that Spark rejects at runtime. The Dialect contract doesn't pass the candidate element through, so we can't rewrite to the correct array_contains(from_json(...), elem) form here. Failing fast at conversion time is honest; runtime SQL failure was the previous (also broken) outcome. Native ARRAY<T> columns continue to work via writeArrayMembership. - Cel2SqlSparkTest: drop unused java.util.List import; trim the class Javadoc to match what the tests actually assert (the JSON-access claim was overstated). New test verifies the JSON-array-membership throw. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…#11) Three new project-local skills under .claude/skills/, each authored against the existing skill-authoring conventions and passing the lint script: - add-cel-function: walks adding a new CEL function to all 6 dialects (constant + dispatch + Dialect interface method + 6 impls + tests). Includes references/sql-writer-pattern.md and references/conversion-exception.md. - add-sql-dialect: walks adding a new SQL dialect (e.g. CockroachDB, Snowflake). Includes a scaffold_dialect.sh script that copies an existing dialect package and renames classes via sed, plus references/ dialect-method-checklist.md and references/test-files.md. - port-from-upstream: walks porting from the Go upstream cel2sql repo. Includes a list_upstream_changes.sh script that auto-detects the last port commit on cel2sql4j and lists candidate upstream commits, plus references/go-to-java-idioms.md and references/two-pr-shape.md. Also includes the previously-uncommitted skill-authoring skill that was staged on main (the foundation these new skills are authored against). While running the build-verification step on this PR, discovered that SparkDialect (PR #10) was merged on top of PR #9 without rebasing and silently lost the writeFormat method that PR #9 added to the Dialect interface — main currently does not compile. Fixed in this PR by adding SparkDialect.writeFormat using Spark's format_string() function (its printf-equivalent), and added a Spark case to Cel2SqlOptionsTest's formatTests. All 392 unit tests pass; ./gradlew build clean. Co-authored-by: Richard Wooding <richardwooding@Richards-Virtual-Machine.local> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Ports the Apache Spark dialect that landed upstream in
cel2sqlv3.7.0 (commitef505ab), bringing cel2sql4j to feature parity on dialect coverage. Branched frommain; depends on nothing in #9 (PR A).New files
dialect/spark/SparkDialect.java— fullDialect+IndexAdvisorimpldialect/spark/SparkRegex.java— RE2 → Java regex passthrough with ReDoS safety checks (Spark runs on the JVM, so patterns are unchanged after validation)dialect/spark/SparkValidation.java— identifier rules + reserved-keyword setDialectName.SPARKCel2SqlSparkTest— 18 unit cases covering output, regex, identifier validationSpark-specific output
concat(a, b),LOCATE(needle, haystack) > 0,RLIKEarray(...),array_contains(arr, x),COALESCE(size(arr), 0)get_json_object,from_json(..., 'ARRAY<STRING>')+EXPLODEfor arraysEXTRACT(... FROM ts), with(dayofweek(ts) - 1)forgetDayOfWeek(Spark uses 1=Sunday; CEL uses 0=Sunday)?Index advisor
recommendIndex()always returnsnull. Spark indexing is storage-layer specific (Delta Z-order vs Iceberg sort vs plain Parquet) and not portable as a single SQL recommendation. Implementing the interface (rather than omitting it) preventsCel2Sql.analyzeQuery()from silently falling back to Postgres recommendations on a Spark dialect.Out of scope (mirrors upstream)
spark/provider.go) — not portable; Java users constructSchemadirectlyTest plan
./gradlew test— all 388 unit tests pass (370 existing + 18 new Spark)./gradlew buildclean🤖 Generated with Claude Code