Skip to content

Add Apache Spark SQL dialect#10

Merged
richardwooding merged 2 commits intomainfrom
feat/spark-dialect
Apr 27, 2026
Merged

Add Apache Spark SQL dialect#10
richardwooding merged 2 commits intomainfrom
feat/spark-dialect

Conversation

@richardwooding
Copy link
Copy Markdown
Contributor

Summary

Ports the Apache Spark dialect that landed upstream in cel2sql v3.7.0 (commit ef505ab), bringing cel2sql4j to feature parity on dialect coverage. Branched from main; depends on nothing in #9 (PR A).

New files

  • dialect/spark/SparkDialect.java — full Dialect + IndexAdvisor impl
  • dialect/spark/SparkRegex.java — RE2 → Java regex passthrough with ReDoS safety checks (Spark runs on the JVM, so patterns are unchanged after validation)
  • dialect/spark/SparkValidation.java — identifier rules + reserved-keyword set
  • DialectName.SPARK
  • Cel2SqlSparkTest — 18 unit cases covering output, regex, identifier validation

Spark-specific output

Strings concat(a, b), LOCATE(needle, haystack) > 0, RLIKE
Arrays array(...), array_contains(arr, x), COALESCE(size(arr), 0)
JSON get_json_object, from_json(..., 'ARRAY<STRING>') + EXPLODE for arrays
Timestamps EXTRACT(... FROM ts), with (dayofweek(ts) - 1) for getDayOfWeek (Spark uses 1=Sunday; CEL uses 0=Sunday)
Placeholders positional ?
Multi-dim arrays rejected — Spark has no portable equivalent

Index advisor

recommendIndex() always returns null. Spark indexing is storage-layer specific (Delta Z-order vs Iceberg sort vs plain Parquet) and not portable as a single SQL recommendation. Implementing the interface (rather than omitting it) prevents Cel2Sql.analyzeQuery() from silently falling back to Postgres recommendations on a Spark dialect.

Out of scope (mirrors upstream)

  • Spark JDBC type provider (Go-side spark/provider.go) — not portable; Java users construct Schema directly
  • Spark Testcontainers integration tests — heavy; will land separately if needed

Test plan

  • ./gradlew test — all 388 unit tests pass (370 existing + 18 new Spark)
  • ./gradlew build clean
  • CI green on this branch

🤖 Generated with Claude Code

Ports the Spark dialect added to upstream cel2sql in v3.7.0 (commit ef505ab),
making cel2sql4j a 6th-dialect library matching upstream's coverage.

New files:
- dialect/spark/SparkDialect.java — full Dialect impl + IndexAdvisor (no-op
  recommendations; Spark indexing is storage-layer specific)
- dialect/spark/SparkRegex.java — RE2/Java-regex passthrough validator
  (Spark uses java.util.regex, same engine cel2sql4j compiles against, so
  patterns pass through unchanged after ReDoS safety checks)
- dialect/spark/SparkValidation.java — identifier validation + reserved
  keyword set (~110 Spark SQL keywords, lowercased)
- DialectName enum: add SPARK

Spark-specific output:
- Strings: concat(a, b), LOCATE/RLIKE
- Arrays: array(...), array_contains, COALESCE(size(arr), 0)
- JSON: get_json_object / from_json + EXPLODE for arrays
- Timestamps: EXTRACT(... FROM ts), with (dayofweek(t) - 1) for getDayOfWeek
  to convert Spark's 1=Sunday to CEL's 0=Sunday
- Placeholders: positional ?
- Multi-dim ARRAY_LENGTH throws (Spark has no portable equivalent)

IndexAdvisor.recommendIndex returns null for all patterns — analyzeQuery
on Spark returns empty recommendations rather than falling back to Postgres
advice, matching upstream's `SupportsIndexAnalysis returns false` semantics.

Tests: Cel2SqlSparkTest covers basic comparisons, string functions, regex,
array membership, timestamp extraction, parameterized output, identifier
validation, and the regex security checks. 18 test cases.

Docs:
- README: Spark badge, 6-dialect feature claim, supported-dialects table row,
  dialect-import example
- CLAUDE.md: dialect-comparison table now has 6 columns; Spark notes paragraph
  covering the regex passthrough, day-of-week adjustment, and the no-op
  index advisor

Out of scope (mirrors upstream):
- Spark JDBC type provider (Go-side `spark/provider.go`) — not portable;
  Java users construct Schema directly
- Spark Testcontainers integration tests — heavy; will land separately
  if needed

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR ports the Apache Spark SQL dialect into cel2sql4j, aligning the Java implementation with upstream cel2sql dialect coverage and enabling Spark-specific SQL generation (syntax, regex handling, identifier validation, and analysis behavior).

Changes:

  • Added a new Spark dialect implementation (SparkDialect) plus Spark-specific regex and identifier validation helpers.
  • Extended the dialect enum (DialectName) and updated documentation to list Spark as a supported dialect.
  • Added a Spark-focused test suite (Cel2SqlSparkTest) covering core translation behaviors and dialect validation.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/main/java/com/spandigital/cel2sql/dialect/spark/SparkDialect.java Implements Spark-specific SQL emission for strings/arrays/json/timestamps and provides a no-op IndexAdvisor behavior.
src/main/java/com/spandigital/cel2sql/dialect/spark/SparkRegex.java Adds Spark regex validation/passthrough with ReDoS-oriented guards.
src/main/java/com/spandigital/cel2sql/dialect/spark/SparkValidation.java Adds Spark identifier rules and reserved keyword validation.
src/main/java/com/spandigital/cel2sql/dialect/DialectName.java Adds SPARK dialect name.
src/test/java/com/spandigital/cel2sql/Cel2SqlSparkTest.java Adds unit tests for Spark SQL output, parameter placeholders, index analysis behavior, and validation.
README.md Documents Spark as a supported dialect and adds usage + capability table row.
CLAUDE.md Updates internal contributor docs to include Spark dialect notes and capability table.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +229 to +239
// Mirrors SQLite's `lhs = (SELECT value FROM json_each(...))` pattern.
w.append("(SELECT col FROM (SELECT EXPLODE(from_json(");
writeExpr.write();
w.append(", 'ARRAY<STRING>')) AS col) t)");
}

@Override
public void writeNestedJSONArrayMembership(StringBuilder w, SqlWriter writeExpr) throws ConversionException {
w.append("(SELECT col FROM (SELECT EXPLODE(from_json(");
writeExpr.write();
w.append(", 'ARRAY<STRING>')) AS col) t)");
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

writeJSONArrayMembership currently emits a scalar subquery that can return multiple rows (one per element from EXPLODE(...)). In Spark SQL, scalar subqueries must return at most one row, so this will fail at runtime for non-empty arrays with >1 element and it also doesn't implement boolean membership semantics. Consider generating a boolean membership predicate (e.g., EXISTS (...) / array_contains(...)) or adjusting the converter/dialect contract so the element expression can be incorporated correctly.

Suggested change
// Mirrors SQLite's `lhs = (SELECT value FROM json_each(...))` pattern.
w.append("(SELECT col FROM (SELECT EXPLODE(from_json(");
writeExpr.write();
w.append(", 'ARRAY<STRING>')) AS col) t)");
}
@Override
public void writeNestedJSONArrayMembership(StringBuilder w, SqlWriter writeExpr) throws ConversionException {
w.append("(SELECT col FROM (SELECT EXPLODE(from_json(");
writeExpr.write();
w.append(", 'ARRAY<STRING>')) AS col) t)");
throw new ConversionException(
"Spark JSON array membership requires a boolean predicate such as array_contains(...) or EXISTS (...), "
+ "but the current dialect contract does not provide the candidate element needed to build one.");
}
@Override
public void writeNestedJSONArrayMembership(StringBuilder w, SqlWriter writeExpr) throws ConversionException {
throw new ConversionException(
"Spark nested JSON array membership requires a boolean predicate such as array_contains(...) or EXISTS (...), "
+ "but the current dialect contract does not provide the candidate element needed to build one.");

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took your throw-with-error suggestion in 740d544. The Dialect contract doesn't pass the candidate element through, so we can't emit the correct array_contains(from_json(...), elem) form here without an interface change. Failing fast at conversion time is preferable to the silent multi-row scalar subquery that Spark rejects at runtime. Native ARRAY<T> columns continue to work via writeArrayMembership (the standard array_contains path). Added a unit test asserting the throw.

Comment on lines +229 to +239
// Mirrors SQLite's `lhs = (SELECT value FROM json_each(...))` pattern.
w.append("(SELECT col FROM (SELECT EXPLODE(from_json(");
writeExpr.write();
w.append(", 'ARRAY<STRING>')) AS col) t)");
}

@Override
public void writeNestedJSONArrayMembership(StringBuilder w, SqlWriter writeExpr) throws ConversionException {
w.append("(SELECT col FROM (SELECT EXPLODE(from_json(");
writeExpr.write();
w.append(", 'ARRAY<STRING>')) AS col) t)");
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

writeNestedJSONArrayMembership has the same scalar-subquery issue as writeJSONArrayMembership: SELECT col FROM (SELECT EXPLODE(...)) can yield multiple rows and will error in Spark when used as a scalar expression, and it doesn't produce a boolean membership test. This should be rewritten to a boolean predicate (e.g., EXISTS with a filter) or the surrounding converter logic should be updated to use a Spark-friendly membership form.

Suggested change
// Mirrors SQLite's `lhs = (SELECT value FROM json_each(...))` pattern.
w.append("(SELECT col FROM (SELECT EXPLODE(from_json(");
writeExpr.write();
w.append(", 'ARRAY<STRING>')) AS col) t)");
}
@Override
public void writeNestedJSONArrayMembership(StringBuilder w, SqlWriter writeExpr) throws ConversionException {
w.append("(SELECT col FROM (SELECT EXPLODE(from_json(");
writeExpr.write();
w.append(", 'ARRAY<STRING>')) AS col) t)");
w.append("array_contains(from_json(");
writeExpr.write();
w.append(", 'ARRAY<STRING>'), ").append(jsonFunc).append(')');
}
@Override
public void writeNestedJSONArrayMembership(StringBuilder w, SqlWriter writeExpr) throws ConversionException {
w.append("COALESCE(size(from_json(");
writeExpr.write();
w.append(", 'ARRAY<STRING>')), 0) > 0");

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same fix in 740d544. The array_contains(from_json(...), jsonFunc) rewrite you suggested would emit incorrect SQL because jsonFunc is the JSON helper function name (e.g. jsonb_array_elements_text), not the candidate element — that information isn't plumbed through this Dialect method. Throwing is the honest answer; users with native ARRAY<T> columns get the working array_contains path via writeArrayMembership.

Comment on lines +20 to +22
* test cases: RLIKE for regex, array_contains for membership, size+COALESCE for
* length, get_json_object for JSON access, and the dayofweek - 1 adjustment
* for {@code getDayOfWeek}.
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class Javadoc mentions testing JSON access via get_json_object, but the test cases in this file don't currently assert any JSON translations. Either add a representative JSON access case (so Spark's JSON behavior is covered) or update the comment to avoid implying coverage that isn't present.

Suggested change
* test cases: RLIKE for regex, array_contains for membership, size+COALESCE for
* length, get_json_object for JSON access, and the dayofweek - 1 adjustment
* for {@code getDayOfWeek}.
* test cases covered here: RLIKE for regex, array_contains for membership,
* size+COALESCE for length, and the dayofweek - 1 adjustment for
* {@code getDayOfWeek}.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trimmed the Javadoc claim in 740d544 (you're right — the JSON branch wasn't covered, and the existing tests assert RLIKE / array_contains / size+COALESCE / dayofweek). The new JSON test verifies the throw on JSON array membership.

import org.junit.jupiter.params.provider.Arguments;
import org.junit.jupiter.params.provider.MethodSource;

import java.util.List;
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

java.util.List is imported but not used in this test class. Removing the unused import will keep the file clean and avoid compiler warnings.

Suggested change
import java.util.List;

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the unused java.util.List import in 740d544.

- Spark writeJSONArrayMembership / writeNestedJSONArrayMembership now throw a
  clear ConversionException instead of emitting a multi-row scalar subquery
  that Spark rejects at runtime. The Dialect contract doesn't pass the
  candidate element through, so we can't rewrite to the correct
  array_contains(from_json(...), elem) form here. Failing fast at conversion
  time is honest; runtime SQL failure was the previous (also broken) outcome.
  Native ARRAY<T> columns continue to work via writeArrayMembership.
- Cel2SqlSparkTest: drop unused java.util.List import; trim the class
  Javadoc to match what the tests actually assert (the JSON-access claim
  was overstated). New test verifies the JSON-array-membership throw.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@richardwooding richardwooding merged commit 0bc7802 into main Apr 27, 2026
4 checks passed
richardwooding added a commit that referenced this pull request Apr 27, 2026
…#11)

Three new project-local skills under .claude/skills/, each authored against
the existing skill-authoring conventions and passing the lint script:

- add-cel-function: walks adding a new CEL function to all 6 dialects
  (constant + dispatch + Dialect interface method + 6 impls + tests).
  Includes references/sql-writer-pattern.md and references/conversion-exception.md.
- add-sql-dialect: walks adding a new SQL dialect (e.g. CockroachDB,
  Snowflake). Includes a scaffold_dialect.sh script that copies an
  existing dialect package and renames classes via sed, plus references/
  dialect-method-checklist.md and references/test-files.md.
- port-from-upstream: walks porting from the Go upstream cel2sql repo.
  Includes a list_upstream_changes.sh script that auto-detects the last
  port commit on cel2sql4j and lists candidate upstream commits, plus
  references/go-to-java-idioms.md and references/two-pr-shape.md.

Also includes the previously-uncommitted skill-authoring skill that was
staged on main (the foundation these new skills are authored against).

While running the build-verification step on this PR, discovered that
SparkDialect (PR #10) was merged on top of PR #9 without rebasing and
silently lost the writeFormat method that PR #9 added to the Dialect
interface — main currently does not compile. Fixed in this PR by
adding SparkDialect.writeFormat using Spark's format_string() function
(its printf-equivalent), and added a Spark case to Cel2SqlOptionsTest's
formatTests.

All 392 unit tests pass; ./gradlew build clean.

Co-authored-by: Richard Wooding <richardwooding@Richards-Virtual-Machine.local>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants