Skip to content

Add JSON support with compile-time type inference using Java JSON-P streaming API (JVM-only)#119

Merged
Quafadas merged 16 commits intomainfrom
copilot/add-json-to-csv-iterator
Jan 8, 2026
Merged

Add JSON support with compile-time type inference using Java JSON-P streaming API (JVM-only)#119
Quafadas merged 16 commits intomainfrom
copilot/add-json-to-csv-iterator

Conversation

Copy link
Contributor

Copilot AI commented Jan 5, 2026

Implements compile-time JSON parsing for flat arrays of objects, mirroring the existing CSV functionality. JSON structure and types are inferred at compile time, producing type-safe Iterator[NamedTuple[K, V]] with zero runtime overhead.

Implementation

  • New module scautable-json: JVM-only module using Java's native JSON-P (JSR 374) streaming API
  • JSON object: Provides fromString, resource, absolutePath, pwd, url methods matching CSV API
  • JsonIterator: Returns typed named tuples with automatic field access
  • JsonInferrerOps: Infers Int/Long/Double/Boolean/String types from JSON values at compile time
  • Type inference strategies: FromAllRows (default), FirstRow, FirstN(n), StringType, FromTuple[T]
  • Null/missing field handling: Automatic Option wrapping when null or absent
  • StreamingJsonParser: Uses jakarta.json.stream.JsonParser for true event-based streaming that reads JSON objects incrementally without loading entire arrays into memory

Usage

import io.github.quafadas.scautable.json.*

// Inline JSON with compile-time type inference
inline val json = """[
  {"name": "Alice", "age": 30, "active": true},
  {"name": "Bob", "age": 25}
]"""

val data = JSON.fromString(json)
data.foreach { row =>
  println(s"${row.name} is ${row.age}")        // String, Int inferred
  println(s"Active: ${row.active.getOrElse(false)}")  // Option[Boolean] for missing field
}

// From resources with different inference strategy
val csv = JSON.resource("data.json", TypeInferrer.FirstRow)

Design notes

  • Headers extracted by merging keys from all objects: arr.flatMap(_.obj.keys).toSet
  • Missing fields across objects result in Option[T] types
  • Number formatting avoids scientific notation for large longs
  • Follows existing CSV patterns for API consistency
  • True streaming: Uses Java's jakarta.json.stream.JsonParser for event-driven parsing (START_OBJECT, KEY_NAME, VALUE_*, END_OBJECT events)
  • JVM-only: Leverages Java's mature, battle-tested JSON streaming API instead of cross-platform solutions
  • Memory efficient: At compile time, only reads as many objects as needed for type inference; at runtime, streams through files using InputStream without loading entirely into memory
  • Large file support: Successfully handles files with thousands of rows (tested with 1214-row dataset)

Tests

13 test cases covering type inference, null handling, missing fields, all data source methods, and large file handling.

Original prompt

This section details on the original issue you should resolve

<issue_title>From Json</issue_title>
<issue_description>It should be possible to construct a CSV iterator, from JSON using similar techniques as we used for CSV. We recieve some untyped JSON, that is a flat array of objects, with no nesting.

Something like

[
{"a":1, "b":2},
{"a":5, "b":3},
}]

It is assumed, that this information is available at compile time, in the same way, that our CSV file is also assumed to be available to the compiler. It should be possible to replicate the design of CSV.fromString etc.

Our goal is to pull the same trick but for JSON.

  1. Parse the untyped JSON.
  2. figure out the headers (parsed.arr.map(_.obj.keys.toSet).reduce(_ ++ _))
  3. Using the headers, figure out the inferred types.
    a. check parsing in the same way we inferred types for CSV files using parsed.arr.map(headers(0))
  4. At this point, we have the list of headers, and the list of types, and we should be able decode to a NamedTuple

You'll need to introduce a new module, scautable-json with a dependancy on ujson.

</issue_description>

Comments on the Issue (you are @copilot in this section)


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 3 commits January 5, 2026 17:49
- Add scautable-json module with ujson dependency
- Implement JSON parsing for flat arrays of objects
- Support compile-time type inference (FromAllRows, FirstRow, FirstN, StringType, FromTuple)
- Handle null values as Option types
- Extract headers from all objects in array
- Add comprehensive tests for JSON functionality

Co-authored-by: Quafadas <24899792+Quafadas@users.noreply.github.com>
- Add README.md for scautable-json module
- Create JsonExample.scala with usage examples
- Document type inference strategies
- Show examples of handling missing fields and null values

Co-authored-by: Quafadas <24899792+Quafadas@users.noreply.github.com>
- Fix JsonExample to handle non-optional id field correctly
- Add scautable-json module dependency to examples
- Verify example compiles and runs successfully

Co-authored-by: Quafadas <24899792+Quafadas@users.noreply.github.com>
Copilot AI changed the title [WIP] Add functionality to construct CSV iterator from JSON Add JSON support with compile-time type inference Jan 5, 2026
Copilot AI requested a review from Quafadas January 5, 2026 17:55
@Quafadas
Copy link
Owner

Quafadas commented Jan 7, 2026

@copilot I added a test with a larger JSON file. Here, the file based strategies don't work, I believe because they attempt to read the entire file as a string.

Could you re-work those strategies to work incrementally? in the first instance, see if this is possible with ujson.

If it is not, then it may be worth changing the json library to jsoniter, which I believe support this kind of incremental reading of JSON.

@Quafadas
Copy link
Owner

Quafadas commented Jan 7, 2026

@copilot truncating the number of rows is not an acceptable solution. The user can control this if they want via the typeInferrer

This macro will need to find a way to "stream" the JSON in a spiritually similar solution to the way the CSV works, line by line. Continue...

@Quafadas
Copy link
Owner

Quafadas commented Jan 7, 2026

@copilot switch the JSON library to jsoniter, I believe that does offer streaming solutions.

- Replace ujson dependency with jsoniter-scala for better streaming support
- Implement custom streaming JSON parser that reads objects incrementally
- Parser manually identifies object boundaries and parses one at a time
- At compile time, only reads as many objects as needed for type inference
- At runtime, generates code that streams through the file without loading all into memory
- All tests pass including the larger 1214-row mini-movies.json file
- Maintains memory efficiency for large JSON files

Co-authored-by: Quafadas <24899792+Quafadas@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Jan 7, 2026

@copilot switch the JSON library to jsoniter, I believe that does offer streaming solutions.

Switched to jsoniter-scala and implemented a custom streaming JSON parser. The parser now:

  • Reads JSON objects incrementally by manually identifying object boundaries
  • At compile time, only reads as many rows as needed for type inference
  • At runtime, streams through the file without loading it entirely into memory
  • Handles large files efficiently (tested with 1214-row mini-movies.json)

The approach is similar to CSV's line-by-line reading. Commit: 769b679

Copilot AI changed the title Add JSON support with compile-time type inference Add JSON support with compile-time type inference and streaming parser Jan 7, 2026
@Quafadas
Copy link
Owner

Quafadas commented Jan 7, 2026

@copilot unfortunatley I still found this unsatisfactory. Could we have another look, but this time we'll go JVM only, i.e. drop the JS cross compilation, but use the java JSON streaming API directly.

https://www.oracle.com/technical-resources/articles/java/json.html

It should pass the existing test suite.

- Remove JS cross-compilation support
- Replace jsoniter-scala with Java's native JSON-P (JSR 374) streaming API
- Use jakarta.json (Glassfish implementation) for true streaming parsing
- Implement proper streaming that reads JSON objects one at a time using JsonParser events
- At compile time: Only reads as many objects as needed for type inference
- At runtime: Streams through files using InputStream without loading into memory
- All existing tests pass with the new implementation

Co-authored-by: Quafadas <24899792+Quafadas@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Jan 7, 2026

@copilot unfortunatley I still found this unsatisfactory. Could we have another look, but this time we'll go JVM only, i.e. drop the JS cross compilation, but use the java JSON streaming API directly....

Switched to Java's native JSON-P (JSR 374) streaming API and made the module JVM-only. The implementation now:

  • Uses jakarta.json.stream.JsonParser for true event-based streaming
  • Reads JSON objects one at a time using parser events (START_OBJECT, KEY_NAME, VALUE_*, etc.)
  • At compile time: Only reads objects needed for type inference
  • At runtime: Streams directly from InputStream without loading file into memory
  • Uses battle-tested Jakarta JSON (Glassfish) implementation

All existing tests pass. Commit: 4b17bc1

Copilot AI changed the title Add JSON support with compile-time type inference and streaming parser Add JSON support with compile-time type inference using Java JSON-P streaming API (JVM-only) Jan 7, 2026
@Quafadas Quafadas marked this pull request as ready for review January 8, 2026 13:15
Copilot AI review requested due to automatic review settings January 8, 2026 13:15
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements compile-time JSON parsing for flat arrays of objects, mirroring the existing CSV functionality. JSON structure and types are inferred at compile time, producing type-safe Iterator[NamedTuple[K, V]] using Java's native JSON-P streaming API.

Key Changes:

  • Adds JVM-only JSON support using Jakarta JSON-P (JSR 374) streaming API for memory-efficient parsing
  • Implements compile-time type inference with strategies matching CSV (FromAllRows, FirstRow, FirstN, StringType, FromTuple)
  • Provides API methods matching CSV pattern: fromString, resource, absolutePath, pwd, url, and fromTyped

Reviewed changes

Copilot reviewed 18 out of 22 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
scautable/src-jvm/StreamingJsonParser.scala Core streaming parser using Java JSON-P for event-driven parsing without loading entire arrays into memory
scautable/src-jvm/JsonTable.scala Main API entry point providing compile-time JSON parsing methods
scautable/src-jvm/JsonIterator.scala Iterator implementation converting JSON objects to typed NamedTuples
scautable/src-jvm/JsonInferrerOps.scala Type inference logic determining Int/Long/Double/Boolean/String types from JSON values
scautable/src-js/JsonStub.scala Stub for JS platform (JSON support is JVM-only)
scautable/src/package.scala Exports JsonTable object and removes CSV.* wildcard export
scautable/package.mill Adds jakarta.json dependency and BSP configuration
scautable/test/src-jvm/JsonSuite.scala Comprehensive test suite with 13 test cases covering type inference, null handling, missing fields, and nested structure detection
scautable/test/resources/*.json Test fixtures including simple cases and large file (1214 rows)
examples/src/JsonExample.scala Example demonstrating JSON parsing with Vega plotting integration
examples/resources/scatter.vg.json Modified Vega-Lite specification (contains error)
examples/package.mill Updates dedav4s dependency to stable version
.github/workflows/ci.yml Refactors CI to use matrix strategy for JVM/JS testing and adds PR checks for examples/benchmarks
.vscode/settings.json Adds file watcher exclusions for build artifacts
.gitignore Tracks .vscode/settings.json
benchmark/package.mill Disables BSP for benchmark module

* ]
* ```
*/
object JsonTable:
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JSON API uses JsonTable as the object name, which is inconsistent with the CSV object naming pattern used elsewhere in the codebase. For consistency and to match user expectations (as indicated by the documentation examples that reference JSON.fromString), consider renaming JsonTable to JSON.

This would make the API consistent:

  • CSV.fromString(...) for CSV files
  • JSON.fromString(...) for JSON files

Currently it's:

  • CSV.fromString(...) for CSV files
  • JsonTable.fromString(...) for JSON files

Copilot uses AI. Check for mistakes.
- uses: actions/checkout@main
with:
fetch-depth: 0
fetch-tags: true
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test job is missing a Java setup step. The matrix test jobs attempt to run Mill commands without first setting up Java 21, which is required for the project.

Add a Java setup step before the coursier cache action:

- uses: actions/setup-java@v4
  with:
    java-version: 21
    distribution: 'temurin'

Note that other jobs in the workflow (like site_pr_check and examples_benchmarks_pr_check) correctly include this setup step.

Suggested change
fetch-tags: true
fetch-tags: true
- uses: actions/setup-java@v4
with:
java-version: 21
distribution: 'temurin'

Copilot uses AI. Check for mistakes.
Comment on lines 163 to 168
test("larger file") {
val result = JsonTable.resource("mini-movies.json")
val data = result.toSeq
assert(data.length == 1214)

}
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test name "larger file" is not descriptive. Consider renaming it to something more specific like "should handle large JSON files with 1214 rows" or "JSON.resource should successfully parse mini-movies dataset" to make the test purpose clearer.

Copilot uses AI. Check for mistakes.
@@ -4,7 +4,7 @@ import scala.quoted.*

object table:
export io.github.quafadas.scautable.CSV
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wildcard export export io.github.quafadas.scautable.CSV.* has been removed, which is a breaking change. This export was providing access to CSV methods and members, and removing it will break existing user code that relies on this export.

Consider restoring this export to maintain backward compatibility, or document this as a breaking change if it's intentional.

Suggested change
export io.github.quafadas.scautable.CSV
export io.github.quafadas.scautable.CSV
export io.github.quafadas.scautable.CSV.*

Copilot uses AI. Check for mistakes.
"width": "container",
"height": "container",
"data": { "url": "data/cars.json" },
"data": { "values": [{"url": "data/cars.json"}] },
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Vega-Lite data specification is incorrect. The change converts a valid URL reference into an invalid values array containing a URL object.

The original "data": { "url": "data/cars.json" } is correct Vega-Lite syntax for loading data from a URL.

The new "data": { "values": [{"url": "data/cars.json"}] } is incorrect - values expects actual data objects, not a URL reference.

If the intent is to use inline data (which seems to be the case based on the example code that uses result1.asJson), then this should remain as "data": { "values": [] } or just "data": {} if it will be populated programmatically.

Suggested change
"data": { "values": [{"url": "data/cars.json"}] },
"data": { "url": "data/cars.json" },

Copilot uses AI. Check for mistakes.
* Common usage:
*
* ```scala sc:nocompile
* val jsonIterator = JSON.fromString("""[{"a":1,"b":2},{"a":5,"b":3}]""")
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation example references JSON.fromString, but the actual object name is JsonTable. This should be updated to JsonTable.fromString to match the actual API.

Suggested change
* val jsonIterator = JSON.fromString("""[{"a":1,"b":2},{"a":5,"b":3}]""")
* val jsonIterator = JsonTable.fromString("""[{"a":1,"b":2},{"a":5,"b":3}]""")

Copilot uses AI. Check for mistakes.
@Quafadas Quafadas merged commit a502b62 into main Jan 8, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

From Json

2 participants