Skip to content

Data declaration extensions#216

Merged
comnik merged 10 commits intomainfrom
ncg-load-cdc
Feb 26, 2026
Merged

Data declaration extensions#216
comnik merged 10 commits intomainfrom
ncg-load-cdc

Conversation

@comnik
Copy link
Collaborator

@comnik comnik commented Feb 25, 2026

RelEDB → EDB, dropping the "Rel" prefix which we've been using to indicate constructs that will go away once we drop Rel support. This is not the case for EDB references.

CSVColumn → GNFColumn, generalizing to make column declarations reusable between CSV and Iceberg data. GNFColumn can represent nested paths like /:append/:METADATA$KEY.

Generalize Snapshot to support multiple mappings. The snapshot implementation on the engine is more awkward than I had hoped for, and this allows us to at least do it more efficiently.

@comnik comnik self-assigned this Feb 25, 2026
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude says: the grammar's gnf_column_path rule indexes into a Sequence[String] with $$[0], which requires the sequence/list indexing support added in that file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, GetElement was just for tuple types. We never had a need for indexing into lists, until now.

@comnik comnik marked this pull request as ready for review February 25, 2026 21:41
| "[" STRING* "]"
construct: $$ = $2
deconstruct if builtin.length($$) != 1:
$2: Sequence[String] = $$
Copy link
Contributor

@minsungc minsungc Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there semantics for when gnf_column_path is a sequence of length 1? Is it something we can explicitly disallow?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the common case is a path of length 1 (i.e. just the column name). We do support it, we just don't require wrapping it in [].

Copy link
Contributor

@minsungc minsungc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick clarifying question otherwise LGTM!

Copy link
Contributor

@davidwzhao davidwzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks!

CSVLocator locator = 1;
CSVConfig config = 2;
repeated CSVColumn columns = 3;
repeated GNFColumn columns = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if at some point we can unify or better distinguish between columns used for data ingestion vs data export? We currently also have ExportCSVColumn which is actually pretty similar to GNFColumn, just without the types field.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My work on NOT NULL constraints (inclusion dependencies) may have adjacencies to this
consideration. Please keep me in the loop.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, yeah good point. Maybe even lower than that, it seems that we have the need for a mapping from RelationId to a String[] path (possibly of length 1) in several places, most recently the SnapshotMapping.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 cool preview of the future ahead

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dress for the job you want... ;)

Copy link
Contributor

@nystrom nystrom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

@nystrom
Copy link
Contributor

nystrom commented Feb 26, 2026

Looks like you need to rebuild the printers

Copy link
Contributor

@staworko staworko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing on my side just a small question to check if I'm missing something

rel_edb
: "(" "rel_edb" relation_id rel_edb_path rel_edb_types ")"
construct: $$ = logic.RelEDB(target_id=$3, path=$4, types=$5)
edb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As much as I love Rel, I like the removal of references to rel_ from the raicode.

snapshot
: "(" "snapshot" rel_edb_path relation_id ")"
construct: $$ = transactions.Snapshot(destination_path=$3, source_relation=$4)
snapshot_mapping
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see what has changed here except for the ability to do a snapshot of multiple
relations at once. Is that the whole change? Reading on.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the whole change, yeah, multiple snapshots in one action 👍 Its a little awkward to have to do this rather than just having multiple snapshot actions in the same epoch, but it simplifies the implementation on the engine side.

CSVLocator locator = 1;
CSVConfig config = 2;
repeated CSVColumn columns = 3;
repeated GNFColumn columns = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My work on NOT NULL constraints (inclusion dependencies) may have adjacencies to this
consideration. Please keep me in the loop.

(snapshot ["my_edb"] :my_rel)
(snapshot ["database" "table"] :computed)
(snapshot ["schema" "namespace" "relation"] :big_signed))
(snapshot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I still don't see any big change other than multiple relation at one snapshot rather
than one per snapshot. Am I missing something?

Naturally, there are conceivable reason why this is smarter e.g., triggering snapshot
creation comes with an overhead. but this doesn't need any additional explanation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above, it has to do with some awkwardness in how to implement snapshot on the engine, because it is kind of both, a read and a write, which doesn't fit well into Arroyo's epoch model. There would be ways to implement it in a nicer way, where could just support many single snapshot actions in one epoch, but not in the short term.

Copy link
Contributor

@staworko staworko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@comnik comnik merged commit 5e27319 into main Feb 26, 2026
6 checks passed
@comnik comnik deleted the ncg-load-cdc branch February 26, 2026 11:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants