Exclude all rows or limit `n` rows per stream #13

aaronsteers · 2022-04-08T06:11:11Z

Sometimes it's useful to run an EL pipeline with zero rows in order to create the target tables but not yet load data into them.

In the SDK for taps, we recently added --test=schema to emit only the SCHEMA messages without emitting any RECORD messages.

For non-SDK taps, it would be helpful to have a similar option that excludes all rows or all rows past the first n records.

Note:

For safety, we should probably also not pass along any STATE messages when running in this mode. Since we'd be dropping records intentionally, we would not want to pass a bookmark that implied records had been written which were not actually passed.
To the extent that the tap is still having to process records, there's still a performance hit from having to read all the records from source. However, most pipelines are constrained on target performance, so by skipping the write process, there should still be significant gains for many/most use cases.

The text was updated successfully, but these errors were encountered:

teej · 2022-04-29T17:25:11Z

There's a few use cases where this has come up:

When trying out Meltano for the first time for a new source, I want to ensure it works for all the data types I use in that source, and that it maps those data types to something reasonable in my destination.
For my workloads – Postgres tables with 10B+ rows representing 5+ TBs of data – Meltano does not replicate in a reasonable amount of time. As a workaround, I want Meltano to create the table, I manually backfill in a performant way, and then I let Meltano take over for ongoing replication.

dcowden · 2023-12-14T15:16:25Z

We are running into this issue as well. For tables with INCREMENTAL replication, workarounds are available. Some taps support a start date, which can be used to artificially just fetch a few recent rows. It's also possible to use state postfixes and state manipulation to set state to a recent value for the purposes of testing. None of these are great, because they are fiddly, and because they do not achieve the goal of getting N records-- each table will of course need a different state to guarantee getting some rows.

The big issue, however, is with FULL_TABLE replication. We have not found any workarounds for this case. State is ignored.
We tried using filters in meltano-map-transformer, but this doesn't prevent selecting all of the rows in the source-- it just prevents them from going to the target.

One solution is to handle this in the tap-- if the singer spec contemplating testing, a --test mode that always retrieves the top N rows would have been great. but that ship has sailed.

Meltano could handle this by providing a configuration option to quit after N rows. I imagine that would go somewhere about here, as a different future waiting on a given number of rows from the tap:

https://github.com/meltano/meltano/blob/main/src/meltano/core/runner/singer.py#L119

That'd be a change to meltano core, not meltano-map-transform, so i realize that means this post is somewhat in the wrong place. But I figured since others had run into this issue here, i'd post here first.

aaronsteers added the good first issue Good for newcomers label Apr 8, 2022

edgarrmondragon added the enhancement New feature or request label Apr 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exclude all rows or limit `n` rows per stream #13

Exclude all rows or limit `n` rows per stream #13

aaronsteers commented Apr 8, 2022 •

edited

Loading

teej commented Apr 29, 2022

dcowden commented Dec 14, 2023 •

edited

Loading

Exclude all rows or limit n rows per stream #13

Exclude all rows or limit n rows per stream #13

Comments

aaronsteers commented Apr 8, 2022 • edited Loading

teej commented Apr 29, 2022

dcowden commented Dec 14, 2023 • edited Loading

Exclude all rows or limit `n` rows per stream #13

Exclude all rows or limit `n` rows per stream #13

aaronsteers commented Apr 8, 2022 •

edited

Loading

dcowden commented Dec 14, 2023 •

edited

Loading