Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exclude all rows or limit n rows per stream #13

Open
aaronsteers opened this issue Apr 8, 2022 · 2 comments
Open

Exclude all rows or limit n rows per stream #13

aaronsteers opened this issue Apr 8, 2022 · 2 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@aaronsteers
Copy link
Contributor

aaronsteers commented Apr 8, 2022

Sometimes it's useful to run an EL pipeline with zero rows in order to create the target tables but not yet load data into them.

In the SDK for taps, we recently added --test=schema to emit only the SCHEMA messages without emitting any RECORD messages.

For non-SDK taps, it would be helpful to have a similar option that excludes all rows or all rows past the first n records.

Note:

  • For safety, we should probably also not pass along any STATE messages when running in this mode. Since we'd be dropping records intentionally, we would not want to pass a bookmark that implied records had been written which were not actually passed.
  • To the extent that the tap is still having to process records, there's still a performance hit from having to read all the records from source. However, most pipelines are constrained on target performance, so by skipping the write process, there should still be significant gains for many/most use cases.
@aaronsteers aaronsteers added the good first issue Good for newcomers label Apr 8, 2022
@edgarrmondragon edgarrmondragon added the enhancement New feature or request label Apr 14, 2022
@teej
Copy link

teej commented Apr 29, 2022

There's a few use cases where this has come up:

  1. When trying out Meltano for the first time for a new source, I want to ensure it works for all the data types I use in that source, and that it maps those data types to something reasonable in my destination.
  2. For my workloads – Postgres tables with 10B+ rows representing 5+ TBs of data – Meltano does not replicate in a reasonable amount of time. As a workaround, I want Meltano to create the table, I manually backfill in a performant way, and then I let Meltano take over for ongoing replication.

@dcowden
Copy link

dcowden commented Dec 14, 2023

We are running into this issue as well. For tables with INCREMENTAL replication, workarounds are available. Some taps support a start date, which can be used to artificially just fetch a few recent rows. It's also possible to use state postfixes and state manipulation to set state to a recent value for the purposes of testing. None of these are great, because they are fiddly, and because they do not achieve the goal of getting N records-- each table will of course need a different state to guarantee getting some rows.

The big issue, however, is with FULL_TABLE replication. We have not found any workarounds for this case. State is ignored.
We tried using filters in meltano-map-transformer, but this doesn't prevent selecting all of the rows in the source-- it just prevents them from going to the target.

One solution is to handle this in the tap-- if the singer spec contemplating testing, a --test mode that always retrieves the top N rows would have been great. but that ship has sailed.

Meltano could handle this by providing a configuration option to quit after N rows. I imagine that would go somewhere about here, as a different future waiting on a given number of rows from the tap:

https://github.com/meltano/meltano/blob/main/src/meltano/core/runner/singer.py#L119

That'd be a change to meltano core, not meltano-map-transform, so i realize that means this post is somewhat in the wrong place. But I figured since others had run into this issue here, i'd post here first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants