Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replication key field finished_at of runs stream can sometimes be null #213

Open
edgarrmondragon opened this issue Sep 8, 2023 · 3 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@edgarrmondragon
Copy link
Member

From a Slack conversation:

  1. Incremental replication where the replication key is sometimes null:
    Again in tap-dbt, the runs stream is set to replicate incrementally using finished_at as the replication key. However this field is sometimes NULL for our runs.

Are there any workarounds for this, aside from tweaking our local code to replicate the runs table with full table replication?

I need to dig into the API docs to see what's going and maybe come up with a workaround (other than overriding the replication method in the Singer catalog).

Help from other users of this tap is more than welcome!


@edgarrmondragon edgarrmondragon added help wanted Extra attention is needed bug Something isn't working labels Sep 8, 2023
@mjsqu
Copy link
Contributor

mjsqu commented Sep 10, 2023

The finished_at property was chosen for the runs stream because it is one of the keys that API requests can be ordered by. Unfortunately it looks like the ordering keys are not documented - but one can try hitting the following endpoints:

  • api/v2/accounts/1/runs/?order_by=-updated_at
  • api/v2/accounts/1/runs/?order_by=-finished_at

At our site the first returns a message that provides the required order_by keys:

{
    "status": {
        "code": 400,
        "is_success": false,
        "user_message": "The request was invalid. Please double check the provided data and try again.",
        "developer_message": ""
    },
    "data": {
        "reason": "Invalid order_by value. Use one of [id, created_at, finished_at, -id, -created_at, -finished_at] instead."
    }
}

Ascending or descending:

  • id
  • created_at
  • finished_at

Helpful links:

@mjsqu
Copy link
Contributor

mjsqu commented Sep 10, 2023

The problem with using created_at is that the following scenario may occur:

  • A new run id=1234 is created at 10am
  • Another run id=1235 is created at 10:05am
  • At 11am, both runs are still active. The tap runs without a state bookmark and extracts all runs. It stores the highest created_at value as 10:05am
  • At 11:15am, run id=1234 finishes, the dbt Cloud record is updated with finishing status, finished_at etc.
  • At 11:30am the tap runs in incremental mode. It checks off the records in reverse created_at order and stops when it reaches 10:05am - creating and outputting a final RECORD message containing id=1235.
  • The updated status of id=1234 is not extracted because the created_at value for that run is 10am, before the bookmark value.

I think that makes sense, but please feel free to check my logic.

I was motivated to create an incremental replication method for the runs endpoint because we have a lot of job runs at our site, however if you have lower volumes of runs, a full_table style replication may be preferable. Is it possible to select that style of replication and override the incremental method?

@mjsqu
Copy link
Contributor

mjsqu commented Sep 11, 2023

Just noted the Slack comment said:

Are there any workarounds for this, aside from tweaking our local code to replicate the runs table with full table replication?

Which invalidates the final paragraph of my previous comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
Status: No status
Development

No branches or pull requests

2 participants