Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-work FlowETL QA check discovery #6497

Open
jc-harrison opened this issue Apr 18, 2024 · 1 comment
Open

Re-work FlowETL QA check discovery #6497

jc-harrison opened this issue Apr 18, 2024 · 1 comment

Comments

@jc-harrison
Copy link
Member

FlowETL's current approach for adding QA checks to a DAG is to scan through all files in the template_searchpath, and create QACheckOperator tasks for all *.sql templates that contain "qa_checks" somewhere in the (relative) filepath. This is convenient for dynamically picking up QA checks, but has some downsides:

  • The approach of searching through all templates is a little fragile. In particular, if the same QA check file appears relative to multiple searchpaths (e.g. if both /path/to/dag_folder and /path/to/dag_folder/subdir are in the template searchpath), that QA check will be picked up twice.
  • The requirement for "qa_checks" to appear in the relative path can lead to some non-intuitive behaviour - e.g. if I define a QA check in /my/additional/qa_checks/my_check.sql and then set additional_qa_check_paths=/my/additional/qa_checks/, my QA check will not be picked up.
  • Because the default QA checks are not in the default template searchpath, we implicitly add <flowetl_module_install_path>/qa_checks to the template searchpath. Since the template searchpath setting applies to the entire DAG, there's potential for this to have unintended side-effects on template fetching in other tasks within the DAG.

Aside from these potential pitfalls, there are other features that might be nice to have which are not possible with the current approach. E.g.:

  • Excluding some QA checks (e.g. if some of the default QA checks are not applicable because they apply to a field that isn't in the data we're ingesting, or perhaps we want to run some of the checks at one stage in a DAG and the rest at a later stage)
  • We may perhaps want to override one of the default QA checks with a modified check in some circumstances

It would be good to re-design the QA check discovery mechanism to avoid these shortcomings and allow more explicit control over check discovery when it's useful, without losing all of the convenience of the dynamic QA check discovery and injection of default QA checks.

@jc-harrison
Copy link
Member Author

Ideally the solution here would no longer rely on us knowing the location of the DAG folder during DAG creation, so that we can remove the hack introduced in #6496.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant