Re-work FlowETL QA check discovery #6497

jc-harrison · 2024-04-18T17:24:37Z

FlowETL's current approach for adding QA checks to a DAG is to scan through all files in the template_searchpath, and create QACheckOperator tasks for all *.sql templates that contain "qa_checks" somewhere in the (relative) filepath. This is convenient for dynamically picking up QA checks, but has some downsides:

The approach of searching through all templates is a little fragile. In particular, if the same QA check file appears relative to multiple searchpaths (e.g. if both /path/to/dag_folder and /path/to/dag_folder/subdir are in the template searchpath), that QA check will be picked up twice.
The requirement for "qa_checks" to appear in the relative path can lead to some non-intuitive behaviour - e.g. if I define a QA check in /my/additional/qa_checks/my_check.sql and then set additional_qa_check_paths=/my/additional/qa_checks/, my QA check will not be picked up.
Because the default QA checks are not in the default template searchpath, we implicitly add <flowetl_module_install_path>/qa_checks to the template searchpath. Since the template searchpath setting applies to the entire DAG, there's potential for this to have unintended side-effects on template fetching in other tasks within the DAG.

Aside from these potential pitfalls, there are other features that might be nice to have which are not possible with the current approach. E.g.:

Excluding some QA checks (e.g. if some of the default QA checks are not applicable because they apply to a field that isn't in the data we're ingesting, or perhaps we want to run some of the checks at one stage in a DAG and the rest at a later stage)
We may perhaps want to override one of the default QA checks with a modified check in some circumstances

It would be good to re-design the QA check discovery mechanism to avoid these shortcomings and allow more explicit control over check discovery when it's useful, without losing all of the convenience of the dynamic QA check discovery and injection of default QA checks.

The text was updated successfully, but these errors were encountered:

jc-harrison · 2024-04-22T10:10:07Z

Ideally the solution here would no longer rely on us knowing the location of the DAG folder during DAG creation, so that we can remove the hack introduced in #6496.

jc-harrison added refactoring FlowETL labels Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-work FlowETL QA check discovery #6497

Re-work FlowETL QA check discovery #6497

jc-harrison commented Apr 18, 2024

jc-harrison commented Apr 22, 2024

Re-work FlowETL QA check discovery #6497

Re-work FlowETL QA check discovery #6497

Comments

jc-harrison commented Apr 18, 2024

jc-harrison commented Apr 22, 2024