Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_qa_checks duplicates #6494

Closed
jc-harrison opened this issue Apr 15, 2024 · 3 comments · Fixed by #6496
Closed

get_qa_checks duplicates #6494

jc-harrison opened this issue Apr 15, 2024 · 3 comments · Fixed by #6496
Labels
bug Something isn't working FlowETL

Comments

@jc-harrison
Copy link
Member

Describe the bug
Suppose we have the following file structure:

- dags_root/
    - qa_checks/
        - my_check.sql
    - my_dag.py

where 'my_dag.py' is a DAG definition file that calls flowetl.util.get_qa_checks(). If we mount 'dags_root/' to '/opt/airflow/dags/etl/' in a FlowETL container, importing the DAG fails with:

Broken DAG: [/opt/airflow/dags/etl/my_dag.py] Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/baseoperator.py", line 811, in __init__
    task_group.add(self)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/task_group.py", line 231, in add
    raise DuplicateTaskIdFound(f"{node_type} id '{key}' has already been added to the DAG")
airflow.exceptions.DuplicateTaskIdFound: Task id 'my_check' has already been added to the DAG

On closer inspection, template_paths within get_qa_checks() evaluates to:

[PosixPath('etl/qa_checks/my_check.sql'), PosixPath('qa_checks/my_check.sql')]
@jc-harrison jc-harrison added bug Something isn't working FlowETL labels Apr 15, 2024
@jc-harrison
Copy link
Member Author

The problem here is that we're adding settings.DAGS_FOLDER to the template searchpath, and the default searchpath is dag.folder (which in our case will be {settings.DAGS_FOLDER}/etl) - so jinja_env.list_templates() finds matching templates qa_checks/my_check.sql (relative to dag.folder) and etl/qa_checks/my_check.sql (relative to settings.DAGS_FOLDER).

I think we should not be adding settings.DAGS_FOLDER to the template searchpath - dag.folder (which will always be either settings.DAGS_FOLDER or a subdirectory of it) is automatically in the template searchpath, so explicitly adding settings.DAGS_FOLDER will always either have no effect or introduce duplicate templates. And I don't think there is any good reason we would want to add the root dags folder to the searchpath if the dag file is not in the root dags folder.

@jc-harrison
Copy link
Member Author

Ah, no, I see the problem now - we're adding settings.DAGS_FOLDER because, if we create a DAG using the flowetl.util.create_dag function, dag.folder is the path to the flowetl module when the DAG is defined (although I think it becomes the path to the actual DAG definition file later). So if we don't add settings.DAGS_FOLDER to the searchpath, then a DAG defined using flowetl.util.create_dag won't pick up any custom QA checks in the DAG folder.

But if we do add settings.DAGS_FOLDER, then a DAG defined explicitly (i.e. not using flowetl.util.create_dag) in a sub-folder of the DAGs folder will pick up custom QA checks twice - hence the issue.

I think the fundamental problem here is that we can't rely on the set of templates found when the DAG is defined being the same as those that are available when the DAG runs. It seems to me that, if we want to keep the behaviour of dynamically adding tasks for all QA check templates present, we need to do so during DAG execution rather than DAG definition. Perhaps we can do that using mapped tasks. Alternatively, maybe we can make this work by dropping the auto-detection of QA checks in the DAGs folder, and instead require specifying additional_qa_check_paths when defining the DAG if we want to include QA checks other than the defaults.

@greenape
Copy link
Member

Could we just skip adding it if dag.folder != Path(__file__).parent?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working FlowETL
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants