Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid duplicate QA check tasks in FlowETL #6496

Merged
merged 19 commits into from
Apr 22, 2024
Merged

Avoid duplicate QA check tasks in FlowETL #6496

merged 19 commits into from
Apr 22, 2024

Conversation

jc-harrison
Copy link
Member

@jc-harrison jc-harrison commented Apr 17, 2024

Closes #6494

I have:

  • Formatted any Python files with black
  • Brought the branch up to date with master
  • Added any relevant Github labels
  • Added tests for any new additions
  • Added or updated any relevant documentation
  • Added an Architectural Decision Record (ADR), if appropriate
  • Added an MPLv2 License Header if appropriate
  • Updated the Changelog

Description

  • Changes get_qa_checks() to only add settings.DAGS_FOLDER to the template searchpath if dag.folder does not already point to the correct DAG folder (which is the case if a DAG is created by the create_dag() function).
  • Changes the default_path in get_qa_checks() to always be Path(__file__).parent, rather than being either Path(__file__).parent or Path(__file__).parent / "qa_checks" depending on whether or not the create_dag() function is being used. An upshot of this is that the default QA check files no longer need the additional level of nesting that was required to ensure the template paths containd the string "qa_checks".
  • Fixes flowetl.util.create_dag so that the returned DAG has the correct dag.folder
  • Removes awkward workaround in flowetl.util.get_qa_checks that was necessary due to the dag.folder sometimes being incorrect
  • Adds integration tests for loading QA checks from the DAG folder

Overall, I think we still want to re-work the QA check discovery more substantially - the template-path-contains-"qa_checks" criterion and the silent addition of a path within the FlowETL module to the template searchpath both leave us open to unexpected and hard-to-debug bugs, and it would be useful to have more explicit control over which QA checks are added to a DAG. But this PR fixes the immediate issue, to enable the use of custom QA checks again.

Note: it would still be possible to get duplicates if the user defining a DAG specifies a template_searchpath or additional_qa_check_paths containing either a parent or child of the DAG folder - this PR does not prevent errors in that situation.

Copy link

cypress bot commented Apr 17, 2024

Passing run #22056 ↗︎

0 4 0 0 Flakiness 0

Details:

One more attempt at fixing test
Project: FlowAuth Commit: 601b192927
Status: Passed Duration: 00:41 💡
Started: Apr 18, 2024 4:01 PM Ended: Apr 18, 2024 4:01 PM

Review all test suite changes for PR #6496 ↗︎

@jc-harrison jc-harrison added bug Something isn't working FlowETL labels Apr 17, 2024
Copy link

codecov bot commented Apr 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.31%. Comparing base (7e8e63d) to head (601b192).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #6496   +/-   ##
=======================================
  Coverage   92.30%   92.31%           
=======================================
  Files         268      268           
  Lines       10583    10586    +3     
  Branches      855      855           
=======================================
+ Hits         9769     9772    +3     
  Misses        676      676           
  Partials      138      138           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jc-harrison
Copy link
Member Author

I've changed to a different approach from the one I started off using. The underlying problem here was that Airflow assumes the DAG file (i.e. the file that needs to be imported to define the DAG) is the file in which the DAG constructor is called, but when using FlowETL's create_dag function the DAG file is actually one frame back (i.e. the file in which create_dag is called, not the file that defines create_dag). Since dag.folder (i.e. the parent of the DAG file) is the default template searchpath, we had to add a workaround in get_qa_checks for the case that dag.folder wasn't actually pointing to the DAG folder (which happened if the DAG was created using create_dag), and that workaround resulted in QA checks defined in the DAG folder being picked up twice. For the same reason, we also had to add the entire contents of the FlowETL module to the template searchpath rather than just the path to the default QA checks.

To get around this, I've updated create_dag to explicitly set dag.fileloc to be the file location from which create_dag is called. This feels a bit hacky, but it's exactly the same hack that's used in DAG.__init__(), and overall I think this is the cleanest solution short of a more substantial re-work of the QA check discovery mechanism to avoid needing to walk the entire template searchpath during DAG creation.

Copy link
Member

@greenape greenape left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this solution distressing, but the best available option as it stands.

@jc-harrison jc-harrison merged commit ed133ee into master Apr 22, 2024
40 checks passed
@jc-harrison jc-harrison deleted the flowetl-qa-fix branch April 22, 2024 10:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working FlowETL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

get_qa_checks duplicates
2 participants