Skip to content

Conversation

@jonavellecuerdo
Copy link
Contributor

@jonavellecuerdo jonavellecuerdo commented Oct 29, 2024

Purpose and background context

Create a central utility function to parse TIMDEX "run" details from names of files that are used/created throughout the Transmogrifier A/B Diff workflow. This function replaces the now deprecated parse_x functions that were previously implemented within core function modules. Here are the main updates:

How can a reviewer manually see the effects of these changes?

Run make test and confirm all unit tests are passing.

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed or provided examples verified
  • New dependencies are appropriate or there were no changes

Why these changes are being introduced:
* Several core functions rely on information about the TIMDEX
"run" details, which appear in the names of files produced
by the "extract" and "transform" steps of the TIMDEX pipeline.
This new util function can be used by core functions as needed,
reducing duplicated code.

How this addresses that need:
* Add util function parse_timdex_filename
* Update run_ab_transforms and collate_ab_transforms to use util
* Use TIMDEX source slug in collated_dataset

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-380
* https://mitlibraries.atlassian.net/browse/TIMX-365
@jonavellecuerdo jonavellecuerdo self-assigned this Oct 29, 2024
@jonavellecuerdo jonavellecuerdo marked this pull request as ready for review October 29, 2024 14:49
@jonavellecuerdo
Copy link
Contributor Author

@ehanson8 This is a small, "cleanup" body of work that @ghukill and I wanted to implement. Please feel free to take a look if you're interested!

Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not much too comment on, looks good and works as expected! Thanks for adding this.

r"^([\w\-]+?)-(\d{4}-\d{2}-\d{2})-(\w+)-extracted-records-to-index(?:_(\d+))?\.\w+$",
extract_filename,
def get_transformed_filename(filename_details: dict) -> str:
"""Get transformed filename using extract filename details."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this update to use the parsed components for the output filename.

Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting one small change, date to run-date and cadence to run-type as discussed for the TIMDEX parquet work.

Otherwise, looks good to me! Approval still stands with this change.

filename,
)

keys = ["source", "date", "cadence", "stage", "action", "index", "file_type"]
Copy link
Contributor

@ghukill ghukill Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies! Missed this on first pass.

Can we update these now, as discussed we might do in the TIMDEX parquet work?

  • date --> run-date
  • cadence --> run-type

Also note: I think we should use a dash in the key name, to mirror the dash in the TIMDEX StepFunction usage. In code, particularly for DuckDB, we may end up using an underscore, but I'd vote for a dash in this utility function for symmetry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a comment from the engineering plan: https://mitlibraries.atlassian.net/wiki/spaces/IN/pages/edit-v2/4094296066?focusedCommentId=4141219863.

The name change is not performed yet in this engineering plan, but I think it was a good idea and will make that change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Made the required updates. See latest commit 9425ce3. 🤓

* Update modules to use 'run-date' and 'run-type'
@coveralls
Copy link

Pull Request Test Coverage Report for Build 11578504284

Details

  • 20 of 22 (90.91%) changed or added relevant lines in 3 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.3%) to 96.386%

Changes Missing Coverage Covered Lines Changed/Added Lines %
abdiff/core/utils.py 9 11 81.82%
Totals Coverage Status
Change from base Build 11574719054: -0.3%
Covered Lines: 640
Relevant Lines: 664

💛 - Coveralls

Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@jonavellecuerdo jonavellecuerdo merged commit 059c576 into main Oct 29, 2024
2 checks passed
@jonavellecuerdo jonavellecuerdo deleted the TIMX-380-365-add-util-parser-and-use-source-slug branch October 29, 2024 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants