Add utility function to parse details from S3 URIs and filenames #42

jonavellecuerdo · 2024-10-29T14:28:19Z

Purpose and background context

Create a central utility function to parse TIMDEX "run" details from names of files that are used/created throughout the Transmogrifier A/B Diff workflow. This function replaces the now deprecated parse_x functions that were previously implemented within core function modules. Here are the main updates:

Utility function utils.parse_timdex_filename
Use utility function output to create get_transformed_filename function in run_ab_transforms.
Use utility function output to name columns in the collated dataset in collate_ab_transforms.

How can a reviewer manually see the effects of these changes?

Run make test and confirm all unit tests are passing.

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

Developer

All new ENV is documented in README
All new ENV has been added to staging and production environments
All related Jira tickets are linked in commit message(s)
Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

The commit message is clear and follows our guidelines (not just this PR message)
There are appropriate tests covering any new functionality
The provided documentation is sufficient for understanding any new functionality introduced
Any manual tests have been performed or provided examples verified
New dependencies are appropriate or there were no changes

Why these changes are being introduced: * Several core functions rely on information about the TIMDEX "run" details, which appear in the names of files produced by the "extract" and "transform" steps of the TIMDEX pipeline. This new util function can be used by core functions as needed, reducing duplicated code. How this addresses that need: * Add util function parse_timdex_filename * Update run_ab_transforms and collate_ab_transforms to use util * Use TIMDEX source slug in collated_dataset Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-380 * https://mitlibraries.atlassian.net/browse/TIMX-365

jonavellecuerdo · 2024-10-29T14:50:53Z

@ehanson8 This is a small, "cleanup" body of work that @ghukill and I wanted to implement. Please feel free to take a look if you're interested!

ghukill

Not much too comment on, looks good and works as expected! Thanks for adding this.

tests/test_utils.py

ghukill · 2024-10-29T15:03:44Z

abdiff/core/run_ab_transforms.py

-        r"^([\w\-]+?)-(\d{4}-\d{2}-\d{2})-(\w+)-extracted-records-to-index(?:_(\d+))?\.\w+$",
-        extract_filename,
+def get_transformed_filename(filename_details: dict) -> str:
+    """Get transformed filename using extract filename details."""


I like this update to use the parsed components for the output filename.

ghukill

Requesting one small change, date to run-date and cadence to run-type as discussed for the TIMDEX parquet work.

Otherwise, looks good to me! Approval still stands with this change.

ghukill · 2024-10-29T15:11:19Z

abdiff/core/utils.py

+        filename,
+    )
+
+    keys = ["source", "date", "cadence", "stage", "action", "index", "file_type"]


Apologies! Missed this on first pass.

Can we update these now, as discussed we might do in the TIMDEX parquet work?

date --> run-date

cadence --> run-type

Also note: I think we should use a dash in the key name, to mirror the dash in the TIMDEX StepFunction usage. In code, particularly for DuckDB, we may end up using an underscore, but I'd vote for a dash in this utility function for symmetry.

Here is a comment from the engineering plan: https://mitlibraries.atlassian.net/wiki/spaces/IN/pages/edit-v2/4094296066?focusedCommentId=4141219863.

The name change is not performed yet in this engineering plan, but I think it was a good idea and will make that change.

Good catch! Made the required updates. See latest commit 9425ce3. 🤓

* Update modules to use 'run-date' and 'run-type'

coveralls · 2024-10-29T16:44:13Z

Pull Request Test Coverage Report for Build 11578504284

Details

20 of 22 (90.91%) changed or added relevant lines in 3 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.3%) to 96.386%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
abdiff/core/utils.py	9	11	81.82%

Totals
Change from base Build 11574719054:	-0.3%
Covered Lines:	640
Relevant Lines:	664

💛 - Coveralls

ghukill

Looks good!

abdiff/core/run_ab_transforms.py

jonavellecuerdo self-assigned this Oct 29, 2024

jonavellecuerdo marked this pull request as ready for review October 29, 2024 14:49

jonavellecuerdo requested a review from ghukill October 29, 2024 14:49

ghukill approved these changes Oct 29, 2024

View reviewed changes

ghukill requested changes Oct 29, 2024

View reviewed changes

Address comments in PR

9425ce3

* Update modules to use 'run-date' and 'run-type'

jonavellecuerdo requested a review from ghukill October 29, 2024 16:41

ghukill approved these changes Oct 29, 2024

View reviewed changes

ghukill reviewed Oct 29, 2024

View reviewed changes

abdiff/core/run_ab_transforms.py Show resolved Hide resolved

jonavellecuerdo merged commit 059c576 into main Oct 29, 2024
2 checks passed

jonavellecuerdo deleted the TIMX-380-365-add-util-parser-and-use-source-slug branch October 29, 2024 17:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add utility function to parse details from S3 URIs and filenames #42

Add utility function to parse details from S3 URIs and filenames #42

Uh oh!

jonavellecuerdo commented Oct 29, 2024 •

edited

Loading

Uh oh!

jonavellecuerdo commented Oct 29, 2024

Uh oh!

ghukill left a comment •

edited

Loading

Uh oh!

Uh oh!

ghukill Oct 29, 2024

Uh oh!

ghukill left a comment

Uh oh!

ghukill Oct 29, 2024 •

edited

Loading

Uh oh!

ghukill Oct 29, 2024

Uh oh!

jonavellecuerdo Oct 29, 2024

Uh oh!

coveralls commented Oct 29, 2024

Uh oh!

ghukill left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add utility function to parse details from S3 URIs and filenames #42

Add utility function to parse details from S3 URIs and filenames #42

Uh oh!

Conversation

jonavellecuerdo commented Oct 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Developer

Code Reviewer(s)

Uh oh!

jonavellecuerdo commented Oct 29, 2024

Uh oh!

ghukill left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ghukill Oct 29, 2024

Choose a reason for hiding this comment

Uh oh!

ghukill left a comment

Choose a reason for hiding this comment

Uh oh!

ghukill Oct 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ghukill Oct 29, 2024

Choose a reason for hiding this comment

Uh oh!

jonavellecuerdo Oct 29, 2024

Choose a reason for hiding this comment

Uh oh!

coveralls commented Oct 29, 2024

Pull Request Test Coverage Report for Build 11578504284

Details

💛 - Coveralls

Uh oh!

ghukill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jonavellecuerdo commented Oct 29, 2024 •

edited

Loading

ghukill left a comment •

edited

Loading

ghukill Oct 29, 2024 •

edited

Loading