-
Notifications
You must be signed in to change notification settings - Fork 0
Add utility function to parse details from S3 URIs and filenames #42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add utility function to parse details from S3 URIs and filenames #42
Conversation
Why these changes are being introduced: * Several core functions rely on information about the TIMDEX "run" details, which appear in the names of files produced by the "extract" and "transform" steps of the TIMDEX pipeline. This new util function can be used by core functions as needed, reducing duplicated code. How this addresses that need: * Add util function parse_timdex_filename * Update run_ab_transforms and collate_ab_transforms to use util * Use TIMDEX source slug in collated_dataset Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-380 * https://mitlibraries.atlassian.net/browse/TIMX-365
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not much too comment on, looks good and works as expected! Thanks for adding this.
| r"^([\w\-]+?)-(\d{4}-\d{2}-\d{2})-(\w+)-extracted-records-to-index(?:_(\d+))?\.\w+$", | ||
| extract_filename, | ||
| def get_transformed_filename(filename_details: dict) -> str: | ||
| """Get transformed filename using extract filename details.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this update to use the parsed components for the output filename.
ghukill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requesting one small change, date to run-date and cadence to run-type as discussed for the TIMDEX parquet work.
Otherwise, looks good to me! Approval still stands with this change.
abdiff/core/utils.py
Outdated
| filename, | ||
| ) | ||
|
|
||
| keys = ["source", "date", "cadence", "stage", "action", "index", "file_type"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies! Missed this on first pass.
Can we update these now, as discussed we might do in the TIMDEX parquet work?
date-->run-datecadence-->run-type
Also note: I think we should use a dash in the key name, to mirror the dash in the TIMDEX StepFunction usage. In code, particularly for DuckDB, we may end up using an underscore, but I'd vote for a dash in this utility function for symmetry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a comment from the engineering plan: https://mitlibraries.atlassian.net/wiki/spaces/IN/pages/edit-v2/4094296066?focusedCommentId=4141219863.
The name change is not performed yet in this engineering plan, but I think it was a good idea and will make that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Made the required updates. See latest commit 9425ce3. 🤓
* Update modules to use 'run-date' and 'run-type'
Pull Request Test Coverage Report for Build 11578504284Details
💛 - Coveralls |
ghukill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Purpose and background context
Create a central utility function to parse TIMDEX "run" details from names of files that are used/created throughout the Transmogrifier A/B Diff workflow. This function replaces the now deprecated
parse_xfunctions that were previously implemented within core function modules. Here are the main updates:get_transformed_filenamefunction inrun_ab_transforms.collate_ab_transforms.How can a reviewer manually see the effects of these changes?
Run
make testand confirm all unit tests are passing.Includes new or updated dependencies?
NO
Changes expectations for external applications?
NO
What are the relevant tickets?
Developer
Code Reviewer(s)