-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HRQB 13 - Refactor base Tasks #11
Conversation
Why these changes are being introduced: While working on building out actual chains of Tasks in Pipelines, it was determined that it would be beneficial to programmatically set the Target output paths for Tasks. To support this, a pipeline name will get passed throughout that help name these artifacts. This will allow a single Task to get reused in different pipelines, where the output Target will have a filename unique the pipeline. With these changes, it was also determined that testing could be simplified to use some Tasks as fixtures. How this addresses that need: * HRQBTask.path is a dynamic property that constructs a path based on a configurable targets directory, pipeline name, task stage, and task name * Task fixtures were created for extract, transform, load, and pipeline * PandasPickleTarget is limited to only returning DataFrames as it was determined that Series are likely not going to get used Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/HRQB-13
output/LookupTablesPipeline__load__UpsertJobTitles.pickle | ||
""" | ||
filename = ( | ||
"__".join( # noqa: FLY002 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The double underscore __
is used to support splitting on this filename if needed, where a single underscore could showup in components of the name. Splitting on the double underscore would give the:
- pipeline name
- pipeline stage
- task name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend noting the double underscore __
in the docstring because it's not easily distinguishable from a single underscore at a glance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall but a few questions
def write(self, panda_object: PandasObject) -> None: | ||
panda_object.to_pickle(self.path) | ||
def write(self, df: pd.DataFrame) -> None: | ||
df.to_pickle(self.path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: I know df
is a convention but always feel full names are preferable to abbreviations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a coming PR, you'll see that I agree and have some fixtures like colors_df
, names_df
, animals_df
, etc. But I think in a context where there is only and truly one dataframe getting used, that the convention of df
is okay, particularly when there is really nothing else known about the Dataframe.
output/LookupTablesPipeline__load__UpsertJobTitles.pickle | ||
""" | ||
filename = ( | ||
"__".join( # noqa: FLY002 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend noting the double underscore __
in the docstring because it's not easily distinguishable from a single underscore at a glance
This is useful when a Task has multiple parent Tasks, to easily and precisely | ||
access a specific parent Task's output. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm admittedly still confused on the single vs multiple parent tasks but not sure if it's worth adding more documentation or if it will become clearer in the next few PRs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an example coming in the next PR.
Here is a sneak peak at that, where the Task PrepareAnimals
has input from two Tasks ExtractAnimalColors
and ExtractAnimalNames
:
├── COMPLETE: AnimalsDebug()
├── COMPLETE: LoadAnimalsDebug(pipeline=AnimalsDebug, stage=Load, table_name=Animals)
├── COMPLETE: PrepareAnimals(pipeline=AnimalsDebug, stage=Transform, table_name=Animals)
├── COMPLETE: ExtractAnimalColors(table_name=, pipeline=AnimalsDebug, stage=Extract)
├── COMPLETE: ExtractAnimalNames(table_name=, pipeline=AnimalsDebug, stage=Extract)
A more realistic example might be getting job data from the data warehouse, and supervisor information from HR data, where a single Task then joines that information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, that is a very clear example!
"""Return single parent Task Target, raise error if multiple parent Tasks. | ||
|
||
When used, this convenience method also helps reason about Tasks. Quite often, a | ||
Task is only expecting a single parent Task that will feed it data. In these | ||
scenarios, using self.single_input is not only convenient, but also codifies in | ||
code this assumption. If this Task were to receive multiple inputs in the future | ||
this method would then throw an error. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect! Very understandable
This is useful when a Task has multiple parent Tasks, to easily and precisely | ||
access a specific parent Task's output. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, that is a very clear example!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Just one small comment re: updating a class docstring. The conversations between you and @ehanson8 were helpful for review. :)
Purpose and background context
While working on building out actual chains of Tasks in pipelines for the CLI, it was determined that it would be beneficial to programmatically set the output path for Task Targets.
By setting these dynamically, each Task will have a predictable output filename. Additionally, different pipelines can use the same Task without naming collisions in their Targets (which should be different), as all Tasks will now have a
pipeline
parameter that is used in the output filename. Lastly, by setting a baseTARGETS_DIRECTORY
env var, we can also control where targets are written to (though defaults tooutput/
directory).The most relevant change in this PR is
HRQBTask.path
here.This also presented an opportunity to rework the tests that exercised the base Tasks, by using actual fixtures of fictional Tasks that will also be helpful for testing pipelines, overall simplifying them.
In summary: this refactoring smooths out a couple of awkward edges around base Tasks that implementing pipelines and CLI commands revealed. The majority of changes are related to updated tests.
How can a reviewer manually see the effects of these changes?
The most telling change is initializing a Task and note the path is now dynamic.
Includes new or updated dependencies?
NO
Changes expectations for external applications?
NO
What are the relevant tickets?
Developer
Code Reviewer(s)