Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HRQB 13 - Refactor base Tasks #11

Merged
merged 3 commits into from
May 7, 2024
Merged

HRQB 13 - Refactor base Tasks #11

merged 3 commits into from
May 7, 2024

Conversation

ghukill
Copy link
Collaborator

@ghukill ghukill commented May 6, 2024

Purpose and background context

While working on building out actual chains of Tasks in pipelines for the CLI, it was determined that it would be beneficial to programmatically set the output path for Task Targets.

By setting these dynamically, each Task will have a predictable output filename. Additionally, different pipelines can use the same Task without naming collisions in their Targets (which should be different), as all Tasks will now have a pipeline parameter that is used in the output filename. Lastly, by setting a base TARGETS_DIRECTORY env var, we can also control where targets are written to (though defaults to output/ directory).

The most relevant change in this PR is HRQBTask.path here.

This also presented an opportunity to rework the tests that exercised the base Tasks, by using actual fixtures of fictional Tasks that will also be helpful for testing pipelines, overall simplifying them.

In summary: this refactoring smooths out a couple of awkward edges around base Tasks that implementing pipelines and CLI commands revealed. The majority of changes are related to updated tests.

How can a reviewer manually see the effects of these changes?

The most telling change is initializing a Task and note the path is now dynamic.

import os
from tests.fixtures.tasks.extract import ExtractAnimalNames

# init Task
task = ExtractAnimalNames(
    pipeline="Animals",
)

# show dynamically built path
task.path
# Out[3]: 'output/Animals__Extract__ExtractAnimalNames.pickle'

# modify/set TARGETS_DIRECTORY to see how it drives the path
os.environ['TARGETS_DIRECTORY'] = "/tmp/special/place"
task.path
# Out[6]: '/tmp/special/place/Animals__Extract__ExtractAnimalNames.pickle'

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed or provided examples verified
  • New dependencies are appropriate or there were no changes

Why these changes are being introduced:

While working on building out actual chains of Tasks in Pipelines,
it was determined that it would be beneficial to programmatically
set the Target output paths for Tasks.  To support this, a pipeline name
will get passed throughout that help name these artifacts.  This will
allow a single Task to get reused in different pipelines, where the output
Target will have a filename unique the pipeline.

With these changes, it was also determined that testing could be
simplified to use some Tasks as fixtures.

How this addresses that need:
* HRQBTask.path is a dynamic property that constructs a path based
on a configurable targets directory, pipeline name, task stage, and
task name
* Task fixtures were created for extract, transform, load, and pipeline
* PandasPickleTarget is limited to only returning DataFrames as it was
determined that Series are likely not going to get used

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/HRQB-13
output/LookupTablesPipeline__load__UpsertJobTitles.pickle
"""
filename = (
"__".join( # noqa: FLY002
Copy link
Collaborator Author

@ghukill ghukill May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The double underscore __ is used to support splitting on this filename if needed, where a single underscore could showup in components of the name. Splitting on the double underscore would give the:

  • pipeline name
  • pipeline stage
  • task name

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend noting the double underscore __ in the docstring because it's not easily distinguishable from a single underscore at a glance

Copy link

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall but a few questions

def write(self, panda_object: PandasObject) -> None:
panda_object.to_pickle(self.path)
def write(self, df: pd.DataFrame) -> None:
df.to_pickle(self.path)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: I know df is a convention but always feel full names are preferable to abbreviations

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a coming PR, you'll see that I agree and have some fixtures like colors_df, names_df, animals_df, etc. But I think in a context where there is only and truly one dataframe getting used, that the convention of df is okay, particularly when there is really nothing else known about the Dataframe.

output/LookupTablesPipeline__load__UpsertJobTitles.pickle
"""
filename = (
"__".join( # noqa: FLY002
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend noting the double underscore __ in the docstring because it's not easily distinguishable from a single underscore at a glance

hrqb/base/task.py Outdated Show resolved Hide resolved
Comment on lines +70 to +71
This is useful when a Task has multiple parent Tasks, to easily and precisely
access a specific parent Task's output.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm admittedly still confused on the single vs multiple parent tasks but not sure if it's worth adding more documentation or if it will become clearer in the next few PRs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an example coming in the next PR.

Here is a sneak peak at that, where the Task PrepareAnimals has input from two Tasks ExtractAnimalColors and ExtractAnimalNames:

├── COMPLETE: AnimalsDebug()
   ├── COMPLETE: LoadAnimalsDebug(pipeline=AnimalsDebug, stage=Load, table_name=Animals)
      ├── COMPLETE: PrepareAnimals(pipeline=AnimalsDebug, stage=Transform, table_name=Animals)
         ├── COMPLETE: ExtractAnimalColors(table_name=, pipeline=AnimalsDebug, stage=Extract)
         ├── COMPLETE: ExtractAnimalNames(table_name=, pipeline=AnimalsDebug, stage=Extract)

A more realistic example might be getting job data from the data warehouse, and supervisor information from HR data, where a single Task then joines that information.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, that is a very clear example!

tests/conftest.py Show resolved Hide resolved
@ghukill ghukill requested a review from ehanson8 May 7, 2024 13:19
Comment on lines +54 to +61
"""Return single parent Task Target, raise error if multiple parent Tasks.

When used, this convenience method also helps reason about Tasks. Quite often, a
Task is only expecting a single parent Task that will feed it data. In these
scenarios, using self.single_input is not only convenient, but also codifies in
code this assumption. If this Task were to receive multiple inputs in the future
this method would then throw an error.
"""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect! Very understandable

Comment on lines +70 to +71
This is useful when a Task has multiple parent Tasks, to easily and precisely
access a specific parent Task's output.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, that is a very clear example!

Copy link

@jonavellecuerdo jonavellecuerdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Just one small comment re: updating a class docstring. The conversations between you and @ehanson8 were helpful for review. :)

hrqb/base/target.py Show resolved Hide resolved
hrqb/base/task.py Show resolved Hide resolved
@ghukill ghukill merged commit 037043f into main May 7, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants