HRQB 13 - Refactor base Tasks #11

ghukill · 2024-05-06T19:46:28Z

Purpose and background context

While working on building out actual chains of Tasks in pipelines for the CLI, it was determined that it would be beneficial to programmatically set the output path for Task Targets.

By setting these dynamically, each Task will have a predictable output filename. Additionally, different pipelines can use the same Task without naming collisions in their Targets (which should be different), as all Tasks will now have a pipeline parameter that is used in the output filename. Lastly, by setting a base TARGETS_DIRECTORY env var, we can also control where targets are written to (though defaults to output/ directory).

The most relevant change in this PR is HRQBTask.path here.

This also presented an opportunity to rework the tests that exercised the base Tasks, by using actual fixtures of fictional Tasks that will also be helpful for testing pipelines, overall simplifying them.

In summary: this refactoring smooths out a couple of awkward edges around base Tasks that implementing pipelines and CLI commands revealed. The majority of changes are related to updated tests.

How can a reviewer manually see the effects of these changes?

The most telling change is initializing a Task and note the path is now dynamic.

import os
from tests.fixtures.tasks.extract import ExtractAnimalNames

# init Task
task = ExtractAnimalNames(
    pipeline="Animals",
)

# show dynamically built path
task.path
# Out[3]: 'output/Animals__Extract__ExtractAnimalNames.pickle'

# modify/set TARGETS_DIRECTORY to see how it drives the path
os.environ['TARGETS_DIRECTORY'] = "/tmp/special/place"
task.path
# Out[6]: '/tmp/special/place/Animals__Extract__ExtractAnimalNames.pickle'

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/HRQB-13

Developer

All new ENV is documented in README
All new ENV has been added to staging and production environments
All related Jira tickets are linked in commit message(s)
Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

The commit message is clear and follows our guidelines (not just this PR message)
There are appropriate tests covering any new functionality
The provided documentation is sufficient for understanding any new functionality introduced
Any manual tests have been performed or provided examples verified
New dependencies are appropriate or there were no changes

Why these changes are being introduced: While working on building out actual chains of Tasks in Pipelines, it was determined that it would be beneficial to programmatically set the Target output paths for Tasks. To support this, a pipeline name will get passed throughout that help name these artifacts. This will allow a single Task to get reused in different pipelines, where the output Target will have a filename unique the pipeline. With these changes, it was also determined that testing could be simplified to use some Tasks as fixtures. How this addresses that need: * HRQBTask.path is a dynamic property that constructs a path based on a configurable targets directory, pipeline name, task stage, and task name * Task fixtures were created for extract, transform, load, and pipeline * PandasPickleTarget is limited to only returning DataFrames as it was determined that Series are likely not going to get used Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/HRQB-13

ghukill · 2024-05-06T19:50:51Z

hrqb/base/task.py

+            output/LookupTablesPipeline__load__UpsertJobTitles.pickle
+        """
+        filename = (
+            "__".join(  # noqa: FLY002


The double underscore __ is used to support splitting on this filename if needed, where a single underscore could showup in components of the name. Splitting on the double underscore would give the:

pipeline name

pipeline stage

task name

I would recommend noting the double underscore __ in the docstring because it's not easily distinguishable from a single underscore at a glance

ehanson8

Looks good overall but a few questions

ehanson8 · 2024-05-06T21:35:05Z

hrqb/base/target.py

-    def write(self, panda_object: PandasObject) -> None:
-        panda_object.to_pickle(self.path)
+    def write(self, df: pd.DataFrame) -> None:
+        df.to_pickle(self.path)


Optional: I know df is a convention but always feel full names are preferable to abbreviations

In a coming PR, you'll see that I agree and have some fixtures like colors_df, names_df, animals_df, etc. But I think in a context where there is only and truly one dataframe getting used, that the convention of df is okay, particularly when there is really nothing else known about the Dataframe.

ehanson8 · 2024-05-06T21:38:07Z

hrqb/base/task.py

+            output/LookupTablesPipeline__load__UpsertJobTitles.pickle
+        """
+        filename = (
+            "__".join(  # noqa: FLY002


I would recommend noting the double underscore __ in the docstring because it's not easily distinguishable from a single underscore at a glance

hrqb/base/task.py

ehanson8 · 2024-05-06T21:53:44Z

hrqb/base/task.py

+        This is useful when a Task has multiple parent Tasks, to easily and precisely
+        access a specific parent Task's output.


I'm admittedly still confused on the single vs multiple parent tasks but not sure if it's worth adding more documentation or if it will become clearer in the next few PRs

There is an example coming in the next PR.

Here is a sneak peak at that, where the Task PrepareAnimals has input from two Tasks ExtractAnimalColors and ExtractAnimalNames:

├── COMPLETE: AnimalsDebug() ├── COMPLETE: LoadAnimalsDebug(pipeline=AnimalsDebug, stage=Load, table_name=Animals) ├── COMPLETE: PrepareAnimals(pipeline=AnimalsDebug, stage=Transform, table_name=Animals) ├── COMPLETE: ExtractAnimalColors(table_name=, pipeline=AnimalsDebug, stage=Extract) ├── COMPLETE: ExtractAnimalNames(table_name=, pipeline=AnimalsDebug, stage=Extract)

A more realistic example might be getting job data from the data warehouse, and supervisor information from HR data, where a single Task then joines that information.

Thank you, that is a very clear example!

tests/conftest.py

ehanson8 · 2024-05-07T14:36:40Z

hrqb/base/task.py

+        """Return single parent Task Target, raise error if multiple parent Tasks.
+
+        When used, this convenience method also helps reason about Tasks.  Quite often, a
+        Task is only expecting a single parent Task that will feed it data.  In these
+        scenarios, using self.single_input is not only convenient, but also codifies in
+        code this assumption.  If this Task were to receive multiple inputs in the future
+        this method would then throw an error.
+        """


Perfect! Very understandable

ehanson8 · 2024-05-07T14:37:40Z

hrqb/base/task.py

+        This is useful when a Task has multiple parent Tasks, to easily and precisely
+        access a specific parent Task's output.


Thank you, that is a very clear example!

jonavellecuerdo

Looks good to me! Just one small comment re: updating a class docstring. The conversations between you and @ehanson8 were helpful for review. :)

hrqb/base/target.py

hrqb/base/task.py

ghukill requested review from ehanson8 and jonavellecuerdo May 6, 2024 19:46

ghukill commented May 6, 2024

View reviewed changes

ehanson8 reviewed May 6, 2024

View reviewed changes

Additional docstrings from review feedback

3a9e289

ghukill requested a review from ehanson8 May 7, 2024 13:19

ehanson8 approved these changes May 7, 2024

View reviewed changes

jonavellecuerdo approved these changes May 7, 2024

View reviewed changes

hrqb/base/target.py Show resolved Hide resolved

hrqb/base/task.py Show resolved Hide resolved

Update docstrings for DataFrame only

ecc971b

ghukill merged commit 037043f into main May 7, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HRQB 13 - Refactor base Tasks #11

HRQB 13 - Refactor base Tasks #11

ghukill commented May 6, 2024 •

edited

Loading

ghukill May 6, 2024 •

edited

Loading

ehanson8 May 6, 2024

ehanson8 left a comment

ehanson8 May 6, 2024

ghukill May 7, 2024

ehanson8 May 6, 2024

ehanson8 May 6, 2024

ghukill May 7, 2024

ehanson8 May 7, 2024

ehanson8 May 7, 2024

ehanson8 May 7, 2024

jonavellecuerdo left a comment

		This is useful when a Task has multiple parent Tasks, to easily and precisely
		access a specific parent Task's output.

HRQB 13 - Refactor base Tasks #11

HRQB 13 - Refactor base Tasks #11

Conversation

ghukill commented May 6, 2024 • edited Loading

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Developer

Code Reviewer(s)

ghukill May 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ehanson8 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonavellecuerdo left a comment

Choose a reason for hiding this comment

ghukill commented May 6, 2024 •

edited

Loading

ghukill May 6, 2024 •

edited

Loading