Skip to content

Conversation

adrien-berchet
Copy link
Member

Proposal to improve the propagation of the parameters from the workflow to the subtasks. Before this PR, the parameters were updated after the task was registered by luigi, which could lead to some inconsistencies.

BREAKING CHANGE: The old API is deprecated. The subtask should be given by type instead of instance.

BREAKING CHANGE: The old API is deprecated.
@adrien-berchet adrien-berchet marked this pull request as draft January 3, 2022 15:27
@adrien-berchet
Copy link
Member Author

TODO: update the doc accordingly if this proposal is accepted.

@codecov
Copy link

codecov bot commented Jan 3, 2022

Codecov Report

Merging #10 (3eb449e) into main (750f9fb) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main       #10   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            7         7           
  Lines          696       711   +15     
  Branches       133       133           
=========================================
+ Hits           696       711   +15     
Flag Coverage Δ
pytest 100.00% <100.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
data_validation_framework/task.py 100.00% <100.00%> (ø)

@adrien-berchet adrien-berchet marked this pull request as ready for review January 3, 2022 15:47
@arnaudon
Copy link
Collaborator

arnaudon commented Jan 3, 2022

nice! So it will just be matter of removing () in the codes depending on that tool, right?


def inputs(self):
return {ValidationTask1(): {"col_name": "new_col_name_in_current_task"}}
return {ValidationTask1: {"col_name": "new_col_name_in_current_task"}}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've always wondered if we don't pass a dict, but a str, if we can assume both key/values are the same? it may be to confusing, not sure

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't know. It could be list of str but if one want to change one column name it might be even more verbose in the end. So I don't know but if you have any idea of an API for this we can try.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooooh I had an idea tonight! Maybe we could split the process: we could have a step to rename the columns of the input dataframe and then the second step imports the columns from the input tasks. So in basic case we would have this:

    class MyTask(ElementValidationTask):
        validation_function = staticmethod(my_validation_function)
        output_columns = {'my_new_col': None}

        def inputs(self):
            return {
                PreviousTask: ["input_data"]
            }

And if we want to rename a column, we use a new mapping like this:

    class MyTask(ElementValidationTask):
        validation_function = staticmethod(my_validation_function)
        output_columns = {"my_new_col": None}
        rename_columns = {"input_data": "input_data_from_dataframe"}

        def inputs(self):
            return {
                PreviousTask: ["input_data"]
            }

What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah but there is an issue with this: we can't deal with two inputs with same column names.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes yes, forget it, I think it's fine as it is!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I am afraid we would have to to something quite confusing, like adding the column renaming feature as proposed above but still allow to rename the columns of the inputs. This could lead to something like this:

    class MyTask(ElementValidationTask):
        validation_function = staticmethod(my_validation_function)
        output_columns = {"my_new_col": None}
        rename_columns = {"input_data": "input_data_from_dataframe"}

        def inputs(self):
            return {
                PreviousTaskA: ["input_data", "other_col_A"],
                PreviousTaskB: {"input_data": "input_data_from_task_B", "other_col_B": "other_col_B"},
            }

Which might be even more complicated in the end, though it would be simpler in simple cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok :)

@adrien-berchet
Copy link
Member Author

nice! So it will just be matter of removing () in the codes depending on that tool, right?

Yeah, for simple cases it should just change this. If one was passing arguments to the constructor the change is a bit bigger but I think nobody does this yet :)

@adrien-berchet adrien-berchet merged commit c10cadb into main Jan 4, 2022
@adrien-berchet adrien-berchet deleted the new_input_api branch January 4, 2022 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants