-
Notifications
You must be signed in to change notification settings - Fork 1
New API for inputs #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
BREAKING CHANGE: The old API is deprecated.
TODO: update the doc accordingly if this proposal is accepted. |
Codecov Report
@@ Coverage Diff @@
## main #10 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 7 7
Lines 696 711 +15
Branches 133 133
=========================================
+ Hits 696 711 +15
Flags with carried forward coverage won't be shown. Click here to find out more.
|
bc69b30
to
3eb449e
Compare
nice! So it will just be matter of removing () in the codes depending on that tool, right? |
|
||
def inputs(self): | ||
return {ValidationTask1(): {"col_name": "new_col_name_in_current_task"}} | ||
return {ValidationTask1: {"col_name": "new_col_name_in_current_task"}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've always wondered if we don't pass a dict, but a str, if we can assume both key/values are the same? it may be to confusing, not sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I don't know. It could be list of str but if one want to change one column name it might be even more verbose in the end. So I don't know but if you have any idea of an API for this we can try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooooh I had an idea tonight! Maybe we could split the process: we could have a step to rename the columns of the input dataframe and then the second step imports the columns from the input tasks. So in basic case we would have this:
class MyTask(ElementValidationTask):
validation_function = staticmethod(my_validation_function)
output_columns = {'my_new_col': None}
def inputs(self):
return {
PreviousTask: ["input_data"]
}
And if we want to rename a column, we use a new mapping like this:
class MyTask(ElementValidationTask):
validation_function = staticmethod(my_validation_function)
output_columns = {"my_new_col": None}
rename_columns = {"input_data": "input_data_from_dataframe"}
def inputs(self):
return {
PreviousTask: ["input_data"]
}
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah but there is an issue with this: we can't deal with two inputs with same column names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes yes, forget it, I think it's fine as it is!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I am afraid we would have to to something quite confusing, like adding the column renaming feature as proposed above but still allow to rename the columns of the inputs. This could lead to something like this:
class MyTask(ElementValidationTask):
validation_function = staticmethod(my_validation_function)
output_columns = {"my_new_col": None}
rename_columns = {"input_data": "input_data_from_dataframe"}
def inputs(self):
return {
PreviousTaskA: ["input_data", "other_col_A"],
PreviousTaskB: {"input_data": "input_data_from_task_B", "other_col_B": "other_col_B"},
}
Which might be even more complicated in the end, though it would be simpler in simple cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok :)
Yeah, for simple cases it should just change this. If one was passing arguments to the constructor the change is a bit bigger but I think nobody does this yet :) |
Proposal to improve the propagation of the parameters from the workflow to the subtasks. Before this PR, the parameters were updated after the task was registered by luigi, which could lead to some inconsistencies.
BREAKING CHANGE: The old API is deprecated. The subtask should be given by type instead of instance.