-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Numericalizer abc #227
base: master
Are you sure you want to change the base?
Numericalizer abc #227
Conversation
podium/preproc/numericalizer_abc.py
Outdated
def _finalize(self): | ||
# Subclasses should override this method to add custom | ||
# finalization logic | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefixing the public method with an underscore doesn't seem right. We should handle the finalization somehow internally (overriding self.finalize and calling it sets the _finalized
flag to True). Or we can give the _finalize
a more appropriate name that doesn't start with an underscore and leave the finalize
as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea to switch them around is good IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there are tests that call this function, and the user may want to update and finalize a numericalizer before passing it to the Field
(can't think of an concrete example, but I think I saw an usecase like that), using _finalize
to call this seems a bit clumsy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, to avoid confusion, the best approach would be to leave finalize
as the overridable default function and add something like a mark_finalized
function, which the finalize
function has to call in the end. I'd avoid fiddling with metaclasses and just raise an informative exception to the user about the call to mark_finalized
.
podium/preproc/numericalizer_abc.py
Outdated
def update(self, tokens: List[str]) -> None: | ||
pass | ||
|
||
def finalize(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does a simple custom function need this property? If not, does it make sense to transfer it here in the first place? Same goes for the eager property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the general case, all numericalizers can be finalizable and eager. If subclasses don't need finalization, they don't have to override finalize
and nothing happens. There must be an API for the general case.
Remember, simple functional numericalizers in Field
get wrapped in a wrapper that takes care of the compatibility with NumericalizerABC
, so it's transparent to a user that wants to just use simple functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe I understand. But, my concern is that right now there is only a single subclass that uses finalization
and eagerness and that dictates the API.
I'm fine with the wrapping when there is a custom mapping.
# Conflicts: # podium/datasets/__init__.py # podium/datasets/dataset.py # podium/datasets/dataset_abc.py # podium/preproc/__init__.py # podium/storage/vocab.py
# Conflicts: # podium/storage/field.py
Ready for merge. The only changes from the last review round are the addition of |
import numpy as np | ||
|
||
|
||
class NumericalizerABC(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would rename to Numericalizer, to follow naming conventions in collections.abc
@@ -205,6 +207,16 @@ def remove_pretokenize_hooks(self): | |||
self._pretokenization_pipeline.clear() | |||
|
|||
|
|||
class NumericalizerCallableWrapper(NumericalizerABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NumericalizerCallableWrapper
seems too verbose. Plus, since this class is used internally, it makes sense to prefix its name with an underscore.
numericalization. | ||
""" | ||
|
||
def __init__(self, eager=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type hint for eager is missing (eager: bool = True
)
err_msg = ( | ||
f"Field {name}: unsupported numericalizer type " | ||
f'"{type(numericalizer).__name__}"' | ||
) | ||
raise TypeError(err_msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would put this error message directly inside the TypeError.
def mark_finalized(self) -> None: | ||
"""Marks the field as finalized. This method must be called after finalization | ||
completes successfully.""" | ||
self._finalized = True | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, not a huge fan of mark_finalized
, but let's see what @mttk and @FilipBolt have to say.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd stay with finalize
as the method name to keep things consistent and make the internal variable: is_finalized
. Mark finalized feels redundant.
As discussed on the regular meeting, just leaving a note here that this is currently on hold. |
This is a WIP of the new
NumericalizerABC
interface used forVocab
and other vectorizers.The existing codebase required relatively minor changes to enable this.
Since this is WIP at the moment, it still requires implementing this interface in the existing vectorizers and handling the TODOs left in the code.
I'm creating this pull request so we can review and improve the general idea before any more serious work is done.
The interface itself is called
NumericalizerABC
and is contained in thenumericalizer_abc.py
file in thepreproc
package.Be aware that this PR is based on the
dataset-abc
branch, so the diff shows changes from both branches. Setshow changes from black correction
when inspecting files changed to avoid irrelevant changes.Focus on the
NumericalizerABC
,Field
andVocab
classes.