Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numericalizer abc #227

Open
wants to merge 28 commits into
base: master
Choose a base branch
from
Open

Numericalizer abc #227

wants to merge 28 commits into from

Conversation

ivansmokovic
Copy link
Collaborator

@ivansmokovic ivansmokovic commented Nov 5, 2020

This is a WIP of the new NumericalizerABC interface used for Vocab and other vectorizers.
The existing codebase required relatively minor changes to enable this.

Since this is WIP at the moment, it still requires implementing this interface in the existing vectorizers and handling the TODOs left in the code.

I'm creating this pull request so we can review and improve the general idea before any more serious work is done.

The interface itself is called NumericalizerABC and is contained in the numericalizer_abc.py file in the preproc package.

Be aware that this PR is based on the dataset-abc branch, so the diff shows changes from both branches. Set show changes from black correction when inspecting files changed to avoid irrelevant changes.

Focus on the NumericalizerABC, Field and Vocab classes.

@ivansmokovic ivansmokovic added the feature New feature or request label Nov 5, 2020
@ivansmokovic ivansmokovic self-assigned this Nov 5, 2020
Comment on lines 16 to 19
def _finalize(self):
# Subclasses should override this method to add custom
# finalization logic
pass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefixing the public method with an underscore doesn't seem right. We should handle the finalization somehow internally (overriding self.finalize and calling it sets the _finalized flag to True). Or we can give the _finalize a more appropriate name that doesn't start with an underscore and leave the finalize as-is.

Copy link
Collaborator Author

@ivansmokovic ivansmokovic Nov 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea to switch them around is good IMO.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there are tests that call this function, and the user may want to update and finalize a numericalizer before passing it to the Field (can't think of an concrete example, but I think I saw an usecase like that), using _finalize to call this seems a bit clumsy.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, to avoid confusion, the best approach would be to leave finalize as the overridable default function and add something like a mark_finalized function, which the finalize function has to call in the end. I'd avoid fiddling with metaclasses and just raise an informative exception to the user about the call to mark_finalized.

def update(self, tokens: List[str]) -> None:
pass

def finalize(self) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does a simple custom function need this property? If not, does it make sense to transfer it here in the first place? Same goes for the eager property.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the general case, all numericalizers can be finalizable and eager. If subclasses don't need finalization, they don't have to override finalize and nothing happens. There must be an API for the general case.

Remember, simple functional numericalizers in Field get wrapped in a wrapper that takes care of the compatibility with NumericalizerABC, so it's transparent to a user that wants to just use simple functions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I understand. But, my concern is that right now there is only a single subclass that uses finalization and eagerness and that dictates the API.

I'm fine with the wrapping when there is a custom mapping.

podium/datasets/dataset_abc.py Outdated Show resolved Hide resolved
# Conflicts:
#	podium/datasets/__init__.py
#	podium/datasets/dataset.py
#	podium/datasets/dataset_abc.py
#	podium/preproc/__init__.py
#	podium/storage/vocab.py
@ivansmokovic
Copy link
Collaborator Author

ivansmokovic commented Dec 10, 2020

Ready for merge. The only changes from the last review round are the addition of mark_finalized and documentation.

import numpy as np


class NumericalizerABC(ABC):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would rename to Numericalizer, to follow naming conventions in collections.abc

@@ -205,6 +207,16 @@ def remove_pretokenize_hooks(self):
self._pretokenization_pipeline.clear()


class NumericalizerCallableWrapper(NumericalizerABC):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumericalizerCallableWrapper seems too verbose. Plus, since this class is used internally, it makes sense to prefix its name with an underscore.

numericalization.
"""

def __init__(self, eager=True):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type hint for eager is missing (eager: bool = True)

Comment on lines +341 to +345
err_msg = (
f"Field {name}: unsupported numericalizer type "
f'"{type(numericalizer).__name__}"'
)
raise TypeError(err_msg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would put this error message directly inside the TypeError.

Comment on lines +74 to +78
def mark_finalized(self) -> None:
"""Marks the field as finalized. This method must be called after finalization
completes successfully."""
self._finalized = True

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, not a huge fan of mark_finalized, but let's see what @mttk and @FilipBolt have to say.

Copy link
Collaborator

@FilipBolt FilipBolt Dec 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd stay with finalize as the method name to keep things consistent and make the internal variable: is_finalized. Mark finalized feels redundant.

@FilipBolt
Copy link
Collaborator

As discussed on the regular meeting, just leaving a note here that this is currently on hold.

@mariosasko mariosasko mentioned this pull request Dec 27, 2020
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants