Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename PretokenizationPipeline and PosttokenizationPipeline #244

Closed
FilipBolt opened this issue Dec 18, 2020 · 7 comments · Fixed by #300
Closed

Rename PretokenizationPipeline and PosttokenizationPipeline #244

FilipBolt opened this issue Dec 18, 2020 · 7 comments · Fixed by #300

Comments

@FilipBolt
Copy link
Collaborator

PosttokenizationPipeline and PretokenizationPipeline sounds too similar to pipeline.py module and should be renamed to be more related to hooks.

Suggestions:

  • PretokenizationHookManager / PosttokenizationHookManager
  • PretokenizationQueue / PosttokenizationQueue

Further, hook code could be extracted to a separate module and unified using a general HookQueue interface that defines add_hook and clear methods to avoid copying of those.

@mariosasko
Copy link
Collaborator

mariosasko commented Dec 18, 2020

I don't like the names ending with Manager because the logic that these components implement is too simple.

I think it's best to remove PosttokenizationPipeline and PretokenizationPipeline altogether and transfer all the logic inside Field and MultioutputField. Rn, Field.{add, remove}_hook just delegate the calls and it's fine to have a few lines of logic there.

The bigger issue is that MultioutputField is not a subclass of Field, so we are duplicating some logic.

@FilipBolt
Copy link
Collaborator Author

True, I would also like for MultioutputField to be somehow unified with Field (perhaps not directly as MultioutputField contains Fields).

I believe we had some offline discussion on this, and the way forward is to separate hooks out (and potentially rename them to some name, undecided yet), and leave Field and MultioutputField as it for the time being (as they require a larger overhaul).

@ivansmokovic
Copy link
Collaborator

I don't think MultioutputField should be looked at as a Field, as it's more or less just a wrapper to signify special handling during preprocessing.

@ivansmokovic
Copy link
Collaborator

Maybe simplify MultioutputField down to the bone, making it just a holder for fields/tokenizers and moving the logic into Field. This would allow us to remove the annoying [(name, (raw, tokenized))] return type for preprocess.

@mttk
Copy link
Member

mttk commented Dec 23, 2020

Why not just do the python thing and drop all suffixes, going with Pretokenization, Posttokenization?

@mariosasko
Copy link
Collaborator

Actually, I think the naming is not that relevant in this case. Both PosttokenizationPipeline and PretokenizationPipeline are only used internally by Field so keeping them private is the way to go imo.

@mttk
Copy link
Member

mttk commented Jan 8, 2021

I'd agree with @mariosasko 's previous comment as they are only used internally. We can underscore them and be OK with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants