Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor and simplify TokenGazetteer #109

Open
johann-petrak opened this issue May 17, 2021 · 2 comments
Open

Refactor and simplify TokenGazetteer #109

johann-petrak opened this issue May 17, 2021 · 2 comments

Comments

@johann-petrak
Copy link
Collaborator

The constructor is very complex right now.
We need some way to specify/do all the things that can be done or decided at init time in a way that is easier to understand.

One part of constructing the gazetteer is dealing with the gazetteer list(s): unless we already have tokenized lists, the entries need to get tokenized, this is also necessary when using legacy Java GATE def/lst files.
Instead of doing this automatically, lets separate out the task and only use already tokenized gazetteer lists when initializing (once those are created, they can get pickled for fast loading later).

@johann-petrak
Copy link
Collaborator Author

We need to distinguish between the data structure and the annotator, maybe something like

# one of
tokdata = SimpleTokenGazetteerData.from_gate_def("somefile.def", tokenizer=sometok, case_sensitive=False)
tokdata = SimpleTokenGazetteerData("ownpickledformatfile")
tokdata = SimpleTokenGazetteerData().from_string_list(stringlist, tokenizer=sometok)
# then
tokenizer = SimpleTokenGazetteerAnnotator(tokdata, outset=someset, ...)

Kind of ugly that this would mean that for each gazetteer annotator we need its own gazetteer data class. In theory we could use some kind of class nesting but that would cause problems with type hinting as the needed types are not defined for methods that use the nested class.

@johann-petrak
Copy link
Collaborator Author

OK, this is a bit more complex, since we eventually will need a standard way for saving and serializing Annotators and for that, a single constructor is better. Also we really save only a few parameters in most cases.

Also, we should maybe not have a default tokenizer to make it clearer that the gazetteer list tokenizer should match the document tokenizer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant