Refactor and simplify TokenGazetteer #109

johann-petrak · 2021-05-17T09:52:31Z

The constructor is very complex right now.
We need some way to specify/do all the things that can be done or decided at init time in a way that is easier to understand.

One part of constructing the gazetteer is dealing with the gazetteer list(s): unless we already have tokenized lists, the entries need to get tokenized, this is also necessary when using legacy Java GATE def/lst files.
Instead of doing this automatically, lets separate out the task and only use already tokenized gazetteer lists when initializing (once those are created, they can get pickled for fast loading later).

johann-petrak · 2021-05-18T08:39:28Z

We need to distinguish between the data structure and the annotator, maybe something like

# one of
tokdata = SimpleTokenGazetteerData.from_gate_def("somefile.def", tokenizer=sometok, case_sensitive=False)
tokdata = SimpleTokenGazetteerData("ownpickledformatfile")
tokdata = SimpleTokenGazetteerData().from_string_list(stringlist, tokenizer=sometok)
# then
tokenizer = SimpleTokenGazetteerAnnotator(tokdata, outset=someset, ...)

Kind of ugly that this would mean that for each gazetteer annotator we need its own gazetteer data class. In theory we could use some kind of class nesting but that would cause problems with type hinting as the needed types are not defined for methods that use the nested class.

johann-petrak · 2021-06-28T07:55:17Z

OK, this is a bit more complex, since we eventually will need a standard way for saving and serializing Annotators and for that, a single constructor is better. Also we really save only a few parameters in most cases.

Also, we should maybe not have a default tokenizer to make it clearer that the gazetteer list tokenizer should match the document tokenizer.

johann-petrak mentioned this issue May 17, 2021

TokenGazetteer #106

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor and simplify TokenGazetteer #109

Refactor and simplify TokenGazetteer #109

johann-petrak commented May 17, 2021

johann-petrak commented May 18, 2021

johann-petrak commented Jun 28, 2021

Refactor and simplify TokenGazetteer #109

Refactor and simplify TokenGazetteer #109

Comments

johann-petrak commented May 17, 2021

johann-petrak commented May 18, 2021

johann-petrak commented Jun 28, 2021