TokenGazetteer #106

mdorkhah · 2021-05-14T14:27:27Z

TokenGazetteer do not annotate strings from the .lst file which include characters like #%.,/-+*@!

For example if my .lst file is like:
Apple
A.P.L
App#

the TokenGazetteer only annotates Apple in the document and ignores A.P.L and App#

johann-petrak · 2021-05-16T06:57:03Z

Please give more information: which tokenizer are you using for the document and which for the gazetteer list?

The TokenGazetteer does not match strings but sequences of tokens, so entries will only match if the token sequence for "A.P.L." will be the same in the document and in the gazetteer list.

mdorkhah · 2021-05-16T23:48:52Z

Please give more information: which tokenizer are you using for the document and which for the gazetteer list?

The TokenGazetteer does not match strings but sequences of tokens, so entries will only match if the token sequence for "A.P.L." will be the same in the document and in the gazetteer list.

I'm using Annie tokenizer through GateWorker like this:

    gs = GateWorker(gatehome=Gate_path, java=Java_path + "/java.exe", port=port)
    gs.loadMavenPlugin("uk.ac.gate.plugins", "annie", "9.0")
    gpipe = gs.loadPipelineFromPlugin("uk.ac.gate.plugins", "annie", "/resources/ANNIE_with_defaults.gapp")
    gdoc = gs.pdoc2gdoc(doc)
    gcorp = gs.newCorpus()
    gcorp.add(gdoc)
    gpipe.setCorpus(gcorp)
    gpipe.execute()
    anniedoc = gs.gdoc2pdoc(gdoc)
    gs.close()

and then I use TokenGazetteer like this:

tgaz = TokenGazetteer(path + ".def", fmt="gate-def", annset="", all=False, skip=True, outset=outset, outtype=detail)
gazdoc = tgaz(doc)

This works for Apple in my list but does not find matches for Apple#, Apple+ or Apple/

johann-petrak · 2021-05-17T09:06:19Z

As I said, the TokenGazetteer matches sequences of tokens: so the sequence of tokens in your document is whatever the ANNIE tokenizer produces, but the sequence of tokens for each entry in your gazetteer list is (by default) what splitting on white space produces - and this is different for words with special characters, punctuation etc.

The TokenGazetteer has a tokenizer parameter to specify a tokenizer that will get used instead of splitting on whitespace, but in your case, that cannot be used as your tokenizer is not a Python tokenizer but running in Java GATE via the worker.
TBH, this is a use case we have not thought of before :)

So to make this work for you, we need a workaround where all entries in your gazetteer list first get tokenized by the ANNIE gazetteer as well and then stored an used as an already tokenized list. I will try to come up with a simple solution for this.

johann-petrak · 2021-05-17T14:10:12Z

Here is a possible way to do this:

from gatenlp.processing.annotator import Annotator
class AnnieTokenizer(Annotator):
    def __init__(self, gateworker, tokeniserPR):
        self._gw = gateworker
        self._tok = tokeniserPR    
        self._ctrl = gateworker.jvm.gate.Factory.createResource("gate.creole.SerialAnalyserController")
        self._ctrl.add(tokeniserPR)
        self._corpus = gateworker.newCorpus()
        self._ctrl.setCorpus(self._corpus)
    def __call__(self, doc):
        gdoc = self._gw.pdoc2gdoc(doc)
        self._corpus.add(gdoc)
        self._ctrl.execute()
        self._corpus.remove(gdoc)
        tmpdoc = self._gw.gdoc2pdoc(gdoc)
        # make sure we return the SAME document!
        outset = doc.annset()
        for ann in tmpdoc.annset().with_type("Token"):
            outset.add_ann(ann)
        return doc
        
gs = GateWorker(gatehome=Gate_path, java=Java_path + "/java.exe", port=port)
gs.loadMavenPlugin("uk.ac.gate.plugins", "annie", "9.0")
gpipe = gs.loadPipelineFromPlugin("uk.ac.gate.plugins", "annie", "/resources/ANNIE_with_defaults.gapp")
gdoc = gs.pdoc2gdoc(doc)
gcorp = gs.newCorpus()
gcorp.add(gdoc)
gpipe.setCorpus(gcorp)
gpipe.execute()
anniedoc = gs.gdoc2pdoc(gdoc)

# get the annie tokenizer from the pipeline and wrap it in something usable for the token gazetteer
annietok = AnnieTokenizer(gs, gpipe.getPRs()[1])        
# create the token gazetteer using the ANNIE tokenizer
# IMPORTANT: this must be done before the gateworker gets closed as the gateworker is needed for creating the
# gazetteer instance
tgaz = TokenGazetteer(path + ".def", fmt="gate-def", tokenizer=annietok, annset="", all=False, skip=True, outset=outset, outtype=detail)

gs.close()

# gateworker should not be needed for just running the gazetteer
gazdoc = tgaz(doc)

johann-petrak · 2021-05-17T14:17:09Z

Closing this as it is not a bug.

The TokenGazetteer will get refactored and hopefully made easier to use in future versions, see #109

johann-petrak · 2021-05-17T16:44:51Z

That should have been gateworker instead. I have updated the original code.

mdorkhah · 2021-05-17T16:58:05Z

Yes, but it still doesn't work. The outset is empty.

def GazDet(doc,port):
    doc.annset("Resume").clear()
    gs = GateWorker(gatehome=Gate_path, java=Java_path + "/java.exe", port=port)
    gs.loadMavenPlugin("uk.ac.gate.plugins", "annie", "9.0")
    gpipe = gs.loadPipelineFromPlugin("uk.ac.gate.plugins", "annie", "/resources/ANNIE_with_defaults.gapp")
    gdoc = gs.pdoc2gdoc(doc)
    gcorp = gs.newCorpus()
    gcorp.add(gdoc)
    gpipe.setCorpus(gcorp)
    gpipe.execute()
    anniedoc = gs.gdoc2pdoc(gdoc)
    annietok = AnnieTokenizer(gs, gpipe.getPRs()[1]) 
    for detail in details:
        tgaz = TokenGazetteer(Details_path + "/" + detail + ".def", fmt="gate-def", tokenizer=annietok, annset="", all=False, skip=True, 
                              outset="Resume", outtype=detail)
        gazdoc = tgaz(doc)
    gazdoc.show()
    gs.close()
    return gazdoc

johann-petrak · 2021-05-17T17:59:32Z

I tested the code here and it worked (the error above was not detected because gs was the same as gateworker when I ran it) with the version of gatenlp I am using.
Which version are you using?
You could check what annotations annietok(Document("Apple and A.B.C.") produces - this should return a document with Tokens as produced by the ANNIE tokenizer.

mdorkhah · 2021-05-17T18:23:18Z

I ran "python -m pip install -U gatenlp[all]" so I guess my version is the latest one.

johann-petrak · 2021-05-18T07:30:55Z

The version you are using can be easily determined using gatenlp.__version__, if it is the latest, then it is 1.0.4

For testing, please make sure that the AnnieTokenizer.__call__ method returns doc (see updated code above).

As I said, the code I shared works for me, so if it does not work in your case, you need to find out where the problem may lie.
The first step is to check what I suggested earlier and find out which annotations you get when doing this:

doc = Document("A.B.C.  and Apple and Apple#")
annietok(doc)
print(doc.annset()

This should show a number of Tokens, e.g. for "A", ".", "B" ... and "Apple", "#"

mdorkhah · 2021-05-19T19:41:22Z

My version is '1.0.5+snapshot'. I tried to download new version from github code but nothing worked with that version.

johann-petrak · 2021-05-20T05:11:34Z

'1.0.5+snapshot' is a github development version. Since that version gets updated regularly, it can be useful to
also specify the commit version which one can get by running git rev-parse --short HEAD in the git repo.

I am using the very latest github version here it might be a good idea for you to try that one and report any problems as that version is going to be part of what gets released next anyway.

mdorkhah · 2021-05-24T14:36:05Z

The problem is that any Tokenizer considers the characters like . , + - / \ @ # % $ ! as one separate Token, which is reasonable.
But, I need that Tokenizer consider the combination of this characters with before or after words as well. For example when it takes "Apple.Orange", considers Apple Apple. .Orange Orange as Tokens.

johann-petrak · 2021-05-24T14:59:20Z

If you have such rather specific requirements, you will probably have to implement your own approach to tokenizing the gazetteer list and/or document or not use the token gazetteer at all.
What will work best in your case apparently very much depends on the very specific rules and requirements you have.
Maybe you need a string-based gazetteer, maybe you need regular expressions.
Maybe you need to implement something completely new.

From what you have shared about your requirements so far, maybe you can implement your own algorithm to convert the original list of gazetteer strings to lists of tokens for the gazetteer, where it would be perfectly valid to generate more than one gazetteer list entry per original string. This is something that often makes sense in other contexts.

mdorkhah added the bug Something isn't working label May 14, 2021

johann-petrak closed this as completed May 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TokenGazetteer #106

TokenGazetteer #106

mdorkhah commented May 14, 2021

johann-petrak commented May 16, 2021

mdorkhah commented May 16, 2021

johann-petrak commented May 17, 2021

johann-petrak commented May 17, 2021 •

edited

Loading

johann-petrak commented May 17, 2021

johann-petrak commented May 17, 2021

mdorkhah commented May 17, 2021

johann-petrak commented May 17, 2021

mdorkhah commented May 17, 2021

johann-petrak commented May 18, 2021

mdorkhah commented May 19, 2021

johann-petrak commented May 20, 2021

mdorkhah commented May 24, 2021

johann-petrak commented May 24, 2021

TokenGazetteer #106

TokenGazetteer #106

Comments

mdorkhah commented May 14, 2021

johann-petrak commented May 16, 2021

mdorkhah commented May 16, 2021

johann-petrak commented May 17, 2021

johann-petrak commented May 17, 2021 • edited Loading

johann-petrak commented May 17, 2021

johann-petrak commented May 17, 2021

mdorkhah commented May 17, 2021

johann-petrak commented May 17, 2021

mdorkhah commented May 17, 2021

johann-petrak commented May 18, 2021

mdorkhah commented May 19, 2021

johann-petrak commented May 20, 2021

mdorkhah commented May 24, 2021

johann-petrak commented May 24, 2021

johann-petrak commented May 17, 2021 •

edited

Loading