Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TokenGazetteer #106

Closed
mdorkhah opened this issue May 14, 2021 · 14 comments
Closed

TokenGazetteer #106

mdorkhah opened this issue May 14, 2021 · 14 comments
Labels
bug Something isn't working

Comments

@mdorkhah
Copy link

TokenGazetteer do not annotate strings from the .lst file which include characters like #%.,/-+*@!

For example if my .lst file is like:
Apple
A.P.L
App#

the TokenGazetteer only annotates Apple in the document and ignores A.P.L and App#

@mdorkhah mdorkhah added the bug Something isn't working label May 14, 2021
@johann-petrak
Copy link
Collaborator

Please give more information: which tokenizer are you using for the document and which for the gazetteer list?

The TokenGazetteer does not match strings but sequences of tokens, so entries will only match if the token sequence for "A.P.L." will be the same in the document and in the gazetteer list.

@mdorkhah
Copy link
Author

Please give more information: which tokenizer are you using for the document and which for the gazetteer list?

The TokenGazetteer does not match strings but sequences of tokens, so entries will only match if the token sequence for "A.P.L." will be the same in the document and in the gazetteer list.

I'm using Annie tokenizer through GateWorker like this:

    gs = GateWorker(gatehome=Gate_path, java=Java_path + "/java.exe", port=port)
    gs.loadMavenPlugin("uk.ac.gate.plugins", "annie", "9.0")
    gpipe = gs.loadPipelineFromPlugin("uk.ac.gate.plugins", "annie", "/resources/ANNIE_with_defaults.gapp")
    gdoc = gs.pdoc2gdoc(doc)
    gcorp = gs.newCorpus()
    gcorp.add(gdoc)
    gpipe.setCorpus(gcorp)
    gpipe.execute()
    anniedoc = gs.gdoc2pdoc(gdoc)
    gs.close()

and then I use TokenGazetteer like this:

tgaz = TokenGazetteer(path + ".def", fmt="gate-def", annset="", all=False, skip=True, outset=outset, outtype=detail)
gazdoc = tgaz(doc)

This works for Apple in my list but does not find matches for Apple#, Apple+ or Apple/

@johann-petrak
Copy link
Collaborator

As I said, the TokenGazetteer matches sequences of tokens: so the sequence of tokens in your document is whatever the ANNIE tokenizer produces, but the sequence of tokens for each entry in your gazetteer list is (by default) what splitting on white space produces - and this is different for words with special characters, punctuation etc.

The TokenGazetteer has a tokenizer parameter to specify a tokenizer that will get used instead of splitting on whitespace, but in your case, that cannot be used as your tokenizer is not a Python tokenizer but running in Java GATE via the worker.
TBH, this is a use case we have not thought of before :)

So to make this work for you, we need a workaround where all entries in your gazetteer list first get tokenized by the ANNIE gazetteer as well and then stored an used as an already tokenized list. I will try to come up with a simple solution for this.

@johann-petrak
Copy link
Collaborator

johann-petrak commented May 17, 2021

Here is a possible way to do this:

from gatenlp.processing.annotator import Annotator
class AnnieTokenizer(Annotator):
    def __init__(self, gateworker, tokeniserPR):
        self._gw = gateworker
        self._tok = tokeniserPR    
        self._ctrl = gateworker.jvm.gate.Factory.createResource("gate.creole.SerialAnalyserController")
        self._ctrl.add(tokeniserPR)
        self._corpus = gateworker.newCorpus()
        self._ctrl.setCorpus(self._corpus)
    def __call__(self, doc):
        gdoc = self._gw.pdoc2gdoc(doc)
        self._corpus.add(gdoc)
        self._ctrl.execute()
        self._corpus.remove(gdoc)
        tmpdoc = self._gw.gdoc2pdoc(gdoc)
        # make sure we return the SAME document!
        outset = doc.annset()
        for ann in tmpdoc.annset().with_type("Token"):
            outset.add_ann(ann)
        return doc
        
gs = GateWorker(gatehome=Gate_path, java=Java_path + "/java.exe", port=port)
gs.loadMavenPlugin("uk.ac.gate.plugins", "annie", "9.0")
gpipe = gs.loadPipelineFromPlugin("uk.ac.gate.plugins", "annie", "/resources/ANNIE_with_defaults.gapp")
gdoc = gs.pdoc2gdoc(doc)
gcorp = gs.newCorpus()
gcorp.add(gdoc)
gpipe.setCorpus(gcorp)
gpipe.execute()
anniedoc = gs.gdoc2pdoc(gdoc)

# get the annie tokenizer from the pipeline and wrap it in something usable for the token gazetteer
annietok = AnnieTokenizer(gs, gpipe.getPRs()[1])        
# create the token gazetteer using the ANNIE tokenizer
# IMPORTANT: this must be done before the gateworker gets closed as the gateworker is needed for creating the
# gazetteer instance
tgaz = TokenGazetteer(path + ".def", fmt="gate-def", tokenizer=annietok, annset="", all=False, skip=True, outset=outset, outtype=detail)

gs.close()

# gateworker should not be needed for just running the gazetteer
gazdoc = tgaz(doc)

@johann-petrak
Copy link
Collaborator

Closing this as it is not a bug.

The TokenGazetteer will get refactored and hopefully made easier to use in future versions, see #109

@johann-petrak
Copy link
Collaborator

That should have been gateworker instead. I have updated the original code.

@mdorkhah
Copy link
Author

Yes, but it still doesn't work. The outset is empty.

def GazDet(doc,port):
    doc.annset("Resume").clear()
    gs = GateWorker(gatehome=Gate_path, java=Java_path + "/java.exe", port=port)
    gs.loadMavenPlugin("uk.ac.gate.plugins", "annie", "9.0")
    gpipe = gs.loadPipelineFromPlugin("uk.ac.gate.plugins", "annie", "/resources/ANNIE_with_defaults.gapp")
    gdoc = gs.pdoc2gdoc(doc)
    gcorp = gs.newCorpus()
    gcorp.add(gdoc)
    gpipe.setCorpus(gcorp)
    gpipe.execute()
    anniedoc = gs.gdoc2pdoc(gdoc)
    annietok = AnnieTokenizer(gs, gpipe.getPRs()[1]) 
    for detail in details:
        tgaz = TokenGazetteer(Details_path + "/" + detail + ".def", fmt="gate-def", tokenizer=annietok, annset="", all=False, skip=True, 
                              outset="Resume", outtype=detail)
        gazdoc = tgaz(doc)
    gazdoc.show()
    gs.close()
    return gazdoc

@johann-petrak
Copy link
Collaborator

I tested the code here and it worked (the error above was not detected because gs was the same as gateworker when I ran it) with the version of gatenlp I am using.
Which version are you using?
You could check what annotations annietok(Document("Apple and A.B.C.") produces - this should return a document with Tokens as produced by the ANNIE tokenizer.

@mdorkhah
Copy link
Author

I ran "python -m pip install -U gatenlp[all]" so I guess my version is the latest one.

@johann-petrak
Copy link
Collaborator

The version you are using can be easily determined using gatenlp.__version__, if it is the latest, then it is 1.0.4

For testing, please make sure that the AnnieTokenizer.__call__ method returns doc (see updated code above).

As I said, the code I shared works for me, so if it does not work in your case, you need to find out where the problem may lie.
The first step is to check what I suggested earlier and find out which annotations you get when doing this:

doc = Document("A.B.C.  and Apple and Apple#")
annietok(doc)
print(doc.annset()

This should show a number of Tokens, e.g. for "A", ".", "B" ... and "Apple", "#"

@mdorkhah
Copy link
Author

My version is '1.0.5+snapshot'. I tried to download new version from github code but nothing worked with that version.

@johann-petrak
Copy link
Collaborator

'1.0.5+snapshot' is a github development version. Since that version gets updated regularly, it can be useful to
also specify the commit version which one can get by running git rev-parse --short HEAD in the git repo.

I am using the very latest github version here it might be a good idea for you to try that one and report any problems as that version is going to be part of what gets released next anyway.

@mdorkhah
Copy link
Author

The problem is that any Tokenizer considers the characters like . , + - / \ @ # % $ ! as one separate Token, which is reasonable.
But, I need that Tokenizer consider the combination of this characters with before or after words as well. For example when it takes "Apple.Orange", considers Apple Apple. .Orange Orange as Tokens.

@johann-petrak
Copy link
Collaborator

If you have such rather specific requirements, you will probably have to implement your own approach to tokenizing the gazetteer list and/or document or not use the token gazetteer at all.
What will work best in your case apparently very much depends on the very specific rules and requirements you have.
Maybe you need a string-based gazetteer, maybe you need regular expressions.
Maybe you need to implement something completely new.

From what you have shared about your requirements so far, maybe you can implement your own algorithm to convert the original list of gazetteer strings to lists of tokens for the gazetteer, where it would be perfectly valid to generate more than one gazetteer list entry per original string. This is something that often makes sense in other contexts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants