Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frog can't deal with tokens that contain spaces #34

Closed
proycon opened this issue Jul 12, 2017 · 11 comments
Closed

Frog can't deal with tokens that contain spaces #34

proycon opened this issue Jul 12, 2017 · 11 comments
Assignees
Labels

Comments

@proycon
Copy link
Member

proycon commented Jul 12, 2017

In historical dutch, certain words may be written apart although they can be considered one token: "vol daen" (voldaan) and represented as a single <w> in FoLiA. Would the various Frog modules (mblem, mbpos etc) be able to deal with spaces in tokens?

@proycon
Copy link
Member Author

proycon commented Jul 12, 2017

(somewhat related to LanguageMachines/ucto#24 , as ucto can't deal with this either yet)

@kosloot
Copy link
Collaborator

kosloot commented Jul 17, 2017

At the moment this is not possible in Frog.
None of the modules. ucto, mbma, mblem etc. are capable of handling splits.
Nor do we have any plans to accomplish this.
The fastest (but non-trivial) way to handle splits is to post process the Frog output with PiCCL/TiCCL and then rerun Frog on the hopefully resolved tokens.

@antalvdb
Copy link
Member

Another fix would be to have a wrapper that takes any content that contains spaces, and iterates over the space-delimited tokens, processing them individually with MBLEM and/or MBMA. These analyses could then be collated.

An even rougher fix would be to delete the spaces before giving them as input to MBLEM or MBMA ("vol daen" -> "voldaen"). This solution is brute, but makes sense, because the presence of a space within a tag indicates directly that the string in is a single token in present-day Dutch.

@kosloot
Copy link
Collaborator

kosloot commented Jul 17, 2017

The rough fix will lead to a lot of problems, i guess:
How do you determine which spaces are splits an which are none?
You would need a lexicon then, or?

@antalvdb
Copy link
Member

You would just treat "voldaen" as any word that you may or may not have seen before. You delete any whitespace.

The fact that there are one or more spaces within a tag signals that apparently in present-day Dutch everything between and should be one word, without spaces.

@kosloot
Copy link
Collaborator

kosloot commented Jul 17, 2017

this assumes that there are already 'tags' or tokens detected. But what to do in running text?
'ik geloof niet dat dit werkt'
'ikgeloofnietdatditwerkt'

@proycon
Copy link
Member Author

proycon commented Jul 17, 2017

I'm not in favour of the brute solution of deleting spaces and sidetracking the problem that way (= information loss). I think ideally Frog should be able to handle spaces in tokens just as any other character, i.e. Frog should be completely agnostic about it and just accept whatever the tokenizer delivers (it would still be one <w> after all). Am I right in thinking the source of this issue is that space is used as a delimiter in the underlying timbl modules, rather than tab?

@antalvdb
Copy link
Member

I'm not sure if that is the issue - the underlying delimiter may well be the comma, and the modules may work with spaces just the same. Alternatively the spaces could be written to another character ("_") and the whole process may just work fine. Perhaps first perform a test?

@proycon
Copy link
Member Author

proycon commented Jul 17, 2017

I just ran into Frog indeed stumbling over spaces in tokens as foreseen :) First I thought it was another issue but it proves this issue, so I'll add this here:

frog-pos-tagger-:mismatch between number of <w> tags and the tagger result.                 
frog-pos-tagger-:words according to <w> tags:                                               
frog-pos-tagger-:w[0]= ‘                      
frog-pos-tagger-:w[1]= AA                     
frog-pos-tagger-:w[2]= (                      
frog-pos-tagger-:w[3]= Floris                 
frog-pos-tagger-:w[4]= van                    
frog-pos-tagger-:w[5]= der                    
frog-pos-tagger-:w[6]= )                      
frog-pos-tagger-:w[7]= een                    
frog-pos-tagger-:w[8]= der                    
frog-pos-tagger-:w[9]= edelen                 
frog-pos-tagger-:w[10]= die                   
frog-pos-tagger-:w[11]= in                    
frog-pos-tagger-:w[12]= 1415                  
frog-pos-tagger-:w[13]= Jan                   
frog-pos-tagger-:w[14]= van                   
frog-pos-tagger-:w[15]= Arkel                 
frog-pos-tagger-:w[16]= gevankelijk           
frog-pos-tagger-:w[17]= naar                  
frog-pos-tagger-:w[18]= 's                    
frog-pos-tagger-:w[19]= Hage                  
frog-pos-tagger-:w[20]= voerden               
frog-pos-tagger-:w[21]= ,                     
frog-pos-tagger-:w[22]= waarvoor              
frog-pos-tagger-:w[23]= zij                   
frog-pos-tagger-:w[24]= eene                  
frog-pos-tagger-:w[25]= goede                 
frog-pos-tagger-:w[26]= som                   
frog-pos-tagger-:w[27]= gelds                 
frog-pos-tagger-:w[28]= trokken               
frog-pos-tagger-:w[29]= .                     
frog-pos-tagger-:w[30]= ’                     
frog-pos-tagger-:words according to POS tagger:                                             
frog-pos-tagger-:word[0]='                    
frog-pos-tagger-:word[1]=AA                   
frog-pos-tagger-:word[2]=(                    
frog-pos-tagger-:word[3]=Floris               
frog-pos-tagger-:word[4]=van                  
frog-pos-tagger-:word[5]=der                  
frog-pos-tagger-:word[6]=)                    
frog-pos-tagger-:word[7]=een                  
frog-pos-tagger-:word[8]=der                  
frog-pos-tagger-:word[9]=edelen               
frog-pos-tagger-:word[10]=die                 
frog-pos-tagger-:word[11]=in                  
frog-pos-tagger-:word[12]=1415                
frog-pos-tagger-:word[13]=Jan                 
frog-pos-tagger-:word[14]=steen        <== THIS ONE MISSES IN THE OTHER       
frog-pos-tagger-:word[15]=van                 
frog-pos-tagger-:word[16]=Arkel               
frog-pos-tagger-:word[17]=gevankelijk         
frog-pos-tagger-:word[18]=naar                
frog-pos-tagger-:word[19]=de                  
frog-pos-tagger-:word[20]=Haag                
frog-pos-tagger-:word[21]=voerden             
frog-pos-tagger-:word[22]=,                   
frog-pos-tagger-:word[23]=waarvoor            
frog-pos-tagger-:word[24]=zij                 
frog-pos-tagger-:word[25]=een                 
frog-pos-tagger-:word[26]=goede               
frog-pos-tagger-:word[27]=som                 
frog-pos-tagger-:word[28]=geld                
frog-pos-tagger-:word[29]=trokken             
frog-pos-tagger-:word[30]=.                   
frog-pos-tagger-:word[31]='                   
frog-:problem frogging: aa__001biog01_01.tok.translated.folia.xml                           
frog-:POS tagger is confused IOB tagger is confused NER failed: '' AA ( Floris van der ) een der edelen die in 1415 Jan steen van Arkel gevankelijk naar de Haag voerden , waarvoor zij een goede som geld trokken . '' ==> ''//O AA//B-org (//O Floris//B-per van//I-per der//I-per )//I-per een//I-per der//I-per edelen//O die//O in//O 1415//O Jan//B-per steen//O van//O Arkel//B-loc gevankelijk//O naar//O de//O Haag//B-loc voerden//O ,//O waarvoor//O zij//O een//O goede//O som//O geld//O trokken//O .//O '//O ' 

The problem here is this word in the input document:

<w xml:id="aa__001biog01_01.TEI.2.text.body.div.p.152.s.1.w.14" class="WORD">
   <t>Jan</t>                                                                                                                                                             
   <t class="contemporary">Jan steen</t>
   <lemma class="WNT:M028633.ADD.948" set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/int_lemmaid_withcompounds.foliaset.ttl"/>
   <lemma class="jan steen" set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/int_lemmatext_withcompounds.foliaset.ttl"/>
    <metric class="modernisationsource" value="inthistlexicon"/>
</w>

So the word "Jan" modernises to "Jan steen" (which is obviously odd and wrong, but not the actual issue here). Frog runs on the contemporary layer and breaks over the space (as we expected). I'll just have to disallow multiword tokens in the moderniser for now (or do an ugly patch with another delimiter like underscore), but this will at some point come back to haunt us if we build a specialised tagger/lemmatiser for Nederlab with proper multiword support and want to run Frog on its' output.

@proycon proycon added bug and removed question labels Jul 17, 2017
@proycon proycon changed the title Can Frog deal with tokens that contain spaces? Frog can't deal with tokens that contain spaces Jul 17, 2017
proycon added a commit to LanguageMachines/foliautils that referenced this issue Jul 20, 2017
…variant, use a non-breaking narrow space instead of a real space to circumvent issue LanguageMachines/frog#34 (hacky and potentially confusing!)
@kosloot
Copy link
Collaborator

kosloot commented Aug 24, 2017

There are a lot of issues at hand here.
First: In this case the 'sanity' check in Frog isn't aware of 'multiword' words.
I assume this can be fixed rather easy. (the error message is also confusing, because it used the wrong textclass)
But that is just a small part of the multitude of problems at hand.
Reverting to the original question:

  • Can MBLEM handle the word 'vol daen'?
    Not at the present, but modification is easy, yielding the lemma 'voldaen' (assuming someone magicly comes up with training data)
  • Can MBMA handle the word 'vol daen'?
    Not at the present, but modification is 'easy' I think. I suggest by just analyzing the 'de-spaced' token.
  • Can MBT handle 'vol daen'?
    NO, and it is very hard to do I think, unless just ignoring that it was 1 token.
    How would Mbt recognize the single token 'vol daen' in Sentence like:
    "Ik had een vol daen gevoel."?
    MBT is sentence based, and assumes space delimited words/tokens.
  • Same problem with all MBT based modules in Frog. (NER, Chunker)
    So to get this to work, MBT needs a complete rework, to get a variant that accepts sequences of tokens instead off sentences of words. (and training data too)

Then there is also the reverse problem of run-ons where 'voldaen' is to be split in 2 words.
The tagger has no clue, the lemmatizer will have no problems, after the modifications.
But how and when do we merge this knowledge?

QUICK HACK Proposal for multiple words in 1 <w> :
For tagging, we could remove the spaces, to assure that 1 FoLiA word, leads to 1 Tag.

Example:

<s id="s.1">
	<w id="w.1">
	  <t>Een</t>
	</w>
	<w id="w.2">
	  <t>multi word</t>
	</w>
	<w id="w.3">
	  <t>test</t>
	</w>
      </s>

We tag this as if the second word was:

	<w id="w.2">
	  <t>multiword</t>
	</w>

The adapted MBMA and MBLEM, we can provide the text include the space.

Result:

      <s xml:id="s.1">
        <w xml:id="w.1">
          <t>Een</t>
          <pos class="LID(onbep,stan,agr)" confidence="0.981771" head="LID">
            <feat class="onbep" subset="lwtype"/>
            <feat class="stan" subset="naamval"/>
            <feat class="agr" subset="npagr"/>
          </pos>
          <morphology>
            <morpheme>
              <t>een</t>
            </morpheme>
          </morphology>
          <lemma class="een"/>
        </w>
        <w xml:id="w.2">
          <t>multi word</t>
          <pos class="N(soort,ev,basis,onz,stan)" confidence="0.733484" head="N">
            <feat class="soort" subset="ntype"/>
            <feat class="ev" subset="getal"/>
            <feat class="basis" subset="graad"/>
            <feat class="onz" subset="genus"/>
            <feat class="stan" subset="naamval"/>
          </pos>
          <lemma class="multi word"/>
          <morphology>
            <morpheme>
              <t>multi word</t>
            </morpheme>
          </morphology>
        </w>
        <w xml:id="w.3">
          <t>test</t>
          <pos class="N(soort,ev,basis,zijd,stan)" confidence="0.789112" head="N">
            <feat class="soort" subset="ntype"/>
            <feat class="ev" subset="getal"/>
            <feat class="basis" subset="graad"/>
            <feat class="zijd" subset="genus"/>
            <feat class="stan" subset="naamval"/>
          </pos>
          <lemma class="test"/>
          <morphology>
            <morpheme>
              <t>test</t>
            </morpheme>
          </morphology>
        </w>
      </s>

@kosloot
Copy link
Collaborator

kosloot commented Aug 28, 2017

So for now, this simple solution is implemented:
Frog accepts FoLiA with embedded spaces now. All spaces are removed for all taggers AND the parser, converting multi words into singe words.

@kosloot kosloot closed this as completed Sep 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants