New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Frog can't deal with tokens that contain spaces #34
Comments
(somewhat related to LanguageMachines/ucto#24 , as ucto can't deal with this either yet) |
At the moment this is not possible in Frog. |
Another fix would be to have a wrapper that takes any content that contains spaces, and iterates over the space-delimited tokens, processing them individually with MBLEM and/or MBMA. These analyses could then be collated. An even rougher fix would be to delete the spaces before giving them as input to MBLEM or MBMA ("vol daen" -> "voldaen"). This solution is brute, but makes sense, because the presence of a space within a tag indicates directly that the string in is a single token in present-day Dutch. |
The rough fix will lead to a lot of problems, i guess: |
You would just treat "voldaen" as any word that you may or may not have seen before. You delete any whitespace. The fact that there are one or more spaces within a tag signals that apparently in present-day Dutch everything between and should be one word, without spaces. |
this assumes that there are already 'tags' or tokens detected. But what to do in running text? |
I'm not in favour of the brute solution of deleting spaces and sidetracking the problem that way (= information loss). I think ideally Frog should be able to handle spaces in tokens just as any other character, i.e. Frog should be completely agnostic about it and just accept whatever the tokenizer delivers (it would still be one |
I'm not sure if that is the issue - the underlying delimiter may well be the comma, and the modules may work with spaces just the same. Alternatively the spaces could be written to another character ("_") and the whole process may just work fine. Perhaps first perform a test? |
I just ran into Frog indeed stumbling over spaces in tokens as foreseen :) First I thought it was another issue but it proves this issue, so I'll add this here:
The problem here is this word in the input document: <w xml:id="aa__001biog01_01.TEI.2.text.body.div.p.152.s.1.w.14" class="WORD">
<t>Jan</t>
<t class="contemporary">Jan steen</t>
<lemma class="WNT:M028633.ADD.948" set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/int_lemmaid_withcompounds.foliaset.ttl"/>
<lemma class="jan steen" set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/int_lemmatext_withcompounds.foliaset.ttl"/>
<metric class="modernisationsource" value="inthistlexicon"/>
</w> So the word "Jan" modernises to "Jan steen" (which is obviously odd and wrong, but not the actual issue here). Frog runs on the contemporary layer and breaks over the space (as we expected). I'll just have to disallow multiword tokens in the moderniser for now (or do an ugly patch with another delimiter like underscore), but this will at some point come back to haunt us if we build a specialised tagger/lemmatiser for Nederlab with proper multiword support and want to run Frog on its' output. |
…variant, use a non-breaking narrow space instead of a real space to circumvent issue LanguageMachines/frog#34 (hacky and potentially confusing!)
There are a lot of issues at hand here.
Then there is also the reverse problem of run-ons where 'voldaen' is to be split in 2 words. QUICK HACK Proposal for multiple words in 1 <w> : Example: <s id="s.1">
<w id="w.1">
<t>Een</t>
</w>
<w id="w.2">
<t>multi word</t>
</w>
<w id="w.3">
<t>test</t>
</w>
</s> We tag this as if the second word was: <w id="w.2">
<t>multiword</t>
</w> The adapted MBMA and MBLEM, we can provide the text include the space. Result: <s xml:id="s.1">
<w xml:id="w.1">
<t>Een</t>
<pos class="LID(onbep,stan,agr)" confidence="0.981771" head="LID">
<feat class="onbep" subset="lwtype"/>
<feat class="stan" subset="naamval"/>
<feat class="agr" subset="npagr"/>
</pos>
<morphology>
<morpheme>
<t>een</t>
</morpheme>
</morphology>
<lemma class="een"/>
</w>
<w xml:id="w.2">
<t>multi word</t>
<pos class="N(soort,ev,basis,onz,stan)" confidence="0.733484" head="N">
<feat class="soort" subset="ntype"/>
<feat class="ev" subset="getal"/>
<feat class="basis" subset="graad"/>
<feat class="onz" subset="genus"/>
<feat class="stan" subset="naamval"/>
</pos>
<lemma class="multi word"/>
<morphology>
<morpheme>
<t>multi word</t>
</morpheme>
</morphology>
</w>
<w xml:id="w.3">
<t>test</t>
<pos class="N(soort,ev,basis,zijd,stan)" confidence="0.789112" head="N">
<feat class="soort" subset="ntype"/>
<feat class="ev" subset="getal"/>
<feat class="basis" subset="graad"/>
<feat class="zijd" subset="genus"/>
<feat class="stan" subset="naamval"/>
</pos>
<lemma class="test"/>
<morphology>
<morpheme>
<t>test</t>
</morpheme>
</morphology>
</w>
</s> |
So for now, this simple solution is implemented: |
In historical dutch, certain words may be written apart although they can be considered one token: "vol daen" (voldaan) and represented as a single
<w>
in FoLiA. Would the various Frog modules (mblem, mbpos etc) be able to deal with spaces in tokens?The text was updated successfully, but these errors were encountered: