Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

value of Textcontent dissappears (empty string) upon add? #17

Closed
proycon opened this issue May 17, 2017 · 7 comments
Closed

value of Textcontent dissappears (empty string) upon add? #17

proycon opened this issue May 17, 2017 · 7 comments
Assignees

Comments

@proycon
Copy link
Member

proycon commented May 17, 2017

Something goes wrong when I add TextContent with value eologico*phijsico*metaphijsicum, libfolia adds an empty text content element instead! I've no idea what triggers this (special meaning for the asterisk perhaps??), other words process fine.

I add TextContent as follows:
https://github.com/LanguageMachines/foliautils/blob/wordtranslate/src/FoLiA-wordtranslate.cxx#L134

Debug output, I explicitly check if I'm not passing an empty string (after trimming even):

$ FoLiA-wordtranslate --outputclass contemporary -d lexicon.1637-2010.250.lexserv.vandale.tsv -p preservation2010.txt -r rules.machine aa__001biog01_01.tok.folia.xml
                                                                                                                                                                                                                                                            
Loading dictionary...                                                                                                                                                                                                                                       
Loading preserve lexicon...                                                                                                                                                                                                                                 
Loading rules...                                                                                                                                                                                                                                            
DEBUG: target before sanity check 'eologico*phijsico*metaphijsicum'                                                                                                                                                                                         
DEBUG: target after sanity check 'eologico*phijsico*metaphijsicum'
DEBUG: text after adding textcontent ''
finished aa__001biog01_01.tok.folia.xml 
@proycon
Copy link
Member Author

proycon commented May 17, 2017

Now I can't reproduce the above debug anymore (text content shows fine), but the serialisation to xml still has an empty text..

DEBUG: -- BEFORE APPEND --                                                                                                                                                                                                                                  
DEBUG: target: 'eologico*phijsico*metaphijsicum'
DEBUG: after unicode encoding and decoding 'eologico*phijsico*metaphijsicum'
DEBUG: from textcontent 'eologico*phijsico*metaphijsicum'
DEBUG: length from textcontent  33
DEBUG: -- AFTER APPEND -- 
DEBUG: from textcontent 'eologico*phijsico*metaphijsicum'
DEBUG: length from textcontent  33
DEBUG: from word 'eologico*phijsico*metaphijsicum'
          <w xml:id="aa__001biog01_01.TEI.2.text.body.div.p.10802.s.1.w.2" class="WORD-COMPOUND">
            <t>theologico-physico-metaphysicum</t>
            <t class="contemporary"></t>
            <metric class="modernisationsource" value="rules"/>

@proycon
Copy link
Member Author

proycon commented May 17, 2017

Ok, the following debug shows the problem, still no idea why though:

DEBUG: target: 'eologico*phijsico*metaphijsicum'                                                                                                                                                                                                            
DEBUG: from textcontent 'eologico*phijsico*metaphijsicum'
DEBUG: length from textcontent  33
DEBUG: from word 'eologico*phijsico*metaphijsicum'
DEBUG: XML serialisation '<w xmlns="http://ilk.uvt.nl/folia" xml:id="aa__001biog01_01.TEI.2.text.body.div.p.10802.s.1.w.2" class="WORD-COMPOUND"><t>theologico-physico-metaphysicum</t><t class="contemporary"></t></w>'  

Debug code is committed: https://github.com/LanguageMachines/foliautils/blob/wordtranslate/src/FoLiA-wordtranslate.cxx#L143

@proycon
Copy link
Member Author

proycon commented May 17, 2017

It also fails on the following other input words, which probably get mangled to asteriskses too by my tool (incorrectly but that's not the issue here):

  • Onderwierum-en-Westerdijkshorn
  • Hollandsch-Hoogduitsch-Israelitische
  • Arrondissements-kiescollegie
  • Kollumerland-en-Nieuw-Kruisland

@proycon
Copy link
Member Author

proycon commented May 17, 2017

Hah, there seem to be two 0x00 bytes in front of the string! That would explain things. I should have counted the characters better :)

@proycon
Copy link
Member Author

proycon commented May 17, 2017

Conclusion: So this seems to happen if there are invalid characters in the string, I think it would be helpful if this could be caught and a warning outputted when appending text, provided it's not too expensive.

@kosloot
Copy link
Contributor

kosloot commented May 17, 2017

checking for an string to be valid UTF8 is quite expensive.
A 0 is even not 'that invalid'. It is a C string terminator, yielding 'empty' strings when in front.
I didn't find an easy way to check validity.

@kosloot
Copy link
Contributor

kosloot commented Jun 26, 2017

Ok. the problem occurred in a program that incorrectly used the libicu API, yielding iinvalid Unicode strings.
That is a quality of implementation problem in libicu. Not in libfolia.

@kosloot kosloot closed this as completed Jun 26, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants