Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docx creates nested runs (<w:r><w:t><w:r><w:t>), which are then invisible in the opened document #19

Closed
unhammer opened this issue Jun 3, 2022 · 3 comments

Comments

@unhammer
Copy link

unhammer commented Jun 3, 2022

$ for y in yes no; do APERTIUM_TRANSFUSE=$y apertium -f docx -u -d . nob-nno  /tmp/in.docx >/tmp/ut.$y.docx; done

in.docx

With transfuse, we get this bit:

      <w:r>
        <w:rPr/>
        <w:t xml:space="preserve">
          <w:r>
            <w:rPr/>
            <w:t xml:space="preserve">Dette er såkalla «Sideloaded Add-ins». Dei nyttar eit webview, i praksis ein nettlesar</w:t>
          </w:r>
        </w:t>
      </w:r>

which word (and libreoffice) don't show on opening the document, presumably nested runs aren't allowed in OOXML.

(Note: If I first save in.docx from Libreoffice, transfuse can handle it fine, because LO merges all the runs in the input paragraph on saving (removing the proofErr stuff).)

@unhammer
Copy link
Author

unhammer commented Jun 3, 2022

Strangely, the text isn't divided in input to the pipeline. The split-point is nettleser som, and input xml has

        <w:t xml:space="preserve">». De benytter et webview, i praksis en nettleser som er bygget inn i Office-programmene, for å vise innholdet sitt og utføre oppgavene sine. </w:t>

Here's what it looks like for the first step of the pipeline:

[transfuse:\/tmp\/transfuse-D7x3hD-8b_Y]

[tf-block:1-Zh6TUA]

Teknologi.[]

[tf-block:2-SKmwAw]

[[t:text:SyTAKg]]Dette er såkalte «Sideloaded Add-ins». De benytter et webview, i praksis en nettleser som er bygget inn i Office-programmene, for å vise innholdet sitt og utføre oppgavene sine.[[/]] .[]

Then right after the full pipeline, we have wordbound tags galore:

[transfuse:\/tmp\/transfuse-D7x3hD-8b_Y]

[tf-block:1-Zh6TUA]

Teknologi.[]

[tf-block:2-SKmwAw]

[[t:text:SyTAKg]]Dette[[/]] [[t:text:SyTAKg]]er[[/]] [[t:text:SyTAKg]]såkalla[[/]] [[t:text:SyTAKg]]«[[/]][[t:text:SyTAKg]]Sideloaded[[/]] [[t:text:SyTAKg]]Add-[[/]][[t:text:SyTAKg]]ins[[/]][[t:text:SyTAKg]]»[[/]][[t:text:SyTAKg]].[[/]] [[t:text:SyTAKg]]Dei[[/]] [[t:text:SyTAKg]]nyttar[[/]] [[t:text:SyTAKg]]eit[[/]] [[t:text:SyTAKg]]webview[[/]][[t:text:SyTAKg]],[[/]] [[t:text:SyTAKg]]i[[/]] [[t:text:SyTAKg]]praksis[[/]] [[t:text:SyTAKg]]ein[[/]] [[t:text:SyTAKg]]ne[[t:text:SyTAKg]]ttl[[/]]esar[[/]] [[t:text:SyTAKg]]som[[/]] [[t:text:SyTAKg]]er[[/]] [[t:text:SyTAKg]]bygd inn[[/]] [[t:text:SyTAKg]]i[[/]] [[t:text:SyTAKg]]Office-[[/]][[t:text:SyTAKg]]programma[[/]][[t:text:SyTAKg]],[[/]] [[t:text:SyTAKg]]for[[/]] [[t:text:SyTAKg]]å[[/]] [[t:text:SyTAKg]]visa[[/]] [[t:text:SyTAKg]]innhaldet[[/]] [[t:text:SyTAKg]]sitt[[/]] [[t:text:SyTAKg]]og[[/]] [[t:text:SyTAKg]]utføra[[/]] [[t:text:SyTAKg]]oppgåvene[[/]] [[t:text:SyTAKg]]sine[[/]][[t:text:SyTAKg]].[[/]] .[]

The second-to-last step, before postgenerator, looks like
[[t:text:SyTAKg]]ne~tt[[/]][[t:text:SyTAKg]]lesar[[/]]
at the split-point, while after postgenerator we get
[[t:text:SyTAKg]]ne[[t:text:SyTAKg]]ttl[[/]]esar[[/]] [[t:text:SyTAKg]]som[[/]]

So is the issue here that postgenerator should not be creating these nested word blanks, or that transfuse should somehow know how to deal with nested word blanks?

@unhammer
Copy link
Author

unhammer commented Jun 3, 2022

@mr-martian does your apertium/lttoolbox#144 avoid nested word blanks in postgen?

@TinoDidriksen
Copy link
Owner

The pipe may not yield nested structures, nor will Transfuse give it nested structures, so that looks like a bug in postgen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants