Dealing with whitespaces #21

moongazers · 2021-01-15T09:08:38Z

  let str = '<?xml version="1.0" encoding="UTF-8" standalone="yes"?><w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"><w:body><w:p><w:r><w:t>aaaaa</w:t></w:r><w:r><w:rPr><w:sz w:val="48"></w:sz></w:rPr><w:t>       </w:t></w:r><w:r><w:t>bbbbbb</w:t></w:r></w:p><w:sectPr><w:pgSz w:w="793.7008056640625" w:h="1122.5196533203125" w:orient="portrait"></w:pgSz><w:pgMar w:top="95.99999237060547" w:right="119.81102752685547" w:bottom="95.99999237060547" w:left="119.81102752685547" w:header="47.24409484863281" w:footer="47.24409484863281" w:gutter="0"></w:pgMar></w:sectPr></w:body></w:document>';
  const resut = txml.parse(str);

Note the <w:t> </w:t> part.
Checking the result object, that 'w:t' item has an empty 'children' array: children: [].

After reading source code, it seems that parseChildren function has the following lines:

var text = parseText()
if (text.trim().length > 0)
    children.push(text);
pos++;

text.trim()causes the issue. Is there any particular purpose 'trim()' is needed here? Or am I missing something in the process?

Thanks!

The text was updated successfully, but these errors were encountered:

TobiasNickel · 2021-01-19T03:44:54Z

yes, white spaces are usually not considered part of the content. so we can format an xml document to be more human readable.

however, I can see, that this is a lost of information. and when stringifying an object, it is no longer be exactly the same.

are the spaces to be part of the office document? I would really like to enable you to parse the content of such documents correctly.

moongazers · 2021-01-19T05:50:27Z

yes, white spaces are usually not considered part of the content. so we can format an xml document to be more human readable.

however, I can see, that this is a lost of information. and when stringifying an object, it is no longer be exactly the same.

are the spaces to be part of the office document? I would really like to enable you to parse the content of such documents correctly.

Yes they are part of document. It's a simplified version of http://officeopenxml.com/WPtext.php.

TobiasNickel · 2021-01-19T09:14:10Z

I really thank you for this question. You made me look deeper into a topic that I always pushed away, reading office documents.

I came up with the following to reproduce the issue, and yes, the condition to filter whitespace is a problem here.

var fs = require("fs");
var JSZip = require("jszip");
const txml = require('txml')

fs.readFile("wordpad.docx", async function(err, data) {
    if (err) throw err;
    const zip = await JSZip.loadAsync(data);
    
    console.log(Object.keys(zip.files))
    const textContent = await zip.file("word/document.xml").async("string");
    const content = txml.parse(textContent);
    console.log(textContent);
    console.log(JSON.stringify(content,undefined,'  '));
    const newContent = txml.stringify(content).split('></?xml>').join('?>\n')
    zip.file("content.xml", newContent, {string: true});
    const updatedZip = await zip.generateAsync({type: "nodebuffer"});
    fs.writeFileSync('wordpad2fromNode.docx',updatedZip);
});

I found when commenting the if statement, that the new written document is working fine. How would you like it when we add an option keepWhitespaces. I think this could be a minor update. I also still have to fix the type-definition. #20 . that can be done in one go.

moongazers · 2021-01-19T09:43:58Z

I really thank you for this question. You made me look deeper into a topic that I always pushed away, reading office documents.

I came up with the following to reproduce the issue, and yes, the condition to filter whitespace is a problem here.
var fs = require("fs");
var JSZip = require("jszip");
const txml = require('txml')

fs.readFile("wordpad.docx", async function(err, data) {
    if (err) throw err;
    const zip = await JSZip.loadAsync(data);
    
    console.log(Object.keys(zip.files))
    const textContent = await zip.file("word/document.xml").async("string");
    const content = txml.parse(textContent);
    console.log(textContent);
    console.log(JSON.stringify(content,undefined,'  '));
    const newContent = txml.stringify(content).split('></?xml>').join('?>\n')
    zip.file("content.xml", newContent, {string: true});
    const updatedZip = await zip.generateAsync({type: "nodebuffer"});
    fs.writeFileSync('wordpad2fromNode.docx',updatedZip);
});
I found when commenting the if statement, that the new written document is working fine. How would you like it when we add an option keepWhitespaces. I think this could be a minor update. I also still have to fix the type-definition. #20 . that can be done in one go.

Yes I did the same and so far so good. keepWhitespaces sounds good to me. Glad I can help :) Thanks again for the great lib!

ConradHughes · 2021-05-18T13:51:40Z

There's still a bug here actually:

var text = parseText();
if (text.trim().length > 0)
    children.push(text);

.. text.trim() does not modify text, so the text that gets pushed is the original text, including whitespace. The only thing that the old code achieved was to drop strings that are all whitespace! The issue is still there even with the keepWhitespaces patch. I think you need something like this:

const text = parseText();
if (keepWhitespace) // keepWhitespace will force even empty strings to be pushed - is this desirable?
    children.push(text);
else {
    const trimmed = text.trim();
    if (trimmed.length > 0)
        children.push(trimmed);
}

TobiasNickel · 2021-05-19T11:31:08Z

the current code is this: https://github.com/TobiasNickel/tXml/blob/master/tXml.js#L146

    var text = parseText()
    if (keepWhitespace || text.trim().length > 0)
        children.push(text);

with it, when the developer want to keepWhitespace the value is kept anyway. But I think that your solution is still better. because it will be more consistent with removing whitespace when not keeping whitespace.

The code for the next version has already updated to allow smaller bundles. this is an other great change for version 5. Thanks for your feedback. I think at the weekend I will prepare the version 5 update for npm.

ConradHughes · 2021-05-19T14:32:14Z

Cool. Just to be clear, what I mean is that the current code doesn't trim text, even though it thinks it does:

    var text = parseText()
    if (keepWhitespace || text.trim().length > 0)
        children.push(text);
//                    ^^^^--- not trimmed!

.. if I txml.parse('<tag> text </tag>', { simplify: true }), I get { tag: ' text ' }, where I think the expectation is that it should be { tag: 'text' }..

richcanvas · 2022-05-22T02:37:24Z

in fact office documents use xml:space to mark if whitespace is significant, such as this:
<w:t xml:space="preserve"> </w:t>
in this case, whitespace should be preserved.

TobiasNickel added a commit that referenced this issue Jul 10, 2021

keepWhitespace #21

df00535

TobiasNickel closed this as completed Dec 17, 2021

richcanvas mentioned this issue May 22, 2022

office document whitespace #42

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dealing with whitespaces #21

Dealing with whitespaces #21

moongazers commented Jan 15, 2021 •

edited

Loading

TobiasNickel commented Jan 19, 2021

moongazers commented Jan 19, 2021

TobiasNickel commented Jan 19, 2021

moongazers commented Jan 19, 2021 •

edited

Loading

ConradHughes commented May 18, 2021 •

edited

Loading

TobiasNickel commented May 19, 2021

ConradHughes commented May 19, 2021

richcanvas commented May 22, 2022

Dealing with whitespaces #21

Dealing with whitespaces #21

Comments

moongazers commented Jan 15, 2021 • edited Loading

TobiasNickel commented Jan 19, 2021

moongazers commented Jan 19, 2021

TobiasNickel commented Jan 19, 2021

moongazers commented Jan 19, 2021 • edited Loading

ConradHughes commented May 18, 2021 • edited Loading

TobiasNickel commented May 19, 2021

ConradHughes commented May 19, 2021

richcanvas commented May 22, 2022

moongazers commented Jan 15, 2021 •

edited

Loading

moongazers commented Jan 19, 2021 •

edited

Loading

ConradHughes commented May 18, 2021 •

edited

Loading