Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dealing with whitespaces #21

Closed
moongazers opened this issue Jan 15, 2021 · 8 comments
Closed

Dealing with whitespaces #21

moongazers opened this issue Jan 15, 2021 · 8 comments

Comments

@moongazers
Copy link

moongazers commented Jan 15, 2021

  let str = '<?xml version="1.0" encoding="UTF-8" standalone="yes"?><w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"><w:body><w:p><w:r><w:t>aaaaa</w:t></w:r><w:r><w:rPr><w:sz w:val="48"></w:sz></w:rPr><w:t>       </w:t></w:r><w:r><w:t>bbbbbb</w:t></w:r></w:p><w:sectPr><w:pgSz w:w="793.7008056640625" w:h="1122.5196533203125" w:orient="portrait"></w:pgSz><w:pgMar w:top="95.99999237060547" w:right="119.81102752685547" w:bottom="95.99999237060547" w:left="119.81102752685547" w:header="47.24409484863281" w:footer="47.24409484863281" w:gutter="0"></w:pgMar></w:sectPr></w:body></w:document>';
  const resut = txml.parse(str);

Note the <w:t> </w:t> part.
Checking the result object, that 'w:t' item has an empty 'children' array: children: [].

After reading source code, it seems that parseChildren function has the following lines:

var text = parseText()
if (text.trim().length > 0)
    children.push(text);
pos++;

text.trim()causes the issue. Is there any particular purpose 'trim()' is needed here? Or am I missing something in the process?

Thanks!

@TobiasNickel
Copy link
Owner

yes, white spaces are usually not considered part of the content. so we can format an xml document to be more human readable.

however, I can see, that this is a lost of information. and when stringifying an object, it is no longer be exactly the same.

are the spaces to be part of the office document? I would really like to enable you to parse the content of such documents correctly.

@moongazers
Copy link
Author

yes, white spaces are usually not considered part of the content. so we can format an xml document to be more human readable.

however, I can see, that this is a lost of information. and when stringifying an object, it is no longer be exactly the same.

are the spaces to be part of the office document? I would really like to enable you to parse the content of such documents correctly.

Yes they are part of document. It's a simplified version of http://officeopenxml.com/WPtext.php.

@TobiasNickel
Copy link
Owner

I really thank you for this question. You made me look deeper into a topic that I always pushed away, reading office documents.

I came up with the following to reproduce the issue, and yes, the condition to filter whitespace is a problem here.

var fs = require("fs");
var JSZip = require("jszip");
const txml = require('txml')

fs.readFile("wordpad.docx", async function(err, data) {
    if (err) throw err;
    const zip = await JSZip.loadAsync(data);
    
    console.log(Object.keys(zip.files))
    const textContent = await zip.file("word/document.xml").async("string");
    const content = txml.parse(textContent);
    console.log(textContent);
    console.log(JSON.stringify(content,undefined,'  '));
    const newContent = txml.stringify(content).split('></?xml>').join('?>\n')
    zip.file("content.xml", newContent, {string: true});
    const updatedZip = await zip.generateAsync({type: "nodebuffer"});
    fs.writeFileSync('wordpad2fromNode.docx',updatedZip);
});

I found when commenting the if statement, that the new written document is working fine. How would you like it when we add an option keepWhitespaces. I think this could be a minor update. I also still have to fix the type-definition. #20 . that can be done in one go.

@moongazers
Copy link
Author

moongazers commented Jan 19, 2021

I really thank you for this question. You made me look deeper into a topic that I always pushed away, reading office documents.

I came up with the following to reproduce the issue, and yes, the condition to filter whitespace is a problem here.

var fs = require("fs");
var JSZip = require("jszip");
const txml = require('txml')

fs.readFile("wordpad.docx", async function(err, data) {
    if (err) throw err;
    const zip = await JSZip.loadAsync(data);
    
    console.log(Object.keys(zip.files))
    const textContent = await zip.file("word/document.xml").async("string");
    const content = txml.parse(textContent);
    console.log(textContent);
    console.log(JSON.stringify(content,undefined,'  '));
    const newContent = txml.stringify(content).split('></?xml>').join('?>\n')
    zip.file("content.xml", newContent, {string: true});
    const updatedZip = await zip.generateAsync({type: "nodebuffer"});
    fs.writeFileSync('wordpad2fromNode.docx',updatedZip);
});

I found when commenting the if statement, that the new written document is working fine. How would you like it when we add an option keepWhitespaces. I think this could be a minor update. I also still have to fix the type-definition. #20 . that can be done in one go.

Yes I did the same and so far so good. keepWhitespaces sounds good to me. Glad I can help :) Thanks again for the great lib!

@ConradHughes
Copy link

ConradHughes commented May 18, 2021

There's still a bug here actually:

var text = parseText();
if (text.trim().length > 0)
    children.push(text);

.. text.trim() does not modify text, so the text that gets pushed is the original text, including whitespace. The only thing that the old code achieved was to drop strings that are all whitespace! The issue is still there even with the keepWhitespaces patch. I think you need something like this:

const text = parseText();
if (keepWhitespace) // keepWhitespace will force even empty strings to be pushed - is this desirable?
    children.push(text);
else {
    const trimmed = text.trim();
    if (trimmed.length > 0)
        children.push(trimmed);
}

@TobiasNickel
Copy link
Owner

the current code is this: https://github.com/TobiasNickel/tXml/blob/master/tXml.js#L146

    var text = parseText()
    if (keepWhitespace || text.trim().length > 0)
        children.push(text);

with it, when the developer want to keepWhitespace the value is kept anyway. But I think that your solution is still better. because it will be more consistent with removing whitespace when not keeping whitespace.

The code for the next version has already updated to allow smaller bundles. this is an other great change for version 5. Thanks for your feedback. I think at the weekend I will prepare the version 5 update for npm.

@ConradHughes
Copy link

Cool. Just to be clear, what I mean is that the current code doesn't trim text, even though it thinks it does:

    var text = parseText()
    if (keepWhitespace || text.trim().length > 0)
        children.push(text);
//                    ^^^^--- not trimmed!

.. if I txml.parse('<tag> text </tag>', { simplify: true }), I get { tag: ' text ' }, where I think the expectation is that it should be { tag: 'text' }..

TobiasNickel added a commit that referenced this issue Jul 10, 2021
@richcanvas
Copy link

in fact office documents use xml:space to mark if whitespace is significant, such as this:
<w:t xml:space="preserve"> </w:t>
in this case, whitespace should be preserved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants