-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dealing with whitespaces #21
Comments
yes, white spaces are usually not considered part of the content. so we can format an xml document to be more human readable. however, I can see, that this is a lost of information. and when stringifying an object, it is no longer be exactly the same. are the spaces to be part of the office document? I would really like to enable you to parse the content of such documents correctly. |
Yes they are part of document. It's a simplified version of http://officeopenxml.com/WPtext.php. |
I really thank you for this question. You made me look deeper into a topic that I always pushed away, reading office documents. I came up with the following to reproduce the issue, and yes, the condition to filter whitespace is a problem here. var fs = require("fs");
var JSZip = require("jszip");
const txml = require('txml')
fs.readFile("wordpad.docx", async function(err, data) {
if (err) throw err;
const zip = await JSZip.loadAsync(data);
console.log(Object.keys(zip.files))
const textContent = await zip.file("word/document.xml").async("string");
const content = txml.parse(textContent);
console.log(textContent);
console.log(JSON.stringify(content,undefined,' '));
const newContent = txml.stringify(content).split('></?xml>').join('?>\n')
zip.file("content.xml", newContent, {string: true});
const updatedZip = await zip.generateAsync({type: "nodebuffer"});
fs.writeFileSync('wordpad2fromNode.docx',updatedZip);
}); I found when commenting the |
Yes I did the same and so far so good. |
There's still a bug here actually:
..
|
the current code is this: https://github.com/TobiasNickel/tXml/blob/master/tXml.js#L146 var text = parseText()
if (keepWhitespace || text.trim().length > 0)
children.push(text); with it, when the developer want to The code for the next version has already updated to allow smaller bundles. this is an other great change for version 5. Thanks for your feedback. I think at the weekend I will prepare the version 5 update for npm. |
Cool. Just to be clear, what I mean is that the current code doesn't trim text, even though it thinks it does: var text = parseText()
if (keepWhitespace || text.trim().length > 0)
children.push(text);
// ^^^^--- not trimmed! .. if I |
in fact office documents use xml:space to mark if whitespace is significant, such as this: |
Note the
<w:t> </w:t>
part.Checking the result object, that 'w:t' item has an empty 'children' array:
children: []
.After reading source code, it seems that
parseChildren
function has the following lines:text.trim()
causes the issue. Is there any particular purpose 'trim()' is needed here? Or am I missing something in the process?Thanks!
The text was updated successfully, but these errors were encountered: