Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Doc/Docx native format file generation #2
This would be a new feature in that the PPP could output .docx rather than just .doc. It would also be an improvement to .doc, in that the generated .doc file (should we choose to keep it) in that the .doc file would be in its native format, rather than actually an .html document with a .doc file extension.
Currently, a .docx file needs to be created manually. Meanwhile, .doc file produced by PPP is buggy since it is just plain html underneath. This can be seen by opening the file in google docs or open office. If opened in MS Word, it appears to be fine. Though opening and saving in Word converts it into such a format that it is now compatible with open office and google docs. Ideally, this process should be streamlined so that the user does not have to do this manually.
1) Lowriter for HTML to .doc/.docx conversion
2) Custom implementation of .doc/.docx generator using OOXML
3) Use workflow automation tools
How to reproduce the "bug"*
*"bug" in a quotation marks, because it's not a bug, just a different capabilities of different office engines in terms of converting web page into doc
Why does this happen? It seems that only MS Office's engine is capable to parse this HTML in a desired way.
So, this test.doc file becomes a good valid document file only and only when MS Office repairs/parses it. In other words, on first file open MS Office doesn't "open" it, but "repairs/parses/converts/whatever".
And without this "re-saving" procedure all we have is just an html file manually renamed to a document file.
It's not a problem if the end user will use MS Office. It'll be a problem, if anybody will use any other document editor/viewer other than MS Office, including GoogleDocs.
Thanks for all of the very specific details. This is useful.
Your MS Word looks the same as mine. On my computer at least, changing file extension to .doc and then opening in MS Word and hitting save, and close, did not end up changing the file size. I did not test in Open Office or Google Docs yet, though.
I'm using OSX High Sierra 10.13.3, MS Word 15.26, from 2016.
My assumption (in which I'm 99.99% certain, though) is that you didn't save the file at all. You open the file, you do nothing, you click "Save" - and nothing happens, though it may seem as it was saved (because there were no changes, so nothing to save - in LibreOffice, for example, "Save" button will be inactive in this case, and on the other hand, it's kind of strange why "Save" button is clickable in MS Office).
What I mean... Please, take a look at the creation date of the file you will open. Say, it'll be "2018 June 12:25pm". Then open and "save" it, check the date again - it'll be the same "12:25pm", so nothing was saved, it's the same old file.
How to save it without "saving as"? For example, add and delete some character, a space/full stop/etc. So there will be some "change" in a document. Then press "save".
And voilà - file's date is changed, as well as the size (so the file was actually saved).
*I was struggling with this "saving but not saving" MS Office behaviour yesterday as well, but quickly noticed unchanged date of a "saved" file.
Ah, and some quite important thing to mention. After renaming test.html to test.doc and opening it in MS Office - MS Office will think that it's a "web document". After adding/deleting a space character (or any other modification, so the file will be actually saved) and clicking "save" - it'll be successfully saved. But with a "web page" type! So, if we then open this saved file in GoogleDocs, it'll be rendered like this:
So, in order to produce a repaired good working doc file, it should not be only opened in MS Office, but "Saved as" with a type of document specified.
Possible ways to handle it
Disclaimer: it's just an ideas/suggestions as long as I have little to nothing experience in this field.
1. Do nothing.
2. Manually open each and every generated file in a MS Office, re-save it, and only then send it to the end user.
3. Automation of the previous re-saving procedure.
4. Rebuilding HTML output in such a way, that any office software will render it in a desired way.
5. Creating .doc the right way.
6. Using some open text formats instead of .doc
7. .odt to .doc
8. Something else...
Current thoughts after a little bit of investigation.
So, the final goal is to produce a valid doc file automatically.
Right now we have .html file renamed to .doc file (so, actually, it's just a broken doc file), and only Word can repair it well. Not a good workflow.
And back to OpenOffice. Let's assume it'll be possible to edit html code in such a way Open/Libre Office will be able to parse it. Actually, I've already tested it, and there is a good chance to do it. There are a lot of problems, though (OpenOffice ignores some HTML rules, like CSS styles for tables, doesn't ignore HTML comments, and so on and so forth).
But again, let's assume edited/modified html file will be rendered in OpenOffice well, then what?
Then it'll be possible to produce a valid doc or docx file just in a single command:
So it will not be just a renamed html file anymore, but a valid doc file, which may be opened in any software (OpenOffice, MS Word, GoogleDocs, you name it...) without any problems. And the issue will be solved.
At this moment I'll consider this route as the main (or even the only one possible) option.
referenced this issue
Jun 28, 2018
@tulvit I am in agreement. I would like to try this.
The only possible disadvantage this route that I can see is if we later try to implement a special feature, where we allow a user to edit the word document, save it, and then run a special command to merge their changes back into the original XlsForm excel file. I feel like this would be much easier to implement if the underlying data structure of the document were HTML.
However, as there are no plans presently to implement that feature, let's not worry about it. I give you permission to proceed with your strategy.