New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc/Docx native format file generation #2

Open
joeflack4 opened this Issue Jun 18, 2018 · 6 comments

Comments

Projects
2 participants
@joeflack4
Contributor

joeflack4 commented Jun 18, 2018

Description

This would be a new feature in that the PPP could output .docx rather than just .doc. It would also be an improvement to .doc, in that the generated .doc file (should we choose to keep it) in that the .doc file would be in its native format, rather than actually an .html document with a .doc file extension.

Currently, a .docx file needs to be created manually. Meanwhile, .doc file produced by PPP is buggy since it is just plain html underneath. This can be seen by opening the file in google docs or open office. If opened in MS Word, it appears to be fine. Though opening and saving in Word converts it into such a format that it is now compatible with open office and google docs. Ideally, this process should be streamlined so that the user does not have to do this manually.

Possible solutions

1) Lowriter for HTML to .doc/.docx conversion

Example: lowriter --headless --convert-to docx ~/file.html

2) Custom implementation of .doc/.docx generator using OOXML

Although the older binary formats (.doc, xls, and .ppt) continue to be supported by Microsoft, OOXML is now the default format of all Microsoft Office documents (.docx, .xlsx, and .pptx).
http://officeopenxml.com/
https://en.wikipedia.org/wiki/Office_Open_XML

3) Use workflow automation tools

https://blog.testproject.io/2016/12/22/open-source-test-automation-tools-for-desktop-applications/

@joeflack4 joeflack4 self-assigned this Jun 18, 2018

@joeflack4 joeflack4 changed the title from Adding .docx support and automating .doc integrity to Post-process generated file in MS Office before outputting it to the end user. (doc, docx) Jun 18, 2018

@tulvit

This comment has been minimized.

Show comment
Hide comment
@tulvit

tulvit Jun 18, 2018

Member

How to reproduce the "bug"*

*"bug" in a quotation marks, because it's not a bug, just a different capabilities of different office engines in terms of converting web page into doc

  • Generating test.html from test.xlsx via the following command:
    python3 -m ppp test.xlsx -p minimal > test.html

  • Creating two copies of this file, test1.html and test2.html, and manually changing its extensions to .doc, so there will be 2 files: test1.doc and test2.doc (at this moment - it's two complete copies).

  • Now let's open test2.doc in MS Office:
    image
    Looks good.

  • Now we'll re-save this file (which is currently opened in MS Office) in doc or docx format, which is not important. What is important - while saving, the type of the document should be set to any "word/text document" and not to a "web page".

  • Ok, at this point we have 2 files. test1.doc, which is just a renamed test1.html and was never opened before, and test2.doc - which was renamed as well, but opened and resaved in MS Office. Now let's open both of these files in Open/Libre Office and GoogleDocs, and see the results.

  • Open/Libre Office
    test1.doc:
    image
    Broken document, not expected results.
    test2.doc:
    image
    Not bad!

  • GoogleDocs
    test1.doc:
    image
    Not broken so much, but anyway not good at all.
    test2.doc:
    image
    And again, it's just fine.

Why does this happen? It seems that only MS Office's engine is capable to parse this HTML in a desired way.

So, this test.doc file becomes a good valid document file only and only when MS Office repairs/parses it. In other words, on first file open MS Office doesn't "open" it, but "repairs/parses/converts/whatever".

And without this "re-saving" procedure all we have is just an html file manually renamed to a document file.

It's not a problem if the end user will use MS Office. It'll be a problem, if anybody will use any other document editor/viewer other than MS Office, including GoogleDocs.

Member

tulvit commented Jun 18, 2018

How to reproduce the "bug"*

*"bug" in a quotation marks, because it's not a bug, just a different capabilities of different office engines in terms of converting web page into doc

  • Generating test.html from test.xlsx via the following command:
    python3 -m ppp test.xlsx -p minimal > test.html

  • Creating two copies of this file, test1.html and test2.html, and manually changing its extensions to .doc, so there will be 2 files: test1.doc and test2.doc (at this moment - it's two complete copies).

  • Now let's open test2.doc in MS Office:
    image
    Looks good.

  • Now we'll re-save this file (which is currently opened in MS Office) in doc or docx format, which is not important. What is important - while saving, the type of the document should be set to any "word/text document" and not to a "web page".

  • Ok, at this point we have 2 files. test1.doc, which is just a renamed test1.html and was never opened before, and test2.doc - which was renamed as well, but opened and resaved in MS Office. Now let's open both of these files in Open/Libre Office and GoogleDocs, and see the results.

  • Open/Libre Office
    test1.doc:
    image
    Broken document, not expected results.
    test2.doc:
    image
    Not bad!

  • GoogleDocs
    test1.doc:
    image
    Not broken so much, but anyway not good at all.
    test2.doc:
    image
    And again, it's just fine.

Why does this happen? It seems that only MS Office's engine is capable to parse this HTML in a desired way.

So, this test.doc file becomes a good valid document file only and only when MS Office repairs/parses it. In other words, on first file open MS Office doesn't "open" it, but "repairs/parses/converts/whatever".

And without this "re-saving" procedure all we have is just an html file manually renamed to a document file.

It's not a problem if the end user will use MS Office. It'll be a problem, if anybody will use any other document editor/viewer other than MS Office, including GoogleDocs.

@joeflack4

This comment has been minimized.

Show comment
Hide comment
@joeflack4

joeflack4 Jun 19, 2018

Contributor

Thanks for all of the very specific details. This is useful.

Your MS Word looks the same as mine. On my computer at least, changing file extension to .doc and then opening in MS Word and hitting save, and close, did not end up changing the file size. I did not test in Open Office or Google Docs yet, though.

I'm using OSX High Sierra 10.13.3, MS Word 15.26, from 2016.

Contributor

joeflack4 commented Jun 19, 2018

Thanks for all of the very specific details. This is useful.

Your MS Word looks the same as mine. On my computer at least, changing file extension to .doc and then opening in MS Word and hitting save, and close, did not end up changing the file size. I did not test in Open Office or Google Docs yet, though.

I'm using OSX High Sierra 10.13.3, MS Word 15.26, from 2016.

@joeflack4 joeflack4 added the bug label Jun 19, 2018

@tulvit

This comment has been minimized.

Show comment
Hide comment
@tulvit

tulvit Jun 19, 2018

Member

@joeflack4

On my computer at least, changing file extension to .doc and then opening in MS Word and hitting save, and close, did not end up changing the file size.

My assumption (in which I'm 99.99% certain, though) is that you didn't save the file at all. You open the file, you do nothing, you click "Save" - and nothing happens, though it may seem as it was saved (because there were no changes, so nothing to save - in LibreOffice, for example, "Save" button will be inactive in this case, and on the other hand, it's kind of strange why "Save" button is clickable in MS Office).

What I mean... Please, take a look at the creation date of the file you will open. Say, it'll be "2018 June 12:25pm". Then open and "save" it, check the date again - it'll be the same "12:25pm", so nothing was saved, it's the same old file.

How to save it without "saving as"? For example, add and delete some character, a space/full stop/etc. So there will be some "change" in a document. Then press "save".

And voilà - file's date is changed, as well as the size (so the file was actually saved).

*I was struggling with this "saving but not saving" MS Office behaviour yesterday as well, but quickly noticed unchanged date of a "saved" file.

Ah, and some quite important thing to mention. After renaming test.html to test.doc and opening it in MS Office - MS Office will think that it's a "web document". After adding/deleting a space character (or any other modification, so the file will be actually saved) and clicking "save" - it'll be successfully saved. But with a "web page" type! So, if we then open this saved file in GoogleDocs, it'll be rendered like this:

image

So, in order to produce a repaired good working doc file, it should not be only opened in MS Office, but "Saved as" with a type of document specified.

Member

tulvit commented Jun 19, 2018

@joeflack4

On my computer at least, changing file extension to .doc and then opening in MS Word and hitting save, and close, did not end up changing the file size.

My assumption (in which I'm 99.99% certain, though) is that you didn't save the file at all. You open the file, you do nothing, you click "Save" - and nothing happens, though it may seem as it was saved (because there were no changes, so nothing to save - in LibreOffice, for example, "Save" button will be inactive in this case, and on the other hand, it's kind of strange why "Save" button is clickable in MS Office).

What I mean... Please, take a look at the creation date of the file you will open. Say, it'll be "2018 June 12:25pm". Then open and "save" it, check the date again - it'll be the same "12:25pm", so nothing was saved, it's the same old file.

How to save it without "saving as"? For example, add and delete some character, a space/full stop/etc. So there will be some "change" in a document. Then press "save".

And voilà - file's date is changed, as well as the size (so the file was actually saved).

*I was struggling with this "saving but not saving" MS Office behaviour yesterday as well, but quickly noticed unchanged date of a "saved" file.

Ah, and some quite important thing to mention. After renaming test.html to test.doc and opening it in MS Office - MS Office will think that it's a "web document". After adding/deleting a space character (or any other modification, so the file will be actually saved) and clicking "save" - it'll be successfully saved. But with a "web page" type! So, if we then open this saved file in GoogleDocs, it'll be rendered like this:

image

So, in order to produce a repaired good working doc file, it should not be only opened in MS Office, but "Saved as" with a type of document specified.

@tulvit

This comment has been minimized.

Show comment
Hide comment
@tulvit

tulvit Jun 19, 2018

Member

Possible ways to handle it

Disclaimer: it's just an ideas/suggestions as long as I have little to nothing experience in this field.

1. Do nothing.
If end users will always open provided file in a MS Office - then there will be no problem at all. (Probably, some new versions of MS Office may render html in a different way, bet let's assume it'll never happen.)

2. Manually open each and every generated file in a MS Office, re-save it, and only then send it to the end user.
May work only if there are only a few users. For 10-100 users a day it'll take a whole day, with 100+ users it'll be just impossible.

3. Automation of the previous re-saving procedure.
Just the same as 2, but with scripts, not hands. I. e. after the html file is generated, it'll be opened in MS Office and then re-saved automatically, via Windows API or some tool/software available.

4. Rebuilding HTML output in such a way, that any office software will render it in a desired way.
Making it much simpler, trying to fix particular bugs (the main problem with OpenOffice - rendering forms, and with GoogleDocs - nested table cells). Makes sense, but not a good option either - fixing bug after the bug, and anyways leaving out all other software (Polaris Office, WPS Office, dozens of them).

5. Creating .doc the right way.
Not via .html -> .doc, but generating doc file right away. There are already quite a bit of such libraries (python-docx, PHPOffice), but last time I've checked all of them offered only basic operations like creating a header and adding paragraphs which will not suit our needs.

6. Using some open text formats instead of .doc
ODT, I assume. Didn't investigate it so far, but I think it'll give more options to generate files via API and there should be already a lot of opensource solutions. And .odt files should work just fine in all office suits. So it's basically the same as 5, but .odt instead of .doc (and it allows us to leave MS Office out of the picture).

7. .odt to .doc
Basically, same as 6, but a little bit extended. As soon as we'll have a valid .odt file, I believe it'll be real to convert it to a *.doc format the right way. If I'm not wrong, OpenOffice offers CLI application as well, so it should be pretty straightforward.

8. Something else...
I guess, there are still some other options, and probably much better ones.

Member

tulvit commented Jun 19, 2018

Possible ways to handle it

Disclaimer: it's just an ideas/suggestions as long as I have little to nothing experience in this field.

1. Do nothing.
If end users will always open provided file in a MS Office - then there will be no problem at all. (Probably, some new versions of MS Office may render html in a different way, bet let's assume it'll never happen.)

2. Manually open each and every generated file in a MS Office, re-save it, and only then send it to the end user.
May work only if there are only a few users. For 10-100 users a day it'll take a whole day, with 100+ users it'll be just impossible.

3. Automation of the previous re-saving procedure.
Just the same as 2, but with scripts, not hands. I. e. after the html file is generated, it'll be opened in MS Office and then re-saved automatically, via Windows API or some tool/software available.

4. Rebuilding HTML output in such a way, that any office software will render it in a desired way.
Making it much simpler, trying to fix particular bugs (the main problem with OpenOffice - rendering forms, and with GoogleDocs - nested table cells). Makes sense, but not a good option either - fixing bug after the bug, and anyways leaving out all other software (Polaris Office, WPS Office, dozens of them).

5. Creating .doc the right way.
Not via .html -> .doc, but generating doc file right away. There are already quite a bit of such libraries (python-docx, PHPOffice), but last time I've checked all of them offered only basic operations like creating a header and adding paragraphs which will not suit our needs.

6. Using some open text formats instead of .doc
ODT, I assume. Didn't investigate it so far, but I think it'll give more options to generate files via API and there should be already a lot of opensource solutions. And .odt files should work just fine in all office suits. So it's basically the same as 5, but .odt instead of .doc (and it allows us to leave MS Office out of the picture).

7. .odt to .doc
Basically, same as 6, but a little bit extended. As soon as we'll have a valid .odt file, I believe it'll be real to convert it to a *.doc format the right way. If I'm not wrong, OpenOffice offers CLI application as well, so it should be pretty straightforward.

8. Something else...
I guess, there are still some other options, and probably much better ones.

@tulvit

This comment has been minimized.

Show comment
Hide comment
@tulvit

tulvit Jun 27, 2018

Member

UPDATE

Current thoughts after a little bit of investigation.

So, the final goal is to produce a valid doc file automatically.

Right now we have .html file renamed to .doc file (so, actually, it's just a broken doc file), and only Word can repair it well. Not a good workflow.

And back to OpenOffice. Let's assume it'll be possible to edit html code in such a way Open/Libre Office will be able to parse it. Actually, I've already tested it, and there is a good chance to do it. There are a lot of problems, though (OpenOffice ignores some HTML rules, like CSS styles for tables, doesn't ignore HTML comments, and so on and so forth).

But again, let's assume edited/modified html file will be rendered in OpenOffice well, then what?

Then it'll be possible to produce a valid doc or docx file just in a single command:

lowriter --headless --convert-to docx ~/file.html

So it will not be just a renamed html file anymore, but a valid doc file, which may be opened in any software (OpenOffice, MS Word, GoogleDocs, you name it...) without any problems. And the issue will be solved.

At this moment I'll consider this route as the main (or even the only one possible) option.

Member

tulvit commented Jun 27, 2018

UPDATE

Current thoughts after a little bit of investigation.

So, the final goal is to produce a valid doc file automatically.

Right now we have .html file renamed to .doc file (so, actually, it's just a broken doc file), and only Word can repair it well. Not a good workflow.

And back to OpenOffice. Let's assume it'll be possible to edit html code in such a way Open/Libre Office will be able to parse it. Actually, I've already tested it, and there is a good chance to do it. There are a lot of problems, though (OpenOffice ignores some HTML rules, like CSS styles for tables, doesn't ignore HTML comments, and so on and so forth).

But again, let's assume edited/modified html file will be rendered in OpenOffice well, then what?

Then it'll be possible to produce a valid doc or docx file just in a single command:

lowriter --headless --convert-to docx ~/file.html

So it will not be just a renamed html file anymore, but a valid doc file, which may be opened in any software (OpenOffice, MS Word, GoogleDocs, you name it...) without any problems. And the issue will be solved.

At this moment I'll consider this route as the main (or even the only one possible) option.

@joeflack4 joeflack4 added this to To do in PPP via automation Jun 28, 2018

@joeflack4

This comment has been minimized.

Show comment
Hide comment
@joeflack4

joeflack4 Jun 29, 2018

Contributor

@tulvit I am in agreement. I would like to try this.

The only possible disadvantage this route that I can see is if we later try to implement a special feature, where we allow a user to edit the word document, save it, and then run a special command to merge their changes back into the original XlsForm excel file. I feel like this would be much easier to implement if the underlying data structure of the document were HTML.

However, as there are no plans presently to implement that feature, let's not worry about it. I give you permission to proceed with your strategy.

Contributor

joeflack4 commented Jun 29, 2018

@tulvit I am in agreement. I would like to try this.

The only possible disadvantage this route that I can see is if we later try to implement a special feature, where we allow a user to edit the word document, save it, and then run a special command to merge their changes back into the original XlsForm excel file. I feel like this would be much easier to implement if the underlying data structure of the document were HTML.

However, as there are no plans presently to implement that feature, let's not worry about it. I give you permission to proceed with your strategy.

@joeflack4 joeflack4 added P3 and removed low priority labels Aug 17, 2018

@joeflack4 joeflack4 changed the title from Post-process generated file in MS Office before outputting it to the end user. (doc, docx) to Doc/Docx native format file generation Aug 20, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment