Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows html to docx fails to embed images in the docx file #316

Closed
saumzzz opened this issue Nov 29, 2022 · 18 comments · Fixed by #322
Closed

Windows html to docx fails to embed images in the docx file #316

saumzzz opened this issue Nov 29, 2022 · 18 comments · Fixed by #322
Assignees
Milestone

Comments

@saumzzz
Copy link

saumzzz commented Nov 29, 2022

I have an html file which links to an image in the same folder, when converting from html to docx on windows it throws the error [WARNING] Could not fetch resource test.png: PandocResourceNotFound "test.png"

pypandoc-binary==1.10

html file:

<!DOCTYPE html>
<html lang="en">

<head>
  <title>Test Title</title>
  <meta name="viewport" content="width=device-width, initial-scale=1">

<body>

  <h1 class="section">Test Heading</h1>

  <div class="row">
    <img src="test.png" alt="test alt" />
  </div>

</body>

</html>

python script

import pypandoc
 
pypandoc.convert_file(
    'index.html',
    to='docx',
    format='html',
    outputfile='test.docx',
)

output of python test.py:
[WARNING] Could not fetch resource test.png: PandocResourceNotFound "test.png"

@JessicaTegner
Copy link
Owner

hi @saumcor
Have you tried setting the pandocs data directory, so pandoc knows where to look for the image files?

@saumzzz
Copy link
Author

saumzzz commented Nov 29, 2022

Hey @JessicaTegner
I tried the data-dir and the resource-path(independently) too as follows but still the image wasn't embedded and gave the same warning PandocResourceNotFound

extra_args = ['--data-dir=<windows path>']

pypandoc.convert_file(
    'index.html',
    to='docx',
    format='html',
    outputfile='test.docx',
    extra_args=extra_args,
)

@saumzzz
Copy link
Author

saumzzz commented Dec 5, 2022

Hey @JessicaTegner, what's the issue here? Do you need additional info? Any solutions/workarounds?

@JessicaTegner
Copy link
Owner

hi @saumcor sorry for not getting back to you :)

It seems a bunch of people have had the same issues over times, but I still don't know what the root cause of this is.

@saumzzz
Copy link
Author

saumzzz commented Dec 5, 2022

Hey @JessicaTegner, no worries, thanks for helping

@sanjass
Copy link

sanjass commented Dec 24, 2022

I had the same issue and passing sandbox=False (default is True) fixed it for me.
So in your case it'd be

pypandoc.convert_file(
    'index.html',
    to='docx',
    format='html',
    outputfile='test.docx',
    extra_args=extra_args,
    sandbox=False,  # <----------add this
)

@JessicaTegner It seems setting sandbox=False is not recommended in most cases, based on the docstring
:param bool sandbox: Run pandoc in pandocs own sandbox mode, limiting IO operations in readers and writers to reading the files specified on the command line. Anyone using pandoc on untrusted user input should use this option. Note: This only does something, on pandoc >= 2.15 .
Do you have suggestions on how to avoid having to set sandbox to False and still have images working as expected?

@JessicaTegner
Copy link
Owner

JessicaTegner commented Dec 25, 2022

@sanjass and others
This is the full explanation from the Pandocusers guide

--sandbox
Run pandoc in a sandbox, limiting IO operations in readers and writers to reading the files specified on the command line. Note that this option does not limit IO operations by filters or in the production of PDF documents. But it does offer security against, for example, disclosure of files through the use of include directives. Anyone using pandoc on untrusted user input should use this option.
Note: some readers and writers (e.g., docx) need access to data files. If these are stored on the file system, then pandoc will not be able to find them when run in --sandbox mode and will raise an error. For these applications, we recommend using a pandoc binary compiled with the embed_data_files option, which causes the data files to be baked into the binary instead of being stored on the file system.

So there's 2 options.

  1. Disabling sandbox mode
  2. Using a pandoc binary compiled with the embed_data_files option, which is currently out of scope for this library.

I would be willing to consider alternatives, such as setting sandbox to false by default.

What do people think?

@sanjass
Copy link

sanjass commented Jan 2, 2023

@JessicaTegner thanks for the prompt response. While I'm no expert on the implications of the options you provided, I don't think it's unreasonable to have sandbox=False by default as this would replicate the pandoc CLI usage more closely and avoid confusion.

Namely, when using pandoc directly one would have to explicitly provide --sandbox as a parameter in order to run in a sandbox mode, so the same can be true for pypandoc by explicitly requiring users to specify sandbox=True to get the sandbox effect. This way, if the users "go out of their way" to override the default value of the sandbox parameter, then they would have presumably read pandoc's documentation and know that they need to use embed_data_files option along with it (e.g. for conversion to docx), which should hopefully avoid errors such as the one in this issue.

In either case, more thorough documentation is needed, especially if we keep sandbox=True by default.

@JessicaTegner
Copy link
Owner

@sanjass you are right. We should probably have sandbox set to false by default, to replicate the pandoc cli

@JessicaTegner
Copy link
Owner

Update: After reading through the pandoc user manual, under the "General options", it seems that sandbox default behavior is indeed true. If that's the case, pypandoc is currently doing as the pandoc cli. We could probably, in that case, add some better documentation referencing the pandoc user manual.

What does people think?

@sanjass
Copy link

sanjass commented Jan 6, 2023

"General options", it seems that sandbox default behavior is indeed true

Hmm, that's weird. I found this line in the pandoc code optSandbox = False under Defaults for command-line options.. The default being False would also make sense since --sandbox sounds like an enabling flag (a "disabling" flag would hopefully be named --disable-sandbox or something).

When testing locally with pandoc version 2.19.2, it also seems sandbox is False by default. The way I tried this is as follows:

Given a sample.html file with content <img src="atom.jpeg" alt="atom_pic"> and an actual image named atom.jpeg in the testing directory:
Running pandoc sample.html -f html -t docx -o sample.docx works as expected (image is attached) while running pandoc sample.html -f html -t docx -o sample.docx --sandbox results in [WARNING] Could not fetch resource atom.jpeg: PandocResourceNotFound "atom.jpeg" and the image is not attached.

@JessicaTegner
Copy link
Owner

hmm interesting. Yeah in that case sandbox = false should be default in pypandoc.

@JessicaTegner
Copy link
Owner

@saumcor and @sanjass

I have aded some tech logic, replicating what OP had an issue with. This conversion however, doesn't seem to produce any warnings or errors. Let me know what you think.

@saumzzz
Copy link
Author

saumzzz commented Jan 25, 2023

Hey @JessicaTegner that seems to be in line with the behaviour of pandoc without the --sandbox flag. No warnings or errors, with the file getting embedded in the docx file.

@JessicaTegner
Copy link
Owner

yes @saumcor but as you can see from the code, I didn't actually change anything, just wrote a test case for it, matching this issue

@sanjass
Copy link

sanjass commented Jan 25, 2023

@JessicaTegner I didn't run the test so I can't confirm, but could it be that you're seeing a different outcome because of pandoc version? Based on L351-L353 sandbox=True only has an effect after pandoc version 2.15

@JessicaTegner
Copy link
Owner

@sanjass yes, because "sandbox" was introduced in pandoc = 2.15, so on earlier versions it has no effects. I tested with pandoc 2.19x

@JessicaTegner
Copy link
Owner

@saumcor and @sanjass if you check the pr #322 the modified code should make this possible again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants