`UnicodeEncodeError` when docstring contain non-ascii characters #91

masci · 2023-03-14T14:16:55Z

Describe the bug

When a docstrings contains non-ascii character the conversion fails

To Reproduce
Steps to reproduce the behavior:

create a Python file foo.py containing the following:

def foo():
    """
    Some converters tend to recognize clusters of letters as ligatures, such as "ﬀ" (double f)
    """
    pass

from the same folder, run pydoc-markdown -I . -m foo
see the error:

Traceback (most recent call last):
  File "/Users/massi/.virtualenvs/haystack/bin/pydoc-markdown", line 8, in <module>
    sys.exit(cli())
  File "/Users/massi/.virtualenvs/haystack/lib/python3.10/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/Users/massi/.virtualenvs/haystack/lib/python3.10/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/Users/massi/.virtualenvs/haystack/lib/python3.10/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/massi/.virtualenvs/haystack/lib/python3.10/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/Users/massi/.virtualenvs/haystack/lib/python3.10/site-packages/pydoc_markdown/main.py", line 383, in cli
    session.render(pydocmd)
  File "/Users/massi/.virtualenvs/haystack/lib/python3.10/site-packages/pydoc_markdown/main.py", line 132, in render
    modules = config.load_modules()
  File "/Users/massi/.virtualenvs/haystack/lib/python3.10/site-packages/pydoc_markdown/__init__.py", line 150, in load_modules
    modules.extend(loader.load())
  File "/Users/massi/.virtualenvs/haystack/lib/python3.10/site-packages/docspec_python/__init__.py", line 90, in load_python_modules
    yield parse_python_module(filename, module_name=module_name, options=options, encoding=encoding)
  File "/Users/massi/.virtualenvs/haystack/lib/python3.10/site-packages/docspec_python/__init__.py", line 131, in parse_python_module
    return parse_python_module(fpobj, fp, module_name, options, encoding)
  File "/Users/massi/.virtualenvs/haystack/lib/python3.10/site-packages/docspec_python/__init__.py", line 136, in parse_python_module
    return parser.parse(ast, filename, module_name)
  File "/Users/massi/.virtualenvs/haystack/lib/python3.10/site-packages/docspec_python/parser.py", line 300, in parse
    member = self.parse_declaration(module, node)
  File "/Users/massi/.virtualenvs/haystack/lib/python3.10/site-packages/docspec_python/parser.py", line 326, in parse_declaration
    return self.parse_funcdef(parent, node, False, decorations)
  File "/Users/massi/.virtualenvs/haystack/lib/python3.10/site-packages/docspec_python/parser.py", line 529, in parse_funcdef
    docstring = self.get_docstring_from_first_node(body)
  File "/Users/massi/.virtualenvs/haystack/lib/python3.10/site-packages/docspec_python/parser.py", line 799, in get_docstring_from_first_node
    return self.prepare_docstring(get_value(node.children[0]), parent)
  File "/Users/massi/.virtualenvs/haystack/lib/python3.10/site-packages/docspec_python/parser.py", line 898, in prepare_docstring
    docstring.content = docstring.content.encode("latin1").decode(
UnicodeEncodeError: 'latin-1' codec can't encode character '\ufb00' in position 77: ordinal not in range(256)

Expected behavior
No errors like it was with version<2.1.0

The text was updated successfully, but these errors were encountered:

NiklasRosenstein · 2023-03-15T10:19:21Z

Hey @masci, thanks for the bug report. Dang, it seems I didn't test this sufficiently and trusted StackOverflow a bit too much 👀

The encode/decode code here was introduced to convert a Python literal string into an actual string as it would be parsed by the Python interpreter to memory (so when you write "foo\n" into your docstring, would actually be "foo\n" in the Docstring.content instead of "foo\\n")

Unless there's another better working solution using the encode/decode logic, I suppose we need to manually parse the string and convert special character sequences.

masci · 2023-03-15T15:52:54Z

Thanks for following up! I'm not sure I get 100% the logic of the answer in SO but at some point I see

...
s.encode('latin1')         # To bytes, required by 'unicode-escape'
...

and I wonder, if the goal of that step is just to have bytes out of the original string, can't we just encode using something more flexible than latin1, like utf-8? Am I missing something?

NiklasRosenstein · 2023-03-15T17:14:39Z

The reason is that latin1 and unicode_escape seem to have a convenient overlap in escape character use, or something like that. But if latin1 can't encode everything, then it's no use either. 🤦

>>> 'ü'.encode('latin1')
b'\xfc'
>>> 'ü'.encode('latin1').decode('unicode_escape')
'ü'
>>> 'ü'.encode('utf-8')
b'\xc3\xbc'
>>> 'ü'.encode('utf-8').decode('unicode_escape')
'Ã¼'

NiklasRosenstein · 2023-03-15T23:02:10Z

It seems like you already found the PR and thus the StackOverflow answer I was referring to, but for reference: #83 and https://stackoverflow.com/a/58829514/791713

The best alternative that I can think of without re-implementing the decoding of raw strings is to use ast.literal_eval(). Actually that does appear rather elegant to me, in particular because the string we're dealing with will have the quotes around it.

    if s:
      s = ast.literal_eval(s)
      return Docstring(location, dedent_docstring(s).strip())

….decode(unicode_escape)` method.

….decode(unicode_escape)` method. (#92)

NiklasRosenstein · 2023-03-15T23:30:44Z

In 2.1.2

masci added the type: bug Something isn't working label Mar 14, 2023

NiklasRosenstein added a commit that referenced this issue Mar 15, 2023

fix: Fix #91 by using ast.literal_eval() instead of `encode(latin1)…

9a481e9

….decode(unicode_escape)` method.

NiklasRosenstein linked a pull request Mar 15, 2023 that will close this issue

fix: Fix #91 by using ast.literal_eval() instead of encode(latin1).decode(unicode_escape) method. #92

Merged

NiklasRosenstein closed this as completed in #92 Mar 15, 2023

NiklasRosenstein added a commit that referenced this issue Mar 15, 2023

fix: Fix #91 by using ast.literal_eval() instead of `encode(latin1)…

d7d6be7

….decode(unicode_escape)` method. (#92)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`UnicodeEncodeError` when docstring contain non-ascii characters #91

`UnicodeEncodeError` when docstring contain non-ascii characters #91

masci commented Mar 14, 2023

NiklasRosenstein commented Mar 15, 2023

masci commented Mar 15, 2023

NiklasRosenstein commented Mar 15, 2023

NiklasRosenstein commented Mar 15, 2023

NiklasRosenstein commented Mar 15, 2023

UnicodeEncodeError when docstring contain non-ascii characters #91

UnicodeEncodeError when docstring contain non-ascii characters #91

Comments

masci commented Mar 14, 2023

NiklasRosenstein commented Mar 15, 2023

masci commented Mar 15, 2023

NiklasRosenstein commented Mar 15, 2023

NiklasRosenstein commented Mar 15, 2023

NiklasRosenstein commented Mar 15, 2023

`UnicodeEncodeError` when docstring contain non-ascii characters #91

`UnicodeEncodeError` when docstring contain non-ascii characters #91