Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding isn't a valid parameter to smart_open #142

Closed
jayantj opened this issue Nov 8, 2017 · 3 comments
Closed

encoding isn't a valid parameter to smart_open #142

jayantj opened this issue Nov 8, 2017 · 3 comments
Assignees
Labels

Comments

@jayantj
Copy link

jayantj commented Nov 8, 2017

This isn't a bug, but it can be slightly confusing -

Steps to reproduce

print(open('cp852.tsv.txt', 'r', encoding='cp852').read())
# tímto	budeš
# budem	byli

print(smart_open('cp852.tsv.txt', 'r', encoding='cp852').read())
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa1 in position 1: invalid start byte

cp852.tsv.txt

The encoding parameter isn't mentioned in the documentation anywhere, so it isn't reasonable for the above to work. It would be helpful though to log a warning, or raise an exception specifying it is an invalid argument, or to have the parameter supported in the first place (if possible).

@piskvorky
Copy link
Owner

piskvorky commented Nov 8, 2017

Definitely a bug -- smart_open must remain a drop-in replacement for open, for local filesystem usage.

Thanks for reporting.

@menshikh-iv we should be passing all extra (unknown) parameters straight to the underlying storage. I believe that's how smart_open always worked, so not sure what the current bug is due to.

@jayantj
Copy link
Author

jayantj commented Nov 8, 2017

Might have something to do with the fact that encoding isn't a valid argument to open for python2.

@mpenkov
Copy link
Collaborator

mpenkov commented Nov 17, 2017

As @jayantj pointed out, open doesn't take keyword arguments in Py2, so this bug is only reproducible under Py3.

(venv)bash-3.2$ source venv/bin/activate
(venv)bash-3.2$ python bug.py
Traceback (most recent call last):
  File "bug.py", line 3, in <module>
    print(open('cp852.tsv.txt', 'r', encoding='cp852').read())
TypeError: 'encoding' is an invalid keyword argument for this function
(venv)bash-3.2$ source venv3/bin/activate
(venv3)bash-3.2$ python bug.py
tímto   budeš
budem   byli
Traceback (most recent call last):
  File "bug.py", line 7, in <module>
    print(smart_open('cp852.tsv.txt', 'r', encoding='cp852').read())
  File "/Users/misha/git/smart_open/venv3/bin/../lib/python3.5/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa1 in position 1: invalid start byte

The reason encoding isn't being picked up is here. The encoding, which is a keyword argument, isn't being passed to the file_smart_open function.

I'll look into fixing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants