Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utf8 encoded symbols not cast as unicode strings in python v2.7 #44

Closed
pazz opened this issue Apr 3, 2014 · 17 comments
Closed

utf8 encoded symbols not cast as unicode strings in python v2.7 #44

pazz opened this issue Apr 3, 2014 · 17 comments
Assignees
Milestone

Comments

@pazz
Copy link

pazz commented Apr 3, 2014

Hi!
I've been using configobj a for while in my project. Due to other dependencies
(twisted mainly), the stable release still depends on python v2.7.
Recently, an issue was opened regarding utf8 encoded symbols in the config file:
pazz/alot#693
It turns out that this is due to configobj v5.0.x not behaving as previous versions
when it comes to utf8 encoded symbols:
Previously, those were always passed on as unicode strings, e.g. 🐜 became
u'\U0001f41c'. Now, the same symbol is read as '\xf0\x9f\x90\x9c'.
Obvously this makes trouble. Of course i can patch around this, but still,
i think configobj should behave exactly the same as prevous versions here.
Thanks!

@robdennis
Copy link
Member

Thanks for the report Patrick, we are definitely committed to maintaining compatibility across versions and this will definitely trigger a new patch-level release.

Unicode on py2 is a risk area due to the py3 changes, and clearly this particular case didn't have sufficient test coverage.

I'd like to take a crack at this tonight with the aim at releasing before the weekend. I'm on my phone, but when I get to a desktop, I'll move this into the latest milestone and give it a due date.

@robdennis robdennis added the bug label Apr 4, 2014
@robdennis robdennis added this to the 5.0.3 - next bug fix release milestone Apr 4, 2014
@robdennis robdennis self-assigned this Apr 4, 2014
@robdennis
Copy link
Member

Hello @pazz, wanted to let you know I'm working on this now, and before I got too deep, I wanted to confirm that my test data matched up to the problem:

this is me writing a utf-8 encoded string to a test cfg file.

>>> open('test.cfg', 'w').write(u'encoded = \\U0001f41c\nactual = \U0001f41c'.encode('utf-8'))
>>> string = open('test.cfg', 'r').read()
>>> print string
encoded = \U0001f41c
actual = <the ant looking character that chrome will not let me paste>
>>> repr(string)
'encoded = \\U0001f41c\nactual = \xf0\x9f\x90\x9c'
>>> unicode_string = unicode(open('test.cfg', 'r').read(), 'utf-8')
>>> print unicode_string
encoded = \U0001f41c
actual = <the ant looking character that chrome will not let me paste>
>>> repr(unicode_string)
u'encoded = \\U0001f41c\nactual = \U0001f41c'

so here's me opening up this cfg file using Configobj 4.7.2:

>>> import configobj
>>> configobj.__version__
'4.7.2'
>>> cfg = configobj.ConfigObj('test.cfg')
>>> cfg['actual']
'\xf0\x9f\x90\x9c'
>>> cfg['encoded']
'\\U0001f41c'

based on your description, it sounds like you'd expect the value of actual to be the unicode character associated with this repr: \U0001f41c

are there any options that I should be using or different test data to reproduce the correct behavior on 4.7.2?

edit: as a test, I gave it a shot with the utf-8 BOM:

>>> unicode(open('test.cfg', 'r').read(), 'utf-8')
u'\ufeffencoded = \\U0001f41c\nactual = \U0001f41c'
>>> cfg = configobj.ConfigObj('test.cfg')
>>> cfg['encoded']
'\\U0001f41c'
>>> cfg['actual']
'\xf0\x9f\x90\x9c'

with the same results. Sorry @pazz I feel like I'm a bit stuck here

@robdennis
Copy link
Member

to cpnfirm 4.7.2 at least works as documented I explicitly passed the encoding argument and I got this:

>>> cfg = configobj.ConfigObj('test.cfg', encoding='utf-8')
>>> cfg['actual']
u'\U0001f41c'
>>> cfg['encoded']
u'\\U0001f41c'

which seems to match up with what you expected.

To make sure I'm reading it right, is the error when you do or don't pass in the encoding argument? In checking your master, I had noticed that you're not setting it, though maybe I should check a different branch.

@kzuberi
Copy link

kzuberi commented Apr 4, 2014

I had a problem that might have the same cause, so i'll mention here instead of opening a separate bug. On python 2.7 with configobj 5, using your above test.cfg, if I just do:

>>> import configobj
>>> cfg = configobj.ConfigObj('test.cfg', encoding='utf8')
>>> cfg.write()

it blows up with UnicodeDecodeError. Works fine in configobj 4.7.2 (in my case had to downgrade again to keep my scripts working). Hope that helps somehow.

@robdennis
Copy link
Member

thanks kzuberi, I think that's enough to confirm there's a problem so I appreciate that

@robdennis
Copy link
Member

ok, I have good news, due to the changes in #39 necessary to close the (originally) windows-specific #34, the untagged master actually works correctly:

>>> cfg = [b'test = \xf0\x9f\x90\x9c']
>>> c = configobj.ConfigObj(cfg, encoding='utf8')
>>> c['test']
u'\U0001f41c'
>>> c.write()
['test = \xf0\x9f\x90\x9c']

note that the access of the element is unicode, but writing it out is using bytestrings (since this method is intended for writing configs to file).

I was holding off on a tag until we got access to a windows machine for running the test suite, but since the root cause is affecting everyone, there's no need to wait on it.

I did add a specific test for this and I'll look to tag tonight, as was originally discussed.

@robdennis robdennis removed the bug label Apr 4, 2014
robdennis added a commit to robdennis/configobj that referenced this issue Apr 4, 2014
robdennis added a commit that referenced this issue Apr 4, 2014
@robdennis
Copy link
Member

leaving this open until I tag, but @pazz it would be worth confirming that alot does use the encoding parameter

@EliAndrewC
Copy link
Member

I'll try to get a new Windows laptop soon, and there are two different projects (this one and MAGFest) that are being hobbled by the fact that I can't test things on Windows. No promises, but it's definitely increasing in priority.

@robdennis
Copy link
Member

5.0.3 released: https://pypi.python.org/pypi/configobj/5.0.3
also, due to #20 getting closed, you can once again do this:

>>> import configobj
>>> configobj.__version__
'5.0.3'

@pazz
Copy link
Author

pazz commented Apr 7, 2014

I still get different outputs unfortunately. Here is some more info.
I run this script (called cotest.py locally) on my config:

import configobj
print configobj.__version__
cfg = configobj.ConfigObj('.config/alot/config', encoding='utf-8')

t = cfg['tags']['bug']['translated']
print(repr(t))

The relevant part of the config looks like this:

...
[tags]
  [[bug]]
    translated = 🐜
...

The config file is utf8 encoded. i get

$python cotest.py                                                                                                      
4.7.2
u'\U0001f41c'
$
$pip install --user --upgrade configobj                                                                                
Downloading/unpacking configobj from https://pypi.python.org/packages/source/c/configobj/configobj-5.0.3.tar.gz#md5=a23f0ad1cbdcb7fc10fe15a28067fdd1
  Downloading configobj-5.0.3.tar.gz
  Running setup.py egg_info for package configobj

Requirement already up-to-date: six==1.5.2 in /usr/lib/python2.7/dist-packages (from configobj)
Installing collected packages: configobj
  Running setup.py install for configobj

Successfully installed configobj
Cleaning up...

$python cotest.py                                                                                 
5.0.3
'\xf0\x9f\x90\x9c'

If i understand the cofngiobj docs correctly, the encoding parameter is just relevant fr writing configs is it? because this is something i never actually do.

@pazz
Copy link
Author

pazz commented Apr 7, 2014

i can confirm that if i change line 2 in my script to

cfg = configobj.ConfigObj('.config/alot/config')

I.e., remove the encoding parameter, i get the same result.

@robdennis
Copy link
Member

Unfortunately this may be an area where the docs aren't really helpful. I'll need to review it. 

Encoding is what gets it parsed in as Unicode and since writing out config is using bytestrings, the encoding is used there to properly encode any Unicode values in the config.

I definitely appreciate your cotest.py and I though checking it on my phone is rough, there's nothing super obvious that explains this difference. Hopefully I can reproduce and a fix would trigger a 5.0.4.

I'll check it tonight, thanks

On Mon, Apr 7, 2014 at 7:49 AM, Patrick Totzke notifications@github.com
wrote:

i can confirm that if i change line 2 in my script to

cfg = configobj.ConfigObj('.config/alot/config')

I.e., remove the encoding parameter, i get the same result.

Reply to this email directly or view it on GitHub:
#44 (comment)

@robdennis robdennis reopened this Apr 8, 2014
@robdennis
Copy link
Member

reopened, because looking a @pazz 's provided code in a real webbrowser, it should totally work

@robdennis
Copy link
Member

ok, we have a smoking gun, thanks for the test code @pazz

Basically, because reasons (e.g. we inherited this), 99% of the tests are specifying configobj objects using a "list of strings" instead of a path to a file.

This test works:

    #issue #44
    def test_encoding_in_subsections(self, ant_cfg):
        c = cfg_lines(ant_cfg)
        cfg = ConfigObj(c, encoding='utf-8')

        assert isinstance(cfg['tags']['bug']['translated'], six.text_type)

And this does NOT

    #issue #44
    def test_encoding_in_config_files(self, request, ant_cfg):
        with NamedTemporaryFile(delete=False, mode='w') as cfg_file:
            cfg_file.write(ant_cfg.encode('utf-8'))
        request.addfinalizer(lambda : os.unlink(cfg_file.name))

        cfg = ConfigObj(cfg_file.name, encoding='utf-8')
        assert isinstance(cfg['tags']['bug']['translated'], six.text_type)

where ant_cfg is the cfg you posted, and cfg_lines is a helper method that splits a string into the list like the tests used to use

this means there's a difference in code paths between loading a config from path (e.g. the 99.9% actual use case) and the list of strings (90% test case) :( UGGGH.

Good news, now that this is reproducible, I feel good about turning this into 5.0.4 tonight. Thanks again @pazz

edit: forgot that I had a standing Tuesday appointment for some reason, but working on it now.

@robdennis
Copy link
Member

alright, so feeling good about this, but because I'd like to docs to actually build again, I'd like to fix #49 and tag 5.0.4 tomorrow night, which will "officially" close this.

@robdennis robdennis reopened this Apr 10, 2014
@pazz
Copy link
Author

pazz commented Apr 10, 2014

ok, i'm properly confused now: my issue seems to be solved (current my app runs ok with current master), but the cotest.py script now even gives the non-unicode output with the old 4.7.2 version...

[~] python cotest.py                                                                                                      
5.0.3
'\xf0\x9f\x90\x9c'
[~] rm -r ~/.local/lib/python2.7/site-packages/configobj*                                                                 
[~] python cotest.py                                                                                                      
4.7.2
'\xf0\x9f\x90\x9c'

I of course get the same output with the configobj version from pip (5.0.3) but that one makes my app crash

@robdennis
Copy link
Member

I don't have a good explanation based on what you posted here, but I am glad to here that your app runs against current master. I'll close this one for now since I'm tagging 5.0.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants