utf8 encoded symbols not cast as unicode strings in python v2.7 #44

pazz · 2014-04-03T10:50:39Z

Hi!
I've been using configobj a for while in my project. Due to other dependencies
(twisted mainly), the stable release still depends on python v2.7.
Recently, an issue was opened regarding utf8 encoded symbols in the config file:
pazz/alot#693
It turns out that this is due to configobj v5.0.x not behaving as previous versions
when it comes to utf8 encoded symbols:
Previously, those were always passed on as unicode strings, e.g. 🐜 became
u'\U0001f41c'. Now, the same symbol is read as '\xf0\x9f\x90\x9c'.
Obvously this makes trouble. Of course i can patch around this, but still,
i think configobj should behave exactly the same as prevous versions here.
Thanks!

The text was updated successfully, but these errors were encountered:

robdennis · 2014-04-03T13:06:03Z

Thanks for the report Patrick, we are definitely committed to maintaining compatibility across versions and this will definitely trigger a new patch-level release.

Unicode on py2 is a risk area due to the py3 changes, and clearly this particular case didn't have sufficient test coverage.

I'd like to take a crack at this tonight with the aim at releasing before the weekend. I'm on my phone, but when I get to a desktop, I'll move this into the latest milestone and give it a due date.

robdennis · 2014-04-04T20:21:24Z

Hello @pazz, wanted to let you know I'm working on this now, and before I got too deep, I wanted to confirm that my test data matched up to the problem:

this is me writing a utf-8 encoded string to a test cfg file.

>>> open('test.cfg', 'w').write(u'encoded = \\U0001f41c\nactual = \U0001f41c'.encode('utf-8'))
>>> string = open('test.cfg', 'r').read()
>>> print string
encoded = \U0001f41c
actual = <the ant looking character that chrome will not let me paste>
>>> repr(string)
'encoded = \\U0001f41c\nactual = \xf0\x9f\x90\x9c'
>>> unicode_string = unicode(open('test.cfg', 'r').read(), 'utf-8')
>>> print unicode_string
encoded = \U0001f41c
actual = <the ant looking character that chrome will not let me paste>
>>> repr(unicode_string)
u'encoded = \\U0001f41c\nactual = \U0001f41c'

so here's me opening up this cfg file using Configobj 4.7.2:

>>> import configobj
>>> configobj.__version__
'4.7.2'
>>> cfg = configobj.ConfigObj('test.cfg')
>>> cfg['actual']
'\xf0\x9f\x90\x9c'
>>> cfg['encoded']
'\\U0001f41c'

based on your description, it sounds like you'd expect the value of actual to be the unicode character associated with this repr: \U0001f41c

are there any options that I should be using or different test data to reproduce the correct behavior on 4.7.2?

edit: as a test, I gave it a shot with the utf-8 BOM:

>>> unicode(open('test.cfg', 'r').read(), 'utf-8')
u'\ufeffencoded = \\U0001f41c\nactual = \U0001f41c'
>>> cfg = configobj.ConfigObj('test.cfg')
>>> cfg['encoded']
'\\U0001f41c'
>>> cfg['actual']
'\xf0\x9f\x90\x9c'

with the same results. Sorry @pazz I feel like I'm a bit stuck here

robdennis · 2014-04-04T20:39:40Z

to cpnfirm 4.7.2 at least works as documented I explicitly passed the encoding argument and I got this:

>>> cfg = configobj.ConfigObj('test.cfg', encoding='utf-8')
>>> cfg['actual']
u'\U0001f41c'
>>> cfg['encoded']
u'\\U0001f41c'

which seems to match up with what you expected.

To make sure I'm reading it right, is the error when you do or don't pass in the encoding argument? In checking your master, I had noticed that you're not setting it, though maybe I should check a different branch.

kzuberi · 2014-04-04T20:40:34Z

I had a problem that might have the same cause, so i'll mention here instead of opening a separate bug. On python 2.7 with configobj 5, using your above test.cfg, if I just do:

>>> import configobj
>>> cfg = configobj.ConfigObj('test.cfg', encoding='utf8')
>>> cfg.write()

it blows up with UnicodeDecodeError. Works fine in configobj 4.7.2 (in my case had to downgrade again to keep my scripts working). Hope that helps somehow.

robdennis · 2014-04-04T20:43:06Z

thanks kzuberi, I think that's enough to confirm there's a problem so I appreciate that

robdennis · 2014-04-04T21:22:30Z

ok, I have good news, due to the changes in #39 necessary to close the (originally) windows-specific #34, the untagged master actually works correctly:

>>> cfg = [b'test = \xf0\x9f\x90\x9c']
>>> c = configobj.ConfigObj(cfg, encoding='utf8')
>>> c['test']
u'\U0001f41c'
>>> c.write()
['test = \xf0\x9f\x90\x9c']

note that the access of the element is unicode, but writing it out is using bytestrings (since this method is intended for writing configs to file).

I was holding off on a tag until we got access to a windows machine for running the test suite, but since the root cause is affecting everyone, there's no need to wait on it.

I did add a specific test for this and I'll look to tag tonight, as was originally discussed.

specific test for #44

robdennis · 2014-04-04T21:41:28Z

leaving this open until I tag, but @pazz it would be worth confirming that alot does use the encoding parameter

EliAndrewC · 2014-04-04T21:56:29Z

I'll try to get a new Windows laptop soon, and there are two different projects (this one and MAGFest) that are being hobbled by the fact that I can't test things on Windows. No promises, but it's definitely increasing in priority.

robdennis · 2014-04-05T03:37:03Z

5.0.3 released: https://pypi.python.org/pypi/configobj/5.0.3
also, due to #20 getting closed, you can once again do this:

>>> import configobj
>>> configobj.__version__
'5.0.3'

pazz · 2014-04-07T11:43:08Z

I still get different outputs unfortunately. Here is some more info.
I run this script (called cotest.py locally) on my config:

import configobj
print configobj.__version__
cfg = configobj.ConfigObj('.config/alot/config', encoding='utf-8')

t = cfg['tags']['bug']['translated']
print(repr(t))

The relevant part of the config looks like this:

...
[tags]
  [[bug]]
    translated = 🐜
...

The config file is utf8 encoded. i get

$python cotest.py                                                                                                      
4.7.2
u'\U0001f41c'
$
$pip install --user --upgrade configobj                                                                                
Downloading/unpacking configobj from https://pypi.python.org/packages/source/c/configobj/configobj-5.0.3.tar.gz#md5=a23f0ad1cbdcb7fc10fe15a28067fdd1
  Downloading configobj-5.0.3.tar.gz
  Running setup.py egg_info for package configobj

Requirement already up-to-date: six==1.5.2 in /usr/lib/python2.7/dist-packages (from configobj)
Installing collected packages: configobj
  Running setup.py install for configobj

Successfully installed configobj
Cleaning up...

$python cotest.py                                                                                 
5.0.3
'\xf0\x9f\x90\x9c'

If i understand the cofngiobj docs correctly, the encoding parameter is just relevant fr writing configs is it? because this is something i never actually do.

pazz · 2014-04-07T11:49:35Z

i can confirm that if i change line 2 in my script to

cfg = configobj.ConfigObj('.config/alot/config')

I.e., remove the encoding parameter, i get the same result.

robdennis · 2014-04-07T14:49:16Z

Unfortunately this may be an area where the docs aren't really helpful. I'll need to review it.

Encoding is what gets it parsed in as Unicode and since writing out config is using bytestrings, the encoding is used there to properly encode any Unicode values in the config.

I definitely appreciate your cotest.py and I though checking it on my phone is rough, there's nothing super obvious that explains this difference. Hopefully I can reproduce and a fix would trigger a 5.0.4.

I'll check it tonight, thanks

On Mon, Apr 7, 2014 at 7:49 AM, Patrick Totzke notifications@github.com
wrote:

i can confirm that if i change line 2 in my script to
cfg = configobj.ConfigObj('.config/alot/config')
I.e., remove the encoding parameter, i get the same result.

Reply to this email directly or view it on GitHub:
#44 (comment)

robdennis · 2014-04-08T00:55:20Z

reopened, because looking a @pazz 's provided code in a real webbrowser, it should totally work

robdennis · 2014-04-08T13:13:31Z

ok, we have a smoking gun, thanks for the test code @pazz

Basically, because reasons (e.g. we inherited this), 99% of the tests are specifying configobj objects using a "list of strings" instead of a path to a file.

This test works:

    #issue #44
    def test_encoding_in_subsections(self, ant_cfg):
        c = cfg_lines(ant_cfg)
        cfg = ConfigObj(c, encoding='utf-8')

        assert isinstance(cfg['tags']['bug']['translated'], six.text_type)

And this does NOT

    #issue #44
    def test_encoding_in_config_files(self, request, ant_cfg):
        with NamedTemporaryFile(delete=False, mode='w') as cfg_file:
            cfg_file.write(ant_cfg.encode('utf-8'))
        request.addfinalizer(lambda : os.unlink(cfg_file.name))

        cfg = ConfigObj(cfg_file.name, encoding='utf-8')
        assert isinstance(cfg['tags']['bug']['translated'], six.text_type)

where ant_cfg is the cfg you posted, and cfg_lines is a helper method that splits a string into the list like the tests used to use

this means there's a difference in code paths between loading a config from path (e.g. the 99.9% actual use case) and the list of strings (90% test case) :( UGGGH.

Good news, now that this is reproducible, I feel good about turning this into 5.0.4 tonight. Thanks again @pazz

edit: forgot that I had a standing Tuesday appointment for some reason, but working on it now.

robdennis · 2014-04-10T03:10:56Z

alright, so feeling good about this, but because I'd like to docs to actually build again, I'd like to fix #49 and tag 5.0.4 tomorrow night, which will "officially" close this.

pazz · 2014-04-10T07:34:43Z

ok, i'm properly confused now: my issue seems to be solved (current my app runs ok with current master), but the cotest.py script now even gives the non-unicode output with the old 4.7.2 version...

[~] python cotest.py                                                                                                      
5.0.3
'\xf0\x9f\x90\x9c'
[~] rm -r ~/.local/lib/python2.7/site-packages/configobj*                                                                 
[~] python cotest.py                                                                                                      
4.7.2
'\xf0\x9f\x90\x9c'

I of course get the same output with the configobj version from pip (5.0.3) but that one makes my app crash

robdennis · 2014-04-11T03:52:52Z

I don't have a good explanation based on what you posted here, but I am glad to here that your app runs against current master. I'll close this one for now since I'm tagging 5.0.4

robdennis added the bug label Apr 4, 2014

robdennis added this to the 5.0.3 - next bug fix release milestone Apr 4, 2014

robdennis self-assigned this Apr 4, 2014

robdennis removed the bug label Apr 4, 2014

robdennis added a commit to robdennis/configobj that referenced this issue Apr 4, 2014

specific test for DiffSK#44

5c68d79

robdennis added a commit that referenced this issue Apr 4, 2014

Merge pull request #45 from robdennis/master

e36be92

specific test for #44

robdennis closed this as completed Apr 5, 2014

robdennis reopened this Apr 8, 2014

robdennis mentioned this issue Apr 9, 2014

unit tests should typically using a config file path instead of a list of strings #50

Closed

robdennis closed this as completed in 62bb181 Apr 10, 2014

robdennis reopened this Apr 10, 2014

robdennis closed this as completed Apr 11, 2014

robdennis mentioned this issue Apr 26, 2014

encoding issue writing #55

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

utf8 encoded symbols not cast as unicode strings in python v2.7 #44

utf8 encoded symbols not cast as unicode strings in python v2.7 #44

pazz commented Apr 3, 2014

robdennis commented Apr 3, 2014

robdennis commented Apr 4, 2014

robdennis commented Apr 4, 2014

kzuberi commented Apr 4, 2014

robdennis commented Apr 4, 2014

robdennis commented Apr 4, 2014

robdennis commented Apr 4, 2014

EliAndrewC commented Apr 4, 2014

robdennis commented Apr 5, 2014

pazz commented Apr 7, 2014

pazz commented Apr 7, 2014

robdennis commented Apr 7, 2014

I.e., remove the encoding parameter, i get the same result.

robdennis commented Apr 8, 2014

robdennis commented Apr 8, 2014

robdennis commented Apr 10, 2014

pazz commented Apr 10, 2014

robdennis commented Apr 11, 2014

utf8 encoded symbols not cast as unicode strings in python v2.7 #44

utf8 encoded symbols not cast as unicode strings in python v2.7 #44

Comments

pazz commented Apr 3, 2014

robdennis commented Apr 3, 2014

robdennis commented Apr 4, 2014

robdennis commented Apr 4, 2014

kzuberi commented Apr 4, 2014

robdennis commented Apr 4, 2014

robdennis commented Apr 4, 2014

robdennis commented Apr 4, 2014

EliAndrewC commented Apr 4, 2014

robdennis commented Apr 5, 2014

pazz commented Apr 7, 2014

pazz commented Apr 7, 2014

robdennis commented Apr 7, 2014

I.e., remove the encoding parameter, i get the same result.

robdennis commented Apr 8, 2014

robdennis commented Apr 8, 2014

robdennis commented Apr 10, 2014

pazz commented Apr 10, 2014

robdennis commented Apr 11, 2014