New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
utf8 encoded symbols not cast as unicode strings in python v2.7 #44
Comments
Thanks for the report Patrick, we are definitely committed to maintaining compatibility across versions and this will definitely trigger a new patch-level release. Unicode on py2 is a risk area due to the py3 changes, and clearly this particular case didn't have sufficient test coverage. I'd like to take a crack at this tonight with the aim at releasing before the weekend. I'm on my phone, but when I get to a desktop, I'll move this into the latest milestone and give it a due date. |
Hello @pazz, wanted to let you know I'm working on this now, and before I got too deep, I wanted to confirm that my test data matched up to the problem: this is me writing a utf-8 encoded string to a test cfg file. >>> open('test.cfg', 'w').write(u'encoded = \\U0001f41c\nactual = \U0001f41c'.encode('utf-8'))
>>> string = open('test.cfg', 'r').read()
>>> print string
encoded = \U0001f41c
actual = <the ant looking character that chrome will not let me paste>
>>> repr(string)
'encoded = \\U0001f41c\nactual = \xf0\x9f\x90\x9c'
>>> unicode_string = unicode(open('test.cfg', 'r').read(), 'utf-8')
>>> print unicode_string
encoded = \U0001f41c
actual = <the ant looking character that chrome will not let me paste>
>>> repr(unicode_string)
u'encoded = \\U0001f41c\nactual = \U0001f41c' so here's me opening up this cfg file using Configobj 4.7.2: >>> import configobj
>>> configobj.__version__
'4.7.2'
>>> cfg = configobj.ConfigObj('test.cfg')
>>> cfg['actual']
'\xf0\x9f\x90\x9c'
>>> cfg['encoded']
'\\U0001f41c' based on your description, it sounds like you'd expect the value of are there any options that I should be using or different test data to reproduce the correct behavior on 4.7.2? edit: as a test, I gave it a shot with the utf-8 BOM: >>> unicode(open('test.cfg', 'r').read(), 'utf-8')
u'\ufeffencoded = \\U0001f41c\nactual = \U0001f41c'
>>> cfg = configobj.ConfigObj('test.cfg')
>>> cfg['encoded']
'\\U0001f41c'
>>> cfg['actual']
'\xf0\x9f\x90\x9c' with the same results. Sorry @pazz I feel like I'm a bit stuck here |
to cpnfirm 4.7.2 at least works as documented I explicitly passed the >>> cfg = configobj.ConfigObj('test.cfg', encoding='utf-8')
>>> cfg['actual']
u'\U0001f41c'
>>> cfg['encoded']
u'\\U0001f41c' which seems to match up with what you expected. To make sure I'm reading it right, is the error when you do or don't pass in the |
I had a problem that might have the same cause, so i'll mention here instead of opening a separate bug. On python 2.7 with configobj 5, using your above test.cfg, if I just do: >>> import configobj
>>> cfg = configobj.ConfigObj('test.cfg', encoding='utf8')
>>> cfg.write() it blows up with UnicodeDecodeError. Works fine in configobj 4.7.2 (in my case had to downgrade again to keep my scripts working). Hope that helps somehow. |
thanks kzuberi, I think that's enough to confirm there's a problem so I appreciate that |
ok, I have good news, due to the changes in #39 necessary to close the (originally) windows-specific #34, the untagged master actually works correctly: >>> cfg = [b'test = \xf0\x9f\x90\x9c']
>>> c = configobj.ConfigObj(cfg, encoding='utf8')
>>> c['test']
u'\U0001f41c'
>>> c.write()
['test = \xf0\x9f\x90\x9c'] note that the access of the element is unicode, but writing it out is using bytestrings (since this method is intended for writing configs to file). I was holding off on a tag until we got access to a windows machine for running the test suite, but since the root cause is affecting everyone, there's no need to wait on it. I did add a specific test for this and I'll look to tag tonight, as was originally discussed. |
leaving this open until I tag, but @pazz it would be worth confirming that alot does use the encoding parameter |
I'll try to get a new Windows laptop soon, and there are two different projects (this one and MAGFest) that are being hobbled by the fact that I can't test things on Windows. No promises, but it's definitely increasing in priority. |
5.0.3 released: https://pypi.python.org/pypi/configobj/5.0.3 >>> import configobj
>>> configobj.__version__
'5.0.3' |
I still get different outputs unfortunately. Here is some more info. import configobj
print configobj.__version__
cfg = configobj.ConfigObj('.config/alot/config', encoding='utf-8')
t = cfg['tags']['bug']['translated']
print(repr(t)) The relevant part of the config looks like this:
The config file is utf8 encoded. i get
If i understand the cofngiobj docs correctly, the |
i can confirm that if i change line 2 in my script to
I.e., remove the encoding parameter, i get the same result. |
Unfortunately this may be an area where the docs aren't really helpful. I'll need to review it. Encoding is what gets it parsed in as Unicode and since writing out config is using bytestrings, the encoding is used there to properly encode any Unicode values in the config. I definitely appreciate your cotest.py and I though checking it on my phone is rough, there's nothing super obvious that explains this difference. Hopefully I can reproduce and a fix would trigger a 5.0.4. I'll check it tonight, thanks On Mon, Apr 7, 2014 at 7:49 AM, Patrick Totzke notifications@github.com
|
reopened, because looking a @pazz 's provided code in a real webbrowser, it should totally work |
ok, we have a smoking gun, thanks for the test code @pazz Basically, because reasons (e.g. we inherited this), 99% of the tests are specifying configobj objects using a "list of strings" instead of a path to a file. This test works: #issue #44
def test_encoding_in_subsections(self, ant_cfg):
c = cfg_lines(ant_cfg)
cfg = ConfigObj(c, encoding='utf-8')
assert isinstance(cfg['tags']['bug']['translated'], six.text_type) And this does NOT #issue #44
def test_encoding_in_config_files(self, request, ant_cfg):
with NamedTemporaryFile(delete=False, mode='w') as cfg_file:
cfg_file.write(ant_cfg.encode('utf-8'))
request.addfinalizer(lambda : os.unlink(cfg_file.name))
cfg = ConfigObj(cfg_file.name, encoding='utf-8')
assert isinstance(cfg['tags']['bug']['translated'], six.text_type) where this means there's a difference in code paths between loading a config from path (e.g. the 99.9% actual use case) and the list of strings (90% test case) :( UGGGH. Good news, now that this is reproducible, I feel good about turning this into 5.0.4 tonight. Thanks again @pazz edit: forgot that I had a standing Tuesday appointment for some reason, but working on it now. |
alright, so feeling good about this, but because I'd like to docs to actually build again, I'd like to fix #49 and tag 5.0.4 tomorrow night, which will "officially" close this. |
ok, i'm properly confused now: my issue seems to be solved (current my app runs ok with current master), but the
I of course get the same output with the configobj version from pip (5.0.3) but that one makes my app crash |
I don't have a good explanation based on what you posted here, but I am glad to here that your app runs against current master. I'll close this one for now since I'm tagging 5.0.4 |
Hi!
I've been using configobj a for while in my project. Due to other dependencies
(twisted mainly), the stable release still depends on python v2.7.
Recently, an issue was opened regarding utf8 encoded symbols in the config file:
pazz/alot#693
It turns out that this is due to configobj v5.0.x not behaving as previous versions
when it comes to utf8 encoded symbols:
Previously, those were always passed on as unicode strings, e.g.
🐜
becameu'\U0001f41c'. Now, the same symbol is read as '\xf0\x9f\x90\x9c'.
Obvously this makes trouble. Of course i can patch around this, but still,
i think configobj should behave exactly the same as prevous versions here.
Thanks!
The text was updated successfully, but these errors were encountered: