New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal to not use unicode_literals #22

Closed
mitsuhiko opened this Issue Jan 5, 2014 · 15 comments

Comments

Projects
None yet
7 participants
@mitsuhiko

mitsuhiko commented Jan 5, 2014

I want to propose not using unicode_literals for this undertaking. Only a very low number of people are using Python 3.2 or older and being explicit about unicode strings makes Python code less error prone.

Accidentally upgrading docstrings and other things to unicode has very bad consequences which can go unnoticed for a really long time. I have seen people putting unicode strings into WSGI dictionaries, breaking pydoc because of unicode docstrings, Django breaking filesystem access due to accidentally using unicode paths etc.

I like the idea of python future a ton, and just started looking into it, but I am really not a fan of the idea of proposing people to use unicode literals.

@faassen

This comment has been minimized.

faassen commented Jan 6, 2014

So instead you suggest futurize instead adds an explicit 'u' to all unicode literals? And 'b' too presumably? This would still work on python 2.7 and python 3.3. Right?

@Julian

This comment has been minimized.

Julian commented Jan 6, 2014

Very strong +1.

If there's one thing I've learnt while porting things to Py3, it's that unicode_literals is a very very bad idea. You completely lose the ability to specify a native string literal, which quickly leads to API hell, since now you can't cope with APIs that migrated in Py2 to Py3 that take native strings without paying runtime costs calling functions to return native strings.

@edschofield

This comment has been minimized.

Contributor

edschofield commented Jan 7, 2014

Armin, thanks very much for your proposal, and also @Julian and @faassen for your feedback.

I would be open to changing the recommendation about unicode_literals, but I would like to understand the drawbacks more first.

Thanks for the heads-up about the pydoc issue with unicode docstrings. I wasn't aware of this before. It seems that a fix was committed to branch '2.7' of Python a few days ago (http://bugs.python.org/issue1065986), but of course it would be good to warn users of Python < 2.7.7.

Django itself recommends using unicode_literals as its top porting tip for Django developers. I ported mezzanine to a Py2/3 compatible codebase using unicode_literals. I saw some problems with Django 1.4.x (broken handling of unicode cookie keys), but these seemed to be fixed in 1.5. The whole process was mostly smooth and the diff is definitely smaller than it would have been with explicit u'' prefixes. The final codebase is also very clean; for instance, it requires only one code block like this:

if PY2:
    ...
else:
    ...

across the entire project. I believe the final codebase is simpler and cleaner than it would have been with explicit u'' prefixes everywhere, but I don't have hard evidence for this since I didn't also try a port using the other approach for comparison.

The performance hit of converting unicode literals to native strings when they really are needed is a fair point. The docs can definitely point this out as a drawback. I would also like to know which APIs require a native string. I previously started compiling a reference list of such cases here. Certainly more could be added. I think it is helpful to name and shame any API incompatibilities explicitly, whether unicode-related or not.

A major goal for future is to allow portable codebases to be as clean and maintainable as possible, with the philosophy of using Python 3 idioms wherever possible -- so as not to make the code uglier just to support Python 2. But correctness is an even more important goal, so this is worth more discussion. @faassen: Yes, the decision is primarily about the default behaviour of the futurize script. (I don't think this decision affects the future package itself.)

The current Ubuntu LTS and Debian stable have only Python 3.2, so it would be nice for futurize to continue to support unicode_literals as an option to support these users.

I will start by writing up these pros and cons as a Sphinx doc page on whether to use unicode_literals. I would be grateful for any further thoughts / arguments on the issue.

Thanks! :)

@edschofield

This comment has been minimized.

Contributor

edschofield commented Jan 7, 2014

In fact, deciding on whether to recommend unicode_literals does affect one rather important facet of using the future package itself: isinstance('', str) checks. Without unicode_literals, this result occurs on Py2:

>>> from future.builtins import *
>>> isinstance('', str)
False

which would be undesirable, since it is inconsistent with both Python 3 and Python 2 without the future import.

(The current design of the future types with respect to isinstance() checks is explained here.)

@edschofield

This comment has been minimized.

Contributor

edschofield commented Jan 8, 2014

I have written an initial draft of a section on the pros and cons of using unicode_literals in docs/imports.rst in master. I'd be grateful if you could please review it and give me your feedback, especially on any further drawbacks or sources of errors you are aware of. I'd also gratefully receive any pull requests.

Thanks again!

@mitsuhiko

This comment has been minimized.

mitsuhiko commented Jan 8, 2014

There are so many subtle problems that unicode_literal causes. For instance lots of people accidentally introduce unicode into filenames and that seems to work, until they are using it on a system where there are unicode characters in the filesystem path.

Some examples of what broke in Django through it:

There are more, but those are the ones a quick trac search on the Django tracker showed up.

@ztane

This comment has been minimized.

ztane commented Jan 9, 2014

Yeah, one of the nuisances of the WSGI spec is that the header values IIRC are the str or StringType on both py2 and py3. With unicode_literals this causes hard-to-spot bugs, as some WSGI servers might be more tolerant than others, but usually using unicode in python 2 for WSGI headers will cause the response to fail

@ncoghlan

This comment has been minimized.

ncoghlan commented Jan 9, 2014

+1 from me for avoiding the unicode_literals future, as it can have very strange side effects in Python 2 (and thanks Armin for raising the issue).

This is one of the key reasons I backed Armin's PEP 414 (which restored Unicode literals in Python 3.3)

I'll also take a look at Ed's review of the pros and cons - we may be able to do something clever with the isinstance overrides to handle the instance checking problem.

@edschofield

This comment has been minimized.

Contributor

edschofield commented Jan 9, 2014

@mitsuhiko: Thanks heaps for your research!

@ncoghlan and @ztane: Thanks for your feedback too. I have updated the docs again here. I'd appreciate your comments.

I can see the down-sides to using unicode_literals to port existing Python 2 code. I will attempt to remove it from futurize in favour of an incremental migration approach.

I can see a stronger argument for using unicode_literals in the case of backporting new or existing Python 3 code to Python 2; then there isn't the same risk of breaking an existing Python 2 API. I'd appreciate hearing about any real-world back-porting experiences with unicode_literals.

@edschofield

This comment has been minimized.

Contributor

edschofield commented Jan 9, 2014

@ncoghlan: __instancecheck__ is already being overridden in the metaclass of the future types, if that's what you mean. But I expect that reporting isinstance(native_py2_str, str) == True would be harmful in a Py3-style codebase where str means text.

@ncoghlan

This comment has been minimized.

ncoghlan commented Jan 9, 2014

Updated docs look good to me, and I agree it should be less quirky in the 3
-> 2/3 case than it is when going in the other direction.

I think you have a typo in the relevant bullet point though (point 4 in
favour).

@edschofield

This comment has been minimized.

Contributor

edschofield commented Jan 11, 2014

I have now released an update (v0.10.2) and uploaded the latest docs to http://python-future.org. The section on unicode_literals is here.

Thanks very much for your feedback!

@ncoghlan

This comment has been minimized.

ncoghlan commented Jan 11, 2014

Very nice! I think there may be still be a typo in the third "Benefits" bullet point, though: "The diff for a Python 2 -> 2/3 port may be smaller". If I'm understanding that point correctly, I believe it's talking about Python 3 -> 2/3 ports.

@edschofield

This comment has been minimized.

Contributor

edschofield commented Feb 27, 2014

An update has just gone out (v0.11.3) in which the futurize script no longer adds from __future__ import unicode_literals by default.

@untitaker untitaker referenced this issue Aug 14, 2015

Closed

Fix *-imports #123

vincentbernat added a commit to vincentbernat/cookiecutter that referenced this issue Aug 27, 2015

cli: don't use unicode_literals to please click
The reason are explained in this thread:
 PythonCharmers/python-future#22

Just remove unicode_literals from `cookiecutter/cli.py` as it is the
first use of click. However, it should also be removed from
`cookiecutter/prompt.py`.

perrygeo added a commit to perrygeo/python-rasterstats that referenced this issue Sep 10, 2015

vincentbernat added a commit to vincentbernat/cookiecutter that referenced this issue Sep 18, 2015

cli: don't use unicode_literals to please click
The reason are explained in this thread:
 PythonCharmers/python-future#22

Just remove unicode_literals from `cookiecutter/cli.py` and
`cookiecutter/prompt.py` as they are the first use of click.

mrgambal added a commit to mrgambal/vulyk that referenced this issue Oct 31, 2015

UnwashedMeme added a commit to iDigBio/idb-backend that referenced this issue Apr 15, 2016

Remove imports of unicode_literals
I had put some of these in place to try and inch towards python3
somewhat. But people argue this isn't actually a good strategy:
PythonCharmers/python-future#22

FGtatsuro added a commit to FGtatsuro/dotfiles that referenced this issue Nov 14, 2016

marthjod added a commit to marthjod/hipy that referenced this issue Jul 11, 2017

@spookylukey

This comment has been minimized.

spookylukey commented Sep 4, 2018

A very late reply to this issue, from someone surprised by the advice that click was giving me:

Armin mentions some examples of problems caused by unicode_literals in Django. If you check more carefully, ticket 2167, ticket 21358 and ticket 21627 are actually counter examples (the fixes involved adding unicode_literals) and ticket 20809 was marked as not-a-bug - the behaviour introduced by unicode_literals was welcome.

My own experience is that unicode_literals was the only way to fix a continual source of unicode errors in Python 2 code bases with large number of strings that were intended to be unicode, but people just kept forgetting to add the u prefix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment