I want to propose not using unicode_literals for this undertaking. Only a very low number of people are using Python 3.2 or older and being explicit about unicode strings makes Python code less error prone.
Accidentally upgrading docstrings and other things to unicode has very bad consequences which can go unnoticed for a really long time. I have seen people putting unicode strings into WSGI dictionaries, breaking pydoc because of unicode docstrings, Django breaking filesystem access due to accidentally using unicode paths etc.
I like the idea of python future a ton, and just started looking into it, but I am really not a fan of the idea of proposing people to use unicode literals.
So instead you suggest futurize instead adds an explicit 'u' to all unicode literals? And 'b' too presumably? This would still work on python 2.7 and python 3.3. Right?
Very strong +1.
If there's one thing I've learnt while porting things to Py3, it's that unicode_literals is a very very bad idea. You completely lose the ability to specify a native string literal, which quickly leads to API hell, since now you can't cope with APIs that migrated in Py2 to Py3 that take native strings without paying runtime costs calling functions to return native strings.
Armin, thanks very much for your proposal, and also @Julian and @faassen for your feedback.
I would be open to changing the recommendation about unicode_literals, but I would like to understand the drawbacks more first.
Thanks for the heads-up about the pydoc issue with unicode docstrings. I wasn't aware of this before. It seems that a fix was committed to branch '2.7' of Python a few days ago (http://bugs.python.org/issue1065986), but of course it would be good to warn users of Python < 2.7.7.
Django itself recommends using unicode_literals as its top porting tip for Django developers. I ported mezzanine to a Py2/3 compatible codebase using unicode_literals. I saw some problems with Django 1.4.x (broken handling of unicode cookie keys), but these seemed to be fixed in 1.5. The whole process was mostly smooth and the diff is definitely smaller than it would have been with explicit u'' prefixes. The final codebase is also very clean; for instance, it requires only one code block like this:
across the entire project. I believe the final codebase is simpler and cleaner than it would have been with explicit u'' prefixes everywhere, but I don't have hard evidence for this since I didn't also try a port using the other approach for comparison.
The performance hit of converting unicode literals to native strings when they really are needed is a fair point. The docs can definitely point this out as a drawback. I would also like to know which APIs require a native string. I previously started compiling a reference list of such cases here. Certainly more could be added. I think it is helpful to name and shame any API incompatibilities explicitly, whether unicode-related or not.
A major goal for future is to allow portable codebases to be as clean and maintainable as possible, with the philosophy of using Python 3 idioms wherever possible -- so as not to make the code uglier just to support Python 2. But correctness is an even more important goal, so this is worth more discussion. @faassen: Yes, the decision is primarily about the default behaviour of the futurize script. (I don't think this decision affects the future package itself.)
The current Ubuntu LTS and Debian stable have only Python 3.2, so it would be nice for futurize to continue to support unicode_literals as an option to support these users.
I will start by writing up these pros and cons as a Sphinx doc page on whether to use unicode_literals. I would be grateful for any further thoughts / arguments on the issue.
In fact, deciding on whether to recommend unicode_literals does affect one rather important facet of using the future package itself: isinstance('', str) checks. Without unicode_literals, this result occurs on Py2:
>>> from future.builtins import *
>>> isinstance('', str)
which would be undesirable, since it is inconsistent with both Python 3 and Python 2 without the future import.
(The current design of the future types with respect to isinstance() checks is explained here.)
I have written an initial draft of a section on the pros and cons of using unicode_literals in docs/imports.rst in master. I'd be grateful if you could please review it and give me your feedback, especially on any further drawbacks or sources of errors you are aware of. I'd also gratefully receive any pull requests.
There are so many subtle problems that unicode_literal causes. For instance lots of people accidentally introduce unicode into filenames and that seems to work, until they are using it on a system where there are unicode characters in the filesystem path.
Some examples of what broke in Django through it:
There are more, but those are the ones a quick trac search on the Django tracker showed up.
Yeah, one of the nuisances of the WSGI spec is that the header values IIRC are the str or StringType on both py2 and py3. With unicode_literals this causes hard-to-spot bugs, as some WSGI servers might be more tolerant than others, but usually using unicode in python 2 for WSGI headers will cause the response to fail
+1 from me for avoiding the unicode_literals future, as it can have very strange side effects in Python 2 (and thanks Armin for raising the issue).
This is one of the key reasons I backed Armin's PEP 414 (which restored Unicode literals in Python 3.3)
I'll also take a look at Ed's review of the pros and cons - we may be able to do something clever with the isinstance overrides to handle the instance checking problem.
@mitsuhiko: Thanks heaps for your research!
@ncoghlan and @ztane: Thanks for your feedback too. I have updated the docs again here. I'd appreciate your comments.
I can see the down-sides to using unicode_literals to port existing Python 2 code. I will attempt to remove it from futurize in favour of an incremental migration approach.
I can see a stronger argument for using unicode_literals in the case of backporting new or existing Python 3 code to Python 2; then there isn't the same risk of breaking an existing Python 2 API. I'd appreciate hearing about any real-world back-porting experiences with unicode_literals.
@ncoghlan: __instancecheck__ is already being overridden in the metaclass of the future types, if that's what you mean. But I expect that reporting isinstance(native_py2_str, str) == True would be harmful in a Py3-style codebase where str means text.
isinstance(native_py2_str, str) == True
I have now released an update (v0.10.2) and uploaded the latest docs to http://python-future.org. The section on unicode_literals is here.
Thanks very much for your feedback!
Very nice! I think there may be still be a typo in the third "Benefits" bullet point, though: "The diff for a Python 2 -> 2/3 port may be smaller". If I'm understanding that point correctly, I believe it's talking about Python 3 -> 2/3 ports.
An update has just gone out (v0.11.3) in which the futurize script no longer adds from __future__ import unicode_literals by default.
from __future__ import unicode_literals
cli: don't use unicode_literals to please click
The reason are explained in this thread:
Just remove unicode_literals from `cookiecutter/cli.py` as it is the
first use of click. However, it should also be removed from
OTOH lets not use unicode_literals, see http://click.pocoo.org/5/pyth…
…on3/#unicode-literals and PythonCharmers/python-future#22
The reason are explained in this thread:
Just remove unicode_literals from `cookiecutter/cli.py` and
`cookiecutter/prompt.py` as they are the first use of click.
DEL: removed ```from __future__ import unicode_literals``` from CLI f…
Remove imports of unicode_literals
I had put some of these in place to try and inch towards python3
somewhat. But people argue this isn't actually a good strategy: