# Writing 2/3-Compatible Python

Written for HERA CHAMP Camp — Santa Fe, NM — June 14, 2018, by Peter K. G. Williams (peter@newton.cx). Should be generally applicable, though.

## Heads up: the `2to3` tool

You may have heard that there is a command-line tool, `2to3`, that will edit your code to make it compatible with Python 3. However, it creates code that is **not compatible** with Python 2 anymore. Therefore it doesn't suffice for our purposes. However, it's pretty clever, so when converting existing code it might work well to *start* by running `2to3`, then patching up its edits to match the recommendations below.

## Start every file like this:

In [1]:
# -*- coding: utf-8 -*-
# Copyright 2018 the HERA Project
# Licensed under the 2-clause BSD license (e.g.)

"""Docstring ..."""

from __future__ import absolute_import, division, print_function

In Python 2, this activates three key behaviors that make things compatible with Python 3.

To automatically activate this behavior in Jupyter, put the following in `~/.ipython/profile_default/ipython_config.py`:

```python
c.InteractiveShellApp.exec_lines = [
   'from __future__ import absolute_import, division, print_function',
]
```

Note, however, that if you plan on sharing your notebook with other people, they won't necessarily have this configuration set up, so you should explicitly put the `__future__` import at the top of the notebook to ensure that the environment is standardized.

## Print becomes a function

You've all heard this. Translating from old `print` to new `print()` is not complex.

In [2]:
bar = 123
print('foo', bar, 17) # the most basic usage is really just a matter of adding parens

print('incomplete line', end='') # equivalent to `print 'incomplete_line',`

import sys
print('to a file', file=sys.stderr) # equivalent to `print >>sys.stderr, 'to a file'`.

foo 123 17
incomplete line

to a file


## Import statements become absolutely referenced

Say that you have a package with a structure like this:

```
hera_foo/
   __init__.py
   utils.py
   science.py
   more/
      __init__.py
      ideas.py
```

In Python 2, the following statement in `utils.py` would import the sibling module `science.py`:

```python
import science
```

In Python 3 and 2/3-compatible Python 3, imports like this are "absolute", and Python will insist on interpreting such an import as asking to find a top-level package named `science`. Instead, you must write:

```python
from . import science
```

You can also write things like:

```python
from .science import do_integral
```

If you have several layers of modules there is a 'dot-dot' notation to go up one level. From the `ideas.py` you could write:

```python
from .. import utils
```

Special bonus fun! If you want your Sphinx docs generation to be compatible with both Python 2 and 3, you need to stick a space in the dots somewhere if you have more than three levels of import (e.g. `from ... import`)



## Integer division needs to be made explicit

In plain Python 2, `1 / 2` equals zero. In Python 3 and compatible Python 2, `1 / 2` equals 0.5. The special operator `//` explicitly calls out integer division.

In [3]:
assert 1 // 2 == 0
assert 1 / 2 == 0.5

If you want an operator that is "integer division with integers, float division otherwise", use `past.utils.old_div`.

## Deprecated exception syntaxes are gone

The newer syntaxes have been supported forever and you should have been using them anyway.

In [4]:
try:
    raise Exception('oh no') # better than `raise Exception, 'oh no'`
except Exception as e: # better than `except Exception, e`
    pass

In [5]:
import six, sys # "six" is a module that helps you write 2/3-compatible code

try:
    try:
        assert False
    except:
        etype, evalue, etb = sys.exc_info()
        six.reraise(etype, evalue, etb) # replaces `raise etype, evalue, etb`
except Exception as e:
    print(e.__class__.__name__, e)

AssertionError 


There are other changes in this general ballpark (e.g. chaining of exceptions) but most of our code doesn't do anything fancier than this.

## The `int` and `long` types are unified

Did you even know that `long` was a type?

In [6]:
import six # once again, "six" helps you write 2/3-compatible code

q = 17

if isinstance(q, six.integer_types): # replaces `type(q) == int` and variants
    print('q is an int')

q is an int


## Stringy things default to being Unicode

This is by far the most intellectually challenging change, but I'm hopeful that it will be relatively straightforward for us to deal with.

For what it's worth, Python 2 was really sloppy about the relevant matters, and Python 3 does a much better job of dealing with them. In Python 3 and 2/3-compatible code, you have to be more careful, but the code you write will be more correct. (The difference is often subtle if you mainly read and write English, but not if you mainly read and write Chinese or Sinhala or Thai or Arabic or Devanagari or ...)

Anyway. Both Python 2 and Python 3 have two types that you might think of as "string-like": "bytes" and "unicode".

A bytes value is an array of binary data. In both versions of Python, you can reliably get bytes by writing something like this:

In [7]:
import six

my_bytes = b'\xf0\x9f\x92\xa9' # `\xNN` encodes a byte in hexadecimal
print('len:', len(my_bytes))
print('is bytes?', isinstance(my_bytes, six.binary_type))

len: 4
is bytes? True


A unicode value stores some quantity of "text" in any language, construed broadly. In both versions of Python, you can reliably get Unicode by writing something like this:

In [8]:
import six

my_unicode = u'古池や蛙飛び込む水の音' # https://en.wikipedia.org/wiki/Haiku#Examples
print('len:', len(my_unicode))
print('is text?', isinstance(my_unicode, six.text_type))

len: 11
is text? True


**The key fact is that there are a variety of ways to convert Unicode into bytes, and/or bytes into Unicode.** You need to choose a "codec" to do the conversion. Conversions are not always possible: given a choice of encoding, some Unicode characters may not be representable (plain ASCII cannot express the letter "é") and some byte sequences may not be decodable (UTF-8 disallows internal zero bytes).

The only codec we are ever going to care about is UTF-8, which is like ASCII but can express almost any Unicode text in a compatible way if needed.

In [9]:
# This expresses bytes as a hexadecimal-encoded str in both Python 2 and 3.
# Just `repr()` gives different output depending on your Python major version.
# This is the tersest way I can find! See the Appendix for relevant discussion.
import six
bytes2hexstr = lambda b: ' '.join('%02x' % i for i in six.iterbytes(b))

uasb = my_unicode.encode('utf8')
assert isinstance(uasb, six.binary_type)
print('japanese text encoded as bytes in UTF8 encoding:\n\n', bytes2hexstr(uasb), '\n')
print('len:', len(uasb))

print()
uasb2 = my_unicode.encode('shift-jis')
assert isinstance(uasb2, six.binary_type)
print('japanese text encoded as bytes in Shift-JIS encoding:\n\n', bytes2hexstr(uasb2), '\n')
print('len:', len(uasb2))

print()
try:
    my_unicode.encode('ascii')
except Exception as e:
    print('could not represent the text in the ASCII encoding:', e)

japanese text encoded as bytes in UTF8 encoding:

 e5 8f a4 e6 b1 a0 e3 82 84 e8 9b 99 e9 a3 9b e3 81 b3 e8 be bc e3 82 80 e6 b0 b4 e3 81 ae e9 9f b3 

len: 33

japanese text encoded as bytes in Shift-JIS encoding:

 8c c3 92 72 82 e2 8a 5e 94 f2 82 d1 8d 9e 82 de 90 85 82 cc 89 b9 

len: 22

could not represent the text in the ASCII encoding: 'ascii' codec can't encode characters in position 0-10: ordinal not in range(128)


In [10]:
basu = my_bytes.decode('utf8')
assert isinstance(basu, six.text_type)
print('bytes converted into Unicode using UTF8 encoding:\n\n', basu, '\n')
print('len:', len(basu))

print()
try:
    my_bytes.decode('shift-jis')
except Exception as e:
    print('the bytes are not valid Shift-JIS:', e)

bytes converted into Unicode using UTF8 encoding:

 💩 

len: 1

the bytes are not valid Shift-JIS: 'shift_jis' codec can't decode byte 0xf0 in position 0: illegal multibyte sequence


**OK, here's where things get messy.**

In Python 2, the "bytes" type is good old `str`, and the Unicode type is called `unicode`. If you type a bare string, `'abcd'`, you get bytes/`str`.

In Python 3, the "bytes" type is named `bytes`, and the Unicode type is called `str`. If you type a bare string you get Unicode/`str`.

There are lot of Python APIs that always take or return `str` in either Python 2 or 3: i.e., in Python 2 they want bytes, and in Python 3 they want Unicode. For instance, the `str()` and `repr()` functions always return `str`s. (See the Appendix for discussion of the relevant pitfalls, which chiefly arise when you `print()` bytes.)

(This is why I do not recommend using `from __future__ import unicode_literals`, which makes is so that in Python 2 a bare string `'abcd'` becomes of type `unicode`. This makes it challenging to write 2/3-compatible code because it breaks the invariant that `'abcd'` is always an object of type `str`.)

So the following invariants hold on both Python 2 and Python 3:

In [11]:
assert isinstance(b'1234', six.binary_type)
assert isinstance(u'1234', six.text_type)
assert isinstance('1234', str)

## I/O cares about Unicode vs. bytes

As it should! If you use the standard interfaces, you will get `str`, i.e., either bytes or Unicode depending on which Python version is active:

In [12]:
with open('/etc/passwd') as f:
    assert isinstance(f.readline(), str)

If you want text:

In [13]:
import io

with io.open('/etc/passwd', 'rt') as f: # 'rt' means 'read mode, text mode'
    assert isinstance(f.readline(), six.text_type)

If you want binary data:

In [14]:
import io

with io.open('/etc/passwd', 'rb') as f: # 'rb' means 'read mode, binary mode'
    assert isinstance(f.readline(), six.binary_type)

The "text mode" read above automatically chooses a codec to convert the underlying binary file data into text. It will basically always use UTF-8. The `codecs` module provides the tools to do these sorts of conversions of file streams yourself.

Analogous patterns hold for writing files.

**Double extra bonus fun time!!!** Python 2 has a hack to make it so that `print(u'unicode')` automatically chooses a codec and prints something correct.

This hack **does not work** if your program's output is redirected into a pipe or a file rather than the terminal. So `./my-unicode-printer.py` works, but `./my-unicode-printer.py >log.txt` doesn't. Weak sauce!

My `pwkit` package includes code that can set up your standard output streams to always accept Unicode on both Python 2 and 3: see docs [here](http://pwkit.readthedocs.io/en/latest/foundations/io/#unicode-safety) and [here](http://pwkit.readthedocs.io/en/latest/cli-tools/pwkit-cli-toplevel/#pwkit.cli.unicode_stdio). If you use this approach, you then must ensure that you always ensure that you *only* pass Unicode to `print()`, never bytes.

## Lots of functions return iterators rather than lists

This is good for efficiency. To get a consistent API, use:

In [15]:
from six.moves import map, range, zip

# The type of map's return value will be a "map object", not a list, in both 2 and 3:
print(type(map(lambda x: x * 2, [1, 2, 3])))

<class 'map'>


Reproducing the old behavior is easy:

In [16]:
print(list(map(lambda x: x * 2, [1, 2, 3])))
print(sorted(map(lambda x: x * 2, [3, 2, 1]))) # no need to listify if you're going to sort

[2, 4, 6]
[2, 4, 6]


This also holds true for (e.g.) `dict.keys()`, `dict.items()`, and so on.

If you used to use `dict.iterkeys()`, etc.:

In [17]:
import six

d = dict(foo=1, bar=2, baz=3)
for k in six.iterkeys(d):
    print(k)

foo
bar
baz


Likewise instead of `xrange`, switch to `range` but put the following at the top of your files:

In [18]:
from six.moves import range

Note that calling `list()` on a list is a no-op so it's OK to be a bit sloppy about potentially double-listing things in Python 2.

## Metaclasses need some extra work

Monkey see, monkey do:

```python
from six import with_metaclass

class Form(with_metaclass(FormType, BaseForm)):
    pass
```

## Lots of other smaller-bore stuff

[Here's a good cheat sheet](http://python-future.org/compatible_idioms.html). Some things that might come up:

- `exec` statements
- `repr` via backticks
- Octal constants
- Various package names and locations (`StringIO`, `urllib`, ...)
- Custom iterator support
- Stringification of classes
- `raw_input` calls
- All sorts of other fun stuff.

Of course, when in doubt, Google it! If you run into any esoteric problems, it's likely that someone has already done so and posted the fix on StackExchange.

# Appendix: More Bytes/Unicode Gotchas

When you start running on Python 3, you'll likely find that some output that your program emits starts looking like `b'hello world'` rather than just `hello world`. Here I try to explain the fundamental cause of what's going on.

Basically, the switch from `str`-is-bytes (Python 2) to `str`-is-Unicode (Python 3) has follow-on effects relating to the stringification of both bytes and unicode values.

One aspect of the issue relates to `repr`. Here's one way to think about it: in 2/3-compatible code, the `repr` of explictly-typed bytes and Unicode depends on which major version of Python you're running. The `repr` of `str`, however is the same.

```python
# ONLY TRUE IN PYTHON 2:
assert repr(b'123') == "'123'"
assert repr(u'123') == "u'123'"

# ONLY TRUE IN PYTHON 3:
assert repr(b'123') == "b'123'"
assert repr(u'123') == "'123'"

# TRUE IN BOTH:
assert repr('123') = "'123'"
```

Next, `str` also starts being funky, in an asymmetric way:

```python
# PYTHON 2:
assert str(b'123') = b'123' # generically true: `str(x) = x` if x is bytes

# PYTHON 3:
assert str(b'123') = repr(b'123') # generically true: `str(x) = repr(x)` if x is bytes
```

The last line is generally the problem: in Python 3, if you `print` a bytes value, it is stringified, which is equivalent to taking its `repr`, which gives you the value in the  `b''` formatting. Note that if your value was of the `str` type — i.e., either bytes or Unicode depending on the Python major version — everything would be OK. (In terms of unwanted extra characters from the `repr` operation, that is. From a purist standpoint, calling `print()` with non-Unicode is *always* incorrect, hence [pwkit.cli.unicode_stdio](http://pwkit.readthedocs.io/en/latest/cli-tools/pwkit-cli-toplevel/#pwkit.cli.unicode_stdio). It is *very* challenging to write Python code that is truly correct in these matters, though.)

The non-purist solution is to convert from bytes to `str` — i.e., a no-op on Python 2, and a decoding operation in Python 3. There is *probably* a nice built-in or `six` way to do this, but one approach is to use a short helper function:

In [19]:
import six

def bytes_to_str(b):
    """Convert bytes to the str type: a noop on Python 2, and a decode operation on Python 3.
    
    We hardcode the use of the UTF-8 codec. Depending on the context, it might be better
    to use ``sys.getdefaultencoding()``, but sometimes Python chooses ASCII as the default
    encoding.
    
    """
    if six.PY2: # `six.PY2` is True if are we running on Python 2
        return b
    return b.decode('utf8')

assert bytes_to_str(b'123') == '123'
print(bytes_to_str(b'123')) # will reliably give '123'

123


When writing binary data files, one often wants the inverse:

In [20]:
import six

def str_to_bytes(s):
    "Convert str to the bytes type: a noop on Python 2, and an encode operation on Python 3."
    
    if six.PY2:
        return s
    return s.encode('utf8')

assert str_to_bytes('123') == b'123'

More rarely, you'll have the symmetric problem: a function expects `str`, or auto-stringifies a value, and you have Unicode, leading to potential problems on Python 2. This occurs more rarely because (1) most of us are porting code from 2 → 2/3, not 3 → 2/3, and (2) Python 2 has various hacks to auto-encode Unicode into bytes when needed. However, if this problem does come up, functions analogous to those shown above can help.