A recent [rebutal](https://learnpythonthehardway.org/book/nopython3.html) against Python 3 was recently written by the (in)famous Zed Shaw, with [many](https://eev.ee/blog/2016/11/23/a-rebuttal-for-python-3/) responses to various arguments and counter arguments. 

One particular topic which caught my eye was the `bytearray` vs `unicodearray` debate. I'll try explicitely avoid the term `str`/`string`/`bytes`/`unicode` naming as it is (IMHO) confusing, but that's a debate for another time. 

If one pay attention to above debates, you might see that there are about two camps:

 - `bytearray` and `unicodearray` are two different things, and we should _never_ convert from one to the other. (that's rought the Pro-Python-3 camp)
 - `bytearray` and `unicodearray` are similar enough in most cases that we should do the magic for users. 
 
 
I'm greatly exagerating here and the following is neither for one side or another, I have my personal preference of what I think is good, but that's irrelevant for now. 

Note that both sides argue that _their_ preference is better for beginners. 

Let's try to deconstruct both arguments. The Python 3 one argue that both bytes and text are different and should be treated as such. One of the example will be something like:

python2

```
>>> print("it is summer !"[::-1])
'! remmus si ti'
```

Seem to work fine... until you enter non ascii:

```
>>> print("C'est l'été !"[::-1])
! ��t��'l tse'C
```


In the second case throwing `.encode()` and  `.decode()` at random won't really fix the thing. Thus this camp argue you _need_ to teach about text vs bytes early.


The second camp argue that in most case decoding/encoding automatically as ASCII make perfect sens as teh bytearrays are already human redable as is. It is relatively hard already to teach to a beginner. And usually a human brain can keep track of 5 +/- 2 things as the same time. Asking users to remember (and understand) when they need to encode/decode is tough. Especially if the difference betwen the two result seem minimal.

As a example, let's look at who among you have already **visually** read bytes comming from the network without decoding to ascii ?

In [11]:
import requests_cache

In [13]:
requests_cache.install_cache('tmp.db.tmp')

Yes you did !

In [14]:
import requests
requests.get('http://swapi.co/api/people/1').content

b'{"name":"Luke Skywalker","height":"172","mass":"77","hair_color":"blond","skin_color":"fair","eye_color":"blue","birth_year":"19BBY","gender":"male","homeworld":"http://swapi.co/api/planets/1/","films":["http://swapi.co/api/films/6/","http://swapi.co/api/films/3/","http://swapi.co/api/films/2/","http://swapi.co/api/films/1/","http://swapi.co/api/films/7/"],"species":["http://swapi.co/api/species/1/"],"vehicles":["http://swapi.co/api/vehicles/14/","http://swapi.co/api/vehicles/30/"],"starships":["http://swapi.co/api/starships/12/","http://swapi.co/api/starships/22/"],"created":"2014-12-09T13:50:51.644000Z","edited":"2014-12-20T21:17:56.891000Z","url":"http://swapi.co/api/people/1/"}'

In [15]:
requests.get('http://swapi.co/api/people/1').content.decode()

'{"name":"Luke Skywalker","height":"172","mass":"77","hair_color":"blond","skin_color":"fair","eye_color":"blue","birth_year":"19BBY","gender":"male","homeworld":"http://swapi.co/api/planets/1/","films":["http://swapi.co/api/films/6/","http://swapi.co/api/films/3/","http://swapi.co/api/films/2/","http://swapi.co/api/films/1/","http://swapi.co/api/films/7/"],"species":["http://swapi.co/api/species/1/"],"vehicles":["http://swapi.co/api/vehicles/14/","http://swapi.co/api/vehicles/30/"],"starships":["http://swapi.co/api/starships/12/","http://swapi.co/api/starships/22/"],"created":"2014-12-09T13:50:51.644000Z","edited":"2014-12-20T21:17:56.891000Z","url":"http://swapi.co/api/people/1/"}'

Go explain a beginner that the 2 above are different things, to someone who might already struggle with 0 (or 1) base indexing, and with the syntax of the language. 

## The system is non consistant

Many argue that Python 2 is non consistent as the implicit conversion make weird things possible:

Ptyhon 2:

```
print("it's summer".encode().encode())
it's summer
```

Is `"it's summer"` bytes or unicode ? is `"it's summer".encode()` bytes or unicode. Well, you can't tell becaue the question does not make sens. Hence the Python 3 change. Which was an on purpose decision. It is neither good; not bad. It traded one set of functionlities (auto conversion of bytes <-> text) for more invarients: I can't encode text twice.

The remaining problem is `text` and `bytes` might now be much different for the _computer_, they still look so similar to the user. And while computers are good at detecting a single little `b''` prefix, human are not. The human brain _does_ group things that look similar, and it is _extremly hard_  to make abstraction of things you already know. 

Can you, like a computer , play chess against yourself without your strategy with the whites to affect your strategy with the blacks ?

Can you read `ЯUSSIAИ`, without seeing a "Reversed" R and a reversed N ?

A computer will have no difficulties. 

Maybe we need to help the human by changing the representation of `bytesarrays` ?

### Changing the repr

First, remember that _there are good reasons_ for `bytesarrays` to be represented by decoding to ascii-like (non ASCCI get hex-escaped). It is tremendouly usefull when working on network interfaces, or when trying to figure out what is in a binary blog as there are _often_ pieces in ASCII. The type of a file staring with hex : `89 50 4e 47 0d` is not obvious. Though if correspond in asci to `?PNG` ... so well you know it's an image. This is not what we are trying to tackle.

Regardless of previous statement what would happend if we were to revert the `repr` of `bytesarray` to better represent what they are ?  IPython allows to change object repr easily:

In [59]:
ip = get_ipython()
from binascii import hexlify
text_formatter = ip.display_formatter.formatters['text/plain']

In [60]:
def _print_bytestr(arg, p, cycle):
    p.text('BytesBytesBytesBytesBytes')        
text_formatter.for_type(bytes, _print_bytestr)

<function __main__._print_bytestr>

In [61]:
b"hello from bytes"

BytesBytesBytesBytesBytes

Ok, not really usefull, but originally byes are 0 and 1:

In [64]:
def _print_bytestr(arg, p, cycle):
    p.text('<bytes  "'+bin(int(hexlify(arg).decode(), 16))+'" at {}>'.format(hex(id(arg))))        
text_formatter.for_type(bytes, _print_bytestr)

b'hello zeros and ones !'

<bytes  "0b1101000011001010110110001101100011011110010000001111010011001010111001001101111011100110010000001100001011011100110010000100000011011110110111001100101011100110010000000100001" at 0x10734c8f0>

a bit long, let's ellide:

In [116]:
def ellide(s):
    if len(s) < 35 :
        return s
    else:
        return s[0:18]+'...'+s[-8:]

In [70]:
def _print_bytestr(arg, p, cycle):
    p.text('<bytes  "'+ellide(bin(int(hexlify(arg).decode(), 16)))+'" at {}>'.format(hex(id(arg))))        
text_formatter.for_type(bytes, _print_bytestr)

b'hello zeros and ones !'

<bytes  "0b1101000011001010...00100001" at 0x1073195a8>

Now you will definitively remember when an object is bytes !

In [73]:
requests.get('http://swapi.co/api/people/1').content

<bytes  "0b1111011001000100...01111101" at 0x107f08920>

zeros and ones might be the most compact, hex notation give you more infomation:

In [76]:
def _print_bytestr(arg, p, cycle):
    p.text('<bytes  "'+ellide(hexlify(arg).decode())+'" at {}>'.format(hex(id(arg))))        
text_formatter.for_type(bytes, _print_bytestr)

<function __main__._print_bytestr>

In [75]:
requests.get('http://swapi.co/api/people/1').content

<bytes  "7b226e616d65223a22...312f227d" at 0x107e2eec0>

The point being that if you make `bytesarray`s and `unicodearray`s different for the computer, you might (should ?) at least make them visually different when displayed. 

The above example might not be useful, as many don't fluently read binary or hex but they are visually different.

##  Other possible REPR

The is then a compromise between something usefull and visually distinctive. 
While the `b'...'` repr is nice, it need to go through some shenigans for non-renderable and non-ascii bytes:

In [84]:
print('\x00hi there !\n ünicoÐe !'.encode())

b'\x00hi there !\n \xc3\xbcnico\xc3\x90e !'


Why not make a full-hex dump with the ascii-like representation en the side ?

In [85]:
import codecs

In [86]:
hexl = codecs.getencoder('hex')

In [87]:
def taken(it, n):
    l =[]
    try:
        for i in range(n):
            l.append(next(it))
    except StopIteration:
        pass
    if len(l) == 0:
        return []
    return l+[' '*len(l[0])]*(n-len(l))

In [88]:
def whitespace_filter(char):
    if (char < '\x20') or (char in {'�',127 }):
        return '.'
    else:
        return char

In [89]:
def hexrepr(ba):
    N=8
    hrep = repr(hexl(ba)[0])[2:-1]
    asrep = ba.decode('ascii', "replace") # replace by � what is non-ascii
    lt = len(asrep)
    
    _it = iter([x for x in hrep])
    it = iter([x+next(_it) for x in _it])
    
    asit = iter(map(whitespace_filter, asrep))
    ret = 'Bytes array of {} bytes:\n\n'.format(lt)
    ret = ret + 'hex: | 01 23 45 56 89 AB CD EF | ASCII  |\n'
    ret = ret + '-----|'+'-'*(3*N)+' |'+'-'*N+'|\n'
    for i in range(0, lt//N+1):
        ret = ret+'%03dx | '% i +' '.join(taken(it, N))+ " |"+''.join(taken(asit, N)) + '|\n'
    return ret
    
    
    

In [90]:
def _print_bytestr(arg, p, cycle):
    p.text(hexrepr(arg))        
text_formatter.for_type(bytes, _print_bytestr)

<function __main__._print_bytestr>

In [94]:
'The quick brown fox jumped over the lazy dogs, in ünicoÐe !'.encode()

Bytes array of 61 bytes:

hex: | 01 23 45 56 89 AB CD EF | ASCII  |
-----|------------------------ |--------|
000x | 54 68 65 20 71 75 69 63 |The quic|
001x | 6b 20 62 72 6f 77 6e 20 |k brown |
002x | 66 6f 78 20 6a 75 6d 70 |fox jump|
003x | 65 64 20 6f 76 65 72 20 |ed over |
004x | 74 68 65 20 6c 61 7a 79 |the lazy|
005x | 20 64 6f 67 73 2c 20 69 | dogs, i|
006x | 6e 20 c3 bc 6e 69 63 6f |n ..nico|
007x | c3 90 65 20 21          |..e !   |


While it _seem_ to make little sens for text, it does make more send for binary format, like images:

In [99]:
with open('1pxwhiteatthebottom.png', 'rb') as f:
    data = f.read()
    
data[:50]

Bytes array of 50 bytes:

hex: | 01 23 45 56 89 AB CD EF | ASCII  |
-----|------------------------ |--------|
000x | 89 50 4e 47 0d 0a 1a 0a |.PNG....|
001x | 00 00 00 0d 49 48 44 52 |....IHDR|
002x | 00 00 01 b0 00 00 01 20 |....... |
003x | 08 06 00 00 00 d5 28 26 |......(&|
004x | 69 00 00 00 04 73 42 49 |i....sBI|
005x | 54 08 08 08 08 7c 08 64 |T....|.d|
006x | 88 00                   |..      |


But remember the bytes behing a text make little sens without encoding, so technically the fisrst one make littel sen as well. It _happen_ to display correctly on the screen after printing it. 

In [125]:
def _print_bytestr(arg, p, cycle):
    p.text('<bytes '+ellide(repr(arg))+' at {}>'.format(hex(id(arg))))        
text_formatter.for_type(bytes, _print_bytestr)

<function __main__._print_bytestr>

In [129]:
'This is short and make it easy to see these are bytes !'.encode()

<bytes b'This is short an...bytes !' at 0x107c180e0>