# Differences between the bytes and str types

We have two character types in Python: bytes and str. A byte instance contains unmodified 8-bit values (oftens dislpayed wwith ASCII encoding). 

In [1]:
a = b'w\x69taj'
print(list(a))
print(a)

[119, 105, 116, 97, 106]
b'witaj'


Str instance contains so-colled code points.

In [2]:
a = 'a\u0300 propos'
print(list(a))
print(a)

['a', '̀', ' ', 'p', 'r', 'o', 'p', 'o', 's']
à propos


encode() method must be used yo convert Unicode characters to binary data. The opposite conversion requires the use of the decode() method.

First function takes as an arguemnt str or bytes type and always converts it into str type.

In [3]:
def to_str(bytes_or_str):
    if isinstance(bytes_or_str, bytes):
        value = bytes_or_str.decode('utf-8')
    else:
        value = bytes_or_str
    return value

In [4]:
print(repr(to_str(b'foo')))
print(repr(to_str('bar')))

'foo'
'bar'


Next function takes as an argument str or bytes type and always converts it into bytes type.

In [5]:
def to_bytes(bytes_or_str):
    if isinstance(bytes_or_str, str):
        value = bytes_or_str.encode('utf-8')
    else:
        value = bytes_or_str
    return value

In [6]:
print(repr(to_str(b'foo')))
print(repr(to_str('bar')))

'foo'
'bar'


Above approach has two serious drawbacks. First of them is that bytes ans str instances seem to work in the same way, byt they don't match each other. Using the '+' operator we can add values of the bytes type to bytes and str type to str.

In [7]:
print(b'jeden'+ b'dwa')
print('jeden' + 'dwa')

b'jedendwa'
jedendwa


However, we cannot add a str type to bytes type and bytes type to str type.

In [8]:
print(b'jeden' + 'dwa')

TypeError: can't concat str to bytes

In [9]:
print('jeden' + b'dwa')

TypeError: can only concatenate str (not "bytes") to str

Using binary operators it is possible to compare the bytes with bytes and the str with str.

In [10]:
assert b'red' > b'blue'

In [11]:
assert 'red' > 'blue'

However, we cannot compare the str with bytes and the bytes with str.

In [12]:
assert b'red' < 'blue'

TypeError: '<' not supported between instances of 'bytes' and 'str'

In [13]:
assert 'red' < 'blue'

AssertionError: 

Comparing the bytes and str in order to determine if they are equal always end with False value. Even if both vales contain exactly the same characters.

In [14]:
print(b'foo' == 'foo')

False


% operator works with strings format for individual types.

In [15]:
print(b'red %s' % b'blue')

b'red blue'


In [16]:
print('red %s' % 'blue')

red blue


However, we cannot pass an instance of the str to a string of the bytes instance formatting, becuase it won't be able to figure out which binary encoding should be used.

In [17]:
print(b'red %s' % 'blue')

TypeError: %b requires a bytes-like object, or an object that implements __bytes__, not 'str'

However, we can pass an instance of the bytes instance to a text string formatting of an instance of the str using the % operator, although the result will be different than expected.

In [18]:
print('red %s' % b'blue')

red b'blue'


This code actually calls the __repr__() method in the bytes instance and then replaces the value indicated by %s. Therefore the value b'blue' is in quotation mark in the output.

The seond problem is that in Python filehanlde operations require Unicde by default instead of unmodified raw instances. The code below does not work.

In [19]:
with open('data.bin', 'w') as f:
    f.write(b'\xf1\xf2\xf3\xf4\xf5')

TypeError: write() argument must be str, not bytes

The reason for raising an exception is opening the file in writing mode ('w') instead of in the binary data saving mode ('wb'). When the file is created in the text mode, operations performed on the file handle expect str instance containing Unicode characters insted of bytes instances containing binary data.

In [20]:
with open('data.bin', 'wb') as f:
    f.write(b'\xf1\xf2\xf3\xf4\xf5')

The same problem occurs when reading data from a file.

In [21]:
with open('data.bin', 'r') as f:
    data = f.read()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 0: invalid continuation byte

This operation fails because the file is opened in text-read mode ('r') instead of binary-text read mode ('rb'). When the file handle is in the text mode the default encoding for the character is used to interpret binary data by calling bytes.encode() and str.decode().

In [22]:
with open('data.bin', 'rb') as f:
    data = f.read()

In [23]:
assert data == b'\xf1\xf2\xf3\xf4\xf5'

In [24]:
with open ('data.bin', 'r', encoding='cp1252') as f:
    data = f.read()

In [25]:
assert data == 'n o o o o '

AssertionError: 

The exception will not be raised and the interpolation of the text string of the file content will depend on the return value received when reading unmodified bytes.