# The bytes class

In the previous tutorial the string class was examined and was seen to have a Unicode character as a fundamental unit. The bytes class on the other hand use a byte as a fundamental unit. It was the foundation for text data in Python 2. 

A computer stores data using a bit. A bit can be conceptualised as a single dip switch and has the values Off and On respectively as shown below. A switch being Off is typically denoted as 0 and On is typically denoted as 1.

<img src='./images/img_001.png' alt='img_001' width='400'/>

1 switch only gives 2**1 combinations ranging from 0:2 (inclusive of 0 and exclusive of 2) which is quite limited in application by itself.

It is therefore common to combine 8 switches together:

<img src='./images/img_002.png' alt='img_002' width='400'/>

This gives 2**8 combinations ranging from 0:256 (inclusive of 0 and exclusive of 256).

A bytes string is essentially a collection of these bytes:

<img src='./images/img_003.png' alt='img_003' width='400'/>

## ASCII Characters

The American Standard Code for Information Interchange (ASCII) maps each byte to a physical command or English character, recall that the physical commands were used to control primitive computers that were essentially typewriter based:

<img src='./images/img_004.png' alt='img_004' width='600'/>

|byte|hex|num|command|
|---|---|---|---|
|00000000|00|000|null|
|00000001|01|001|start of heading|
|00000010|02|002|start of text|
|00000011|03|003|end of text|
|00000100|04|004|end of transmission|
|00000101|05|005|enquiry|
|00000110|06|006|acknowledge|
|00000111|07|007|bell|
|00001000|08|008|**backspace**|
|00001001|09|009|**horizontal tab**|
|00001010|0a|010|**new line**|
|00001011|0b|011|**vertical tab**|
|00001100|0c|012|**form feed**|
|00001101|0d|013|**carriage return**|
|00001110|0e|014|shift out|
|00001111|0f|015|shift in|
|00010000|10|016|data link escape|
|00010001|11|017|device control 1|
|00010010|12|018|device control 2|
|00010011|13|019|device control 3|
|00010100|14|020|device control 4|
|00010101|15|021|negative acknowledge|
|00010110|16|022|synchronous idle|
|00010111|17|023|end of transmission block|
|00011000|18|024|cancel|
|00011001|19|025|end of medium|
|00011010|1a|026|substitute|
|00011011|1b|027|**escape**|
|00011100|1c|028|file separator|
|00011101|1d|029|group separator|
|00011110|1e|030|record separator|
|00011111|1f|031|unit seperator|
|00100000|20|032|**space**|

The remaining commands spanning up to half a byte contained the characters most commonly used in the English language.

|byte|hex|num|character|
|---|---|---|---|
|00100001|21|033|!|
|00100010|22|034|"|
|00100011|23|035|#|
|00100100|24|036|$|
|00100101|25|037|%|
|00100110|26|038|&|
|00100111|27|039|'|
|00101000|28|040|(|
|00101001|29|041|)|
|00101010|2a|042|*|
|00101011|2b|043|+|
|00101100|2c|044|,|
|00101101|2d|045|-|
|00101110|2e|046|.|
|00101111|2f|047|/|
|00110000|30|048|0|
|00110001|31|049|1|
|00110010|32|050|2|
|00110011|33|051|3|
|00110100|34|052|4|
|00110101|35|053|5|
|00110110|36|054|6|
|00110111|37|055|7|
|00111000|38|056|8|
|00111001|39|057|9|
|00111010|3a|058|:|
|00111011|3b|059|;|
|00111100|3c|060|<|
|00111101|3d|061|=|
|00111110|3e|062|>|
|00111111|3f|063|?|
|01000000|40|064|@|
|01000001|41|065|A|
|01000010|42|066|B|
|01000011|43|067|C|
|01000100|44|068|D|
|01000101|45|069|E|
|01000110|46|070|F|
|01000111|47|071|G|
|01001000|48|072|H|
|01001001|49|073|I|
|01001010|4a|074|J|
|01001011|4b|075|K|
|01001100|4c|076|L|
|01001101|4d|077|M|
|01001110|4e|078|N|
|01001111|4f|079|O|
|01010000|50|080|P|
|01010001|51|081|Q|
|01010010|52|082|R|
|01010011|53|083|S|
|01010100|54|084|T|
|01010101|55|085|U|
|01010110|56|086|V|
|01010111|57|087|W|
|01011000|58|088|X|
|01011001|59|089|Y|
|01011010|5a|090|Z|
|01011011|5b|091|[|
|01011100|5c|092|\|
|01011101|5d|093|]|
|01011110|5e|094|^|
|01011111|5f|095|_|
|01100000|60|096|`|
|01100001|61|097|a|
|01100010|62|098|b|
|01100011|63|099|c|
|01100100|64|100|d|
|01100101|65|101|e|
|01100110|66|102|f|
|01100111|67|103|g|
|01101000|68|104|h|
|01101001|69|105|i|
|01101010|6a|106|j|
|01101011|6b|107|k|
|01101100|6c|108|l|
|01101101|6d|109|m|
|01101110|6e|110|n|
|01101111|6f|111|o|
|01110000|70|112|p|
|01110001|71|113|q|
|01110010|72|114|r|
|01110011|73|115|s|
|01110100|74|116|t|
|01110101|75|117|u|
|01110110|76|118|v|
|01110111|77|119|w|
|01111000|78|120|x|
|01111001|79|121|y|
|01111010|7a|122|z|
|01111011|7b|123|{|
|01111100|7c|124|||
|01111101|7d|125|}|
|01111110|7e|126|~|
|01111111|7f|127||


The bytes string below is an example of the ASCII word 'hello':

In [84]:
bytes((0b01101000, 0b01100101, 0b01101100, 0b01101100, 0b01101111))

b'hello'

<img src='./images/img_003.png' alt='img_003' width='400'/>

## Initialisation Signature

The initialisation signature can be viewed by inputting:

In [1]:
# bytes()

Or output to a cell using:

In [2]:
? bytes

[1;31mInit signature:[0m  [0mbytes[0m[1;33m([0m[0mself[0m[1;33m,[0m [1;33m/[0m[1;33m,[0m [1;33m*[0m[0margs[0m[1;33m,[0m [1;33m**[0m[0mkwargs[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
bytes(iterable_of_ints) -> bytes
bytes(string, encoding[, errors]) -> bytes
bytes(bytes_or_buffer) -> immutable copy of bytes_or_buffer
bytes(int) -> bytes object of size given by the parameter initialized with null bytes
bytes() -> empty bytes object

Construct an immutable array of bytes from:
  - an iterable yielding integers in range(256)
  - a text string encoded using the specified encoding
  - any object implementing the buffer API.
  - an integer
[1;31mType:[0m           type
[1;31mSubclasses:[0m     

The init signature shows five different ways of initialising a bytes instance:

In [3]:
# bytes(self, /, *args, **kwargs)
# bytes(iterable_of_ints) -> bytes
# bytes(string, encoding[, errors]) -> bytes
# bytes(bytes_or_buffer) -> immutable copy of bytes_or_buffer
# bytes(int) -> bytes object of size given by the parameter initialized with null bytes
# bytes() -> empty bytes object

As bytes is a fundamental datatype, the first way is instantiation of a bytes class using an existing bytes class and is analogous to its counterpart in the string class.

The second way is by using an iterable of integers such as a tuple. Each integer must a valid value for a byte. Recall that a byte has 8 switches and each switch has 2 combinations:

<img src='./images/img_002.png' alt='img_002' width='400'/>

Therefore there are:

In [4]:
2 ** 8

256

combinations. Which have the range of 0:256 inclusive of the lower bound 0 and exclusive of the upper bound 256 as zero-order indexing is utilised.

So for example:

In [5]:
integers = (104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 33)
bytes(integers)

b'hello world!'

If the first number is examined recall that the decimal integer 104 in binary is:

In [6]:
'0b' + bin(104).removeprefix('0b').zfill(8)

'0b01101000'

And is an ASCII character:

In [7]:
chr(104)

'h'

And this bytes instance at the fundamental level consists of 12 bytes as shown:

In [8]:
for number in integers:
    print(str(number).center(8), end=' ')

print()
for number in integers:
    print(bin(number).removeprefix('0b').zfill(8), end=' ')

print()
for number in integers:
    print(chr(number).center(8), end=' ')

  104      101      108      108      111       32      119      111      114      108      100       33    
01101000 01100101 01101100 01101100 01101111 00100000 01110111 01101111 01110010 01101100 01100100 00100001 
   h        e        l        l        o                 w        o        r        l        d        !     

A tuple of binary integers can also be used to initialise a bytes instance:

In [87]:
bytes((0b01101000, 0b01100101, 0b01101100, 0b01101100, 0b01101111, 0b00100000, 0b01110111, 0b01101111, 0b01110010, 0b01101100, 0b01100100, 0b00100001))

b'hello world!'

If this tuple is examined directly, all the numbers are cast into decimal:

In [93]:
(0b01101000, 0b01100101, 0b01101100, 0b01101100, 0b01101111, 0b00100000, 0b01110111, 0b01101111, 0b01110010, 0b01101100, 0b01100100, 0b00100001)

(104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 33)

When every byte is a printable ASCII character it will be displayed instead of the byte sequence:

In [9]:
bytes(integers)

b'hello world!'

When the byte sequence contains whitespace characters:

In [10]:
integers = (9, 10, 11, 12, 13)
bytes(integers)

b'\t\n\x0b\x0c\r'

The ASCII escape characters for the tab, new line and carriage return display as \r \n \r. The vertical tab and form feed instead are displayed using a hexadecimal escape character. Recall that a byte has 8 bits. And groups of 4 bits can each be represented as a hexadecimal character. In other words the byte is split into 2 halves and each half is represented by a hexadecimal character:

<img src='./images/img_005.png' alt='img_005' width='400'/>

For example the decimal number 12 in binary and hexadecimal is:

In [11]:
bin(12)

'0b1100'

In [12]:
hex(12)

'0xc'

This is displayed without the trailing zeros, so the first half a byte is not shown. These can be added for clarity:

In [89]:
'0b' + bin(12).removeprefix('0b').zfill(8)

'0b00001100'

In [88]:
'0x' + hex(12).removeprefix('0x').zfill(2)

'0x0c'

Striping the prefix and zerofilling to a byte (8 bits or 2 hexadecimal characters):

In [13]:
bin(12).removeprefix('0b').zfill(8)

'00001100'

In [14]:
hex(12).removeprefix('0x').zfill(2)

'0c'

And:

In [15]:
print(bin(12).removeprefix('0b').zfill(8)[:4], bin(12).removeprefix('0b').zfill(8)[4:])
print(hex(12).removeprefix('0x').zfill(2)[:1].center(4), hex(12).removeprefix('0x').zfill(2)[1:].center(4))

0000 1100
 0    c  


The 16 possible combinations for the 4 bit sequence each map to the following hexadecimal characters:

|4 bit binary|hexadecimal character|decimal character|
|---|---|---|
|0000|0|0|
|0001|1|1|
|0010|2|2|
|0011|3|3|
|0100|4|4|
|0101|5|5|
|0110|6|6|
|0111|7|7|
|1000|8|8|
|1001|9|9|
|1010|a|10|
|1011|b|11|
|1100|c|12|
|1101|d|13|
|1110|e|14|
|1111|f|15|

Non-printable ASCII characters and byte values outwith the 128:256 outwith the ASCII range are also displayed using the hexadecimal escape character:

In [16]:
integers = (1, 2, 3)
bytes(integers)

b'\x01\x02\x03'

In [17]:
integers = (128, 129, 130)
bytes(integers)

b'\x80\x81\x82'

And for clarity:

In [18]:
integers = (104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 33)
bytes(integers)

for number in integers:
    print(str(number).center(8), end=' ')

print()
for number in integers:
    print(bin(number).removeprefix('0b').zfill(8), end=' ')

print()
for number in integers:
    print((r'\x' + hex(number).removeprefix('0x')).center(8), end=' ')

print()
for number in integers:
    print(chr(number).center(8), end=' ')
    

  104      101      108      108      111       32      119      111      114      108      100       33    
01101000 01100101 01101100 01101100 01101111 00100000 01110111 01101111 01110010 01101100 01100100 00100001 
  \x68     \x65     \x6c     \x6c     \x6f     \x20     \x77     \x6f     \x72     \x6c     \x64     \x21   
   h        e        l        l        o                 w        o        r        l        d        !     

A tuple of hexadecimal integers can also be used to initialise a bytes instance:

In [92]:
bytes((0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x20, 0x77, 0x6f, 0x72, 0x6c, 0x64, 0x21))

b'hello world!'

If this tuple is examined directly all the integers are cast into decimal:

In [94]:
(0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x20, 0x77, 0x6f, 0x72, 0x6c, 0x64, 0x21)

(104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 33)

Because a byte is a configuration of 8 switches it has:

In [96]:
2 ** 8

256

combinations. This are from 0:256 (inclusive of the lower bound 0 and exclusive of the upper bound 256 as Python uses zero-rder indexing). The maximum value when all switches are on is therefore 255:

<img src='./images/img_006.png' alt='img_006' width='400'/>

Integers that exceed 255 do not map to a byte and a ValueError will display if they are attempted to be used:

In [19]:
# bytes((256, ))

<span style='color:red'>ValueError</span>: bytes must be in range(0, 256)

A byte string can also be created using the equivalent hexadecimal escape characters notice that the encoding needs to also be specified and in this case is 'ASCII'. Encoding is essentially a translation mapping each byte to a Unicode character:

In [20]:
greeting = bytes('\x68\x65\x6c\x6c\x6f\x20\x77\x6f\x72\x6c\x64\x21', encoding='ASCII')
greeting

b'hello world!'

A bytes sequence can be created by casting a Unicode string to bytes, once again the encoding is specified as 'ASCII':

In [21]:
greeting = bytes('hello world!', encoding='ASCII')
greeting

b'hello world!'

When each character spans a single byte as in the case of ASCII, the encoding is assumed to be ASCII and the bytes can be instantiated using short hand notation. A prefix b is added which distinguishes it from a Unicode string:

In [22]:
b'hello world!'

b'hello world!'

In [23]:
hex(128)

'0x80'

## Encoding and Decoding

When a string was instantiated, an encoding translation table was selected; that is a table that maps the bytes sequence to a specific character. 

The most basic one is 'ASCII' where each character is encoded over 1 byte using only half of the possible values. ASCII is restricted to a small subset of English characters:

In [24]:
a_ascii = b'a'
a_ascii

b'a'

The current standard is the Unicode Transformation 8 'UTF-8' format which is adaptable and can use:
* 1 byte (8 bits) for an ASCII character
* 2 bytes (16 bits) for the more popular Unicode characters (e.g. from European alphabets and commonly used mathematical symbols)
* 4 bytes (32 bits) for extended Unicode characters

In [25]:
a_utf8 = bytes('a', encoding='utf-8')
a_utf8

b'a'

In [26]:
alpha_utf8 = bytes('α', encoding='utf-8')
alpha_utf8

b'\xce\xb1'

In [27]:
um_utf8 = bytes('㎛', encoding='utf-8')
um_utf8

b'\xe3\x8e\x9b'

The 'UTF-16-LE' and 'UTF-16-BE' were previous standards where each character occupies 2 bytes, notice the additional byte \\x00 is added at the back and front respectively. There is need to add a zero byte for each ASCII character so it spans over 2 bytes:

In [28]:
a_le = bytes('a', encoding='utf-16-le')
a_le

b'a\x00'

In [29]:
a_be = bytes('a', encoding='utf-16-be')
a_be

b'\x00a'

In [30]:
alpha_le = bytes('α', encoding='utf-16-le')
alpha_le

b'\xb1\x03'

In [31]:
alpha_be = bytes('α', encoding='utf-16-be')
alpha_be

b'\x03\xb1'

If the correct encoding scheme (translation table) is used for the character, it will be decoded correctly:

In [32]:
a_utf8.decode(encoding='utf-8')

'a'

In [33]:
a_le.decode(encoding='utf-16-le')

'a'

If the wrong one is selected it can result in a UnicodeDecodeError:

In [34]:
# a8.decode(encoding='utf-16-be')

<span style='color:red'>UnicodeDecodeError</span>: 'utf-16-be' codec can't decode byte 0x61 in position 0: truncated data

However it can also result in additional characters or completely different character replacement:

In [35]:
a_le.decode(encoding='utf-8')

'a\x00'

In [36]:
alpha_le.decode(encoding='utf-16-be')

'넃'

Unicode strings are more reliable than byte strings as they do not have the encoding problems. They should therefore be used preferentially in Python code where possible. 

Bytes are however commonly used when interfacing with basic hardware. If the bytes sequence is examined in binary it is a series of 0 and 1 which is not very human readible:

In [37]:
for number in b'hello world!':
    print(bin(number).removeprefix('0b').zfill(8), end='')

011010000110010101101100011011000110111100100000011101110110111101110010011011000110010000100001

However in this form it is easy to transmit this information using a digital signal. A serial port for example has a signal pin which is configured to transmit or receive a digital time trace. The baud rate, 9600 for example means 9600 bits are processed in a second. When the bit is 0, the voltage of the signal pin is LOW and when the bit is 1 the voltage of the signal pin is HIGH.

As humans it is easier to examine the byte sequence using hexadecimal and this can be done using the bytes method hex:

In [38]:
? bytes.hex

[1;31mDocstring:[0m
Create a string of hexadecimal numbers from a bytes object.

  sep
    An optional single character or byte to separate hex bytes.
  bytes_per_sep
    How many bytes between separators.  Positive values count from the
    right, negative values count from the left.

Example:
>>> value = b'\xb9\x01\xef'
>>> value.hex()
'b901ef'
>>> value.hex(':')
'b9:01:ef'
>>> value.hex(':', 2)
'b9:01ef'
>>> value.hex(':', -2)
'b901:ef'
[1;31mType:[0m      method_descriptor

In [39]:
b'hello world!'.hex()

'68656c6c6f20776f726c6421'

In [40]:
for number in b'hello world!':
    print(bin(number).removeprefix('0b').zfill(8), end=' ')
    
print()

for number in b'hello world!':
    print(hex(number).removeprefix('0x').zfill(2).center(8), end=' ')

01101000 01100101 01101100 01101100 01101111 00100000 01110111 01101111 01110010 01101100 01100100 00100001 
   68       65       6c       6c       6f       20       77       6f       72       6c       64       21    

It is recommended to decode a byte string to a Unicode string as early as possible in a Python program and to case a Unicode string to a byte string as late as possible before transmitting it to basic hardware in order to avoid encoding issues.

If the 16 bit (2 bytes) encoding is examined in more detail as hex so it is human readible:

In [41]:
greek_be = bytes('αβγδε', encoding='UTF-16-BE').hex()
greek_be

'03b103b203b303b403b5'

In [42]:
greek_le = bytes('αβγδε', encoding='UTF-16-LE').hex()
greek_le

'b103b203b303b403b503'

In [43]:
greek = bytes('αβγδε', encoding='UTF-16').hex()
greek

'fffeb103b203b303b403b503'

And in groupings of 4 are examined:

In [44]:
print('UTF-16-BE', end=':      ')

for i in range(0, len(greek_be), 4):
    print(greek_be [i:i+4], end=' ')

print()    
print('UTF-16-LE', end=':      ')

for i in range(0, len(greek_le), 4):
    print(greek_le [i:i+4], end=' ')  
    
print()    
print('UTF-16', end=':    ')

for i in range(0, len(greek), 4):
    print(greek [i:i+4], end=' ')    

UTF-16-BE:      03b1 03b2 03b3 03b4 03b5 
UTF-16-LE:      b103 b203 b303 b403 b503 
UTF-16:    fffe b103 b203 b303 b403 b503 

Notice the difference in the first character between UTF-16-BE and UTF-16-LE; for big endian it is encoded as 03 b1 with the big order byte being displayed first. For little endian it is encoded as b1 03 with the little order byte being displayed first.

An analogy is the number twelve, it is usually written in English using the big number first as 12. However if it is written with the little number first it would be written as 21.

Although big endian is used with decimal numbers, little endian was more commonly used for byte ordering, for example being the standard on all Intel processors. There was early encoding/decoding issues due to the byte order when UTF-16 was first issued and as a result UTF-16 Byte Order Marker (BOM) is typically prefixed. This BOM is seen when UTF-16 is specified but not when UTF-16-LE is specified which are otherwise identical.

UTF-8 was designed to always be big order and has no encoding issues regarding byte ordering however there is a variation UTF-8-Sig which includes a BOM. This is commonly used in Microsoft Applications:

In [45]:
greek_utf8 = bytes('αβγδε', encoding='UTF-8').hex()
greek_utf8

'ceb1ceb2ceb3ceb4ceb5'

In [46]:
greek_utf8sig = bytes('αβγδε', encoding='UTF-8-Sig').hex()
greek_utf8sig

'efbbbfceb1ceb2ceb3ceb4ceb5'

The first standard encoding scheme was 'ASCII', which was fixed over 1 byte (8 bit). Recall a byte has:

In [47]:
2 ** 8

256

combinations although only half were used. Initially there were regional encoding schemes (translation tables) which mapped the second half of the combinations to regional characters. For example in the UK, 'Latin1' was used which includes the £ sign:

In [48]:
gb = bytes('£123.45', encoding='Latin1')
gb

b'\xa3123.45'

In [49]:
int('0xa3', base=16)

163

In [50]:
gb.decode(encoding='Latin1')

'£123.45'

This regional encoding scheme spanned over the full byte. The problem with early regional encoding was that operating systems were configured to use a different encoding scheme than content such as early websites and characters outwith ASCII which were fixed were often substituted. For example:

In [51]:
gb.decode(encoding='Latin2')

'Ł123.45'

In [52]:
gb.decode(encoding='Latin3')

'£123.45'

In [53]:
gb.decode(encoding='Greek')

'£123.45'

In [54]:
gb.decode(encoding='Cyrillic')

'Ѓ123.45'

UTF-16 (16 bits) was created which has:

In [55]:
2 ** 16

65536

combinations and hence allowed more characters. ASCII characters have to be extended to 2 bytes as the byte size is fixed. The restriction in the number of characters led to UTF-32 (32 bits) giving:

In [56]:
2 ** 32

4294967296

combinations. ASCII characters have to be extended to 4 bytes as the byte size is fixed. Finally UTF-8 was created which is adaptable between 1 byte for ASCII characters, 2 bytes for common Unicode characters and 4 bytes for extended Unicode characters. UTF-16 and UTF-32 had variations byte order which was also fixed for UTF-8. UTF-8 is the standard used for Unicode strings and will be used from now on.

## Bytes Identifiers

The bytes class uses the design model of a Python object as seen by its method resolution order:

In [57]:
bytes.mro

<function bytes.mro()>

If the help function is used on bytes, details about all the identifiers will be given:

In [58]:
help(bytes)

Help on class bytes in module builtins:

class bytes(object)
 |  bytes(iterable_of_ints) -> bytes
 |  bytes(string, encoding[, errors]) -> bytes
 |  bytes(bytes_or_buffer) -> immutable copy of bytes_or_buffer
 |  bytes(int) -> bytes object of size given by the parameter initialized with null bytes
 |  bytes() -> empty bytes object
 |  
 |  Construct an immutable array of bytes from:
 |    - an iterable yielding integers in range(256)
 |    - a text string encoded using the specified encoding
 |    - any object implementing the buffer API.
 |    - an integer
 |  
 |  Methods defined here:
 |  
 |  __add__(self, value, /)
 |      Return self+value.
 |  
 |  __bytes__(self, /)
 |      Convert this value to exact type bytes.
 |  
 |  __contains__(self, key, /)
 |      Return key in self.
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 | 

From the previous notebook, you'll recognise many of these identifiers as they are seen in the string class. For the most part the identifiers between the two classes are consistent however some will differ due to the difference in fundamental unit.

If the methods are compared:

In [59]:
for identifier in dir(str):
    isfunction = callable(getattr(str, identifier))
    isdatamodel = identifier[0] == '_'
    if (isfunction and not isdatamodel):
        print(identifier, end=' ')

capitalize casefold center count encode endswith expandtabs find format format_map index isalnum isalpha isascii isdecimal isdigit isidentifier islower isnumeric isprintable isspace istitle isupper join ljust lower lstrip maketrans partition removeprefix removesuffix replace rfind rindex rjust rpartition rsplit rstrip split splitlines startswith strip swapcase title translate upper zfill 

In [60]:
for identifier in dir(bytes):
    isfunction = callable(getattr(bytes, identifier))
    isdatamodel = identifier[0] == '_'
    if (isfunction and not isdatamodel):
        print(identifier, end=' ')

capitalize center count decode endswith expandtabs find fromhex hex index isalnum isalpha isascii isdigit islower isspace istitle isupper join ljust lower lstrip maketrans partition removeprefix removesuffix replace rfind rindex rjust rpartition rsplit rstrip split splitlines startswith strip swapcase title translate upper zfill 

And if a check is made for methods in the bytes class but not in the str class, there are only 3 additions:

In [61]:
for identifier in dir(bytes):
    isfunction = callable(getattr(bytes, identifier))
    isinstr = identifier in dir(str)
    isdatamodel = identifier[0] == '_'
    if (isfunction and not isdatamodel and not isinstr):
        print(identifier, end=' ')

decode fromhex hex 

Likewise if the data model methods are compared:

In [62]:
for identifier in dir(str):
    isfunction = callable(getattr(str, identifier))
    isdatamodel = identifier[0] == '_'
    if (isfunction and isdatamodel):
        print(identifier, end=' ')

__add__ __class__ __contains__ __delattr__ __dir__ __eq__ __format__ __ge__ __getattribute__ __getitem__ __getnewargs__ __getstate__ __gt__ __hash__ __init__ __init_subclass__ __iter__ __le__ __len__ __lt__ __mod__ __mul__ __ne__ __new__ __reduce__ __reduce_ex__ __repr__ __rmod__ __rmul__ __setattr__ __sizeof__ __str__ __subclasshook__ 

In [63]:
for identifier in dir(bytes):
    isfunction = callable(getattr(bytes, identifier))
    isdatamodel = identifier[0] == '_'
    if (isfunction and isdatamodel):
        print(identifier, end=' ')

__add__ __bytes__ __class__ __contains__ __delattr__ __dir__ __eq__ __format__ __ge__ __getattribute__ __getitem__ __getnewargs__ __getstate__ __gt__ __hash__ __init__ __init_subclass__ __iter__ __le__ __len__ __lt__ __mod__ __mul__ __ne__ __new__ __reduce__ __reduce_ex__ __repr__ __rmod__ __rmul__ __setattr__ __sizeof__ __str__ __subclasshook__ 

And if a check is made for the data model methods in the bytes class but not in the str class, there is only 1 addition:

In [64]:
for identifier in dir(bytes):
    isfunction = callable(getattr(bytes, identifier))
    isinstr = identifier in dir(str)
    isdatamodel = identifier[0] == '_'
    if (isfunction and isdatamodel and not isinstr):
        print(identifier, end=' ')

__bytes__ 

Most methods and data model methods have the same names between the two classes and have consistent behaviour. The immutable methods in the str class which return a new str instance will generally return a new bytes instance instead. The individual unit of a Unicode string is a Unicode character and the individual unit of a bytes is a byte. The data model method \_\_len\_\_ which maps to the builtins function len will return how many of these individual units are in an instance:

In [65]:
english_b = b'abcde'
greek_b = bytes('αβγδε', encoding='UTF-8')
english_s = 'abcde'
greek_s = 'αβγδε'

Notice that the len of english_s and greek_s are consistently 5 as there are 5 Unicode characters:

In [66]:
len(english_s)

5

In [67]:
len(greek_s)

5

Notice that the len of english_b is also 5 as each ASCII character spans 1 byte, however greek_s is a length of 10 as each character spans 2 bytes:

In [68]:
len(english_b)

5

In [69]:
len(greek_b)

10

Recall that the data model identifier \_\_getitem\_\_ defines the behaviour when indexing with square brackets. For a string, the Unicode character corresponding to that index is returned, for a byte, the byte in the form of an int is returned:

In [70]:
greek_s[0]

'α'

In [71]:
greek_b[0]

206

In [72]:
greek_b.hex()

'ceb1ceb2ceb3ceb4ceb5'

In [73]:
hex(206)

'0xce'

Slicing will instead return a bytes instance. the difference can be seen when 1 byte is selected from a slice:

In [74]:
greek_b[:1]

b'\xce'

The syntax otherwise is consistent to slicing using the string class:

In [75]:
greek_b[:2:]

b'\xce\xb1'

In [76]:
greek_b[::2]

b'\xce\xce\xce\xce\xce'

In [77]:
greek_b[1::2]

b'\xb1\xb2\xb3\xb4\xb5'

Earlier the method hex was examined:

In [78]:
? greek_b.hex

[1;31mDocstring:[0m
Create a string of hexadecimal numbers from a bytes object.

  sep
    An optional single character or byte to separate hex bytes.
  bytes_per_sep
    How many bytes between separators.  Positive values count from the
    right, negative values count from the left.

Example:
>>> value = b'\xb9\x01\xef'
>>> value.hex()
'b901ef'
>>> value.hex(':')
'b9:01:ef'
>>> value.hex(':', 2)
'b9:01ef'
>>> value.hex(':', -2)
'b901:ef'
[1;31mType:[0m      builtin_function_or_method

In [79]:
greek_b.hex()

'ceb1ceb2ceb3ceb4ceb5'

There is class method fromhex which is an alternative constructor which will create a bytes instance from a string of hexadecimal numbers:

In [80]:
? bytes.fromhex

[1;31mSignature:[0m  [0mbytes[0m[1;33m.[0m[0mfromhex[0m[1;33m([0m[0mstring[0m[1;33m,[0m [1;33m/[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Create a bytes object from a string of hexadecimal numbers.

Spaces between two numbers are accepted.
Example: bytes.fromhex('B9 01EF') -> b'\\xb9\\x01\\xef'.
[1;31mType:[0m      builtin_function_or_method

Note that the class method is normally called from a class and returns an instance:

In [81]:
bytes.fromhex('ceb1ceb2ceb3ceb4ceb5')

b'\xce\xb1\xce\xb2\xce\xb3\xce\xb4\xce\xb5'