# The bytes class

In the previous notebook the Unicode string class ```str``` was examined and was seen to have a Unicode character as a fundamental unit. The byte string class ```bytes``` on the other hand use a byte as a fundamental unit. The byte string was the foundation for text data in Python 2. 

A computer stores data using a bit. A bit can be conceptualised as a single dip switch and has the values Off and On respectively as shown below. A single switch has the possible values ```0```, ```1``` which is ```2 ** 1``` combinations which is a total of ```2```. Note the combination ```0``` is included so ```0:2``` is inclusive of the lower bount ```0``` and exclusive of the upper bound ```2```.

<img src='./images/img_001.png' alt='img_001' width='400'/>

The switch above represents the ```2``` to the power of ```0``` and has a mantissa of ```0```:

In [6]:
prefix = 0

In [7]:
prefix * (2 ** 0)

0

More typically ```8``` of these switches are combined into a single logical unit called a byte. A byte has ```2 ** 8``` combinations which is a total of ```256```. Note the combination ```0``` is included so ```0:256``` is inclusive of the lower bound ```0``` and exclusive of the upper bound ```256```.

<img src='./images/img_002.png' alt='img_002' width='400'/>

The decimal number can be calculated from the ```bytes``` above. Like an ordinary decimal number the units are on the right hand side of the ```bytes``` what differs is each character represents a power of ```2``` (binary) opposed to a power of ```10``` decimal:

In [19]:
+ 0 * (2 ** 7) \
+ 1 * (2 ** 6) \
+ 1 * (2 ** 5) \
+ 0 * (2 ** 4) \
+ 1 * (2 ** 3) \
+ 0 * (2 ** 2) \
+ 0 * (2 ** 1) \
+ 0 * (2 ** 0) 

104

The ```bytes``` above can be expressed as a binary number using the prefix ```0b```, this prefix in VSCode is highlighted in a different colour to make it obvious that this number isn't a decimal number. If input into a cell the decimal equivalent will be calculated. The decimal equivalent is the standard convention and has no prefix:

In [20]:
0b01101000

104

Leading zeros are normally omitted:

In [21]:
0b1101000

104

Although in this context, it is useful to leave them in place so all 8 bits in the byte can be visualised.

A byte string is essentially a collection of individual bytes. It is therefore called the ```bytes``` class:

<img src='./images/img_003.png' alt='img_003' width='400'/>

In [25]:
0b01101000

104

In [36]:
0b01100101

101

In [35]:
0b01101100

108

In [22]:
0b01101100

108

In [29]:
0b01101111

111

These numbers can be grouped into a collection such as a ```tuple```:

In [37]:
(0b01101000, 0b01100101, 0b01101100, 0b01101100, 0b01101111)

(104, 101, 108, 108, 111)

This ```tuple``` collection can be cast into ```bytes``` giving text information:

In [39]:
bytes((0b01101000, 0b01100101, 0b01101100, 0b01101100, 0b01101111))

b'hello'

In [40]:
bytes((104, 101, 108, 108, 111))

b'hello'

## ASCII Characters

The American Standard Code for Information Interchange (ASCII) maps each byte to a physical command or English character, recall that the physical commands were used to control primitive computers that were essentially typewriter based:

<img src='./images/img_004.png' alt='img_004' width='600'/>

|byte|hex|num|command|
|---|---|---|---|
|00000000|00|000|null|
|00000001|01|001|start of heading|
|00000010|02|002|start of text|
|00000011|03|003|end of text|
|00000100|04|004|end of transmission|
|00000101|05|005|enquiry|
|00000110|06|006|acknowledge|
|00000111|07|007|bell|
|00001000|08|008|**backspace**|
|00001001|09|009|**horizontal tab**|
|00001010|0a|010|**new line**|
|00001011|0b|011|**vertical tab**|
|00001100|0c|012|**form feed**|
|00001101|0d|013|**carriage return**|
|00001110|0e|014|shift out|
|00001111|0f|015|shift in|
|00010000|10|016|data link escape|
|00010001|11|017|device control 1|
|00010010|12|018|device control 2|
|00010011|13|019|device control 3|
|00010100|14|020|device control 4|
|00010101|15|021|negative acknowledge|
|00010110|16|022|synchronous idle|
|00010111|17|023|end of transmission block|
|00011000|18|024|cancel|
|00011001|19|025|end of medium|
|00011010|1a|026|substitute|
|00011011|1b|027|**escape**|
|00011100|1c|028|file separator|
|00011101|1d|029|group separator|
|00011110|1e|030|record separator|
|00011111|1f|031|unit seperator|
|00100000|20|032|**space**|

The remaining commands spanning up to half a byte contained the characters most commonly used in the English language.

|byte|hex|num|character|
|---|---|---|---|
|00100001|21|033|!|
|00100010|22|034|"|
|00100011|23|035|#|
|00100100|24|036|$|
|00100101|25|037|%|
|00100110|26|038|&|
|00100111|27|039|'|
|00101000|28|040|(|
|00101001|29|041|)|
|00101010|2a|042|*|
|00101011|2b|043|+|
|00101100|2c|044|,|
|00101101|2d|045|-|
|00101110|2e|046|.|
|00101111|2f|047|/|
|00110000|30|048|0|
|00110001|31|049|1|
|00110010|32|050|2|
|00110011|33|051|3|
|00110100|34|052|4|
|00110101|35|053|5|
|00110110|36|054|6|
|00110111|37|055|7|
|00111000|38|056|8|
|00111001|39|057|9|
|00111010|3a|058|:|
|00111011|3b|059|;|
|00111100|3c|060|<|
|00111101|3d|061|=|
|00111110|3e|062|>|
|00111111|3f|063|?|
|01000000|40|064|@|
|01000001|41|065|A|
|01000010|42|066|B|
|01000011|43|067|C|
|01000100|44|068|D|
|01000101|45|069|E|
|01000110|46|070|F|
|01000111|47|071|G|
|01001000|48|072|H|
|01001001|49|073|I|
|01001010|4a|074|J|
|01001011|4b|075|K|
|01001100|4c|076|L|
|01001101|4d|077|M|
|01001110|4e|078|N|
|01001111|4f|079|O|
|01010000|50|080|P|
|01010001|51|081|Q|
|01010010|52|082|R|
|01010011|53|083|S|
|01010100|54|084|T|
|01010101|55|085|U|
|01010110|56|086|V|
|01010111|57|087|W|
|01011000|58|088|X|
|01011001|59|089|Y|
|01011010|5a|090|Z|
|01011011|5b|091|[|
|01011100|5c|092|\|
|01011101|5d|093|]|
|01011110|5e|094|^|
|01011111|5f|095|_|
|01100000|60|096|`|
|01100001|61|097|a|
|01100010|62|098|b|
|01100011|63|099|c|
|01100100|64|100|d|
|01100101|65|101|e|
|01100110|66|102|f|
|01100111|67|103|g|
|01101000|68|104|h|
|01101001|69|105|i|
|01101010|6a|106|j|
|01101011|6b|107|k|
|01101100|6c|108|l|
|01101101|6d|109|m|
|01101110|6e|110|n|
|01101111|6f|111|o|
|01110000|70|112|p|
|01110001|71|113|q|
|01110010|72|114|r|
|01110011|73|115|s|
|01110100|74|116|t|
|01110101|75|117|u|
|01110110|76|118|v|
|01110111|77|119|w|
|01111000|78|120|x|
|01111001|79|121|y|
|01111010|7a|122|z|
|01111011|7b|123|{|
|01111100|7c|124|||
|01111101|7d|125|}|
|01111110|7e|126|~|
|01111111|7f|127||


## Initialisation Signature

Inputting the class name ```bytes``` with open parenthesis in a new code cell will display the initialisation signature as a popup balloon. Note that Jupyter Notebook and JupyterLab may require the additional keypress shift ```⇧``` and tab ```↹``` in order to invoke the popup balloon:

Alternatively using ```?``` to query the ```bytes``` class will display the docstring in the ipython cell output:

In [3]:
bytes?

[1;31mInit signature:[0m  [0mbytes[0m[1;33m([0m[0mself[0m[1;33m,[0m [1;33m/[0m[1;33m,[0m [1;33m*[0m[0margs[0m[1;33m,[0m [1;33m**[0m[0mkwargs[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
bytes(iterable_of_ints) -> bytes
bytes(string, encoding[, errors]) -> bytes
bytes(bytes_or_buffer) -> immutable copy of bytes_or_buffer
bytes(int) -> bytes object of size given by the parameter initialized with null bytes
bytes() -> empty bytes object

Construct an immutable array of bytes from:
  - an iterable yielding integers in range(256)
  - a text string encoded using the specified encoding
  - any object implementing the buffer API.
  - an integer
[1;31mType:[0m           type
[1;31mSubclasses:[0m     

Recall that the purpose of the initialisation signature is to provide the data required to initialise a new instance. This was covered in a previous notebook which discussed the ```object``` class and introduced the concept of object orientated programming (OOP).

To recap during instantiation the datamodel static method ```__new__``` which is the constructor is used to create the instance ```self``` and this constructor invokes the datamodel instance method ```__init__``` the initialisation signature to initialise this instance with instance data.

For the byte string class, the initialisation signature shows alternative ways of supplying instance data for a byte string.

If the first way is examined:

* In Python parenthesis ```( )``` are used to call a function and supply any necessary input arguments.
* The comma ```,``` is used as a delimiter to seperate out any input arguments.
* In Python ```self``` is used to denote *this instance*. In other words a byte string can be constructed from an existing byte string instance, this is a special case as a byte string is a fundamental datatype.
* Any input argument before a ```/``` must be provided positionally
* ```*args``` indicates a variable number of additional positional input arguments. These are typically not used for the byte string class.
* ```**kwargs``` indicates a variable number of additional named input arguments. These are typically not used for the byte string class.

The parameter ```self```, an existing bytestring instance cannot be provided using a named input argument:

A bytestring instance can be instantiated by supplying an existing bytestring instance ```self``` to the bytestring class:

In [43]:
bytes(b'hello world!')

b'hello world!'

However because the byte string is a fundamental datatype it can also be instantiated shorthand using the following:

In [44]:
b'hello world!'

b'hello world!'

The second way is by using an iterable of integers such as a ```tuple```.

For this to work each integer must a valid byte value. Recall that a byte looks like the following:

<img src='./images/img_002.png' alt='img_002' width='400'/>

And a byte has ```2 ** 8``` combinations which is a total of ```256```. Note the combination ```0``` is included so ```0:256``` is inclusive of the lower bound ```0``` and exclusive of the upper bound ```256```. Therefore the integer value used to represent a byte must like in the range ```0:256``` to be a valid byte.

A ```tuple``` with a single ```int``` element can be provided. Note that a trailing comma is required to distinguish a single element ```tuple``` from a numeric calculation using parenthesis:

In [48]:
num = (97)

In [51]:
type(num)

int

In [52]:
archive = (97, )

In [53]:
type(archive)

tuple

From the above ASCII table the integer number ```97``` corresponds to the character ```a``` and this can be seen when this is cast to a ```tuple```:

In [55]:
bytes((97, ))

b'a'

If an integer that exceeds the upper bound ```256``` (up to and exclusive of ```256``` due to zero-order indexing) a ```ValueError``` will display:

Normally the ```tuple``` will contain more than one ```int``` and each of these will be a valid byte value:

In [6]:
integers = (104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 33)
bytes(integers)

b'hello world!'

If the first number is examined recall that the decimal integer 104 in binary is a Unicode string of the binary sequence of the byte can be obtained using the ```bin``` function:

In [57]:
bin(104)

'0b1101000'

It is useful to see this with leading zeros. To do this the Unicode string methods ```removeprefix``` and ```zfill``` can be used alongside Unicode string concatenation. You should be familar with these from the last notebooks:

In [7]:
'0b' + bin(104).removeprefix('0b').zfill(8)

'0b01101000'

The Unicode string of the ASCII character corresponding to this byte can be retrieved using the ```chr``` function:

In [8]:
chr(104)

'h'

This bytes instance consists of 12 bytes. The decimal integer, binary integer (without prefix) and ASCII character can be shown. Note the space corresponding to a decimal integer of 32 is also an ASCII character:

In [9]:
for number in integers:
    print(str(number).center(8), end=' ')

print()
for number in integers:
    print(bin(number).removeprefix('0b').zfill(8), end=' ')

print()
for number in integers:
    print(chr(number).center(8), end=' ')

  104      101      108      108      111       32      119      111      114      108      100       33    
01101000 01100101 01101100 01101100 01101111 00100000 01110111 01101111 01110010 01101100 01100100 00100001 
   h        e        l        l        o                 w        o        r        l        d        !     

When every byte is a printable ASCII character it will be displayed instead of the byte sequence:

In [12]:
bytes(integers)

b'hello world!'

When the byte sequence contains a byte that maps to a whitespace character (with the exception of space), a command or is unmapped there is no corresponding character to display and an escape sequence instead displays:

In [13]:
integers = (9, 10, 11, 12, 13)
bytes(integers)

b'\t\n\x0b\x0c\r'

The ASCII escape characters for the tab ```\t``` (decimal 9), new line ```\t``` (decimal 10) and carriage return ```\r``` (decimal 13). The vertical tab ```\x0b``` (decimal 11) and form feed ```\x0c``` (decimal 12) instead are displayed using a hexadecimal escape character of the form ```\x00```.

You may have noticed that binary ```0b00001100``` is not very human-readible and it is easy to mistranscribe a binary number. To make it more human readible, the byte is split into 2 and each half byte is represented as a hexadecimal value:

<img src='./images/img_005.png' alt='img_005' width='400'/>

Recall **b**inary has ```2``` digits and the prefix ```0b```, decimal has ```10``` digits and no prefix because it is most commonly used. He**x**adecimal has ```16``` digits and the prefix ```0x```. As decimal only has ```10``` unique characters, in hexadecimal, the first ```6``` letters in the alphabet are used to supplement these giving ```16``` unique characters. Notice:

In [58]:
2 ** 4

16

Therefore each hexadecimal character perfectly maps to a 4 bit binary sequence:

|(0b) binary|(0x) hexadecimal character|decimal character|
|---|---|---|
|0000|0|0|
|0001|1|1|
|0010|2|2|
|0011|3|3|
|0100|4|4|
|0101|5|5|
|0110|6|6|
|0111|7|7|
|1000|8|8|
|1001|9|9|
|1010|a|10|
|1011|b|11|
|1100|c|12|
|1101|d|13|
|1110|e|14|
|1111|f|15|

The byte depicted above corresponds is the binary number:

In [71]:
0b00001100

12

The value returned in the cell output displays the decimal integer. The ```hex``` function can be used to cast a decimal integer into a Unicode string of a hexadecimal character:

In [72]:
hex(0b00001100)

'0xc'

The hexadecimal value is displayed without the trailing zero, so the first half byte is not shown. This can be added for clarity:

In [74]:
'0x' + hex(12).removeprefix('0x').zfill(2)

'0x0c'

In [73]:
'0b' + bin(12).removeprefix('0b').zfill(8)

'0b00001100'

And for clarity:

In [75]:
print(bin(12).removeprefix('0b').zfill(8)[:4], bin(12).removeprefix('0b').zfill(8)[4:])
print(hex(12).removeprefix('0x').zfill(2)[:1].center(4), hex(12).removeprefix('0x').zfill(2)[1:].center(4))

0000 1100
 0    c  


Note that the ```hex``` function displays the Python interpretters preference for lowercase characters for the hexadecimal prefix and hexadecimal characters themselves. Hexadecimal values are usually used when working at the fundamental hardware level and the preference for lowercase is to make hexadecimal values as human readible as possible. In the hexadecimal value spanning 4 bytes (8 hexadecimal characters) below notice that the lowercase variant is much easier to transcribe:

In [68]:
0xabb4ab8a

2880744330

In [66]:
0xABB4AB8A

2880744330

Despite the preference for lowercase, Python will also accept hexadecimal values of uppercase and in the example above both cases return the same integer.

When the byte sequence contains a byte that maps to a whitespace character, a command or is unmapped there is no corresponding character to display and an escape sequence instead displays. This can be seen when using the first ```32``` integers and integers above ```127```:

In [83]:
integers = (0, 1, 2, 29, 30, 31, 128, 129, 130, 253, 254, 255)
bytes(integers)

b'\x00\x01\x02\x1d\x1e\x1f\x80\x81\x82\xfd\xfe\xff'

Notice that character has its own hexadecimal escape sequence prefix ```\x``` and these use two hexadecimal characters which recall span over 1 byte. Because the escape sequence ```\x``` expects two hexadecimal characters any leading zero must be supplied.

When a byte string contains byte sequences that map to characters these characters will be displayed instead of the hexadecimal escape sequence:

In [101]:
integers = (32, 65, 80, 120)
bytes(integers)

b' APx'

The tab ```\t```, newline ```\t```, carriage return ```\r``` and backslash ```\\``` itself all have single escape characters.

In [104]:
integers = (9, 10, 13, 92)
bytes(integers)

b'\t\n\r\\'

A byte string containing all of these initially appear confusing:

In [105]:
integers = (0, 1, 2, 9, 10, 13, 29, 30, 31, 32, 65, 92, 97, 128, 129, 130, 253, 254, 255)
bytes(integers)

b'\x00\x01\x02\t\n\r\x1d\x1e\x1f A\\a\x80\x81\x82\xfd\xfe\xff'

Which is why the ```bytes``` class has method ```hex``` which returns a Unicode string of the hexadecimal values without any of the escape sequences:

In [107]:
bytes(integers).hex()

'68656c6c6f20776f726c6421'

Note that the ```byte``` classes ```hex``` method differs from ```builtins``` function ```hex``` which provides a ```0x``` prefix:

In [115]:
bytes((12, )).hex()

'0c'

In [113]:
hex(12)

'0xc'

The ```builtins``` function ```hex``` can process a **single** large integer that exceeds 1 byte:

In [118]:
hex(256)

'0x100'

Whereas the ```byte``` classes ```hex``` method processes **multiple** integers that are within the constrains of a byte:

In [116]:
bytes((12, 34)).hex()

'0c22'

To recap:

In [108]:
integers = (104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 33)
bytes(integers)

for number in integers:
    print(str(number).center(8), end=' ')

print()
for number in integers:
    print(bin(number).removeprefix('0b').zfill(8), end=' ')

print()
for number in integers:
    print((r'0x' + hex(number).removeprefix('0x')).center(8), end=' ')

print()
for number in integers:
    print((r'\x' + hex(number).removeprefix('0x')).center(8), end=' ')
    
print()
for number in integers:
    print((hex(number).removeprefix('0x')).center(8), end=' ')    

print()
for number in integers:
    print(chr(number).center(8), end=' ')
    

  104      101      108      108      111       32      119      111      114      108      100       33    
01101000 01100101 01101100 01101100 01101111 00100000 01110111 01101111 01110010 01101100 01100100 00100001 
  0x68     0x65     0x6c     0x6c     0x6f     0x20     0x77     0x6f     0x72     0x6c     0x64     0x21   
  \x68     \x65     \x6c     \x6c     \x6f     \x20     \x77     \x6f     \x72     \x6c     \x64     \x21   
   68       65       6c       6c       6f       20       77       6f       72       6c       64       21    
   h        e        l        l        o                 w        o        r        l        d        !     

A bytes string can be instantiated from a Unicode string however the named parameter ```encoding``` needs to be supplied, which gives the instructions to decode a Unicode character outwith the ASCII range. In this case ```'ASCII'``` is supplied indicating all characters are within the ASCII range:

In [120]:
bytes('hello', encoding='ASCII')

b'hello'

Not supplying ```encoding``` gives a ```TypeError```:

Supplying a non-Unicode character that is not ASCII and specifiying ASCII encoding will also give a ```UnicodeDecodeError```:

Encoding will be discussed in more detail shortly.

A bytes instance can be instantiated from an existing bytes instance:

bytes(bytes_or_buffer, /) -> immutable copy of bytes_or_buffer

It can also be instantiation from an instance of a closely related class, the ```bytearray``` class which is a mutatable counterpart of the immutable ```bytes``` class:

In [136]:
ba = bytearray(b'hello')

In [137]:
bytes(ba)

b'hello'

Immutable means the ```bytes``` sequence cannot be modified once constructed. Mutatable means a ```bytearray``` sequence can be modified once constructed.

 A ```bytes``` instance can also be initialised from an integer. In this case, the integer specifies the number of bytes the instance will use and each of these ```bytes``` are initialised to ```0```:

For example a ```bytes``` instance occupying 1 bytes can be instantiated:

In [128]:
bytes(1)

b'\x00'

And another one occupying 4 bytes can be instantiated:

In [132]:
bytes(4)

b'\x00\x00\x00\x00'

Using the ```bytes``` class with out providing any instantiation data will create an empty ```bytes``` instance:


bytes() -> empty bytes object

In [138]:
bytes()

b''

Initialising data and then populating is typically only used for a mutatable class suc as the ```bytearray```.

## Encoding and Decoding

When a bytes string was instantiated from a Unicode string, an encoding translation table was selected; that is a table that maps the bytes sequence to a specific character. 

The most basic one is ```'ASCII'``` where each character is encoded over ```1``` byte using only half of the possible values. ASCII is restricted to a small subset of English characters:

In [31]:
a_ascii = b'a'
a_ascii

b'a'

In [32]:
hex(ord('a'))

'0x61'

The current standard is the Unicode Transformation 8 ```'utf8'``` format which is adaptable and can use:
* 1 byte (8 bits) for an ASCII character
* 2 bytes (16 bits) for the more popular Unicode characters (e.g. from European alphabets and commonly used mathematical symbols)
* 3-4 bytes (24-32 bits) for extended Unicode characters

In [143]:
a_utf8 = bytes('a', encoding='utf8')
a_utf8

b'a'

In [144]:
alpha_utf8 = bytes('α', encoding='utf8')
alpha_utf8

b'\xce\xb1'

In [148]:
um_utf8 = bytes('㎛', encoding='utf8')
um_utf8

b'\xe3\x8e\x9b'

The Unicode Transformation Format 16 Big Endian ```'utf-16-be'``` was a previous standards where each character occupied ```2``` bytes:

In [147]:
a_be = bytes('a', encoding='utf-16-be')
a_be

b'\x00a'

Because each character must occupy 2 bytes the leading zero byte ```\x00``` displays. BE is an abbreviation for Big Endian. Essentially the sixteen value would be encoded as ```10``` using big-endian and is consistent to the way numbers are written in English. In this notation the unit is last and the most significant bit or biggest number is written first (when reading right to left).

In [140]:
hex(16)

'0x10'

When ```'utf-16'``` was the standard, Intel however created processors that operated using Little Endian. In Little Endian, the sixteenth value is encoded as ```01``` and the least significant bit or little number is supplied first (when reading right to left).

The same string that was examined using ```'utf-16-be``` can be examined using ```'utf-16-le```. Notice that the leading zero byte ```\x00``` is now trailing.

In [36]:
a_le = bytes('a', encoding='utf-16-le')
a_le

b'a\x00'

If the Greek letter is examined notice the byteorder is switched:

In [39]:
alpha_be = bytes('α', encoding='utf-16-be')
alpha_be

b'\x03\xb1'

In [38]:
alpha_le = bytes('α', encoding='utf-16-le')
alpha_le

b'\xb1\x03'

And notice that the ```utf-8``` encoding which also uses two bytes for this character gives a different value:

In [141]:
alpha_utf8 = bytes('α', encoding='utf-8')
alpha_utf8

b'\xce\xb1'

The ```bytes``` class has the ```decode``` method which can be used to decode a byte string to a Unicode string. It requires the input argument ```encoding```. When the correct encoding scheme is selected the character will be decoded properly:

In [152]:
a_utf8.decode(encoding='utf8')

'a'

In [142]:
alpha_utf8.decode(encoding='utf8')

'α'

When the wrong encoding system is used, a ```UnicodeDecodeError``` will often display. In this example the character occupies only one byte but is to be decoded using an encoding scheme that requires two bytes per character:

In other cases there is no error and unintended character replacement is carried out:

In [149]:
alpha_utf8.decode(encoding='utf-16-le')

'뇎'

In [150]:
alpha_utf8.decode(encoding='utf-16-be')

'캱'

Unicode strings are more reliable than byte strings as they do not have the encoding problems. They should therefore be used preferentially in Python code where possible. 

Bytes are however commonly used when interfacing with basic hardware. If the bytes sequence is examined in binary it is a series of ```0``` and ```1``` which is not very human readible:

In [45]:
for number in b'hello world!':
    print(bin(number).removeprefix('0b').zfill(8), end='')

011010000110010101101100011011000110111100100000011101110110111101110010011011000110010000100001

However in this form it is very easy to transmit this information using a digital signal. A serial port for example has a signal pin which is configured to transmit or receive a digital time trace. The baud rate, ```9600``` for example means ```9600``` bits are processed in a second. When the bit is ```0```, the voltage of the signal pin is LOW and when the bit is ```1``` the voltage of the signal pin is HIGH.

In [47]:
b'hello world!'.hex()

'68656c6c6f20776f726c6421'

It is recommended to decode a byte string to a Unicode string as early as possible in a Python program and to cast a Unicode string to a byte string as late as possible before transmitting it to basic hardware in order to avoid encoding issues.

If the 16 bit (2 bytes) encoding is examined in more detail as hex so it is human readible; ```'utf-16-be'```, ```'utf-16-le'``` and ```'utf-16'``` can be compared:

In [155]:
greek_be = bytes('αβγδε', encoding='Utf-16-be').hex()
greek_be

'03b103b203b303b403b5'

In [156]:
greek_le = bytes('αβγδε', encoding='utf-16-le').hex()
greek_le

'b103b203b303b403b503'

In [157]:
greek = bytes('αβγδε', encoding='utf-16').hex()
greek

'fffeb103b203b303b403b503'

The differences are easier to see if groupings of 4 hexadecimal characters are examined; recall that it is 2 bytes per character in ```'utf-16'``` and each byte is in turn represented using 2 hexadecimal characters:

In [158]:
print('utf-16-be', end=':      ')

for i in range(0, len(greek_be), 4):
    print(greek_be [i:i+4], end=' ')

print()    
print('utf-16-le', end=':      ')

for i in range(0, len(greek_le), 4):
    print(greek_le [i:i+4], end=' ')  
    
print()    
print('utf-16', end=':    ')

for i in range(0, len(greek), 4):
    print(greek [i:i+4], end=' ')    

utf-16-be:      03b1 03b2 03b3 03b4 03b5 
utf-16-le:      b103 b203 b303 b403 b503 
utf-16:    fffe b103 b203 b303 b403 b503 

Notice the difference in the first character between ```utf-16-be``` and ```utf-16-le```; for big endian it is encoded as ```03 b1``` with the big order byte being displayed first. For little endian it is encoded as ```b1 03``` with the little order byte being displayed first.

```utf-16``` is essentially ```utf-16-le``` but it prefixes a Byte Order Marker (BOM). This has the hexadecimal values ```ff``` (255) and ```fe``` (254) which are the last values in a byte and not assigned to a command or character.Because 255 is the biggest value, it indicates the biggest value is expected first i.e. big endian:

In [166]:
0xff

255

In [169]:
0xfe

254

```'utf8'``` was designed to always be big order and has no encoding issues regarding byte ordering however there is a variation ```utf-8-sig``` which includes a BOM. This is commonly used in Microsoft Applications:

In [170]:
greek_utf8 = bytes('αβγδε', encoding='utf-8').hex()
greek_utf8

'ceb1ceb2ceb3ceb4ceb5'

In [171]:
greek_utf8sig = bytes('αβγδε', encoding='utf-8-sig').hex()
greek_utf8sig

'efbbbfceb1ceb2ceb3ceb4ceb5'

 ef bb bf ce

The first standard encoding scheme was 'ASCII', which was fixed over 1 byte (8 bit). Recall a byte has:

In [55]:
2 ** 8

256

combinations although only half were used. Initially there were regional encoding schemes (translation tables) which mapped the second half of the combinations to regional characters. For example in the UK, 'Latin1' was used which includes the £ sign:

In [56]:
gb = bytes('£123.45', encoding='Latin1')
gb

b'\xa3123.45'

In [57]:
int('0xa3', base=16)

163

In [58]:
gb.decode(encoding='Latin1')

'£123.45'

This regional encoding scheme spanned over the full byte. The problem with early regional encoding was that operating systems were configured to use a different encoding scheme than content such as early websites and characters outwith ASCII which were fixed were often substituted. For example:

In [59]:
gb.decode(encoding='Latin2')

'Ł123.45'

In [60]:
gb.decode(encoding='Latin3')

'£123.45'

In [61]:
gb.decode(encoding='Greek')

'£123.45'

In [62]:
gb.decode(encoding='Cyrillic')

'Ѓ123.45'

UTF-16 (16 bits) was created which has:

In [63]:
2 ** 16

65536

combinations and hence allowed more characters. ASCII characters have to be extended to 2 bytes as the byte size is fixed. The restriction in the number of characters led to UTF-32 (32 bits) giving:

In [64]:
2 ** 32

4294967296

combinations. ASCII characters have to be extended to 4 bytes as the byte size is fixed. Finally UTF-8 was created which is adaptable between 1 byte for ASCII characters, 2 bytes for common Unicode characters and 4 bytes for extended Unicode characters. UTF-16 and UTF-32 had variations byte order which was also fixed for UTF-8. UTF-8 is the standard used for Unicode strings and will be used from now on.

## Bytes Identifiers

The bytes class uses the design model of a Python object as seen by its method resolution order:

In [65]:
bytes.mro()

[bytes, object]

If the help function is used on bytes, details about all the identifiers will be given:

In [66]:
help(bytes)

Help on class bytes in module builtins:

class bytes(object)
 |  bytes(iterable_of_ints) -> bytes
 |  bytes(string, encoding[, errors]) -> bytes
 |  bytes(bytes_or_buffer) -> immutable copy of bytes_or_buffer
 |  bytes(int) -> bytes object of size given by the parameter initialized with null bytes
 |  bytes() -> empty bytes object
 |  
 |  Construct an immutable array of bytes from:
 |    - an iterable yielding integers in range(256)
 |    - a text string encoded using the specified encoding
 |    - any object implementing the buffer API.
 |    - an integer
 |  
 |  Methods defined here:
 |  
 |  __add__(self, value, /)
 |      Return self+value.
 |  
 |  __bytes__(self, /)
 |      Convert this value to exact type bytes.
 |  
 |  __contains__(self, key, /)
 |      Return key in self.
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 | 

From the previous notebook, you'll recognise many of these identifiers as they are seen in the string class. For the most part the identifiers between the two classes are consistent however some will differ due to the difference in fundamental unit.

If the methods are compared:

In [67]:
for identifier in dir(str):
    isfunction = callable(getattr(str, identifier))
    isdatamodel = identifier[0] == '_'
    if (isfunction and not isdatamodel):
        print(identifier, end=' ')

capitalize casefold center count encode endswith expandtabs find format format_map index isalnum isalpha isascii isdecimal isdigit isidentifier islower isnumeric isprintable isspace istitle isupper join ljust lower lstrip maketrans partition removeprefix removesuffix replace rfind rindex rjust rpartition rsplit rstrip split splitlines startswith strip swapcase title translate upper zfill 

In [68]:
for identifier in dir(bytes):
    isfunction = callable(getattr(bytes, identifier))
    isdatamodel = identifier[0] == '_'
    if (isfunction and not isdatamodel):
        print(identifier, end=' ')

capitalize center count decode endswith expandtabs find fromhex hex index isalnum isalpha isascii isdigit islower isspace istitle isupper join ljust lower lstrip maketrans partition removeprefix removesuffix replace rfind rindex rjust rpartition rsplit rstrip split splitlines startswith strip swapcase title translate upper zfill 

And if a check is made for methods in the bytes class but not in the str class, there are only 3 additions:

In [69]:
for identifier in dir(bytes):
    isfunction = callable(getattr(bytes, identifier))
    isinstr = identifier in dir(str)
    isdatamodel = identifier[0] == '_'
    if (isfunction and not isdatamodel and not isinstr):
        print(identifier, end=' ')

decode fromhex hex 

Likewise if the data model methods are compared:

In [70]:
for identifier in dir(str):
    isfunction = callable(getattr(str, identifier))
    isdatamodel = identifier[0] == '_'
    if (isfunction and isdatamodel):
        print(identifier, end=' ')

__add__ __class__ __contains__ __delattr__ __dir__ __eq__ __format__ __ge__ __getattribute__ __getitem__ __getnewargs__ __getstate__ __gt__ __hash__ __init__ __init_subclass__ __iter__ __le__ __len__ __lt__ __mod__ __mul__ __ne__ __new__ __reduce__ __reduce_ex__ __repr__ __rmod__ __rmul__ __setattr__ __sizeof__ __str__ __subclasshook__ 

In [71]:
for identifier in dir(bytes):
    isfunction = callable(getattr(bytes, identifier))
    isdatamodel = identifier[0] == '_'
    if (isfunction and isdatamodel):
        print(identifier, end=' ')

__add__ __bytes__ __class__ __contains__ __delattr__ __dir__ __eq__ __format__ __ge__ __getattribute__ __getitem__ __getnewargs__ __getstate__ __gt__ __hash__ __init__ __init_subclass__ __iter__ __le__ __len__ __lt__ __mod__ __mul__ __ne__ __new__ __reduce__ __reduce_ex__ __repr__ __rmod__ __rmul__ __setattr__ __sizeof__ __str__ __subclasshook__ 

And if a check is made for the data model methods in the bytes class but not in the str class, there is only 1 addition:

In [72]:
for identifier in dir(bytes):
    isfunction = callable(getattr(bytes, identifier))
    isinstr = identifier in dir(str)
    isdatamodel = identifier[0] == '_'
    if (isfunction and isdatamodel and not isinstr):
        print(identifier, end=' ')

__bytes__ 

Most methods and data model methods have the same names between the two classes and have consistent behaviour. The reason for these is both classes are immutable Collections and therefore follow the same design pattern. The immutable methods in the str class which return a new str instance will generally return a new bytes instance instead. The individual unit of a Unicode string is a Unicode character and the individual unit of a bytes is a byte. The data model method \_\_len\_\_ which maps to the builtins function len will return how many of these individual units are in an instance:

In [73]:
english_b = b'abcde'
greek_b = bytes('αβγδε', encoding='UTF-8')
english_s = 'abcde'
greek_s = 'αβγδε'

Notice that the len of english_s and greek_s are consistently 5 as there are 5 Unicode characters:

In [74]:
len(english_s)

5

In [75]:
len(greek_s)

5

Notice that the len of english_b is also 5 as each ASCII character spans 1 byte, however greek_s is a length of 10 as each character spans 2 bytes:

In [76]:
len(english_b)

5

In [77]:
len(greek_b)

10

Recall that the data model identifier \_\_getitem\_\_ defines the behaviour when indexing with square brackets. For a string, the Unicode character corresponding to that index is returned, for a byte, the byte in the form of an int is returned:

In [78]:
greek_s[0]

'α'

In [79]:
greek_b[0]

206

In [80]:
greek_b.hex()

'ceb1ceb2ceb3ceb4ceb5'

In [81]:
hex(206)

'0xce'

In [82]:
0xce

206

Slicing will instead return a bytes instance. The difference can be seen when 1 byte is selected from a slice:

In [83]:
greek_b[:1]

b'\xce'

The syntax otherwise is consistent to slicing using the string class:

In [84]:
greek_b[:2:]

b'\xce\xb1'

In [85]:
greek_b[::2]

b'\xce\xce\xce\xce\xce'

In [86]:
greek_b[1::2]

b'\xb1\xb2\xb3\xb4\xb5'

Earlier the method hex was examined:

In [87]:
? greek_b.hex

[1;31mDocstring:[0m
Create a string of hexadecimal numbers from a bytes object.

  sep
    An optional single character or byte to separate hex bytes.
  bytes_per_sep
    How many bytes between separators.  Positive values count from the
    right, negative values count from the left.

Example:
>>> value = b'\xb9\x01\xef'
>>> value.hex()
'b901ef'
>>> value.hex(':')
'b9:01:ef'
>>> value.hex(':', 2)
'b9:01ef'
>>> value.hex(':', -2)
'b901:ef'
[1;31mType:[0m      builtin_function_or_method

In [88]:
greek_b.hex()

'ceb1ceb2ceb3ceb4ceb5'

There is class method fromhex which is an alternative constructor which will create a bytes instance from a string of hexadecimal numbers:

In [89]:
? bytes.fromhex

[1;31mSignature:[0m  [0mbytes[0m[1;33m.[0m[0mfromhex[0m[1;33m([0m[0mstring[0m[1;33m,[0m [1;33m/[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Create a bytes object from a string of hexadecimal numbers.

Spaces between two numbers are accepted.
Example: bytes.fromhex('B9 01EF') -> b'\\xb9\\x01\\xef'.
[1;31mType:[0m      builtin_function_or_method

Note that the class method is normally called from a class and returns an instance:

In [90]:
bytes.fromhex('ceb1ceb2ceb3ceb4ceb5')

b'\xce\xb1\xce\xb2\xce\xb3\xce\xb4\xce\xb5'