# The bytes class

In the previous notebook the Unicode string class ```str``` was examined and was seen to have a Unicode character as a fundamental unit. The byte string class ```bytes``` on the other hand use a byte as a fundamental unit. The byte string was the foundation for text data in Python 2. 

## Categorize_Identifiers Module

This notebook will use the following functions ```dir2```, ```variables``` and ```view``` in the custom module ```categorize_identifiers``` which is found in the same directory as this notebook file. ```dir2``` is a variant of ```dir``` that groups identifiers into a ```dict``` under categories and ```variables``` is an IPython based a variable inspector. ```view``` is used to view a ```Collection``` in more detail:

In [1]:
from categorize_identifiers import dir2, variables, view

## Conception

A computer stores data using a bit. A bit can be conceptualised as a single dip switch and has the values Off and On respectively as shown below. A single switch has the possible values ```0```, ```1``` which is ```2 ** 1``` combinations which is a total of ```2```. Note the combination ```0``` is included so ```0:2``` is inclusive of the lower bount ```0``` and exclusive of the upper bound ```2```.

<img src='./images/img_001.png' alt='img_001' width='400'/>

More typically ```8``` of these switches are combined into a single logical unit called a byte. A byte has ```2 ** 8``` combinations which is a total of ```256```. Python uses zero-order indexing, so the number of possible values ```0:256``` is inclusive of the lower bound ```0``` and exclusive of the upper bound ```256```. Zero-order indexing was explored when indexing into a ```str``` instance and when initialising ```slice``` instances and ```range``` instances:

<img src='./images/img_002.png' alt='img_002' width='400'/>

The decimal number can be calculated from the ```bytes``` above. Like an ordinary decimal number the units are on the right hand side of the ```bytes``` what differs is each character represents a power of ```2``` (binary) opposed to a power of ```10``` decimal:

In [2]:
+ 0 * (2 ** 7) \
+ 1 * (2 ** 6) \
+ 1 * (2 ** 5) \
+ 0 * (2 ** 4) \
+ 1 * (2 ** 3) \
+ 0 * (2 ** 2) \
+ 0 * (2 ** 1) \
+ 0 * (2 ** 0) 

104

The ```bytes``` above can be expressed as a binary number using the prefix ```0b```, this prefix is used to distinguish the number from a standard decimal integer. Notice that this prefix is highlighted in VSCode using a different colour to distinguish the prefix from the data. Tthe decimal equivalent will be returned in the cell output:

In [3]:
0b01101000

104

Leading zeros are normally omitted:

In [4]:
0b1101000

104

Although in this context, it is useful to show the leading zeros, so all 8 bits in the byte can be visualised.

A byte string is essentially a collection of individual bytes:

<img src='./images/img_003.png' alt='img_003' width='400'/>

In [5]:
0b01101000

104

In [6]:
0b01100101

101

In [7]:
0b01101100

108

In [8]:
0b01101100

108

In [9]:
0b01101111

111

These numbers can be grouped into a collection such as a ```tuple```:

In [10]:
(0b01101000, 0b01100101, 0b01101100, 0b01101100, 0b01101111)

(104, 101, 108, 108, 111)

This ```tuple``` collection can be cast into ```bytes``` giving text information:

In [11]:
bytes((0b01101000, 0b01100101, 0b01101100, 0b01101100, 0b01101111))

b'hello'

In [12]:
bytes((104, 101, 108, 108, 111))

b'hello'

In [13]:
view(bytes((104, 101, 108, 108, 111)))

Index 	 Type                 	 Size   	 Value                         
0 	 int                  	 1      	 104                            	
1 	 int                  	 1      	 101                            	
2 	 int                  	 1      	 108                            	
3 	 int                  	 1      	 108                            	
4 	 int                  	 1      	 111                            	


## ASCII Characters

Recall that the American Standard Code for Information Interchange (ASCII) maps each byte to a physical command or English character. The physical commands were used to control primitive computers that were essentially typewriter based:

<img src='./images/img_004.png' alt='img_004' width='600'/>

There are 128 commands which span the first half of the byte:

|byte|hex|num|command|
|---|---|---|---|
|00000000|00|000|null|
|00000001|01|001|start of heading|
|00000010|02|002|start of text|
|00000011|03|003|end of text|
|00000100|04|004|end of transmission|
|00000101|05|005|enquiry|
|00000110|06|006|acknowledge|
|00000111|07|007|bell|
|00001000|08|008|**backspace**|
|00001001|09|009|**horizontal tab**|
|00001010|0a|010|**new line**|
|00001011|0b|011|**vertical tab**|
|00001100|0c|012|**form feed**|
|00001101|0d|013|**carriage return**|
|00001110|0e|014|shift out|
|00001111|0f|015|shift in|
|00010000|10|016|data link escape|
|00010001|11|017|device control 1|
|00010010|12|018|device control 2|
|00010011|13|019|device control 3|
|00010100|14|020|device control 4|
|00010101|15|021|negative acknowledge|
|00010110|16|022|synchronous idle|
|00010111|17|023|end of transmission block|
|00011000|18|024|cancel|
|00011001|19|025|end of medium|
|00011010|1a|026|substitute|
|00011011|1b|027|**escape**|
|00011100|1c|028|file separator|
|00011101|1d|029|group separator|
|00011110|1e|030|record separator|
|00011111|1f|031|unit seperator|
|00100000|20|032|**space**|

The remaining commands spanning up to half a byte contained the characters most commonly used in the English language.

|byte|hex|num|character|
|---|---|---|---|
|00100001|21|033|!|
|00100010|22|034|"|
|00100011|23|035|#|
|00100100|24|036|$|
|00100101|25|037|%|
|00100110|26|038|&|
|00100111|27|039|'|
|00101000|28|040|(|
|00101001|29|041|)|
|00101010|2a|042|*|
|00101011|2b|043|+|
|00101100|2c|044|,|
|00101101|2d|045|-|
|00101110|2e|046|.|
|00101111|2f|047|/|
|00110000|30|048|0|
|00110001|31|049|1|
|00110010|32|050|2|
|00110011|33|051|3|
|00110100|34|052|4|
|00110101|35|053|5|
|00110110|36|054|6|
|00110111|37|055|7|
|00111000|38|056|8|
|00111001|39|057|9|
|00111010|3a|058|:|
|00111011|3b|059|;|
|00111100|3c|060|<|
|00111101|3d|061|=|
|00111110|3e|062|>|
|00111111|3f|063|?|
|01000000|40|064|@|
|01000001|41|065|A|
|01000010|42|066|B|
|01000011|43|067|C|
|01000100|44|068|D|
|01000101|45|069|E|
|01000110|46|070|F|
|01000111|47|071|G|
|01001000|48|072|H|
|01001001|49|073|I|
|01001010|4a|074|J|
|01001011|4b|075|K|
|01001100|4c|076|L|
|01001101|4d|077|M|
|01001110|4e|078|N|
|01001111|4f|079|O|
|01010000|50|080|P|
|01010001|51|081|Q|
|01010010|52|082|R|
|01010011|53|083|S|
|01010100|54|084|T|
|01010101|55|085|U|
|01010110|56|086|V|
|01010111|57|087|W|
|01011000|58|088|X|
|01011001|59|089|Y|
|01011010|5a|090|Z|
|01011011|5b|091|[|
|01011100|5c|092|\|
|01011101|5d|093|]|
|01011110|5e|094|^|
|01011111|5f|095|_|
|01100000|60|096|`|
|01100001|61|097|a|
|01100010|62|098|b|
|01100011|63|099|c|
|01100100|64|100|d|
|01100101|65|101|e|
|01100110|66|102|f|
|01100111|67|103|g|
|01101000|68|104|h|
|01101001|69|105|i|
|01101010|6a|106|j|
|01101011|6b|107|k|
|01101100|6c|108|l|
|01101101|6d|109|m|
|01101110|6e|110|n|
|01101111|6f|111|o|
|01110000|70|112|p|
|01110001|71|113|q|
|01110010|72|114|r|
|01110011|73|115|s|
|01110100|74|116|t|
|01110101|75|117|u|
|01110110|76|118|v|
|01110111|77|119|w|
|01111000|78|120|x|
|01111001|79|121|y|
|01111010|7a|122|z|
|01111011|7b|123|{|
|01111100|7c|124|||
|01111101|7d|125|}|
|01111110|7e|126|~|
|01111111|7f|127||


Recall the ```string``` module contains the printable ASCII characters:

In [14]:
import string

In the ```bytes``` class, the translation table for an ASCII character is always the same. Therefore instead of displaying the byte for that character, the ASCII character is shown:

In [15]:
string.printable

'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

## Initialisation Signature

The initialisation signature for the ```bytes``` class can be examined:

In [16]:
bytes?

[1;31mInit signature:[0m [0mbytes[0m[1;33m([0m[0mself[0m[1;33m,[0m [1;33m/[0m[1;33m,[0m [1;33m*[0m[0margs[0m[1;33m,[0m [1;33m**[0m[0mkwargs[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
bytes(iterable_of_ints) -> bytes
bytes(string, encoding[, errors]) -> bytes
bytes(bytes_or_buffer) -> immutable copy of bytes_or_buffer
bytes(int) -> bytes object of size given by the parameter initialized with null bytes
bytes() -> empty bytes object

Construct an immutable array of bytes from:
  - an iterable yielding integers in range(256)
  - a text string encoded using the specified encoding
  - any object implementing the buffer API.
  - an integer
[1;31mType:[0m           type
[1;31mSubclasses:[0m     bytes_

For the ```bytes``` string class, the initialisation signature shows 5 alternative ways of supplying instance data:

```python
bytes(self, /, *args, **kwargs)
bytes(iterable_of_ints) -> bytes
bytes(string, encoding[, errors]) -> bytes
bytes(bytes_or_buffer) -> immutable copy of bytes_or_buffer
bytes(int) -> bytes object of size given by the parameter initialized with null bytes
bytes() -> empty bytes object
```

If the first way is examined:

```python
bytes(self, /, *args, **kwargs)
```

* The parenthesis ```( )``` are used to call a function and supply any necessary input arguments.
* The comma ```,``` is used as a delimiter to seperate out any input arguments.
* ```self``` is used to denote *this instance*. In other words a byte string can be constructed from an existing byte string instance, this is a special case as a byte string is a fundamental datatype.
* Any input argument before a ```/``` must be provided positionally
* ```*args``` indicates a variable number of additional positional input arguments. These are typically not used for initialisation of the ```bytes``` string class.
* ```**kwargs``` indicates a variable number of additional named input arguments. These are typically not used for initialisation of the ```bytes``` string class.

A ```bytes``` instance can be instantiated by supplying an existing ```bytes``` instance ```self``` to the ```bytes``` class:

In [17]:
bytes(b'Hello World!')

b'hello world!'

However because the ```bytes``` class is a fundamental datatype it can also be instantiated shorthand using the following:

In [20]:
b'Hello World!'

b'Hello World!'

All of the characters above in the ```bytes``` instance are ASCII printable characters. Therefore each ```byte``` in the ```bytes``` instance above is represented by its corresponding ASCII character:

In [25]:
view(b'Hello World!')

Index 	 Type                 	 Size   	 Value                         
0 	 int                  	 1      	 72                             	
1 	 int                  	 1      	 101                            	
2 	 int                  	 1      	 108                            	
3 	 int                  	 1      	 108                            	
4 	 int                  	 1      	 111                            	
5 	 int                  	 1      	 32                             	
6 	 int                  	 1      	 87                             	
7 	 int                  	 1      	 111                            	
8 	 int                  	 1      	 114                            	
9 	 int                  	 1      	 108                            	
10 	 int                  	 1      	 100                            	
11 	 int                  	 1      	 33                             	


In [19]:
string.printable

'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

A ```bytes``` instance can be initialised using an iterable such as a ```tuple``` of ```int``` instances:

```python
bytes(iterable_of_ints, /) -> bytes
```

For this to work each ```int``` must be a valid byte value. Recall that a byte looks like the following:

<img src='./images/img_002.png' alt='img_002' width='400'/>

And a byte has ```2 ** 8``` combinations which is a total of ```256```. A ```tuple``` with a single ```int``` element can be provided. Note that a trailing comma is required to distinguish a single element ```tuple``` from a numeric calculation using parenthesis:

In [28]:
num = (97)

In [29]:
type(num)

int

In [30]:
archive = (97, )

In [31]:
type(archive)

tuple

From the above ASCII table the integer number ```97``` corresponds to the character ```a``` and this can be seen when this is cast to a ```tuple```:

In [32]:
bytes((97, ))

b'a'

When an ```int``` exceeds the upper bound ```256``` (up to and exclusive of ```256``` due to zero-order indexing) a ```ValueError``` will display:

```python
bytes((256, ))
```

Normally the ```tuple``` will contain more than one ```int``` instance and each of these will be a valid byte value:

In [33]:
integers = (104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 33)
bytes(integers)

b'hello world!'

Recall that the decimal ```int``` instance ```104``` can be represented in binary. The ```bin``` function will return this binary number asa Unicode ```str```:

In [34]:
bin(104)

'0b1101000'

It is useful to see this with leading zeros. To do this the ```str``` instance methods ```removeprefix``` and ```zfill``` can be used alongside ```str``` instance concatenation:

In [35]:
'0b' + bin(104).removeprefix('0b').zfill(8)

'0b01101000'

The Unicode ```str``` instance corresponding to this byte can be retrieved using the ```chr``` function:

In [36]:
chr(104)

'h'

This ```bytes``` instance consists of ```12``` bytes. The decimal ```int```, binary integer (without prefix) and ASCII character can be shown. Note the space corresponding to a decimal integer of ```32``` is also an ASCII character:

In [37]:
for number in integers:
    print(str(number).center(8), end=' ')

print()
for number in integers:
    print(bin(number).removeprefix('0b').zfill(8), end=' ')

print()
for number in integers:
    print(chr(number).center(8), end=' ')

  104      101      108      108      111       32      119      111      114      108      100       33    
01101000 01100101 01101100 01101100 01101111 00100000 01110111 01101111 01110010 01101100 01100100 00100001 
   h        e        l        l        o                 w        o        r        l        d        !     

When every byte is a printable ASCII character it will be displayed instead of the byte sequence:

In [38]:
bytes(integers)

b'hello world!'

When the byte is not printable for example, the ```int``` ```0``` which corresponds to the non-printable command NULL, it is represented using a hexadecimal escape sequence:

In [39]:
bytes((0,))

b'\x00'

All the whitespace characters with exception to the space are represented using an escape character. For the tab, newline and carriage return characters that are commonly used, there are the escape characters ```\t``` (decimal 9), ```\n``` (decimal 10) and ```\r``` (decimal 13). The vertical tab and form feed are less commonly used and represented by their hexadecimal escape sequences ```\x0b``` (decimal 11) and ```\x0c``` (decimal 12):

In [40]:
string.whitespace

' \t\n\r\x0b\x0c'

In [41]:
integers = (9, 10, 11, 12, 13)
bytes(integers)

b'\t\n\x0b\x0c\r'

Binary ```0b00001100``` is not very human-readable and therefore can be mistranscribed by a human. To make a byte more human readable hexadecimal is introduced. In hexadecimal the byte is split into 2 and each half byte is represented as a hexadecimal character:

<img src='./images/img_005.png' alt='img_005' width='400'/>

Recall **b**inary has ```2``` digits and the prefix ```0b```, decimal has ```10``` digits and no prefix because it is most commonly used numbering system. He**x**adecimal has ```16``` digits and the prefix ```0x```. 

Hexadecimal takes the first ```10``` digits from decimal and supplements them with the first ```6``` letters in the alphabet. The number of combinations in half a byte is:

In [42]:
2 ** 4

16

Which means each hexadecimal character perfectly maps to a 4 bit binary sequence:

|(0b) binary|(0x) hexadecimal character|decimal character|
|---|---|---|
|0000|0|0|
|0001|1|1|
|0010|2|2|
|0011|3|3|
|0100|4|4|
|0101|5|5|
|0110|6|6|
|0111|7|7|
|1000|8|8|
|1001|9|9|
|1010|a|10|
|1011|b|11|
|1100|c|12|
|1101|d|13|
|1110|e|14|
|1111|f|15|

Although uppercase and lowercase can be used to represent a hexadecimal character, notice that the Python interpreter prefers lowercase:

In [43]:
integers = (11, 12)
bytes(integers)

b'\x0b\x0c'

The reason for this is lowercase is easier to transcribe:

```python
'ABB4AB8A'
```

When reading the above quickly, notice the similarity between A and 4 and B and 8:

```python
'abb4ab8a'
```

The lowercase characters are more clearly distinguished.

The following ```bytes``` instance:

<img src='./images/img_005.png' alt='img_005' width='400'/>

Is the binary number:

In [44]:
0b00001100

12

The value returned in the cell output displays the decimal integer. The ```hex``` function can be used to cast a decimal ```int``` into a Unicode ```str``` of a hexadecimal character:

In [45]:
hex(0b00001100)

'0xc'

The hexadecimal value is displayed without the trailing zero, so the first half byte is not shown. This can be added for clarity:

In [46]:
'0x' + hex(12).removeprefix('0x').zfill(2)

'0x0c'

In [47]:
'0b' + bin(12).removeprefix('0b').zfill(8)

'0b00001100'

And for clarity:

In [48]:
print(bin(12).removeprefix('0b').zfill(8)[:4], bin(12).removeprefix('0b').zfill(8)[4:])
print(hex(12).removeprefix('0x').zfill(2)[:1].center(4), hex(12).removeprefix('0x').zfill(2)[1:].center(4))

0000 1100
 0    c  


When the byte sequence contains a byte that maps to a whitespace character, a non-printable command or is unmapped to an ASCII character there is no corresponding character to display and an escape sequence instead displays. This can be seen when using the first ```32``` integers and integers above ```127```:

In [49]:
integers = (0, 1, 2, 29, 30, 31, 128, 129, 130, 253, 254, 255)
bytes(integers)

b'\x00\x01\x02\x1d\x1e\x1f\x80\x81\x82\xfd\xfe\xff'

Notice that each character is inserted using its own hexadecimal escape sequence prefix ```\x``` and this instruction must include two hexadecimal characters i.e. include a leading zero where applicable. 

When a ```bytes``` instance contains byte sequences that map to characters these characters will be displayed instead of the hexadecimal escape sequence:

In [50]:
integers = (32, 65, 80, 120)
bytes(integers)

b' APx'

The tab ```\t```, newline ```\t```, carriage return ```\r``` and backslash ```\\``` itself all have single escape characters.

In [51]:
integers = (9, 10, 13, 92)
bytes(integers)

b'\t\n\r\\'

A ```bytes``` instance containing all of these initially appear confusing:

In [52]:
integers = (0, 1, 2, 9, 10, 13, 29, 30, 31, 32, 65, 92, 97, 128, 129, 130, 253, 254, 255)
bytes(integers)

b'\x00\x01\x02\t\n\r\x1d\x1e\x1f A\\a\x80\x81\x82\xfd\xfe\xff'

For this reason the ```bytes``` class has the method ```hex``` which returns a Unicode ```str``` instance of the hexadecimal values without any of the escape sequences:

In [53]:
bytes(integers).hex()

'000102090a0d1d1e1f20415c61808182fdfeff'

Note that the ```byte``` classes ```hex``` method differs from ```builtins``` function ```hex``` which provides a ```0x``` prefix:

In [54]:
bytes((12, )).hex()

'0c'

In [55]:
hex(12)

'0xc'

The ```builtins``` function ```hex``` can process a **single** large integer that exceeds 1 byte:

In [56]:
hex(256)

'0x100'

Whereas the ```byte``` classes ```hex``` method processes **multiple** integers that are within the constrains of a byte:

In [57]:
bytes((12, 34)).hex()

'0c22'

The following ```bytes``` instance can be represented as a ```str``` of hexadecimal characters with 2 hexadecimal characters for each byte using the ```bytes``` class method ```hex```:

In [58]:
b'hello'.hex()

'68656c6c6f'

Notice when each of the ASCII characters is supplied using a hexadecimal escape sequence, the default representation simplifies the output displaying the ASCII character:

In [60]:
b'\x68\x65\x6c\x6c\x6f'

b'hello'

The ```bytes``` class has the class method ```fromhex``` which carries out the counter operation and creates a ```bytes``` instance from a string of hexadecimal characters:

In [61]:
bytes.fromhex('68656c6c6f')

b'hello'

To recap:

In [62]:
integers = (104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 33)
bytes(integers)

for number in integers:
    print(chr(number).center(8), end=' ')
    
print()
for number in integers:
    print(str(number).center(8), end=' ')

print()
for number in integers:
    print(bin(number).removeprefix('0b').zfill(8), end=' ')

print()
for number in integers:
    print((hex(number).removeprefix('0x')).center(8), end=' ')    

print()
for number in integers:
    print((r'0x' + hex(number).removeprefix('0x')).center(8), end=' ')

print()
for number in integers:
    print((r'\x' + hex(number).removeprefix('0x')).center(8), end=' ')
    


   h        e        l        l        o                 w        o        r        l        d        !     
  104      101      108      108      111       32      119      111      114      108      100       33    
01101000 01100101 01101100 01101100 01101111 00100000 01110111 01101111 01110010 01101100 01100100 00100001 
   68       65       6c       6c       6f       20       77       6f       72       6c       64       21    
  0x68     0x65     0x6c     0x6c     0x6f     0x20     0x77     0x6f     0x72     0x6c     0x64     0x21   
  \x68     \x65     \x6c     \x6c     \x6f     \x20     \x77     \x6f     \x72     \x6c     \x64     \x21   

A ```bytes``` instance can be instantiated from a Unicode ```str``` however the named parameter ```encoding``` needs to be supplied, which gives the instructions to decode a Unicode character outwith the ASCII range. In this case the simplest encoding ```'ascii'``` seen above is supplied indicating all characters are within the ASCII range.  Generally the current standard ```'utf-8'``` is used which will be explained in more detail in the next section:

```python
bytes(string, /, encoding[, errors]) -> bytes
```

In [63]:
bytes('hello', encoding='ascii')

b'hello'

In [64]:
bytes('hello', encoding='utf-8')

b'hello'

Not supplying ```encoding``` gives a ```TypeError```:

```python
bytes('hello')
```

Supplying a non-Unicode character that is not ASCII and specifying ASCII encoding will also give a ```UnicodeDecodeError```:

```python
bytes('α', encoding='ascii')
```

A ```bytes``` instance can be cast from an existing ```bytes``` instance:

```python
bytes(bytes_or_buffer, /) -> immutable copy of bytes_or_buffer
```

A ```bytes``` instance can be instantiated by casting a ```bytearray```. The ```bytearray``` is the mutable counterpart to the ```bytes``` class:

In [65]:
ba = bytearray(b'hello')

In [66]:
bytes(ba)

b'hello'

Immutable means the ```bytes``` sequence cannot be modified once constructed. Mutable means a ```bytearray``` sequence can be modified once constructed.

 A NULL ```bytes``` instance can also be initialised from an ```int```. The ```int``` is used to specify the number of NULL ```bytes``` and is not cast into an individual byte as seen when an ```int``` is provided via a ```tuple```:

 ```python
 bytes(int, /) -> bytes object of size given by the parameter initialized with null bytes
 ```

For example a ```bytes``` instance occupying 1 bytes can be instantiated:

In [67]:
bytes(1)

b'\x00'

And another one occupying 4 bytes can be instantiated:

In [68]:
bytes(4)

b'\x00\x00\x00\x00'

Using the ```bytes``` class with out providing any instantiation data will create a single NULL ```bytes``` instance:

```python
bytes() -> empty bytes object
```

In [69]:
bytes()

b''

Initialising data and then populating is more commonly used for with the mutable counterpart the ```bytearray```.

## Encoding and Decoding

When a ```bytes``` instance was instantiated from a Unicode ```str``` instance, an encoding translation table was selected; that is a table that maps the bytes sequence to a specific character. There have been many encoding standards developed throughout the years and by default the current standard ```'utf-8'``` should be used. The Unicode string is essentially locked to using only ```'utf-8'``` and as a consequence is far easier to use than the bytes string:

|encoding|bytes per character|bits per character|byte order|byte order marker BOM|
|---|---|---|---|---|
|'utf-8'|1, 2, 3, 4|8, 16, 24, 32|big endian| |
|'utf-8-sig'|1, 2, 3, 4|8, 16, 24, 32|big endian|efbbbf|
|'utf-32'|4|32|little endian|fffe0000|
|'utf-32-le'|4|32|little endian| |
|'utf-32-be'|4|32|big endian| |
|'utf-16'|2|16|little endian|fffe|
|'utf-16-le'|2|16|little endian| |
|'utf-16-be'|2|16|big endian| |
|'latin1'|1|8| ||
|'ascii'|1|8| ||

This section will go into explaining the concepts in the table above. 

### ASCII

The most basic one is ```'ascii'``` where each character is encoded over ```1``` byte using only half of the possible values. ASCII is restricted to a small subset of English characters:

In [None]:
a_ascii = b'a'
a_ascii

In [None]:
hex(ord('a'))

The ```'ascii'``` encoding scheme was originally developed using 7 bits with:

In [None]:
2 ** 7

for this reason the commands span over the range ```0:128``` (up to and exclusing the upper bound of ```128```). This covers half the possible values of a byte:

In [None]:
2 ** 8

### Extended ASCII Variants

In the 1990s there were numerous regional encoding schemes (translation tables) which mapped the second half of the bytes to regional characters. 

In the UK, ```'latin1'``` was used which includes the ```£``` sign:

In [None]:
gb = bytes('£123.45', encoding='latin1')
gb

In [None]:
int('0xa3', base=16)

In [None]:
gb.decode(encoding='latin1')

This regional encoding scheme spanned over the full byte allowing the commonly used regional characters. 

The problem with early regional encoding was that operating systems and web browsers were often configured to use a different encoding scheme that differed to the encoding the content was itself was written in and as a result non-ASCII characters were often incorrectly substituted. This can be seen for example by decoding the byte string which was originally encoded in ```'latin1'``` with ```'latin2'```, ```'latin3'```, ```'greek'``` and ```'cyrillic'```:

In [None]:
gb.decode(encoding='latin2')

In [None]:
gb.decode(encoding='latin3')

In [None]:
gb.decode(encoding='greek')

In [None]:
gb.decode(encoding='cyrillic')

All of these formats should be considered as legacy formats.

### UTF-16

The Unicode Transformation Format ```'utf-16'``` was a previous standard where each character occupied ```2``` bytes which is ```2 * 8``` bits and is where the name ```16``` comes from. Using 2 bytes instead of 1 bytes per character increases the number of possible combinations to:

In [None]:
2 ** 16

Unfortunately there were variations of ```'utf-16``` which often got confused. The simplest to understand is ```'utf-16-be'```:

In [None]:
a_be = bytes('a', encoding='utf-16-be')
a_be

In [None]:
a_be.hex()

Because each character must occupy 2 bytes the leading zero byte ```\x00``` displays. BE is an abbreviation for Big Endian. Essentially the sixteen value would be encoded as ```10``` using big-endian and is consistent to the way numbers are written in English. In this notation the unit is last and the most significant bit or biggest number is written first (when reading right to left).

In [None]:
hex(16)

When ```'utf-16'``` was the standard, Intel however created processors that operated using Little Endian. In Little Endian, the sixteenth value is encoded as ```01``` and the least significant bit or little number is supplied first (when reading right to left).

The same string that was examined using ```'utf-16-be``` can be examined using ```'utf-16-le```. Notice that the leading zero byte ```\x00``` is now trailing.

In [None]:
a_le = bytes('a', encoding='utf-16-le')
a_le

In [None]:
a_le.hex()

If the Greek letter is examined notice the byteorder is switched:

In [None]:
alpha_be = bytes('α', encoding='utf-16-be')
alpha_be

In [None]:
alpha_le = bytes('α', encoding='utf-16-le')
alpha_le

In [None]:
alpha_be.hex()

In [None]:
alpha_le.hex()

Unintended character replacement is carried out when a character is encoded in ```'utf-16-le'``` and decoded using ```'utf-16-be'```:

In [None]:
alpha_le.decode(encoding='utf-16-be')

Due to the confusion between big endian and little endian, ```'utf-16'``` which is little endian by default includes a Byte Order Marker (BOM). The BOM is a 2 byte prefix that indicates the byte order:

In [None]:
bytes('', encoding='utf-16-le')

In [None]:
bytes('', encoding='utf-16')

The hex values of the BOM can be seen more clearly using the bytes strings ```hex``` method:

In [None]:
bytes('', encoding='utf-16').hex()

This has the hexadecimal values ```ff``` and ```fe```:

In [None]:
0xff

In [None]:
0xfe

These are the last 2 values in a byte. The BOM can be conceptualised as a ```tuple``` of the form ```(255, 254)```:

In [None]:
bytes((255, 254))

And the order of the values in this 2 element tuple is reversed indicating little endian. ```(254, 255)``` would indicate big endian.

If the 16 bit (2 bytes) encoding is examined in more detail as hex so it is human readible; ```'utf-16-be'```, ```'utf-16-le'``` and ```'utf-16'``` can be compared:

In [None]:
be = bytes('abcdeαβγδε', encoding='Utf-16-be').hex()
be

In [None]:
le = bytes('abcdeαβγδε', encoding='utf-16-le').hex()
le

In [None]:
bom_le = bytes('abcdeαβγδε', encoding='utf-16').hex()
bom_le

Since each character is encoded over 2 bytes and each byte is in turn represented using 2 hexadecimal characters. The differences are easier to see if groupings of 4 hexadecimal characters are examined:

In [None]:
print('char', end=':            ')
for i in 'abcdeαβγδε':
    print(i, end='    ')

print()
print('utf-16-be', end=':      ')

for i in range(0, len(be), 4):
    print(be [i:i+4], end=' ')

print()    
print('utf-16-le', end=':      ')

for i in range(0, len(le), 4):
    print(le [i:i+4], end=' ')  
    
print()    
print('utf-16', end=':    ')

for i in range(0, len(bom_le), 4):
    print(bom_le [i:i+4], end=' ')    

With 16 bytes there are:

In [None]:
2 ** 16

combinations. These are not enough to cover characters from all the languages in the world.

### UTF-32

Therefore ```'utf-32'``` was developed:

In [None]:
2 ** 32

This gives groupings of 4 bytes, which is 8 hexadecimal characters:

In [None]:
word = 'abαβ悤悥🦒🦓'

be = bytes(word, encoding='Utf-32-be').hex()
le = bytes(word, encoding='Utf-32-le').hex()
bom_le = bytes(word, encoding='Utf-32').hex()


print('char', end=':               ')
for i in word:
    print(i, end='        ')

print()
print('utf-32-be', end=':          ')

for i in range(0, len(be), 8):
    print(be [i:i+8], end=' ')

print()    
print('utf-32-le', end=':          ')

for i in range(0, len(le), 8):
    print(le [i:i+8], end=' ')  
    
print()    
print('utf-32', end=':    ')

for i in range(0, len(bom_le), 8):
    print(bom_le [i:i+8], end=' ')    

### UTF-8

The main drawback of ```'utf-32'``` is that it requires alot more memory per character. Moreover each ASCII character for example now has 6 leading (big endian) or trailing (little endian) hexadecimal zeros. ```'utf-8'``` was developed as an adaptable format to address these shortfallings and characters can span over 1-4 bytes. ```'utf-8'``` is always big endian and instead of a BOM prefix, the first bits of each multibyte character follow a byte pattern.

In [None]:
print('1 byte:', end=' ')
for unicode_char in 'abcde':
    print(bytes(unicode_char, encoding='utf-8').hex(), end=' ')

print()
print('2 bytes:', end=' ')    
for unicode_char in 'αβγδε':
    print(bytes(unicode_char, encoding='utf-8').hex(), end=' ')
    
print()
print('3 bytes:', end=' ')  
for unicode_char in '悤悥悦悧您':
    print(bytes(unicode_char, encoding='utf-8').hex(), end=' ')

print()
print('4 bytes:', end=' ')  
for unicode_char in '🦒🦓🦔🦕🦖':
    print(bytes(unicode_char, encoding='utf-8').hex(), end=' ')


Generally:

* 1 byte is the ```'ascii'``` subset.
* 2 bytes is extended European characters. ```'utf-16``` is not a subset as ```'utf-8'``` uses a bit pattern that differs from ```'utf-16'```.
* 3 bytes are used for additional languages.
* 4 bytes are used for emojis. ```'utf-32``` is not a subset as ```'utf-8'``` uses a bit pattern that differs from ```'utf-32'```.

```utf-8``` uses some of the bits to identify a 2, 3 and 4 bytes character. For example the 2 byte character:

In [None]:
bin(0xceb1).removeprefix('0b').zfill(2 * 8)

has the byte pattern in bold of the form:

**110**01110 **10**110001

The 3 byte character:

In [None]:
bin(0xe682a4).removeprefix('0b').zfill(3 * 8)

has the byte pattern in bold of the form:

**1110**0110 **10**000010 **10**100100

In [None]:
bin(0xb)

The 4 byte character:

In [None]:
bin(0xf09fa692).removeprefix('0b').zfill(4 * 8)

has the byte pattern in bold of the form:

**11110**000 **10**011111 **10**100110 **10**010010

Since characters ```'utf-8'``` is always big endian and characters encoded over multiple bytes have a byte pattern, there is generally no need for a BOM. 

### UTF-8-Sig

Despite ```'utf-8'``` not requiring a BOM, Microsoft often include one in their products using a variation ```utf-8-sig``` and therefore may be seen in data exported from popular Microsoft applications such as Notepad or Excel. The BOM can be seen by compared the casting of an empty Unicode string to a bytes string using these different ```'utf-8'``` and ```'utf-8-sig'``` respectively:

In [None]:
no_bom = bytes('', encoding='utf-8').hex()
no_bom

In [None]:
bom = bytes('', encoding='utf-8-sig').hex()
bom

The bom contains ```ef``` which recall indicates big endian. Notice is the opposite of the ```ef``` found in ```'utf-16'``` which was little endian. The BOM can be seen in binary:

In [None]:
bin(0xefbbbf).removeprefix('0b').zfill(3 * 8)

Notice it also has the byte pattern that indicates the BOM is encoded over 3 bytes:

**1110**1111 **10**111011 **10**111111

```'utf-8'``` is the current standard and should be used by default. The Unicode string ```str``` class is locked to ```'utf-8'``` and is much easier to work with as there is no worry about encoding. 

When decoding data from another source that is in byte form, ```'utf-8'``` should be used by default to decode it.  If an unwanted BOM appears at the start when decoding data, then the data probably was processed in a Microsoft product with ```'utf-8-sig'```.

## Byte Strings VS Unicode Stings

Unicode strings are essentially byte strings encoded in ```'utf-8'``` with the methods simplified to use a Unicode character as a fundamental unit opposed to a byte. Moreover since there is only one possibility to encode the character, it can natively be displayed instead making it easier to read. For most Python programs a Unicode string should therefore be used instead of a byte string.

The binary sequence of the ```'ascii'``` string which recall is a subset of ```'utf-8'``` is:

In [None]:
for number in b'hello world!':
    print(bin(number).removeprefix('0b').zfill(8), end='')

The binary sequence is not very human readable. However is very easy to transmit as a digital signal. A serial port for example has a signal pin which is configured to transmit or receive a digital time trace. The baud rate, ```9600``` for example means ```9600``` bits are processed in a second. When the bit is ```0```, the voltage of the signal pin is LOW and when the bit is ```1``` the voltage of the signal pin is HIGH.

In [None]:
b'hello world!'.hex()

byte strings are therefore still used when directly interfacing with hardware for example an Arduino. In such applications it is recommended to decode a byte string to a Unicode string as early as possible in a Python program and to cast a Unicode string to a byte string as late as possible before transmitting it to basic hardware in order to avoid encoding issues.

## Bytes Identifiers

The ```bytes``` class uses the design model of a Python object as seen by its method resolution order:

In [None]:
bytes.mro()

If the ```help``` function is used on bytes, details about all the identifiers will be given:

In [None]:
help(bytes)

From the previous notebook, you'll recognise many of these identifiers as they are seen in the ```str``` class. If the ```print_identifier_group``` function is imported from the custom ```helper_module```:

In [None]:
from helper_module import print_identifier_group

The identifiers can be compared between the ```str``` class and ```bytes``` class. Notice there are only a small number of identifiers in the ```bytes``` class not in the ```str``` class:

In [None]:
print_identifier_group(bytes, kind='all', second=str, show_unique_identifiers=True)

And most identifiers in the ```bytes``` class have consistent names to those in the ```str``` class. The identifiers between the two classes are largely consistent however differ slightly due to the difference in fundamental unit:

In [None]:
print_identifier_group(bytes, kind='all', second=str, show_only_intersection_identifiers=True)

Most methods and datamodel methods have the same names between the two classes and have consistent behaviour. The reason for these is both classes are immutable ```Collections``` and therefore follow the same design pattern. The immutable methods in the ```str``` class which return a new ```str``` instance will generally return a new ```bytes``` instance when their counterparts in the ```bytes``` class are used. The individual unit of a ```str``` instance is a Unicode character and the individual unit of a ```bytes``` instance is a byte. The datamodel method ```__len__``` (*dunder len*) will return how many of these individual units are in an instance. Recall that the ```builtins``` function ```len``` is preferentially used over the datamodel method ```__len__``` (*dunder len*) but under the hood the datamodel method defines the behaviour of ```len``` when used on this class:

In [None]:
english_b = b'abcde'
greek_b = bytes('αβγδε', encoding='UTF-8')
english_s = 'abcde'
greek_s = 'αβγδε'

Notice that the ```len``` of ```english_s``` and ```greek_s``` are consistently ```5``` as there are ```5``` Unicode characters:

In [None]:
len(english_s)

In [None]:
len(greek_s)

Notice that the ```len``` of ```english_b``` is also ```5``` as each ASCII character spans ```1``` byte, however ```greek_s``` is a length of ```10``` as each character spans ```2``` bytes:

In [None]:
len(english_b)

In [None]:
len(greek_b)

Recall that the data model identifier ```__getitem__``` (*dunder getitem*) defines the behaviour when indexing with square brackets. For a ```str``` instance, the Unicode character corresponding to that index is returned. For a ```bytes``` instance on the other hand, the byte in the form of an ```int``` is returned:

In [None]:
greek_s[0]

In [None]:
greek_b[0]

In [None]:
greek_b.hex()

In [None]:
hex(206)

In [None]:
0xce

Slicing will instead return a ```bytes``` instance. The difference can be seen when ```1``` byte is selected from a slice:

In [None]:
greek_b[:1]

The syntax otherwise is consistent to slicing when used in the ```str``` class:

In [None]:
greek_b[:2:]

In [None]:
greek_b[::2]

In [None]:
greek_b[1::2]