# The bytes class

In the previous notebook the Unicode string class ```str``` was examined and was seen to have a Unicode character as a fundamental unit. The byte string class ```bytes``` on the other hand use a byte as a fundamental unit. The byte string was the foundation for text data in Python 2. 

A computer stores data using a bit. A bit can be conceptualised as a single dip switch and has the values Off and On respectively as shown below. A single switch has the possible values ```0```, ```1``` which is ```2 ** 1``` combinations which is a total of ```2```. Note the combination ```0``` is included so ```0:2``` is inclusive of the lower bount ```0``` and exclusive of the upper bound ```2```.

<img src='./images/img_001.png' alt='img_001' width='400'/>

The switch above represents the ```2``` to the power of ```0``` and has a mantissa of ```0```:

In [1]:
prefix = 0

In [2]:
prefix * (2 ** 0)

0

More typically ```8``` of these switches are combined into a single logical unit called a byte. A byte has ```2 ** 8``` combinations which is a total of ```256```. Note the combination ```0``` is included so ```0:256``` is inclusive of the lower bound ```0``` and exclusive of the upper bound ```256```.

<img src='./images/img_002.png' alt='img_002' width='400'/>

The decimal number can be calculated from the ```bytes``` above. Like an ordinary decimal number the units are on the right hand side of the ```bytes``` what differs is each character represents a power of ```2``` (binary) opposed to a power of ```10``` decimal:

In [3]:
+ 0 * (2 ** 7) \
+ 1 * (2 ** 6) \
+ 1 * (2 ** 5) \
+ 0 * (2 ** 4) \
+ 1 * (2 ** 3) \
+ 0 * (2 ** 2) \
+ 0 * (2 ** 1) \
+ 0 * (2 ** 0) 

104

The ```bytes``` above can be expressed as a binary number using the prefix ```0b```, this prefix is used to distinguish the number from a standard decimal integer. Notice that this prefix is highlighted in VSCode using a different colour to distinguish the prefix from the data. Tthe decimal equivalent will be returned in the cell output:

In [4]:
0b01101000

104

Leading zeros are normally omitted:

In [5]:
0b1101000

104

Although in this context, it is useful to leave them in place so all 8 bits in the byte can be visualised.

A byte string is essentially a collection of individual bytes. It is therefore called the ```bytes``` class:

<img src='./images/img_003.png' alt='img_003' width='400'/>

In [6]:
0b01101000

104

In [7]:
0b01100101

101

In [8]:
0b01101100

108

In [9]:
0b01101100

108

In [10]:
0b01101111

111

These numbers can be grouped into a collection such as a ```tuple```:

In [11]:
(0b01101000, 0b01100101, 0b01101100, 0b01101100, 0b01101111)

(104, 101, 108, 108, 111)

This ```tuple``` collection can be cast into ```bytes``` giving text information:

In [12]:
bytes((0b01101000, 0b01100101, 0b01101100, 0b01101100, 0b01101111))

b'hello'

In [13]:
bytes((104, 101, 108, 108, 111))

b'hello'

## ASCII Characters

The American Standard Code for Information Interchange (ASCII) maps each byte to a physical command or English character, recall that the physical commands were used to control primitive computers that were essentially typewriter based:

<img src='./images/img_004.png' alt='img_004' width='600'/>

|byte|hex|num|command|
|---|---|---|---|
|00000000|00|000|null|
|00000001|01|001|start of heading|
|00000010|02|002|start of text|
|00000011|03|003|end of text|
|00000100|04|004|end of transmission|
|00000101|05|005|enquiry|
|00000110|06|006|acknowledge|
|00000111|07|007|bell|
|00001000|08|008|**backspace**|
|00001001|09|009|**horizontal tab**|
|00001010|0a|010|**new line**|
|00001011|0b|011|**vertical tab**|
|00001100|0c|012|**form feed**|
|00001101|0d|013|**carriage return**|
|00001110|0e|014|shift out|
|00001111|0f|015|shift in|
|00010000|10|016|data link escape|
|00010001|11|017|device control 1|
|00010010|12|018|device control 2|
|00010011|13|019|device control 3|
|00010100|14|020|device control 4|
|00010101|15|021|negative acknowledge|
|00010110|16|022|synchronous idle|
|00010111|17|023|end of transmission block|
|00011000|18|024|cancel|
|00011001|19|025|end of medium|
|00011010|1a|026|substitute|
|00011011|1b|027|**escape**|
|00011100|1c|028|file separator|
|00011101|1d|029|group separator|
|00011110|1e|030|record separator|
|00011111|1f|031|unit seperator|
|00100000|20|032|**space**|

The remaining commands spanning up to half a byte contained the characters most commonly used in the English language.

|byte|hex|num|character|
|---|---|---|---|
|00100001|21|033|!|
|00100010|22|034|"|
|00100011|23|035|#|
|00100100|24|036|$|
|00100101|25|037|%|
|00100110|26|038|&|
|00100111|27|039|'|
|00101000|28|040|(|
|00101001|29|041|)|
|00101010|2a|042|*|
|00101011|2b|043|+|
|00101100|2c|044|,|
|00101101|2d|045|-|
|00101110|2e|046|.|
|00101111|2f|047|/|
|00110000|30|048|0|
|00110001|31|049|1|
|00110010|32|050|2|
|00110011|33|051|3|
|00110100|34|052|4|
|00110101|35|053|5|
|00110110|36|054|6|
|00110111|37|055|7|
|00111000|38|056|8|
|00111001|39|057|9|
|00111010|3a|058|:|
|00111011|3b|059|;|
|00111100|3c|060|<|
|00111101|3d|061|=|
|00111110|3e|062|>|
|00111111|3f|063|?|
|01000000|40|064|@|
|01000001|41|065|A|
|01000010|42|066|B|
|01000011|43|067|C|
|01000100|44|068|D|
|01000101|45|069|E|
|01000110|46|070|F|
|01000111|47|071|G|
|01001000|48|072|H|
|01001001|49|073|I|
|01001010|4a|074|J|
|01001011|4b|075|K|
|01001100|4c|076|L|
|01001101|4d|077|M|
|01001110|4e|078|N|
|01001111|4f|079|O|
|01010000|50|080|P|
|01010001|51|081|Q|
|01010010|52|082|R|
|01010011|53|083|S|
|01010100|54|084|T|
|01010101|55|085|U|
|01010110|56|086|V|
|01010111|57|087|W|
|01011000|58|088|X|
|01011001|59|089|Y|
|01011010|5a|090|Z|
|01011011|5b|091|[|
|01011100|5c|092|\|
|01011101|5d|093|]|
|01011110|5e|094|^|
|01011111|5f|095|_|
|01100000|60|096|`|
|01100001|61|097|a|
|01100010|62|098|b|
|01100011|63|099|c|
|01100100|64|100|d|
|01100101|65|101|e|
|01100110|66|102|f|
|01100111|67|103|g|
|01101000|68|104|h|
|01101001|69|105|i|
|01101010|6a|106|j|
|01101011|6b|107|k|
|01101100|6c|108|l|
|01101101|6d|109|m|
|01101110|6e|110|n|
|01101111|6f|111|o|
|01110000|70|112|p|
|01110001|71|113|q|
|01110010|72|114|r|
|01110011|73|115|s|
|01110100|74|116|t|
|01110101|75|117|u|
|01110110|76|118|v|
|01110111|77|119|w|
|01111000|78|120|x|
|01111001|79|121|y|
|01111010|7a|122|z|
|01111011|7b|123|{|
|01111100|7c|124|||
|01111101|7d|125|}|
|01111110|7e|126|~|
|01111111|7f|127||


## Initialisation Signature

Inputting the class name ```bytes``` with open parenthesis in a new code cell will display the initialisation signature as a popup balloon. Note that Jupyter Notebook and JupyterLab may require the additional keypress shift ```⇧``` and tab ```↹``` in order to invoke the popup balloon:

Alternatively using ```?``` to query the ```bytes``` class will display the docstring in the ipython cell output:

In [14]:
bytes?

[1;31mInit signature:[0m [0mbytes[0m[1;33m([0m[0mself[0m[1;33m,[0m [1;33m/[0m[1;33m,[0m [1;33m*[0m[0margs[0m[1;33m,[0m [1;33m**[0m[0mkwargs[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
bytes(iterable_of_ints) -> bytes
bytes(string, encoding[, errors]) -> bytes
bytes(bytes_or_buffer) -> immutable copy of bytes_or_buffer
bytes(int) -> bytes object of size given by the parameter initialized with null bytes
bytes() -> empty bytes object

Construct an immutable array of bytes from:
  - an iterable yielding integers in range(256)
  - a text string encoded using the specified encoding
  - any object implementing the buffer API.
  - an integer
[1;31mType:[0m           type
[1;31mSubclasses:[0m     

Recall that the purpose of the initialisation signature is to provide the data required to initialise a new instance. This was covered in a previous notebook which discussed the ```object``` class and introduced the concept of object orientated programming (OOP).

To recap during instantiation the datamodel static method ```__new__``` which is the constructor is used to create the instance ```self``` and this constructor invokes the datamodel instance method ```__init__``` the initialisation signature to initialise this instance with instance data.

For the byte string class, the initialisation signature shows alternative ways of supplying instance data for a byte string.

If the first way is examined:

* In Python parenthesis ```( )``` are used to call a function and supply any necessary input arguments.
* The comma ```,``` is used as a delimiter to seperate out any input arguments.
* In Python ```self``` is used to denote *this instance*. In other words a byte string can be constructed from an existing byte string instance, this is a special case as a byte string is a fundamental datatype.
* Any input argument before a ```/``` must be provided positionally
* ```*args``` indicates a variable number of additional positional input arguments. These are typically not used for the byte string class.
* ```**kwargs``` indicates a variable number of additional named input arguments. These are typically not used for the byte string class.

The parameter ```self```, an existing byte string instance cannot be provided using a named input argument:

A bytestring instance can be instantiated by supplying an existing bytestring instance ```self``` to the bytestring class:

In [15]:
bytes(b'hello world!')

b'hello world!'

However because the byte string is a fundamental datatype it can also be instantiated shorthand using the following:

In [16]:
b'hello world!'

b'hello world!'

The second way is by using an iterable of integers such as a ```tuple```.

For this to work each integer must a valid byte value. Recall that a byte looks like the following:

<img src='./images/img_002.png' alt='img_002' width='400'/>

And a byte has ```2 ** 8``` combinations which is a total of ```256```. Note the combination ```0``` is included so ```0:256``` is inclusive of the lower bound ```0``` and exclusive of the upper bound ```256```. Therefore the integer value used to represent a byte must like in the range ```0:256``` to be a valid byte.

A ```tuple``` with a single ```int``` element can be provided. Note that a trailing comma is required to distinguish a single element ```tuple``` from a numeric calculation using parenthesis:

In [17]:
num = (97)

In [18]:
type(num)

int

In [19]:
archive = (97, )

In [20]:
type(archive)

tuple

From the above ASCII table the integer number ```97``` corresponds to the character ```a``` and this can be seen when this is cast to a ```tuple```:

In [21]:
bytes((97, ))

b'a'

If an integer that exceeds the upper bound ```256``` (up to and exclusive of ```256``` due to zero-order indexing) a ```ValueError``` will display:

Normally the ```tuple``` will contain more than one ```int``` and each of these will be a valid byte value:

In [22]:
integers = (104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 33)
bytes(integers)

b'hello world!'

If the first number is examined recall that the decimal integer 104 in binary is a Unicode string of the binary sequence of the byte can be obtained using the ```bin``` function:

In [23]:
bin(104)

'0b1101000'

It is useful to see this with leading zeros. To do this the Unicode string methods ```removeprefix``` and ```zfill``` can be used alongside Unicode string concatenation. You should be familar with these from the last notebooks:

In [24]:
'0b' + bin(104).removeprefix('0b').zfill(8)

'0b01101000'

The Unicode string of the ASCII character corresponding to this byte can be retrieved using the ```chr``` function:

In [25]:
chr(104)

'h'

This bytes instance consists of 12 bytes. The decimal integer, binary integer (without prefix) and ASCII character can be shown. Note the space corresponding to a decimal integer of 32 is also an ASCII character:

In [26]:
for number in integers:
    print(str(number).center(8), end=' ')

print()
for number in integers:
    print(bin(number).removeprefix('0b').zfill(8), end=' ')

print()
for number in integers:
    print(chr(number).center(8), end=' ')

  104      101      108      108      111       32      119      111      114      108      100       33    
01101000 01100101 01101100 01101100 01101111 00100000 01110111 01101111 01110010 01101100 01100100 00100001 
   h        e        l        l        o                 w        o        r        l        d        !     

When every byte is a printable ASCII character it will be displayed instead of the byte sequence:

In [27]:
bytes(integers)

b'hello world!'

When the byte sequence contains a byte that maps to a whitespace character (with the exception of space), a command or is unmapped there is no corresponding character to display and an escape sequence instead displays:

In [28]:
integers = (9, 10, 11, 12, 13)
bytes(integers)

b'\t\n\x0b\x0c\r'

The ASCII escape characters for the tab ```\t``` (decimal 9), new line ```\t``` (decimal 10) and carriage return ```\r``` (decimal 13). The vertical tab ```\x0b``` (decimal 11) and form feed ```\x0c``` (decimal 12) instead are displayed using a hexadecimal escape character of the form ```\x00```.

You may have noticed that binary ```0b00001100``` is not very human-readible and it is easy to mistranscribe a binary number. To make it more human readible, the byte is split into 2 and each half byte is represented as a hexadecimal value:

<img src='./images/img_005.png' alt='img_005' width='400'/>

Recall **b**inary has ```2``` digits and the prefix ```0b```, decimal has ```10``` digits and no prefix because it is most commonly used. He**x**adecimal has ```16``` digits and the prefix ```0x```. As decimal only has ```10``` unique characters, in hexadecimal, the first ```6``` letters in the alphabet are used to supplement these giving ```16``` unique characters. Notice:

In [29]:
2 ** 4

16

Therefore each hexadecimal character perfectly maps to a 4 bit binary sequence:

|(0b) binary|(0x) hexadecimal character|decimal character|
|---|---|---|
|0000|0|0|
|0001|1|1|
|0010|2|2|
|0011|3|3|
|0100|4|4|
|0101|5|5|
|0110|6|6|
|0111|7|7|
|1000|8|8|
|1001|9|9|
|1010|a|10|
|1011|b|11|
|1100|c|12|
|1101|d|13|
|1110|e|14|
|1111|f|15|

The byte depicted above corresponds is the binary number:

In [30]:
0b00001100

12

The value returned in the cell output displays the decimal integer. The ```hex``` function can be used to cast a decimal integer into a Unicode string of a hexadecimal character:

In [31]:
hex(0b00001100)

'0xc'

The hexadecimal value is displayed without the trailing zero, so the first half byte is not shown. This can be added for clarity:

In [32]:
'0x' + hex(12).removeprefix('0x').zfill(2)

'0x0c'

In [33]:
'0b' + bin(12).removeprefix('0b').zfill(8)

'0b00001100'

And for clarity:

In [34]:
print(bin(12).removeprefix('0b').zfill(8)[:4], bin(12).removeprefix('0b').zfill(8)[4:])
print(hex(12).removeprefix('0x').zfill(2)[:1].center(4), hex(12).removeprefix('0x').zfill(2)[1:].center(4))

0000 1100
 0    c  


Note that the ```hex``` function displays the Python interpretters preference for lowercase characters for the hexadecimal prefix and hexadecimal characters themselves. Hexadecimal values are usually used when working at the fundamental hardware level and the preference for lowercase is to make hexadecimal values as human readible as possible. In the hexadecimal value spanning 4 bytes (8 hexadecimal characters) below notice that the lowercase variant is much easier to transcribe:

In [35]:
0xabb4ab8a

2880744330

In [36]:
0xABB4AB8A

2880744330

Despite the preference for lowercase, Python will also accept hexadecimal values of uppercase and in the example above both cases return the same integer.

When the byte sequence contains a byte that maps to a whitespace character, a command or is unmapped there is no corresponding character to display and an escape sequence instead displays. This can be seen when using the first ```32``` integers and integers above ```127```:

In [37]:
integers = (0, 1, 2, 29, 30, 31, 128, 129, 130, 253, 254, 255)
bytes(integers)

b'\x00\x01\x02\x1d\x1e\x1f\x80\x81\x82\xfd\xfe\xff'

Notice that character has its own hexadecimal escape sequence prefix ```\x``` and these use two hexadecimal characters which recall span over 1 byte. Because the escape sequence ```\x``` expects two hexadecimal characters any leading zero must be supplied.

When a byte string contains byte sequences that map to characters these characters will be displayed instead of the hexadecimal escape sequence:

In [38]:
integers = (32, 65, 80, 120)
bytes(integers)

b' APx'

The tab ```\t```, newline ```\t```, carriage return ```\r``` and backslash ```\\``` itself all have single escape characters.

In [39]:
integers = (9, 10, 13, 92)
bytes(integers)

b'\t\n\r\\'

A byte string containing all of these initially appear confusing:

In [40]:
integers = (0, 1, 2, 9, 10, 13, 29, 30, 31, 32, 65, 92, 97, 128, 129, 130, 253, 254, 255)
bytes(integers)

b'\x00\x01\x02\t\n\r\x1d\x1e\x1f A\\a\x80\x81\x82\xfd\xfe\xff'

Which is why the ```bytes``` class has method ```hex``` which returns a Unicode string of the hexadecimal values without any of the escape sequences:

In [41]:
bytes(integers).hex()

'000102090a0d1d1e1f20415c61808182fdfeff'

Note that the ```byte``` classes ```hex``` method differs from ```builtins``` function ```hex``` which provides a ```0x``` prefix:

In [42]:
bytes((12, )).hex()

'0c'

In [43]:
hex(12)

'0xc'

The ```builtins``` function ```hex``` can process a **single** large integer that exceeds 1 byte:

In [44]:
hex(256)

'0x100'

Whereas the ```byte``` classes ```hex``` method processes **multiple** integers that are within the constrains of a byte:

In [45]:
bytes((12, 34)).hex()

'0c22'

The following bytes string can be represented as a string of hexadecimal characters with 2 hexadecimal characters for each byte using the ```bytes``` class method ```hex```:

In [46]:
b'hello'.hex()

'68656c6c6f'

The ```bytes``` class has the class method ```fromhex``` which carries out the counter operation and creates a ```bytes``` instance from a string of hexadecimal characters:

In [47]:
bytes.fromhex('68656c6c6f')

b'hello'

To recap:

In [48]:
integers = (104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 33)
bytes(integers)

for number in integers:
    print(chr(number).center(8), end=' ')
    
print()
for number in integers:
    print(str(number).center(8), end=' ')

print()
for number in integers:
    print(bin(number).removeprefix('0b').zfill(8), end=' ')

print()
for number in integers:
    print((hex(number).removeprefix('0x')).center(8), end=' ')    

print()
for number in integers:
    print((r'0x' + hex(number).removeprefix('0x')).center(8), end=' ')

print()
for number in integers:
    print((r'\x' + hex(number).removeprefix('0x')).center(8), end=' ')
    


   h        e        l        l        o                 w        o        r        l        d        !     
  104      101      108      108      111       32      119      111      114      108      100       33    
01101000 01100101 01101100 01101100 01101111 00100000 01110111 01101111 01110010 01101100 01100100 00100001 
   68       65       6c       6c       6f       20       77       6f       72       6c       64       21    
  0x68     0x65     0x6c     0x6c     0x6f     0x20     0x77     0x6f     0x72     0x6c     0x64     0x21   
  \x68     \x65     \x6c     \x6c     \x6f     \x20     \x77     \x6f     \x72     \x6c     \x64     \x21   

A bytes string can be instantiated from a Unicode string however the named parameter ```encoding``` needs to be supplied, which gives the instructions to decode a Unicode character outwith the ASCII range. In this case the simplest encoding ```'ascii'``` seen above is supplied indicating all characters are within the ASCII range.  Generally the current standard ```'utf-8'``` is used which will be explained in more detail in the next section:

In [49]:
bytes('hello', encoding='ascii')

b'hello'

Not supplying ```encoding``` gives a ```TypeError```:

Supplying a non-Unicode character that is not ASCII and specifiying ASCII encoding will also give a ```UnicodeDecodeError```:

A bytes instance can be instantiated from an existing bytes instance:

It can also be instantiation from an instance of a closely related class, the ```bytearray``` class which is a mutatable counterpart of the immutable ```bytes``` class:

In [50]:
ba = bytearray(b'hello')

In [51]:
bytes(ba)

b'hello'

Immutable means the ```bytes``` sequence cannot be modified once constructed. Mutatable means a ```bytearray``` sequence can be modified once constructed.

 A ```bytes``` instance can also be initialised from an integer. In this case, the integer specifies the number of bytes the instance will use and each of these ```bytes``` are initialised to ```0```:

For example a ```bytes``` instance occupying 1 bytes can be instantiated:

In [52]:
bytes(1)

b'\x00'

And another one occupying 4 bytes can be instantiated:

In [53]:
bytes(4)

b'\x00\x00\x00\x00'

Using the ```bytes``` class with out providing any instantiation data will create an empty ```bytes``` instance:

In [54]:
bytes()

b''

Initialising data and then populating is typically only used for a mutatable class suc as the ```bytearray```.

## Encoding and Decoding

When a bytes string was instantiated from a Unicode string, an encoding translation table was selected; that is a table that maps the bytes sequence to a specific character. There have been many encoding standards developed throughout the years and by default the current standard ```'utf-8'``` should be used. The Unicode string is essentially locked to using only ```'utf-8'``` and as a consequence is far easier to use than the bytes string:

|encoding|bytes per character|bits per character|byte order|byte order marker BOM|
|---|---|---|---|---|
|'utf-8'|1, 2, 3, 4|8, 16, 24, 32|big endian| |
|'utf-8-sig'|1, 2, 3, 4|8, 16, 24, 32|big endian|efbbbf|
|'utf-32'|4|32|little endian|fffe0000|
|'utf-32-le'|4|32|little endian| |
|'utf-32-be'|4|32|big endian| |
|'utf-16'|2|16|little endian|fffe|
|'utf-16-le'|2|16|little endian| |
|'utf-16-be'|2|16|big endian| |
|'latin1'|1|8| ||
|'ascii'|1|8| ||

This section will go into explaining the concepts in the table above. 

### ASCII

The most basic one is ```'ascii'``` where each character is encoded over ```1``` byte using only half of the possible values. ASCII is restricted to a small subset of English characters:

In [55]:
a_ascii = b'a'
a_ascii

b'a'

In [56]:
hex(ord('a'))

'0x61'

The ```'ascii'``` encoding scheme was originally developed using 7 bits with:

In [57]:
2 ** 7

128

for this reason the commands span over the range ```0:128``` (up to and exclusing the upper bound of ```128```). This covers half the possible values of a byte:

In [58]:
2 ** 8

256

### Extended ASCII Variants

In the 1990s there were numerous regional encoding schemes (translation tables) which mapped the second half of the bytes to regional characters. 

In the UK, ```'latin1'``` was used which includes the ```£``` sign:

In [59]:
gb = bytes('£123.45', encoding='latin1')
gb

b'\xa3123.45'

In [60]:
int('0xa3', base=16)

163

In [61]:
gb.decode(encoding='latin1')

'£123.45'

This regional encoding scheme spanned over the full byte allowing the commonly used regional characters. 

The problem with early regional encoding was that operating systems and web browsers were often configured to use a different encoding scheme that differed to the encoding the content was itself was written in and as a result non-ASCII characters were often incorrectly substituted. This can be seen for example by decoding the byte string which was originally encoded in ```'latin1'``` with ```'latin2'```, ```'latin3'```, ```'greek'``` and ```'cyrillic'```:

In [62]:
gb.decode(encoding='latin2')

'Ł123.45'

In [63]:
gb.decode(encoding='latin3')

'£123.45'

In [64]:
gb.decode(encoding='greek')

'£123.45'

In [65]:
gb.decode(encoding='cyrillic')

'Ѓ123.45'

All of these formats should be considered as legacy formats.

### UTF-16

The Unicode Transformation Format ```'utf-16'``` was a previous standard where each character occupied ```2``` bytes which is ```2 * 8``` bits and is where the name ```16``` comes from. Using 2 bytes instead of 1 bytes per character increases the number of possible combinations to:

In [66]:
2 ** 16

65536

Unfortunately there were variations of ```'utf-16``` which often got confused. The simplest to understand is ```'utf-16-be'```:

In [67]:
a_be = bytes('a', encoding='utf-16-be')
a_be

b'\x00a'

In [68]:
a_be.hex()

'0061'

Because each character must occupy 2 bytes the leading zero byte ```\x00``` displays. BE is an abbreviation for Big Endian. Essentially the sixteen value would be encoded as ```10``` using big-endian and is consistent to the way numbers are written in English. In this notation the unit is last and the most significant bit or biggest number is written first (when reading right to left).

In [69]:
hex(16)

'0x10'

When ```'utf-16'``` was the standard, Intel however created processors that operated using Little Endian. In Little Endian, the sixteenth value is encoded as ```01``` and the least significant bit or little number is supplied first (when reading right to left).

The same string that was examined using ```'utf-16-be``` can be examined using ```'utf-16-le```. Notice that the leading zero byte ```\x00``` is now trailing.

In [70]:
a_le = bytes('a', encoding='utf-16-le')
a_le

b'a\x00'

In [71]:
a_le.hex()

'6100'

If the Greek letter is examined notice the byteorder is switched:

In [72]:
alpha_be = bytes('α', encoding='utf-16-be')
alpha_be

b'\x03\xb1'

In [73]:
alpha_le = bytes('α', encoding='utf-16-le')
alpha_le

b'\xb1\x03'

In [74]:
alpha_be.hex()

'03b1'

In [75]:
alpha_le.hex()

'b103'

Unintended character replacement is carried out when a character is encoded in ```'utf-16-le'``` and decoded using ```'utf-16-be'```:

In [76]:
alpha_le.decode(encoding='utf-16-be')

'넃'

Due to the confusion between big endian and little endian, ```'utf-16'``` which is little endian by default includes a Byte Order Marker (BOM). The BOM is a 2 byte prefix that indicates the byte order:

In [77]:
bytes('', encoding='utf-16-le')

b''

In [78]:
bytes('', encoding='utf-16')

b'\xff\xfe'

The hex values of the BOM can be seen more clearly using the bytes strings ```hex``` method:

In [79]:
bytes('', encoding='utf-16').hex()

'fffe'

This has the hexadecimal values ```ff``` and ```fe```:

In [80]:
0xff

255

In [81]:
0xfe

254

These are the last 2 values in a byte. The BOM can be conceptualised as a ```tuple``` of the form ```(255, 254)```:

In [82]:
bytes((255, 254))

b'\xff\xfe'

And the order of the values in this 2 element tuple is reversed indicating little endian. ```(254, 255)``` would indicate big endian.

If the 16 bit (2 bytes) encoding is examined in more detail as hex so it is human readible; ```'utf-16-be'```, ```'utf-16-le'``` and ```'utf-16'``` can be compared:

In [83]:
be = bytes('abcdeαβγδε', encoding='Utf-16-be').hex()
be

'0061006200630064006503b103b203b303b403b5'

In [84]:
le = bytes('abcdeαβγδε', encoding='utf-16-le').hex()
le

'61006200630064006500b103b203b303b403b503'

In [85]:
bom_le = bytes('abcdeαβγδε', encoding='utf-16').hex()
bom_le

'fffe61006200630064006500b103b203b303b403b503'

Since each character is encoded over 2 bytes and each byte is in turn represented using 2 hexadecimal characters. The differences are easier to see if groupings of 4 hexadecimal characters are examined:

In [86]:
print('char', end=':            ')
for i in 'abcdeαβγδε':
    print(i, end='    ')

print()
print('utf-16-be', end=':      ')

for i in range(0, len(be), 4):
    print(be [i:i+4], end=' ')

print()    
print('utf-16-le', end=':      ')

for i in range(0, len(le), 4):
    print(le [i:i+4], end=' ')  
    
print()    
print('utf-16', end=':    ')

for i in range(0, len(bom_le), 4):
    print(bom_le [i:i+4], end=' ')    

char:            a    b    c    d    e    α    β    γ    δ    ε    
utf-16-be:      0061 0062 0063 0064 0065 03b1 03b2 03b3 03b4 03b5 
utf-16-le:      6100 6200 6300 6400 6500 b103 b203 b303 b403 b503 
utf-16:    fffe 6100 6200 6300 6400 6500 b103 b203 b303 b403 b503 

With 16 bytes there are:

In [87]:
2 ** 16

65536

combinations. These are not enough to cover characters from all the languages in the world.

### UTF-32

Therefore ```'utf-32'``` was developed:

In [88]:
2 ** 32

4294967296

This gives groupings of 4 bytes, which is 8 hexadecimal characters:

In [89]:
word = 'abαβ悤悥🦒🦓'

be = bytes(word, encoding='Utf-32-be').hex()
le = bytes(word, encoding='Utf-32-le').hex()
bom_le = bytes(word, encoding='Utf-32').hex()


print('char', end=':               ')
for i in word:
    print(i, end='        ')

print()
print('utf-32-be', end=':          ')

for i in range(0, len(be), 8):
    print(be [i:i+8], end=' ')

print()    
print('utf-32-le', end=':          ')

for i in range(0, len(le), 8):
    print(le [i:i+8], end=' ')  
    
print()    
print('utf-32', end=':    ')

for i in range(0, len(bom_le), 8):
    print(bom_le [i:i+8], end=' ')    

char:               a        b        α        β        悤        悥        🦒        🦓        
utf-32-be:          00000061 00000062 000003b1 000003b2 000060a4 000060a5 0001f992 0001f993 
utf-32-le:          61000000 62000000 b1030000 b2030000 a4600000 a5600000 92f90100 93f90100 
utf-32:    fffe0000 61000000 62000000 b1030000 b2030000 a4600000 a5600000 92f90100 93f90100 

### UTF-8

The main drawback of ```'utf-32'``` is that it requires alot more memory per character. Moreover each ASCII character for example now has 6 leading (big endian) or trailing (little endian) hexadecimal zeros. ```'utf-8'``` was developed as an adaptable format to address these shortfallings and characters can span over 1-4 bytes. ```'utf-8'``` is always big endian and instead of a BOM prefix, the first bits of each multibyte character follow a byte pattern.

In [90]:
print('1 byte:', end=' ')
for unicode_char in 'abcde':
    print(bytes(unicode_char, encoding='utf-8').hex(), end=' ')

print()
print('2 bytes:', end=' ')    
for unicode_char in 'αβγδε':
    print(bytes(unicode_char, encoding='utf-8').hex(), end=' ')
    
print()
print('3 bytes:', end=' ')  
for unicode_char in '悤悥悦悧您':
    print(bytes(unicode_char, encoding='utf-8').hex(), end=' ')

print()
print('4 bytes:', end=' ')  
for unicode_char in '🦒🦓🦔🦕🦖':
    print(bytes(unicode_char, encoding='utf-8').hex(), end=' ')


1 byte: 61 62 63 64 65 
2 bytes: ceb1 ceb2 ceb3 ceb4 ceb5 
3 bytes: e682a4 e682a5 e682a6 e682a7 e682a8 
4 bytes: f09fa692 f09fa693 f09fa694 f09fa695 f09fa696 

Generally:

* 1 byte is the ```'ascii'``` subset.
* 2 bytes is extended European characters. ```'utf-16``` is not a subset as ```'utf-8'``` uses a bit pattern that differs from ```'utf-16'```.
* 3 bytes are used for additional languages.
* 4 bytes are used for emojis. ```'utf-32``` is not a subset as ```'utf-8'``` uses a bit pattern that differs from ```'utf-32'```.

```utf-8``` uses some of the bits to identify a 2, 3 and 4 bytes character. For example the 2 byte character:

In [91]:
bin(0xceb1).removeprefix('0b').zfill(2 * 8)

'1100111010110001'

has the byte pattern in bold of the form:

**110**01110 **10**110001

The 3 byte character:

In [92]:
bin(0xe682a4).removeprefix('0b').zfill(3 * 8)

'111001101000001010100100'

has the byte pattern in bold of the form:

**1110**0110 **10**000010 **10**100100

In [93]:
bin(0xb)

'0b1011'

The 4 byte character:

In [94]:
bin(0xf09fa692).removeprefix('0b').zfill(4 * 8)

'11110000100111111010011010010010'

has the byte pattern in bold of the form:

**11110**000 **10**011111 **10**100110 **10**010010

Since characters ```'utf-8'``` is always big endian and characters encoded over multiple bytes have a byte pattern, there is generally no need for a BOM. 

### UTF-8-Sig

Despite ```'utf-8'``` not requiring a BOM, Microsoft often include one in their products using a variation ```utf-8-sig``` and therefore may be seen in data exported from popular Microsoft applications such as Notepad or Excel. The BOM can be seen by compared the casting of an empty Unicode string to a bytes string using these different ```'utf-8'``` and ```'utf-8-sig'``` respectively:

In [95]:
no_bom = bytes('', encoding='utf-8').hex()
no_bom

''

In [96]:
bom = bytes('', encoding='utf-8-sig').hex()
bom

'efbbbf'

The bom contains ```ef``` which recall indicates big endian. Notice is the opposite of the ```ef``` found in ```'utf-16'``` which was little endian. The BOM can be seen in binary:

In [97]:
bin(0xefbbbf).removeprefix('0b').zfill(3 * 8)

'111011111011101110111111'

Notice it also has the byte pattern that indicates the BOM is encoded over 3 bytes:

**1110**1111 **10**111011 **10**111111

```'utf-8'``` is the current standard and should be used by default. The Unicode string ```str``` class is locked to ```'utf-8'``` and is much easier to work with as there is no worry about encoding. 

When decoding data from another source that is in byte form, ```'utf-8'``` should be used by default to decode it.  If an unwanted BOM appears at the start when decoding data, then the data probably was processed in a Microsoft product with ```'utf-8-sig'```.

## Byte Strings VS Unicode Stings

Unicode strings are essentially byte strings encoded in ```'utf-8'``` with the methods simplified to use a Unicode character as a fundamental unit opposed to a byte. Moreover since there is only one possibility to encode the character, it can natively be displayed instead making it easier to read. For most Python programs a Unicode string should therefore be used instead of a byte string.

The binary sequence of the ```'ascii'``` string which recall is a subset of ```'utf-8'``` is:

In [98]:
for number in b'hello world!':
    print(bin(number).removeprefix('0b').zfill(8), end='')

011010000110010101101100011011000110111100100000011101110110111101110010011011000110010000100001

The binary sequence is not very human readible. However is very easy to transmit as a digital signal. A serial port for example has a signal pin which is configured to transmit or receive a digital time trace. The baud rate, ```9600``` for example means ```9600``` bits are processed in a second. When the bit is ```0```, the voltage of the signal pin is LOW and when the bit is ```1``` the voltage of the signal pin is HIGH.

In [99]:
b'hello world!'.hex()

'68656c6c6f20776f726c6421'

byte strings are therefore still used when directly interfacing with hardware for example an Arduino. In such applications it is recommended to decode a byte string to a Unicode string as early as possible in a Python program and to cast a Unicode string to a byte string as late as possible before transmitting it to basic hardware in order to avoid encoding issues.

## Bytes Identifiers

The bytes class uses the design model of a Python object as seen by its method resolution order:

In [100]:
bytes.mro()

[bytes, object]

If the help function is used on bytes, details about all the identifiers will be given:

In [101]:
help(bytes)

Help on class bytes in module builtins:

class bytes(object)
 |  bytes(iterable_of_ints) -> bytes
 |  bytes(string, encoding[, errors]) -> bytes
 |  bytes(bytes_or_buffer) -> immutable copy of bytes_or_buffer
 |  bytes(int) -> bytes object of size given by the parameter initialized with null bytes
 |  bytes() -> empty bytes object
 |  
 |  Construct an immutable array of bytes from:
 |    - an iterable yielding integers in range(256)
 |    - a text string encoded using the specified encoding
 |    - any object implementing the buffer API.
 |    - an integer
 |  
 |  Methods defined here:
 |  
 |  __add__(self, value, /)
 |      Return self+value.
 |  
 |  __bytes__(self, /)
 |      Convert this value to exact type bytes.
 |  
 |  __contains__(self, key, /)
 |      Return key in self.
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 | 

From the previous notebook, you'll recognise many of these identifiers as they are seen in the string class. If the ```print_identifier_group``` function is imported from the custom ```helper_module```:

In [102]:
from helper_module import print_identifier_group

The identifiers can be grouped into:

* datamodel attributes
* datamodel methods
* attributes
* methods

And consistency for the identifiers name can be compared between the Unicode string ```str``` class and bytes string ```bytes``` class:

In [103]:
print_identifier_group(bytes, kind='datamodel_attribute', second=str, show_unique_identifiers=True)

[]


In [104]:
print_identifier_group(bytes, kind='datamodel_attribute', second=str, show_only_intersection_identifiers=True)

['__doc__']


In [105]:
print_identifier_group(bytes, kind='datamodel_method', second=str, show_unique_identifiers=True)

['__bytes__']


In [106]:
print_identifier_group(bytes, kind='datamodel_method', second=str, show_only_intersection_identifiers=True)

['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__']


In [107]:
print_identifier_group(bytes, kind='attribute', second=str, show_unique_identifiers=True)

[]


In [108]:
print_identifier_group(bytes, kind='attribute', second=str, show_only_intersection_identifiers=True)

[]


In [109]:
print_identifier_group(bytes, kind='function', second=str, show_unique_identifiers=True)

['decode', 'fromhex', 'hex']


In [110]:
print_identifier_group(bytes, kind='function', second=str, show_only_intersection_identifiers=True)

['capitalize', 'center', 'count', 'endswith', 'expandtabs', 'find', 'index', 'isalnum', 'isalpha', 'isascii', 'isdigit', 'islower', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'removeprefix', 'removesuffix', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']


For the most part the identifiers between the two classes are consistent however some will differ due to the difference in fundamental unit.

Most methods and datamodel methods have the same names between the two classes and have consistent behaviour. The reason for these is both classes are immutable ```Collections``` and therefore follow the same design pattern. The immutable methods in the ```str``` class which return a new ```str``` instance will generally return a new ```bytes``` instance when their counterparts in the ```bytes``` class are used. The individual unit of a Unicode string is a Unicode character and the individual unit of a bytes is a byte. The datamodel method ```__len__``` (*dunder len*) will return how many of these individual units are in an instance. Recall that the ```builtins``` function ```len``` is preferentially used over the datamodel method ```__len__``` (*dunder len*) but under the hood the datamodel method defines the behaviour of ```len``` when used on this class:

In [111]:
english_b = b'abcde'
greek_b = bytes('αβγδε', encoding='UTF-8')
english_s = 'abcde'
greek_s = 'αβγδε'

Notice that the ```len``` of ```english_s``` and ```greek_s``` are consistently ```5``` as there are ```5``` Unicode characters:

In [112]:
len(english_s)

5

In [113]:
len(greek_s)

5

Notice that the ```len``` of ```english_b``` is also ```5``` as each ASCII character spans ```1``` byte, however ```greek_s``` is a length of ```10``` as each character spans ```2``` bytes:

In [114]:
len(english_b)

5

In [115]:
len(greek_b)

10

Recall that the data model identifier ```__getitem__``` (*dunder getitem*) defines the behaviour when indexing with square brackets. For a string, the Unicode character corresponding to that index is returned. For a byte on the otherhand, the byte in the form of an ```int``` is returned:

In [116]:
greek_s[0]

'α'

In [117]:
greek_b[0]

206

In [118]:
greek_b.hex()

'ceb1ceb2ceb3ceb4ceb5'

In [119]:
hex(206)

'0xce'

In [120]:
0xce

206

Slicing will instead return a ```bytes``` instance. The difference can be seen when ```1``` byte is selected from a slice:

In [121]:
greek_b[:1]

b'\xce'

The syntax otherwise is consistent to slicing using the string class:

In [122]:
greek_b[:2:]

b'\xce\xb1'

In [123]:
greek_b[::2]

b'\xce\xce\xce\xce\xce'

In [124]:
greek_b[1::2]

b'\xb1\xb2\xb3\xb4\xb5'