# Builtins Module: The Byte String Class (bytes)

In the previous notebook the Unicode string class ```str``` was examined and was seen to have a Unicode character as a fundamental unit. The byte string class ```bytes``` on the other hand use a byte as a fundamental unit. The byte string was the foundation for text data in Python 2. 

## Categorize_Identifiers Module

This notebook will use the following functions ```dir2```, ```variables``` and ```view``` in the custom module ```categorize_identifiers``` which is found in the same directory as this notebook file. ```dir2``` is a variant of ```dir``` that groups identifiers into a ```dict``` under categories and ```variables``` is an IPython based a variable inspector. ```view``` is used to view a ```Collection``` in more detail:

In [1]:
from categorize_identifiers import dir2, variables, view

## Bytes Conception

A computer stores data using a bit. A bit can be conceptualised as a single dip switch and has the values Off and On respectively as shown below. A single switch has the possible values ```0```, ```1``` which is ```2 ** 1``` combinations which is a total of ```2```. 

A single switch ranges between ```0:2```. Since Python uses zero-order indexing, the lower bound ```0``` is included and the upper bound ```2``` is exclusive. i.e. up to and excluding ```2```:

<img src='./images/img_001.png' alt='img_001' width='400'/>

More typically ```8``` of these switches are combined into a single logical unit called a byte. A byte has ```2 ** 8``` combinations which is a total of ```256```. i.e. a byte comprises of 8 bits and has ```0:256``` combinations:

<img src='./images/img_002.png' alt='img_002' width='400'/>

Each dipswitch represents a power of ```2```. The first dipswitch on the right hand side "8" represents the units (```2 ** 0```), the second dipswitch from the right "7" represents the power (```2 ** 1```), the third dipswitch from the right "6" represents the power (```2 ** 2```) and so on... The number above can therefore be calculated as a decimal number using:

In [2]:
+ 0 * (2 ** 7) \
+ 1 * (2 ** 6) \
+ 1 * (2 ** 5) \
+ 0 * (2 ** 4) \
+ 1 * (2 ** 3) \
+ 0 * (2 ** 2) \
+ 0 * (2 ** 1) \
+ 0 * (2 ** 0) 

104

The ```bytes``` above can be expressed as a binary number using the prefix ```0b```, this prefix is used to distinguish the base 2 from the base 10 which is used by default for an ```int```. Notice that syntax highlight highlights the base 2 prefix. The decimal ```int``` will be returned in the cell output:

In [3]:
0b01101000

104

Leading zeros are normally omitted:

In [4]:
0b1101000

104

Although in this context, it is useful to show the leading zeros, so all 8 bits in the byte can be visualised.

A ```bytes``` instance is essentially a collection of individual bytes:

<img src='./images/img_003.png' alt='img_003' width='400'/>

Each byte above can be represented as a binary number and grouped into a ```tuple```:

In [5]:
(0b01101000, 0b01100101, 0b01101100, 0b01101100, 0b01101111)

(104, 101, 108, 108, 111)

This ```tuple``` collection can be cast into ```bytes``` giving text information:

In [6]:
bytes((0b01101000, 0b01100101, 0b01101100, 0b01101100, 0b01101111))

b'hello'

If this ```bytes``` instance is viewed, notice the value for each index is a byte which is represented by an ```int``` in decimal:

In [7]:
view(bytes((0b01101000, 0b01100101, 0b01101100, 0b01101100, 0b01101111)))

Index 	 Type                 	 Size   	 Value                         
0 	 int                  	 1      	 104                            	
1 	 int                  	 1      	 101                            	
2 	 int                  	 1      	 108                            	
3 	 int                  	 1      	 108                            	
4 	 int                  	 1      	 111                            	


## ASCII Characters

Recall that the American Standard Code for Information Interchange (ASCII) maps each byte to a physical command or English character. The physical commands were used to control primitive computers that were essentially typewriter based:

<img src='./images/img_004.png' alt='img_004' width='600'/>

There are 128 commands which span the first half of the byte:

|byte|hex|num|command|
|---|---|---|---|
|00000000|00|000|null|
|00000001|01|001|start of heading|
|00000010|02|002|start of text|
|00000011|03|003|end of text|
|00000100|04|004|end of transmission|
|00000101|05|005|enquiry|
|00000110|06|006|acknowledge|
|00000111|07|007|bell|
|00001000|08|008|**backspace**|
|00001001|09|009|**horizontal tab**|
|00001010|0a|010|**new line**|
|00001011|0b|011|**vertical tab**|
|00001100|0c|012|**form feed**|
|00001101|0d|013|**carriage return**|
|00001110|0e|014|shift out|
|00001111|0f|015|shift in|
|00010000|10|016|data link escape|
|00010001|11|017|device control 1|
|00010010|12|018|device control 2|
|00010011|13|019|device control 3|
|00010100|14|020|device control 4|
|00010101|15|021|negative acknowledge|
|00010110|16|022|synchronous idle|
|00010111|17|023|end of transmission block|
|00011000|18|024|cancel|
|00011001|19|025|end of medium|
|00011010|1a|026|substitute|
|00011011|1b|027|**escape**|
|00011100|1c|028|file separator|
|00011101|1d|029|group separator|
|00011110|1e|030|record separator|
|00011111|1f|031|unit seperator|
|00100000|20|032|**space**|

The remaining commands spanning up to half a byte contained the characters most commonly used in the English language.

|byte|hex|num|character|
|---|---|---|---|
|00100001|21|033|!|
|00100010|22|034|"|
|00100011|23|035|#|
|00100100|24|036|$|
|00100101|25|037|%|
|00100110|26|038|&|
|00100111|27|039|'|
|00101000|28|040|(|
|00101001|29|041|)|
|00101010|2a|042|*|
|00101011|2b|043|+|
|00101100|2c|044|,|
|00101101|2d|045|-|
|00101110|2e|046|.|
|00101111|2f|047|/|
|00110000|30|048|0|
|00110001|31|049|1|
|00110010|32|050|2|
|00110011|33|051|3|
|00110100|34|052|4|
|00110101|35|053|5|
|00110110|36|054|6|
|00110111|37|055|7|
|00111000|38|056|8|
|00111001|39|057|9|
|00111010|3a|058|:|
|00111011|3b|059|;|
|00111100|3c|060|<|
|00111101|3d|061|=|
|00111110|3e|062|>|
|00111111|3f|063|?|
|01000000|40|064|@|
|01000001|41|065|A|
|01000010|42|066|B|
|01000011|43|067|C|
|01000100|44|068|D|
|01000101|45|069|E|
|01000110|46|070|F|
|01000111|47|071|G|
|01001000|48|072|H|
|01001001|49|073|I|
|01001010|4a|074|J|
|01001011|4b|075|K|
|01001100|4c|076|L|
|01001101|4d|077|M|
|01001110|4e|078|N|
|01001111|4f|079|O|
|01010000|50|080|P|
|01010001|51|081|Q|
|01010010|52|082|R|
|01010011|53|083|S|
|01010100|54|084|T|
|01010101|55|085|U|
|01010110|56|086|V|
|01010111|57|087|W|
|01011000|58|088|X|
|01011001|59|089|Y|
|01011010|5a|090|Z|
|01011011|5b|091|[|
|01011100|5c|092|\|
|01011101|5d|093|]|
|01011110|5e|094|^|
|01011111|5f|095|_|
|01100000|60|096|`|
|01100001|61|097|a|
|01100010|62|098|b|
|01100011|63|099|c|
|01100100|64|100|d|
|01100101|65|101|e|
|01100110|66|102|f|
|01100111|67|103|g|
|01101000|68|104|h|
|01101001|69|105|i|
|01101010|6a|106|j|
|01101011|6b|107|k|
|01101100|6c|108|l|
|01101101|6d|109|m|
|01101110|6e|110|n|
|01101111|6f|111|o|
|01110000|70|112|p|
|01110001|71|113|q|
|01110010|72|114|r|
|01110011|73|115|s|
|01110100|74|116|t|
|01110101|75|117|u|
|01110110|76|118|v|
|01110111|77|119|w|
|01111000|78|120|x|
|01111001|79|121|y|
|01111010|7a|122|z|
|01111011|7b|123|{|
|01111100|7c|124|||
|01111101|7d|125|}|
|01111110|7e|126|~|
|01111111|7f|127||


Recall the ```string``` module contains the printable ASCII characters:

In [8]:
import string

In the ```bytes``` class, the translation table for an ASCII character is always the same. Therefore in the formal representation instead of displaying the byte for that character, the ASCII character is shown:

In [9]:
string.printable

'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

## Initialisation Signature

The initialisation signature for the ```bytes``` class can be examined:

In [10]:
bytes?

[1;31mInit signature:[0m [0mbytes[0m[1;33m([0m[0mself[0m[1;33m,[0m [1;33m/[0m[1;33m,[0m [1;33m*[0m[0margs[0m[1;33m,[0m [1;33m**[0m[0mkwargs[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
bytes(iterable_of_ints) -> bytes
bytes(string, encoding[, errors]) -> bytes
bytes(bytes_or_buffer) -> immutable copy of bytes_or_buffer
bytes(int) -> bytes object of size given by the parameter initialized with null bytes
bytes() -> empty bytes object

Construct an immutable array of bytes from:
  - an iterable yielding integers in range(256)
  - a text string encoded using the specified encoding
  - any object implementing the buffer API.
  - an integer
[1;31mType:[0m           type
[1;31mSubclasses:[0m     bytes_

For the ```bytes``` string class, the initialisation signature shows 5 alternative ways of supplying instance data:

```python
bytes(self, /, *args, **kwargs)
bytes(iterable_of_ints) -> bytes
bytes(string, encoding[, errors]) -> bytes
bytes(bytes_or_buffer) -> immutable copy of bytes_or_buffer
bytes(int) -> bytes object of size given by the parameter initialized with null bytes
bytes() -> empty bytes object
```

If the first way is examined:

```python
bytes(self, /, *args, **kwargs)
```

* The parenthesis ```( )``` are used to call a function and supply any necessary input arguments.
* The comma ```,``` is used as a delimiter to seperate out any input arguments.
* ```self``` is used to denote *this instance*. In other words a byte string can be constructed from an existing byte string instance, this is a special case as a byte string is a fundamental datatype.
* Any input argument before a ```/``` must be provided positionally
* ```*args``` indicates a variable number of additional positional input arguments. These are typically not used for initialisation of the ```bytes``` string class.
* ```**kwargs``` indicates a variable number of additional named input arguments. These are typically not used for initialisation of the ```bytes``` string class.

A ```bytes``` instance can be instantiated by supplying an existing ```bytes``` instance ```self``` to the ```bytes``` class:

In [11]:
bytes(b'Hello World!')

b'Hello World!'

However because the ```bytes``` class is a fundamental datatype it can also be instantiated shorthand using the following:

In [12]:
b'Hello World!'

b'Hello World!'

All of the characters above in the ```bytes``` instance are ASCII printable characters. Therefore each byte value in the ```bytes``` instance above is represented by its corresponding ASCII character:

In [13]:
view(b'Hello World!')

Index 	 Type                 	 Size   	 Value                         
0 	 int                  	 1      	 72                             	
1 	 int                  	 1      	 101                            	
2 	 int                  	 1      	 108                            	
3 	 int                  	 1      	 108                            	
4 	 int                  	 1      	 111                            	
5 	 int                  	 1      	 32                             	
6 	 int                  	 1      	 87                             	
7 	 int                  	 1      	 111                            	
8 	 int                  	 1      	 114                            	
9 	 int                  	 1      	 108                            	
10 	 int                  	 1      	 100                            	
11 	 int                  	 1      	 33                             	


In [14]:
string.printable

'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

A ```bytes``` instance can be initialised using an iterable such as a ```tuple``` of ```int``` instances:

```python
bytes(iterable_of_ints, /) -> bytes
```

For this to work each ```int``` must be a valid byte value and recall that a byte looks like the following:

<img src='./images/img_002.png' alt='img_002' width='400'/>

Since a byte has ```2 ** 8``` combinations, which is a total of ```256``` the range is ```0:256``` inclusive of the lower bound and exclusive of the maximum bound. Therefore the maximum value for an ```int``` instance is ```255```. Note that a trailing comma is required to distinguish a single element ```tuple``` from a numeric calculation using parenthesis:

In [15]:
num = (97)

In [16]:
archive = (97, )

In [17]:
variables()

Unnamed: 0_level_0,Type,Size/Shape,Value
Instance Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
num,int,,97
archive,tuple,1.0,"(97,)"


From the above ASCII table the ```int``` instance ```97``` corresponds to the character ```a``` and this can be seen when this is cast to a ```tuple```:

In [18]:
bytes((97, ))

b'a'

When an ```int``` exceeds the upper bound ```256``` (up to and exclusive of ```256``` due to zero-order indexing) a ```ValueError``` will display:

```python
bytes((256, ))
```

Normally the ```tuple``` will contain more than one ```int``` instance and each of these will be a valid byte value:

In [19]:
integers = (104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 33)

In [20]:
bytes(integers)

b'hello world!'

Recall that the decimal ```int``` instance ```104``` can be represented in binary. The ```bin``` function will return this binary number as an Unicode ```str``` instance:

In [21]:
bin(104)

'0b1101000'

For conception it is helpful to see the leading zeros using the ```str``` instance methods ```removeprefix``` and ```zfill```, alongside ```str``` instance concatenation:

In [22]:
'0b' + bin(104).removeprefix('0b').zfill(8)

'0b01101000'

The Unicode ```str``` instance corresponding to this byte can be retrieved using the ```chr``` function:

In [23]:
chr(104)

'h'

This ```bytes``` instance consists of ```12``` individual byte units. The different representation for each byte unit can be examined below:

In [24]:
for number in integers:
    print(str(number).center(8), end=' ')

print()
for number in integers:
    print(bin(number).removeprefix('0b').zfill(8), end=' ')

print()
for number in integers:
    print(chr(number).center(8), end=' ')

  104      101      108      108      111       32      119      111      114      108      100       33    
01101000 01100101 01101100 01101100 01101111 00100000 01110111 01101111 01110010 01101100 01100100 00100001 
   h        e        l        l        o                 w        o        r        l        d        !     

When every byte is a printable ASCII character it will be displayed instead of the byte sequence:

In [25]:
bytes(integers)

b'hello world!'

When the byte is not printable for example, the ```int``` instance ```0``` which corresponds to the ASCII non-printable command NULL, is represented using a hexadecimal escape sequence:

In [26]:
bytes((0,))

b'\x00'

All the whitespace characters with exception to the space are represented using an escape character. For the tab, newline and carriage return characters that are commonly used, there are the escape characters ```\t``` (decimal 9), ```\n``` (decimal 10) and ```\r``` (decimal 13). The vertical tab and form feed are less commonly used and represented by their hexadecimal escape sequences ```\x0b``` (decimal 11) and ```\x0c``` (decimal 12):

In [27]:
string.whitespace

' \t\n\r\x0b\x0c'

In [28]:
integers = (9, 10, 11, 12, 13)

In [29]:
bytes(integers)

b'\t\n\x0b\x0c\r'

Binary ```0b00001100``` is not very human-readable and therefore it is easy for a human to make transcription errors when dealing with binary. To make a byte more human readable the hexadecimal number system is introduced. In hexadecimal the byte is essentially split into 2 halves and each half byte is represented as a hexadecimal character:

<img src='./images/img_005.png' alt='img_005' width='400'/>

Recall **b**inary has ```2``` digits and the prefix ```0b```, decimal has ```10``` digits and no prefix because it is most commonly used numbering system. He**x**adecimal has ```16``` digits and the prefix ```0x```. 

Hexadecimal takes the first ```10``` digits from decimal and supplements them with the first ```6``` letters in the alphabet. The number of combinations in half a byte is:

In [30]:
2 ** 4

16

As a consequence each hexadecimal character perfectly maps to a 4 bit (half a byte) binary sequence:

|(0b) binary|(0x) hexadecimal character|decimal character|
|---|---|---|
|0000|0|0|
|0001|1|1|
|0010|2|2|
|0011|3|3|
|0100|4|4|
|0101|5|5|
|0110|6|6|
|0111|7|7|
|1000|8|8|
|1001|9|9|
|1010|a|10|
|1011|b|11|
|1100|c|12|
|1101|d|13|
|1110|e|14|
|1111|f|15|

Although uppercase and lowercase can be used to represent a hexadecimal character, notice that the Python interpreter prefers lowercase:

In [31]:
integers = (11, 12)

In [32]:
bytes(integers)

b'\x0b\x0c'

A human is more likely to make a transcription error when reading hexadecimal sequences. For example:

```python
'ABB4AB8A'
```

when reading the above quickly, notice the similarity between A and 4 and B and 8. The lowercase characters are more clearly distinguished:

```python
'abb4ab8a'
```

The following ```bytes``` instance:

<img src='./images/img_005.png' alt='img_005' width='400'/>

Is the binary number:

In [33]:
0b00001100

12

The value returned in the cell output displays the decimal integer. The ```hex``` function can be used to cast a decimal ```int``` into a Unicode ```str``` of a hexadecimal character:

In [34]:
hex(0b00001100)

'0xc'

The hexadecimal value is displayed without the trailing zero, so the first half byte is not shown. This can be added for clarity:

In [35]:
'0x' + hex(12).removeprefix('0x').zfill(2)

'0x0c'

In [36]:
'0b' + bin(12).removeprefix('0b').zfill(8)

'0b00001100'

And for clarity:

In [37]:
print(bin(12).removeprefix('0b').zfill(8)[:4], bin(12).removeprefix('0b').zfill(8)[4:])
print(hex(12).removeprefix('0x').zfill(2)[:1].center(4), hex(12).removeprefix('0x').zfill(2)[1:].center(4))

0000 1100
 0    c  


When the byte sequence contains a byte that maps to a whitespace character, a non-printable command or is unmapped to an ASCII character there is no corresponding character to display and an escape sequence instead displays. This can be seen when using the first ```32``` integers and integers above ```127```:

In [38]:
integers = (0, 1, 2, 29, 30, 31, 128, 129, 130, 253, 254, 255)

In [39]:
bytes(integers)

b'\x00\x01\x02\x1d\x1e\x1f\x80\x81\x82\xfd\xfe\xff'

Notice that each character is inserted using its own hexadecimal escape sequence prefix ```\x``` and this instruction expects two hexadecimal characters so a leading zero must be included where applicable. 

When a ```bytes``` instance contains byte sequences that map to characters these characters will be displayed instead of the hexadecimal escape sequence:

In [40]:
integers = (32, 65, 80, 120)

In [41]:
bytes(integers)

b' APx'

The tab ```\t```, newline ```\t```, carriage return ```\r``` and backslash ```\\``` itself all have single escape characters (the ```\``` is shown as ```\\``` as the ```\``` is used to insert an escape character):

In [42]:
integers = (9, 10, 13, 92)

In [43]:
bytes(integers)

b'\t\n\r\\'

A ```bytes``` instance containing all of these initially appear confusing:

In [44]:
integers = (0, 1, 2, 9, 10, 13, 29, 30, 31, 32, 65, 92, 97, 128, 129, 130, 253, 254, 255)

In [45]:
bytes_string = bytes(integers)

In [46]:
variables()

Unnamed: 0_level_0,Type,Size/Shape,Value
Instance Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
num,int,,97
archive,tuple,1.0,"(97,)"
integers,tuple,19.0,"(0, 1, 2, 9, 10, 13, 29, 30, 31, 32, 65, 92, 97, 128, 129, 130, 253, 254, 255)"
number,int,,33
bytes_string,bytes,19.0,b'\x00\x01\x02\t\n\r\x1d\x1e\x1f A\\a\x80\x81\x82\xfd\xfe\xff'


In [47]:
view(bytes_string)

Index 	 Type                 	 Size   	 Value                         
0 	 int                  	 1      	 0                              	
1 	 int                  	 1      	 1                              	
2 	 int                  	 1      	 2                              	
3 	 int                  	 1      	 9                              	
4 	 int                  	 1      	 10                             	
5 	 int                  	 1      	 13                             	
6 	 int                  	 1      	 29                             	
7 	 int                  	 1      	 30                             	
8 	 int                  	 1      	 31                             	
9 	 int                  	 1      	 32                             	
10 	 int                  	 1      	 65                             	
11 	 int                  	 1      	 92                             	
12 	 int                  	 1      	 97                             	
13 	 int                  	 1

For this reason the ```bytes``` class has the method ```hex``` which returns a Unicode ```str``` instance of the hexadecimal values without any of the escape sequences:

In [48]:
bytes_string.hex()

'000102090a0d1d1e1f20415c61808182fdfeff'

Note that the ```byte``` classes ```hex``` method differs from ```builtins``` function ```hex``` which provides a ```0x``` prefix:

In [49]:
bytes((12, )).hex()

'0c'

In [50]:
hex(12)

'0xc'

The ```builtins``` function ```hex``` can process a **single** large integer that exceeds 1 byte:

In [51]:
hex(256)

'0x100'

Whereas the ```byte``` classes ```hex``` method processes **multiple** integers that are within the constrains of a byte:

In [52]:
bytes((12, 34)).hex()

'0c22'

The following ```bytes``` instance can be represented as a ```str``` of hexadecimal characters with 2 hexadecimal characters for each byte using the ```bytes``` class method ```hex```:

In [53]:
b'hello'.hex()

'68656c6c6f'

Notice when each of the ASCII characters is supplied using a hexadecimal escape sequence, the default representation simplifies the output displaying the ASCII character:

In [54]:
b'\x68\x65\x6c\x6c\x6f'

b'hello'

The ```bytes``` class has the class method ```fromhex``` which is an alternative constructor to create a ```bytes``` instance from a Unicode ```str``` instance of hexadecimal characters:

In [55]:
bytes.fromhex('68656c6c6f')

b'hello'

The different ways of representing each byte in the ```bytes``` instance can be examined using:

In [56]:
integers = (104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 33)
bytes(integers)

for number in integers:
    print(chr(number).center(8), end=' ')
    
print()
for number in integers:
    print(str(number).center(8), end=' ')

print()
for number in integers:
    print(bin(number).removeprefix('0b').zfill(8), end=' ')

print()
for number in integers:
    print((hex(number).removeprefix('0x')).center(8), end=' ')    

print()
for number in integers:
    print((r'0x' + hex(number).removeprefix('0x')).center(8), end=' ')

print()
for number in integers:
    print((r'\x' + hex(number).removeprefix('0x')).center(8), end=' ')
    


   h        e        l        l        o                 w        o        r        l        d        !     
  104      101      108      108      111       32      119      111      114      108      100       33    
01101000 01100101 01101100 01101100 01101111 00100000 01110111 01101111 01110010 01101100 01100100 00100001 
   68       65       6c       6c       6f       20       77       6f       72       6c       64       21    
  0x68     0x65     0x6c     0x6c     0x6f     0x20     0x77     0x6f     0x72     0x6c     0x64     0x21   
  \x68     \x65     \x6c     \x6c     \x6f     \x20     \x77     \x6f     \x72     \x6c     \x64     \x21   

A ```bytes``` instance can be instantiated from a Unicode ```str``` however the named parameter ```encoding``` needs to be supplied, which gives the instructions to decode a Unicode character outwith the ASCII range. When the simplest encoding ```'ascii'``` is supplied all characters in the Unicode ```str``` must be within the ASCII range. Generally the current standard ```'utf-8'``` is used which is adaptable and encodes each Unicode character in the Unicode ```str``` to 1-4 bytes in the returned ```bytes``` instance:

```python
bytes(string, /, encoding[, errors]) -> bytes
```

In [57]:
bytes('hello', encoding='ascii')

b'hello'

In [58]:
bytes('hello', encoding='utf-8')

b'hello'

Not supplying ```encoding``` gives a ```TypeError```:

```python
bytes('hello')
```

Supplying a non-Unicode character that is not ASCII and specifying ASCII encoding will also give a ```UnicodeDecodeError```:

```python
bytes('α', encoding='ascii')
```

A ```bytes``` instance can be cast from an existing ```bytes``` instance:

```python
bytes(bytes_or_buffer, /) -> immutable copy of bytes_or_buffer
```

A ```bytes``` instance can be instantiated by casting a ```bytearray```. The ```bytearray``` is the mutable counterpart to the ```bytes``` class:

In [59]:
bytearray_string = bytearray(b'hello')

In [60]:
bytes_string = bytes(bytearray_string)

In [61]:
variables()

Unnamed: 0_level_0,Type,Size/Shape,Value
Instance Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
num,int,,97
archive,tuple,1.0,"(97,)"
integers,tuple,12.0,"(104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 33)"
number,int,,33
bytes_string,bytes,5.0,b'hello'
bytearray_string,bytearray,5.0,bytearray(b'hello')


 A NULL ```bytes``` instance can also be initialised from an ```int```. The ```int``` is used to specify the number of NULL ```bytes``` and is not cast into an individual byte as seen when an ```int``` is provided via a ```tuple```:

 ```python
 bytes(int, /) -> bytes object of size given by the parameter initialized with null bytes
 ```

For example a ```bytes``` instance occupying 1 bytes can be instantiated:

In [62]:
bytes(1)

b'\x00'

And another one occupying 4 bytes can be instantiated:

In [63]:
bytes(4)

b'\x00\x00\x00\x00'

Using the ```bytes``` class with out providing any instantiation data will create a single NULL ```bytes``` instance:

```python
bytes() -> empty bytes object
```

In [64]:
bytes()

b''

Initialising data and then populating is more commonly used for mutable datatypes. For an immutable datatype, the instance cannot be modified and the instance name instead gets reassigned to a new instance.

## Encoding and Decoding

When a ```bytes``` instance was instantiated from a Unicode ```str``` instance, an encoding translation table was selected; that is a table that maps the bytes sequence to a specific character. There have been many encoding standards developed throughout the years and by default the current standard ```'utf-8'``` should be used. The Unicode ```str``` class is always ```'utf-8'``` and as a consequence is far easier to use than the ```bytes``` class for most text applications:

|encoding|bytes per character|bits per character|byte order|byte order marker BOM|
|---|---|---|---|---|
|'utf-8'|1, 2, 3, 4|8, 16, 24, 32|big endian| |
|'utf-8-sig'|1, 2, 3, 4|8, 16, 24, 32|big endian|efbbbf|
|'utf-32'|4|32|little endian|fffe0000|
|'utf-32-le'|4|32|little endian| |
|'utf-32-be'|4|32|big endian| |
|'utf-16'|2|16|little endian|fffe|
|'utf-16-le'|2|16|little endian| |
|'utf-16-be'|2|16|big endian| |
|'latin1'|1|8| ||
|'ascii'|1|8| ||

### ASCII

```'ascii'``` is the most basic translation table and each character is encoded over ```1``` byte using only half of the possible values. ASCII is restricted to a small subset of English characters:

In [65]:
a_ascii = b'a'

In [66]:
a_ascii

b'a'

In [67]:
hex(ord('a'))

'0x61'

The ```'ascii'``` encoding scheme was originally developed using 7 bits with:

In [68]:
2 ** 7

128

for this reason the commands span over the range ```0:128``` (up to and exclusing the upper bound of ```128```). This covers half the possible values of a byte:

In [69]:
2 ** 8

256

### Extended ASCII Variants

In the 1990s there were numerous regional translation tables which mapped the second half of the bytes to regional characters. 

In the UK, ```'latin1'``` was used which includes the ```£``` sign:

In [70]:
gb = bytes('£123.45', encoding='latin1')

In [71]:
gb

b'\xa3123.45'

In [72]:
int('0xa3', base=16)

163

In [73]:
gb.decode(encoding='latin1')

'£123.45'

This regional encoding scheme spanned over the full byte allowing the commonly used regional characters. 

The problem with early regional encoding was that operating systems and browsers were often configured to use a regional encoding scheme that often differed to the encoding scheme the content itself was written in and as a result non-ASCII characters were often incorrectly substituted. This can be seen for example by decoding the ```bytes``` instance above which was originally encoded in ```'latin1'``` with ```'latin2'```, ```'latin3'```, ```'greek'``` and ```'cyrillic'```:

In [74]:
gb.decode(encoding='latin2')

'Ł123.45'

In [75]:
gb.decode(encoding='latin3')

'£123.45'

In [76]:
gb.decode(encoding='greek')

'£123.45'

In [77]:
gb.decode(encoding='cyrillic')

'Ѓ123.45'

All of these formats should be considered as legacy formats.

### UTF-16

The Unicode Transformation Format ```'utf-16'``` was a previous standard where each character occupied ```2``` bytes which is ```2 * 8``` bits and is where the name ```16``` comes from. Using 2 bytes instead of 1 bytes per character increases the number of possible combinations to:

In [78]:
2 ** 16

65536

When we count using numbers we use big endian, for example the number twelve is represented using two decimal digits:

```python
12
```

This is big endian and the most significant digit 1 which corresponds to 10 is stated first followed by the digit 2 which corresponds to 2 units.

This number twelve could also be represented in little endian:

```python
21
```

In little endian, the least significant digit, the digit 2 which corresponds to 2 units is stated first followed by the most significant digit 1 which corresponds to 10.

When the ASCII character ```a``` is encoded using the ```'ascii'``` translation table it occupies a single byte, which recall is represented using two hexadecimal characters:

In [79]:
bytes('\x61', encoding='ascii')

b'a'

For ```'utf-16'```, each character must occupy two bytes. For an ASCII character the single byte that was encoded in ```'ascii'``` is taken as the least significant byte and is accompanied by the NULL byte which acts as the most significant byte. In big endian the most significant byte is placed first followed by the least significant byte:

In [80]:
b'\x00\x61'.hex()

'0061'

In little endian the least significant byte is instead displayed first, followed by the least significant byte:

In [81]:
b'\x61\x00'.hex()

'6100'

If the two byte instances are examined, the default representation assumes the ```bytes``` instance is using ```'ascii'``` encoding and so the NULL byte displays as an escape character and the ```'ascii'``` character displays:

In [82]:
b'\x00\x61' # big endian

b'\x00a'

In [83]:
b'\x61\x00' # little endian

b'a\x00'

If the Unicode ```str``` instance ```'abc'``` is examined and encoded in ```'utf-16-be'```:

In [84]:
bytes('abc', encoding='utf-16-be')

b'\x00a\x00b\x00c'

Which looks like the following when the bytes corresponding to ASCII characters aren't processed:

In [85]:
b'\x00\x61\x00\x62\x00\x63' # big endian

b'\x00a\x00b\x00c'

If the Unicode ```str``` instance ```'abc'``` is instead encoded in ```'utf-16-le'```:

In [86]:
bytes('abc', encoding='utf-16-le')

b'a\x00b\x00c\x00'

In [87]:
b'\x61\x00\x62\x00\x63\x00' # little endian

b'a\x00b\x00c\x00'

When ```'utf-16'``` was introduced there was a deviation in the way processors handled characters that spanned over multiple bytes. Some processors used big endian and others used little endian. Intel, the most dominant processor manufacturer at the time favoured little endian. As there was confusion between the two variants of ```'utf-16'```, Microsoft favoured addition of a Byte Order Marker (BOM). The BOM is at the start of the ```bytes``` instance and like every character in ```'utf-16'``` will span over two bytes (4 hexadecimal characters):

In [88]:
bytes('abc', encoding='utf-16-le').hex()

'610062006300'

In [89]:
bytes('abc', encoding='utf-16').hex()

'fffe610062006300'

The BOM can be examined by casting an empty ```str``` instance:

In [90]:
bytes('', encoding='utf-16')

b'\xff\xfe'

In [91]:
bytes('', encoding='utf-16').hex()

'fffe'

The ```str``` instance corresponding to the Greek letter alpha can be encoded in ```'utf-16-le'```:

In [92]:
alpha_be = bytes('α', encoding='utf-16-le')

The hexadecimal values can be examined:

In [93]:
alpha_be.hex()

'b103'

In [94]:
alpha_be

b'\xb1\x03'

This ```bytes``` instance can be decoded back to the original ```str``` instance using the correct encoding:

In [95]:
alpha_be.decode(encoding='utf-16-le')

'α'

If the incorrect decoding is used the wrong character is selected:

In [96]:
alpha_be.decode(encoding='utf-16-be')

'넃'

This is equivalent to:

In [97]:
b'\x03\xb1'.decode(encoding='utf-16-le')

'넃'

If a single byte encoding is used, each byte will be represented as a different character. One of the characters is a non-printable ASCII character so displays as ```\x03```:

In [98]:
alpha_be.decode(encoding='latin1')

'±\x03'

With 16 bytes there are:

In [99]:
2 ** 16

65536

combinations. These are not enough to cover characters from all the languages in the world.

### UTF-32

Therefore ```'utf-32'``` was developed which spans over 32 bits which is 4 bytes:

In [100]:
2 ** 32

4294967296

Like ```'utf-16'``` there are BOM variations:

In [101]:
bytes('\x61', encoding='utf-32-be'), bytes('\x61', encoding='utf-32-be').hex()

(b'\x00\x00\x00a', '00000061')

In [102]:
bytes('\x61', encoding='utf-32-le'), bytes('\x61', encoding='utf-32-le').hex()

(b'a\x00\x00\x00', '61000000')

In [103]:
bytes('\x61', encoding='utf-32-be'), bytes('\x61', encoding='utf-32').hex()

(b'\x00\x00\x00a', 'fffe000061000000')

This gives groupings of 4 bytes, which is 8 hexadecimal characters:

In [104]:
word = 'abαβ悤悥🦒🦓'

be = bytes(word, encoding='utf-32-be').hex()
le = bytes(word, encoding='utf-32-le').hex()
bom_le = bytes(word, encoding='utf-32').hex()


print('char', end=':               ')
for i in word:
    print(i, end='        ')

print()
print('utf-32-be', end=':          ')

for i in range(0, len(be), 8):
    print(be [i:i+8], end=' ')

print()    
print('utf-32-le', end=':          ')

for i in range(0, len(le), 8):
    print(le [i:i+8], end=' ')  
    
print()    
print('utf-32', end=':    ')

for i in range(0, len(bom_le), 8):
    print(bom_le [i:i+8], end=' ')    

char:               a        b        α        β        悤        悥        🦒        🦓        
utf-32-be:          00000061 00000062 000003b1 000003b2 000060a4 000060a5 0001f992 0001f993 
utf-32-le:          61000000 62000000 b1030000 b2030000 a4600000 a5600000 92f90100 93f90100 
utf-32:    fffe0000 61000000 62000000 b1030000 b2030000 a4600000 a5600000 92f90100 93f90100 

### UTF-8

The main drawback of ```'utf-32'``` is that it requires a lot more memory per character, had byte order issues and each ASCII character now needs to be accompanied by 3 NULL bytes. The current standard ```'utf-8'``` was developed as an adaptable format and characters span over 1-4 bytes and is always big endian:

In [105]:
print('1 byte:', end=' ')
for unicode_char in 'abcde':
    print(bytes(unicode_char, encoding='utf-8').hex(), end=' ')

print()
print('2 bytes:', end=' ')    
for unicode_char in 'αβγδε':
    print(bytes(unicode_char, encoding='utf-8').hex(), end=' ')
    
print()
print('3 bytes:', end=' ')  
for unicode_char in '悤悥悦悧您':
    print(bytes(unicode_char, encoding='utf-8').hex(), end=' ')

print()
print('4 bytes:', end=' ')  
for unicode_char in '🦒🦓🦔🦕🦖':
    print(bytes(unicode_char, encoding='utf-8').hex(), end=' ')


1 byte: 61 62 63 64 65 
2 bytes: ceb1 ceb2 ceb3 ceb4 ceb5 
3 bytes: e682a4 e682a5 e682a6 e682a7 e682a8 
4 bytes: f09fa692 f09fa693 f09fa694 f09fa695 f09fa696 

Generally:

* 1 byte is the ```'ascii'``` subset.
* 2 bytes is extended European characters. ```'utf-16``` is not a subset as ```'utf-8'``` uses a byte pattern that differs from ```'utf-16'```.
* 3 bytes are used for additional languages.
* 4 bytes are used for emojis. ```'utf-32``` is not a subset as ```'utf-8'``` uses a byte pattern that differs from ```'utf-32'```.

Under the hood the start of the first byte is used to identify if 1, 2, 3 or 4 bytes is used to identify a character:

1 byte: XXXXXXXX (ASCII)

2 bytes: **110**XXXXX **10**XXXXXX

3 bytes: **1110**XXXX **10**XXXXXX **10**XXXXXX

4 bytes: **11110**XXX **10**XXXXXX **10**XXXXXX **10**XXXXXX


4 of the example characters above can be cast into binary and seen to follow the above pattern:

In [106]:
('1 byte', bin(0x61).removeprefix('0b').zfill(1 * 8))

('1 byte', '01100001')

In [107]:
('2 bytes', bin(0xceb1).removeprefix('0b').zfill(2 * 8))

('2 bytes', '1100111010110001')

In [108]:
('3 bytes', bin(0xe682a4).removeprefix('0b').zfill(3 * 8))

('3 bytes', '111001101000001010100100')

In [109]:
('4 bytes', bin(0xf09fa692).removeprefix('0b').zfill(4 * 8))

('4 bytes', '11110000100111111010011010010010')

### UTF-8-Sig

Since characters ```'utf-8'``` is always big endian and characters encoded over multiple bytes have a byte pattern, there is generally no need for a BOM. 

Despite ```'utf-8'``` not requiring a BOM, Microsoft often include one in their products using a variation ```utf-8-sig``` and therefore may be seen in data exported from popular Microsoft applications such as Notepad or Excel. The BOM can be seen by compared the casting of an empty Unicode string to a bytes string using these different ```'utf-8'``` and ```'utf-8-sig'``` respectively:

In [110]:
bytes('', encoding='utf-8').hex()

''

In [111]:
bytes('', encoding='utf-8-sig').hex()

'efbbbf'

```'utf-8'``` is the current standard and should be used by default. The Unicode string ```str``` class is locked to ```'utf-8'``` and is much easier to work with as there is no worry about encoding. 

When decoding data from another source that is in byte form, ```'utf-8'``` should be used by default to decode it.  If an unwanted BOM appears at the start when decoding data, then the data probably was processed using a Microsoft product with ```'utf-8-sig'```.

## Hardware

The ```bytes``` class was the main text class for Python 2. 

As ```'utf-8'``` became widely adopted as an encoding standard, the text datatype was redeveloped to use ```'utf-8'``` as the only translation table. This simplified class allowed a Unicode character to be considered as a fundamental unit opposed to a byte. Python 3 introduced major changes over Python 2, in particular changes to the default text class. The ```str``` class is the default text class for Python 3 and should be used in most applications, the older text datatype is the ```bytes``` class and is sometimes used when communicating directly with hardware on the hardware level. The ```bytes``` instance below can be transmitted over a serial port:

In [112]:
for number in b'hello world!':
    print(bin(number).removeprefix('0b').zfill(8), end='')

011010000110010101101100011011000110111100100000011101110110111101110010011011000110010000100001

A serial port for example has a signal pin which is configured to transmit or receive a digital time trace. The baud rate, ```9600``` for example means ```9600``` bits are processed in a second. When the bit is ```0```, the voltage of the signal pin is LOW and when the bit is ```1``` the voltage of the signal pin is HIGH.

```bytes``` instances are therefore still used when directly interfacing with hardware for example an Arduino. In such applications it is recommended to decode a ```bytes``` instance to a Unicode ```str``` instance as early as possible in a Python program and only cast the Unicode ```str``` instance back to a ```bytes``` instance as late as possible before transmitting it to reduce the possibility of encoding issues.

## Bytes Identifiers

If the identifiers of the ```bytes``` class and the ```str``` class are examined, and are seen to be largely consistent:

In [113]:
dir2(bytes, object, unique_only=True)

{'method': ['capitalize',
            'center',
            'count',
            'decode',
            'endswith',
            'expandtabs',
            'find',
            'fromhex',
            'hex',
            'index',
            'isalnum',
            'isalpha',
            'isascii',
            'isdigit',
            'islower',
            'isspace',
            'istitle',
            'isupper',
            'join',
            'ljust',
            'lower',
            'lstrip',
            'maketrans',
            'partition',
            'removeprefix',
            'removesuffix',
            'replace',
            'rfind',
            'rindex',
            'rjust',
            'rpartition',
            'rsplit',
            'rstrip',
            'split',
            'splitlines',
            'startswith',
            'strip',
            'swapcase',
            'title',
            'translate',
            'upper',
            'zfill'],
 'datamodel_method': ['__add__',
     

In [114]:
dir2(str, object, unique_only=True)

{'method': ['capitalize',
            'casefold',
            'center',
            'count',
            'encode',
            'endswith',
            'expandtabs',
            'find',
            'format',
            'format_map',
            'index',
            'isalnum',
            'isalpha',
            'isascii',
            'isdecimal',
            'isdigit',
            'isidentifier',
            'islower',
            'isnumeric',
            'isprintable',
            'isspace',
            'istitle',
            'isupper',
            'join',
            'ljust',
            'lower',
            'lstrip',
            'maketrans',
            'partition',
            'removeprefix',
            'removesuffix',
            'replace',
            'rfind',
            'rindex',
            'rjust',
            'rpartition',
            'rsplit',
            'rstrip',
            'split',
            'splitlines',
            'startswith',
            'strip',
            'swa

The unique identifiers can be examined for each class. The ```str``` method ```'encode'``` casts a ```str``` instance to a ```bytes``` instance using a specified translation table to encode with. The ```bytes``` method ```'decode'``` casts a ```bytes``` instance to a ```str``` instance using a specified translation table to decode with. The class method ```fromhex``` and instance ```hex``` can be used to cast from a Unicode ```str``` instance of hexadecimal characters and to ```return``` a Unicode ```str``` instance of hexadecimal characters respectively:

In [115]:
dir2(bytes, str, unique_only=True)

{'method': ['decode', 'fromhex', 'hex'],
 'datamodel_method': ['__buffer__', '__bytes__']}


In [116]:
dir2(str, bytes, unique_only=True)

{'method': ['casefold',
            'encode',
            'format',
            'format_map',
            'isdecimal',
            'isidentifier',
            'isnumeric',
            'isprintable']}


Some ```str``` methods related to formatting and methods which check groupings of Unicode characters don't have a counterpart available in the ```bytes``` class.

The consistent identifiers behave consistently between the two classes. The counterpart to ```str``` methods that return a ```str``` instance will instead return a ```bytes``` instance or single byte which recall is an ```int``` between ```0:256```. Supposing the four instances are instantiated:

In [117]:
english_b = b'abcde'
greek_b = bytes('αβγδε', encoding='UTF-8')
english_s = 'abcde'
greek_s = 'αβγδε'

The variables can be viewed:

In [118]:
variables(['english_b', 'english_s', 'greek_b', 'greek_s'])

Unnamed: 0_level_0,Type,Size/Shape,Value
Instance Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
english_b,bytes,5,b'abcde'
english_s,str,5,abcde
greek_b,bytes,10,b'\xce\xb1\xce\xb2\xce\xb3\xce\xb4\xce\xb5'
greek_s,str,5,αβγδε


And examined:

In [119]:
view(english_b)

Index 	 Type                 	 Size   	 Value                         
0 	 int                  	 1      	 97                             	
1 	 int                  	 1      	 98                             	
2 	 int                  	 1      	 99                             	
3 	 int                  	 1      	 100                            	
4 	 int                  	 1      	 101                            	


In [120]:
view(english_s)

Index 	 Type                 	 Size   	 Value                         
0 	 str                  	 1      	 a                              	
1 	 str                  	 1      	 b                              	
2 	 str                  	 1      	 c                              	
3 	 str                  	 1      	 d                              	
4 	 str                  	 1      	 e                              	


In [121]:
view(greek_b)

Index 	 Type                 	 Size   	 Value                         
0 	 int                  	 1      	 206                            	
1 	 int                  	 1      	 177                            	
2 	 int                  	 1      	 206                            	
3 	 int                  	 1      	 178                            	
4 	 int                  	 1      	 206                            	
5 	 int                  	 1      	 179                            	
6 	 int                  	 1      	 206                            	
7 	 int                  	 1      	 180                            	
8 	 int                  	 1      	 206                            	
9 	 int                  	 1      	 181                            	


In [122]:
view(greek_s)

Index 	 Type                 	 Size   	 Value                         
0 	 str                  	 1      	 α                              	
1 	 str                  	 1      	 β                              	
2 	 str                  	 1      	 γ                              	
3 	 str                  	 1      	 δ                              	
4 	 str                  	 1      	 ε                              	


Notice that the ```len``` of ```english_s``` and ```greek_s``` are consistently ```5``` as there are ```5``` Unicode characters:

In [123]:
len(english_s)

5

In [124]:
len(greek_s)

5

Notice that the ```len``` of ```english_b``` is also ```5``` as each ASCII character spans ```1``` byte, however ```greek_s``` is a length of ```10``` as each character spans ```2``` bytes:

In [125]:
len(english_b)

5

In [126]:
len(greek_b)

10

For a ```str``` instance, the Unicode character corresponding to that index is returned. For a ```bytes``` instance on the other hand, the byte in the form of an ```int``` is returned:

In [127]:
greek_s[0]

'α'

In [128]:
greek_b[0]

206

In [129]:
greek_b.hex()

'ceb1ceb2ceb3ceb4ceb5'

In [130]:
0xce

206

Slicing will instead return a ```bytes``` instance. The difference can be seen when ```1``` byte is selected from a slice:

In [131]:
greek_b[:1]

b'\xce'

The syntax otherwise is consistent to slicing when used in the ```str``` class:

In [132]:
greek_b[:2:]

b'\xce\xb1'

In [133]:
greek_b[::2]

b'\xce\xce\xce\xce\xce'

In [134]:
greek_b[1::2]

b'\xb1\xb2\xb3\xb4\xb5'

[Return to Python Tutorials](../readme.md)