<a href="https://colab.research.google.com/github/SCS-Technology-and-Innovation/DSLP/blob/main/DSLP_M01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Encoding


On any kind of a computer, text is represented through sequences of *characters* called **strings**. All digital representation occurs through sequences of bits. Each bit can take one of two possible values: 1 (true) or 0 (false).




In [63]:
letter = '0'
print(letter)

0


In [64]:
code = ord(letter)
print(code)

48


In [65]:
chr(code) # reverse lookup

'0'

In [12]:
print(ord('A'))
print(ord('ü'))

65
252


In [34]:
start = 30
count = 10
for code in range(start, start + count):
  print(code, chr(code))

30 
31 
32  
33 !
34 "
35 #
36 $
37 %
38 &
39 '


Computer memory is **not** measured in bits but instead *bytes* that are groups of eight bits each. One cannot, in general, reserve a single bit of memory; one or more bytes is the usual way to proceed.

The **shortest** way to make text into bits is therefore assigning *one byte per letter*. Let's look at the bits themselves.


In [33]:
letter = 'x' # pick a letter
number = ord(letter) # get its number in the default encoding
bits = bin(number) # get the bits (binary number)
print(bits) # see the bits

0b1111000


That's only *seven* bits instead of the **eight** we expected. It is because *leading zeroes* are not shown, unless we request that specifically by **formatting** the printout to match a specified pattern.

In [None]:
character = '2'
byte = ord(character)
bits = bin(byte)
print(bits) # no leading zeroes but with that 0b prefix in place
print("{0:08b}".format(byte)) # leading zeroes to a whole byte but no prefix

0b110010
00110010


Using one byte per character gives us eight options of yes/no, which means that we have a total of $2^8$ possible but sequences that fit into one byte each, allowing us to represent a total of 256 different characters.

If we think in English only, this sounds like a lot.

In [None]:
import string # we use a library that contains useful information regarding strings

letters = string.ascii_letters # we will discuss what ASCII means in a bit
print(letters)

digits = string.digits
print(digits)

punctuation = string.punctuation
print(punctuation)

print(len(letters) + len(digits) + len(punctuation))

abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
0123456789
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
94


In addition to letters, digits, and punctuation, there are also other characters of interest, notably all **whitespace** characters like line breaks, spaces, and tabulators. We do not notice them unless they are before or in between other characters, though.

In [None]:
print('🐱', string.printable, '🐱')
len(string.printable)

🐱 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ 	
 🐱


100

That's a hundred right here and we haven't even begun with accented characters for French and Spanish, let alone languages that use a different alphabet altogether. And then there are all the [emoji](https://en.wikipedia.org/wiki/List_of_emojis) like the cat that was used above to allow us to take note of the whitespace at the end of the printable list of characters.

So what do we get if we look at `ord` for the cat emoji?

In [None]:
ord('🐱')

128049

In [40]:
chr(100000)

'𘚠'

Let's figure out what is the largest number in the default encoding that we are working with that still corresponds to the symbol by just attempting to print them and fishing for where the error pops up.

In [52]:
start = 1000000
end = 2000000
incr = 100000
for code in range(start, end, incr):
  print(code, chr(code))

1000000 󴉀
1100000 􌣠


ValueError: chr() arg not in range(0x110000)

This is way bigger than the "one byte per character" minimal assumption we begun with. Who and how decides the number of bytes per character (whether that is the same amount for all symbols or varies between symbols) and how the bit sequences within those bytes are mapped onto readable symbols?

There are multiple **standards** that map sequences of bits onto human-readable characters.

*   [ASCII](https://en.wikipedia.org/wiki/ASCII)
*   [Unicode](https://en.wikipedia.org/wiki/Unicode) (UTF formats)
*   [ISO 8859-1](https://en.wikipedia.org/wiki/ISO/IEC_8859-1) (also known as Latin-1 which is extended ASCII)

In [None]:
text = 'ça pourrait être difficile de savoir 🤷' # the Python default is Unicode
print(text)

ça pourrait être difficile de savoir 🤷


Not all encodings contain representations for all characters.

In [None]:
plain = text.encode(encoding = 'ascii', errors = 'namereplace')
print(plain)

b'\\N{LATIN SMALL LETTER C WITH CEDILLA}a pourrait \\N{LATIN SMALL LETTER E WITH CIRCUMFLEX}tre difficile de savoir \\N{SHRUG}'


In [None]:
plain = text.encode(encoding = 'ascii', errors = 'ignore')
print(plain)

b'a pourrait tre difficile de savoir '


In [None]:
plain = text.encode(encoding = 'ascii', errors = 'replace')
print(plain)

b'?a pourrait ?tre difficile de savoir ?'


Not all modern-day files, software, operating systems, and computers default to Unicode. Sometimes you will encounter a file that is in ISO 8859-1 (Latin-1) instead.

To include the bytes corresponding to a Latin-1 into a byte sequence in python, one must first use a `\x` to indicate that a Latin-1 code follows and then place the two *hexadecimal* digits of the desired symbol (leading zeroes included). The codes can be looked up in a [table](https://cs.stanford.edu/people/miles/iso8859.html).

In [53]:
different = b'\xC4k\xE4inen' # a non-unicode string, as a sequence of bytes
print(different)

b'\xc4k\xe4inen'


That's *grumpy* in Finnish. The [table](https://cs.stanford.edu/people/miles/iso8859.html) shows the symbols `C4` (Ä) and `E4` (ä) stand for.

In [55]:
print(type(different))
usual = different.decode('iso-8859-1')
print(type(usual))
print(usual)

<class 'bytes'>
<class 'str'>
Äkäinen


How to figure out what encoding is default?

In [None]:
import locale
locale.getpreferredencoding()

'UTF-8'

UTF-8 is one of the encodings in the Unicode family.

When presented with a file of dubious origin with an unknown encoding, one can either guess or employ tools build to deduce the intended encoding. Let's load some mystery files to try this out.

In [56]:
import urllib.request # to access files at known URLs

# a list of URLs
urls = [ 'https://raw.githubusercontent.com/SCS-Technology-and-Innovation/DSLP/main/data/mysteryone.txt',
         'https://raw.githubusercontent.com/SCS-Technology-and-Innovation/DSLP/main/data/mysterytwo.txt',
         'https://raw.githubusercontent.com/SCS-Technology-and-Innovation/DSLP/main/data/mysterythree.txt' ]

# a list of the file contents
inputfiles = [ urllib.request.urlopen(url).read() for url in urls ] # read the contents

# let's see them
for i in inputfiles:
  print(i)

b"C'est un fichier en fran\xc3\xa7ais.\n"
b'This is a file in English.\n'
b'T\xe4m\xe4 tiedosto on Suomeksi.\n'


We could of course just look at the bits themselves.

In [68]:
for i in inputfiles:
  print(''.join(f'{z:08b}' for z in i))

01000011001001110110010101110011011101000010000001110101011011100010000001100110011010010110001101101000011010010110010101110010001000000110010101101110001000000110011001110010011000010110111011000011101001110110000101101001011100110010111000001010
010101000110100001101001011100110010000001101001011100110010000001100001001000000110011001101001011011000110010100100000011010010110111000100000010001010110111001100111011011000110100101110011011010000010111000001010
010101001110010001101101111001000010000001110100011010010110010101100100011011110111001101110100011011110010000001101111011011100010000001010011011101010110111101101101011001010110101101110011011010010010111000001010


We can add spaces to see the individual bytes.

In [69]:
for i in inputfiles:
  print(' '.join(f'{z:08b}' for z in i))

01000011 00100111 01100101 01110011 01110100 00100000 01110101 01101110 00100000 01100110 01101001 01100011 01101000 01101001 01100101 01110010 00100000 01100101 01101110 00100000 01100110 01110010 01100001 01101110 11000011 10100111 01100001 01101001 01110011 00101110 00001010
01010100 01101000 01101001 01110011 00100000 01101001 01110011 00100000 01100001 00100000 01100110 01101001 01101100 01100101 00100000 01101001 01101110 00100000 01000101 01101110 01100111 01101100 01101001 01110011 01101000 00101110 00001010
01010100 11100100 01101101 11100100 00100000 01110100 01101001 01100101 01100100 01101111 01110011 01110100 01101111 00100000 01101111 01101110 00100000 01010011 01110101 01101111 01101101 01100101 01101011 01110011 01101001 00101110 00001010


We have the contents, but as byte sequences. Work remains to be done.

In order to use a tool to detect encodings, we need to request our colab virtual machine to install that tool since it is not part of the standard configuration.

In [57]:
!apt install python3-magic # this tool is not included by default, we have to fetch it

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  python3-magic
0 upgraded, 1 newly installed, 0 to remove and 45 not upgraded.
Need to get 12.6 kB of archives.
After this operation, 52.2 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/main amd64 python3-magic all 2:0.4.24-2 [12.6 kB]
Fetched 12.6 kB in 0s (122 kB/s)
Selecting previously unselected package python3-magic.
(Reading database ... 121920 files and directories currently installed.)
Preparing to unpack .../python3-magic_2%3a0.4.24-2_all.deb ...
Unpacking python3-magic (2:0.4.24-2) ...
Setting up python3-magic (2:0.4.24-2) ...


In [58]:
import magic # one tool to determine encodings

m = magic.open(magic.MAGIC_MIME_ENCODING)
m.load()
for i in inputfiles:
  enc = m.buffer(i)
  print(enc)

utf-8
us-ascii
iso-8859-1


With this knowledge, we can now convert the contents to the encoding of this environment which is Unicode.

In [59]:
for i in inputfiles:
  enc = m.buffer(i)
  print(i.decode(enc))

C'est un fichier en français.

This is a file in English.

Tämä tiedosto on Suomeksi.



For more details on this topic, Brad Solomon has an excellent [tutorial](https://realpython.com/python-encodings-guide/#encoding-and-decoding-in-python-3) available online for free.