<a href="https://colab.research.google.com/github/SCS-Technology-and-Innovation/DSLP/blob/main/DSLP_M01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Encoding


On any kind of a computer, text is represented through sequences of *characters* called **strings**. All digital representation occurs through sequences of bits. Each bit can take one of two possible values: 1 (true) or 0 (false).

Computer memory is **not** measured in bits but instead *bytes* that are groups of eight bits each. One cannot, in general, reserve a single bit of memory; one or more bytes is the usual way to proceed.





In [23]:
letter = 'a'
print(letter, ord(letter))

a 97


In [26]:
print(ord('A'))
print(ord('ü'))

65
252


The **shortest** way to make text into bits is therefore assigning *one byte per letter*. Let's look at the bits themselves.


In [29]:
bits = ord('x')
print(bin(bits))

0b1111000


That's only *seven* bits instead of the **eight** we expected. It is because *leading zeroes* are not shown, unless we request that specifically by **formatting** the printout to match a specified pattern.

In [37]:
character = '2'
byte = ord(character)
bits = bin(byte)
print(bits) # no leading zeroes but with that 0b prefix in place
print("{0:08b}".format(byte)) # leading zeroes to a whole byte but no prefix

0b110010
00110010


Using one byte per letter gives us eight options of yes/no, which means that we have a total of $2^8$ possible but sequences that fit into one byte each, allowing us to represent a total of 256 different characters.

If we think in English only, this sounds like a lot.

In [17]:
import string # we use a library that contains useful information regarding strings

letters = string.ascii_letters # we will discuss what ASCII means in a bit
print(letters)

digits = string.digits
print(digits)

punctuation = string.punctuation
print(punctuation)

print(len(letters) + len(digits) + len(punctuation))

abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
0123456789
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
94


In addition to letters, digits, and punctuation, there are also other characters of interest, notably all **whitespace** characters like line breaks, spaces, and tabulators. We do not notice them unless they are before or in between other characters, though.

In [22]:
print('🐱', string.printable, '🐱')
len(string.printable)

🐱 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ 	
 🐱


100

That's a hundred right here and we haven't even begun with accented characters for French and Spanish, let alone languages that use a different alphabet altogether. And then there are all the [emoji](https://en.wikipedia.org/wiki/List_of_emojis) like the cat that was used above to allow us to take note of the whitespace at the end of the printable list of characters.

So what do we get if we look at `ord` for the cat emoji?

In [28]:
ord('🐱')

128049

This is way bigger than the "one byte per character" minimal assumption we begun with. Who and how decides the number of bytes per character (whether that is the same amount for all symbols or varies between symbols) and how the bit sequences within those bytes are mapped onto readable symbols?

There are multiple **standards** that map sequences of bits onto human-readable characters.

*   [ASCII](https://en.wikipedia.org/wiki/ASCII)
*   [Unicode](https://en.wikipedia.org/wiki/Unicode) (UTF formats)
*   [ISO 8859-1](https://en.wikipedia.org/wiki/ISO/IEC_8859-1) (also known as Latin1)

In [6]:
text = 'ça pourrait être difficile de savoir 🤷' # the Python default is Unicode
print(text)

ça pourrait être difficile de savoir 🤷


Not all encodings contain representations for all characters.

In [7]:
plain = text.encode(encoding = 'ascii', errors = 'namereplace')
print(plain)

b'\\N{LATIN SMALL LETTER C WITH CEDILLA}a pourrait \\N{LATIN SMALL LETTER E WITH CIRCUMFLEX}tre difficile de savoir \\N{SHRUG}'


In [8]:
plain = text.encode(encoding = 'ascii', errors = 'ignore')
print(plain)

b'a pourrait tre difficile de savoir '


In [9]:
plain = text.encode(encoding = 'ascii', errors = 'replace')
print(plain)

b'?a pourrait ?tre difficile de savoir ?'


Not all modern-day files, software, operating systems, and computers default to Unicode. Sometimes you will encounter a file that is in ISO 8859-1 instead.

In [48]:
different = b'\xC4k\xE3inen' # a non-unicode string, as a sequence of bytes
print(different)

b'\xc4k\xe3inen'


That's *grumpy* in Finnish. There is a [table](https://cs.stanford.edu/people/miles/iso8859.html) for knowing what symbols `C4` and `E3` stand for.

In [47]:
print(type(different))
usual = different.decode('iso-8859-1')
print(type(usual))
print(usual)

<class 'bytes'>
<class 'str'>
Äkãinen


How to figure out what encoding is default?

In [49]:
import locale
locale.getpreferredencoding()

'UTF-8'

When presented with a file of dubious origin with an unknown encoding, one can either guess or employ tools build to deduce the intended encoding. Let's load some mystery files to try this out.

In [53]:
import urllib.request # to access files at known URLs

urls = [ 'https://raw.githubusercontent.com/SCS-Technology-and-Innovation/DSLP/main/data/mysteryone.txt',
         'https://raw.githubusercontent.com/SCS-Technology-and-Innovation/DSLP/main/data/mysterytwo.txt',
         'https://raw.githubusercontent.com/SCS-Technology-and-Innovation/DSLP/main/data/mysterythree.txt' ]

inputfiles = [ urllib.request.urlopen(url).read() for url in urls ] # read the contents
for i in inputfiles:
  print(i)

b"C'est un fichier en fran\xc3\xa7ais.\n"
b'This is a file in English.\n'
b'T\xe4m\xe4 tiedosto on Suomeksi.\n'


We have the contents, but as byte sequences. Work remains to be done.

In [56]:
!apt install python3-magic # this tool is not included by default, we have to fetch it

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  python3-magic
0 upgraded, 1 newly installed, 0 to remove and 45 not upgraded.
Need to get 12.6 kB of archives.
After this operation, 52.2 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/main amd64 python3-magic all 2:0.4.24-2 [12.6 kB]
Fetched 12.6 kB in 0s (59.0 kB/s)
Selecting previously unselected package python3-magic.
(Reading database ... 121752 files and directories currently installed.)
Preparing to unpack .../python3-magic_2%3a0.4.24-2_all.deb ...
Unpacking python3-magic (2:0.4.24-2) ...
Setting up python3-magic (2:0.4.24-2) ...


In [59]:
import magic # one tool to determine encodings

m = magic.open(magic.MAGIC_MIME_ENCODING)
m.load()
for i in inputfiles:
  enc = m.buffer(i)
  print(enc)

utf-8
us-ascii
iso-8859-1


With this knowledge, we can now convert the contents to the encoding of this environment which is Unicode.

In [60]:
for i in inputfiles:
  enc = m.buffer(i)
  print(i.decode(enc))

C'est un fichier en français.

This is a file in English.

Tämä tiedosto on Suomeksi.



For more details on this topic, Brad Solomon has an excellent [tutorial](https://realpython.com/python-encodings-guide/#encoding-and-decoding-in-python-3) available online for free.