# Character Encodings

As a data scientist you get to work with data that comes in different formats and encodings. This may lead to problems when trying to use the data in different taks like in Pandas dataframes or just reading the file from memory or CSV file. Due to this reason, we need to decode and encode the data into a format that we can use.

## Text Encoding

To understand this in depth, we need to understand how text is encoded and decoded in Python programming language. Its important to keep in mind, computers only work with numbers hence, for a computer to store text, it first needs to convert the text into characters first.

A computers basic unit of storage is the **bytes**. A computers stores numbers that can be converted back into different formats to work with. This is also true for text. Computers store text inform of numbers. To understand this, lets get to a quick example of how this works.

In [27]:
msg = "Hello world"

In [28]:
encoded_msg = msg.encode('utf-8')

In [29]:
encoded_msg = str.encode(msg)

In [30]:
type(encoded_msg)

bytes

In [31]:
encoded_msg

b'Hello world'

In [32]:
list(encoded_msg)

[72, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100]

In [33]:
len(msg)

11

In [34]:
len(list(encoded_msg))

11

From the example shown, you can clearly see that each character making the text is converted into a number, that is the reason the **len** of the text and list of the encoded text are the same. Okay where are this numbers coming from you might ask. Well this are a UTF-8 character codes, take a look [here](https://www.smashingmagazine.com/2012/06/all-about-unicode-utf8-character-sets/) for more info

In this video we are not going to jump into the little thinny details of how UTF-8 works, for more info read [thi](https://en.wikipedia.org/wiki/UTF-8)s article from wikipedia

The first element of the list is **72** this is the code for **H** in utf-8. To confirm this in Python...

In [35]:
ord("H")

72

For a quick test, you can pause the video and try this out with other characters.

#### Demo

In [36]:
for char in "Hello world":
    print(ord(char))

72
101
108
108
111
32
119
111
114
108
100


The code above shows how we came about getting this values.

## UTF-8 support

UTF-8 supports many characters included the ones from other languages and emojis as well. To try this out...

In [37]:
text = "Hello world 😉"
encoded_text = text.encode()

In [38]:
type(encoded_text)

bytes

In [39]:
len(encoded_text)

16

In [40]:
list(encoded_text)

[72, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 32, 240, 159, 152, 137]

In [41]:
encoded_text

b'Hello world \xf0\x9f\x98\x89'

One important thing you need to notice is that, UTF-8 characters encoding was used to encode the text hence, must be used as well during decoding the text. Lets try to decode the **encode_text** using another encoding format

In [42]:
encoded_text.decode("ascii")

UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 12: ordinal not in range(128)

In [43]:
encoded_text.decode()

'Hello world 😉'

## How Do I Know Which To Use To Decode My Data?

This can be a difficult thing since you can tell what encoding format was used to encode the data, but you can find out. This is what leads to error since you are trying to decode data that was encoded in another format.

In [46]:
import chardet

chardet.detect(encoded_text)

{'encoding': 'Windows-1254',
 'confidence': 0.3036647364631782,
 'language': 'Turkish'}

In [54]:
test_text = "Hello"
encoded_test = test_text.encode("utf-8")

In [55]:
chardet.detect(encoded_test)

{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}

From the example you can agree with me that, this is not a reliable approach to doing this. This is what leads to errors when reading data of which we can not tell its encoding. Its better if we know it than trying to predict or guess the value By default Python uses **UFT-8**.

If you try to encode and decode text without specifying the encoding, Python will automatically use the **UFT-8**. Try this out on your own.

Some this when reading data with a different encoding might not give you an error but the output will be wront. Example...

In [59]:
test_text = "Hello 😉"
encoded_test = test_text.encode("utf-8")

In [60]:
encoded_test.decode()

'Hello 😉'

In [61]:
encoded_test.decode("latin1")

'Hello ð\x9f\x98\x89'

This is because, **latin1** is not capable to encode and decode the emoji. Proof...

In [62]:
encoded_test = test_text.encode("latin1")

UnicodeEncodeError: 'latin-1' codec can't encode character '\U0001f609' in position 6: ordinal not in range(256)

# Working With Pandas

Lets go ahead and apply the same knowledge to reading data in Pandas

In [63]:
import pandas as pd

In [71]:
df = pd.read_csv("parsing_dates.csv", encoding = "ascii")

In [72]:
df

Unnamed: 0,2018-02-20,Dinning,200
0,2019-03-23,Shopping,330
1,2020-09-20,Jogging,400
2,2021-10-04,Partying,500
3,2022-11-09,Coding,20


In this case all the encoding types worked perfectly, this may not be the case for some datasets, to be on the safe side try to test this out.

In [73]:
with open('parsing_dates.csv', 'rb') as rawdata:
    result = chardet.detect(rawdata.read(10000))

In [75]:
result

{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}

In [77]:
df = pd.read_csv("parsing_dates.csv", encoding = "ascii")
df

Unnamed: 0,2018-02-20,Dinning,200
0,2019-03-23,Shopping,330
1,2020-09-20,Jogging,400
2,2021-10-04,Partying,500
3,2022-11-09,Coding,20


## Conclusion

When getting encoding error when reading a CSV file, maybe because someone wrote the file in another encoding format. Its only important that you decode it using the right encoding used to encode it. Thanks for watching.