## Character Encoding

In [30]:
import pandas as pd
import numpy as np

# helpful character encoding module
import chardet

# set seed for reproducibility
np.random.seed(0)

Character encodings are specific sets of rules for mapping from raw binary byte strings (that look like this: 0110100001101001) to characters that make up human-readable text (like "hi"). There are many different encodings, and if you tried to read in text with a different encoding than the one it was originally written in, you ended up with scrambled text called "mojibake" (said like mo-gee-bah-kay). Here's an example of mojibake:

æ–‡å—åŒ–ã??

You might also end up with a "unknown" characters. There are what gets printed when there's no mapping between a particular byte and a character in the encoding you're using to read your byte string in and they look like this:

����������

[source](https://www.kaggle.com/alexisbcook/character-encodings)

When loading in data sometimes we might encounter the issue of encoding mismatch.  This issue is less of a problem this days but can still occur. THe most used encoding type is UTF-8


All Python code is written in UTF-8 hence all your data should also be in the same encoding type. If this is not the case, you will have errors.

Lets see an example code

In [31]:
before = "Hello world 😉"
before

'Hello world 😉'

In [32]:
type(before)

str

We can encode by replacing characters that raise error.

In [33]:
encoded = before.encode("utf-8", errors = 'replace')

We have now converted a string type  to a byte type.

The two most connon types in Python are

1. strings
2. bytes

Bytes are just a sequence of integers. We can convert a string to an byte by specifying the type of encoding.

In [34]:
type(encoded)

bytes

In [35]:
encoded

b'Hello world \xf0\x9f\x98\x89'

Now that we have converted a string to a byte type, lets go ahead and convert it back into a string type. This can be done as follows

In [36]:
encoded.decode("utf-8")

'Hello world 😉'

When we specify the wrong encoding, we will get an error. Since the type we specified can not make sense of the data.

In [37]:
print(encoded.decode("ascii"))

UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 12: ordinal not in range(128)

ASCII can decode all English letters and word even if they are encoded in another encoding type.

Sometimes we might not know the right encoding type used to encode the data. For this we can use chardet to do that for us automatically. 100% accuracy is not guaranteed.

In [41]:
with open('datasets/housing.csv', 'rb') as rawdata:
    result = chardet.detect(rawdata.read(10000))

In [42]:
print(result)

{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}


This shows that chardet is 100% confident the answer is 100%

In [44]:
df = pd.read_csv("datasets/housing.csv", encoding = "ascii")

In [45]:
df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0,INLAND
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0,INLAND
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000,92300.0,INLAND
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0,INLAND


Sometimes the encoding is not right you need to try till u get the right one. Once you get the file opened, its best you save it to a CSV file. You can do this by using the pd.to_csv() this method uses UTF-8 by default since its the standards in Python.