# Character Encoding 

## What are encodings?
These are specific sets of rules for mapping raw binary byte strings to such as "01001010" to human-readable text such as "Hi, my name is John". There are many different types of encoding, and you need to use the corresponding decoding schema to get the original text.

There are also 'unknown' characters, which happens when there's no mapping between a particular byte and a character in the encoding schema. So your output will look like this: 

����������

There are lots of different character encodings, but the main one is 'UTF-8'. This is the main text encoding standard.

In [10]:
# modules we'll use
import pandas as pd
import numpy as np

# helpful character encoding module
import charset_normalizer

# set seed for reproducibility
np.random.seed(0)

In [5]:
before = "This is the euro symbol: €"
type(before) # str

after = before.encode("utf-8", errors="replace") 
type(after) # bytes

'''
Bytes are printed out as if they were character encoded in ASCII. 
This is denoted when you see the 'b' at the front of the output.
The euro symbol itself looks like '\xe2\x82\xac', which is the hexadecimal
representation of a character. This type of encoding happens when special 
characters are stored in formats such as JSON, or when text is improperly
decoded. In this case, it's the latter. 
'''
print(after)
print(after.decode("utf-8"))

# Trying to decode with a different schema. We'd get an error.
# print(after.decode("ascii"))


b'This is the euro symbol: \xe2\x82\xac'
This is the euro symbol: €


In [9]:
'''
Remember that encoding maps from strings to bytes. ASCII supports 
alphabetical characters and whatnot, so if it finds a symbol that it 
can't support such as the euro symbol, then it'll be lost. Here if we 
have an error during the conversion, we'll just replace it with the 
unknown character.

We want to avoid this. Just encode and decode in UTF-8 so that you're 
able to keep all of your data. The best time to convert to UTF-8 is when 
you read in files.
'''
# start with a string
before = "This is the euro symbol: €"

# encode it to a different encoding, replacing characters that raise errors
after = before.encode("ascii", errors = "replace")

# convert it back to ascii
print(after.decode("ascii"))

This is the euro symbol: ?


In [17]:
'''
+ Reading in files with encoding problems:
Most files you'll encounter will be encoded with UTF-8. Python expects 
this by default, but sometimes you may get a 'UnicodeDecodeError' when 
reading in a csv file.

kickstarter_2016 = pd.read_csv("kick-starter-utf-8.csv")

In this case, it indicates to us that our file isn't actually in utf-8.
To figure out the encoding we can either test a bunch of different encodings, 
or we can just use 'charset_normalizer' module which gives us a good guesstimate.
Just look through say, the first 10000 bytes to give it enough info. Also 
if you see the error info from reading in the .csv, the first problem occurs at the 
11th character, so we don't need to look too far.

It says it's pretty confident that the csv is in utf-8, so we'll
try to load it in like that. And yeah it works!
'''

csv_path = "../data/kick-starter-utf-8.csv"
with open(csv_path, "rb") as raw_data:
  result = charset_normalizer.detect(raw_data.read(10000))
print(result)

encoding_type = "utf-8"
kickstarter_2016 = pd.read_csv(csv_path, encoding=encoding_type)
kickstarter_2016.head(5)

{'encoding': 'utf-8', 'language': 'Spanish', 'confidence': 0.991}


  kickstarter_2016 = pd.read_csv(csv_path, encoding=encoding_type)


Unnamed: 0.1,Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16
0,0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09 11:36:00,1000,2015-08-11 12:12:28,0,failed,0,GB,0,,,,
1,1,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26 00:20:50,45000,2013-01-12 00:20:50,220,failed,3,US,220,,,,
2,2,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16 04:24:11,5000,2012-03-17 03:24:11,1,failed,1,US,1,,,,
3,3,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29 01:00:00,19500,2015-07-04 08:35:03,1283,canceled,14,US,1283,,,,
4,4,1000014025,Monarch Espresso Bar,Restaurants,Food,USD,2016-04-01 13:38:27,50000,2016-02-26 13:38:27,52375,successful,224,US,52375,,,,
