## Character Encodings


In [1]:
# Environment set up
import pandas as pd
import numpy as np

# helpful character encoding
import chardet

#set seed for reproducibility

**Character encodings** are specific sets of rules for mapping grom raw binary strings (that look like this: 0111010101010110)
to characters that make up human - readable text (like 'hi'). There are many different encodings, and if you tried to read in text with a
differnet encoding than the one it was originally writtin in, you end up with scrabled text called "mojibake". Here's an example
of mojibake:

æ–‡å—åŒ–ã??


You might also end up with a "unknown" characters. There are what gets printed when there's no mapping between a particular byte and a character
in the encoding you're using to read your byte string in and they look like this:

����������

Character encoding mismatches are less common today than they used to be, but it's definitely still a problem. Ther lots
of different character encodings, but the main one you need to know is UTF-8

 > UTF-8 is **the** standard text encoding. All Python code is in UTF-8 and, ideally, all your data should be as well.
>It's when things aren't in UTF-8 that you run in trouble

It was pretty hard to deal with encodings in Python 2, but thankfully in Python 3 it's a lot simpler. (Kaggle Notebooks only use
Python 3). There are two main data types you'll encounter when working with text in Python 3. One is the string, which is what
text is by default.

In [2]:
# start with a string
before = 'THis is the euro symbol: €'

type(before)

str

The other data is the bytes data type, which is a sequence of integers. You can convert a string into bytes by specifying which
encoding it's in:




In [3]:
#encode it to a different encoding, replacing characters that raise errors
after = before.encode('utf-8', errors='replace')

type(after)

bytes

In [4]:
after

b'THis is the euro symbol: \xe2\x82\xac'

In [5]:
# convert it back to utf-8
after.decode()


'THis is the euro symbol: €'

In [6]:
print(after.decode('ascii'))


UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 25: ordinal not in range(128)

In [7]:
#encode the string to a different encoding, replacing characters that raise errors
after = before.encode('ascii', errors='replace')
print(after.decode('ascii'))

THis is the euro symbol: ?


This is bad and we want to avoid doing it! It's far better to convert all our text to utf-8 as soon as we can and keep it
in that encoding. The best time to convert non UTF-8 input into UTF-8 is when you read in files, which we'll talk about next.




In [9]:
DATA_URL = '/Users/dravik/Downloads/ks-projects-201612.csv'
kickstarter_2016 = pd.read_csv(DATA_URL)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x99 in position 11: invalid start byte

Notice that we get the same `UnicodeDecodeError` we got when we tried to decode UTF-8 bytes as if they were ASCII.
One way to figure it out is to try and test a bunch of different character encodings and see if any of them work. A better way,
though, is to use the chardet module to try and automatically guess what the right encoding is. It's not 100% guaranteed to be
rigth, but it's usually faster than just trying to guess.




In [11]:
# look at the first ten thousand bytes to guess the character encoding
with open(DATA_URL, 'rb') as rawdata:
    result = chardet.detect(rawdata.read(10000))

print(result)

{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}


In [14]:
kickstarter_2016 = pd.read_csv(DATA_URL, encoding='Windows-1252')
kickstarter_2016.head()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Index(['ID ', 'name ', 'category ', 'main_category ', 'currency ', 'deadline ',
       'goal ', 'launched ', 'pledged ', 'state ', 'backers ', 'country ',
       'usd pledged ', 'Unnamed: 13', 'Unnamed: 14', 'Unnamed: 15',
       'Unnamed: 16'],
      dtype='object')

In [15]:
kickstarter_2016.to_csv('/Users/dravik/Downloads/ks-projects-201612_UTF-8.csv')


### Exercises



In [17]:
sample_entry = b'\xa7A\xa6n'
print(sample_entry)
print('data type:', type(sample_entry))

b'\xa7A\xa6n'
data type: <class 'bytes'>


In [22]:
new_entry = sample_entry.decode('big5-tw').encode('UTF-8')
print(new_entry)
print(type(new_entry))

b'\xe4\xbd\xa0\xe5\xa5\xbd'
<class 'bytes'>


In [29]:
PK_DATA = '/Users/dravik/Downloads/police_killings.csv'
police_killings = pd.read_csv(PK_DATA)

AttributeError: 'str' object has no attribute 'read'

In [30]:
# using chardet to detect encoding of the file
with open(PK_DATA, 'rb') as raw_police_killings:
    result = chardet.detect(raw_police_killings.read())

print(result)

{'encoding': 'ISO-8859-1', 'confidence': 0.7281409812209826, 'language': ''}


In [34]:
pd.reset_option('all')
police_killings = pd.read_csv(PK_DATA, encoding='ISO-8859-1')
police_killings.head()


: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.



: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.



Unnamed: 0,name,age,gender,raceethnicity,month,day,year,streetaddress,city,state,...,share_hispanic,p_income,h_income,county_income,comp_income,county_bucket,nat_bucket,pov,urate,college
0,A'donte Washington,16,Male,Black,February,23,2015,Clearview Ln,Millbrook,AL,...,5.6,28375,51367.0,54766,0.937936,3.0,3.0,14.1,0.097686,0.16851
1,Aaron Rutledge,27,Male,White,April,2,2015,300 block Iris Park Dr,Pineville,LA,...,0.5,14678,27972.0,40930,0.683411,2.0,1.0,28.8,0.065724,0.111402
2,Aaron Siler,26,Male,White,March,14,2015,22nd Ave and 56th St,Kenosha,WI,...,16.8,25286,45365.0,54930,0.825869,2.0,3.0,14.6,0.166293,0.147312
3,Aaron Valdez,25,Male,Hispanic/Latino,March,11,2015,3000 Seminole Ave,South Gate,CA,...,98.8,17194,48295.0,55909,0.863814,3.0,3.0,11.7,0.124827,0.050133
4,Adam Jovicic,29,Male,White,March,19,2015,364 Hiwood Ave,Munroe Falls,OH,...,1.7,33954,68785.0,49669,1.384868,5.0,4.0,1.9,0.06355,0.403954
