# Character Encodings

Notebook containing different solutions for working with character encodings.

Character encodings can be present when working with data in HTTP, HTML & XML formats.

UTF-8 is the standard character encoding, and is used throughout python by default.

In [1]:
# Module Importations
import chardet      # Useful character encoding module
import numpy as np
import pandas as pd

In [2]:
# Notebook Constants
np.random.seed(0)

In [3]:
# String character data type
before = "This is the pound symbol: £"

type(before)

str

In [4]:
# Bytes character data type

# Encode it to bytes data type
after = before.encode("utf-8", errors = "replace")      # Replace characters that raise errors

type(after)

bytes

In [5]:
# Peek at 'after'
after

b'This is the pound symbol: \xc2\xa3'

In [6]:
# Convert back to utf-8
print(after.decode("utf-8"))

This is the pound symbol: £


In [7]:
# Try to decode our bytes with the ascii encoding
print(after.decode("ascii"))

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 26: ordinal not in range(128)

Be careful when handling character errors.

In [8]:
# Start with a string
before = "This is the pound symbol: £"

# Encode it to a different encoding, replacing characters that raise errors
after = before.encode("ascii", errors = "replace")

# Convert back to utf-8
print(after.decode("ascii"))

This is the pound symbol: ?


# Reading in Files with Encoding Problems

Using chardet to auto-detect the encoding type.

In [9]:
# Try to read a file not in utf-8
file_string = r'C:/Developer/scratch-pad-python/Datasets/ks-projects-201612.csv'

kickstarter_2016_df = pd.read_csv(file_string)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x99 in position 11: invalid start byte

In [10]:
# Use chardet to parse the same file
with open(file_string, 'rb') as rawdata:
    result = chardet.detect(rawdata.read(10000))    # Use the first 10000 lines to detect encoding

# Print potential encoding
print(result)

{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}


In [11]:
# Read in the file with the encoding detected by chardet
kickstarter_2016_df = pd.read_csv(file_string, encoding = 'Windows-1252')

# Look at the first few lines
kickstarter_2016_df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09 11:36:00,1000,2015-08-11 12:12:28,0,failed,0,GB,0,,,,
1,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26 00:20:50,45000,2013-01-12 00:20:50,220,failed,3,US,220,,,,
2,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16 04:24:11,5000,2012-03-17 03:24:11,1,failed,1,US,1,,,,
3,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29 01:00:00,19500,2015-07-04 08:35:03,1283,canceled,14,US,1283,,,,
4,1000014025,Monarch Espresso Bar,Restaurants,Food,USD,2016-04-01 13:38:27,50000,2016-02-26 13:38:27,52375,successful,224,US,52375,,,,


# Saving Files with UTF-8 Encoding