# Character Encodings

Notebook containing different solutions for working with character encodings.

Character encodings can be present when working with data in HTTP, HTML & XML formats.

UTF-8 is the standard character encoding, and is used throughout python by default.

In [1]:
# Module Importations
import chardet      # Useful character encoding module
import numpy as np
import pandas as pd

In [None]:
# Notebook Constants
np.random.seed(0)

In [2]:
# String character data type
before = "This is the pound symbol: £"

type(before)

str

In [3]:
# Bytes character data type

# Encode it to bytes data type
after = before.encode("utf-8", errors = "replace")      # Replace characters that raise errors

type(after)

bytes

In [4]:
# Peek at 'after'
after

b'This is the pound symbol: \xc2\xa3'

In [5]:
# Convert back to utf-8
print(after.decode("utf-8"))

This is the pound symbol: £


In [6]:
# Try to decode our bytes with the ascii encoding
print(after.decode("ascii"))

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 26: ordinal not in range(128)

Be careful when handling character errors.

In [7]:
# Start with a string
before = "This is the pound symbol: £"

# Encode it to a different encoding, replacing characters that raise errors
after = before.encode("ascii", errors = "replace")

# Convert back to utf-8
print(after.decode("ascii"))

This is the pound symbol: ?


# Reading in Files with Encoding Problems

In [9]:
# Try to read a file not in utf-8
file_string = r'C:/Developer/scratch-pad-python/Datasets/ks-projects-201612.csv'

kickstarter_2016_df = pd.read_csv(file_string)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x99 in position 11: invalid start byte