## What are encodings?
**Character encodings** are specific sets of rules for mapping from raw binary byte strings (that look like this: 0110100001101001) to characters that make up human-readable text (like "hi"). There are many different encodings, and if you tried to read in text with a different encoding than the one it was originally written in, you ended up with scrambled text called "mojibake" (said like mo-gee-bah-kay). Here's an example of mojibake:

æ–‡å—åŒ–ã??

You might also end up with a "unknown" characters. There are what gets printed when there's no mapping between a particular byte and a character in the encoding you're using to read your byte string in and they look like this:

����������

Character encoding mismatches are less common today than they used to be, but it's definitely still a problem. There are lots of different character encodings, but the main one you need to know is UTF-8.

```
UTF-8 is the standard text encoding. All Python code is in UTF-8 and, ideally, all your data should be as well. It's when things aren't in UTF-8 that you run into trouble.
```

In [2]:
!pip install chardet

Collecting chardet
  Using cached chardet-4.0.0-py2.py3-none-any.whl (178 kB)
Installing collected packages: chardet
Successfully installed chardet-4.0.0


In [3]:
# Modules
import pandas as pd
import numpy as np

# Helpful character encoding module
import chardet

# Set seed for reproducibility
np.random.seed(0)

In [4]:
before = "This is the euro symbol: ē"
type(before)

str

In [5]:
# Encode it to a different encoding, replacing characters that raise erros
after = before.encode('utf-8',errors='replace')

# Check the type
type(after)

bytes

In [6]:
after

b'This is the euro symbol: \xc4\x93'

In [7]:
# Convert it back to utf-8
print(after.decode('utf-8'))

This is the euro symbol: ē
