# String encoding in Python 3

In Python we have strings, also known as unicode strings:

In [1]:
'hello'  # or u'hello'; the leading "u" is optional

'hello'

In Python we also have byte strings:

In [2]:
bytes([1,2,3])

b'\x01\x02\x03'

Bytes are how computers store data! They are numbers from 0 to 255.

![](

Numbers don't mean anything, but there's a simple mapping of half of these numbers to English letters, numbers, and control characters called ASCII that's so ingrained in Python that if you print a bytestring with bytes from 32 to 126, they'll be displayed as their corresponding characters!

In [3]:
bytes([65, 66, 67, 3, 4, 5])

b'ABC\x03\x04\x05'

In [4]:
b'Hello'  # we can write bytestrings like this too

b'Hello'

Sockets can be used to send any kind of information - text, jpegs, pdfs, or lists of numbers; so they send and receive bytes! But if you're a human reading or writing text, you want to use strings so you can use characters not in the ASCII set.

In [5]:
type(input())  # input() returns a string

Hello, I’m ”Sidney San Martín.“ 私はユニコードを使いたい and 😄


str

If you have a string from input or anywhere else, you'll need to encode it into bytes in order to send it. Strings have an encoded method, and the most useful encoding it `utf-8`.

![](http://eli.thegreenplace.net/images/2012/01/py3_string_bytes.png)

In [6]:
message = "Hello, I’m ”Sidney San Martín.“ 私はユニコードを使いたい and 😄"
encoded_message = message.encode('utf-8')
encoded_message

b'Hello, I\xe2\x80\x99m \xe2\x80\x9dSidney San Mart\xc3\xadn.\xe2\x80\x9c \xe7\xa7\x81\xe3\x81\xaf\xe3\x83\xa6\xe3\x83\x8b\xe3\x82\xb3\xe3\x83\xbc\xe3\x83\x89\xe3\x82\x92\xe4\xbd\xbf\xe3\x81\x84\xe3\x81\x9f\xe3\x81\x84 and \xf0\x9f\x98\x84'

When you receive bytes from a socket or a file opened in binary mode, the bytes might represent encoded text:

In [7]:
data = b'\xe7\xa7\x81\xe3\x81\xaf\xe3\x83\xa6'
data.decode('utf-8')

'私はユ'

But the bytes might also represent numbers, or an image file, or anything else! There are many ways to interpret a series of bytes.

In [9]:
data = b'\xe7\xa7\x81\xe3\x81\xaf\xe3\x83\xa6'
import struct
struct.unpack('ihhb', data)

(-478042137, -20607, -31773, -90)

Suggestioned reading:
    * https://nedbatchelder.com/text/unipain.html
    * http://eli.thegreenplace.net/2012/01/30/the-bytesstr-dichotomy-in-python-3