# Encoding and Decoding
- Encoding
    - String -> Raw Byte
- Decoding
    - Raw Byte -> String

In [1]:
s = "嗨"

In [2]:
se = s.encode("utf8")
se

b'\xe5\x97\xa8'

In [3]:
se.decode("utf-8")

'嗨'

- To see the system default encoding

In [4]:
import sys

sys.getdefaultencoding()

'utf-8'

---

# Python's String Type

## Python 2.X
- str
    - 8-bit text
    - binary data
- unicode
    - decoded Unicode text
    
## Python 3.X
- str
    - unicode
- bytes
    - binray data
- bytearray
    - mutable **bytes** (support in-place change)
    

## Usage in different version
- Python 2.X
    - str -> simple text and binary data
    - unicode -> text whose chacter sets don't map to 8-bit bytes
- Python 3.X
    - str -> any kind of text
    - bytes / bytearray -> binary data
    
## Mapping Between the Two Versions
- Unicode text
    - Python3: str
    - Python2: unicode
- Byte Based Data
    - Python3: bytes, bytearray
    - Python2: str

---

# Coding Basic
## Python3 bytes Objects
### In Python 2.6, 2.7, **bytes(x)** is the same as **str(x)**

In [5]:
%%python2

x = 'a'

print(bytes(x) == str(x))

True


### A sequence of small integers
- Range 0 through 255
- Though it prints its content as characters whenerver possible

In [6]:
b = b"abcd"
print(b[0])
print(b[0:])
print(list(b))

97
b'abcd'
[97, 98, 99, 100]


### **bytes** is immutable

In [7]:
b[0] = b"b"

TypeError: 'bytes' object does not support item assignment

- Support most **str** operations except formating(**format** or **%**)
    - **`__mod__`**  and **`__rmod__`** is not implemented


### Byte String Literal: Encoded Text

- Python3 allows special characters to be coded with both ***hex*** and ***Unicode*** escapes in **str**, but only with ***hex*** escapes in **bytes**

In [8]:
s = "\xC4"
print(s)

s = "\u00C4"
print(s)

Ä
Ä


In [9]:
b = b"\xC4"
print(b)

b = b"\u00C4"
print(b)

b'\xc4'
b'\\u00C4'


- **bytes** literals requires characters be ASCII or be escaped if their values are greater than 127

In [10]:
b = b'ÄB'

SyntaxError: bytes can only contain ASCII literal characters. (<ipython-input-10-f15e9bcd0436>, line 1)

In [11]:
b = b"\xC4"
print(b.decode("latin-1"))

Ä


## Python 2 Unicode Literals in Python3

In [12]:
s = "嗨"

- The **u** or **U** suffix were removed in Python3.0 but available again after Python3.3
- It's mainly for Python2 compatibility

## Mixing String Types
- Mix **unicode** and **str** is allowed in Python2
    - This works only if the unicode containers only 7-bit (ASCII) bytes

In [13]:
%%python2

print u'ab' + 'cd'

abcd


In [14]:
%%python2

print u'\uC4' + 'cd'

  File "<stdin>", line 2
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-3: truncated \uXXXX escape


- Mix **str** and **bytes** is never allowed in Python3

In [15]:
"ab" + b"cd"

TypeError: Can't convert 'bytes' object to str implicitly

## Python3 / 2.6+ bytearray Objects
### In Python3, when creating an bytearray, encoding name or byte string is required
(Since text and binary string do not mix)

In [16]:
bytearray("a")

TypeError: string argument without an encoding

In [17]:
bytearray("a", "utf-8")

bytearray(b'a')

In [18]:
bytearray(b"a")

bytearray(b'a')

### Required an integer for index assignments

In [19]:
b = bytearray("abcde", "utf-8")

In [20]:
b[0] = b"b"

TypeError: an integer is required

In [21]:
b[0] = ord("b")
b

bytearray(b'bbcde')

In [22]:
c = bytearray("zxcvb", "utf-8")
b[0] = c[0]
b

bytearray(b'zbcde')

## Python3 String Types Summary
- str: textual data
- bytes: binary data
- bytearray: binary data that will be changed in place

---

# Text and Binary Files

## Text Files
- Read: decode to str
- Write: take an str and then encode
- Common Usage
    - Program output
    - HTML
    - email content
    - CSV
    - XML
- Perform ***line-end tranlations***
    - By default, all line-end forms map to '\n' regardlesss of the platform
## Binary Files
- Open a file in binary mode by add **b** (e.g. open('file', 'b'))
- Read: does not decode and return byte object
- Write: take **bytes** or **bytearray**
- Common Usage
    - image files
    - data transfered over networks
    - packed binary data
- Do **not** perform encoding, decoding and line-end translations
    
## Source File Character Set Encoding 
**`# -*- encoding: latin-1 -*- `**
- When this comment is present, Python will recongize strings represented natively in the given encoding
    
## Different Between Text and Binary Files
### Python2
- Almost no major distinciton between text and binary files
    - Both accept and return content as **str**
    - The only major different is ***line-end translations***

In [23]:
%%python2

open('tmp.txt', 'wb').write('abc')
open('tmp.txt', 'w').write(b'abc')

## Python3
- We'll get errors if we try to write a bytes to a text file or a str to binary files

In [24]:
open("tmp.txt", "wb").write("abc")
open("tmp.txt", "w").write(b"abc")

TypeError: 'str' does not support the buffer interface

## Unicode Files in Python2
- Repalce **str** with **unicode** and **open** with **codecs.open**

---

# Other String Tool Changes in Python3

## re: Patteren Matching
- **re** has been generalized to
    - **str**, **bytes**, **bytearray** in Python3
        - Still, **str** and **bytes** can not mix
    - **unicode**, **str** in Python2
    
## struct: Binary Data
- Create and extract packed binary data from strings
    - Most Python Programmers don't deal with binary bits
- Work the same in Python3 but packed data is represented as **bytes** and ** bytearray** only

## pickle: Object Serialization
- In Python3, pickle always create a **bytes** object
    - Files used to store pickled objects must always be opend in binary mode
- If you care abotu version neutrality or don't care about protocols of version-specific default, always use binray-mode files for pickled data as the following

In [25]:
import pickle

pickle.dump([1, 2, 3], open("tmp.pkl", "wb"))
pickle.load(open("tmp.pkl", "rb"))

[1, 2, 3]

## XML Parsing Tools
- Support the SAX, DOM parsing models and ElementTree(a Python specific API)

In [26]:
# DOM

from xml.dom.minidom import parse, Node

xmltree = parse("mybooks.xml")
for node1 in xmltree.getElementsByTagName("title"):
    for node2 in node1.childNodes:
        if node2.nodeType == Node.TEXT_NODE:
            print(node2.data)

First Book
Second Book
Third Book


In [27]:
# SAX

import xml.sax
import xml.sax.handler


class BookHandler(xml.sax.handler.ContentHandler):
    def __init__(self):
        self.in_title = False

    def startElement(self, name, attr):
        if name == "title":
            self.in_title = True

    def characters(self, data):
        if self.in_title:
            print(data)

    def endElement(self, name):
        if name == "title":
            self.in_title = False


parser = xml.sax.make_parser()
handler = BookHandler()
parser.setContentHandler(handler)
parser.parse("mybooks.xml")

First Book
Second Book
Third Book


In [28]:
# ElementTree

from xml.etree.ElementTree import parse

tree = parse("mybooks.xml")
for e in tree.findall("title"):
    print(e.text)

First Book
Second Book
Third Book
