# Text versus Bytes

## Character Issues

#### Example 4-1. Encoding and Decoding

In [1]:
s = 'café'
len(s)

In [2]:
b = s.encode('utf8')
b

In [3]:
len(b) # é가 UTF-8로 2byte

In [4]:
b.decode('utf8')

## Byte Essentials

이진 시퀀스에 사용되는 2가지 내장 자료형. 1바이트 정수를 연속적으로 저장.
* bytes: 불변형
* bytearray: 가변형

#### Example 4-2. A five-byte sequence as bytes and as bytearray

In [5]:
cafe = bytes('café', encoding='utf_8')
cafe

In [6]:
cafe[0]

In [7]:
cafe[:1]

In [8]:
cafe_arr = bytearray(cafe)
cafe_arr

In [9]:
cafe_arr[-1:]

**cafe[0]은 정수가 cafe[:1]은 bytes가 반환되는 이유**  
cafe[i]는 항목 하나를, cafe[i:i+1]는 해당되는 시퀀스를 동일한 자료형으로 반환.  
bytes는 0-255사이의 정수가 들어간 시퀀스임.  
s[0] == s[:1]이 성립하는 시퀀스형은 str이 유일.

#### Example 4-3. Initializing bytes from the raw data of an array

In [10]:
import array
numbers = array.array('h', [-2, -1, 0, 1, 2]) # short int(16bit)
octets = bytes(numbers)
octets # 10 bytes

b'\xfe\xff\xff\xff\x00\x00\x01\x00\x02\x00'

### Structs and Memory Views

#### Example 4-4. Using *memoryview* and *struct* to inpsect a GIF image header

In [11]:
import struct
fmt = '<3s3sHH'
with open('../data/sample.gif', 'rb') as fp:
  img = memoryview(fp.read())

header = img[:10]
bytes(header)

b'GIF89ah\x01h\x01'

In [12]:
struct.unpack(fmt, header)

(b'GIF', b'89a', 360, 360)

In [13]:
# momoryview를 슬라이싱하면 바이트를 복사하지 않고 새로운 memoryview 객체를 반환.
# 참조를 삭제해 memoryview 객체에 연결된 메모리 해제
del header
del img

## Basic Encoders/Decoders

#### Example 4-5. The String 'El Niño' encoded with three codecs producing very different byte sequences

In [14]:
for codec in ['latin_1', 'utf_8', 'utf_16']:
  print(codec, 'El Niño'.encode(codec), sep='\t')

latin_1	b'El Ni\xf1o'
utf_8	b'El Ni\xc3\xb1o'
utf_16	b'\xff\xfeE\x00l\x00 \x00N\x00i\x00\xf1\x00o\x00'


## Understanding Encode/Decode Problems

### Coping with UnicodeEncodeError

#### Example 4-6.  Encoding to bytes: success and error handling

In [15]:
city = 'São Paulo'
city.encode('utf_8')

b'S\xc3\xa3o Paulo'

In [16]:
city.encode('utf_16')

b'\xff\xfeS\x00\xe3\x00o\x00 \x00P\x00a\x00u\x00l\x00o\x00'

In [17]:
city.encode('iso8859_1')

b'S\xe3o Paulo'

In [18]:
city.encode('cp437')

UnicodeEncodeError: 'charmap' codec can't encode character '\xe3' in position 1: character maps to <undefined>

In [19]:
city.encode('cp437', errors='ignore')

b'So Paulo'

In [20]:
city.encode('cp437', errors='replace')

b'S?o Paulo'

In [21]:
city.encode('cp437', errors='xmlcharrefreplace')

b'S&#227;o Paulo'

### Coping with UnicodeDecodeError

#### Example 4-7. illustrates how using the wrong codec may produce gremlins or a UnicodeDecodeError


In [22]:
octets = b'Montr\xe9al'
octets.decode('cp1252')

'Montréal'

In [23]:
octets.decode('iso8859_7')

'Montrιal'

In [24]:
octets.decode('koi8_r')

'MontrИal'

In [25]:
octets.decode('utf_8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 5: invalid continuation byte

In [26]:
octets.decode('utf_8', errors='replace')

'Montr�al'

### SyntaxError When Loading Modules with Unexpected Encoding

#### Example 4-8. ola.py: 'Hellow, World' in Porttuguese

In [27]:
# coding: cp1252
print('Olá, Mundo!')

Olá, Mundo!


### How to Discover the Encoding of a Byte Sequence  

바이트 시퀀스의 인코딩 방식은 별도로 정보를 가져와야 알 수 있음. HTTP나 XML 같은 통신 프로토콜이나 파일포맷은 인코딩 방식을 명시하는 헤더를 포함하고 있음.  
바이트스트림의 값의 범위 또는 퍁패턴을 통해 인코딩 방식 추정 가능.  
*Chardet* 패키지를 이용해 30가지 인코딩 방식을 알아낼 수 있음.

### BOM: A Useful Gremlin


Example 4-5에서 utf16 인코딩 앞의 b'\xff\xfe'는 바이트 표기 순서 표시(BOM). 인코딩한 인텔 CPU의 '리틀엔디언' 바이트 순서를 나타냄.  

In [28]:
u16 = 'El Niño'.encode('utf_16')
u16

b'\xff\xfeE\x00l\x00 \x00N\x00i\x00\xf1\x00o\x00'

**ZERO WIDTH NO-BREAK SPACE**

utf-16 인코딩은 빅엔디언과 리틀엔디언 cpu 간의 혼란을 방지하기 위해 이 문자를 인코딩된 텍스트 앞에 붙임.  
리틀엔디언 또는 빅엔디언을 명시하는 UTF-16LE, UTF-16BE 변형을 사용할 경우 BOM 생성 X


In [29]:
# little endian
u16le = 'El Niño'.encode('utf_16le')
u16le

b'E\x00l\x00 \x00N\x00i\x00\xf1\x00o\x00'

In [30]:
# big endian
u16be = 'El Niño'.encode('utf_16be')
u16be

b'\x00E\x00l\x00 \x00N\x00i\x00\xf1\x00o'

In [31]:
u16le

b'E\x00l\x00 \x00N\x00i\x00\xf1\x00o\x00'

## Handling Text Files

**유니코드 샌드위치**:  텍스트를 처리할 때 최고의 순서

1. bytes를 str로 변환
2. str 객체로만 작업
3. str 객체를 bytes로 변환

#### Example 4-9. A platform encoding issue

In [32]:
open('cafe.txt', 'w', encoding='utf_8').write('café')
open('cafe.txt', encoding='cp1252').read()

'cafÃ©'

#### Example 4-10. Closer inspection of Example 4-9 running on Windows reveals the bug and how to fix it

In [33]:
fp = open('cafe.txt', 'w', encoding='utf-8')
fp

<_io.TextIOWrapper name='cafe.txt' mode='w' encoding='utf-8'>

In [34]:
fp.write('café')
fp.close()

In [35]:
import os
os.stat('cafe.txt').st_size

5

In [36]:
fp2 = open('cafe.txt', encoding='cp1252')
fp2

<_io.TextIOWrapper name='cafe.txt' mode='r' encoding='cp1252'>

In [37]:
fp2.read()

'cafÃ©'

In [38]:
fp3 = open('cafe.txt', encoding='utf_8')
fp3

<_io.TextIOWrapper name='cafe.txt' mode='r' encoding='utf_8'>

In [39]:
fp3.read()

'café'

In [40]:
fp4 = open('cafe.txt', 'rb')
fp4.read()

b'caf\xc3\xa9'

### Encoding Defaults: Madhouse

#### Example 4-11. Exploring encoding defaults

In [41]:
import sys, locale

expressions = """
  locale.getpreferredencoding()
  type(my_file)
  my_file.encoding
  sys.stdout.isatty()
  sys.stdout.encoding
  sys.stdin.isatty()
  sys.stdin.encoding
  sys.stderr.isatty()
  sys.stderr.encoding
  sys.getdefaultencoding()
  sys.getfilesystemencoding()
"""

my_file = open('dummy', 'w')

for expression in expressions.split():
  value = eval(expression)
  print(expression.rjust(30), '->', repr(value))

 locale.getpreferredencoding() -> 'cp949'
                 type(my_file) -> <class '_io.TextIOWrapper'>
              my_file.encoding -> 'cp949'
           sys.stdout.isatty() -> False
           sys.stdout.encoding -> 'UTF-8'
            sys.stdin.isatty() -> False
            sys.stdin.encoding -> 'cp949'
           sys.stderr.isatty() -> False
           sys.stderr.encoding -> 'UTF-8'
      sys.getdefaultencoding() -> 'utf-8'
   sys.getfilesystemencoding() -> 'utf-8'


In [42]:
eval("locale.getpreferredencoding()")

'cp949'

## Normalizing Unicode for Saner Comparisons

유니코드는 결합 문자가 있어 문자열 비교가 간단하지 않음. (출력은 같으나 코드 포인트가 다름)  
unicodedata.normalize()를 이용해 유니코드 정규화가 필요.  

정규화 방법  
* NFC: 코드 포인트를 조합해서 가장 짧은 동일 문자열 생성.  
* NFD: 조합된 문자를 기본 문자와 별도의 결합 문자로 분리.
* NFKC, NFKD: 호환성 문자에 영향을 미침 -> 다른 표준(e.g. µ, ℀)의 상호변환을 지원하기 위한 문자

In [43]:
s1 = 'café'
s2 = 'cafe\u0301'
print(s1, s2)
print(len(s1), len(s2))
print(s1==s2)

café café
4 5
False


In [44]:
# 결합문자
from unicodedata import normalize
s1 = 'café'
s2 = 'cafe\u0301'
print(len(normalize('NFC', s1)), len(normalize('NFC', s2)))
print(len(normalize('NFD', s1)), len(normalize('NFD', s2)))
print(normalize('NFC', s1) == normalize('NFC', s2))
print(normalize('NFD', s1) == normalize('NFD', s2))

4 4
5 5
True
True


In [45]:
# 단일문자
from unicodedata import normalize, name
ohm = '\u2126'
print(name(ohm))

ohm_c = normalize('NFC', ohm)
print(name(ohm_c))

print(ohm == ohm_c)
print(normalize('NFC', ohm) == normalize('NFC', ohm))

OHM SIGN
GREEK CAPITAL LETTER OMEGA
False
True


In [46]:
# NFKC
from unicodedata import normalize, name

half = '½'
print(normalize('NFKC', half))
four_squared = '4²'
print(normalize('NFKC', four_squared))

micro = 'µ'
micro_kc = normalize('NFKC', micro)
print(micro, micro_kc)
print(ord(micro), ord(micro_kc))
print(name(micro), name(micro_kc))

1⁄2
42
µ μ
181 956
MICRO SIGN GREEK SMALL LETTER MU


### Case Folding

In [47]:
micro = 'µ'
name(micro)

'MICRO SIGN'

In [48]:
micro_cf = micro.casefold()
name(micro_cf)

'GREEK SMALL LETTER MU'

In [49]:
micro, micro_cf

('µ', 'μ')

In [50]:
eszett =  'ß'
name(eszett)

'LATIN SMALL LETTER SHARP S'

In [51]:
eszett_cf = eszett.casefold()
eszett, eszett_cf

('ß', 'ss')

### Utility Functions for Normalized Text Matching
NFC가 통상적으로 최고, 대소문자 구분 없이 비교할 경우 str.casefold

#### Example 4-13. normeq.py: normalized Unicode string comparison

In [52]:
from unicodedata import normalize
def nfc_equal(str1, str2):
    return normalize('NFC', str1) == normalize('NFC', str2)
def fold_equal(str1, str2):
    return (normalize('NFC', str1).casefold() == normalize('NFC', str2).casefold())

In [53]:
# Using Normal Form C, case sensitive:
s1 = 'café'
s2 = 'cafe\u0301'
print(s1 == s2)
print(nfc_equal(s1, s2))
print(nfc_equal('A', 'a'))

False
True
False


In [54]:
# Using Normal Form C with case folding:
s3 = 'Straße'
s4 = 'strasse'
s3 == s4
print(nfc_equal(s3, s4))
print(fold_equal(s3, s4))
print(fold_equal(s1, s2))
print(fold_equal('A', 'a'))

False
True
True
True


### Extreme “Normalization”: Taking Out Diacritics

#### Example 4-14. Function to remove all  combining marks (module sanitize.py)

In [55]:
import unicodedata
import string

def shave_marks(txt):
    """Remove all diacritic marks"""
    norm_txt = unicodedata.normalize('NFD', txt)
    shaved = ''.join(c for c in norm_txt if not unicodedata.combining(c))
    return unicodedata.normalize('NFC', shaved)

#### Example 4-15. Two examples using *shave_marks* from Example 4-14

In [56]:
order = '“Herr Voß: • 1⁄2 cup of ŒtkerTM caffè latte • bowl of açaí.”'
Greek =  'Ζέφυρος, Zéfiro'
print(shave_marks(order))
print(shave_marks(Greek))

“Herr Voß: • 1⁄2 cup of ŒtkerTM caffe latte • bowl of acai.”
Ζεφυρος, Zefiro


#### Example 4-16. Function to remove combining marks from Latin characters

In [57]:
def shave_marks_latin(txt):
    """Remove all diacritic marks from Latin base characters"""
    norm_txt = unicodedata.normalize('NFD', txt)
    latin_base = False
    keepers = []
    for c in norm_txt:
        if unicodedata.combining(c) and latin_base:
            continue
        keepers.append(c)
        if not unicodedata.combining(c):
            latin_base = c in string.ascii_letters
    shaved = ''.join(keepers)
    return unicodedata.normalize('NFC', shaved)

#### Example 4-17. Transform some Western typographical symbols into ASCII

In [58]:
# mapping table
single_map = str.maketrans("""‚ƒ„†ˆ‹‘’“”•–—˜›""",
                           """'f"*^<''""---~>""")
multi_map = str.maketrans({
    '€': '<euro>',
    '…': '...',
    'Œ': 'OE',
    '™': '(TM)',
    'œ': 'oe',
    '‰': '<per mille>',
    '‡': '**',
})

multi_map.update(single_map)

# does not affect ASCII or latin11 text, only the Microsoft additions in to latin1 in cp1252
def dewinize(txt):
    """Replace Win1252 symbols with ASCII chars or sequences"""
    return txt.translate(multi_map)


def asciize(txt):
    no_marks = shave_marks_latin(dewinize(txt))
    no_marks = no_marks.replace('ß', 'ss')
    return unicodedata.normalize('NFKC', no_marks) 

#### Example 4-18. shows asciize in use

In [59]:
 order = '“Herr Voß: • 1⁄2 cup of ŒtkerTM caffè latte • bowl of açaí.”'
print(dewinize(order))
print(asciize(order))

"Herr Voß: - 1⁄2 cup of OEtkerTM caffè latte - bowl of açaí."
"Herr Voss: - 1⁄2 cup of OEtkerTM caffe latte - bowl of acai."


## Sorting Unicode Text

In [60]:
fruits = ['caju', 'atemoia', 'cajá', 'açaí', 'acerola']
sorted(fruits)

['acerola', 'atemoia', 'açaí', 'caju', 'cajá']

locale.strxfrm() 사용해서 현지어 비교에 사용가능한 문자열로 변환

#### Example 4-19. Using the *locale.strxfrm()* function as sort key

In [61]:
import locale
print(locale.setlocale(locale.LC_COLLATE, 'pt_BR.UTF-8'))
fruits = ['caju', 'atemoia', 'cajá', 'açaí', 'acerola']
sorted_fruits = sorted(fruits, key=locale.strxfrm)
print(sorted_fruits)

pt_BR.UTF-8
['açaí', 'acerola', 'atemoia', 'cajá', 'caju']


복잡한 설치, 배포 문제 -> PyUCA로 해결

### Sorting with the Unicode Colation Algorithm

#### Example 4-20. Using the *pyuca.Collator.sort_key* method

In [62]:
import pyuca
coll = pyuca.Collator()
fruits = ['caju', 'atemoia', 'cajá', 'açaí', 'acerola']
sorted_fruits = sorted(fruits, key=coll.sort_key)
sorted_fruits

['açaí', 'acerola', 'atemoia', 'cajá', 'caju']

## The Unicode Database

#### Example 4-21. Demo of Unicode database numerical character metadata

In [63]:
import unicodedata
import re

re_digit = re.compile(r'\d')

sample = '1\xbc\xb2\u0969\u136b\u216b\u2466\u2480\u3285'

for char in sample:
    print('U+%04x' % ord(char), char.center(6), 
          're_dig' if re_digit.match(char) else '-',
         'isdig' if char.isdigit() else '-',
         'isnum' if char.isnumeric() else '-',
         format(unicodedata.numeric(char), '5.2f'),
         unicodedata.name(char),
         sep='\t')

U+0031	  1   	re_dig	isdig	isnum	 1.00	DIGIT ONE
U+00bc	  ¼   	-	-	isnum	 0.25	VULGAR FRACTION ONE QUARTER
U+00b2	  ²   	-	isdig	isnum	 2.00	SUPERSCRIPT TWO
U+0969	  ३   	re_dig	isdig	isnum	 3.00	DEVANAGARI DIGIT THREE
U+136b	  ፫   	-	isdig	isnum	 3.00	ETHIOPIC DIGIT THREE
U+216b	  Ⅻ   	-	-	isnum	12.00	ROMAN NUMERAL TWELVE
U+2466	  ⑦   	-	isdig	isnum	 7.00	CIRCLED DIGIT SEVEN
U+2480	  ⒀   	-	-	isnum	13.00	PARENTHESIZED NUMBER THIRTEEN
U+3285	  ㊅   	-	-	isnum	 6.00	CIRCLED IDEOGRAPH SIX


## Dual-mode str and bytes APIS

### str Versus bytes in Regular Expressions

#### Example 4-22. ramanujan.py: compare behavior of simple str and bytes regular expressions

In [64]:
import re
re_numbers_str = re.compile(r'\d+')
re_words_str = re.compile(r'\w+')
re_numbers_bytes = re.compile(rb'\d+')
re_words_bytes = re.compile(rb'\w+')

text_str = ("Ramanujan saw \u0be7\u0bed\u0be8\u0bef"
            " as 1729 = 13 + 123 = 93 + 103.")

text_bytes = text_str.encode('utf_8')

print('Text', repr(text_str), sep='\n ')
print('Numbers')
print(' str :', re_numbers_str.findall(text_str))
print(' bytes:', re_numbers_bytes.findall(text_bytes))
print('Words')
print(' str :', re_words_str.findall(text_str))
print(' bytes:', re_words_bytes.findall(text_bytes))

Text
 'Ramanujan saw ௧௭௨௯ as 1729 = 13 + 123 = 93 + 103.'
Numbers
 str : ['௧௭௨௯', '1729', '13', '123', '93', '103']
 bytes: [b'1729', b'13', b'123', b'93', b'103']
Words
 str : ['Ramanujan', 'saw', '௧௭௨௯', 'as', '1729', '13', '123', '93', '103']
 bytes: [b'Ramanujan', b'saw', b'as', b'1729', b'13', b'123', b'93', b'103']


bytes: ascii 범위를 벗어나는 문자는 숫자나 단어로 처리 x

###  str Versus bytes on os Functions

#### Example 4-23. listdir with str and bytes arguments and results

운영체제의 파일명은 str로 변환하지 않고 바이트 덩어리로 취급함 -> os 모듈에선 str과 bytes 두가지 인수를 받아서 해결  
OS가 파일명을 문자로 취급하지 않는 이유: [Understanding Unix fille name encoding](https://unix.stackexchange.com/questions/39175/understanding-unix-file-name-encoding)

In [65]:
import os
print(os.listdir('.'))
print(os.listdir(b'.'))

['.ipynb_checkpoints', 'cafe.txt', 'Chap2.ipynb', 'Chap3.ipynb', 'Chap4.ipynb', 'digits-of-π.txt', 'dummy', 'floats.bin']
[b'.ipynb_checkpoints', b'cafe.txt', b'Chap2.ipynb', b'Chap3.ipynb', b'Chap4.ipynb', b'digits-of-\xcf\x80.txt', b'dummy', b'floats.bin']


**Surrogateescape**
* 그렘린 처리할 때 사용하는 코덱 에러 처리기  
* 디코딩할 수 없는 바이트를 U+DC00~U+DCFF까지의 코드 포인트로 치환 -> 하위 써로게이트 영역  
    * 문자할당이 되어있지 않고 애플리케이션 내부 용도로 사용하는 공간

In [66]:
pi_name_bytes = os.listdir(b'.')[5]
pi_name_str = pi_name_bytes.decode('ascii', 'surrogateescape')
pi_name_str

'digits-of-\udccf\udc80.txt'

In [67]:
pi_name_str.encode('ascii', 'surrogateescape')

b'digits-of-\xcf\x80.txt'

## Chapter Summary

* txt 파일 열 때 encoding 입력 권장
* 유니코드 문자 비교 시 정규화 필수
    * unicodedata.normalize
    * unicodedata.casefold
    * diacritics 제거
    * 정렬 시 locale 설정

[Character encoding and Unicode in Python](https://pyvideo.org/pycon-us-2014/character-encoding-and-unicode-in-python.html)