## Text Processing with Unicode:
<br>

- Unicode supports over a million characters. Each character is assigned a number, called
a code point. In Python, code points are written in the form \uXXXX, where XXXX
is the number in four-digit hexadecimal form.

- Within a program, we can manipulate Unicode strings just like normal strings. However,
when Unicode characters are stored in files or displayed on a terminal, they must
be encoded as a stream of bytes. Some encodings (such as ASCII and Latin-2) use a
single byte per code point.

- From a Unicode perspective, characters are abstract entities that can be realized as one
or more glyphs. Only glyphs can appear on a screen or be printed on paper. A font is
a mapping from characters to glyphs.

In [2]:
import re 
import nltk
import codecs
from bs4 import BeautifulSoup

In [3]:
PATH = nltk.data.find('corpora/unicode_samples/polish-lat2.txt')

In [4]:
file = codecs.open(PATH, encoding="latin2")
for line in file:
    line = line.strip()
    print(line.encode('unicode_escape'))

b'Pruska Biblioteka Pa\\u0144stwowa. Jej dawne zbiory znane pod nazw\\u0105'
b'"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez'
b'Niemc\\xf3w pod koniec II wojny \\u015bwiatowej na Dolny \\u015al\\u0105sk, zosta\\u0142y'
b'odnalezione po 1945 r. na terytorium Polski. Trafi\\u0142y do Biblioteki'
b'Jagiello\\u0144skiej w Krakowie, obejmuj\\u0105 ponad 500 tys. zabytkowych'
b'archiwali\\xf3w, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.'


The first line in this output illustrates a Unicode escape string preceded by the \u escape
string, namely \u0144.