# **Natural Language Processing with Python**
by [CSpanias](https://cspanias.github.io/aboutme/) - 01/2022

Content based on the [NLTK book](https://www.nltk.org/book/). <br>

You can find Chapter 3 [here](https://www.nltk.org/book/ch03.html).

# CONTENT

1. Language Processing and Python
2. Accessing Text Corpora and Lexical Resources
3. Processing Raw Text
    1. Accessing Text from the Web and from Disk
    2. Strings: Text Processing at the Lowest Level
    3. [Text Processing with Unicode](#Unicode)
        1. [What is Unicode?](#Uni)
        1. [Extracting encoded text from files](#FileExtraction)
        1. [Using your local encoding in Python](#LocalEncoding)

**Install**, **import** and **download NLTK**. <br>

*Uncomment lines 2 and 5 if you haven't installed and downloaded NLTK yet.*

In [1]:
# install nltk
#!pip install nltk

# load nltk
import nltk

# download nltk
#nltk.download()

<a name="Unicode"></a>
## 3.3 Text Processing with Unicode

<a name="Uni"></a>
###  3.3.1 What is Unicode?

Unicode supports over a million characters. Each character is assigned a number, called a **code point**. In Python, code points are written in the form `\uXXXX`, where `XXXX` is the number in **4-digit hexadecimal form**.

**Within a program**, we can manipulate Unicode strings just like **normal strings**. 

However, when Unicode characters are **stored in files or displayed on a terminal**, they must be encoded as a **stream of bytes**. 

Some encodings (such as **ASCII** and Latin-2) use a **single byte per code point**, so they can only support a small subset of Unicode, enough for a single language. Other encodings (such as **UTF-8**) use **multiple bytes** and can represent the full range of Unicode characters.

Text in files will be in a particular encoding, so we need some mechanism for translating it into Unicode — **translation into Unicode is called decoding**. 

Conversely, to write out Unicode to a file or a terminal, we first need to translate it into a suitable encoding — this **translation out of Unicode is called encoding**.

From a Unicode perspective, **characters are abstract entities which can be realized as one or more glyphs**. Only glyphs can appear on a screen or be printed on paper. A **font** is a **mapping from characters to glyphs**.

![encoding_decoding.PNG](attachment:encoding_decoding.PNG)

<a name="FileExtraction"></a>
###  3.3.2 Extracting encoded text from files

The Python **`open()`** function **can read encoded data into Unicode strings**, and **write out Unicode strings in encoded form**. 

It takes a **parameter to specify the encoding** of the file being read or written.

In [3]:
# a snippet of Polish text
path = nltk.data.find('corpora/unicode_samples/polish-lat2.txt')

# open file
f = open(path, encoding='latin2')

# for every line in file
for line in f:
    # remove trailing whitespace
    line = line.strip()
    # print line
    print(line)

Pruska Biblioteka Państwowa. Jej dawne zbiory znane pod nazwą
"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez
Niemców pod koniec II wojny światowej na Dolny Śląsk, zostały
odnalezione po 1945 r. na terytorium Polski. Trafiły do Biblioteki
Jagiellońskiej w Krakowie, obejmują ponad 500 tys. zabytkowych
archiwaliów, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.


In [5]:
# open file
f = open(path, encoding='latin2')

# for every line in file
for line in f:
    # remove trailing whitespace
    line = line.strip()
    # check codepoints 
    # convert all non-ASCII chars into their 2- or 4-digit representation
    print(line.encode('unicode_escape'))

b'Pruska Biblioteka Pa\\u0144stwowa. Jej dawne zbiory znane pod nazw\\u0105'
b'"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez'
b'Niemc\\xf3w pod koniec II wojny \\u015bwiatowej na Dolny \\u015al\\u0105sk, zosta\\u0142y'
b'odnalezione po 1945 r. na terytorium Polski. Trafi\\u0142y do Biblioteki'
b'Jagiello\\u0144skiej w Krakowie, obejmuj\\u0105 ponad 500 tys. zabytkowych'
b'archiwali\\xf3w, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.'


In Python 3, **source code is encoded using UTF-8** by default, and you can include Unicode characters in strings if you are using IDLE or another program editor that supports Unicode. 

**Arbitrary Unicode characters** can be included using the `\uXXXX` escape sequence. 

We find the **integer ordinal of a character** using **`ord()`**.

We can obtain the **hexadecimal 4 digit notation** using **`hex(number)`**, and we can then **define a string with the appropriate escape sequence**.

In [21]:
# find integer ordinal of non-ASCII letter
print("The integer ordinal of 'ł' is: {}.\n".format(ord('ł')))

# find hex number of interget ordinal
print("The hex number of 'ł' is: {}.\n".format(hex(322)))

# assign hex representation to var
pol_letter = '\u0142'

# print the corresponding letter
print("The hex {} corresponds to: {}.".format(hex(322), pol_letter))

The integer ordinal of 'ł' is: 322.

The hex number of 'ł' is: 0x142.

The hex 0x142 corresponds to: ł.


We can also see how this char is represented as a **sequence of bytes** inside a text file.

In [22]:
# check byte representation
print("The byte representation of 'ł' is: {}.".format(pol_letter.encode('utf8')))

The byte representation of 'ł' is: b'\xc5\x82'.


The module **`unicodedata`** lets us **inspect the properties of Unicode characters**. 

In [26]:
import unicodedata

# read file line by line
lines = open(path, encoding='latin2').readlines()

# get 3rd line
line = lines[2]

# print line without non-ASCII chars
print(line.encode('unicode_escape'), "\n")

# for every character in line
for char in line:
    # if ordinal integer is >127
    if ord(char) > 127:
        # UTF-8 byte sequence
        print('{} U+(:04x) {}'.format(char.encode('utf8'),
                                      # code point using standard convention
                                      ord(char),
                                      # unicode name
                                      unicodedata.name(char)))

b'Niemc\\xf3w pod koniec II wojny \\u015bwiatowej na Dolny \\u015al\\u0105sk, zosta\\u0142y\\n' 

b'\xc3\xb3' U+(:04x) 243
b'\xc5\x9b' U+(:04x) 347
b'\xc5\x9a' U+(:04x) 346
b'\xc4\x85' U+(:04x) 261
b'\xc5\x82' U+(:04x) 322


In [38]:
# find index
print("The index of 'zosta\u0142y' is:\n{}.\n".format(line.find('zosta\u0142y')))

# normalize line
line = line.lower()
print("Normalised line:\n{}\n".format(line))

# escape non-ASCII chars
print("Line with unicode escape:\n{}\n".format(line.encode('unicode_escape')))

import re
# searh for this regex pattern in that line
m = re.search('\u015b\w*', line)
# print results
print("The regex pattern '\\u015b\\w*' matched:\n{}".format(m.group()))

The index of 'zostały' is:
54.

Normalised line:
niemców pod koniec ii wojny światowej na dolny śląsk, zostały


Line with unicode escape:
b'niemc\\xf3w pod koniec ii wojny \\u015bwiatowej na dolny \\u015bl\\u0105sk, zosta\\u0142y\\n'

The regex pattern '\u015b\w*' matched:
światowej


**NLTK tokenizers** allow **Unicode strings as input**, and correspondingly yield **Unicode strings as output**.

In [41]:
from nltk.tokenize import word_tokenize

# tokenize the line
print("Tokenized line:\n{}".format(word_tokenize(line)))

Tokenized line:
['niemców', 'pod', 'koniec', 'ii', 'wojny', 'światowej', 'na', 'dolny', 'śląsk', ',', 'zostały']


<a name="LocalEncoding"></a>
###  3.3.3 Using your local encoding in Python
If you are used to working with **characters in a particular local encoding**, you probably want to be able to use your standard methods for inputting and editing strings in a Python file.

In order to do this, you need to include the string `'# -*- coding: <coding> -*-'` as the **first or second line of your file**.

![polish_local.PNG](attachment:polish_local.PNG)