# **Natural Language Processing with Python**
by [CSpanias](https://cspanias.github.io/aboutme/) - 01/2022

Content based on the [NLTK book](https://www.nltk.org/book/). <br>

You can find Chapter 3 [here](https://www.nltk.org/book/ch03.html).

# CONTENT

1. Language Processing and Python
2. Accessing Text Corpora and Lexical Resources
3. Processing Raw Text
    1. Accessing Text from the Web and from Disk
    2. Strings: Text Processing at the Lowest Level
    3. Text Processing with Unicode
    4. Regular Expressions for Detecting Word Patterns
    5. Useful Applications of Regural Expressions
    6. Normalizing Text
    7. Regular Expressions for Tokenizing Text
    8. [Segmentation](#Segmentation)
        1. [Sentence Segmentation](#SentSegm)
        1. [Word Segmentation](#WordSegm)
    9. [Formatting: From Lists to Strings](#Formatting)
        1. [From Lists to Strings](#List2String)
        1. [Strings and Formats](#StringFormats)
        1. [Lining Things Up](#Lining)
        1. [Writing Results to a File](#File)
        1. [Text Wrapping](#Wrapping)


<a name="Segmentation"></a>
## 3.8 Segmentation
1. [Sentence Segmentation](#SentSegm)
1. [Word Segmentation](#WordSegm)

**Tokenization is an instance of a more general problem of segmentation**. 

<a name="SentSegm"></a>
###  3.8.1 Sentence Segmentation

Manipulating texts at the level of individual words often **presupposes the ability to divide a text into individual sentences**. 

As we have seen, **some corpora already provide access at the sentence level**. 

In [15]:
from nltk.corpus import brown

# compute the average number of words per sentence
print("The average number of words per sentence is: {:.2f}".
     format(len(brown.words()) / len(brown.sents()) ))

The average number of words per sentence is: 20.25


In other cases, the text is **only available as a stream of characters**. 

Before tokenizing the text into words, we need to segment it into sentences. 

NLTK facilitates this by including the [**Punkt sentence segmenter**](https://direct.mit.edu/coli/article/32/4/485/1923/Unsupervised-Multilingual-Sentence-Boundary).

*__Data Pretty Print__ [documentation](https://docs.python.org/3/library/pprint.html)*

In [22]:
from nltk.corpus import gutenberg
from nltk.tokenize import sent_tokenize

# load text
text = gutenberg.raw("chesterton-thursday.txt")

# tokenize text into sentences
sents = sent_tokenize(text)

# pretty print sample result
pprint.pprint(sents[79:89])

['"Nonsense!"',
 'said Gregory, who was very rational when anyone else\nattempted paradox.',
 '"Why do all the clerks and navvies in the\n'
 'railway trains look so sad and tired, so very sad and tired?',
 'I will\ntell you.',
 'It is because they know that the train is going right.',
 'It\n'
 'is because they know that whatever place they have taken a ticket\n'
 'for that place they will reach.',
 'It is because after they have\n'
 'passed Sloane Square they know that the next station must be\n'
 'Victoria, and nothing but Victoria.',
 'Oh, their wild rapture!',
 'oh,\n'
 'their eyes like stars and their souls again in Eden, if the next\n'
 'station were unaccountably Baker Street!"',
 '"It is you who are unpoetical," replied the poet Syme.']


The resulting text is really **a single sentence**, reporting the speech of Mr Lucian Gregory. 

However, **the quoted speech contains several sentences**, and these have been split into individual strings. This is reasonable behavior for most applications.

Sentence segmentation is difficult because **period is used to mark abbreviations**, and **some periods simultaneously mark an abbreviation and terminate a sentence**, as often happens with acronyms like U.S.A.

<a name="WordSegm"></a>
###  3.8.2 Word Segmentation

For some writing systems, tokenizing text is made more difficult by the fact that there is **no visual representation of word boundaries**. 

For example, in Chinese, the three-character string: `爱国人` (ai4 "love" (verb), guo2 "country", ren2 "person") could be tokenized as: 
1. `爱国 / 人`, "country-loving person" <br>
or, <br>
2. `爱 / 国人`, "love country-person."

A similar problem arises in the **processing of spoken language**, where the hearer must **segment a continuous speech stream** into individual words. A particularly challenging version of this problem arises when we **don't know the words in advance**. This is the problem faced by a **language learner**, such as a child hearing utterances from a parent. 

Consider the following artificial example, where word boundaries have been removed:	

1. doyouseethekitty
1. seethedoggy
1. doyoulikethekitty
1. likethedoggy

Our first challenge is simply to **represent the problem**: we need to **find a way to separate text content from the segmentation**. 

We can do this by **annotating each character with a boolean value** to indicate whether or not a word-break appears after the character.

In [23]:
# text without boundaries
text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"

# annotate each break with boolean on sentence level
seg1 = "0000000000000001000000000010000000000000000100000000000"

# annotate each break with boolean on word level
seg2 = "0100100100100001001001000010100100010010000100010010000"

Observe that the **segmentation strings consist of zeros and ones**. 

They are one character shorter than the source text, since **a text of length `n` can only be broken up in `n-1` places**. 

We can **get back to the original segmented text from the above representation** by building a custom function. 

In [29]:
def segment(text , segs):
    """Convert segmentation strings back into text."""
    
    # initialize variables
    words = []
    last = 0
    
    # for the length of segement
    for i in range (len(segs)):
        # check if first boolean value is 1, if not check next value
        if segs[i] == '1':
            # extract sliced value of text & append it to list
            words.append(text[last:i+1])
            # increase last by 1
            last = i+1
            
    # extract and append the remaining text to list
    words.append(text[last:])
    return words

print("Original text:\n{}\n".format(text))
print("Sentence-level boolean representation:\n{}\n".format(seg1))
print("Sentence-level segmentation back to text:\n{}\n".format(segment(text, seg1)))
print("Word-level boolean representation:\n{}\n".format(seg2))
print("Word-level segmentation back to text:\n{}\n".format(segment(text, seg2)))

Original text:
doyouseethekittyseethedoggydoyoulikethekittylikethedoggy

Sentence-level boolean representation:
0000000000000001000000000010000000000000000100000000000

Sentence-level segmentation back to text:
['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']

Word-level boolean representation:
0100100100100001001001000010100100010010000100010010000

Word-level segmentation back to text:
['do', 'you', 'see', 'the', 'kitty', 'see', 'the', 'doggy', 'do', 'you', 'like', 'the', 'kitty', 'like', 'the', 'doggy']



Now the **segmentation task becomes a search problem**: find the bit string that causes the text string to be correctly segmented into words. 

We assume the learner is acquiring words and storing them in an **internal lexicon**. Given a suitable lexicon, it is possible to **reconstruct the source text as a sequence of lexical items**. 

We can define an **objective function**: a scoring function whose value we will try to optimize, based on the size of the lexicon (number of characters in the words plus an extra delimiter character to mark the end of each word) and the amount of information needed to reconstruct the source text from the lexicon.

![objective_funcntion.PNG](attachment:objective_funcntion.PNG)

In [34]:
def evaluate(text, segs):
    """Evaluate segmentation quality."""
    
    # obtain words back from segemented strings
    words = segment(text, segs)
    # count the number of words
    text_size = len(words)
    # sum the characters of each word plus a boundary and remove duplicates
    lexicon_size = sum(len(word) + 1 for word in set(words))
    # return the sum of text and lexicon size
    return text_size + lexicon_size

# random segmentation example
seg3 = "0000100100000011001000000110000100010000001100010000001"

print("Original text:\n{}\n".format(text))

print("Sentence-level boolean representation:\n{}\n".format(seg1))
print("Sentence-level segmentation back to text:\n{}\n".format(segment(text, seg1)))
print("Evaluation score of the sentence-level segmentation:\n{}\n".format(evaluate(text, seg1)))

print("Word-level boolean representation:\n{}\n".format(seg2))
print("Word-level segmentation back to text:\n{}\n".format(segment(text, seg2)))
print("Evaluation score of the word-level segmentation:\n{}\n".format(evaluate(text, seg2)))

print("Random boolean representation:\n{}\n".format(seg3))
print("Resulted segmented-text:\n{}\n".format(segment(text, seg3)))
print("Evaluation score of the segmentation:\n{}\n".format(evaluate(text, seg3)))

Original text:
doyouseethekittyseethedoggydoyoulikethekittylikethedoggy

Sentence-level boolean representation:
0000000000000001000000000010000000000000000100000000000

Sentence-level segmentation back to text:
['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']

Evaluation score of the sentence-level segmentation:
64

Word-level boolean representation:
0100100100100001001001000010100100010010000100010010000

Word-level segmentation back to text:
['do', 'you', 'see', 'the', 'kitty', 'see', 'the', 'doggy', 'do', 'you', 'like', 'the', 'kitty', 'like', 'the', 'doggy']

Evaluation score of the word-level segmentation:
48

Random boolean representation:
0000100100000011001000000110000100010000001100010000001

Resulted segmented-text:
['doyou', 'see', 'thekitt', 'y', 'see', 'thedogg', 'y', 'doyou', 'like', 'thekitt', 'y', 'like', 'thedogg', 'y']

Evaluation score of the segmentation:
47



The final step is to **search for the pattern of zeros and ones that minimizes this objective function**.

**Smaller values** of the score indicate a **better segmentation**.

**Non-Deterministic Search Using Simulated Annealing**: 
1. Begin searching with phrase segmentations only.
1. Randomly perturb the zeros and ones proportional to the "temperature" 
1. With each iteration the temperature is lowered and the perturbation of boundaries is reduced.

**Simulated annealing** is a heuristic for finding a good approximation to the optimum value of a function in a large, discrete search space, based on an **analogy with annealing in metallurgy**. 

In [51]:
from random import randint

def flip(segs, pos):
    """"""
    
    #
    return segs[:pos] + str(1-int(segs[pos])) + segs[pos+1:]


def flip_n(segs, n):
    """"""
    #
    for i in range(n):
        #
        segs = flip(segs, randint(0, len(segs)-1))
    
    #
    return segs


def anneal(text, segs, iterations, cooling_rate):
    """"""
    
    #
    temperature = float(len(segs))
    #
    while temperature > 0.5:
        #
        best_segs, best = segs, evaluate(text, segs)
        
        # 
        for i in range(iterations):
            #
            guess = flip_n(segs, round(temperature))
            #
            score = evaluate(text, guess)
            
            #
            if score < best:
                #
                best, best_segs = score, guess
                
        #
        score, segs = best, best_segs
        #
        temperature = temperature / cooling_rate
        
        #
        print(evaluate(text, segs), segment(text, segs))
        
    # inform user that segmentation optimization process has finished
    print("\nSegmentation has been optimized.\n")
    
    #
    return print("The resulting segmentations is:\n{}\n".format(segs))

# initialize a random segmentation
random_segm = "0000000000000001000000000010000000000000000100000000000"

# print information
print("Original text:\n{}\n".format(text))
print("Random boolean representation:\n{}\n".format(random_segm))

# invoke function
print("Optimizing segmentation:\n")
anneal(text, random_segm, 5000, 1.2)

Original text:
doyouseethekittyseethedoggydoyoulikethekittylikethedoggy

Random boolean representation:
0000000000000001000000000010000000000000000100000000000

Optimizing segmentation:

64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
63 ['doyouseethekitty', 'seeth', 'edoggy', 'doyoulikethekittyl', 'i', 'keth', 'edoggy']
62 ['doyouseethekitt', 'yseeth', 'edoggy', 'doyoul', 'i', 'keth', 'ekittyl', 'i', 'keth', 'edoggy']
62 ['doyouseethekitt', 'yseeth', 'edoggy', 'doyoul', 'i', 'keth', 'ekittyl', 'i', 'keth', 'ed

Notice that **the best segmentation includes "words" like thekitty**, since there's **not enough evidence in the data to split this any further**.

**With enough data**, it is possible to **automatically segment text into words with a reasonable degree of accuracy**. 

Such methods can be applied to **tokenization for writing systems that don't have any visual representation of word boundaries**.

<a name="Formatting"></a>
## 3.9 Formatting: From Lists to Strings
1. [From Lists to Strings](#List2String)
1. [Strings and Formats](#StringFormats)
1. [Lining Things Up](#Lining)
1. [Writing Results to a File](#File)
1. [Text Wrapping](#Wrapping)

When the **results to be presented** are **linguistic**, **textual output** is usually the most natural choice. 

However, when the results are **numerical**, it may be preferable to produce **graphical** output.

<a name="List2String"></a>
###  3.9.1 From Lists to Strings
**`()join`**

The **simplest kind** of structured object we use for text processing is **lists of words**. 

When we want to **output these** to a display or a file, we must **convert these lists into strings**. 

The `join()` method only works on a **list of strings**.

In [53]:
# create a list of tokens
silly = ['We', 'called', 'him', 'Tortoise', 'because', 'he', 'taught', 'us', '.']

# 'glue' them into a string
print(' '.join(silly), "\n")

# 'glue' them into a string with specified delimiter
print(';'.join(silly))

We called him Tortoise because he taught us . 

We;called;him;Tortoise;because;he;taught;us;.


<a name="StringFormats"></a>
###  3.9.2 Strings and Formats

There are two ways of **displaying the contents of an object**:

1. **`print()`** command yields Python's attempt to produce **the most human-readable form of an object**. 

2. **naming the variable at a prompt** shows us a string that can be used to recreate this object. 

It is important to keep in mind that **both of these are just strings**, displayed for the benefit of you, the user. They **do not give us any clue as to the actual internal representation of the object**.

In [55]:
# create string
word = 'Hello'

# display with print
print(word, "\n")

# check the var type
print("The type of the variable word is: {}\n".format(type(word)))

# display with just naming
word

Hello 

The type of the variable word is: <class 'str'>



'Hello'

There are many other useful ways to display an object as a string of characters. 

This may be for the **benefit of a human reader**, or because we want to **export our data to a particular file format for use in an external program**.

**Formatted output** typically contains a **combination of variables and pre-specified strings**.

In [58]:
from nltk import FreqDist

# create a FreqDist
fdist = FreqDist(['dog', 'cat', 'dog', 'cat', 'dog', 'snake', 'dog', 'cat'])

# print each word with a predefined output
for word in sorted(fdist):
    print(word, '->', fdist[word], end='; ')

cat -> 3; dog -> 4; snake -> 1; 

**Print statements** that contain alternating variables and constants can be **difficult to read and maintain**. 

Another solution is to use **string formatting**.

In [59]:
# print each word using string formatting
for word in sorted(fdist):
    print('{}->{}'.format(word, fdist[word]), end=" ")

cat->3 dog->4 snake->1 

The **curly brackets `'{}'`** mark the presence of a **replacement field**.

This acts as a placeholder for the string values of objects that are passed to the **`str.format()`** method. 

A string containing replacement fields is called a **format string**.

The **field name** in a format string can start with a **number**, which refers to a **positional argument of `format()`**. 

In [62]:
# A = index 0, B = index 1
'from {1} to {0}'.format('A', 'B')

'from B to A'

We can also **provide the values for the placeholders indirectly**.

In [63]:
# create a string with placeholder
template = 'Lee wants a {} right now.'

# create a list of values for the placeholder
menu = ['sandwich', 'spam fritter', 'pancake']

# replace the placeholder with list element
for snack in menu:
    print(template.format(snack))

Lee wants a sandwich right now.
Lee wants a spam fritter right now.
Lee wants a pancake right now.


<a name="Lining"></a>
###  3.9.3 Lining Things Up

So far our format strings generated output of **arbitrary width** on the page (or screen). 

We can add **padding** to obtain output of a given width by inserting into the brackets a colon **`':'` followed by an integer**. 

It is **right-justified by default for numbers**, but we can precede the width specifier with a **`'<'`** alignment option to make numbers **left-justified**.

**Strings are left-justified by default**, but can be **right-justified** with the **`'>'`** alignment option.

Other control characters can be used to specify the **sign and precision of floating point numbers**.

*A list with all the **formatting options** can be found [here](https://www.w3schools.com/python/ref_string_format.asp).*

In [71]:
# print a number with width of 6
print("A number with a width of 6:\n'{:6}'\n".format(41))

# print the same as above but with left-alignment
print("A number with a width of 6 and aligned left:\n'{:<6}'\n".format(41))

# print a floating-point number with a specified number of decimals
print("A floating-point number rounded to 2 decimal places:\n{:.2f}".format(41.123123))

A number with a width of 6:
'    41'

A number with a width of 6 and aligned left:
'41    '

A floating-point number rounded to 2 decimal places:
41.12


The string formatting is **smart enough** to know that if you include a **`%`** in your format specification, then you want to **represent the value as a percentage**; there's **no need to multiply by 100**.

In [74]:
# assign variables
count, total = 3205, 9375

# format number as a % rounded to 2 decimal places
"Accuracy for {} words: {:.2%}".format(total, count / total)

'Accuracy for 9375 words: 34.19%'

An important use of formatting strings is for **tabulating data**. 

This gives us **full control of headings and column widths**.

_Note the **clear separation between the language processing work, and the tabulation of results**._

In [78]:
from nltk import ConditionalFreqDist

def tabulate(cfdist, words, categories):
    
    # column headings
    print('{:16}'.format('Category'), end=' ')
    
    # for each word
    for word in words:
        print('{:>6}'.format(word), end=' ')
    print()
    
    # row heading
    for category in categories:
        print('{:16}'.format(category), end=' ')
        # for each word
        for word in words:
            # print table cell
            print('{:6}'.format(cfdist[category][word]), end=' ')
        print()
            
# create a CFD
cfd = ConditionalFreqDist(
        (genre, word)
        for genre in brown.categories()
        for word in brown.words(categories=genre)
)

# create a list of genres
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']

# create a list of modals
modals = ['can', 'could', 'may', 'might', 'must', 'will']

# invoke function
tabulate(cfd, modals, genres)

Category            can  could    may  might   must   will 
news                 93     86     66     38     50    389 
religion             82     59     78     12     54     71 
hobbies             268     58    131     22     83    264 
science_fiction      16     49      4     12      8     16 
romance              74    193     11     51     45     43 
humor                16     30      8      8      9     13 


Recall that we can use a format string `'{:{width}}'` and bound a value to the **width parameter** in `format()`. This allows us to **specify the width of a field using a variable**.

In [83]:
'{:{width}}'.format('Monty Python', width=15)

'Monty Python   '

We could use this to **automatically customize the column** to be just wide enough to accommodate all the words, using **`width = max(len(w) for w in words)`**.

<a name="File"></a>
###  3.9.4 Writing Results to a File

It is often useful to **write output to files**.

In [85]:
from nltk.corpus import genesis

# create, or read if already exists, an output file 
output_file = open('output.txt', 'w')

# get some words
words = set(genesis.words('english-kjv.txt'))

# print words to output file
for word in sorted(words):
    print(word, file=output_file)

When we write **non-text data to a file** we must **convert it to a string first**.

In [92]:
# create, or read if already exists, an output file 
output_file = open('output1.txt', 'w')

# find the length of words
print("The length of words is: {}.\n".format(len(words)))

# convert number to string
length_str = str(len(words))

# print to output file
print(length_str, file=output_file)

The length of words is: 2789.



<a name="Wrapping"></a>
###  3.9.5 Text Wrapping

When the output of our program is **text-like**, instead of tabular, it will usually be **necessary to wrap it** so that it can be displayed conveniently.

In [99]:
# create a list of words
saying = ['After', 'all', 'is', 'said', 'and', 'done', ',', 'more',
         'is', 'said', 'than', 'done']

# print each word with its length in parentheses
for word in saying:
    print(word, '(' + str(len(word)) + '),', end=' ')

After (5), all (3), is (2), said (4), and (3), done (4), , (1), more (4), is (2), said (4), than (4), done (4), 

We can take care of line wrapping with the help of Python's **`textwrap`** module.

It can be used for **wrapping and formatting of plain text**. 

This module provides formatting of text by **adjusting the line breaks in the input paragraph**.

_More info on **`textwrap`** can be found [here](https://www.geeksforgeeks.org/textwrap-text-wrapping-filling-python/)._

In [108]:
from textwrap import wrap, fill

# createa a paragraph text
paragraph = 'Hello there!\nHow are you doing?\nLong time no see!'
print("Original text of type {}:\n{}\n".format(type(paragraph), paragraph))

# wrap text
wrapped_text = wrap(paragraph)
print("Wrapped text of type {}:\n{}\n".format(type(wrapped_text), wrapped_text))

# wrap string
filled_text = fill(paragraph)
print("Filled text of type {1}:\n{0}\n".format(filled_text, type(filled_text)))

Original text of type <class 'str'>:
Hello there!
How are you doing?
Long time no see!

Wrapped text of type <class 'list'>:
['Hello there! How are you doing? Long time no see!']

Filled text of type <class 'str'>:
Hello there! How are you doing? Long time no see!

