# **Natural Language Processing with Python**
by [CSpanias](https://cspanias.github.io/aboutme/) - 01/2022

Content based on the [NLTK book](https://www.nltk.org/book/). <br>

You can find Chapter 3 [here](https://www.nltk.org/book/ch03.html).

# CONTENT

1. Language Processing and Python
2. Accessing Text Corpora and Lexical Resources
3. Processing Raw Text
    1. Accessing Text from the Web and from Disk
    2. Strings: Text Processing at the Lowest Level
    3. Text Processing with Unicode
    4. Regular Expressions for Detecting Word Patterns
    5. Useful Applications of Regural Expressions
    6. [Normalizing Text](#Normalize)
        1. [Stemmers](#Stemmers)
        1. [Lemmatization](#Lemmatization)
    7. [Regular Expressions for Tokenizing Text](#RegexTokens)
        1. [Simple Approaches to Tokenization](#Simple)
        2. [NLTK's Regular Expression Tokenizer](#Tokenizer)
        3. [Further Issues with Tokenization](#Issues)

**Install**, **import** and **download NLTK**. <br>

*Uncomment lines 2 and 5 if you haven't installed and downloaded NLTK yet.*

In [1]:
# install nltk
#!pip install nltk

# load nltk
import nltk

# download nltk
#nltk.download()

<a name="Normalize"></a>
## 3.6  Normalizing Text
1. [Stemmers](#Stemmers)
1. [Lemmatization](#Lemmatization)

In earlier program examples we have often **converted text to lowercase** before doing anything with its words. <br>

Often we want to go further than this, and **strip off any affixes**, a task known as **stemming**. 

A further step is to make sure that **the resulting form is a known word in a dictionary**, a task known as **lemmatization**.

In [2]:
from nltk.tokenize import word_tokenize
# define text
raw = """DENNIS: Listen, strange women lying in ponds distributing swords
is no basis for a system of government.  Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony."""

# tokenize text
tokens = word_tokenize(raw)

<a name="Stemmers"></a>
###  3.6.1 Stemmers
**`from nltk.stem import PorterStemmer, LancasterStemmer`**

NLTK includes several **off-the-shelf stemmers**, and if you ever need a stemmer you should use one of these in preference to crafting your own using regular expressions, since these handle a wide range of irregular cases. 

The **Porter** and **Lancaster** stemmers follow **their own rules for stripping affixes**.

In [3]:
from nltk.stem import PorterStemmer, LancasterStemmer

# instantiate stemmers
porter = PorterStemmer()
lancaster = LancasterStemmer()

# fit stemmers
print("PorterStemmer:\n{}\n".format([porter.stem(t) for t in tokens]))
print("LancasterStemmer:\n{}\n".format([lancaster.stem(t) for t in tokens]))

PorterStemmer:
['denni', ':', 'listen', ',', 'strang', 'women', 'lie', 'in', 'pond', 'distribut', 'sword', 'is', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern', '.', 'suprem', 'execut', 'power', 'deriv', 'from', 'a', 'mandat', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcic', 'aquat', 'ceremoni', '.']

LancasterStemmer:
['den', ':', 'list', ',', 'strange', 'wom', 'lying', 'in', 'pond', 'distribut', 'sword', 'is', 'no', 'bas', 'for', 'a', 'system', 'of', 'govern', '.', 'suprem', 'execut', 'pow', 'der', 'from', 'a', 'mand', 'from', 'the', 'mass', ',', 'not', 'from', 'som', 'farc', 'aqu', 'ceremony', '.']



Observe that **the Porter stemmer correctly handles the word lying** (mapping it to lie), while the **Lancaster stemmer does not**.

**Stemming is not a well-defined process**, and we typically pick the stemmer that best suits the application we have in mind. 

The **Porter Stemmer** is a good choice if you are indexing some texts and want to support **search using alternative forms of words**.

<a name="Lemmatization"></a>
###  3.6.2 Lemmatization
**`from nltk.stem import WordNetLemmatizer`**

The **WordNet lemmatizer only removes affixes if the resulting word is in its dictionary**. 

This additional checking process makes the lemmatizer **slower than the above stemmers**.

In [4]:
from nltk.stem import WordNetLemmatizer

# instantiate Lemmatizer
wnl = WordNetLemmatizer()

# fit lemmatizer
print("Lemmatized tokens:\n{}".format([wnl.lemmatize(t) for t in tokens]))

Lemmatized tokens:
['DENNIS', ':', 'Listen', ',', 'strange', 'woman', 'lying', 'in', 'pond', 'distributing', 'sword', 'is', 'no', 'basis', 'for', 'a', 'system', 'of', 'government', '.', 'Supreme', 'executive', 'power', 'derives', 'from', 'a', 'mandate', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcical', 'aquatic', 'ceremony', '.']


Notice that it **doesn't handle lying**, but it **converts women to woman**.

The WordNet lemmatizer is a good choice if you want to **compile the vocabulary of some texts and want a list of valid lemmas** (or lexicon headwords).

Another **normalization task** involves **identifying non-standard words including numbers, abbreviations, and dates, and mapping any such tokens to a special vocabulary**. 

For example, every decimal number could be mapped to a single token 0.0, and every acronym could be mapped to AAA. 

This **keeps the vocabulary small** and **improves the accuracy of many language modeling tasks**.

<a name="RegexTokens"></a>
## 3.7 Regular Expressions for Tokenizing Text
1. [Simple Approaches to Tokenization](#Simple)
2. [NLTK's Regular Expression Tokenizer](#Tokenizer)
3. [Further Issues with Tokenization](#Issues)

**Tokenization** is the task of **cutting a string into identifiable linguistic units that constitute a piece of language data**. 

Although it is a **fundamental task**, it has been delayed until now because **many corpora are already tokenized**, and because **NLTK includes some tokenizers**. 

Using **regular expressions** for tokenizing text provides **much more control over the process**.

<a name="Simple"></a>
###  3.7.1 Simple Approaches to Tokenization

The very **simplest method** for tokenizing text is to **split on whitespace**.

We could split this raw text on whitespace using **`raw.split()`**. 

To do the same using a regex, it is not enough to match any space characters in the string since this results in **tokens that contain a `\n`** (newline character). 

We need to **match any number of spaces, tabs, or newlines**.

In [5]:
import re

# define text
raw = """'When I'M a Duchess,' she said to herself, (not in a very hopeful tone
though), 'I won't have any pepper in my kitchen AT ALL. Soup does very
well without--Maybe it's always pepper that makes people hot-tempered,'..."""

# split on whitespace
print("Split on whitespace:\n{}\n".format(raw.split()))

# split on whitespace with regex
print("Split just on whitespace with regex:\n{}\n".format(re.split(r" ", raw)))

# split on regex pattern, whitespace\tab\
print("Split on regex pattern:\n{}\n".format(re.split('[ \t\n]+', raw)))

# split on regex pattern, s+ = any whitespace character
print("Split on regex pattern 's+':\n{}\n".format(re.split('\s+', raw)))

Split on whitespace:
["'When", "I'M", 'a', "Duchess,'", 'she', 'said', 'to', 'herself,', '(not', 'in', 'a', 'very', 'hopeful', 'tone', 'though),', "'I", "won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very', 'well', 'without--Maybe', "it's", 'always', 'pepper', 'that', 'makes', 'people', "hot-tempered,'..."]

Split just on whitespace with regex:
["'When", "I'M", 'a', "Duchess,'", 'she', 'said', 'to', 'herself,', '(not', 'in', 'a', 'very', 'hopeful', 'tone\nthough),', "'I", "won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very\nwell', 'without--Maybe', "it's", 'always', 'pepper', 'that', 'makes', 'people', "hot-tempered,'..."]

Split on regex pattern:
["'When", "I'M", 'a', "Duchess,'", 'she', 'said', 'to', 'herself,', '(not', 'in', 'a', 'very', 'hopeful', 'tone', 'though),', "'I", "won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very', 'well', 'without--Maybe', "it's", '

The regular expression `[ \t\n]+`matches one or more space, tab (\t) or newline (\n). Other whitespace characters, such as **carriage-return** and **form-feed** should really be included too. 

Thus, a better idea would be to use a **built-in `re` abbreviation**, **`\s`**, which means **any whitespace character**. The above statement can be rewritten as `re.split(r'\s+', raw)`.

Splitting on whitespace gives us tokens like **'(not'** and **'herself,'**. 

An alternative is to use the fact that Python provides us with a character class **`\w`** for word characters, equivalent to `[a-zA-Z0-9_]`. 

It also **defines the complement** of this class **`\W`**, i.e. **all characters other than** letters, digits or underscore.

In [17]:
print("Split on regex using '\W+':\n{}\n".format(re.split(r'\W+',raw)))

Split on regex using '\W+':
['', 'When', 'I', 'M', 'a', 'Duchess', 'she', 'said', 'to', 'herself', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', 'I', 'won', 't', 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', 'Soup', 'does', 'very', 'well', 'without', 'Maybe', 'it', 's', 'always', 'pepper', 'that', 'makes', 'people', 'hot', 'tempered', '']



Observe that this resulted in **empty strings** at the start and the end. 

We get the same tokens, but without the empty strings, with `re.findall(r'\w+', raw)`, using a **pattern that matches the words instead of the spaces**. 

Now that we're matching the words, we're in a position to **extend the regular expression to cover a wider range of cases**.

**Reminder**: *Parentheses have a second function: to select substrings to be extracted.*

*If we want to use the parentheses to **specify the scope of the disjunction, but not to select the material to be output**, the **`?:`** symbol must be added.*

In [29]:
# 1. search for any sequence of word characters
# 2. or, search for any non-whitespace character followed by any number of word characters
print("Split on regex using '\w+|\S\w*':\n{}\n".format(re.findall(r'\w+|\S\w*', raw)))

# generalize regex to include word-internal hyphens and apostrophes
# 1. search for one or more alphabetical characters
# 2. followed by "-" or "'" 
# 3. followed by zero or more alphabetical characters
# 4. or, search for a single "'"
# 5. or, search for one or more "-", "'", or "("
# 6. or, search for any non-whitespace character
# 7. followed by zero or more alphabetical characters
print("Split on extended regex:\n{}\n".
      format(re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*", raw)))

Split on regex using '\w+|\S\w*':
["'When", 'I', "'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',', '(not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'I", 'won', "'t", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does', 'very', 'well', 'without', '-', '-Maybe', 'it', "'s", 'always', 'pepper', 'that', 'makes', 'people', 'hot', '-tempered', ',', "'", '.', '.', '.']

Split on extended regex:
["'", 'When', "I'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',', '(', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'", 'I', "won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does', 'very', 'well', 'without', '--', 'Maybe', "it's", 'always', 'pepper', 'that', 'makes', 'people', 'hot-tempered', ',', "'", '...']



![regex_symbols.PNG](attachment:regex_symbols.PNG)

<a name="Tokenizer"></a>
###  3.7.2 NLTK's Regular Expression Tokenizer
**`from nltk.tokenize import regexp`**

The function `nltk.regexp_tokenize()` is similar to `re.findall()` (when used for tokenization), but **more efficient** for this task, and **avoids the need for special treatment of parentheses**. 

In [40]:
from nltk.tokenize import regexp_tokenize

text = 'That U.S.A. poster-print costs $12.40...'

# create pattern
pattern = r'''(?x)     # set flag to allow verbose regexps
    (?:[A-Z]\.)+       # abbreviations, e.g. U.S.A.
  | \w+(?:-\w+)*       # words with optional internal hyphens
  | \$?\d+(?:\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
  | \.\.\.             # ellipsis
  | [][.,;"'?():-_`]   # these are separate tokens; includes ], [
'''

# call function
tokens = nltk.regexp_tokenize(text, pattern)
print(tokens)

['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']


The special **`(?x)` "verbose flag"** tells Python to **strip out the embedded whitespace and comments**.

When using the **verbose flag**, it is no longer possible to use **`' '`** to match a space character; so **`\s`** must be used instead. 

The `regexp_tokenize()` has an **optional gaps parameter**. When set to True, the regular expression **specifies the gaps between tokens**, as with `re.split()`.

We can **evaluate a tokenizer by comparing the resulting tokens with a wordlist**, and **reporting any tokens that don't appear in the wordlist**.

In [44]:
from nltk.corpus import words

# lower-case tokens
tokens = [token.lower() for token in tokens]

# check for tokens that are not included in words
print("The tokens that are not included in the wordlist are:\n{}".
      format(set(tokens).difference(words.words())))

The tokens that are not included in the wordlist are:
{'poster-print', '...', 'costs', 'u.s.a.', '$12.40'}


<a name="Issues"></a>
###  3.7.3 Further Issues with Tokenization
Tokenization turns out to be a difficult task. No single solution works well across-the-board, and **we must decide what counts as a token depending on the application domain**.

When developing a tokenizer it helps to have access to **raw text which has been manually tokenized**, in order to compare the output of your tokenizer with high-quality (or "gold-standard") tokens. 

The NLTK corpus collection includes a sample of **Penn Treebank data**, including the raw **Wall Street Journal text**(`nltk.corpus.treebank_raw.raw()`) and the **tokenized version** (`nltk.corpus.treebank.words()`).

A final issue for tokenization is the presence of **contractions**, such as '**didn't**'. If we are analyzing the meaning of a sentence, it would probably be more useful to normalize this form to two separate forms: '**did**' and '**n't**' (or not). 

We can do this work with the help of a **lookup table**.