# Processing Raw Text
http://www.nltk.org/book/ch03.html

The goal of this chapter is to answer the following questions:

1. How can we write programs to access text from local files and from the web, in order to get hold of an unlimited range of language material?
2. How can we split documents up into individual words and punctuation symbols, so we can carry out the same kinds of analysis we did with text corpora in earlier chapters?
3. How can we write programs to produce formatted output and save it in a file?
- In order to address these questions, we will be covering key concepts in NLP, including tokenization and stemming. 
- Along the way you will consolidate your Python knowledge and learn about strings, files, and regular expressions. 
- Since so much text on the web is in HTML format, we will also see how to dispense with markup.

In [1]:
%matplotlib inline
%pprint
import nltk, re, pprint, matplotlib
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.corpus import brown
from nltk.corpus import wordnet as wn
from nltk.corpus import gutenberg
from nltk.book import *
from bs4 import BeautifulSoup
import feedparser
import os

Pretty printing has been turned OFF
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


## Accessing Text from the Web and from Disk

### Electronic Books
- A small sample of texts from Project Gutenberg appears in the NLTK corpus collection. 
- You can browse the catalog of 25,000 free online books at http://www.gutenberg.org/catalog/, and obtain a URL to an ASCII text file. 
- Although 90% of the texts in Project Gutenberg are in English, it includes material in over 50 other languages, including Catalan, Chinese, Dutch, Finnish, French, German, Italian, Portuguese and Spanish (with more than 100 texts each).

    - Text number 2554 is an English translation of Crime and Punishment, and we can access it as follows.

In [2]:
from  urllib  import  request
#url = "http://www.gutenberg.org/files/2554/2554.txt"
url = "http://www.gutenberg.org/files/2554/2554-0.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')

type(raw)
len(raw)
raw[:75]

'\ufeffThe Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r'

- Notice the \r and \n in the opening line of the file, which is how Python displays the special carriage return and line feed characters (the file must have been created on a Windows machine). 
- For our language processing, we want to break up the string into words and punctuation 
    - This step is called tokenization, and it produces our familiar structure, a list of words and punctuation.

In [3]:
tokens = word_tokenize(raw)
print(type(tokens))
print(len(tokens))
print(tokens[:100])

<class 'list'>
257726
['\ufeffThe', 'Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by', 'Fyodor', 'Dostoevsky', 'This', 'eBook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever', '.', 'You', 'may', 'copy', 'it', ',', 'give', 'it', 'away', 'or', 're-use', 'it', 'under', 'the', 'terms', 'of', 'the', 'Project', 'Gutenberg', 'License', 'included', 'with', 'this', 'eBook', 'or', 'online', 'at', 'www.gutenberg.org', 'Title', ':', 'Crime', 'and', 'Punishment', 'Author', ':', 'Fyodor', 'Dostoevsky', 'Release', 'Date', ':', 'March', '28', ',', '2006', '[', 'EBook', '#', '2554', ']', 'Last', 'Updated', ':', 'October', '27', ',', '2016', 'Language', ':', 'English', 'Character', 'set', 'encoding', ':', 'UTF-8', '***', 'START', 'OF', 'THIS', 'PROJECT', 'GUTENBERG']


- If we now take the further step of creating an NLTK text from this list, we can carry out all of the other linguistic processing we saw in Chapter 1, along with the regular list operations like slicing

In [4]:
text = nltk.Text(tokens)
print(type(text))
print(text[1024:1062])
print(text.collocations())

<class 'nltk.text.Text'>
['an', 'exceptionally', 'hot', 'evening', 'early', 'in', 'July', 'a', 'young', 'man', 'came', 'out', 'of', 'the', 'garret', 'in', 'which', 'he', 'lodged', 'in', 'S.', 'Place', 'and', 'walked', 'slowly', ',', 'as', 'though', 'in', 'hesitation', ',', 'towards', 'K.', 'bridge', '.', 'He', 'had', 'successfully']
Katerina Ivanovna; Pyotr Petrovitch; Pulcheria Alexandrovna; Avdotya
Romanovna; Rodion Romanovitch; Marfa Petrovna; Sofya Semyonovna; old
woman; Project Gutenberg-tm; Porfiry Petrovitch; Amalia Ivanovna;
great deal; young man; Nikodim Fomitch; Ilya Petrovitch; Project
Gutenberg; Andrey Semyonovitch; Hay Market; Dmitri Prokofitch; Good
heavens
None


- The find() and rfind() ("reverse find") methods help us get the right index values to use for slicing the string. We overwrite raw with this slice, so now it begins with "PART I" and goes up to (but not including) the phrase that marks the end of the content.

In [5]:
print(raw.find("PART I"))
print(raw.rfind("End of Project Gutenberg's Crime"))
raw = raw[5338:1157743]
print(raw.find("PART I"))

5336
-1
195767


## Dealing with HTML

- Much of the text on the web is in the form of HTML documents. You can use a web browser to save a page as text to a local file, then access this as described in the section on files below.
- However, if you're going to do this often, it's easiest to get Python to do the work directly. 
- The first step is the same as before, using urlopen. 
- For fun we'll pick a BBC News story called Blondes to die out in 200 years, an urban legend passed along by the BBC as established scientific fact:

In [6]:
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = request.urlopen(url).read().decode('utf8')
html[:60]
#html

'<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN'

To get text out of HTML we will use a Python library called BeautifulSoup, available from http://www.crummy.com/software/BeautifulSoup/:

raw = BeautifulSoup(html).get_text()
tokens = word_tokenize(raw)
tokens

In [7]:
tokens = tokens[110:390]
text = nltk.Text(tokens)
print(text.concordance('gene'))

no matches
None


## Processing RSS Feeds
- The blogosphere is an important source of text, in both formal and informal registers. 
- With the help of a Python library called the Universal Feed Parser, available from https://pypi.python.org/pypi/feedparser, we can access the content of a blog, as shown below:


In [26]:
llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")
print(llog['feed']['title'])
print(len(llog.entries))
post = llog.entries[2]
print(post.title)
content = post.content[0].value
print(content[:70])

Language Log
13
Gilded Age diglossia
<p><a href="http://languagelog.ldc.upenn.edu/myl/InaCoolbrith1.jpg"><i


## Reading Local Files
- In order to read a local file, we need to use Python's built-in open() function, followed by the read() method. 

In [30]:
f = open('document.txt')
raw = f.read()
print(raw)

This is line 1
This is line 2
This is line 3
Bye


In [32]:
os.listdir('.')

['.ipynb_checkpoints',
 'document.txt',
 'img',
 'nltk_ch01_hw1.ipynb',
 'nltk_ch01_hw2.ipynb',
 'nltk_ch02_hw1.ipynb',
 'nltk_ch02_hw2.ipynb',
 'nltk_ch02_hw3.ipynb',
 'nltk_ch02_hw4.ipynb',
 'nltk_ch02_hw5.ipynb',
 'nltk_chapter02_hw2.ipynb',
 'nltk_chapter03.ipynb',
 'nltk_chapter2.ipynb']

Recall that the '\n' characters are newlines; this is equivalent to pressing Enter on a keyboard and starting a new line.
We can also read a file one line at a time using a for loop:

In [34]:
f = open('document.txt', 'r')
for line in f:
     print(line.strip())

This is line 1
This is line 2
This is line 3
Bye


In [35]:
path = nltk.data.find('corpora/gutenberg/melville-moby_dick.txt')
raw = open(path, 'r').read()

In [36]:
raw[100:200]

' School)\n\nThe pale Usher--threadbare in coat, heart, body, and brain; I see him\nnow.  He was ever du'

# Capturing User Input

In [37]:
s = input("Enter some text: ")
print("You typed", len(word_tokenize(s)), "words.")

Enter some text: Buna
You typed 1 words.


![title](img/nlp_pipeline.png)
- When we tokenize a string we produce a list (of words), and this is Python's ```<list>``` type. 
- Normalizing and sorting lists produces other lists:
- The type of an object determines what operations you can perform on it. 
- So, for example, we can append to a list but not to a string
- Similarly, we can concatenate strings with strings, and lists with lists, but we cannot concatenate strings with lists

In [39]:
raw = open('document.txt').read()
type(raw)

str

In [40]:
tokens = word_tokenize(raw)
print(type(tokens))
words = [w.lower() for w in tokens]
print(type(words))
vocab = sorted(set(words))
print(type(vocab))

<class 'list'>
<class 'list'>
<class 'list'>


In [42]:
raw = gutenberg.raw('melville-moby_dick.txt')
fdist = nltk.FreqDist(ch.lower() for ch in raw if ch.isalpha())
fdist.most_common(5)

[('e', 117092), ('t', 87996), ('a', 77916), ('o', 69326), ('n', 65617)]

![title](img/string_methods.png)
## Text Processing with Unicode
- Our programs will often need to deal with different languages, and different character sets. The concept of "plain text" is a fiction. If you live in the English-speaking world you probably use ASCII, possibly without realizing it. If you live in Europe you might use one of the extended Latin character sets, containing such characters as "ø" for Danish and Norwegian, "ő" for Hungarian, "ñ" for Spanish and Breton, and "ň" for Czech and Slovak. In this section, we will give an overview of how to use Unicode for processing texts that use non-ASCII character sets.

### What is Unicode?
- Unicode supports over a million characters. Each character is assigned a number, called a code point. In Python, code points are written in the form \uXXXX, where XXXX is the number in 4-digit hexadecimal form.

- Within a program, we can manipulate Unicode strings just like normal strings. However, when Unicode characters are stored in files or displayed on a terminal, they must be encoded as a stream of bytes. Some encodings (such as ASCII and Latin-2) use a single byte per code point, so they can only support a small subset of Unicode, enough for a single language. Other encodings (such as UTF-8) use multiple bytes and can represent the full range of Unicode characters.

- Text in files will be in a particular encoding, so we need some mechanism for translating it into Unicode — translation into Unicode is called decoding. Conversely, to write out Unicode to a file or a terminal, we first need to translate it into a suitable encoding — this translation out of Unicode is called encoding, and is illustrated in 3.3.
![title](img/unicode.png)

In [43]:
path = nltk.data.find('corpora/unicode_samples/polish-lat2.txt')

In [46]:
f = open(path, encoding='latin2')
for line in f:
    line = line.strip()
    print(line)

Pruska Biblioteka Państwowa. Jej dawne zbiory znane pod nazwą
"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez
Niemców pod koniec II wojny światowej na Dolny Śląsk, zostały
odnalezione po 1945 r. na terytorium Polski. Trafiły do Biblioteki
Jagiellońskiej w Krakowie, obejmują ponad 500 tys. zabytkowych
archiwaliów, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.


In [47]:
f = open(path, encoding='latin2')
for line in f:
    line = line.strip()
    print(line.encode('unicode_escape'))

b'Pruska Biblioteka Pa\\u0144stwowa. Jej dawne zbiory znane pod nazw\\u0105'
b'"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez'
b'Niemc\\xf3w pod koniec II wojny \\u015bwiatowej na Dolny \\u015al\\u0105sk, zosta\\u0142y'
b'odnalezione po 1945 r. na terytorium Polski. Trafi\\u0142y do Biblioteki'
b'Jagiello\\u0144skiej w Krakowie, obejmuj\\u0105 ponad 500 tys. zabytkowych'
b'archiwali\\xf3w, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.'


## Regular Expressions for Detecting Word Patterns
- Many linguistic processing tasks involve pattern matching. 
- For example, we can find words ending with ed using endswith('ed'). We saw a variety of such "word tests" in 4.2. 
- Regular expressions give us a more powerful and flexible method for describing the character patterns we are interested in.

In [8]:
wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]

In [9]:
wordlist[100:200]

['abdicable', 'abdicant', 'abdicate', 'abdication', 'abdicative', 'abdicator', 'abditive', 'abditory', 'abdomen', 'abdominal', 'abdominalian', 'abdominally', 'abdominoanterior', 'abdominocardiac', 'abdominocentesis', 'abdominocystic', 'abdominogenital', 'abdominohysterectomy', 'abdominohysterotomy', 'abdominoposterior', 'abdominoscope', 'abdominoscopy', 'abdominothoracic', 'abdominous', 'abdominovaginal', 'abdominovesical', 'abduce', 'abducens', 'abducent', 'abduct', 'abduction', 'abductor', 'abeam', 'abear', 'abearance', 'abecedarian', 'abecedarium', 'abecedary', 'abed', 'abeigh', 'abele', 'abelite', 'abelmosk', 'abeltree', 'abenteric', 'abepithymia', 'aberdevine', 'aberrance', 'aberrancy', 'aberrant', 'aberrate', 'aberration', 'aberrational', 'aberrator', 'aberrometer', 'aberroscope', 'aberuncator', 'abet', 'abetment', 'abettal', 'abettor', 'abevacuation', 'abey', 'abeyance', 'abeyancy', 'abeyant', 'abfarad', 'abhenry', 'abhiseka', 'abhominable', 'abhor', 'abhorrence', 'abhorrency', 

- Let's find words ending with ed using the regular expression «ed$» 
- We will use the re.search(p, s) function to check whether the pattern p can be found somewhere inside the string s. 
- We need to specify the characters of interest, and use the dollar sign which has a special behavior in the context of regular expressions in that it matches the end of the word:

In [10]:
[w for w in wordlist if re.search('ed$', w)][100:200]

['amiced', 'amphitheatered', 'ampullated', 'amused', 'anchored', 'angled', 'anguiped', 'anguished', 'angulated', 'angulinerved', 'anhungered', 'animated', 'aniseed', 'annodated', 'annulated', 'anomaliped', 'anserated', 'anteflected', 'anteflexed', 'antimoniated', 'antimoniureted', 'antimoniuretted', 'antiquated', 'antired', 'antiweed', 'antlered', 'apertured', 'apexed', 'apicifixed', 'apiculated', 'apocopated', 'apostrophied', 'appearanced', 'appellatived', 'appendaged', 'appendiculated', 'applied', 'appressed', 'aralkylated', 'arbored', 'arched', 'architraved', 'arcked', 'arcuated', 'ared', 'areolated', 'ariled', 'arillated', 'armchaired', 'armed', 'armied', 'armillated', 'armored', 'armoried', 'arpeggiated', 'arpeggioed', 'arrased', 'arrowed', 'arrowheaded', 'arrowweed', 'arseneted', 'arsenetted', 'arseniureted', 'articled', 'articulated', 'ashamed', 'ashlared', 'ashweed', 'aspersed', 'asphyxied', 'assented', 'assessed', 'assigned', 'assistanted', 'associated', 'assonanced', 'assorte

The . wildcard symbol matches any single character. Suppose we have room in a crossword puzzle for an 8-letter word with j as its third letter and t as its sixth letter. In place of each blank cell we use a period:

In [11]:
[w for w in wordlist if re.search('^..j..t..$', w)]

['abjectly', 'adjuster', 'dejected', 'dejectly', 'injector', 'majestic', 'objectee', 'objector', 'rejecter', 'rejector', 'unjilted', 'unjolted', 'unjustly']

- Finally, the ? symbol specifies that the previous character is optional. 
    - Thus «^e-?mail$» will match both email and e-mail. 
    - We could count the total number of occurrences of this word (in either spelling) in a text using  
    ```sum(1 for w in text if re.search('^e-?mail$', w))```.

- The T9 system is used for entering text on mobile phones. Two or more words that are entered with the same sequence of keystrokes are known as textonyms. For example, both hole and golf are entered by pressing the sequence 4653. What other words could be produced with the same sequence? Here we use the regular expression «^[ghi][mno][jlk][def]$»:
![title](img/text_keys.png)

In [55]:
[w for w in wordlist if re.search('^[ghi][mno][jlk][def]$', w)]

['gold', 'golf', 'hold', 'hole']

- Look for some "finger-twisters", by searching for words that only use part of the number-pad.
- For example «^[ghijklmno]+$», or more concisely, «^[g-o]+$», will match words that only use keys 4, 5, 6 in the center row, and «^[a-fj-o]+$» will match words that use keys 2, 3, 5, 6 in the top-right corner. What do - and + mean?

In [16]:
[w for w in wordlist if re.search('^[a-fj-o]+$', w)][:100]

['a', 'aa', 'aal', 'aam', 'aba', 'abac', 'abaca', 'aback', 'abaff', 'abalone', 'abandon', 'abandonable', 'abandoned', 'abandonee', 'abb', 'abdal', 'abdomen', 'abeam', 'abed', 'abele', 'able', 'abloom', 'abode', 'abolla', 'aboma', 'aboon', 'academe', 'acana', 'acca', 'accede', 'accedence', 'accend', 'accolade', 'accoladed', 'accolle', 'accommodable', 'ace', 'ackman', 'acle', 'acme', 'acne', 'acnodal', 'acnode', 'acock', 'acold', 'acoma', 'acone', 'ad', 'adad', 'adance', 'add', 'adda', 'addable', 'added', 'addend', 'addenda', 'addle', 'ade', 'adead', 'adeem', 'adenocele', 'adenoma', 'adman', 'ado', 'adobe', 'ae', 'aefald', 'aenean', 'aeon', 'aface', 'affa', 'affable', 'aflame', 'afoam', 'ajaja', 'ak', 'aka', 'akala', 'ake', 'akeake', 'akee', 'aknee', 'ako', 'al', 'ala', 'alack', 'alada', 'alala', 'alameda', 'alamo', 'alan', 'aland', 'alb', 'alba', 'alban', 'albe', 'albedo', 'albee', 'alcalde', 'alcanna']

Let's explore the + symbol a bit further. Notice that it can be applied to individual letters, or to bracketed sets of letters:

In [17]:
chat_words = sorted(set(w for w in nltk.corpus.nps_chat.words()))
[w for w in chat_words if re.search('^m+i+n+e+$', w)]

['miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee', 'miiiiiinnnnnnnnnneeeeeeee', 'mine', 'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee']

In [18]:
[w for w in chat_words if re.search('^[ha]+$', w)]

['a', 'aaaaaaaaaaaaaaaaa', 'aaahhhh', 'ah', 'ahah', 'ahahah', 'ahh', 'ahhahahaha', 'ahhh', 'ahhhh', 'ahhhhhh', 'ahhhhhhhhhhhhhh', 'h', 'ha', 'haaa', 'hah', 'haha', 'hahaaa', 'hahah', 'hahaha', 'hahahaa', 'hahahah', 'hahahaha', 'hahahahaaa', 'hahahahahaha', 'hahahahahahaha', 'hahahahahahahahahahahahahahahaha', 'hahahhahah', 'hahhahahaha']

- + simply means "one or more instances of the preceding item", which could be an individual character like m, a set like [fed] or a range like [d-f].
- Now let's replace + with *, which means "zero or more instances of the preceding item". 
    - The regular expression «^m*i*n*e*$» will match everything that we found using «^m+i+n+e+$», but also words where some of the letters don't appear at all, e.g. me, min, and mmmmm. 
    - Note that the + and * symbols are sometimes referred to as Kleene closures, or simply closures.
- The ^ operator has another function when it appears as the first character inside square brackets. 
    - For example «[^aeiouAEIOU]» matches any character other than a vowel. 
    - We can search the NPS Chat Corpus for words that are made up entirely of non-vowel characters using «^[^aeiouAEIOU]+$» to find items like these: :):):), grrr, cyb3r and zzzzzzzz. Notice this includes non-alphabetic characters.

- Here are some more examples of regular expressions being used to find tokens that match a particular pattern, illustrating the use of some new symbols: \, {}, (), and |:

In [21]:
wsj = sorted(set(nltk.corpus.treebank.words()))
print(wsj[:100])
print([w for w in wsj if re.search('^[0-9]+\.[0-9]+$', w)][:20])
print([w for w in wsj if re.search('^[A-Z]+\$$', w)][:20])
print([w for w in wsj if re.search('^[0-9]{4}$', w)][:20])
print([w for w in wsj if re.search('^[0-9]+-[a-z]{3,5}$', w)][:20])
print([w for w in wsj if re.search('^[a-z]{5,}-[a-z]{2,3}-[a-z]{,6}$', w)][:20])
print([w for w in wsj if re.search('(ed|ing)$', w)][:20])

['!', '#', '$', '%', '&', "'", "''", "'30s", "'40s", "'50s", "'80s", "'82", "'86", "'S", "'d", "'ll", "'m", "'re", "'s", "'ve", '*', '*-1', '*-10', '*-100', '*-101', '*-102', '*-103', '*-104', '*-105', '*-106', '*-107', '*-108', '*-109', '*-11', '*-110', '*-111', '*-112', '*-113', '*-114', '*-115', '*-116', '*-117', '*-118', '*-119', '*-12', '*-120', '*-121', '*-122', '*-123', '*-124', '*-125', '*-126', '*-127', '*-128', '*-129', '*-13', '*-130', '*-131', '*-132', '*-133', '*-134', '*-135', '*-136', '*-137', '*-138', '*-139', '*-14', '*-140', '*-141', '*-142', '*-144', '*-145', '*-146', '*-147', '*-149', '*-15', '*-150', '*-151', '*-152', '*-153', '*-154', '*-155', '*-156', '*-157', '*-158', '*-159', '*-16', '*-160', '*-161', '*-162', '*-163', '*-164', '*-165', '*-166', '*-17', '*-18', '*-19', '*-2', '*-20', '*-21']
['0.0085', '0.05', '0.1', '0.16', '0.2', '0.25', '0.28', '0.3', '0.4', '0.5', '0.50', '0.54', '0.56', '0.60', '0.7', '0.82', '0.84', '0.9', '0.95', '0.99']
['C$', 'US$']
['

- backslash means that the following character is deprived of its special powers and must literally match a specific character in the word. 
- Thus, while . is special, \. only matches a period. 
- The braced expressions, like {3,5}, specify the number of repeats of the previous item. 
- The pipe character indicates a choice between the material on its left or its right. 
- Parentheses indicate the scope of an operator: they can be used together with the pipe (or disjunction) symbol like this: «w(i|e|ai|oo)t», matching wit, wet, wait, and woot. 
- It is instructive to see what happens when you omit the parentheses from the last expression above, and search for «ed|ing$».
![title](img/regular_exp.png)
- To the Python interpreter, a regular expression is just like any other string. If the string contains a backslash followed by particular characters, it will interpret these specially. 
    - For example ```\b``` would be interpreted as the backspace character. 
    - In general, when using regular expressions containing backslash, we should instruct the interpreter not to look inside the string at all, but simply to pass it directly to the re library for processing.
    - We do this by prefixing the string with the letter r, to indicate that it is a raw string. 
    - For example, the raw string ```r'\band\b'``` contains two ```\b``` symbols that are interpreted by the re library as matching word boundaries instead of backspace characters. 
    - If you get into the habit of using r'...' for regular expressions — as we will do from now on — you will avoid having to think about these complications.

## Useful Applications of Regular Expressions
- The above examples all involved searching for words w that match some regular expression regexp using re.search(regexp, w). Apart from checking if a regular expression matches a word, we can use regular expressions to extract material from words, or to modify words in specific ways.

## Extracting Word Pieces
- The re.findall() ("find all") method finds all (non-overlapping) matches of the given regular expression. Let's find all the vowels in a word, then count them:

In [24]:
word = 'supercalifragilisticexpialidocious'
print(re.findall(r'[aeiou]', word))
print(len(re.findall(r'[aeiou]', word)))

['u', 'e', 'a', 'i', 'a', 'i', 'i', 'i', 'e', 'i', 'a', 'i', 'o', 'i', 'o', 'u']
16


In [25]:
wsj = sorted(set(nltk.corpus.treebank.words()))
fd = nltk.FreqDist(vs for word in wsj
                     for vs in re.findall(r'[aeiou]{2,}', word))
fd.most_common(12)

[('io', 549), ('ea', 476), ('ie', 331), ('ou', 329), ('ai', 261), ('ia', 253), ('ee', 217), ('oo', 174), ('ua', 109), ('au', 106), ('ue', 105), ('ui', 95)]

## Doing More with Word Pieces
- Once we can use re.findall() to extract material from words, there's interesting things to do with the pieces, like glue them back together or plot them.

- It is sometimes noted that English text is highly redundant, and it is still easy to read when word-internal vowels are left out. 
- For example, declaration becomes dclrtn, and inalienable becomes inlnble, retaining any initial or final vowel sequences. The regular expression in our next example matches initial vowel sequences, final vowel sequences, and all consonants; everything else is ignored. This three-way disjunction is processed left-to-right, if one of the three parts matches the word, any later parts of the regular expression are ignored. We use re.findall() to extract all the matching pieces, and ''.join() to join them together (see 3.9 for more about the join operation).

In [27]:
regexp = r'^[AEIOUaeiou]+[AEIOUaeiou]+$|[^AEIOUaeiou]'
def compress(word):
    pieces = re.findall(regexp, word)
    return ''.join(pieces)

english_udhr = nltk.corpus.udhr.words('English-Latin1')
print(nltk.tokenwrap(compress(w) for w in english_udhr[:75] ))

nvrsl Dclrtn f Hmn Rghts Prmbl Whrs rcgntn f th nhrnt dgnty nd f th ql
nd nlnbl rghts f ll mmbrs f th hmn fmly s th fndtn f frdm , jstc nd pc
n th wrld , Whrs dsrgrd nd cntmpt fr hmn rghts hv rsltd n brbrs cts
whch hv trgd th cnscnc f mnknd , nd th dvnt f  wrld n whch hmn bngs
shll njy frdm f spch nd


Next, let's combine regular expressions with conditional frequency distributions. Here we will extract all consonant-vowel sequences from the words of Rotokas, such as ka and si. Since each of these is a pair, it can be used to initialize a conditional frequency distribution. We then tabulate the frequency of each pair:

In [28]:
rotokas_words = nltk.corpus.toolbox.words('rotokas.dic')
cvs = [cv for w in rotokas_words for cv in re.findall(r'[ptksvr][aeiou]', w)]
cfd = nltk.ConditionalFreqDist(cvs)
cfd.tabulate()

    a   e   i   o   u 
k 418 148  94 420 173 
p  83  31 105  34  51 
r 187  63  84  89  79 
s   0   0 100   2   1 
t  47   8   0 148  37 
v  93  27 105  48  49 


- Examining the rows for s and t, we see they are in partial "complementary distribution", which is evidence that they are not distinct phonemes in the language. Thus, we could conceivably drop s from the Rotokas alphabet and simply have a pronunciation rule that the letter t is pronounced s when followed by i. (Note that the single entry having su, namely kasuari, 'cassowary' is borrowed from English.)
- If we want to be able to inspect the words behind the numbers in the above table, it would be helpful to have an index, allowing us to quickly find the list of words that contains a given consonant-vowel pair, e.g. cv_index['su'] should give us all words containing su. Here's how we can do this:

In [29]:
cv_word_pairs = [(cv, w) for w in rotokas_words
           for cv in re.findall(r'[ptksvr][aeiou]', w)]
cv_index = nltk.Index(cv_word_pairs)
print(cv_index['su'])
print(cv_index['po'])

['kasuari']
['kaapo', 'kaapopato', 'kaipori', 'kaiporipie', 'kaiporivira', 'kapo', 'kapoa', 'kapokao', 'kapokapo', 'kapokapo', 'kapokapoa', 'kapokapoa', 'kapokapora', 'kapokapora', 'kapokaporo', 'kapokaporo', 'kapokari', 'kapokarito', 'kapokoa', 'kapoo', 'kapooto', 'kapoovira', 'kapopaa', 'kaporo', 'kaporo', 'kaporopa', 'kaporoto', 'kapoto', 'karokaropo', 'karopo', 'kepo', 'kepoi', 'keposi', 'kepoto']


## Finding Word Stems
When we use a web search engine, we usually don't mind (or even notice) if the words in the document differ from our search terms in having different endings. A query for laptops finds documents containing laptop and vice versa. Indeed, laptop and laptops are just two forms of the same dictionary word (or lemma). For some language processing tasks we want to ignore word endings, and just deal with word stems.

There are various ways we can pull out the stem of a word. Here's a simple-minded approach which just strips off anything that looks like a suffix:

In [31]:
def stem(word):
    for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
        if word.endswith(suffix):
            return word[:-len(suffix)]
        return word

In [32]:
re.findall(r'^.*(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

['ing']

Here, re.findall() just gave us the suffix even though the regular expression matched the entire word. This is because the parentheses have a second function, to select substrings to be extracted. If we want to use the parentheses to specify the scope of the disjunction, but not to select the material to be output, we have to add ?:, which is just one of many arcane subtleties of regular expressions. Here's the revised version.

In [33]:
re.findall(r'^.*(?:ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

['processing']

In [35]:
re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

[('process', 'ing')]

In [36]:
re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')

[('processe', 's')]

The regular expression incorrectly found an -s suffix instead of an -es suffix. This demonstrates another subtlety: the star operator is "greedy" and the .* part of the expression tries to consume as much of the input as possible. If we use the "non-greedy" version of the star operator, written *?, we get what we want:

In [37]:
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')

[('process', 'es')]

In [38]:
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', 'language')

[('language', '')]

In [39]:
def stem(word):
    regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'
    stem, suffix = re.findall(regexp, word)[0]
    return stem

raw = """DENNIS: Listen, strange women lying in ponds distributing swords
     is no basis for a system of government.  Supreme executive power derives from
     a mandate from the masses, not from some farcical aquatic ceremony."""
tokens = word_tokenize(raw)
[stem(t) for t in tokens]

['DENNIS', ':', 'Listen', ',', 'strange', 'women', 'ly', 'in', 'pond', 'distribut', 'sword', 'i', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern', '.', 'Supreme', 'execut', 'power', 'deriv', 'from', 'a', 'mandate', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcical', 'aquatic', 'ceremony', '.']

Notice that our regular expression removed the s from ponds but also from is and basis. It produced some non-words like distribut and deriv, but these are acceptable stems in some applications.
## Searching Tokenized Text
- You can use a special kind of regular expression for searching across multiple words in a text (where a text is a list of tokens). 
- For example, ```"<a> <man>"``` finds all instances of a man in the text. The angle brackets are used to mark token boundaries, and any whitespace between the angle brackets is ignored (behaviors that are unique to NLTK's findall() method for texts). 
- In the following example, we include <.*> which will match any single token, and enclose it in parentheses so only the matched word (e.g. monied) and not the matched phrase (e.g. a monied man) is produced. 
- The second example finds three-word phrases ending with the word bro. The last example finds sequences of three or more words starting with the letter l.

In [40]:
from nltk.corpus import gutenberg, nps_chat
moby = nltk.Text(gutenberg.words('melville-moby_dick.txt'))
print(moby.findall(r"<a> (<.*>) <man>"))
chat = nltk.Text(nps_chat.words())
print(chat.findall(r"<.*> <.*> <bro>"))
print(chat.findall(r"<l.*>{3,}")) 

monied; nervous; dangerous; white; white; white; pious; queer; good;
mature; white; Cape; great; wise; wise; butterless; white; fiendish;
pale; furious; better; certain; complete; dismasted; younger; brave;
brave; brave; brave
None
you rule bro; telling you bro; u twizted bro
None
lol lol lol; lmao lol lol; lol lol lol; la la la la la; la la la; la
la la; lovely lol lol love; lol lol lol.; la la la; la la la
None


In [41]:
from nltk.corpus import brown
hobbies_learned = nltk.Text(brown.words(categories=['hobbies', 'learned']))
hobbies_learned.findall(r"<\w*> <and> <other> <\w*s>")

speed and other activities; water and other liquids; tomb and other
landmarks; Statues and other monuments; pearls and other jewels;
charts and other items; roads and other features; figures and other
objects; military and other areas; demands and other factors;
abstracts and other compilations; iron and other metals


## Normalizing Text
- In earlier program examples we have often converted text to lowercase before doing anything with its words, e.g. ```set(w.lower() for w in text)```. 
- By using lower(), we have normalized the text to lowercase so that the distinction between The and the is ignored. 
- Often we want to go further than this, and strip off any affixes, a task known as stemming. 
- A further step is to make sure that the resulting form is a known word in a dictionary, a task known as lemmatization. We discuss each of these in turn. 
- First, we need to define the data we will use in this section:

In [42]:
raw = """DENNIS: Listen, strange women lying in ponds distributing swords
 is no basis for a system of government.  Supreme executive power derives from
 a mandate from the masses, not from some farcical aquatic ceremony."""
tokens = word_tokenize(raw)

## Stemmers
- NLTK includes several off-the-shelf stemmers, and if you ever need a stemmer you should use one of these in preference to crafting your own using regular expressions, since these handle a wide range of irregular cases. 
- The Porter and Lancaster stemmers follow their own rules for stripping affixes. Observe that the Porter stemmer correctly handles the word lying (mapping it to lie), while the Lancaster stemmer does not.
- Stemming is not a well-defined process, and we typically pick the stemmer that best suits the application we have in mind. 
- The Porter Stemmer is a good choice if you are indexing some texts and want to support search using alternative forms of words (illustrated in 3.6, which uses object oriented programming techniques that are outside the scope of this book, string formatting techniques to be covered in 3.9, and the enumerate() function to be explained in 4.2).

In [43]:
porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()
print([porter.stem(t) for t in tokens])
print([lancaster.stem(t) for t in tokens])

['denni', ':', 'listen', ',', 'strang', 'women', 'lie', 'in', 'pond', 'distribut', 'sword', 'is', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern', '.', 'suprem', 'execut', 'power', 'deriv', 'from', 'a', 'mandat', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcic', 'aquat', 'ceremoni', '.']
['den', ':', 'list', ',', 'strange', 'wom', 'lying', 'in', 'pond', 'distribut', 'sword', 'is', 'no', 'bas', 'for', 'a', 'system', 'of', 'govern', '.', 'suprem', 'execut', 'pow', 'der', 'from', 'a', 'mand', 'from', 'the', 'mass', ',', 'not', 'from', 'som', 'farc', 'aqu', 'ceremony', '.']


In [44]:
class IndexedText(object):

    def __init__(self, stemmer, text):
        self._text = text
        self._stemmer = stemmer
        self._index = nltk.Index((self._stem(word), i)
                                 for (i, word) in enumerate(text))

    def concordance(self, word, width=40):
        key = self._stem(word)
        wc = int(width/4)                # words of context
        for i in self._index[key]:
            lcontext = ' '.join(self._text[i-wc:i])
            rcontext = ' '.join(self._text[i:i+wc])
            ldisplay = '{:>{width}}'.format(lcontext[-width:], width=width)
            rdisplay = '{:{width}}'.format(rcontext[:width], width=width)
            print(ldisplay, rdisplay)

    def _stem(self, word):
        return self._stemmer.stem(word).lower()

In [45]:
porter = nltk.PorterStemmer()
grail = nltk.corpus.webtext.words('grail.txt')
text = IndexedText(porter, grail)
text.concordance('lie')

r king ! DENNIS : Listen , strange women lying in ponds distributing swords is no
 beat a very brave retreat . ROBIN : All lies ! MINSTREL : [ singing ] Bravest of
       Nay . Nay . Come . Come . You may lie here . Oh , but you are wounded !   
doctors immediately ! No , no , please ! Lie down . [ clap clap ] PIGLET : Well  
ere is much danger , for beyond the cave lies the Gorge of Eternal Peril , which 
   you . Oh ... TIM : To the north there lies a cave -- the cave of Caerbannog --
h it and lived ! Bones of full fifty men lie strewn about its lair . So , brave k
not stop our fight ' til each one of you lies dead , and the Holy Grail returns t


## Lemmatization
The WordNet lemmatizer only removes affixes if the resulting word is in its dictionary. This additional checking process makes the lemmatizer slower than the above stemmers. Notice that it doesn't handle lying, but it converts women to woman.
- The WordNet lemmatizer is a good choice if you want to compile the vocabulary of some texts and want a list of valid lemmas (or lexicon headwords).

In [49]:
wnl = nltk.WordNetLemmatizer()
[wnl.lemmatize(t) for t in tokens]

['DENNIS', ':', 'Listen', ',', 'strange', 'woman', 'lying', 'in', 'pond', 'distributing', 'sword', 'is', 'no', 'basis', 'for', 'a', 'system', 'of', 'government', '.', 'Supreme', 'executive', 'power', 'derives', 'from', 'a', 'mandate', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcical', 'aquatic', 'ceremony', '.']

## Regular Expressions for Tokenizing Text
- Tokenization is the task of cutting a string into identifiable linguistic units that constitute a piece of language data. - - Although it is a fundamental task, we have been able to delay it until now because many corpora are already tokenized, and because NLTK includes some tokenizers. Now that you are familiar with regular expressions, you can learn how to use them to tokenize text, and to have much more control over the process.

### Simple Approaches to Tokenization
- The very simplest method for tokenizing text is to split on whitespace. Consider the following text from Alice's Adventures in Wonderland:
- The regular expression ```«[ \t\n]+»``` matches one or more space, tab (\t) or newline (\n). 
- Other whitespace characters, such as carriage-return and form-feed should really be included too. 
- Instead, we will use a built-in re abbreviation,  \s, which means any whitespace character. The above statement can be rewritten as re.split(r'\s+', raw)

In [51]:
raw = """'When I'M a Duchess,' she said to herself, (not in a very hopeful tone
    though), 'I won't have any pepper in my kitchen AT ALL. Soup does very
    well without--Maybe it's always pepper that makes people hot-tempered,'..."""
print(re.split(r' ', raw))
print(re.split(r'[ \t\n]+', raw))
print(re.split(r'[ \s+]+', raw))

["'When", "I'M", 'a', "Duchess,'", 'she', 'said', 'to', 'herself,', '(not', 'in', 'a', 'very', 'hopeful', 'tone\n', '', '', '', 'though),', "'I", "won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very\n', '', '', '', 'well', 'without--Maybe', "it's", 'always', 'pepper', 'that', 'makes', 'people', "hot-tempered,'..."]
["'When", "I'M", 'a', "Duchess,'", 'she', 'said', 'to', 'herself,', '(not', 'in', 'a', 'very', 'hopeful', 'tone', 'though),', "'I", "won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very', 'well', 'without--Maybe', "it's", 'always', 'pepper', 'that', 'makes', 'people', "hot-tempered,'..."]
["'When", "I'M", 'a', "Duchess,'", 'she', 'said', 'to', 'herself,', '(not', 'in', 'a', 'very', 'hopeful', 'tone', 'though),', "'I", "won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very', 'well', 'without--Maybe', "it's", 'always', 'pepper', 'that', 'makes', 'people', "hot-t

Splitting on whitespace gives us tokens like '(not' and 'herself,'. An alternative is to use the fact that Python provides us with a character class \w for word characters, equivalent to [a-zA-Z0-9_]. It also defines the complement of this class \W, i.e. all characters other than letters, digits or underscore. We can use \W in a simple regular expression to split the input on anything other than a word character:

In [52]:
re.split(r'\W+', raw)

['', 'When', 'I', 'M', 'a', 'Duchess', 'she', 'said', 'to', 'herself', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', 'I', 'won', 't', 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', 'Soup', 'does', 'very', 'well', 'without', 'Maybe', 'it', 's', 'always', 'pepper', 'that', 'makes', 'people', 'hot', 'tempered', '']

- Observe that this gives us empty strings at the start and the end (to understand why, try doing 'xx'.split('x')). 
- We get the same tokens, but without the empty strings, with ```re.findall(r'\w+', raw)```, using a pattern that matches the words instead of the spaces. 
- Now that we're matching the words, we're in a position to extend the regular expression to cover a wider range of cases.
- The regular expression «\w+|\S\w*» will first try to match any sequence of word characters. If no match is found, it will try to match any non-whitespace character (\S is the complement of \s) followed by further word characters. This means that punctuation is grouped with any following letters (e.g. 's) but that sequences of two or more punctuation characters are separated.

In [53]:
re.findall(r'\w+', raw)

['When', 'I', 'M', 'a', 'Duchess', 'she', 'said', 'to', 'herself', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', 'I', 'won', 't', 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', 'Soup', 'does', 'very', 'well', 'without', 'Maybe', 'it', 's', 'always', 'pepper', 'that', 'makes', 'people', 'hot', 'tempered']

- Let's generalize the \w+ in the above expression to permit word-internal hyphens and apostrophes: ```«\w+([-']\w+)*»```. 
- This expression means \w+ followed by zero or more instances of [-']\w+; it would match hot-tempered and it's. 
- (We need to include ?: in this expression for reasons discussed earlier.) 
- We'll also add a pattern to match quote characters so these are kept separate from the text they enclose.

In [54]:
print(re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*", raw))

["'", 'When', "I'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',', '(', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'", 'I', "won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does', 'very', 'well', 'without', '--', 'Maybe', "it's", 'always', 'pepper', 'that', 'makes', 'people', 'hot-tempered', ',', "'", '...']


![title](img\regular_exp_symbols.png)
## NLTK's Regular Expression Tokenizer
- The function nltk.regexp_tokenize() is similar to re.findall() (as we've been using it for tokenization).
- However, nltk.regexp_tokenize() is more efficient for this task, and avoids the need for special treatment of parentheses. 
- For readability we break up the regular expression over several lines and add a comment about each line. The special (?x) "verbose flag" tells Python to strip out the embedded whitespace and comments.
- When using the verbose flag, you can no longer use ' ' to match a space character; use \s instead. 
- The regexp_tokenize() function has an optional gaps parameter. When set to True, the regular expression specifies the gaps between tokens, as with re.split().

In [59]:
text = 'That U.S.A. poster-print costs $12.40...'
# set flag to allow verbose regexps
# abbreviations, e.g. U.S.A.
# words with optional internal hyphens
# currency and percentages, e.g. $12.40, 82%
# ellipsis
# these are separate tokens; includes ], [
pattern = r'''(?x)    # set flag to allow verbose regexps
     ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
   | \w+(-\w+)*        # words with optional internal hyphens
   | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
   | \.\.\.            # ellipsis
   | [][.,;"'?():-_`]  # these are separate tokens; includes ], [
 '''
nltk.regexp_tokenize(text, pattern)

[('', '', ''), ('A.', '', ''), ('', '-print', ''), ('', '', ''), ('', '', '.40'), ('', '', '')]