# [3. Processing Raw Text](https://www.nltk.org/book/ch03.html) - Notes

* [NLTK-Book-Resource Repository](https://github.com/BetoBob/NLTK-Book-Resource)
* [NLTK-Book-Resource Table of Contents](https://github.com/BetoBob/NLTK-Book-Resource#table-of-contents)

Run the cell below before running the examples.

In [None]:
import nltk, re, pprint
from nltk import word_tokenize

## 3.1   - Accessing Text from the Web and from Disk

### Electronic Books

* **Note:** using `utf-8-sig` is a more correct codec for this web document, and will remove some extra characters
* [Forum Post on codecs](https://stackoverflow.com/questions/17912307/u-ufeff-in-python-string)

In [None]:
from urllib import request
url = "http://www.gutenberg.org/files/2554/2554-0.txt"
response = request.urlopen(url)
raw = response.read().decode('utf-8-sig')
type(raw)

In [None]:
len(raw)

In [None]:
raw[:75]

In [None]:
tokens = word_tokenize(raw)
type(tokens)

In [None]:
len(tokens)

In [None]:
tokens[:10]

In [None]:
text = nltk.Text(tokens)

In [None]:
type(text)

In [None]:
# updated to match 'utf-8-sig' encoding
text[1019:1059]

In [None]:
text.collocation_list()

In [None]:
raw.find("PART I")

In [None]:
raw.rfind("End of Project Gutenberg’s Crime")

In [None]:
raw = raw[5335:1157811]

In [None]:
raw.find("PART I")

### Dealing with HTML

* **Note:** BeautifulSoup (`bs4`) is a python library bundled in Anaconda and in Google Colab

In [None]:
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = request.urlopen(url).read().decode('utf8')
html[:60]

In [None]:
print(html)

In [None]:
from bs4 import BeautifulSoup
raw = BeautifulSoup(html, 'html.parser').get_text()
tokens = word_tokenize(raw)

In [None]:
tokens = tokens[110:390]
text = nltk.Text(tokens)
text.concordance('gene')

### Processing Search Engine Results

**Note:** Search engine scraping is a constantly changing process. Google in more recent years have made it deliberately challenging to scrape data from search results to avoid bot searches from effecting their search algorithm and to maintain the value of their [analytics services](https://developers.google.com/custom-search/v1/overview). Web scraping data from Google can breach their [terms of services](https://policies.google.com/terms). You should instead use their official [Custom Search JSON API](https://developers.google.com/custom-search/v1/overview), which provides 100 free search queries per day. 

* [Search engine scraping](https://en.wikipedia.org/wiki/Search_engine_scraping)
* [Google terms of services](https://policies.google.com/terms)
* [Google Search API](https://developers.google.com/custom-search/v1/overview)
    * 100 free searches a day
* [Bing Search API](https://azure.microsoft.com/en-us/services/cognitive-services/bing-web-search-api/)
    * 3 searches per second, 1,000 searches per month
* [DuckDuckGo Instant Answer API](https://api.duckduckgo.com/api?t=h_)

### Processing RSS Feeds

Run the `pip install` command below to download the feedparser library.

In [None]:
!pip install feedparser

* **Note:** This RSS feed changes over time, so the "He's my BF" entry will not be available. This code however will still work properly for any entry in the RSS feed.

In [None]:
import feedparser
llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")
llog['feed']['title']

In [None]:
len(llog.entries)

In [None]:
post = llog.entries[4]
post.title

In [None]:
content = post.content[0].value

In [None]:
content[:70]

In [None]:
raw = BeautifulSoup(content, 'html.parser').get_text()

In [None]:
word_tokenize(raw)

In [None]:
[l.title for l in llog.entries]

### Reading Local Files

**Note:** 
* It's important to close your document with the `.close()` command after opening a file.
* If you are in Google Colab, click on the folder icon, right click the Files menu, and select `New File` to create a txt file
    * you can also create a folder by right clicking the Files menu, and can move an `example.txt` file you created in a folder named `data` to run the code below

In [None]:
f = open('data/example.txt')
raw = f.read()
f.close()

raw

In [None]:
f.close()

**Your Turn:** Your Turn: Create a file called document.txt using a text editor, and type in a few lines of text, and save it as plain text. Next, in the Python interpreter, open the file using `f = open('document.txt')`, then inspect its contents using `print(f.read())`.

In [None]:
f = open('document.txt')
print(f.read())
f.close()

In [None]:
import os
os.listdir('.')

In [None]:
f = open('data/example.txt', 'r')
for line in f:
    print(line.strip())
f.close()

In [None]:
path = nltk.data.find('corpora/gutenberg/melville-moby_dick.txt')

file = open(path, 'r')
raw = file.read()
file.close()

raw

### Capturing User Input

In [None]:
s = input("Enter some text: ")

print("You typed", len(word_tokenize(s)), "words.")

### The NLP Pipeline

In [None]:
file = open('data/example.txt')

raw = file.read()

file.close()

type(raw)

In [None]:
tokens = word_tokenize(raw)
type(tokens)

In [None]:
words = [w.lower() for w in tokens]
type(words)

In [None]:
vocab = sorted(set(words))
type(vocab)

In [None]:
vocab.append('blog')
raw.append('blog')

In [None]:
query = 'Who knows?'
beatles = ['john', 'paul', 'george', 'ringo']
query + beatles

## 3.2 - Strings: Text Processing at the Lowest Level

### Basic Operations with Strings

In [None]:
monty = 'Monty Python'
monty

In [None]:
circus = "Monty Python's Flying Circus"
circus

In [None]:
circus = 'Monty Python's Flying Circus'

In [None]:
couplet = "Shall I compare thee to a Summer's day?"\
          "Thou are more lovely and more temperate:"

In [None]:
print(couplet)

In [None]:
couplet = ("Rough winds do shake the darling buds of May,"
           "And Summer's lease hath all too short a date:")

In [None]:
print(couplet)

In [None]:
couplet = """Shall I compare thee to a Summer's day?
Thou are more lovely and more temperate:"""

In [None]:
print(couplet)

In [None]:
couplet = '''Rough winds do shake the darling buds of May,
And Summer's lease hath all too short a date:'''

In [None]:
print(couplet)

In [None]:
'very' + 'very' + 'very'

In [None]:
'very' * 3

**Your Turn:** Try running the following code, then try to use your understanding of the string `+` and `*` operations to figure out how it works. Be careful to distinguish between the string `' '`, which is a single whitespace character, and `''`, which is the empty string.

In [None]:
a = [1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1]
b = [' ' * 2 * (7 - i) + 'very' * i for i in a]

for line in b:
    print(line)

In [None]:
'very' - 'y'

In [None]:
'very' / 2

### Printing Strings

In [None]:
monty = 'Monty Python'
print(monty)

In [None]:
grail = 'Holy Grail'
print(monty + grail)

In [None]:
print(monty, grail)

In [None]:
print(monty, "and the", grail)

### Accessing Individual Characters

In [None]:
monty = 'Monty Python'

In [None]:
monty[0]

In [None]:
monty[3]

In [None]:
monty[20]

In [None]:
monty[-1]

In [None]:
monty[5]

In [None]:
monty[-7]

In [None]:
sent = 'colorless green ideas sleep furiously'

for char in sent:
    print(char, end=' ')

In [None]:
from nltk.corpus import gutenberg
raw = gutenberg.raw('melville-moby_dick.txt')
fdist = nltk.FreqDist(ch.lower() for ch in raw if ch.isalpha())

fdist.most_common(5)

In [None]:
[char for (char, count) in fdist.most_common()]

### Accessing Substrings

In [None]:
monty = 'Monty Python'

In [None]:
monty[6:10]

In [None]:
monty[-12:-7]

In [None]:
monty[:5]

In [None]:
monty[6:]

In [None]:
phrase = 'And now for something completely different'

In [None]:
if 'thing' in phrase:
    print('found "thing"')

In [None]:
monty.find('Python')

**Your Turn:** Make up a sentence and assign it to a variable, e.g. `sent = 'my sentence...'`. Now write slice expressions to pull out individual words. (This is obviously not a convenient way to process the words of a text!)

### The Difference between Lists and Strings

In [None]:
query = 'Who knows?'
beatles = ['John', 'Paul', 'George', 'Ringo']

In [None]:
query[2]

In [None]:
beatles[2]

In [None]:
query[:2]

In [None]:
beatles[:2]

In [None]:
query + " I don't"

In [None]:
beatles + 'Brian'

In [None]:
beatles + ['Brian']

In [None]:
beatles[0] = "John Lennon"

In [None]:
del beatles[-1]

In [None]:
beatles

In [None]:
query[0] = 'F'

## 3.3 - Text Processing with Unicode

* [List of Unicode Characters](https://en.wikipedia.org/wiki/List_of_Unicode_characters)

In [None]:
path = nltk.data.find('corpora/unicode_samples/polish-lat2.txt')

In [None]:
f = open(path, encoding='latin2')

In [None]:
for line in f:
    line = line.strip()
    print(line)

In [None]:
f.close()

In [None]:
f = open(path, encoding='latin2')
for line in f:
    line = line.strip()
    print(line.encode('unicode_escape'))

In [None]:
ord('ń')

In [None]:
nacute = '\u0144'

In [None]:
nacute

In [None]:
nacute.encode('utf8')

In [None]:
import unicodedata
lines = open(path, encoding='latin2').readlines()
line = lines[2]

print(line.encode('unicode_escape'))

In [None]:
for c in line:
    if ord(c) > 127:
        print('{} U+{:04x} {}'.format(c.encode('utf8'), ord(c), unicodedata.name(c)))

In [None]:
line.find('zosta\u0142y')

In [None]:
line = line.lower()
line

In [None]:
line.encode('unicode_escape')

In [None]:
import re
m = re.search('\u015b\w*', line)
m.group()

In [None]:
word_tokenize(line)

## 3.4 - Regular Expressions for Detecting Word Patterns

* [Official Documentation on Python Regular Expressions](https://docs.python.org/3/howto/regex.html)

In [None]:
import re
wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]

### Using Basic Meta-Characters

In [None]:
[w for w in wordlist if re.search('ed$', w)]

In [None]:
[w for w in wordlist if re.search('^..j..t..$', w)]

**Your Turn:** The caret symbol `^` matches the start of a string, just like the `$` matches the end. What results do we get with the above example if we leave out both of these, and search for `«..j..t..»`?

### Ranges and Closures

In [None]:
[w for w in wordlist if re.search('^[ghi][mno][jlk][def]$', w)]

**Your Turn:** Look for some "finger-twisters", by searching for words that only use part of the number-pad. For example `«^[ghijklmno]+$»`, or more concisely, `«^[g-o]+$»`, will match words that only use keys 4, 5, 6 in the center row, and `«^[a-fj-o]+$»` will match words that use keys 2, 3, 5, 6 in the top-right corner. What do `-` and `+` mean?

In [None]:
chat_words = sorted(set(w for w in nltk.corpus.nps_chat.words()))v

In [None]:
[w for w in chat_words if re.search('^m+i+n+e+$', w)]

In [None]:
[w for w in chat_words if re.search('^[ha]+$', w)]

In [None]:
wsj = sorted(set(nltk.corpus.treebank.words()))

In [None]:
# decimal numbers
[w for w in wsj if re.search('^[0-9]+\.[0-9]+$', w)]

In [None]:
# words ending with 'S'
[w for w in wsj if re.search('^[A-Z]+\$$', w)]

In [None]:
# four digit integers
[w for w in wsj if re.search('^[0-9]{4}$', w)]

In [None]:
# (integer)-(3-5 letter word)
[w for w in wsj if re.search('^[0-9]+-[a-z]{3,5}$', w)]

In [None]:
# (>=5 letter word)-(2-3 letter word)-(<=6 letter word)
[w for w in wsj if re.search('^[a-z]{5,}-[a-z]{2,3}-[a-z]{,6}$', w)]

In [None]:
# words ending in 'ed' or 'ing'
# Note: removing the parenthesis makes the regular expression:
#    words containing 'ed' or ending in 'ing'
[w for w in wsj if re.search('(ed|ing)$', w)]

## 3.5 - Useful Applications of Regular Expressions

### Extracting Word Pieces

In [None]:
word = 'supercalifragilisticexpialidocious'

In [None]:
re.findall(r'[aeiou]', word)

In [None]:
len(re.findall(r'[aeiou]', word))

In [None]:
wsj = sorted(set(nltk.corpus.treebank.words()))
fd = nltk.FreqDist(vs for word in wsj 
                   for vs in re.findall(r'[aeiou]{2,}', word))

In [None]:
fd.most_common(12)

**Your Turn:** In the W3C Date Time Format, dates are represented like this: 2009-12-31. 

Replace the `?` in the following Python code with a regular expression, in order to convert the string `'2009-12-31'` to a list of integers `[2009, 12, 31]`:

In [None]:
[int(n) for n in re.findall(?, '2009-12-31')]

### Doing More with Word Pieces

In [None]:
# function that removes vowels from word

regexp = r'^[AEIOUaeiou]+|[AEIOUaeiou]+$|[^AEIOUaeiou]'

def compress(word):
    pieces = re.findall(regexp, word)
    return ''.join(pieces)

In [None]:
# remove vowels in words of UDHR

english_udhr = nltk.corpus.udhr.words('English-Latin1')
print(nltk.tokenwrap(compress(w) for w in english_udhr[:75]))

In [None]:
rotokas_words = nltk.corpus.toolbox.words('rotokas.dic')
cvs = [cv for w in rotokas_words for cv in re.findall(r'[ptksvr][aeiou]', w)]
cfd = nltk.ConditionalFreqDist(cvs)
cfd.tabulate()

In [None]:
cv_word_pairs = [(cv, w) for w in rotokas_words
                 for cv in re.findall(r'[ptksvr][aeiou]', w)]

In [None]:
cv_index = nltk.Index(cv_word_pairs)

In [None]:
cv_index['su']

In [None]:
cv_index['po']

### Finding Word Stems

In [None]:
def stem(word):
    for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
        if word.endswith(suffix):
            return word[:-len(suffix)]
    return word

In [None]:
re.findall(r'^.*(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

In [None]:
re.findall(r'^.*(?:ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

In [None]:
re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

In [None]:
re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')

In [None]:
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')

In [None]:
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', 'language')

In [None]:
def stem(word):
    regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'
    stem, suffix = re.findall(regexp, word)[0]
    return stem

raw = """DENNIS: Listen, strange women lying in ponds distributing swords
is no basis for a system of government.  Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony."""

tokens = word_tokenize(raw)

[stem(t) for t in tokens]

### Searching Tokenized Text

In [None]:
from nltk.corpus import gutenberg, nps_chat

moby = nltk.Text(gutenberg.words('melville-moby_dick.txt'))
moby.findall(r"<a> (<.*>) <man>")

In [None]:
chat = nltk.Text(nps_chat.words())

In [None]:
chat.findall(r"<.*> <.*> <bro>") 

In [None]:
chat.findall(r"<l.*>{3,}")

**Your Turn:** Consolidate your understanding of regular expression patterns and substitutions using `nltk.re_show(p, s)` which annotates the string s to show every place where pattern p was matched, and `nltk.app.nemo()` which provides a graphical interface for exploring regular expressions. For more practice, try some of the exercises on regular expressions at the end of this chapter.

* **Note:** `nltk.app.nemo()` does not work in Google Colab. You can only run this on a local version of Jupyter Notebooks.

In [None]:
from nltk.corpus import brown

hobbies_learned = nltk.Text(brown.words(categories=['hobbies', 'learned']))
hobbies_learned.findall(r"<\w*> <and> <other> <\w*s>")

## 3.6 - Normalizing Text

## 3.7

## 3.8

## 3.9

## Your Turn Solutions

### 3.4

**Your Turn:** The caret symbol `^` matches the start of a string, just like the `$` matches the end. What results do we get with the above example if we leave out both of these, and search for `«..j..t..»`?

#### Solution

The regular expression will match 8 character substrings with `j` as the third character amd `t` as the sixth character.

In [None]:
[w for w in wordlist if re.search('..j..t..', w)]

**Your Turn:** Look for some "finger-twisters", by searching for words that only use part of the number-pad. For example `«^[ghijklmno]+$»`, or more concisely, `«^[g-o]+$»`, will match words that only use keys 4, 5, 6 in the center row, and `«^[a-fj-o]+$»` will match words that use keys 2, 3, 5, 6 in the top-right corner. What do `-` and `+` mean?

#### Solution

* the `-` represents a range of characters (in order of unicode placement), where the letter on the left if the starting letter and the letter on the right is the ending letter
    * for example, the expression `[a-zA-Z]` is a regular expression that means any alphabetical character
* the `+` symbol means "find one or more of these characters"
    * for example `a+` will find the strings: "a", "aaaaa", etc.

### 3.5

**Your Turn:** In the W3C Date Time Format, dates are represented like this: 2009-12-31. 

Replace the `?` in the following Python code with a regular expression, in order to convert the string `'2009-12-31'` to a list of integers `[2009, 12, 31]`:

In [None]:
[int(n) for n in re.findall(r"[0-9]+", '2009-12-31')]

In [None]:
# Alternative Solution
[int(n) for n in re.findall(r"\d+", '2009-12-31')]

**Your Turn:** Consolidate your understanding of regular expression patterns and substitutions using `nltk.re_show(p, s)` which annotates the string s to show every place where pattern p was matched.

**Solution:**

A regular expression will be used to capture all three letter words. Notice that the text input is a string rather than a list of word tokens. This will change how the regular expression captures words because whitespace will now need to be considered. Use `\b` (for boundary) to match the beggining and ending of non-whitespace characters.

In [None]:
string = "The cow jumped over the moon."

In [None]:
nltk.re_show(r"\b\w{3}\b", string)