# NLTK Book: Ch. 3. Processing raw text

Code-only version of Chapter 3 of the [NLTK Book](https://www.nltk.org/book/ch03.html) for use in the TAHLR course

## 1 Accessing Text Corpora

> The goal of this chapter is to answer the following questions:
> - How can we write programs to access text from local files and from the web, in order to get hold of an unlimited range of language material?
> - How can we split documents up into individual words and punctuation symbols, so we can carry out the same kinds of analysis we did with text corpora in earlier chapters?
> -How can we write programs to produce formatted output and save it in a file?


### 3.1 Accessing text from the web and from disk

In [None]:
import nltk
import re
from pprint import pprint
from nltk import word_tokenize

In [None]:
from urllib import request

url = "http://www.gutenberg.org/files/2554/2554-0.txt"
response = request.urlopen(url)
raw = response.read().decode('utf-8-sig').strip()

In [None]:
type(raw)

In [None]:
len(raw)

In [None]:
raw[:75]

In [None]:
# tokenization

tokens = word_tokenize(raw)

In [None]:
type(tokens)

In [None]:
len(tokens)

In [None]:
print(tokens[:10])

In [None]:
text = nltk.Text(tokens)

In [None]:
type(text)

In [None]:
print(text[1075:1113])

In [None]:
text.collocations()

In [None]:
raw.find("PART I")

In [None]:
raw.rfind("*** END OF THE PROJECT GUTENBERG EBOOK CRIME AND PUNISHMENT ***")

In [None]:
start = raw.find("PART I")
end = raw.rfind("*** END OF THE PROJECT GUTENBERG EBOOK CRIME AND PUNISHMENT ***")
raw_ = raw[start:end]

In [None]:
print(raw_[:100])

In [None]:
print(raw_[-100:])

In [None]:
# Dealing with HTML, Processing Search-Engine Results, Processing RSS Feeds

# Consult book

In [None]:
# Reading local files
# https://sa.wikisource.org/wiki/%E0%A4%8B%E0%A4%97%E0%A5%8D%E0%A4%B5%E0%A5%87%E0%A4%A6%E0%A4%83_%E0%A4%B8%E0%A5%82%E0%A4%95%E0%A5%8D%E0%A4%A4%E0%A4%82_%E0%A5%A7%E0%A5%A6.%E0%A5%A7%E0%A5%A8%E0%A5%AF

f = open('../data/texts/document.txt')
rv_raw = f.read()
f.close()

In [None]:
# Preferred method

with open('../data/texts/document.txt') as f:
    rv_raw = f.read()

In [None]:
rv_raw.split('\n')[:4]

In [None]:
import os
os.listdir('../data/texts/')

In [None]:
[file for file in os.listdir('../data/texts/') if file.endswith('.txt')]

In [None]:
from glob import glob

print(glob('../data/texts/*.txt'))

In [None]:
with open('../data/texts/document.txt', 'r') as f:
    for line in f:
        print(line.strip())

### The NLP Pipeline

In [None]:
tokens = word_tokenize(rv_raw)
print(tokens)

In [None]:
vocab = sorted(set(tokens))
print(vocab)

In [None]:
raw = 'It was the best of times. It was the worst of times.'
tokens = word_tokenize(raw)
print(tokens)

In [None]:
vocab = sorted(set(tokens))
print(vocab)

In [None]:
## 3.2 Strings: Text Processing at the Lowest Level

In [None]:
monty = 'Monty Python'
circus = "Monty Python's Flying Circus"
circus = 'Monty Python\'s Flying Circus'

In [None]:
couplet = "Shall I compare thee to a Summer's day?"\
            "Thou are more lovely and more temperate:"
print(couplet)

In [None]:
couplet = ("Rough winds do shake the darling buds of May,"
            "And Summer's lease hath all too short a date:")
print(couplet)

In [None]:
print("""Shall I compare thee to a Summer's day?
Thou are more lovely and more temperate:""")

In [None]:
'very' + 'very' + 'very'

In [None]:
'very' * 3

In [None]:
''.join(['very', 'very', 'very'])

In [None]:
' '.join(['very', 'very', 'very'])

In [None]:
monty

In [None]:
print(monty)

In [None]:
monty[0]

In [None]:
monty[3]

In [None]:
monty[5]

In [None]:
# monty[20]

In [None]:
monty[-1]

In [None]:
monty[len(monty) - 7]

In [None]:
monty[-7]

In [None]:
sent = 'colorless green ideas sleep furiously'
for char in sent:
    print(char, end=' ')

In [None]:
from nltk.corpus import gutenberg
raw = gutenberg.raw('melville-moby_dick.txt')
fdist = nltk.FreqDist(ch.lower() for ch in raw if ch.isalpha())

In [None]:
fdist.most_common(5)

In [None]:
print([char for (char, count) in fdist.most_common()])

In [None]:
print(len([char for (char, count) in fdist.most_common()]))

In [None]:
#  Monty Python
#  012345678901
#- 210987654321

monty[6:10]

In [None]:
monty[-12:-7]

In [None]:
monty[:5]

In [None]:
monty[6:]

In [None]:
phrase = 'And now for something completely different'
if 'thing' in phrase:
    print('found "thing"')

In [None]:
if not 'nothing' in phrase:
    print('found nothing')

In [None]:
 monty.find('Python')

## 3.4 Regular expressions for detecting word patterns

In [None]:
wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]

In [None]:
# $ = end of string
print([w for w in wordlist if re.search('ed$', w)][:5])


In [None]:
# ^ = start of string
# $ = end of string
# . = any character

print([w for w in wordlist if re.search('^..j..t..$', w)][:5])

In [None]:
print([w for w in wordlist if re.search('^[ghi][mno][jlk][def]$', w)])

In [None]:
chat_words = sorted(set(w for w in nltk.corpus.nps_chat.words()))

In [None]:
print([w for w in chat_words if re.search('^m+i+n+e+$', w)])

In [None]:
import random

haha = [w for w in chat_words if re.search('^[ha]+$', w)]
haha_sample = random.sample(haha, 5)
print(haha_sample)

In [None]:
wsj = sorted(set(nltk.corpus.treebank.words()))
print(random.sample([w for w in wsj if re.search('^[0-9]{4}$', w)],5))

In [None]:
print(random.sample([w for w in wsj if re.search('(ed|ing)$', w)],5))

## 3.5 Useful Applications of Regular Expressions

In [None]:
word = 'supercalifragilisticexpialidocious'
print(re.findall(r'[aeiou]', word))

In [None]:
len(re.findall(r'[aeiou]', word))

In [None]:
wsj = sorted(set(nltk.corpus.treebank.words()))
fd = nltk.FreqDist(vs for word in wsj
                    for vs in re.findall(r'[aeiou]{2,}', word))
pprint(fd.most_common(12))

## 3.6 Normalizing text

In [None]:
raw = """DENNIS: Listen, strange women lying in ponds distributing swords is no basis for a system of government.  Supreme executive power derives from a mandate from the masses, not from some farcical aquatic ceremony."""
tokens = word_tokenize(raw)

In [None]:
porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()

In [None]:
print([porter.stem(t) for t in tokens])

In [None]:
print([lancaster.stem(t) for t in tokens])

In [None]:
# Lemmatization

wnl = nltk.WordNetLemmatizer()
print([wnl.lemmatize(t) for t in tokens])

## 3.7. Regular expressions for tokenizing text

"Tokenization turns out to be a far more difficult task than you might have expected."

In [None]:
raw = """'When I'M a Duchess,' she said to herself, (not in a very hopeful tone
though), 'I won't have any pepper in my kitchen AT ALL. Soup does very
well without--Maybe it's always pepper that makes people hot-tempered,'..."""

In [None]:
print(re.split(r' ', raw))

In [None]:
# cf. 
print(raw.split(' '))

In [None]:
text = nltk.corpus.gutenberg.raw('chesterton-thursday.txt')
sents = nltk.sent_tokenize(text)
for i, sent in enumerate(sents[79:89]):
    print(f'{i}: {sent}')

In [None]:
print(rv_raw)

In [None]:
units = re.split(r'[।॥]\n', rv_raw)
print(units[:4])

## Formatting: From lists to strings

In [None]:
silly = ['We', 'called', 'him', 'Tortoise', 'because', 'he', 'taught', 'us', '.']

In [None]:
' '.join(silly)

In [None]:
'|'.join(silly)

In [None]:
''.join(silly)

In [None]:
# String formatting

fdist = nltk.FreqDist(['dog', 'cat', 'dog', 'cat', 'dog', 'snake', 'dog', 'cat'])

for word in sorted(fdist):
    print(f'{word} -> {fdist[word]}', end='; ')

In [None]:
import math
f'{math.pi:.4f}'

In [None]:
count, total = 3205, 9375
f'accuracy for {total} words: {100 * count/total:.4f}%'

In [None]:
# !pip install tabulate

In [None]:
from tabulate import tabulate

from nltk.corpus import brown
cfd = nltk.ConditionalFreqDist(
    (genre, word)
    for genre in brown.categories()
    for word in brown.words(categories=genre))

genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
modals = ['can', 'could', 'may', 'might', 'must', 'will']

print(tabulate(cfd.tabulate(conditions=genres, samples=modals)))

In [None]:
with open('../data/texts/output/kjv-words.txt', 'w') as f:
    words = set(nltk.corpus.genesis.words('english-kjv.txt'))
    for word in sorted(words, key=str.lower):
        if word.isalpha():
            f.write(f'{word}\n')
