# Chapter 3

In [30]:
import nltk, re, pprint
from nltk import word_tokenize
from urllib.request import urlopen

## Accessing Text from the Web and from Disk

A small sample of texts from Project Gutenberg appears in the NLTK corpus collection. However, you may be interested in analyzing other texts from Project Gutenberg. You can browse the catalog of 25,000 free online books at http://www.gutenberg.org/catalog/, and obtain a URL to an ASCII text file.

Text number 2554 is an English translation of Crime and Punishment, and we can access it as follows:

In [32]:
from urllib.request import urlopen
import requests
# Import urlopen from urllib2 module
url = "http://www.gutenberg.org/files/2554/2554-0.txt"
# Specify url as this particular text string
response = urlopen(url)
raw = response.read().decode('utf8')
type(raw)
# what is the type of the raw string data we read? Unicode

str

In [33]:
len(raw)
# number of characters (including spaces) from this text file from the web

1176812

In [34]:
raw[:75]
# what are the first 75 characters from this text file?

'\ufeffThe Project Gutenberg eBook of Crime and Punishment, by Fyodor Dostoevsky\r'

For our language processing, we want to break up the string into words and punctuation, as we saw in 1.. This step is called tokenization, and it produces our familiar structure, a list of words and punctuation.

In [35]:
tokens = word_tokenize(raw)
# word_tokenize converts raw string data into word tokens
type(tokens)
# Shows that these tokens are placed in a list

list

In [36]:
len(tokens)
# Return the number of word tokens

257058

In [37]:
tokens[:10]
# let's return the first 10 words/tokens

['\ufeffThe',
 'Project',
 'Gutenberg',
 'eBook',
 'of',
 'Crime',
 'and',
 'Punishment',
 ',',
 'by']

Notice that NLTK was needed for tokenization, but not for any of the earlier tasks of opening a URL and reading it into a string. If we now take the further step of creating an NLTK text from this list, we can carry out all of the other linguistic processing we saw in 1., along with the regular list operations like slicing:

In [38]:
text = nltk.Text(tokens)
# Convert tokens list into a format NLTK can understand and process

In [39]:
type(text)
# Now we have an NLTK text

nltk.text.Text

In [40]:
text[1024:1062]
# return this subset of words/tokens

['insight',
 'impresses',
 'us',
 'as',
 'wisdom',
 '...',
 'that',
 'wisdom',
 'of',
 'the',
 'heart',
 'which',
 'we',
 'seek',
 'that',
 'we',
 'may',
 'learn',
 'from',
 'it',
 'how',
 'to',
 'live',
 '.',
 'All',
 'his',
 'other',
 'gifts',
 'came',
 'to',
 'him',
 'from',
 'nature',
 ',',
 'this',
 'he',
 'won',
 'for']

In [41]:
text.collocations()
# Remember, "Collocations are expressions of multiple words which commonly co-occur."

Katerina Ivanovna; Pyotr Petrovitch; Pulcheria Alexandrovna; Avdotya
Romanovna; Rodion Romanovitch; Marfa Petrovna; Sofya Semyonovna; old
woman; Project Gutenberg-tm; Porfiry Petrovitch; Amalia Ivanovna;
great deal; young man; Nikodim Fomitch; Project Gutenberg; Ilya
Petrovitch; Andrey Semyonovitch; Hay Market; Dmitri Prokofitch; Good
heavens


Notice that Project Gutenberg appears as a collocation. This is because each text downloaded from Project Gutenberg contains a 
header with the name of the text, the author, the names of people who scanned and corrected the text, a license, and so on. 
Sometimes this information appears in a footer at the end of the file. We cannot reliably detect where the content begins and ends, 
and so have to resort to manual inspection of the file, to discover unique strings that mark the beginning and the end, before 
trimming raw to be just the content and nothing else:


In [42]:
raw.find("PART I")

5575

In [43]:
raw.rfind("End of Project Gutenberg's Crime")

-1

In [44]:
# Here we essentially subset raw to be the "raw" content, and no header/metadata.

raw = raw[5338:1157743]
raw.find("PART I")

# The find() and rfind() ("reverse find") methods help us get the right index values to use for slicing the string [1]. We overwrite 
# raw with this slice, so now it begins with "PART I" and goes up to (but not including) the phrase that marks the end of the content.

# This was our first brush with the reality of the web: texts found on the web may contain unwanted material, and there may not be an 
# automatic way to remove it. But with a small amount of extra work we can extract the material we need.

237

## Dealing with HTML

Much of the text on the web is in the form of HTML documents. You can use a web browser to save a page as text to a local file, 
then access this as described in the section on files below. However, if you're going to do this often, it's easiest to get Python 
to do the work directly. The first step is the same as before, using urlopen. For fun we'll pick a BBC News story called Blondes to 
die out in 200 years, an urban legend passed along by the BBC as established scientific fact:

In [45]:
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"

# Remember, we use urlopen instead of request.urlopen
html = urlopen(url).read().decode('utf8')
html[:60]

'<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN'

You can type print(html) to see the HTML content in all its glory, including meta tags, an image map, JavaScript, forms, and tables.
To get text out of HTML we will use a Python library called BeautifulSoup, available from 
http://www.crummy.com/software/BeautifulSoup/:

In [46]:
from bs4 import BeautifulSoup
raw = BeautifulSoup(html, "lxml").get_text() # Note: added ,"lxml"
tokens = word_tokenize(raw)
tokens

['BBC',
 'NEWS',
 '|',
 'Health',
 '|',
 'Blondes',
 "'to",
 'die',
 'out',
 'in',
 '200',
 "years'",
 'NEWS',
 'SPORT',
 'WEATHER',
 'WORLD',
 'SERVICE',
 'A-Z',
 'INDEX',
 'SEARCH',
 'You',
 'are',
 'in',
 ':',
 'Health',
 'News',
 'Front',
 'Page',
 'Africa',
 'Americas',
 'Asia-Pacific',
 'Europe',
 'Middle',
 'East',
 'South',
 'Asia',
 'UK',
 'Business',
 'Entertainment',
 'Science/Nature',
 'Technology',
 'Health',
 'Medical',
 'notes',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Talking',
 'Point',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Country',
 'Profiles',
 'In',
 'Depth',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Programmes',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'SERVICES',
 'Daily',
 'E-mail',
 'News',
 'Ticker',
 'Mobile/PDAs',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Text',
 'Only',
 'Feedback',
 'Help',
 'EDITIONS',
 'Change',
 'to',
 'UK',
 'Friday',
 ',',
 '27',
 'September',
 ',',
 '2002',
 ',',
 '11:51',
 'GMT',
 '12:51'

In [47]:
# This still contains unwanted material concerning site navigation and related stories. With some trial and error you can find the 
# start and end indexes of the content and select the tokens of interest, and initialize a text as before.

tokens = tokens[110:390]
text = nltk.Text(tokens)
text.concordance('gene')

Displaying 5 of 5 matches:
hey say too few people now carry the gene for blondes to last beyond the next 
blonde hair is caused by a recessive gene . In order for a child to have blond
 have blonde hair , it must have the gene on both sides of the family in the g
ere is a disadvantage of having that gene or by chance . They do n't disappear
des would disappear is if having the gene was a disadvantage and I do not thin


## Processing Search Engine Results

The web can be thought of as a huge corpus of unannotated text. Web search engines provide an efficient means of searching this large 
quantity of text for relevant linguistic examples. The main advantage of search engines is size: since you are searching such a large 
set of documents, you are more likely to find any linguistic pattern you are interested in. Furthermore, you can make use of very 
specific patterns, which would only match one or two examples on a smaller example, but which might match tens of thousands of 
examples when run on the web. A second advantage of web search engines is that they are very easy to use. Thus, they provide a very 
convenient tool for quickly checking a theory, to see if it is reasonable.

Unfortunately, search engines have some significant shortcomings. First, the allowable range of search patterns is severely 
restricted. Unlike local corpora, where you write programs to search for arbitrarily complex patterns, search engines generally only 
allow you to search for individual words or strings of words, sometimes with wildcards. Second, search engines give inconsistent 
results, and can give widely different figures when used at different times or in different geographical regions. When content has 
been duplicated across multiple sites, search results may be boosted. Finally, the markup in the result returned by a search engine 
may change unpredictably, breaking any pattern-based method of locating particular content (a problem which is ameliorated by the use 
of search engine APIs).

In [48]:
# Processing RSS Feeds

# The blogosphere is an important source of text, in both formal and informal registers. With the help of a Python library called the 
# Universal Feed Parser, available from https://pypi.python.org/pypi/feedparser, we can access the content of a blog, as shown below:

import feedparser
# I typed the following at the Windows Command prompt: "pip install feedparser"

llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")
llog['feed']['title']

'Language Log'

In [49]:
len(llog.entries)

13

In [50]:
post = llog.entries[2]
post.title

'White tongue'

In [51]:
content = post.content[0].value
content[:70]

'<p>Two days ago, I met a person who had a thick white coating on their'

In [52]:
raw = BeautifulSoup(content).get_text()
word_tokenize(raw)

# With some further work, we can write programs to create a small corpus of blog posts, and use this as the basis for our NLP work.

['Two',
 'days',
 'ago',
 ',',
 'I',
 'met',
 'a',
 'person',
 'who',
 'had',
 'a',
 'thick',
 'white',
 'coating',
 'on',
 'their',
 'tongue',
 '.',
 'Wondering',
 'what',
 'it',
 'was',
 'called',
 'and',
 'its',
 'implications',
 'for',
 'health',
 ',',
 'I',
 'asked',
 'members',
 'of',
 'the',
 'e-Mair',
 'list',
 'about',
 'it',
 '.',
 'Here',
 'are',
 'some',
 'of',
 'the',
 'answers',
 'I',
 'received',
 ':',
 'Denis',
 '(',
 'Sinologist',
 ')',
 ':',
 'Thick',
 'tongue',
 'coating',
 ',',
 'often',
 'due',
 'to',
 'lengthening',
 'of',
 'the',
 'keratinous',
 'papillae',
 'on',
 'the',
 'tongue',
 "'s",
 'surface',
 '.',
 'Heidi',
 '(',
 'Yoga',
 'teacher',
 'and',
 'Ayurveda',
 'specialist',
 ')',
 ':',
 'We',
 'call',
 'it',
 '``',
 'ama',
 "''",
 'in',
 'Ayurveda',
 '–',
 'accumulated',
 'toxins',
 'from',
 'undigested',
 'foods',
 '.',
 'The',
 'person',
 'who',
 'has',
 'it',
 'might',
 'be',
 'ill',
 '.',
 'I',
 'scrape',
 'my',
 'tongue',
 'every',
 'day',
 'From',
 'Pr

## Reading Local Files

In order to read a local file, we need to use Python's built-in open() function, followed by the read() method. Suppose you have a 
file document.txt, you can load its contents like this:

first create a document.txt in c:\cgraph

In [53]:
f = open('document.txt')
raw = f.read()

FileNotFoundError: [Errno 2] No such file or directory: 'document.txt'

To check that the file that you are trying to open is really in the right directory, use IDLE's Open command in the File menu; this
will display a list of all the files in the directory where IDLE is running. An alternative is to examine the current directory 
from within Python:

In [None]:
import os
os.listdir('.')

Another possible problem you might have encountered when accessing a text file is the newline conventions, which are different for 
different operating systems. The built-in open() function has a second parameter for controlling how the file is opened: 
open('document.txt', 'rU') — 'r' means to open the file for reading (the default), and 'U' stands for "Universal", which lets us 
ignore the different conventions used for marking newlines.

Assuming that you can open the file, there are several methods for reading it. The read() method creates a string with the contents
of the entire file:

- f.read()


Recall that the '\n' characters are newlines; this is equivalent to pressing Enter on a keyboard and starting a new line.

In [None]:
# We can also read a file one line at a time using a for loop:

f = open('document.txt', 'rU')
for line in f:
    print(line.strip())
    
# Here we use the strip() method to remove the newline character at the end of the input line.
f.read()

In [None]:
# NLTK's corpus files can also be accessed using these methods. We simply have to use nltk.data.find() to get the filename for any 
# corpus item. Then we can open and read it in the way we just demonstrated above:

path = nltk.data.find('corpora/gutenberg/melville-moby_dick.txt')
raw = open(path, 'rU').read()

## Tokenization
Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens.

We have "word_tokenize()", "sent_tokenize()" models which we can import from nltk library

In [1]:
from nltk import word_tokenize
text = "There are several fishes in the sea, can you see them ?"
word_token = word_tokenize(text)
print(word_token)

['There', 'are', 'several', 'fishes', 'in', 'the', 'sea', ',', 'can', 'you', 'see', 'them', '?']


In [2]:
from nltk import sent_tokenize
text = "There are several fishes in the sea, can you see them ?"
sent_token = sent_tokenize(text)
print(sent_token)

['There are several fishes in the sea, can you see them ?']


After illustrating each model how it's done.
- Word tokenize: We use the method word_tokenize() to split a sentence into words. The output of word tokenization can be converted to Data Frame for better text understanding in machine learning applications.

- Sentence tokenize: We use the sent_tokenize() method to split a document or paragraph into sentences.

## White Space Tokenization
The simplest way to tokenize text is to use whitespace within a string as the “delimiter” of words. This can be accomplished with Python’s split function, which is available on all string object instances as well as on the string built-in class itself. You can change the separator any way you need.

In [3]:
Sentence = "I was born in jordan in 2002. Therefore i am 19 years old"
Sentence.split()

['I',
 'was',
 'born',
 'in',
 'jordan',
 'in',
 '2002,',
 'Therefore',
 'i',
 'am',
 '19',
 'years',
 'old']

As you can notice, this built-in Python method already does a good job tokenizing a simple sentence. It’s “mistake” was on the last word, where it included the sentence-ending punctuation with the token “2002.”. We need the tokens to be separated from neighboring punctuation and other significant tokens in a sentence.

In the example below, we’ll perform sentence tokenization using the comma as a separator.

In [4]:
Sentence.split(',')

['I was born in jordan in 2002', ' Therefore i am 19 years old']

There are also a couple of tokenizers you can use...
for example:-
## Punctuation-based tokenizer
This tokenizer splits the sentences into words based on whitespaces and punctuations.

In [5]:
from nltk import wordpunct_tokenize
print(wordpunct_tokenize(Sentence))

['I', 'was', 'born', 'in', 'jordan', 'in', '2002', ',', 'Therefore', 'i', 'am', '19', 'years', 'old']


## Treebank Word tokenizer
This tokenizer incorporates a variety of common rules for english word tokenization. It separates phrase-terminating punctuation like (?!.;,) from adjacent tokens and retains decimal numbers as a single token. Besides, it contains rules for English contractions. 

For example “don’t” is tokenized as [“do”, “n’t”].

In [11]:
from nltk import TreebankWordTokenizer
Tokenizer = TreebankWordTokenizer()
print(Tokenizer.tokenize(Sentence))

['I', 'was', 'born', 'in', 'jordan', 'in', '2002', ',', 'Therefore', 'i', 'am', '19', 'years', 'old']


## Extracting Text from PDF, MSWord and other Binary Formats

ASCII text and HTML text are human readable formats. Text often comes in binary formats — like PDF and MSWord — that can only be 
opened using specialized software. Third-party libraries such as pypdf and pywin32 provide access to these formats. Extracting 
text from multi-column documents is particularly challenging. For once-off conversion of a few documents, it is simpler to open the 
document with a suitable application, then save it as text to your local drive, and access it as described below. If the document 
is already on the web, you can enter its URL in Google's search box. The search result often includes a link to an HTML version of 
the document, which you can save as text.

### Capturing User Input

Sometimes we want to capture the text that a user inputs when she is interacting with our program. To prompt the user to type a 
line of input, call the Python function input(). After saving the input to a variable, we can manipulate it just as we have done 
for other strings.

- s = input("Enter some text: ")

Note: I need to use input() in Python. This was mentioned earlier in NLTK book as well.
Note: input works for Python 3. For Python 2, you need to use raw_input...
Update NLTK folks with this as well.

In [None]:
s = input("Enter some text: ")

In [None]:
print("You typed", len(word_tokenize(s)), "words.")

When we tokenize a string we produce a list (of words), and this is Python's <list> type. Normalizing and sorting lists produces 
other lists:
Note that word_tokenize is from the NLTK library (as imported above)

In [None]:
tokens = word_tokenize(raw)

print type(tokens)
print(tokens)

In [None]:
# Here we make all tokens lower case and turn into a new list words

words = [w.lower() for w in tokens]
print(type(words))
print (words)

In [None]:
# We take our lower case words, apply set (to get the "set" of words or vocabulary)
# Then we sort it and save to a new variable, vocab

vocab = sorted(set(words))
print(type(vocab))
print(vocab)

In [None]:
# The type of an object determines what operations you can perform on it. So, for example, we can append to a list but not to a 
# string:

vocab.append('blog')
print (vocab)

# note, every time I rerun this code, I add "blog" to the end of it...

In [None]:
raw.append('blog')

In [None]:
# Above is an error... HOwever, I could do the following?

raw = raw + " blog"
print (raw)

# Yay! I can...

In [58]:
# Similarly, we can concatenate strings with strings, and lists with lists, but we cannot concatenate strings with lists:

# query is a string
query = 'Who knows?'

# beatles is a list

beatles = ['john', 'paul', 'george', 'ringo']

# but we cannot concatenate a string to a list...
query + beatles

TypeError: can only concatenate str (not "list") to str

In [54]:
# But, in my estimation, we could append the string to the list as follows:

beatles.append(query)
print (beatles)

# yes, this works.

NameError: name 'beatles' is not defined

# 3.3 Text Processing with Unicode

Our programs will often need to deal with different languages, and different character sets. The concept
of "plain text" is a fiction. If you live in the English-speaking world you probably use ASCII, possibly 
without realizing it. If you live in Europe you might use one of the extended Latin character sets, 
containing such characters as "ø" for Danish and Norwegian, "ő" for Hungarian, "ñ" for Spanish and 
Breton, and "ň" for Czech and Slovak. In this section, we will give an overview of how to use Unicode 
for processing texts that use non-ASCII character sets.

## What is Unicode?

Unicode supports over a million characters. Each character is assigned a number, called a code point. 
In Python, code points are written in the form \uXXXX, where XXXX is the number in 4-digit hexadecimal 
form.

Within a program, we can manipulate Unicode strings just like normal strings. However, when Unicode 
characters are stored in files or displayed on a terminal, they must be encoded as a stream of bytes. 
Some encodings (such as ASCII and Latin-2) use a single byte per code point, so they can only support 
a small subset of Unicode, enough for a single language. Other encodings (such as UTF-8) use multiple 
bytes and can represent the full range of Unicode characters.

Text in files will be in a particular encoding, so we need some mechanism for translating it into 
Unicode — translation into Unicode is called decoding. Conversely, to write out Unicode to a file or a 
terminal, we first need to translate it into a suitable encoding — this translation out of Unicode is 
called encoding, and is illustrated in 3.3.

From a Unicode perspective, characters are abstract entities which can be realized as one or more 
glyphs. Only glyphs can appear on a screen or be printed on paper. A font is a mapping from characters 
to glyphs.

### Extracting encoded text from files

Let's assume that we have a small text file, and that we know how it is encoded. For example, 
polish-lat2.txt, as the name suggests, is a snippet of Polish text (from the Polish Wikipedia; see 
http://pl.wikipedia.org/wiki/Biblioteka_Pruska). This file is encoded as Latin-2, also known as 
ISO-8859-2. The function nltk.data.find() locates the file for us:

-> path = nltk.data.find('corpora/unicode_samples/polish-lat2.txt')

In [55]:
# The Python open() function can read encoded data into Unicode strings, and write out Unicode strings 
# in encoded form. It takes a parameter to specify the encoding of the file being read or written. So 
# let's open our Polish file with the encoding 'latin2' and inspect the contents of the file:

import codecs

f = codecs.open(path, encoding='latin2')
for line in f:
    line = line.strip()
    print(line)
    
# To open the file with latin-2 encoding in Python 2.X, we need to codecs.open instead of open.

NameError: name 'path' is not defined

In [56]:
# If this does not display correctly on your terminal, or if we want to see the underlying numerical 
# values (or "codepoints") of the characters, then we can convert all non-ASCII characters into their 
# two-digit \xXX and four-digit \uXXXX representations:

f = codecs.open(path, encoding='latin2')
for line in f:
    line = line.strip()
    print(line.encode('unicode_escape'))

NameError: name 'path' is not defined

In [57]:
# When we tokenize a string we produce a list (of words), and this is Python's <list> type. Normalizing and sorting lists produces 
# other lists:

# Note that word_tokenize is from the NLTK library (as imported above)
from nltk.tokenize import word_tokenize
tokens = word_tokenize(raw)

print(type(tokens))
print(tokens)

<class 'list'>
['Two', 'days', 'ago', ',', 'I', 'met', 'a', 'person', 'who', 'had', 'a', 'thick', 'white', 'coating', 'on', 'their', 'tongue', '.', 'Wondering', 'what', 'it', 'was', 'called', 'and', 'its', 'implications', 'for', 'health', ',', 'I', 'asked', 'members', 'of', 'the', 'e-Mair', 'list', 'about', 'it', '.', 'Here', 'are', 'some', 'of', 'the', 'answers', 'I', 'received', ':', 'Denis', '(', 'Sinologist', ')', ':', 'Thick', 'tongue', 'coating', ',', 'often', 'due', 'to', 'lengthening', 'of', 'the', 'keratinous', 'papillae', 'on', 'the', 'tongue', "'s", 'surface', '.', 'Heidi', '(', 'Yoga', 'teacher', 'and', 'Ayurveda', 'specialist', ')', ':', 'We', 'call', 'it', '``', 'ama', "''", 'in', 'Ayurveda', '–', 'accumulated', 'toxins', 'from', 'undigested', 'foods', '.', 'The', 'person', 'who', 'has', 'it', 'might', 'be', 'ill', '.', 'I', 'scrape', 'my', 'tongue', 'every', 'day', 'From', 'Proto-Indo-Aryan', '*', 'HaHmás', ',', 'from', 'Proto-Indo-Iranian', '*', 'HaHmás', ',', 'from', '

In [59]:
# Here we make all tokens lower case and turn into a new list words

words = [w.lower() for w in tokens]
print(type(words))
print (words)

<class 'list'>
['two', 'days', 'ago', ',', 'i', 'met', 'a', 'person', 'who', 'had', 'a', 'thick', 'white', 'coating', 'on', 'their', 'tongue', '.', 'wondering', 'what', 'it', 'was', 'called', 'and', 'its', 'implications', 'for', 'health', ',', 'i', 'asked', 'members', 'of', 'the', 'e-mair', 'list', 'about', 'it', '.', 'here', 'are', 'some', 'of', 'the', 'answers', 'i', 'received', ':', 'denis', '(', 'sinologist', ')', ':', 'thick', 'tongue', 'coating', ',', 'often', 'due', 'to', 'lengthening', 'of', 'the', 'keratinous', 'papillae', 'on', 'the', 'tongue', "'s", 'surface', '.', 'heidi', '(', 'yoga', 'teacher', 'and', 'ayurveda', 'specialist', ')', ':', 'we', 'call', 'it', '``', 'ama', "''", 'in', 'ayurveda', '–', 'accumulated', 'toxins', 'from', 'undigested', 'foods', '.', 'the', 'person', 'who', 'has', 'it', 'might', 'be', 'ill', '.', 'i', 'scrape', 'my', 'tongue', 'every', 'day', 'from', 'proto-indo-aryan', '*', 'hahmás', ',', 'from', 'proto-indo-iranian', '*', 'hahmás', ',', 'from', '

In [None]:
# We take our lower case words, apply set (to get the "set" of words or vocabulary)
# Then we sort it and save to a new variable, vocab

vocab = sorted(set(words))
print(type(vocab))
print(vocab)

In [None]:
# The type of an object determines what operations you can perform on it. So, for example, we can append to a list but not to a 
# string:

vocab.append('blog')
print (vocab)

# note, every time I rerun this code, I add "blog" to the end of it...

## Using your local encoding in Python
If you are used to working with characters in a particular local encoding, you probably want to be able to use your standard methods for inputting and editing strings in a Python file. In order to do this, you need to include the string '# -*- coding: <coding> -*-' as the first or second line of your file. Note that <coding> has to be a string like 'latin-1', 'big5' or 'utf-8'

## Finding Word Stems using Regular expression
When we use a web search engine, we usually don't mind (or even notice) if the words in the document differ from our search terms in having different endings. A query for laptops finds documents containing laptop and vice versa. Indeed, laptop and laptops are just two forms of the same dictionary word (or lemma). For some language processing tasks we want to ignore word endings, and just deal with word stems.

There are various ways we can pull out the stem of a word. Here's a simple-minded approach which just strips off anything that looks like a suffix:

In [12]:
def stem(word):
     for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
         if word.endswith(suffix):
             return word[:-len(suffix)]
     return word

Although we will ultimately use NLTK's built-in stemmers, it's interesting to see how we can use regular expressions for this task. Our first step is to build up a disjunction of all the suffixes. We need to enclose it in parentheses in order to limit the scope of the disjunction.

In [15]:
import re
re.findall(r'^.*(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

['ing']

Here, re.findall() just gave us the suffix even though the regular expression matched the entire word. This is because the parentheses have a second function, to select substrings to be extracted. If we want to use the parentheses to specify the scope of the disjunction, but not to select the material to be output, we have to add "?"

In [16]:
re.findall(r'^.*(?:ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

['processing']

However, we'd actually like to split the word into stem and suffix. So we should just parenthesize both parts of the regular expression:

In [17]:
re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

[('process', 'ing')]

This looks promising, but still has a problem. Let's look at a different word, processes:

In [18]:
re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')

[('processe', 's')]

In [19]:
# This approach still has many problems (can you spot them?) but we will move on to define a function to perform stemming, and apply it to a whole text:

 	
def stem(word):
     regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'
     stem, suffix = re.findall(regexp, word)[0]
     return stem
raw = """DENNIS: Listen, strange women lying in ponds distributing swordsis no basis for a system of government.  Supreme executive power derives froma mandate from the masses, not from some farcical aquatic ceremony."""
tokens = word_tokenize(raw)
[stem(t) for t in tokens]


['DENNIS',
 ':',
 'Listen',
 ',',
 'strange',
 'women',
 'ly',
 'in',
 'pond',
 'distribut',
 'swordsi',
 'no',
 'basi',
 'for',
 'a',
 'system',
 'of',
 'govern',
 '.',
 'Supreme',
 'execut',
 'power',
 'deriv',
 'froma',
 'mandate',
 'from',
 'the',
 'mass',
 ',',
 'not',
 'from',
 'some',
 'farcical',
 'aquatic',
 'ceremony',
 '.']

Notice that our regular expression removed the s from ponds but also from is and basis. It produced some non-words like distribut and deriv, but these are acceptable stems in some applications.

## Summary
- We view a text as a list of words. A "raw text" is a potentially long string containing words and whitespace formatting, and is how we typically store and visualize a text.
- A string is specified in Python using single or double quotes: 'Monty Python', "Monty Python".
The characters of a string are accessed using indexes, counting from zero: 'Monty Python'[0] gives the value M. The length of a string is found using len().
- Substrings are accessed using slice notation: 'Monty Python'[1:5] gives the value onty. If the start index is omitted, the substring begins at the start of the string; if the end index is omitted, the slice continues to the end of the string.
- Strings can be split into lists: 'Monty Python'.split() gives ['Monty', 'Python']. Lists can be joined into strings: '/'.join(['Monty', 'Python']) gives 'Monty/Python'.
- We can read text from a file input.txt using text = open('input.txt').read(). We can read text from url using text = request.urlopen(url).read().decode('utf8'). We can iterate over the lines of a text file using for line in open(f).
- We can write text to a file by opening the file for writing output_file = open('output.txt', 'w'), then adding content to the file print("Monty Python", file=output_file).
- Texts found on the web may contain unwanted material (such as headers, footers, markup), that need to be removed before we do any linguistic processing.
- Tokenization is the segmentation of a text into basic units — or tokens — such as words and punctuation. Tokenization based on whitespace is inadequate for many applications because it bundles punctuation together with words. NLTK provides an off-the-shelf tokenizer nltk.word_tokenize().
- Lemmatization is a process that maps the various forms of a word (such as appeared, appears) to the canonical or citation form of the word, also known as the lexeme or lemma (e.g. appear).
- Regular expressions are a powerful and flexible method of specifying patterns. Once we have imported the re module, we can use re.findall() to find all substrings in a string that match a pattern.
- If a regular expression string includes a backslash, you should tell Python not to preprocess the string, by using a raw string with an r prefix: r'regexp'.
- When backslash is used before certain characters, e.g. \n, this takes on a special meaning (newline character); however, when backslash is used before regular expression wildcards and operators, e.g. \., \|, \$, these characters lose their special meaning and are matched literally.
- A string formatting expression template % arg_tuple consists of a format string template that contains conversion specifiers like %-6s and %0.2d.

## Exercises
☼ Define a string s = 'colorless'. Write a Python statement that changes this to "colourless" using only the slice and concatenation operations.

☼ We can use the slice notation to remove morphological endings on words. For example, 'dogs'[:-1] removes the last character of dogs, leaving dog. Use slice notation to remove the affixes from these words (we've inserted a hyphen to indicate the affix boundary, but omit this from your strings): dish-es, run-ning, nation-ality, un-do, pre-heat.

☼ We saw how we can generate an IndexError by indexing beyond the end of a string. Is it possible to construct an index that goes too far to the left, before the start of the string?

☼ We can specify a "step" size for the slice. The following returns every second character within the slice: monty[6:11:2]. It also works in the reverse direction: monty[10:5:-2] Try these for yourself, then experiment with different step values.

☼ What happens if you ask the interpreter to evaluate monty[::-1]? Explain why this is a reasonable result.

☼ Describe the class of strings matched by the following regular expressions.

- [a-zA-Z]+
- [A-Z][a-z]*
- p[aeiou]{,2}t
- \d+(\.\d+)?
- ([^aeiou][aeiou][^aeiou])*
- \w+|[^\w\s]+
Test your answers using nltk.re_show().

☼ Write regular expressions to match the following classes of strings:

A single determiner (assume that a, an, and the are the only determiners).
An arithmetic expression using integers, addition, and multiplication, such as 2*3+8.
☼ Write a utility function that takes a URL as its argument, and returns the contents of the URL, with all HTML markup removed. Use from urllib import request and then request.urlopen('http://nltk.org/').read().decode('utf8') to access the contents of the URL.

☼ Save some text into a file corpus.txt. Define a function load(f) that reads from the file named in its sole argument, and returns a string containing the text of the file.

Use nltk.regexp_tokenize() to create a tokenizer that tokenizes the various kinds of punctuation in this text. Use one multi-line regular expression, with inline comments, using the verbose flag (?x).
Use nltk.regexp_tokenize() to create a tokenizer that tokenizes the following kinds of expression: monetary amounts; dates; names of people and organizations.
☼ Rewrite the following loop as a list comprehension:
The code is 1.1 under the exercises

☼ Write a for loop to print out the characters of a string, one per line.

☼ What is the difference between calling split on a string with no argument or with ' ' as the argument, e.g. sent.split() versus sent.split(' ')? What happens when the string being split contains tab characters, consecutive space characters, or a sequence of tabs and spaces? (In IDLE you will need to use '\t' to enter a tab character.)

☼ Create a variable words containing a list of words. Experiment with words.sort() and sorted(words). What is the difference?

☼ Explore the difference between strings and integers by typing the following at a Python prompt: "3" * 7 and 3 * 7. Try converting between strings and integers using int("3") and str(3).

☼ Use a text editor to create a file called prog.py containing the single line monty = 'Monty Python'. Next, start up a new session with the Python interpreter, and enter the expression monty at the prompt. You will get an error from the interpreter. Now, try the following (note that you have to leave off the .py part of the filename):
The code is 1.2 under the exercises

☼ What happens when the formatting strings %6s and %-6s are used to display strings that are longer than six characters?

◑ Read in some text from a corpus, tokenize it, and print the list of all wh-word types that occur. (wh-words in English are used in questions, relative clauses and exclamations: who, which, what, and so on.) Print them in order. Are any words duplicated in this list, because of the presence of case distinctions or punctuation?

◑ Create a file consisting of words and (made up) frequencies, where each line consists of a word, the space character, and a positive integer, e.g. fuzzy 53. Read the file into a Python list using open(filename).readlines(). Next, break each line into its two fields using split(), and convert the number into an integer using int(). The result should be a list of the form: [['fuzzy', 53], ...].

◑ Write code to access a favorite webpage and extract some text from it. For example, access a weather site and extract the forecast top temperature for your town or city today.

◑ Write a function unknown() that takes a URL as its argument, and returns a list of unknown words that occur on that webpage. In order to do this, extract all substrings consisting of lowercase letters (using re.findall()) and remove any items from this set that occur in the Words Corpus (nltk.corpus.words). Try to categorize these words manually and discuss your findings.

◑ Examine the results of processing the URL http://news.bbc.co.uk/ using the regular expressions suggested above. You will see that there is still a fair amount of non-textual data there, particularly Javascript commands. You may also find that sentence breaks have not been properly preserved. Define further regular expressions that improve the extraction of text from this web page.

◑ Are you able to write a regular expression to tokenize text in such a way that the word don't is tokenized into do and n't? Explain why this regular expression won't work: «n't|\w+».

◑ Try to write code to convert text into hAck3r, using regular expressions and substitution, where e → 3, i → 1, o → 0, l → |, s → 5, . → 5w33t!, ate → 8. Normalize the text to lowercase before converting it. Add more substitutions of your own. Now try to map s to two different values: $ for word-initial s, and 5 for word-internal s.

◑ Pig Latin is a simple transformation of English text. Each word of the text is converted as follows: move any consonant (or consonant cluster) that appears at the start of the word to the end, then append ay, e.g. string → ingstray, idle → idleay. http://en.wikipedia.org/wiki/Pig_Latin

Write a function to convert a word to Pig Latin.
Write code that converts text, instead of individual words.
Extend it further to preserve capitalization, to keep qu together (i.e. so that quiet becomes ietquay), and to detect when y is used as a consonant (e.g. yellow) vs a vowel (e.g. style).
◑ Download some text from a language that has vowel harmony (e.g. Hungarian), extract the vowel sequences of words, and create a vowel bigram table.

◑ Python's random module includes a function choice() which randomly chooses an item from a sequence, e.g. choice("aehh ") will produce one of four possible characters, with the letter h being twice as frequent as the others. Write a generator expression that produces a sequence of 500 randomly chosen letters drawn from the string "aehh ", and put this expression inside a call to the ''.join() function, to concatenate them into one long string. You should get a result that looks like uncontrolled sneezing or maniacal laughter: he  haha ee  heheeh eha. Use split() and join() again to normalize the whitespace in this string.

◑ Consider the numeric expressions in the following sentence from the MedLine Corpus: The corresponding free cortisol fractions in these sera were 4.53 +/- 0.15% and 8.16 +/- 0.23%, respectively. Should we say that the numeric expression 4.53 +/- 0.15% is three words? Or should we say that it's a single compound word? Or should we say that it is actually nine words, since it's read "four point five three, plus or minus zero point fifteen percent"? Or should we say that it's not a "real" word at all, since it wouldn't appear in any dictionary? Discuss these different possibilities. Can you think of application domains that motivate at least two of these answers?

◑ Readability measures are used to score the reading difficulty of a text, for the purposes of selecting texts of appropriate difficulty for language learners. Let us define μw to be the average number of letters per word, and μs to be the average number of words per sentence, in a given text. The Automated Readability Index (ARI) of the text is defined to be: 4.71 μw + 0.5 μs - 21.43. Compute the ARI score for various sections of the Brown Corpus, including section f (lore) and j (learned). Make use of the fact that nltk.corpus.brown.words() produces a sequence of words, while nltk.corpus.brown.sents() produces a sequence of sentences.

◑ Use the Porter Stemmer to normalize some tokenized text, calling the stemmer on each word. Do the same thing with the Lancaster Stemmer and see if you observe any differences.

◑ Define the variable saying to contain the list ['After', 'all', 'is', 'said', 'and', 'done', ',', 'more',
'is', 'said', 'than', 'done', '.']. Process this list using a for loop, and store the length of each word in a new list lengths. Hint: begin by assigning the empty list to lengths, using lengths = []. Then each time through the loop, use append() to add another length value to the list. Now do the same thing using a list comprehension.

◑ Define a variable silly to contain the string: 'newly formed bland ideas are inexpressible in an infuriating
way'. (This happens to be the legitimate interpretation that bilingual English-Spanish speakers can assign to Chomsky's famous nonsense phrase, colorless green ideas sleep furiously according to Wikipedia). Now write code to perform the following tasks:

Split silly into a list of strings, one per word, using Python's split() operation, and save this to a variable called bland.
Extract the second letter of each word in silly and join them into a string, to get 'eoldrnnnna'.
Combine the words in bland back into a single string, using join(). Make sure the words in the resulting string are separated with whitespace.
Print the words of silly in alphabetical order, one per line.
◑ The index() function can be used to look up items in sequences. For example, 'inexpressible'.index('e') tells us the index of the first position of the letter e.

What happens when you look up a substring, e.g. 'inexpressible'.index('re')?
Define a variable words containing a list of words. Now use words.index() to look up the position of an individual word.
Define a variable silly as in the exercise above. Use the index() function in combination with list slicing to build a list phrase consisting of all the words up to (but not including) in in silly.
◑ Write code to convert nationality adjectives like Canadian and Australian to their corresponding nouns Canada and Australia (see http://en.wikipedia.org/wiki/List_of_adjectival_forms_of_place_names).

◑ Read the LanguageLog post on phrases of the form as best as p can and as best p can, where p is a pronoun. Investigate this phenomenon with the help of a corpus and the findall() method for searching tokenized text described in 3.5. http://itre.cis.upenn.edu/~myl/languagelog/archives/002733.html

◑ Study the lolcat version of the book of Genesis, accessible as nltk.corpus.genesis.words('lolcat.txt'), and the rules for converting text into lolspeak at http://www.lolcatbible.com/index.php?title=How_to_speak_lolcat. Define regular expressions to convert English words into corresponding lolspeak words.

◑ Read about the re.sub() function for string substitution using regular expressions, using help(re.sub) and by consulting the further readings for this chapter. Use re.sub in writing code to remove HTML tags from an HTML file, and to normalize whitespace.

★ An interesting challenge for tokenization is words that have been split across a line-break. E.g. if long-term is split, then we have the string long-\nterm.

Write a regular expression that identifies words that are hyphenated at a line-break. The expression will need to include the \n character.
Use re.sub() to remove the \n character from these words.
How might you identify words that should not remain hyphenated once the newline is removed, e.g. 'encyclo-\npedia'?x
★ Read the Wikipedia entry on Soundex. Implement this algorithm in Python.

★ Obtain raw texts from two or more genres and compute their respective reading difficulty scores as in the earlier exercise on reading difficulty. E.g. compare ABC Rural News and ABC Science News (nltk.corpus.abc). Use Punkt to perform sentence segmentation.

★ Rewrite the following nested loop as a nested list comprehension:
The code 1.3 is under the exercises
 	
★ Use WordNet to create a semantic index for a text collection. Extend the concordance search program in 3.6, indexing each word using the offset of its first synset, e.g. wn.synsets('dog')[0].offset (and optionally the offset of some of its ancestors in the hypernym hierarchy).

★ With the help of a multilingual corpus such as the Universal Declaration of Human Rights Corpus (nltk.corpus.udhr), and NLTK's frequency distribution and rank correlation functionality (nltk.FreqDist, nltk.spearman_correlation), develop a system that guesses the language of a previously unseen text. For simplicity, work with a single character encoding and just a few languages.

★ Write a program that processes a text and discovers cases where a word has been used with a novel sense. For each word, compute the WordNet similarity between all synsets of the word and all synsets of the words in its context. (Note that this is a crude approach; doing it well is a difficult, open research problem.)

★ Read the article on normalization of non-standard words (Sproat et al, 2001), and implement a similar system for text normalization.

In [21]:
# Code 1.1
>>> sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
>>> result = []
>>> for word in sent:
...     word_len = (word, len(word))
...     result.append(word_len)
>>> result
[('The', 3), ('dog', 3), ('gave', 4), ('John', 4), ('the', 3), ('newspaper', 9)]
# Define a string raw containing a sentence of your own choosing. Now, split raw on some character other than space, such as 's'.

[('The', 3),
 ('dog', 3),
 ('gave', 4),
 ('John', 4),
 ('the', 3),
 ('newspaper', 9)]

In [22]:
# Code 1.2
>>> from prog import monty
>>> monty
# This time, Python should return with a value. You can also try import prog, in which case Python should be able to evaluate the expression prog.monty at the prompt.

ModuleNotFoundError: No module named 'prog'

In [None]:
# Code 1.3
>>> words = ['attribution', 'confabulation', 'elocution',
...          'sequoia', 'tenacious', 'unidirectional']
>>> vsequences = set()
>>> for word in words:
...     vowels = []
...     for char in word:
...         if char in 'aeiou':
...             vowels.append(char)
...     vsequences.add(''.join(vowels))
>>> sorted(vsequences)
['aiuio', 'eaiou', 'eouio', 'euoia', 'oauaio', 'uiieioa']