# 1. Simple Statistics and NLTK

The following exercises use a portion of the Gutenberg corpus that is stored in the corpus dataset of NLTK. [The Project Gutenberg](http://www.gutenberg.org/) is a large collection of electronic books that are out of copyright. These books are free to download for reading, or for our case, for doing a little of corpus analysis.

To obtain the list of files of NLTK's Gutenberg corpus, type the following commands:

In [1]:
import nltk
nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

To obtain all words in the entire Gutenberg corpus of NLTK, type the following:

In [2]:
gutenbergwords = nltk.corpus.gutenberg.words()

Now you can find the total number of words, and the first 10 words (do not attempt to display all the words or your computer will freeze!):

In [3]:
len(gutenbergwords)

2621613

In [4]:
gutenbergwords[:10]

['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', 'VOLUME', 'I', 'CHAPTER']

You can also find the words of just a selection of documents, as shown below. For more details of what information you can extract from this corpus, read the "Gutenberg corpus" section of the [NLTK book chapter 2](http://www.nltk.org/book_1ed/ch02.html), section 2.1. 

### Exercise 1.1
*Find the words with length of at least 7 characters in the complete Gutenberg corpus.*

['Woodhouse',
 'handsome',
 'comfortable',
 'disposition',
 'blessings',
 'existence',
 'distress',
 'youngest',
 'daughters',
 'affectionate']

### Exercise 1.2
*Find the 5 most frequent words that are longer than 7 characters and occur more than 7 times in the complete Gutenberg corpus.*


[('children', 2223), ('therefore', 1146), ('together', 958), ('something', 881), ('Jerusalem', 821)]


### Exercise 1.3
*Find the average number of words across the documents of the NLTK Gutenberg corpus.*


145645.16666666666

### Exercise 1.4
*Find the Gutenberg document that has the longest average word length.*


Document with largest average word length is milton-paradise.txt with word length 4.835734572682675


### Exercise 1.5
*Find the 10 most frequent bigrams in the entire Gutenberg corpus.*


[((',', 'and'), 41294),
 (('of', 'the'), 18912),
 (('in', 'the'), 9793),
 (("'", 's'), 9781),
 ((';', 'and'), 7559),
 (('and', 'the'), 6432),
 (('the', 'LORD'), 5964),
 ((',', 'the'), 5957),
 ((',', 'I'), 5677),
 ((',', 'that'), 5352)]

### Exercise 1.6
*Find the most frequent bigram that begins with "Moby" in Herman Melville's "Moby Dick".*

[(('Moby', 'Dick'), 83), (('Moby', '-'), 1)]

# 2. Regular Expressions
The aim of this section is to develop a regular expression that detects all the numerical expressions in the Brown corpus. The Brown corpus is described in the [NLTK book chapter 2](http://www.nltk.org/book_1ed/ch02.html), section 2.1.

The Brown corpus is a corpus annotated with the parts of speech. In the following exercises, you will concentrate on the words tagged with labels that begin with the `CD` tag. This tag stands for "cardinal numeral" and indicates that the token is a number. For the full list of tags in the Brown corpus you can see the [Wikipedia entry](https://en.wikipedia.org/wiki/Brown_Corpus#Part-of-speech_tags_used).

The following Python code shows the first 5 words of the "news" category of the Brown corpus. You'll see that it is a list of pairs, where each pair is a word and a tag. The tag has two components separated with `-`. We will focus on the labels that begin with `'CD'`.


In [11]:
import nltk 
tagged = nltk.corpus.brown.tagged_words(categories='news')
tagged[:5]

[('The', 'AT'),
 ('Fulton', 'NP-TL'),
 ('County', 'NN-TL'),
 ('Grand', 'JJ-TL'),
 ('Jury', 'NN-TL')]

The following code stores the list of unannotated tokens in the variable `tokens`. Note that we could have used `nltk.corpus.brown.words` but build this list but the code below makes sure that the list mirrors the list `tagged'.

In [12]:
tokens = [w for (w,t) in tagged]
tokens[:5]

['The', 'Fulton', 'County', 'Grand', 'Jury']

### Exercise 2.1
*Write code that finds all the numbers (items tagged with `'CD'`) from the list `tagged` and stores them in a new variable `numbers`. How many numbers are there? How many different numbers are there?*


There are 2020 numbers
There are 460 distinct numbers


### Exercise 2.2
*Find all the items in `tokens` that match the following regular expression: `^[0-9]+$`. Write the result as a list of pairs `(word,annotation)` so that, if the word is a number, the annotation is `'CD'`, and if it is not a number, the annotation is `''`.*

In [14]:
import re
def annotateNum(listtokens):
    """Annotate the list of tokens to identify the numbers.
    Example of run:
    >>> annotateNum(['the','number','5'])
    [('the', ''), ('number', ''), ('5', 'CD')]
    """
    return []

In [15]:
annotateNum(['the','number','5'])

[('the', ''), ('number', ''), ('5', 'CD')]

The following code computes the recall and precision of the annotations.

1. The *recall* is the ratio of numbers that are tagged correctly.
2. The *precision* is the ratio of tagged tokens that are numbers.

In [16]:
def evaluate(result,tagged):
    assert len(result) == len(tagged) # This is a check that the length of the result and tagged are equal
    correct = [result[i][0] for i in range(len(result)) if result[i][1][:2] == 'CD' and tagged[i][1][:2] == 'CD']
    numbers_result = [result[i][0] for i in range(len(result)) if result[i][1][:2] == 'CD']
    numbers_tagged = [tagged[i][0] for i in range(len(tagged)) if tagged[i][1][:2] == 'CD']
    print("Recall:",len(correct)/len(numbers_tagged))
    print("Precision:",len(correct)/len(numbers_result))

In [17]:
evaluate(annotateNum(tokens),tagged)

Recall: 0.5304428044280443
Precision: 0.9982638888888888


The following code returns the mistakes that you produced.

In [18]:
def false_positives(result,tagged):
    "Return the non-numbers that were tagged as numbers"
    return [tagged[i] for i in range(len(result)) if result[i][1]== 'CD' and tagged[i][1][:2] != 'CD']
def false_negatives(result,tagged):
    "Return the numbers that were not tagged as numbers"
    return [result[i] for i in range(len(result)) if result[i][1]!= 'CD' and tagged[i][1][:2] == 'CD']

In [19]:
false_positives(annotateNum(tokens),tagged)[:10]

[('3', 'OD'), ('3', 'OD-TL')]

In [20]:
false_negatives(annotateNum(tokens),tagged)[:10]

[('two', ''),
 ('one', ''),
 ('two', ''),
 ('Four', ''),
 ('one', ''),
 ('one', ''),
 ('one', ''),
 ('two', ''),
 ('Five', ''),
 ('three', '')]

### Exercise 2.3
**This exercise is optional for this workshop but it will be very useful for the sort of work that you will do in the assignments.**

*Concentrate on your worst figure, either recall and precision, and try to improve it:*

1. *If recall is low, list all the numbers that your system failed to detect (the false negatives). With those numbers in mind, update your regular expression.*
2. *If precision is low, list all the tokens that your system erroneously classified as numbers (the false positives). The update your regular expression to cover those words.*

*Test your new regular expression on the full Brown corpus, compute recall and precision, and see what you can improve. Do this several times and try to get better results.*

Solution: There's no single best pattern. The main goal of this activity is to
help the students develop their analytical skills and to find a good
methodology to develop rules. They could do this exercise in groups
of two, and they should *really* test their patterns by computing
recall and precision. I haven't introduced the concepts of recall and
precision in the lectures yet, but hopefully they can see the usefulness
of these measures for the development of rules.