# 1. Simple Statistics and NLTK

The following exercises use a portion of the Gutenberg corpus that is stored in the corpus dataset of NLTK. [The Project Gutenberg](http://www.gutenberg.org/) is a large collection of electronic books that are out of copyright. These books are free to download for reading, or for our case, for doing a little of corpus analysis.

To obtain the list of files of NLTK's Gutenberg corpus, type the following commands:

In [13]:
import nltk
nltk.download('gutenberg')
gutenberg=nltk.corpus.gutenberg
gut_fileids=nltk.corpus.gutenberg.fileids()
gut_fileids

[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\jakev_000\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

To obtain all words in the entire Gutenberg corpus of NLTK, type the following:

In [2]:
gutenbergwords = nltk.corpus.gutenberg.words()

Now you can find the total number of words, and the first 10 words (do not attempt to display all the words or your computer will freeze!):

In [3]:
len(gutenbergwords)

2621613

In [4]:
gutenbergwords[:10]

['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', 'VOLUME', 'I', 'CHAPTER']

You can also find the words of just a selection of documents, as shown below. For more details of what information you can extract from this corpus, read the "Gutenberg corpus" section of the [NLTK book chapter 2](http://www.nltk.org/book_1ed/ch02.html), section 2.1. 

In [5]:
emma = nltk.corpus.gutenberg.words('austen-emma.txt')
len(emma)

192427

In [6]:
emma[:10]

['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', 'VOLUME', 'I', 'CHAPTER']

As we have seen in the lectures, we can use Python's `collections.Counter` to find the most frequent words of a document from NLTK's Gutenberg collection. Below you can see how you can find the 5 most frequent words of the word list stored in the variable `emma`:

In [7]:
import collections
emma_counter = collections.Counter(emma)
print(emma_counter.most_common(5))

[(',', 11454), ('.', 6928), ('to', 5183), ('the', 4844), ('and', 4672)]


### Exercise 1.1
*Write Python code that prints the 10 most frequent words in each of the documents of the Gutenberg corpus. Can you identify any similarities among these list of most frequent words?*

In [21]:
for doc in gut_fileids:
    words=gutenberg.words(doc)
    counter=collections.Counter(words)
    print("10 most common words in",str(doc)[:-4]+":")
    print(counter.most_common(5),"\n")

10 most common words in austen-emma:
[(',', 11454), ('.', 6928), ('to', 5183), ('the', 4844), ('and', 4672)] 

10 most common words in austen-persuasion:
[(',', 6750), ('the', 3120), ('to', 2775), ('.', 2741), ('and', 2739)] 

10 most common words in austen-sense:
[(',', 9397), ('to', 4063), ('.', 3975), ('the', 3861), ('of', 3565)] 

10 most common words in bible-kjv:
[(',', 70509), ('the', 62103), (':', 43766), ('and', 38847), ('of', 34480)] 

10 most common words in blake-poems:
[(',', 680), ('the', 351), ('.', 201), ('And', 176), ('and', 169)] 

10 most common words in bryant-stories:
[(',', 3481), ('the', 3086), ('and', 1873), ('.', 1817), ('to', 1165)] 

10 most common words in burgess-busterbrown:
[('.', 823), (',', 822), ('the', 639), ('he', 562), ('and', 484)] 

10 most common words in carroll-alice:
[(',', 1993), ("'", 1731), ('the', 1527), ('and', 802), ('.', 764)] 

10 most common words in chesterton-ball:
[(',', 4547), ('the', 4523), ('.', 3589), ('of', 2529), ('and', 2488

### Exercise 1.2
*Find the unique words with length of more than 17 characters in the complete Gutenberg corpus.*

*Hint: to find the distinct items of a Python list you can convert it into a set:*

In [8]:
my_list = ['a','b','c','a','c']
my_set = set(my_list)
print(my_set)
print(len(my_set))

{'a', 'c', 'b'}
3


### Exercise 1.3
*Find the words that are longer than 5 characters and occur more than 2000 times in the complete Gutenberg corpus.*


### Exercise 1.4
*Find the average number of words in the documents of the NLTK Gutenberg corpus.*


### (Optional) Exercise 1.5
*Find the Gutenberg document that has the longest average word length.*


### Exercise 1.6
*Find the 10 most frequent bigrams in the entire Gutenberg corpus.*


### Exercise 1.7
*Find the most frequent bigram that begins with "Moby" in Herman Melville's "Moby Dick".*

# 2. Text Preprocessing with NLTK
The following exercises will ask questions about tokens, stems, and parts of speech.

### Exercise 2.1
*What is the sentence with the largest number of tokens in Austen's "Emma"?*

### Exercise 2.2
*What are the 5 most frequent parts of speech in Austen's "Emma"? Use the universal tag set*

### Exercise 2.3
*What is the number of distinct stems in Austen's "Emma"?* 

### (Optional) Exercise 2.4
*What is the most ambiguous stem in Austen's "Emma"? (meaning, which stem in Austen's "Emma" is realised in the largest number of distinct tokens?)*