# 1. Simple Statistics and NLTK

The following exercises use a portion of the Gutenberg corpus that is stored in the corpus dataset of NLTK. [The Project Gutenberg](http://www.gutenberg.org/) is a large collection of electronic books that are out of copyright. These books are free to download for reading, or for our case, for doing a little of corpus analysis.

To obtain the list of files of NLTK's Gutenberg corpus, type the following commands:

In [1]:
import nltk
nltk.download('gutenberg')
nltk.corpus.gutenberg.fileids()

[nltk_data] Downloading package gutenberg to /Users/jakob/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [2]:
import nltk
from nltk import sent_tokenize, word_tokenize, pos_tag_sents 
import numpy as np
nltk.download('punkt')
nltk.download('gutenberg')
nltk.download('universal_tagset')
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter

gutenberg = nltk.corpus.gutenberg

def count_pos(document, pos):
    sents = [word_tokenize(s) for s in sent_tokenize(gutenberg.raw(document))]
    tagged_sents = pos_tag_sents(sents, tagset="universal")

    pos_counts=Counter()

    for s in tagged_sents:
        for (w,p) in s:
            pos_counts.update((p,1))


    return pos_counts[pos]

count_pos('austen-emma.txt', 'NOUN')

[nltk_data] Downloading package punkt to /Users/jakob/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package gutenberg to /Users/jakob/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package universal_tagset to
[nltk_data]     /Users/jakob/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


32000

To obtain all words in the entire Gutenberg corpus of NLTK, type the following:

In [2]:
gutenbergwords = nltk.corpus.gutenberg.words()
print(len(gutenbergwords))

2621613


Now you can find the total number of words, and the first 10 words (do not attempt to display all the words or your computer will freeze!):

In [3]:
len(gutenbergwords)

2621613

In [4]:
gutenbergwords[:25]

['[',
 'Emma',
 'by',
 'Jane',
 'Austen',
 '1816',
 ']',
 'VOLUME',
 'I',
 'CHAPTER',
 'I',
 'Emma',
 'Woodhouse',
 ',',
 'handsome',
 ',',
 'clever',
 ',',
 'and',
 'rich',
 ',',
 'with',
 'a',
 'comfortable',
 'home']

You can also find the words of just a selection of documents, as shown below. For more details of what information you can extract from this corpus, read the "Gutenberg corpus" section of the [NLTK book chapter 2](http://www.nltk.org/book_1ed/ch02.html), section 2.1. 

In [5]:
emma = nltk.corpus.gutenberg.words('austen-emma.txt')
len(emma)

192427

In [6]:
emma[:10]

['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', 'VOLUME', 'I', 'CHAPTER']

As we have seen in the lectures, we can use Python's `collections.Counter` to find the most frequent words of a document from NLTK's Gutenberg collection. Below you can see how you can find the 5 most frequent words of the word list stored in the variable `emma`:

In [7]:
import collections
emma_counter = collections.Counter(emma)
print(emma_counter.most_common(5))

[(',', 11454), ('.', 6928), ('to', 5183), ('the', 4844), ('and', 4672)]


### Exercise 1.1
*Write Python code that prints the 10 most frequent words in each of the documents of the Gutenberg corpus. Can you identify any similarities among these list of most frequent words?*

In [8]:
textlist = nltk.corpus.gutenberg.fileids()
for name in textlist:
    text = nltk.corpus.gutenberg.words(name)
    counter = collections.Counter(text)
    print('10 most common words in '+name+":")
    print(counter.most_common(10))
    print('\n')

10 most common words in austen-emma.txt:
[(',', 11454), ('.', 6928), ('to', 5183), ('the', 4844), ('and', 4672), ('of', 4279), ('I', 3178), ('a', 3004), ('was', 2385), ('her', 2381)]


10 most common words in austen-persuasion.txt:
[(',', 6750), ('the', 3120), ('to', 2775), ('.', 2741), ('and', 2739), ('of', 2564), ('a', 1529), ('in', 1346), ('was', 1330), (';', 1290)]


10 most common words in austen-sense.txt:
[(',', 9397), ('to', 4063), ('.', 3975), ('the', 3861), ('of', 3565), ('and', 3350), ('her', 2436), ('a', 2043), ('I', 2004), ('in', 1904)]


10 most common words in bible-kjv.txt:
[(',', 70509), ('the', 62103), (':', 43766), ('and', 38847), ('of', 34480), ('.', 26160), ('to', 13396), ('And', 12846), ('that', 12576), ('in', 12331)]


10 most common words in blake-poems.txt:
[(',', 680), ('the', 351), ('.', 201), ('And', 176), ('and', 169), ('of', 131), ('I', 130), ('in', 116), ('a', 108), ("'", 104)]


10 most common words in bryant-stories.txt:
[(',', 3481), ('the', 3086), ('a

In [49]:
bib_kjv=nltk.corpus.gutenberg.words('bible-kjv.txt')
c=collections.Counter(bib_kjv)
print(c['Jesus'])

977


### Exercise 1.2
*Find the unique words with length of more than 17 characters in the complete Gutenberg corpus.*

*Hint: to find the distinct items of a Python list you can convert it into a set:*

In [9]:
my_list = ['a','b','c','a','c']
my_set = set(my_list)
print(my_set)
print(len(my_set))

{'a', 'b', 'c'}
3


In [24]:
word_set = set()
for word in gutenbergwords:
    if len(word) > 17:
        word_set.add(word)

In [25]:
word_set2= set([word for word in gutenbergwords if len(word) > 17])
print(len(word_set2))

3


In [23]:
print(len(word_set))

3


### Exercise 1.3
*Find the words that are longer than 5 characters and occur more than 2000 times in the complete Gutenberg corpus.*


In [13]:
words = [w for w in gutenbergwords if len(w) > 5]
frequent_words=[]
count = collections.Counter(words)

for w in count:
    if count[w] > 2000:
        frequent_words.append((w,count[w]))

In [14]:
print(frequent_words)

[('little', 2825), ('before', 3335), ('people', 2773), ('children', 2223), ('should', 2496), ('against', 2255), ('Israel', 2591)]


### Exercise 1.4
*Find the average number of words in the documents of the NLTK Gutenberg corpus.*


In [15]:
avg_words = int(len(gutenbergwords)/len(textlist))

print("Average number of words per document: "+str(avg_words))

Average number of words per document: 145645


### (Optional) Exercise 1.5
*Find the Gutenberg document that has the longest average word length.*


In [16]:
average_word_lengths= dict()

for f in textlist:
    nchars=len(nltk.corpus.gutenberg.raw(f))
    nwords=len(nltk.corpus.gutenberg.words(f))
    average_word_lengths[f]=nchars/nwords
    print('Document:',f, " Average word length:",average_word_lengths[f])

sorted_keys = sorted(average_word_lengths.keys(), key = lambda x: average_word_lengths[x], reverse=True)

print("\n Document with largest average word length is", sorted_keys[0])


Document: austen-emma.txt  Average word length: 4.609909212324673
Document: austen-persuasion.txt  Average word length: 4.749793727271801
Document: austen-sense.txt  Average word length: 4.753785952421314
Document: bible-kjv.txt  Average word length: 4.286881563819072
Document: blake-poems.txt  Average word length: 4.567033756284415
Document: bryant-stories.txt  Average word length: 4.489300433741879
Document: burgess-busterbrown.txt  Average word length: 4.464641670621737
Document: carroll-alice.txt  Average word length: 4.233216065669891
Document: chesterton-ball.txt  Average word length: 4.716173862839705
Document: chesterton-brown.txt  Average word length: 4.724783007796614
Document: chesterton-thursday.txt  Average word length: 4.63099417739442
Document: edgeworth-parents.txt  Average word length: 4.4391184023772565
Document: melville-moby_dick.txt  Average word length: 4.76571875515204
Document: milton-paradise.txt  Average word length: 4.835734572682675
Document: shakespeare-cae

### Exercise 1.6
*Find the 10 most frequent bigrams in the entire Gutenberg corpus.*


In [17]:
words = gutenbergwords

bigrams = nltk.bigrams(words)
bigrams_counter = collections.Counter(bigrams)
bigrams_counter.most_common(10)

[((',', 'and'), 41294),
 (('of', 'the'), 18912),
 (('in', 'the'), 9793),
 (("'", 's'), 9781),
 ((';', 'and'), 7559),
 (('and', 'the'), 6432),
 (('the', 'LORD'), 5964),
 ((',', 'the'), 5957),
 ((',', 'I'), 5677),
 ((',', 'that'), 5352)]

### Exercise 1.7
*Find the most frequent bigram that begins with "Moby" in Herman Melville's "Moby Dick".*

In [18]:
nltk.gutenberg.sentences() 

AttributeError: module 'nltk' has no attribute 'gutenberg'

# 2. Text Preprocessing with NLTK
The following exercises will ask questions about tokens, stems, and parts of speech.

### Exercise 2.1
*What is the sentence with the largest number of tokens in Austen's "Emma"?*

In [None]:
emma_sent = nltk.tokenize.sent_tokenize(nltk.corpus.gutenberg.raw('austen-emma.txt'))
longest = ""
max_w = 0
for s in emma_sent:
    num_w = len(nltk.word_tokenize(s))
    if num_w > max_w:
        longest = s
        max_w = num_w


print('Longest sentence:')
print(longest)
print(str(max_w)," words")

Longest sentence:
While he lived, it must be only an engagement;
but she flattered herself, that if divested of the danger of
drawing her away, it might become an increase of comfort to him.--
How to do her best by Harriet, was of more difficult decision;--
how to spare her from any unnecessary pain; how to make
her any possible atonement; how to appear least her enemy?--
On these subjects, her perplexity and distress were very great--
and her mind had to pass again and again through every bitter
reproach and sorrowful regret that had ever surrounded it.--
She could only resolve at last, that she would still avoid a
meeting with her, and communicate all that need be told by letter;
that it would be inexpressibly desirable to have her removed just
now for a time from Highbury, and--indulging in one scheme more--
nearly resolve, that it might be practicable to get an invitation
for her to Brunswick Square.--Isabella had been pleased with Harriet;
and a few weeks spent in London must give

### Exercise 2.2
*What are the 5 most frequent parts of speech in Austen's "Emma"? Use the universal tag set*

In [None]:
emma_sents =[nltk.word_tokenize(s) for s in nltk.sent_tokenize]

[('VERB', 35723),
 ('NOUN', 31998),
 ('.', 30304),
 ('PRON', 21263),
 ('ADP', 17880)]

### Exercise 2.3
*What is the number of distinct stems in Austen's "Emma"?* 

Austen's Emma has 5394 distinct stems


### (Optional) Exercise 2.4
*What is the most ambiguous stem in Austen's "Emma"? (meaning, which stem in Austen's "Emma" is realised in the largest number of distinct tokens?)*

The most ambiguous stem is 'respect' with words {'Respect', 'respecting', 'respects', 'respectful', 'respectable', 'respective', 'respectfully', 'respectably', 'respected', 'respectability', 'respect'}
