# 1. Simple Statistics and NLTK

The following exercises use a portion of the Gutenberg corpus that is stored in the corpus dataset of NLTK. [The Project Gutenberg](http://www.gutenberg.org/) is a large collection of electronic books that are out of copyright. These books are free to download for reading, or for our case, for doing a little of corpus analysis.

To obtain the list of files of NLTK's Gutenberg corpus, type the following commands:

In [1]:
import nltk
nltk.download('gutenberg')
nltk.corpus.gutenberg.fileids()

[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\ohaiy\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\gutenberg.zip.


['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

To obtain all words in the entire Gutenberg corpus of NLTK, type the following:

In [2]:
gutenbergwords = nltk.corpus.gutenberg.words()

Now you can find the total number of words, and the first 10 words (do not attempt to display all the words or your computer will freeze!):

In [3]:
len(gutenbergwords)

2621613

In [4]:
gutenbergwords[:10]

['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', 'VOLUME', 'I', 'CHAPTER']

You can also find the words of just a selection of documents, as shown below. For more details of what information you can extract from this corpus, read the "Gutenberg corpus" section of the [NLTK book chapter 2](http://www.nltk.org/book_1ed/ch02.html), section 2.1. 

In [5]:
emma = nltk.corpus.gutenberg.words('austen-emma.txt')
len(emma)

192427

In [6]:
emma[:10]

['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', 'VOLUME', 'I', 'CHAPTER']

As we have seen in the lectures, we can use Python's `collections.Counter` to find the most frequent words of a document from NLTK's Gutenberg collection. Below you can see how you can find the 5 most frequent words of the word list stored in the variable `emma`:

In [7]:
import collections
emma_counter = collections.Counter(emma)
print(emma_counter.most_common(5))

[(',', 11454), ('.', 6928), ('to', 5183), ('the', 4844), ('and', 4672)]


### Exercise 1.1
*Write Python code that prints the 10 most frequent words in each of the documents of the Gutenberg corpus. Can you identify any similarities among these list of most frequent words?*

In [13]:
for id in nltk.corpus.gutenberg.fileids():
    print(id)
    print(collections.Counter(nltk.corpus.gutenberg.words(id)).most_common(10))
   

austen-emma.txt
[(',', 11454), ('.', 6928), ('to', 5183), ('the', 4844), ('and', 4672), ('of', 4279), ('I', 3178), ('a', 3004), ('was', 2385), ('her', 2381)]
austen-persuasion.txt
[(',', 6750), ('the', 3120), ('to', 2775), ('.', 2741), ('and', 2739), ('of', 2564), ('a', 1529), ('in', 1346), ('was', 1330), (';', 1290)]
austen-sense.txt
[(',', 9397), ('to', 4063), ('.', 3975), ('the', 3861), ('of', 3565), ('and', 3350), ('her', 2436), ('a', 2043), ('I', 2004), ('in', 1904)]
bible-kjv.txt
[(',', 70509), ('the', 62103), (':', 43766), ('and', 38847), ('of', 34480), ('.', 26160), ('to', 13396), ('And', 12846), ('that', 12576), ('in', 12331)]
blake-poems.txt
[(',', 680), ('the', 351), ('.', 201), ('And', 176), ('and', 169), ('of', 131), ('I', 130), ('in', 116), ('a', 108), ("'", 104)]
bryant-stories.txt
[(',', 3481), ('the', 3086), ('and', 1873), ('.', 1817), ('to', 1165), ('a', 988), ('"', 900), ('he', 872), ('of', 801), ('was', 706)]
burgess-busterbrown.txt
[('.', 823), (',', 822), ('the', 

### Exercise 1.2
*Find the unique words with length of more than 17 characters in the complete Gutenberg corpus.*

*Hint: to find the distinct items of a Python list you can convert it into a set:*

In [15]:
my_list = ['a','b','c','a','c']
my_set = set(my_list)
print(my_set)
print(len(my_set))

{'b', 'c', 'a'}
3


In [17]:
my_list = gutenbergwords
my_set = set(len(my_list))
print(my_set)

TypeError: 'bool' object is not iterable

### Exercise 1.3
*Find the words that are longer than 5 characters and occur more than 2000 times in the complete Gutenberg corpus.*


[('little', 2825), ('before', 3335), ('people', 2773), ('children', 2223), ('should', 2496), ('against', 2255), ('Israel', 2591)]


### Exercise 1.4
*Find the average number of words in the documents of the NLTK Gutenberg corpus.*


145645.16666666666

### (Optional) Exercise 1.5
*Find the Gutenberg document that has the longest average word length.*


Document with largest average word length is milton-paradise.txt with word length 4.835734572682675


### Exercise 1.6
*Find the 10 most frequent bigrams in the entire Gutenberg corpus.*


[((',', 'and'), 41294),
 (('of', 'the'), 18912),
 (('in', 'the'), 9793),
 (("'", 's'), 9781),
 ((';', 'and'), 7559),
 (('and', 'the'), 6432),
 (('the', 'LORD'), 5964),
 ((',', 'the'), 5957),
 ((',', 'I'), 5677),
 ((',', 'that'), 5352)]

### Exercise 1.7
*Find the most frequent bigram that begins with "Moby" in Herman Melville's "Moby Dick".*

[(('Moby', 'Dick'), 83), (('Moby', '-'), 1)]

# 2. Text Preprocessing with NLTK
The following exercises will ask questions about tokens, stems, and parts of speech.

### Exercise 2.1
*What is the sentence with the largest number of tokens in Austen's "Emma"?*

In [22]:
nltk.download('punkt')
 
emmatext = nltk.corpus.gutenberg.raw('austen-emma.txt')
nltk.sent_tokenize(emmatext)


for s in nltk.sent_tokenize(emmatext):
    print(max(s.split(),key=len))

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ohaiy\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


disposition,
affectionate,
remembrance
particularly
intimacy
governess,
disadvantages
unperceived,
consciousness.--Miss
Taylor's
continuance.
bride-people
composed
happiness
unexceptionable
Taylor
kindness--the
recollection.
well-informed,
intellectual
companion
conversation,
valetudinarian
comparatively
shrubberies,
consequence
looked
acquaintance
melancholy
required
depressed;
disagreeable;
cheerfully
thought
cannot.
good-humoured,
own!--But
large.--And
meeting!
wedding
dear,
distance.
could
walking.
carriage,
carriage!
way;--and
Weston's
already.
talked
daughter's
anywhere
doing,
Hannah
her--James
think
pretty-spoken
needlework,
excellent
daughter,
are."
backgammon,
backgammon-table
eight-and-thirty,
connexions
Hartfield
circumstance,
satisfactorily.
gratefully
shocking
"Not
beautiful
dirty.
cold."
"Dirty,
shoes.
them."
"Well!
surprising,
dreadfully
wedding."
bye--I
congratulations;
behave?
most?"
"Ah!
Taylor!
business."
Woodhouse,
independence!--At
"Especially
playfully.
know--and


impossible
encouragement
interesting
understood
confesses
understood
Harriet--your
particularly
affronted
Smith!--Miss
extremely
encouragement!--Sir,
admirer
acquaintance.
exceedingly
misconception
disappointment
matrimony
mortification,
straightforward
passed.--Emma
indescribable
gruel--perfectly
unwelcome!--Such
mis-judgment,
persuaded
presumption
Harriet--never!
confusion.
supposed,
unmarked,
circumstances;--
wit"--but
thick-headed
unnecessarily
possibility.
penetration.
indiscreetly;
mortifying;
addresses
professions
attachment,
disappointment
affection
expressions,
trouble
aggrandise
encouragement,
elegancies
consequence
generations
inconsiderable,
misinterpreted
first
together.
adventuring
concerned
"actually
attachment,
Oh!
persuading
right.
stopped,
introducing
girl,
disappointment
lawyer."
dispiriting
discontinuing
cheerfulness
cheerfulness
alleviations
particularly
thoughts;
favourable
unpleasant
atmosphere
intercourse
Knightley,
Knightley,
Elton?"
perplexities,
pleasantly
ch

knows
delightful
conversation
do?--Very
carriage
mother
Pray
friends
commandingly
Fairfax?--I
to-day?
Fairfax
obliged
particular
scepticism.
carriage,"
Kingston.
thing
"Oh!
Kingston--are
Kingston."
servants
_you_?"
thank
come
pianoforte.
Crown,
deliberating
too!--Quite
thank
minutes.
Kingston
"Oh!
come
happy
enough.
pianoforte."
sorry!--Oh!
delightful?--Miss
"Oh!
delightful
mentioned
country-dance
gratitude,
"Oh!
shocked!--Jane
matter
sending
great
shocked!
Hodges
mentioned
should
Ah!
off.
thanked.
mentioned.
.
.
.
(returning
Knightley
Kingston.
thing.
.
.
."
offers,
"Oh!
Knightley
heard
Kingston?'
mentioned.
.
.
.
Oh!
going?--You
themselves
entirely.
successively,
Churchill
difficulties,
delightfully
acquiescence.
indispensable
Fairfax,
Knightley.
pleasure.
Fairfax,
couple?--I
another,
seriously
_invite_
allowable
brother's,
Somebody
evening,
acquaintance
opposite
passage?"
scheme;
earnestly,
persevered
"Oh!
imprudence.
Emma!--Emma
dreadful
Harriet.
would
Weston,
Pray
thoughtless.
fat

Knightley,
difficulty,
reflection
apprehensive
distressing.
her:--caution
encouragement
declaration.
acquaintance!
anticipating
something
Churchill's
afterwards.
observation,
friendliness.
pleasure
tenderness
watched
clear
indifference,
agitation.
comparative
restlessness
liveliness
acquaintance
resolution
Churchill
prevented.
could
Randall's.
Churchill's
convinced
Though
complaints
appeared
endure
communicated
immediately
recommended
ready-furnished
neighbourhood
confidence
understood
considering
hoped
months
indisputable.
delighted.
circumstance
neighbourhood.
man?--An
always
difference
Manchester-street--was
returning.
intercourse.
nearer!
immediately
acknowledged
preparation
Weston's
to-morrows
Woodhouse
lightened
February.
Hartfield,
misfortune
approached,
meeting
it;--but
themselves,
sufficiently
delightful
together,
unreasonably
distinguishing
confidantes,
character.--General
half-circle
councillors
carriage,
restlessness,
carriages,--
spoken
soon,"
curiosity
comes."
carriage
im

instantaneous
perceive.--He
afterwards--
disappearance.
earlier;--it
pleasanter.--They
earlier!
disagreeableness
cheerfully,
comfortably
good-natured
situation,
Taylor's
Taylor
respect,
background.
Churchill!
six-and-thirty
different
Churchill
things
tenderness
disagreeable,
compassionate
justified.
seriously
fancifulness,
Churchill!
supposed--and
shock--with
Churchill's
Churchill
solemn,
thought
commiseration
earliest
speculation
compassion--and
possible
attachment
independent
attachment,
self-command.
brighter
strengthened
forbearance.
communicating
Yorkshire,
present,
kindness--and
neglecting,
consideration.
Hartfield.
written
invitation
Smallridge's
apprehension
undertaken
overcome.
unfavourable
questioned;
Fairfax
listened
conversation,
compliments
indisposition
service--and
unpersuadable;
denied--and
themselves
communicative;
housekeeper
take--and,
doubt--putting
sorry,
inconsistency
particularly
parlour-door,
Weston
you."
unwell?"
all--only
father)--Humph!--Can
"Certainly.
momen

They
resolved
beautiful
beautiful,
excellent
Hartfield,
appearing
believed
herself.--In
particularly
disordered,
alarmed,
considering,
Churchill
"Perry!"
Fairfax's
friend
saying
morning?--And
recollected,
extraordinary
cried.
laughing.--She
attempt
Look
pretending
recollections,
entertainingly;
contemplation
inclination,
satisfied--unaccountable
congratulations.--
irresistible.--Beyond
so.--Harriet's
concealment.--Such
illegitimacy,
acknowledged
good-tempered
cheerfulness.
temptation,
respectable
regretted.--
impair.--Perhaps,
Campbells.--The
intermediate
Knightley.--They
Woodhouse--how
hopeless.--A
however,
otherwise,
hesitated--she
poultry-house
suffered.--Pilfering
resolution,
protected
wedding-day--and
business!--Selina
deficiencies,
FINIS


### Exercise 2.2
*What are the 5 most frequent parts of speech in Austen's "Emma"? Use the universal tag set*

[('VERB', 35723),
 ('NOUN', 31998),
 ('.', 30304),
 ('PRON', 21263),
 ('ADP', 17880)]

### Exercise 2.3
*What is the number of distinct stems in Austen's "Emma"?* 

Austen's Emma has 5394 distinct stems


### (Optional) Exercise 2.4
*What is the most ambiguous stem in Austen's "Emma"? (meaning, which stem in Austen's "Emma" is realised in the largest number of distinct tokens?)*

The most ambiguous stem is 'respect' with words {'Respect', 'respecting', 'respects', 'respectful', 'respectable', 'respective', 'respectfully', 'respectably', 'respected', 'respectability', 'respect'}
