## Chapter 5

#### 1. What are the most common adverbs in the brown corpus (categories="news")? Please output the 10 most frequent ones (Please use the universal tagset).

In [1]:
import nltk
from nltk.corpus import brown

In [3]:
brown_vocab = nltk.corpus.brown.tagged_words(categories = "news", tagset='universal')

In [4]:
cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in brown_vocab)

In [5]:
cfd['ADV'].most_common(10)

[('not', 254),
 ('when', 128),
 ('also', 120),
 ('now', 76),
 ('as', 75),
 ('here', 67),
 ('where', 58),
 ('then', 56),
 ('back', 55),
 ('about', 49)]

#### 2. What are the part-of-speech tags before the word “religion” in the brown corpus (categories="religion")? (Please use the universal tagset)

In [30]:
brown_religion_vocab = nltk.corpus.brown.tagged_words(categories = "religion", tagset='universal')

In [31]:
tags = [a[1] for (a, b) in nltk.bigrams(brown_religion_vocab) if b[0] == 'religion']  

In [32]:
fd = nltk.FreqDist(tags)

In [33]:
fd.tabulate()

NOUN    .  ADJ  ADP CONJ  DET 
   3    2    2    1    1    1 


#### 3. What are the words that are highly ambiguous as to their part-of-speech tags ((i.e. the word has more than 3 pos tags) in the brown corpus (categories="adventure") (Please use the universal tagset).

In [11]:
brown_adventure_tagged = brown.tagged_words(categories = "adventure", tagset = "universal")
data = nltk.ConditionalFreqDist((word.lower(), tag)
                                for (word, tag) in brown_adventure_tagged)

In [29]:
for word in sorted(data.conditions()):
    if len(data[word]) > 3:
        tags = [tag for (tag,_) in data[word].most_common()]
        print(word, ' '.join(tags))

back ADV NOUN VERB ADJ
last ADJ ADV VERB NOUN
outside ADV ADP ADJ NOUN
past ADP ADV ADJ NOUN
that ADP DET PRON ADV


#### 4. Train a unigram tagger on the brown corpus (categories="humor"). a) Split the data into training and testing dataset- training on the 95% of data and testing on the remaining 5%. b) Evaluate the performance of this tagger on the test dataset. c) Use this tagger to tag some new text ['this','is','a','NLP','class']. d) Observe that some words are not assigned a tag. Explain why not?  (Please do not use the universal tagset)

In [41]:
brown_humor_tagged = brown.tagged_sents(categories = 'humor')
brown_sents = brown.sents(categories = 'humor')
unigram_tagger = nltk.UnigramTagger(brown_humor_tagged)
unigram_tagger.tag(brown_sents[10])

[('Thus', 'RB'),
 ('it', 'PPS'),
 ('was', 'BEDZ'),
 ('that', 'CS'),
 ('Barco', 'NP'),
 (',', ','),
 ('apprehended', 'VBN'),
 ('for', 'IN'),
 ('mere', 'JJ'),
 ('larceny', 'NN'),
 (',', ','),
 ('now', 'RB'),
 ('began', 'VBD'),
 ('to', 'TO'),
 ('suspect', 'VB'),
 ('that', 'CS'),
 ('one', 'CD'),
 ('or', 'CC'),
 ('another', 'DT'),
 ('of', 'IN'),
 ('his', 'PP$'),
 ('murders', 'NNS'),
 ('had', 'HVD'),
 ('been', 'BEN'),
 ('uncovered', 'VBN'),
 ('.', '.')]

In [42]:
## a
size = int(len(brown_humor_tagged)*.95)
train_humor = brown_humor_sents[:size]
test_humor =  brown_humor_sents[size:]

In [43]:
## b
unigram_tagger = nltk.UnigramTagger(train_humor)
unigram_tagger.evaluate(test_humor)

0.7032967032967034

In [44]:
## c
unigram_tagger.tag(['this','is','a','NLP','class'])
## Some words are not assigned a tag because when we run the UnigramTagger on the train data,
## it still fails to capture the tag for the unfrequent words. Meaning tagger runs well on the 
## most frequents words-tags used in a text.

[('this', 'DT'), ('is', 'BEZ'), ('a', 'AT'), ('NLP', None), ('class', 'NN')]

#### 5. Explore the nps_chat corpus and find out what part-of-speech tags occur before a noun, with the most frequent ones first (Please use the universal tagset).

In [47]:
chat_tagged_words = nltk.corpus.nps_chat.tagged_words(tagset = 'universal')
chat_tag_pairs = list(nltk.bigrams(chat_tagged_words))

In [48]:
noun_preceders = [a[1] for (a,b) in chat_tag_pairs if b[1] == 'NOUN']
fdist = nltk.FreqDist(noun_preceders)

In [52]:
[tag for tag in fdist.most_common()]

[('X', 2558),
 ('DET', 1308),
 ('NOUN', 1262),
 ('VERB', 947),
 ('ADJ', 823),
 ('PRON', 676),
 ('.', 636),
 ('ADP', 567),
 ('CONJ', 238),
 ('NUM', 223),
 ('ADV', 218),
 ('PRT', 189)]

#### 6. Explore the brown corpus (categories="romance") to find out all tags starting with VB and its associated (word, frequency) pairs (no more than 6 pairs). (Please do not use the universal tagset)
#### For example, one of the outputs should look like:
#### VBG [('going', 59), ('looking', 36), ('trying', 23), ('thinking', 21), ('watching', 20), ('taking', 19)]

In [57]:
def findtags(tag_prefix, tagged_text):
    cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text
                              if tag.startswith(tag_prefix))
    return dict((tag, cfd[tag].most_common(6)) for tag in cfd.conditions())

tag_dict = findtags('VB', nltk.corpus.brown.tagged_words(categories =  'romance'))

In [58]:
for tag in sorted(tag_dict):
    print(tag, tag_dict[tag])

VB [('get', 92), ('know', 88), ('go', 76), ('see', 74), ('take', 62), ('say', 59)]
VB+PPO [("Let's", 10), ("let's", 5)]
VBD [('said', 318), ('went', 82), ('thought', 80), ('came', 75), ('knew', 69), ('looked', 68)]
VBG [('going', 59), ('looking', 36), ('trying', 23), ('thinking', 21), ('watching', 20), ('taking', 19)]
VBG+TO [('gonna', 4)]
VBG-TL [("Racin'", 1), ('Dancing', 1), ('Surviving', 1)]
VBN [('got', 36), ('come', 29), ('done', 29), ('gone', 25), ('seen', 20), ('made', 20)]
VBN+TO [('gotta', 1)]
VBN-TL [('United', 3), ('Armed', 1), ('Forked', 1)]
VBZ [('says', 7), ('wants', 7), ('goes', 5), ('gets', 4), ('thinks', 4), ('makes', 4)]


#### 7. Write programs to process the brown corpus (categories="editorial")  and find answers to the following questions (Please do not use the universal tagset):
#### a.Which nouns are more common in their plural form (e.g. tag='NNS'), rather than their singular form (e.g.tag='NN')? (Only consider regular plurals, formed with the -s suffix.)
#### b.What do the 10 most frequent tags represent in the Brown Corpus? Please output the tags and explain the meaning for each tag.


In [100]:
# a
brown_edit_tagged = nltk.corpus.brown.tagged_words(categories = 'editorial')
cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in brown_edit_tagged
                                  if tag == 'NN' or tag == 'NNS')

In [120]:
cfd['NN'].most_common()

[('time', 72),
 ('world', 65),
 ('man', 54),
 ('war', 53),
 ('year', 52),
 ('government', 49),
 ('state', 45),
 ('way', 43),
 ('city', 39),
 ('program', 37),
 ('fact', 35),
 ('life', 34),
 ('part', 33),
 ('course', 33),
 ('day', 32),
 ('country', 32),
 ('school', 30),
 ('week', 29),
 ('party', 29),
 ('power', 28),
 ('peace', 28),
 ('question', 26),
 ('public', 24),
 ('action', 24),
 ('law', 23),
 ('work', 23),
 ('reason', 23),
 ('budget', 22),
 ('tax', 22),
 ('right', 22),
 ('policy', 22),
 ('business', 21),
 ('service', 21),
 ('editorial', 21),
 ('choice', 20),
 ('hand', 20),
 ('point', 20),
 ('matter', 20),
 ('crisis', 19),
 ('county', 19),
 ('use', 19),
 ('group', 19),
 ('area', 19),
 ('history', 19),
 ('death', 19),
 ('cent', 19),
 ('book', 19),
 ('issue', 18),
 ('board', 18),
 ('end', 18),
 ('problem', 18),
 ('system', 17),
 ('article', 17),
 ('land', 17),
 ('case', 17),
 ('vote', 17),
 ('example', 17),
 ('child', 17),
 ('labor', 16),
 ('moment', 16),
 ('sense', 16),
 ('name', 16)

In [121]:
cfd['NNS'].most_common()

[('people', 73),
 ('years', 63),
 ('men', 37),
 ('children', 31),
 ('days', 26),
 ('members', 25),
 ('schools', 24),
 ('leaders', 24),
 ('problems', 23),
 ('tests', 21),
 ('things', 20),
 ('others', 19),
 ('words', 19),
 ('areas', 18),
 ('months', 16),
 ('bombs', 16),
 ('efforts', 15),
 ('nations', 14),
 ('teachers', 14),
 ('troops', 13),
 ('citizens', 13),
 ('dollars', 12),
 ('friends', 12),
 ('programs', 11),
 ('conditions', 11),
 ('effects', 11),
 ('students', 11),
 ('powers', 10),
 ('jobs', 10),
 ('stations', 10),
 ('generations', 10),
 ('miles', 10),
 ('sides', 10),
 ('girls', 10),
 ('megatons', 10),
 ('questions', 9),
 ('forces', 9),
 ('services', 9),
 ('arms', 9),
 ('facts', 9),
 ('taxpayers', 9),
 ('weapons', 9),
 ('cities', 9),
 ('eyes', 8),
 ('officials', 8),
 ('hours', 8),
 ('countries', 8),
 ('groups', 8),
 ('candidates', 8),
 ('times', 8),
 ('feet', 8),
 ('ways', 7),
 ('issues', 7),
 ('plans', 7),
 ('grounds', 7),
 ('points', 7),
 ('unions', 7),
 ('allies', 7),
 ('states',

In [122]:
## Not sure how to completely do the output for a.

In [139]:
## b
brown_edit_tagged = nltk.corpus.brown.tagged_words(categories = 'editorial')
tag_fd = nltk.FreqDist(tag for (word,tag) in brown_edit_tagged)
tag_fd.most_common(10)

[('NN', 7675),
 ('IN', 6204),
 ('AT', 5311),
 ('JJ', 3593),
 ('.', 2988),
 ('NNS', 2972),
 (',', 2741),
 ('VB', 2129),
 ('NP', 1884),
 ('CC', 1835)]

In [None]:
## b
## NN = NOUN
## IN = PREPOSITION
## AT = ARTICLE
## JJ = ADJECTIVE
## . = SENTENCE
## NNS = PLURAL NOUN
## , = COMMA
## VB = VERB
## NP =  PROPER NOUN 
## CC = COORDINATING CONJUNCTION

#### 8. Write code to search the brown corpus (categories="hobbies") for particular words and phrases according to tags, to answer the following questions (please do not use the universal tagset):
#### a. Produce an alphabetically sorted list of the distinct words tagged as MD.
#### b. Identify three-word prepositional phrases of the form IN + AT + NN (eg. in the lab).

In [36]:
brown_hobbies_tagged = nltk.corpus.brown.tagged_words(categories = 'hobbies')
hobbies_sents_tagged = nltk.corpus.brown.tagged_sents(categories = 'hobbies')

In [12]:
## a
sorted(set(
    (word.lower(),tag) for (word,tag) in brown_hobbies_tagged if tag =='MD')
      )

[('can', 'MD'),
 ('could', 'MD'),
 ('dare', 'MD'),
 ('may', 'MD'),
 ('might', 'MD'),
 ('must', 'MD'),
 ('need', 'MD'),
 ('shall', 'MD'),
 ('should', 'MD'),
 ('will', 'MD'),
 ('would', 'MD')]

In [67]:
## b
## Identify three-word prepositional phrases of the form IN + AT + NN (eg. in the lab).

def proccess(sentence):
    for (w1, t1), (w2, t2), (w3, t3) in nltk.trigrams(sentence):
        if (t1 == 'IN' and t2 == 'AT' and t3 == 'NN'):
            print(w1, w2, w3)

In [68]:
for tagged_sent in hobbies_sents_tagged:
    proccess(tagged_sent)

at the time
from the sterno-cleido
of the neck
of the leg
to the value
of a muscle
of the wide-grip
on the chest
with the barbell
in the pecs
to the Aj
of the pin-point
with a bit
to the pecs
into the serratus
of every muscle
with the knowledge
in the limbo
at the hipline
of the champion
under the skin
on the leg
for the bodybuilder
to the height
against the back
to the rear
to the front
in a nutshell
at the back
of the neck
with the bar
of the neck
to the front
from a pansy
from a dealer
in a week
of the year
with a fog
with a board
with the plant
over the bed
of a compost
to the color
through the winter
in a flat
but a mat
over the glass
in a border
at the expense
to the earth
for the rest
of the season
from the mother
at a time
over the winter
in the year
to a meal
in the light
throughout the world
to the season
in the world
on the tree
into a sort
of the fruit
of the avocado
to the consumer
of the blood
at the pit
of a weapon
With a nation
to the advantage
to the missile
in the air

on the water
in a tree
to the letter
in the family
of the job
with no chance
in the know
across the country
at the latitude
in a mortgage
with a minimum
in a day
in the size
in the cost
of the house
in a closet
From the coil
in the yard
in a mild-winter
above the cost
in the basement
in the attic
to a point
than the price
on a variety
besides the nature
from the outside
of a conditioner
in an hour
to the cooling
for the horsepower
of the compressor
of the unit
With a unit
on the outside
of the house
in the roof
of the house
of a gas
to the moisture
in the ceiling
in the side
to a minimum
in the installation
on the basis
above a bedroom
of a site
through the work
of the site
from the county
in the field
during the time
of the year
of the climate
to the sun
in the field
at the site
at the office
in the area
regarding the site
to the site
by a group
on the character
of the site
of the investigator
of the area
for the future
to the public
of the recreation
at the site
on a body
for a park


#### 9. Use a default dictionary and itemgetter (n) to sort the most frequent tags used in the brown corpus (categories="reviews"). Please first convert the tags into the universal tags.

In [70]:
from collections import defaultdict
from operator import itemgetter
from nltk.corpus import brown

frequency = defaultdict(int)
for (word,tag) in brown.tagged_words(categories = 'reviews', tagset = 'universal'):
    frequency[tag] += 1

In [72]:
sorted(frequency.items(), key = itemgetter(1), reverse = True)

[('NOUN', 10528),
 ('VERB', 5478),
 ('.', 5354),
 ('ADP', 4832),
 ('DET', 4720),
 ('ADJ', 3554),
 ('ADV', 2083),
 ('CONJ', 1453),
 ('PRON', 1246),
 ('PRT', 870),
 ('NUM', 477),
 ('X', 109)]

#### 10. Explore the brown corpus (categories="learned") to find out the most 200 frequent words and store their most likely tags. We can then use this information as the model for a "lookup tagger" (an NLTK UnigramTagger). If the words are not among the 200 most frequent words, we would like to assign the default tag of "NN" to them. Then use this lookup tagger to tag a new sentence of your own. 

In [124]:
from nltk.corpus import brown

fd = nltk.FreqDist(brown.words(categories = 'learned'))
cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories = 'learned'))

most_frequent200 = fd.most_common(200)

In [125]:
likely_tags = dict((word,cfd[word].max()) for (word,_) in most_frequent200)

In [126]:
baseline_tagger = nltk.UnigramTagger(model = likely_tags, backoff = nltk.DefaultTagger('NN'))

In [127]:
baseline_tagger.tag(["When","the","Mets","win", "the","World", "Series", ",","I", "will", "go",
                    "to", "the", "parade", "for", "the", "whole", "day", "."])

[('When', 'WRB'),
 ('the', 'AT'),
 ('Mets', 'NN'),
 ('win', 'NN'),
 ('the', 'AT'),
 ('World', 'NN'),
 ('Series', 'NN'),
 (',', ','),
 ('I', 'PPSS'),
 ('will', 'MD'),
 ('go', 'NN'),
 ('to', 'TO'),
 ('the', 'AT'),
 ('parade', 'NN'),
 ('for', 'IN'),
 ('the', 'AT'),
 ('whole', 'NN'),
 ('day', 'NN'),
 ('.', '.')]