In [2]:
import nltk
import datetime
from nltk.corpus import brown

**1. Search the web for "spoof newspaper headlines", to find such gems as: ** *British Left Waffles on Falkland Islands* **, and ** *Juvenile Court to Try Shooting Defendant* **. Manually tag these headlines to see if knowledge of the part-of-speech tags removes the ambiguity.**

In [3]:
# here TRY means to examine evidence in court and decide whether sb is innocent or guilty
headline = 'Juvenile/NOUN Court/NOUN to/PRT Try/VERB Shooting/ADJ Defendant/NOUN'
[nltk.tag.str2tuple(t) for t in headline.split()]

[('Juvenile', 'NOUN'),
 ('Court', 'NOUN'),
 ('to', 'PRT'),
 ('Try', 'VERB'),
 ('Shooting', 'ADJ'),
 ('Defendant', 'NOUN')]

**2. Working with someone else, take turns to pick a word that can be either a noun or a verb (e.g. ** *contest* **); the opponent has to predict which one is likely to be the most frequent in the Brown corpus; check the opponent's prediction, and tally the score over several turns.**

Omitted.

**3. Tokenize and tag the following sentence: ** *They wind back the clock, while we chase after the wind.* ** What different pronunciations and parts of speech are involved?**

In [4]:
sent = 'They wind back the clock, while we chase after the wind.'
nltk.pos_tag(nltk.word_tokenize(sent))

[('They', 'PRP'),
 ('wind', 'VBP'),
 ('back', 'RB'),
 ('the', 'DT'),
 ('clock', 'NN'),
 (',', ','),
 ('while', 'IN'),
 ('we', 'PRP'),
 ('chase', 'VBP'),
 ('after', 'IN'),
 ('the', 'DT'),
 ('wind', 'NN'),
 ('.', '.')]

**4. Review the mappings in 3.1. Discuss any other examples of mappings you can think of. What type of information do they map from and to?**

| More Linguistic Object | Maps From | Maps To |
| --- | --- | --- |
| Word Frequency | Word | Number of occurrences in a text |
| Word Prounciation | Word | List of the word's prounciation |
| Abbreviation | Acronym | List of the full name |

**5. Using the Python interpreter in interactive mode, experiment with the dictionary examples in this chapter. Create a dictionary ** `d` **, and add some entries. What happens if you try to access a non-existent entry, e.g. ** `d['xyz']` **?** 

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'xyz'
```

**6. Try deleting an element from a dictionary d, using the syntax ** `del d['abc']` **. Check that the item was deleted.**

Omitted.

**7. Create two dictionaries, ** `d1` ** and ** `d2` **, and add some entries to each. Now issue the command ** `d1.update(d2)` **. What did this do? What might it be useful for?**

In [6]:
d1 = {'hello': 1, 'world': 2, 'natural': 0}
d2 = {'natural': 3, 'language': 4, 'processing': 5}
d1.update(d2)
d1

{'hello': 1, 'language': 4, 'natural': 3, 'processing': 5, 'world': 2}

Update the dictionary with the key/value pairs from other, overwriting existing keys.  
Useful when merging two dictionaries.

**8. Create a dictionary ** `e` **, to represent a single lexical entry for some word of your choice. Define keys like ** `headword, part-of-speech, sense, ` ** and ** `example` **, and assign them suitable values.**

In [7]:
e = {}
e['headword'] = ['NOUN', 'a word or term placed at the beginning (as of a chapter or an entry in an encyclopedia)']
e['part-of-speech'] = ['PHRASE', 'a traditional class of words distinguished according to the kind of idea denoted and the function performed in a sentence']
e['sense'] = ['NOUN', 'a meaning conveyed or intended']
e['example'] = ['NOUN', 'one that serves as a pattern to be imitated or not to be imitated']

**9. Satisfy yourself that there are restrictions on the distribution of ** *go* ** and ** *went* **, in the sense that they cannot be freely interchanged in the kinds of contexts illustrated in (3d) in 7.**

'We *went* on the excursion.' means the tense is past.  
'We *go* on the excursion.' well, is it unlikely used in daily life?

**10. Train a unigram tagger and run it on some new text. Observe that some words are not assigned a tag. Why not?**

In [6]:
brown_tagged_sents = brown.tagged_sents(categories='news')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)

test_text = ['hello', 'world', 'natural', 'language', 'processing']
unigram_tagger.tag(test_text)

[('hello', None),
 ('world', 'NN'),
 ('natural', 'JJ'),
 ('language', 'NN'),
 ('processing', 'NN')]

The words doesn't appear in the training text, and therefore the tagger can't speculate the word's tag.

**11. Learn about the affix tagger (type ** `help(nltk.AffixTagger)` **). Train an affix tagger and run it on some new text. Experiment with different settings for the affix length and the minimum word length. Discuss your findings.**

In [14]:
# help(nltk.AffixTagger)
affix_tagger = nltk.AffixTagger(brown_tagged_sents, affix_length=3, min_stem_length=4)
test_text = 'Experiment with different settings for the affix length and the minimum word length'.split()
affix_tagger.tag(test_text)

[('Experiment', 'NN-TL'),
 ('with', None),
 ('different', 'JJ'),
 ('settings', 'NN'),
 ('for', None),
 ('the', None),
 ('affix', None),
 ('length', None),
 ('and', None),
 ('the', None),
 ('minimum', 'NNS'),
 ('word', None),
 ('length', None)]

**12. Train a bigram tagger with no backoff tagger, and run it on some of the training data. Next, run it on some new data. What happens to the performance of the tagger? Why?**

Temporarily omitted.

**13. We can use a dictionary to specify the values to be substituted into a formatting string. Read Python's library documentation for formatting strings http://docs.python.org/lib/typesseq-strings.html** *(404 NOT FOUND)* ** and use this method to display today's date in two different formats.**

In [2]:
datetime.datetime.today().strftime("%Y-%m-%d")

'2018-08-31'

**14. Use ** `sorted()` ** and ** `set()` ** to get a sorted list of tags used in the Brown corpus, removing duplicates.**

In [18]:
list_of_tags = sorted(set([tag for (_, tag) in brown.tagged_words()]))

**15. Write programs to process the Brown Corpus and find answers to the following questions:**  
a. **Which nouns are more common in their plural form, rather than their singular form? (Only consider regular plurals, formed with the ** *-s* ** suffix.)**  
b. **Which word has the greatest number of distinct tags. What are they, and what do they represent?**  
c. **List tags in order of decreasing frequency. What do the 20 most frequent tags represent?**  
d. **Which tags are nouns most commonly found after? What do these tags represent?**

In [89]:
brown_tagged = brown.tagged_words()
cfd = nltk.ConditionalFreqDist(brown_tagged)

In [34]:
# Which nouns are more common in their plural form, rather than their singular form? 
# (Only consider regular plurals, formed with the -s suffix.)

common_plural = set()
for word in set(brown.words()):
    if cfd[word+'s']['NNS'] > cfd[word]['NN']:
        common_plural.add(word)

In [77]:
# Which word has the greatest number of distinct tags. What are they, and what do they represent?

tag_dict = {k:len(cfd[k]) for k in cfd}
greatest = max(tag_dict, key=lambda key: tag_dict[key])

*that*  
CS, CS-HL, CS-NC, DT, DT-NC, NIL, QL, WPO, WPO-NC, WPS, WPS-HL, WPS-NC

In [88]:
# List tags in order of decreasing frequency. What do the 20 most frequent tags represent?

helper_list = [t for (_, t) in brown_tagged]    # extract the tags to a list 
fd = nltk.FreqDist(helper_list)
fd.most_common(20)

[('NN', 152470),
 ('IN', 120557),
 ('AT', 97959),
 ('JJ', 64028),
 ('.', 60638),
 (',', 58156),
 ('NNS', 55110),
 ('CC', 37718),
 ('RB', 36464),
 ('NP', 34476),
 ('VB', 33693),
 ('VBN', 29186),
 ('VBD', 26167),
 ('CS', 22143),
 ('PPS', 18253),
 ('VBG', 17893),
 ('PP$', 16872),
 ('TO', 14918),
 ('PPSS', 13802),
 ('CD', 13510)]

In [93]:
# Which tags are nouns most commonly found after? What do these tags represent?

word_tag_pairs = nltk.bigrams(brown_tagged)
noun_after = [b[1] for (a, b) in word_tag_pairs if a[1].startswith('NN')]
fdist = nltk.FreqDist(noun_after)
[tag for (tag, _) in fdist.most_common(10)]

['IN', '.', ',', 'CC', 'NN', 'NNS', 'VBD', 'CS', 'MD', 'BEZ']

**16. Explore the following issues that arise in connection with the lookup tagger:**  
a. **What happens to the tagger performance for the various model sizes when a backoff tagger is omitted?**  
b. **Consider the curve in 4.2; suggest a good size for a lookup tagger that balances memory and performance. Can you come up with scenarios where it would be preferable to minimize memory usage, or to maximize performance with no regard for memory usage?**

When a backoff tagger is omitted, with the increase of model sizes, the tagger performance would be improved since there would be less UNKNOWN words.  
If the memory usage is limited, then a 90% performance is advisable(about 8000 in Figure 4.2). Use as large model size as possible with no regard for memory usage.(Well, take overfitting and calculating time into consideration as well =D)


**17. What is the upper limit of performance for a lookup tagger, assuming no limit to the size of its table? (Hint: write a program to work out what percentage of tokens of a word are assigned the most likely tag for that word, on average.)**

The word's most possible tag's proportion of all that word's tags?

**18. Generate some statistics for tagged data to answer the following questions:**  
a. **What proportion of word types are always assigned the same part-of-speech tag?**  
b. **How many words are ambiguous, in the sense that they appear with at least two tags?**  
c. **What percentage of word ** *tokens* ** in the Brown Corpus involve these ambiguous words?**

In [3]:
brown_tag = brown.tagged_words(tagset='universal')
cfd = nltk.ConditionalFreqDist(brown_tag)

In [14]:
proportion = sum(1 for word in cfd if len(cfd[word]) == 1) / len(cfd)

In [16]:
ambiguous = sum(1 for word in cfd if len(cfd[word]) > 1)

**19. The ** `evaluate()` ** method works out how accurately the tagger performs on this text. For example, if the supplied tagged text was ** `[('the', 'DT'), ('dog', 'NN')]` ** and the tagger produced the output ** `[('the', 'NN'), ('dog', 'NN')]` **, then the score would be ** `0.5` **. Let's try to figure out how the evaluation method works:**  
a. **A tagger ** `t` ** takes a list of words as input, and produces a list of tagged words as output. However,** ` t.evaluate()` ** is given correctly tagged text as its only parameter. What must it do with this input before performing the tagging?**