# Parts of Speech Tagging Tutorial

The purpose of this tutorial is to experiment with Part-of-Speech tagging, using the tools provided by NLTK. This tutorial is taken from [here](https://www.cs.bgu.ac.il/~elhadad/nlp18/NLTKPOSTagging.html) as part of an NLP course.

We will make use of the contents of [Chapter 5](http://www.nltk.org/book/ch05.html) of the 
[Natural Language Processing with Python --- Analyzing Text with the Natural Language Toolkit](http://www.nltk.org/book). As experimental dataset, we will use the [Brown Corpus](http://en.wikipedia.org/wiki/Brown_Corpus). The Brown Corpus defines a tagset (specific collection of part-of-speech labels) that has been reused in many other annotated resources in English. The [universal tagset](http://universaldependencies.org/u/pos/) includes 17 tags:

Tag	| Meaning	 | Examples
----|------------|----------
ADJ	| adjective	 | new, good, high, special, big, local
ADV	| adverb	 | really, already, still, early, now
CONJ| conjunction| and, or, but, if, while, although
DET	| determiner | the, a, some, most, every, no
X	| other, foreign words | dolce, ersatz, esprit, quo, maitre
NOUN | noun	     | year, home, costs, time, education
PROPN| proper noun | Alison, Africa, April, Washington
NUM	 | numeral	| twenty-four, fourth, 1991, 14:24
PRON | pronoun	| he, their, her, its, my, I, us
ADP  | adposition, preposition | on, of, at, with, by, into, under
AUX	 | auxiliary verb | has (done), is (doing), will (do), should (do), must (do), can (do)
INTJ | interjection | ah, bang, ha, whee, hmpf, oops
VERB | verb | is, has, get, do, make, see, run
PART | particle | possessive marker 's, negation 'not'
SCONJ | subordinating conjunction: complementizer, adverbial clause introducer | I believe 'that' he will come, if, while
SYM	| symbol | $, (C), +, *, /, =, :), john.doe@example.com



Note that the decision on how to tag a word, without more information is ambiguous for multiple reasons:

- The same string can be understood as a `noun` or a `verb` (e.g, **book**).
- Some POS tags have a systematically ambiguous definition: a present participle can be used in progressive verb usages (I am going:VERB), but it can also be used in an adjectival position modifying a noun: (A striking:ADJ comparison). In other words, it is unclear in the definition itself of the tag whether the tag refers to a syntactic function or to a morphological property of the word.


## 0. Working on the Brown Corpus with NLTK

NLTK contains a collection of tagged corpora, arranged as convenient Python objects. We will use the **Brown corpus** in this experiment:

* The `tagged_sents` version of the corpus is a list of sentences. Each sentence is a list of tuples `(word, tag)`. 

* With `tagged_words`, one can access the corpus as a flat list of tagged words.

In [None]:
import nltk
nltk.download('brown')

from nltk.corpus import brown

brown_news_tagged = brown.tagged_sents(categories='news', tagset='universal')
brown_news_words = brown.tagged_words(categories='news',  tagset='universal')

[nltk_data] Downloading package brown to /Users/AlexB/nltk_data...
[nltk_data]   Package brown is already up-to-date!


In [None]:
nltk.download('universal_tagset')

[nltk_data] Downloading package universal_tagset to
[nltk_data]     /Users/AlexB/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


True

Each sentence in the corpus as a **list of tuples**:

In [None]:
type(brown_news_tagged)

nltk.corpus.reader.util.ConcatenatedCorpusView

In [None]:
brown_news_tagged

[[('The', 'DET'), ('Fulton', 'NOUN'), ('County', 'NOUN'), ('Grand', 'ADJ'), ('Jury', 'NOUN'), ('said', 'VERB'), ('Friday', 'NOUN'), ('an', 'DET'), ('investigation', 'NOUN'), ('of', 'ADP'), ("Atlanta's", 'NOUN'), ('recent', 'ADJ'), ('primary', 'NOUN'), ('election', 'NOUN'), ('produced', 'VERB'), ('``', '.'), ('no', 'DET'), ('evidence', 'NOUN'), ("''", '.'), ('that', 'ADP'), ('any', 'DET'), ('irregularities', 'NOUN'), ('took', 'VERB'), ('place', 'NOUN'), ('.', '.')], [('The', 'DET'), ('jury', 'NOUN'), ('further', 'ADV'), ('said', 'VERB'), ('in', 'ADP'), ('term-end', 'NOUN'), ('presentments', 'NOUN'), ('that', 'ADP'), ('the', 'DET'), ('City', 'NOUN'), ('Executive', 'ADJ'), ('Committee', 'NOUN'), (',', '.'), ('which', 'DET'), ('had', 'VERB'), ('over-all', 'ADJ'), ('charge', 'NOUN'), ('of', 'ADP'), ('the', 'DET'), ('election', 'NOUN'), (',', '.'), ('``', '.'), ('deserves', 'VERB'), ('the', 'DET'), ('praise', 'NOUN'), ('and', 'CONJ'), ('thanks', 'NOUN'), ('of', 'ADP'), ('the', 'DET'), ('City

Entire corpus as a flat list of tagged words (each tagged word a tuple):

In [None]:
type(brown_news_words)

nltk.corpus.reader.util.ConcatenatedCorpusView

In [None]:
brown_news_words

[('The', 'DET'), ('Fulton', 'NOUN'), ...]

### Measuring success: Accuracy, Training Dataset, Test Dataset

Assume we develop a tagger. How do we know how successful it is? Can we trust the decisions the tagger makes? In order to evaluate the tagger, we are going to split the dataset into training and testing:

In [None]:
len(brown_news_tagged)

4623

4623 sentences, i.e. 4623 lists of tuples, to be divided into:

* 4523 sentences (training)
* 100 sentences (test)

In [None]:
brown_train = brown_news_tagged[100:]
brown_test = brown_news_tagged[:100]

from nltk.tag import untag
test_sent = untag(brown_test[0])
print("Tagged: ", brown_test[0])
print()
print("Untagged: ", test_sent)

Tagged:  [('The', 'DET'), ('Fulton', 'NOUN'), ('County', 'NOUN'), ('Grand', 'ADJ'), ('Jury', 'NOUN'), ('said', 'VERB'), ('Friday', 'NOUN'), ('an', 'DET'), ('investigation', 'NOUN'), ('of', 'ADP'), ("Atlanta's", 'NOUN'), ('recent', 'ADJ'), ('primary', 'NOUN'), ('election', 'NOUN'), ('produced', 'VERB'), ('``', '.'), ('no', 'DET'), ('evidence', 'NOUN'), ("''", '.'), ('that', 'ADP'), ('any', 'DET'), ('irregularities', 'NOUN'), ('took', 'VERB'), ('place', 'NOUN'), ('.', '.')]

Untagged:  ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.']


## 1. Baseline Tagger: Default Tag

In the absence of any knowledge, the most basic tagging approach is to assign the same tag to all the words.
It can be done with the `DefaultTagger` class, which takes a tag and assigns it to all the words.

In [None]:
from nltk import DefaultTagger

default_tagger = DefaultTagger('default_tag')
default_tagger.tag('This is a test'.split())

[('This', 'default_tag'),
 ('is', 'default_tag'),
 ('a', 'default_tag'),
 ('test', 'default_tag')]

### Exercise 1.1

Using the `DefaultTagger`, try different tags (see the available options in the table at the beginning of the notebook).

**Which one is offering the best performance? Why?**

To measure success, in this task, we will measure accuracy. The tagger object in NLTK includes a method called `evaluate` to measure the accuracy of a tagger on a given test set (our `brown_test` object).


Let's try different tags:

In [None]:
universal_tags = ["ADJ",
                  "ADV",
                  "CONJ",
                  "DET",
                  "X",
                  "NOUN",
                  "PROPN",
                  "NUM",
                  "PRON",
                  "ADP",
                  "AUX",
                  "INTJ",
                  "VERB",
                  "PART",
                  "SCONJ",
                  "SYM"]

In [None]:
tagger_acc = list()

for ut in universal_tags:
    dt = nltk.DefaultTagger(ut)
    tagger_acc.append((ut, dt.evaluate(brown_test)))

In [None]:
sorted(tagger_acc, key=lambda x:x[1], reverse=True)

[('NOUN', 0.31790123456790126),
 ('VERB', 0.1693121693121693),
 ('DET', 0.12301587301587301),
 ('ADP', 0.1128747795414462),
 ('ADJ', 0.05291005291005291),
 ('ADV', 0.025573192239858905),
 ('PRON', 0.021164021164021163),
 ('CONJ', 0.02072310405643739),
 ('NUM', 0.017195767195767195),
 ('X', 0.0),
 ('PROPN', 0.0),
 ('AUX', 0.0),
 ('INTJ', 0.0),
 ('PART', 0.0),
 ('SCONJ', 0.0),
 ('SYM', 0.0)]

In [None]:
tagger_acc = sorted(tagger_acc, key=lambda x:x[1], reverse=True)

print("'{0}' is the tag with the best performance, \
attaining an accuracy score of {1:.2f}%".format(tagger_acc[0][0],
                                                tagger_acc[0][1] * 100))

'NOUN' is the tag with the best performance, attaining an accuracy score of 31.79%


The performance of each tag simply corresponds to its relative frequency in the test set:

In [None]:
test_tags = [tup[1] for sentence in brown_test for tup in sentence]

In [None]:
dict(nltk.FreqDist(test_tags))

{'DET': 279,
 'NOUN': 721,
 'ADJ': 120,
 'VERB': 384,
 'ADP': 256,
 '.': 259,
 'ADV': 58,
 'CONJ': 47,
 'PRT': 57,
 'PRON': 48,
 'NUM': 39}

In [None]:
test_tags_freq = dict(nltk.FreqDist(test_tags))
total_freq = sum(test_tags_freq.values())

In [None]:
test_tags_freq.update({k:round(test_tags_freq[k]*100/total_freq,2) for k in test_tags_freq.keys()})

In [None]:
test_tags_freq

{'DET': 12.3,
 'NOUN': 31.79,
 'ADJ': 5.29,
 'VERB': 16.93,
 'ADP': 11.29,
 '.': 11.42,
 'ADV': 2.56,
 'CONJ': 2.07,
 'PRT': 2.51,
 'PRON': 2.12,
 'NUM': 1.72}

Note that that the frequency of the `'NOUN'` tag corresponds to its accuracy (31.79%).

Before moving on, let us inspect the steps implemented inside the `evaluate` method source code:

        def evaluate(self, gold):
            tagged_sents = self.tag_sents(untag(sent) for sent in gold)
            gold_tokens = list(chain(*gold))
            test_tokens = list(chain(*tagged_sents))
            return accuracy(gold_tokens, test_tokens)
            
        def accuracy(reference, test):
            if len(reference) != len(test):
                raise ValueError("Lists must have the same length.")
            return sum(x == y for x, y in zip(reference, test)) / len(test)
            
        class nltk.chain
        chain(*iterables) â€“> chain object

        Return a chain object whose .next() method returns elements from the first iterable until it is
        exhausted, then elements from the next iterable, until all of the iterables are exhausted.

**Based on the above, define a function** `retrieve_incorrect_tags` **to return a list with the subset of (word,tag) tuples that have been assigned a tag different to the reference:**

In [None]:
def retrieve_incorrect_tags(reference_tags, test_tagger):
    
    reference = list(nltk.chain(*reference_tags))
    test = list(nltk.chain(*test_tagger.tag_sents(untag(sent) for sent in reference_tags)))
    
    if len(reference) != len(test):
        raise ValueError("Lists must have the same length.")
        
    reftest_wordtags = [(r,t) for r,t in zip(reference, test) if r != t]
    
    reftest_tags = [(r[1],t[1]) for r,t in reftest_wordtags]
    
    return nltk.FreqDist(reftest_tags)

In [None]:
retrieve_incorrect_tags(brown_test, nltk.DefaultTagger('NOUN')).most_common(10)

[(('VERB', 'NOUN'), 384),
 (('DET', 'NOUN'), 279),
 (('.', 'NOUN'), 259),
 (('ADP', 'NOUN'), 256),
 (('ADJ', 'NOUN'), 120),
 (('ADV', 'NOUN'), 58),
 (('PRT', 'NOUN'), 57),
 (('PRON', 'NOUN'), 48),
 (('CONJ', 'NOUN'), 47),
 (('NUM', 'NOUN'), 39)]

## 2. Sources of Knowledge to Improve Tagging Accuracy

Intuitively, the sources of knowledge that can help us decide what is the tag of a word include:
- A dictionary that lists the possible parts of speech for each word
- The context of the word in a sentence (neighboring words)
- The morphological form of the word (suffixes, prefixes)


### 2.1 Lookup Tagger: Using Dictionary Knowledge

Assume we have a dictionary that lists the possible tags for each word in English. Could we use this information to perform better tagging?

The intuition is that we would only assign to a word a tag that it can have in the dictionary. For example, if `box` can only be a `Verb` or a `Noun`, when we have to tag an instance of the word `box`, we only choose between 2 options - and not between 17 options.

There are 3 issues we must address to turn this into working code:

- Where do we get the dictionary?
- How do we choose between the various tags associated to a word in the dictionary? (For example, how do we choose between `VERB` and `NOUN` for `box`).
- What do we do for words that do not appear in the dictionary?

The simple solutions we will test are the following - note that for each question, there exist other strategies that we will investigate later:

- Where do we get the dictionary? We will learn it from a sample dataset.
- How do we choose between the various tags associated to a word in the dictionary? We will choose the most likely tag as observed in the sample dataset.
- What do we do for words that do not appear in the dictionary? We will pass unknown words to a backoff tagger (tag all unknown words as `NOUN`).

The `nltk.UnigramTagger` implements this overall strategy. It must be trained on a dataset, from which it builds a model of "unigrams". The following code shows how it is used:

### Exercise 2.1.1

Use the `UnigramTagger` class and the `brown_train` object to create a unigram tagger.

**Which tag is selecting to annotate each word?**

**What's happening with unknown words?**

In [None]:
from nltk import UnigramTagger

In [None]:
ungTagger = UnigramTagger(train = brown_train)

In [None]:
ungTagger.tag(test_sent)

[('The', 'DET'),
 ('Fulton', None),
 ('County', 'NOUN'),
 ('Grand', 'ADJ'),
 ('Jury', 'NOUN'),
 ('said', 'VERB'),
 ('Friday', 'NOUN'),
 ('an', 'DET'),
 ('investigation', 'NOUN'),
 ('of', 'ADP'),
 ("Atlanta's", 'NOUN'),
 ('recent', 'ADJ'),
 ('primary', 'NOUN'),
 ('election', 'NOUN'),
 ('produced', 'VERB'),
 ('``', '.'),
 ('no', 'DET'),
 ('evidence', 'NOUN'),
 ("''", '.'),
 ('that', 'ADP'),
 ('any', 'DET'),
 ('irregularities', None),
 ('took', 'VERB'),
 ('place', 'NOUN'),
 ('.', '.')]

Each word of the training set is considered individually (constituting its own context), and in the presence of multiple tags, the most frequent one is assigned to it if encountered in the test set. Take `that` as an illustrative example:

In [None]:
train_tags_that = [tup[1] for sentence in brown_train for tup in sentence if tup[0].lower() == 'that']

In [None]:
nltk.FreqDist(train_tags_that)

FreqDist({'ADP': 520, 'DET': 150, 'PRON': 126, 'ADV': 5})

In [None]:
nltk.FreqDist(train_tags_that).most_common(1)[0][0]

'ADP'

Amongst the four different POS tags assigned to `that` in the training set, `'ADP'` is the most frequent one. Hence, that is the assigned tagging when `that` is encountered in `test_sent`. Following this approach iteratively for all words in `test_sent` should retrieve the tag list from the `UnigramTagger` for that sentence:

In [None]:
ungTagger_list = list()

for w in test_sent:
    train_tags_w = [tup[1] for sentence in brown_train for tup in sentence if tup[0] == w]
    if not train_tags_w:
        ungTagger_list.append((w,None))
        continue
    
    ungTagger_list.append((w, nltk.FreqDist(train_tags_w).most_common(1)[0][0]))

In [None]:
ungTagger_list

[('The', 'DET'),
 ('Fulton', None),
 ('County', 'NOUN'),
 ('Grand', 'ADJ'),
 ('Jury', 'NOUN'),
 ('said', 'VERB'),
 ('Friday', 'NOUN'),
 ('an', 'DET'),
 ('investigation', 'NOUN'),
 ('of', 'ADP'),
 ("Atlanta's", 'NOUN'),
 ('recent', 'ADJ'),
 ('primary', 'NOUN'),
 ('election', 'NOUN'),
 ('produced', 'VERB'),
 ('``', '.'),
 ('no', 'DET'),
 ('evidence', 'NOUN'),
 ("''", '.'),
 ('that', 'ADP'),
 ('any', 'DET'),
 ('irregularities', None),
 ('took', 'VERB'),
 ('place', 'NOUN'),
 ('.', '.')]

Note that this list coincides with `ungTagger.tag(test_sent)`, as expected. Words that are encountered in the test set without precedents in the training set are tagged as `None`.

### Exercise 2.1.2

Making use of the `evaluate` method measure how successful is this tagger.

**Are we improving the performance of the tagger?**
**Do your find the new performance sufficient enough for a NLP system?**

In [None]:
round(ungTagger.evaluate(brown_test) * 100, 2)

88.89

There is a significant improvement compared to the baseline accuracy score, from 31.79% to 88.89%, though it is still insufficient for most NLP application purposes (1 in every 10 words is incorrectly tagged on average).

In [None]:
retrieve_incorrect_tags(brown_test, ungTagger).most_common(21)

[(('NOUN', None), 127),
 (('VERB', None), 23),
 (('VERB', 'NOUN'), 20),
 (('ADJ', None), 18),
 (('ADP', 'PRT'), 14),
 (('NOUN', 'VERB'), 10),
 (('NOUN', 'ADJ'), 7),
 (('ADV', 'ADJ'), 6),
 (('ADJ', 'NOUN'), 6),
 (('NUM', None), 5),
 (('ADV', None), 3),
 (('PRON', 'ADP'), 2),
 (('ADV', 'ADP'), 2),
 (('ADV', 'PRT'), 2),
 (('ADP', 'VERB'), 1),
 (('ADP', 'ADV'), 1),
 (('ADJ', 'VERB'), 1),
 (('ADP', None), 1),
 (('ADV', 'DET'), 1),
 (('VERB', 'ADV'), 1),
 (('VERB', 'ADP'), 1)]

### Exercise 2.1.3

If we analyze the tagger annotation, we will see that it assigns `None` to unknown words. A good way to improve this is to tag unknowns words as `NOUN` (the most common tag). This is known as a backoff tagger (i.e., a second tagger that applies where the original one cannot identify the tag for a word)

NLTK provides a simple way to implement this backoff tagging. All the constructors for the Tagger classes (e.g., `UnigramTagger`) have a parameter `backoff` where you can set the backoff tagger that will apply. In this case, our backoff tagger will be the `DefaultTagger` that annotates `NOUN`, which we developed in the exercise below .

**Using the `DefaultTagger` and `UnigramTagger` classes, create a tagger that assigns the most common tag to each word and, for unknown words, assigns a backoff tag of `NOUN`.**

**What's the accuracy of this tagger? Do we improved our performance?**

In [None]:
ungTaggerBckoff = UnigramTagger(train = brown_train, backoff = nltk.DefaultTagger('NOUN'))

In [None]:
round(ungTaggerBckoff.evaluate(brown_test) * 100, 2)

94.49

Accuracy improved from 88.89% to 94.49% by adding the backoff tagger.

In [None]:
retrieve_incorrect_tags(brown_test, ungTaggerBckoff).most_common(18)

[(('VERB', 'NOUN'), 43),
 (('ADJ', 'NOUN'), 24),
 (('ADP', 'PRT'), 14),
 (('NOUN', 'VERB'), 10),
 (('NOUN', 'ADJ'), 7),
 (('ADV', 'ADJ'), 6),
 (('NUM', 'NOUN'), 5),
 (('ADV', 'NOUN'), 3),
 (('PRON', 'ADP'), 2),
 (('ADV', 'ADP'), 2),
 (('ADV', 'PRT'), 2),
 (('ADP', 'VERB'), 1),
 (('ADP', 'ADV'), 1),
 (('ADJ', 'VERB'), 1),
 (('ADP', 'NOUN'), 1),
 (('ADV', 'DET'), 1),
 (('VERB', 'ADV'), 1),
 (('VERB', 'ADP'), 1)]

### 2.2 Using Morphological Clues

As mentioned above, another knowledge source to perform tagging is to look at the letter structure of the words. 
We will look at 2 different methods to use this knowledge. 
First, we will use `nltk.RegexpTagger` to recognize specific regular expressions in words.

In [None]:
from nltk import RegexpTagger

regexp_tagger = RegexpTagger(
     [(r'^-?[0-9]+(.[0-9]+)?$', 'NUM'),   # cardinal numbers
      (r'(The|the|A|a|An|an)$', 'DET'),   # articles
      (r'.*able$', 'ADJ'),                # adjectives
      (r'.*ness$', 'NOUN'),               # nouns formed from adjectives
      (r'.*ly$', 'ADV'),                  # adverbs
      (r'.*s$', 'NOUN'),                  # plural nouns
      (r'.*ing$', 'VERB'),                # gerunds
      (r'.*ed$', 'VERB'),                 # past tense verbs
      (r'.*', 'NOUN')                     # nouns (default)
])

print('Regexp accuracy %4.1f%%' % (100.0 * regexp_tagger.evaluate(brown_test)))

Regexp accuracy 48.2%


The regular expressions are tested in order. If one matches, it decides the tag. Else it tries the next tag. 


In [None]:
retrieve_incorrect_tags(brown_test, regexp_tagger).most_common(16)

[(('VERB', 'NOUN'), 259),
 (('.', 'NOUN'), 259),
 (('ADP', 'NOUN'), 252),
 (('ADJ', 'NOUN'), 111),
 (('DET', 'NOUN'), 63),
 (('PRT', 'NOUN'), 57),
 (('PRON', 'NOUN'), 48),
 (('CONJ', 'NOUN'), 47),
 (('ADV', 'NOUN'), 45),
 (('NUM', 'NOUN'), 16),
 (('NOUN', 'VERB'), 6),
 (('ADP', 'VERB'), 4),
 (('ADJ', 'ADV'), 3),
 (('ADJ', 'VERB'), 2),
 (('VERB', 'ADJ'), 1),
 (('NOUN', 'ADV'), 1)]

The question we face when we see such a "rule-based" tagger are:

- How do we find the most successful regular expressions?
- In which order should we try the regular expressions?

A typical answer to such questions is: 

- let's learn these parameters from a training corpus. 

The `nltk.AffixTagger` is a trainable tagger that attempts to learn word patterns. 
It only looks at the last letters in the words in the training corpus, and counts how often a word suffix 
can predict the word tag. 
In other words, we only learn rules of the form ('.*xyz' , POS). 
This is how the affix tagger is used:

In [None]:
from nltk import AffixTagger

affix_tagger = AffixTagger(brown_train, backoff=DefaultTagger('NOUN'))
print('Affix tagger accuracy: %4.2f%%' % (100.0 * affix_tagger.evaluate(brown_test)))

Affix tagger accuracy: 42.28%


In [None]:
retrieve_incorrect_tags(brown_test, affix_tagger).most_common(26)

[(('DET', 'NOUN'), 260),
 (('.', 'NOUN'), 259),
 (('ADP', 'NOUN'), 239),
 (('VERB', 'NOUN'), 221),
 (('ADJ', 'NOUN'), 48),
 (('PRT', 'NOUN'), 48),
 (('CONJ', 'NOUN'), 47),
 (('PRON', 'NOUN'), 45),
 (('NUM', 'NOUN'), 38),
 (('ADV', 'NOUN'), 35),
 (('NOUN', 'ADJ'), 20),
 (('NOUN', 'VERB'), 10),
 (('NOUN', 'ADV'), 6),
 (('ADJ', 'ADV'), 5),
 (('ADP', 'ADJ'), 5),
 (('ADP', 'VERB'), 4),
 (('ADV', 'ADJ'), 3),
 (('ADJ', 'VERB'), 3),
 (('ADV', 'PRT'), 3),
 (('ADV', 'VERB'), 2),
 (('VERB', 'ADJ'), 2),
 (('VERB', 'ADP'), 2),
 (('NOUN', 'ADP'), 1),
 (('NOUN', 'DET'), 1),
 (('ADJ', 'ADP'), 1),
 (('ADV', 'ADP'), 1)]

Should we be disappointed that the "data-based approach" performs worse than the hand-written rules (42% vs. 48%)? 

Not necessarily: note that our hand-written rules include cases that the AffixTagger cannot learn - we match cardinal numbers and suffixes with more than 3 letters. 

Let us see whether the combination of the 2 taggers helps.

### Exercise 2.2.1

**Using the `AffixTagger` class, create a tagger that learns from word patterns and that uses the previous `RegexpTagger` as backoff.**

**Evaluate and analyze the performance of the model**

In [None]:
afTaggerBckoff = AffixTagger(brown_train, backoff=regexp_tagger)

In [None]:
round(afTaggerBckoff.evaluate(brown_test) * 100, 2)

52.87

The accuracy score of the `AffixTagger` improves about 10% by setting the `RegexpTagger` as backoff instead of `DefaultTagger('NOUN')`. Note that the regular expression tagger still resorts to `'NOUN'` as default tag when no pattern match is found.

In [None]:
retrieve_incorrect_tags(brown_test, afTaggerBckoff).most_common(26)

[(('.', 'NOUN'), 259),
 (('ADP', 'NOUN'), 239),
 (('VERB', 'NOUN'), 221),
 (('ADJ', 'NOUN'), 48),
 (('PRT', 'NOUN'), 48),
 (('CONJ', 'NOUN'), 47),
 (('PRON', 'NOUN'), 45),
 (('DET', 'NOUN'), 44),
 (('ADV', 'NOUN'), 34),
 (('NOUN', 'ADJ'), 20),
 (('NUM', 'NOUN'), 15),
 (('NOUN', 'VERB'), 10),
 (('NOUN', 'ADV'), 6),
 (('ADJ', 'ADV'), 5),
 (('ADP', 'ADJ'), 5),
 (('ADP', 'VERB'), 4),
 (('ADV', 'ADJ'), 3),
 (('ADJ', 'VERB'), 3),
 (('ADV', 'PRT'), 3),
 (('ADV', 'VERB'), 2),
 (('VERB', 'ADJ'), 2),
 (('VERB', 'ADP'), 2),
 (('NOUN', 'ADP'), 1),
 (('NOUN', 'DET'), 1),
 (('ADJ', 'ADP'), 1),
 (('ADV', 'ADP'), 1)]

### Exercise 2.2.2

In the previous exercise we created an `AffixTagger` that is able to learn the annotation from the word patterns. Perhaps, we could apply this tagger to annotate the unknown words (i.e., to use it as a backoff tagger). In the previous section, we used a NOUN-default tagger for that. How much does this tagger help the `UnigramTagger` if we use it as a backoff instead of the NOUN-default tagger?

**Use the `AffixTagger` that we created below as a backoff tagger for the `UnigramTagger` in the previous section**

**Are we improving our tagger?**

In [None]:
ungTaggerBckoff2 = UnigramTagger(train = brown_train, backoff = afTaggerBckoff)

In [None]:
round(ungTaggerBckoff2.evaluate(brown_test) * 100, 2)

95.41

By changing the backoff tagger from `DefaultTagger('NOUN')` to `AffixTagger`, the accuracy score of the `UnigramTagger` improves by about 1%.

In [None]:
retrieve_incorrect_tags(brown_test, ungTaggerBckoff2).most_common(18)

[(('VERB', 'NOUN'), 27),
 (('NOUN', 'VERB'), 14),
 (('ADP', 'PRT'), 14),
 (('ADJ', 'NOUN'), 11),
 (('NOUN', 'ADJ'), 10),
 (('ADV', 'ADJ'), 6),
 (('NOUN', 'ADV'), 4),
 (('VERB', 'ADP'), 3),
 (('ADP', 'VERB'), 2),
 (('ADJ', 'VERB'), 2),
 (('PRON', 'ADP'), 2),
 (('ADV', 'ADP'), 2),
 (('ADV', 'PRT'), 2),
 (('ADP', 'ADV'), 1),
 (('VERB', 'ADJ'), 1),
 (('ADV', 'NOUN'), 1),
 (('ADV', 'DET'), 1),
 (('VERB', 'ADV'), 1)]

###  2.3 Looking at the Context

At this point, we have combined 2 major sources of information: dictionary and morphology and obtained about 95.4% accuracy. The last source of knowledge we want to exploit is the context of the word to be tagged: **the words that appear around the word to be tagged**. 

The intuition is that if we have to decide between `book` as a verb or a noun, the word/s preceding `book` can give us strong cues: for example, if it is an article (`the` or `a`) then we would be sure that `book` is a noun; if it is `to`, then we would be sure it is a verb.

How can we turn this intuition into working code? The easiest way to detect predictive contexts is to construct a list of contexts - and for each context, keep track of the distribution of tags that follow it. Luckily for us, this procedure is already implemented into the `NgramTagger`, which takes as parameter a number setting the length of the context.

As usual, if the tagger cannot make a decision (because the observed context was never seen at training time), 
the decision is delegated to a backoff tagger.

### Exercise 2.3.1

**Use the `NgramTagger` to create a context-based tagger. For the cases that this tagger cannot annotate anything, use the previous `UnigramTagger` as backoff.**

**Try different context sizes (you can set that as a parameter when you create the `NgramTagger`) and analyze how it affects to the final performance of the model.**

In [None]:
from nltk import NgramTagger

Initially try with a bigram tagger as `NgramTagger(n=2)`:

In [None]:
n2Tagger = NgramTagger(2, brown_train, backoff=ungTaggerBckoff2)

In [None]:
round(n2Tagger.evaluate(brown_test) * 100, 2)

96.12

In [None]:
retrieve_incorrect_tags(brown_test, n2Tagger).most_common(20)

[(('VERB', 'NOUN'), 15),
 (('NOUN', 'VERB'), 12),
 (('ADP', 'PRT'), 12),
 (('ADJ', 'NOUN'), 11),
 (('NOUN', 'ADJ'), 7),
 (('NOUN', 'ADV'), 4),
 (('ADV', 'ADJ'), 4),
 (('PRT', 'ADP'), 3),
 (('VERB', 'ADP'), 3),
 (('ADP', 'VERB'), 2),
 (('ADJ', 'VERB'), 2),
 (('ADP', 'ADV'), 2),
 (('PRON', 'ADP'), 2),
 (('ADV', 'ADP'), 2),
 (('ADV', 'PRT'), 2),
 (('ADJ', 'X'), 1),
 (('VERB', 'ADJ'), 1),
 (('ADV', 'NOUN'), 1),
 (('ADJ', 'ADV'), 1),
 (('ADV', 'DET'), 1)]

Looping over a range of values (2, 3, 4, 5, 6) for the size of the n-grams:

In [None]:
nTagger_acc_list = list()

for n in range(2,7):
    nTagger = NgramTagger(n, brown_train, backoff=ungTaggerBckoff2)
    nTagger_acc_list.append((n, round(nTagger.evaluate(brown_test) * 100, 2)))

In [None]:
nTagger_acc_list

[(2, 96.12), (3, 95.81), (4, 95.19), (5, 94.89), (6, 94.93)]

The accuracy score of the `NgramTagger` drops with the size of the n-grams, to the point that the 4-gram tagger performs worse than the `UnigramTagger` set as backoff. This is expected, since larger n-grams provide more information but are also more unlikely to be found in the training set.

## Summary

This practice introduced tools to tag parts of speech in free text. The key point of the approach we investigated is that it is **data-driven**:

- We first define possible knowledge sources that can help us solve the task. Specifically, we investigated 
  * dictionary, 
  * morphological 
  * context
  as possible sources.

- We tested simple machine learning methods: data is acquired by inspecting a training dataset, then evaluated by testing on a test dataset.

- We investigated one method to combine several systems into a combined system: backoff models.


# Additional Materials: Practical Tagging

In this practice we have played with the development of new Taggers. You can refer back to this Notebook if and when you need to create your own Taggers. Nevertheless, most of the time the Taggers already included in the different libraries will do the trick for you.

In particular, NLTK provides you a way to tag your dataset with just a couple of lines of code by using the `pos_tag` function.

The first thing we need to do is to tokenize the sentence to be tagged. To that end, we can make use of the `word_tokenize` function in NLTK.

In [None]:
text = nltk.word_tokenize("And now for something completely different")
text

['And', 'now', 'for', 'something', 'completely', 'different']

Then we should feed our tokenized text to the pos tagging function

In [None]:
nltk.pos_tag(text)

[('And', 'CC'),
 ('now', 'RB'),
 ('for', 'IN'),
 ('something', 'NN'),
 ('completely', 'RB'),
 ('different', 'JJ')]

If we have more than one sentence to parse, we can make use of some of the Sentence Tokenizers that nltk provides (e.g. `sent_tokenize`) to split the text in sentences, and the the word tokenizer to split each sentence in words.

In [None]:
sentences = nltk.sent_tokenize("And now for something completely different. This is just another sentence")
print("Sentences:", sentences)

Sentences: ['And now for something completely different.', 'This is just another sentence']


In [None]:
text = [nltk.word_tokenize(sentence) for sentence in sentences]
print("Text:", text)
print()

Text: [['And', 'now', 'for', 'something', 'completely', 'different', '.'], ['This', 'is', 'just', 'another', 'sentence']]



In [None]:
for tagging in [nltk.pos_tag(t) for t in text]:
    print("Tagging:",tagging)

Tagging: [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ'), ('.', '.')]
Tagging: [('This', 'DT'), ('is', 'VBZ'), ('just', 'RB'), ('another', 'DT'), ('sentence', 'NN')]


Let's see a full example with a proper corpus. NLTK provides many corpora that can be used for research or for the training of our NLP system. To find a comprehensive list of all the corpus and how to use them, please refer to the [2nd Chapter of the NLTK book](https://www.nltk.org/book/ch02.html).

We will use the corpus `state_union` including the texts of the State of the Union addresses since 1945. Let's load one of these speeches.

In [None]:
from nltk.corpus import state_union

text = state_union.raw("1946-Truman.txt")
text[:1000]

"PRESIDENT HARRY S. TRUMAN'S MESSAGE TO THE CONGRESS ON THE STATE OF THE UNION AND ON THE BUDGET FOR 1946.\n \nJanuary 21, 1946. Dated January 14, 1946 \n\nTo the Congress of the United States:\nA quarter century ago the Congress decided that it could no longer consider the financial programs of the various departments on a piecemeal basis. Instead it has called on the President to present a comprehensive Executive Budget. The Congress has shown its satisfaction with that method by extending the budget system and tightening its controls. The bigger and more complex the Federal Program, the more necessary it is for the Chief Executive to submit a single budget for action by the Congress.\nAt the same time, it is clear that the budgetary program and the general program of the Government are actually inseparable. The President bears the responsibility for recommending to the Congress a comprehensive set of proposals on all Government activities and their financing. In formulating policies

We now define a function `tag_corpus` that takes care of the tagging process. First, it splits the text in sentences with the `sent_tokenize` function. Then, it iterates over these sentences, tokenize them with the `word_tokenize` function and apply the `pos_tag` function to the tokens.

In [None]:
def tag_corpus(corpus_text, end):
    try:
        for sentence in nltk.sent_tokenize(corpus_text)[:end]:
            words = nltk.word_tokenize(sentence)
            tagged = nltk.pos_tag(words)
            print("Sentence:", sentence, "\nTagging:", tagged)
            print()

    except Exception as e:
        print(str(e))

In [None]:
tag_corpus(text, end=5)

Sentence: PRESIDENT HARRY S. TRUMAN'S MESSAGE TO THE CONGRESS ON THE STATE OF THE UNION AND ON THE BUDGET FOR 1946. 
Tagging: [('PRESIDENT', 'NNP'), ('HARRY', 'NNP'), ('S.', 'NNP'), ('TRUMAN', 'NNP'), ("'S", 'POS'), ('MESSAGE', 'NN'), ('TO', 'VBD'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'), ('AND', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('BUDGET', 'NNP'), ('FOR', 'NNP'), ('1946', 'CD'), ('.', '.')]

Sentence: January 21, 1946. 
Tagging: [('January', 'NNP'), ('21', 'CD'), (',', ','), ('1946', 'CD'), ('.', '.')]

Sentence: Dated January 14, 1946 

To the Congress of the United States:
A quarter century ago the Congress decided that it could no longer consider the financial programs of the various departments on a piecemeal basis. 
Tagging: [('Dated', 'VBN'), ('January', 'NNP'), ('14', 'CD'), (',', ','), ('1946', 'CD'), ('To', 'TO'), ('the', 'DT'), ('Congress', 'NNP'), ('of', 'IN'), ('the', 'DT'), 

Same function in a more *pythonic* way

In [None]:
def pythonized_tag_corpus(corpus_text, end):
    try:
        [print("Sentence:", sentence, "\nTagging:", nltk.pos_tag(nltk.word_tokenize(sentence)), "\n") for sentence in nltk.sent_tokenize(corpus_text)[:end]]
    
    except Exception as e:
        print(str(e))

In [None]:
pythonized_tag_corpus(text, end=5)    

Sentence: PRESIDENT HARRY S. TRUMAN'S MESSAGE TO THE CONGRESS ON THE STATE OF THE UNION AND ON THE BUDGET FOR 1946. 
Tagging: [('PRESIDENT', 'NNP'), ('HARRY', 'NNP'), ('S.', 'NNP'), ('TRUMAN', 'NNP'), ("'S", 'POS'), ('MESSAGE', 'NN'), ('TO', 'VBD'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'), ('AND', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('BUDGET', 'NNP'), ('FOR', 'NNP'), ('1946', 'CD'), ('.', '.')] 

Sentence: January 21, 1946. 
Tagging: [('January', 'NNP'), ('21', 'CD'), (',', ','), ('1946', 'CD'), ('.', '.')] 

Sentence: Dated January 14, 1946 

To the Congress of the United States:
A quarter century ago the Congress decided that it could no longer consider the financial programs of the various departments on a piecemeal basis. 
Tagging: [('Dated', 'VBN'), ('January', 'NNP'), ('14', 'CD'), (',', ','), ('1946', 'CD'), ('To', 'TO'), ('the', 'DT'), ('Congress', 'NNP'), ('of', 'IN'), ('the', 'DT')