# **Natural Language Processing with Python**
by [CSpanias](https://cspanias.github.io/aboutme/) - 02/2022

Content based on the [NLTK book](https://www.nltk.org/book/). <br>

You can find Chapter 5 [here](https://www.nltk.org/book/ch05.html).

# CONTENT

1. Language Processing and Python
2. Accessing Text Corpora and Lexical Resources
3. Processing Raw Text
4. Writing Structured Programs
5. Categorizing and Tagging Words
    1. Using a Tagger
    1. Tagged Corpora
    1. Mapping Words to Properties Using Python Dictionaries
    1. Automatic Tagging
    1. N-Gram Tagging
    1. [Transformation-Based Tagging](#tbtagging)
    1. [How to Determine the Category of a Word](#category)
        1. [Morphological Clues](#morpho)
        1. [Syntactic Clues](#syntantic)
        1. [Semantic Clues](#semantic)
        1. [New Words](#newwords)
        1. [Morphology in POS Tagsets](#tagsetmorpho)

<a name="tbtagging"></a>
# 5.6 Transformation-Based Tagging

A potential issue with n-gram taggers is the __size of their n-gram table (or language model)__. 

If tagging is to be employed in a variety of language technologies deployed on mobile computing devices, it is important to strike a __balance between model size and tagger performance__. 

An n-gram tagger with backoff may store trigram and bigram tables, __large sparse arrays__ which may have hundreds of millions of entries.

A second issue concerns __context__. The only information an n-gram tagger considers from prior context is tags, even though words themselves might be a useful source of information. It is simply impractical for n-gram models to be conditioned on the identities of words in the context. 

In this section we examine Brill tagging, an inductive tagging method which performs very well using models that are only a tiny fraction of the size of n-gram taggers.

__Brill tagging__ is a kind of __transformation-based learning__. The general idea is very simple:
1. __guess the tag__ of each word
2. then go back and __fix the mistakes__

In this way, a Brill tagger successively transforms a bad tagging of a text into a better one. As with n-gram tagging, this is a __supervised learning method__, since we need annotated training data to figure out whether the tagger's guess is a mistake or not. 

However, unlike n-gram tagging, it does not count observations but compiles a list of transformational correction rules.

Let's look at an example involving the following sentence:

> The President said he will ask Congress to increase grants to states for vocational rehabilitation

We will examine the operation of two rules: 
1. Replace `NN` with `VB` when the previous word is `TO`
2. Replace `TO` with `IN` when the next tag is `NNS`. 

The figure below illustrates this process:
1. first tagging with the unigram tagger
2. then applying the rules to fix the errors.

![brill.PNG](attachment:brill.PNG)

 All such rules are generated from a template of the following form: 
> "replace `T1` with `T2` in the context `C`". 

Typical contexts are the identity or the tag of the preceding or following word, or the appearance of a specific tag within 2-3 words of the current word. 

During its training phase, the tagger guesses values for `T1`, `T2` and `C`, to create thousands of candidate rules. Each rule is __scored according to its net benefit__: the number of incorrect tags that it corrects, less the number of correct tags it incorrectly modifies.

Brill taggers have another interesting property: the __rules are linguistically interpretable__. 

Compare this with the n-gram taggers, which employ a potentially massive table of n-grams. We cannot learn much from direct inspection of such a table, in comparison to the rules learned by the Brill tagger.

In [4]:
from nltk.tbl import demo as brill_demo

brill_demo.demo()

#print(open('errors.out').read())

Loading tagged data from treebank... 
Read testing data (200 sents/5251 wds)
Read training data (800 sents/19933 wds)
Read baseline data (800 sents/19933 wds) [reused the training set]
Trained baseline tagger
    Accuracy on test set: 0.8366
Training tbl tagger...
TBL train (fast) (seqs: 800; tokens: 19933; tpls: 24; min score: 3; min acc: None)
Finding initial useful rules...
    Found 12799 useful rules.

           B      |
   S   F   r   O  |        Score = Fixed - Broken
   c   i   o   t  |  R     Fixed = num tags changed incorrect -> correct
   o   x   k   h  |  u     Broken = num tags changed correct -> incorrect
   r   e   e   e  |  l     Other = num tags changed incorrect -> incorrect
   e   d   n   r  |  e
------------------+-------------------------------------------------------
  23  23   0   0  | POS->VBZ if Pos:PRP@[-2,-1]
  18  19   1   0  | NN->VB if Pos:-NONE-@[-2] & Pos:TO@[-1]
  14  14   0   0  | VBP->VB if Pos:MD@[-2,-1]
  12  12   0   0  | VBP->VB if Pos:TO@[-1]
  

<a name="category"></a>
# 5.7 How to Determine the Category of a Word

Now that we have examined word classes in detail, we turn to a more basic question: __how do we decide what category a word belongs to in the first place?__ 

In general, linguists use morphological, syntactic, and semantic clues to determine the category of a word.

<a name="morpho"></a>
## 5.7.1 Morphological Clues

The __internal structure__ of a word may give useful clues as to the word's category. 

For example, `-ness` is a suffix that combines with an adjective to produce a noun, e.g. happy → happiness, ill → illness. 

So if we encounter a word that ends in `-ness`, this is __very likely to be a noun__. 

Similarly, `-ment` is a suffix that combines with some verbs to produce a noun, e.g. govern → government and establish → establishment.

English verbs can also be morphologically complex.

For instance, the present participle of a verb ends in `-ing`, and expresses the idea of ongoing, incomplete action (e.g. falling, eating). 

The `-ing` suffix also appears on nouns derived from verbs, e.g. the falling of the leaves (this is known as the __gerund__).

<a name="syntantic"></a>
## 5.7.2 Syntactic Clues

Another source of information is the __typical contexts__ in which a word can occur. 

For example, assume that we have already determined the category of nouns. Then we might say that a __syntactic criterion__ for an adjective in English is that it can occur immediately before a noun, or immediately following the words be or very. 

According to these tests, `near` should be categorized as an adjective:
	
1. the `near` window
2. The end is (very) `near`.

<a name="semantic"></a>
## 5.7.3 Semantic Clues

Finally, the __meaning of a word__ is a useful clue as to its lexical category. 

For example, __the best-known definition of a `noun` is semantic__: 

> "the name of a person, place or thing". 

Within modern linguistics, semantic criteria for word classes are treated with suspicion, mainly because they are __hard to formalize__. 

Nevertheless, semantic criteria underpin many of our intuitions about word classes, and enable us to make a __good guess about the categorization of words in languages that we are unfamiliar with__. 

For example, if all we know about the Dutch word `verjaardag` is that it means the same as the English word `birthday`, then we can guess that `verjaardag` is a noun in Dutch. 

However, some care is needed: although we might translate `zij is vandaag jarig` as `it's her birthday today`, the word `jarig` is in fact an adjective in Dutch, and has no exact equivalent in English.

<a name="newwords"></a>
## 5.7.4 New Words

__All languages acquire new lexical items__. 

A list of words recently added to the Oxford Dictionary of English includes cyberslacker, fatoush, blamestorm, SARS, cantopop, bupkis, noughties, muggle, and robata. 

Notice that all these new words are `nouns`, and this is reflected in calling nouns an __open class__.

By contrast, prepositions are regarded as a __closed class__. 

That is, there is a limited set of words belonging to the class (e.g., above, along, at, below, beside, between, during, for, from, in, near, on, outside, over, past, through, towards, under, up, with), and __membership of the set only changes very gradually over time__.

<a name="tagsetmorpho"></a>
## 5.7.5 Morphology in POS Tagsets

__Common tagsets often capture some morpho-syntactic information__; that is, information about the kind of morphological markings that words receive by virtue of their syntactic role.

Consider, for example, the selection of distinct grammatical forms of the word `go` illustrated in the following sentences:
		
* Go away!
* He sometimes goes to the cafe.
* All the cakes have gone.
* We went on the excursion.

Each of these forms — `go`, `goes`, `gone`, and `went` — is __morphologically distinct__ from the others. 

Consider the form, `goes`. This occurs in a restricted set of grammatical contexts, and requires a third person singular subject. Thus, the following sentences are ungrammatical.
	
* *They sometimes goes to the cafe.
* *I sometimes goes to the cafe.


By contrast, `gone` is the past participle form; it is required after `have` (and cannot be replaced in this context by goes), and cannot occur as the main verb of a clause.
	
* *All the cakes have goes.
* *He sometimes gone to the cafe.

We can easily imagine a tagset in which the four distinct grammatical forms just discussed were all tagged as `VB`. Although this would be adequate for some purposes, a more fine-grained tagset provides useful information about these forms that can help other processors that try to detect patterns in tag sequences. The Brown tagset captures these distinctions, as summarized in the figure below.

![tagset_morphology.PNG](attachment:tagset_morphology.PNG)

In addition to this set of verb tags, the various forms of the verb to be have special tags: `be/BE`, `being/BEG`, `am/BEM`, `are/BER`, `is/BEZ`, `been/BEN`, `were/BED` and `was/BEDZ` (plus extra tags for negative forms of the verb). 

All told, this fine-grained tagging of verbs means that an automatic tagger that uses this tagset is effectively carrying out a limited amount of morphological analysis.

Most POS tagsets make use of the same basic categories, such as __noun__, -__verb__, __adjective__, and __preposition__. 

However, __tagsets differ both in how finely they divide words into categories, and in how they define their categories__. 

For example, `is` might be tagged simply as a __verb__ in one tagset; but as a __distinct form of the lexeme `be`__ in another tagset (as in the Brown Corpus). 

This variation in tagsets is unavoidable, since POS tags are used in different ways for different tasks. In other words, there is no one 'right way' to assign tags, only more or less useful ways depending on one's goals.