[Chapter Here](https://www.nltk.org/book/ch05.html)

Chapter Goals:
1. What are lexical categories and how are they used in natural language processing?
2. What is a good Python data structure for storing words and their categories?
3. How can we automatically tag each word of a text with its word class?

# 1 - Using a Tagger

* [POS Tags (Penn Tree)](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)

In [4]:
import nltk

text = nltk.word_tokenize("And now for something completely different")
nltk.pos_tag(text)

[('And', 'CC'),
 ('now', 'RB'),
 ('for', 'IN'),
 ('something', 'NN'),
 ('completely', 'RB'),
 ('different', 'JJ')]

In [6]:
nltk.help.upenn_tagset('RB')

RB: adverb
    occasionally unabatingly maddeningly adventurously professedly
    stirringly prominently technologically magisterially predominately
    swiftly fiscally pitilessly ...


In [7]:
nltk.help.upenn_tagset('NN.*')

NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...
NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...
NNPS: noun, proper, plural
    Americans Americas Amharas Amityvilles Amusements Anarcho-Syndicalists
    Andalusians Andes Andruses Angels Animals Anthony Antilles Antiques
    Apache Apaches Apocrypha ...
NNS: noun, common, plural
    undergraduates scotches bric-a-brac products bodyguards facets coasts
    divestitures storehouses designs clubs fragrances averages
    subjectivists apprehensions muses factory-jobs ...


Notice that POS changes on the context of the word that is being used. For example, 'refuse' is first used as a VBP (*Verb, non-3rd person singular present*) then used as part of a Noun (*refuse permit*) in the following sentence:

    They refuse to permit us to obtain the refuse permit

In [11]:
text = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit")

nltk.pos_tag(text)

[('They', 'PRP'),
 ('refuse', 'VBP'),
 ('to', 'TO'),
 ('permit', 'VB'),
 ('us', 'PRP'),
 ('to', 'TO'),
 ('obtain', 'VB'),
 ('the', 'DT'),
 ('refuse', 'NN'),
 ('permit', 'NN')]

# 2 - Tagged Corpora

## Penn Tree Bank Tags:

<img src='https://www.researchgate.net/profile/Han_Van_Der_Aa/publication/320858849/figure/tbl3/AS:631618876235795@1527601083401/4-Overview-of-the-Penn-Treebank-tagset-from-135-p131.png' width=75%>

## 2.3 Universal Part-of-Speech Tagset

|Tag |Meaning            |English Examples                      |
|----|-------------------|--------------------------------------|
|ADJ |adjective          |new, good, high, special, big, local  |
|ADP |adposition         |on, of, at, with, by, into, under     |
|ADV |adverb             |really, already, still, early, now    |
|CONJ|conjunction        |and, or, but, if, while, although     |
|DET |determiner, article|the, a, some, most, every, no, which  |
|NOUN|noun               |year, home, costs, time, Africa       |
|NUM |numeral            |twenty-four, fourth, 1991, 14:24      |
|PRT |particle           |at, on, out, over per, that, up, with |
|PRON|pronoun            |he, their, her, its, my, I, us        |
|VERB|verb               |is, say, told, given, playing, would  |
|.   |punctuation marks  |. , ; !                               |
|X   |other              |ersatz, esprit, dunno, gr8, univeristy|

# 4 - Automatic Tagging

* **4.1** default tagger 
    * gives every word the same tag initially
    * usually default tag is Noun because most new words are Nouns
* **4.2** regular expression tagger
    * tag items based on endings of words (like *.*ing* is a VBG, *.ed* is VBD and *.*s* is NNS
* **4.3** Lookup Tagger
    * get top-n words, find their most frequent tags in a corpa
    * create a UnigramTagger with these top-n words and tag stuff 

In [46]:
from nltk.corpus import brown

fd = nltk.FreqDist(brown.words(categories='news'))
cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
most_freq_words = fd.most_common(100)
likely_tags = dict((word, cfd[word].max()) for (word, _) in most_freq_words)
baseline_tagger = nltk.UnigramTagger(model=likely_tags)

# for eval purposes 
# brown_tagged_sents is an example of **gold standard**
brown_tagged_sents = brown.tagged_sents(categories='news')
baseline_tagger.evaluate(brown_tagged_sents)

0.45578495136941344

In [22]:
cfd['the']

FreqDist({'AT': 5558, 'AT-TL': 18, 'AT-HL': 4})

In [23]:
cfd['the'].max()

'AT'

In [40]:
str(likely_tags).replace("\'","")

'{the: AT, ,: ,, .: ., of: IN, and: CC, to: TO, a: AT, in: IN, for: IN, The: AT, that: CS, ``: ``, is: BEZ, was: BEDZ, "": "", on: IN, at: IN, with: IN, be: BE, by: IN, as: CS, he: PPS, said: VBD, his: PP$, will: MD, it: PPS, from: IN, are: BER, ;: ., an: AT, has: HVZ, --: --, had: HVD, who: WPS, have: HV, not: *, Mrs.: NP, were: BED, this: DT, which: WDT, would: MD, their: PP$, been: BEN, they: PPSS, He: PPS, one: CD, I: PPSS, but: CC, its: PP$, or: CC, ): ), more: AP, Mr.: NP, (: (, up: RP, all: ABN, out: RP, last: AP, two: CD, other: AP, :: :, new: JJ, first: OD, than: IN, year: NN, A: AT, about: IN, there: EX, when: WRB, home: NN, after: IN, In: IN, also: RB, It: PPS, over: IN, into: IN, no: AT, But: CC, made: VBN, only: AP, her: PP$, years: NNS, time: NN, three: CD, them: PPO, some: DTI, can: MD, him: PPO, New: JJ-TL, any: DTI, state: NN, ?: ., President: NN-TL, before: IN, week: NN, could: MD, under: IN, against: IN, we: PPSS, what: WDT}'

In [45]:
raw = 'I do not like green eggs and ham, I do not like them Sam I am!'
baseline_tagger.tag(nltk.word_tokenize(raw))

[('I', 'PPSS'),
 ('do', None),
 ('not', '*'),
 ('like', None),
 ('green', None),
 ('eggs', None),
 ('and', 'CC'),
 ('ham', None),
 (',', ','),
 ('I', 'PPSS'),
 ('do', None),
 ('not', '*'),
 ('like', None),
 ('them', 'PPO'),
 ('Sam', None),
 ('I', 'PPSS'),
 ('am', None),
 ('!', None)]

# 5 - N-Gram Tagging