# Part of Speech Tags

In this notebook, we learn more about POS tags. 


## Tagsets and Examples

Universal tagset: (thanks to http://www.tablesgenerator.com/markdown_tables)

| Tag  | Meaning             | English Examples                       |
|------|---------------------|----------------------------------------|
| ADJ  | adjective           | new, good, high, special, big, local   |
| ADP  | adposition          | on, of, at, with, by, into, under      |
| ADV  | adverb              | really, already, still, early, now     |
| CONJ | conjunction         | and, or, but, if, while, although      |
| DET  | determiner, article | the, a, some, most, every, no, which   |
| NOUN | noun                | year, home, costs, time, Africa        |
| NUM  | numeral             | twenty-four, fourth, 1991, 14:24       |
| PRT  | particle            | at, on, out, over per, that, up, with  |
| PRON | pronoun             | he, their, her, its, my, I, us         |
| VERB | verb                | is, say, told, given, playing, would   |
| .    | punctuation marks   | . , ; !                                |
| X    | other               | ersatz, esprit, dunno, gr8, univeristy |


We list the `upenn` (aka. `treebank`) tagset below. In addition to that, NLTK also has 
* brown: use `nltk.help.brown_tagset()`
* claws5: use `nltk.help.claws5_tagset()`

In [2]:
import nltk

In [3]:
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

In [4]:
nltk.help.upenn_tagset('WP$')

WP$: WH-pronoun, possessive
    whose


In [5]:
nltk.help.upenn_tagset('PDT')

PDT: pre-determiner
    all both half many quite such sure this


In [6]:
nltk.help.upenn_tagset('DT')

DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those


In [7]:
nltk.help.upenn_tagset('POS')

POS: genitive marker
    ' 's


In [8]:
nltk.help.upenn_tagset('RBR')

RBR: adverb, comparative
    further gloomier grander graver greater grimmer harder harsher
    healthier heavier higher however larger later leaner lengthier less-
    perfectly lesser lonelier longer louder lower more ...


In [9]:
nltk.help.upenn_tagset('RBS')

RBS: adverb, superlative
    best biggest bluntest earliest farthest first furthest hardest
    heartiest highest largest least less most nearest second tightest worst


In [10]:
nltk.help.upenn_tagset('MD')

MD: modal auxiliary
    can cannot could couldn't dare may might must need ought shall should
    shouldn't will would


Or this summary table (also c.f. https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)

| Tag | Meaning                                  | Tag  | Meaning               | Tag | Meaning                               |
|-----|------------------------------------------|------|-----------------------|-----|---------------------------------------|
| CC  | Coordinating conjunction                 | NNP  | Proper noun, singular | VB  | Verb, base form                       |
| CD  | Cardinal number                          | NNPS | Proper noun, plural   | VBD | Verb, past tense                      |
| DT  | Determiner                               | PDT  | Predeterminer         | VBG | Verb, gerund or present               |
| EX  | Existential there                        | POS  | Possessive ending     | VBN | Verb, past participle                 |
| FW  | Foreign word                             | PRP  | Personal pronoun      | VBP | Verb, non-3rd person singular present |
| IN  | Preposition or subordinating conjunction | PRP\$ | Possessive pronoun    | VBZ | Verb, 3rd person singular             |
| JJ  | Adjective                                | RB   | Adverb                | WDT | Wh-determiner                         |
| JJR | Adjective, comparative                   | RBR  | Adverb, comparative   | WP  | Wh-pronoun                            |
| JJS | Adjective, superlative                   | RBS  | Adverb, superlative   | WP\$ | Possessive wh-pronoun                 |
| LS  | List item marker                         | RP   | Particle              | WRB | Wh-adverb                             |
| MD  | Modal                                    | SYM  | Symbol                |     |                                       |
| NN  | Noun, singular or mass                   | TO   | to                    |     |                                       |
| NNS | Noun, plural                             | UH   | Interjection          |     |                                       |

## Tagging a sentence

In [11]:
from pprint import pprint

sent = 'Beautiful is better than ugly.'
tokens = nltk.tokenize.word_tokenize(sent)
pos_tags = nltk.pos_tag(tokens)
pprint(pos_tags)

[('Beautiful', 'NNP'),
 ('is', 'VBZ'),
 ('better', 'JJR'),
 ('than', 'IN'),
 ('ugly', 'RB'),
 ('.', '.')]


Various algorithms can be used to perform POS tagging. In general, the accuracy is pretty high (state-of-the-art can reach approximately 97%). However, there are still incorrect tags. We demonstrate this below. 

In [14]:
truths = [[(u'Pierre', u'NNP'), (u'Vinken', u'NNP'), (u',', u','), (u'61', u'CD'),
            (u'years', u'NNS'), (u'old', u'JJ'), (u',', u','), (u'will', u'MD'),
            (u'join', u'VB'), (u'the', u'DT'), (u'board', u'NN'), (u'as', u'IN'),
            (u'a', u'DT'), (u'nonexecutive', u'JJ'), (u'director', u'NN'),
            (u'Nov.', u'NNP'), (u'29', u'CD'), (u'.', u'.')],
        [(u'Mr.', u'NNP'), (u'Vinken', u'NNP'), (u'is', u'VBZ'), (u'chairman', u'NN'),
            (u'of', u'IN'), (u'Elsevier', u'NNP'), (u'N.V.', u'NNP'), (u',', u','),
            (u'the', u'DT'), (u'Dutch', u'NNP'), (u'publishing', u'VBG'),
            (u'group', u'NN'), (u'.', u'.'), (u'Rudolph', u'NNP'), (u'Agnew', u'NNP'),
            (u',', u','), (u'55', u'CD'), (u'years', u'NNS'), (u'old', u'JJ'),
            (u'and', u'CC'), (u'former', u'JJ'), (u'chairman', u'NN'), (u'of', u'IN'),
            (u'Consolidated', u'NNP'), (u'Gold', u'NNP'), (u'Fields', u'NNP'),
            (u'PLC', u'NNP'), (u',', u','), (u'was', u'VBD'), (u'named', u'VBN'),
            (u'a', u'DT'), (u'nonexecutive', u'JJ'), (u'director', u'NN'), (u'of', u'IN'),
            (u'this', u'DT'), (u'British', u'JJ'), (u'industrial', u'JJ'),
            (u'conglomerate', u'NN'), (u'.', u'.')],
        [(u'A', u'DT'), (u'form', u'NN'),
            (u'of', u'IN'), (u'asbestos', u'NN'), (u'once', u'RB'), (u'used', u'VBN'),
            (u'to', u'TO'), (u'make', u'VB'), (u'Kent', u'NNP'), (u'cigarette', u'NN'),
            (u'filters', u'NNS'), (u'has', u'VBZ'), (u'caused', u'VBN'), (u'a', u'DT'),
            (u'high', u'JJ'), (u'percentage', u'NN'), (u'of', u'IN'),
            (u'cancer', u'NN'), (u'deaths', u'NNS'),
            (u'among', u'IN'), (u'a', u'DT'), (u'group', u'NN'), (u'of', u'IN'),
            (u'workers', u'NNS'), (u'exposed', u'VBN'), (u'to', u'TO'), (u'it', u'PRP'),
            (u'more', u'RBR'), (u'than', u'IN'), (u'30', u'CD'), (u'years', u'NNS'),
            (u'ago', u'IN'), (u',', u','), (u'researchers', u'NNS'),
            (u'reported', u'VBD'), (u'.', u'.')]]

In [21]:
import pandas as pd

def proj(pair_list, idx):
    return [p[idx] for p in pair_list]

data = []
for truth in truths:
    sent_toks = proj(truth, 0)
    true_tags = proj(truth, 1)
    nltk_tags = nltk.pos_tag(sent_toks)
    for i in range(len(sent_toks)):
        # print('{}\t{}\t{}'.format(sent_toks[i], true_tags[i], nltk_tags[i][1])) # if you do not want to use DataFrame
        data.append( (sent_toks[i], true_tags[i], nltk_tags[i][1] ) )

headers = ['token', 'true_tag', 'nltk_tag']
df = pd.DataFrame(data, columns = headers)
df

Unnamed: 0,token,true_tag,nltk_tag
0,Pierre,NNP,NNP
1,Vinken,NNP,NNP
2,",",",",","
3,61,CD,CD
4,years,NNS,NNS
5,old,JJ,JJ
6,",",",",","
7,will,MD,MD
8,join,VB,VB
9,the,DT,DT


In [23]:
# this finds out the tokens that the true_tag and nltk_tag are different. 
df[df.true_tag != df.nltk_tag]

Unnamed: 0,token,true_tag,nltk_tag
28,publishing,VBG,NN
62,used,VBN,VBD
84,more,RBR,JJR
88,ago,IN,RB
