# Exercise : POS Tagging

## Part-of-Speech Tagging – Do it Yourself
#### Select two sentences of your choice (with a sentence length ≥ 8) from the play corpus wiki-en- flower.txt. Label the words in the sentences with the Penn Treebank part-of-speech tags provided in the handout of the lecture. Document decisions that you found difficult.

In [1]:
s1 = "The state tree is the Longleaf Pine , the state flower is the Camellia ."

In [2]:
s2 = "The film largely consisted of a stick figure moving about and encountering all manner of morphing objects , such as a wine bottle that transforms into a flower ." 

In [3]:
# tokenize s1 and s2
def tokenize(s):
    return s.split(" ")

In [4]:
tokenize(s1)

['The',
 'state',
 'tree',
 'is',
 'the',
 'Longleaf',
 'Pine',
 ',',
 'the',
 'state',
 'flower',
 'is',
 'the',
 'Camellia',
 '.']

In [5]:
tokenize(s2)

['The',
 'film',
 'largely',
 'consisted',
 'of',
 'a',
 'stick',
 'figure',
 'moving',
 'about',
 'and',
 'encountering',
 'all',
 'manner',
 'of',
 'morphing',
 'objects',
 ',',
 'such',
 'as',
 'a',
 'wine',
 'bottle',
 'that',
 'transforms',
 'into',
 'a',
 'flower',
 '.']

In [6]:
import nltk

In [7]:
def tag(t):
    tags = nltk.pos_tag(t)
    return tags

In [8]:
tag(tokenize(s1))

[('The', 'DT'),
 ('state', 'NN'),
 ('tree', 'NN'),
 ('is', 'VBZ'),
 ('the', 'DT'),
 ('Longleaf', 'NNP'),
 ('Pine', 'NNP'),
 (',', ','),
 ('the', 'DT'),
 ('state', 'NN'),
 ('flower', 'NN'),
 ('is', 'VBZ'),
 ('the', 'DT'),
 ('Camellia', 'NNP'),
 ('.', '.')]

![pen_paper_for_s1](IMG_0115.JPG)

In [9]:
tag(tokenize(s2))

[('The', 'DT'),
 ('film', 'NN'),
 ('largely', 'RB'),
 ('consisted', 'VBN'),
 ('of', 'IN'),
 ('a', 'DT'),
 ('stick', 'JJ'),
 ('figure', 'NN'),
 ('moving', 'VBG'),
 ('about', 'RB'),
 ('and', 'CC'),
 ('encountering', 'VBG'),
 ('all', 'DT'),
 ('manner', 'NN'),
 ('of', 'IN'),
 ('morphing', 'VBG'),
 ('objects', 'NNS'),
 (',', ','),
 ('such', 'JJ'),
 ('as', 'IN'),
 ('a', 'DT'),
 ('wine', 'NN'),
 ('bottle', 'NN'),
 ('that', 'WDT'),
 ('transforms', 'VBZ'),
 ('into', 'IN'),
 ('a', 'DT'),
 ('flower', 'NN'),
 ('.', '.')]

![pen_paper_for_s2](IMG_0116.JPG)

#### Suggest a non-lexicalised transformation based on one of the rule templates of the lecture handout, to correct the part-of-speech error in the following word/tag sequence:
a/DT very/RB difficult/RB question/NN

#### Explain step-by-step and in your own words why we use the simplified way to calculate ....

![pen_paper_for_simplification_reason](IMG_0117.JPG)
![pen_paper_for_simplification_reason](IMG_0118.JPG)

#### Calculate the probabilities of the two part-of-speech sequences

In [10]:
# race|VB; VB|TO ------ race|NN; NN|TO
(0.00012 * 0.83) > (0.00057 * 0.00047)

True

So `PPSS VB TO VB` is a more probable tag sequence

## Part-of-Speech Tagging with the Tree Tagger

#### Describe the output of the Tree Tagger. What information is provided, and what is the output format?
`word tag lemma`

#### How many different part-of-speech tags did the Tree Tagger assign to your corpus words? Which are the two most frequent part-of-speech types?

In [11]:
tags = []
unknowns = []

with open("tagged.txt", encoding="utf-8") as f:
    lines = f.readlines()
    for l in lines:
        x = l.split("\t")
        if len(x) == 3:
            tags.append(x[1])
            if x[2] == "<unknown>" or x[2] == "<unknown>\n":
                unknowns.append(x[0])
        else:
            unknowns.append(x[0])

In [12]:
uniq_tags = set(tags)

In [13]:
print("# of tags : ", len(tags))
print("# of unique tags: ", len(uniq_tags))

# of tags :  32815
# of unique tags:  43


In [14]:
import collections as c

tag_freq = c.Counter(tags)
tag_freq

Counter({'DT': 2858,
         'NN': 4779,
         'VBZ': 636,
         'NP': 2372,
         ',': 2330,
         'SENT': 1038,
         'RB': 957,
         'VBD': 611,
         'IN': 3447,
         'VBG': 460,
         'CC': 1211,
         'NNS': 2622,
         'JJ': 2691,
         'WDT': 177,
         'PP': 304,
         'POS': 185,
         'VBN': 850,
         'TO': 483,
         'PP$': 267,
         '``': 559,
         "''": 690,
         'UH': 4,
         'VBP': 492,
         'VB': 426,
         ':': 209,
         'CD': 579,
         '(': 455,
         ')': 458,
         'RP': 40,
         'MD': 139,
         'JJR': 56,
         'NPS': 76,
         'JJS': 57,
         'RBS': 50,
         'WRB': 85,
         'RBR': 37,
         'FW': 34,
         'EX': 31,
         'WP': 38,
         'WP$': 2,
         'LS': 5,
         'PDT': 8,
         'SYM': 7})

Two most frequent tags are `DT` and `NN`

#### How many unknown words are in the corpus (types and tokens)? Can you think of an alternative way to deal with unknown words, in comparison to what the Tree Tagger does?

In [15]:
print("# of unknown words : ", len(unknowns))

# of unknown words :  1091


In [16]:
unknowns

['Longleaf',
 'morphing',
 'Hyacinthus',
 'subcomponent',
 'spring/',
 'summer/',
 'anthophytes',
 'clade',
 '-',
 '-',
 'Gnetales',
 'Bennettitales',
 'Boddhisatvas',
 'Mahayana',
 'Asparagales',
 'Alismatales',
 'Apiales',
 'Asterales',
 'Asteraceae',
 'Asteraceae',
 'annuus',
 'Lactuca',
 'sativa',
 'Cichorium',
 'Jänner',
 'Januar',
 'heuer',
 'Erdäpfel',
 'Kartoffeln',
 'Aardappel',
 'Schlagobers',
 'Schlagsahne',
 'Faschiertes',
 'Hackfleisch',
 'Fisolen',
 'Gartenbohne',
 'Karfiol',
 'Blumenkohl',
 'Karotte',
 'Möhre',
 'Kohlsprossen',
 'Rosenkohl',
 'Marillen',
 'Aprikosen',
 'Paradeiser',
 'Tomaten',
 'Palatschinken',
 'Pfannkuchen',
 'Topfen',
 'semi-sweet',
 'Kren',
 'Meerrettich',
 's',
 'Även',
 'blomma',
 'de-Nazification',
 'Alliaceae',
 'Asteraceae',
 'Compositae',
 'Asteraceae',
 'calathid',
 'calathidium',
 'Asteraceae',
 'pseudanthia',
 'Asteraceae',
 'Lactuca',
 'sativa',
 'Cichorium',
 'scolymus',
 'annuus',
 'Smallanthus',
 'sonchifolius',
 'yacón',
 'Carthamus',


Some unknown words are from languages other than English. Some are symbols. There can be a a few ways to deal with them. 

 - Tag foreign words using params of those languages
 - Add tags based on the tags of the words around them (probabilistic method)
 - Use morphology of the unknown words
 - Use probablity distribution of the tags available to predict tags for unknown words 