# Lab 2 - Tagging

TDT4310, Spring 2022\
Lab date: February 1, 2022\
Prof. Björn Gambäck\
TA. Tollef Jørgensen

Student: Mathias Bjørgum, mathigb

## Instructions

This lab uses ``python 3.10.2`` running  in an ``conda`` environment with `nltk version: 3.6.7`. The following modules will be imported and used:

In [1]:
import nltk
import matplotlib.pyplot as plt
from collections import defaultdict, OrderedDict
from sklearn.model_selection import train_test_split as split
from nltk.tag import DefaultTagger, UnigramTagger, RegexpTagger, BigramTagger, TrigramTagger

## Exercises

### 1 - Ambiguity

Use the *Brown Corpus* and...

#### a) Print the 5 most frequent tags in the corpus.

In [2]:
brown_corpus = nltk.corpus.brown

nltk.FreqDist([tag for word, tag in brown_corpus.tagged_words()]).most_common(5)

[('NN', 152470), ('IN', 120557), ('AT', 97959), ('JJ', 64028), ('.', 60638)]

The 5 most common tags:

`[('NN', 152470), ('IN', 120557), ('AT', 97959), ('JJ', 64028), ('.', 60638)]`

#### b) How many words are ambiguous, in the sense that they appear with more than two tags?

In [3]:
tags = defaultdict(set)

# Each word gets its own key in a dictionary
for word, tag in brown_corpus.tagged_words():
    tags[word].add(tag)

count = 0

# Counts all words that has multiple tags
for word, tag_list in tags.items():
    if len(tag_list) > 2:
        count +=1

print(count)

1543


The number of words that have multiple tags are 1543.

#### c) Print the percentage of ambiguous words in the corpus.

In [4]:
# Creates a set of all the words
words = set([word for word, tag in brown_corpus.tagged_words()])

print(f"{count} / {len(words)} = {count / len(words)}")

1543 / 56057 = 0.027525554346468774


We can see that the percentage of words that are ambigous are 2.753%

#### d) Find the top 5 words (longer than 4 characters) with the highest number of distinct tags. Select one of them and print out a sentence with the word in its different forms

In [5]:
words_longer_than_four = [(str(word), tag_list) for word, tag_list in tags.items() if len(str(word)) > 4]
ordered_based_on_value_length = OrderedDict(sorted(words_longer_than_four, key=lambda item: len(item[1]), reverse=True))

top_five = [ordered_based_on_value_length.popitem(False) for i in range(5)]

print(top_five)

[('Little', {'JJ-HL', 'JJ', 'AP-HL', 'QL', 'NP', 'JJ-TL', 'AP'}), ('Chinese', {'JJ-HL', 'NP-NC', 'JJ-NC', 'JJ', 'NP', 'NPS', 'JJ-TL'}), ('still', {'JJ', 'NN', 'QL', 'VB', 'QLP', 'RB'}), ('right', {'NN', 'JJ', 'NN-HL', 'QL', 'NR', 'RB'}), ('Light', {'JJ-HL', 'NN', 'JJ', 'NN-TL', 'VB', 'JJ-TL'})]


We can see that the top five words with the highest number of disinct tags are:

`[('Little', {'AP-HL', 'AP', 'NP', 'QL', 'JJ-HL', 'JJ-TL', 'JJ'}), ('Chinese', {'JJ-NC', 'NPS', 'NP-NC', 'NP', 'JJ-HL', 'JJ-TL', 'JJ'}), ('still', {'RB', 'QL', 'VB', 'QLP', 'NN', 'JJ'}), ('right', {'NN-HL', 'NR', 'RB', 'QL', 'NN', 'JJ'}), ('Light', {'JJ-HL', 'VB', 'JJ-TL', 'NN', 'JJ', 'NN-TL'})]`

I select the word *"still"* to find sentences with the words in different forms.

In [6]:
def find_one_sentence(target_word: str, target_tag: str):
    for sent_tags in brown_corpus.tagged_sents():
        for word, word_tag in sent_tags:
            if word == target_word and word_tag == tag:
                return sent_tags

In [7]:
target_word = "still"
target_tags = tags[target_word]
sents = []

for tag in target_tags:
    sents.append([find_one_sentence(target_word, tag), tag])

In [8]:
untagged_sents = []

for tagged_sent in sents:
    sent = ""
    for word, tag in tagged_sent[0]:
        sent += word
        sent += " "
    
    untagged_sents.append([sent, tagged_sent[1]])

(untagged_sents)

[['Similarly , in presenting still photographs of early jazz groups , the program allowed no time for a close perusal . ',
  'JJ'],
 ["I aim to keep a little whisky still back in the ridge for my pleasure '' . ",
  'NN'],
 ['Further improvements in earnings of the Kansas Turnpike are expected late in 1961 , with the opening of a new bypass at Wichita , and still later when the turnpike gets downtown connections in both Kansas City , Kans. , and Kansas City , Mo. . ',
  'QL'],
 ['Recent statements by well-known scientists regarding the destructive power of the newest nuclear bombs and the deadly fall-outs should be sufficient to still the voices of those who advocate nuclear warfare instead of negotiations . ',
  'VB'],
 ['In the future , quantitative demand will be greater because of the expansion of the economy , and the qualitative need will be greater still . ',
  'QLP'],
 ["While details are still to be worked out , Ratcliff said he expects to tell home folks in Dallas why he think

This gives us six sentences where *"still"* has different roles. The sentences are as follows:

```
"While details are still to be worked out , Ratcliff said he expects to tell home folks in Dallas why he thinks Berry's proposed constitutional amendment should be rejected . ", role:  'RB'

'Further improvements in earnings of the Kansas Turnpike are expected late in 1961 , with the opening of a new bypass at Wichita , and still later when the turnpike gets downtown connections in both Kansas City , Kans. , and Kansas City , Mo. . ',role:  'QL'

'Recent statements by well-known scientists regarding the destructive power of the newest nuclear bombs and the deadly fall-outs should be sufficient to still the voices of those who advocate nuclear warfare instead of negotiations . ', role:  'VB'

'In the future , quantitative demand will be greater because of the expansion of the economy , and the qualitative need will be greater still . ', role:  'QLP'

"I aim to keep a little whisky still back in the ridge for my pleasure '' . ", role:  'NN'

'Similarly , in presenting still photographs of early jazz groups , the program allowed no time for a close perusal . ', role:  'JJ'
```

### 2 - Training a tagger

Explore the performance of a tagger using the Brown Corpus and NPS Chat Corpus as data sources, with
different ratios of train/test data. Use the following ratios:

* Brown 90% / NPS 10%
* Brown 50% /NPS 50%
* NPS 90% / Brown 10%
* NPS 50% / Brown 50%


In [9]:
brown_corpus = nltk.corpus.brown
NPS_corpus = nltk.corpus.nps_chat

I define the train/test splits, and adds them to a list to use later. I choose to use the universal tagset in this exercise.

In [10]:
def get_train_test_permuatations(brown_corpus, NPS_corpus, tagset_universal = True):
    random_state = 4310
    if not tagset_universal:
        brown_train_90, brown_test_10 = split(brown_corpus.tagged_sents(), train_size=0.9, random_state=random_state)
        brown_train_50, brown_test_50 = split(brown_corpus.tagged_sents(), train_size=0.5, random_state=random_state)
        NPS_train_90, NPS_test_10 = split(NPS_corpus.tagged_posts(), train_size=0.9, random_state=random_state)
        NPS_train_50, NPS_test_50 = split(NPS_corpus.tagged_posts(), train_size=0.5, random_state=random_state)

    if tagset_universal:
        brown_train_90, brown_test_10 = split(brown_corpus.tagged_sents(tagset="universal"), train_size=0.9, random_state=random_state)
        brown_train_50, brown_test_50 = split(brown_corpus.tagged_sents(tagset="universal"), train_size=0.5, random_state=random_state)
        NPS_train_90, NPS_test_10 = split(NPS_corpus.tagged_posts(tagset="universal"), train_size=0.9, random_state=random_state)
        NPS_train_50, NPS_test_50 = split(NPS_corpus.tagged_posts(tagset="universal"), train_size=0.5, random_state=random_state)
    
    train_test_permutations = [(brown_train_90, NPS_test_10),
        (brown_train_50, NPS_test_50),
        (NPS_train_90, brown_test_10),
        (NPS_train_50, brown_test_50)]
    return train_test_permutations


In [11]:
train_test_permutations = get_train_test_permuatations(
    brown_corpus, NPS_corpus, True)

In [12]:
from typing import List, Tuple

def find_most_common_tag(dataset: List[tuple[str, str]]) -> str:
    '''Assumes that the input set is a list of tagged words. Returns the most common tag.'''
    tags = []
    for sent in dataset:
        for word, tag in sent:
            tags.append(tag)

    return nltk.FreqDist(tags).most_common(1)[0][0]

In [13]:
def get_default_tagger_from_train_split(train_split):
    return DefaultTagger(find_most_common_tag(train_split))

#### a) Create a *DefaultTagger* using the most common tag in each corpus as the default tag

I assume that we are supposed to create a new *DefaultTagger* for each permutation of train/test data.

In [14]:
default_taggers = [get_default_tagger_from_train_split(train_split) for train_split, test_split in train_test_permutations]

#### b) Create a combined tagger with the RegEx tagger (see Ch. 5, sec. 4.2) with an initial backoff using the most common default tag. Then, use n-gram taggers as backoff taggers (e.g., UnigramTagger, BigramTagger, TrigramTagger). The ordering is up to you, but justify your choice. Calculate the accuracy of each of the four train/test permutations.

In [15]:
def print_accuracy_of_taggers(taggers: List, train_test_permutations: List) -> None:
    '''
    Prints the accuracy of each tagger and its corresponding test split.

    If the lenght of taggers is equal to 1, it will only use the one tagger.
    '''

    for i, permutation in enumerate(train_test_permutations):
        permutation_no = i
        if len(taggers) == 1: i = 0
        print(f"Accuracy of permuatation {permutation_no+1}: {taggers[i].accuracy(permutation[1]):2f}")

In [16]:
def get_combined_reg_ex_tagger(train_split, n_gram: int = 0, tagset_universal: bool = True):
    '''
    Takes a training dataset `train_split` and a number of n-gram `n_gram` to create a `RegexpTagger`.
    
    `n_gram` defaults to zero, i.e a default tagger. `tagset_universal` defaults to True.
    '''
    if tagset_universal:
        patterns = [
        (r'.*ing$', 'VERB'), # gerunds
        (r'.*ed$', 'VERB'), # simple past
        (r'.*es$', 'VERB'), # 3rd singular present
        (r'.*ould$', 'VERB'), # modals
        (r'.*\'s$', 'NOUN'), # possessive nouns
        (r'.*s$', 'NOUN'), # plural nouns
        (r'^-?[0-9]+(\.[0-9]+)?$', 'NUM'), # cardinal numbers
        ]
    if not tagset_universal:
        patterns = [
        (r'.*ing$', 'VBG'), # gerunds
        (r'.*ed$', 'VBD'), # simple past
        (r'.*es$', 'VBZ'), # 3rd singular present
        (r'.*ould$', 'MD'), # modals
        (r'.*\'s$', 'NN$'), # possessive nouns
        (r'.*s$', 'NNS'), # plural nouns
        (r'^-?[0-9]+(\.[0-9]+)?$', 'CD'), # cardinal numbers
        ]

    t0 = get_default_tagger_from_train_split(train_split)
    t1 = UnigramTagger(train_split, backoff=t0)
    t2 = BigramTagger(train_split, backoff=t1)
    t3 = TrigramTagger(train_split, backoff=t2)

    if n_gram == 1:
        return RegexpTagger(patterns, backoff = t1)
    if n_gram == 2:
        return RegexpTagger(patterns, backoff = t2)
    if n_gram == 3:
        return RegexpTagger(patterns, backoff = t3)

    return RegexpTagger(patterns, backoff = t0)
    

In [17]:
for i in range(4):
    print(f"N-gram tagger = {i}")
    print_accuracy_of_taggers(
        taggers = [get_combined_reg_ex_tagger(split[0], n_gram = i, tagset_universal=True)
        for split in train_test_permutations],
        train_test_permutations=train_test_permutations
    )

N-gram tagger = 0
Accuracy of permuatation 1: 0.231729
Accuracy of permuatation 2: 0.239749
Accuracy of permuatation 3: 0.280140
Accuracy of permuatation 4: 0.279242
N-gram tagger = 1
Accuracy of permuatation 1: 0.602626
Accuracy of permuatation 2: 0.598718
Accuracy of permuatation 3: 0.784156
Accuracy of permuatation 4: 0.774193
N-gram tagger = 2
Accuracy of permuatation 1: 0.602845
Accuracy of permuatation 2: 0.598805
Accuracy of permuatation 3: 0.784019
Accuracy of permuatation 4: 0.772226
N-gram tagger = 3
Accuracy of permuatation 1: 0.600656
Accuracy of permuatation 2: 0.596449
Accuracy of permuatation 3: 0.777352
Accuracy of permuatation 4: 0.758848


We can calculate the accuracy of all the permuatations. I get the following output:
```
N-gram tagger = 0
Accuracy of permuatation 1: 0.231729
Accuracy of permuatation 2: 0.239749
Accuracy of permuatation 3: 0.280140
Accuracy of permuatation 4: 0.279242

N-gram tagger = 1
Accuracy of permuatation 1: 0.602626
Accuracy of permuatation 2: 0.598718
Accuracy of permuatation 3: 0.784156
Accuracy of permuatation 4: 0.774193

N-gram tagger = 2
Accuracy of permuatation 1: 0.602845
Accuracy of permuatation 2: 0.598805
Accuracy of permuatation 3: 0.784019
Accuracy of permuatation 4: 0.772226

N-gram tagger = 3
Accuracy of permuatation 1: 0.600656
Accuracy of permuatation 2: 0.596449
Accuracy of permuatation 3: 0.777352
Accuracy of permuatation 4: 0.758848
```

**NOTE**: This takes around 2m to run, and i am not quite sure why.

Based on this it looks like training on the **NPS dataset** and testing on the **Brown dataset** gived the best accuracy. We can see that the accuracy increases drastically from default tagger to a unigram tagger, but after that it goes down. It looks like the Unigram tagger gives the best result.

#### c) Select a dataset split of your choice and print a table containing the precision, recall and f-measure for the top 5 most common tags (look up truncate in the documentation) and sort each score by count. Do this for all your chosen variations of backoffs (e.g., DefaultTagger, UnigramTagger and BigramTagger).

I choose dataset split no. 3, since it gave the best performance in the previous task. I also strip the tags of the test set.

In [18]:
dataset = train_test_permutations[2]
taggers = [get_combined_reg_ex_tagger(dataset[0], n_gram = i) for i in range(4)]

tags = []
for post in dataset[0]:
    for word, tag in post:
        tags.append(tag)

top_five_tags = nltk.FreqDist(tags).most_common(5)
top_five_tags

[('NOUN', 8695), ('VERB', 8077), ('X', 5901), ('PRON', 4202), ('.', 3828)]

In [19]:
from typing import List

def strip_tag_or_word(tagged_sents: List[List[Tuple[str, str]]], strip_words: bool = False):
    '''Strips words or tags of tagged sentences depending on a boolean. `False` = strip tag'''
    stripped_sents = []
    for sent in tagged_sents:
        sentence = []
        for word, tag in sent:
            if strip_words:
                sentence.append(tag)
            else:
                sentence.append(word)
        stripped_sents.append(sentence)

    return stripped_sents

In [20]:
test_set_no_labels = strip_tag_or_word(dataset[1], False)

test_tags = [taggers[i].tag_sents(test_set_no_labels) for i in range(len(taggers))]

In [21]:
from nltk.metrics import ConfusionMatrix

ref = [x for l in strip_tag_or_word(dataset[1], True) for x in l]
for i, test_set in enumerate(test_tags):
    tst = [x for l in strip_tag_or_word(test_set, True) for x in l]
    cm = ConfusionMatrix(ref,tst)
    # Maybe a bit cheesy to do it this way, but it is the best way i can think of.
    cm._values = [tag for tag, _ in top_five_tags]
    print(f"N-gram tagger: {i}")
    print(cm.evaluate(sort_by_count=True))

N-gram tagger: 0
 Tag | Prec.  | Recall | F-measure
-----+--------+--------+-----------
NOUN | 0.4435 | 0.9101 | 0.5964
VERB | 0.7293 | 0.3677 | 0.4889
   . | 0.0000 | 0.0000 | 0.0000
PRON | 0.0000 | 0.0000 | 0.0000
   X | 0.0000 | 0.0000 | 0.0000

N-gram tagger: 1
 Tag | Prec.  | Recall | F-measure
-----+--------+--------+-----------
NOUN | 0.8047 | 0.8812 | 0.8413
VERB | 0.8127 | 0.7397 | 0.7745
   . | 0.9999 | 0.9412 | 0.9696
PRON | 0.9983 | 0.9724 | 0.9852
   X | 0.0523 | 0.0692 | 0.0596

N-gram tagger: 2
 Tag | Prec.  | Recall | F-measure
-----+--------+--------+-----------
NOUN | 0.8057 | 0.8866 | 0.8442
VERB | 0.8150 | 0.7380 | 0.7746
   . | 0.9999 | 0.9409 | 0.9695
PRON | 0.9900 | 0.9724 | 0.9811
   X | 0.1125 | 0.0692 | 0.0857

N-gram tagger: 3
 Tag | Prec.  | Recall | F-measure
-----+--------+--------+-----------
NOUN | 0.8051 | 0.8867 | 0.8439
VERB | 0.8151 | 0.7377 | 0.7745
   . | 0.9999 | 0.9248 | 0.9609
PRON | 0.9899 | 0.9680 | 0.9788
   X | 0.0281 | 0.0692 | 0.0400



From this i get this output:

```
N-gram tagger: 0
 Tag | Prec.  | Recall | F-measure
-----+--------+--------+-----------
NOUN | 0.4435 | 0.9101 | 0.5964
VERB | 0.7293 | 0.3677 | 0.4889
   . | 0.0000 | 0.0000 | 0.0000
PRON | 0.0000 | 0.0000 | 0.0000
   X | 0.0000 | 0.0000 | 0.0000

N-gram tagger: 1
 Tag | Prec.  | Recall | F-measure
-----+--------+--------+-----------
NOUN | 0.8047 | 0.8812 | 0.8413
VERB | 0.8127 | 0.7397 | 0.7745
   . | 0.9999 | 0.9412 | 0.9696
PRON | 0.9983 | 0.9724 | 0.9852
   X | 0.0523 | 0.0692 | 0.0596

N-gram tagger: 2
 Tag | Prec.  | Recall | F-measure
-----+--------+--------+-----------
NOUN | 0.8057 | 0.8866 | 0.8442
VERB | 0.8150 | 0.7380 | 0.7746
   . | 0.9999 | 0.9409 | 0.9695
PRON | 0.9900 | 0.9724 | 0.9811
   X | 0.1125 | 0.0692 | 0.0857

N-gram tagger: 3
 Tag | Prec.  | Recall | F-measure
-----+--------+--------+-----------
NOUN | 0.8051 | 0.8867 | 0.8439
VERB | 0.8151 | 0.7377 | 0.7745
   . | 0.9999 | 0.9248 | 0.9609
PRON | 0.9899 | 0.9680 | 0.9788
   X | 0.0281 | 0.0692 | 0.0400
```

#### d) Using the *Brown Coprus*, create a baseline tagger (e.g Unigram) with a lookup model (see Ch. 5, sec. 4.3). The model should handle the most 200 common words and store the tags. Evaluate the accuracy on the above permutations of train/test data.

In [22]:
fd = nltk.FreqDist(brown_corpus.words())
cfd = nltk.ConditionalFreqDist(brown_corpus.tagged_words(tagset="universal"))
most_common_200 = fd.most_common(200)
likely_tags = dict((word, cfd[word].max()) for (word, _) in most_common_200)

In [23]:
baseline_tagger = nltk.UnigramTagger(model = likely_tags, 
    backoff=get_default_tagger_from_train_split(brown_corpus.tagged_sents()))
    
print_accuracy_of_taggers([baseline_tagger], 
    train_test_permutations=train_test_permutations)

Accuracy of permuatation 1: 0.321007
Accuracy of permuatation 2: 0.319883
Accuracy of permuatation 3: 0.547006
Accuracy of permuatation 4: 0.547247


Not suprisingly this model performs significantly better when testing on the *Brown* dataset, since that is the one it has been trained on.

#### e) With an arbitrary text from another corpus (or an article you scraped in Lab 1), use the tagger you just created and print a few tagged sentences.

In [24]:
import random

gutenberg_corpus = nltk.corpus.gutenberg
bible = gutenberg_corpus.sents('bible-kjv.txt')
tagged_bible = baseline_tagger.tag_sents(bible)

for i in range(2):
    print(f"sentence {i+1}")
    randint = random.randint(0, len(tagged_bible))
    print(tagged_bible[randint])
    print("\n")

sentence 1
[('30', 'NN'), (':', '.'), ('22', 'NN'), ('Moreover', 'NN'), ('the', 'DET'), ('LORD', 'NN'), ('spake', 'NN'), ('unto', 'NN'), ('Moses', 'NN'), (',', '.'), ('saying', 'NN'), (',', '.'), ('30', 'NN'), (':', '.'), ('23', 'NN'), ('Take', 'NN'), ('thou', 'NN'), ('also', 'ADV'), ('unto', 'NN'), ('thee', 'NN'), ('principal', 'NN'), ('spices', 'NN'), (',', '.'), ('of', 'ADP'), ('pure', 'NN'), ('myrrh', 'NN'), ('five', 'NN'), ('hundred', 'NN'), ('shekels', 'NN'), (',', '.'), ('and', 'CONJ'), ('of', 'ADP'), ('sweet', 'NN'), ('cinnamon', 'NN'), ('half', 'NN'), ('so', 'ADV'), ('much', 'ADJ'), (',', '.'), ('even', 'ADV'), ('two', 'NUM'), ('hundred', 'NN'), ('and', 'CONJ'), ('fifty', 'NN'), ('shekels', 'NN'), (',', '.'), ('and', 'CONJ'), ('of', 'ADP'), ('sweet', 'NN'), ('calamus', 'NN'), ('two', 'NUM'), ('hundred', 'NN'), ('and', 'CONJ'), ('fifty', 'NN'), ('shekels', 'NN'), (',', '.'), ('30', 'NN'), (':', '.'), ('24', 'NN'), ('And', 'CONJ'), ('of', 'ADP'), ('cassia', 'NN'), ('five', 'NN')

Example output:

```
sentence 1
[('39', 'NN'), (':', '.'), ('23', 'NN'), ('The', 'DET'), ('quiver', 'NN'), ('rattleth', 'NN'), ('against', 'ADP'), ('him', 'PRON'), (',', '.'), ('the', 'DET'), ('glittering', 'NN'), ('spear', 'NN'), ('and', 'CONJ'), ('the', 'DET'), ('shield', 'NN'), ('.', '.')]

sentence 2
[('5', 'NN'), (':', '.'), ('11', 'NN'), ('And', 'CONJ'), ('the', 'DET'), ('children', 'NN'), ('of', 'ADP'), ('Gad', 'NN'), ('dwelt', 'NN'), ('over', 'ADP'), ('against', 'ADP'), ('them', 'PRON'), (',', '.'), ('in', 'ADP'), ('the', 'DET'), ('land', 'NN'), ('of', 'ADP'), ('Bashan', 'NN'), ('unto', 'NN'), ('Salcah', 'NN'), (':', '.'), ('5', 'NN'), (':', '.'), ('12', 'NN'), ('Joel', 'NN'), ('the', 'DET'), ('chief', 'NN'), (',', '.'), ('and', 'CONJ'), ('Shapham', 'NN'), ('the', 'DET'), ('next', 'NN'), (',', '.'), ('and', 'CONJ'), ('Jaanai', 'NN'), (',', '.'), ('and', 'CONJ'), ('Shaphat', 'NN'), ('in', 'ADP'), ('Bashan', 'NN'), ('.', '.')]
```

#### f) Experiment with different ratios and using only one dataset with a train/test split. Explain your findings.

In [35]:
def get_train_test_split(dataset, train_split: float):
    return split(dataset, train_size=train_split)

tagger = taggers[1]
dataset = brown_corpus.tagged_sents()

train_sizes = [0.25, 0.5, 0.75, 0.9]


for i in train_sizes:
    print_accuracy_of_taggers([tagger], [split(dataset, train_size=i)])

Accuracy of permuatation 1: 0.052273
Accuracy of permuatation 1: 0.052179
Accuracy of permuatation 1: 0.052044
Accuracy of permuatation 1: 0.052403


This gives the following output:

```
Accuracy: 0.052273
Accuracy: 0.052179
Accuracy: 0.052044
Accuracy: 0.052403
```

As we can see the difference in accuracy does not change that much between the different ratios.

### 3 - Tagging with probabilities

Hidden Makrov Models (HMMs) can be used to solve Part-of-Speech (POS) tagging. Use HMMs to calculate probabilities for words and tags, using the appended code.

Implementation of the methods is found in a seperate file, and the output is found here.

#### a) Implement the missing pieces of the function task3a() found in the appended code. Also found on the next page for reference.

Implemented in seperate file.

In [25]:
import sys
sys.path.append(".")

from lab2_task3_helper import lab2_helper

#### b) Print the probability of...
* a verb (VB) being "run"
* a preposition (PP) beging followed by a verb

In [26]:
lab2_helper.task3b()

Prob. of a Verb(VB) being 'run' is 0.1329%
Prob. of a Preposition(PP) being followed by a Verb(VB) is 25.1591%


Output:
```
Prob. of a Verb(VB) being 'run' is 0.1329%
Prob. of a Preposition(PP) being followed by a Verb(VB) is 25.1591%
```

#### c) Print the 10 most common words for each of the tags NN, VB, JJ

In [27]:
tagwords, tags = lab2_helper.task3a()

In [28]:
target_tags = ["NN", "VB", "JJ"]

for target in target_tags:
    print(f"Target tag: {target}")
    print(tagwords[target].freqdist().most_common(10))

Target tag: NN
[('time', 1555), ('man', 1148), ('Af', 994), ('years', 942), ('way', 883), ('people', 809), ('men', 736), ('world', 684), ('life', 676), ('year', 647)]
Target tag: VB
[('said', 1943), ('made', 1119), ('make', 765), ('see', 727), ('get', 719), ('know', 676), ('came', 621), ('used', 610), ('go', 604), ('come', 589)]
Target tag: JJ
[('new', 1060), ('such', 903), ('own', 750), ('good', 693), ('great', 592), ('New', 575), ('old', 568), ('American', 535), ('small', 517), ('long', 515)]


#### d) Print the probability of the tag sequence PP VB VB DT NN for the sentence “I can code some code”

In [29]:
probability = tagwords["START"].prob("START") * tags["START"].prob("PP") * \
    tagwords["PP"].prob("I") *  tags["PP"].prob("VB") * \
    tagwords["VB"].prob("can") * tags["VB"].prob("VB") * \
    tagwords["VB"].prob("code") * tags["VB"].prob("DT") * \
    tagwords["DT"].prob("some") * tags["DT"].prob("NN") * \
    tagwords["NN"].prob("code") * \
    tagwords["END"].prob("END")

print(probability)
lab2_helper.prettify(probability)

6.171716745411479e-22


'0.0%'

I get a very low probability of 6.171716745411479e-22.