## Document Analysis: Computational Methods - Summer Term 2025
### Lectures: Jun.-Prof. Dr. Andreas Spitz
### Tutorials: Julian Schelb

Buket Sak, Anna Werner, Yu Zeyuan

# Exercise 02

You will learn about:
    
- The Brown Corpus
- Part of Speech (POS) tagging
- Unigram and Bigram tagger

---

## Task 1 - The Brown Corpus:

---

### Part 1

In the following, we will use the _Brown Corpus_. In one or two sentences, describe what the _Brown Corpus_ is and how it can be used for POS tagging.

\# TEXT SUBMISSION ANSWER HERE (Double click to edit) - expected approx. 50-100 words
Brown Corpus has tagset where we can make use of while doing part of speech tagging.

The Brown Corpus includes English texts from 15 different categories, such as news, fiction etc. It can be used for POS tagging because it contains manually tagged sentences, allowing researchers to train and evaluate POS taggers effectively.

### Part 2

We start by analyzing which tags occur in the brown corpus. For this, you should extract the `tagged_words` first. Then

1. List the first 20 entries and
2. then list the ten most common tags in the category `news`.

In the lecture, we use the Brown Corpus POS tags (default, i.e., `tagset=None`).

In [1]:
import nltk
nltk.download('brown')
from nltk.corpus import brown

[nltk_data] Downloading package brown to /Users/buketsak/nltk_data...
[nltk_data]   Package brown is already up-to-date!


In [2]:
# CODE SUBMISSION ANSWER HERE (Double click to edit)
from collections import Counter
tagged_words = brown.tagged_words()
print(tagged_words[0:20])

# Get tagged words only from the 'news' category
brown.tagged_words(categories='news')
# Extract only the tags
tags = [tag for (word, tag) in brown.tagged_words(categories='news')]
# Count frequency of tags
tag_counts = Counter(tags)
most_common_tags = tag_counts.most_common(10)
print("\n", most_common_tags)

[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS')]

 [('NN', 13162), ('IN', 10616), ('AT', 8893), ('NP', 6866), (',', 5133), ('NNS', 5066), ('.', 4452), ('JJ', 4392), ('CC', 2664), ('VBD', 2524)]


### Part 3

In the previous part, you should get ten different POS tags. For each tag, what does it stand for?

\# TEXT SUBMISSION ANSWER HERE (Double click to edit) - expected: explanation for each tag

NN: noun singular

IN: in preposition

AT: article

NP: Proper noun?

NNS: Noun plural

JJ: Adjective

CC: Coordinating conjunction 

VBD: Verb, past tense

---

## Task 2 - POS Tagging:

### Part 1

Use a Unigram tagger, trained on the Brown corpus, to tag the example sentence from the Penn treebank (see also https://www.nltk.org/_modules/nltk/corpus/reader/tagged.html)

For which words does it completely fail?

In [3]:
import nltk
nltk.download('brown')
nltk.download('treebank')
from nltk.corpus import brown
from nltk.tag import UnigramTagger

[nltk_data] Downloading package brown to /Users/buketsak/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package treebank to
[nltk_data]     /Users/buketsak/nltk_data...
[nltk_data]   Package treebank is already up-to-date!


In [4]:
treebank_test = list(nltk.corpus.treebank.words()[0:20])
print(treebank_test)

['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.', 'Mr.', 'Vinken']


In [5]:
# CODE SUBMISSION ANSWER HERE (Double click to edit)
trained_data = brown.tagged_sents()
unigram_tagger = UnigramTagger(trained_data)
tagged_sent = unigram_tagger.tag(treebank_test)

print(tagged_sent)

[('Pierre', 'NP'), ('Vinken', None), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'AT'), ('board', 'NN'), ('as', 'CS'), ('a', 'AT'), ('nonexecutive', None), ('director', 'NN'), ('Nov.', 'NP'), ('29', 'CD'), ('.', '.'), ('Mr.', 'NP'), ('Vinken', None)]


\# TEXT SUBMISSION ANSWER HERE (Double click to edit) - expected list of words

For the proper noun Vinken and the word nonexecutive, the UnigramTagger completely fails and returns None because these words were not present in the training data.

### Part 2

Compare the tags with the reference tags from the Penn treebank.

In [6]:
# CODE SUBMISSION ANSWER HERE (Double click to edit)
from nltk.corpus import treebank
from nltk.tag import DefaultTagger 
# Get the first 20 words and gold tags
treebank_test = list(treebank.words()[0:20])
treebank_gold = list(treebank.tagged_words()[0:20])

# Train UnigramTagger on the Brown corpus with Backoff
backoff = DefaultTagger('NN')  # fallback tagger
unigram_tagger = UnigramTagger(trained_data, backoff=backoff)
# Predict tags
tagged_sent = unigram_tagger.tag(treebank_test)

print(f"{'Word':15} {'Predicted':10} {'Gold'}")
print("-" * 32)
for (word_pred, pred_tag), (word_gold, gold_tag) in zip(tagged_sent, treebank_gold):
    print(f"{word_pred:15} {pred_tag:10} {gold_tag}")

Word            Predicted  Gold
--------------------------------
Pierre          NP         NNP
Vinken          NN         NNP
,               ,          ,
61              CD         CD
years           NNS        NNS
old             JJ         JJ
,               ,          ,
will            MD         MD
join            VB         VB
the             AT         DT
board           NN         NN
as              CS         IN
a               AT         DT
nonexecutive    NN         JJ
director        NN         NN
Nov.            NP         NNP
29              CD         CD
.               .          .
Mr.             NP         NNP
Vinken          NN         NNP


\# TEXT SUBMISSION ANSWER HERE (Double click to edit) - expected comparison for each tag

So we’re evaluating how well a UnigramTagger trained on the Brown corpus performs when tagging a sentence from the Penn Treebank. For words that are unseen, the tagger falls back to the default tag 'NN'. However, this can lead to inaccurate results. For example, the word “nonexecutive” is tagged as 'NN', while its correct tag is 'JJ', and “Vinken” is a proper noun, but is also incorrectly tagged as 'NN'.
There are also differences in tagsets between the Brown and Penn Treebank corpora. For instance:

AT in Brown, DT in TreeBank.

NP in Brown, NNP in TreeBank.

CS in Brown, IN in TreeBank. 

### Part 3

Now train 
 1. a Unigram tagger,
 2. a Bigram tagger,
 3. and a Brill tagger (using rule brill24)
 
with a subset of the Brown Corpus. This might take 1-2 minutes.

Then, validate and compare their performance on a different subset of the Brown corpus.

In [7]:
from nltk.tag import UnigramTagger, BigramTagger
from nltk.tag.brill_trainer import BrillTaggerTrainer
from nltk.tag.brill import brill24

In [8]:
n_cutoff = 20000
brown_sents_train = brown.tagged_sents()[0:n_cutoff] # training corpus
brown_sents_test = brown.tagged_sents()[n_cutoff:] # reference corpus

In [9]:
# CODE SUBMISSION ANSWER HERE (Double click to edit)
unigram_tagger = UnigramTagger(brown_sents_train, backoff=backoff)
bigram_tagger = BigramTagger(brown_sents_train, backoff=backoff)

templates = brill24()
brill_trainer = BrillTaggerTrainer(initial_tagger=bigram_tagger,
                                                 templates=templates,
                                                 trace=3)
brill_tagger = brill_trainer.train(brown_sents_train)

TBL train (fast) (seqs: 20000; tokens: 430477; tpls: 24; min score: 2; min acc: None)
Finding initial useful rules...
    Found 247221 useful rules.

           B      |
   S   F   r   O  |        Score = Fixed - Broken
   c   i   o   t  |  R     Fixed = num tags changed incorrect -> correct
   o   x   k   h  |  u     Broken = num tags changed correct -> incorrect
   r   e   e   e  |  l     Other = num tags changed incorrect -> incorrect
   e   d   n   r  |  e
------------------+-------------------------------------------------------
32083209   1  13  | TO->IN if Pos:NN@[1]
10801080   0   6  | NN->AT if Word:the@[0]
 252 252   0   1  | NN->AT if Word:a@[0]
 239 249  10   6  | CS->QL if Word:as@[2]
 136 195  59  70  | CS->WPS if Pos:NN@[-1] & Pos:NN@[1]
  72  72   0   1  | NN->PP$ if Word:his@[0]
  68  69   1   0  | EX->RB if Pos:NN@[1]
  62  72  10   0  | IN-TL->IN if Pos:NN@[1,2]
  61  62   1   2  | QL->AP if Word:than@[1]
  60  60   0   0  | NN->. if Word:.@[-1,0]
  60  97  37 157  |

In [24]:
unigram_tags = [unigram_tagger.tag(sentence) for sentence in brown_sents_test]
bigram_tags = [bigram_tagger.tag(sentence) for sentence in brown_sents_test]
brill_tags = [brill_tagger.tag(sentence) for sentence in brown_sents_test]

print("Tagger performance on test set:")
print("Unigram Tagger's accuracy", unigram_tagger.accuracy(brown_sents_test))
print("Bigram Tagger's accuracy", bigram_tagger.accuracy(brown_sents_test))
print("Brill Tagger's accuracy", brill_tagger.accuracy(brown_sents_test))

Tagger performance on test set:
Unigram Tagger's accuracy 0.8754534941803576
Bigram Tagger's accuracy 0.8067550276099437
Brill Tagger's accuracy 0.8271857016757559


### Part 4

Discuss the scores of your taggers. Which one performs better, and why?

The UnigramTagger performed better than the other taggers. One possible reason is that in the training data, there may be few or no occurrences of specific word pairs, which makes it harder for the BigramTagger to generalize. As a result, the BigramTagger might fail to assign the correct tag in the test set when encountering unfamiliar or rare bigram combinations.??

\# TEXT SUBMISSION ANSWER HERE (Double click to edit) - expected approx. 100-150 words

The UnigramTagger performed better than the other taggers. One possible reason is that in the training data, there may be few or no occurrences of specific word pairs, which makes it harder for the BigramTagger to generalize. As a result, the BigramTagger might fail to assign the correct tag in the test set when encountering unfamiliar or rare bigram combinations. (??)

### Part 5

Discuss ideas for improving the implementations and the quality of the taggers. You are not required to implement the improvement ideas.

\# TEXT SUBMISSION ANSWER HERE (Double click to edit) - expected approx. 100-250 words



---

## Task 3 - Unigram and Bigram Taggers (pen and paper):

**Training data:**

His [PRP] raise [NN] was [VB] five [CD] dollars [NN] . [SYM]
We [PRP] usually [RB] get [VB] a [DT] raise [NN] at [IN] the [DT] start [NN] of [IN] the [DT] year [NN] . [SYM]
A [DT] major [JJ] success [NN] helped [VB] to [TO] raise [VB] our [PRP] spirits [NN] . [SYM]



**Test sentence:**

It [PRP] looks [VB] like [CC] a [DT] fine [JJ] place [NN] to [TO] raise [NN or VB?] children [NN] . [SYM]


### Part 1: Unigram Tagger

Given the training data, determine the most likely tag for the word "raise" in the test sentence, using Unigram tagging method:


<font color='ff000000'>\# TEXT SUBMISSION ANSWER HERE (Double click to edit) - write down the steps for calculating the most likely POS tag</font>

*Assigns the most frequent tag to word based on the training data. So in this case for the word "raise", unigram tagger would assign "NN" to the word "raise" in test sentence, which is incorrect. 

p(VB|raise) = 1/3 or p(NN|race)= 2/3

-Assigned NN which is wrong. 

!But we should look at the context.


### Part 2 - Bigram Tagger:

Given the training data (in Task 3), determine the most likely tag for the word "raise" in the test sentence, using Bigram tagging method:



<font color='ff000000'>\# TEXT SUBMISSION ANSWER HERE (Double click to edit) - write down the steps for calculating the most likely POS tag</font

*Looks at the preceding word.

p(VB|raise) = p(VB|TO) * p(raise|VB)= 1 * 1/4 = 1/4    the probability of assigning VB to "raise" =  TO preceding-coming before VB  * raise-tagged as VB / total VB in training sentence

P(NN|raise) =  p(NN|TO)  * p(raise|NN) = 0 * 2/7 = 0   the probability of assigning NN to "raise" = TO preceding-coming before NN  * raise-tagged as NN / total NN in the training sentence

-Assigned VB to "raise", which is accurate.

---

#### Submitting your results:

To submit your results, please:

- save this file, i.e., `ex??_assignment.ipynb`.
- if you reference any external files (e.g., images), please create a zip or rar archieve and put the notebook files and all referenced files in there.
- login to ILIAS and submit the `*.ipynb` or archive for the corresponding assignment.

**Remarks:**
    
- Do not copy any code from the Internet. In case you want to use publicly available code, please, add the reference to the respective code snippet.
- Check your code compiles and executes, even after you have restarted the Kernel.
- Submit your written solutions and the coding exercises within the provided spaces and not otherwise.
- Write the names of your partner and your name in the top section.