## Document Analysis: Computational Methods - Summer Term 2025
### Lectures: Jun.-Prof. Dr. Andreas Spitz
### Tutorials: Julian Schelb

# Exercise 02

You will learn about:
    
- The Brown Corpus
- Part of Speech (POS) tagging
- Unigram and Bigram tagger

---

## Task 1 - The Brown Corpus:

---

### Part 1

In the following, we will use the _Brown Corpus_. In one or two sentences, describe what the _Brown Corpus_ is and how it can be used for POS tagging.

**EXAMPLE SOLUTION**

The Brown Corpus (Brown University Standard Corpus of Present-Day American English) is a general corpus (a reference text collection) originally compiled by Henry Kučera and W. Nelson Francis in the 1960s for linguistics research. It consists of 500 American English texts, totaling roughly one million words, which all have been manually tagged and can be used as ground truth for POS tagging. Compared to contemporary corpi (e.g., Corpus of Contemporary American English or International Corpus of English), it is very small.

### Part 2

We start by analyzing which tags occur in the brown corpus. For this, you should extract the `tagged_words` first. Then

1. List the first 20 entries and
2. then list the ten most common tags in the category `news`.

In the lecture, we use the Brown Corpus POS tags (default, i.e., `tagset=None`).

In [1]:
import nltk
nltk.download('brown')
from nltk.corpus import brown

[nltk_data] Downloading package brown to
[nltk_data]     /Users/julianschelb/nltk_data...
[nltk_data]   Package brown is already up-to-date!


In [2]:
# EXAMPLE SOLUTION

brown_news_tagged = brown.tagged_words(categories='news')

In [None]:
brown_news_tagged[0:20]

[('The', 'AT'),
 ('Fulton', 'NP-TL'),
 ('County', 'NN-TL'),
 ('Grand', 'JJ-TL'),
 ('Jury', 'NN-TL'),
 ('said', 'VBD'),
 ('Friday', 'NR'),
 ('an', 'AT'),
 ('investigation', 'NN'),
 ('of', 'IN'),
 ("Atlanta's", 'NP$'),
 ('recent', 'JJ'),
 ('primary', 'NN'),
 ('election', 'NN'),
 ('produced', 'VBD'),
 ('``', '``'),
 ('no', 'AT'),
 ('evidence', 'NN'),
 ("''", "''"),
 ('that', 'CS')]

In [None]:
tags = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)
tags.most_common()[0:10]

[('NN', 13162),
 ('IN', 10616),
 ('AT', 8893),
 ('NP', 6866),
 (',', 5133),
 ('NNS', 5066),
 ('.', 4452),
 ('JJ', 4392),
 ('CC', 2664),
 ('VBD', 2524)]

### Part 3

In the previous part, you should get ten different POS tags. For each tag, what does it stand for?

**EXAMPLE SOLUTION**

- **NN** - Singular or mass noun: *apple*
- **IN** - Preposition: *under*
- **AT** - Article (a, the, no): *the*
- **NP** - Proper noun or part of name phrase: *London*
- **`,`** - Comma: *`Used to separate elements in a sentence, such as clauses or items in a list.`*
- **NNS** - Plural noun: *apples*
- **`.`** - Sentence closer: *`Marks the end of a sentence, like a period.`*
- **JJ** - Adjective: *green*
- **CC** - Coordinating conjunction (and, or): *and*
- **VBD** - Verb, past tense: *walked*



see also http://korpus.uib.no/icame/manuals/BROWN/INDEX.HTM#bc6 or https://varieng.helsinki.fi/CoRD/corpora/BROWN/tags.html

---

## Task 2 - POS Tagging:

### Part 1

Use a Unigram tagger, trained on the Brown corpus, to tag the example sentence from the Penn treebank (see also https://www.nltk.org/_modules/nltk/corpus/reader/tagged.html)

For which words does it completely fail?

In [1]:
import nltk
nltk.download('brown')
nltk.download('treebank')
from nltk.corpus import brown
from nltk.tag import UnigramTagger

[nltk_data] Downloading package brown to
[nltk_data]     /Users/julianschelb/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package treebank to
[nltk_data]     /Users/julianschelb/nltk_data...
[nltk_data]   Package treebank is already up-to-date!


In [2]:
treebank_test = list(nltk.corpus.treebank.words()[0:20])

In [3]:
# EXAMPLE SOLUTION

brown_sents_tagged = brown.tagged_sents()
tagger_unigram = UnigramTagger(brown_sents_tagged)

for tok, tag in tagger_unigram.tag(treebank_test):
    print(f"({tok}, {tag}), ")

(Pierre, NP), 
(Vinken, None), 
(,, ,), 
(61, CD), 
(years, NNS), 
(old, JJ), 
(,, ,), 
(will, MD), 
(join, VB), 
(the, AT), 
(board, NN), 
(as, CS), 
(a, AT), 
(nonexecutive, None), 
(director, NN), 
(Nov., NP), 
(29, CD), 
(., .), 
(Mr., NP), 
(Vinken, None), 


**EXAMPLE SOLUTION**

It fails for Vinken and nonexecutive.

### Part 2

Compare the tags with the reference tags from the Penn treebank.

In [4]:
# EXAMPLE SOLUTION

treebank_reference = list(nltk.corpus.treebank.tagged_words()[0:20])
for tok, tag in treebank_reference:
    print(f"({tok}, {tag}), ")

(Pierre, NNP), 
(Vinken, NNP), 
(,, ,), 
(61, CD), 
(years, NNS), 
(old, JJ), 
(,, ,), 
(will, MD), 
(join, VB), 
(the, DT), 
(board, NN), 
(as, IN), 
(a, DT), 
(nonexecutive, JJ), 
(director, NN), 
(Nov., NNP), 
(29, CD), 
(., .), 
(Mr., NNP), 
(Vinken, NNP), 


**EXAMPLE SOLUTION**

The differences are:

```
Pierre (NNP instead of NP)
Vinken (NNP instead of None)
the (DT instead of AT)
as (IN instead of CS) 
a (DT instead of AT)
nonexecutive (JJ instead of None)
Nov. (NNP instead of NP)
Mr. (NNP instead of NP)
Vinken (NNP instead of None)
```

We can see that the tagging has troubles with names, special words and sometimes makes mistakes within the same word class.

### Part 3

Now train 
 1. a Unigram tagger,
 2. a Bigram tagger,
 3. and a Brill tagger (using rule brill24)
 
with a subset of the Brown Corpus. This might take 1-2 minutes.

Then, validate and compare their performance on a different subset of the Brown corpus.

In [8]:
from nltk.tag import UnigramTagger, BigramTagger
from nltk.tag.brill_trainer import BrillTaggerTrainer
from nltk.tag.brill import brill24

In [9]:
n_cutoff = 20000
brown_sents_train = brown.tagged_sents()[0:n_cutoff] # training corpus
brown_sents_test = brown.tagged_sents()[n_cutoff:] # reference corpus

In [10]:
# EXAMPLE SOLUTION

# Unigram
tagger_unigram = UnigramTagger(brown_sents_train)

# Bigram
tagger_bigram = BigramTagger(brown_sents_train, backoff=tagger_unigram)

# Brill
trainer_brill = BrillTaggerTrainer(tagger_bigram, brill24(), trace=1)
tagger_brill = trainer_brill.train(brown_sents_train, max_rules=20)

#
# Evaluate
#
print('unigram', tagger_unigram.evaluate(brown_sents_test))
print('bigram', tagger_bigram.evaluate(brown_sents_test))
print('brill', tagger_brill.evaluate(brown_sents_test))

TBL train (fast) (seqs: 20000; tokens: 430477; tpls: 24; min score: 2; min acc: None)
Finding initial useful rules...
    Found 172575 useful rules.
Selecting rules...


  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  print('unigram', tagger_unigram.evaluate(brown_sents_test))


unigram 0.8615109858152631


  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  print('bigram', tagger_bigram.evaluate(brown_sents_test))


bigram 0.8776540785395127


  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  print('brill', tagger_brill.evaluate(brown_sents_test))


brill 0.8854122332236234


### Part 4

Discuss the scores of your taggers. Which one performs better, and why?

**EXAMPLE SOLUTION**

We achieved a score of 86.2% accuracy for the Unigram tagger, 87.8% accuracy for the Bigram tagger, and 89.1% accuracy for the Brill tagger. These results match our expectations: The more information (more context) the tagger takes into account, the better the performance. Nevertheless, one might expect the difference between unigram and the other taggers to be bigger since unigram does not take into account any context, but the others do. Maybe this observation would change if the training set is more complex and thus more bigrams and trigrams of tags in the test text also occur in the training text.

**Brill Tagger:**
1. **Initial Tagging**: Start with a simple baseline model, such as a unigram tagger, to provide a preliminary tagging based on the most frequent tags.
2. **Learning Transformation Rules**: Analyze tagging errors to automatically generate potential transformation rules that suggest modifying one part-of-speech tag to another under specific conditions.
3. **Rule Selection and Optimization**: Evaluate and rank the generated rules by their effectiveness in improving tagging accuracy, selecting the most beneficial rules for application.


### Part 5

Discuss ideas for improving the implementations and the quality of the taggers. You are not required to implement the improvement ideas.

**EXAMPLE SOLUTION**

- In terms of quality, using a larger training corpus will almost certainly improve the accuracy, although the increase might only be marginal.
- A different approach would be to add hard-coded rules that account for specific cases, which often produce errors when using the statistical n-gram approach.

---

## Task 3 - Unigram and Bigram Taggers (pen and paper):

**Training data:**

His [PRP] raise [NN] was [VB] five [CD] dollars [NN] . [SYM]
We [PRP] usually [RB] get [VB] a [DT] raise [NN] at [IN] the [DT] start [NN] of [IN] the [DT] year [NN] . [SYM]
A [DT] major [JJ] success [NN] helped [VB] to [TO] raise [VB] our [PRP] spirits [NN] . [SYM]



**Test sentence:**

It [PRP] looks [VB] like [CC] a [DT] fine [JJ] place [NN] to [TO] raise [NN or VB?] children [NN] . [SYM]


### Part 1: Unigram Tagger

Given the training data, determine the most likely tag for the word "raise" in the following sentence, using Unigram tagging method:


**EXAMPLE SOLUTION**

**Naive solution**

p(k=NN|raise) = 2 / 3

p(k=VB|raise) = 1 / 3

**Consider frequence (Bayes rule), so apply the adapted formula:**


p(NN) * p(raise|NN) = 7/27 * 2/7 = 2/27 -- most likely tag, but it would be wrong

p(VB) * p(raise|VB) = 4/27 * 1/4 = 1/27


**Granted: If we plug in all the numbers (including the marginal probability), we indeed end up at the top probabilities**


**Bayes Theorem**

**Bayes' Theorem for Part-of-Speech Tagging**:
`p(k=Tag|word) = p(word|k=Tag) * p(Tag) / p(word)`

**Components**:
1. **Posterior Probability, `p(k=Tag|word)`**:
   - Probability of the tag given the word.
2. **Likelihood, `p(word|k=Tag)`**:
   - Probability of observing the word given the tag.
3. **Prior Probability, `p(Tag)`**:
   - Overall probability of the tag in the corpus.
4. **Marginal Probability, `p(word)`**:
   - Overall probability of observing the word across all tags.



### Part 2 - Bigram Tagger:

Given the training data (in Task 2), determine the most likely tag for the word "raise" in the following sentence, using Bigram tagging method:

It [PRP] looks [VB] like [CC] a [DT] fine [JJ] place [NN] to [TO] raise [NN or VB?] children [NN] . [SYM]


**EXAMPLE SOLUTION**

p(NN|TO) * p(raise|NN) = 0/1 * 2/7 = 0

p(VB|TO) * p(raise|VB) = 1/1 * 1/4 = 1/4 -- most likely tag, correct

---

#### Submitting your results:

To submit your results, please:

- save this file, i.e., `ex??_assignment.ipynb`.
- if you reference any external files (e.g., images), please create a zip or rar archieve and put the notebook files and all referenced files in there.
- login to ILIAS and submit the `*.ipynb` or archive for the corresponding assignment.

**Remarks:**
    
- Do not copy any code from the Internet. In case you want to use publicly available code, please, add the reference to the respective code snippet.
- Check your code compiles and executes, even after you have restarted the Kernel.
- Submit your written solutions and the coding exercises within the provided spaces and not otherwise.