## Document Analysis: Computational Methods - Summer Term 2025
### Lectures: Jun.-Prof. Dr. Andreas Spitz
### Tutorials: Julian Schelb

# Exercise 02

You will learn about:
    
- The Brown Corpus
- Part of Speech (POS) tagging
- Unigram and Bigram tagger

---

## Task 1 - The Brown Corpus:

---

### Part 1

In the following, we will use the _Brown Corpus_. In one or two sentences, describe what the _Brown Corpus_ is and how it can be used for POS tagging.

\# TEXT SUBMISSION ANSWER HERE (Double click to edit) - expected approx. 50-100 words

### Part 2

We start by analyzing which tags occur in the brown corpus. For this, you should extract the `tagged_words` first. Then

1. List the first 20 entries and
2. then list the ten most common tags in the category `news`.

In the lecture, we use the Brown Corpus POS tags (default, i.e., `tagset=None`).

In [2]:
import nltk
nltk.download('brown')
from nltk.corpus import brown

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\bagim\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\brown.zip.


In [11]:
print(brown.tagged_words()[:20])
print(brown.tagged_words(tagset= "news")[:10])

[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS')]
[('The', 'UNK'), ('Fulton', 'UNK'), ('County', 'UNK'), ('Grand', 'UNK'), ('Jury', 'UNK'), ('said', 'UNK'), ('Friday', 'UNK'), ('an', 'UNK'), ('investigation', 'UNK'), ('of', 'UNK')]


### Part 3

In the previous part, you should get ten different POS tags. For each tag, what does it stand for?

\# TEXT SUBMISSION ANSWER HERE (Double click to edit) - expected: explanation for each tag

---

## Task 2 - POS Tagging:

### Part 1

Use a Unigram tagger, trained on the Brown corpus, to tag the example sentence from the Penn treebank (see also https://www.nltk.org/_modules/nltk/corpus/reader/tagged.html)

For which words does it completely fail?

In [12]:
import nltk
nltk.download('brown')
nltk.download('treebank')
from nltk.corpus import brown
from nltk.tag import UnigramTagger

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\bagim\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package treebank to
[nltk_data]     C:\Users\bagim\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\treebank.zip.


In [28]:
treebank_test = list(nltk.corpus.treebank.words()[0:20])
print(treebank_test)

['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.', 'Mr.', 'Vinken']


In [26]:
print(UnigramTagger(treebank_test))

ValueError: not enough values to unpack (expected 2, got 1)

\# TEXT SUBMISSION ANSWER HERE (Double click to edit) - expected list of words

### Part 2

Compare the tags with the reference tags from the Penn treebank.

In [None]:
# CODE SUBMISSION ANSWER HERE (Double click to edit)

\# TEXT SUBMISSION ANSWER HERE (Double click to edit) - expected comparison for each tag

### Part 3

Now train 
 1. a Unigram tagger,
 2. a Bigram tagger,
 3. and a Brill tagger (using rule brill24)
 
with a subset of the Brown Corpus. This might take 1-2 minutes.

Then, validate and compare their performance on a different subset of the Brown corpus.

In [None]:
from nltk.tag import UnigramTagger, BigramTagger
from nltk.tag.brill_trainer import BrillTaggerTrainer
from nltk.tag.brill import brill24

In [None]:
n_cutoff = 20000
brown_sents_train = brown.tagged_sents()[0:n_cutoff] # training corpus
brown_sents_test = brown.tagged_sents()[n_cutoff:] # reference corpus

In [None]:
# CODE SUBMISSION ANSWER HERE (Double click to edit)

### Part 4

Discuss the scores of your taggers. Which one performs better, and why?

\# TEXT SUBMISSION ANSWER HERE (Double click to edit) - expected approx. 100-150 words

### Part 5

Discuss ideas for improving the implementations and the quality of the taggers. You are not required to implement the improvement ideas.

\# TEXT SUBMISSION ANSWER HERE (Double click to edit) - expected approx. 100-250 words

---

## Task 3 - Unigram and Bigram Taggers (pen and paper):

**Training data:**

His [PRP] raise [NN] was [VB] five [CD] dollars [NN] . [SYM]
We [PRP] usually [RB] get [VB] a [DT] raise [NN] at [IN] the [DT] start [NN] of [IN] the [DT] year [NN] . [SYM]
A [DT] major [JJ] success [NN] helped [VB] to [TO] raise [VB] our [PRP] spirits [NN] . [SYM]



**Test sentence:**

It [PRP] looks [VB] like [CC] a [DT] fine [JJ] place [NN] to [TO] raise [NN or VB?] children [NN] . [SYM]


### Part 1: Unigram Tagger

Given the training data, determine the most likely tag for the word "raise" in the test sentence, using Unigram tagging method:


<font color='ff000000'>\# TEXT SUBMISSION ANSWER HERE (Double click to edit) - write down the steps for calculating the most likely POS tag</font>





### Part 2 - Bigram Tagger:

Given the training data (in Task 3), determine the most likely tag for the word "raise" in the test sentence, using Bigram tagging method:



<font color='ff000000'>\# TEXT SUBMISSION ANSWER HERE (Double click to edit) - write down the steps for calculating the most likely POS tag</font>

---

#### Submitting your results:

To submit your results, please:

- save this file, i.e., `ex??_assignment.ipynb`.
- if you reference any external files (e.g., images), please create a zip or rar archieve and put the notebook files and all referenced files in there.
- login to ILIAS and submit the `*.ipynb` or archive for the corresponding assignment.

**Remarks:**
    
- Do not copy any code from the Internet. In case you want to use publicly available code, please, add the reference to the respective code snippet.
- Check your code compiles and executes, even after you have restarted the Kernel.
- Submit your written solutions and the coding exercises within the provided spaces and not otherwise.
- Write the names of your partner and your name in the top section.