## Document Analysis: Computational Methods - Summer Term 2025
### Lectures: Jun.-Prof. Dr. Andreas Spitz
### Tutorials: Julian Schelb

# Exercise 04

- NLP Recap
- Entity Recognition

---

## Task 1 - NLP Recap:

## Q & A

Give answers to the following question. If you like, you can treat it as exam preparation, e.g., first try to solve the question without help of the slides ;) But you are obviously allowed to use the slides at any time


(1) Name three reasons why natural language processing (NLP) is challenging

**EXAMPLE SOLUTION**

1. Comprehensively understanding the human language requires **understanding both the words and** how the **concepts** are connected to deliver the intended message -> Syntax & Semantics

2. While humans can easily master a language, the **ambiguity** and **imprecise** characteristics of the natural languages are what make NLP difficult for machines to implement -> Ambuguitivy, Imprecision, Context-Dependce

3. There are many **different languges (hundreds) with individual rules**, syntax etc. -> multilingualism

4. Language is **different across cultures, age groups and individuals** -> External context dependence

5. **Technical aspects** like text encodings, formatting, special character treatment etc. -> Handling of Text in Software

...

---

(2) What is tokenization?

**EXAMPLE SOLUTION**

Tokenization describes segmenting an input stream into an ordered sequence of tokens.

---

(3) What is a word stem? Give the stem of the word "undoes"

**EXAMPLE SOLUTION**

The common part to all derivated or inflected words e.g. undoes/undoing -> undo

---

(4) What is a word lemma? Give the lemma of the word "undoes"

**EXAMPLE SOLUTION**

The base form of a word e.g. undoes -> (to) do

---

(5) Why should we typically extract word stems/lemmas before preceding with text analysis?

**EXAMPLE SOLUTION**

To condense related words so that we don’t have as much variability. 
For instance, we would use it in an information retrieval setting to boost algorithm’s recall.

---

(6) What are stop-words? Why it makes sense to remove stop-words before preceding with text analysis?

**EXAMPLE SOLUTION**

Stop-words are words in natural language that have a very little meaning, such as "is", "an", "the", etc.
They are often removed from the text since stop-words occur in abundance, hence providing little to no unique information that can be used for classification or clustering.

---

(7) If you had access to frequency statistics for a language, how could you create a list of stop words?

**EXAMPLE SOLUTION**

One Way: Determine a cut-off, e.g. the 500 most frequent words, and mark these as stop words for following analysis steps.

---

(8) Name a use-case in which we should NOT remove stop-words prior to text analysis.

**EXAMPLE SOLUTION**

See the following example:
```
 The crowd believed the man.
 The crowd believed in the man.
 The crowd believed that the man was lying.
```
With stop-words removed, each of these examples become the same sequence, namely "crowd believed man", however one sentence is indicating the opposite.

---

(9) What is part-of-speech (POS) tagging? Why is it useful?

**EXAMPLE SOLUTION**

A POS tag is a tag that indicates the part of speech for a word.
POS tags have been used for a variety of NLP tasks; they provide linguistic signal on how a word is being used within the scope of a phrase, sentence, or document.

---

(10) Explain how n-gram POS tagging works. What is the limitation of this method?

**EXAMPLE SOLUTION**

The n-gram tagger is a simple statistical tagging algorithm. 
For each token, it assigns the tag that is most likely for that token’s text.
For bigram and trigram tagger, by looking at the previous words and POS tags, part-of-speech tag for the current word can be guessed.
Bigram tagger uses previous tag as part of its context
Trigram tagger uses the previous two tags as part of its context.

A potential issue with nth-order tagger is their size. 
An nth-order tagger with backoff may store trigram and bigram tables, large sparse arrays, which may have hundreds of millions of entries. 
A consequence of the size of the models is that it is simply impractical for nth-order models to be conditioned on the identities of words in the context.

---

(11) What is parsing? Which two main types of parsing exist?

**EXAMPLE SOLUTION**

Parsing means to analyze (a sentence) in terms of grammatical constituents, identifying the parts of speech, syntactic relations, etc.

The two man approaches are:

Top-down approach vs. bottom-up approach.

alternatively:

Constituency parsing vs. dependency parsing.

---

(12) What is the difference between a context-free and a context-sensitive language? What is the difference between a context-free and a regular language?

**EXAMPLE SOLUTION**

In a context-free grammar, each production rule has the form `A -> α` (with α being a string of non-terminals and terminal-symbols, e.g. `A B` or `A terminal`)

The set of context-free languages is a subset of all context-sensitive grammars. In the context-sensitive grammar, we allow rules to have the form `αAβ → αγβ` (with A being a non-terminal and the others being either).
Here, production rules may need surounding contextual information. The same is not necessary in context-free grammars.

Finall, a regular language is even more restricted by it's grammar: Each right-hand-side can have at most at most one non-terminal, and the non-terminal has to be consitently placed at the start or end of the string.

---

(13) Give an equivalent grammar in Chomsky Normal form for the Grammar G=(N,T,P,S) with N={S,A,B,C}, T={a,b,c}, P={S->ABC, A->a, B->b, C->c}

Production rule set:
- S -> A B C
- A -> a
- B -> b
- C -> c

We need to replace the rule containing the non-terminal symbols by introducing a new variable X:

- S -> A X
- X -> B C
- A -> a
- B -> b
- C -> c

---

(14) Explain how shift-reduce parsing works.

Shift-Reduce Parsing uses a stack as a datastructure, and has two distinct operations:

SHIFT: Push a word fromt the input sentence onto the stack.

REDUCE: We check if the top n words on the top of the stack match the right hand side of a production
rule. If this is the case, reduce the n words and replace them by the left hand side of the production rule.

The process stops when the input sentence has been processed completely (meaning every word has been pushed onto the stack), and S as the final stack element is pushed from the stack.


---

(15) What is the grammar ambiguity problem?

**EXAMPLE SOLUTION**

An ambiguous grammar is a context-free grammar for which there exists a string that can have more than one leftmost derivation or parse tree, while an unambiguous grammar is a context-free grammar for which every valid string has a unique leftmost derivation or parse tree.


---

## Task 2 - Named Entity Recognition:

### Part 1: Automated Annotations

Use Spacy to annotate entities in the debates dataset (available as part of the JSON in the data directory). [Depending on your computer, the extraction may take some time. Therefore, you are allowed to restrict the size of the text, e.g. to the first 250 sentences. Feel free to use a larger share of data to get better insights.]

Tip: Use the en_core_web_sm corpus

Display the results in readable form, e.g. show the tagged entities for a reasonably sized part of the data, idealy alongside the original text.

In [1]:
# Tip for Spacy: In order to load a dataset, you might need to download the dataset via command line. 
#Inside of a notebook, you can run commands with a !-mark

#Similar to: !python ...

In [2]:
# read debates
import json
with open('data/texts.json', 'r') as infile:
    data = json.load(infile)

content_debates = data['debates']

In [3]:
# split sentences
import nltk
sentences = nltk.sent_tokenize(content_debates)[0:1000:]

In [4]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [5]:
annotated = [nlp(sentence) for sentence in sentences]

In [6]:
from spacy import displacy
displacy.render(annotated[0:10:], jupyter=True, style='ent')




### Part 2: Analysis

In real-word use cases, NER can be difficult. Analyze the following challenging examples and identify all cases in which the automated NER failed. For the tagging, you can use the method from part 1.

In [7]:
# read debates
import json
with open('data/hard_data.json', 'r') as infile:
    data = json.load(infile)

test_sentences = [s["sentence"] for s in data['test_sentences']]

Write the failure cases down here, then try to classify them into types of errors that the model makes. Discuss the results.
You can use a mixture of markdown and code if it suits your analysis.

In [8]:
test_annotated = [nlp(sentence) for sentence in test_sentences]
# Your Code Submission goes here

In [9]:
from spacy import displacy
displacy.render(test_annotated[0:10:], jupyter=True, style='ent')


---

#### Submitting your results:

To submit your results, please:

- save this file, i.e., `ex??_assignment.ipynb`.
- if you reference any external files (e.g., images), please create a zip or rar archieve and put the notebook files and all referenced files in there.
- login to ILIAS and submit the `*.ipynb` or archive for the corresponding assignment.

**Remarks:**
    
- Do not copy any code from the Internet. In case you want to use publicly available code, please, add the reference to the respective code snippet.
- Check your code compiles and executes, even after you have restarted the Kernel.
- Submit your written solutions and the coding exercises within the provided spaces and not otherwise.
- Write the names of your partner and your name in the top section.