## Document Analysis: Computational Methods - Summer Term 2025
### Lectures: Jun.-Prof. Dr. Andreas Spitz
### Tutorials: Julian Schelb

# Exercise 04

- NLP Recap
- Entity Recognition

---

## Task 1 - NLP Recap:

## Q & A

Give answers to the following question. If you like, you can treat it as exam preparation, e.g., first try to solve the question without help of the slides ;) But you are obviously allowed to use the slides at any time


(1) Name three reasons why natural language processing (NLP) is challenging

1. Because some of the words have more than one POS-tag(Heteronims)
2. To predict a POS-tag we need not only statical algorithms, but more advanced
3. If a corpus don't have enough words, than it will be bad trained (Unigramm, Bigramm)

---

(2) What is tokenization?

Tokenization is a process of breaking text into smaller units called tokens(typically words, subwords, or punctuation marks), which serve as the basic elements for further NLP tasks.

---

(3) What is a word stem? Give the stem of the word "undoes"

Word stem is the core part of a word that remains after removing inflectional or derivational affixes.

Undoes -> Undo

---

(4) What is a word lemma? Give the lemma of the word "undoes"

Lemma is a canonical or dictionary from of a word - the form like in dictionary.

Undoes -> Undo

---

(5) Why should we typically extract word stems/lemmas before preceding with text analysis?

Because they might help us to filter texts or documents.

They reduce vocabulary size and helps algorithms work more efficiently.

Improves accuracy in tasks like information retrieval, classification, and clustering, because semantically related forms are grouped together.

---

(6) What are stop-words? Why it makes sense to remove stop-words before preceding with text analysis?

Stop-words are frequent dunction words (like  the, is, and). We remove them to reduce noise, vocabulary size, and improves efficiency and accuracy of text analysis.


---

(7) If you had access to frequency statistics for a language, how could you create a list of stop words?

We can make a list of stop words, by using Zipf's Law

---

(8) Name a use-case in which we should NOT remove stop-words prior to text analysis.

POS-tagging

---

(9) What is part-of-speech (POS) tagging? Why is it useful?

Part-of-speech tagging is proccess of assinging tags(grammatical categories) to words. It is useful for future text/senctece processing like Named-Entity-recognition

---

(10) Explain how n-gram POS tagging works. What is the limitation of this method?

N-gram tagging uses statistical approach to tag words depending on previous corpus. Limitation is that, without previous context and without backoff stategy it woeks bad. 

---

(11) What is parsing? Which two main types of parsing exist?

Parsing is an algorithm of building a syntactical structure of a senctece with the respect of given grammar. Two types are Bottom-up and Top-down approaches.

---

(12) What is the difference between a context-free and a context-sensitive language? What is the difference between a context-free and a regular language?

CFL vs CSL: Context-sensitive rules allow rewiriting depending on surrounding symbols; they generate a larger class of languages \
CFL vs Regular: Regular languages are simple(no recursion); all regular languages are CFLs, but not all CFLs are regular. 

---

(13) Give an equivalent grammar in Chomsky Normal form for the Grammar G=(N,T,P,S) with N={S,A,B,C}, T={a,b,c}, P={S->ABC, A->a, B->b, C->c}

G = (N, T, P, S) \
N = {S, A, B, C} \
T = {a,b,c} \
P = {S->ABC, A->a, B->b, C->c} \

---

(14) Explain how shift-reduce parsing works.

Algorithm outline:

- Shift operation: Push a word from the input onto the stack
- Reduce operation: If top n words on the top of the stack match the right hand side of a production rule, then they are popped and replaced by the left hand side of the production
- STOPPING operation: The process stops whe the input has been prpcessed and S has been popped from the stack

---

(15) What is the grammar ambiguity problem?

Grammar ambigous if there exists at least one string in the language that can be generated by the grammar in more than one way (i.e., it has more than one valid parse tree or derivation)

---

## Task 2 - Named Entity Recognition:

### Part 1: Automated Annotations

Use Spacy to annotate entities in the debates dataset (available as part of the JSON in the data directory). [Depending on your computer, the extraction may take some time. Therefore, you are allowed to restrict the size of the text, e.g. to the first 250 sentences. Feel free to use a larger share of data to get better insights.]

Tip: Use the en_core_web_sm corpus

Display the results in readable form, e.g. show the tagged entities for a reasonably sized part of the data, idealy alongside the original text.

In [8]:
# Tip for Spacy: In order to load a dataset, you might need to download the dataset via command line. 
#Inside of a notebook, you can run commands with a !-mark

#Similar to: !python ...
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [6]:
# read debates
import json
import spacy
with open('data/texts.json', 'r') as infile:
    data = json.load(infile)

content_debates = data['debates']
nlp = spacy.load("en_core_web_sm")
doc = nlp(content_debates)


In [17]:
for entity in doc.ents:
    print(entity.text, entity.label_)

evening TIME
Hofstra University ORG
Hempstead GPE
New York GPE
Lester Holt PERSON
NBC Nightly News ORG
first ORDINAL
tonight TIME
Donald Trump PERSON
Hillary Clinton PERSON
the Commission on Presidential Debates ORG
tonight TIME
90-minute TIME
six CARDINAL
15 minutes TIME
three CARDINAL
tonight TIME
America GPE
America GPE
two minutes TIME
Democratic NORP
the United States GPE
Hillary Clinton PERSON
Republican NORP
the United States GPE
Donald J. Trump PERSON
Donald PERSON
tonight TIME
two CARDINAL
this evening TIME
American NORP
Achieving Prosperity WORK_OF_ART
two CARDINAL
America GPE
today DATE
six straight years DATE
years DATE
nearly half CARDINAL
Americans NORP
Clinton PERSON
American NORP
Lester PERSON
Hofstra NORP
Today DATE
second ORDINAL
First ORDINAL
tonight TIME
Donald Trump PERSON
I. Donald PERSON
November 8th DATE
Clinton PERSON
Trump PERSON
American NORP
two minutes TIME
Lester GPE
Mexico GPE
China GPE
China GPE
Mexico GPE
eighth ORDINAL
the United States GPE
Ford ORG
Th

### Part 2: Analysis

In real-word use cases, NER can be difficult. Analyze the following challenging examples and identify all cases in which the automated NER failed. For the tagging, you can use the method from part 1.

In [4]:
# read debates
import json
with open('data/hard_data.json', 'r', encoding='utf-8-sig') as infile:
    data = json.load(infile)
    
test_sentences = [s["sentence"] for s in data['test_sentences']]

Write the failure cases down here, then try to classify them into types of errors that the model makes. Discuss the results.
You can use a mixture of markdown and code if it suits your analysis.

In [11]:
for i in test_sentences:
    doc = nlp(i)  
    for entity in doc.ents:
        print(entity.text, entity.label_)



1 CARDINAL
tonight TIME
4 days DATE
Denver GPE
San Francisco GPE
2 CARDINAL
United Airlines ORG
9:20 p.m. TIME
2 hours and 28 minutes TIME
337 MONEY
3 CARDINAL
Seattle GPE
Monday DATE
Tampa GPE
4:10 p.m. TIME
1 CARDINAL
Park Hyatt Aviara PERSON
Golf Club ORG
279 MONEY
4.8 CARDINAL
18 CARDINAL
2 CARDINAL
Staybridge Suites Carlsbad WORK_OF_ART
145 MONEY
4.5 CARDINAL
BBQ ORG
1 CARDINAL
Mummy PERSON
4:30 pm this afternoon TIME
Davis Stadium 5 PERSON
2 CARDINAL
Chips PRODUCT
9:50 PM TIME
10:15 PM TIME
Snaps GPE
10:25 PM TIME
1 CARDINAL
Rihanna Cover PERSON
One CARDINAL
IKEA ORG
California GPE
Second Floor ORG
Kitchen GPE
4.3 CARDINAL
5 CARDINAL
New American NORP
1 CARDINAL
last Saturday, September 9th DATE
LA Dodgers ORG
New York Red Bulls ORG


Wrong: 
Park Hyatt Aviara PERSON
Golf Club ORG
Staybridge Suites Carlsbad WORK_OF_ART
BBQ ORG
Second Floor ORG

---

#### Submitting your results:

To submit your results, please:

- save this file, i.e., `ex??_assignment.ipynb`.
- if you reference any external files (e.g., images), please create a zip or rar archieve and put the notebook files and all referenced files in there.
- login to ILIAS and submit the `*.ipynb` or archive for the corresponding assignment.

**Remarks:**
    
- Do not copy any code from the Internet. In case you want to use publicly available code, please, add the reference to the respective code snippet.
- Check your code compiles and executes, even after you have restarted the Kernel.
- Submit your written solutions and the coding exercises within the provided spaces and not otherwise.
- Write the names of your partner and your name in the top section.