## Document Analysis: Computational Methods - Summer Term 2025
### Lectures: Jun.-Prof. Dr. Andreas Spitz
### Tutorials: Julian Schelb

# Exercise 04

- NLP Recap
- Entity Recognition

---

## Task 1 - NLP Recap:

## Q & A

Give answers to the following question. If you like, you can treat it as exam preparation, e.g., first try to solve the question without help of the slides ;) But you are obviously allowed to use the slides at any time


(1) Name three reasons why natural language processing (NLP) is challenging

- Ambiguity
- differing syntacitical and semantical rules across languages
- 

---

(2) What is tokenization?

Splitting text into smaller entites, usually words, but can also be sentences.

---

(3) What is a word stem? Give the stem of the word "undoes"

do? without the prefix and the inflection.

---

(4) What is a word lemma? Give the lemma of the word "undoes"

undo

---

(5) Why should we typically extract word stems/lemmas before preceding with text analysis?

to normalize them, to entfern inflection/declination/ other word forms.

---

(6) What are stop-words? Why it makes sense to remove stop-words before preceding with text analysis?

stopwords are mostly functions words that bear no semantic meaning and/or words that occur with a very high frequency in a language, thus overpowering the content/words we are actually interested in.

---

(7) If you had access to frequency statistics for a language, how could you create a list of stop words?

The moste frequent words are most likely stopwords?

---

(8) Name a use-case in which we should NOT remove stop-words prior to text analysis.

when we need context, want to parse text and build trees?

---

(9) What is part-of-speech (POS) tagging? Why is it useful?

tagging words with their word types helps to 

---

(10) Explain how n-gram POS tagging works. What is the limitation of this method?

<font color='ff000000'>\# TEXT SUBMISSION ANSWER HERE (Double click to edit)</font>

---

(11) What is parsing? Which two main types of parsing exist?

<font color='ff000000'>\# TEXT SUBMISSION ANSWER HERE (Double click to edit)</font>

---

(12) What is the difference between a context-free and a context-sensitive language? What is the difference between a context-free and a regular language?

<font color='ff000000'>\# TEXT SUBMISSION ANSWER HERE (Double click to edit)</font>

---

(13) Give an equivalent grammar in Chomsky Normal form for the Grammar G=(N,T,P,S) with N={S,A,B,C}, T={a,b,c}, P={S->ABC, A->a, B->b, C->c}

N = {S,A,B,C,D}, T={a,b,c},P = {S -> AD, D -> BC, A -> a, B -> b, C -> c } 

---

(14) Explain how shift-reduce parsing works.

<font color='ff000000'>\# TEXT SUBMISSION ANSWER HERE (Double click to edit)</font>

---

(15) What is the grammar ambiguity problem?

<font color='ff000000'>\# TEXT SUBMISSION ANSWER HERE (Double click to edit)</font>

---

## Task 2 - Named Entity Recognition:

### Part 1: Automated Annotations

Use Spacy to annotate entities in the debates dataset (available as part of the JSON in the data directory). [Depending on your computer, the extraction may take some time. Therefore, you are allowed to restrict the size of the text, e.g. to the first 250 sentences. Feel free to use a larger share of data to get better insights.]

Tip: Use the en_core_web_sm corpus

Display the results in readable form, e.g. show the tagged entities for a reasonably sized part of the data, idealy alongside the original text.

In [1]:
# Tip for Spacy: In order to load a dataset, you might need to download the dataset via command line. 
#Inside of a notebook, you can run commands with a !-mark

#Similar to: !python ...
!python -m spacy 

zsh:1: command not found: python


In [2]:
# read debates
import json
with open('data/texts.json', 'r') as infile:
    data = json.load(infile)

content_debates = data['debates']

In [3]:
# Your Code Submission goes here
import spacy
from spacy import displacy
#!python3 -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm") # english tokenizer, tagger, parser and NER
doc = nlp(content_debates)

sent = list(doc.sents)[:250]

text_sub = " ".join([sentence.text for sentence in sent])

doc_sub = nlp(text_sub)

print("Noun phrases: ", [(chunk.text, chunk.label_) for chunk in doc_sub.noun_chunks], "\n")
print("Verbs: ", [(token.text, token.pos_) for token in doc_sub if token.pos_ == "VERB"], "\n")

#for entity in doc_sub.ents:
#    print(entity.text, entity.label_)

Noun phrases:  [(' Good evening', 'NP'), ('Hofstra University', 'NP'), ('Hempstead', 'NP'), ('I', 'NP'), ('Lester Holt', 'NP'), ('anchor', 'NP'), ('"NBC Nightly News', 'NP'), ('I', 'NP'), ('you', 'NP'), ('the first presidential debate', 'NP'), ('The participants', 'NP'), ('Donald Trump', 'NP'), ('Hillary Clinton', 'NP'), ('This debate', 'NP'), ('the Commission', 'NP'), ('Presidential Debates', 'NP'), ('a nonpartisan, nonprofit organization', 'NP'), ('The commission', 'NP'), ("tonight's format", 'NP'), ('the rules', 'NP'), ('the campaigns', 'NP'), ('The 90-minute debate', 'NP'), ('six segments', 'NP'), ('We', 'NP'), ('three topic areas', 'NP'), ('prosperity', 'NP'), ("America's direction", 'NP'), ('America', 'NP'), ('the start', 'NP'), ('each segment', 'NP'), ('I', 'NP'), ('the same lead-off question', 'NP'), ('both candidates', 'NP'), ('they', 'NP'), ('up to two minutes', 'NP'), ('that point', 'NP'), ('the end', 'NP'), ('the segment', 'NP'), ('we', 'NP'), ('an open discussion', 'NP'), 

In [14]:
def mark_entities(text, doc):
    offset = 0
    marked = text
    for ent in doc.ents:
        start = ent.start_char + offset
        end = ent.end_char + offset
        label = f"[{ent.label_}]"
        marked = marked[:end] + label + marked[end:]
        offset += len(label)
    return marked


for sent in list(doc.sents)[:10]:
    sent_doc = nlp(sent.text)
    if sent_doc.ents:
        print(mark_entities(sent.text, sent_doc))
        print("-" * 20)

 Good evening[TIME] from Hofstra University[ORG] in Hempstead[GPE], New York[GPE].
--------------------
I am Lester Holt[PERSON], anchor of "NBC Nightly News[ORG].”
--------------------
I want to welcome you to the first[ORDINAL] presidential debate.

--------------------
The participants tonight[TIME] are Donald Trump[PERSON] and Hillary Clinton[PERSON].
--------------------
This debate is sponsored by the Commission on Presidential Debates[ORG], a nonpartisan, nonprofit organization.
--------------------
The commission drafted tonight[TIME]'s format, and the rules have been agreed to by the campaigns.

--------------------
The 90-minute[TIME] debate is divided into six[CARDINAL] segments, each 15 minutes[TIME] long.
--------------------
We'll explore three[CARDINAL] topic areas tonight[TIME]:
--------------------
Achieving prosperity; America[GPE]'s direction; and securing America[GPE].
--------------------
At the start of each segment, I will ask the same lead-off question to both c

### Part 2: Analysis

In real-word use cases, NER can be difficult. Analyze the following challenging examples and identify all cases in which the automated NER failed. For the tagging, you can use the method from part 1.

In [8]:
# read debates
import json
with open('data/hard_data.json', 'r') as infile:
    data = json.load(infile)

test_sentences = [s["sentence"] for s in data['test_sentences']]

In [11]:
print(len(test_sentences))

11


Write the failure cases down here, then try to classify them into types of errors that the model makes. Discuss the results.
You can use a mixture of markdown and code if it suits your analysis.

In [13]:
# Your Code Submission goes here
for i, sentence in enumerate(test_sentences[:11], 1):
    sent_doc = nlp(sentence)
    print(f"Sentence {i}:")
    if sent_doc.ents:
        print(mark_entities(sentence, sent_doc))
    else:
        print(sentence)
        print("[No entities detected]")
    print("-" * 20)

Sentence 1:
Flight booking 1[CARDINAL]: “So, I would like to fly out sometime tonight[TIME] and fly back in the evening in 4 days[DATE]. From I’m looking to go to Denver[GPE]. I’m flying out of San Francisco[GPE].”
--------------------
Sentence 2:
Flight booking 2[CARDINAL]: “Okay, you got it so it looks like United Airlines[ORG] leaves at 9:20 p.m.[TIME] that is nonstop the flight duration is 2 hours and 28 minutes[TIME] and is priced at $337[MONEY].”
--------------------
Sentence 3:
Flight booking 3[CARDINAL]: “I found a flight that leaves Seattle[GPE] coming Monday[DATE] at 7:35 a.m and arrives in Tampa[GPE] at 4:10 p.m.[TIME]”
--------------------
Sentence 4:
Hotel booking 1[CARDINAL]: “Park Hyatt Aviara[PERSON] resort Golf Club[ORG] and Spa, it’s $279[MONEY] per night. It‘s rated 4.8[CARDINAL] stars. Resort offering an 18[CARDINAL]-hole golf course, an outdoor pool & tennis courts plus a spa & fine dining.”
--------------------
Sentence 5:
Hotel booking 2[CARDINAL]: “Staybridge Su

Sentence 1:

missed: 'in the evening in four days' as DATE rather than 4 CARDINAL, evening TIME, 

Sentence 3:

missed: 7:35 a.m TIME, probably because this would be one entity right after another?

Sentence 4:

"Park Hyatt Aviara PERSON resort" is and  ORG  and belongs to Golf Club as well as to Spa. what about per night? Is that not a time or sth.?

Boundary issue?

Sentence 5: 

Type errors: wrong entity type (hotel as work of art).

"Staybridge Suites Carlsbad" tagged as WORK_OF_ART rather than an ORG (it's a hotel).

False positive: BBQ as ORG.

mistakes BBQ for a proper noun and therefore an ORG since it doesn't recognizes it as a person?

Sentence 6:

Type error (and boundary error?):

"The Mummy" is a product not a person.

"Regal Davis Stadium 5" is a ORG not a person.

"at 4:30 pm this afternoon", is this whole phrase caught as time or only afternoon?

Sentence 7:

missed entity: "Get out" aWORK_OF_ART was missed.

misclassification: "Snaps GPE" is WORK_OF_ART, too.

Sentence 8:

missed entity: diamonds is not recognized as WORK_OF_ART.

misclassification / missed entity: One CARDINAL Voice Children’s Choir. The whole is a person? or so.

false positive: Cover PERSON, whatever Cover is, this is just a bad sentence.

Sentence 10:

missclassification: "Second Floor ORG in Kitchen GPE" is acutally all one ORG not and org with its GPE.

What is New American NORP ? NORP: Nationalities or religious or political groups. Although 

### NER is particularly challenged by:

Ambiguous Names: Like "The Mummy" or “Snaps” (movie/product vs. person/location).

Named Locations vs. Generic Words: "Kitchen" (room) vs. “Kitchen” (place name).

Business/Facility Names: Often misclassified as PERSON or WORK_OF_ART when they are hotels or organizations.

Cultural/Artistic References: Songs, bands, or movie names not well represented in spaCy’s default en_core_web_sm model.

Multitoken Names: Long org names like "One Voice Children’s Choir" are missed unless trained on similar data.

## TAG OVERWIEW

**PERSON:**      People, including fictional.

**NORP:**       Nationalities or religious or political groups.

**FAC:**         Buildings, airports, highways, bridges, etc.

**ORG:**         Companies, agencies, institutions, etc.

**GPE:**        Countries, cities, states.

**LOC:**         Non-GPE locations, mountain ranges, bodies of water.

**PRODUCT:**     Objects, vehicles, foods, etc. (Not services.)

**EVENT:**       Named hurricanes, battles, wars, sports events, etc.

**WORK_OF_ART:** Titles of books, songs, etc.

**LAW:**         Named documents made into laws.

**LANGUAGE:**    Any named language.

**DATE:**        Absolute or relative dates or periods.

**TIME:**        Times smaller than a day.

**PERCENT:**     Percentage, including ”%“.

**MONEY:**       Monetary values, including unit.

**QUANTITY:**    Measurements, as of weight or distance.

**ORDINAL:**     “first”, “second”, etc.

**CARDINAL:**    Numerals that do not fall under another type.

---

#### Submitting your results:

To submit your results, please:

- save this file, i.e., `ex??_assignment.ipynb`.
- if you reference any external files (e.g., images), please create a zip or rar archieve and put the notebook files and all referenced files in there.
- login to ILIAS and submit the `*.ipynb` or archive for the corresponding assignment.

**Remarks:**
    
- Do not copy any code from the Internet. In case you want to use publicly available code, please, add the reference to the respective code snippet.
- Check your code compiles and executes, even after you have restarted the Kernel.
- Submit your written solutions and the coding exercises within the provided spaces and not otherwise.
- Write the names of your partner and your name in the top section.