## Document Analysis: Computational Methods - Summer Term 2025
### Lectures: Jun.-Prof. Dr. Andreas Spitz
### Tutorials: Julian Schelb
#### Submitted by: Buket Sak, Anna Werner, Yu Zeyuan

# Exercise 04

- NLP Recap
- Entity Recognition

---

## Task 1 - NLP Recap:

## Q & A

Give answers to the following question. If you like, you can treat it as exam preparation, e.g., first try to solve the question without help of the slides ;) But you are obviously allowed to use the slides at any time


(1) Name three reasons why natural language processing (NLP) is challenging

One of the reasons NLP is challenging is that language is ambiguous;the meaning of a sentence can vary depending on the context. Another challenge is interpretation and world knowledge; since everyone has different knowledge and experiences, the same sentence can be interpreted differently by different people. A third challenge is anaphora and coreference resolution. To understand what a pronoun or reference word refers to, a system must often rely on context and sometimes even real-world knowledge.

---

(2) What is tokenization?

Tokenization is the process of breaking down text into smaller units, such as words, subwords, characters, or even sentences. It is a crucial step for any type of analysis or processing in NLP applications.

---

(3) What is a word stem? Give the stem of the word "undoes"

A word stem is the base form of a word, also called sometimes the "root". It corresponds to the word without any prefixes or suffixes.
For example, the stem of “undoes” is “do”.

---

(4) What is a word lemma? Give the lemma of the word "undoes"

A word lemma is the dictionary form of a word, the base form that one could look up in a dictionary. Although there may be multiple word forms, they all share a single lemma, which corresponds to the dictionary entry.

undoes --> undo 

---

(5) Why should we typically extract word stems/lemmas before preceding with text analysis?

By performing stemming or lemmatization, we reduce all words that share the same core meaning to a common base form. This simplifies the analysis by reducing the dimensionality of the data and improves the accuracy of the results. It’s especially helpful in tasks like text classification, indexing, or search engines, because when a user types a word, they typically expect to retrieve results that include all inflected forms of that word.

---

(6) What are stop-words? Why it makes sense to remove stop-words before preceding with text analysis?

Stop words are functional words that frequently appear in a text but often do not carry meaningful information for certain types of analysis. Examples include auxiliaries and common words like "is," "are," "or," and "the", or "not". Removing stop words helps us focus on more important words in a text and reduces noise.

---

(7) If you had access to frequency statistics for a language, how could you create a list of stop words?

I would include the most frequent words, but blindly removing the top 100 words could result in the loss of important words, as not all frequent words are stop words. To avoid this, we could use POS tagging: if a word is a noun, verb, or adjective, it likely carries meaning and should not be included in the stop word list. Additionally, depending on the task, I would avoid including words like “not” or its variations, since they can significantly affect the meaning of a sentence especially in tasks like sentiment analysis.

---

(8) Name a use-case in which we should NOT remove stop-words prior to text analysis.

One specific use-case is sentiment analysis, where we shouldn’t remove stop words because functional words—especially words like “not”—determine the sentiment and meaning of a sentence.

Also, in general, when using transformer-models like BERT, we should not remove stop words because these pre-trained models are trained on large amounts of natural language text that include stop words. BERT relies on the full context of a sentence, including stop words, to understand the meaning and relationships between words.

---

(9) What is part-of-speech (POS) tagging? Why is it useful?

Part-of-speech tagging is the process of assigning each word in a sentence its grammatical category, such as noun, verb, adjective, etc.
It is useful because it provides information about the sentence structure and supports many NLP tasks, such as named entity recognition (NER) and machine translation etc. 

---

(10) Explain how n-gram POS tagging works. What is the limitation of this method?

N-gram POS tagging is a probabilistic method used to assign the most likely POS tags to words in a sentence. It works by considering the current word and the POS tags of the previous n−1 words. Depending on the value of n, the model can be a unigram(each word is tagged independently), bigram(the tag of the current word depends only on the tag of the previous word), trigram(depends on the two preceding tags.), and so on. It involves calculating probabilities based on a tagged training corpus and selecting the most likely sequence of tags for a given sentence. One major limitatiion is that it struggles with long-distance dependencies between words, as it only considers a limited context window. Additionally, because it relies on previously seen data, it may fail to accurately tag unknown words that do not appear in the training set.

---

(11) What is parsing? Which two main types of parsing exist?

Parsing is the process of analyzing the structure of a given sentence. In this way, we determine and examine its syntactic structure. There are two main types of parsing: constituency parsing and dependency parsing.
Dependency parsing, as the name suggests, relies on the relationships between words in a sentence. Constituency parsing, on the other hand, is based on context-free grammars and involves creating a tree structure that represents nested phrases.


---

(12) What is the difference between a context-free and a context-sensitive language? What is the difference between a context-free and a regular language?

<font color='ff000000'>\# TEXT SUBMISSION ANSWER HERE (Double click to edit)</font>

---

(13) Give an equivalent grammar in Chomsky Normal form for the Grammar G=(N,T,P,S) with N={S,A,B,C}, T={a,b,c}, P={S->ABC, A->a, B->b, C->c}

in CNF, three non-terminal symbols on RHS are not allowed. 

---

(14) Explain how shift-reduce parsing works.

Shift-reduce parsing is a bottom up parsing method. Here, we use a stack to help construct the parse tree. We begin by shifting words from the input (right-hand side) onto the stack. If the stack matches any production rule, we reduce it and replace it with the corresponding rule. When there is no match, we shift the next input word into the stack and keep doing the same process. We stop when we reach the start symbol S.

---

(15) What is the grammar ambiguity problem?

Grammar ambiguity occurs when a sentence can be parsed in more than one way, resulting in more than one possible parse tree structure. The most common example is: "I saw the man with the telescope." In this case, the resulting tree structure could attach "with the telescope" either to the verb "saw" or to the noun "the man", leading to two different interpretations.

---

## Task 2 - Named Entity Recognition:

### Part 1: Automated Annotations

Use Spacy to annotate entities in the debates dataset (available as part of the JSON in the data directory). [Depending on your computer, the extraction may take some time. Therefore, you are allowed to restrict the size of the text, e.g. to the first 250 sentences. Feel free to use a larger share of data to get better insights.]

Tip: Use the en_core_web_sm corpus

Display the results in readable form, e.g. show the tagged entities for a reasonably sized part of the data, idealy alongside the original text.

In [1]:
# Tip for Spacy: In order to load a dataset, you might need to download the dataset via command line. 
#Inside of a notebook, you can run commands with a !-mark

#Similar to: !python ...
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [2]:
# read debates
import json
with open('data/texts.json', 'r') as infile:
    data = json.load(infile)

content_debates = data['debates']

In [3]:
import nltk
from nltk.tokenize import sent_tokenize
# Tokenize into sentences
sentences = sent_tokenize(content_debates)
first_sent = sentences[:250]
first_sent

[' Good evening from Hofstra University in Hempstead, New York.',
 'I am Lester Holt, anchor of "NBC Nightly News.” I want to welcome you to the first presidential debate.',
 'The participants tonight are Donald Trump and Hillary Clinton.',
 'This debate is sponsored by the Commission on Presidential Debates, a nonpartisan, nonprofit organization.',
 "The commission drafted tonight's format, and the rules have been agreed to by the campaigns.",
 'The 90-minute debate is divided into six segments, each 15 minutes long.',
 "We'll explore three topic areas tonight: Achieving prosperity; America's direction; and securing America.",
 'At the start of each segment, I will ask the same lead-off question to both candidates, and they will each have up to two minutes to respond.',
 "From that point until the end of the segment, we'll have an open discussion.",
 'The questions are mine and have not been shared with the commission or the campaigns.',
 'The audience here in the room has agreed to r

In [15]:
import spacy
nlp = spacy.load("en_core_web_sm")

# Join them back into a single string because it expects a singe string not a list
result = ' '.join(first_sent)

# Apply the model
doc = nlp(result)

# Print detected named entities
for ent in doc.ents:
    print(ent.text, ent.label_)

evening TIME
Hofstra University ORG
Hempstead GPE
New York GPE
Lester Holt PERSON
NBC Nightly News ORG
first ORDINAL
tonight TIME
Donald Trump PERSON
Hillary Clinton PERSON
the Commission on Presidential Debates ORG
tonight TIME
90-minute TIME
six CARDINAL
15 minutes TIME
three CARDINAL
tonight TIME
America GPE
America GPE
up to two minutes TIME
Democratic NORP
the United States GPE
Hillary Clinton PERSON
Republican NORP
the United States GPE
Donald J. Trump PERSON
Donald PERSON
tonight TIME
two CARDINAL
this evening TIME
American NORP
Achieving Prosperity WORK_OF_ART
two CARDINAL
America GPE
today DATE
a record six straight years DATE
years DATE
nearly half CARDINAL
Americans NORP
Clinton PERSON
American NORP
Lester PERSON
Hofstra NORP
Today DATE
second ORDINAL
First ORDINAL
sick days DATE
tonight TIME
Donald Trump PERSON
I. Donald PERSON
November 8th DATE
Clinton PERSON
Trump PERSON
American NORP
up to two minutes TIME
Lester PERSON
Mexico GPE
China GPE
China GPE
Mexico GPE
eighth OR

In [16]:
from spacy import displacy
displacy.render(doc, style="ent")

### Part 2: Analysis

In real-word use cases, NER can be difficult. Analyze the following challenging examples and identify all cases in which the automated NER failed. For the tagging, you can use the method from part 1.

In [20]:
# read debates
import json
with open('data/hard_data.json', 'r') as infile:
    data = json.load(infile)

test_sentences = [s["sentence"] for s in data['test_sentences']]
result1 = ' '.join(test_sentences)
# Apply the model
doc1 = nlp(result1)

# Print detected named entities
for ent in doc1.ents:
    print(ent.text, ent.label_)

1 CARDINAL
tonight TIME
4 days DATE
Denver GPE
San Francisco GPE
2 CARDINAL
United Airlines ORG
9:20 p.m. TIME
2 hours and 28 minutes TIME
337 MONEY
3 CARDINAL
Seattle GPE
coming Monday DATE
Tampa GPE
4:10 p.m. TIME
1 CARDINAL
Golf Club ORG
279 MONEY
4.8 CARDINAL
18 CARDINAL
2 CARDINAL
Staybridge Suites Carlsbad WORK_OF_ART
145 MONEY
4.5 CARDINAL
1 CARDINAL
Mummy PERSON
4:30 pm this afternoon TIME
Regal Davis Stadium FAC
2 CARDINAL
9:50 PM TIME
Get Out EVENT
10:15 PM TIME
Snaps GPE
10:25 PM TIME
Music 1: “Here’s WORK_OF_ART
Rihanna Cover PERSON
One Voice Children’s Choir ORG
California GPE
Second Floor ORG
Kitchen GPE
4.3 CARDINAL
5 CARDINAL
New American NORP
last Saturday, September 9th DATE
LA Dodgers ORG
New York Red Bulls ORG


In [22]:
#nice visualization
displacy.render(doc1, style="ent")

Write the failure cases down here, then try to classify them into types of errors that the model makes. Discuss the results.
You can use a mixture of markdown and code if it suits your analysis.

evening -> Missing entity (TIME)

Rihanna Cover -> Incorrect entity, should be no entity for the word "cover"

Rihanna -> Missing entity (PERSON)

Music 1: “Here’s -> Incorrect entity

Staybridge Suites Carlsbad -> Incorrect entity

IKEA -> Missing entity

---

#### Submitting your results:

To submit your results, please:

- save this file, i.e., `ex??_assignment.ipynb`.
- if you reference any external files (e.g., images), please create a zip or rar archieve and put the notebook files and all referenced files in there.
- login to ILIAS and submit the `*.ipynb` or archive for the corresponding assignment.

**Remarks:**
    
- Do not copy any code from the Internet. In case you want to use publicly available code, please, add the reference to the respective code snippet.
- Check your code compiles and executes, even after you have restarted the Kernel.
- Submit your written solutions and the coding exercises within the provided spaces and not otherwise.
- Write the names of your partner and your name in the top section.