## Document Analysis: Computational Methods - Summer Term 2024
### Lectures: Jun.-Prof. Dr. Andreas Spitz
### Tutorials: Julian Schelb

# Exercise 04

- NLP Recap
- Entity Recognition

---

## Task 1 - NLP Recap:

## Q & A

Give answers to the following question. If you like, you can treat it as exam preparation, e.g., first try to solve the question without help of the slides ;) But you are obviously allowed to use the slides at any time


(1) Name three reasons why natural language processing (NLP) is challenging

**EXAMPLE SOLUTION**

1. Comprehensively understanding the human language requires **understanding both the words and** how the **concepts** are connected to deliver the intended message -> Syntax & Semantics. Example: In the sentence, "The bank is closed," a model needs to understand that "bank" refers to a financial institution and not a river bank based on the context.

2. While humans can easily master a language, the **ambiguity** and **imprecise** characteristics of the natural languages are what make NLP difficult for machines to implement -> Ambuguitivy, Imprecision, Context-Dependce. The sentence, "I saw her duck," can be interpreted differently as "I observed her duck (the animal)" or "I saw her quickly lower her head (the action)." 

3. There are many **different languges (hundreds) with individual rules**, syntax etc. -> multilingualism

4. Language is **different across cultures, age groups and individuals** -> External context dependence. Example: In Western culture "That's cool!" might be used to express approval, admiration, or appreciation.

5. **Technical aspects** like text encodings, formatting, special character treatment etc. -> Handling of Text in Software

...

---

(2) What is tokenization?

**EXAMPLE SOLUTION**

Tokenization is the process of breaking down a text into individual units called tokens in NLP. These tokens are typically words, but they can also be subwords or even characters, depending on the tokenization strategy used. Tokenization serves as a fundamental step in NLP tasks to enable further analysis and processing of text.

There are different tokenisation strategies:

1. **Text Segmentation:** The input text is divided into smaller segments or units, which can be sentences, paragraphs, or documents. This step helps in managing the text structure and applying tokenization at an appropriate level.

2. **Word Tokenization:** The segmented text is further split into individual words or word-like units. Each word becomes a separate token. For example, the sentence "Tokenization is important" would be tokenized into the tokens ["Tokenization", "is", "important"].

3. **Subword Tokenization:** In some cases, especially when dealing with morphologically rich languages or for tasks like machine translation, subword tokenization is used. It breaks words into smaller meaningful units, such as subword units or morphemes. This allows capturing the morphological variations and improves the coverage of rare or unseen words.

4. **Character Tokenization**: At the most granular level, text can be tokenized into individual characters. This approach is useful in tasks that require character-level analysis, such as text generation or language modeling.

---

(3) What is a word stem? Give the stem of the word "undoes"

**EXAMPLE SOLUTION**

The common part to all inflected words e.g. undoes/undoing -> undo


Please note that there is a difference between word root and word stem:

* **Root:** The root of a word represents its core meaning or semantic base. It is the part of the word that remains when all affixes (prefixes, suffixes, and infixes) are removed. The root carries the fundamental lexical or morphological content of the word. For example, the root of the word "studying" is "stud."

* **Stem:** The stem of a word is a form to which affixes can be attached. It is obtained by removing any inflectional affixes, but it may or may not represent the core meaning of the word. The stem may contain some changes or modifications compared to the root due to morphological rules. For example, the stem of the word "running" is "run." The stem of "unhappiness" is "unhappi." When we remove the suffix "-ness" from the word, we are left with the stem "unhappi." The stem represents the base form to which affixes can be added, and in this case, it captures the essential meaning of the word "unhappy."

---

(4) What is a word lemma? Give the lemma of the word "undoes"

**EXAMPLE SOLUTION**

A word lemma, also known as a base form or dictionary form, refers to the canonical or uninflected form of a word. It represents the form of the word that serves as a common or generic representation from which different inflected or derived forms can be derived.

The lemma of the word "undoes" is "undo." "Undo" is the base form or lemma, and "undoes" is the inflected form of the verb "undo" in the third-person singular present tense.

<center></center>


|   Term   | Definition                                         | Example              |
|----------|---------------------------------------------------|----------------------|
| Original Word | The word in its original form without any modifications or affixes applied to it. | studying            |
| Root     | The base form of a word to which prefixes and suffixes are added. | stud                |
| Stem     | The core part of a word that remains after removing inflectional affixes. | stud(i)               |
| Lemma    | The canonical or dictionary form of a word, often used as a base form for linguistic analysis. | (to) study              |
| Affix    | A morphological element attached to a root or stem to create a new word or alter its meaning or function. | -ing                  |
| Inflection | The modification of a word to express grammatical features such as tense, number, or case. | studied            |
| Derivation | The formation of new words by adding affixes to a root or stem, resulting in a word with a different meaning or part of speech. | (the) studies          |



In [2]:
word = "untouchables"

from nltk.stem.porter import *
stemmer = PorterStemmer()
print("Stem:", stemmer.stem(word))

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print("Lemma:", lemmatizer.lemmatize(word, pos ="v"))

Stem: untouch
Lemma: untouchables


---

(5) Why should we typically extract word stems/lemmas before preceding with text analysis?

**EXAMPLE SOLUTION**

To condense related words so that we don’t have as much variability. Extracting word stems or lemmas before proceeding with text analysis has several benefits:

1. **Normalization:** Word stemming or lemmatization helps to normalize the text data by reducing words to their base or root forms. This process reduces the variation in word forms and helps to consolidate words with the same meaning. It can be particularly useful in cases where different inflections or derivatives of a word appear in the text.

2. **Vocabulary reduction:** By reducing words to their base forms, the vocabulary size is reduced. This helps to simplify the analysis and make it more computationally efficient. Instead of treating different forms of the same word as separate entities, they are mapped to a common lemma, allowing for better data representation and analysis.

3. **Improved semantic analysis:** Word stems or lemmas often carry the core meaning of a word. By using lemmas, we can focus on the essential semantic content of the text while removing inflectional variations. This can enhance tasks like sentiment analysis, topic modeling, or language understanding by capturing the underlying meaning more effectively.

4. **Better information retrieval:** When searching or matching text documents, using lemmas instead of exact word forms can improve the recall of relevant documents. By considering the base form of words, the search system can match documents that contain different inflections or variants of the same word, increasing the chances of retrieving relevant results.


---

(6) What are stop-words? Why it makes sense to remove stop-words before preceding with text analysis?

**EXAMPLE SOLUTION**

Stop-words are commonly used words that typically do not carry significant meaning in a given language. Examples of stop-words in English include "the," "is," "and," "a," "in," and so on. These words are often functional words that help to connect and structure sentences but do not contribute much to the overall understanding or interpretation of the text.

It makes sense to remove stop-words before proceeding with text analysis for several reasons:

1. **Focus on content-bearing words and noise reduction:** By excluding stop-words, the analysis can prioritize content-bearing words that carry more meaningful information. This can lead to a better understanding of the underlying themes, patterns, or sentiments within the text.

2. **Memory and processing efficiency:** Stop-words tend to appear in a large number of documents, resulting in a high frequency of occurrence. Removing stop-words reduces the size of the vocabulary, which can significantly reduce memory usage and processing time in text analysis tasks. This is particularly important when dealing with large text corpora or when resources are limited.




---

(7) If you had access to frequency statistics for a language, how could you create a list of stop words?

**EXAMPLE SOLUTION**

One Way: Determine a cut-off, e.g. the 500 most frequent words, and mark these as stop words for following analysis steps.

---

(8) Name a use-case in which we should NOT remove stop-words prior to text analysis.

**EXAMPLE SOLUTION**

See the following example:
```
 The crowd believed the man.
 The crowd believed in the man.
 The crowd believed that the man was lying.
```
With stop-words removed, each of these examples become the same sequence, namely "crowd believed man", however one sentence is indicating the opposite.

For example, in sentiment analysis tasks, stop words can provide important context and contribute to the sentiment expressed in a sentence. Words like "not", "no", or "but" can completely change the meaning and sentiment of a sentence. Removing these stop words could result in the loss of critical information for sentiment analysis. See the list of English stopwords used by sklearn & nltk: 

```python
from nltk.corpus import stopwords
print(stopwords.words('english'))
```

In [3]:
from nltk.corpus import stopwords
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

---

(9) What is part-of-speech (POS) tagging? Why is it useful?

**EXAMPLE SOLUTION**

A POS tag is a tag that indicates the part of speech for a word.
POS tags have been used for a variety of NLP tasks; they provide linguistic signal on how a word is being used within the scope of a phrase, sentence, or document.

**Input:** "I love eating pizza."

**Output:** 
- "I" - Pronoun (PRP)
- "love" - Verb (VB)
- "eating" - Verb (VBG)
- "pizza" - Noun (NN)


---

(10) Explain how n-gram POS tagging works. What is the limitation of this method?

**EXAMPLE SOLUTION**

The n-gram tagger is a simple statistical tagging algorithm. 
For each token, it assigns the tag that is most likely for that token’s text.
For bigram and trigram tagger, by looking at the previous words and POS tags, part-of-speech tag for the current word can be guessed.
Bigram tagger uses previous tag as part of its context
Trigram tagger uses the previous two tags as part of its context.

A potential issue with nth-order tagger is their size. 
An nth-order tagger with backoff may store trigram and bigram tables, large sparse arrays, which may have hundreds of millions of entries. 
A consequence of the size of the models is that it is simply impractical for nth-order models to be conditioned on the identities of words in the context.

Another problem with using an n-gram POS tagger is that it often suffers from the issue of data sparsity. An n-gram POS tagger predicts the part-of-speech tags of words based on the context of the surrounding words. For example, a trigram POS tagger considers the current word along with the two preceding words to determine the appropriate tag. However, as the value of n increases, the number of unique n-grams in the training data also increases, resulting in sparse data. Sparse data means that there may be many unseen or infrequently occurring n-grams in the training set. As a result, the n-gram POS tagger may struggle to accurately assign POS tags to unseen or rare word combinations. This can lead to poor performance and limited generalization of the tagger, particularly when encountering new or previously unseen words. A backoff tagger is a type of part-of-speech (POS) tagger that utilizes a hierarchy of taggers to assign POS tags to words in a sentence. It is designed to handle cases where a more sophisticated tagger, such as a contextual or rule-based tagger, may struggle due to data sparsity or lack of contextual information. 

---

(11) What is parsing? Which two main types of parsing exist?

**EXAMPLE SOLUTION**

Parsing means to analyze (a sentence) in terms of grammatical constituents, identifying the parts of speech, syntactic relations, etc.

The two man approaches are:

> Top-down approach vs. bottom-up approach.

alternatively:

> Constituency parsing vs. dependency parsing.

---

(12) What is the difference between a context-free and a context-sensitive language? What is the difference between a context-free and a regular language?

**EXAMPLE SOLUTION**

In a context-free grammar, each production rule has the form `A -> α` (with α being a string of non-terminals and terminal-symbols, e.g. `A B` or `A terminal`)

The set of context-free languages is a subset of all context-sensitive grammars. In the context-sensitive grammar, we allow rules to have the form `αAβ → αγβ` (with A being a non-terminal and the others being either).
Here, production rules may need surounding contextual information. The same is not necessary in context-free grammars.

Finall, a regular language is even more restricted by it's grammar: Each right-hand-side can have at most at most one non-terminal, and the non-terminal has to be consitently placed at the start or end of the string.

---

(13) Give an equivalent grammar in Chomsky Normal form for the Grammar G=(N,T,P,S) with N={S,A,B,C}, T={a,b,c}, P={S->ABC, A->a, B->b, C->c}

Production rule set:
- S -> A B C
- A -> a
- B -> b
- C -> c

We need to replace the rule containing the non-terminal symbols by introducing a new variable X:

- S -> A X
- X -> B C
- A -> a
- B -> b
- C -> c

---

(14) Explain how shift-reduce parsing works.

Shift-Reduce Parsing uses a stack as a datastructure, and has two distinct operations:

* **SHIFT:** Push a word fromt the input sentence onto the stack.

* **REDUCE:** We check if the top n words on the top of the stack match the right hand side of a production
rule. If this is the case, reduce the n words and replace them by the left hand side of the production rule.

* The process **stops** when the input sentence has been processed completely (meaning every word has been pushed onto the stack), and S as the final stack element is pushed from the stack.


---

(15) What is the grammar ambiguity problem?

**EXAMPLE SOLUTION**

An ambiguous grammar is a context-free grammar for which there exists a string that can have more than one leftmost derivation or parse tree, while an unambiguous grammar is a context-free grammar for which every valid string has a unique leftmost derivation or parse tree.


---

## Task 2 - Named Entity Recognition:

### Part 1: Automated Annotations

Use Spacy to annotate entities in the debates dataset (available as part of the JSON in the data directory). [Depending on your computer, the extraction may take some time. Therefore, you are allowed to restrict the size of the text, e.g. to the first 250 sentences. Feel free to use a larger share of data to get better insights.]

Tip: Use the en_core_web_sm corpus

Display the results in readable form, e.g. show the tagged entities for a reasonably sized part of the data, idealy alongside the original text.

In [1]:
# Tip for Spacy: In order to load a dataset, you might need to download the dataset via command line. 
# Inside of a notebook, you can run commands with a !-mark

# Similar to: !python ...

/System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python: No module named spacy


In [5]:
# Read debates
import json
with open('data/texts.json', 'r') as infile:
    data = json.load(infile)

content_debates = data['debates']

In [6]:
# Split sentences
import nltk
sentences = nltk.sent_tokenize(content_debates)[0:1000:]

# Load model and annotate sentences
import spacy
pipeline = spacy.load("en_core_web_sm")
annotated = [pipeline(sentence) for sentence in sentences]

In [7]:
# Display NER results
from spacy import displacy
displacy.render(annotated[0:10:], jupyter=True, style='ent')

### Part 2: Analysis

In real-word use cases, NER can be difficult. Analyze the following challenging examples and identify all cases in which the automated NER failed. For the tagging, you can use the method from part 1.

In [8]:
# Read debates
import json
with open('data/hard_data.json', 'r') as infile:
    data = json.load(infile)

test_sentences = [s["sentence"] for s in data['test_sentences']]

Write the failure cases down here, then try to classify them into types of errors that the model makes. Discuss the results.
You can use a mixture of markdown and code if it suits your analysis.

In [9]:
test_annotated = [nlp(sentence) for sentence in test_sentences]

In [10]:
from spacy import displacy
displacy.render(test_annotated[0:10:], jupyter=True, style='ent')

NER models can encounter several common mistakes. Some of these mistakes include:

1. **Incorrect entity boundaries:** NER models may sometimes fail to accurately identify the boundaries of named entities, leading to incomplete or overlapping entity recognition. For example, a model may recognize "New York" as two separate entities instead of a single entity representing the location.

2. **Out-of-vocabulary entities:** NER models are trained on a specific set of entities, and they may struggle to recognize entities that were not present in the training data. This can result in entities being unrecognized or misclassified.

3. **Ambiguous entities:** Some named entities can have multiple possible interpretations or can be ambiguous in certain contexts. NER models may struggle to disambiguate such entities and assign the correct entity type. For example, the word "Apple" can refer to a fruit or a technology company, and the correct entity type depends on the context. 

4. **Handling of nested or overlapping entities:** NER models may struggle to handle cases where named entities overlap or are nested within each other. Determining the correct boundaries and entity types in such scenarios can be challenging. For instance, misidentifying "Paris" as a location entity and not recognizing the nested entity "Hotel Paris".



---

#### Submitting your results:

To submit your results, please:

- save this file, i.e., `ex??_assignment.ipynb`.
- if you reference any external files (e.g., images), please create a zip or rar archieve and put the notebook files and all referenced files in there.
- login to ILIAS and submit the `*.ipynb` or archive for the corresponding assignment.

**Remarks:**
    
- Do not copy any code from the Internet. In case you want to use publicly available code, please, add the reference to the respective code snippet.
- Check your code compiles and executes, even after you have restarted the Kernel.
- Submit your written solutions and the coding exercises within the provided spaces and not otherwise.
- Write the names of your partner and your name in the top section.