# Worksheet 2

## Task 1

Write a regular expression that can:

(a) Validate German ZIP codes: `[0-9]{5}`

(b) Detect academic titles (Prof., Prof, Dr., Dr, , B.Sc., BSc, M.Sc., MSc, B.A., ...): `(Prof|Dr|((B|M)\\.?(Sc|A)))\\.?`

(c) Test whether a binary string (i.e. a string that just contains 0s and 1s) contains the substring "01" and "10" an equal number of times: not possible with regular expressions

## Task 2

Calculate the minimum edit distance between "color" and "colour" with the algorithm introduced in the lecture (provide the full table).

(a) Assume the cost of all transactions to be 1.

|   |   | c | o | l | o | u | r |
|---|---|---|---|---|---|---|---|
|   | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
| c | 1 | 0 | 1 | 2 | 3 | 4 | 5 |
| o | 2 | 1 | 0 | 1 | 2 | 3 | 4 |
| l | 3 | 2 | 1 | 0 | 1 | 2 | 3 |
| o | 4 | 3 | 2 | 1 | 0 | 1 | 2 |
| r | 5 | 4 | 3 | 2 | 1 | 1 | 1 |

(b) How does the table change if the costs for replacement is 2 (i.e. equal to the costs of a deletion and insertion)?

|   |   | c | o | l | o | u | r |
|---|---|---|---|---|---|---|---|
|   | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
| c | 1 | 0 | 1 | 2 | 3 | 4 | 5 |
| o | 2 | 1 | 0 | 1 | 2 | 3 | 4 |
| l | 3 | 2 | 1 | 0 | 1 | 2 | 3 |
| o | 4 | 3 | 2 | 1 | 0 | 1 | 2 |
| r | 5 | 4 | 3 | 2 | 1 | 2 | 1 |

Only changes the u,r cell and doesnt effect the final result.

## Task 3

Implement the minimum edit distance algorithm introduced in the lecture in Python.

In [18]:
import numpy as np

def min_edit_distance(source, target, ins_cost=1, del_cost=1, sub_cost=1):
    n = len(source) + 1
    m = len(target) + 1
    d = np.zeros((n,m), dtype=int)

    for i in range(1, n):
        d[i ,0] = d[i - 1, 0] + del_cost
    
    for j in range(1, m):
        d[0, j] = d[0, j - 1] + ins_cost
    
    for j in range(1, m):
        for i in range(1, n):
            if source[i - 1] == target[j - 1]:
                d[i, j] = d[i - 1, j - 1]
            else:
                d[i, j] = min(d[i, j - 1] + ins_cost, d[i - 1, j] + del_cost, d[i - 1, j - 1] + sub_cost)

    return d[n - 1, m - 1]

print(min_edit_distance("color", "colour"))
print(min_edit_distance("color", "colour", sub_cost=2))

1
1


## Task 4

Use the full rule set of the [Porter Stemmer](https://www.inf.fu-berlin.de/lehre/WS01/netbasedIS/vortraege/stem_porter.html) to manually stem the following words. Show your work (i.e. write down the intermediate steps).

(a) enshittification

- Step 2: (m>0) ATION -> ATE: enshittificate
- Step 3: (m>0) ICATE -> IC: enshittific
- Step 4: (m>1) IC -> : enshittif

(b) congratulations

- Step 1a: S -> : congratulation
- Step 2: (m>0) ATION -> ATE: congratulate
- Step 4: (m>1) ATE -> : congratul

(c) necessity

- Step 1c: (\*v\*) Y -> I: necessiti
- Step 4: (m>1) ITI   -> : necess

(d) bare: Is the final stem

## Task 5

For the following sentences: "I own multiple cats.", "Marburg is a nice city.", "It’s fairly cold in here.":

In [75]:
# setup
import nltk
from nltk.stem import *
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.corpus import wordnet

nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger_eng')

sentences = ["I own multiple cats.", "Marburg is a nice city.", "It's fairly cold in here."]
tokens = [word_tokenize(sentence) for sentence in sentences]

porter = PorterStemmer()
snowball = SnowballStemmer('english')
wnl = WordNetLemmatizer()

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\felix\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\felix\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\felix\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!



(a) Use the popular NLP library NLTK to stem the sentences with the [Porter and the Snowball stemmer](https://www.nltk.org/howto/stem.html). What do you notice? Hint: You will need to tokenize the sentences, you can use the [word_tokenize function](https://www.nltk.org/api/nltk.tokenize.word_tokenize.html) of NLTK.

In [76]:
# a)
result_porter = [[porter.stem(plural) for plural in plurals] for plurals in tokens]
result_snowball = [[snowball.stem(plural) for plural in plurals] for plurals in tokens]

print('Porter:   ', result_porter)
print('Snowball: ', result_snowball)

Porter:    [['i', 'own', 'multipl', 'cat', '.'], ['marburg', 'is', 'a', 'nice', 'citi', '.'], ['it', "'s", 'fairli', 'cold', 'in', 'here', '.']]
Snowball:  [['i', 'own', 'multipl', 'cat', '.'], ['marburg', 'is', 'a', 'nice', 'citi', '.'], ['it', "'s", 'fair', 'cold', 'in', 'here', '.']]


(b) Find three additional sentences which are stemmed differently by the Porter and Snowball stemmer.

In [None]:
words = ["quickly", "generously", "relativity"]
tokens_words = [word_tokenize(word) for word in words]

result_porter = [[porter.stem(plural) for plural in plurals] for plurals in tokens_words]
result_snowball = [[snowball.stem(plural) for plural in plurals] for plurals in tokens_words]

print('Porter:   ', result_porter)
print('Snowball: ', result_snowball)

Porter:    [['quickli'], ['gener'], ['rel']]
Snowball:  [['quick'], ['generous'], ['relat']]


(c) NLTK also supports lemmatization using the [WordNetLemmatizer4](https://www.nltk.org/api/nltk.stem.WordNetLemmatizer.html). Use the lemmatize function of the WordNetLemmatizer and pass the tokens as only parameter, what do you observe?

In [None]:
print("Original:   ", tokens)
print("Lemmatized: ", [[wnl.lemmatize(plural) for plural in plurals] for plurals in tokens])

# It only removes plurals, in this case "cats" -> "cat"

Original:    [['I', 'own', 'multiple', 'cats', '.'], ['Marburg', 'is', 'a', 'nice', 'city', '.'], ['It', "'s", 'fairly', 'cold', 'in', 'here', '.']]
Lemmatized:  [['I', 'own', 'multiple', 'cat', '.'], ['Marburg', 'is', 'a', 'nice', 'city', '.'], ['It', "'s", 'fairly', 'cold', 'in', 'here', '.']]


It only removes plurals, in this case cats -> cat

(d) Now use NLTK’s [PoS Tagger](https://www.nltk.org/api/nltk.tag.pos_tag.html) to use the lemmatize function with two parameters, the token and its PoS. Hint: You will have to convert the PoS tags to fit the tags expected by the lemmatizer function. What changes do you observe?

In [None]:
# d)
tags = [pos_tag(sentece) for sentece in tokens]

def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN 

print("Original:   ", tokens)
print("Lemmatized: ", [[wnl.lemmatize(word, get_wordnet_pos(tag)) for (word, tag) in sentence] for sentence in tags])

Original:    [['I', 'own', 'multiple', 'cats', '.'], ['Marburg', 'is', 'a', 'nice', 'city', '.'], ['It', "'s", 'fairly', 'cold', 'in', 'here', '.']]
Lemmatized:  [['I', 'own', 'multiple', 'cat', '.'], ['Marburg', 'be', 'a', 'nice', 'city', '.'], ['It', "'s", 'fairly', 'cold', 'in', 'here', '.']]


Plurals get transformed into singulars (cats -> cat) and verbs get transfromed into their base from (is -> be)

## Task 6

Analyse the sentiment of the sentence "We waited one hour in line to check in for our flight. Juuust great, exactly what you need to start of your vacation!" using:

(a) The Python library [TextBlob](https://textblob.readthedocs.io/en/dev/index.html)

In [84]:
from textblob import TextBlob

text = 'We waited one hour in line to check in for our flight. Juuust great, exactly what you need to start of your vacation!'

blob = TextBlob(text)
for sentence in blob.sentences:
    print(sentence.sentiment.polarity)

0.0
0.55625


Interpretation of results: 0.0 -> Neutral sentiment, 0.55625 -> fairly subjective

(b) Two different LLMs (provide your prompt and the results)

*Prompt*: Analyse the sentiment of the sentence "We waited one hour in line to check in for our flight. Juuust great, exactly what you need to start of your vacation!"

*ChatGPT 4o*:

The sentiment of the sentence: "We waited one hour in line to check in for our flight. Juuust great, exactly what you need to start of your vacation!" is sarcastic and negative.

Analysis:

- Literal content: Waiting an hour in line to check in for a flight is generally an unpleasant experience.
- Tone and phrasing: The exaggerated spelling of “Juuust great” and the phrase “exactly what you need to start off your vacation” are classic indicators of sarcasm. These expressions are often used ironically to imply the opposite of the literal words.
- Implied sentiment: Despite the superficially positive words (“great”, “what you need”), the context and delivery suggest frustration or dissatisfaction.

Conclusion: The speaker is expressing frustration and displeasure with a sarcastic tone, making the overall sentiment negative.

*Deepseek*:

The sentiment of the sentence is negative and expresses sarcasm and frustration.

Breakdown:
 - "We waited one hour in line to check in for our flight.": This sets a negative tone, as long waiting times are generally frustrating.
- "Juuust great, exactly what you need to start off your vacation!": The exaggerated spelling of "Juuust" and the phrase "exactly what you need" are heavily sarcastic, implying the opposite of what is stated. The speaker is clearly unhappy about the situation, as this is not an ideal way to begin a vacation.

Conclusion: The overall sentiment is negative, with a tone of sarcastic disappointment.

(c) What do you observe?

TextBlob struggles with detecting sarcasm, while modern LLMs are able to catch it.