# Deep Learning for Natural Language and Code - Exercise 2

**Prof. Dr. Steffen Herbold**
**SoSe 2025**
**Due on 2025/05/15**

## General information for all exercises (read carefully!)

Within the "Deep Learning for Natural Language and Code Exercise," you will execute different tasks that relate various NLP concepts. The main goal of these exercises is to teach you *how* to develop approaches. Once you have gained this knowledge, you will have the opportunity to use your own solution and compare it with existing solutions from popular libraries. This means that these exercises are not just about knowing how to use the libraries.

## Problem description

One task that is executed in NLP is sentiment analysis. In this exercise, we will focus on developing models that are able to classify reviews in different sentiments/emotions. For this, we assume that you have an initial implementation of the BOW from exercise # 1. For this exercise, we are going to work with a dataset that consists of movie reviews:

*   Dataset: [https://ai.stanford.edu/~amaas/data/sentiment/](https://ai.stanford.edu/~amaas/data/sentiment/)

In StudIP, a Jupyter Notebook is provided as a template for this exercise. In the template, a series of sections are provided in order to help you organize and structure the code. Additionally, some blocks of codes are also provided to facilitate some exercise tasks.

## Data set description

For this exercise, we are going to work with the Large Movie Review Dataset [1]. This dataset was built for binary sentiment classification. It is composed of 25,000 highly polar movie reviews for training, and 25,000 for testing. Meaning that the middle values (i.e., scores of 5) are ignored. The dataset also contains unlabeled data. However, for this exercise, those are not going to be taken into account.

In [1]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from pathlib import Path
import re
import string 
from collections import Counter, defaultdict

import preprocessing

## Setup & Data Loading

We will load the Large Movie Review Dataset (IMDb).
The dataset is divided into `train` and `test` sets, and each of these has `pos` (positive) and `neg` (negative) review folders.
Filenames in `pos` and `neg` folders include the rating, e.g., `9_1.txt` (score 9, id 1).
- Positive reviews: score >= 7
- Negative reviews: score <= 4
- Neutral reviews (score 5 or 6) are ignored in the main sentiment classification task.

In [2]:
# Define base path to the dataset
BASE_DATA_PATH = Path("../exercise_01/aclImdb") 

TRAIN_PATH = BASE_DATA_PATH / "train"
TEST_PATH = BASE_DATA_PATH / "test"

TRAIN_POS_PATH = TRAIN_PATH / "pos"
TRAIN_NEG_PATH = TRAIN_PATH / "neg"
TEST_POS_PATH = TEST_PATH / "pos"
TEST_NEG_PATH = TEST_PATH / "neg"

assert TRAIN_POS_PATH.exists() and TRAIN_NEG_PATH.exists() and TEST_POS_PATH.exists() and TEST_NEG_PATH.exists()

In [3]:
def load_imdb_data(data_path: Path) -> tuple[list[str], list[int]]:
    """
    Loads movie reviews and their sentiments from the specified path.

    """
    texts:list[str] = []
    labels:list[int] = [] 

    for _, folder_path in [("pos", data_path / "pos"), ("neg", data_path / "neg")]:
        if not folder_path.exists():
            raise FileNotFoundError(f"Warning: Path {folder_path} does not exist.")

        for file_path in folder_path.glob("*.txt"):
                # Extract score from filename, e.g., "7_123.txt" -> score 7
            score = int(file_path.name.split('_')[0])
            filename_parts = file_path.name.split('_')
            score_str = filename_parts[1].split('.')[0]
            score = int(score_str)

            labels.append(score)
            texts.append(file_path.read_text(encoding='utf-8'))
    return texts, labels

# Load training data
train_texts_raw, train_labels = load_imdb_data(TRAIN_PATH)
print(f"Loaded {len(train_texts_raw)} training reviews.")
print(f"Training labels distribution: Positive (1): {sum(train_labels)}, Negative (0): {len(train_labels) - sum(train_labels)}")

# Load test data
test_texts_raw, test_labels = load_imdb_data(TEST_PATH)
print(f"Loaded {len(test_texts_raw)} test reviews.")
print(f"Test labels distribution: Positive (1): {sum(test_labels)}, Negative (0): {len(test_labels) - sum(test_labels)}")

Loaded 25000 training reviews.
Training labels distribution: Positive (1): 136943, Negative (0): -111943
Loaded 25000 test reviews.
Test labels distribution: Positive (1): 137824, Negative (0): -112824


In [4]:
print("Train :")
for text, label in zip(train_texts_raw[:10], train_labels):
    print(f"{text[:50]}... Label : {label}")

print("\n\nTest :")
for text, label in zip(test_texts_raw[:10], test_labels):
    print(f"{text[:50]}... Label : {label}")

Train :
Train :
For a movie that gets no respect there sure are a ... Label : 9
Bizarre horror movie filled with famous faces but ... Label : 8
A solid, if unremarkable film. Matthau, as Einstei... Label : 7
It's a strange feeling to sit alone in a theater o... Label : 8
You probably all already know this by now, but 5 a... Label : 10
I saw the movie with two grown children. Although ... Label : 8
You're using the IMDb.<br /><br />You've given som... Label : 10
This was a good film with a powerful message of lo... Label : 10
Made after QUARTET was, TRIO continued the quality... Label : 10
For a mature man, to admit that he shed a tear ove... Label : 10


Test :
Based on an actual story, John Boorman shows the s... Label : 9
This is a gem. As a Film Four production - the ant... Label : 9
I really like this show. It has drama, romance, an... Label : 9
This is the best 3-D experience Disney has at thei... Label : 10
Of the Korean movies I've seen, only three had rea... Label : 10
this mov

# Data Preprocessing

In [None]:
string.punctuation.replace("-", "")

In [None]:
from typing import Literal
from pydantic import BaseModel, Field

class PreprocessingConfig(BaseModel):
    """
    Configuration for the text preprocessing pipeline.
    Defines various steps and their parameters for cleaning and preparing raw text data.
    The order of application in a pipeline would typically be:
    1. Lowercasing (if enabled)
    2. Number replacement (if enabled)
    3. Tokenization
    4. Hyphenated word splitting (if enabled)
    5. Punctuation removal (if enabled)
    6. (External: Frequent term removal, after vocab construction)
    """

    # PT#1 (a) Tokenization strategy
    tokenize_on_punctuation: list[str] | Literal[False] = Field(
        default=,
        description="If True, tokenize based on spaces and punctuation characters (e.g., 'word.' -> ['word', '.']). "
                    "If False, tokenize only on spaces (e.g., 'word.' -> ['word.'])."
    )

    # PT#1 (b) Case conversion
    lowercase: bool = Field(
        default=True,
        description="If True, convert all text to lowercase. If False, keep original capitalization."
    )

    # PT#1 (c) Punctuation removal
    remove_punctuation: str | None = Field(
        default=string.punctuation,
        description="If str, remove all characters in the string. "
                    "If 'tokenize_on_punctuation' is True, punctuation tokens are removed. "
                    "If 'tokenize_on_punctuation' is False, punctuation is stripped from token ends. "
                    "Care should be taken not to remove structural parts of special tokens like '<NUM>'."
    )

    # PT#1 (d) High-frequency term removal
    max_df_percentage: float | None = Field(
        default=None,
        gt=0.0,
        lt=1.0,
        description="Remove terms that appear in more than this percentage of documents. "
                    "Value must be between 0.0 and 1.0 (exclusive). If None, no terms are removed based on high frequency. "
    )

    # PT#1 (e) Number replacement
    number_replacement_token: str | None= Field(
        default="<NUM>",
        min_length=1,
        description="The special token to use when replacing numbers. If None, numbers are not replaced."
    )

    # split_hyphen
    split_hyphenated_words: bool = Field(
        default=False, # Often, keeping hyphenated words as single units is beneficial
        description="If True, attempts to split hyphenated words into components. "
                    "E.g., 'state-of-the-art' could become ['state', 'of', 'the', 'art'] or ['state', '-', 'of', '-', 'the', '-', 'art'] "
                    "depending on the splitting logic and whether hyphens themselves are kept. "
                    "This is typically applied after initial tokenization. "
                    "If 'tokenize_on_punctuation' is True and hyphens are delimiters, this might be redundant or refine further."
    )

### Remove HTML tags

In [56]:
def clean_html_tags(text: str) -> str:
    """
    Remove HTML tags from text while preserving content.
    Special case: preserves "<3" (heart symbol).
    """
    # Temporarily replace "<3" with a placeholder
    text_with_placeholder = text.replace("<3", "HEART_SYMBOL_PLACEHOLDER_XYZ")
    # Remove HTML tags
    cleaned_text = re.sub(r"<[^>]*>", "", text_with_placeholder)
    # Restore the heart symbols
    cleaned_text_final = cleaned_text.replace("HEART_SYMBOL_PLACEHOLDER_XYZ", "<3")
    return cleaned_text_final

# Test basic HTML tag removal
assert clean_html_tags("<p>Hello world</p>") == "Hello world"

# Test nested tags
assert clean_html_tags("<div><p>Nested content</p></div>") == "Nested content"

# Test with attributes
assert clean_html_tags('<a href="https://example.com">Link text</a>') == "Link text"

# Test with multiple tags and text
assert clean_html_tags("<h1>Title</h1><p>Paragraph</p>") == "TitleParagraph"

# Test with the special case of "<3" (heart symbol)
assert (
    clean_html_tags(
        "I LUVED IT SO MUCH <3 <br /><br />its about a women...<br /><br /> her<br /><br />"
    )
    == "I LUVED IT SO MUCH <3 its about a women... her"
)

## Programming Tasks (PT)

### PT1: Build a pipeline for cleaning the raw reviews.

In this pipeline you should be able to activate or deactivate the following options:
(a) Tokenize based on spaces and punctuation or only spaces.
(b) Convert everything to lower case or keep the capitalization as it is.
(c) Remove punctuation.
(d) Remove the terms that appear more often than a certain percentage. You should be able to vary this percentage (i.e., it should be a parameter).
(e) Replace the numbers with a token (i.e., a special token `<NUM>`).

We will implement a function `pipeline_clean_review` that takes a raw text and flags for these options.
The option (d) is a corpus-level operation and will be handled after tokenizing all documents.

# Data cleaning

In [43]:
_NUMBER_RE = re.compile(r"^\d+(?:\.\d+)?$")          # integers & simple floats
_PUNCT_SPLIT_RE = re.compile(r"\w+|[^\w\s]")         # words OR punctuation


In [None]:
from typing import Literal


def tokenize(
    text: str,
    mode: Literal["space", "space_punct"] = "space_punct",
    split_hyphen: bool = True
) -> list[str]:
    """
    Tokenise *text* according to *mode* and optionally split on hyphens.

    Parameters
    ----------
    text : str
        Raw input string.
    mode : {"space", "space_punct"}, default="space_punct"
        ``"space"`` – classic ``str.split()`` (split on all whitespace).
        ``"space_punct"`` – split on whitespace **and** return punctuation as
        standalone tokens.
    split_hyphen : bool, default=True
        If True, split tokens on hyphens (e.g., 'a-composed-word' -> ['a', 'composed', 'word']).
        If False, keep hyphenated words as single tokens.

    Returns
    -------
    list[str]
        Sequence of tokens.

    Examples
    --------
    >>> tokenize("Hello, world!", mode="space")
    ['Hello,', 'world!']
    >>> tokenize("Hello, world!", mode="space_punct")
    ['Hello', ',', 'world', '!']
    >>> tokenize("a-composed-word", mode="space_punct", split_hyphen=True)
    ['a', 'composed', 'word']
    >>> tokenize("a-composed-word", mode="space_punct", split_hyphen=False)
    ['a-composed-word']
    """
    if mode == "space":
        tokens = text.split()
    elif mode == "space_punct":
        tokens = _PUNCT_SPLIT_RE.findall(text)
    else:
        raise ValueError(f"Unknown tokenisation mode: {mode}")

    if split_hyphen:
        split_tokens: list[str] = []
        for tok in tokens:
            if '-' in tok and len(tok) > 1:
                split_tokens.extend([t for t in tok.split('-') if t])
            else:
                split_tokens.append(tok)
        tokens = split_tokens
    return tokens

assert tokenize("Hi there", "space") == ["Hi", "there"]
assert tokenize("Hi, there!", "space_punct") == ["Hi", ",", "there", "!"]
sample_review_text = "This is a GREAT movie  from 2023!! Loved it <3. Cost: $10. What's up? It's well-done."
tokenize(sample_review_text)

# Hyphen splitting assertions
assert tokenize("a-composed-word", mode="space_punct", split_hyphen=True) == ["a", "composed", "word"]
assert tokenize("a-composed-word", mode="space_punct", split_hyphen=False) == ["a-composed-word"]
assert tokenize("well-done!", mode="space_punct", split_hyphen=True) == ["well", "done", "!"]
assert tokenize("well-done!", mode="space_punct", split_hyphen=False) == ["well-done", "!"]

In [None]:
tokenize("a-composed-word", mode="space_punct", split_hyphen=True)

In [45]:
def clean_token(
    token: str,
    *,
    lowercase: bool,
    remove_punct: bool,
    replace_numbers: bool,
) -> str | None:
    """
    Clean a single token according to the options provided.

    Parameters
    ----------
    token : str
        The input token (a single word or symbol).
    lowercase : bool
        If True, convert the token to lower case (e.g., 'Hello' -> 'hello').
    remove_punct : bool
        If True, remove any punctuation characters from the token (e.g., 'hello!' -> 'hello').
        If the token becomes empty after punctuation removal, returns None.
    replace_numbers : bool
        If True, any token consisting only of digits (or a simple decimal number) is replaced
        with the string '<NUM>'.

    Returns
    -------
    str or None
        The cleaned token, or None if the token is entirely removed (e.g., only punctuation).

    Examples
    --------
    >>> clean_token("Hello,", lowercase=True, remove_punct=True, replace_numbers=False)
    'hello'
    >>> clean_token("123", lowercase=False, remove_punct=False, replace_numbers=True)
    '<NUM>'
    >>> clean_token("!!!", lowercase=False, remove_punct=True, replace_numbers=False)
    None
    >>> clean_token("Test", lowercase=False, remove_punct=False, replace_numbers=False)
    'Test'
    """
    if lowercase:
        token = token.lower()

    if remove_punct:
        token = token.translate(str.maketrans("", "", string.punctuation))
        if token == "":
            return None

    if replace_numbers and _NUMBER_RE.fullmatch(token):
        token = "<NUM>"

    return token

assert clean_token("Hello,", lowercase=True, remove_punct=True, replace_numbers=False) == "hello"
assert clean_token("!!,", lowercase=False, remove_punct=True, replace_numbers=False) is None
assert clean_token("42", lowercase=False, remove_punct=False, replace_numbers=True) == "<NUM>"


In [46]:
def apply_cleaning(
    tokens: list[str],
    *,
    lowercase: bool,
    remove_punct: bool,
    replace_numbers: bool,
) -> list[str]:
    """Vectorised wrapper around :pyfunc:`clean_token`.

    Example
    -------
    >>> apply_cleaning(["Hello,", "World!", "123"], lowercase=True,
    ...               remove_punct=True, replace_numbers=True)
    ['hello', 'world', '<NUM>']
    """
    cleaned: list[str] = []
    for tok in tokens:
        new_tok = clean_token(
            tok,
            lowercase=lowercase,
            remove_punct=remove_punct,
            replace_numbers=replace_numbers,
        )
        if new_tok:
            cleaned.append(new_tok)
    return cleaned


assert apply_cleaning([
    "Hello,", "World!", "123"
], lowercase=True, remove_punct=True, replace_numbers=True) == [
    "hello", "world", "<NUM>"
]

assert apply_cleaning([
    "Hello,", "!", "123"
], lowercase=True, remove_punct=True, replace_numbers=False) == [
    "hello", "123"
]


assert apply_cleaning([
    "Hello,", "!", "123"
], lowercase=False, remove_punct=False, replace_numbers=False) == [
'Hello,', '!', '123'
]






In [47]:
from typing import Sequence

def remove_high_df_tokens(
    docs: list[list[str]],
    max_df: float,
) -> list[list[str]]:
    """
    Remove tokens that appear in more than a given proportion of documents.

    Document frequency (DF) is defined as the number of documents in which a token appears
    at least once, divided by the total number of documents. This method removes any token
    whose document frequency is greater than `max_df`.

    Parameters
    ----------
    docs : list[list[str]]
        Corpus represented as a list of tokenized documents (each document is a list of tokens).
    max_df : float
        Upper threshold for document frequency (0 < max_df < 1). For example,
        max_df=0.8 will remove any token appearing in more than 80% of the documents.

    Returns
    -------
    cleaned_docs : list[list[str]]
        Corpus with high-DF tokens removed from each document.

    Example
    -------
    >>> docs = [
    ...   ["good", "movie", "the"],
    ...   ["the", "bad", "movie"],
    ...   ["the", "movie", "average"]
    ... ]
    >>> remove_high_df_tokens(docs, max_df=0.7)
    [['good'], ['bad'], ['average']]
    # 'the' and 'movie' are removed because they appear in all 3 documents (DF=1.0)
    """
    if not 0 < max_df < 1:
        raise ValueError("max_df must be between 0 and 1 (exclusive)")

    num_docs = len(docs)
    # Count in how many documents each token appears (DF)
    df_counter = Counter()
    for doc in docs:
        df_counter.update(set(doc))  # set(doc) ensures each token counted once per doc

    # Find tokens exceeding the DF threshold
    high_df_tokens = {
        token for token, df in df_counter.items()
        if df / num_docs > max_df
    }

    # Remove high-DF tokens from each document
    cleaned_docs = [
        [tok for tok in doc if tok not in high_df_tokens]
        for doc in docs
    ]

    return cleaned_docs

docs = [
    ["good", "movie", "the"],
    ["the", "bad", "movie"],
    ["the", "movie", "average"]
]
out = remove_high_df_tokens(docs, max_df=0.7)
assert out == [["good"], ["bad"], ["average"]]

In [None]:
sample_review_text = "This is a GREAT <br>movie</br>  from 2023!! Loved it <3. Cost: $10. What's up? It's well-done."

assert tokenize(sample_review_text) == [
    "This",
    "is",
    "a",
    "GREAT",
    "<",
    "br",
    ">",
    "movie",
    "<",
    "/",
    "br",
    ">",
    "from",
    "2023",
    "!",
    "!",
    "Loved",
    "it",
    "<",
    "3",
    ".",
    "Cost",
    ":",
    "$",
    "10",
    ".",
    "What",
    "'",
    "s",
    "up",
    "?",
    "It",
    "'",
    "s",
    "well",
    "-",
    "done",
    ".",
]

# V2