# Stage Direction Modifier for XML Plays

This notebook processes XML files containing plays to modify the `<stage>` elements based on the speaker and context. The goal is to ensure that each stage direction is appropriately annotated with the verb indicating speech, such as "sagt" (for singular speakers) or "sagen" (for plural speakers), based on the surrounding context.

### Key Features:
1. **Finite Verb Detection**: The script detects finite verbs within the `<stage>` direction using spaCy's part-of-speech (POS) tagging.
2. **Verb Insertion**: If a stage direction lacks a speech verb, the script adds "sagt" or "sagen" at the beginning. If a stage direction contains multiple finite verbs, "und" is inserted between them, ensuring correct syntax.
3. **Plurality Handling**: The script checks if the speaker is plural (based on the `who` attribute) and adjusts the verb to "sagen" when needed.
4. **Proper Sentence Handling**: The algorithm avoids modifying stage directions that already start with uppercase letters (typically indicating proper sentences).

### How it Works:
- The script iterates over each `<sp>` element in the XML file, checks for the presence of `<stage>` and `<p>`, and processes the stage direction accordingly.
- It ensures that the speech act verb ("sagt" or "sagen") is added at the correct position, with proper handling of commas and multiple verbs.

This notebook helps automate the process of transforming and annotating stage directions in narrative plays, making them more suitable for further linguistic processing or analysis.

### Requirements:
- **spaCy**: For part-of-speech tagging and verb detection.
- **lxml**: For XML parsing and manipulation.

Make sure to install the necessary dependencies before running the notebook:




In [13]:
#bash
#!pip install spacy lxml
#!pip install openpyxl
#!python -m spacy download de_core_news_md

In [48]:
import os
import xml.etree.ElementTree as ET
import spacy
import re

# Load spaCy German model
nlp = spacy.load("de_core_news_md")  # or "de_core_news_lg"

In [49]:
def process_folder(input_dir, output_dir):
    """
    Process a folder of XML files, modifying the stage elements based on the logic.
    """
    os.makedirs(output_dir, exist_ok=True)

    for filename in os.listdir(input_dir):
        if filename.endswith(".xml"):
            file_path = os.path.join(input_dir, filename)
            try:
                tree = ET.parse(file_path)
                strip_namespace(tree)
                modify_stage_elements(tree)

                output_path = os.path.join(output_dir, filename)
                tree.write(output_path, encoding='utf-8', xml_declaration=True)
                print(f"Processed: {filename}")
            except ET.ParseError as e:
                print(f"Error parsing {filename}: {e}")


In [50]:


# Helper functions to check POS tags

def is_action_like(text):
    """
    Check if the stage direction starts with an action verb (finite verb).
    """
    doc = nlp(text)
    if not doc:
        return False

    # Heuristic: starts with a verb + possibly a pronoun (reflexive verb like 'küßt ihn')
    if len(doc) > 1 and doc[0].pos_ == "VERB":
        if doc[1].pos_ in {"PRON", "DET"}:
            return True
    return doc[0].pos_ == "VERB"

def is_tone_modifier(text):
    """
    Check if the stage direction starts with a tone modifier (adjective or adverb).
    """
    doc = nlp(text)
    if not doc:
        return False
    return doc[0].pos_ in {"ADJ", "ADV"}  # e.g., "leise", "ekstatisch"

def is_plural_speaker(sp_element):
    """
    Heuristic to check if the speaker refers to multiple characters (plural).
    """
    who = sp_element.attrib.get('who', '')
    return len(who.strip().split()) > 1  # multiple IDs means plural

def strip_namespace(tree):
    """
    Strip XML namespaces to make processing easier.
    """
    for elem in tree.iter():
        if '}' in elem.tag:
            elem.tag = elem.tag.split('}', 1)[1]
    return tree


In [51]:
#works great as generalization 

def is_finite_verb(token):
    """
    Check if the token is a finite verb.
    """
    return token.pos_ == 'VERB' and token.tag_ in ['VVFIN', 'VAFIN']

def is_plural_speaker(sp):
    """
    Check if the speaker is plural by looking at the 'who' attribute.
    """
    return len(sp.attrib.get('who', '').split()) > 1

def modify_stage_elements(tree):
    """
    Modify the <stage> elements based on the POS-tagging logic.
    """
    root = tree.getroot()

    for sp in root.iter('sp'):
        # Attempt to get the <stage> and <p> elements safely
        stage = sp.find('stage')
        if sp.findall('l') != []:
            paragraph = sp.findall('l')[0]
        else:
            paragraph = sp.find('p')

        # Skip if the stage element is missing or has no text content
        if stage is None or stage.text is None or not stage.text.strip():
            #print(f"Warning: Missing or empty <stage> in <sp> with speaker {sp.find('speaker').text if sp.find('speaker') else 'Unknown'}")
            continue  # Skip to the next <sp> if there's no stage element

        # Skip if the paragraph <p> is missing or has no text content
        if paragraph is None or paragraph.text is None or not paragraph.text.strip():
            #print(f"Warning: Missing or empty <p> in <sp> with speaker {sp.find('speaker').text if sp.find('speaker') else 'Unknown'}")
            continue  # Skip to the next <sp> if there's no paragraph element or it's empty

        # Safe processing of <stage> content after confirming it exists
        original = stage.text.strip()

        # Skip the stage direction if it starts with an uppercase letter (likely already a proper sentence)
        if original[0].isupper():
            continue  # Skip further processing if stage direction starts with uppercase

        # Remove trailing punctuation like '.' or '!' to make processing easier
        content = original.rstrip('.!?').strip()

        # Parse the stage direction text with spaCy to check for multiple verbs
        doc = nlp(content)
        verbs = [token for token in doc if is_finite_verb(token)]

        # If no finite verb is found, we will add "sagt" by default
        if not verbs:
            verb = "sagen" if is_plural_speaker(sp) else "sagt"
            stage.text = f"{verb} {content}."
        else:
            # Handle the case where we already have a finite verb
            if len(verbs) == 1:
                # Add "sagt" or "sagen" and connect it with "und" to the other finite verbs
                verb = "sagen" if is_plural_speaker(sp) else "sagt"
                stage.text = f"{verb} und {content}"
            elif len(verbs) > 1:
                # Multiple verbs, so add "und" between them but only after "sagt" or "sagen"
                verbs_str = " und ".join([token.text for token in verbs])
                remaining_content = " ".join([token.text for token in doc if token not in verbs])
                stage.text = f"{verbs_str} {remaining_content}"



In [52]:
# Example usage
input_directory = "gdc-tei/all"
output_directory = "./manipulated_texts_generalized_approach"

process_folder(input_directory, output_directory)

Processed: benedix-unerschuetterlich.xml
Processed: fouque-sigurd-der-schlangentoedter.xml
Processed: anzengruber-heimgfunden.xml
Processed: anzengruber-der-gwissenswurm.xml
Processed: kotzebue-die-deutschen-kleinstaedter.xml
Processed: scheerbart-ruebezahl.xml
Processed: nestroy-freiheit-in-kraehwinkel.xml
Processed: grillparzer-des-meeres-und-der-liebe-wellen.xml
Processed: beer-struensee.xml
Processed: andre-der-comoedienfeind.xml
Processed: zedlitz-liebe-findet-ihre-wege.xml
Processed: richter-eumenides-duester.xml
Processed: ball-die-nase-des-michelangelo.xml
Processed: kotzebue-menschenhass-und-reue.xml
Processed: ruederer-die-fahnenweihe.xml
Processed: kurz-prinzessin-pumphia.xml
Processed: wedekind-fritz-schwigerling.xml
Processed: babo-das-winterquartier-in-amerika.xml
Processed: klingemann-faust.xml
Processed: auenbrugger-der-rauchfangkehrer.xml
Processed: wagner-tannhaeuser.xml
Processed: hafner-der-furchtsame.xml
Processed: braun-von-braunthal-das-nachtlager-von-granada.xml

In [None]:
# Extract the stage directions for manual annotation or further analysis.

In [55]:
import pandas as pd
from lxml import etree

def extract_sp_stage_paragraph(xml_tree):
    """
    Extract speaker, stage direction, and paragraph from each <sp> element in the XML.

    Returns:
        List of dictionaries with keys: 'speaker', 'stage', 'paragraph'
    """
    results = []
    root = xml_tree.getroot()

    for sp in root.iter('sp'):
        speaker_el = sp.find('speaker')
        stage_el = sp.find('stage')
        if sp.findall('l') != []:
            paragraph_el = sp.findall('l')[0]
        else:
            paragraph_el = sp.find('p')

        speaker = speaker_el.text.strip() if speaker_el is not None and speaker_el.text else ''
        stage = stage_el.text.strip() if stage_el is not None and stage_el.text else ''
        paragraph = paragraph_el.text.strip() if paragraph_el is not None and paragraph_el.text else ''

        if speaker and stage and paragraph:
            results.append({
                'speaker': speaker,
                'stage': stage,
                'paragraph': paragraph
            })

    return results

def save_to_excel(data, output_path='extracted_stage-dialogue-data.xlsx'):
    """
    Saves extracted data to an Excel file (.xlsx) using UTF-8 (default for Excel files).
    """
    if not data:
        print("No data to write.")
        return

    df = pd.DataFrame(data)
    df.to_excel(output_path, index=False)  # No need for encoding param

    print(f"Saved {len(data)} entries to {output_path}.")


In [56]:
input_dir = "./gdc-tei/all"
texts = [filename[:-4] for filename in os.listdir(input_dir) if filename.endswith(".xml")]
for text in texts:
    print(text)
    tree = etree.parse(f'./manipulated_texts_generalized_approach/{text}.xml')
    data = extract_sp_stage_paragraph(tree)
    save_to_excel(data, output_path=f"./manipulated_texts_generalized_approach/{text}.xlsx")



benedix-unerschuetterlich
Saved 34 entries to ./manipulated_texts_generalized_approach/benedix-unerschuetterlich.xlsx.
fouque-sigurd-der-schlangentoedter
Saved 66 entries to ./manipulated_texts_generalized_approach/fouque-sigurd-der-schlangentoedter.xlsx.
anzengruber-heimgfunden
Saved 294 entries to ./manipulated_texts_generalized_approach/anzengruber-heimgfunden.xlsx.
anzengruber-der-gwissenswurm
Saved 174 entries to ./manipulated_texts_generalized_approach/anzengruber-der-gwissenswurm.xlsx.
kotzebue-die-deutschen-kleinstaedter
Saved 162 entries to ./manipulated_texts_generalized_approach/kotzebue-die-deutschen-kleinstaedter.xlsx.
scheerbart-ruebezahl
Saved 166 entries to ./manipulated_texts_generalized_approach/scheerbart-ruebezahl.xlsx.
nestroy-freiheit-in-kraehwinkel
Saved 209 entries to ./manipulated_texts_generalized_approach/nestroy-freiheit-in-kraehwinkel.xlsx.
grillparzer-des-meeres-und-der-liebe-wellen
Saved 133 entries to ./manipulated_texts_generalized_approach/grillparzer-