# 02g – 8. Expierment: LLM-Postprocessing v2: Qwen3-8B mit verfeinerter Pipeline

## Motivation

`02f` zeigte, dass Qwen3-8B die marginal beste Verbesserung liefert (Δ WER −0.0006).
Dieses Notebook verfeinert die Postprocessing-Pipeline für Qwen3-8B mit mehreren
Verbesserungen gegenüber `02f`, die anhand der produzierten Ergebnisse identifiziert wurden. In diesem Notebook werden in den Kommentaren insbesondere die Veränderungen zum Code in `02f` dargestellt:

| Komponente | `02f` | `02g` (diese Version) |
|------------|-------|----------------------|
| SEP-Token | `<<<SEP>>>` | `\|\|\|SEP\|\|\|` (robuster gegen LLM-Escaping) |
| SEP-Parsing | einfaches `split(SEP)` | Regex der Separator-Varianten normalisiert |
| Anfangsbuchstaben-Regel | immer erzwungen | nur für Wörter ≥4 Zeichen |
| Insertionen | grundsätzlich verboten | erlaubt für Funktionswörter (a, the, is...) |
| Deletionen | grundsätzlich verboten | erlaubt für Füllwörter (um, uh, erm...) |
| Fallback bei Format-Fehler | Original zurückgeben | Zeile-für-Zeile-Fallback |
| Unicode-Normalisierung | nicht vorhanden | `unicodedata.normalize('NFKC')` |
| WER-Guard | aktiviert | deaktiviert (block_size=4 ist konservativer) |
| block_size | 8 | 4 (weniger Kontext pro Aufruf, präzisere Korrekturen) |

## Ergebnis (Vorschau)

E42 verschlechtert die Ergebnisse: WER +0.028, Joint Error +0.014.
Die verfeinerte Pipeline hilft in dieser Konfiguration nicht.

**Hinweis zum Bugfix:** Dieser Lauf wurde **vor dem Bugfix** in `segmentation.py` durchgeführt (`min_duration_off` las fälschlicherweise den Wert von `min_duration_on`). Das ist **gewollt**: Der Bugfix wurde erst nach Abschluss der LLM- und Hyperparameter-Experimente entdeckt. Da der Bugfix allein die WER zunächst verschlechterte, wurde erst in `02j_`/`02k_` die Kombination aus Bugfix + `min_duration`-Optimierung erarbeitet, die schließlich das beste Ergebnis lieferte.

## 1 – GPU-Check & Auswahl

In [1]:
!nvidia-smi

Fri Jan 30 10:17:24 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-80GB          On  |   00000000:01:00.0 Off |                    0 |
| N/A   42C    P0            164W /  500W |   52989MiB /  81920MiB |     66%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  |   00

In [2]:
import os

# Physische GPU-Auswahl: hier GPU 2 (siehe nvidia-smi)
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "2"  # Anpassen je nach Verfügbarkeit

## 2 – CUDA-Verifikation

In [3]:
import torch

In [4]:
print("CUDA available:", torch.cuda.is_available())
print("CUDA devices:", torch.cuda.device_count())
if torch.cuda.is_available():
    print("Device 0 name:", torch.cuda.get_device_name(0))
    print("Memory allocated:", torch.cuda.memory_allocated(0) / 1024**3, "GB")

CUDA available: True
CUDA devices: 1
Device 0 name: NVIDIA A100-SXM4-80GB
Memory allocated: 0.0 GB


## 3 – Setup: Imports & Arbeitsverzeichnis

In [5]:
import os, sys, re, gc, shutil, subprocess
from pathlib import Path
import pandas as pd
import torch
import webvtt
import difflib
import unicodedata

project_baseline_path = "/home/josch080/Projektgruppe/mcorec_baseline"
os.chdir(project_baseline_path)

from script.pg_utils_experiments import append_eval_results_for_experiments

  if not hasattr(np, "object"):


## 4 – Konfiguration

In [6]:
BASE_DATA_DIR = Path("data-bin/dev")

SESSION_IDS = ["session_40", "session_43", "session_49", "session_50", "session_54"]

INPUT_PREFIX = "output_E09_bs12_len20" # BL4-beste-Konfiguration als Eingabe

RESULTS_CSV = "results_dev_subset_by_session.csv"

In [7]:
print("CWD:", os.getcwd())
print("BASE_DATA_DIR:", BASE_DATA_DIR)
print("INPUT_PREFIX:", INPUT_PREFIX)
print("Sessions:", SESSION_IDS)

CWD: /home/josch080/Projektgruppe/mcorec_baseline
BASE_DATA_DIR: data-bin/dev_without_central_videos/dev
INPUT_PREFIX: output_E09_bs12_len20
Sessions: ['session_40', 'session_43', 'session_49', 'session_50', 'session_54']


## 5 – LLM-Konfiguration & Experiment

In [8]:
LLM_CONFIGS = {
    "qwen3_8b": {
        "model_name": "Qwen/Qwen3-8B",
        "dtype": "bfloat16",
        "description": "Qwen3 8B"
    }
}

In [9]:
# LLM Experimente

EXPERIMENTS = {
    "E42_qwen3_8b_v2": {
        "llm_model": "qwen3_8b", 
        "description": "Qwen3 8B"
    },
}
print("LLM Experimente:", list(EXPERIMENTS.keys())) 

LLM Experimente: ['E42_qwen3_8b_v2']


## 6 – LLM-Laden / Entladen

Identisch zu `02f` – lazy loading mit Cache.

In [10]:
from transformers import AutoTokenizer, AutoModelForCausalLM

loaded_models = {}
loaded_tokenizers = {}

def _dtype_from_cfg(dtype_str: str):
    s = (dtype_str or "auto").lower()
    if s == "bfloat16":
        return torch.bfloat16
    if s == "float16":
        return torch.float16
    return "auto"

def load_llm(llm_key: str):
    if llm_key in loaded_models:
        return loaded_models[llm_key], loaded_tokenizers[llm_key]

    cfg = LLM_CONFIGS[llm_key]
    model_name = cfg["model_name"]
    dtype = _dtype_from_cfg(cfg.get("dtype", "auto"))

    print(f"\nLoading LLM: {llm_key} -> {model_name} (dtype={cfg.get('dtype')})")

    tok = AutoTokenizer.from_pretrained(model_name, use_fast=True, trust_remote_code=True)
    if tok.pad_token_id is None:
        tok.pad_token = tok.eos_token

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=dtype,
        device_map="auto",
        trust_remote_code=True
    )
    model.eval()

    loaded_models[llm_key] = model
    loaded_tokenizers[llm_key] = tok
    return model, tok

def unload_llm(llm_key: str):
    if llm_key in loaded_models:
        del loaded_models[llm_key]
    if llm_key in loaded_tokenizers:
        del loaded_tokenizers[llm_key]
    torch.cuda.empty_cache()
    gc.collect()

  from .autonotebook import tqdm as notebook_tqdm


## 7 – Text-Hilfsfunktionen & SEP-Token

**Änderung gegenüber `02f`:**
- SEP-Token gewechselt von `<<<SEP>>>` auf `|||SEP|||`
  → `<<<` und `>>>` werden von manchen LLMs als HTML/Template-Syntax interpretiert
  und escaped oder verändert; `|||` ist neutraler
- Neuer `SEP_SPLIT_RE`-Regex: normalisiert Separator-Varianten die das LLM ausgibt
  (z.B. `|| SEP ||`, `|||SEPARATOR|||`, `| . SEP . |`) zurück auf den kanonischen Token
- `ALLOWED_INS`/`ALLOWED_DEL`: Wortlisten für erlaubte Insertionen (Funktionswörter)
  und Deletionen (Füllwörter) im Token-Guard

In [11]:
THINK_RE = re.compile(r"<think>.*?</think>", flags=re.DOTALL | re.IGNORECASE)
SPEAKER_TAG_RE = re.compile(r"^\s*(Human|Assistant|System|User)\s*:\s*", flags=re.IGNORECASE)

def strip_thinking(text: str) -> str:
    # Entfernt <think>-Blöcke (Reasoning-Trace von Qwen3/DeepSeek-R1)
    if not text:
        return text
    text = THINK_RE.sub("", text)
    text = re.sub(r"</?think>\s*", "", text, flags=re.IGNORECASE)
    return text.strip()

def clean_caption_text(t: str) -> str:
    # Bereinigt Untertiteltext: BOM, Speaker-Tags, doppelte Leerzeichen
    if t is None:
        return ""
    t = t.replace("\ufeff", "").strip()
    t = SPEAKER_TAG_RE.sub("", t).strip()
    t = " ".join(t.split())
    return t

# Neuer SEP-Token: '|||SEP|||' statt '<<<SEP>>>' aus 02f
# Grund: '<<<' / '>>>' werden von Qwen3 manchmal als Template-Syntax behandelt und escaped
SEP = "|||SEP|||"

# Regex für Separator-Varianten: normalisiert abweichende LLM-Ausgaben auf den kanonischen SEP
# Matcht z.B.: '|| SEP ||', '|||SEPARATOR|||', '| . SEP . |', '|| - SEP - ||'
SEP_SPLIT_RE = re.compile(
    r"(?:\|\s*){2,}\s*[\.\-_]*\s*_?\s*"
    r"(?:S\s*E\s*P|S\s*E\s*P\s*A\s*R\s*A\s*T\s*O\s*R)\s*_?\s*"
    r"[\.\-_'\"`]*\s*(?:\s*\|)*",
    flags=re.IGNORECASE
)

# Erlaubte Insertionen: kurze Funktionswörter die ASR häufig auslässt
ALLOWED_INS = {
    "a","an","the","to","of","in","on","at","for","with","from","by","as",
    "and","or","but","is","are","was","were","am","be","been","do","does","did",
    "have","has","had"
}

# Erlaubte Deletionen: Füllwörter die ASR häufig inkorrekt transkribiert
ALLOWED_DEL = {"um","uh","erm","hmm"}

def build_chat_prompt(tokenizer, system_msg: str, user_msg: str) -> str:
    messages = [
        {"role": "system", "content": system_msg},
        {"role": "user", "content": user_msg},
    ]
    try:
        return tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True,
            enable_thinking=False,
        )
    except TypeError:
        return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)


## 8 – System-Prompts (2-Stage)

**Änderungen gegenüber `02f`:**
- SEP-Token auf `|||SEP|||` aktualisiert
- Expliziter Hinweis `plain ASCII characters only` und `do NOT join words`
  gegen ein Verhalten bei dem Qwen3 Wörter zusammenschreibt oder Unicode-Zeichen einfügt
- Relaxed-Prompt erlaubt jetzt explizit das Löschen von Füllwörtern

In [12]:
# f-Strings: SEP wird direkt in die Prompts eingebettet → konsistent mit der Laufzeitvariable
SYSTEM_PROMPT_STRICT = (
    f"You are an ASR transcript post-processor for subtitle LINES.\n"
    f"Input lines are separated by the token {SEP}.\n"
    f"Output MUST contain the EXACT same number of lines separated ONLY by {SEP}.\n"
    f"Output ONLY the corrected lines (no explanations, no extra text).\n"
    f"Do NOT reorder lines.\n"
    f"Preserve casing style per line (ALL CAPS stays ALL CAPS).\n"
    f"Preserve punctuation style (do not add new punctuation).\n"
    f"IMPORTANT: keep normal spaces between words; do NOT join words; do NOT use underscores.\n"
    f"Use plain ASCII characters only.\n" # verhindert Unicode-Sonderzeichen in Ausgabe
    f"STRICT: Do NOT add/remove words. Only replace when very confident.\n"
)

SYSTEM_PROMPT_RELAXED = (
    f"You are an ASR transcript post-processor for subtitle LINES.\n"
    f"Input lines are separated by the token {SEP}.\n"
    f"Output MUST contain the EXACT same number of lines separated ONLY by {SEP}.\n"
    f"Output ONLY the corrected lines (no explanations, no extra text).\n"
    f"Do NOT reorder lines. Preserve casing and punctuation style.\n"
    f"IMPORTANT: keep normal spaces between words; do NOT join words; do NOT use underscores.\n"
    f"Use plain ASCII characters only.\n"
    f"RELAXED: you may delete standalone fillers (UM/UH/ERM/HMM) and fix obvious ASR errors.\n"
)

## 9 – Token-Guard v2

**Änderungen gegenüber `02f`:**
- **Weichere Anfangsbuchstaben-Regel:** nur noch für Wörter ≥4 Zeichen erzwungen.
  Kurze Wörter (1–3 Zeichen) dürfen jetzt auch mit anderem Buchstaben beginnen,
  z.B. `'a' → 'an'` oder `'in' → 'on'`
- **Erlaubte Insertionen** (`max_ins_del=2`): bis zu 2 Funktionswörter aus `ALLOWED_INS`
  dürfen pro Zeile eingefügt werden
- **Erlaubte Deletionen** (`max_ins_del=2`): bis zu 2 Füllwörter aus `ALLOWED_DEL`
  dürfen gelöscht werden
- `ins_del_used`: gemeinsamer Zähler für Insertionen und Deletionen – verhindert
  dass zu viele Strukturänderungen pro Zeile gemacht werden

In [13]:
def char_sim(a: str, b: str) -> float:
    # Zeichenweise Ähnlichkeit (0=komplett verschieden, 1=identisch)
    return difflib.SequenceMatcher(a=a.lower(), b=b.lower()).ratio()

def word_ok(old: str, new: str, max_extra_chars: int, min_char_sim: float) -> bool:
    # Prüft ob Ersetzung old→new akzeptabel ist
    if old == new:
        return True
    if len(new) > len(old) + max_extra_chars:
        return False
        
    # Anfangsbuchstaben-Regel: NUR für längere Wörter (≥4 Zeichen) erzwungen.
    # In 02f war sie für alle Wörter aktiv – das blockierte sinnvolle Korrekturen
    # bei kurzen Wörtern wie 'a'→'an' oder 'in'→'on'
    if len(old) >= 4 and len(new) >= 4 and old[0].isalpha() and new[0].isalpha():
        if old[0].lower() != new[0].lower():
            return False
    if char_sim(old, new) < min_char_sim:
        return False
    return True


def token_guard_line(orig: str, cand: str,
                     max_extra_chars: int = 6,
                     min_char_sim: float = 0.20,
                     max_ins_del: int = 2) -> str:

    # Token-Guard mit erlaubten Insertionen (Funktionswörter) und Deletionen (Füllwörter)
    o = orig.split()
    c = cand.split()
    sm = difflib.SequenceMatcher(a=o, b=c)

    out = []
    ins_del_used = 0 # gemeinsamer Zähler: begrenzt Struktur-Änderungen pro Zeile

    for tag, i1, i2, j1, j2 in sm.get_opcodes():
        if tag == "equal":
            out.extend(o[i1:i2])

        elif tag == "replace":
            o_seg = o[i1:i2]
            c_seg = c[j1:j2]
            if len(o_seg) == len(c_seg):
                for ow, nw in zip(o_seg, c_seg):
                    out.append(nw if word_ok(ow, nw, max_extra_chars, min_char_sim) else ow)
            else:
                out.extend(o_seg) # Ungleich-lange Blöcke: Original behalten

        elif tag == "delete":
            # Füllwörter aus ALLOWED_DEL dürfen gelöscht werden (bis max_ins_del)
            for ow in o[i1:i2]:
                if ins_del_used < max_ins_del and ow.lower() in ALLOWED_DEL:
                    ins_del_used += 1
                    continue  # Wort weglassen (löschen)
                out.append(ow) # Sonst: Original behalten

        elif tag == "insert":
            # Funktionswörter aus ALLOWED_INS dürfen eingefügt werden (bis max_ins_del)
            for nw in c[j1:j2]:
                if ins_del_used < max_ins_del and nw.lower() in ALLOWED_INS:
                    ins_del_used += 1
                    out.append(nw) # Einfügen erlaubt
                # else: Insertion ignorieren

    return " ".join(out)


## 10 – WER-Berechnung & VTT-Hilfsfunktionen

Identisch zu `02f`.

In [14]:
def normalize_words(s: str):
    # Lowercase, nur alphanumerisch + Apostroph, normalisierte Leerzeichen
    s = s.lower()
    s = re.sub(r"[^a-z0-9'\s]+", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s.split()

def wer(ref: str, hyp: str) -> float:
    # Word Error Rate via Levenshtein-DP. WER = Editdistanz / len(ref)
    r = normalize_words(ref)
    h = normalize_words(hyp)
    dp = [[0]*(len(h)+1) for _ in range(len(r)+1)]
    for i in range(len(r)+1):
        dp[i][0] = i
    for j in range(len(h)+1):
        dp[0][j] = j
    for i in range(1, len(r)+1):
        for j in range(1, len(h)+1):
            cost = 0 if r[i-1] == h[j-1] else 1
            dp[i][j] = min(dp[i-1][j] + 1, dp[i][j-1] + 1, dp[i-1][j-1] + cost)
    return dp[len(r)][len(h)] / max(1, len(r))

def flatten_vtt_text(path: Path) -> str:
    # Fügt alle Untertitelzeilen einer VTT-Datei zu einem String zusammen.
    v = webvtt.read(str(path))
    return " ".join([clean_caption_text(c.text) for c in v.captions])

## 11 – Postprocessing-Kernfunktionen v2

**Änderungen gegenüber `02f`:**
- `max_new_tokens`: dynamisch berechnet (`max(256, int(in_len * 0.9))`) statt fixed 256
- `repetition_penalty=1.1` und `no_repeat_ngram_size=3`: reduzieren repetitive Ausgaben
- `unicodedata.normalize('NFKC')`: normalisiert Unicode-Varianten (z.B. Fullwidth-Zeichen)
- Robusteres SEP-Parsing: `SEP_SPLIT_RE` normalisiert Varianten vor dem Split
- **Neuer Fallback `line_by_line()`**: falls Block-Verarbeitung mit Format-Fehler scheitert,
  wird jede Zeile einzeln ans LLM geschickt (teurer aber robuster)
- `correct_block`: Stage-1-Fehler löst jetzt Stage-2 aus (statt direkt Original zurückgeben)

In [15]:
MAX_INPUT_TOKENS = 4096
CONTEXT_BEFORE = 0   # Kein Kontext – vereinfacht Prompt, reduziert Format-Fehler
CONTEXT_AFTER  = 0 

def _run_one_prompt(lines, model, tokenizer, system_prompt: str,
                    context_before=None, context_after=None):
    # Sendet Block ans LLM, gibt korrigierte Zeilen oder None zurück.
    context_before = context_before or []
    context_after  = context_after or []

    user_parts = []
    user_parts.append(
         f"You will receive {len(lines)} subtitle lines separated by the token `{SEP}`.\n"
        f"Return exactly {len(lines)} lines separated by `{SEP}`, no extra text.\n"
        f"Example: A{SEP}B -> A{SEP}B\n" # Konkreteres Beispiel als in 02f
    )

    if context_before:
        user_parts.append("Context (previous subtitle lines, DO NOT edit):\n")
        user_parts.append(SEP.join(context_before) + "\n")

    user_parts.append("Lines to correct:\n")
    user_parts.append(SEP.join(lines) + "\n")

    if context_after:
        user_parts.append("Context (next subtitle lines, DO NOT edit):\n")
        user_parts.append(SEP.join(context_after) + "\n")

    user_msg = "\n".join(user_parts)

    prompt = build_chat_prompt(tokenizer, system_prompt, user_msg)
    inputs = tokenizer(
        [prompt],
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=MAX_INPUT_TOKENS
    )
    if torch.cuda.is_available():
        inputs = {k: v.cuda() for k, v in inputs.items()}

    with torch.inference_mode():
        in_len = inputs["input_ids"].shape[1]
        out = model.generate(
            **inputs,
            do_sample=False,
            
            # Dynamische Token-Anzahl: proportional zur Eingabelänge, min. 256
            max_new_tokens=min(2048, max(256, int(in_len * 0.9))),
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
            repetition_penalty=1.1, # NEU: vermindert Wiederholungen im Output
            no_repeat_ngram_size=3, # NEU: verhindert 3-Gramm-Wiederholungen
    )


    input_len = int(inputs["attention_mask"][0].sum().item())
    gen_ids = out[0][input_len:]
    text = tokenizer.decode(gen_ids, skip_special_tokens=True)
    text = strip_thinking(text).replace("\ufeff", "").strip()

    # NEU: NFKC-Normalisierung – konvertiert Unicode-Varianten auf Standard-ASCII
    # z.B. Fullwidth-Buchstaben (Ａ→A), Ligaturzeichen (ﬁ→fi)
    text = unicodedata.normalize("NFKC", text)

    # Speaker-Tags entfernen die manche Modelle voranstellen
    text = re.sub(
        r"^\s*(?:<\|?(?:user|assistant|system|human)\|?>|user|assistant|system|human)\s*[:\n-]*\s*",
        "",
        text,
        flags=re.IGNORECASE
    ).strip()

    # NEU: Separator-Varianten normalisieren, dann auf kanonischen SEP splitten
    text_norm = SEP_SPLIT_RE.sub(SEP, text)
    raw_parts = text_norm.split(SEP)
    
    # Leere Randteile entfernen (entstehen wenn LLM mit SEP beginnt/endet)
    while raw_parts and raw_parts[0].strip() == "":
        raw_parts = raw_parts[1:]
    while raw_parts and raw_parts[-1].strip() == "":
        raw_parts = raw_parts[:-1]
    
    parts = [clean_caption_text(p) for p in raw_parts]

    if len(parts) != len(lines):
        print("⚠ FORMAT FAIL expected", len(lines), "got", len(parts))
        print("⚠ RAW OUTPUT (first 400 chars):", repr(text[:400]))
        return None

    return parts

def correct_block(lines, model, tokenizer,
                  context_before=None, context_after=None,
                  guard_max_extra_chars=6,
                  guard_min_char_sim=0.20):
    # 2-Stage-Postprocessing mit erweitertem Fallback.

    # Fallback-Kaskade:
    # 1. Stage 1 (strict) → bei Erfolg: zurückgeben
    # 2. Stage 1 scheitert oder keine Änderung → Stage 2 (relaxed)
    # 3. Stage 2 scheitert → line_by_line() (jede Zeile einzeln)

    def line_by_line():
        # Fallback: jede Zeile einzeln ans LLM schicken (teurer, aber robuster)
        fixed = []
        for line in lines:
            one = _run_one_prompt([line], model, tokenizer, SYSTEM_PROMPT_RELAXED)
            fixed.append(one[0] if one is not None and len(one) == 1 else line)
        return fixed

    out1 = _run_one_prompt(lines, model, tokenizer, SYSTEM_PROMPT_STRICT,
                           context_before=context_before, context_after=context_after)

    # Stage 1 Format-Fehler: direkt zu Stage 2
    if out1 is None:
        out2 = _run_one_prompt(lines, model, tokenizer, SYSTEM_PROMPT_RELAXED,
                               context_before=context_before, context_after=context_after)
        return out2 if out2 is not None else line_by_line()

    # Stage 1 keine Änderung: Stage 2 versuchen
    if out1 == lines:
        out2 = _run_one_prompt(lines, model, tokenizer, SYSTEM_PROMPT_RELAXED,
                               context_before=context_before, context_after=context_after)
        return out2 if out2 is not None else line_by_line()

    return out1 # Stage 1 hat Änderungen gemacht → direkt zurückgeben

## 12 – VTT-Datei-Postprocessing

**Änderung gegenüber `02f`:** Token-Guard wird jetzt explizit
in `postprocess_vtt_file` aufgerufen (mit `max_ins_del=2`),
statt in `correct_block` eingebettet zu sein. Sauberere Trennung der Verantwortlichkeiten.

In [16]:
def copy_input_to_output(session_dir: Path, input_prefix: str, output_prefix: str):
    # Kopiert Eingabe-Verzeichnis für Postprocessing
    src = session_dir / input_prefix
    dst = session_dir / output_prefix
    if not src.exists():
        raise FileNotFoundError(f"Input dir missing: {src}")
    if dst.exists():
        shutil.rmtree(dst)
    shutil.copytree(src, dst)
    return src, dst

def postprocess_vtt_file(vtt_in: Path, vtt_out: Path,
                         model, tokenizer,
                         block_size=8,
                         guard_max_extra_chars=6,
                         guard_min_char_sim=0.20,
                         quick_wer_guard=False,
                         label_vtt: Path | None = None,
                         context_before_n: int = 0,
                         context_after_n: int = 0):
    v = webvtt.read(str(vtt_in))
    caps = v.captions

    orig_texts = [clean_caption_text(c.text) for c in caps]
    fixed = []

    for i in range(0, len(orig_texts), block_size):
        block = orig_texts[i:i + block_size]

        ctx_before = fixed[-context_before_n:] if context_before_n > 0 else []

        ctx_after_start = i + block_size
        ctx_after = orig_texts[ctx_after_start:ctx_after_start + context_after_n] if context_after_n > 0 else []

         # LLM-Korrektur
        fixed_block = correct_block(
            block, model, tokenizer,
            context_before=ctx_before,
            context_after=ctx_after,
            guard_max_extra_chars=guard_max_extra_chars,
            guard_min_char_sim=guard_min_char_sim
        )
        
        # Token-Guard NACH correct_block: verhindert unerwünschte Strukturänderungen
        # max_ins_del=2 erlaubt bis zu 2 Insertionen/Deletionen pro Zeile
        fixed_block = [
            token_guard_line(o, c, max_extra_chars=GUARD_MAX_EXTRA_CHARS, min_char_sim=GUARD_MIN_CHAR_SIM, max_ins_del=2)
            for o, c in zip(block, fixed_block)
        ]
        fixed.extend(fixed_block)

    # Optionaler WER-Guard (hier deaktiviert)
    if quick_wer_guard and label_vtt is not None and label_vtt.exists():
        ref = flatten_vtt_text(label_vtt)
        hyp_before = " ".join(orig_texts)
        hyp_after  = " ".join(fixed)
        w_before = wer(ref, hyp_before)
        w_after  = wer(ref, hyp_after)
        if w_after > w_before:
            fixed = orig_texts

    for c, new_t in zip(caps, fixed):
        c.text = new_t

    v.save(str(vtt_out))

## 13 – Experiment-Ausführung

**Änderungen gegenüber `02f`:**
- `block_size=4` statt 8: kleinere Blöcke, weniger Format-Fehler
- `USE_QUICK_WER_GUARD=False`: WER-Guard deaktiviert (block_size=4 ist konservativer genug)
- `context_before_n=0`: kein Kontext (vereinfacht Prompt)

In [17]:
BLOCK_SIZE = 4 # Kleiner als 02f (8): weniger Format-Fehler durch kürzere Prompts
GUARD_MAX_EXTRA_CHARS = 6     
GUARD_MIN_CHAR_SIM    = 0.20  
USE_QUICK_WER_GUARD = False    # Deaktiviert: block_size=4 ist konservativ genug
LABEL_DIR_NAME = "labels"

for exp_key, exp_cfg in EXPERIMENTS.items():
    llm_key = exp_cfg["llm_model"]

    try:
        model, tok = load_llm(llm_key)
    except Exception as e:
        print(f"✗ Could not load {llm_key}: {e}")
        continue

    print(f"\n### Running Experiment {exp_key} with LLM {llm_key} ###")

    for sid in SESSION_IDS:
        session_dir = BASE_DATA_DIR / sid
        out_prefix = f"output_{exp_key}"

        print(f"\nSession {sid}: copy {INPUT_PREFIX} -> {out_prefix}")
        try:
            _, out_dir = copy_input_to_output(session_dir, INPUT_PREFIX, out_prefix)
        except Exception as e:
            print("✗ Skip session:", e)
            continue

        vtts = sorted(out_dir.glob("*.vtt"))
        if not vtts:
            print("⚠ No VTTs found in", out_dir)
            continue

        label_dir = session_dir / LABEL_DIR_NAME

        for vtt in vtts:
            label_vtt = (label_dir / vtt.name) if label_dir.exists() else None
            postprocess_vtt_file(
                vtt, vtt,
                model, tok,
                block_size=BLOCK_SIZE,
                guard_max_extra_chars=GUARD_MAX_EXTRA_CHARS,
                guard_min_char_sim=GUARD_MIN_CHAR_SIM,
                quick_wer_guard=USE_QUICK_WER_GUARD,
                label_vtt=label_vtt,
                context_before_n=0,   # Kein Kontext – einfacherer Prompt
                context_after_n=0
            )

        print(f"✓ {len(vtts)} VTTs postprocessed in {out_dir}")

    unload_llm(llm_key)

print("\nFertig: Alle Outputs generiert.")


Loading LLM: qwen3_8b -> Qwen/Qwen3-8B (dtype=bfloat16)


Loading checkpoint shards: 100%|██████████| 5/5 [00:03<00:00,  1.31it/s]
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



### Running Experiment E42_qwen3_8b_v2 with LLM qwen3_8b ###

Session session_40: copy output_E09_bs12_len20 -> output_E42_qwen3_8b_v2


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


⚠ FORMAT FAIL expected 4 got 3
⚠ RAW OUTPUT (first 400 chars): 'YEAH||| SEP |||OKAY SO DO YOU KNOW WHAT THE FIRST PERIOD IS CALLED|||SEP|||RICE IT I S It AS Iᴛ Is It回|||YEAH ThE NEXT ONE BEGINS GOOD  WAY  UNDERSТАND WY OKEY ?'


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

⚠ FORMAT FAIL expected 4 got 3
⚠ RAW OUTPUT (first 400 chars): "I'VE BEEN WATCHING ||| SEP ||| I'M GОНNA TEXT iT MYSELF |||_SEP||| NO ONE'S STOПPING ME IF YОU CAN'T DО IT I'll DO IT FОR YОu SWEΕT THЕN YЕAH HА ||| SEР ||| BUT YОUR UP ON APRIl SO YОUr APRIL AT HITTIΝG PEOPLE UP APRIЛ AT VISIΤING PEOPLES UP"


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


⚠ FORMAT FAIL expected 4 got 2
⚠ RAW OUTPUT (first 400 chars): "FROM YOU, OH, IS THERE ANYTHING ACTUALL Y GOING.ON AND YOU.WILL LOVE THAT.OHHHHH YOU ARE.TOO||| SEP||| IT'S HARD.TO ANSWER.OHHH HHH MOST.SMILING|| |OH.IT LOOKS.LIKE ANYWAYS |||NOT.LOSER OR.A BUT THE.DEEP INSID E KIND OF.THING.SO MAYBE.THAT'S.TRUE"


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


✓ 6 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_40/output_E42_qwen3_8b_v2

Session session_43: copy output_E09_bs12_len20 -> output_E42_qwen3_8b_v2


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 6 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_43/output_E42_qwen3_8b_v2

Session session_49: copy output_E09_bs12_len20 -> output_E42_qwen3_8b_v2


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


⚠ FORMAT FAIL expected 4 got 3
⚠ RAW OUTPUT (first 400 chars): "OKAY ||| SEP ||| THAT'S RIGHT |||_SEP_||| BUT YOU WANT_TO_DO CHORDs RIGHT |||\nTHAT' S KIND OF_SCARY"


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 6 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_49/output_E42_qwen3_8b_v2

Session session_50: copy output_E09_bs12_len20 -> output_E42_qwen3_8b_v2


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 6 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_50/output_E42_qwen3_8b_v2

Session session_54: copy output_E09_bs12_len20 -> output_E42_qwen3_8b_v2


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

⚠ FORMAT FAIL expected 4 got 1
⚠ RAW OUTPUT (first 400 chars): "IT'SYOURAUDIENCE||**SEP**||IDON'TKNOWIDON' TKNOW||**SE**P||IDONTKNOW||** SEP**||HMMMIWASWORKINGONONEOFTHEFUSEBUTIWASWORKINGLESONTHATONEBEFOREICAMEHEREBUTINEVERLIKEPERFORMEDITISWEARI STILLHAVEITUUNDERMYFEET"


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


⚠ FORMAT FAIL expected 4 got 3
⚠ RAW OUTPUT (first 400 chars): "IT'SYOURAUDIENCE|| |SEP|| |I DON' TKNOW I ACTUALY DON' TKNOW||| SEP|| | I DON' TNOW|| | HMMM I WASE WORKINGONONEOFTHFUSE BUT IWASWORKINGONTHATONEBEFOREICAMETHRE BUTINEVERLIKEPERFORMEDIT I SWARE I STILLHAVEITU NDERMYFEET"


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 5 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_54/output_E42_qwen3_8b_v2

Fertig: Alle Outputs generiert.


## 14 – Evaluation & Aggregation

In [18]:
df_dev = append_eval_results_for_experiments(
    experiments=EXPERIMENTS,
    session_ids=SESSION_IDS,
    target_csv="results_dev_subset_by_session.csv",
)


########## Evaluate für session_40 ##########
Starte Evaluate: /home/josch080/Projektgruppe/mcorec_train/bin/python script/evaluate.py --session_dir data-bin/dev_without_central_videos/dev/session_40 --output_dir_name output_ --label_dir_name labels
Evaluating 1 sessions

=== Evaluating session session_40 ===

--- Evaluating output dir: output_E01_bs4_len15 ---
Conversation clustering F1 score: 1.0
Speaker to WER: {'spk_0': 0.564, 'spk_1': 0.4281, 'spk_2': 0.5576, 'spk_3': 0.4283, 'spk_4': 0.4793, 'spk_5': 0.4189}
Speaker clustering F1 score: {'spk_0': 1.0, 'spk_1': 1.0, 'spk_2': 1.0, 'spk_3': 1.0, 'spk_4': 1.0, 'spk_5': 1.0}
Joint ASR-Clustering Error Rate: {'spk_0': 0.282, 'spk_1': 0.21405, 'spk_2': 0.2788, 'spk_3': 0.21415, 'spk_4': 0.23965, 'spk_5': 0.20945}

--- Evaluating output dir: output_E02_bs8_len15 ---
Conversation clustering F1 score: 1.0
Speaker to WER: {'spk_0': 0.561, 'spk_1': 0.4312, 'spk_2': 0.5506, 'spk_3': 0.4283, 'spk_4': 0.5041, 'spk_5': 0.4189}
Speaker clusterin

## 15 – Ergebnisanalyse: E42 vs. alle LLM-Experimente

E42 wird mit E09-Baseline und allen LLM-Experimenten aus `02f` verglichen.

In [21]:
import pandas as pd
import numpy as np

try:
    from IPython.display import display
except ImportError:
    display = print

dev_df = pd.read_csv("results_dev_subset_by_session.csv")

# Alle LLM-Experimente inkl. E42 für Gesamtvergleich
EXPERIMENTS = {
    "E38_qwen3_8b": {"llm_model": "qwen3_8b", "description": "Qwen3 8B"},
    "E39_qwen2.5_7b": {"llm_model": "qwen2.5_7b", "description": "Qwen2.5 7B"},
    "E40_qwen2.5_coder_7b": {"llm_model": "qwen2.5_coder_7b", "description": "Qwen2.5 Coder 7B"},
    "E41_deepseek_r1": {"llm_model": "deepseek_r1_distill_qwen_7b", "description": "DeepSeek R1 Distill 7B"},
    "E42_qwen3_8b_v2": {"llm_model": "qwen3_8b", "description": "Qwen3 8B (E42)"},
}
llm_models = list(EXPERIMENTS.keys())

# Fixe Baseline
BASELINE_MODEL = "E09_bs12_len20"

# Metriken checken
required_cols = ["avg_speaker_wer", "avg_joint_error"]
missing_cols = [c for c in required_cols if c not in dev_df.columns]
if missing_cols:
    raise ValueError(f"Diese Spalten fehlen in der CSV: {missing_cols}")

present_models = set(dev_df["model"].unique())

missing_models = [m for m in (llm_models + [BASELINE_MODEL]) if m not in present_models]
if missing_models:
    print("WARNUNG: Diese Modelle wurden in der CSV nicht gefunden:", missing_models)

# Aggregation
baseline_agg = (
    dev_df[dev_df["model"] == BASELINE_MODEL]
    .groupby("model")[required_cols]
    .mean()
    .reset_index()
)
if baseline_agg.empty:
    raise ValueError(f"Baseline '{BASELINE_MODEL}' nicht in der CSV gefunden.")

baseline_wer = float(baseline_agg.loc[0, "avg_speaker_wer"])
baseline_joint = float(baseline_agg.loc[0, "avg_joint_error"])

llm_agg = (
    dev_df[dev_df["model"].isin(llm_models)]
    .groupby("model")[required_cols]
    .mean()
    .reset_index()
)


# Vergleichstabelle
comp = pd.concat([baseline_agg, llm_agg], ignore_index=True)

comp = comp.rename(columns={
    "avg_speaker_wer": "wer",
    "avg_joint_error": "joint_error",
})

comp["delta_wer"] = comp["wer"] - baseline_wer
comp["delta_joint_error"] = comp["joint_error"] - baseline_joint

# baseline oben, Rest nach WER sortieren
comp["__is_baseline"] = (comp["model"] == BASELINE_MODEL).astype(int)
comp = comp.sort_values(["__is_baseline", "wer"], ascending=[False, True]).drop(columns="__is_baseline")

comp = comp[["model", "wer", "joint_error", "delta_wer", "delta_joint_error"]].reset_index(drop=True)

print(f"Baseline fix gesetzt auf: {BASELINE_MODEL}")
print("Interpretation: Negative Deltas = Verbesserung (niedriger ist besser).")
display(comp)


Baseline fix gesetzt auf: E09_bs12_len20
Interpretation: Negative Deltas = Verbesserung (niedriger ist besser).


Unnamed: 0,model,wer,joint_error,delta_wer,delta_joint_error
0,E09_bs12_len20,0.495416,0.3239,0.0,0.0
1,E38_qwen3_8b,0.494813,0.323598,-0.000603,-0.0003016667
2,E39_qwen2.5_7b,0.495116,0.32375,-0.0003,-0.00015
3,E41_deepseek_r1,0.495416,0.3239,0.0,5.5511150000000004e-17
4,E40_qwen2.5_coder_7b,0.495723,0.324053,0.000307,0.0001533333
5,E42_qwen3_8b_v2,0.523491,0.337938,0.028075,0.01403767


## 16 – Interpretation

| Experiment | Konfiguration | WER | Δ WER |
|------------|--------------|-----|-------|
| E09 (Baseline) | BL4 beam=12 len=20 | 0.4954 | – |
| E38 (02f) | Qwen3-8B, block=8, WER-Guard | 0.4948 | −0.0006 |
| **E42 (02g)** | **Qwen3-8B v2, block=4, kein Guard** | **0.5235** | **+0.028** |

Die verfeinerte Pipeline **verschlechtert** die Ergebnisse deutlich.
Der WER-Guard aus `02f` war offensichtlich entscheidend dafür,
dass E38 eine (wenn auch marginale) Verbesserung erzielte.

Ohne WER-Guard kann das LLM Korrekturen einbringen die lokal plausibel wirken,
global aber die WER erhöhen – der Token-Guard allein reicht nicht aus.

**Zwischenfazit:** Der WER-Guard ist offensichtlich entscheidend – ohne ihn
verschlechtert sich die WER deutlich. Die Experimente mit LLM-Postprocessing
werden dennoch weitergeführt: Im Notebook `02h_` wird ein Code-spezialisierter LLM
(DeepSeek Coder) getestet.
