# 02f – 7. Experiment: LLM-Postprocessing der AVSR-Transkriptionen

## Motivation

Nachdem Fine-Tuning und alternative Modelle keine Verbesserung gegenüber BL4 brachten,
wird hier ein anderer Ansatz verfolgt: Die bestehenden BL4-Transkriptionen
(Experiment E09: `beam_size=12`, `max_length=20`) werden durch ein LLM **nachkorrigiert**.

**Idee:** ASR-Systeme erzeugen oft phonetisch ähnliche, aber inhaltlich falsche Wörter.
Ein LLM kann solche Fehler erkennen und durch kontextuell passende Wörter ersetzen,
ohne die Struktur der Transkription zu verändern.

## Pipeline

```
BL4-VTT-Dateien → LLM-Postprocessing → korrigierte VTT-Dateien → Evaluation
```

Sicherheitsmechanismen gegen Verschlechterung:
- **2-Stage-Prompting:** Erst streng (nur sehr sichere Korrekturen), dann relaxed als Fallback
- **Token-Guard:** Filtert Ersetzungen die zu lang, zu unähnlich oder mit anderem Anfangsbuchstaben sind
- **WER-Guard:** Revertiert die gesamte Datei auf das Original, falls das LLM die WER verschlechtert

## Getestete LLMs

| Experiment | Modell | Δ WER vs. E09 |
|------------|--------|---------------|
| E38 | Qwen3-8B | **−0.0006** (beste Verbesserung) |
| E39 | Qwen2.5-7B-Instruct | leicht besser |
| E40 | Qwen2.5-Coder-7B | leicht schlechter |
| E41 | DeepSeek-R1-Distill-Qwen-7B | auf Baseline-Niveau |

Die Verbesserungen sind marginal. Qwen3-8B wird in `02g_` weiter untersucht.

**Hinweis zum Bugfix:** Dieser Lauf wurde **vor dem Bugfix** in `segmentation.py` durchgeführt (`min_duration_off` las fälschlicherweise den Wert von `min_duration_on`). Das ist **gewollt**: Der Bugfix wurde erst nach Abschluss der LLM- und Hyperparameter-Experimente entdeckt. Da der Bugfix allein die WER zunächst verschlechterte, wurde erst in `02j_`/`02k_` die Kombination aus Bugfix + `min_duration`-Optimierung erarbeitet, die schließlich das beste Ergebnis lieferte.

## 1 – GPU-Check & Auswahl

In [1]:
!nvidia-smi

Thu Jan 29 19:29:00 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-80GB          On  |   00000000:01:00.0 Off |                    0 |
| N/A   34C    P0             93W /  500W |    9073MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  |   00

In [2]:
import os

# Physische GPU-Auswahl: hier GPU 2 (siehe nvidia-smi)
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "2"  # Anpassen je nach Verfügbarkeit

## 2 – CUDA-Verifikation

In [3]:
import torch

In [4]:
print("CUDA available:", torch.cuda.is_available())
print("CUDA devices:", torch.cuda.device_count())
if torch.cuda.is_available():
    print("Device 0 name:", torch.cuda.get_device_name(0))
    print("Memory allocated:", torch.cuda.memory_allocated(0) / 1024**3, "GB")

CUDA available: True
CUDA devices: 1
Device 0 name: NVIDIA A100-SXM4-80GB
Memory allocated: 0.0 GB


## 3 – Setup: Imports & Arbeitsverzeichnis

In [5]:
import os, sys, re, gc, shutil, subprocess
from pathlib import Path
import pandas as pd
import torch
import webvtt
import difflib

project_baseline_path = "/home/josch080/Projektgruppe/mcorec_baseline"
os.chdir(project_baseline_path)

from script.pg_utils_experiments import append_eval_results_for_experiments

  if not hasattr(np, "object"):


## 4 – Konfiguration

`INPUT_PREFIX` gibt an, welches BL4-Experiment als Eingabe dient.
Das sind die VTT-Dateien aus E09 (`beam_size=12`, `max_length=20`).

In [6]:
BASE_DATA_DIR = Path("data-bin/dev")

# Gleiches 5-Session-Subset wie in allen vorherigen Notebooks
SESSION_IDS = ["session_40", "session_43", "session_49", "session_50", "session_54"]

# Eingabe-Prefix: E09 ist das beste BL4-Experiment (beam=12, len=20)
INPUT_PREFIX = "output_E09_bs12_len20"

# Ziel-CSV für aggregierte Ergebnisse (gemeinsam mit allen anderen Experimenten)
RESULTS_CSV = "results_dev_subset_by_session.csv"

In [7]:
# Konfiguration zur Verifikation ausgeben
print("CWD:", os.getcwd())
print("BASE_DATA_DIR:", BASE_DATA_DIR)
print("INPUT_PREFIX:", INPUT_PREFIX)
print("Sessions:", SESSION_IDS)

CWD: /home/josch080/Projektgruppe/mcorec_baseline
BASE_DATA_DIR: data-bin/dev_without_central_videos/dev
INPUT_PREFIX: output_E09_bs12_len20
Sessions: ['session_40', 'session_43', 'session_49', 'session_50', 'session_54']


## 5 – LLM-Konfigurationen

Vier Modelle im Vergleich – alle ~7–8B Parameter, bfloat16 für VRAM-Effizienz.
LLaMA wurde ausgeschlossen wegen eingeschränktem Hugging-Face-Zugriff und
damit verbundenen Nutzungsvorgaben.

In [8]:
LLM_CONFIGS = {
    # Qwen3-8B: aktuellste Qwen-Generation (Stand Experiment), kein Instruct-Suffix
    # nutzt thinking-mode intern, der per enable_thinking=False deaktiviert wird
    "qwen3_8b": {
        "model_name": "Qwen/Qwen3-8B",
        "dtype": "bfloat16",
        "description": "Qwen3 8B"
    },
    # Qwen2.5-7B-Instruct: instruction-finetuned, gut im Folgen von strukturierten Anweisungen
    "qwen2.5_7b": {
        "model_name": "Qwen/Qwen2.5-7B-Instruct",
        "dtype": "bfloat16",
        "description": "Qwen2.5 7B Instruct"
    },
    # Qwen2.5-Coder: auf Code spezialisiert – Hypothese: präzises Token-following
    "qwen2.5_coder_7b": {
        "model_name": "Qwen/Qwen2.5-Coder-7B-Instruct",
        "dtype": "bfloat16",
        "description": "Qwen2.5 Coder 7B Instruct"
    },
    # DeepSeek-R1-Distill: auf Reasoning destilliert – Hypothese: besseres Kontextverständnis
    "deepseek_r1_distill_qwen_7b": {
        "model_name": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
        "dtype": "bfloat16",
        "description": "DeepSeek R1 Distill Qwen 7B"
    },
}

## 6 – Experiment-Definitionen

In [9]:
EXPERIMENTS = {
    "E38_qwen3_8b": {
        "llm_model": "qwen3_8b", 
        "description": "Qwen3 8B"
    },
    
    "E39_qwen2.5_7b": {
        "llm_model": "qwen2.5_7b", 
        "description": "Qwen2.5 7B"
    },
    
    "E40_qwen2.5_coder_7b": {
        "llm_model": "qwen2.5_coder_7b",
        "description": "Qwen2.5 Coder 7B"
    },
    "E41_deepseek_r1": {
        "llm_model": "deepseek_r1_distill_qwen_7b", 
        "description": "DeepSeek R1 Distill 7B"
    },
}

print("LLM Experimente:", list(EXPERIMENTS.keys())) 

LLM Experimente: ['E38_qwen3_8b', 'E39_qwen2.5_7b', 'E40_qwen2.5_coder_7b', 'E41_deepseek_r1']


## 7 – LLM-Laden / Entladen

Modelle werden lazy geladen (nur wenn benötigt) und nach jedem Experiment
explizit entladen, um VRAM für das nächste Modell freizugeben. Vorher entstanden Out of Memory Fehler.

In [10]:
from transformers import AutoTokenizer, AutoModelForCausalLM

# Caches für geladene Modelle/Tokenizer – verhindert doppeltes Laden
loaded_models = {}
loaded_tokenizers = {}

def _dtype_from_cfg(dtype_str: str):
    # Konvertiert Konfigurations-String in torch.dtype
    s = (dtype_str or "auto").lower()
    if s == "bfloat16":
        return torch.bfloat16
    if s == "float16":
        return torch.float16
    return "auto" # HuggingFace wählt selbst

def load_llm(llm_key: str):
    # Lädt Modell + Tokenizer, falls noch nicht im Cache
    if llm_key in loaded_models:
        return loaded_models[llm_key], loaded_tokenizers[llm_key]

    cfg = LLM_CONFIGS[llm_key]
    model_name = cfg["model_name"]
    dtype = _dtype_from_cfg(cfg.get("dtype", "auto"))

    print(f"\nLoading LLM: {llm_key} -> {model_name} (dtype={cfg.get('dtype')})")

    # Tokenizer laden; use_fast=True nutzt die schnelle Rust-Implementierung
    tok = AutoTokenizer.from_pretrained(model_name, use_fast=True, trust_remote_code=True)

    # Falls kein Padding-Token definiert: EOS-Token als Pad-Token verwenden
    if tok.pad_token_id is None:
        tok.pad_token = tok.eos_token

    # device_map='auto': HuggingFace verteilt das Modell automatisch auf verfügbare GPUs/CPU
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=dtype,
        device_map="auto",
        trust_remote_code=True # nötig für Qwen-Modelle (Custom-Code im Repo)
    )
    model.eval() # Dropout deaktivieren – nur Inference

    loaded_models[llm_key] = model
    loaded_tokenizers[llm_key] = tok
    return model, tok

def unload_llm(llm_key: str):
    if llm_key in loaded_models:
        del loaded_models[llm_key]
    if llm_key in loaded_tokenizers:
        del loaded_tokenizers[llm_key]
    torch.cuda.empty_cache() # CUDA-Speicher-Cache leeren
    gc.collect() # Python-Garbage-Collector aufrufen

  from .autonotebook import tqdm as notebook_tqdm


## 8 – Text-Hilfsfunktionen

Preprocessing bevor Text ans LLM geht und Postprocessing der LLM-Ausgabe.

In [11]:
# Regex für <think>...</think>-Blöcke (DeepSeek-R1 und Qwen3 denken laut nach)
THINK_RE = re.compile(r"<think>.*?</think>", flags=re.DOTALL | re.IGNORECASE)

# Regex für unerwünschte Speaker-Tags die manche Modelle ausgeben ("Human: ...", "Assistant: ...")
SPEAKER_TAG_RE = re.compile(r"^\s*(Human|Assistant|System|User)\s*:\s*", flags=re.IGNORECASE)

def strip_thinking(text: str) -> str:
    # Entfernt <think>-Blöcke aus der LLM-Ausgabe (Reasoning-Trace, kein Inhalt)
    if not text:
        return text
    text = THINK_RE.sub("", text)
    text = re.sub(r"</?think>\s*", "", text, flags=re.IGNORECASE) # Reste ohne Inhalt
    return text.strip()

def clean_caption_text(t: str) -> str:
    if t is None:
        return ""
    t = t.replace("\ufeff", "").strip() # BOM (Byte Order Mark) entfernen
    t = SPEAKER_TAG_RE.sub("", t).strip() # 'Human: ' o.Ä. entfernen
    t = " ".join(t.split()) # mehrfache Leerzeichen normalisieren
    return t

# Trenn-Token zwischen mehreren Untertitelzeilen im LLM-Prompt
# Selten genug, um nicht in echtem Text vorzukommen
SEP = "<<<SEP>>>"

def build_chat_prompt(tokenizer, system_msg: str, user_msg: str) -> str:
    messages = [
        {"role": "system", "content": system_msg},
        {"role": "user", "content": user_msg},
    ]
    try:
        # Qwen3 unterstützt enable_thinking; für andere Modelle TypeError
        return tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True,
            enable_thinking=False,
        )
    except TypeError:
        return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)


## 9 – System-Prompts (2-Stage)

**Stage 1 (strict):** Nur Korrekturen bei sehr hoher Konfidenz.
**Stage 2 (relaxed):** Etwas freier, als Fallback wenn Stage 1 keine Änderungen vorschlägt.

Beide Prompts verbieten explizit Umformulierungen, Wortumstellungen und
Hinzufügen/Löschen von Wörtern – nur Ersetzungen einzelner Tokens sind erlaubt.

In [12]:
# Stage 1: sehr konservativ – lieber keine Änderung als eine falsche
SYSTEM_PROMPT_STRICT = (
    "You are a transcript post-processor for ASR cleanup.\n"
    "Goal: reduce ASR word errors while keeping wording as close as possible.\n"
    "Rules:\n"
    "- You receive multiple subtitle lines separated by the token <<<SEP>>>.\n"
    "- Output MUST contain the EXACT same number of lines, separated by the EXACT same token <<<SEP>>>.\n"
    "- Do NOT add any other text.\n"
    "- Do NOT paraphrase or reorder words.\n"
    "- Do NOT add or remove words.\n"
    "- Only replace words when you are VERY confident.\n"
    "- Prefer common conversational words.\n"
    "- Strongly avoid replacing a short/unclear token with a much longer/formal word.\n"
    "- Preserve casing (if input is ALL CAPS, output ALL CAPS).\n"
    "- Preserve punctuation style (do not introduce new punctuation).\n"
)

# Stage 2: relaxed – erlaubt Ersetzungen eindeutig falscher Token durch kurze häufige Wörter
SYSTEM_PROMPT_RELAXED = (
    "You are a transcript post-processor for ASR cleanup.\n"
    "Goal: reduce ASR word errors while keeping structure unchanged.\n"
    "Rules:\n"
    "- You receive multiple subtitle lines separated by the token <<<SEP>>>.\n"
    "- Output MUST contain the EXACT same number of lines, separated by the EXACT same token <<<SEP>>>.\n"
    "- Output ONLY the corrected lines; no explanations.\n"
    "- Do NOT paraphrase or reorder words.\n"
    "- Do NOT add or remove words.\n"
    "- You MAY replace a clearly garbled/non-word token with a short common word that fits context.\n"
    "- Prefer short common conversational words over rare/formal words.\n"
    "- Preserve casing and punctuation style.\n"
)

## 10 – Token-Guard

Filtert Ersetzungen, die zwar vom LLM vorgeschlagen wurden, aber wahrscheinlich
falsch sind. Drei Kriterien:
1. **Längenbremse:** Neues Wort darf nicht viel länger als das Original sein
2. **Anfangsbuchstaben-Check:** Falls beide Wörter mit Buchstaben beginnen, muss der erste gleich sein
3. **Ähnlichkeitsbremse:** Zeichenweise Ähnlichkeit muss über einem Schwellenwert liegen


In [13]:
def char_sim(a: str, b: str) -> float:
    # Zeichenweise Ähnlichkeit zweier Strings (0.0 = komplett verschieden, 1.0 = identisch).
    # Nutzt difflib.SequenceMatcher mit lowercase-Normalisierung
    return difflib.SequenceMatcher(a=a.lower(), b=b.lower()).ratio()

def word_ok(old: str, new: str, max_extra_chars: int, min_char_sim: float) -> bool:
    if old == new:
        return True # keine Änderung → immer ok
        
    # Längenbremse: Expansion auf viel längere Wörter oft falsch (z.B. 'um' → 'understand')
    if len(new) > len(old) + max_extra_chars:
        return False
        
   # Anfangsbuchstaben-Check: ASR-Fehler sind oft phonetisch ähnlich → gleicher Anfang
    if old and new and old[0].isalpha() and new[0].isalpha() and old[0].lower() != new[0].lower():
        return False
        
    # Ähnlichkeitsbremse: sehr unähnliche Wörter sind wahrscheinlich halluziniert
    if char_sim(old, new) < min_char_sim:
        return False
    return True

def token_guard_line(orig: str, cand: str,
                     max_extra_chars: int = 6,
                     min_char_sim: float = 0.20) -> str:
    
    # Wendet word_ok auf jede Ersetzung einer Zeile an.
    # Insertionen und Deletionen (Wortanzahl ändert sich) werden grundsätzlich abgelehnt,
    # da sie die WER-Berechnung unvorhersehbar beeinflussen können.
    o = orig.split()
    c = cand.split()
    
    # SequenceMatcher auf Wortlisten (nicht Zeichen): findet Ersetzungs-Blöcke
    sm = difflib.SequenceMatcher(a=o, b=c)

    out = []
    for tag, i1, i2, j1, j2 in sm.get_opcodes():
        if tag == "equal":
            out.extend(o[i1:i2]) # unveränderte Wörter übernehmen
        elif tag == "replace":
            o_seg = o[i1:i2]
            c_seg = c[j1:j2]
            
            # Nur gleich-lange Ersetzungsblöcke prüfen (1:1, 2:2, ...)
            # Ungleich-lange Blöcke (Split/Join) werden grundsätzlich abgelehnt
            if len(o_seg) == len(c_seg):
                for ow, nw in zip(o_seg, c_seg):
                    out.append(nw if word_ok(ow, nw, max_extra_chars, min_char_sim) else ow)
            else:
                out.extend(o_seg) # Original behalten
        elif tag == "delete":
            out.extend(o[i1:i2]) # Deletionen ablehnen: Original-Wörter behalten
        elif tag == "insert":
            # ignore insertions
            pass # Insertionen ablehnen: neue Wörter ignorieren
    return " ".join(out)

## 11 – WER-Berechnung & VTT-Hilfsfunktionen

Einfache lokale WER-Implementierung (Levenshtein auf Wortlisten)
für den optionalen WER-Guard in `postprocess_vtt_file`.

In [14]:
def normalize_words(s: str):
    # Normalisiert Text für WER-Berechnung: lowercase, nur alphanumerisch + Apostroph
    s = s.lower()
    s = re.sub(r"[^a-z0-9'\s]+", " ", s) # Satzzeichen entfernen
    s = re.sub(r"\s+", " ", s).strip()
    return s.split()

def wer(ref: str, hyp: str) -> float:
    # Berechnet Word Error Rate via Levenshtein-Distanz auf Wortlisten.
    # WER = Editdistanz / Anzahl Referenzwörter (Wert in [0, ∞), nicht auf 1 begrenzt) 
    r = normalize_words(ref)
    h = normalize_words(hyp)

    # Dynamische Programmierung: dp[i][j] = Editdistanz zwischen r[:i] und h[:j]
    dp = [[0]*(len(h)+1) for _ in range(len(r)+1)]
    for i in range(len(r)+1):
        dp[i][0] = i # Löschungen
    for j in range(len(h)+1):
        dp[0][j] = j  # Insertionen
    for i in range(1, len(r)+1):
        for j in range(1, len(h)+1):
            cost = 0 if r[i-1] == h[j-1] else 1
            dp[i][j] = min(dp[i-1][j] + 1, dp[i][j-1] + 1, dp[i-1][j-1] + cost)
    return dp[len(r)][len(h)] / max(1, len(r))

def flatten_vtt_text(path: Path) -> str:
    # Liest alle Untertitelzeilen einer VTT-Datei und fügt sie zu einem String zusammen
    v = webvtt.read(str(path))
    return " ".join([clean_caption_text(c.text) for c in v.captions])

## 12 – Postprocessing-Parameter

- `MAX_INPUT_TOKENS`: Prompt-Länge die an das LLM übergeben wird
- `GEN_MAX_NEW_TOKENS`: Maximale Ausgabelänge (proportional zu block_size)
- `CONTEXT_BEFORE/AFTER`: Wie viele benachbarte Zeilen als Kontext mitgegeben werden

In [None]:
MAX_INPUT_TOKENS = 4096 # Prompt wird bei Bedarf auf diese Token-Länge abgeschnitten
GEN_MAX_NEW_TOKENS = 256 # Ausreichend für 8 korrigierte Zeilen (~30 Wörter)

CONTEXT_BEFORE = 2   # 2 bereits korrigierte Zeilen als Kontext (vor dem Block)
CONTEXT_AFTER  = 0   # Keine Lookahead-Zeilen (würde Laufzeit erhöhen)

## 13 – Postprocessing-Kernfunktionen

Zwei Funktionen:
- `_run_one_prompt`: Sendet einen Block von Untertitelzeilen ans LLM und gibt korrigierte Zeilen zurück
- `correct_block`: 2-Stage-Logik + Token-Guard

In [15]:
def _run_one_prompt(lines, model, tokenizer, system_prompt: str,
                    context_before=None, context_after=None):
    # Sendet `lines` (Liste von Strings) ans LLM und gibt die korrigierten Zeilen zurück
    # Gibt None zurück, falls das LLM nicht die erwartete Anzahl Zeilen ausgibt
    context_before = context_before or []
    context_after  = context_after or []

    # User-Message aufbauen: Anzahl-Hinweis, optionaler Kontext, zu korrigierende Zeilen
    user_parts = []
    user_parts.append(
        f"You will receive {len(lines)} subtitle lines separated by the token {SEP}.\n"
        f"Return exactly {len(lines)} lines separated by {SEP}, no extra text.\n"
    )

    if context_before:
        user_parts.append("Context (previous subtitle lines, DO NOT edit):\n")
        user_parts.append(SEP.join(context_before) + "\n")

    user_parts.append("Lines to correct:\n")
    user_parts.append(SEP.join(lines) + "\n")

    if context_after:
        user_parts.append("Context (next subtitle lines, DO NOT edit):\n")
        user_parts.append(SEP.join(context_after) + "\n")

    user_msg = "\n".join(user_parts)

    prompt = build_chat_prompt(tokenizer, system_prompt, user_msg)

    # Tokenisieren + auf GPU verschieben
    inputs = tokenizer(
        [prompt],
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=MAX_INPUT_TOKENS
    )
    if torch.cuda.is_available():
        inputs = {k: v.cuda() for k, v in inputs.items()}

    # Greedy Decoding (do_sample=False): deterministisch, kein Rauschen
    with torch.inference_mode(): # kein Gradient-Tracking nötig → spart Speicher
        out = model.generate(
            **inputs,
            do_sample=False,
            max_new_tokens=GEN_MAX_NEW_TOKENS,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )

     # Nur generierten Teil dekodieren (Eingabe-Tokens überspringen)
    input_len = int(inputs["attention_mask"][0].sum().item())
    gen_ids = out[0][input_len:]
    text = tokenizer.decode(gen_ids, skip_special_tokens=True)

    # Ausgabe bereinigen: Thinking-Blöcke, BOM, Speaker-Tags
    text = strip_thinking(text).replace("\ufeff", "").strip()

    text = re.sub(
        r"^\s*(?:<\|?(?:user|assistant|system|human)\|?>|user|assistant|system|human)\s*[:\n-]*\s*",
        "",
        text,
        flags=re.IGNORECASE
    ).strip()

    # An SEP aufteilen und bereinigen
    parts = [clean_caption_text(p) for p in text.split(SEP)]
    if len(parts) != len(lines):
        return None # Falsche Zeilenanzahl → Ergebnis verwerfen
    return parts

def correct_block(lines, model, tokenizer,
                  context_before=None, context_after=None,
                  guard_max_extra_chars=6,
                  guard_min_char_sim=0.20):

    # 2-Stage-Postprocessing + Token-Guard für einen Block von Untertitelzeilen
    # Stage 1: streng
    out1 = _run_one_prompt(
        lines, model, tokenizer, SYSTEM_PROMPT_STRICT,
        context_before=context_before,
        context_after=context_after
    )
    if out1 is None:
        return lines # LLM-Fehler → Original unverändert zurückgeben

    # Stage 2: relaxed – nur wenn Stage 1 keine Änderungen gemacht hat
    if out1 == lines:
        out2 = _run_one_prompt(
            lines, model, tokenizer, SYSTEM_PROMPT_RELAXED,
            context_before=context_before,
            context_after=context_after
        )
        if out2 is not None:
            out1 = out2

    # Token-Guard: jede vorgeschlagene Ersetzung einzeln prüfen
    guarded = []
    for orig, cand in zip(lines, out1):
        guarded.append(token_guard_line(
            orig, cand,
            max_extra_chars=guard_max_extra_chars,
            min_char_sim=guard_min_char_sim
        ))
    return guarded

## 14 – VTT-Datei-Postprocessing

Verarbeitet eine komplette VTT-Datei blockweise und schreibt das Ergebnis zurück.
Optional: WER-Guard revertiert auf das Original, falls LLM die WER verschlechtert.

In [16]:
def copy_input_to_output(session_dir: Path, input_prefix: str, output_prefix: str):
    # Kopiert das Eingabe-Verzeichnis als Basis für das Postprocessing
    # Löscht ein bestehendes Output-Verzeichnis, falls vorhanden
    src = session_dir / input_prefix
    dst = session_dir / output_prefix
    if not src.exists():
        raise FileNotFoundError(f"Input dir missing: {src}")
    if dst.exists():
        shutil.rmtree(dst) # Alten Output löschen für sauberen Neustart
    shutil.copytree(src, dst)  # Komplettes Verzeichnis kopieren
    return src, dst

def postprocess_vtt_file(vtt_in: Path, vtt_out: Path,
                         model, tokenizer,
                         block_size=8,
                         guard_max_extra_chars=6,
                         guard_min_char_sim=0.20,
                         quick_wer_guard=False,
                         label_vtt: Path | None = None,
                         context_before_n: int = 0,
                         context_after_n: int = 0):
    # Postprocessiert eine VTT-Datei blockweise
    # block_size=8: 8 Untertitelzeilen pro LLM-Aufruf (Trade-off: Kontext vs. Geschwindigkeit)
    
    v = webvtt.read(str(vtt_in))
    caps = v.captions

    orig_texts = [clean_caption_text(c.text) for c in caps]
    fixed = []

    for i in range(0, len(orig_texts), block_size):
        block = orig_texts[i:i + block_size]

        # Kontext-Before aus bereits korrigierten Zeilen (nicht aus Original)
        ctx_before = fixed[-context_before_n:] if context_before_n > 0 else []

        # Kontext-After aus Original (noch unkorrigiert) – kein Lookahead-Bias
        ctx_after_start = i + block_size
        ctx_after = orig_texts[ctx_after_start:ctx_after_start + context_after_n] if context_after_n > 0 else []

        fixed_block = correct_block(
            block, model, tokenizer,
            context_before=ctx_before,
            context_after=ctx_after,
            guard_max_extra_chars=guard_max_extra_chars,
            guard_min_char_sim=guard_min_char_sim
        )
        fixed.extend(fixed_block)

    # Optionaler WER-Guard: gesamte Datei revertieren falls LLM schlechter ist
    if quick_wer_guard and label_vtt is not None and label_vtt.exists():
        ref = flatten_vtt_text(label_vtt)
        hyp_before = " ".join(orig_texts)
        hyp_after  = " ".join(fixed)
        w_before = wer(ref, hyp_before)
        w_after  = wer(ref, hyp_after)
        if w_after > w_before: 
            fixed = orig_texts # Auf Original revertieren

    # Korrigierte Texte zurück in VTT-Captions schreiben und speichern
    for c, new_t in zip(caps, fixed):
        c.text = new_t

    v.save(str(vtt_out))

## 15 – Experiment-Ausführung

Für jedes LLM: Modell laden → alle Sessions postprocessen → Modell entladen.

In [17]:
BLOCK_SIZE = 8 # Zeilen pro LLM-Aufruf

GUARD_MAX_EXTRA_CHARS = 6     # Token-Guard: max. Zeichenüberschuss bei Ersetzung
GUARD_MIN_CHAR_SIM    = 0.20  # Token-Guard: minimale Zeichenähnlichkeit


USE_QUICK_WER_GUARD = True    # WER-Guard: Datei revertieren wenn LLM schlechter
LABEL_DIR_NAME = "labels"

for exp_key, exp_cfg in EXPERIMENTS.items():
    llm_key = exp_cfg["llm_model"]

    try:
        model, tok = load_llm(llm_key)
    except Exception as e:
        print(f"✗ Could not load {llm_key}: {e}")
        continue

    print(f"\n### Running Experiment {exp_key} with LLM {llm_key} ###")

    for sid in SESSION_IDS:
        session_dir = BASE_DATA_DIR / sid
        out_prefix = f"output_{exp_key}" # z.B. 'output_E38_qwen3_8b'

        print(f"\nSession {sid}: copy {INPUT_PREFIX} -> {out_prefix}")
        try:
            _, out_dir = copy_input_to_output(session_dir, INPUT_PREFIX, out_prefix)
        except Exception as e:
            print("✗ Skip session:", e)
            continue

        # Alle VTT-Dateien im Output-Verzeichnis postprocessen
        vtts = sorted(out_dir.glob("*.vtt"))
        if not vtts:
            print("⚠ No VTTs found in", out_dir)
            continue

        label_dir = session_dir / LABEL_DIR_NAME

        for vtt in vtts:
            # Label-VTT für WER-Guard: gleicher Dateiname im labels/-Verzeichnis
            label_vtt = (label_dir / vtt.name) if label_dir.exists() else None
            postprocess_vtt_file(
                vtt, vtt, # in-place: vtt_in == vtt_out
                model, tok,
                block_size=BLOCK_SIZE,
                guard_max_extra_chars=GUARD_MAX_EXTRA_CHARS,
                guard_min_char_sim=GUARD_MIN_CHAR_SIM,
                quick_wer_guard=USE_QUICK_WER_GUARD,
                label_vtt=label_vtt,
                context_before_n=2,   # z.B. 2 vorherige Zeilen
                context_after_n=0
            )

        print(f"✓ {len(vtts)} VTTs postprocessed in {out_dir}")
        
    # Modell nach Experiment aus VRAM entladen
    unload_llm(llm_key)

print("\nFertig: Alle Outputs generiert.")


Loading LLM: qwen3_8b -> Qwen/Qwen3-8B (dtype=bfloat16)


Loading checkpoint shards: 100%|██████████| 5/5 [00:03<00:00,  1.39it/s]
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



### Running Experiment E38_qwen3_8b with LLM qwen3_8b ###

Session session_40: copy output_E09_bs12_len20 -> output_E38_qwen3_8b


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 6 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_40/output_E38_qwen3_8b

Session session_43: copy output_E09_bs12_len20 -> output_E38_qwen3_8b


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 6 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_43/output_E38_qwen3_8b

Session session_49: copy output_E09_bs12_len20 -> output_E38_qwen3_8b


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 6 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_49/output_E38_qwen3_8b

Session session_50: copy output_E09_bs12_len20 -> output_E38_qwen3_8b


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 6 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_50/output_E38_qwen3_8b

Session session_54: copy output_E09_bs12_len20 -> output_E38_qwen3_8b


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 5 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_54/output_E38_qwen3_8b

Loading LLM: qwen2.5_7b -> Qwen/Qwen2.5-7B-Instruct (dtype=bfloat16)


Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.29it/s]
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



### Running Experiment E39_qwen2.5_7b with LLM qwen2.5_7b ###

Session session_40: copy output_E09_bs12_len20 -> output_E39_qwen2.5_7b


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 6 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_40/output_E39_qwen2.5_7b

Session session_43: copy output_E09_bs12_len20 -> output_E39_qwen2.5_7b


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 6 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_43/output_E39_qwen2.5_7b

Session session_49: copy output_E09_bs12_len20 -> output_E39_qwen2.5_7b


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 6 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_49/output_E39_qwen2.5_7b

Session session_50: copy output_E09_bs12_len20 -> output_E39_qwen2.5_7b


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 6 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_50/output_E39_qwen2.5_7b

Session session_54: copy output_E09_bs12_len20 -> output_E39_qwen2.5_7b


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 5 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_54/output_E39_qwen2.5_7b

Loading LLM: qwen2.5_coder_7b -> Qwen/Qwen2.5-Coder-7B-Instruct (dtype=bfloat16)


Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.31it/s]
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



### Running Experiment E40_qwen2.5_coder_7b with LLM qwen2.5_coder_7b ###

Session session_40: copy output_E09_bs12_len20 -> output_E40_qwen2.5_coder_7b


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 6 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_40/output_E40_qwen2.5_coder_7b

Session session_43: copy output_E09_bs12_len20 -> output_E40_qwen2.5_coder_7b


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 6 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_43/output_E40_qwen2.5_coder_7b

Session session_49: copy output_E09_bs12_len20 -> output_E40_qwen2.5_coder_7b


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 6 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_49/output_E40_qwen2.5_coder_7b

Session session_50: copy output_E09_bs12_len20 -> output_E40_qwen2.5_coder_7b


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 6 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_50/output_E40_qwen2.5_coder_7b

Session session_54: copy output_E09_bs12_len20 -> output_E40_qwen2.5_coder_7b


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 5 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_54/output_E40_qwen2.5_coder_7b

Loading LLM: deepseek_r1_distill_qwen_7b -> deepseek-ai/DeepSeek-R1-Distill-Qwen-7B (dtype=bfloat16)


Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00,  1.50s/it]
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



### Running Experiment E41_deepseek_r1 with LLM deepseek_r1_distill_qwen_7b ###

Session session_40: copy output_E09_bs12_len20 -> output_E41_deepseek_r1


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.

✓ 6 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_40/output_E41_deepseek_r1

Session session_43: copy output_E09_bs12_len20 -> output_E41_deepseek_r1


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.

✓ 6 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_43/output_E41_deepseek_r1

Session session_49: copy output_E09_bs12_len20 -> output_E41_deepseek_r1


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.

✓ 6 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_49/output_E41_deepseek_r1

Session session_50: copy output_E09_bs12_len20 -> output_E41_deepseek_r1


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.

✓ 6 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_50/output_E41_deepseek_r1

Session session_54: copy output_E09_bs12_len20 -> output_E41_deepseek_r1


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.

✓ 5 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_54/output_E41_deepseek_r1

Fertig: Alle Outputs generiert.


## 16 – Evaluation & Aggregation

In [18]:
df_dev = append_eval_results_for_experiments(
    experiments=EXPERIMENTS,
    session_ids=SESSION_IDS,
    target_csv="results_dev_subset_by_session.csv",
)


########## Evaluate für session_40 ##########
Starte Evaluate: /home/josch080/Projektgruppe/mcorec_train/bin/python script/evaluate.py --session_dir data-bin/dev_without_central_videos/dev/session_40 --output_dir_name output_ --label_dir_name labels
Evaluating 1 sessions

=== Evaluating session session_40 ===

--- Evaluating output dir: output_E01_bs4_len15 ---
Conversation clustering F1 score: 1.0
Speaker to WER: {'spk_0': 0.564, 'spk_1': 0.4281, 'spk_2': 0.5576, 'spk_3': 0.4283, 'spk_4': 0.4793, 'spk_5': 0.4189}
Speaker clustering F1 score: {'spk_0': 1.0, 'spk_1': 1.0, 'spk_2': 1.0, 'spk_3': 1.0, 'spk_4': 1.0, 'spk_5': 1.0}
Joint ASR-Clustering Error Rate: {'spk_0': 0.282, 'spk_1': 0.21405, 'spk_2': 0.2788, 'spk_3': 0.21415, 'spk_4': 0.23965, 'spk_5': 0.20945}

--- Evaluating output dir: output_E02_bs8_len15 ---
Conversation clustering F1 score: 1.0
Speaker to WER: {'spk_0': 0.561, 'spk_1': 0.4312, 'spk_2': 0.5506, 'spk_3': 0.4283, 'spk_4': 0.5041, 'spk_5': 0.4189}
Speaker clusterin

## 17 – Ergebnisanalyse

LLM-Ergebnisse werden gegen die BL4-Baseline (E09) verglichen.
Negative Delta-Werte = Verbesserung gegenüber der Baseline.

In [19]:
import pandas as pd
import numpy as np

try:
    from IPython.display import display
except ImportError:
    display = print # Fallback außerhalb von Jupyter

dev_df = pd.read_csv("results_dev_subset_by_session.csv")

# EXPERIMENTS neu definieren (für den Fall, dass diese Zelle isoliert ausgeführt wird)
EXPERIMENTS = {
    "E38_qwen3_8b": {"llm_model": "qwen3_8b", "description": "Qwen3 8B"},
    "E39_qwen2.5_7b": {"llm_model": "qwen2.5_7b", "description": "Qwen2.5 7B"},
    "E40_qwen2.5_coder_7b": {"llm_model": "qwen2.5_coder_7b", "description": "Qwen2.5 Coder 7B"},
    "E41_deepseek_r1": {"llm_model": "deepseek_r1_distill_qwen_7b", "description": "DeepSeek R1 Distill 7B"},
}
llm_models = list(EXPERIMENTS.keys())
BASELINE_MODEL = "E09_bs12_len20" # BL4 beste Konfiguration

required_cols = ["avg_speaker_wer", "avg_joint_error"]

# Warnung falls Modelle in der CSV fehlen (z.B. Experiment noch nicht ausgeführt)
missing_cols = [c for c in required_cols if c not in dev_df.columns]
if missing_cols:
    raise ValueError(f"Diese Spalten fehlen in der CSV: {missing_cols}")

present_models = set(dev_df["model"].unique())

missing_models = [m for m in (llm_models + [BASELINE_MODEL]) if m not in present_models]
if missing_models:
    print("WARNUNG: Diese Modelle wurden in der CSV nicht gefunden:", missing_models)

# Baseline über alle Sessions mitteln
baseline_agg = (
    dev_df[dev_df["model"] == BASELINE_MODEL]
    .groupby("model")[required_cols]
    .mean()
    .reset_index()
)

if baseline_agg.empty:
    raise ValueError(f"Baseline '{BASELINE_MODEL}' nicht in der CSV gefunden.")

baseline_wer = float(baseline_agg.loc[0, "avg_speaker_wer"])
baseline_joint = float(baseline_agg.loc[0, "avg_joint_error"])

# LLM-Experimente über alle Sessions mitteln
llm_agg = (
    dev_df[dev_df["model"].isin(llm_models)]
    .groupby("model")[required_cols]
    .mean()
    .reset_index()
)

# Vergleichstabelle: Baseline zuerst, dann LLMs nach WER sortiert
comp = pd.concat([baseline_agg, llm_agg], ignore_index=True)

comp = comp.rename(columns={
    "avg_speaker_wer": "wer",
    "avg_joint_error": "joint_error",
})

comp["delta_wer"] = comp["wer"] - baseline_wer
comp["delta_joint_error"] = comp["joint_error"] - baseline_joint

# Baseline immer oben, Rest aufsteigend nach WER
comp["__is_baseline"] = (comp["model"] == BASELINE_MODEL).astype(int)
comp = comp.sort_values(["__is_baseline", "wer"], ascending=[False, True]).drop(columns="__is_baseline")
comp = comp[["model", "wer", "joint_error", "delta_wer", "delta_joint_error"]].reset_index(drop=True)

print(f"Baseline fix gesetzt auf: {BASELINE_MODEL}")
print("Interpretation: Negative Deltas = Verbesserung (niedriger ist besser).")
display(comp)


Baseline fix gesetzt auf: E09_bs12_len20
Interpretation: Negative Deltas = Verbesserung (niedriger ist besser).


Unnamed: 0,model,wer,joint_error,delta_wer,delta_joint_error
0,E09_bs12_len20,0.495416,0.3239,0.0,0.0
1,E38_qwen3_8b,0.494813,0.323598,-0.000603,-0.0003016667
2,E39_qwen2.5_7b,0.495116,0.32375,-0.0003,-0.00015
3,E41_deepseek_r1,0.495416,0.3239,0.0,5.5511150000000004e-17
4,E40_qwen2.5_coder_7b,0.495723,0.324053,0.000307,0.0001533333


## 18 – Interpretation

| Experiment | Modell | WER | Δ WER | Δ Joint Error |
|------------|--------|-----|-------|---------------|
| E09 (Baseline) | BL4 beam=12 len=20 | 0.4954 | – | – |
| **E38** | **Qwen3-8B** | **0.4948** | **−0.0006** | **−0.0003** |
| E39 | Qwen2.5-7B | ~0.4952 | leicht negativ | leicht negativ |
| E41 | DeepSeek-R1 | ~0.4954 | ~0 | ~0 |
| E40 | Qwen2.5-Coder | ~0.4958 | +0.0004 | +0.0002 |

Die Verbesserungen durch LLM-Postprocessing sind **marginal**.
Der Token-Guard und WER-Guard verhindern größere Verschlechterungen,
begrenzen aber auch das Verbesserungspotenzial.

Qwen3-8B zeigt die beste (wenn auch geringe) Verbesserung und wird in
`02g_` mit verfeinerten Parametern weiter untersucht.
