# 03a – LLM-Postprocessing (E38-Konfiguration) auf allen Dev-Sessions

## Kontext

`02f_` testete die LLM-Pipeline auf 5 Sessions. Dieses Notebook führt die beste
LLM-Konfiguration (E38: Qwen3-8B, SEP-Block-Prompting, WER-Guard, CONTEXT_BEFORE=2)
auf **allen Dev-Sessions** aus.

`03a_` skaliert damit die identische Pipeline von `02_f` auf alle Dev-Sessions.

## Ergebnis (Vorschau)

Auch auf allen Sessions keine signifikante Verbesserung.
Der LLM-Postprocessing-Ansatz wird endgültig nicht weiterverfolgt.

**Hinweis zum Bugfix:** Dieser finale Lauf basiert auf dem Stand **vor dem Segmentierungs-Bugfix**. Der Bugfix-Lauf folgt in `03b_`.

## 1 – GPU-Check & Auswahl

In [1]:
!nvidia-smi

Fri Jan 30 13:47:09 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-80GB          On  |   00000000:01:00.0 Off |                    0 |
| N/A   33C    P0             80W /  500W |    7093MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  |   00

In [2]:
import os

# Physische GPU-Auswahl: hier GPU 2 (siehe nvidia-smi)
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "2"  # Anpassen je nach Verfügbarkeit

## 2 – CUDA-Verifikation

In [3]:
import torch

In [4]:
print("CUDA available:", torch.cuda.is_available())
print("CUDA devices:", torch.cuda.device_count())
if torch.cuda.is_available():
    print("Device 0 name:", torch.cuda.get_device_name(0))
    print("Memory allocated:", torch.cuda.memory_allocated(0) / 1024**3, "GB")

CUDA available: True
CUDA devices: 1
Device 0 name: NVIDIA A100-SXM4-80GB
Memory allocated: 0.0 GB


## 3 – Setup: Imports & Arbeitsverzeichnis

In [5]:
import os, sys, re, gc, shutil, subprocess
from pathlib import Path
import pandas as pd
import torch
import webvtt
import difflib

project_baseline_path = "/home/josch080/Projektgruppe/mcorec_baseline"
os.chdir(project_baseline_path)

from script.pg_utils_experiments import append_eval_results_for_experiments

  if not hasattr(np, "object"):


## 4 – Konfiguration

`SESSION_IDS` wird dynamisch aus dem Dateisystem aufgebaut – alle `session_*`-Verzeichnisse.
So werden automatisch alle Sessions erfasst, ohne eine Liste pflegen zu müssen.

In [6]:
BASE_DATA_DIR = Path("data-bin/dev")

# Alle Sessions dynamisch aus dem Dateisystem
SESSION_IDS = sorted([p.name for p in BASE_DATA_DIR.glob("session_*") if p.is_dir()])

RESULTS_CSV = "final_results_by_session.csv"

In [7]:
print("CWD:", os.getcwd())
print("BASE_DATA_DIR:", BASE_DATA_DIR)
print("Sessions:", SESSION_IDS)

CWD: /home/josch080/Projektgruppe/mcorec_baseline
BASE_DATA_DIR: data-bin/dev_without_central_videos/dev
Sessions: ['session_132', 'session_133', 'session_134', 'session_135', 'session_136', 'session_137', 'session_138', 'session_139', 'session_140', 'session_141', 'session_40', 'session_41', 'session_42', 'session_43', 'session_44', 'session_48', 'session_49', 'session_50', 'session_51', 'session_52', 'session_53', 'session_54', 'session_55', 'session_56', 'session_57']


## 5 – LLM-Konfiguration & Experiment

In [8]:
LLM_CONFIGS = {
    "qwen3_8b": {
        "model_name": "Qwen/Qwen3-8B",
        "dtype": "bfloat16",
        "description": "Qwen3 8B"
    }
}

In [9]:
EXPERIMENTS = {
    "pp_qwen3_8b_final_bs12_len20": {
        "llm_model": "qwen3_8b",
        "input_prefix": "output_final_bs12_len20", # Eingabe: Ergebnis aus 03_
    },
}
print("Experimente:", list(EXPERIMENTS.keys())) 

Experimente: ['pp_qwen3_8b_final_bs12_len20', 'pp_qwen3_8b_final_bugfix_bs12_len20']


## 6 – LLM-Laden / Entladen

Identisch zu `02f_`.

In [10]:
from transformers import AutoTokenizer, AutoModelForCausalLM

loaded_models = {}
loaded_tokenizers = {}

def _dtype_from_cfg(dtype_str: str):
    s = (dtype_str or "auto").lower()
    if s == "bfloat16":
        return torch.bfloat16
    if s == "float16":
        return torch.float16
    return "auto"

def load_llm(llm_key: str):
    if llm_key in loaded_models:
        return loaded_models[llm_key], loaded_tokenizers[llm_key]

    cfg = LLM_CONFIGS[llm_key]
    model_name = cfg["model_name"]
    dtype = _dtype_from_cfg(cfg.get("dtype", "auto"))

    print(f"\nLoading LLM: {llm_key} -> {model_name} (dtype={cfg.get('dtype')})")

    tok = AutoTokenizer.from_pretrained(model_name, use_fast=True, trust_remote_code=True)
    if tok.pad_token_id is None:
        tok.pad_token = tok.eos_token

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=dtype,
        device_map="auto",
        trust_remote_code=True
    )
    model.eval()

    loaded_models[llm_key] = model
    loaded_tokenizers[llm_key] = tok
    return model, tok

def unload_llm(llm_key: str):
    if llm_key in loaded_models:
        del loaded_models[llm_key]
    if llm_key in loaded_tokenizers:
        del loaded_tokenizers[llm_key]
    torch.cuda.empty_cache()
    gc.collect()

  from .autonotebook import tqdm as notebook_tqdm


## 7 – Text-Hilfsfunktionen & Prompts

Identisch zu `02f_` (SEP-Token `<<<SEP>>>`, 2-Stage-Prompting, identische Prompt-Texte).

In [11]:
THINK_RE = re.compile(r"<think>.*?</think>", flags=re.DOTALL | re.IGNORECASE)
SPEAKER_TAG_RE = re.compile(r"^\s*(Human|Assistant|System|User)\s*:\s*", flags=re.IGNORECASE)

def strip_thinking(text: str) -> str:
    if not text:
        return text
    text = THINK_RE.sub("", text)
    text = re.sub(r"</?think>\s*", "", text, flags=re.IGNORECASE)
    return text.strip()

def clean_caption_text(t: str) -> str:
    if t is None:
        return ""
    t = t.replace("\ufeff", "").strip()
    t = SPEAKER_TAG_RE.sub("", t).strip()
    t = " ".join(t.split())
    return t

SEP = "<<<SEP>>>" # Trenner zwischen Zeilen im Block-Prompt

def build_chat_prompt(tokenizer, system_msg: str, user_msg: str) -> str:
    messages = [
        {"role": "system", "content": system_msg},
        {"role": "user", "content": user_msg},
    ]
    try:
        return tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True,
            enable_thinking=False,
        )
    except TypeError:
        return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)


In [12]:
# Stage 1 (strict): keine Wort-Einfügungen/Löschungen
SYSTEM_PROMPT_STRICT = (
    "You are a transcript post-processor for ASR cleanup.\n"
    "Goal: reduce ASR word errors while keeping wording as close as possible.\n"
    "Rules:\n"
    "- You receive multiple subtitle lines separated by the token <<<SEP>>>.\n"
    "- Output MUST contain the EXACT same number of lines, separated by the EXACT same token <<<SEP>>>.\n"
    "- Do NOT add any other text.\n"
    "- Do NOT paraphrase or reorder words.\n"
    "- Do NOT add or remove words.\n"
    "- Only replace words when you are VERY confident.\n"
    "- Prefer common conversational words.\n"
    "- Strongly avoid replacing a short/unclear token with a much longer/formal word.\n"
    "- Preserve casing (if input is ALL CAPS, output ALL CAPS).\n"
    "- Preserve punctuation style (do not introduce new punctuation).\n"
)

# Stage 2 (relaxed): erlaubt gezielte Ersetzungen klar falscher Tokens
SYSTEM_PROMPT_RELAXED = (
    "You are a transcript post-processor for ASR cleanup.\n"
    "Goal: reduce ASR word errors while keeping structure unchanged.\n"
    "Rules:\n"
    "- You receive multiple subtitle lines separated by the token <<<SEP>>>.\n"
    "- Output MUST contain the EXACT same number of lines, separated by the EXACT same token <<<SEP>>>.\n"
    "- Output ONLY the corrected lines; no explanations.\n"
    "- Do NOT paraphrase or reorder words.\n"
    "- Do NOT add or remove words.\n"
    "- You MAY replace a clearly garbled/non-word token with a short common word that fits context.\n"
    "- Prefer short common conversational words over rare/formal words.\n"
    "- Preserve casing and punctuation style.\n"
)

## 8 – Token-Guard v1

Die ursprüngliche Version aus `02f_` (ohne erlaubte Insertionen/Deletionen).
Strenger als der v2-Guard aus `02g_`.

In [13]:
def char_sim(a: str, b: str) -> float:
    return difflib.SequenceMatcher(a=a.lower(), b=b.lower()).ratio()

def word_ok(old: str, new: str, max_extra_chars: int, min_char_sim: float) -> bool:
    if old == new:
        return True
    if len(new) > len(old) + max_extra_chars:
        return False
    if old and new and old[0].isalpha() and new[0].isalpha() and old[0].lower() != new[0].lower():
        return False # Anfangsbuchstaben-Regel: für alle Wortlängen (strenger als 02g_)
    if char_sim(old, new) < min_char_sim:
        return False
    return True

def token_guard_line(orig: str, cand: str,
                     max_extra_chars: int = 6,
                     min_char_sim: float = 0.20) -> str:
    # Token-Guard v1: keine Insertionen/Deletionen erlaubt (strenger als 02g_)
    o = orig.split()
    c = cand.split()
    sm = difflib.SequenceMatcher(a=o, b=c)

    out = []
    for tag, i1, i2, j1, j2 in sm.get_opcodes():
        if tag == "equal":
            out.extend(o[i1:i2])
        elif tag == "replace":
            o_seg = o[i1:i2]
            c_seg = c[j1:j2]
            if len(o_seg) == len(c_seg):
                for ow, nw in zip(o_seg, c_seg):
                    out.append(nw if word_ok(ow, nw, max_extra_chars, min_char_sim) else ow)
            else:
                out.extend(o_seg) # Ungleich-lange Blöcke: Original behalten
        elif tag == "delete":
            out.extend(o[i1:i2]) # Deletion immer ablehnen
        elif tag == "insert":
            pass # Insertion immer ablehnen
    return " ".join(out)

## 9 – WER & VTT-Hilfsfunktionen

Identisch zu `02f_`.

In [14]:
def normalize_words(s: str):
    s = s.lower()
    s = re.sub(r"[^a-z0-9'\s]+", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s.split()

def wer(ref: str, hyp: str) -> float:
    r = normalize_words(ref)
    h = normalize_words(hyp)
    dp = [[0]*(len(h)+1) for _ in range(len(r)+1)]
    for i in range(len(r)+1):
        dp[i][0] = i
    for j in range(len(h)+1):
        dp[0][j] = j
    for i in range(1, len(r)+1):
        for j in range(1, len(h)+1):
            cost = 0 if r[i-1] == h[j-1] else 1
            dp[i][j] = min(dp[i-1][j] + 1, dp[i][j-1] + 1, dp[i-1][j-1] + cost)
    return dp[len(r)][len(h)] / max(1, len(r))

def flatten_vtt_text(path: Path) -> str:
    v = webvtt.read(str(path))
    return " ".join([clean_caption_text(c.text) for c in v.captions])

## 10 – Kernfunktionen

Identisch zu `02f_`.

In [15]:
MAX_INPUT_TOKENS = 4096
GEN_MAX_NEW_TOKENS = 256

CONTEXT_BEFORE = 2  
CONTEXT_AFTER  = 0

def _run_one_prompt(lines, model, tokenizer, system_prompt: str,
                    context_before=None, context_after=None):
    context_before = context_before or []
    context_after  = context_after or []

    user_parts = []
    user_parts.append(
        f"You will receive {len(lines)} subtitle lines separated by the token {SEP}.\n"
        f"Return exactly {len(lines)} lines separated by {SEP}, no extra text.\n"
    )

    if context_before:
        user_parts.append("Context (previous subtitle lines, DO NOT edit):\n")
        user_parts.append(SEP.join(context_before) + "\n")

    user_parts.append("Lines to correct:\n")
    user_parts.append(SEP.join(lines) + "\n")

    if context_after:
        user_parts.append("Context (next subtitle lines, DO NOT edit):\n")
        user_parts.append(SEP.join(context_after) + "\n")

    user_msg = "\n".join(user_parts)

    prompt = build_chat_prompt(tokenizer, system_prompt, user_msg)
    inputs = tokenizer(
        [prompt],
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=MAX_INPUT_TOKENS
    )
    if torch.cuda.is_available():
        inputs = {k: v.cuda() for k, v in inputs.items()}

    with torch.inference_mode():
        out = model.generate(
            **inputs,
            do_sample=False,
            max_new_tokens=GEN_MAX_NEW_TOKENS,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )

    input_len = int(inputs["attention_mask"][0].sum().item())
    gen_ids = out[0][input_len:]
    text = tokenizer.decode(gen_ids, skip_special_tokens=True)
    text = strip_thinking(text).replace("\ufeff", "").strip()

    text = re.sub(
        r"^\s*(?:<\|?(?:user|assistant|system|human)\|?>|user|assistant|system|human)\s*[:\n-]*\s*",
        "",
        text,
        flags=re.IGNORECASE
    ).strip()

    parts = [clean_caption_text(p) for p in text.split(SEP)]
    if len(parts) != len(lines):
        return None
    return parts

def correct_block(lines, model, tokenizer,
                  context_before=None, context_after=None,
                  guard_max_extra_chars=6,
                  guard_min_char_sim=0.20):
    # 2-Stage-Korrektur mit Token-Guard v1. Kein line_by_line-Fallback (vgl. 02g_)
    out1 = _run_one_prompt(
        lines, model, tokenizer, SYSTEM_PROMPT_STRICT,
        context_before=context_before,
        context_after=context_after
    )
    if out1 is None:
        return lines

    if out1 == lines:
        out2 = _run_one_prompt(
            lines, model, tokenizer, SYSTEM_PROMPT_RELAXED,
            context_before=context_before,
            context_after=context_after
        )
        if out2 is not None:
            out1 = out2

    guarded = []
    for orig, cand in zip(lines, out1):
        guarded.append(token_guard_line(
            orig, cand,
            max_extra_chars=guard_max_extra_chars,
            min_char_sim=guard_min_char_sim
        ))
    return guarded

## 11 – VTT-Postprocessing

Identisch zu `02f_`.

In [16]:
def copy_input_to_output(session_dir: Path, input_prefix: str, output_prefix: str):
    src = session_dir / input_prefix
    dst = session_dir / output_prefix
    if not src.exists():
        raise FileNotFoundError(f"Input dir missing: {src}")
    if dst.exists():
        shutil.rmtree(dst)
    shutil.copytree(src, dst)
    return src, dst

def postprocess_vtt_file(vtt_in: Path, vtt_out: Path,
                         model, tokenizer,
                         block_size=8,
                         guard_max_extra_chars=6,
                         guard_min_char_sim=0.20,
                         quick_wer_guard=False,
                         label_vtt: Path | None = None,
                         context_before_n: int = 0,
                         context_after_n: int = 0):
    v = webvtt.read(str(vtt_in))
    caps = v.captions

    orig_texts = [clean_caption_text(c.text) for c in caps]
    fixed = []

    for i in range(0, len(orig_texts), block_size):
        block = orig_texts[i:i + block_size]

        # Kontext: vorher aus bereits "fixed" (also schon korrigiert)
        ctx_before = fixed[-context_before_n:] if context_before_n > 0 else []

        # Kontext: nachher aus original (nicht korrigiert), damit du nicht vorrechnest
        ctx_after_start = i + block_size
        ctx_after = orig_texts[ctx_after_start:ctx_after_start + context_after_n] if context_after_n > 0 else []

        fixed_block = correct_block(
            block, model, tokenizer,
            context_before=ctx_before,
            context_after=ctx_after,
            guard_max_extra_chars=guard_max_extra_chars,
            guard_min_char_sim=guard_min_char_sim
        )
        fixed.extend(fixed_block)

    # Optional quick WER guard vs label (auf kompletter Datei)
    if quick_wer_guard and label_vtt is not None and label_vtt.exists():
        ref = flatten_vtt_text(label_vtt)
        hyp_before = " ".join(orig_texts)
        hyp_after  = " ".join(fixed)
        w_before = wer(ref, hyp_before)
        w_after  = wer(ref, hyp_after)
        if w_after > w_before:
            fixed = orig_texts

    for c, new_t in zip(caps, fixed):
        c.text = new_t

    v.save(str(vtt_out))

## 12 – Experiment-Ausführung

Identische Konfiguration wie `02f_` (`CONTEXT_BEFORE=2`, `USE_QUICK_WER_GUARD=True`).

In [17]:
BLOCK_SIZE = 8

GUARD_MAX_EXTRA_CHARS = 6     
GUARD_MIN_CHAR_SIM    = 0.20  

USE_QUICK_WER_GUARD = True   # Datei-WER-Guard aktiv (vgl. 02f_ war True, 02g_ war False)
LABEL_DIR_NAME = "labels"

model, tok = load_llm("qwen3_8b")

for exp_key, exp_cfg in EXPERIMENTS.items():
    input_prefix = exp_cfg["input_prefix"]
    out_prefix = f"output_{exp_key}"  # evaluate.py sieht model=exp_key

    print(f"\n=== EXP {exp_key}: {input_prefix} -> {out_prefix} ===")

    for sid in SESSION_IDS:
        session_dir = BASE_DATA_DIR / sid

        print(f"\nSession {sid}: copy {input_prefix} -> {out_prefix}")
        try:
            _, out_dir = copy_input_to_output(session_dir, input_prefix, out_prefix)
        except Exception as e:
            print("✗ Skip session:", e)
            continue

        vtts = sorted(out_dir.glob("*.vtt"))
        if not vtts:
            print("⚠ No VTTs found in", out_dir)
            continue

        label_dir = session_dir / LABEL_DIR_NAME

        for vtt in vtts:
            label_vtt = (label_dir / vtt.name) if label_dir.exists() else None
            postprocess_vtt_file(
                vtt, vtt,
                model, tok,
                block_size=BLOCK_SIZE,
                guard_max_extra_chars=GUARD_MAX_EXTRA_CHARS,
                guard_min_char_sim=GUARD_MIN_CHAR_SIM,
                quick_wer_guard=USE_QUICK_WER_GUARD,
                label_vtt=label_vtt,
                context_before_n=2, # 2 vorherige korrigierte Zeilen als Kontext
                context_after_n=0
            )

        print(f"✓ {len(vtts)} VTTs postprocessed in {out_dir}")

unload_llm("qwen3_8b")
print("\nFertig: Postprocessing-Outputs generiert.")


Loading LLM: qwen3_8b -> Qwen/Qwen3-8B (dtype=bfloat16)


Loading checkpoint shards: 100%|██████████| 5/5 [00:03<00:00,  1.39it/s]
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



=== EXP pp_qwen3_8b_final_bs12_len20: output_final_bs12_len20 -> output_pp_qwen3_8b_final_bs12_len20 ===

Session session_132: copy output_final_bs12_len20 -> output_pp_qwen3_8b_final_bs12_len20


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 6 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_132/output_pp_qwen3_8b_final_bs12_len20

Session session_133: copy output_final_bs12_len20 -> output_pp_qwen3_8b_final_bs12_len20


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 6 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_133/output_pp_qwen3_8b_final_bs12_len20

Session session_134: copy output_final_bs12_len20 -> output_pp_qwen3_8b_final_bs12_len20


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 6 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_134/output_pp_qwen3_8b_final_bs12_len20

Session session_135: copy output_final_bs12_len20 -> output_pp_qwen3_8b_final_bs12_len20


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 5 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_135/output_pp_qwen3_8b_final_bs12_len20

Session session_136: copy output_final_bs12_len20 -> output_pp_qwen3_8b_final_bs12_len20


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 6 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_136/output_pp_qwen3_8b_final_bs12_len20

Session session_137: copy output_final_bs12_len20 -> output_pp_qwen3_8b_final_bs12_len20


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 6 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_137/output_pp_qwen3_8b_final_bs12_len20

Session session_138: copy output_final_bs12_len20 -> output_pp_qwen3_8b_final_bs12_len20


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 6 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_138/output_pp_qwen3_8b_final_bs12_len20

Session session_139: copy output_final_bs12_len20 -> output_pp_qwen3_8b_final_bs12_len20


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 6 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_139/output_pp_qwen3_8b_final_bs12_len20

Session session_140: copy output_final_bs12_len20 -> output_pp_qwen3_8b_final_bs12_len20


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 6 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_140/output_pp_qwen3_8b_final_bs12_len20

Session session_141: copy output_final_bs12_len20 -> output_pp_qwen3_8b_final_bs12_len20


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 5 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_141/output_pp_qwen3_8b_final_bs12_len20

Session session_40: copy output_final_bs12_len20 -> output_pp_qwen3_8b_final_bs12_len20


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 6 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_40/output_pp_qwen3_8b_final_bs12_len20

Session session_41: copy output_final_bs12_len20 -> output_pp_qwen3_8b_final_bs12_len20


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 6 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_41/output_pp_qwen3_8b_final_bs12_len20

Session session_42: copy output_final_bs12_len20 -> output_pp_qwen3_8b_final_bs12_len20


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 6 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_42/output_pp_qwen3_8b_final_bs12_len20

Session session_43: copy output_final_bs12_len20 -> output_pp_qwen3_8b_final_bs12_len20


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 6 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_43/output_pp_qwen3_8b_final_bs12_len20

Session session_44: copy output_final_bs12_len20 -> output_pp_qwen3_8b_final_bs12_len20


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 6 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_44/output_pp_qwen3_8b_final_bs12_len20

Session session_48: copy output_final_bs12_len20 -> output_pp_qwen3_8b_final_bs12_len20


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 3 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_48/output_pp_qwen3_8b_final_bs12_len20

Session session_49: copy output_final_bs12_len20 -> output_pp_qwen3_8b_final_bs12_len20


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 6 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_49/output_pp_qwen3_8b_final_bs12_len20

Session session_50: copy output_final_bs12_len20 -> output_pp_qwen3_8b_final_bs12_len20


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 6 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_50/output_pp_qwen3_8b_final_bs12_len20

Session session_51: copy output_final_bs12_len20 -> output_pp_qwen3_8b_final_bs12_len20


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 6 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_51/output_pp_qwen3_8b_final_bs12_len20

Session session_52: copy output_final_bs12_len20 -> output_pp_qwen3_8b_final_bs12_len20


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 6 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_52/output_pp_qwen3_8b_final_bs12_len20

Session session_53: copy output_final_bs12_len20 -> output_pp_qwen3_8b_final_bs12_len20


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 5 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_53/output_pp_qwen3_8b_final_bs12_len20

Session session_54: copy output_final_bs12_len20 -> output_pp_qwen3_8b_final_bs12_len20


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 5 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_54/output_pp_qwen3_8b_final_bs12_len20

Session session_55: copy output_final_bs12_len20 -> output_pp_qwen3_8b_final_bs12_len20


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 5 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_55/output_pp_qwen3_8b_final_bs12_len20

Session session_56: copy output_final_bs12_len20 -> output_pp_qwen3_8b_final_bs12_len20


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 4 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_56/output_pp_qwen3_8b_final_bs12_len20

Session session_57: copy output_final_bs12_len20 -> output_pp_qwen3_8b_final_bs12_len20


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

✓ 5 VTTs postprocessed in data-bin/dev_without_central_videos/dev/session_57/output_pp_qwen3_8b_final_bs12_len20

=== EXP pp_qwen3_8b_final_bugfix_bs12_len20: output_final_bugfix_bs12_len20 -> output_pp_qwen3_8b_final_bugfix_bs12_len20 ===

Session session_132: copy output_final_bugfix_bs12_len20 -> output_pp_qwen3_8b_final_bugfix_bs12_len20


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

KeyboardInterrupt: 

## 13 – Evaluation & Aggregation

In [22]:
df_dev = append_eval_results_for_experiments( 
    experiments=EXPERIMENTS, 
    session_ids=SESSION_IDS, 
    target_csv="final_results_by_session.csv", 
)


########## Evaluate für session_132 ##########
Starte Evaluate: /home/josch080/Projektgruppe/mcorec_train/bin/python script/evaluate.py --session_dir data-bin/dev_without_central_videos/dev/session_132 --output_dir_name output_ --label_dir_name labels
Evaluating 1 sessions

=== Evaluating session session_132 ===

--- Evaluating output dir: output_auto_avsr ---
Conversation clustering F1 score: 1.0
Speaker to WER: {'spk_0': 0.8698, 'spk_1': 0.8679, 'spk_2': 0.9183, 'spk_3': 0.8807, 'spk_4': 0.8512, 'spk_5': 0.8877}
Speaker clustering F1 score: {'spk_0': 1.0, 'spk_1': 1.0, 'spk_2': 1.0, 'spk_3': 1.0, 'spk_4': 1.0, 'spk_5': 1.0}
Joint ASR-Clustering Error Rate: {'spk_0': 0.4349, 'spk_1': 0.43395, 'spk_2': 0.45915, 'spk_3': 0.44035, 'spk_4': 0.4256, 'spk_5': 0.44385}

--- Evaluating output dir: output_avsr_cocktail ---
Conversation clustering F1 score: 1.0
Speaker to WER: {'spk_0': 0.5022, 'spk_1': 0.6208, 'spk_2': 0.4942, 'spk_3': 0.4947, 'spk_4': 0.657, 'spk_5': 0.7181}
Speaker clusteri

  results_df = pd.concat([results_df, new_df], ignore_index=True)


## 14 – Ergebnisanalyse: LLM vs. BL4-Baseline

Robuste Spalten-Normalisierung (`standardize_cols`) und flexible Suche
(`filter_exp`), da verschiedene CSV-Quellen unterschiedliche Spaltennamen haben können.

In [28]:
import pandas as pd

FINAL_RESULTS_PATH = "final_results_by_session.csv"
BASELINE_RESULTS_PATH = "results_baseline_dev_without_central_videos.csv"

LLM_EXP_KEY  = "pp_qwen3_8b_final_bs12_len20"
LLM_EXP_ALT  = "output_" + LLM_EXP_KEY
BASELINE_EXP = "output_avsr_cocktail_finetuned"

def standardize_cols(df: pd.DataFrame) -> pd.DataFrame:
    # Normalisiert Spaltennamen: exp/model/experiment → exp, wer/avg_speaker_wer → wer, etc.
    df = df.copy()
    for c in ["exp", "experiment", "run", "model"]:
        if c in df.columns:
            df = df.rename(columns={c: "exp"})
            break
    for c in ["wer", "avg_speaker_wer", "speaker_wer"]:
        if c in df.columns:
            df = df.rename(columns={c: "wer"})
            break
    for c in ["joint_error", "avg_joint_error", "avg_joint_err"]:
        if c in df.columns:
            df = df.rename(columns={c: "joint_error"})
            break
    return df

def filter_exp(df: pd.DataFrame, exp_key: str, exp_alt: str | None = None) -> pd.DataFrame:
    m = (df["exp"] == exp_key)
    if exp_alt is not None:
        m = m | (df["exp"] == exp_alt)
    if m.sum() == 0:
        m = df["exp"].astype(str).str.contains(exp_key, regex=False)
    return df[m].copy()

# Load + standardize
final_df = standardize_cols(pd.read_csv(FINAL_RESULTS_PATH))
base_df  = standardize_cols(pd.read_csv(BASELINE_RESULTS_PATH))

required = {"exp", "wer", "joint_error"}
if not required.issubset(final_df.columns):
    raise ValueError(f"final_results_by_session.csv fehlt: {sorted(required - set(final_df.columns))}")
if not required.issubset(base_df.columns):
    raise ValueError(f"baseline csv fehlt: {sorted(required - set(base_df.columns))}")

# Baseline (single row)
base_row = base_df[base_df["exp"] == BASELINE_EXP]
if base_row.empty:
    raise ValueError(f"Baseline exp '{BASELINE_EXP}' nicht gefunden. Beispiele: {base_df['exp'].unique()[:20]}")
base_row = base_row.iloc[0]
baseline_wer   = float(base_row["wer"])
baseline_joint = float(base_row["joint_error"])

# LLM aggregated over sessions
llm_rows = filter_exp(final_df, LLM_EXP_KEY, exp_alt=LLM_EXP_ALT)
if llm_rows.empty:
    raise ValueError(f"LLM exp '{LLM_EXP_KEY}' nicht gefunden. Beispiele: {final_df['exp'].unique()[:20]}")

llm_wer   = float(llm_rows["wer"].mean())
llm_joint = float(llm_rows["joint_error"].mean())
n_llm = int(len(llm_rows))

# Two-row output (baseline + llm)
rows = [
    {
        "exp": BASELINE_EXP,
        "n_rows": int((base_df["exp"] == BASELINE_EXP).sum()),
        "wer": baseline_wer,
        "joint_error": baseline_joint,
        "delta_wer": 0.0,
        "delta_joint_error": 0.0,
        "rel_delta_wer_%": 0.0,
        "rel_delta_joint_error_%": 0.0,
    },
    {
        "exp": LLM_EXP_KEY,
        "n_rows": n_llm,
        "wer": llm_wer,
        "joint_error": llm_joint,
        "delta_wer": llm_wer - baseline_wer,
        "delta_joint_error": llm_joint - baseline_joint,
        "rel_delta_wer_%": ((llm_wer - baseline_wer) / baseline_wer) * 100 if baseline_wer != 0 else None,
        "rel_delta_joint_error_%": ((llm_joint - baseline_joint) / baseline_joint) * 100 if baseline_joint != 0 else None,
    },
]

out = pd.DataFrame(rows)
display(out)


Unnamed: 0,exp,n_rows,wer,joint_error,delta_wer,delta_joint_error,rel_delta_wer_%,rel_delta_joint_error_%
0,output_avsr_cocktail_finetuned,1,0.498727,0.354674,0.0,0.0,0.0,0.0
1,pp_qwen3_8b_final_bs12_len20,25,0.498731,0.356197,4e-06,0.001523,0.000708,0.42951


## 15 – Interpretation

| Metrik | Baseline (`output_final_bs12_len20`) | E38 LLM (`output_pp_qwen3_8b_final_bs12_len20`) | Δ |
|--------|--------------------------------------|--------------------------------------------------|---|
| Speaker WER (↓) | 0.4980 | 0.4987 | +0.0007 |
| Conv F1 (↑) | 0.8153 | 0.8153 | -0.0000 |
| Joint Error (↓) | 0.3453 | 0.3562 | +0.0109 |

Auch auf allen 25 Sessions keine signifikante Verbesserung durch LLM-Postprocessing (E38).
WER +0.0007.
Die Skalierung auf alle Sessions bestätigt das 5-Session-Ergebnis aus `02f_`.