# 02e – 6. Experiment: Whisper-Flamingo: MCoRec-Fine-Tuning

## Motivation

`02d` zeigte, dass Whisper-Flamingo ohne In-Domain-Training deutlich schlechter als BL4 ist.
Ein möglicher Grund: Der vortrainierte Checkpoint kennt die MCoRec-Domäne nicht.
Dieses Notebook testet, ob ein MCoRec-Fine-Tuning des Whisper-Flamingo-Modells
die Lücke zu BL4 schließen kann.

**Training:** 2 000 Schritte auf MCoRec-Trainingsdaten, LR 1·10⁻⁵,
Encoder und Video-Modell eingefroren (nur Decoder wird trainiert).

## Ergebnis (Vorschau)

| Modell | WER | Joint Error |
|--------|-----|-------------|
| Whisper-Flamingo (kein FT) | ~0.92 | ~0.54 |
| **Whisper-Flamingo Finetune** | **~7.10 – 8.65** | **~3.63 – 4.40** |
| BL4 AV-HuBERT (beam=12, len=20) | ~0.495 | ~0.324 |

Das Fine-Tuning **verschlechtert die Performance** – das Modell ist nach dem Training
faktisch unbrauchbar. Es ist möglich, dass im Trainingsprozess etwas nicht korrekt gelaufen ist. Der Fine-Tuning-Versuch wird nicht weiterverfolgt.

**Hinweis zum Bugfix:** Dieser Lauf wurde **vor dem Bugfix** in `segmentation.py` durchgeführt (`min_duration_off` las fälschlicherweise den Wert von `min_duration_on`). Das ist **gewollt**: Der Bugfix wurde erst nach Abschluss der LLM- und Hyperparameter-Experimente entdeckt. Da der Bugfix allein die WER zunächst verschlechterte, wurde erst in `02j_`/`02k_` die Kombination aus Bugfix + `min_duration`-Optimierung erarbeitet, die schließlich das beste Ergebnis lieferte.

## 1 – GPU-Auswahl

In [3]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "2"  # macht "unsere" GPU zur 0 im Prozess

## 2 – CUDA-Verifikation

In [4]:
import torch
print("n_gpu:", torch.cuda.device_count())
print("current device:", torch.cuda.current_device())
print("device name:", torch.cuda.get_device_name(0))


n_gpu: 1
current device: 0
device name: NVIDIA A100-SXM4-80GB


## 3 – Setup: Arbeitsverzeichnis & Pfade

In [2]:
import os, sys
project_baseline_path = "/home/josch080/Projektgruppe/mcorec_baseline"
os.chdir(project_baseline_path)
if project_baseline_path not in sys.path:
    sys.path.append(project_baseline_path)

## 4 – CUDA nach CWD-Wechsel nochmals prüfen

In [6]:
import torch
print("CUDA visible devices:", os.environ.get("CUDA_VISIBLE_DEVICES", "<not set>"))
print("n_gpu:", torch.cuda.device_count())

CUDA visible devices: 2
n_gpu: 1


## 5 – Trainings-Imports

Alle benötigten Klassen und Funktionen kommen aus
`script/train_whisper_flamingo_mcorec.py` – einem projektinternen Modul
das speziell für dieses Fine-Tuning-Experiment geschrieben wurde.

In [5]:
# Projektinternes Fine-Tuning-Modul für Whisper-Flamingo auf MCoRec:
#   LocalMCoRecVTTDataset    – lädt Clips aus dem lokalen MCoRec-Verzeichnis
#   FlamingoCollatorWrapper  – bereitet Batches (Audio, Video, Tokens) auf
#   load_whisper_flamingo    – lädt Modell + Tokenizer mit vortrainierten Gewichten
#   compute_mels             – konvertiert Rohaudiо in Mel-Spektrogramm (Whisper-Eingabeformat)
#   compute_loss             – berechnet den Cross-Entropy-Loss für Decoder-Tokens
#   device                   – vorkonfiguriertes torch.device (cuda:0 oder cpu)
from script.train_whisper_flamingo_mcorec import (
    LocalMCoRecVTTDataset,
    FlamingoCollatorWrapper,
    load_whisper_flamingo,
    compute_mels,
    compute_loss,
    device,
)

  from .autonotebook import tqdm as notebook_tqdm
  if not hasattr(np, "bool"):
  if not hasattr(np, "object"):
  if not hasattr(np, "long"):


## 6 – Fine-Tuning-Loop

Nur der Decoder wird trainiert – Encoder und Video-Modell sind eingefroren.
Das reduziert den Speicherbedarf und verhindert, dass bereits gelernte
audio-visuelle Repräsentationen überschrieben werden.

In [6]:
from torch.optim import AdamW
from torch.utils.data import DataLoader


# Dataset
# Clips aus dem MCoRec-Trainings-Split laden (ohne 'central videos')
train_root = "data-bin/train"

train_ds = LocalMCoRecVTTDataset(
    train_root=train_root,
    max_samples=None,        # None = alle Clips nutzen
)

print("Anzahl Clips im Dataset:", len(train_ds))
print("Beispiel 0:", train_ds[0]["text"], train_ds[0]["start_time"], train_ds[0]["end_time"], train_ds[0]["video"])

# Modell + Tokenizer
model, tokenizer = load_whisper_flamingo()

# pad_id: Token-ID für Padding; Fallback auf EOT-Token falls kein explizites Padding definiert
pad_id = getattr(tokenizer, "pad_id", tokenizer.eot)

# Encoder und Video-Modell einfrieren: requires_grad=False verhindert Gradienten-Berechnung
# → nur Decoder-Gewichte werden während des Trainings aktualisiert
for name, p in model.named_parameters():
    if name.startswith("encoder.") or "video_model" in name:
        p.requires_grad = False

trainable_params = [p for p in model.parameters() if p.requires_grad]
print("Trainierbare Parameter:", sum(p.numel() for p in trainable_params))

# Collator + DataLoader
# FlamingoCollatorWrapper packt Clips zu Batches: Audio → Tensor, Video → Tensor, Text → Token-IDs
collator = FlamingoCollatorWrapper(tokenizer)

train_loader = DataLoader(
    train_ds,
    batch_size=1, # batch_size=1: Clips variieren stark in Länge → kein sinnvolles Padding
    shuffle=True, # zufällige Reihenfolge für bessere Generalisierung
    num_workers=4, # parallele Vorverarbeitung auf CPU-Kernen
    collate_fn=collator,
    pin_memory=True, # schnellerer CPU→GPU-Transfer durch gepinnten Speicher
)

# Optimizer – nur trainierbare Parameter benutzen
# AdamW: Adam-Optimizer mit Gewichts-Decay (verhindert Überanpassung)
# Nur trainierbare Parameter übergeben (Encoder/Video-Modell sind eingefroren)
optimizer = AdamW(trainable_params, lr=1e-5, weight_decay=0.01)

# Trainings-Loop
max_steps = 2000 # Anzahl echter Gradient-Updates (übersprungene Batches zählen nicht)
log_every = 50 # Loss-Ausgabe alle 50 Schritte

step = 0
model.train() # Dropout und BatchNorm in Trainings-Modus schalten

# Äußere Epochen-Schleife: läuft so lange, bis max_steps erreicht ist
for epoch in range(999999): # praktisch unbegrenzt – Abbruch via break unten
    for batch in train_loader:
        if step >= max_steps:
            break
            
        # Tensoren auf GPU verschieben; non_blocking=True erlaubt asynchronen Transfer
        audios = batch.audios.to(device, non_blocking=True)
        videos = batch.videos.to(device, non_blocking=True)
        tokens = batch.tokens.to(device, non_blocking=True)

        # Clips mit zu kurzen Transkriptionen überspringen (z.B. leere Labels)
        # tokens.size(1) <= 1 bedeutet: nur Start-Token, kein echter Text
        if tokens.size(1) <= 1:
            # Optional zum Debuggen:
            # print(f"skip step {step} – nur {tokens.size(1)} Token, text={batch.text[0]!r}")
            continue

        # Audio in Mel-Spektrogramm konvertieren (Whisper-Eingabeformat: 80 Mel-Bänder)
        mel = compute_mels(audios)
        if mel is None:
            # Segment ohne verwertbares Audio (z.B. Stille) → überspringen,
            # ohne step zu erhöhen, damit max_steps echte Updates zählt
            continue

        # Cross-Entropy-Loss über alle Decoder-Output-Tokens berechnen
        loss = compute_loss(model, mel, videos, tokens, pad_id)

        # Standard-Backpropagation:
        optimizer.zero_grad(set_to_none=True) # set_to_none=True spart etwas Speicher vs. zero_grad()
        loss.backward() # Gradienten berechnen
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # Gradient-Clipping: verhindert explodierende Gradienten
        optimizer.step() # Gewichte aktualisieren

        if step % log_every == 0:
            print(f"step {step} | loss {loss.item():.4f} | text[0]={batch.text[0]!r}")

        step += 1

    if step >= max_steps:
        break

# Checkpoint speichern
# state_dict(): nur Modell-Gewichte (kein Optimizer-Zustand) – reicht für Inference
save_dir = os.path.join(project_baseline_path, "model-bin", "whisper_flamingo_mcorec_finetune")
os.makedirs(save_dir, exist_ok=True) # exist_ok=True: kein Fehler, falls Ordner schon existiert
ckpt_path = os.path.join(save_dir, f"pytorch_model_step{step}.bin")
torch.save(model.state_dict(), ckpt_path)
print("Checkpoint gespeichert unter:", ckpt_path)


[LocalMCoRecVTTDataset] Built 19007 clip-items from data-bin/dev_without_central_videos/train
Anzahl Clips im Dataset: 19007
Beispiel 0: Okay. 28.843 29.581 data-bin/dev_without_central_videos/train/session_00/speakers/spk_1/ego_video.mp4
Loading base Whisper model (large-v2) with AV frontend...


  checkpoint = torch.load(fp, map_location=device)


Whisper dropout rate : 0.0
Loading AV-HuBERT encoder


  state = torch.load(f, map_location=torch.device("cpu"))
2025-12-22 19:39:08 | INFO | avhubert.hubert_pretraining | current directory is /home/josch080/Projektgruppe/mcorec_baseline
2025-12-22 19:39:08 | INFO | avhubert.hubert_pretraining | AVHubertPretrainingTask Config {'_name': 'av_hubert_pretraining', 'data': '/checkpoint/bshi/data/lrs3//video/wav/all_tsv/', 'input_modality': '???', 'labels': ['wrd'], 'label_dir': '/checkpoint/bshi/data/lrs3//exp/ls-hubert/tune-modality/all_bpe/unigram1000/', 'label_rate': -1, 'sample_rate': 16000, 'normalize': True, 'enable_padding': False, 'max_sample_size': 500, 'min_sample_size': None, 'max_trim_sample_size': '${task.max_sample_size}', 'single_target': True, 'random_crop': False, 'pad_audio': True, 'pdb': False, 'stack_order_audio': 4, 'skip_verify': False, 'image_aug': True, 'image_crop_size': 88, 'image_mean': 0.421, 'image_std': 0.165, 'modalities': ['audio', 'video'], 'is_s2s': True, 'tokenizer_bpe_name': 'sentencepiece', 'tokenizer_bpe_mo

Using AV-HuBERT encoder with parameters: 325136104
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Loading Flamingo checkpoint: /home/josch080/Projektgruppe/mcorec_baseline/model-bin/w

  sd = torch.load(ckpt, map_location="cpu")


Trainierbare Parameter: 1536157504
step 0 | loss 22.2507 | text[0]='Huhahahuh'
step 50 | loss 6.0098 | text[0]='So they did.'
step 100 | loss 7.0774 | text[0]="The bigger the noodles, the smaller, the smaller, the smaller. And that's all they had. You know."
step 150 | loss 1.8102 | text[0]='Yeah.'
step 200 | loss 8.2846 | text[0]="We chose the wrong topic. I'm getting hungry."
step 250 | loss 5.8487 | text[0]='Yeah yeah, He, he only came out like in the middle of the movie. No, no, like middle of the movie'
step 300 | loss 5.0868 | text[0]='No, actually. Yeah, lucky. Yeah. Absolutely.'
step 350 | loss 2.6313 | text[0]='Yeah, yeah.'
step 400 | loss 9.0179 | text[0]='In Sendai area.'
step 450 | loss 5.9199 | text[0]="I'm vaccinated for a lot of stuff, but I was out."
step 500 | loss 5.3767 | text[0]='before then yeah, I would take my... YouTube. YouTube was my source of like'
step 550 | loss 6.5959 | text[0]='And I like that scene. And ohhh'
step 600 | loss 2.7084 | text[0]='Oh...'
step

## 7 – Modell & Experiment-Definitionen für Inference

In [6]:
# Feingetuneter Checkpoint – Step 2000 ist der finale Checkpoint des Trainings oben
MODELS = {
    "whisper_flamingo_large_ft": {
        "model_type": "whisper_flamingo",
         "chkpt": "model-bin/whisper_flamingo_mcorec_finetune/pytorch_model_step2000.bin",
    }
}

## 8 – Sessions: Schrittweise Einschränkung (Debugging)

Während der Entwicklung wurde der Inference-Lauf schrittweise auf immer weniger
Sessions eingeschränkt, um die Laufzeit zu kontrollieren und die Ergebnisse zu erzeugen, ohne einen Out of Memory Fehler zu generieren. Es wurden dieselben 5 Sessions wie in den Expiermenten zuvor genutzt.

Die vier SESSION_IDS-Definitionen überschreiben sich jeweils – aktiv ist die letzte ausgeführte Zelle. Für den Gesamt-Durchlauf Zelle 8a ausführen.

In [11]:
# 8a – Vollständiges 5-Session-Subset (Standard für Vergleichbarkeit)
SESSION_IDS = ["session_40", "session_43", "session_49", "session_50", "session_54"]

In [6]:
SESSION_IDS = ["session_49", "session_50", "session_54"]

In [7]:
SESSION_IDS = ["session_50"]

In [7]:
SESSION_IDS = ["session_54"]

## 9 – Experiment-Grid (E28–E37)

Analoges Grid wie in `02d_` für Whisper-Flamingo ohne Fine-Tuning –
ermöglicht direkten FT vs. non-FT Vergleich bei gleichen Hyperparametern.

In [9]:
EXPERIMENTS = {
    # 15 Sekunden
    "E28_whisper_flamingo_ft_bs4_len15": {
    "base_model": "whisper_flamingo_large_ft",
    "beam_size": 4,
    "max_length": 15,
    "comment": "Whisper-Flamingo Finetune, beam=4, len=15",
    },

    "E29_whisper_flamingo_ft_bs8_len15": {
    "base_model": "whisper_flamingo_large_ft",
    "beam_size": 8,
    "max_length": 15,
    "comment": "Whisper-Flamingo Finetune, beam=8, len=15",
    },
    
    "E30_whisper_flamingo_ft_bs12_len15": {
    "base_model": "whisper_flamingo_large_ft",
    "beam_size": 12,
    "max_length": 15,
    "comment": "Whisper-Flamingo Finetune, beam=12, len=15",
    },

    # 18 Sekunden
    "E31_whisper_flamingo_ft_bs4_len18": {
    "base_model": "whisper_flamingo_large_ft",
    "beam_size": 4,
    "max_length": 18,
    "comment": "Whisper-Flamingo Finetune, beam=4, len=18",
    },

    "E32_whisper_flamingo_ft_bs8_len18": {
    "base_model": "whisper_flamingo_large_ft",
    "beam_size": 8,
    "max_length": 18,
    "comment": "Whisper-Flamingo Finetune, beam=8, len=18",
    },
    
    "E33_whisper_flamingo_ft_bs12_len18": {
    "base_model": "whisper_flamingo_large_ft",
    "beam_size": 12,
    "max_length": 18,
    "comment": "Whisper-Flamingo Finetune, beam=12, len=18",
    },

    # 20 Sekunden
    "E34_whisper_flamingo_ft_bs4_len20": {
    "base_model": "whisper_flamingo_large_ft",
    "beam_size": 4,
    "max_length": 20,
    "comment": "Whisper-Flamingo Finetune, beam=4, len=20",
    },

    "E35_whisper_flamingo_ft_bs8_len20": {
    "base_model": "whisper_flamingo_large_ft",
    "beam_size": 8,
    "max_length": 20,
    "comment": "Whisper-Flamingo Finetune, beam=8, len=20",
    },

    "E36_whisper_flamingo_ft_bs12_len20": {
    "base_model": "whisper_flamingo_large_ft",
    "beam_size": 12,
    "max_length": 20,
    "comment": "Whisper-Flamingo Finetune, beam=12, len=20",
    },

    "E37_whisper_flamingo_ft_bs8_len10": {
    "base_model": "whisper_flamingo_large_ft",
    "beam_size": 8,
    "max_length": 10,
    "comment": "Whisper-Flamingo Finetune, beam=8, len=10",
    },
}


## 10 – Inference-Utilities laden & Inference starten

10 Experimente × aktive SESSION_IDS.

In [7]:
from script.pg_utils_experiments import run_inference_for_experiment, run_eval_and_log, append_eval_results_for_experiments

  if not hasattr(np, "bool"):
  if not hasattr(np, "object"):
  if not hasattr(np, "long"):


In [None]:
for sid in SESSION_IDS:
    session_dir = f"data-bin/dev/{sid}"
    print(f"\n########## Starte Experimente für {sid} ##########")

    for exp_name in EXPERIMENTS:
        run_inference_for_experiment(
            exp_name=exp_name,
            base_models=MODELS,
            experiments=EXPERIMENTS,
            session_dir=session_dir,
        )


########## Starte Experimente für session_54 ##########

Starte Inference für Experiment: E28_whisper_flamingo_ft_bs4_len15
  base_model      = whisper_flamingo_large_ft
  model_type      = whisper_flamingo
  checkpoint_path = model-bin/whisper_flamingo_mcorec_finetune/pytorch_model_step2000.bin
  beam_size       = 4
  max_length      = 15
  output_dir_name = output_E28_whisper_flamingo_ft_bs4_len15
  session_dir     = data-bin/dev_without_central_videos/dev/session_54
  comment         = Whisper-Flamingo Finetune, beam=4, len=15
Loading whisper_flamingo model...
Loading Whisper-Flamingo AV model (large-v2) on cuda


  checkpoint = torch.load(fp, map_location=device)


Whisper dropout rate : 0.0
Loading AV-HuBERT encoder


  state = torch.load(f, map_location=torch.device("cpu"))
2025-12-28 12:23:11 | INFO | avhubert.hubert_pretraining | current directory is /home/josch080/Projektgruppe/mcorec_baseline
2025-12-28 12:23:11 | INFO | avhubert.hubert_pretraining | AVHubertPretrainingTask Config {'_name': 'av_hubert_pretraining', 'data': '/checkpoint/bshi/data/lrs3//video/wav/all_tsv/', 'input_modality': '???', 'labels': ['wrd'], 'label_dir': '/checkpoint/bshi/data/lrs3//exp/ls-hubert/tune-modality/all_bpe/unigram1000/', 'label_rate': -1, 'sample_rate': 16000, 'normalize': True, 'enable_padding': False, 'max_sample_size': 500, 'min_sample_size': None, 'max_trim_sample_size': '${task.max_sample_size}', 'single_target': True, 'random_crop': False, 'pad_audio': True, 'pdb': False, 'stack_order_audio': 4, 'skip_verify': False, 'image_aug': True, 'image_crop_size': 88, 'image_mean': 0.421, 'image_std': 0.165, 'modalities': ['audio', 'video'], 'is_s2s': True, 'tokenizer_bpe_name': 'sentencepiece', 'tokenizer_bpe_mo

Using AV-HuBERT encoder with parameters: 325136104
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Adding gated x attn layers
Loading Whisper-Flamingo checkpoint weights from model-bin/whisper_flamingo_mcorec_fi

  sd = torch.load(ckpt, map_location="cpu")


whisper_flamingo model loaded successfully!
Inferring 1 sessions using whisper_flamingo model
Processing session session_54


Processing speakers:   0%|          | 0/5 [00:00<?, ?it/s]





[Acessing speaker spk_0 track 1 of 2:   0%|          | 0/27 [00:00<?, ?it/s]
[Acessing speaker spk_0 track 1 of 2:   4%|▎         | 1/27 [00:30<13:14, 30.57s/it]
[Acessing speaker spk_0 track 1 of 2:   7%|▋         | 2/27 [00:57<11:47, 28.29s/it]
[Acessing speaker spk_0 track 1 of 2:  11%|█         | 3/27 [01:22<10:41, 26.73s/it]
[Acessing speaker spk_0 track 1 of 2:  15%|█▍        | 4/27 [01:48<10:08, 26.46s/it]
[Acessing speaker spk_0 track 1 of 2:  19%|█▊        | 5/27 [02:16<09:55, 27.08s/it]
[Acessing speaker spk_0 track 1 of 2:  22%|██▏       | 6/27 [02:44<09:37, 27.48s/it]
[Acessing speaker spk_0 track 1 of 2:  26%|██▌       | 7/27 [03:13<09:15, 27.77s/it]
[Acessing speaker spk_0 track 1 of 2:  30%|██▉       | 8/27 [03:40<08:45, 27.63s/it]
[Acessing speaker spk_0 track 1 of 2:  33%|███▎      | 9/27 [04:06<08:08, 27.15s/it]
[Acessing speaker spk_0 track 1 of 2:  37%|███▋      | 10/27 [04:32<07:37, 26.93s/it]
[Acessing speaker spk_0 track 1 of 2:  41%|████      | 11/2





[Acessing speaker spk_1 track 1 of 1:   0%|          | 0/29 [00:00<?, ?it/s]
[Acessing speaker spk_1 track 1 of 1:   3%|▎         | 1/29 [00:29<13:35, 29.12s/it]
[Acessing speaker spk_1 track 1 of 1:   7%|▋         | 2/29 [00:59<13:23, 29.78s/it]

## 11 – Evaluation & Aggregation

In [12]:
df_dev = append_eval_results_for_experiments(
    experiments=EXPERIMENTS,
    session_ids=SESSION_IDS,
    target_csv="results_dev_subset_by_session.csv",
)



########## Evaluate für session_40 ##########
Starte Evaluate: /home/josch080/Projektgruppe/mcorec_wf/bin/python script/evaluate.py --session_dir data-bin/dev_without_central_videos/dev/session_40 --output_dir_name output_ --label_dir_name labels
Evaluating 1 sessions

=== Evaluating session session_40 ===

--- Evaluating output dir: output_E01_bs4_len15 ---
Conversation clustering F1 score: 1.0
Speaker to WER: {'spk_0': 0.564, 'spk_1': 0.4281, 'spk_2': 0.5576, 'spk_3': 0.4283, 'spk_4': 0.4793, 'spk_5': 0.4189}
Speaker clustering F1 score: {'spk_0': 1.0, 'spk_1': 1.0, 'spk_2': 1.0, 'spk_3': 1.0, 'spk_4': 1.0, 'spk_5': 1.0}
Joint ASR-Clustering Error Rate: {'spk_0': 0.282, 'spk_1': 0.21405, 'spk_2': 0.2788, 'spk_3': 0.21415, 'spk_4': 0.23965, 'spk_5': 0.20945}

--- Evaluating output dir: output_E02_bs8_len15 ---
Conversation clustering F1 score: 1.0
Speaker to WER: {'spk_0': 0.561, 'spk_1': 0.4312, 'spk_2': 0.5506, 'spk_3': 0.4283, 'spk_4': 0.5041, 'spk_5': 0.4189}
Speaker clustering F

## 12 – Ergebnisanalyse: FT vs. non-FT vs. Baselines

Vier Vergleichsebenen:
1. Whisper-Flamingo-FT intern (E28–E37)
2. Whisper-Flamingo ohne FT (E18–E27) aus `02d_`
3. Referenz-Baselines (Whisper audio-only + BL4)
4. Gesamtvergleich aller vier Modell-Familien

In [5]:
import pandas as pd

# Gesamtergebnis einlesen
dev_df = pd.read_csv("results_dev_subset_by_session.csv")

# Whisper-Flamingo-FT (E28–E37)
whisper_flamingo_ft_models = [
    "E28_whisper_flamingo_ft_bs4_len15",
    "E29_whisper_flamingo_ft_bs8_len15",
    "E30_whisper_flamingo_ft_bs12_len15",
    "E31_whisper_flamingo_ft_bs4_len18",
    "E32_whisper_flamingo_ft_bs8_len18",
    "E33_whisper_flamingo_ft_bs12_len18",
    "E34_whisper_flamingo_ft_bs4_len20",
    "E35_whisper_flamingo_ft_bs8_len20",
    "E36_whisper_flamingo_ft_bs12_len20",
    "E37_whisper_flamingo_ft_bs8_len10",
]

wf_ft_df = (
    dev_df[dev_df["model"].isin(whisper_flamingo_ft_models)]
    .groupby("model")[["avg_speaker_wer", "avg_joint_error"]]
    .mean()
    .reset_index()
)

print("Whisper-Flamingo *Finetune* (aggregiert über dev-Subset):")
display(wf_ft_df.sort_values("avg_joint_error"))

# Whisper-Flamingo ohne FT (E18–E27) – zum Direktvergleich
whisper_flamingo_models = [
    "E18_whisper_flamingo_bs4_len15",
    "E19_whisper_flamingo_bs8_len15",
    "E20_whisper_flamingo_bs12_len15",
    "E21_whisper_flamingo_bs4_len18",
    "E22_whisper_flamingo_bs8_len18",
    "E23_whisper_flamingo_bs12_len18",
    "E24_whisper_flamingo_bs4_len20",
    "E25_whisper_flamingo_bs8_len20",
    "E26_whisper_flamingo_bs12_len20",
    "E27_whisper_flamingo_bs8_len10",
]

wf_plain_df = (
    dev_df[dev_df["model"].isin(whisper_flamingo_models)]
    .groupby("model")[["avg_speaker_wer", "avg_joint_error"]]
    .mean()
    .reset_index()
)

print("Whisper-Flamingo (ohne Finetune, aggregiert über dev-Subset):")
display(wf_plain_df.sort_values("avg_joint_error"))

# Referenz-Baselines
whisper_models = ["E16_whisper_bs8_len20", "E17_whisper_bs12_len20"]
best_avsr_models = ["E08_bs8_len20", "E09_bs12_len20"]

baseline_models = whisper_models + best_avsr_models

baseline_df = (
    dev_df[dev_df["model"].isin(baseline_models)]
    .groupby("model")[["avg_speaker_wer", "avg_joint_error"]]
    .mean()
    .reset_index()
)

print("Bisherige Baselines (Whisper + AVSR-Finetune):")
display(baseline_df.sort_values("avg_joint_error"))

# Gesamtvergleich mit Modell-Familie als Label
wf_ft_df2 = wf_ft_df.copy()
wf_ft_df2["family"] = "Whisper-Flamingo-FT"

wf_plain_df2 = wf_plain_df.copy()
wf_plain_df2["family"] = "Whisper-Flamingo"

baseline_df2 = baseline_df.copy()
baseline_df2["family"] = baseline_df2["model"].apply(
    lambda m: "Whisper (Audio)" if m.startswith("E16") or m.startswith("E17") else "AVSR-Finetune"
)

comparison_df = (
    pd.concat([wf_ft_df2, wf_plain_df2, baseline_df2], ignore_index=True)
    .sort_values("avg_joint_error")
    .reset_index(drop=True)
)

print("Gesamtvergleich (inkl. Finetune):")
display(comparison_df)


Whisper-Flamingo *Finetune* (aggregiert über dev-Subset):


Unnamed: 0,model,avg_speaker_wer,avg_joint_error
6,E34_whisper_flamingo_ft_bs4_len20,7.099739,3.626061
3,E31_whisper_flamingo_ft_bs4_len18,7.179875,3.666129
0,E28_whisper_flamingo_ft_bs4_len15,7.332131,3.742258
7,E35_whisper_flamingo_ft_bs8_len20,7.563768,3.858076
4,E32_whisper_flamingo_ft_bs8_len18,7.652895,3.90264
8,E36_whisper_flamingo_ft_bs12_len20,7.784863,3.968624
1,E29_whisper_flamingo_ft_bs8_len15,7.801377,3.97688
5,E33_whisper_flamingo_ft_bs12_len18,7.875273,4.013829
2,E30_whisper_flamingo_ft_bs12_len15,8.03691,4.094647
9,E37_whisper_flamingo_ft_bs8_len10,8.64881,4.400597


Whisper-Flamingo (ohne Finetune, aggregiert über dev-Subset):


Unnamed: 0,model,avg_speaker_wer,avg_joint_error
1,E19_whisper_flamingo_bs8_len15,0.921862,0.537123
4,E22_whisper_flamingo_bs8_len18,0.923613,0.537999
9,E27_whisper_flamingo_bs8_len10,0.923795,0.53809
7,E25_whisper_flamingo_bs8_len20,0.925969,0.539176
5,E23_whisper_flamingo_bs12_len18,0.931475,0.54193
2,E20_whisper_flamingo_bs12_len15,0.93206,0.542222
0,E18_whisper_flamingo_bs4_len15,0.933546,0.542965
8,E26_whisper_flamingo_bs12_len20,0.933904,0.543144
3,E21_whisper_flamingo_bs4_len18,0.935684,0.544034
6,E24_whisper_flamingo_bs4_len20,0.938264,0.545324


Bisherige Baselines (Whisper + AVSR-Finetune):


Unnamed: 0,model,avg_speaker_wer,avg_joint_error
1,E09_bs12_len20,0.495416,0.3239
0,E08_bs8_len20,0.495798,0.324091
3,E17_whisper_bs12_len20,1.025096,0.58874
2,E16_whisper_bs8_len20,1.031613,0.591999


Gesamtvergleich (inkl. Finetune):


Unnamed: 0,model,avg_speaker_wer,avg_joint_error,family
0,E09_bs12_len20,0.495416,0.3239,AVSR-Finetune
1,E08_bs8_len20,0.495798,0.324091,AVSR-Finetune
2,E19_whisper_flamingo_bs8_len15,0.921862,0.537123,Whisper-Flamingo
3,E22_whisper_flamingo_bs8_len18,0.923613,0.537999,Whisper-Flamingo
4,E27_whisper_flamingo_bs8_len10,0.923795,0.53809,Whisper-Flamingo
5,E25_whisper_flamingo_bs8_len20,0.925969,0.539176,Whisper-Flamingo
6,E23_whisper_flamingo_bs12_len18,0.931475,0.54193,Whisper-Flamingo
7,E20_whisper_flamingo_bs12_len15,0.93206,0.542222,Whisper-Flamingo
8,E18_whisper_flamingo_bs4_len15,0.933546,0.542965,Whisper-Flamingo
9,E26_whisper_flamingo_bs12_len20,0.933904,0.543144,Whisper-Flamingo


## 13 – Interpretation

| Modell | WER | Joint Error |
|--------|-----|-------------|
| Whisper-Flamingo (kein FT, beste Konfig.) | ~0.92 | ~0.54 |
| **Whisper-Flamingo FT (alle Konfigs.)** | **~7.10 – 8.65** | **~3.63 – 4.40** |
| BL4 AV-HuBERT (beam=12, len=20) | ~0.495 | ~0.324 |

Das Fine-Tuning hat die Performance **drastisch verschlechtert** (WER ~8× höher
als ohne Fine-Tuning). Das Modell ist nach dem Training praktisch unbrauchbar. Es ist möglich, dass im Trainingsprozess etwas nicht korrekt gelaufen ist.

**Mögliche weitere Ursachen:**
- **Catastrophic Forgetting:** Durch das Einfrieren nur von Encoder und Video-Modell
  (nicht des gesamten Modells) wurden Decoder-Gewichte überschrieben, die für
  die grundlegende Sprachmodellierung wichtig sind
- **Datenmenge zu gering:** 2 000 Schritte auf einem kleinen MCoRec-Subset reichen
  möglicherweise nicht aus, um stabile Decoder-Weights zu erzeugen
- **Architektur-Mismatch:** Der Whisper-Flamingo-Decoder erwartet möglicherweise
  spezifische Eingabeformate, die `compute_loss` nicht korrekt liefert

**Schlussfolgerung:** Whisper-Flamingo-Fine-Tuning wird nicht weiterverfolgt.
Alle weiteren Experimente bauen auf BL4 auf.
