# 02c – 4. Experiment: Modell-Vergleich: Whisper Large-v3 (Audio-Only) vs. AV-HuBERT

## Motivation

Nachdem Fine-Tuning-Ansätze (02a, 02b) keine Verbesserung gegenüber BL4 brachten,
wird hier ein grundlegend anderes Modell getestet: **Whisper Large-v3** als
reines Audio-ASR-System (kein visueller Input).

**Hypothese:** Whisper ist ein state-of-the-art generisches ASR-Modell. Kann es
trotz fehlendem visuellem Input mit BL4 mithalten?

## Ergebnis (Vorschau)

| Modell | WER | Joint Error |
|--------|-----|-------------|
| Whisper Large-v3 | ~1.03 | ~0.59 |
| BL4 (beam=12, len=20) | ~0.495 | ~0.324 |

Whisper ist **~52 % schlechter** in WER und ~45 % schlechter in Joint Error.
In der überlappenden Mehrsprecher-Umgebung von MCoRec ist visueller Input entscheidend.

**Hinweis zum Bugfix:** Dieser Lauf wurde **vor dem Bugfix** in `segmentation.py` durchgeführt (`min_duration_off` las fälschlicherweise den Wert von `min_duration_on`). Das ist **gewollt**: Der Bugfix wurde erst nach Abschluss der LLM- und Hyperparameter-Experimente entdeckt. Da der Bugfix allein die WER zunächst verschlechterte, wurde erst in `02j_`/`02k_` die Kombination aus Bugfix + `min_duration`-Optimierung erarbeitet, die schließlich das beste Ergebnis lieferte.

## 1 – Setup: Arbeitsverzeichnis & Imports

In [1]:
import os, sys
import pandas as pd

# Arbeitsverzeichnis auf Repo-Root setzen (Voraussetzung für alle relativen Pfade)
project_baseline_path = "/home/josch080/Projektgruppe/mcorec_baseline"
os.chdir(project_baseline_path)

# Repo-Root in sys.path, damit projektinterne Module importierbar sind
if project_baseline_path not in sys.path:
    sys.path.append(project_baseline_path)

from script.pg_utils_experiments import run_inference_for_experiment, run_eval_and_log, append_eval_results_for_experiments

## 2 – Modell-Definition

`whisper_audio` ist eine in `inference.py` ergänzte Modellklasse, die
Whisper rein audio-basiert betreibt – ohne Lip-Crop-Input.
Der Checkpoint-Name wird direkt an die openai-whisper-Bibliothek übergeben.

In [2]:
MODELS = {
    "whisper_large_v3": {
        "model_type": "whisper_audio",
        # Checkpoint-Name gemäß openai-whisper-Konvention:
        # Optionen: "tiny", "base", "small", "medium", "large-v2", "large-v3"
        "chkpt": "large-v3"}
}

## 3 – Sessions & Experimente

Gleiches 5-Session-Subset für direkte Vergleichbarkeit mit allen vorherigen Experimenten.

In [3]:
SESSION_IDS = ["session_40", "session_43", "session_49", "session_50", "session_54"]

In [4]:
EXPERIMENTS = {
    # E16/E17: Whisper mit den bisher besten Hyperparametern (aus 02_)
    # um einen fairen Vergleich mit BL4 bei gleichen Decoding-Parametern zu ermöglichen
    "E16_whisper_bs8_len20": {
    "base_model": "whisper_large_v3",
    "beam_size": 8,
    "max_length": 20,
    "comment": "Whisper Large-v3, beam=8, len=20",
    },
    
    "E17_whisper_bs12_len20": {
    "base_model": "whisper_large_v3",
    "beam_size": 12,
    "max_length": 20,
    "comment": "Whisper Large-v3, beam=12, len=20",
    },
}


## 4 – Inference

2 Experimente × 5 Sessions = 10 Läufe.

In [5]:
for sid in SESSION_IDS:
    session_dir = f"data-bin/dev/{sid}"
    print(f"\n########## Starte Experimente für {sid} ##########")

    for exp_name in EXPERIMENTS:
        run_inference_for_experiment(
            exp_name=exp_name,
            base_models=MODELS,
            experiments=EXPERIMENTS,
            session_dir=session_dir,
        )


########## Starte Experimente für session_40 ##########

Starte Inference für Experiment: E16_whisper_bs8_len20
  base_model      = whisper_large_v3
  model_type      = whisper_audio
  checkpoint_path = large-v3
  beam_size       = 8
  max_length      = 20
  output_dir_name = output_E16_whisper_bs8_len20
  session_dir     = data-bin/dev_without_central_videos/dev/session_40
  comment         = Whisper Large-v3, beam=8, len=20
Loading whisper_audio model...
Loading Whisper model 'large-v3' on device cuda


100%|█████████████████████████████████████| 2.88G/2.88G [00:55<00:00, 56.0MiB/s]


whisper_audio model loaded successfully!
Inferring 1 sessions using whisper_audio model
Processing session session_40


Processing speakers:   0%|          | 0/6 [00:00<?, ?it/s]





[Acessing speaker spk_0 track 1 of 1:   0%|          | 0/35 [00:00<?, ?it/s]
[Acessing speaker spk_0 track 1 of 1:   3%|▎         | 1/35 [00:04<02:34,  4.56s/it]
[Acessing speaker spk_0 track 1 of 1:   6%|▌         | 2/35 [00:05<01:20,  2.44s/it]
[Acessing speaker spk_0 track 1 of 1:   9%|▊         | 3/35 [00:06<00:53,  1.68s/it]
[Acessing speaker spk_0 track 1 of 1:  11%|█▏        | 4/35 [00:07<00:44,  1.43s/it]
[Acessing speaker spk_0 track 1 of 1:  14%|█▍        | 5/35 [00:08<00:35,  1.19s/it]
[Acessing speaker spk_0 track 1 of 1:  17%|█▋        | 6/35 [00:09<00:36,  1.27s/it]
[Acessing speaker spk_0 track 1 of 1:  20%|██        | 7/35 [00:15<01:18,  2.82s/it]
[Acessing speaker spk_0 track 1 of 1:  23%|██▎       | 8/35 [00:17<01:04,  2.39s/it]
[Acessing speaker spk_0 track 1 of 1:  26%|██▌       | 9/35 [00:20<01:11,  2.75s/it]
[Acessing speaker spk_0 track 1 of 1:  29%|██▊       | 10/35 [00:22<01:02,  2.48s/it]
[Acessing speaker spk_0 track 1 of 1:  31%|███▏      | 11/3





[Acessing speaker spk_1 track 1 of 1:   0%|          | 0/40 [00:00<?, ?it/s]
[Acessing speaker spk_1 track 1 of 1:   2%|▎         | 1/40 [00:01<00:49,  1.27s/it]
[Acessing speaker spk_1 track 1 of 1:   5%|▌         | 2/40 [00:04<01:26,  2.29s/it]
[Acessing speaker spk_1 track 1 of 1:   8%|▊         | 3/40 [00:16<04:18,  7.00s/it]
[Acessing speaker spk_1 track 1 of 1:  10%|█         | 4/40 [00:18<02:55,  4.88s/it]
[Acessing speaker spk_1 track 1 of 1:  12%|█▎        | 5/40 [00:20<02:17,  3.93s/it]
[Acessing speaker spk_1 track 1 of 1:  15%|█▌        | 6/40 [00:22<01:49,  3.21s/it]
[Acessing speaker spk_1 track 1 of 1:  18%|█▊        | 7/40 [00:23<01:25,  2.60s/it]
[Acessing speaker spk_1 track 1 of 1:  20%|██        | 8/40 [00:25<01:09,  2.16s/it]
[Acessing speaker spk_1 track 1 of 1:  22%|██▎       | 9/40 [00:25<00:54,  1.75s/it]
[Acessing speaker spk_1 track 1 of 1:  25%|██▌       | 10/40 [00:28<00:55,  1.84s/it]
[Acessing speaker spk_1 track 1 of 1:  28%|██▊       | 11/4





Processing speaker spk_2 track 1 of 3: 0it [00:00, ?it/s]

[Acessing speaker spk_2 track 2 of 3:   0%|          | 0/13 [00:00<?, ?it/s]
[Acessing speaker spk_2 track 2 of 3:   8%|▊         | 1/13 [00:01<00:16,  1.39s/it]
[Acessing speaker spk_2 track 2 of 3:  15%|█▌        | 2/13 [00:04<00:23,  2.18s/it]
[Acessing speaker spk_2 track 2 of 3:  23%|██▎       | 3/13 [00:05<00:19,  1.97s/it]
[Acessing speaker spk_2 track 2 of 3:  31%|███       | 4/13 [00:13<00:36,  4.04s/it]
[Acessing speaker spk_2 track 2 of 3:  38%|███▊      | 5/13 [00:14<00:26,  3.27s/it]
[Acessing speaker spk_2 track 2 of 3:  46%|████▌     | 6/13 [00:19<00:26,  3.76s/it]
[Acessing speaker spk_2 track 2 of 3:  54%|█████▍    | 7/13 [00:24<00:25,  4.17s/it]
[Acessing speaker spk_2 track 2 of 3:  62%|██████▏   | 8/13 [00:29<00:21,  4.36s/it]
[Acessing speaker spk_2 track 2 of 3:  69%|██████▉   | 9/13 [00:31<00:15,  3.79s/it]
[Acessing speaker spk_2 track 2 of 3:  77%|███████▋  | 10/13 [00:38<00:13,  4.54s/it]






[Acessing speaker spk_3 track 1 of 2:   0%|          | 0/18 [00:00<?, ?it/s]
[Acessing speaker spk_3 track 1 of 2:   6%|▌         | 1/18 [00:01<00:26,  1.54s/it]
[Acessing speaker spk_3 track 1 of 2:  11%|█         | 2/18 [00:02<00:20,  1.26s/it]
[Acessing speaker spk_3 track 1 of 2:  17%|█▋        | 3/18 [00:11<01:10,  4.71s/it]
[Acessing speaker spk_3 track 1 of 2:  22%|██▏       | 4/18 [00:12<00:46,  3.31s/it]
[Acessing speaker spk_3 track 1 of 2:  28%|██▊       | 5/18 [00:19<01:02,  4.79s/it]
[Acessing speaker spk_3 track 1 of 2:  33%|███▎      | 6/18 [00:21<00:44,  3.69s/it]
[Acessing speaker spk_3 track 1 of 2:  39%|███▉      | 7/18 [00:29<00:56,  5.14s/it]
[Acessing speaker spk_3 track 1 of 2:  44%|████▍     | 8/18 [00:33<00:47,  4.79s/it]
[Acessing speaker spk_3 track 1 of 2:  50%|█████     | 9/18 [00:39<00:44,  4.98s/it]
[Acessing speaker spk_3 track 1 of 2:  56%|█████▌    | 10/18 [00:41<00:32,  4.05s/it]
[Acessing speaker spk_3 track 1 of 2:  61%|██████    | 11/1





[Acessing speaker spk_4 track 1 of 1:   0%|          | 0/26 [00:00<?, ?it/s]
[Acessing speaker spk_4 track 1 of 1:   4%|▍         | 1/26 [00:19<07:56, 19.04s/it]
[Acessing speaker spk_4 track 1 of 1:   8%|▊         | 2/26 [00:20<03:33,  8.90s/it]
[Acessing speaker spk_4 track 1 of 1:  12%|█▏        | 3/26 [00:32<03:54, 10.22s/it]
[Acessing speaker spk_4 track 1 of 1:  15%|█▌        | 4/26 [00:41<03:30,  9.58s/it]
[Acessing speaker spk_4 track 1 of 1:  19%|█▉        | 5/26 [00:58<04:18, 12.33s/it]
[Acessing speaker spk_4 track 1 of 1:  23%|██▎       | 6/26 [01:03<03:14,  9.74s/it]
[Acessing speaker spk_4 track 1 of 1:  27%|██▋       | 7/26 [01:06<02:25,  7.65s/it]
[Acessing speaker spk_4 track 1 of 1:  31%|███       | 8/26 [01:08<01:46,  5.92s/it]
[Acessing speaker spk_4 track 1 of 1:  35%|███▍      | 9/26 [01:12<01:26,  5.10s/it]
[Acessing speaker spk_4 track 1 of 1:  38%|███▊      | 10/26 [01:21<01:41,  6.37s/it]
[Acessing speaker spk_4 track 1 of 1:  42%|████▏     | 11/2





[Acessing speaker spk_5 track 1 of 1:   0%|          | 0/32 [00:00<?, ?it/s]
[Acessing speaker spk_5 track 1 of 1:   3%|▎         | 1/32 [00:01<01:01,  1.99s/it]
[Acessing speaker spk_5 track 1 of 1:   6%|▋         | 2/32 [00:05<01:35,  3.17s/it]
[Acessing speaker spk_5 track 1 of 1:   9%|▉         | 3/32 [00:10<01:43,  3.57s/it]
[Acessing speaker spk_5 track 1 of 1:  12%|█▎        | 4/32 [00:13<01:38,  3.51s/it]
[Acessing speaker spk_5 track 1 of 1:  16%|█▌        | 5/32 [00:14<01:11,  2.63s/it]
[Acessing speaker spk_5 track 1 of 1:  19%|█▉        | 6/32 [00:17<01:12,  2.80s/it]
[Acessing speaker spk_5 track 1 of 1:  22%|██▏       | 7/32 [00:21<01:19,  3.19s/it]
[Acessing speaker spk_5 track 1 of 1:  25%|██▌       | 8/32 [00:24<01:13,  3.08s/it]
[Acessing speaker spk_5 track 1 of 1:  28%|██▊       | 9/32 [00:41<02:51,  7.47s/it]
[Acessing speaker spk_5 track 1 of 1:  31%|███▏      | 10/32 [00:44<02:13,  6.08s/it]
[Acessing speaker spk_5 track 1 of 1:  34%|███▍      | 11/3


Starte Inference für Experiment: E17_whisper_bs12_len20
  base_model      = whisper_large_v3
  model_type      = whisper_audio
  checkpoint_path = large-v3
  beam_size       = 12
  max_length      = 20
  output_dir_name = output_E17_whisper_bs12_len20
  session_dir     = data-bin/dev_without_central_videos/dev/session_40
  comment         = Whisper Large-v3, beam=12, len=20
Loading whisper_audio model...
Loading Whisper model 'large-v3' on device cuda
whisper_audio model loaded successfully!
Inferring 1 sessions using whisper_audio model
Processing session session_40


Processing speakers:   0%|          | 0/6 [00:00<?, ?it/s]





[Acessing speaker spk_0 track 1 of 1:   0%|          | 0/35 [00:00<?, ?it/s]
[Acessing speaker spk_0 track 1 of 1:   3%|▎         | 1/35 [00:02<01:10,  2.07s/it]
[Acessing speaker spk_0 track 1 of 1:   6%|▌         | 2/35 [00:03<00:52,  1.59s/it]
[Acessing speaker spk_0 track 1 of 1:   9%|▊         | 3/35 [00:04<00:44,  1.38s/it]
[Acessing speaker spk_0 track 1 of 1:  11%|█▏        | 4/35 [00:05<00:41,  1.33s/it]
[Acessing speaker spk_0 track 1 of 1:  14%|█▍        | 5/35 [00:06<00:36,  1.21s/it]
[Acessing speaker spk_0 track 1 of 1:  17%|█▋        | 6/35 [00:08<00:43,  1.49s/it]
[Acessing speaker spk_0 track 1 of 1:  20%|██        | 7/35 [00:17<01:48,  3.89s/it]
[Acessing speaker spk_0 track 1 of 1:  23%|██▎       | 8/35 [00:19<01:28,  3.29s/it]
[Acessing speaker spk_0 track 1 of 1:  26%|██▌       | 9/35 [00:24<01:39,  3.82s/it]
[Acessing speaker spk_0 track 1 of 1:  29%|██▊       | 10/35 [00:27<01:26,  3.46s/it]
[Acessing speaker spk_0 track 1 of 1:  31%|███▏      | 11/3





[Acessing speaker spk_1 track 1 of 1:   0%|          | 0/40 [00:00<?, ?it/s]
[Acessing speaker spk_1 track 1 of 1:   2%|▎         | 1/40 [00:01<00:55,  1.41s/it]
[Acessing speaker spk_1 track 1 of 1:   5%|▌         | 2/40 [00:05<01:59,  3.15s/it]
[Acessing speaker spk_1 track 1 of 1:   8%|▊         | 3/40 [00:13<03:12,  5.20s/it]
[Acessing speaker spk_1 track 1 of 1:  10%|█         | 4/40 [00:15<02:25,  4.05s/it]
[Acessing speaker spk_1 track 1 of 1:  12%|█▎        | 5/40 [00:18<02:10,  3.74s/it]
[Acessing speaker spk_1 track 1 of 1:  15%|█▌        | 6/40 [00:21<01:53,  3.32s/it]
[Acessing speaker spk_1 track 1 of 1:  18%|█▊        | 7/40 [00:23<01:31,  2.78s/it]
[Acessing speaker spk_1 track 1 of 1:  20%|██        | 8/40 [00:24<01:14,  2.34s/it]
[Acessing speaker spk_1 track 1 of 1:  22%|██▎       | 9/40 [00:25<01:01,  2.00s/it]
[Acessing speaker spk_1 track 1 of 1:  25%|██▌       | 10/40 [00:29<01:17,  2.59s/it]
[Acessing speaker spk_1 track 1 of 1:  28%|██▊       | 11/4





Processing speaker spk_2 track 1 of 3: 0it [00:00, ?it/s]

[Acessing speaker spk_2 track 2 of 3:   0%|          | 0/13 [00:00<?, ?it/s]
[Acessing speaker spk_2 track 2 of 3:   8%|▊         | 1/13 [00:01<00:19,  1.66s/it]
[Acessing speaker spk_2 track 2 of 3:  15%|█▌        | 2/13 [00:05<00:32,  2.94s/it]
[Acessing speaker spk_2 track 2 of 3:  23%|██▎       | 3/13 [00:07<00:27,  2.72s/it]
[Acessing speaker spk_2 track 2 of 3:  31%|███       | 4/13 [00:18<00:51,  5.73s/it]
[Acessing speaker spk_2 track 2 of 3:  38%|███▊      | 5/13 [00:21<00:37,  4.64s/it]
[Acessing speaker spk_2 track 2 of 3:  46%|████▌     | 6/13 [00:27<00:37,  5.32s/it]
[Acessing speaker spk_2 track 2 of 3:  54%|█████▍    | 7/13 [00:34<00:35,  5.85s/it]
[Acessing speaker spk_2 track 2 of 3:  62%|██████▏   | 8/13 [00:36<00:23,  4.69s/it]
[Acessing speaker spk_2 track 2 of 3:  69%|██████▉   | 9/13 [00:38<00:14,  3.74s/it]
[Acessing speaker spk_2 track 2 of 3:  77%|███████▋  | 10/13 [00:47<00:15,  5.33s/it]






[Acessing speaker spk_3 track 1 of 2:   0%|          | 0/18 [00:00<?, ?it/s]
[Acessing speaker spk_3 track 1 of 2:   6%|▌         | 1/18 [00:01<00:31,  1.87s/it]
[Acessing speaker spk_3 track 1 of 2:  11%|█         | 2/18 [00:03<00:23,  1.50s/it]
[Acessing speaker spk_3 track 1 of 2:  17%|█▋        | 3/18 [00:15<01:40,  6.67s/it]
[Acessing speaker spk_3 track 1 of 2:  22%|██▏       | 4/18 [00:17<01:05,  4.66s/it]
[Acessing speaker spk_3 track 1 of 2:  28%|██▊       | 5/18 [00:26<01:19,  6.09s/it]
[Acessing speaker spk_3 track 1 of 2:  33%|███▎      | 6/18 [00:29<01:03,  5.27s/it]
[Acessing speaker spk_3 track 1 of 2:  39%|███▉      | 7/18 [00:39<01:13,  6.66s/it]
[Acessing speaker spk_3 track 1 of 2:  44%|████▍     | 8/18 [00:43<00:59,  5.95s/it]
[Acessing speaker spk_3 track 1 of 2:  50%|█████     | 9/18 [00:50<00:57,  6.34s/it]
[Acessing speaker spk_3 track 1 of 2:  56%|█████▌    | 10/18 [00:53<00:41,  5.23s/it]
[Acessing speaker spk_3 track 1 of 2:  61%|██████    | 11/1





[Acessing speaker spk_4 track 1 of 1:   0%|          | 0/26 [00:00<?, ?it/s]
[Acessing speaker spk_4 track 1 of 1:   4%|▍         | 1/26 [00:26<11:09, 26.79s/it]
[Acessing speaker spk_4 track 1 of 1:   8%|▊         | 2/26 [00:29<04:58, 12.45s/it]
[Acessing speaker spk_4 track 1 of 1:  12%|█▏        | 3/26 [00:43<05:02, 13.14s/it]
[Acessing speaker spk_4 track 1 of 1:  15%|█▌        | 4/26 [00:55<04:45, 13.00s/it]
[Acessing speaker spk_4 track 1 of 1:  19%|█▉        | 5/26 [01:00<03:29,  9.96s/it]
[Acessing speaker spk_4 track 1 of 1:  23%|██▎       | 6/26 [01:07<02:57,  8.86s/it]
[Acessing speaker spk_4 track 1 of 1:  27%|██▋       | 7/26 [01:11<02:20,  7.41s/it]
[Acessing speaker spk_4 track 1 of 1:  31%|███       | 8/26 [01:16<01:59,  6.64s/it]
[Acessing speaker spk_4 track 1 of 1:  35%|███▍      | 9/26 [01:21<01:42,  6.01s/it]
[Acessing speaker spk_4 track 1 of 1:  38%|███▊      | 10/26 [01:35<02:14,  8.40s/it]
[Acessing speaker spk_4 track 1 of 1:  42%|████▏     | 11/2





[Acessing speaker spk_5 track 1 of 1:   0%|          | 0/32 [00:00<?, ?it/s]
[Acessing speaker spk_5 track 1 of 1:   3%|▎         | 1/32 [00:02<01:11,  2.29s/it]
[Acessing speaker spk_5 track 1 of 1:   6%|▋         | 2/32 [00:07<02:08,  4.27s/it]
[Acessing speaker spk_5 track 1 of 1:   9%|▉         | 3/32 [00:13<02:25,  5.03s/it]
[Acessing speaker spk_5 track 1 of 1:  12%|█▎        | 4/32 [00:19<02:30,  5.38s/it]
[Acessing speaker spk_5 track 1 of 1:  16%|█▌        | 5/32 [00:21<01:46,  3.94s/it]
[Acessing speaker spk_5 track 1 of 1:  19%|█▉        | 6/32 [00:25<01:46,  4.08s/it]
[Acessing speaker spk_5 track 1 of 1:  22%|██▏       | 7/32 [00:28<01:34,  3.78s/it]
[Acessing speaker spk_5 track 1 of 1:  25%|██▌       | 8/32 [00:32<01:31,  3.79s/it]
[Acessing speaker spk_5 track 1 of 1:  28%|██▊       | 9/32 [00:57<04:02, 10.54s/it]
[Acessing speaker spk_5 track 1 of 1:  31%|███▏      | 10/32 [01:02<03:08,  8.59s/it]
[Acessing speaker spk_5 track 1 of 1:  34%|███▍      | 11/3


########## Starte Experimente für session_43 ##########

Starte Inference für Experiment: E16_whisper_bs8_len20
  base_model      = whisper_large_v3
  model_type      = whisper_audio
  checkpoint_path = large-v3
  beam_size       = 8
  max_length      = 20
  output_dir_name = output_E16_whisper_bs8_len20
  session_dir     = data-bin/dev_without_central_videos/dev/session_43
  comment         = Whisper Large-v3, beam=8, len=20
Loading whisper_audio model...
Loading Whisper model 'large-v3' on device cuda
whisper_audio model loaded successfully!
Inferring 1 sessions using whisper_audio model
Processing session session_43


Processing speakers:   0%|          | 0/6 [00:00<?, ?it/s]





[Acessing speaker spk_0 track 1 of 2:   0%|          | 0/27 [00:00<?, ?it/s]
[Acessing speaker spk_0 track 1 of 2:   4%|▎         | 1/27 [00:01<00:36,  1.39s/it]
[Acessing speaker spk_0 track 1 of 2:   7%|▋         | 2/27 [00:03<00:49,  1.99s/it]
[Acessing speaker spk_0 track 1 of 2:  11%|█         | 3/27 [00:06<00:54,  2.28s/it]
[Acessing speaker spk_0 track 1 of 2:  15%|█▍        | 4/27 [00:07<00:40,  1.76s/it]
[Acessing speaker spk_0 track 1 of 2:  19%|█▊        | 5/27 [00:10<00:46,  2.13s/it]
[Acessing speaker spk_0 track 1 of 2:  22%|██▏       | 6/27 [00:12<00:46,  2.23s/it]
[Acessing speaker spk_0 track 1 of 2:  26%|██▌       | 7/27 [00:22<01:34,  4.73s/it]
[Acessing speaker spk_0 track 1 of 2:  30%|██▉       | 8/27 [00:29<01:42,  5.38s/it]
[Acessing speaker spk_0 track 1 of 2:  33%|███▎      | 9/27 [00:30<01:11,  3.97s/it]
[Acessing speaker spk_0 track 1 of 2:  37%|███▋      | 10/27 [00:34<01:09,  4.08s/it]
[Acessing speaker spk_0 track 1 of 2:  41%|████      | 11/2





[Acessing speaker spk_1 track 1 of 1:   0%|          | 0/32 [00:00<?, ?it/s]
[Acessing speaker spk_1 track 1 of 1:   3%|▎         | 1/32 [00:01<00:39,  1.27s/it]
[Acessing speaker spk_1 track 1 of 1:   6%|▋         | 2/32 [00:02<00:43,  1.45s/it]
[Acessing speaker spk_1 track 1 of 1:   9%|▉         | 3/32 [00:04<00:38,  1.33s/it]
[Acessing speaker spk_1 track 1 of 1:  12%|█▎        | 4/32 [00:08<01:06,  2.38s/it]
[Acessing speaker spk_1 track 1 of 1:  16%|█▌        | 5/32 [00:10<01:01,  2.27s/it]
[Acessing speaker spk_1 track 1 of 1:  19%|█▉        | 6/32 [00:11<00:50,  1.94s/it]
[Acessing speaker spk_1 track 1 of 1:  22%|██▏       | 7/32 [00:13<00:49,  1.97s/it]
[Acessing speaker spk_1 track 1 of 1:  25%|██▌       | 8/32 [00:14<00:42,  1.78s/it]
[Acessing speaker spk_1 track 1 of 1:  28%|██▊       | 9/32 [00:19<01:04,  2.79s/it]
[Acessing speaker spk_1 track 1 of 1:  31%|███▏      | 10/32 [00:21<00:55,  2.53s/it]
[Acessing speaker spk_1 track 1 of 1:  34%|███▍      | 11/3





[Acessing speaker spk_2 track 1 of 1:   0%|          | 0/29 [00:00<?, ?it/s]
[Acessing speaker spk_2 track 1 of 1:   3%|▎         | 1/29 [00:02<01:04,  2.32s/it]
[Acessing speaker spk_2 track 1 of 1:   7%|▋         | 2/29 [00:03<00:39,  1.47s/it]
[Acessing speaker spk_2 track 1 of 1:  10%|█         | 3/29 [00:04<00:34,  1.34s/it]
[Acessing speaker spk_2 track 1 of 1:  14%|█▍        | 4/29 [00:05<00:35,  1.41s/it]
[Acessing speaker spk_2 track 1 of 1:  17%|█▋        | 5/29 [00:07<00:33,  1.41s/it]
[Acessing speaker spk_2 track 1 of 1:  21%|██        | 6/29 [00:08<00:29,  1.27s/it]
[Acessing speaker spk_2 track 1 of 1:  24%|██▍       | 7/29 [00:11<00:40,  1.84s/it]
[Acessing speaker spk_2 track 1 of 1:  28%|██▊       | 8/29 [00:12<00:37,  1.79s/it]
[Acessing speaker spk_2 track 1 of 1:  31%|███       | 9/29 [00:15<00:42,  2.13s/it]
[Acessing speaker spk_2 track 1 of 1:  34%|███▍      | 10/29 [00:21<01:03,  3.36s/it]
[Acessing speaker spk_2 track 1 of 1:  38%|███▊      | 11/2





[Acessing speaker spk_3 track 1 of 1:   0%|          | 0/31 [00:00<?, ?it/s]
[Acessing speaker spk_3 track 1 of 1:   3%|▎         | 1/31 [00:01<00:38,  1.28s/it]
[Acessing speaker spk_3 track 1 of 1:   6%|▋         | 2/31 [00:02<00:40,  1.40s/it]
[Acessing speaker spk_3 track 1 of 1:  10%|▉         | 3/31 [00:04<00:42,  1.50s/it]
[Acessing speaker spk_3 track 1 of 1:  13%|█▎        | 4/31 [00:06<00:42,  1.58s/it]
[Acessing speaker spk_3 track 1 of 1:  16%|█▌        | 5/31 [00:08<00:47,  1.83s/it]
[Acessing speaker spk_3 track 1 of 1:  19%|█▉        | 6/31 [00:12<01:04,  2.57s/it]
[Acessing speaker spk_3 track 1 of 1:  23%|██▎       | 7/31 [00:13<00:51,  2.15s/it]
[Acessing speaker spk_3 track 1 of 1:  26%|██▌       | 8/31 [00:34<03:04,  8.04s/it]
[Acessing speaker spk_3 track 1 of 1:  29%|██▉       | 9/31 [00:42<02:55,  7.96s/it]
[Acessing speaker spk_3 track 1 of 1:  32%|███▏      | 10/31 [00:44<02:11,  6.28s/it]
[Acessing speaker spk_3 track 1 of 1:  35%|███▌      | 11/3





[Acessing speaker spk_4 track 1 of 2:   0%|          | 0/16 [00:00<?, ?it/s]
[Acessing speaker spk_4 track 1 of 2:   6%|▋         | 1/16 [00:02<00:42,  2.86s/it]
[Acessing speaker spk_4 track 1 of 2:  12%|█▎        | 2/16 [00:07<00:53,  3.81s/it]
[Acessing speaker spk_4 track 1 of 2:  19%|█▉        | 3/16 [00:13<01:04,  4.97s/it]
[Acessing speaker spk_4 track 1 of 2:  25%|██▌       | 4/16 [00:17<00:55,  4.62s/it]
[Acessing speaker spk_4 track 1 of 2:  31%|███▏      | 5/16 [00:19<00:38,  3.49s/it]
[Acessing speaker spk_4 track 1 of 2:  38%|███▊      | 6/16 [00:20<00:27,  2.78s/it]
[Acessing speaker spk_4 track 1 of 2:  44%|████▍     | 7/16 [00:22<00:22,  2.51s/it]
[Acessing speaker spk_4 track 1 of 2:  50%|█████     | 8/16 [00:23<00:16,  2.09s/it]
[Acessing speaker spk_4 track 1 of 2:  56%|█████▋    | 9/16 [00:26<00:15,  2.20s/it]
[Acessing speaker spk_4 track 1 of 2:  62%|██████▎   | 10/16 [00:27<00:10,  1.77s/it]
[Acessing speaker spk_4 track 1 of 2:  69%|██████▉   | 11/1





[Acessing speaker spk_5 track 1 of 1:   0%|          | 0/37 [00:00<?, ?it/s]
[Acessing speaker spk_5 track 1 of 1:   3%|▎         | 1/37 [00:02<01:21,  2.27s/it]
[Acessing speaker spk_5 track 1 of 1:   5%|▌         | 2/37 [00:03<00:55,  1.58s/it]
[Acessing speaker spk_5 track 1 of 1:   8%|▊         | 3/37 [00:11<02:40,  4.71s/it]
[Acessing speaker spk_5 track 1 of 1:  11%|█         | 4/37 [00:14<02:13,  4.05s/it]
[Acessing speaker spk_5 track 1 of 1:  14%|█▎        | 5/37 [00:17<01:54,  3.59s/it]
[Acessing speaker spk_5 track 1 of 1:  16%|█▌        | 6/37 [00:19<01:29,  2.90s/it]
[Acessing speaker spk_5 track 1 of 1:  19%|█▉        | 7/37 [00:21<01:18,  2.60s/it]
[Acessing speaker spk_5 track 1 of 1:  22%|██▏       | 8/37 [00:22<01:07,  2.34s/it]
[Acessing speaker spk_5 track 1 of 1:  24%|██▍       | 9/37 [00:27<01:28,  3.17s/it]
[Acessing speaker spk_5 track 1 of 1:  27%|██▋       | 10/37 [00:29<01:09,  2.59s/it]
[Acessing speaker spk_5 track 1 of 1:  30%|██▉       | 11/3


Starte Inference für Experiment: E17_whisper_bs12_len20
  base_model      = whisper_large_v3
  model_type      = whisper_audio
  checkpoint_path = large-v3
  beam_size       = 12
  max_length      = 20
  output_dir_name = output_E17_whisper_bs12_len20
  session_dir     = data-bin/dev_without_central_videos/dev/session_43
  comment         = Whisper Large-v3, beam=12, len=20
Loading whisper_audio model...
Loading Whisper model 'large-v3' on device cuda
whisper_audio model loaded successfully!
Inferring 1 sessions using whisper_audio model
Processing session session_43


Processing speakers:   0%|          | 0/6 [00:00<?, ?it/s]





[Acessing speaker spk_0 track 1 of 2:   0%|          | 0/27 [00:00<?, ?it/s]
[Acessing speaker spk_0 track 1 of 2:   4%|▎         | 1/27 [00:01<00:41,  1.59s/it]
[Acessing speaker spk_0 track 1 of 2:   7%|▋         | 2/27 [00:03<00:47,  1.91s/it]
[Acessing speaker spk_0 track 1 of 2:  11%|█         | 3/27 [00:07<01:03,  2.64s/it]
[Acessing speaker spk_0 track 1 of 2:  15%|█▍        | 4/27 [00:08<00:47,  2.08s/it]
[Acessing speaker spk_0 track 1 of 2:  19%|█▊        | 5/27 [00:11<00:49,  2.26s/it]
[Acessing speaker spk_0 track 1 of 2:  22%|██▏       | 6/27 [00:14<00:54,  2.57s/it]
[Acessing speaker spk_0 track 1 of 2:  26%|██▌       | 7/27 [00:30<02:20,  7.01s/it]
[Acessing speaker spk_0 track 1 of 2:  30%|██▉       | 8/27 [00:38<02:18,  7.29s/it]
[Acessing speaker spk_0 track 1 of 2:  33%|███▎      | 9/27 [00:39<01:36,  5.35s/it]
[Acessing speaker spk_0 track 1 of 2:  37%|███▋      | 10/27 [00:45<01:35,  5.59s/it]
[Acessing speaker spk_0 track 1 of 2:  41%|████      | 11/2





[Acessing speaker spk_1 track 1 of 1:   0%|          | 0/32 [00:00<?, ?it/s]
[Acessing speaker spk_1 track 1 of 1:   3%|▎         | 1/32 [00:01<00:42,  1.38s/it]
[Acessing speaker spk_1 track 1 of 1:   6%|▋         | 2/32 [00:03<00:45,  1.53s/it]
[Acessing speaker spk_1 track 1 of 1:   9%|▉         | 3/32 [00:04<00:44,  1.52s/it]
[Acessing speaker spk_1 track 1 of 1:  12%|█▎        | 4/32 [00:09<01:24,  3.00s/it]
[Acessing speaker spk_1 track 1 of 1:  16%|█▌        | 5/32 [00:12<01:18,  2.92s/it]
[Acessing speaker spk_1 track 1 of 1:  19%|█▉        | 6/32 [00:14<01:04,  2.49s/it]
[Acessing speaker spk_1 track 1 of 1:  22%|██▏       | 7/32 [00:16<01:01,  2.46s/it]
[Acessing speaker spk_1 track 1 of 1:  25%|██▌       | 8/32 [00:18<00:53,  2.24s/it]
[Acessing speaker spk_1 track 1 of 1:  28%|██▊       | 9/32 [00:25<01:25,  3.72s/it]
[Acessing speaker spk_1 track 1 of 1:  31%|███▏      | 10/32 [00:28<01:14,  3.40s/it]
[Acessing speaker spk_1 track 1 of 1:  34%|███▍      | 11/3





[Acessing speaker spk_2 track 1 of 1:   0%|          | 0/29 [00:00<?, ?it/s]
[Acessing speaker spk_2 track 1 of 1:   3%|▎         | 1/29 [00:02<01:14,  2.65s/it]
[Acessing speaker spk_2 track 1 of 1:   7%|▋         | 2/29 [00:03<00:47,  1.74s/it]
[Acessing speaker spk_2 track 1 of 1:  10%|█         | 3/29 [00:05<00:42,  1.63s/it]
[Acessing speaker spk_2 track 1 of 1:  14%|█▍        | 4/29 [00:07<00:44,  1.78s/it]
[Acessing speaker spk_2 track 1 of 1:  17%|█▋        | 5/29 [00:08<00:41,  1.72s/it]
[Acessing speaker spk_2 track 1 of 1:  21%|██        | 6/29 [00:10<00:36,  1.60s/it]
[Acessing speaker spk_2 track 1 of 1:  24%|██▍       | 7/29 [00:14<00:56,  2.57s/it]
[Acessing speaker spk_2 track 1 of 1:  28%|██▊       | 8/29 [00:16<00:47,  2.27s/it]
[Acessing speaker spk_2 track 1 of 1:  31%|███       | 9/29 [00:18<00:42,  2.11s/it]
[Acessing speaker spk_2 track 1 of 1:  34%|███▍      | 10/29 [00:26<01:16,  4.01s/it]
[Acessing speaker spk_2 track 1 of 1:  38%|███▊      | 11/2





[Acessing speaker spk_3 track 1 of 1:   0%|          | 0/31 [00:00<?, ?it/s]
[Acessing speaker spk_3 track 1 of 1:   3%|▎         | 1/31 [00:01<00:33,  1.12s/it]
[Acessing speaker spk_3 track 1 of 1:   6%|▋         | 2/31 [00:03<00:45,  1.58s/it]
[Acessing speaker spk_3 track 1 of 1:  10%|▉         | 3/31 [00:05<00:49,  1.77s/it]
[Acessing speaker spk_3 track 1 of 1:  13%|█▎        | 4/31 [00:08<01:06,  2.48s/it]
[Acessing speaker spk_3 track 1 of 1:  16%|█▌        | 5/31 [00:11<01:08,  2.64s/it]
[Acessing speaker spk_3 track 1 of 1:  19%|█▉        | 6/31 [00:16<01:27,  3.52s/it]
[Acessing speaker spk_3 track 1 of 1:  23%|██▎       | 7/31 [00:18<01:09,  2.90s/it]
[Acessing speaker spk_3 track 1 of 1:  26%|██▌       | 8/31 [00:46<04:12, 10.96s/it]
[Acessing speaker spk_3 track 1 of 1:  29%|██▉       | 9/31 [00:58<04:04, 11.11s/it]
[Acessing speaker spk_3 track 1 of 1:  32%|███▏      | 10/31 [01:00<02:57,  8.46s/it]
[Acessing speaker spk_3 track 1 of 1:  35%|███▌      | 11/3





[Acessing speaker spk_4 track 1 of 2:   0%|          | 0/16 [00:00<?, ?it/s]
[Acessing speaker spk_4 track 1 of 2:   6%|▋         | 1/16 [00:01<00:19,  1.31s/it]
[Acessing speaker spk_4 track 1 of 2:  12%|█▎        | 2/16 [00:07<00:55,  4.00s/it]
[Acessing speaker spk_4 track 1 of 2:  19%|█▉        | 3/16 [00:15<01:19,  6.10s/it]
[Acessing speaker spk_4 track 1 of 2:  25%|██▌       | 4/16 [00:21<01:11,  5.96s/it]
[Acessing speaker spk_4 track 1 of 2:  31%|███▏      | 5/16 [00:23<00:49,  4.51s/it]
[Acessing speaker spk_4 track 1 of 2:  38%|███▊      | 6/16 [00:25<00:35,  3.58s/it]
[Acessing speaker spk_4 track 1 of 2:  44%|████▍     | 7/16 [00:27<00:29,  3.25s/it]
[Acessing speaker spk_4 track 1 of 2:  50%|█████     | 8/16 [00:29<00:21,  2.71s/it]
[Acessing speaker spk_4 track 1 of 2:  56%|█████▋    | 9/16 [00:32<00:20,  2.87s/it]
[Acessing speaker spk_4 track 1 of 2:  62%|██████▎   | 10/16 [00:33<00:13,  2.31s/it]
[Acessing speaker spk_4 track 1 of 2:  69%|██████▉   | 11/1





[Acessing speaker spk_5 track 1 of 1:   0%|          | 0/37 [00:00<?, ?it/s]
[Acessing speaker spk_5 track 1 of 1:   3%|▎         | 1/37 [00:02<01:36,  2.68s/it]
[Acessing speaker spk_5 track 1 of 1:   5%|▌         | 2/37 [00:04<01:06,  1.91s/it]
[Acessing speaker spk_5 track 1 of 1:   8%|▊         | 3/37 [00:14<03:20,  5.88s/it]
[Acessing speaker spk_5 track 1 of 1:  11%|█         | 4/37 [00:18<02:49,  5.12s/it]
[Acessing speaker spk_5 track 1 of 1:  14%|█▎        | 5/37 [00:22<02:25,  4.55s/it]
[Acessing speaker spk_5 track 1 of 1:  16%|█▌        | 6/37 [00:24<01:55,  3.73s/it]
[Acessing speaker spk_5 track 1 of 1:  19%|█▉        | 7/37 [00:26<01:40,  3.34s/it]
[Acessing speaker spk_5 track 1 of 1:  22%|██▏       | 8/37 [00:29<01:27,  3.00s/it]
[Acessing speaker spk_5 track 1 of 1:  24%|██▍       | 9/37 [00:33<01:37,  3.47s/it]
[Acessing speaker spk_5 track 1 of 1:  27%|██▋       | 10/37 [00:35<01:23,  3.10s/it]
[Acessing speaker spk_5 track 1 of 1:  30%|██▉       | 11/3


########## Starte Experimente für session_49 ##########

Starte Inference für Experiment: E16_whisper_bs8_len20
  base_model      = whisper_large_v3
  model_type      = whisper_audio
  checkpoint_path = large-v3
  beam_size       = 8
  max_length      = 20
  output_dir_name = output_E16_whisper_bs8_len20
  session_dir     = data-bin/dev_without_central_videos/dev/session_49
  comment         = Whisper Large-v3, beam=8, len=20
Loading whisper_audio model...
Loading Whisper model 'large-v3' on device cuda
whisper_audio model loaded successfully!
Inferring 1 sessions using whisper_audio model
Processing session session_49


Processing speakers:   0%|          | 0/6 [00:00<?, ?it/s]





[Acessing speaker spk_0 track 1 of 1:   0%|          | 0/12 [00:00<?, ?it/s]
[Acessing speaker spk_0 track 1 of 1:   8%|▊         | 1/12 [00:01<00:19,  1.79s/it]
[Acessing speaker spk_0 track 1 of 1:  17%|█▋        | 2/12 [00:04<00:20,  2.05s/it]
[Acessing speaker spk_0 track 1 of 1:  25%|██▌       | 3/12 [00:05<00:14,  1.57s/it]
[Acessing speaker spk_0 track 1 of 1:  33%|███▎      | 4/12 [00:06<00:11,  1.38s/it]
[Acessing speaker spk_0 track 1 of 1:  42%|████▏     | 5/12 [00:07<00:09,  1.38s/it]
[Acessing speaker spk_0 track 1 of 1:  50%|█████     | 6/12 [00:09<00:08,  1.48s/it]
[Acessing speaker spk_0 track 1 of 1:  58%|█████▊    | 7/12 [00:10<00:06,  1.35s/it]
[Acessing speaker spk_0 track 1 of 1:  67%|██████▋   | 8/12 [00:12<00:06,  1.62s/it]
[Acessing speaker spk_0 track 1 of 1:  75%|███████▌  | 9/12 [00:16<00:06,  2.26s/it]
[Acessing speaker spk_0 track 1 of 1:  83%|████████▎ | 10/12 [00:18<00:04,  2.31s/it]
[Acessing speaker spk_0 track 1 of 1:  92%|█████████▏| 11/1





[Acessing speaker spk_1 track 1 of 1:   0%|          | 0/14 [00:00<?, ?it/s]
[Acessing speaker spk_1 track 1 of 1:   7%|▋         | 1/14 [00:08<01:53,  8.70s/it]
[Acessing speaker spk_1 track 1 of 1:  14%|█▍        | 2/14 [00:10<00:58,  4.89s/it]
[Acessing speaker spk_1 track 1 of 1:  21%|██▏       | 3/14 [00:11<00:34,  3.11s/it]
[Acessing speaker spk_1 track 1 of 1:  29%|██▊       | 4/14 [00:13<00:23,  2.32s/it]
[Acessing speaker spk_1 track 1 of 1:  36%|███▌      | 5/14 [00:16<00:23,  2.64s/it]
[Acessing speaker spk_1 track 1 of 1:  43%|████▎     | 6/14 [00:18<00:19,  2.47s/it]
[Acessing speaker spk_1 track 1 of 1:  50%|█████     | 7/14 [00:21<00:18,  2.58s/it]
[Acessing speaker spk_1 track 1 of 1:  57%|█████▋    | 8/14 [00:24<00:17,  2.86s/it]
[Acessing speaker spk_1 track 1 of 1:  64%|██████▍   | 9/14 [00:28<00:15,  3.15s/it]
[Acessing speaker spk_1 track 1 of 1:  71%|███████▏  | 10/14 [00:35<00:18,  4.51s/it]
[Acessing speaker spk_1 track 1 of 1:  79%|███████▊  | 11/1





[Acessing speaker spk_2 track 1 of 8:   0%|          | 0/1 [00:00<?, ?it/s]
Processing speaker spk_2 track 1 of 8: 100%|██████████| 1/1 [00:00<00:00,  1.10it/s]

[Acessing speaker spk_2 track 2 of 8:   0%|          | 0/1 [00:00<?, ?it/s]
Processing speaker spk_2 track 2 of 8: 100%|██████████| 1/1 [00:00<00:00,  1.02it/s]

[Acessing speaker spk_2 track 3 of 8:   0%|          | 0/1 [00:00<?, ?it/s]
Processing speaker spk_2 track 3 of 8: 100%|██████████| 1/1 [00:01<00:00,  1.02s/it]

[Acessing speaker spk_2 track 4 of 8:   0%|          | 0/3 [00:00<?, ?it/s]
[Acessing speaker spk_2 track 4 of 8:  33%|███▎      | 1/3 [00:09<00:18,  9.43s/it]
[Acessing speaker spk_2 track 4 of 8:  67%|██████▋   | 2/3 [00:14<00:06,  6.75s/it]
Processing speaker spk_2 track 4 of 8: 100%|██████████| 3/3 [00:22<00:00,  7.49s/it]

[Acessing speaker spk_2 track 5 of 8:   0%|          | 0/2 [00:00<?, ?it/s]
[Acessing speaker spk_2 track 5 of 8:  50%|█████     | 1/2 [00:01<00:01,  1.95s/it]
Processing spea





[Acessing speaker spk_3 track 1 of 1:   0%|          | 0/21 [00:00<?, ?it/s]
[Acessing speaker spk_3 track 1 of 1:   5%|▍         | 1/21 [00:01<00:29,  1.48s/it]
[Acessing speaker spk_3 track 1 of 1:  10%|▉         | 2/21 [00:05<00:56,  2.97s/it]
[Acessing speaker spk_3 track 1 of 1:  14%|█▍        | 3/21 [00:08<00:52,  2.91s/it]
[Acessing speaker spk_3 track 1 of 1:  19%|█▉        | 4/21 [00:14<01:08,  4.03s/it]
[Acessing speaker spk_3 track 1 of 1:  24%|██▍       | 5/21 [00:17<01:03,  3.98s/it]
[Acessing speaker spk_3 track 1 of 1:  29%|██▊       | 6/21 [00:20<00:52,  3.50s/it]
[Acessing speaker spk_3 track 1 of 1:  33%|███▎      | 7/21 [00:27<01:02,  4.47s/it]
[Acessing speaker spk_3 track 1 of 1:  38%|███▊      | 8/21 [00:32<01:02,  4.78s/it]
[Acessing speaker spk_3 track 1 of 1:  43%|████▎     | 9/21 [00:35<00:51,  4.27s/it]
[Acessing speaker spk_3 track 1 of 1:  48%|████▊     | 10/21 [00:40<00:49,  4.46s/it]
[Acessing speaker spk_3 track 1 of 1:  52%|█████▏    | 11/2





[Acessing speaker spk_4 track 1 of 1:   0%|          | 0/22 [00:00<?, ?it/s]
[Acessing speaker spk_4 track 1 of 1:   5%|▍         | 1/22 [00:07<02:34,  7.35s/it]
[Acessing speaker spk_4 track 1 of 1:   9%|▉         | 2/22 [00:14<02:29,  7.49s/it]
[Acessing speaker spk_4 track 1 of 1:  14%|█▎        | 3/22 [00:24<02:42,  8.55s/it]
[Acessing speaker spk_4 track 1 of 1:  18%|█▊        | 4/22 [00:28<01:59,  6.66s/it]
[Acessing speaker spk_4 track 1 of 1:  23%|██▎       | 5/22 [00:32<01:35,  5.63s/it]
[Acessing speaker spk_4 track 1 of 1:  27%|██▋       | 6/22 [00:40<01:45,  6.62s/it]
[Acessing speaker spk_4 track 1 of 1:  32%|███▏      | 7/22 [00:43<01:20,  5.36s/it]
[Acessing speaker spk_4 track 1 of 1:  36%|███▋      | 8/22 [00:44<00:56,  4.04s/it]
[Acessing speaker spk_4 track 1 of 1:  41%|████      | 9/22 [00:47<00:45,  3.49s/it]
[Acessing speaker spk_4 track 1 of 1:  45%|████▌     | 10/22 [00:48<00:33,  2.83s/it]
[Acessing speaker spk_4 track 1 of 1:  50%|█████     | 11/2





[Acessing speaker spk_5 track 1 of 2:   0%|          | 0/21 [00:00<?, ?it/s]
[Acessing speaker spk_5 track 1 of 2:   5%|▍         | 1/21 [00:01<00:26,  1.34s/it]
[Acessing speaker spk_5 track 1 of 2:  10%|▉         | 2/21 [00:02<00:27,  1.45s/it]
[Acessing speaker spk_5 track 1 of 2:  14%|█▍        | 3/21 [00:05<00:36,  2.03s/it]
[Acessing speaker spk_5 track 1 of 2:  19%|█▉        | 4/21 [00:06<00:29,  1.75s/it]
[Acessing speaker spk_5 track 1 of 2:  24%|██▍       | 5/21 [00:09<00:30,  1.88s/it]
[Acessing speaker spk_5 track 1 of 2:  29%|██▊       | 6/21 [00:11<00:32,  2.16s/it]
[Acessing speaker spk_5 track 1 of 2:  33%|███▎      | 7/21 [00:12<00:25,  1.79s/it]
[Acessing speaker spk_5 track 1 of 2:  38%|███▊      | 8/21 [00:16<00:31,  2.42s/it]
[Acessing speaker spk_5 track 1 of 2:  43%|████▎     | 9/21 [00:18<00:27,  2.33s/it]
[Acessing speaker spk_5 track 1 of 2:  48%|████▊     | 10/21 [00:22<00:30,  2.76s/it]
[Acessing speaker spk_5 track 1 of 2:  52%|█████▏    | 11/2


Starte Inference für Experiment: E17_whisper_bs12_len20
  base_model      = whisper_large_v3
  model_type      = whisper_audio
  checkpoint_path = large-v3
  beam_size       = 12
  max_length      = 20
  output_dir_name = output_E17_whisper_bs12_len20
  session_dir     = data-bin/dev_without_central_videos/dev/session_49
  comment         = Whisper Large-v3, beam=12, len=20
Loading whisper_audio model...
Loading Whisper model 'large-v3' on device cuda
whisper_audio model loaded successfully!
Inferring 1 sessions using whisper_audio model
Processing session session_49


Processing speakers:   0%|          | 0/6 [00:00<?, ?it/s]





[Acessing speaker spk_0 track 1 of 1:   0%|          | 0/12 [00:00<?, ?it/s]
[Acessing speaker spk_0 track 1 of 1:   8%|▊         | 1/12 [00:02<00:22,  2.02s/it]
[Acessing speaker spk_0 track 1 of 1:  17%|█▋        | 2/12 [00:05<00:26,  2.62s/it]
[Acessing speaker spk_0 track 1 of 1:  25%|██▌       | 3/12 [00:06<00:17,  1.99s/it]
[Acessing speaker spk_0 track 1 of 1:  33%|███▎      | 4/12 [00:07<00:13,  1.74s/it]
[Acessing speaker spk_0 track 1 of 1:  42%|████▏     | 5/12 [00:09<00:12,  1.75s/it]
[Acessing speaker spk_0 track 1 of 1:  50%|█████     | 6/12 [00:11<00:11,  1.89s/it]
[Acessing speaker spk_0 track 1 of 1:  58%|█████▊    | 7/12 [00:12<00:08,  1.72s/it]
[Acessing speaker spk_0 track 1 of 1:  67%|██████▋   | 8/12 [00:15<00:08,  2.08s/it]
[Acessing speaker spk_0 track 1 of 1:  75%|███████▌  | 9/12 [00:20<00:08,  2.95s/it]
[Acessing speaker spk_0 track 1 of 1:  83%|████████▎ | 10/12 [00:23<00:06,  3.02s/it]
[Acessing speaker spk_0 track 1 of 1:  92%|█████████▏| 11/1





[Acessing speaker spk_1 track 1 of 1:   0%|          | 0/14 [00:00<?, ?it/s]
[Acessing speaker spk_1 track 1 of 1:   7%|▋         | 1/14 [00:11<02:28, 11.40s/it]
[Acessing speaker spk_1 track 1 of 1:  14%|█▍        | 2/14 [00:14<01:17,  6.44s/it]
[Acessing speaker spk_1 track 1 of 1:  21%|██▏       | 3/14 [00:15<00:44,  4.07s/it]
[Acessing speaker spk_1 track 1 of 1:  29%|██▊       | 4/14 [00:16<00:30,  3.00s/it]
[Acessing speaker spk_1 track 1 of 1:  36%|███▌      | 5/14 [00:21<00:31,  3.46s/it]
[Acessing speaker spk_1 track 1 of 1:  43%|████▎     | 6/14 [00:24<00:25,  3.23s/it]
[Acessing speaker spk_1 track 1 of 1:  50%|█████     | 7/14 [00:27<00:22,  3.22s/it]
[Acessing speaker spk_1 track 1 of 1:  57%|█████▋    | 8/14 [00:31<00:21,  3.63s/it]
[Acessing speaker spk_1 track 1 of 1:  64%|██████▍   | 9/14 [00:36<00:20,  4.10s/it]
[Acessing speaker spk_1 track 1 of 1:  71%|███████▏  | 10/14 [00:47<00:24,  6.01s/it]
[Acessing speaker spk_1 track 1 of 1:  79%|███████▊  | 11/1





[Acessing speaker spk_2 track 1 of 8:   0%|          | 0/1 [00:00<?, ?it/s]
Processing speaker spk_2 track 1 of 8: 100%|██████████| 1/1 [00:01<00:00,  1.01s/it]

[Acessing speaker spk_2 track 2 of 8:   0%|          | 0/1 [00:00<?, ?it/s]
Processing speaker spk_2 track 2 of 8: 100%|██████████| 1/1 [00:01<00:00,  1.17s/it]

[Acessing speaker spk_2 track 3 of 8:   0%|          | 0/1 [00:00<?, ?it/s]
Processing speaker spk_2 track 3 of 8: 100%|██████████| 1/1 [00:01<00:00,  1.23s/it]

[Acessing speaker spk_2 track 4 of 8:   0%|          | 0/3 [00:00<?, ?it/s]
[Acessing speaker spk_2 track 4 of 8:  33%|███▎      | 1/3 [00:10<00:20, 10.30s/it]
[Acessing speaker spk_2 track 4 of 8:  67%|██████▋   | 2/3 [00:17<00:08,  8.64s/it]
Processing speaker spk_2 track 4 of 8: 100%|██████████| 3/3 [00:30<00:00, 10.04s/it]

[Acessing speaker spk_2 track 5 of 8:   0%|          | 0/2 [00:00<?, ?it/s]
[Acessing speaker spk_2 track 5 of 8:  50%|█████     | 1/2 [00:02<00:02,  2.51s/it]
Processing spea





[Acessing speaker spk_3 track 1 of 1:   0%|          | 0/21 [00:00<?, ?it/s]
[Acessing speaker spk_3 track 1 of 1:   5%|▍         | 1/21 [00:01<00:30,  1.51s/it]
[Acessing speaker spk_3 track 1 of 1:  10%|▉         | 2/21 [00:06<01:11,  3.74s/it]
[Acessing speaker spk_3 track 1 of 1:  14%|█▍        | 3/21 [00:10<01:08,  3.79s/it]
[Acessing speaker spk_3 track 1 of 1:  19%|█▉        | 4/21 [00:18<01:31,  5.39s/it]
[Acessing speaker spk_3 track 1 of 1:  24%|██▍       | 5/21 [00:24<01:32,  5.76s/it]
[Acessing speaker spk_3 track 1 of 1:  29%|██▊       | 6/21 [00:28<01:13,  4.92s/it]
[Acessing speaker spk_3 track 1 of 1:  33%|███▎      | 7/21 [00:36<01:24,  6.03s/it]
[Acessing speaker spk_3 track 1 of 1:  38%|███▊      | 8/21 [00:43<01:21,  6.27s/it]
[Acessing speaker spk_3 track 1 of 1:  43%|████▎     | 9/21 [00:47<01:07,  5.62s/it]
[Acessing speaker spk_3 track 1 of 1:  48%|████▊     | 10/21 [00:53<01:03,  5.77s/it]
[Acessing speaker spk_3 track 1 of 1:  52%|█████▏    | 11/2





[Acessing speaker spk_4 track 1 of 1:   0%|          | 0/22 [00:00<?, ?it/s]
[Acessing speaker spk_4 track 1 of 1:   5%|▍         | 1/22 [00:11<04:01, 11.48s/it]
[Acessing speaker spk_4 track 1 of 1:   9%|▉         | 2/22 [00:19<03:10,  9.54s/it]
[Acessing speaker spk_4 track 1 of 1:  14%|█▎        | 3/22 [00:31<03:22, 10.67s/it]
[Acessing speaker spk_4 track 1 of 1:  18%|█▊        | 4/22 [00:36<02:32,  8.46s/it]
[Acessing speaker spk_4 track 1 of 1:  23%|██▎       | 5/22 [00:41<02:02,  7.20s/it]
[Acessing speaker spk_4 track 1 of 1:  27%|██▋       | 6/22 [00:51<02:09,  8.09s/it]
[Acessing speaker spk_4 track 1 of 1:  32%|███▏      | 7/22 [00:55<01:39,  6.65s/it]
[Acessing speaker spk_4 track 1 of 1:  36%|███▋      | 8/22 [00:56<01:10,  5.02s/it]
[Acessing speaker spk_4 track 1 of 1:  41%|████      | 9/22 [00:59<00:56,  4.36s/it]
[Acessing speaker spk_4 track 1 of 1:  45%|████▌     | 10/22 [01:02<00:47,  3.99s/it]
[Acessing speaker spk_4 track 1 of 1:  50%|█████     | 11/2





[Acessing speaker spk_5 track 1 of 2:   0%|          | 0/21 [00:00<?, ?it/s]
[Acessing speaker spk_5 track 1 of 2:   5%|▍         | 1/21 [00:01<00:28,  1.44s/it]
[Acessing speaker spk_5 track 1 of 2:  10%|▉         | 2/21 [00:03<00:33,  1.75s/it]
[Acessing speaker spk_5 track 1 of 2:  14%|█▍        | 3/21 [00:07<00:46,  2.60s/it]
[Acessing speaker spk_5 track 1 of 2:  19%|█▉        | 4/21 [00:09<00:40,  2.36s/it]
[Acessing speaker spk_5 track 1 of 2:  24%|██▍       | 5/21 [00:11<00:38,  2.42s/it]
[Acessing speaker spk_5 track 1 of 2:  29%|██▊       | 6/21 [00:15<00:42,  2.82s/it]
[Acessing speaker spk_5 track 1 of 2:  33%|███▎      | 7/21 [00:16<00:33,  2.37s/it]
[Acessing speaker spk_5 track 1 of 2:  38%|███▊      | 8/21 [00:21<00:42,  3.25s/it]
[Acessing speaker spk_5 track 1 of 2:  43%|████▎     | 9/21 [00:24<00:37,  3.13s/it]
[Acessing speaker spk_5 track 1 of 2:  48%|████▊     | 10/21 [00:29<00:40,  3.72s/it]
[Acessing speaker spk_5 track 1 of 2:  52%|█████▏    | 11/2


########## Starte Experimente für session_50 ##########

Starte Inference für Experiment: E16_whisper_bs8_len20
  base_model      = whisper_large_v3
  model_type      = whisper_audio
  checkpoint_path = large-v3
  beam_size       = 8
  max_length      = 20
  output_dir_name = output_E16_whisper_bs8_len20
  session_dir     = data-bin/dev_without_central_videos/dev/session_50
  comment         = Whisper Large-v3, beam=8, len=20
Loading whisper_audio model...
Loading Whisper model 'large-v3' on device cuda
whisper_audio model loaded successfully!
Inferring 1 sessions using whisper_audio model
Processing session session_50


Processing speakers:   0%|          | 0/6 [00:00<?, ?it/s]





[Acessing speaker spk_0 track 1 of 1:   0%|          | 0/25 [00:00<?, ?it/s]
[Acessing speaker spk_0 track 1 of 1:   4%|▍         | 1/25 [00:02<01:00,  2.54s/it]
[Acessing speaker spk_0 track 1 of 1:   8%|▊         | 2/25 [00:04<00:52,  2.26s/it]
[Acessing speaker spk_0 track 1 of 1:  12%|█▏        | 3/25 [00:06<00:42,  1.93s/it]
[Acessing speaker spk_0 track 1 of 1:  16%|█▌        | 4/25 [00:07<00:32,  1.56s/it]
[Acessing speaker spk_0 track 1 of 1:  20%|██        | 5/25 [00:08<00:32,  1.61s/it]
[Acessing speaker spk_0 track 1 of 1:  24%|██▍       | 6/25 [00:11<00:36,  1.90s/it]
[Acessing speaker spk_0 track 1 of 1:  28%|██▊       | 7/25 [00:12<00:32,  1.83s/it]
[Acessing speaker spk_0 track 1 of 1:  32%|███▏      | 8/25 [00:14<00:30,  1.77s/it]
[Acessing speaker spk_0 track 1 of 1:  36%|███▌      | 9/25 [00:16<00:27,  1.74s/it]
[Acessing speaker spk_0 track 1 of 1:  40%|████      | 10/25 [00:18<00:28,  1.88s/it]
[Acessing speaker spk_0 track 1 of 1:  44%|████▍     | 11/2





[Acessing speaker spk_1 track 1 of 1:   0%|          | 0/24 [00:00<?, ?it/s]
[Acessing speaker spk_1 track 1 of 1:   4%|▍         | 1/24 [00:10<04:03, 10.58s/it]
[Acessing speaker spk_1 track 1 of 1:   8%|▊         | 2/24 [00:15<02:36,  7.13s/it]
[Acessing speaker spk_1 track 1 of 1:  12%|█▎        | 3/24 [00:16<01:30,  4.33s/it]
[Acessing speaker spk_1 track 1 of 1:  17%|█▋        | 4/24 [00:18<01:13,  3.66s/it]
[Acessing speaker spk_1 track 1 of 1:  21%|██        | 5/24 [00:22<01:10,  3.70s/it]
[Acessing speaker spk_1 track 1 of 1:  25%|██▌       | 6/24 [00:24<00:53,  2.98s/it]
[Acessing speaker spk_1 track 1 of 1:  29%|██▉       | 7/24 [00:25<00:40,  2.40s/it]
[Acessing speaker spk_1 track 1 of 1:  33%|███▎      | 8/24 [00:26<00:31,  1.95s/it]
[Acessing speaker spk_1 track 1 of 1:  38%|███▊      | 9/24 [00:32<00:47,  3.16s/it]
[Acessing speaker spk_1 track 1 of 1:  42%|████▏     | 10/24 [00:35<00:45,  3.27s/it]
[Acessing speaker spk_1 track 1 of 1:  46%|████▌     | 11/2





[Acessing speaker spk_2 track 1 of 2:   0%|          | 0/18 [00:00<?, ?it/s]
[Acessing speaker spk_2 track 1 of 2:   6%|▌         | 1/18 [00:03<01:01,  3.65s/it]
[Acessing speaker spk_2 track 1 of 2:  11%|█         | 2/18 [00:04<00:32,  2.05s/it]
[Acessing speaker spk_2 track 1 of 2:  17%|█▋        | 3/18 [00:11<01:06,  4.44s/it]
[Acessing speaker spk_2 track 1 of 2:  22%|██▏       | 4/18 [00:14<00:52,  3.74s/it]
[Acessing speaker spk_2 track 1 of 2:  28%|██▊       | 5/18 [00:21<01:05,  5.06s/it]
[Acessing speaker spk_2 track 1 of 2:  33%|███▎      | 6/18 [00:23<00:45,  3.82s/it]
[Acessing speaker spk_2 track 1 of 2:  39%|███▉      | 7/18 [00:24<00:33,  3.08s/it]
[Acessing speaker spk_2 track 1 of 2:  44%|████▍     | 8/18 [00:26<00:25,  2.58s/it]
[Acessing speaker spk_2 track 1 of 2:  50%|█████     | 9/18 [00:28<00:20,  2.31s/it]
[Acessing speaker spk_2 track 1 of 2:  56%|█████▌    | 10/18 [00:29<00:15,  1.88s/it]
[Acessing speaker spk_2 track 1 of 2:  61%|██████    | 11/1





[Acessing speaker spk_3 track 1 of 3:   0%|          | 0/16 [00:00<?, ?it/s]
[Acessing speaker spk_3 track 1 of 3:   6%|▋         | 1/16 [00:01<00:29,  1.97s/it]
[Acessing speaker spk_3 track 1 of 3:  12%|█▎        | 2/16 [00:03<00:20,  1.46s/it]
[Acessing speaker spk_3 track 1 of 3:  19%|█▉        | 3/16 [00:04<00:17,  1.34s/it]
[Acessing speaker spk_3 track 1 of 3:  25%|██▌       | 4/16 [00:08<00:27,  2.30s/it]
[Acessing speaker spk_3 track 1 of 3:  31%|███▏      | 5/16 [00:10<00:24,  2.19s/it]
[Acessing speaker spk_3 track 1 of 3:  38%|███▊      | 6/16 [00:12<00:23,  2.34s/it]
[Acessing speaker spk_3 track 1 of 3:  44%|████▍     | 7/16 [00:13<00:17,  1.90s/it]
[Acessing speaker spk_3 track 1 of 3:  50%|█████     | 8/16 [00:16<00:18,  2.25s/it]
[Acessing speaker spk_3 track 1 of 3:  56%|█████▋    | 9/16 [00:20<00:20,  2.90s/it]
[Acessing speaker spk_3 track 1 of 3:  62%|██████▎   | 10/16 [00:22<00:15,  2.57s/it]
[Acessing speaker spk_3 track 1 of 3:  69%|██████▉   | 11/1





[Acessing speaker spk_4 track 1 of 1:   0%|          | 0/27 [00:00<?, ?it/s]
[Acessing speaker spk_4 track 1 of 1:   4%|▎         | 1/27 [00:02<01:09,  2.67s/it]
[Acessing speaker spk_4 track 1 of 1:   7%|▋         | 2/27 [00:03<00:42,  1.68s/it]
[Acessing speaker spk_4 track 1 of 1:  11%|█         | 3/27 [00:05<00:40,  1.67s/it]
[Acessing speaker spk_4 track 1 of 1:  15%|█▍        | 4/27 [00:09<01:02,  2.71s/it]
[Acessing speaker spk_4 track 1 of 1:  19%|█▊        | 5/27 [00:13<01:07,  3.07s/it]
[Acessing speaker spk_4 track 1 of 1:  22%|██▏       | 6/27 [00:14<00:50,  2.40s/it]
[Acessing speaker spk_4 track 1 of 1:  26%|██▌       | 7/27 [00:16<00:46,  2.32s/it]
[Acessing speaker spk_4 track 1 of 1:  30%|██▉       | 8/27 [00:28<01:44,  5.48s/it]
[Acessing speaker spk_4 track 1 of 1:  33%|███▎      | 9/27 [00:31<01:20,  4.49s/it]
[Acessing speaker spk_4 track 1 of 1:  37%|███▋      | 10/27 [00:32<01:01,  3.62s/it]
[Acessing speaker spk_4 track 1 of 1:  41%|████      | 11/2





[Acessing speaker spk_5 track 1 of 1:   0%|          | 0/29 [00:00<?, ?it/s]
[Acessing speaker spk_5 track 1 of 1:   3%|▎         | 1/29 [00:01<00:49,  1.78s/it]
[Acessing speaker spk_5 track 1 of 1:   7%|▋         | 2/29 [00:03<00:44,  1.65s/it]
[Acessing speaker spk_5 track 1 of 1:  10%|█         | 3/29 [00:04<00:33,  1.31s/it]
[Acessing speaker spk_5 track 1 of 1:  14%|█▍        | 4/29 [00:05<00:35,  1.41s/it]
[Acessing speaker spk_5 track 1 of 1:  17%|█▋        | 5/29 [00:08<00:41,  1.72s/it]
[Acessing speaker spk_5 track 1 of 1:  21%|██        | 6/29 [00:10<00:48,  2.11s/it]
[Acessing speaker spk_5 track 1 of 1:  24%|██▍       | 7/29 [00:12<00:41,  1.88s/it]
[Acessing speaker spk_5 track 1 of 1:  28%|██▊       | 8/29 [00:18<01:07,  3.24s/it]
[Acessing speaker spk_5 track 1 of 1:  31%|███       | 9/29 [00:25<01:27,  4.39s/it]
[Acessing speaker spk_5 track 1 of 1:  34%|███▍      | 10/29 [00:32<01:42,  5.37s/it]
[Acessing speaker spk_5 track 1 of 1:  38%|███▊      | 11/2


Starte Inference für Experiment: E17_whisper_bs12_len20
  base_model      = whisper_large_v3
  model_type      = whisper_audio
  checkpoint_path = large-v3
  beam_size       = 12
  max_length      = 20
  output_dir_name = output_E17_whisper_bs12_len20
  session_dir     = data-bin/dev_without_central_videos/dev/session_50
  comment         = Whisper Large-v3, beam=12, len=20
Loading whisper_audio model...
Loading Whisper model 'large-v3' on device cuda
whisper_audio model loaded successfully!
Inferring 1 sessions using whisper_audio model
Processing session session_50


Processing speakers:   0%|          | 0/6 [00:00<?, ?it/s]





[Acessing speaker spk_0 track 1 of 1:   0%|          | 0/25 [00:00<?, ?it/s]
[Acessing speaker spk_0 track 1 of 1:   4%|▍         | 1/25 [00:02<01:11,  2.99s/it]
[Acessing speaker spk_0 track 1 of 1:   8%|▊         | 2/25 [00:05<01:04,  2.79s/it]
[Acessing speaker spk_0 track 1 of 1:  12%|█▏        | 3/25 [00:07<00:53,  2.44s/it]
[Acessing speaker spk_0 track 1 of 1:  16%|█▌        | 4/25 [00:08<00:41,  1.97s/it]
[Acessing speaker spk_0 track 1 of 1:  20%|██        | 5/25 [00:11<00:41,  2.08s/it]
[Acessing speaker spk_0 track 1 of 1:  24%|██▍       | 6/25 [00:14<00:45,  2.41s/it]
[Acessing speaker spk_0 track 1 of 1:  28%|██▊       | 7/25 [00:16<00:42,  2.36s/it]
[Acessing speaker spk_0 track 1 of 1:  32%|███▏      | 8/25 [00:18<00:39,  2.34s/it]
[Acessing speaker spk_0 track 1 of 1:  36%|███▌      | 9/25 [00:21<00:38,  2.40s/it]
[Acessing speaker spk_0 track 1 of 1:  40%|████      | 10/25 [00:23<00:36,  2.40s/it]
[Acessing speaker spk_0 track 1 of 1:  44%|████▍     | 11/2





[Acessing speaker spk_1 track 1 of 1:   0%|          | 0/24 [00:00<?, ?it/s]
[Acessing speaker spk_1 track 1 of 1:   4%|▍         | 1/24 [00:15<05:51, 15.28s/it]
[Acessing speaker spk_1 track 1 of 1:   8%|▊         | 2/24 [00:21<03:44, 10.21s/it]
[Acessing speaker spk_1 track 1 of 1:  12%|█▎        | 3/24 [00:23<02:08,  6.12s/it]
[Acessing speaker spk_1 track 1 of 1:  17%|█▋        | 4/24 [00:26<01:41,  5.06s/it]
[Acessing speaker spk_1 track 1 of 1:  21%|██        | 5/24 [00:31<01:35,  5.05s/it]
[Acessing speaker spk_1 track 1 of 1:  25%|██▌       | 6/24 [00:33<01:12,  4.01s/it]
[Acessing speaker spk_1 track 1 of 1:  29%|██▉       | 7/24 [00:35<00:54,  3.18s/it]
[Acessing speaker spk_1 track 1 of 1:  33%|███▎      | 8/24 [00:36<00:41,  2.61s/it]
[Acessing speaker spk_1 track 1 of 1:  38%|███▊      | 9/24 [00:42<00:56,  3.75s/it]
[Acessing speaker spk_1 track 1 of 1:  42%|████▏     | 10/24 [00:47<00:56,  4.00s/it]
[Acessing speaker spk_1 track 1 of 1:  46%|████▌     | 11/2





[Acessing speaker spk_2 track 1 of 2:   0%|          | 0/18 [00:00<?, ?it/s]
[Acessing speaker spk_2 track 1 of 2:   6%|▌         | 1/18 [00:03<00:56,  3.31s/it]
[Acessing speaker spk_2 track 1 of 2:  11%|█         | 2/18 [00:04<00:32,  2.05s/it]
[Acessing speaker spk_2 track 1 of 2:  17%|█▋        | 3/18 [00:14<01:22,  5.52s/it]
[Acessing speaker spk_2 track 1 of 2:  22%|██▏       | 4/18 [00:17<01:04,  4.57s/it]
[Acessing speaker spk_2 track 1 of 2:  28%|██▊       | 5/18 [00:28<01:31,  7.06s/it]
[Acessing speaker spk_2 track 1 of 2:  33%|███▎      | 6/18 [00:30<01:03,  5.27s/it]
[Acessing speaker spk_2 track 1 of 2:  39%|███▉      | 7/18 [00:32<00:46,  4.22s/it]
[Acessing speaker spk_2 track 1 of 2:  44%|████▍     | 8/18 [00:34<00:35,  3.50s/it]
[Acessing speaker spk_2 track 1 of 2:  50%|█████     | 9/18 [00:36<00:27,  3.05s/it]
[Acessing speaker spk_2 track 1 of 2:  56%|█████▌    | 10/18 [00:37<00:19,  2.47s/it]
[Acessing speaker spk_2 track 1 of 2:  61%|██████    | 11/1





[Acessing speaker spk_3 track 1 of 3:   0%|          | 0/16 [00:00<?, ?it/s]
[Acessing speaker spk_3 track 1 of 3:   6%|▋         | 1/16 [00:02<00:32,  2.16s/it]
[Acessing speaker spk_3 track 1 of 3:  12%|█▎        | 2/16 [00:03<00:24,  1.73s/it]
[Acessing speaker spk_3 track 1 of 3:  19%|█▉        | 3/16 [00:04<00:20,  1.58s/it]
[Acessing speaker spk_3 track 1 of 3:  25%|██▌       | 4/16 [00:10<00:37,  3.09s/it]
[Acessing speaker spk_3 track 1 of 3:  31%|███▏      | 5/16 [00:13<00:32,  2.95s/it]
[Acessing speaker spk_3 track 1 of 3:  38%|███▊      | 6/16 [00:16<00:31,  3.12s/it]
[Acessing speaker spk_3 track 1 of 3:  44%|████▍     | 7/16 [00:17<00:22,  2.52s/it]
[Acessing speaker spk_3 track 1 of 3:  50%|█████     | 8/16 [00:20<00:21,  2.72s/it]
[Acessing speaker spk_3 track 1 of 3:  56%|█████▋    | 9/16 [00:26<00:25,  3.66s/it]
[Acessing speaker spk_3 track 1 of 3:  62%|██████▎   | 10/16 [00:29<00:19,  3.29s/it]
[Acessing speaker spk_3 track 1 of 3:  69%|██████▉   | 11/1





[Acessing speaker spk_4 track 1 of 1:   0%|          | 0/27 [00:00<?, ?it/s]
[Acessing speaker spk_4 track 1 of 1:   4%|▎         | 1/27 [00:03<01:19,  3.04s/it]
[Acessing speaker spk_4 track 1 of 1:   7%|▋         | 2/27 [00:04<00:49,  1.97s/it]
[Acessing speaker spk_4 track 1 of 1:  11%|█         | 3/27 [00:06<00:55,  2.29s/it]
[Acessing speaker spk_4 track 1 of 1:  15%|█▍        | 4/27 [00:10<01:06,  2.89s/it]
[Acessing speaker spk_4 track 1 of 1:  19%|█▊        | 5/27 [00:14<01:09,  3.18s/it]
[Acessing speaker spk_4 track 1 of 1:  22%|██▏       | 6/27 [00:15<00:53,  2.57s/it]
[Acessing speaker spk_4 track 1 of 1:  26%|██▌       | 7/27 [00:18<00:51,  2.55s/it]
[Acessing speaker spk_4 track 1 of 1:  30%|██▉       | 8/27 [00:24<01:13,  3.85s/it]
[Acessing speaker spk_4 track 1 of 1:  33%|███▎      | 9/27 [00:29<01:11,  4.00s/it]
[Acessing speaker spk_4 track 1 of 1:  37%|███▋      | 10/27 [00:31<00:58,  3.43s/it]
[Acessing speaker spk_4 track 1 of 1:  41%|████      | 11/2





[Acessing speaker spk_5 track 1 of 1:   0%|          | 0/29 [00:00<?, ?it/s]
[Acessing speaker spk_5 track 1 of 1:   3%|▎         | 1/29 [00:02<01:04,  2.30s/it]
[Acessing speaker spk_5 track 1 of 1:   7%|▋         | 2/29 [00:04<00:55,  2.05s/it]
[Acessing speaker spk_5 track 1 of 1:  10%|█         | 3/29 [00:05<00:42,  1.63s/it]
[Acessing speaker spk_5 track 1 of 1:  14%|█▍        | 4/29 [00:07<00:43,  1.73s/it]
[Acessing speaker spk_5 track 1 of 1:  17%|█▋        | 5/29 [00:10<00:51,  2.16s/it]
[Acessing speaker spk_5 track 1 of 1:  21%|██        | 6/29 [00:13<00:58,  2.54s/it]
[Acessing speaker spk_5 track 1 of 1:  24%|██▍       | 7/29 [00:15<00:52,  2.37s/it]
[Acessing speaker spk_5 track 1 of 1:  28%|██▊       | 8/29 [00:25<01:41,  4.83s/it]
[Acessing speaker spk_5 track 1 of 1:  31%|███       | 9/29 [00:34<02:04,  6.20s/it]
[Acessing speaker spk_5 track 1 of 1:  34%|███▍      | 10/29 [00:44<02:20,  7.42s/it]
[Acessing speaker spk_5 track 1 of 1:  38%|███▊      | 11/2


########## Starte Experimente für session_54 ##########

Starte Inference für Experiment: E16_whisper_bs8_len20
  base_model      = whisper_large_v3
  model_type      = whisper_audio
  checkpoint_path = large-v3
  beam_size       = 8
  max_length      = 20
  output_dir_name = output_E16_whisper_bs8_len20
  session_dir     = data-bin/dev_without_central_videos/dev/session_54
  comment         = Whisper Large-v3, beam=8, len=20
Loading whisper_audio model...
Loading Whisper model 'large-v3' on device cuda
whisper_audio model loaded successfully!
Inferring 1 sessions using whisper_audio model
Processing session session_54


Processing speakers:   0%|          | 0/5 [00:00<?, ?it/s]





[Acessing speaker spk_0 track 1 of 2:   0%|          | 0/26 [00:00<?, ?it/s]
[Acessing speaker spk_0 track 1 of 2:   4%|▍         | 1/26 [00:03<01:15,  3.02s/it]
[Acessing speaker spk_0 track 1 of 2:   8%|▊         | 2/26 [00:04<00:51,  2.13s/it]
[Acessing speaker spk_0 track 1 of 2:  12%|█▏        | 3/26 [00:05<00:37,  1.64s/it]
[Acessing speaker spk_0 track 1 of 2:  15%|█▌        | 4/26 [00:06<00:32,  1.48s/it]
[Acessing speaker spk_0 track 1 of 2:  19%|█▉        | 5/26 [00:09<00:37,  1.79s/it]
[Acessing speaker spk_0 track 1 of 2:  23%|██▎       | 6/26 [00:13<00:54,  2.72s/it]
[Acessing speaker spk_0 track 1 of 2:  27%|██▋       | 7/26 [00:19<01:13,  3.87s/it]
[Acessing speaker spk_0 track 1 of 2:  31%|███       | 8/26 [00:23<01:10,  3.90s/it]
[Acessing speaker spk_0 track 1 of 2:  35%|███▍      | 9/26 [00:25<00:54,  3.20s/it]
[Acessing speaker spk_0 track 1 of 2:  38%|███▊      | 10/26 [00:26<00:41,  2.60s/it]
[Acessing speaker spk_0 track 1 of 2:  42%|████▏     | 11/2





[Acessing speaker spk_1 track 1 of 1:   0%|          | 0/27 [00:00<?, ?it/s]
[Acessing speaker spk_1 track 1 of 1:   4%|▎         | 1/27 [00:05<02:10,  5.01s/it]
[Acessing speaker spk_1 track 1 of 1:   7%|▋         | 2/27 [00:09<02:02,  4.89s/it]
[Acessing speaker spk_1 track 1 of 1:  11%|█         | 3/27 [00:17<02:33,  6.38s/it]
[Acessing speaker spk_1 track 1 of 1:  15%|█▍        | 4/27 [00:23<02:19,  6.05s/it]
[Acessing speaker spk_1 track 1 of 1:  19%|█▊        | 5/27 [00:30<02:20,  6.39s/it]
[Acessing speaker spk_1 track 1 of 1:  22%|██▏       | 6/27 [00:39<02:30,  7.15s/it]
[Acessing speaker spk_1 track 1 of 1:  26%|██▌       | 7/27 [00:42<01:57,  5.87s/it]
[Acessing speaker spk_1 track 1 of 1:  30%|██▉       | 8/27 [00:48<01:53,  5.97s/it]
[Acessing speaker spk_1 track 1 of 1:  33%|███▎      | 9/27 [00:56<01:58,  6.59s/it]
[Acessing speaker spk_1 track 1 of 1:  37%|███▋      | 10/27 [00:58<01:30,  5.33s/it]
[Acessing speaker spk_1 track 1 of 1:  41%|████      | 11/2





[Acessing speaker spk_2 track 1 of 1:   0%|          | 0/32 [00:00<?, ?it/s]
[Acessing speaker spk_2 track 1 of 1:   3%|▎         | 1/32 [00:03<01:55,  3.72s/it]
[Acessing speaker spk_2 track 1 of 1:   6%|▋         | 2/32 [00:06<01:28,  2.96s/it]
[Acessing speaker spk_2 track 1 of 1:   9%|▉         | 3/32 [00:13<02:18,  4.77s/it]
[Acessing speaker spk_2 track 1 of 1:  12%|█▎        | 4/32 [00:17<02:08,  4.58s/it]
[Acessing speaker spk_2 track 1 of 1:  16%|█▌        | 5/32 [00:19<01:39,  3.70s/it]
[Acessing speaker spk_2 track 1 of 1:  19%|█▉        | 6/32 [00:20<01:13,  2.81s/it]
[Acessing speaker spk_2 track 1 of 1:  22%|██▏       | 7/32 [00:23<01:11,  2.85s/it]
[Acessing speaker spk_2 track 1 of 1:  25%|██▌       | 8/32 [00:29<01:30,  3.78s/it]
[Acessing speaker spk_2 track 1 of 1:  28%|██▊       | 9/32 [00:32<01:19,  3.45s/it]
[Acessing speaker spk_2 track 1 of 1:  31%|███▏      | 10/32 [00:34<01:09,  3.17s/it]
[Acessing speaker spk_2 track 1 of 1:  34%|███▍      | 11/3





[Acessing speaker spk_3 track 1 of 1:   0%|          | 0/38 [00:00<?, ?it/s]
[Acessing speaker spk_3 track 1 of 1:   3%|▎         | 1/38 [00:02<01:21,  2.20s/it]
[Acessing speaker spk_3 track 1 of 1:   5%|▌         | 2/38 [00:07<02:22,  3.96s/it]
[Acessing speaker spk_3 track 1 of 1:   8%|▊         | 3/38 [00:08<01:38,  2.82s/it]
[Acessing speaker spk_3 track 1 of 1:  11%|█         | 4/38 [00:11<01:35,  2.80s/it]
[Acessing speaker spk_3 track 1 of 1:  13%|█▎        | 5/38 [00:15<01:41,  3.08s/it]
[Acessing speaker spk_3 track 1 of 1:  16%|█▌        | 6/38 [00:16<01:20,  2.53s/it]
[Acessing speaker spk_3 track 1 of 1:  18%|█▊        | 7/38 [00:20<01:29,  2.89s/it]
[Acessing speaker spk_3 track 1 of 1:  21%|██        | 8/38 [00:21<01:12,  2.40s/it]
[Acessing speaker spk_3 track 1 of 1:  24%|██▎       | 9/38 [00:27<01:38,  3.38s/it]
[Acessing speaker spk_3 track 1 of 1:  26%|██▋       | 10/38 [00:34<02:04,  4.45s/it]
[Acessing speaker spk_3 track 1 of 1:  29%|██▉       | 11/3





[Acessing speaker spk_4 track 1 of 1:   0%|          | 0/28 [00:00<?, ?it/s]
[Acessing speaker spk_4 track 1 of 1:   4%|▎         | 1/28 [00:04<02:11,  4.86s/it]
[Acessing speaker spk_4 track 1 of 1:   7%|▋         | 2/28 [00:07<01:27,  3.36s/it]
[Acessing speaker spk_4 track 1 of 1:  11%|█         | 3/28 [00:08<01:05,  2.62s/it]
[Acessing speaker spk_4 track 1 of 1:  14%|█▍        | 4/28 [00:09<00:47,  1.98s/it]
[Acessing speaker spk_4 track 1 of 1:  18%|█▊        | 5/28 [00:14<01:06,  2.90s/it]
[Acessing speaker spk_4 track 1 of 1:  21%|██▏       | 6/28 [00:16<00:53,  2.45s/it]
[Acessing speaker spk_4 track 1 of 1:  25%|██▌       | 7/28 [00:23<01:23,  3.96s/it]
[Acessing speaker spk_4 track 1 of 1:  29%|██▊       | 8/28 [00:24<01:03,  3.17s/it]
[Acessing speaker spk_4 track 1 of 1:  32%|███▏      | 9/28 [00:27<01:01,  3.22s/it]
[Acessing speaker spk_4 track 1 of 1:  36%|███▌      | 10/28 [00:29<00:49,  2.73s/it]
[Acessing speaker spk_4 track 1 of 1:  39%|███▉      | 11/2


Starte Inference für Experiment: E17_whisper_bs12_len20
  base_model      = whisper_large_v3
  model_type      = whisper_audio
  checkpoint_path = large-v3
  beam_size       = 12
  max_length      = 20
  output_dir_name = output_E17_whisper_bs12_len20
  session_dir     = data-bin/dev_without_central_videos/dev/session_54
  comment         = Whisper Large-v3, beam=12, len=20
Loading whisper_audio model...
Loading Whisper model 'large-v3' on device cuda
whisper_audio model loaded successfully!
Inferring 1 sessions using whisper_audio model
Processing session session_54


Processing speakers:   0%|          | 0/5 [00:00<?, ?it/s]





[Acessing speaker spk_0 track 1 of 2:   0%|          | 0/26 [00:00<?, ?it/s]
[Acessing speaker spk_0 track 1 of 2:   4%|▍         | 1/26 [00:03<01:15,  3.03s/it]
[Acessing speaker spk_0 track 1 of 2:   8%|▊         | 2/26 [00:05<00:58,  2.42s/it]
[Acessing speaker spk_0 track 1 of 2:  12%|█▏        | 3/26 [00:06<00:44,  1.93s/it]
[Acessing speaker spk_0 track 1 of 2:  15%|█▌        | 4/26 [00:07<00:38,  1.74s/it]
[Acessing speaker spk_0 track 1 of 2:  19%|█▉        | 5/26 [00:12<00:56,  2.70s/it]
[Acessing speaker spk_0 track 1 of 2:  23%|██▎       | 6/26 [00:21<01:36,  4.81s/it]
[Acessing speaker spk_0 track 1 of 2:  27%|██▋       | 7/26 [00:29<01:52,  5.93s/it]
[Acessing speaker spk_0 track 1 of 2:  31%|███       | 8/26 [00:34<01:42,  5.67s/it]
[Acessing speaker spk_0 track 1 of 2:  35%|███▍      | 9/26 [00:36<01:17,  4.58s/it]
[Acessing speaker spk_0 track 1 of 2:  38%|███▊      | 10/26 [00:38<00:58,  3.66s/it]
[Acessing speaker spk_0 track 1 of 2:  42%|████▏     | 11/2





[Acessing speaker spk_1 track 1 of 1:   0%|          | 0/27 [00:00<?, ?it/s]
[Acessing speaker spk_1 track 1 of 1:   4%|▎         | 1/27 [00:06<02:50,  6.55s/it]
[Acessing speaker spk_1 track 1 of 1:   7%|▋         | 2/27 [00:13<02:50,  6.82s/it]
[Acessing speaker spk_1 track 1 of 1:  11%|█         | 3/27 [00:23<03:15,  8.15s/it]
[Acessing speaker spk_1 track 1 of 1:  15%|█▍        | 4/27 [00:31<03:07,  8.16s/it]
[Acessing speaker spk_1 track 1 of 1:  19%|█▊        | 5/27 [00:41<03:11,  8.69s/it]
[Acessing speaker spk_1 track 1 of 1:  22%|██▏       | 6/27 [00:52<03:25,  9.78s/it]
[Acessing speaker spk_1 track 1 of 1:  26%|██▌       | 7/27 [00:57<02:39,  8.00s/it]
[Acessing speaker spk_1 track 1 of 1:  30%|██▉       | 8/27 [01:05<02:32,  8.02s/it]
[Acessing speaker spk_1 track 1 of 1:  33%|███▎      | 9/27 [01:13<02:25,  8.08s/it]
[Acessing speaker spk_1 track 1 of 1:  37%|███▋      | 10/27 [01:17<01:52,  6.64s/it]
[Acessing speaker spk_1 track 1 of 1:  41%|████      | 11/2





[Acessing speaker spk_2 track 1 of 1:   0%|          | 0/32 [00:00<?, ?it/s]
[Acessing speaker spk_2 track 1 of 1:   3%|▎         | 1/32 [00:02<01:22,  2.65s/it]
[Acessing speaker spk_2 track 1 of 1:   6%|▋         | 2/32 [00:05<01:28,  2.96s/it]
[Acessing speaker spk_2 track 1 of 1:   9%|▉         | 3/32 [00:15<02:58,  6.14s/it]
[Acessing speaker spk_2 track 1 of 1:  12%|█▎        | 4/32 [00:22<02:53,  6.18s/it]
[Acessing speaker spk_2 track 1 of 1:  16%|█▌        | 5/32 [00:24<02:11,  4.87s/it]
[Acessing speaker spk_2 track 1 of 1:  19%|█▉        | 6/32 [00:25<01:34,  3.64s/it]
[Acessing speaker spk_2 track 1 of 1:  22%|██▏       | 7/32 [00:29<01:32,  3.70s/it]
[Acessing speaker spk_2 track 1 of 1:  25%|██▌       | 8/32 [00:33<01:33,  3.88s/it]
[Acessing speaker spk_2 track 1 of 1:  28%|██▊       | 9/32 [00:37<01:27,  3.79s/it]
[Acessing speaker spk_2 track 1 of 1:  31%|███▏      | 10/32 [00:40<01:18,  3.55s/it]
[Acessing speaker spk_2 track 1 of 1:  34%|███▍      | 11/3





[Acessing speaker spk_3 track 1 of 1:   0%|          | 0/38 [00:00<?, ?it/s]
[Acessing speaker spk_3 track 1 of 1:   3%|▎         | 1/38 [00:02<01:48,  2.94s/it]
[Acessing speaker spk_3 track 1 of 1:   5%|▌         | 2/38 [00:10<03:13,  5.37s/it]
[Acessing speaker spk_3 track 1 of 1:   8%|▊         | 3/38 [00:11<02:12,  3.78s/it]
[Acessing speaker spk_3 track 1 of 1:  11%|█         | 4/38 [00:15<02:07,  3.74s/it]
[Acessing speaker spk_3 track 1 of 1:  13%|█▎        | 5/38 [00:20<02:16,  4.13s/it]
[Acessing speaker spk_3 track 1 of 1:  16%|█▌        | 6/38 [00:22<01:47,  3.37s/it]
[Acessing speaker spk_3 track 1 of 1:  18%|█▊        | 7/38 [00:27<02:00,  3.89s/it]
[Acessing speaker spk_3 track 1 of 1:  21%|██        | 8/38 [00:29<01:36,  3.21s/it]
[Acessing speaker spk_3 track 1 of 1:  24%|██▎       | 9/38 [00:36<02:11,  4.53s/it]
[Acessing speaker spk_3 track 1 of 1:  26%|██▋       | 10/38 [01:05<05:38, 12.10s/it]
[Acessing speaker spk_3 track 1 of 1:  29%|██▉       | 11/3





[Acessing speaker spk_4 track 1 of 1:   0%|          | 0/28 [00:00<?, ?it/s]
[Acessing speaker spk_4 track 1 of 1:   4%|▎         | 1/28 [00:06<03:06,  6.92s/it]
[Acessing speaker spk_4 track 1 of 1:   7%|▋         | 2/28 [00:09<02:00,  4.64s/it]
[Acessing speaker spk_4 track 1 of 1:  11%|█         | 3/28 [00:12<01:27,  3.50s/it]
[Acessing speaker spk_4 track 1 of 1:  14%|█▍        | 4/28 [00:13<01:01,  2.56s/it]
[Acessing speaker spk_4 track 1 of 1:  18%|█▊        | 5/28 [00:19<01:27,  3.80s/it]
[Acessing speaker spk_4 track 1 of 1:  21%|██▏       | 6/28 [00:21<01:10,  3.20s/it]
[Acessing speaker spk_4 track 1 of 1:  25%|██▌       | 7/28 [00:28<01:36,  4.58s/it]
[Acessing speaker spk_4 track 1 of 1:  29%|██▊       | 8/28 [00:30<01:14,  3.72s/it]
[Acessing speaker spk_4 track 1 of 1:  32%|███▏      | 9/28 [00:35<01:15,  3.99s/it]
[Acessing speaker spk_4 track 1 of 1:  36%|███▌      | 10/28 [00:37<01:04,  3.58s/it]
[Acessing speaker spk_4 track 1 of 1:  39%|███▉      | 11/2

## 5 – Evaluation & Aggregation

In [6]:
# Ergebnisse auswerten und an gemeinsame CSV anhängen
df_dev = append_eval_results_for_experiments(
    experiments=EXPERIMENTS,
    session_ids=SESSION_IDS,
    target_csv="results_dev_subset_by_session.csv",
)



########## Evaluate für session_40 ##########
Starte Evaluate: /home/josch080/Projektgruppe/mcorec_train/bin/python script/evaluate.py --session_dir data-bin/dev_without_central_videos/dev/session_40 --output_dir_name output_ --label_dir_name labels
Evaluating 1 sessions

=== Evaluating session session_40 ===

--- Evaluating output dir: output_E01_bs4_len15 ---
Conversation clustering F1 score: 1.0
Speaker to WER: {'spk_0': 0.564, 'spk_1': 0.4281, 'spk_2': 0.5576, 'spk_3': 0.4283, 'spk_4': 0.4793, 'spk_5': 0.4189}
Speaker clustering F1 score: {'spk_0': 1.0, 'spk_1': 1.0, 'spk_2': 1.0, 'spk_3': 1.0, 'spk_4': 1.0, 'spk_5': 1.0}
Joint ASR-Clustering Error Rate: {'spk_0': 0.282, 'spk_1': 0.21405, 'spk_2': 0.2788, 'spk_3': 0.21415, 'spk_4': 0.23965, 'spk_5': 0.20945}

--- Evaluating output dir: output_E02_bs8_len15 ---
Conversation clustering F1 score: 1.0
Speaker to WER: {'spk_0': 0.561, 'spk_1': 0.4312, 'spk_2': 0.5506, 'spk_3': 0.4283, 'spk_4': 0.5041, 'spk_5': 0.4189}
Speaker clusterin

## 6 – Vergleich: Whisper vs. BL4

Whisper-Ergebnisse (E16, E17) werden den besten BL4-Konfigurationen (E08, E09) gegenübergestellt.

In [7]:
import pandas as pd

dev_df = pd.read_csv("results_dev_subset_by_session.csv")

# Whisper-Ergebnisse über alle Sessions mitteln
whisper_models = ["E16_whisper_bs8_len20", "E17_whisper_bs12_len20"]
whisper_df = (
    dev_df[dev_df["model"].isin(whisper_models)]
    .groupby("model")[["avg_speaker_wer", "avg_joint_error"]]
    .mean()
    .reset_index()
)

# BL4-Referenz: beste Konfigurationen aus 02_ (beam=8/12, len=20)
best_avsr_models = ["E08_bs8_len20", "E09_bs12_len20"]
avsr_df = (
    dev_df[dev_df["model"].isin(best_avsr_models)]
    .groupby("model")[["avg_speaker_wer", "avg_joint_error"]]
    .mean()
    .reset_index()
)

print("Whisper:")
display(whisper_df)
print("AVSR finetuned:")
display(avsr_df)


Whisper:


Unnamed: 0,model,avg_speaker_wer,avg_joint_error
0,E16_whisper_bs8_len20,1.031613,0.591999
1,E17_whisper_bs12_len20,1.025096,0.58874


AVSR finetuned:


Unnamed: 0,model,avg_speaker_wer,avg_joint_error
0,E08_bs8_len20,0.495798,0.324091
1,E09_bs12_len20,0.495416,0.3239


## 7 – Interpretation

| Modell | WER | Joint Error | rel. Δ WER vs. BL4 |
|--------|-----|-------------|--------------------|
| Whisper Large-v3 (beam=8/12) | ~1.03 | ~0.59 | +~108 % |
| BL4 (beam=12, len=20) | ~0.495 | ~0.324 | – |

Whisper ist mit einer WER von 1.03 mehr als doppelt so fehleranfällig wie BL4.
Auch die Joint Error Rate (0.59 vs. 0.324) liegt deutlich höher.

**Warum ist Whisper so viel schlechter?**

1. **Kein visueller Input:** In der überlappenden Mehrsprecher-Umgebung von MCoRec
   liefern Lip-Crops pro Sprecher entscheidende Kontextinformation, die Whisper fehlt.
2. **Kein In-Domain-Fine-Tuning:** Whisper wurde auf generischen Audiodaten trainiert,
   nicht auf MCoRec-spezifischen Sprachmustern und Vokabular.
3. **Fehlende Sprecher-Segmentierung:** Ohne sprecher-spezifische Eingabe produziert
   Whisper viele Insertions und inhaltlich abweichende Wörter, was WER und JER stark erhöht.

**Schlussfolgerung:** Audio-only ASR ist für die MCoRec-Challenge ungeeignet.
Der audio-visuelle Ansatz von BL4 bleibt die Basis für alle weiteren Experimente.
