# 02b – 3. Experiment: Stage-2-Light-Fine-Tuning: MCoRec-fokussierter Trainingsmix

## Motivation

`02a` zeigte, dass ein zweites Fine-Tuning mit LR 5·10⁻⁵ und 30 k Schritten
das Modell überanpasst. Dieser Ansatz testet eine noch konservativere Variante:

| Parameter | Stage-2 (`02a`) | Stage-2-Light (`02b`) |
|-----------|-----------------|----------------------|
| Lernrate | 5·10⁻⁵ | **1·10⁻⁵** (5× kleiner) |
| Schritte | 30 000 | **5 000** |
| MCoRec-Anteil | ~20 % | **~70 %** (`--mcorec_mode heavy`) |
| Warmup | 3 000 Steps | **500 Steps** |

Hypothese: Weniger Schritte + stärkerer MCoRec-Fokus könnte Überanpassung verhindern
und gleichzeitig die Domänenanpassung verbessern.

## Ergebnis (Vorschau)

Auch Stage-2-Light ist **schlechter** als BL4 (WER +0.032–0.033, JER +0.016).
BL4 bleibt das beste Modell – weiteres Fine-Tuning wird nicht weiterverfolgt.

**Hinweis zum Bugfix:** Dieser Lauf wurde **vor dem Bugfix** in `segmentation.py` durchgeführt (`min_duration_off` las fälschlicherweise den Wert von `min_duration_on`). Das ist **gewollt**: Der Bugfix wurde erst nach Abschluss der LLM- und Hyperparameter-Experimente entdeckt. Da der Bugfix allein die WER zunächst verschlechterte, wurde erst in `02j_`/`02k_` die Kombination aus Bugfix + `min_duration`-Optimierung erarbeitet, die schließlich das beste Ergebnis lieferte.

## 1 – Setup: Arbeitsverzeichnis & Imports

In [1]:
import os, sys
from pathlib import Path

# Path-Objekt für komfortables Pfad-Handling
project_baseline_path = Path("/home/josch080/Projektgruppe/mcorec_baseline")
os.chdir(project_baseline_path)
print("CWD:", os.getcwd())

# Repo-Root in sys.path, damit projektinterne Module importierbar sind
if str(project_baseline_path) not in sys.path:
    sys.path.append(str(project_baseline_path))


CWD: /home/josch080/Projektgruppe/mcorec_baseline


## 2 – GPU-Auswahl

In [2]:
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

## 3 – CUDA-Verifikation

In [3]:
import torch
print("n_gpu:", torch.cuda.device_count())
# sollte 1 ausgeben

n_gpu: 1


## 4 – Stage-2-Light-Training

Gegenüber `02a` sind drei Parameter verändert: Lernrate (5× kleiner),
Schrittanzahl (6× weniger) und MCoRec-Gewichtung (von ~20 % auf ~70 %).

**Hinweis:** `subprocess.run(cmd)` ist bereits ausgeführt; der Checkpoint liegt unter
`model-bin/avsr_cocktail_mcorec_stage2_light_lr1e-5_5k/checkpoint5000`.

In [4]:
cmd = [
    sys.executable, "script/train.py",
    "--streaming_dataset", # Daten on-the-fly laden
    "--include_mcorec", # MCoRec in den Trainingsmix aufnehmen
    "--mcorec_mode", "heavy",           # MCoRec stark gewichten: ~70 % statt ~20 % im Mix
    "--batch_size", "4",
    "--max_steps", "5000",              # Stark reduziert gegenüber Stage-2 (30k): minimales Nachtraining
    "--gradient_accumulation_steps", "2", # Effektive Batch-Größe = 4 × 2 = 8
    "--save_steps", "1000", # Checkpoint alle 1000 Schritte
    "--eval_steps", "1000", # Validation alle 1000 Schritte
    "--log_interval", "25",
    "--learning_rate", "1e-5",          # 5× kleiner als Stage-2 (5e-5)
    "--warmup_steps", "500",            # Kurzer Warmup passend zur geringen Schrittzahl
    "--checkpoint_name", "avsr_cocktail_mcorec_stage2_light_lr1e-5_5k",
    "--model_name_or_path", "./model-bin/avsr_cocktail_mcorec_finetune", # Startpunkt: BL4
    "--output_dir", "./model-bin",
    "--report_to", "none", # Kein externes Logging (z.B. W&B)
]
print(" ".join(cmd)) # Vollständigen Befehl zur Verifikation ausgeben


/home/josch080/Projektgruppe/mcorec_train/bin/python script/train.py --streaming_dataset --include_mcorec --mcorec_mode heavy --batch_size 4 --max_steps 5000 --gradient_accumulation_steps 2 --save_steps 1000 --eval_steps 1000 --log_interval 25 --learning_rate 1e-5 --warmup_steps 500 --checkpoint_name avsr_cocktail_mcorec_stage2_light_lr1e-5_5k --model_name_or_path ./model-bin/avsr_cocktail_mcorec_finetune --output_dir ./model-bin --report_to none


In [5]:
import subprocess

# Training starten – Ausgabe erscheint direkt im Notebook
subprocess.run(cmd)

Loading pretrained model from ./model-bin/avsr_cocktail_mcorec_finetune
Loading MCoRec dataset
map_datasets
 {'lrs2': {'probabilities': 0.1, 'dataset': {'train': IterableDataset({
    features: ['label', 'length', 'sample_id', 'video'],
    num_shards: 10
}), 'valid': None}}, 'vox2': {'probabilities': 0.05, 'dataset': {'train': IterableDataset({
    features: ['label', 'length', 'sample_id', 'video'],
    num_shards: 53
}), 'valid': None}}, 'avyt': {'probabilities': 0.1, 'dataset': {'train': IterableDataset({
    features: ['label', 'length', 'sample_id', 'video'],
    num_shards: 16
}), 'valid': None}}, 'avyt-mix': {'probabilities': 0.05, 'dataset': {'train': IterableDataset({
    features: ['label', 'length', 'sample_id', 'video'],
    num_shards: 664
}), 'valid': None}}, 'mcorec': {'probabilities': 0.7, 'dataset': {'train': IterableDataset({
    features: ['label', 'length', 'sample_id', 'video'],
    num_shards: 48
}), 'valid': IterableDataset({
    features: ['label', 'length', 's

  super().__init__(
Could not estimate the number of tokens of the input, floating-point operations will not be computed
  0%|          | 25/5000 [00:24<37:58,  2.18it/s]  

{'loss': 9.1956, 'grad_norm': 22.156843185424805, 'learning_rate': 4.4e-07, 'epoch': 0.01}


  1%|          | 50/5000 [00:38<40:36,  2.03it/s]

{'loss': 12.3982, 'grad_norm': 22.16065216064453, 'learning_rate': 9.200000000000001e-07, 'epoch': 0.01}


  2%|▏         | 75/5000 [00:49<36:23,  2.26it/s]

{'loss': 11.3531, 'grad_norm': 14.029118537902832, 'learning_rate': 1.4000000000000001e-06, 'epoch': 0.01}


  2%|▏         | 100/5000 [01:01<37:38,  2.17it/s]

{'loss': 12.5311, 'grad_norm': 24.898277282714844, 'learning_rate': 1.9000000000000002e-06, 'epoch': 0.02}


  2%|▎         | 125/5000 [01:13<36:04,  2.25it/s]

{'loss': 9.5582, 'grad_norm': 24.308225631713867, 'learning_rate': 2.4000000000000003e-06, 'epoch': 0.03}


  3%|▎         | 150/5000 [01:25<38:47,  2.08it/s]

{'loss': 11.2654, 'grad_norm': 23.608436584472656, 'learning_rate': 2.9e-06, 'epoch': 0.03}


  4%|▎         | 175/5000 [01:36<34:47,  2.31it/s]

{'loss': 10.0444, 'grad_norm': 44.15377426147461, 'learning_rate': 3.4000000000000005e-06, 'epoch': 0.04}


  4%|▍         | 200/5000 [01:49<41:54,  1.91it/s]

{'loss': 15.5317, 'grad_norm': 24.779939651489258, 'learning_rate': 3.900000000000001e-06, 'epoch': 0.04}


  4%|▍         | 225/5000 [02:03<46:55,  1.70it/s]

{'loss': 9.6219, 'grad_norm': 50.59184265136719, 'learning_rate': 4.4e-06, 'epoch': 0.04}


  5%|▌         | 250/5000 [02:14<32:45,  2.42it/s]

{'loss': 7.7752, 'grad_norm': 21.73801040649414, 'learning_rate': 4.9000000000000005e-06, 'epoch': 0.05}


  6%|▌         | 275/5000 [02:26<34:53,  2.26it/s]

{'loss': 8.9794, 'grad_norm': 13.741610527038574, 'learning_rate': 5.400000000000001e-06, 'epoch': 0.06}


  6%|▌         | 300/5000 [02:39<36:19,  2.16it/s]

{'loss': 10.8057, 'grad_norm': 22.244972229003906, 'learning_rate': 5.9e-06, 'epoch': 0.06}


  6%|▋         | 325/5000 [02:50<37:04,  2.10it/s]

{'loss': 8.0935, 'grad_norm': 22.367286682128906, 'learning_rate': 6.4000000000000006e-06, 'epoch': 0.07}


  7%|▋         | 350/5000 [03:02<36:04,  2.15it/s]

{'loss': 9.2336, 'grad_norm': 20.070228576660156, 'learning_rate': 6.9e-06, 'epoch': 0.07}


  8%|▊         | 375/5000 [03:13<33:11,  2.32it/s]

{'loss': 9.1382, 'grad_norm': 4.022035121917725, 'learning_rate': 7.4e-06, 'epoch': 0.07}


  8%|▊         | 400/5000 [03:24<35:40,  2.15it/s]

{'loss': 7.6042, 'grad_norm': 32.135169982910156, 'learning_rate': 7.9e-06, 'epoch': 0.08}


  8%|▊         | 425/5000 [03:35<33:11,  2.30it/s]

{'loss': 9.7831, 'grad_norm': 9.99147891998291, 'learning_rate': 8.400000000000001e-06, 'epoch': 0.09}


  9%|▉         | 450/5000 [03:47<36:04,  2.10it/s]

{'loss': 9.4056, 'grad_norm': 30.385099411010742, 'learning_rate': 8.900000000000001e-06, 'epoch': 0.09}


 10%|▉         | 475/5000 [03:59<32:15,  2.34it/s]

{'loss': 11.4209, 'grad_norm': 65.0094985961914, 'learning_rate': 9.4e-06, 'epoch': 0.1}


 10%|█         | 500/5000 [04:11<32:24,  2.31it/s]

{'loss': 11.8324, 'grad_norm': 41.487213134765625, 'learning_rate': 9.9e-06, 'epoch': 0.1}


 10%|█         | 525/5000 [04:23<34:15,  2.18it/s]

{'loss': 7.5882, 'grad_norm': 13.352439880371094, 'learning_rate': 9.955555555555556e-06, 'epoch': 0.1}


 11%|█         | 550/5000 [04:35<44:49,  1.65it/s]

{'loss': 9.2561, 'grad_norm': 96.8046646118164, 'learning_rate': 9.9e-06, 'epoch': 0.11}


 12%|█▏        | 575/5000 [04:47<39:49,  1.85it/s]

{'loss': 10.017, 'grad_norm': 33.21139144897461, 'learning_rate': 9.844444444444446e-06, 'epoch': 0.12}


 12%|█▏        | 600/5000 [04:59<36:28,  2.01it/s]

{'loss': 9.533, 'grad_norm': 35.55388259887695, 'learning_rate': 9.78888888888889e-06, 'epoch': 0.12}


 12%|█▎        | 625/5000 [05:10<31:07,  2.34it/s]

{'loss': 8.2852, 'grad_norm': 12.032537460327148, 'learning_rate': 9.733333333333334e-06, 'epoch': 0.12}


 13%|█▎        | 650/5000 [05:20<26:14,  2.76it/s]

{'loss': 9.8267, 'grad_norm': 39.97711181640625, 'learning_rate': 9.677777777777778e-06, 'epoch': 0.13}


 14%|█▎        | 675/5000 [05:31<34:42,  2.08it/s]

{'loss': 7.9212, 'grad_norm': 28.514780044555664, 'learning_rate': 9.622222222222222e-06, 'epoch': 0.14}


 14%|█▍        | 700/5000 [05:43<35:48,  2.00it/s]

{'loss': 12.6468, 'grad_norm': 22.436687469482422, 'learning_rate': 9.566666666666668e-06, 'epoch': 0.14}


 14%|█▍        | 725/5000 [05:56<36:27,  1.95it/s]

{'loss': 8.7631, 'grad_norm': 9.642409324645996, 'learning_rate': 9.511111111111112e-06, 'epoch': 0.14}


 15%|█▌        | 750/5000 [06:08<34:25,  2.06it/s]

{'loss': 8.1837, 'grad_norm': 22.359825134277344, 'learning_rate': 9.455555555555557e-06, 'epoch': 0.15}


 16%|█▌        | 775/5000 [06:20<37:06,  1.90it/s]

{'loss': 13.5508, 'grad_norm': 39.59782028198242, 'learning_rate': 9.4e-06, 'epoch': 0.15}


 16%|█▌        | 800/5000 [06:32<29:45,  2.35it/s]

{'loss': 8.1569, 'grad_norm': 13.374761581420898, 'learning_rate': 9.344444444444446e-06, 'epoch': 0.16}


 16%|█▋        | 825/5000 [06:44<41:34,  1.67it/s]

{'loss': 13.8071, 'grad_norm': 29.352235794067383, 'learning_rate': 9.28888888888889e-06, 'epoch': 0.17}


 17%|█▋        | 850/5000 [06:57<38:37,  1.79it/s]

{'loss': 13.2481, 'grad_norm': 45.65077209472656, 'learning_rate': 9.233333333333334e-06, 'epoch': 0.17}


 18%|█▊        | 875/5000 [07:10<33:16,  2.07it/s]

{'loss': 13.3083, 'grad_norm': 21.65273666381836, 'learning_rate': 9.17777777777778e-06, 'epoch': 0.17}


 18%|█▊        | 900/5000 [07:21<31:10,  2.19it/s]

{'loss': 7.5191, 'grad_norm': 23.185470581054688, 'learning_rate': 9.122222222222223e-06, 'epoch': 0.18}


 18%|█▊        | 925/5000 [07:34<34:04,  1.99it/s]

{'loss': 10.0215, 'grad_norm': 12.512810707092285, 'learning_rate': 9.066666666666667e-06, 'epoch': 0.18}


 19%|█▉        | 950/5000 [07:45<35:46,  1.89it/s]

{'loss': 9.9387, 'grad_norm': 35.78242874145508, 'learning_rate': 9.011111111111111e-06, 'epoch': 0.19}


 20%|█▉        | 975/5000 [07:58<33:42,  1.99it/s]

{'loss': 9.4487, 'grad_norm': 33.85115051269531, 'learning_rate': 8.955555555555555e-06, 'epoch': 0.2}


 20%|██        | 1000/5000 [08:10<29:29,  2.26it/s]

{'loss': 10.9505, 'grad_norm': 20.87000846862793, 'learning_rate': 8.900000000000001e-06, 'epoch': 0.2}


Too many dataloader workers: 10 (max is dataset.num_shards=3). Stopping 7 dataloader workers.
 20%|██        | 1000/5000 [11:08<29:29,  2.26it/s]

{'eval_loss': 27.92057228088379, 'eval_runtime': 177.9086, 'eval_samples_per_second': 21.876, 'eval_steps_per_second': 5.469, 'epoch': 0.2}


 20%|██        | 1025/5000 [11:29<29:22,  2.25it/s]   

{'loss': 9.6124, 'grad_norm': 20.516700744628906, 'learning_rate': 8.844444444444445e-06, 'epoch': 0.2}


 21%|██        | 1050/5000 [11:41<29:18,  2.25it/s]

{'loss': 8.9763, 'grad_norm': 12.408702850341797, 'learning_rate': 8.788888888888891e-06, 'epoch': 0.21}


 22%|██▏       | 1075/5000 [11:55<32:19,  2.02it/s]

{'loss': 11.6714, 'grad_norm': 43.32182312011719, 'learning_rate': 8.733333333333333e-06, 'epoch': 0.21}


 22%|██▏       | 1100/5000 [12:07<28:35,  2.27it/s]

{'loss': 10.1149, 'grad_norm': 32.629486083984375, 'learning_rate': 8.677777777777779e-06, 'epoch': 0.22}


 22%|██▎       | 1125/5000 [12:19<27:28,  2.35it/s]

{'loss': 13.1181, 'grad_norm': 35.43832015991211, 'learning_rate': 8.622222222222223e-06, 'epoch': 0.23}


 23%|██▎       | 1150/5000 [12:31<29:28,  2.18it/s]

{'loss': 9.5095, 'grad_norm': 30.83103370666504, 'learning_rate': 8.566666666666667e-06, 'epoch': 0.23}


 24%|██▎       | 1175/5000 [12:42<32:07,  1.98it/s]

{'loss': 7.2187, 'grad_norm': 21.28726577758789, 'learning_rate': 8.511111111111113e-06, 'epoch': 0.23}


 24%|██▍       | 1200/5000 [12:54<29:15,  2.16it/s]

{'loss': 10.467, 'grad_norm': 46.20471954345703, 'learning_rate': 8.455555555555555e-06, 'epoch': 0.24}


 24%|██▍       | 1225/5000 [13:06<30:35,  2.06it/s]

{'loss': 9.919, 'grad_norm': 28.64875602722168, 'learning_rate': 8.400000000000001e-06, 'epoch': 0.24}


 25%|██▌       | 1250/5000 [13:17<28:35,  2.19it/s]

{'loss': 6.8976, 'grad_norm': 22.253583908081055, 'learning_rate': 8.344444444444445e-06, 'epoch': 0.25}


 26%|██▌       | 1275/5000 [13:29<33:57,  1.83it/s]

{'loss': 8.0384, 'grad_norm': 34.93832778930664, 'learning_rate': 8.288888888888889e-06, 'epoch': 0.26}


 26%|██▌       | 1300/5000 [13:41<32:21,  1.91it/s]

{'loss': 9.6958, 'grad_norm': 53.498252868652344, 'learning_rate': 8.233333333333335e-06, 'epoch': 0.26}


 26%|██▋       | 1325/5000 [13:54<28:01,  2.19it/s]

{'loss': 11.8991, 'grad_norm': 25.14459991455078, 'learning_rate': 8.177777777777779e-06, 'epoch': 0.27}


 27%|██▋       | 1350/5000 [14:05<28:09,  2.16it/s]

{'loss': 8.2285, 'grad_norm': 22.195329666137695, 'learning_rate': 8.122222222222223e-06, 'epoch': 0.27}


 28%|██▊       | 1375/5000 [14:18<34:42,  1.74it/s]

{'loss': 10.7342, 'grad_norm': 19.79163932800293, 'learning_rate': 8.066666666666667e-06, 'epoch': 0.28}


 28%|██▊       | 1400/5000 [14:29<23:14,  2.58it/s]

{'loss': 8.8996, 'grad_norm': 24.90903091430664, 'learning_rate': 8.011111111111113e-06, 'epoch': 0.28}


 28%|██▊       | 1425/5000 [14:39<24:53,  2.39it/s]

{'loss': 11.3002, 'grad_norm': 19.91614532470703, 'learning_rate': 7.955555555555557e-06, 'epoch': 0.28}


 29%|██▉       | 1450/5000 [14:50<25:12,  2.35it/s]

{'loss': 7.7587, 'grad_norm': 21.977266311645508, 'learning_rate': 7.9e-06, 'epoch': 0.29}


 30%|██▉       | 1475/5000 [15:01<29:59,  1.96it/s]

{'loss': 7.2277, 'grad_norm': 21.85486602783203, 'learning_rate': 7.844444444444446e-06, 'epoch': 0.29}


 30%|███       | 1500/5000 [15:13<26:47,  2.18it/s]

{'loss': 12.9917, 'grad_norm': 45.723655700683594, 'learning_rate': 7.788888888888889e-06, 'epoch': 0.3}


 30%|███       | 1525/5000 [15:26<30:01,  1.93it/s]

{'loss': 11.9718, 'grad_norm': 36.46078109741211, 'learning_rate': 7.733333333333334e-06, 'epoch': 0.3}


 31%|███       | 1550/5000 [15:37<23:38,  2.43it/s]

{'loss': 9.2663, 'grad_norm': 19.235685348510742, 'learning_rate': 7.677777777777778e-06, 'epoch': 0.31}


 32%|███▏      | 1575/5000 [15:49<24:28,  2.33it/s]

{'loss': 11.1068, 'grad_norm': 24.90699577331543, 'learning_rate': 7.622222222222223e-06, 'epoch': 0.32}


 32%|███▏      | 1600/5000 [16:02<28:11,  2.01it/s]

{'loss': 12.7329, 'grad_norm': 19.131244659423828, 'learning_rate': 7.566666666666667e-06, 'epoch': 0.32}


 32%|███▎      | 1625/5000 [16:14<23:12,  2.42it/s]

{'loss': 11.802, 'grad_norm': 32.598976135253906, 'learning_rate': 7.511111111111111e-06, 'epoch': 0.33}


 33%|███▎      | 1650/5000 [16:26<32:34,  1.71it/s]

{'loss': 11.511, 'grad_norm': 36.98774337768555, 'learning_rate': 7.455555555555556e-06, 'epoch': 0.33}


 34%|███▎      | 1675/5000 [16:38<26:20,  2.10it/s]

{'loss': 9.7183, 'grad_norm': 4.880810260772705, 'learning_rate': 7.4e-06, 'epoch': 0.34}


 34%|███▍      | 1700/5000 [16:50<29:54,  1.84it/s]

{'loss': 11.0284, 'grad_norm': 50.105995178222656, 'learning_rate': 7.344444444444445e-06, 'epoch': 0.34}


 34%|███▍      | 1725/5000 [17:01<23:43,  2.30it/s]

{'loss': 9.69, 'grad_norm': 5.6285881996154785, 'learning_rate': 7.28888888888889e-06, 'epoch': 0.34}


 35%|███▌      | 1750/5000 [17:13<23:48,  2.27it/s]

{'loss': 10.7506, 'grad_norm': 8.49683952331543, 'learning_rate': 7.233333333333334e-06, 'epoch': 0.35}


 36%|███▌      | 1775/5000 [17:26<22:36,  2.38it/s]

{'loss': 11.9686, 'grad_norm': 26.27542495727539, 'learning_rate': 7.177777777777778e-06, 'epoch': 0.35}


 36%|███▌      | 1800/5000 [17:39<29:51,  1.79it/s]

{'loss': 7.9553, 'grad_norm': 25.539291381835938, 'learning_rate': 7.122222222222222e-06, 'epoch': 0.36}


 36%|███▋      | 1825/5000 [17:51<26:18,  2.01it/s]

{'loss': 14.5006, 'grad_norm': 22.876995086669922, 'learning_rate': 7.066666666666667e-06, 'epoch': 0.36}


 37%|███▋      | 1850/5000 [18:04<28:18,  1.85it/s]

{'loss': 13.3197, 'grad_norm': 99.7592544555664, 'learning_rate': 7.011111111111112e-06, 'epoch': 0.37}


 38%|███▊      | 1875/5000 [18:15<21:28,  2.43it/s]

{'loss': 8.9323, 'grad_norm': 21.663768768310547, 'learning_rate': 6.955555555555557e-06, 'epoch': 0.38}


 38%|███▊      | 1900/5000 [18:26<24:07,  2.14it/s]

{'loss': 10.4985, 'grad_norm': 15.355202674865723, 'learning_rate': 6.9e-06, 'epoch': 0.38}


 38%|███▊      | 1925/5000 [18:38<28:00,  1.83it/s]

{'loss': 8.4687, 'grad_norm': 44.62416458129883, 'learning_rate': 6.844444444444445e-06, 'epoch': 0.39}


 39%|███▉      | 1939/5000 [18:44<25:09,  2.03it/s]'(ProtocolError('Connection aborted.', BrokenPipeError(32, 'Broken pipe')), '(Request ID: 0c031e22-f82b-4ff1-b312-6276c15d2329)')' thrown while requesting GET https://huggingface.co/datasets/nguyenvulebinh/AVYT/resolve/e6c6bf6f40e698b82215d269cfc0a0d65a7a2372/vox2/vox2-dev-000009.tar
Retrying in 1s [Retry 1/5].
 39%|███▉      | 1950/5000 [18:50<23:00,  2.21it/s]

{'loss': 10.7728, 'grad_norm': 35.69119644165039, 'learning_rate': 6.788888888888889e-06, 'epoch': 0.39}


 40%|███▉      | 1975/5000 [19:02<27:40,  1.82it/s]

{'loss': 10.4086, 'grad_norm': 45.720706939697266, 'learning_rate': 6.733333333333334e-06, 'epoch': 0.4}


 40%|████      | 2000/5000 [19:13<20:14,  2.47it/s]

{'loss': 11.9201, 'grad_norm': 39.35735321044922, 'learning_rate': 6.677777777777779e-06, 'epoch': 0.4}


Too many dataloader workers: 10 (max is dataset.num_shards=3). Stopping 7 dataloader workers.
 40%|████      | 2000/5000 [22:19<20:14,  2.47it/s]

{'eval_loss': 27.953811645507812, 'eval_runtime': 185.9282, 'eval_samples_per_second': 20.933, 'eval_steps_per_second': 5.233, 'epoch': 0.4}


 40%|████      | 2025/5000 [22:39<24:29,  2.02it/s]   

{'loss': 7.8158, 'grad_norm': 19.53037452697754, 'learning_rate': 6.6222222222222236e-06, 'epoch': 0.41}


 41%|████      | 2050/5000 [22:51<23:07,  2.13it/s]

{'loss': 9.9482, 'grad_norm': 15.331071853637695, 'learning_rate': 6.566666666666667e-06, 'epoch': 0.41}


 42%|████▏     | 2075/5000 [23:02<21:11,  2.30it/s]

{'loss': 7.7086, 'grad_norm': 27.6015567779541, 'learning_rate': 6.513333333333333e-06, 'epoch': 0.41}


 42%|████▏     | 2100/5000 [23:15<27:41,  1.75it/s]

{'loss': 11.1768, 'grad_norm': 24.74627685546875, 'learning_rate': 6.457777777777778e-06, 'epoch': 0.42}


 42%|████▎     | 2125/5000 [23:27<20:17,  2.36it/s]

{'loss': 9.4251, 'grad_norm': 20.74651336669922, 'learning_rate': 6.402222222222223e-06, 'epoch': 0.42}


 43%|████▎     | 2150/5000 [23:40<24:55,  1.91it/s]

{'loss': 9.6363, 'grad_norm': 23.97159767150879, 'learning_rate': 6.346666666666668e-06, 'epoch': 0.43}


 44%|████▎     | 2175/5000 [23:51<20:24,  2.31it/s]

{'loss': 9.814, 'grad_norm': 29.916763305664062, 'learning_rate': 6.291111111111111e-06, 'epoch': 0.43}


 44%|████▍     | 2200/5000 [24:04<22:35,  2.06it/s]

{'loss': 15.1163, 'grad_norm': 40.98516845703125, 'learning_rate': 6.235555555555556e-06, 'epoch': 0.44}


 44%|████▍     | 2225/5000 [24:16<26:58,  1.71it/s]

{'loss': 8.4664, 'grad_norm': 36.101715087890625, 'learning_rate': 6.18e-06, 'epoch': 0.45}


 45%|████▌     | 2250/5000 [24:29<22:27,  2.04it/s]

{'loss': 12.0627, 'grad_norm': 34.62934494018555, 'learning_rate': 6.124444444444445e-06, 'epoch': 0.45}


 46%|████▌     | 2275/5000 [24:41<24:58,  1.82it/s]

{'loss': 15.8163, 'grad_norm': 46.762794494628906, 'learning_rate': 6.06888888888889e-06, 'epoch': 0.46}


 46%|████▌     | 2300/5000 [24:53<20:41,  2.17it/s]

{'loss': 8.1186, 'grad_norm': 21.6552677154541, 'learning_rate': 6.013333333333335e-06, 'epoch': 0.46}


 46%|████▋     | 2325/5000 [25:05<21:07,  2.11it/s]

{'loss': 8.0795, 'grad_norm': 18.05217933654785, 'learning_rate': 5.957777777777778e-06, 'epoch': 0.47}


 47%|████▋     | 2350/5000 [25:16<21:12,  2.08it/s]

{'loss': 11.2263, 'grad_norm': 15.04200267791748, 'learning_rate': 5.902222222222223e-06, 'epoch': 0.47}


 48%|████▊     | 2375/5000 [25:28<18:38,  2.35it/s]

{'loss': 11.5261, 'grad_norm': 24.241744995117188, 'learning_rate': 5.846666666666667e-06, 'epoch': 0.47}


 48%|████▊     | 2400/5000 [25:41<19:19,  2.24it/s]

{'loss': 15.375, 'grad_norm': 45.321075439453125, 'learning_rate': 5.791111111111112e-06, 'epoch': 0.48}


 48%|████▊     | 2425/5000 [25:54<23:29,  1.83it/s]

{'loss': 9.4973, 'grad_norm': 36.32010269165039, 'learning_rate': 5.735555555555557e-06, 'epoch': 0.48}


 49%|████▉     | 2450/5000 [26:05<19:24,  2.19it/s]

{'loss': 8.794, 'grad_norm': 30.05166244506836, 'learning_rate': 5.68e-06, 'epoch': 0.49}


 50%|████▉     | 2475/5000 [26:16<20:20,  2.07it/s]

{'loss': 8.7011, 'grad_norm': 12.657560348510742, 'learning_rate': 5.624444444444445e-06, 'epoch': 0.49}


 50%|█████     | 2500/5000 [26:28<19:58,  2.09it/s]

{'loss': 13.7538, 'grad_norm': 22.12327003479004, 'learning_rate': 5.56888888888889e-06, 'epoch': 0.5}


 50%|█████     | 2525/5000 [26:41<23:21,  1.77it/s]

{'loss': 9.7759, 'grad_norm': 44.21232986450195, 'learning_rate': 5.513333333333334e-06, 'epoch': 0.51}


 51%|█████     | 2550/5000 [26:52<19:04,  2.14it/s]

{'loss': 10.1073, 'grad_norm': 18.708431243896484, 'learning_rate': 5.4577777777777785e-06, 'epoch': 0.51}


 52%|█████▏    | 2575/5000 [27:03<15:00,  2.69it/s]

{'loss': 10.3904, 'grad_norm': 48.635292053222656, 'learning_rate': 5.402222222222223e-06, 'epoch': 0.52}


 52%|█████▏    | 2600/5000 [27:12<13:57,  2.87it/s]

{'loss': 10.694, 'grad_norm': 5.699521064758301, 'learning_rate': 5.346666666666667e-06, 'epoch': 0.52}


 52%|█████▎    | 2625/5000 [27:23<18:57,  2.09it/s]

{'loss': 9.8554, 'grad_norm': 45.705177307128906, 'learning_rate': 5.2911111111111115e-06, 'epoch': 0.53}


 53%|█████▎    | 2650/5000 [27:35<17:18,  2.26it/s]

{'loss': 12.0868, 'grad_norm': 19.206205368041992, 'learning_rate': 5.235555555555556e-06, 'epoch': 0.53}


 54%|█████▎    | 2675/5000 [27:46<17:48,  2.18it/s]

{'loss': 10.8772, 'grad_norm': 15.50216007232666, 'learning_rate': 5.18e-06, 'epoch': 0.54}


 54%|█████▍    | 2700/5000 [27:57<18:47,  2.04it/s]

{'loss': 10.7653, 'grad_norm': 19.047964096069336, 'learning_rate': 5.124444444444445e-06, 'epoch': 0.54}


 55%|█████▍    | 2725/5000 [28:09<14:46,  2.57it/s]

{'loss': 9.8149, 'grad_norm': 32.18979263305664, 'learning_rate': 5.06888888888889e-06, 'epoch': 0.55}


 55%|█████▌    | 2750/5000 [28:21<16:52,  2.22it/s]

{'loss': 8.5833, 'grad_norm': 39.65501403808594, 'learning_rate': 5.013333333333333e-06, 'epoch': 0.55}


 56%|█████▌    | 2775/5000 [28:34<17:04,  2.17it/s]

{'loss': 10.5767, 'grad_norm': 34.602294921875, 'learning_rate': 4.957777777777778e-06, 'epoch': 0.56}


 56%|█████▌    | 2800/5000 [28:47<21:26,  1.71it/s]

{'loss': 9.416, 'grad_norm': 33.097679138183594, 'learning_rate': 4.902222222222222e-06, 'epoch': 0.56}


 56%|█████▋    | 2825/5000 [29:01<17:00,  2.13it/s]

{'loss': 9.6653, 'grad_norm': 31.39249610900879, 'learning_rate': 4.846666666666667e-06, 'epoch': 0.56}


 57%|█████▋    | 2850/5000 [29:13<16:55,  2.12it/s]

{'loss': 10.3131, 'grad_norm': 25.149261474609375, 'learning_rate': 4.791111111111111e-06, 'epoch': 0.57}


 57%|█████▊    | 2875/5000 [29:25<15:22,  2.30it/s]

{'loss': 8.1825, 'grad_norm': 22.12935447692871, 'learning_rate': 4.735555555555556e-06, 'epoch': 0.57}


 58%|█████▊    | 2900/5000 [29:36<15:00,  2.33it/s]

{'loss': 11.0661, 'grad_norm': 33.05979537963867, 'learning_rate': 4.680000000000001e-06, 'epoch': 0.58}


 58%|█████▊    | 2925/5000 [29:48<16:11,  2.14it/s]

{'loss': 10.5915, 'grad_norm': 23.03805160522461, 'learning_rate': 4.624444444444445e-06, 'epoch': 0.58}


 59%|█████▉    | 2950/5000 [30:00<20:49,  1.64it/s]

{'loss': 15.6464, 'grad_norm': 45.680320739746094, 'learning_rate': 4.568888888888889e-06, 'epoch': 0.59}


 60%|█████▉    | 2975/5000 [30:13<17:25,  1.94it/s]

{'loss': 12.2959, 'grad_norm': 24.538644790649414, 'learning_rate': 4.513333333333333e-06, 'epoch': 0.59}


 60%|██████    | 3000/5000 [30:25<17:24,  1.92it/s]

{'loss': 9.399, 'grad_norm': 25.193845748901367, 'learning_rate': 4.457777777777778e-06, 'epoch': 0.6}


Too many dataloader workers: 10 (max is dataset.num_shards=3). Stopping 7 dataloader workers.
 60%|██████    | 3000/5000 [33:26<17:24,  1.92it/s]

{'eval_loss': 28.117063522338867, 'eval_runtime': 180.8624, 'eval_samples_per_second': 21.519, 'eval_steps_per_second': 5.38, 'epoch': 0.6}


 60%|██████    | 3025/5000 [33:48<16:07,  2.04it/s]   

{'loss': 11.3766, 'grad_norm': 26.64322853088379, 'learning_rate': 4.402222222222223e-06, 'epoch': 0.6}


 61%|██████    | 3050/5000 [34:00<15:38,  2.08it/s]

{'loss': 11.8795, 'grad_norm': 18.54511260986328, 'learning_rate': 4.346666666666667e-06, 'epoch': 0.61}


 62%|██████▏   | 3075/5000 [34:12<14:18,  2.24it/s]

{'loss': 12.2321, 'grad_norm': 42.12863540649414, 'learning_rate': 4.291111111111112e-06, 'epoch': 0.61}


 62%|██████▏   | 3100/5000 [34:23<14:33,  2.18it/s]

{'loss': 8.2434, 'grad_norm': 15.509504318237305, 'learning_rate': 4.235555555555556e-06, 'epoch': 0.62}


 62%|██████▎   | 3125/5000 [34:35<16:54,  1.85it/s]

{'loss': 10.0488, 'grad_norm': 15.417184829711914, 'learning_rate': 4.18e-06, 'epoch': 0.62}


 63%|██████▎   | 3150/5000 [34:47<15:24,  2.00it/s]

{'loss': 10.9935, 'grad_norm': 13.360200881958008, 'learning_rate': 4.124444444444445e-06, 'epoch': 0.63}


 64%|██████▎   | 3175/5000 [34:59<14:57,  2.03it/s]

{'loss': 8.8275, 'grad_norm': 19.79775619506836, 'learning_rate': 4.0688888888888896e-06, 'epoch': 0.64}


 64%|██████▍   | 3200/5000 [35:12<15:41,  1.91it/s]

{'loss': 15.2279, 'grad_norm': 28.95928192138672, 'learning_rate': 4.013333333333334e-06, 'epoch': 0.64}


 64%|██████▍   | 3225/5000 [35:24<13:52,  2.13it/s]

{'loss': 11.5378, 'grad_norm': 69.25008392333984, 'learning_rate': 3.9577777777777785e-06, 'epoch': 0.65}


 65%|██████▌   | 3250/5000 [35:36<17:04,  1.71it/s]

{'loss': 11.3898, 'grad_norm': 52.62971496582031, 'learning_rate': 3.9022222222222225e-06, 'epoch': 0.65}


 66%|██████▌   | 3275/5000 [35:48<16:46,  1.71it/s]

{'loss': 12.5239, 'grad_norm': 44.13243865966797, 'learning_rate': 3.8466666666666665e-06, 'epoch': 0.66}


 66%|██████▌   | 3300/5000 [35:59<12:04,  2.35it/s]

{'loss': 8.2287, 'grad_norm': 33.39620590209961, 'learning_rate': 3.7911111111111114e-06, 'epoch': 0.66}


 66%|██████▋   | 3325/5000 [36:10<11:10,  2.50it/s]

{'loss': 9.9597, 'grad_norm': 54.274024963378906, 'learning_rate': 3.7355555555555555e-06, 'epoch': 0.67}


 67%|██████▋   | 3350/5000 [36:21<11:57,  2.30it/s]

{'loss': 8.3817, 'grad_norm': 23.37487030029297, 'learning_rate': 3.6800000000000003e-06, 'epoch': 0.67}


 68%|██████▊   | 3375/5000 [36:33<11:31,  2.35it/s]

{'loss': 10.2802, 'grad_norm': 71.12081146240234, 'learning_rate': 3.624444444444445e-06, 'epoch': 0.68}


 68%|██████▊   | 3400/5000 [36:45<11:51,  2.25it/s]

{'loss': 13.5395, 'grad_norm': 18.166717529296875, 'learning_rate': 3.568888888888889e-06, 'epoch': 0.68}


 68%|██████▊   | 3425/5000 [36:57<11:29,  2.28it/s]

{'loss': 7.9126, 'grad_norm': 15.723886489868164, 'learning_rate': 3.5133333333333337e-06, 'epoch': 0.69}


 69%|██████▉   | 3450/5000 [37:10<13:31,  1.91it/s]

{'loss': 12.4674, 'grad_norm': 41.52882766723633, 'learning_rate': 3.457777777777778e-06, 'epoch': 0.69}


 70%|██████▉   | 3475/5000 [37:21<11:08,  2.28it/s]

{'loss': 7.5438, 'grad_norm': 21.051822662353516, 'learning_rate': 3.4022222222222222e-06, 'epoch': 0.69}


 70%|███████   | 3500/5000 [37:33<11:31,  2.17it/s]

{'loss': 6.1567, 'grad_norm': 37.188636779785156, 'learning_rate': 3.346666666666667e-06, 'epoch': 0.7}


 70%|███████   | 3525/5000 [37:44<10:15,  2.40it/s]

{'loss': 9.058, 'grad_norm': 25.32170867919922, 'learning_rate': 3.2911111111111116e-06, 'epoch': 0.7}


 71%|███████   | 3550/5000 [37:57<11:01,  2.19it/s]

{'loss': 7.9293, 'grad_norm': 13.827238082885742, 'learning_rate': 3.2355555555555556e-06, 'epoch': 0.71}


 72%|███████▏  | 3575/5000 [38:09<10:14,  2.32it/s]

{'loss': 13.31, 'grad_norm': 29.573589324951172, 'learning_rate': 3.1800000000000005e-06, 'epoch': 0.71}


 72%|███████▏  | 3600/5000 [38:20<11:28,  2.03it/s]

{'loss': 7.8392, 'grad_norm': 11.936644554138184, 'learning_rate': 3.124444444444445e-06, 'epoch': 0.72}


 72%|███████▎  | 3625/5000 [38:32<10:54,  2.10it/s]

{'loss': 8.8927, 'grad_norm': 42.13713455200195, 'learning_rate': 3.068888888888889e-06, 'epoch': 0.72}


 73%|███████▎  | 3650/5000 [38:44<10:33,  2.13it/s]

{'loss': 9.542, 'grad_norm': 4.396058559417725, 'learning_rate': 3.013333333333334e-06, 'epoch': 0.73}


 74%|███████▎  | 3675/5000 [38:57<11:09,  1.98it/s]

{'loss': 10.3824, 'grad_norm': 125.1960678100586, 'learning_rate': 2.957777777777778e-06, 'epoch': 0.73}


 74%|███████▍  | 3700/5000 [39:08<09:23,  2.31it/s]

{'loss': 8.355, 'grad_norm': 36.30138397216797, 'learning_rate': 2.9022222222222223e-06, 'epoch': 0.74}


 74%|███████▍  | 3725/5000 [39:19<08:15,  2.57it/s]

{'loss': 7.0096, 'grad_norm': 28.381624221801758, 'learning_rate': 2.8466666666666672e-06, 'epoch': 0.74}


 75%|███████▌  | 3750/5000 [39:32<11:09,  1.87it/s]

{'loss': 11.4334, 'grad_norm': 25.482097625732422, 'learning_rate': 2.7911111111111113e-06, 'epoch': 0.75}


 76%|███████▌  | 3775/5000 [39:43<10:50,  1.88it/s]

{'loss': 8.9277, 'grad_norm': 26.384965896606445, 'learning_rate': 2.7355555555555557e-06, 'epoch': 0.76}


 76%|███████▌  | 3800/5000 [39:56<11:42,  1.71it/s]

{'loss': 13.194, 'grad_norm': 40.55134201049805, 'learning_rate': 2.68e-06, 'epoch': 0.76}


 76%|███████▋  | 3825/5000 [40:07<09:33,  2.05it/s]

{'loss': 10.9129, 'grad_norm': 13.829072952270508, 'learning_rate': 2.6244444444444446e-06, 'epoch': 0.77}


 77%|███████▋  | 3850/5000 [40:19<09:23,  2.04it/s]

{'loss': 9.6096, 'grad_norm': 24.043458938598633, 'learning_rate': 2.568888888888889e-06, 'epoch': 0.77}


 78%|███████▊  | 3875/5000 [40:31<08:53,  2.11it/s]

{'loss': 7.4753, 'grad_norm': 59.92928695678711, 'learning_rate': 2.5133333333333336e-06, 'epoch': 0.78}


 78%|███████▊  | 3900/5000 [40:42<07:53,  2.32it/s]

{'loss': 8.8431, 'grad_norm': 36.01270294189453, 'learning_rate': 2.457777777777778e-06, 'epoch': 0.78}


 78%|███████▊  | 3925/5000 [40:53<07:24,  2.42it/s]

{'loss': 13.4599, 'grad_norm': 29.087406158447266, 'learning_rate': 2.4022222222222225e-06, 'epoch': 0.79}


 79%|███████▉  | 3950/5000 [41:03<07:33,  2.32it/s]

{'loss': 7.6836, 'grad_norm': 23.203397750854492, 'learning_rate': 2.346666666666667e-06, 'epoch': 0.79}


 80%|███████▉  | 3975/5000 [41:16<08:07,  2.10it/s]

{'loss': 13.7593, 'grad_norm': 35.90630340576172, 'learning_rate': 2.2911111111111114e-06, 'epoch': 0.8}


 80%|████████  | 4000/5000 [41:28<06:48,  2.45it/s]

{'loss': 10.5936, 'grad_norm': 16.189184188842773, 'learning_rate': 2.235555555555556e-06, 'epoch': 0.8}


Too many dataloader workers: 10 (max is dataset.num_shards=3). Stopping 7 dataloader workers.
 80%|████████  | 4000/5000 [44:32<06:48,  2.45it/s]

{'eval_loss': 27.893360137939453, 'eval_runtime': 184.3944, 'eval_samples_per_second': 21.107, 'eval_steps_per_second': 5.277, 'epoch': 0.8}


 80%|████████  | 4025/5000 [44:54<08:34,  1.90it/s]   

{'loss': 11.5433, 'grad_norm': 38.4270133972168, 'learning_rate': 2.1800000000000003e-06, 'epoch': 0.81}


 81%|████████  | 4050/5000 [45:05<06:38,  2.38it/s]

{'loss': 7.4458, 'grad_norm': 31.571823120117188, 'learning_rate': 2.1244444444444443e-06, 'epoch': 0.81}


 82%|████████▏ | 4075/5000 [45:17<08:43,  1.77it/s]

{'loss': 10.8207, 'grad_norm': 42.33230972290039, 'learning_rate': 2.0688888888888892e-06, 'epoch': 0.81}


 82%|████████▏ | 4100/5000 [45:28<07:11,  2.08it/s]

{'loss': 8.7983, 'grad_norm': 67.53534698486328, 'learning_rate': 2.0133333333333337e-06, 'epoch': 0.82}


 82%|████████▎ | 4125/5000 [45:39<06:19,  2.30it/s]

{'loss': 8.5271, 'grad_norm': 23.134328842163086, 'learning_rate': 1.9577777777777777e-06, 'epoch': 0.82}


 83%|████████▎ | 4150/5000 [45:50<06:31,  2.17it/s]

{'loss': 7.6108, 'grad_norm': 30.07275390625, 'learning_rate': 1.9022222222222222e-06, 'epoch': 0.83}


 84%|████████▎ | 4175/5000 [46:02<05:56,  2.31it/s]

{'loss': 10.108, 'grad_norm': 10.205194473266602, 'learning_rate': 1.8466666666666668e-06, 'epoch': 0.83}


 84%|████████▍ | 4200/5000 [46:14<06:16,  2.13it/s]

{'loss': 9.0522, 'grad_norm': 33.88395690917969, 'learning_rate': 1.7911111111111113e-06, 'epoch': 0.84}


 84%|████████▍ | 4225/5000 [46:26<05:58,  2.16it/s]

{'loss': 12.5181, 'grad_norm': 44.63100814819336, 'learning_rate': 1.7355555555555555e-06, 'epoch': 0.84}


 85%|████████▌ | 4250/5000 [46:38<05:50,  2.14it/s]

{'loss': 11.1912, 'grad_norm': 9.855303764343262, 'learning_rate': 1.6800000000000002e-06, 'epoch': 0.85}


 86%|████████▌ | 4275/5000 [46:49<05:01,  2.40it/s]

{'loss': 4.8685, 'grad_norm': 21.444812774658203, 'learning_rate': 1.6244444444444447e-06, 'epoch': 0.85}


 86%|████████▌ | 4300/5000 [47:01<04:51,  2.40it/s]

{'loss': 7.7701, 'grad_norm': 18.94683265686035, 'learning_rate': 1.568888888888889e-06, 'epoch': 0.86}


 86%|████████▋ | 4325/5000 [47:14<06:40,  1.69it/s]

{'loss': 11.3297, 'grad_norm': 17.289413452148438, 'learning_rate': 1.5133333333333334e-06, 'epoch': 0.86}


 87%|████████▋ | 4350/5000 [47:25<04:03,  2.66it/s]

{'loss': 11.0205, 'grad_norm': 45.83953857421875, 'learning_rate': 1.457777777777778e-06, 'epoch': 0.87}


 88%|████████▊ | 4375/5000 [47:35<03:53,  2.67it/s]

{'loss': 8.4238, 'grad_norm': 42.81150817871094, 'learning_rate': 1.4022222222222223e-06, 'epoch': 0.88}


 88%|████████▊ | 4400/5000 [47:47<04:58,  2.01it/s]

{'loss': 7.8131, 'grad_norm': 25.17094612121582, 'learning_rate': 1.3466666666666668e-06, 'epoch': 0.88}


 88%|████████▊ | 4425/5000 [48:00<04:10,  2.29it/s]

{'loss': 10.4077, 'grad_norm': 4.785533905029297, 'learning_rate': 1.2911111111111112e-06, 'epoch': 0.89}


 89%|████████▉ | 4450/5000 [48:10<03:53,  2.36it/s]

{'loss': 8.8586, 'grad_norm': 33.33085250854492, 'learning_rate': 1.2355555555555557e-06, 'epoch': 0.89}


 90%|████████▉ | 4475/5000 [48:23<04:13,  2.07it/s]

{'loss': 15.447, 'grad_norm': 36.584999084472656, 'learning_rate': 1.1800000000000001e-06, 'epoch': 0.9}


 90%|█████████ | 4500/5000 [48:36<04:39,  1.79it/s]

{'loss': 9.1598, 'grad_norm': 24.8271484375, 'learning_rate': 1.1266666666666667e-06, 'epoch': 0.9}


 90%|█████████ | 4525/5000 [48:49<04:51,  1.63it/s]

{'loss': 10.164, 'grad_norm': 26.300390243530273, 'learning_rate': 1.0711111111111112e-06, 'epoch': 0.91}


 91%|█████████ | 4550/5000 [49:01<03:34,  2.10it/s]

{'loss': 11.4717, 'grad_norm': 32.084049224853516, 'learning_rate': 1.0155555555555557e-06, 'epoch': 0.91}


 92%|█████████▏| 4575/5000 [49:11<03:24,  2.08it/s]

{'loss': 9.8428, 'grad_norm': 47.64494323730469, 'learning_rate': 9.600000000000001e-07, 'epoch': 0.92}


 92%|█████████▏| 4600/5000 [49:23<03:31,  1.89it/s]

{'loss': 8.7087, 'grad_norm': 64.92591094970703, 'learning_rate': 9.044444444444445e-07, 'epoch': 0.92}


 92%|█████████▎| 4625/5000 [49:38<03:41,  1.69it/s]

{'loss': 12.7356, 'grad_norm': 42.56081771850586, 'learning_rate': 8.488888888888889e-07, 'epoch': 0.93}


 93%|█████████▎| 4650/5000 [49:50<03:00,  1.94it/s]

{'loss': 10.7616, 'grad_norm': 47.15546798706055, 'learning_rate': 7.933333333333335e-07, 'epoch': 0.93}


 94%|█████████▎| 4675/5000 [50:03<02:25,  2.23it/s]

{'loss': 11.3828, 'grad_norm': 23.036575317382812, 'learning_rate': 7.377777777777779e-07, 'epoch': 0.94}


 94%|█████████▍| 4700/5000 [50:15<02:08,  2.34it/s]

{'loss': 12.025, 'grad_norm': 15.881168365478516, 'learning_rate': 6.822222222222223e-07, 'epoch': 0.94}


 94%|█████████▍| 4725/5000 [50:27<02:08,  2.13it/s]

{'loss': 11.195, 'grad_norm': 29.00819206237793, 'learning_rate': 6.266666666666667e-07, 'epoch': 0.94}


 95%|█████████▍| 4737/5000 [50:33<02:03,  2.14it/s]'(ProtocolError('Connection aborted.', BrokenPipeError(32, 'Broken pipe')), '(Request ID: 951f2610-e320-4d1c-99be-d370877ec447)')' thrown while requesting GET https://huggingface.co/datasets/nguyenvulebinh/AVYT/resolve/e6c6bf6f40e698b82215d269cfc0a0d65a7a2372/vox2/vox2-dev-000005.tar
Retrying in 1s [Retry 1/5].
 95%|█████████▌| 4750/5000 [50:40<01:54,  2.18it/s]

{'loss': 10.238, 'grad_norm': 43.60195541381836, 'learning_rate': 5.711111111111111e-07, 'epoch': 0.95}


 96%|█████████▌| 4775/5000 [50:52<01:40,  2.25it/s]

{'loss': 9.2801, 'grad_norm': 35.99544143676758, 'learning_rate': 5.155555555555556e-07, 'epoch': 0.95}


 96%|█████████▌| 4800/5000 [51:04<01:28,  2.25it/s]

{'loss': 10.504, 'grad_norm': 29.351713180541992, 'learning_rate': 4.6000000000000004e-07, 'epoch': 0.96}


 96%|█████████▋| 4825/5000 [51:16<01:44,  1.68it/s]

{'loss': 9.7053, 'grad_norm': 45.090911865234375, 'learning_rate': 4.0444444444444445e-07, 'epoch': 0.96}


 97%|█████████▋| 4850/5000 [51:28<01:08,  2.19it/s]

{'loss': 8.1993, 'grad_norm': 2.6401584148406982, 'learning_rate': 3.488888888888889e-07, 'epoch': 0.97}


 98%|█████████▊| 4875/5000 [51:41<01:09,  1.80it/s]

{'loss': 10.7727, 'grad_norm': 26.89510154724121, 'learning_rate': 2.9333333333333337e-07, 'epoch': 0.97}


 98%|█████████▊| 4900/5000 [51:52<00:43,  2.31it/s]

{'loss': 7.9399, 'grad_norm': 14.220874786376953, 'learning_rate': 2.3777777777777777e-07, 'epoch': 0.98}


 98%|█████████▊| 4925/5000 [52:05<00:35,  2.12it/s]

{'loss': 11.5709, 'grad_norm': 34.67255783081055, 'learning_rate': 1.8222222222222226e-07, 'epoch': 0.98}


 99%|█████████▉| 4950/5000 [52:16<00:21,  2.32it/s]

{'loss': 10.8846, 'grad_norm': 19.015058517456055, 'learning_rate': 1.2666666666666666e-07, 'epoch': 0.99}


100%|█████████▉| 4975/5000 [52:28<00:11,  2.20it/s]

{'loss': 8.9909, 'grad_norm': 23.241195678710938, 'learning_rate': 7.111111111111112e-08, 'epoch': 0.99}


100%|██████████| 5000/5000 [52:40<00:00,  2.38it/s]

{'loss': 10.3017, 'grad_norm': 36.679786682128906, 'learning_rate': 1.5555555555555557e-08, 'epoch': 1.0}


Too many dataloader workers: 10 (max is dataset.num_shards=3). Stopping 7 dataloader workers.
100%|██████████| 5000/5000 [55:47<00:00,  2.38it/s]

{'eval_loss': 27.72670555114746, 'eval_runtime': 187.3396, 'eval_samples_per_second': 20.775, 'eval_steps_per_second': 5.194, 'epoch': 1.0}


100%|██████████| 5000/5000 [55:56<00:00,  1.49it/s]


{'train_runtime': 3356.0327, 'train_samples_per_second': 11.919, 'train_steps_per_second': 1.49, 'train_loss': 10.243281088256836, 'epoch': 1.0}


CompletedProcess(args=['/home/josch080/Projektgruppe/mcorec_train/bin/python', 'script/train.py', '--streaming_dataset', '--include_mcorec', '--mcorec_mode', 'heavy', '--batch_size', '4', '--max_steps', '5000', '--gradient_accumulation_steps', '2', '--save_steps', '1000', '--eval_steps', '1000', '--log_interval', '25', '--learning_rate', '1e-5', '--warmup_steps', '500', '--checkpoint_name', 'avsr_cocktail_mcorec_stage2_light_lr1e-5_5k', '--model_name_or_path', './model-bin/avsr_cocktail_mcorec_finetune', '--output_dir', './model-bin', '--report_to', 'none'], returncode=0)

## 5 – Inference-Setup

In [6]:
import os, sys
import pandas as pd

# Arbeitsverzeichnis auf Repo-Root setzen (Voraussetzung für alle relativen Pfade)
project_baseline_path = "/home/josch080/Projektgruppe/mcorec_baseline"
os.chdir(project_baseline_path)

# Repo-Root in sys.path, damit projektinterne Module importierbar sind
if project_baseline_path not in sys.path:
    sys.path.append(project_baseline_path)

from script.pg_utils_experiments import run_inference_for_experiment, run_eval_and_log, append_eval_results_for_experiments

## 6 – Modell-Definitionen

BL4 als Referenz und der neue Stage-2-Light-Checkpoint.

In [11]:
MODELS = {
     # BL4: Referenzmodell
    "cocktail_finetuned": {
        "model_type": "avsr_cocktail",
        "chkpt": "model-bin/avsr_cocktail_mcorec_finetune",
        "out": "output_avsr_cocktail_finetuned",
    },
    # Stage-2-Light: BL4 + 5k weitere Schritte mit MCoRec-Heavy-Mix
    "cocktail_stage2_light": {
    "model_type": "avsr_cocktail",
    "chkpt": "model-bin/avsr_cocktail_mcorec_stage2_light_lr1e-5_5k/checkpoint-5000",
    "out": "output_avsr_cocktail_stage2_light",
    },
}

## 7 – Sessions & Experimente

Gleiches 5-Session-Subset wie bei den vorherigen Experimenten für direkte Vergleichbarkeit.
Zwei Experimente mit den bisher besten Hyperparametern (beam=8/12, len=20).

In [12]:
SESSION_IDS = ["session_40", "session_43", "session_49", "session_50", "session_54"]

In [18]:
EXPERIMENTS = {
    "E14_stage2_light_bs8_len20": {
    "base_model": "cocktail_stage2_light",
    "beam_size": 8,
    "max_length": 20,
    "comment": "Stage-2-Light, beam=8, len=20",
    },
    
    "E15_stage2_light_bs12_len20": {
    "base_model": "cocktail_stage2_light",
    "beam_size": 12,
    "max_length": 20,
    "comment": "Stage-2-Light, beam=12, len=20",
    },
}


## 8 – Inference

2 Experimente × 5 Sessions = 10 Läufe.

In [14]:
for sid in SESSION_IDS:
    session_dir = f"data-bin/dev/{sid}"
    print(f"\n########## Starte Experimente für {sid} ##########")

    for exp_name in EXPERIMENTS:
        run_inference_for_experiment(
            exp_name=exp_name,
            base_models=MODELS,
            experiments=EXPERIMENTS,
            session_dir=session_dir,
        )


########## Starte Experimente für session_40 ##########

Starte Inference für Experiment: E14_stage2_light_bs8_len20
  base_model      = cocktail_stage2_light
  model_type      = avsr_cocktail
  checkpoint_path = model-bin/avsr_cocktail_mcorec_stage2_light_lr1e-5_5k/checkpoint-5000
  beam_size       = 8
  max_length      = 20
  output_dir_name = output_E14_stage2_light_bs8_len20
  session_dir     = data-bin/dev_without_central_videos/dev/session_40
  comment         = Stage-2-Light, beam=8, len=20
Loading avsr_cocktail model...
Loading model from model-bin/avsr_cocktail_mcorec_stage2_light_lr1e-5_5k/checkpoint-5000
avsr_cocktail model loaded successfully!
Inferring 1 sessions using avsr_cocktail model
Processing session session_40


Processing speakers:   0%|          | 0/6 [00:00<?, ?it/s]





[Acessing speaker spk_0 track 1 of 1:   0%|          | 0/35 [00:00<?, ?it/s]
[Acessing speaker spk_0 track 1 of 1:   3%|▎         | 1/35 [00:02<01:30,  2.67s/it]
[Acessing speaker spk_0 track 1 of 1:   6%|▌         | 2/35 [00:03<00:43,  1.32s/it]
[Acessing speaker spk_0 track 1 of 1:   9%|▊         | 3/35 [00:03<00:29,  1.07it/s]
[Acessing speaker spk_0 track 1 of 1:  11%|█▏        | 4/35 [00:03<00:21,  1.41it/s]
[Acessing speaker spk_0 track 1 of 1:  14%|█▍        | 5/35 [00:04<00:17,  1.72it/s]
[Acessing speaker spk_0 track 1 of 1:  17%|█▋        | 6/35 [00:05<00:22,  1.28it/s]
[Acessing speaker spk_0 track 1 of 1:  20%|██        | 7/35 [00:09<00:50,  1.79s/it]
[Acessing speaker spk_0 track 1 of 1:  23%|██▎       | 8/35 [00:09<00:38,  1.43s/it]
[Acessing speaker spk_0 track 1 of 1:  26%|██▌       | 9/35 [00:11<00:35,  1.36s/it]
[Acessing speaker spk_0 track 1 of 1:  29%|██▊       | 10/35 [00:11<00:29,  1.18s/it]
[Acessing speaker spk_0 track 1 of 1:  31%|███▏      | 11/3





[Acessing speaker spk_1 track 1 of 1:   0%|          | 0/40 [00:00<?, ?it/s]
[Acessing speaker spk_1 track 1 of 1:   2%|▎         | 1/40 [00:00<00:23,  1.67it/s]
[Acessing speaker spk_1 track 1 of 1:   5%|▌         | 2/40 [00:01<00:30,  1.24it/s]
[Acessing speaker spk_1 track 1 of 1:   8%|▊         | 3/40 [00:02<00:27,  1.35it/s]
[Acessing speaker spk_1 track 1 of 1:  10%|█         | 4/40 [00:02<00:24,  1.45it/s]
[Acessing speaker spk_1 track 1 of 1:  12%|█▎        | 5/40 [00:03<00:29,  1.19it/s]
[Acessing speaker spk_1 track 1 of 1:  15%|█▌        | 6/40 [00:04<00:28,  1.19it/s]
[Acessing speaker spk_1 track 1 of 1:  18%|█▊        | 7/40 [00:05<00:24,  1.37it/s]
[Acessing speaker spk_1 track 1 of 1:  20%|██        | 8/40 [00:05<00:21,  1.52it/s]
[Acessing speaker spk_1 track 1 of 1:  22%|██▎       | 9/40 [00:06<00:18,  1.66it/s]
[Acessing speaker spk_1 track 1 of 1:  25%|██▌       | 10/40 [00:06<00:19,  1.57it/s]
[Acessing speaker spk_1 track 1 of 1:  28%|██▊       | 11/4





Processing speaker spk_2 track 1 of 3: 0it [00:00, ?it/s]

[Acessing speaker spk_2 track 2 of 3:   0%|          | 0/13 [00:00<?, ?it/s]
[Acessing speaker spk_2 track 2 of 3:   8%|▊         | 1/13 [00:00<00:06,  1.79it/s]
[Acessing speaker spk_2 track 2 of 3:  15%|█▌        | 2/13 [00:01<00:08,  1.25it/s]
[Acessing speaker spk_2 track 2 of 3:  23%|██▎       | 3/13 [00:02<00:09,  1.05it/s]
[Acessing speaker spk_2 track 2 of 3:  31%|███       | 4/13 [00:06<00:19,  2.21s/it]
[Acessing speaker spk_2 track 2 of 3:  38%|███▊      | 5/13 [00:07<00:14,  1.78s/it]
[Acessing speaker spk_2 track 2 of 3:  46%|████▌     | 6/13 [00:11<00:16,  2.32s/it]
[Acessing speaker spk_2 track 2 of 3:  54%|█████▍    | 7/13 [00:15<00:17,  2.86s/it]
[Acessing speaker spk_2 track 2 of 3:  62%|██████▏   | 8/13 [00:20<00:17,  3.54s/it]
[Acessing speaker spk_2 track 2 of 3:  69%|██████▉   | 9/13 [00:20<00:10,  2.68s/it]
[Acessing speaker spk_2 track 2 of 3:  77%|███████▋  | 10/13 [00:26<00:10,  3.45s/it]






[Acessing speaker spk_3 track 1 of 2:   0%|          | 0/18 [00:00<?, ?it/s]
[Acessing speaker spk_3 track 1 of 2:   6%|▌         | 1/18 [00:00<00:11,  1.54it/s]
[Acessing speaker spk_3 track 1 of 2:  11%|█         | 2/18 [00:01<00:08,  1.82it/s]
[Acessing speaker spk_3 track 1 of 2:  17%|█▋        | 3/18 [00:07<00:50,  3.36s/it]
[Acessing speaker spk_3 track 1 of 2:  22%|██▏       | 4/18 [00:09<00:37,  2.71s/it]
[Acessing speaker spk_3 track 1 of 2:  28%|██▊       | 5/18 [00:14<00:43,  3.34s/it]
[Acessing speaker spk_3 track 1 of 2:  33%|███▎      | 6/18 [00:15<00:30,  2.57s/it]
[Acessing speaker spk_3 track 1 of 2:  39%|███▉      | 7/18 [00:19<00:33,  3.08s/it]
[Acessing speaker spk_3 track 1 of 2:  44%|████▍     | 8/18 [00:21<00:29,  2.93s/it]
[Acessing speaker spk_3 track 1 of 2:  50%|█████     | 9/18 [00:25<00:29,  3.25s/it]
[Acessing speaker spk_3 track 1 of 2:  56%|█████▌    | 10/18 [00:26<00:19,  2.49s/it]
[Acessing speaker spk_3 track 1 of 2:  61%|██████    | 11/1





[Acessing speaker spk_4 track 1 of 1:   0%|          | 0/26 [00:00<?, ?it/s]
[Acessing speaker spk_4 track 1 of 1:   4%|▍         | 1/26 [00:02<01:01,  2.47s/it]
[Acessing speaker spk_4 track 1 of 1:   8%|▊         | 2/26 [00:04<00:53,  2.21s/it]
[Acessing speaker spk_4 track 1 of 1:  12%|█▏        | 3/26 [00:12<01:52,  4.87s/it]
[Acessing speaker spk_4 track 1 of 1:  15%|█▌        | 4/26 [00:22<02:27,  6.70s/it]
[Acessing speaker spk_4 track 1 of 1:  19%|█▉        | 5/26 [00:24<01:47,  5.12s/it]
[Acessing speaker spk_4 track 1 of 1:  23%|██▎       | 6/26 [00:26<01:19,  3.98s/it]
[Acessing speaker spk_4 track 1 of 1:  27%|██▋       | 7/26 [00:27<01:01,  3.22s/it]
[Acessing speaker spk_4 track 1 of 1:  31%|███       | 8/26 [00:29<00:51,  2.88s/it]
[Acessing speaker spk_4 track 1 of 1:  35%|███▍      | 9/26 [00:31<00:42,  2.53s/it]
[Acessing speaker spk_4 track 1 of 1:  38%|███▊      | 10/26 [00:38<01:02,  3.93s/it]
[Acessing speaker spk_4 track 1 of 1:  42%|████▏     | 11/2





[Acessing speaker spk_5 track 1 of 1:   0%|          | 0/32 [00:00<?, ?it/s]
[Acessing speaker spk_5 track 1 of 1:   3%|▎         | 1/32 [00:00<00:30,  1.03it/s]
[Acessing speaker spk_5 track 1 of 1:   6%|▋         | 2/32 [00:05<01:36,  3.22s/it]
[Acessing speaker spk_5 track 1 of 1:   9%|▉         | 3/32 [00:07<01:09,  2.39s/it]
[Acessing speaker spk_5 track 1 of 1:  12%|█▎        | 4/32 [00:09<01:10,  2.53s/it]
[Acessing speaker spk_5 track 1 of 1:  16%|█▌        | 5/32 [00:10<00:51,  1.92s/it]
[Acessing speaker spk_5 track 1 of 1:  19%|█▉        | 6/32 [00:12<00:52,  2.00s/it]
[Acessing speaker spk_5 track 1 of 1:  22%|██▏       | 7/32 [00:13<00:41,  1.66s/it]
[Acessing speaker spk_5 track 1 of 1:  25%|██▌       | 8/32 [00:15<00:42,  1.79s/it]
[Acessing speaker spk_5 track 1 of 1:  28%|██▊       | 9/32 [00:17<00:40,  1.75s/it]
[Acessing speaker spk_5 track 1 of 1:  31%|███▏      | 10/32 [00:19<00:41,  1.87s/it]
[Acessing speaker spk_5 track 1 of 1:  34%|███▍      | 11/3


Starte Inference für Experiment: E15_stage2_light_bs8_len20
  base_model      = cocktail_stage2_light
  model_type      = avsr_cocktail
  checkpoint_path = model-bin/avsr_cocktail_mcorec_stage2_light_lr1e-5_5k/checkpoint-5000
  beam_size       = 12
  max_length      = 20
  output_dir_name = output_E15_stage2_light_bs8_len20
  session_dir     = data-bin/dev_without_central_videos/dev/session_40
  comment         = Stage-2-Light, beam=12, len=20
Loading avsr_cocktail model...
Loading model from model-bin/avsr_cocktail_mcorec_stage2_light_lr1e-5_5k/checkpoint-5000
avsr_cocktail model loaded successfully!
Inferring 1 sessions using avsr_cocktail model
Processing session session_40


Processing speakers:   0%|          | 0/6 [00:00<?, ?it/s]





[Acessing speaker spk_0 track 1 of 1:   0%|          | 0/35 [00:00<?, ?it/s]
[Acessing speaker spk_0 track 1 of 1:   3%|▎         | 1/35 [00:00<00:20,  1.64it/s]
[Acessing speaker spk_0 track 1 of 1:   6%|▌         | 2/35 [00:01<00:16,  2.05it/s]
[Acessing speaker spk_0 track 1 of 1:   9%|▊         | 3/35 [00:01<00:16,  1.97it/s]
[Acessing speaker spk_0 track 1 of 1:  11%|█▏        | 4/35 [00:01<00:14,  2.08it/s]
[Acessing speaker spk_0 track 1 of 1:  14%|█▍        | 5/35 [00:02<00:13,  2.22it/s]
[Acessing speaker spk_0 track 1 of 1:  17%|█▋        | 6/35 [00:03<00:19,  1.47it/s]
[Acessing speaker spk_0 track 1 of 1:  20%|██        | 7/35 [00:07<00:49,  1.77s/it]
[Acessing speaker spk_0 track 1 of 1:  23%|██▎       | 8/35 [00:08<00:38,  1.43s/it]
[Acessing speaker spk_0 track 1 of 1:  26%|██▌       | 9/35 [00:09<00:35,  1.36s/it]
[Acessing speaker spk_0 track 1 of 1:  29%|██▊       | 10/35 [00:10<00:29,  1.19s/it]
[Acessing speaker spk_0 track 1 of 1:  31%|███▏      | 11/3





[Acessing speaker spk_1 track 1 of 1:   0%|          | 0/40 [00:00<?, ?it/s]
[Acessing speaker spk_1 track 1 of 1:   2%|▎         | 1/40 [00:00<00:16,  2.36it/s]
[Acessing speaker spk_1 track 1 of 1:   5%|▌         | 2/40 [00:01<00:29,  1.30it/s]
[Acessing speaker spk_1 track 1 of 1:   8%|▊         | 3/40 [00:02<00:27,  1.35it/s]
[Acessing speaker spk_1 track 1 of 1:  10%|█         | 4/40 [00:02<00:25,  1.41it/s]
[Acessing speaker spk_1 track 1 of 1:  12%|█▎        | 5/40 [00:04<00:32,  1.08it/s]
[Acessing speaker spk_1 track 1 of 1:  15%|█▌        | 6/40 [00:04<00:30,  1.10it/s]
[Acessing speaker spk_1 track 1 of 1:  18%|█▊        | 7/40 [00:05<00:25,  1.28it/s]
[Acessing speaker spk_1 track 1 of 1:  20%|██        | 8/40 [00:06<00:22,  1.43it/s]
[Acessing speaker spk_1 track 1 of 1:  22%|██▎       | 9/40 [00:06<00:20,  1.55it/s]
[Acessing speaker spk_1 track 1 of 1:  25%|██▌       | 10/40 [00:07<00:20,  1.44it/s]
[Acessing speaker spk_1 track 1 of 1:  28%|██▊       | 11/4





Processing speaker spk_2 track 1 of 3: 0it [00:00, ?it/s]

[Acessing speaker spk_2 track 2 of 3:   0%|          | 0/13 [00:00<?, ?it/s]
[Acessing speaker spk_2 track 2 of 3:   8%|▊         | 1/13 [00:00<00:05,  2.11it/s]
[Acessing speaker spk_2 track 2 of 3:  15%|█▌        | 2/13 [00:01<00:08,  1.23it/s]
[Acessing speaker spk_2 track 2 of 3:  23%|██▎       | 3/13 [00:02<00:10,  1.07s/it]
[Acessing speaker spk_2 track 2 of 3:  31%|███       | 4/13 [00:07<00:21,  2.39s/it]
[Acessing speaker spk_2 track 2 of 3:  38%|███▊      | 5/13 [00:08<00:16,  2.01s/it]
[Acessing speaker spk_2 track 2 of 3:  46%|████▌     | 6/13 [00:12<00:17,  2.53s/it]
[Acessing speaker spk_2 track 2 of 3:  54%|█████▍    | 7/13 [00:16<00:18,  3.10s/it]
[Acessing speaker spk_2 track 2 of 3:  62%|██████▏   | 8/13 [00:21<00:18,  3.69s/it]
[Acessing speaker spk_2 track 2 of 3:  69%|██████▉   | 9/13 [00:22<00:11,  2.83s/it]
[Acessing speaker spk_2 track 2 of 3:  77%|███████▋  | 10/13 [00:27<00:10,  3.66s/it]






[Acessing speaker spk_3 track 1 of 2:   0%|          | 0/18 [00:00<?, ?it/s]
[Acessing speaker spk_3 track 1 of 2:   6%|▌         | 1/18 [00:00<00:09,  1.71it/s]
[Acessing speaker spk_3 track 1 of 2:  11%|█         | 2/18 [00:01<00:08,  1.84it/s]
[Acessing speaker spk_3 track 1 of 2:  17%|█▋        | 3/18 [00:06<00:40,  2.72s/it]
[Acessing speaker spk_3 track 1 of 2:  22%|██▏       | 4/18 [00:08<00:32,  2.36s/it]
[Acessing speaker spk_3 track 1 of 2:  28%|██▊       | 5/18 [00:12<00:39,  3.08s/it]
[Acessing speaker spk_3 track 1 of 2:  33%|███▎      | 6/18 [00:13<00:29,  2.44s/it]
[Acessing speaker spk_3 track 1 of 2:  39%|███▉      | 7/18 [00:19<00:39,  3.60s/it]
[Acessing speaker spk_3 track 1 of 2:  44%|████▍     | 8/18 [00:22<00:33,  3.39s/it]
[Acessing speaker spk_3 track 1 of 2:  50%|█████     | 9/18 [00:26<00:31,  3.53s/it]
[Acessing speaker spk_3 track 1 of 2:  56%|█████▌    | 10/18 [00:27<00:21,  2.71s/it]
[Acessing speaker spk_3 track 1 of 2:  61%|██████    | 11/1





[Acessing speaker spk_4 track 1 of 1:   0%|          | 0/26 [00:00<?, ?it/s]
[Acessing speaker spk_4 track 1 of 1:   4%|▍         | 1/26 [00:02<01:01,  2.46s/it]
[Acessing speaker spk_4 track 1 of 1:   8%|▊         | 2/26 [00:03<00:38,  1.62s/it]
[Acessing speaker spk_4 track 1 of 1:  12%|█▏        | 3/26 [00:13<02:01,  5.30s/it]
[Acessing speaker spk_4 track 1 of 1:  15%|█▌        | 4/26 [00:23<02:43,  7.41s/it]
[Acessing speaker spk_4 track 1 of 1:  19%|█▉        | 5/26 [00:26<01:57,  5.61s/it]
[Acessing speaker spk_4 track 1 of 1:  23%|██▎       | 6/26 [00:28<01:26,  4.31s/it]
[Acessing speaker spk_4 track 1 of 1:  27%|██▋       | 7/26 [00:29<01:05,  3.46s/it]
[Acessing speaker spk_4 track 1 of 1:  31%|███       | 8/26 [00:31<00:55,  3.08s/it]
[Acessing speaker spk_4 track 1 of 1:  35%|███▍      | 9/26 [00:33<00:45,  2.68s/it]
[Acessing speaker spk_4 track 1 of 1:  38%|███▊      | 10/26 [00:40<01:04,  4.01s/it]
[Acessing speaker spk_4 track 1 of 1:  42%|████▏     | 11/2





[Acessing speaker spk_5 track 1 of 1:   0%|          | 0/32 [00:00<?, ?it/s]
[Acessing speaker spk_5 track 1 of 1:   3%|▎         | 1/32 [00:00<00:25,  1.21it/s]
[Acessing speaker spk_5 track 1 of 1:   6%|▋         | 2/32 [00:04<01:13,  2.45s/it]
[Acessing speaker spk_5 track 1 of 1:   9%|▉         | 3/32 [00:05<00:59,  2.04s/it]
[Acessing speaker spk_5 track 1 of 1:  12%|█▎        | 4/32 [00:10<01:26,  3.09s/it]
[Acessing speaker spk_5 track 1 of 1:  16%|█▌        | 5/32 [00:11<01:02,  2.31s/it]
[Acessing speaker spk_5 track 1 of 1:  19%|█▉        | 6/32 [00:13<01:00,  2.31s/it]
[Acessing speaker spk_5 track 1 of 1:  22%|██▏       | 7/32 [00:14<00:47,  1.89s/it]
[Acessing speaker spk_5 track 1 of 1:  25%|██▌       | 8/32 [00:17<00:47,  1.98s/it]
[Acessing speaker spk_5 track 1 of 1:  28%|██▊       | 9/32 [00:18<00:43,  1.89s/it]
[Acessing speaker spk_5 track 1 of 1:  31%|███▏      | 10/32 [00:21<00:43,  1.99s/it]
[Acessing speaker spk_5 track 1 of 1:  34%|███▍      | 11/3


########## Starte Experimente für session_43 ##########

Starte Inference für Experiment: E14_stage2_light_bs8_len20
  base_model      = cocktail_stage2_light
  model_type      = avsr_cocktail
  checkpoint_path = model-bin/avsr_cocktail_mcorec_stage2_light_lr1e-5_5k/checkpoint-5000
  beam_size       = 8
  max_length      = 20
  output_dir_name = output_E14_stage2_light_bs8_len20
  session_dir     = data-bin/dev_without_central_videos/dev/session_43
  comment         = Stage-2-Light, beam=8, len=20
Loading avsr_cocktail model...
Loading model from model-bin/avsr_cocktail_mcorec_stage2_light_lr1e-5_5k/checkpoint-5000
avsr_cocktail model loaded successfully!
Inferring 1 sessions using avsr_cocktail model
Processing session session_43


Processing speakers:   0%|          | 0/6 [00:00<?, ?it/s]





[Acessing speaker spk_0 track 1 of 2:   0%|          | 0/27 [00:00<?, ?it/s]
[Acessing speaker spk_0 track 1 of 2:   4%|▎         | 1/27 [00:00<00:21,  1.24it/s]
[Acessing speaker spk_0 track 1 of 2:   7%|▋         | 2/27 [00:01<00:18,  1.37it/s]
[Acessing speaker spk_0 track 1 of 2:  11%|█         | 3/27 [00:02<00:22,  1.08it/s]
[Acessing speaker spk_0 track 1 of 2:  15%|█▍        | 4/27 [00:03<00:16,  1.36it/s]
[Acessing speaker spk_0 track 1 of 2:  19%|█▊        | 5/27 [00:03<00:15,  1.44it/s]
[Acessing speaker spk_0 track 1 of 2:  22%|██▏       | 6/27 [00:06<00:26,  1.26s/it]
[Acessing speaker spk_0 track 1 of 2:  26%|██▌       | 7/27 [00:14<01:13,  3.68s/it]
[Acessing speaker spk_0 track 1 of 2:  30%|██▉       | 8/27 [00:21<01:28,  4.67s/it]
[Acessing speaker spk_0 track 1 of 2:  33%|███▎      | 9/27 [00:22<01:01,  3.41s/it]
[Acessing speaker spk_0 track 1 of 2:  37%|███▋      | 10/27 [00:24<00:52,  3.10s/it]
[Acessing speaker spk_0 track 1 of 2:  41%|████      | 11/2





[Acessing speaker spk_1 track 1 of 1:   0%|          | 0/32 [00:00<?, ?it/s]
[Acessing speaker spk_1 track 1 of 1:   3%|▎         | 1/32 [00:00<00:30,  1.02it/s]
[Acessing speaker spk_1 track 1 of 1:   6%|▋         | 2/32 [00:01<00:23,  1.26it/s]
[Acessing speaker spk_1 track 1 of 1:   9%|▉         | 3/32 [00:02<00:19,  1.50it/s]
[Acessing speaker spk_1 track 1 of 1:  12%|█▎        | 4/32 [00:05<00:44,  1.60s/it]
[Acessing speaker spk_1 track 1 of 1:  16%|█▌        | 5/32 [00:05<00:34,  1.27s/it]
[Acessing speaker spk_1 track 1 of 1:  19%|█▉        | 6/32 [00:06<00:26,  1.02s/it]
[Acessing speaker spk_1 track 1 of 1:  22%|██▏       | 7/32 [00:07<00:22,  1.09it/s]
[Acessing speaker spk_1 track 1 of 1:  25%|██▌       | 8/32 [00:08<00:26,  1.10s/it]
[Acessing speaker spk_1 track 1 of 1:  28%|██▊       | 9/32 [00:10<00:29,  1.26s/it]
[Acessing speaker spk_1 track 1 of 1:  31%|███▏      | 10/32 [00:10<00:24,  1.10s/it]
[Acessing speaker spk_1 track 1 of 1:  34%|███▍      | 11/3





[Acessing speaker spk_2 track 1 of 1:   0%|          | 0/29 [00:00<?, ?it/s]
[Acessing speaker spk_2 track 1 of 1:   3%|▎         | 1/29 [00:00<00:23,  1.18it/s]
[Acessing speaker spk_2 track 1 of 1:   7%|▋         | 2/29 [00:01<00:16,  1.65it/s]
[Acessing speaker spk_2 track 1 of 1:  10%|█         | 3/29 [00:01<00:14,  1.76it/s]
[Acessing speaker spk_2 track 1 of 1:  14%|█▍        | 4/29 [00:02<00:15,  1.63it/s]
[Acessing speaker spk_2 track 1 of 1:  17%|█▋        | 5/29 [00:03<00:14,  1.68it/s]
[Acessing speaker spk_2 track 1 of 1:  21%|██        | 6/29 [00:03<00:12,  1.81it/s]
[Acessing speaker spk_2 track 1 of 1:  24%|██▍       | 7/29 [00:04<00:13,  1.57it/s]
[Acessing speaker spk_2 track 1 of 1:  28%|██▊       | 8/29 [00:04<00:11,  1.78it/s]
[Acessing speaker spk_2 track 1 of 1:  31%|███       | 9/29 [00:05<00:10,  1.83it/s]
[Acessing speaker spk_2 track 1 of 1:  34%|███▍      | 10/29 [00:12<00:47,  2.52s/it]
[Acessing speaker spk_2 track 1 of 1:  38%|███▊      | 11/2





[Acessing speaker spk_3 track 1 of 1:   0%|          | 0/31 [00:00<?, ?it/s]
[Acessing speaker spk_3 track 1 of 1:   3%|▎         | 1/31 [00:00<00:20,  1.47it/s]
[Acessing speaker spk_3 track 1 of 1:   6%|▋         | 2/31 [00:01<00:18,  1.60it/s]
[Acessing speaker spk_3 track 1 of 1:  10%|▉         | 3/31 [00:02<00:19,  1.46it/s]
[Acessing speaker spk_3 track 1 of 1:  13%|█▎        | 4/31 [00:02<00:16,  1.65it/s]
[Acessing speaker spk_3 track 1 of 1:  16%|█▌        | 5/31 [00:03<00:20,  1.27it/s]
[Acessing speaker spk_3 track 1 of 1:  19%|█▉        | 6/31 [00:05<00:32,  1.32s/it]
[Acessing speaker spk_3 track 1 of 1:  23%|██▎       | 7/31 [00:06<00:25,  1.07s/it]
[Acessing speaker spk_3 track 1 of 1:  26%|██▌       | 8/31 [00:15<01:19,  3.48s/it]
[Acessing speaker spk_3 track 1 of 1:  29%|██▉       | 9/31 [00:26<02:11,  6.00s/it]
[Acessing speaker spk_3 track 1 of 1:  32%|███▏      | 10/31 [00:27<01:35,  4.55s/it]
[Acessing speaker spk_3 track 1 of 1:  35%|███▌      | 11/3





[Acessing speaker spk_4 track 1 of 2:   0%|          | 0/16 [00:00<?, ?it/s]
[Acessing speaker spk_4 track 1 of 2:   6%|▋         | 1/16 [00:00<00:12,  1.20it/s]
[Acessing speaker spk_4 track 1 of 2:  12%|█▎        | 2/16 [00:02<00:18,  1.34s/it]
[Acessing speaker spk_4 track 1 of 2:  19%|█▉        | 3/16 [00:06<00:30,  2.33s/it]
[Acessing speaker spk_4 track 1 of 2:  25%|██▌       | 4/16 [00:08<00:28,  2.36s/it]
[Acessing speaker spk_4 track 1 of 2:  31%|███▏      | 5/16 [00:09<00:19,  1.76s/it]
[Acessing speaker spk_4 track 1 of 2:  38%|███▊      | 6/16 [00:09<00:13,  1.39s/it]
[Acessing speaker spk_4 track 1 of 2:  44%|████▍     | 7/16 [00:10<00:10,  1.15s/it]
[Acessing speaker spk_4 track 1 of 2:  50%|█████     | 8/16 [00:10<00:07,  1.06it/s]
[Acessing speaker spk_4 track 1 of 2:  56%|█████▋    | 9/16 [00:11<00:06,  1.11it/s]
[Acessing speaker spk_4 track 1 of 2:  62%|██████▎   | 10/16 [00:12<00:04,  1.24it/s]
[Acessing speaker spk_4 track 1 of 2:  69%|██████▉   | 11/1





[Acessing speaker spk_5 track 1 of 1:   0%|          | 0/37 [00:00<?, ?it/s]
[Acessing speaker spk_5 track 1 of 1:   3%|▎         | 1/37 [00:00<00:23,  1.51it/s]
[Acessing speaker spk_5 track 1 of 1:   5%|▌         | 2/37 [00:01<00:18,  1.91it/s]
[Acessing speaker spk_5 track 1 of 1:   8%|▊         | 3/37 [00:05<01:21,  2.39s/it]
[Acessing speaker spk_5 track 1 of 1:  11%|█         | 4/37 [00:07<01:09,  2.12s/it]
[Acessing speaker spk_5 track 1 of 1:  14%|█▎        | 5/37 [00:08<00:59,  1.86s/it]
[Acessing speaker spk_5 track 1 of 1:  16%|█▌        | 6/37 [00:09<00:47,  1.53s/it]
[Acessing speaker spk_5 track 1 of 1:  19%|█▉        | 7/37 [00:11<00:44,  1.50s/it]
[Acessing speaker spk_5 track 1 of 1:  22%|██▏       | 8/37 [00:11<00:35,  1.22s/it]
[Acessing speaker spk_5 track 1 of 1:  24%|██▍       | 9/37 [00:12<00:34,  1.22s/it]
[Acessing speaker spk_5 track 1 of 1:  27%|██▋       | 10/37 [00:13<00:26,  1.02it/s]
[Acessing speaker spk_5 track 1 of 1:  30%|██▉       | 11/3


Starte Inference für Experiment: E15_stage2_light_bs8_len20
  base_model      = cocktail_stage2_light
  model_type      = avsr_cocktail
  checkpoint_path = model-bin/avsr_cocktail_mcorec_stage2_light_lr1e-5_5k/checkpoint-5000
  beam_size       = 12
  max_length      = 20
  output_dir_name = output_E15_stage2_light_bs8_len20
  session_dir     = data-bin/dev_without_central_videos/dev/session_43
  comment         = Stage-2-Light, beam=12, len=20
Loading avsr_cocktail model...
Loading model from model-bin/avsr_cocktail_mcorec_stage2_light_lr1e-5_5k/checkpoint-5000
avsr_cocktail model loaded successfully!
Inferring 1 sessions using avsr_cocktail model
Processing session session_43


Processing speakers:   0%|          | 0/6 [00:00<?, ?it/s]





[Acessing speaker spk_0 track 1 of 2:   0%|          | 0/27 [00:00<?, ?it/s]
[Acessing speaker spk_0 track 1 of 2:   4%|▎         | 1/27 [00:00<00:15,  1.68it/s]
[Acessing speaker spk_0 track 1 of 2:   7%|▋         | 2/27 [00:01<00:16,  1.51it/s]
[Acessing speaker spk_0 track 1 of 2:  11%|█         | 3/27 [00:02<00:22,  1.09it/s]
[Acessing speaker spk_0 track 1 of 2:  15%|█▍        | 4/27 [00:03<00:17,  1.34it/s]
[Acessing speaker spk_0 track 1 of 2:  19%|█▊        | 5/27 [00:03<00:15,  1.41it/s]
[Acessing speaker spk_0 track 1 of 2:  22%|██▏       | 6/27 [00:06<00:27,  1.31s/it]
[Acessing speaker spk_0 track 1 of 2:  26%|██▌       | 7/27 [00:15<01:18,  3.90s/it]
[Acessing speaker spk_0 track 1 of 2:  30%|██▉       | 8/27 [00:22<01:32,  4.85s/it]
[Acessing speaker spk_0 track 1 of 2:  33%|███▎      | 9/27 [00:22<01:03,  3.55s/it]
[Acessing speaker spk_0 track 1 of 2:  37%|███▋      | 10/27 [00:25<00:54,  3.22s/it]
[Acessing speaker spk_0 track 1 of 2:  41%|████      | 11/2





[Acessing speaker spk_1 track 1 of 1:   0%|          | 0/32 [00:00<?, ?it/s]
[Acessing speaker spk_1 track 1 of 1:   3%|▎         | 1/32 [00:00<00:25,  1.22it/s]
[Acessing speaker spk_1 track 1 of 1:   6%|▋         | 2/32 [00:01<00:22,  1.33it/s]
[Acessing speaker spk_1 track 1 of 1:   9%|▉         | 3/32 [00:02<00:18,  1.53it/s]
[Acessing speaker spk_1 track 1 of 1:  12%|█▎        | 4/32 [00:05<00:46,  1.66s/it]
[Acessing speaker spk_1 track 1 of 1:  16%|█▌        | 5/32 [00:06<00:36,  1.33s/it]
[Acessing speaker spk_1 track 1 of 1:  19%|█▉        | 6/32 [00:06<00:28,  1.08s/it]
[Acessing speaker spk_1 track 1 of 1:  22%|██▏       | 7/32 [00:07<00:24,  1.03it/s]
[Acessing speaker spk_1 track 1 of 1:  25%|██▌       | 8/32 [00:08<00:28,  1.17s/it]
[Acessing speaker spk_1 track 1 of 1:  28%|██▊       | 9/32 [00:10<00:30,  1.33s/it]
[Acessing speaker spk_1 track 1 of 1:  31%|███▏      | 10/32 [00:11<00:25,  1.16s/it]
[Acessing speaker spk_1 track 1 of 1:  34%|███▍      | 11/3





[Acessing speaker spk_2 track 1 of 1:   0%|          | 0/29 [00:00<?, ?it/s]
[Acessing speaker spk_2 track 1 of 1:   3%|▎         | 1/29 [00:00<00:18,  1.48it/s]
[Acessing speaker spk_2 track 1 of 1:   7%|▋         | 2/29 [00:01<00:15,  1.80it/s]
[Acessing speaker spk_2 track 1 of 1:  10%|█         | 3/29 [00:01<00:14,  1.78it/s]
[Acessing speaker spk_2 track 1 of 1:  14%|█▍        | 4/29 [00:02<00:15,  1.59it/s]
[Acessing speaker spk_2 track 1 of 1:  17%|█▋        | 5/29 [00:03<00:15,  1.60it/s]
[Acessing speaker spk_2 track 1 of 1:  21%|██        | 6/29 [00:03<00:13,  1.71it/s]
[Acessing speaker spk_2 track 1 of 1:  24%|██▍       | 7/29 [00:04<00:14,  1.48it/s]
[Acessing speaker spk_2 track 1 of 1:  28%|██▊       | 8/29 [00:04<00:12,  1.66it/s]
[Acessing speaker spk_2 track 1 of 1:  31%|███       | 9/29 [00:05<00:11,  1.67it/s]
[Acessing speaker spk_2 track 1 of 1:  34%|███▍      | 10/29 [00:12<00:51,  2.71s/it]
[Acessing speaker spk_2 track 1 of 1:  38%|███▊      | 11/2





[Acessing speaker spk_3 track 1 of 1:   0%|          | 0/31 [00:00<?, ?it/s]
[Acessing speaker spk_3 track 1 of 1:   3%|▎         | 1/31 [00:00<00:16,  1.78it/s]
[Acessing speaker spk_3 track 1 of 1:   6%|▋         | 2/31 [00:01<00:17,  1.63it/s]
[Acessing speaker spk_3 track 1 of 1:  10%|▉         | 3/31 [00:02<00:19,  1.42it/s]
[Acessing speaker spk_3 track 1 of 1:  13%|█▎        | 4/31 [00:02<00:17,  1.58it/s]
[Acessing speaker spk_3 track 1 of 1:  16%|█▌        | 5/31 [00:03<00:21,  1.19it/s]
[Acessing speaker spk_3 track 1 of 1:  19%|█▉        | 6/31 [00:06<00:34,  1.38s/it]
[Acessing speaker spk_3 track 1 of 1:  23%|██▎       | 7/31 [00:06<00:27,  1.13s/it]
[Acessing speaker spk_3 track 1 of 1:  26%|██▌       | 8/31 [00:16<01:25,  3.70s/it]
[Acessing speaker spk_3 track 1 of 1:  29%|██▉       | 9/31 [00:27<02:18,  6.29s/it]
[Acessing speaker spk_3 track 1 of 1:  32%|███▏      | 10/31 [00:29<01:40,  4.79s/it]
[Acessing speaker spk_3 track 1 of 1:  35%|███▌      | 11/3





[Acessing speaker spk_4 track 1 of 2:   0%|          | 0/16 [00:00<?, ?it/s]
[Acessing speaker spk_4 track 1 of 2:   6%|▋         | 1/16 [00:00<00:10,  1.44it/s]
[Acessing speaker spk_4 track 1 of 2:  12%|█▎        | 2/16 [00:02<00:18,  1.33s/it]
[Acessing speaker spk_4 track 1 of 2:  19%|█▉        | 3/16 [00:05<00:29,  2.26s/it]
[Acessing speaker spk_4 track 1 of 2:  25%|██▌       | 4/16 [00:08<00:29,  2.47s/it]
[Acessing speaker spk_4 track 1 of 2:  31%|███▏      | 5/16 [00:09<00:20,  1.87s/it]
[Acessing speaker spk_4 track 1 of 2:  38%|███▊      | 6/16 [00:10<00:14,  1.48s/it]
[Acessing speaker spk_4 track 1 of 2:  44%|████▍     | 7/16 [00:10<00:10,  1.22s/it]
[Acessing speaker spk_4 track 1 of 2:  50%|█████     | 8/16 [00:11<00:08,  1.01s/it]
[Acessing speaker spk_4 track 1 of 2:  56%|█████▋    | 9/16 [00:12<00:06,  1.04it/s]
[Acessing speaker spk_4 track 1 of 2:  62%|██████▎   | 10/16 [00:12<00:05,  1.15it/s]
[Acessing speaker spk_4 track 1 of 2:  69%|██████▉   | 11/1





[Acessing speaker spk_5 track 1 of 1:   0%|          | 0/37 [00:00<?, ?it/s]
[Acessing speaker spk_5 track 1 of 1:   3%|▎         | 1/37 [00:00<00:20,  1.78it/s]
[Acessing speaker spk_5 track 1 of 1:   5%|▌         | 2/37 [00:01<00:17,  2.02it/s]
[Acessing speaker spk_5 track 1 of 1:   8%|▊         | 3/37 [00:05<01:22,  2.44s/it]
[Acessing speaker spk_5 track 1 of 1:  11%|█         | 4/37 [00:07<01:10,  2.13s/it]
[Acessing speaker spk_5 track 1 of 1:  14%|█▎        | 5/37 [00:08<01:00,  1.88s/it]
[Acessing speaker spk_5 track 1 of 1:  16%|█▌        | 6/37 [00:09<00:48,  1.56s/it]
[Acessing speaker spk_5 track 1 of 1:  19%|█▉        | 7/37 [00:11<00:46,  1.54s/it]
[Acessing speaker spk_5 track 1 of 1:  22%|██▏       | 8/37 [00:12<00:36,  1.27s/it]
[Acessing speaker spk_5 track 1 of 1:  24%|██▍       | 9/37 [00:13<00:35,  1.27s/it]
[Acessing speaker spk_5 track 1 of 1:  27%|██▋       | 10/37 [00:13<00:27,  1.02s/it]
[Acessing speaker spk_5 track 1 of 1:  30%|██▉       | 11/3


########## Starte Experimente für session_49 ##########

Starte Inference für Experiment: E14_stage2_light_bs8_len20
  base_model      = cocktail_stage2_light
  model_type      = avsr_cocktail
  checkpoint_path = model-bin/avsr_cocktail_mcorec_stage2_light_lr1e-5_5k/checkpoint-5000
  beam_size       = 8
  max_length      = 20
  output_dir_name = output_E14_stage2_light_bs8_len20
  session_dir     = data-bin/dev_without_central_videos/dev/session_49
  comment         = Stage-2-Light, beam=8, len=20
Loading avsr_cocktail model...
Loading model from model-bin/avsr_cocktail_mcorec_stage2_light_lr1e-5_5k/checkpoint-5000
avsr_cocktail model loaded successfully!
Inferring 1 sessions using avsr_cocktail model
Processing session session_49


Processing speakers:   0%|          | 0/6 [00:00<?, ?it/s]





[Acessing speaker spk_0 track 1 of 1:   0%|          | 0/12 [00:00<?, ?it/s]
[Acessing speaker spk_0 track 1 of 1:   8%|▊         | 1/12 [00:00<00:10,  1.03it/s]
[Acessing speaker spk_0 track 1 of 1:  17%|█▋        | 2/12 [00:01<00:09,  1.10it/s]
[Acessing speaker spk_0 track 1 of 1:  25%|██▌       | 3/12 [00:02<00:06,  1.44it/s]
[Acessing speaker spk_0 track 1 of 1:  33%|███▎      | 4/12 [00:02<00:04,  1.69it/s]
[Acessing speaker spk_0 track 1 of 1:  42%|████▏     | 5/12 [00:03<00:04,  1.71it/s]
[Acessing speaker spk_0 track 1 of 1:  50%|█████     | 6/12 [00:03<00:03,  1.66it/s]
[Acessing speaker spk_0 track 1 of 1:  58%|█████▊    | 7/12 [00:04<00:02,  1.71it/s]
[Acessing speaker spk_0 track 1 of 1:  67%|██████▋   | 8/12 [00:05<00:02,  1.68it/s]
[Acessing speaker spk_0 track 1 of 1:  75%|███████▌  | 9/12 [00:07<00:03,  1.31s/it]
[Acessing speaker spk_0 track 1 of 1:  83%|████████▎ | 10/12 [00:08<00:02,  1.16s/it]
[Acessing speaker spk_0 track 1 of 1:  92%|█████████▏| 11/1





[Acessing speaker spk_1 track 1 of 1:   0%|          | 0/14 [00:00<?, ?it/s]
[Acessing speaker spk_1 track 1 of 1:   7%|▋         | 1/14 [00:06<01:20,  6.19s/it]
[Acessing speaker spk_1 track 1 of 1:  14%|█▍        | 2/14 [00:07<00:38,  3.18s/it]
[Acessing speaker spk_1 track 1 of 1:  21%|██▏       | 3/14 [00:07<00:21,  1.97s/it]
[Acessing speaker spk_1 track 1 of 1:  29%|██▊       | 4/14 [00:08<00:14,  1.40s/it]
[Acessing speaker spk_1 track 1 of 1:  36%|███▌      | 5/14 [00:08<00:10,  1.12s/it]
[Acessing speaker spk_1 track 1 of 1:  43%|████▎     | 6/14 [00:09<00:08,  1.05s/it]
[Acessing speaker spk_1 track 1 of 1:  50%|█████     | 7/14 [00:10<00:07,  1.07s/it]
[Acessing speaker spk_1 track 1 of 1:  57%|█████▋    | 8/14 [00:12<00:06,  1.07s/it]
[Acessing speaker spk_1 track 1 of 1:  64%|██████▍   | 9/14 [00:14<00:08,  1.63s/it]
[Acessing speaker spk_1 track 1 of 1:  71%|███████▏  | 10/14 [00:26<00:18,  4.61s/it]
[Acessing speaker spk_1 track 1 of 1:  79%|███████▊  | 11/1





[Acessing speaker spk_2 track 1 of 8:   0%|          | 0/1 [00:00<?, ?it/s]
Processing speaker spk_2 track 1 of 8: 100%|██████████| 1/1 [00:00<00:00,  2.48it/s]

[Acessing speaker spk_2 track 2 of 8:   0%|          | 0/1 [00:00<?, ?it/s]
Processing speaker spk_2 track 2 of 8: 100%|██████████| 1/1 [00:00<00:00,  1.94it/s]

[Acessing speaker spk_2 track 3 of 8:   0%|          | 0/1 [00:00<?, ?it/s]
Processing speaker spk_2 track 3 of 8: 100%|██████████| 1/1 [00:00<00:00,  2.20it/s]

[Acessing speaker spk_2 track 4 of 8:   0%|          | 0/3 [00:00<?, ?it/s]
[Acessing speaker spk_2 track 4 of 8:  33%|███▎      | 1/3 [00:04<00:08,  4.44s/it]
[Acessing speaker spk_2 track 4 of 8:  67%|██████▋   | 2/3 [00:08<00:04,  4.19s/it]
Processing speaker spk_2 track 4 of 8: 100%|██████████| 3/3 [00:15<00:00,  5.06s/it]

[Acessing speaker spk_2 track 5 of 8:   0%|          | 0/2 [00:00<?, ?it/s]
[Acessing speaker spk_2 track 5 of 8:  50%|█████     | 1/2 [00:03<00:03,  3.69s/it]
Processing spea





[Acessing speaker spk_3 track 1 of 1:   0%|          | 0/21 [00:00<?, ?it/s]
[Acessing speaker spk_3 track 1 of 1:   5%|▍         | 1/21 [00:00<00:17,  1.15it/s]
[Acessing speaker spk_3 track 1 of 1:  10%|▉         | 2/21 [00:01<00:18,  1.00it/s]
[Acessing speaker spk_3 track 1 of 1:  14%|█▍        | 3/21 [00:04<00:30,  1.71s/it]
[Acessing speaker spk_3 track 1 of 1:  19%|█▉        | 4/21 [00:07<00:40,  2.36s/it]
[Acessing speaker spk_3 track 1 of 1:  24%|██▍       | 5/21 [00:09<00:36,  2.26s/it]
[Acessing speaker spk_3 track 1 of 1:  29%|██▊       | 6/21 [00:10<00:26,  1.73s/it]
[Acessing speaker spk_3 track 1 of 1:  33%|███▎      | 7/21 [00:14<00:31,  2.26s/it]
[Acessing speaker spk_3 track 1 of 1:  38%|███▊      | 8/21 [00:17<00:32,  2.50s/it]
[Acessing speaker spk_3 track 1 of 1:  43%|████▎     | 9/21 [00:18<00:25,  2.12s/it]
[Acessing speaker spk_3 track 1 of 1:  48%|████▊     | 10/21 [00:20<00:22,  2.04s/it]
[Acessing speaker spk_3 track 1 of 1:  52%|█████▏    | 11/2





[Acessing speaker spk_4 track 1 of 1:   0%|          | 0/22 [00:00<?, ?it/s]
[Acessing speaker spk_4 track 1 of 1:   5%|▍         | 1/22 [00:05<01:46,  5.06s/it]
[Acessing speaker spk_4 track 1 of 1:   9%|▉         | 2/22 [00:13<02:22,  7.15s/it]
[Acessing speaker spk_4 track 1 of 1:  14%|█▎        | 3/22 [00:21<02:23,  7.56s/it]
[Acessing speaker spk_4 track 1 of 1:  18%|█▊        | 4/22 [00:25<01:45,  5.88s/it]
[Acessing speaker spk_4 track 1 of 1:  23%|██▎       | 5/22 [00:28<01:27,  5.14s/it]
[Acessing speaker spk_4 track 1 of 1:  27%|██▋       | 6/22 [00:31<01:10,  4.41s/it]
[Acessing speaker spk_4 track 1 of 1:  32%|███▏      | 7/22 [00:32<00:48,  3.23s/it]
[Acessing speaker spk_4 track 1 of 1:  36%|███▋      | 8/22 [00:33<00:32,  2.34s/it]
[Acessing speaker spk_4 track 1 of 1:  41%|████      | 9/22 [00:33<00:24,  1.88s/it]
[Acessing speaker spk_4 track 1 of 1:  45%|████▌     | 10/22 [00:34<00:17,  1.45s/it]
[Acessing speaker spk_4 track 1 of 1:  50%|█████     | 11/2





[Acessing speaker spk_5 track 1 of 2:   0%|          | 0/21 [00:00<?, ?it/s]
[Acessing speaker spk_5 track 1 of 2:   5%|▍         | 1/21 [00:00<00:19,  1.05it/s]
[Acessing speaker spk_5 track 1 of 2:  10%|▉         | 2/21 [00:01<00:12,  1.53it/s]
[Acessing speaker spk_5 track 1 of 2:  14%|█▍        | 3/21 [00:02<00:18,  1.01s/it]
[Acessing speaker spk_5 track 1 of 2:  19%|█▉        | 4/21 [00:03<00:14,  1.16it/s]
[Acessing speaker spk_5 track 1 of 2:  24%|██▍       | 5/21 [00:04<00:12,  1.31it/s]
[Acessing speaker spk_5 track 1 of 2:  29%|██▊       | 6/21 [00:04<00:11,  1.33it/s]
[Acessing speaker spk_5 track 1 of 2:  33%|███▎      | 7/21 [00:05<00:09,  1.44it/s]
[Acessing speaker spk_5 track 1 of 2:  38%|███▊      | 8/21 [00:06<00:10,  1.27it/s]
[Acessing speaker spk_5 track 1 of 2:  43%|████▎     | 9/21 [00:07<00:09,  1.33it/s]
[Acessing speaker spk_5 track 1 of 2:  48%|████▊     | 10/21 [00:09<00:13,  1.18s/it]
[Acessing speaker spk_5 track 1 of 2:  52%|█████▏    | 11/2


Starte Inference für Experiment: E15_stage2_light_bs8_len20
  base_model      = cocktail_stage2_light
  model_type      = avsr_cocktail
  checkpoint_path = model-bin/avsr_cocktail_mcorec_stage2_light_lr1e-5_5k/checkpoint-5000
  beam_size       = 12
  max_length      = 20
  output_dir_name = output_E15_stage2_light_bs8_len20
  session_dir     = data-bin/dev_without_central_videos/dev/session_49
  comment         = Stage-2-Light, beam=12, len=20
Loading avsr_cocktail model...
Loading model from model-bin/avsr_cocktail_mcorec_stage2_light_lr1e-5_5k/checkpoint-5000
avsr_cocktail model loaded successfully!
Inferring 1 sessions using avsr_cocktail model
Processing session session_49


Processing speakers:   0%|          | 0/6 [00:00<?, ?it/s]





[Acessing speaker spk_0 track 1 of 1:   0%|          | 0/12 [00:00<?, ?it/s]
[Acessing speaker spk_0 track 1 of 1:   8%|▊         | 1/12 [00:00<00:09,  1.16it/s]
[Acessing speaker spk_0 track 1 of 1:  17%|█▋        | 2/12 [00:01<00:09,  1.11it/s]
[Acessing speaker spk_0 track 1 of 1:  25%|██▌       | 3/12 [00:02<00:06,  1.42it/s]
[Acessing speaker spk_0 track 1 of 1:  33%|███▎      | 4/12 [00:02<00:04,  1.65it/s]
[Acessing speaker spk_0 track 1 of 1:  42%|████▏     | 5/12 [00:03<00:04,  1.63it/s]
[Acessing speaker spk_0 track 1 of 1:  50%|█████     | 6/12 [00:04<00:03,  1.59it/s]
[Acessing speaker spk_0 track 1 of 1:  58%|█████▊    | 7/12 [00:04<00:03,  1.61it/s]
[Acessing speaker spk_0 track 1 of 1:  67%|██████▋   | 8/12 [00:05<00:02,  1.59it/s]
[Acessing speaker spk_0 track 1 of 1:  75%|███████▌  | 9/12 [00:08<00:04,  1.39s/it]
[Acessing speaker spk_0 track 1 of 1:  83%|████████▎ | 10/12 [00:10<00:03,  1.50s/it]
[Acessing speaker spk_0 track 1 of 1:  92%|█████████▏| 11/1





[Acessing speaker spk_1 track 1 of 1:   0%|          | 0/14 [00:00<?, ?it/s]
[Acessing speaker spk_1 track 1 of 1:   7%|▋         | 1/14 [00:05<01:17,  6.00s/it]
[Acessing speaker spk_1 track 1 of 1:  14%|█▍        | 2/14 [00:07<00:37,  3.15s/it]
[Acessing speaker spk_1 track 1 of 1:  21%|██▏       | 3/14 [00:07<00:21,  1.97s/it]
[Acessing speaker spk_1 track 1 of 1:  29%|██▊       | 4/14 [00:08<00:14,  1.42s/it]
[Acessing speaker spk_1 track 1 of 1:  36%|███▌      | 5/14 [00:08<00:10,  1.15s/it]
[Acessing speaker spk_1 track 1 of 1:  43%|████▎     | 6/14 [00:09<00:08,  1.08s/it]
[Acessing speaker spk_1 track 1 of 1:  50%|█████     | 7/14 [00:11<00:07,  1.11s/it]
[Acessing speaker spk_1 track 1 of 1:  57%|█████▋    | 8/14 [00:12<00:06,  1.12s/it]
[Acessing speaker spk_1 track 1 of 1:  64%|██████▍   | 9/14 [00:15<00:08,  1.71s/it]
[Acessing speaker spk_1 track 1 of 1:  71%|███████▏  | 10/14 [00:25<00:17,  4.40s/it]
[Acessing speaker spk_1 track 1 of 1:  79%|███████▊  | 11/1





[Acessing speaker spk_2 track 1 of 8:   0%|          | 0/1 [00:00<?, ?it/s]
Processing speaker spk_2 track 1 of 8: 100%|██████████| 1/1 [00:00<00:00,  1.78it/s]

[Acessing speaker spk_2 track 2 of 8:   0%|          | 0/1 [00:00<?, ?it/s]
Processing speaker spk_2 track 2 of 8: 100%|██████████| 1/1 [00:00<00:00,  2.02it/s]

[Acessing speaker spk_2 track 3 of 8:   0%|          | 0/1 [00:00<?, ?it/s]
Processing speaker spk_2 track 3 of 8: 100%|██████████| 1/1 [00:00<00:00,  2.58it/s]

[Acessing speaker spk_2 track 4 of 8:   0%|          | 0/3 [00:00<?, ?it/s]
[Acessing speaker spk_2 track 4 of 8:  33%|███▎      | 1/3 [00:04<00:08,  4.27s/it]
[Acessing speaker spk_2 track 4 of 8:  67%|██████▋   | 2/3 [00:08<00:04,  4.28s/it]
Processing speaker spk_2 track 4 of 8: 100%|██████████| 3/3 [00:15<00:00,  5.30s/it]

[Acessing speaker spk_2 track 5 of 8:   0%|          | 0/2 [00:00<?, ?it/s]
[Acessing speaker spk_2 track 5 of 8:  50%|█████     | 1/2 [00:04<00:04,  4.88s/it]
Processing spea





[Acessing speaker spk_3 track 1 of 1:   0%|          | 0/21 [00:00<?, ?it/s]
[Acessing speaker spk_3 track 1 of 1:   5%|▍         | 1/21 [00:00<00:13,  1.44it/s]
[Acessing speaker spk_3 track 1 of 1:  10%|▉         | 2/21 [00:01<00:18,  1.04it/s]
[Acessing speaker spk_3 track 1 of 1:  14%|█▍        | 3/21 [00:04<00:31,  1.77s/it]
[Acessing speaker spk_3 track 1 of 1:  19%|█▉        | 4/21 [00:08<00:42,  2.50s/it]
[Acessing speaker spk_3 track 1 of 1:  24%|██▍       | 5/21 [00:10<00:37,  2.37s/it]
[Acessing speaker spk_3 track 1 of 1:  29%|██▊       | 6/21 [00:11<00:27,  1.85s/it]
[Acessing speaker spk_3 track 1 of 1:  33%|███▎      | 7/21 [00:15<00:35,  2.52s/it]
[Acessing speaker spk_3 track 1 of 1:  38%|███▊      | 8/21 [00:19<00:41,  3.23s/it]
[Acessing speaker spk_3 track 1 of 1:  43%|████▎     | 9/21 [00:21<00:31,  2.66s/it]
[Acessing speaker spk_3 track 1 of 1:  48%|████▊     | 10/21 [00:23<00:27,  2.46s/it]
[Acessing speaker spk_3 track 1 of 1:  52%|█████▏    | 11/2





[Acessing speaker spk_4 track 1 of 1:   0%|          | 0/22 [00:00<?, ?it/s]
[Acessing speaker spk_4 track 1 of 1:   5%|▍         | 1/22 [00:05<01:47,  5.10s/it]
[Acessing speaker spk_4 track 1 of 1:   9%|▉         | 2/22 [00:12<02:10,  6.55s/it]
[Acessing speaker spk_4 track 1 of 1:  14%|█▎        | 3/22 [00:20<02:18,  7.31s/it]
[Acessing speaker spk_4 track 1 of 1:  18%|█▊        | 4/22 [00:24<01:46,  5.90s/it]
[Acessing speaker spk_4 track 1 of 1:  23%|██▎       | 5/22 [00:28<01:28,  5.23s/it]
[Acessing speaker spk_4 track 1 of 1:  27%|██▋       | 6/22 [00:31<01:12,  4.54s/it]
[Acessing speaker spk_4 track 1 of 1:  32%|███▏      | 7/22 [00:33<00:54,  3.66s/it]
[Acessing speaker spk_4 track 1 of 1:  36%|███▋      | 8/22 [00:34<00:37,  2.64s/it]
[Acessing speaker spk_4 track 1 of 1:  41%|████      | 9/22 [00:35<00:27,  2.10s/it]
[Acessing speaker spk_4 track 1 of 1:  45%|████▌     | 10/22 [00:35<00:19,  1.62s/it]
[Acessing speaker spk_4 track 1 of 1:  50%|█████     | 11/2





[Acessing speaker spk_5 track 1 of 2:   0%|          | 0/21 [00:00<?, ?it/s]
[Acessing speaker spk_5 track 1 of 2:   5%|▍         | 1/21 [00:00<00:15,  1.27it/s]
[Acessing speaker spk_5 track 1 of 2:  10%|▉         | 2/21 [00:01<00:11,  1.63it/s]
[Acessing speaker spk_5 track 1 of 2:  14%|█▍        | 3/21 [00:02<00:19,  1.08s/it]
[Acessing speaker spk_5 track 1 of 2:  19%|█▉        | 4/21 [00:03<00:16,  1.05it/s]
[Acessing speaker spk_5 track 1 of 2:  24%|██▍       | 5/21 [00:04<00:13,  1.16it/s]
[Acessing speaker spk_5 track 1 of 2:  29%|██▊       | 6/21 [00:05<00:12,  1.20it/s]
[Acessing speaker spk_5 track 1 of 2:  33%|███▎      | 7/21 [00:05<00:10,  1.31it/s]
[Acessing speaker spk_5 track 1 of 2:  38%|███▊      | 8/21 [00:06<00:11,  1.17it/s]
[Acessing speaker spk_5 track 1 of 2:  43%|████▎     | 9/21 [00:07<00:09,  1.22it/s]
[Acessing speaker spk_5 track 1 of 2:  48%|████▊     | 10/21 [00:09<00:13,  1.26s/it]
[Acessing speaker spk_5 track 1 of 2:  52%|█████▏    | 11/2


########## Starte Experimente für session_50 ##########

Starte Inference für Experiment: E14_stage2_light_bs8_len20
  base_model      = cocktail_stage2_light
  model_type      = avsr_cocktail
  checkpoint_path = model-bin/avsr_cocktail_mcorec_stage2_light_lr1e-5_5k/checkpoint-5000
  beam_size       = 8
  max_length      = 20
  output_dir_name = output_E14_stage2_light_bs8_len20
  session_dir     = data-bin/dev_without_central_videos/dev/session_50
  comment         = Stage-2-Light, beam=8, len=20
Loading avsr_cocktail model...
Loading model from model-bin/avsr_cocktail_mcorec_stage2_light_lr1e-5_5k/checkpoint-5000
avsr_cocktail model loaded successfully!
Inferring 1 sessions using avsr_cocktail model
Processing session session_50


Processing speakers:   0%|          | 0/6 [00:00<?, ?it/s]





[Acessing speaker spk_0 track 1 of 1:   0%|          | 0/25 [00:00<?, ?it/s]
[Acessing speaker spk_0 track 1 of 1:   4%|▍         | 1/25 [00:00<00:22,  1.05it/s]
[Acessing speaker spk_0 track 1 of 1:   8%|▊         | 2/25 [00:01<00:18,  1.26it/s]
[Acessing speaker spk_0 track 1 of 1:  12%|█▏        | 3/25 [00:02<00:15,  1.46it/s]
[Acessing speaker spk_0 track 1 of 1:  16%|█▌        | 4/25 [00:02<00:11,  1.78it/s]
[Acessing speaker spk_0 track 1 of 1:  20%|██        | 5/25 [00:03<00:14,  1.39it/s]
[Acessing speaker spk_0 track 1 of 1:  24%|██▍       | 6/25 [00:04<00:14,  1.33it/s]
[Acessing speaker spk_0 track 1 of 1:  28%|██▊       | 7/25 [00:05<00:14,  1.24it/s]
[Acessing speaker spk_0 track 1 of 1:  32%|███▏      | 8/25 [00:05<00:12,  1.39it/s]
[Acessing speaker spk_0 track 1 of 1:  36%|███▌      | 9/25 [00:06<00:10,  1.48it/s]
[Acessing speaker spk_0 track 1 of 1:  40%|████      | 10/25 [00:07<00:09,  1.51it/s]
[Acessing speaker spk_0 track 1 of 1:  44%|████▍     | 11/2





[Acessing speaker spk_1 track 1 of 1:   0%|          | 0/24 [00:00<?, ?it/s]
[Acessing speaker spk_1 track 1 of 1:   4%|▍         | 1/24 [00:10<03:52, 10.09s/it]
[Acessing speaker spk_1 track 1 of 1:   8%|▊         | 2/24 [00:11<01:51,  5.08s/it]
[Acessing speaker spk_1 track 1 of 1:  12%|█▎        | 3/24 [00:12<01:01,  2.95s/it]
[Acessing speaker spk_1 track 1 of 1:  17%|█▋        | 4/24 [00:14<00:51,  2.59s/it]
[Acessing speaker spk_1 track 1 of 1:  21%|██        | 5/24 [00:15<00:43,  2.27s/it]
[Acessing speaker spk_1 track 1 of 1:  25%|██▌       | 6/24 [00:16<00:31,  1.76s/it]
[Acessing speaker spk_1 track 1 of 1:  29%|██▉       | 7/24 [00:17<00:22,  1.34s/it]
[Acessing speaker spk_1 track 1 of 1:  33%|███▎      | 8/24 [00:17<00:17,  1.12s/it]
[Acessing speaker spk_1 track 1 of 1:  38%|███▊      | 9/24 [00:21<00:30,  2.04s/it]
[Acessing speaker spk_1 track 1 of 1:  42%|████▏     | 10/24 [00:22<00:23,  1.67s/it]
[Acessing speaker spk_1 track 1 of 1:  46%|████▌     | 11/2





[Acessing speaker spk_2 track 1 of 2:   0%|          | 0/18 [00:00<?, ?it/s]
[Acessing speaker spk_2 track 1 of 2:   6%|▌         | 1/18 [00:00<00:16,  1.01it/s]
[Acessing speaker spk_2 track 1 of 2:  11%|█         | 2/18 [00:01<00:10,  1.51it/s]
[Acessing speaker spk_2 track 1 of 2:  17%|█▋        | 3/18 [00:06<00:39,  2.66s/it]
[Acessing speaker spk_2 track 1 of 2:  22%|██▏       | 4/18 [00:07<00:27,  1.95s/it]
[Acessing speaker spk_2 track 1 of 2:  28%|██▊       | 5/18 [00:12<00:42,  3.24s/it]
[Acessing speaker spk_2 track 1 of 2:  33%|███▎      | 6/18 [00:13<00:28,  2.35s/it]
[Acessing speaker spk_2 track 1 of 2:  39%|███▉      | 7/18 [00:13<00:18,  1.72s/it]
[Acessing speaker spk_2 track 1 of 2:  44%|████▍     | 8/18 [00:14<00:14,  1.47s/it]
[Acessing speaker spk_2 track 1 of 2:  50%|█████     | 9/18 [00:15<00:11,  1.23s/it]
[Acessing speaker spk_2 track 1 of 2:  56%|█████▌    | 10/18 [00:15<00:07,  1.01it/s]
[Acessing speaker spk_2 track 1 of 2:  61%|██████    | 11/1





[Acessing speaker spk_3 track 1 of 3:   0%|          | 0/16 [00:00<?, ?it/s]
[Acessing speaker spk_3 track 1 of 3:   6%|▋         | 1/16 [00:01<00:16,  1.12s/it]
[Acessing speaker spk_3 track 1 of 3:  12%|█▎        | 2/16 [00:01<00:10,  1.37it/s]
[Acessing speaker spk_3 track 1 of 3:  19%|█▉        | 3/16 [00:02<00:11,  1.16it/s]
[Acessing speaker spk_3 track 1 of 3:  25%|██▌       | 4/16 [00:04<00:17,  1.43s/it]
[Acessing speaker spk_3 track 1 of 3:  31%|███▏      | 5/16 [00:05<00:13,  1.23s/it]
[Acessing speaker spk_3 track 1 of 3:  38%|███▊      | 6/16 [00:06<00:10,  1.00s/it]
[Acessing speaker spk_3 track 1 of 3:  44%|████▍     | 7/16 [00:06<00:07,  1.19it/s]
[Acessing speaker spk_3 track 1 of 3:  50%|█████     | 8/16 [00:07<00:06,  1.20it/s]
[Acessing speaker spk_3 track 1 of 3:  56%|█████▋    | 9/16 [00:09<00:07,  1.08s/it]
[Acessing speaker spk_3 track 1 of 3:  62%|██████▎   | 10/16 [00:09<00:05,  1.09it/s]
[Acessing speaker spk_3 track 1 of 3:  69%|██████▉   | 11/1





[Acessing speaker spk_4 track 1 of 1:   0%|          | 0/27 [00:00<?, ?it/s]
[Acessing speaker spk_4 track 1 of 1:   4%|▎         | 1/27 [00:00<00:22,  1.17it/s]
[Acessing speaker spk_4 track 1 of 1:   7%|▋         | 2/27 [00:01<00:13,  1.83it/s]
[Acessing speaker spk_4 track 1 of 1:  11%|█         | 3/27 [00:01<00:15,  1.58it/s]
[Acessing speaker spk_4 track 1 of 1:  15%|█▍        | 4/27 [00:02<00:17,  1.33it/s]
[Acessing speaker spk_4 track 1 of 1:  19%|█▊        | 5/27 [00:04<00:19,  1.11it/s]
[Acessing speaker spk_4 track 1 of 1:  22%|██▏       | 6/27 [00:04<00:16,  1.31it/s]
[Acessing speaker spk_4 track 1 of 1:  26%|██▌       | 7/27 [00:04<00:13,  1.50it/s]
[Acessing speaker spk_4 track 1 of 1:  30%|██▉       | 8/27 [00:06<00:17,  1.11it/s]
[Acessing speaker spk_4 track 1 of 1:  33%|███▎      | 9/27 [00:07<00:15,  1.17it/s]
[Acessing speaker spk_4 track 1 of 1:  37%|███▋      | 10/27 [00:07<00:12,  1.33it/s]
[Acessing speaker spk_4 track 1 of 1:  41%|████      | 11/2





[Acessing speaker spk_5 track 1 of 1:   0%|          | 0/29 [00:00<?, ?it/s]
[Acessing speaker spk_5 track 1 of 1:   3%|▎         | 1/29 [00:00<00:24,  1.15it/s]
[Acessing speaker spk_5 track 1 of 1:   7%|▋         | 2/29 [00:01<00:20,  1.34it/s]
[Acessing speaker spk_5 track 1 of 1:  10%|█         | 3/29 [00:01<00:15,  1.71it/s]
[Acessing speaker spk_5 track 1 of 1:  14%|█▍        | 4/29 [00:02<00:14,  1.75it/s]
[Acessing speaker spk_5 track 1 of 1:  17%|█▋        | 5/29 [00:03<00:13,  1.79it/s]
[Acessing speaker spk_5 track 1 of 1:  21%|██        | 6/29 [00:03<00:13,  1.73it/s]
[Acessing speaker spk_5 track 1 of 1:  24%|██▍       | 7/29 [00:04<00:12,  1.73it/s]
[Acessing speaker spk_5 track 1 of 1:  28%|██▊       | 8/29 [00:06<00:21,  1.00s/it]
[Acessing speaker spk_5 track 1 of 1:  31%|███       | 9/29 [00:14<01:04,  3.22s/it]
[Acessing speaker spk_5 track 1 of 1:  34%|███▍      | 10/29 [00:23<01:34,  4.96s/it]
[Acessing speaker spk_5 track 1 of 1:  38%|███▊      | 11/2


Starte Inference für Experiment: E15_stage2_light_bs8_len20
  base_model      = cocktail_stage2_light
  model_type      = avsr_cocktail
  checkpoint_path = model-bin/avsr_cocktail_mcorec_stage2_light_lr1e-5_5k/checkpoint-5000
  beam_size       = 12
  max_length      = 20
  output_dir_name = output_E15_stage2_light_bs8_len20
  session_dir     = data-bin/dev_without_central_videos/dev/session_50
  comment         = Stage-2-Light, beam=12, len=20
Loading avsr_cocktail model...
Loading model from model-bin/avsr_cocktail_mcorec_stage2_light_lr1e-5_5k/checkpoint-5000
avsr_cocktail model loaded successfully!
Inferring 1 sessions using avsr_cocktail model
Processing session session_50


Processing speakers:   0%|          | 0/6 [00:00<?, ?it/s]





[Acessing speaker spk_0 track 1 of 1:   0%|          | 0/25 [00:00<?, ?it/s]
[Acessing speaker spk_0 track 1 of 1:   4%|▍         | 1/25 [00:00<00:17,  1.35it/s]
[Acessing speaker spk_0 track 1 of 1:   8%|▊         | 2/25 [00:01<00:17,  1.32it/s]
[Acessing speaker spk_0 track 1 of 1:  12%|█▏        | 3/25 [00:02<00:14,  1.47it/s]
[Acessing speaker spk_0 track 1 of 1:  16%|█▌        | 4/25 [00:02<00:12,  1.73it/s]
[Acessing speaker spk_0 track 1 of 1:  20%|██        | 5/25 [00:03<00:15,  1.32it/s]
[Acessing speaker spk_0 track 1 of 1:  24%|██▍       | 6/25 [00:04<00:15,  1.25it/s]
[Acessing speaker spk_0 track 1 of 1:  28%|██▊       | 7/25 [00:05<00:15,  1.15it/s]
[Acessing speaker spk_0 track 1 of 1:  32%|███▏      | 8/25 [00:06<00:13,  1.29it/s]
[Acessing speaker spk_0 track 1 of 1:  36%|███▌      | 9/25 [00:06<00:11,  1.40it/s]
[Acessing speaker spk_0 track 1 of 1:  40%|████      | 10/25 [00:07<00:10,  1.43it/s]
[Acessing speaker spk_0 track 1 of 1:  44%|████▍     | 11/2





[Acessing speaker spk_1 track 1 of 1:   0%|          | 0/24 [00:00<?, ?it/s]
[Acessing speaker spk_1 track 1 of 1:   4%|▍         | 1/24 [00:10<04:08, 10.79s/it]
[Acessing speaker spk_1 track 1 of 1:   8%|▊         | 2/24 [00:12<02:00,  5.47s/it]
[Acessing speaker spk_1 track 1 of 1:  12%|█▎        | 3/24 [00:13<01:06,  3.18s/it]
[Acessing speaker spk_1 track 1 of 1:  17%|█▋        | 4/24 [00:14<00:52,  2.63s/it]
[Acessing speaker spk_1 track 1 of 1:  21%|██        | 5/24 [00:16<00:43,  2.31s/it]
[Acessing speaker spk_1 track 1 of 1:  25%|██▌       | 6/24 [00:17<00:31,  1.73s/it]
[Acessing speaker spk_1 track 1 of 1:  29%|██▉       | 7/24 [00:17<00:22,  1.34s/it]
[Acessing speaker spk_1 track 1 of 1:  33%|███▎      | 8/24 [00:18<00:18,  1.14s/it]
[Acessing speaker spk_1 track 1 of 1:  38%|███▊      | 9/24 [00:22<00:31,  2.10s/it]
[Acessing speaker spk_1 track 1 of 1:  42%|████▏     | 10/24 [00:23<00:24,  1.74s/it]
[Acessing speaker spk_1 track 1 of 1:  46%|████▌     | 11/2





[Acessing speaker spk_2 track 1 of 2:   0%|          | 0/18 [00:00<?, ?it/s]
[Acessing speaker spk_2 track 1 of 2:   6%|▌         | 1/18 [00:00<00:13,  1.25it/s]
[Acessing speaker spk_2 track 1 of 2:  11%|█         | 2/18 [00:01<00:10,  1.58it/s]
[Acessing speaker spk_2 track 1 of 2:  17%|█▋        | 3/18 [00:06<00:38,  2.56s/it]
[Acessing speaker spk_2 track 1 of 2:  22%|██▏       | 4/18 [00:07<00:26,  1.91s/it]
[Acessing speaker spk_2 track 1 of 2:  28%|██▊       | 5/18 [00:11<00:35,  2.69s/it]
[Acessing speaker spk_2 track 1 of 2:  33%|███▎      | 6/18 [00:11<00:24,  2.02s/it]
[Acessing speaker spk_2 track 1 of 2:  39%|███▉      | 7/18 [00:12<00:16,  1.51s/it]
[Acessing speaker spk_2 track 1 of 2:  44%|████▍     | 8/18 [00:13<00:13,  1.34s/it]
[Acessing speaker spk_2 track 1 of 2:  50%|█████     | 9/18 [00:14<00:10,  1.15s/it]
[Acessing speaker spk_2 track 1 of 2:  56%|█████▌    | 10/18 [00:14<00:07,  1.06it/s]
[Acessing speaker spk_2 track 1 of 2:  61%|██████    | 11/1





[Acessing speaker spk_3 track 1 of 3:   0%|          | 0/16 [00:00<?, ?it/s]
[Acessing speaker spk_3 track 1 of 3:   6%|▋         | 1/16 [00:01<00:15,  1.01s/it]
[Acessing speaker spk_3 track 1 of 3:  12%|█▎        | 2/16 [00:01<00:09,  1.41it/s]
[Acessing speaker spk_3 track 1 of 3:  19%|█▉        | 3/16 [00:02<00:11,  1.10it/s]
[Acessing speaker spk_3 track 1 of 3:  25%|██▌       | 4/16 [00:05<00:18,  1.52s/it]
[Acessing speaker spk_3 track 1 of 3:  31%|███▏      | 5/16 [00:06<00:14,  1.30s/it]
[Acessing speaker spk_3 track 1 of 3:  38%|███▊      | 6/16 [00:06<00:10,  1.07s/it]
[Acessing speaker spk_3 track 1 of 3:  44%|████▍     | 7/16 [00:07<00:07,  1.13it/s]
[Acessing speaker spk_3 track 1 of 3:  50%|█████     | 8/16 [00:08<00:07,  1.12it/s]
[Acessing speaker spk_3 track 1 of 3:  56%|█████▋    | 9/16 [00:10<00:08,  1.24s/it]
[Acessing speaker spk_3 track 1 of 3:  62%|██████▎   | 10/16 [00:10<00:06,  1.06s/it]
[Acessing speaker spk_3 track 1 of 3:  69%|██████▉   | 11/1





[Acessing speaker spk_4 track 1 of 1:   0%|          | 0/27 [00:00<?, ?it/s]
[Acessing speaker spk_4 track 1 of 1:   4%|▎         | 1/27 [00:00<00:18,  1.38it/s]
[Acessing speaker spk_4 track 1 of 1:   7%|▋         | 2/27 [00:01<00:13,  1.79it/s]
[Acessing speaker spk_4 track 1 of 1:  11%|█         | 3/27 [00:01<00:16,  1.50it/s]
[Acessing speaker spk_4 track 1 of 1:  15%|█▍        | 4/27 [00:02<00:18,  1.25it/s]
[Acessing speaker spk_4 track 1 of 1:  19%|█▊        | 5/27 [00:04<00:21,  1.02it/s]
[Acessing speaker spk_4 track 1 of 1:  22%|██▏       | 6/27 [00:04<00:17,  1.20it/s]
[Acessing speaker spk_4 track 1 of 1:  26%|██▌       | 7/27 [00:05<00:14,  1.37it/s]
[Acessing speaker spk_4 track 1 of 1:  30%|██▉       | 8/27 [00:06<00:18,  1.02it/s]
[Acessing speaker spk_4 track 1 of 1:  33%|███▎      | 9/27 [00:07<00:16,  1.06it/s]
[Acessing speaker spk_4 track 1 of 1:  37%|███▋      | 10/27 [00:08<00:14,  1.21it/s]
[Acessing speaker spk_4 track 1 of 1:  41%|████      | 11/2





[Acessing speaker spk_5 track 1 of 1:   0%|          | 0/29 [00:00<?, ?it/s]
[Acessing speaker spk_5 track 1 of 1:   3%|▎         | 1/29 [00:00<00:20,  1.40it/s]
[Acessing speaker spk_5 track 1 of 1:   7%|▋         | 2/29 [00:02<00:34,  1.27s/it]
[Acessing speaker spk_5 track 1 of 1:  10%|█         | 3/29 [00:02<00:23,  1.12it/s]
[Acessing speaker spk_5 track 1 of 1:  14%|█▍        | 4/29 [00:03<00:19,  1.29it/s]
[Acessing speaker spk_5 track 1 of 1:  17%|█▋        | 5/29 [00:04<00:17,  1.39it/s]
[Acessing speaker spk_5 track 1 of 1:  21%|██        | 6/29 [00:04<00:16,  1.43it/s]
[Acessing speaker spk_5 track 1 of 1:  24%|██▍       | 7/29 [00:05<00:14,  1.48it/s]
[Acessing speaker spk_5 track 1 of 1:  28%|██▊       | 8/29 [00:07<00:23,  1.10s/it]
[Acessing speaker spk_5 track 1 of 1:  31%|███       | 9/29 [00:15<01:08,  3.40s/it]
[Acessing speaker spk_5 track 1 of 1:  34%|███▍      | 10/29 [00:23<01:31,  4.81s/it]
[Acessing speaker spk_5 track 1 of 1:  38%|███▊      | 11/2


########## Starte Experimente für session_54 ##########

Starte Inference für Experiment: E14_stage2_light_bs8_len20
  base_model      = cocktail_stage2_light
  model_type      = avsr_cocktail
  checkpoint_path = model-bin/avsr_cocktail_mcorec_stage2_light_lr1e-5_5k/checkpoint-5000
  beam_size       = 8
  max_length      = 20
  output_dir_name = output_E14_stage2_light_bs8_len20
  session_dir     = data-bin/dev_without_central_videos/dev/session_54
  comment         = Stage-2-Light, beam=8, len=20
Loading avsr_cocktail model...
Loading model from model-bin/avsr_cocktail_mcorec_stage2_light_lr1e-5_5k/checkpoint-5000
avsr_cocktail model loaded successfully!
Inferring 1 sessions using avsr_cocktail model
Processing session session_54


Processing speakers:   0%|          | 0/5 [00:00<?, ?it/s]





[Acessing speaker spk_0 track 1 of 2:   0%|          | 0/26 [00:00<?, ?it/s]
[Acessing speaker spk_0 track 1 of 2:   4%|▍         | 1/26 [00:00<00:21,  1.16it/s]
[Acessing speaker spk_0 track 1 of 2:   8%|▊         | 2/26 [00:01<00:16,  1.50it/s]
[Acessing speaker spk_0 track 1 of 2:  12%|█▏        | 3/26 [00:01<00:13,  1.76it/s]
[Acessing speaker spk_0 track 1 of 2:  15%|█▌        | 4/26 [00:02<00:10,  2.10it/s]
[Acessing speaker spk_0 track 1 of 2:  19%|█▉        | 5/26 [00:03<00:16,  1.28it/s]
[Acessing speaker spk_0 track 1 of 2:  23%|██▎       | 6/26 [00:05<00:23,  1.18s/it]
[Acessing speaker spk_0 track 1 of 2:  27%|██▋       | 7/26 [00:08<00:31,  1.64s/it]
[Acessing speaker spk_0 track 1 of 2:  31%|███       | 8/26 [00:09<00:28,  1.59s/it]
[Acessing speaker spk_0 track 1 of 2:  35%|███▍      | 9/26 [00:10<00:23,  1.35s/it]
[Acessing speaker spk_0 track 1 of 2:  38%|███▊      | 10/26 [00:10<00:17,  1.12s/it]
[Acessing speaker spk_0 track 1 of 2:  42%|████▏     | 11/2





[Acessing speaker spk_1 track 1 of 1:   0%|          | 0/27 [00:00<?, ?it/s]
[Acessing speaker spk_1 track 1 of 1:   4%|▎         | 1/27 [00:01<00:26,  1.03s/it]
[Acessing speaker spk_1 track 1 of 1:   7%|▋         | 2/27 [00:06<01:28,  3.54s/it]
[Acessing speaker spk_1 track 1 of 1:  11%|█         | 3/27 [00:12<01:58,  4.95s/it]
[Acessing speaker spk_1 track 1 of 1:  15%|█▍        | 4/27 [00:18<02:00,  5.23s/it]
[Acessing speaker spk_1 track 1 of 1:  19%|█▊        | 5/27 [00:23<01:52,  5.12s/it]
[Acessing speaker spk_1 track 1 of 1:  22%|██▏       | 6/27 [00:27<01:37,  4.66s/it]
[Acessing speaker spk_1 track 1 of 1:  26%|██▌       | 7/27 [00:29<01:15,  3.75s/it]
[Acessing speaker spk_1 track 1 of 1:  30%|██▉       | 8/27 [00:34<01:18,  4.13s/it]
[Acessing speaker spk_1 track 1 of 1:  33%|███▎      | 9/27 [00:38<01:14,  4.16s/it]
[Acessing speaker spk_1 track 1 of 1:  37%|███▋      | 10/27 [00:39<00:52,  3.10s/it]
[Acessing speaker spk_1 track 1 of 1:  41%|████      | 11/2





[Acessing speaker spk_2 track 1 of 1:   0%|          | 0/32 [00:00<?, ?it/s]
[Acessing speaker spk_2 track 1 of 1:   3%|▎         | 1/32 [00:01<00:42,  1.38s/it]
[Acessing speaker spk_2 track 1 of 1:   6%|▋         | 2/32 [00:02<00:30,  1.00s/it]
[Acessing speaker spk_2 track 1 of 1:   9%|▉         | 3/32 [00:04<00:48,  1.67s/it]
[Acessing speaker spk_2 track 1 of 1:  12%|█▎        | 4/32 [00:07<01:03,  2.27s/it]
[Acessing speaker spk_2 track 1 of 1:  16%|█▌        | 5/32 [00:08<00:44,  1.63s/it]
[Acessing speaker spk_2 track 1 of 1:  19%|█▉        | 6/32 [00:08<00:32,  1.23s/it]
[Acessing speaker spk_2 track 1 of 1:  22%|██▏       | 7/32 [00:10<00:33,  1.34s/it]
[Acessing speaker spk_2 track 1 of 1:  25%|██▌       | 8/32 [00:12<00:38,  1.61s/it]
[Acessing speaker spk_2 track 1 of 1:  28%|██▊       | 9/32 [00:13<00:32,  1.40s/it]
[Acessing speaker spk_2 track 1 of 1:  31%|███▏      | 10/32 [00:14<00:26,  1.21s/it]
[Acessing speaker spk_2 track 1 of 1:  34%|███▍      | 11/3





[Acessing speaker spk_3 track 1 of 1:   0%|          | 0/38 [00:00<?, ?it/s]
[Acessing speaker spk_3 track 1 of 1:   3%|▎         | 1/38 [00:00<00:28,  1.30it/s]
[Acessing speaker spk_3 track 1 of 1:   5%|▌         | 2/38 [00:02<00:44,  1.24s/it]
[Acessing speaker spk_3 track 1 of 1:   8%|▊         | 3/38 [00:02<00:33,  1.06it/s]
[Acessing speaker spk_3 track 1 of 1:  11%|█         | 4/38 [00:03<00:31,  1.07it/s]
[Acessing speaker spk_3 track 1 of 1:  13%|█▎        | 5/38 [00:04<00:30,  1.07it/s]
[Acessing speaker spk_3 track 1 of 1:  16%|█▌        | 6/38 [00:05<00:25,  1.23it/s]
[Acessing speaker spk_3 track 1 of 1:  18%|█▊        | 7/38 [00:07<00:34,  1.11s/it]
[Acessing speaker spk_3 track 1 of 1:  21%|██        | 8/38 [00:07<00:30,  1.03s/it]
[Acessing speaker spk_3 track 1 of 1:  24%|██▎       | 9/38 [00:09<00:35,  1.24s/it]
[Acessing speaker spk_3 track 1 of 1:  26%|██▋       | 10/38 [00:12<00:49,  1.77s/it]
[Acessing speaker spk_3 track 1 of 1:  29%|██▉       | 11/3





[Acessing speaker spk_4 track 1 of 1:   0%|          | 0/28 [00:00<?, ?it/s]
[Acessing speaker spk_4 track 1 of 1:   4%|▎         | 1/28 [00:01<00:50,  1.86s/it]
[Acessing speaker spk_4 track 1 of 1:   7%|▋         | 2/28 [00:02<00:29,  1.15s/it]
[Acessing speaker spk_4 track 1 of 1:  11%|█         | 3/28 [00:03<00:21,  1.16it/s]
[Acessing speaker spk_4 track 1 of 1:  14%|█▍        | 4/28 [00:03<00:16,  1.48it/s]
[Acessing speaker spk_4 track 1 of 1:  18%|█▊        | 5/28 [00:05<00:23,  1.00s/it]
[Acessing speaker spk_4 track 1 of 1:  21%|██▏       | 6/28 [00:05<00:20,  1.08it/s]
[Acessing speaker spk_4 track 1 of 1:  25%|██▌       | 7/28 [00:08<00:30,  1.44s/it]
[Acessing speaker spk_4 track 1 of 1:  29%|██▊       | 8/28 [00:09<00:24,  1.22s/it]
[Acessing speaker spk_4 track 1 of 1:  32%|███▏      | 9/28 [00:10<00:23,  1.26s/it]
[Acessing speaker spk_4 track 1 of 1:  36%|███▌      | 10/28 [00:10<00:18,  1.05s/it]
[Acessing speaker spk_4 track 1 of 1:  39%|███▉      | 11/2


Starte Inference für Experiment: E15_stage2_light_bs8_len20
  base_model      = cocktail_stage2_light
  model_type      = avsr_cocktail
  checkpoint_path = model-bin/avsr_cocktail_mcorec_stage2_light_lr1e-5_5k/checkpoint-5000
  beam_size       = 12
  max_length      = 20
  output_dir_name = output_E15_stage2_light_bs8_len20
  session_dir     = data-bin/dev_without_central_videos/dev/session_54
  comment         = Stage-2-Light, beam=12, len=20
Loading avsr_cocktail model...
Loading model from model-bin/avsr_cocktail_mcorec_stage2_light_lr1e-5_5k/checkpoint-5000
avsr_cocktail model loaded successfully!
Inferring 1 sessions using avsr_cocktail model
Processing session session_54


Processing speakers:   0%|          | 0/5 [00:00<?, ?it/s]





[Acessing speaker spk_0 track 1 of 2:   0%|          | 0/26 [00:00<?, ?it/s]
[Acessing speaker spk_0 track 1 of 2:   4%|▍         | 1/26 [00:00<00:19,  1.26it/s]
[Acessing speaker spk_0 track 1 of 2:   8%|▊         | 2/26 [00:01<00:16,  1.49it/s]
[Acessing speaker spk_0 track 1 of 2:  12%|█▏        | 3/26 [00:01<00:13,  1.68it/s]
[Acessing speaker spk_0 track 1 of 2:  15%|█▌        | 4/26 [00:02<00:11,  1.98it/s]
[Acessing speaker spk_0 track 1 of 2:  19%|█▉        | 5/26 [00:03<00:17,  1.21it/s]
[Acessing speaker spk_0 track 1 of 2:  23%|██▎       | 6/26 [00:05<00:25,  1.29s/it]
[Acessing speaker spk_0 track 1 of 2:  27%|██▋       | 7/26 [00:08<00:33,  1.76s/it]
[Acessing speaker spk_0 track 1 of 2:  31%|███       | 8/26 [00:10<00:30,  1.70s/it]
[Acessing speaker spk_0 track 1 of 2:  35%|███▍      | 9/26 [00:11<00:24,  1.44s/it]
[Acessing speaker spk_0 track 1 of 2:  38%|███▊      | 10/26 [00:11<00:19,  1.19s/it]
[Acessing speaker spk_0 track 1 of 2:  42%|████▏     | 11/2





[Acessing speaker spk_1 track 1 of 1:   0%|          | 0/27 [00:00<?, ?it/s]
[Acessing speaker spk_1 track 1 of 1:   4%|▎         | 1/27 [00:02<01:03,  2.45s/it]
[Acessing speaker spk_1 track 1 of 1:   7%|▋         | 2/27 [00:08<01:48,  4.35s/it]
[Acessing speaker spk_1 track 1 of 1:  11%|█         | 3/27 [00:15<02:13,  5.55s/it]
[Acessing speaker spk_1 track 1 of 1:  15%|█▍        | 4/27 [00:21<02:12,  5.74s/it]
[Acessing speaker spk_1 track 1 of 1:  19%|█▊        | 5/27 [00:26<02:04,  5.65s/it]
[Acessing speaker spk_1 track 1 of 1:  22%|██▏       | 6/27 [00:30<01:46,  5.08s/it]
[Acessing speaker spk_1 track 1 of 1:  26%|██▌       | 7/27 [00:32<01:21,  4.06s/it]
[Acessing speaker spk_1 track 1 of 1:  30%|██▉       | 8/27 [00:37<01:22,  4.36s/it]
[Acessing speaker spk_1 track 1 of 1:  33%|███▎      | 9/27 [00:41<01:16,  4.27s/it]
[Acessing speaker spk_1 track 1 of 1:  37%|███▋      | 10/27 [00:42<00:54,  3.18s/it]
[Acessing speaker spk_1 track 1 of 1:  41%|████      | 11/2





[Acessing speaker spk_2 track 1 of 1:   0%|          | 0/32 [00:00<?, ?it/s]
[Acessing speaker spk_2 track 1 of 1:   3%|▎         | 1/32 [00:01<00:41,  1.34s/it]
[Acessing speaker spk_2 track 1 of 1:   6%|▋         | 2/32 [00:02<00:30,  1.02s/it]
[Acessing speaker spk_2 track 1 of 1:   9%|▉         | 3/32 [00:05<00:55,  1.90s/it]
[Acessing speaker spk_2 track 1 of 1:  12%|█▎        | 4/32 [00:08<01:09,  2.48s/it]
[Acessing speaker spk_2 track 1 of 1:  16%|█▌        | 5/32 [00:09<00:48,  1.79s/it]
[Acessing speaker spk_2 track 1 of 1:  19%|█▉        | 6/32 [00:09<00:35,  1.35s/it]
[Acessing speaker spk_2 track 1 of 1:  22%|██▏       | 7/32 [00:11<00:36,  1.48s/it]
[Acessing speaker spk_2 track 1 of 1:  25%|██▌       | 8/32 [00:13<00:41,  1.75s/it]
[Acessing speaker spk_2 track 1 of 1:  28%|██▊       | 9/32 [00:14<00:34,  1.52s/it]
[Acessing speaker spk_2 track 1 of 1:  31%|███▏      | 10/32 [00:15<00:28,  1.31s/it]
[Acessing speaker spk_2 track 1 of 1:  34%|███▍      | 11/3





[Acessing speaker spk_3 track 1 of 1:   0%|          | 0/38 [00:00<?, ?it/s]
[Acessing speaker spk_3 track 1 of 1:   3%|▎         | 1/38 [00:00<00:20,  1.79it/s]
[Acessing speaker spk_3 track 1 of 1:   5%|▌         | 2/38 [00:02<00:44,  1.23s/it]
[Acessing speaker spk_3 track 1 of 1:   8%|▊         | 3/38 [00:02<00:34,  1.02it/s]
[Acessing speaker spk_3 track 1 of 1:  11%|█         | 4/38 [00:03<00:33,  1.01it/s]
[Acessing speaker spk_3 track 1 of 1:  13%|█▎        | 5/38 [00:04<00:29,  1.12it/s]
[Acessing speaker spk_3 track 1 of 1:  16%|█▌        | 6/38 [00:05<00:25,  1.25it/s]
[Acessing speaker spk_3 track 1 of 1:  18%|█▊        | 7/38 [00:07<00:36,  1.17s/it]
[Acessing speaker spk_3 track 1 of 1:  21%|██        | 8/38 [00:08<00:32,  1.09s/it]
[Acessing speaker spk_3 track 1 of 1:  24%|██▎       | 9/38 [00:09<00:38,  1.31s/it]
[Acessing speaker spk_3 track 1 of 1:  26%|██▋       | 10/38 [00:13<00:53,  1.91s/it]
[Acessing speaker spk_3 track 1 of 1:  29%|██▉       | 11/3





[Acessing speaker spk_4 track 1 of 1:   0%|          | 0/28 [00:00<?, ?it/s]
[Acessing speaker spk_4 track 1 of 1:   4%|▎         | 1/28 [00:01<00:52,  1.93s/it]
[Acessing speaker spk_4 track 1 of 1:   7%|▋         | 2/28 [00:02<00:31,  1.22s/it]
[Acessing speaker spk_4 track 1 of 1:  11%|█         | 3/28 [00:03<00:22,  1.11it/s]
[Acessing speaker spk_4 track 1 of 1:  14%|█▍        | 4/28 [00:03<00:17,  1.41it/s]
[Acessing speaker spk_4 track 1 of 1:  18%|█▊        | 5/28 [00:05<00:24,  1.06s/it]
[Acessing speaker spk_4 track 1 of 1:  21%|██▏       | 6/28 [00:06<00:21,  1.02it/s]
[Acessing speaker spk_4 track 1 of 1:  25%|██▌       | 7/28 [00:08<00:31,  1.51s/it]
[Acessing speaker spk_4 track 1 of 1:  29%|██▊       | 8/28 [00:09<00:26,  1.31s/it]
[Acessing speaker spk_4 track 1 of 1:  32%|███▏      | 9/28 [00:10<00:25,  1.34s/it]
[Acessing speaker spk_4 track 1 of 1:  36%|███▌      | 10/28 [00:11<00:20,  1.12s/it]
[Acessing speaker spk_4 track 1 of 1:  39%|███▉      | 11/2

## 9 – Evaluation & Aggregation

In [15]:
# Ergebnisse auswerten und an gemeinsame CSV anhängen
df_dev = append_eval_results_for_experiments(
    experiments=EXPERIMENTS,
    session_ids=SESSION_IDS,
    target_csv="results_dev_subset_by_session.csv",
)



########## Evaluate für session_40 ##########
Starte Evaluate: /home/josch080/Projektgruppe/mcorec_train/bin/python script/evaluate.py --session_dir data-bin/dev_without_central_videos/dev/session_40 --output_dir_name output_ --label_dir_name labels
Evaluating 1 sessions

=== Evaluating session session_40 ===

--- Evaluating output dir: output_E01_bs4_len15 ---
Conversation clustering F1 score: 1.0
Speaker to WER: {'spk_0': 0.564, 'spk_1': 0.4281, 'spk_2': 0.5576, 'spk_3': 0.4283, 'spk_4': 0.4793, 'spk_5': 0.4189}
Speaker clustering F1 score: {'spk_0': 1.0, 'spk_1': 1.0, 'spk_2': 1.0, 'spk_3': 1.0, 'spk_4': 1.0, 'spk_5': 1.0}
Joint ASR-Clustering Error Rate: {'spk_0': 0.282, 'spk_1': 0.21405, 'spk_2': 0.2788, 'spk_3': 0.21415, 'spk_4': 0.23965, 'spk_5': 0.20945}

--- Evaluating output dir: output_E02_bs8_len15 ---
Conversation clustering F1 score: 1.0
Speaker to WER: {'spk_0': 0.561, 'spk_1': 0.4312, 'spk_2': 0.5506, 'spk_3': 0.4283, 'spk_4': 0.5041, 'spk_5': 0.4189}
Speaker clusterin

## 10 – Vergleich Stage-2-Light vs. BL4

Stage-2-Light-Ergebnisse (E1x) werden mit BL4-Ergebnissen (E0x) bei gleichen
Hyperparametern verglichen. Positive Δ-Werte = Stage-2-Light schlechter als BL4.

In [24]:
import pandas as pd

dev_df = pd.read_csv("results_dev_subset_by_session.csv")

# Stage-2-ähnliche Experimente (E1*) und Finetuned-Experimente (E0*)
stage2_like = dev_df[dev_df["model"].str.startswith("E1")].copy()
ft_df = dev_df[dev_df["model"].str.startswith("E0")].copy()

def parse_beam_len(model_name: str):
    parts = model_name.split("_")
    beam_part = next(p for p in parts if p.startswith("bs"))
    len_part = next(p for p in parts if p.startswith("len"))
    beam = int(beam_part.replace("bs", ""))
    length = int(len_part.replace("len", ""))
    return beam, length

# Hyperparameter-Spalten für beide DataFrames befüllen
for df in [stage2_like, ft_df]:
    df[["beam_size", "max_length"]] = df["model"].apply(
        lambda m: pd.Series(parse_beam_len(m))
    )

# Über Sessions mitteln
stage2_agg = (
    stage2_like
    .groupby(["model", "beam_size", "max_length"])[["avg_speaker_wer", "avg_joint_error"]]
    .mean()
    .reset_index()
)

ft_agg = (
    ft_df
    .groupby(["model", "beam_size", "max_length"])[["avg_speaker_wer", "avg_joint_error"]]
    .mean()
    .reset_index()
)

# Stage-2-Light und BL4 per (beam_size, max_length) zusammenführen
comparison = pd.merge(
    stage2_agg,
    ft_agg,
    on=["beam_size", "max_length"],
    how="left",
    suffixes=("_stage2", "_ft"),
)

# Δ berechnen: positiv = Stage-2-Light schlechter als BL4
comparison["Δ_WER_stage2_minus_ft"] = (
    comparison["avg_speaker_wer_stage2"] - comparison["avg_speaker_wer_ft"]
)
comparison["Δ_JER_stage2_minus_ft"] = (
    comparison["avg_joint_error_stage2"] - comparison["avg_joint_error_ft"]
)

# Nur E14 und E15 behalten (es gibt ggf. weitere E1x-Einträge aus 02a)
light_mask = comparison["model_stage2"].isin([
    "E14_stage2_light_bs8_len20",
    "E15_stage2_light_bs12_len20",
])

comparison_light = comparison[light_mask].copy()

# Lesbarere Variant-Labels für die Ausgaben
comparison_light["variant"] = comparison_light["model_stage2"].map({
    "E14_stage2_light_bs8_len20":  "Stage-2-light (beam=8, len=20)",
    "E15_stage2_light_bs12_len20": "Stage-2-light (beam=12, len=20)",
})

# Relevante Spalten in sinnvoller Reihenfolge
comparison_light = comparison_light[
    [
        "variant",
        "beam_size",
        "max_length",
        "avg_speaker_wer_stage2",
        "avg_joint_error_stage2",
        "avg_speaker_wer_ft",
        "avg_joint_error_ft",
        "Δ_WER_stage2_minus_ft",
        "Δ_JER_stage2_minus_ft",
    ]
]

display(comparison_light)


Unnamed: 0,variant,beam_size,max_length,avg_speaker_wer_stage2,avg_joint_error_stage2,avg_speaker_wer_ft,avg_joint_error_ft,Δ_WER_stage2_minus_ft,Δ_JER_stage2_minus_ft
4,"Stage-2-light (beam=8, len=20)",8,20,0.528553,0.340469,0.495798,0.324091,0.032755,0.016378
5,"Stage-2-light (beam=12, len=20)",12,20,0.527217,0.3398,0.495416,0.3239,0.031801,0.0159


## 11 – Interpretation

| Konfiguration | WER BL4 | WER Stage-2-Light | Δ WER | JER BL4 | JER Stage-2-Light | Δ JER |
|---------------|---------|-------------------|-------|---------|-------------------|-------|
| beam=8,  len=20 | 0.4958 | 0.5286 | +0.033 | 0.3241 | 0.3405 | +0.016 |
| beam=12, len=20 | 0.4954 | 0.5272 | +0.032 | 0.3239 | 0.3398 | +0.016 |

Stage-2-Light ist geringfügig besser als Stage-2 aus `02a` (Δ WER ~+0.032 vs. ~+0.036),
bleibt aber in beiden Konfigurationen **schlechter als BL4**.

**Fazit:** Auch mit aggressiverem MCoRec-Fokus und sehr niediger Lernrate
lässt sich BL4 durch weiteres Fine-Tuning nicht übertreffen.
BL4 (`avsr_cocktail_mcorec_finetune`) bleibt das Referenzmodell.
Fine-Tuning-Experimente werden nicht weiterverfolgt.
