# DiCoW: Diarization-Conditioned Whisper for Target Speaker ASR - JSALT 25 Competition

## 🏆 Competition Information
Welcome to the DiCoW Target Speaker ASR Challenge!

**Prizes:**
- 🥇 **1st Place:** 3 beers 🍺🍺🍺
- 🥈 **2nd Place:** 2 beers 🍺🍺
- 🥉 **3rd Place:** 1 beer 🍺

---
**Submission:** Submit your best performing systems to the [EMMA Leaderboard](https://huggingface.co/spaces/BUT-FIT/EMMA_leaderboard)

**Deadline:** 17.6. 2025, 16:00 CET

**SUBMISSION_TOKEN:** emmA2025

---

## 🛠️ Challenge Tasks
1. Clone the DiCoW repository and set up the environment
2. Prepare the Libri2Mix dataset
3. Finetune Whisper tiny model using DiCoW
4. Evaluate on Libri2Mix clean test set
5. Submit results to EMMA leaderboard

---

## 📚 Table of Contents
1. [Introduction to MT-ASR, TS-ASR, and DiCoW](#intro)
    1. [Multi-Talker ASR (MT-ASR)](#mt_asr)
    2. [Target Speaker ASR (TS-ASR)](#ts_asr)
    3. [DiCoW: Diarization-Conditioned Whisper](#dicow)
2. [Environment Setup](#setup)
3. [Data Preparation](#data)
4. [Model Training](#finetuning)
5. [Decoding & Evaluation](#decoding)
6. [Submission Guidelines](#submission)

---

## 📖 Resources
1. Repositories:
    - [DiCoW GitHub Repository](https://github.com/BUTSpeechFIT/TS-ASR-Whisper)
    - [DiCoW Inference Repository](https://github.com/BUTSpeechFIT/DiCoW)
2. Papers:
    - [Target Speaker ASR with Whisper](https://ieeexplore.ieee.org/document/10887683)
    - [DiCoW: Diarization-Conditioned Whisper for Target Speaker ASR](https://arxiv.org/abs/2501.00114)
    - [BUT/JHU System Description for CHiME-8 NOTSOFAR-1 Challenge](https://www.isca-archive.org/chime_2024/polok24_chime.html)
    - BUT System for the MLC-SLM Challenge
3. Demo:
    - [DiCoW Gradio Demo](https://pccnect.fit.vutbr.cz/gradio-demo/)
4. Leaderboard:
    - [EMMA Leaderboard](https://huggingface.co/spaces/BUT-FIT/EMMA_leaderboard)

# 1. INTRODUCTION TO MULTI-TALKER AUTOMATIC SPEECH RECOGNITION <a id='#intro'></a>


![mt_asr](img/mt_asr.png)

## The Challenge of Multi-Talker ASR

Automatic Speech Recognition (ASR) systems traditionally work well with single-speaker audio.
However, real-world scenarios often involve multiple speakers talking simultaneously, creating
several challenges:

1. **Overlapping Speech**: Multiple speakers talking at the same time
2. **Speaker Confusion**: Difficulty determining who said what
3. **Acoustic Interference**: Speech from one speaker masks another
4. **Variable Number of Speakers**: Unknown number of active speakers


### Approaches to Multi-Talker ASR <a id='#mt_asr'></a>

![mt_asr_approaches](img/mt_asr_approaches.png)


1. **Speech Separation + ASR**: First separate speakers, then apply ASR
2. **E2E MT-ASR (SOT)**: Concatenate speaker-attributed transcriptions by emission time
3. **Target Speaker ASR**: Focus on specific speaker of interest

### Metrics for Multi-Talker ASR
1. Optimal Reference Combination Word Error Rate (ORC WER)
2. Concatenated minimum-Permutation Word Error Rate (cpWER)
3. Time-Constrained minimum-Permutation Word Error Rate (tcpWER)

https://github.com/fgnt/meeteval

https://groups.uni-paderborn.de/nt/meeteval/icassp2024-demo/poster_example.html?selection=28.6-38.7&minimaps=1

## Target Speaker ASR: Focus on Speaker of Interest
<a id='ts_asr'></a>
Target Speaker ASR (TS-ASR) addresses a practical scenario: given mixed audio with multiple
speakers, transcribe only the speech from a specific target speaker.


### Traditional TS-ASR Approaches:

1. Randomly initialized model and i-vector based speaker embeddings

![ts_asr_embed](img/ts_asr_embed.png)

2. Pretrained ASR model with (better) speaker embeddings

![ts_asr_enrolment](img/ts_asr_enrolment.png)

3. Pretrained ASR model directly conditioned on speaker enrolment

![whisper_enrolment](img/whisper_enrolment.png)





<a id='dicow'></a>
## DICOW: DIARIZATION-CONDITIONED WHISPER

DiCoW (Diarization-Conditioned Whisper) represents a paradigm shift in Target Speaker ASR.
Instead of relying on speaker embeddings, DiCoW leverages speaker diarization outputs as
conditioning information.

![dicow](img/dicow.png)

### Advantages of DiCoW

1. **No Speaker Embeddings Required**: Eliminates dependency on embedding quality
2. **Better Generalization**: Works well with unseen speakers
3. **Simplified Workflow**: Direct conditioning on diarization outputs
4. **Maintains Whisper Performance**: Preserves accuracy on (multi-lingual) single-speaker data

### STNO - Silence, Target, Non-Target, and Overlap Masks

Let $\mathbf{D} \in [0,1]^{S \times T}$, where $S$ is the number of speakers in the recording, and $T$ is the number of frames, represent the diarization output, with $d(s, t)$ denoting the probability that speaker $s$ is active in time frame $t$. Let $s_k$ represent the target speaker.
We define a distribution over the following mutually exclusive events for a frame at time $t$:

1. ${\mathcal{S}}$: Time frame $t$ represents silence.
2. ${\mathcal{T}}$: The target speaker, $s_k$, is the only active speaker in time frame $t$.
3. ${\mathcal{N}}$: One or more non-target speakers, $s \neq s_k$ are active and the target speaker, $s_k$, is not active at time frame $t$.
4. ${\mathcal{O}}$: The target speaker $s_k$ is active while at least one non-target speaker $s \neq s_k$ is also active at time frame $t$, denoting an overlap.


The probabilities of these events occurring at time frame $t$ can be calculated as:
1. $p_{\mathcal{S}}^t  = \prod_{s=1}^S (1 - d(s, t))$
2. $p_{\mathcal{T}}^t  = d(s_k, t)  \prod_{\substack{s=1 \\ s \neq s_k}}^S (1 - d(s, t))$
3. $p_{\mathcal{N}}^t  = \left(1 - p_{\mathcal{S}}^t\right) - d\left(s_k, t\right)$
4. $p_{\mathcal{O}}^t  = d(s_k, t) - p_{\mathcal{T}}^t$


This definition allows us to use a fixed-sized STNO (Silence, Target, Non-target, Overlap) mask $\mathbf{M}^t = \begin{bmatrix} p_{\mathcal{S}}^t & p_{\mathcal{T}}^t & p_{\mathcal{N}}^t & p_{\mathcal{O}}^t \end{bmatrix}^{\top}$.

### Frame-Level Diarization Dependent Transformations


Let $\mathbf{Z}^l \in \mathbb{R}^{d_{{m}} \times T}$ represent the frame-by-frame inputs to the $l$-th (Transformer) layer.

We transform these hidden representations by applying four affine STNO layer- and class-specific transformations: $\mathbf{W}_{\mathcal{S}}^l, \mathbf{W}_{\mathcal{T}}^l, \mathbf{W}_{\mathcal{N}}^l, \mathbf{W}_{\mathcal{O}}^l \in \mathbb{R}^{d_{{m}} \times d_{{m}}}$ together with biases $\mathbf{b}_{\mathcal{S}}^l, \mathbf{b}_{\mathcal{T}}^l, \mathbf{b}_{\mathcal{N}}^l, \mathbf{b}_{\mathcal{O}}^l \in \mathbb{R}^{d_{m}}$ to obtain new speaker-specific hidden representations $\hat{\mathbf{Z}}^l = [\hat{\mathbf{z}}^l_1, \ldots, \hat{\mathbf{z}}^l_T]$ as:

$\hat{\mathbf{z}}^l_t = \left( \mathbf{W}_{\mathcal{S}}^l \mathbf{z}^l_t + \mathbf{b}_{\mathcal{S}}^l \right) p^t_{\mathcal{S}} +
\left( \mathbf{W}_{\mathcal{T}}^l \mathbf{z}^l_t + \mathbf{b}_{\mathcal{T}}^l \right) p^t_{\mathcal{T}}  \nonumber \\
 + \left( \mathbf{W}_{\mathcal{N}}^l \mathbf{z}^l_t + \mathbf{b}_{\mathcal{N}}^l\right) p^t_{\mathcal{N}} +
\left( \mathbf{W}_{\mathcal{O}}^l \mathbf{z}^l_t + \mathbf{b}_{\mathcal{O}}^l \right) p^t_{\mathcal{O}}.$


In other words, the hidden representations $\mathbf{z}^l_t$ are transformed using a convex combination of the four STNO class-specific affine transformations, weighted by the corresponding STNO class probabilities.

![dicow_full](img/target_speaker_whisper_stno.drawio.png)


<a id='ts_asr'></a>
## 2. Environment Setup <a id="setup"></a>
In this section, we will set up the environment for training and evaluating the DiCoW model on the Libri2Mix dataset. We will install the required dependencies, clone the repository.

### Step 1: Install Dependencies

In [None]:
with open("reqs_collab.txt", "w") as f:
    f.write(
        """accelerate>=0.33.0
          datasets>=2.21.0
          evaluate>=0.4.2
          huggingface-hub==0.24.6
          hydra-core==1.3.2
          intervaltree==3.1.0
          jiwer==2.5.2
          kaldiio==2.18.0
          lhotse==1.28.0
          librosa==0.10.2.post1
          meeteval==0.3.0
          pandas==2.2.2
          pyannote.core==5.0.0
          pyannote.database==5.1.0
          pyannote.metrics==3.2.1
          PyYAML==6.0.2
          transformers==4.41.2
          wandb>=0.19.0
          simplejson==3.20.1
          """)

In [None]:
# Install required packages
!pip install -q -r reqs_collab.txt
# !pip install -q gdown
!pip uninstall peft -y
!pip uninstall tensorflow -y

### Step 2: Clone the Repository


In [None]:
# Clone the DiCoW repository
!git clone https://github.com/BUTSpeechFIT/TS-ASR-Whisper.git
%cd TS-ASR-Whisper

# Initialize and update submodules
!git submodule init
!git submodule update
%cd ..

## 3. Data Preparation <a id="data"></a>
In this section, we will prepare the Libri2Mix dataset for training and evaluation. We will download the dataset, unzip it, and prepare the manifests for training and evaluation.

#### 1. Prepare directories


In [None]:
!mkdir -p data
!mkdir -p data/libri2mix
!mkdir -p data/manifests
!mkdir -p data/libri2mix/train-100
!mkdir -p data/libri2mix/dev
!mkdir -p data/libri2mix/test

#### 2. Download prepared Libri2Mix 100h clean dataset
You can download the prepared Libri2Mix 100h clean dataset from the Google Drive link below.

However, bandwidth limit could be exceeded and access could be denied. In that case, you can use the bash script below to download the dataset directly from the nextcloud server.

The dataset is already preprocessed and ready for use. The dataset contains 100 hours of clean speech data, which is a subset of the original Libri2Mix dataset.

In [None]:
# import gdown
#
# gdown.download_folder("https://drive.google.com/drive/folders/1vZEroIOIa2H8JqAltGxFebBv_ukiKU4j?usp=sharing", use_cookies=False,
#                       quiet=True, output="data/libri2mix")

In [None]:
%%bash
# Cutsets
curl -L -o data/libri2mix/libri2mix_mix_clean_sc_dev_cutset.jsonl.gz https://nextcloud.fit.vutbr.cz/s/MHLxjrd8XWPCieE/download
curl -L -o data/libri2mix/libri2mix_clean_100_train_sc_cutset_30s.jsonl.gz https://nextcloud.fit.vutbr.cz/s/gyPBwcMM3Azqbpk/download
curl -L -o data/libri2mix/libri2mix_mix_clean_sc_test_cutset.jsonl.gz https://nextcloud.fit.vutbr.cz/s/gdDMe2EKdAn4Kx5/download

# Data
curl -L -o data/libri2mix/train_mix_clean.tar.gz https://nextcloud.fit.vutbr.cz/s/oXkxkW59xDKLPgJ/download
curl -L -o data/libri2mix/dev_mix_clean.tar.gz https://nextcloud.fit.vutbr.cz/s/DmiAicG2aWLeLqm/download
curl -L -o data/libri2mix/test_mix_clean.tar.gz https://nextcloud.fit.vutbr.cz/s/WACLidXg78BgesB/download

#### 3. Unzip downloaded datasets

In [None]:
!tar -xzf data/libri2mix/train_mix_clean.tar.gz -C data/libri2mix/train-100
!tar -xzf data/libri2mix/dev_mix_clean.tar.gz -C data/libri2mix/dev
!tar -xzf data/libri2mix/test_mix_clean.tar.gz -C data/libri2mix/test

#### 4. Fix paths in the dataset manifests

In [None]:
from lhotse import load_manifest
import os

if __name__ == "__main__":
    for cutset, out in [
        ("data/libri2mix/libri2mix_clean_100_train_sc_cutset_30s.jsonl.gz",
         "data/manifests/libri2mix_clean_100_train_sc_cutset_30s.jsonl.gz"),
        ("data/libri2mix/libri2mix_mix_clean_sc_dev_cutset.jsonl.gz",
         "data/manifests/libri2mix_mix_clean_sc_dev_cutset.jsonl.gz"),
        ("data/libri2mix/libri2mix_mix_clean_sc_test_cutset.jsonl.gz",
         "data/manifests/libri2mix_mix_clean_sc_test_cutset.jsonl.gz")]:
        cset = load_manifest(cutset)
        for r in cset:
            for src in r.recording.sources:
                src.source = src.source.replace(
                    'PATH_TO_BE_REPLACED',
                    os.path.abspath('data/libri2mix'))
            for s in r.supervisions:
                s.alignment = None

        cset.to_file(out)

#### 5. Check the dataset
Below you can see characteristics of the training cutset. The dataset contains 100 hours of clean speech data, which is a subset of the original Libri2Mix dataset. The dataset is already preprocessed and ready for use.

In [None]:
from lhotse import load_manifest

train_cutset = load_manifest("data/manifests/libri2mix_clean_100_train_sc_cutset_30s.jsonl.gz")
train_cutset.describe()

Here you can see and hear example of data in the cutset. Each cut contains audio recording and supervisions for each speaker in the recording. The supervisions contain speaker name and transcription of the speech. Dataset was constructed from LibriSpeech by mixing two samples into one audio recording.

In [None]:
sample = train_cutset[0]
print(f"{sample.supervisions[0].speaker}: {sample.supervisions[0].text}")
print(f"{sample.supervisions[1].speaker}: {sample.supervisions[1].text}")
sample.plot_audio()
sample.play_audio()

#### 6. Try to submit the same file to the DiCoW gradio app:
You can try to submit the same file to the DiCoW gradio app and see how it performs. The app will use the pretrained DiCoW model to transcribe the audio and return the transcription for all speakers in the audio. You can also try to upload your own audio file and see how it performs.
https://pccnect.fit.vutbr.cz/gradio-demo/

#### 7. Prepare small development set for quick testing
Depending on how well you want to estimate the performance of your model, you can use the full development set or a smaller subset. Approximation on 128 cuts should be enough to get a rough estimate of the performance of your model. The full development set contains 3000 cuts.


In [None]:
from lhotse import load_manifest

devset = load_manifest("data/manifests/libri2mix_mix_clean_sc_dev_cutset.jsonl.gz")
devset =devset.subset(first=128)
devset.to_file("data/manifests/libri2mix_mix_clean_sc_dev_cutset_100.jsonl.gz")

In [None]:
import os
MANIFEST_DIR = os.path.abspath("data/manifests")
os.environ["MANIFEST_DIR"] = MANIFEST_DIR
os.environ["TRAIN_CUTSET"] = f"{MANIFEST_DIR}/libri2mix_clean_100_train_sc_cutset_30s.jsonl.gz"
os.environ["DEV_CUTSET"] = f"{MANIFEST_DIR}/libri2mix_mix_clean_sc_dev_cutset.jsonl.gz"
os.environ["TEST_CUTSET"] = f"{MANIFEST_DIR}/libri2mix_mix_clean_sc_test_cutset.jsonl.gz"
os.environ["TOYSET_CUTSET"] = f"{MANIFEST_DIR}/libri2mix_mix_clean_sc_dev_cutset_100.jsonl.gz"

## 4. Model training
First let's try to check of the environment is set up correctly and if we can run the decoding script on a small dataset.

In [None]:
%cd TS-ASR-Whisper


#### Step 1: Prepare the decoding configuration file

In [None]:
with open("configs/decode/toyset_decoding.yaml", "w") as f:
    f.write(
"""
# @package _global_
experiment: libri2mix_decode_both

model:
  whisper_model: "openai/whisper-tiny"
data:
  eval_cutsets: "${oc.env:TOYSET_CUTSET}"
  train_cutsets: "${oc.env:TOYSET_CUTSET}"
  dev_cutsets: "${oc.env:TOYSET_CUTSET}"
  eval_text_norm: "whisper_nsf"
training:
  decode_only: true
  bf16: false
  bf16_full_eval: false
  eval_metrics_list: [ "tcp_wer", "cp_wer"]
  per_device_eval_batch_size: 16
  dataloader_num_workers: 4
  dataloader_prefetch_factor: 1
  dataloader_pin_memory: true
"""
    )

##### Step 2: Export the environment variables for decoding

In [None]:
import os
os.environ["SRC_ROOT"] = os.path.abspath(".")
os.environ["WANDB_ANONYMOUS"] = "allow"
# os.environ["WANDB_ENTITY"] = ""  # Set your Weights & Biases entity if needed
os.environ["WANDB_PROJECT"] = "DiCoW_playground"
os.environ["WANDB_RUN_ID"] = "libri2mix_decode_both"
os.environ["HF_HOME"] = "hf_cache"
os.environ["PYTHONPATH"] = f"{os.environ['SRC_ROOT']}"
os.environ["EXPERIMENT_PATH"] = f"{os.environ['SRC_ROOT']}/exp/{os.environ.get('EXPERIMENT', '')}"
os.environ["LIBRI_TRAIN_CACHED_PATH"] = ""
os.environ["LIBRI_DEV_CACHED_PATH"] = ""
os.environ["AUDIO_PATH_PREFIX"] = ""
os.environ["AUDIO_PATH_PREFIX_REPLACEMENT"] = ""

#### Step 3: Run the decoding script

In [None]:
!python src/main.py +decode=toyset_decoding

#### Step 4: Check the decoding results via meeteval
You should see that randomly initialized model does not perform well. Let's fine-tune it a bit.

In [None]:
import meeteval
from meeteval.viz.visualize import AlignmentVisualization

folder = r'exp/libri2mix_decode_both/test/0/wer/1919-142785-0014_3000-15664-0027'
av = AlignmentVisualization(
    meeteval.io.load(folder + '/ref.json'),
    meeteval.io.load(folder + '/tcp_wer_hyp.json')
)
display(av)

#### 5. DiCoW Training
Create a configuration file for training the DiCoW-tiny model on the Libri2Mix dataset.

In [None]:
with open("configs/train/tiny_l2mix.yaml", "w") as f:
    f.write(
"""
# @package _global_
defaults:
  - /train/icassp/table1_model_comparisons/base

experiment: lsmix_tiny
wandb:
  project: jsalt25_dicow_challenge
model:
  whisper_model: openai/whisper-tiny
  reinit_encoder_from: null
data:
  train_text_norm: "whisper_nsf"
  use_timestamps: true
  eval_cutsets: "${oc.env:TEST_CUTSET}"
  train_cutsets: "${oc.env:TRAIN_CUTSET}"
  dev_cutsets: "${oc.env:TOYSET_CUTSET}"
  eval_text_norm: "whisper_nsf"

training:
  warmup_steps: 2000
  remove_timestamps_from_ctc: true
  overall_batch_size: 24
  learning_rate: 1e-5
  per_device_eval_batch_size: 16
  bf16: false
  bf16_full_eval: false
  fp16: true
  fp16_full_eval: true
  eval_metrics_list: [ "tcp_wer", "cp_wer"]
  eval_strategy: steps
  save_strategy: steps
  eval_steps: 200
  save_steps: 200
  use_amplifiers_only_n_epochs: 0
"""
    )

Let's now train the DiCoW model on the Libri2Mix dataset. We will use the tiny Whisper model as a base and finetune it with DiCoW. Anytime you are satisfied with the results, you can stop the training and use the model for decoding.

In [None]:
!python src/main.py +train=tiny_l2mix

## 5. Decoding & Evaluation
Congrats you have trained the DiCoW model on the Libri2Mix dataset! 🎉

Now let's decode the test set and evaluate the performance of the model.

In [None]:
with open("configs/decode/tiny_test.yaml", "w") as f:
    f.write(
"""
# @package _global_
experiment: libri2mix_clean

model:
  whisper_model: "openai/whisper-tiny"
  reinit_from: "/content/TS-ASR-Whisper/exp/lsmix_tiny/checkpoint-600/model.safetensors"
data:
  eval_cutsets: "${oc.env:TEST_CUTSET}"
  train_cutsets: "${oc.env:TOYSET_CUTSET}"
  dev_cutsets: "${oc.env:TOYSET_CUTSET}"
  eval_text_norm: "whisper_nsf"
training:
  decode_only: true
  bf16: false
  bf16_full_eval: false
  dataloader_num_workers: 4
  dataloader_prefetch_factor: 1
  dataloader_pin_memory: true
  generation_max_length: 64
  eval_metrics_list: [ "tcp_wer", "cp_wer"]
  per_device_eval_batch_size: 64
"""
    )

In [None]:
!python src/main.py +decode=tiny_test

In [None]:
import meeteval
import glob
hyps = []
for hyp_file in glob.glob("/content/TS-ASR-Whisper/exp/libri2mix_clean/test/0/wer/*/tcp_wer_hyp.json"):
    hyps.append(meeteval.io.load(hyp_file))
hyp = meeteval.io.SegLST.merge(*hyps)
hyp.dump(f"my_submission.json")

## 6. Generate Submission

Now you can collect all hypotheses and create a submission file. After creation of the submission file, you can upload it to the [EMMA Leaderboard](https://huggingface.co/spaces/BUT-FIT/EMMA_leaderboard).

In [None]:
import meeteval
import glob
hyps = []
for hyp_file in glob.glob("/content/TS-ASR-Whisper/exp/libri2mix_clean/test/0/wer/*/tcp_wer_hyp.json"):
    hyps.append(meeteval.io.load(hyp_file))
hyp = meeteval.io.SegLST.merge(*hyps)
hyp.dump(f"/content/my_submission.json")

Congrats! You have successfully trained and evaluated the DiCoW model on the Libri2Mix dataset. You can now submit your results to the EMMA leaderboard and compete for the prizes. Good luck! I am becoming thirsty already! 🍺🍺🍺