# DiCoW: Diarization-Conditioned Whisper for Target Speaker ASR - JSALT 25 Competition

## 🏆 Competition Information
Welcome to the DiCoW Target Speaker ASR Challenge!

**Prizes:**
- 🥇 **1st Place:** 3 beers 🍺🍺🍺
- 🥈 **2nd Place:** 2 beers 🍺🍺
- 🥉 **3rd Place:** 1 beer 🍺

**Submission:** Submit your best performing systems to the [EMMA Leaderboard](https://huggingface.co/spaces/BUT-FIT/EMMA_leaderboard)

---

## 📚 Table of Contents
1. [Introduction to MT-ASR, TS-ASR, and DiCoW](#intro)
    1. [Multi-Talker ASR (MT-ASR)](#mt_asr)
    2. [Target Speaker ASR (TS-ASR)](#ts_asr)
    3. [DiCoW: Diarization-Conditioned Whisper](#dicow)
2. [Environment Setup](#setup)
3. [Data Preparation](#data)
4. [Model Finetuning](#finetuning)
5. [Decoding & Evaluation](#decoding)
6. [Submission Guidelines](#submission)

**TASKS:**
1. Clone the DiCoW repository and set up the environment
2. Prepare the Libri2Mix dataset
3. Finetune Whisper tiny model using DiCoW
4. Evaluate on Libri2Mix clean test set
5. Submit results to EMMA leaderboard
6. (Optional) Explore decoding improvements with pretrained models

**DEADLINE:** [TO BE FILLED]


<a id='intro'></a>
# 1. INTRODUCTION TO MULTI-TALKER AUTOMATIC SPEECH RECOGNITION

![mt_asr](img/mt_asr.png)

## The Challenge of Multi-Talker ASR

Automatic Speech Recognition (ASR) systems traditionally work well with single-speaker audio.
However, real-world scenarios often involve multiple speakers talking simultaneously, creating
several challenges:

1. **Overlapping Speech**: Multiple speakers talking at the same time
2. **Speaker Confusion**: Difficulty determining who said what
3. **Acoustic Interference**: Speech from one speaker masks another
4. **Variable Number of Speakers**: Unknown number of active speakers

<a id='mt_asr'></a>
### Approaches to Multi-Talker ASR

![mt_asr_approaches](img/mt_asr_approaches.png)


1. **Speech Separation + ASR**: First separate speakers, then apply ASR
2. **E2E MT-ASR (SOT)**: Concatenate speaker-attributed transcriptions by emission time
3. **Target Speaker ASR**: Focus on specific speaker of interest

<a id='ts_asr'></a>
## Target Speaker ASR: Focus on Speaker of Interest

Target Speaker ASR (TS-ASR) addresses a practical scenario: given mixed audio with multiple
speakers, transcribe only the speech from a specific target speaker.


### Traditional TS-ASR Approaches:

1. Randomly initialized model and i-vector based speaker embeddings

![ts_asr_embed](img/ts_asr_embed.png)

2. Pretrained ASR model with (better) speaker embeddings

![ts_asr_enrolment](img/ts_asr_enrolment.png)

3. Pretrained ASR model directly conditioned on speaker enrolment

![whisper_enrolment](img/whisper_enrolment.png)


<a id='dicow'></a>
## DICOW: DIARIZATION-CONDITIONED WHISPER

DiCoW (Diarization-Conditioned Whisper) represents a paradigm shift in Target Speaker ASR.
Instead of relying on speaker embeddings, DiCoW leverages speaker diarization outputs as
conditioning information.

![dicow](img/dicow.png)


### STNO - Silence, Target, Non-Target, and Overlap Masks

Let $\mathbf{D} \in [0,1]^{S \times T}$, where $S$ is the number of speakers in the recording, and $T$ is the number of frames, represent the diarization output, with $d(s, t)$ denoting the probability that speaker $s$ is active in time frame $t$. Let $s_k$ represent the target speaker.
We define a distribution over the following mutually exclusive events for a frame at time $t$:

1. ${\mathcal{S}}$: Time frame $t$ represents silence.
2. ${\mathcal{T}}$: The target speaker, $s_k$, is the only active speaker in time frame $t$.
3. ${\mathcal{N}}$: One or more non-target speakers, $s \neq s_k$ are active and the target speaker, $s_k$, is not active at time frame $t$.
4. ${\mathcal{O}}$: The target speaker $s_k$ is active while at least one non-target speaker $s \neq s_k$ is also active at time frame $t$, denoting an overlap.


The probabilities of these events occurring at time frame $t$ can be calculated as:
1. $p_{\mathcal{S}}^t  = \prod_{s=1}^S (1 - d(s, t))$
2. $p_{\mathcal{T}}^t  = d(s_k, t)  \prod_{\substack{s=1 \\ s \neq s_k}}^S (1 - d(s, t))$
3. $p_{\mathcal{N}}^t  = \left(1 - p_{\mathcal{S}}^t\right) - d\left(s_k, t\right)$
4. $p_{\mathcal{O}}^t  = d(s_k, t) - p_{\mathcal{T}}^t$


This definition allows us to use a fixed-sized STNO (Silence, Target, Non-target, Overlap) mask $\mathbf{M}^t = \begin{bmatrix} p_{\mathcal{S}}^t & p_{\mathcal{T}}^t & p_{\mathcal{N}}^t & p_{\mathcal{O}}^t \end{bmatrix}^{\top}$.

### Frame-Level Diarization Dependent Transformations


Let $\mathbf{Z}^l \in \mathbb{R}^{d_{{m}} \times T}$ represent the frame-by-frame inputs to the $l$-th (Transformer) layer.

We transform these hidden representations by applying four affine STNO layer- and class-specific transformations: $\mathbf{W}_{\mathcal{S}}^l, \mathbf{W}_{\mathcal{T}}^l, \mathbf{W}_{\mathcal{N}}^l, \mathbf{W}_{\mathcal{O}}^l \in \mathbb{R}^{d_{{m}} \times d_{{m}}}$ together with biases $\mathbf{b}_{\mathcal{S}}^l, \mathbf{b}_{\mathcal{T}}^l, \mathbf{b}_{\mathcal{N}}^l, \mathbf{b}_{\mathcal{O}}^l \in \mathbb{R}^{d_{m}}$ to obtain new speaker-specific hidden representations $\hat{\mathbf{Z}}^l = [\hat{\mathbf{z}}^l_1, \ldots, \hat{\mathbf{z}}^l_T]$ as:

$\hat{\mathbf{z}}^l_t = \left( \mathbf{W}_{\mathcal{S}}^l \mathbf{z}^l_t + \mathbf{b}_{\mathcal{S}}^l \right) p^t_{\mathcal{S}} +
\left( \mathbf{W}_{\mathcal{T}}^l \mathbf{z}^l_t + \mathbf{b}_{\mathcal{T}}^l \right) p^t_{\mathcal{T}}  \nonumber \\
 + \left( \mathbf{W}_{\mathcal{N}}^l \mathbf{z}^l_t + \mathbf{b}_{\mathcal{N}}^l\right) p^t_{\mathcal{N}} +
\left( \mathbf{W}_{\mathcal{O}}^l \mathbf{z}^l_t + \mathbf{b}_{\mathcal{O}}^l \right) p^t_{\mathcal{O}}.$


In other words, the hidden representations $\mathbf{z}^l_t$ are transformed using a convex combination of the four STNO class-specific affine transformations, weighted by the corresponding STNO class probabilities.

![dicow_full](img/target_speaker_whisper_stno.drawio.png)



### Advantages of DiCoW

1. **No Speaker Embeddings Required**: Eliminates dependency on embedding quality
2. **Better Generalization**: Works well with unseen speakers
3. **Simplified Workflow**: Direct conditioning on diarization outputs
4. **Maintains Whisper Performance**: Preserves accuracy on (multi-lingual) single-speaker data

## 2. Environment Setup <a id="setup"></a>
### Step 1: Clone the Repository

In [1]:
# Clone the DiCoW repository
!git clone https://github.com/BUTSpeechFIT/TS-ASR-Whisper.git
%cd TS-ASR-Whisper

# Initialize and update submodules
!git submodule init
!git submodule update

Cloning into 'TS-ASR-Whisper'...
remote: Enumerating objects: 309, done.[K
remote: Counting objects: 100% (309/309), done.[K
remote: Compressing objects: 100% (199/199), done.[K
remote: Total 309 (delta 144), reused 254 (delta 107), pack-reused 0 (from 0)[K
Receiving objects: 100% (309/309), 180.72 KiB | 7.23 MiB/s, done.
Resolving deltas: 100% (144/144), done.
/Users/alexanderpolok/PycharmProjects/JSALT_tutorial/TS-ASR-Whisper
Submodule 'inference_pipeline' (https://github.com/BUTSpeechFIT/DiCoW.git) registered for path 'inference_pipeline'
Cloning into '/Users/alexanderpolok/PycharmProjects/JSALT_tutorial/TS-ASR-Whisper/inference_pipeline'...
Submodule path 'inference_pipeline': checked out 'e9326bd536bf632e823357438b210102903ba620'


### Step 2: Install Dependencies

In [2]:
# Install required packages
!pip install -r requirements.txt

# Optional: Install flash attention for faster training (requires CUDA)
!pip install flash-attn==2.7.2.post1

# Install additional tools
!apt-get update && apt-get install -y ffmpeg sox

Collecting accelerate==0.33.0 (from -r requirements.txt (line 1))
  Obtaining dependency information for accelerate==0.33.0 from https://files.pythonhosted.org/packages/15/33/b6b4ad5efa8b9f4275d4ed17ff8a44c97276171341ba565fdffb0e3dc5e8/accelerate-0.33.0-py3-none-any.whl.metadata
  Downloading accelerate-0.33.0-py3-none-any.whl.metadata (18 kB)
Collecting azure-cli==2.53.1 (from -r requirements.txt (line 2))
  Obtaining dependency information for azure-cli==2.53.1 from https://files.pythonhosted.org/packages/c4/be/248f69a2d0f807d904b0d70f588a056bab61a079f2ce7380d6a9707197c1/azure_cli-2.53.1-py3-none-any.whl.metadata
  Downloading azure_cli-2.53.1-py3-none-any.whl.metadata (8.4 kB)
Collecting datasets==2.21.0 (from -r requirements.txt (line 3))
  Obtaining dependency information for datasets==2.21.0 from https://files.pythonhosted.org/packages/72/b3/33c4ad44fa020e3757e9b2fad8a5de53d9079b501e6bbc45bdd18f82f893/datasets-2.21.0-py3-none-any.whl.metadata
  Downloading datasets-2.21.0