<img src="../Images/DSC_Logo.png" style="width: 400px;">

This notebook applies a workflow that is built on [faster-whisper](https://github.com/SYSTRAN/faster-whisper) and [pyannote.audio](https://github.com/pyannote/pyannote-audio), with installation procedures based on their official GitHub repositories (accessed September 25, 2025). This workflow is one common Whisper-based workflow among several: Whisper can also be run directly (without faster-whisper) or via more integrated pipelines such as [WhisperX](https://github.com/m-bain/whisperX). 

The overall approach that we present in this notebook is: 
1. **Automatic Speech Recognition (ASR)** with Whisper (speech-to-text transcription)
2. **Speaker diarization** with pyannote (who speaks when)
3. **Time-based merging** (assign speaker labels to the transcript)

This is similar to what runs in the background of tools like [noScribe](https://github.com/kaixxx/noScribe). 

If you only need a plain transcript (no speaker labels), you can run Whisper alone and skip pyannote (including the Hugging Face setup) as well as the merging steps.

# 1. One-Time Setup: Install Software & Hugging Face Account

## 1.1 Install Software

The code below installs the **Python packages** listed in "requirements.txt" into the environment your Jupyter notebook is using. It ensures all needed libraries (and versions) are available so the notebook can run without import errors.

In [None]:
%pip install -r ../requirements.txt

## 1.2 Hugging Face Account

Pyannote diarization requires a Hugging Face account and model access. The setup consists of three steps:
1. Create a **Hugging Face account** (if you don’t have one yet): [Hugging Face Account](https://huggingface.co/)
2. Create an **access token** (needed to download and run the diarization model):
Click your profile icon -> Settings -> Access Tokens → create a new token with read permission (no write permission needed; easiest: select "Read" token type).

>Important: Don't share your token.

3. **Request access** to the model repository: Open [pyannote/speaker-diarization-community-1](https://huggingface.co/pyannote/speaker-diarization-community-1), review the conditions, enter your information if required, and click "Agree and access repository".

If you encounter any error messages when applying speaker diarisation to this notebook due to changes in the steps or requirements, please check the latest instructions in the [pyannote.audio](https://github.com/pyannote/pyannote-audio) repository.

In [None]:
own_token = "ENTER_YOUR_TOKEN"

# 2. Import Packages

In [None]:
# Optional to avoid warnings:
import warnings
warnings.simplefilter("ignore", UserWarning)
warnings.simplefilter("ignore", FutureWarning)

In [None]:
import os
from datetime import timedelta
import subprocess

import imageio_ffmpeg          # ffmpeg is an external program; imageio_ffmpeg provides an executable we can call from Python for audio conversion
import torch                   # PyTorch: required by pyannote and used for tensors / GPU support

from faster_whisper import WhisperModel              # speech-to-text (ASR) model
from faster_whisper import BatchedInferencePipeline  # optional wrapper for faster ASR on GPU (batching audio chunks)

import torchaudio                    # used in this notebook to load audio into memory as a waveform (tensor) + sample rate
from pyannote.audio import Pipeline  # speaker diarization pipeline ("who speaks when")

# Path to the ffmpeg executable (used in subprocess calls for conversion)
ffmpeg = imageio_ffmpeg.get_ffmpeg_exe()


You can check the installed PyTorch version and whether your environment has access to a GPU:

In [None]:
import torch

print("PyTorch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())

# 3. Setup

## 3.1 Runtime Setup

Automatically set device and compute type depending on **hardware availability**:

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"

if device == "cuda":
    compute_type = "float16"  # Faster and more memory efficient on GPU
    batch_size = 16           # Adjust based on GPU memory
else:
    compute_type = "int8"     # Required or more efficient on CPU
    batch_size = 1            # Keep at 1 on CPU (larger values don’t help and may cause memory issues)

## 3.2 Select Audio File

With Python, you can easily transcribe **multiple files by looping over a list of paths** (e.g., all files in a folder) and applying the same steps to each file. In this notebook, however, we keep things simple and specify a single audio file. 

Below, we provide the **relative path to one audio file** (going up one folder into, e.g., "Data/buffy/"). Both .wav and .mp3 files work because the transcription library uses ffmpeg under the hood to read many common audio formats. In addition, in the next step we explicitly convert the audio to a standardized format to ensure consistent processing throughout the notebook. 

To switch between example files, uncomment exactly one pair (`file_name` & `audio_file`) and keep all others commented out:

In [None]:
#file_name = "File-A"
#audio_file = "../Data_Raw/File-A_buffy/shortened_Buffy_Seas01-Epis01.en.wav"
#file_name = "File-B"
#audio_file = "../Data_Raw/File-B_moon-landing/CA138clip.mp3"
file_name = "File-C"
audio_file = "../Data_Raw/File-C_qualitative-interview-de/DE_example_2.mp3"
#file_name = "File-D"
#audio_file = "../Data_Raw/File-D_qualitative-interview-en/EN_example_1.mp3"
#file_name = "File-E"
#audio_file = "../Data_Raw/File-E_Bremen-guide-low-saxon/audioguide-2025-platt-01-shortened.wav"
#file_name = "File-F"
#audio_file = "../Data_Raw/File-F_math/math4.mp3"
#file_name = "File-G"
#audio_file = "../Data_Raw/File-G_XX/XX"

# 4. Preprocess Audio File

Whisper and pyannote can read many audio formats and handle basic resampling internally. In this notebook, we still apply one light **preprocessing** step: we standardize the audio to a 16 kHz mono WAV. This ensures that Whisper and pyannote use the exact same audio file and time base, which makes the later alignment/merging step more reliable (see Sect. 8).

More advanced preprocessing (denoising, volume normalization, echo removal, speech separation) is usually optional and only needed if you notice clear problems, such as strong background noise/echo, very uneven volume, very long silences, or heavy overlapping speech.

## 4.1 Run ffmpeg

The audio file is converted once to a 16 kHz mono WAV. This **standardized audio file** is then used for both Whisper and pyannote so they share the exact same audio time base. This makes it more straightforward to align Whisper’s transcript segments with pyannote’s speaker turns and assign speaker labels (see Sect. 8).

In [None]:
audio_16k = "../Data_Preprocessed/audio_16k_mono.wav"

# Convert with ffmpeg (standardize audio for consistent processing):
# -i <input> -> input audio file (e.g., .mp3 or .wav)
# -ac 1      -> convert to mono (1 audio channel)
# -ar 16000  -> resample to 16,000 Hz (common format for speech models)
# -y         -> overwrite output file if it already exists
# -hide_banner       -> hide ffmpeg version banner
# -loglevel error    -> show only errors (no progress/info output)
subprocess.run(
    [ffmpeg, "-y",
     "-hide_banner",
     "-loglevel", "error",
     "-i", audio_file,
     "-ac", "1", "-ar", "16000",
     audio_16k],
    check=True
)

print("Wrote:", audio_16k)



# 5. Load Models

## 5.1 Load Whisper Model for ASR

Load the **Whisper model for ASR** with the given device ("cpu" or "cuda") and precision type ("float16", "int8", etc.). You can select any [Whisper model](https://github.com/openai/whisper) size (e.g., "tiny" to "large-v3") or provide a custom/fine-tuned model.

In [None]:
model = WhisperModel("large-v3",  # "tiny", ...
                     device, 
                     compute_type=compute_type)

With faster-whisper, you can run transcription "normally" or with batched inference. Batched inference is mainly a speed option for GPUs (it processes several audio chunks at once). On CPU, it usually provides little benefit, so the default (non-batched) mode is typically used. If you want the batched model (alternative):

In [None]:
#model = BatchedInferencePipeline(model=model)

## 5.2 Load pyannote Model for Diarization

Load the **pyannote speaker diarization pipeline from Hugging Face** using your access token. This model will later predict a timeline of "who speaks when" in the audio.

In [None]:
diarization = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    token=own_token
)

# 6. Run Transcription / ASR

The `transcribe()` function takes an audio input and produces a **transcription as a sequence of time-stamped text segments.**

You can optionally provide a **language code**. It is one of **many optional settings**. Most of the other parameters control how the decoding is done: 

- In the setup below, you can enable or disable **Word-Level timestamps** (`word_timestamps`). If enabled, per-word start/end times and word probabilities are added, but compute cost increases and timestamps can be less stable for short disfluencies or noisy speech.
- You can also enable or disable **voice activity detection** (`vad_filter`). If enabled, non-speech regions are removed, which typically speeds up transcription and can reduce hallucinations in silent/noisy parts, but it may also cut very short hesitation sounds (e.g., “um”, “ähm”) if they fall below the VAD sensitivity.
- **Beam size** (`beam_size`) controls how many alternative token sequences are considered during decoding. Larger values can slightly improve accuracy, but they also increase runtime.

For details on all available parameters and their defaults, refer to the [faster-whisper](https://github.com/SYSTRAN/faster-whisper) documentation.

>Example: 
>
>In [noScribe](https://github.com/kaixxx/noScribe), **hotwords** from a separate file are passed into `transcribe()` via the `hotwords` parameter to implement the "disfluencies" on/off toggle in the noScribe application. In faster-whisper, these hotwords are inserted as extra prompt tokens before decoding, which slightly biases the decoder toward producing those words when the audio is uncertain (for example, very short filler sounds like "uh" or "ehm" that can be hard to distinguish from breathing or background noise). The original OpenAI [Whisper](https://github.com/openai/whisper) implementation does not provide a `hotwords` parameter under that name, so this behavior is specific to faster-whisper. The closest equivalent is providing a prompt via the `initial_prompt` parameter to bias decoding in a similar direction. 
>
>If you want to find out how accounting for hesitation sounds could be implemented in the `transcribe()` function below, add to the parameters `hotwords = "Äh, das ist, es ist, ähm, nicht so einfach."` or `hotwords = "Uhm, okay, here's what I'm, like, thinking."` (taken from noScribe`s [prompt.yml](https://github.com/kaixxx/noScribe/blob/main/prompt.yml)) and run the transcription with the audio file "C" (german example) or "D" (english example) that you can select in Sect. 3.2

In [None]:
segments, info = model.transcribe(
    audio_16k,
    language="de",
    word_timestamps=False, 
    vad_filter=True, 
    beam_size=5
)

segments = list(segments)  # The transcription will actually run here

In faster-whisper **transcription results**, each text **segment** includes:
- start and end time, 
- recognized text,
- the underlying token IDs,
- and several scores that can be inspected if needed. These scores are model-internal confidence signals derived from the token probabilities during decoding.
- If word-level timestamps are enabled, each segment also contains a list of words with their own start and end times and (model-derived) probabilities.

In [None]:
print(segments) # Show results

# 7. Run Speaker Diarization

Next, we run the **diarization model** on the audio file. This gives us a **timeline of who speaks when**, for example: *SPEAKER_00 speaks from 0–10 seconds, then SPEAKER_01 speaks from 10–15 seconds, and so on.* By setting `min_speakers` and `max_speakers`, we constrain the output to exactly that **number of speaker labels**, even if the real audio might contain fewer or more speakers.

In Sect. 4, `imageio_ffmpeg` provides an ffmpeg executable that we call directly to convert audio files. pyannote, however, loads audio files via its own internal decoder (`TorchCodec`; see [pyannote.audio](https://github.com/pyannote/pyannote-audio)). This decoder can fail on some machines even when ffmpeg itself works. To avoid this, we pass audio "from memory", meaning we first load the audio into Python (RAM) as a waveform (the raw audio samples as a numeric array) plus the sample rate (samples per second), and then pass these values directly to the diarization model. This makes the notebook more reliable on Windows and on JupyterHubs.

In [None]:
def load_for_pyannote(path):
    """Load audio into memory (RAM) and output waveform shape: (channels, samples)."""
    waveform, sr = torchaudio.load(path)
    return {"waveform": waveform, "sample_rate": sr}

# Run speaker diarization on the audio:
diarization_result = diarization(
    load_for_pyannote(audio_16k),
    min_speakers=2,
    max_speakers=2
)

In [None]:
print(diarization_result.speaker_diarization) # Show results

# 8. Merge Results

We merge ASR output with diarization results so that **each spoken segment (and optionally each word) is linked to a speaker label** (e.g., SPEAKER_00). If word-level timestamps are available, we assign speakers per word; otherwise we assign one speaker per segment. 

The code in this section illustrates a **manual preparation** of transcription and diarization outputs and merging strategy and **can be adapted** if requirements differ.

## 8.1 ASR Result Formatting

Since the output of faster-whisper is stored in its own custom Python objects, we first convert it into a simple **Python dictionary**. We define a function `to_whisper_result` that extracts only the fields we need. This makes the transcript easier to inspect, save, and process in later steps. Each segment gets a start time, an end time, the transcribed text, and (if word-level timestamps were enabled) a list of words with their own timing information.

In [None]:
def to_whisper_result(segments):
    """Convert faster-whisper segment objects into a Python dictionary."""
    out = {"segments": []}
    for s in segments:
        item = {
            "start": float(s.start),
            "end": float(s.end),
            "text": s.text or "",
        }
        # Add word-level timestamps if they exist (word_timestamps=True)
        if getattr(s, "words", None):
            item["words"] = [
                {"word": w.word, "start": float(w.start), "end": float(w.end)}
                for w in s.words
                if w.start is not None and w.end is not None
            ]
        out["segments"].append(item)
    return out

Run conversion:

In [None]:
asr_result = to_whisper_result(segments)

In [None]:
print(asr_result) # Show results

## 8.2 Diarization Result Formatting

The pyannote diarization output is also stored in a custom object format. Before merging, we convert it into a simple **list of speaker time intervals**. Each entry contains:
- start time
- end time
- speaker label

This makes the overlap comparison with the transcript easier and avoids repeatedly iterating over the pyannote object.

In [None]:
turns = []
for turn, _, spk in diarization_result.itertracks(yield_label=True):
    turns.append((float(turn.start), float(turn.end), spk))

In [None]:
print(turns) # Show results

## 8.3 Align Results

We **combine the converted ASR output with the diarization output** in the `align` function. For each ASR segment (and for each word, if word-level timestamps are available), we:
- assign speaker labels: per word if word-level timestamps are available; and always a single main speaker for each ASR segment (based on total word duration or, if no words are available, based on overlap with the segment interval).
- compute an overlap flag (True when two or more different speakers are active at any point within the interval).

The alignment shown here is an example and can be adapted to different needs. For example, you could merge adjacent segments with the same speaker into longer turns, or change the decision rule (e.g., using total duration vs. word count).

>Important: Diarization segments and ASR segments are independent time partitions, so they rarely match perfectly. The merge therefore compares time intervals and links them based on temporal overlap.

In [None]:
def align(asr_result, turns):
    """
    Merge ASR output with diarization turns.

    We work with time intervals:
    - ASR segments/words: [start (s), end (e)]
    - diarization speaker turns: [turn_start (ts), turn_end (te)]
    """

    def overlap_seconds(s, e, ts, te):
        """Return overlap duration (in seconds) between [s, e] and [ts, te]."""
        return max(0.0, min(e, te) - max(s, ts))

    def best_speaker(s, e):
        """Return the speaker with the largest total overlap with [s, e]."""
        overlap_by_speaker = {}
        for ts, te, spk in turns:
            ov = overlap_seconds(s, e, ts, te)
            if ov > 0:
                overlap_by_speaker[spk] = overlap_by_speaker.get(spk, 0.0) + ov
        return max(overlap_by_speaker, key=overlap_by_speaker.get) if overlap_by_speaker else None

    def is_overlapped(s, e):
        """True if >= 2 different speakers overlap with [s, e]."""
        seen = set()
        for ts, te, spk in turns:
            if overlap_seconds(s, e, ts, te) > 0:
                seen.add(spk)
                if len(seen) >= 2:
                    return True
        return False

    # Output structure: a dict with a list of segments
    out = {"segments": []}

    # Loop over each ASR segment
    for seg in asr_result["segments"]:

        # 1) Basic info: start/end time and full text
        new_seg = {
            "start": seg["start"],
            "end": seg["end"],
            "text": seg["text"],
        }

        # 2) Segment-level overlap flag
        new_seg["overlap"] = is_overlapped(seg["start"], seg["end"])

        # 3) Speaker labels
        # ---- only needed if your ASR output contains word-level timestamps (seg["words"])
        #      (i.e., you want per-word speaker labels + per-word overlap flags and want
        #      to derive the segment speaker from summed word durations):
        words = seg.get("words", [])
        if words:
            new_words = []
            dur_by_speaker = {}  # total word duration per speaker within this ASR segment

            for w in words:
                # Speaker for this word based on maximum time overlap
                spk = best_speaker(w["start"], w["end"])

                # Copy word info and add speaker + overlap flag
                wd = dict(w)
                wd["speaker"] = spk
                wd["overlap"] = is_overlapped(w["start"], w["end"])
                new_words.append(wd)

                # Accumulate speaking time (word duration) per speaker
                if spk is not None:
                    dur_by_speaker[spk] = dur_by_speaker.get(spk, 0.0) + (w["end"] - w["start"])

            # Save the word list inside the segment
            new_seg["words"] = new_words

            # Segment-level main speaker: pick the one with the most total word duration
            new_seg["speaker"] = max(dur_by_speaker, key=dur_by_speaker.get) if dur_by_speaker else None
        # ---- else (no word-level timestamps): segment-only speaker assignment:
        else:
            # If no word-level timestamps exist, assign a speaker for the entire segment directly
            new_seg["speaker"] = best_speaker(seg["start"], seg["end"])

        out["segments"].append(new_seg)

    return out

Run aligner:

In [None]:
final_result = align(asr_result, turns)

In [None]:
print(final_result) # Show results

# 9. Save Transcript

Finally, we save a **readable transcript as a text file**. For each segment, we write the speaker label, the transcribed text, and the segment start time. If a segment overlaps with another speaker, we add a tag. This produces a *SPEAKER: text [OVERLAP] [time]* format.

You **can adapt the format** depending on what you need for analysis in Python or external tools.

In [None]:
# Save as ...
output_folder = "../Results/"
txt_path = os.path.join(output_folder, f"{file_name}.txt")

# Save 
with open(txt_path, "w", encoding="utf-8") as f:
    for seg in final_result["segments"]:
        speaker = seg.get("speaker")
        text = seg["text"].strip()
        start = str(timedelta(seconds=seg["start"]))[:-3]
        overlap_flag = " [OVERLAP]" # if seg.get("overlap") else ""
        f.write(f"{speaker}: {text}{overlap_flag} [{start}]\n\n")

print("Saved transcript to", txt_path)

If you would only run Whisper transcription (without diarization), you can save the result directly like this:

In [None]:
# Save as ...
#output_folder = "../Results/"
#os.makedirs(output_folder, exist_ok=True)
#txt_path = os.path.join(output_folder, f"{file_name}.txt")

# Save
#with open(txt_path, "w", encoding="utf-8") as f:
#    for s in segments:
#        start = str(timedelta(seconds=float(s.start)))[:-3]
#        f.write(f"[{start}] {s.text.strip()}\n")

#print("Saved transcript to", txt_path)