<img src="../Images/DSC_Logo.png" style="width: 400px;">

This notebook applies [WhisperX](https://github.com/m-bain/whisperX) with installation procedures based on the official GitHub repository (accessed July 29, 2025). See also the faster-whisper-pyannote notebook for additional general information on setup and for comparison.

# 1. One-Time Setup: Install Software & Hugging Face Account

## 1.1 FFmpeg

WhisperX uses FFmpeg to load and preprocess audio (e.g., convert .mp3 to .wav, resample to 16kHz mono). It is required even if the input audio is already in .wav format, because WhisperX always calls FFmpeg internally.

> 16kHz mono? The audio is stored with one channel (not left/right stereo) and sampled 16,000 times per second.

FFmpeg can be installed in Jupyter with conda:

In [None]:
!conda install -c conda-forge ffmpeg -y

Alternatively, FFmpeg can be installed system-wide, and added to the environment PATH of your system to make it accessible from anywhere.

Once installed, test with:

In [None]:
!ffmpeg -version

## 1.2 WhisperX

The easiest way to install WhisperX, according to its GitHub page, is via PyPI:

In [None]:
!pip install whisperx

In [None]:
!pip install "pyannote.audio==3.4.0" # previous pyannote.audio version

This installs WhisperX along with its core dependencies, including pyannote.audio. 

## 1.3 Hugging Face Account

If you don't have a [Hugging Face Account](https://huggingface.co/) you need to create one. 

You will require token later on for the speaker diarization. You can create one by clicking on your profile icon; next click on "Access Tokens": Create a new access token. 

>Important: Don't share your token.

In addition, pyannote requires you to agree to share your contact information to access it's models. For that, go on the [pyannote speaker-diarization model](https://huggingface.co/pyannote/speaker-diarization-3.1) page, enter your information, and click on "Agree and access repository". Do the same for the [pyannote segmentation model](https://huggingface.co/pyannote/segmentation-3.0). 

In [None]:
own_token = "ENTER_YOUR_TOKEN"

# 2. Import Packages

In [None]:
import os
from datetime import timedelta

In [None]:
import whisperx # Whisperx 

You can check the installed PyTorch version and whether your environment has access to a GPU:

In [None]:
import torch

print("PyTorch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())

# 3. Setup

## 3.1 Runtime Setup

Automatically set device and compute type depending on hardware availability:

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"

if device == "cuda":
    compute_type = "float16"  # Faster and more memory efficient on GPU
    batch_size = 16           # Adjust based on GPU memory
else:
    compute_type = "int8"     # Required or more efficient on CPU
    batch_size = 1            # Keep at 1 on CPU (larger values don’t help and may cause memory issues)

>Running WhisperX on CPU does not reduce transcription or diarization accuracy, but it significantly slows down processing time. 

## 3.2 Select Audio File

Provide the relative path to one audio file (goes up one folder into, e.g., "Data/buffy/"). Both .wav and .mp3 files work because the transcription library uses ffmpeg under the hood to read many common audio formats.

In [None]:
audio_file = "../Data/buffy/shortened_Buffy_Seas01-Epis01.en.wav"
#audio_file = "../Data/moon-landing/CA138clip.mp3"
#audio_file = "../Data/qualitative-interview-de/DE_example_2.mp3"
#audio_file = "../Data/qualitative-interview-en/EN_example_1.mp3"

# 4. Load Whisper Model

If you know the language, set it explicitly. This reduces errors and makes decoding faster. The "language" variable will directly be used by WhisperX during model loading and transcription, so the model knows in advance which language to expect instead of trying to detect it automatically.

In [None]:
language = "en"

Load the model with the given device ("cpu" or "cuda") and precision type ("float16", "int8", etc.). Refer to [Whisper](https://github.com/openai/whisper) or use a custom (e.g. fine-tuned) model here.

In [None]:
model = whisperx.load_model("tiny", 
                            device, 
                            compute_type=compute_type, 
                            language=language) # "tiny", ...

# 5. Automatic Speech Recognition (ASR)

Transcription happens after Voice Activity Detection (VAD) and segmentation. Segments are grouped into ~30s chunks and passed to original Whisper (potentially in batches for faster transcription on GPU; higher batch size means more parallel processing). For details on available decoding and alignment options, refer to the [WhisperX](https://github.com/m-bain/whisperX) documentation.

WhisperX outputs a list of segments, where each segment is a dictionary containing the recognized text along with its start and end times in seconds.

In [None]:
audio = whisperx.load_audio(audio_file)
asr_result = model.transcribe(audio, batch_size=batch_size)

In [None]:
print(asr_result)

# 6. Word-level Forced Alignment

WhisperX realigns the text by matching the audio to phoneme probabilities using DTW (dynamic time warping), which gives more accurate word timings. After whisperx.align(), the result is a dictionary whose segments now include word-level alignments. Each segment still has text, start, and end, and gains a words list where every entry has the word string plus precise start/end times and an alignment score useful for spotting poorly aligned or uncertain timings (this is not a metric for ASR accuracy).

In [None]:
model_a, metadata = whisperx.load_align_model(language_code=asr_result["language"], device=device)
aligned_result = whisperx.align(asr_result["segments"], model_a, metadata, audio, device, return_char_alignments=False)

In [None]:
print(aligned_result) # Now contains accurate word-level start/end timestamps

# 7. Speaker Diarization

## 7.1 Load Model

Diarization is loaded through WhisperX, which wraps the pyannote model internally instead of calling it directly.

In [None]:
diarize_model = whisperx.diarize.DiarizationPipeline(use_auth_token=own_token, device=device)

## 7.2 Perform Diarization

We run the diarization model on the audio file. This gives us a timeline of speaker activity, for example: SPEAKER_00 speaks from 0–10 seconds, then SPEAKER_01 speaks from 10–15 seconds, and so on. By setting both min and max speakers to 2, we force the model to split the conversation into exactly two different speakers.

In [None]:
final_result = diarize_model(audio,
                                   min_speakers=2, 
                                   max_speakers=2)

In [None]:
print(diarization_result)

# 8. Merge Results

We merge ASR output with diarization so that each spoken segment (or each word) is linked to a speaker.

In [None]:
final_result = whisperx.assign_word_speakers(diarization_result, aligned_result)

In [None]:
print(final_result) # segments are now assigned speaker IDs

# 9. Save Final Result

Finally, we save a readable transcript as a text file: for each segment, we write out the speaker label, the transcribed text, and the segment’s start time. This produces a simple "Speaker: text [time]" format.

In [None]:
# Save as ...
method = "WhisperX"
file = "Buffy"
output_folder = "../Results/"
output_name = file + "_" + method
txt_path = os.path.join(output_folder, f"{output_name}.txt")

# Save 
with open(txt_path, "w", encoding="utf-8") as f:
    for seg in final_result["segments"]:
        speaker = seg.get("speaker")  # already present in your example
        text = seg["text"].strip()
        start = str(timedelta(seconds=seg["start"]))[:-3]
        f.write(f"{speaker}: {text} [{start}]\n\n")

print("Saved transcript to", txt_path)