# Transcribe a meeting

## Speaker Diarization

Speaker diarization (or diarisation) is the task of taking an unlabelled audio input and predicting “who spoke when”.



In [None]:
# pre-trained speaker diarization model
! pip install pyannote.audio

In [None]:
from pyannote.audio import Pipeline

diarization_pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization@2.1", use_auth_token=True
)

Load a sample of the LibriSpeech ASR dataset that consists of two different speakers that have been concatenated together to give a single audio file.

In [None]:
from datasets import load_dataset

concatenated_librispeech = load_dataset(
    "sanchit-gandhi/concatenated_librispeech", split="train", streaming=True
)
sample = next(iter(concatenated_librispeech))

In [None]:
from IPython.display import Audio

Audio(sample["audio"]["array"], rate=sample["audio"]["sampling_rate"])

Pass this audio file to the diarization model to get the speaker start / end times

In [None]:
import torch

input_tensor = torch.from_numpy(sample["audio"]["array"][None, :]).float()
outputs = diarization_pipeline(
    {"waveform": input_tensor, "sample_rate": sample["audio"]["sampling_rate"]}
)

outputs.for_json()["content"]

## Speech transcription

Use the Whisper model for our speech transcription system

In [None]:
from transformers import pipeline

asr_pipeline = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-base",
)

Get the transcription for our sample audio, returning the segment level timestamps as well so that we know the start / end times for each segment.

In [None]:
asr_pipeline(
    sample["audio"].copy(),
    generate_kwargs={"max_new_tokens": 256},
    return_timestamps=True,
)

## Speechbox
Find the closest alignment between diarization and transcription timestamps by minimising the absolute distance between both.

In [None]:
! pip install git+https://github.com/huggingface/speechbox

Instantiate our combined diarization plus transcription pipeline, by passing the diarization model and ASR model to the ASRDiarizationPipeline class

In [None]:
from speechbox import ASRDiarizationPipeline

pipeline = ASRDiarizationPipeline(
    asr_pipeline=asr_pipeline, diarization_pipeline=diarization_pipeline
)

In [None]:
pipeline(sample["audio"].copy())

 Format the timestamps

In [None]:
# converts a tuple of timestamps to a string, rounded to a set number of decimal places
def tuple_to_string(start_end_tuple, ndigits=1):
    return str((round(start_end_tuple[0], ndigits), round(start_end_tuple[1], ndigits)))

# combines the speaker id, timestamp and text information onto one line, and splits each speaker onto their own line for ease of reading
def format_as_transcription(raw_segments):
    return "\n\n".join(
        [
            chunk["speaker"] + " " + tuple_to_string(chunk["timestamp"]) + chunk["text"]
            for chunk in raw_segments
        ]
    )

In [None]:
outputs = pipeline(sample["audio"].copy())

format_as_transcription(outputs)