## Diarization for Swiss german meetings

Loosely based on [this tutorial](https://huggingface.co/learn/audio-course/en/chapter7/transcribe-meeting) using [this version of whisper small](https://huggingface.co/ss0ffii/whisper-small-german-swiss)

In [68]:
%%capture
import locale
locale.getpreferredencoding = lambda: "UTF-8"
!pip install git+https://github.com/huggingface/transformers.git
!pip install --upgrade pyannote.audio
!pip install speechbrain

In [70]:
# @title Functions definition, run once
import datetime
from typing import Any

def canonicalize_whisper(whisper_out) -> list[dict[str, Any]]:
  result = []
  time_stamp_reset = 0
  last_time_stamp = 0
  for chunk in whisper_out["chunks"]:
    time_stamp = chunk["timestamp"]
    text = chunk["text"]
    if time_stamp[0] < last_time_stamp:
      if time_stamp[0] == 0:
        time_stamp_reset = last_time_stamp
      time_stamp = (time_stamp[0]+time_stamp_reset, time_stamp[1]+time_stamp_reset)
    result.append({
        "start": time_stamp[0],
        "end": time_stamp[1],
        "text": text
    })
    last_time_stamp = time_stamp[1]
  return result

def match_diarization_to_whisper_text(whisper_output, diarization_output):
  # Bring whisper time-stamp to absolute cents of seconds
  canonical_whisper = canonicalize_whisper(whisper_output)

  # iterate over speaker segments
  result = []
  next_segment = canonical_whisper.pop(0)
  for segment, _ in diarization_output.itertracks():
    segment_start = segment.start
    segment_end = segment.end
    segment_label = diarization_output.get_labels(segment)
    text = ""
    while next_segment["start"] < segment_end:
      text += next_segment["text"]
      if canonical_whisper:
        next_segment = canonical_whisper.pop(0)
      else:
        break
    # Drop segment if no text got assigned
    if not text:
      continue
    result.append({
        "speaker" : segment_label.pop(),
        "text" : text,
        "start" : segment_start,
        "end": segment_end
    })
  return result

def pretty_print_time(time_in_sec):
    return f'{int(time_in_sec)//3600:02d}h:{int(time_in_sec)//60:02d}m:{int(time_in_sec%60):02d}s'

def pretty_print_diary(diarized_audio):
  for chunk in diarized_audio:
    chunk_start = chunk["start"]
    print(f'{pretty_print_time(chunk_start)} <{chunk["speaker"]}>: {chunk["text"]}')


In [69]:
# @title You will need a huggingface token to run the models
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Set up the model we will use for transcribing and detecting speakers.

Models available are:
* **swiss-german**: Fast model fine tuned for swiss german, not always super precise.
* **base**: generic model with a good trade-off between speed and accuracy, not fine tuned for swiss german
* **large**: slower but should be the more accurate.

In [71]:
%%capture
# Set up fine tuned whisper to be used for the actual transcription, running on GPU
from transformers import pipeline
from pyannote.audio import Pipeline
import torch

whisper_type = {
    'swiss_german': 'ss0ffii/whisper-small-german-swiss',
    'base': 'openai/whisper-base',
    'large': 'openai/whisper-large-v3',
}
whisper_model = "swiss_german" # @param ["swiss_german", "base", "large"]
whisper_path = whisper_type[whisper_model]
whisper_pipeline = pipeline("automatic-speech-recognition", model=whisper_path, device=0)
diarization_pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1")
diarization_pipeline.to(torch.device("cuda"))

In [72]:
# @title Upload the files that you would like to process

from google.colab import files

track = files.upload()

Saving Pristini_deutschi_frau.mp3 to Pristini_deutschi_frau (2).mp3


In [74]:
# @title You can play the audio files using this cell
from IPython.display import Audio, display

for k in track:
  sound_file = k
  print(f'Play {sound_file}')
  display(Audio(sound_file, autoplay=True))

Play Pristini_deutschi_frau (2).mp3


In [75]:
# @title Run the diarization pipeline

num_speakers = 1 # @param {type:"integer"}
output = {}
for k in track:
  whisper_output = whisper_pipeline(f'/content/{k}', return_timestamps=True)
  diarization_output = diarization_pipeline(f'/content/{k}')
  diary = match_diarization_to_whisper_text(whisper_output, diarization_output)
  output[k] = {
      "whisper_output" : whisper_output,
      "diarization_output" : diarization_output,
      "combined" : diary
  }

for k in output:
  # print('-----------------------------')
  print(f'    File: {k}               ')
  pretty_print_diary(output[k]["combined"])
  # print('-----------------------------')

    File: Pristini_deutschi_frau (2).mp3               
00h:00m:01s <SPEAKER_00>: Hallo, mein Name isch Christina, Morn isch Mittwoch und am Mittwoch isch jeweils de dütschi Tag wo de Alessio No dützsch redet.
