## Transcribe MS Teams meetings to text file

As the teams meetings are <strong>huge</strong> I recommend using some external tool for extracting the audio file and just placing that on the <i>jupyter notebook</i> folder.

One option is https://cloudconvert.com/mpeg-to-wav

If you only need a <code>wav</code> from <code>mp3</code>, you can use https://cloudconvert.com/mp3-to-wav

Of course, if you have the possibility to have the original MS Teams recording (mpgeg4) you can use ffmpeg:<BR>
<code>ffmpeg -i <teams_recording.mp4> <output_file_name.mp3></code>


I am very fed up with Anaconda, so the virtual environment is set with pip.<br>
For torch, install it with instructions from https://pytorch.org/get-started/locally/<br>
For whisper, be sure to install the openai version<br>

In [None]:
from pydub import AudioSegment # Pydub requires that ffmpeg is installed and in the path
from pydub.playback import play
from pyannote.audio import Pipeline
import whisper # Be sure to install openai version: pip install openai-whisper

import torch, torchaudio

import io

from IPython.display import Audio

This first step just checks the audio file, that it is readable and plays the first 10 seconds

In [None]:
# The file to be transcripted
file_path = 'haastattelu.mp3'

# Load the audio file
audio = AudioSegment.from_file(file_path)

# Resample the audio to 16kHz (required by the model)
audio = audio.set_frame_rate(16000)

# Slice the first 10 seconds (10,000 milliseconds)
audio_10_seconds = audio[:10000]

# Save the slice as wav to see if it works
audio_10_seconds.export('first_10_seconds.wav', format='wav')

audio_file = 'first_10_seconds.wav'

# PLay the first 10 secs
display(Audio(audio_file))


Let's make a generator function to slice the audio file to 20 second chunks; Whisper model has been trained on 20s bits and should perform best on these inputs. Helps of course the memory, too.

In [None]:
def audio_stream():
    i = 0
    chunk_size = 20000  # 20,000 milliseconds
    while i < len(audio):
        chunk = audio[i:i+chunk_size]
        chunk.export('chunk.wav', format='wav')
        i += chunk_size
        yield i

We will use the OpenAI "large" model, as it seems to perform quite well on finnish. There are fine-tuned versions in finnish, but they seem to require special torch versions not available without nvidia developer accounts.

Using CUDA and GPU makes inferring at least 2.5 times faster. I think there should be <strong>really</strong> good reason to not using CUDA; If you get errors with GPU RAM limit overflowing, I suggest reducing chunk size before resorting to CPU instead of GPU. An hour of Teams recording takes 23 minutes with a <strong>fast</strong> CPU but only 7 minustes with nvidia gtx 4080 with 16 VRAM.

In [None]:
model = whisper.load_model("large", device="cuda")

The chunks are iterated through the transcription and appended to a text file.

Note that the model is forced to finnish language. 

In [None]:
# Load the pipeline and initialize it
from auth_tokens import get_pyannotetoken
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1", use_auth_token=get_pyannotetoken()).to(torch.device("cuda"))

verbose = True
with open("combined_results.txt", "w", encoding="utf-8") as f:
    for chunk_file in audio_stream():
        print(f"Current progress: {chunk_file / len(audio) * 100:.2f}%")

        # Transcribing the audio chunk
        transcription = model.transcribe("chunk.wav", language="fi", verbose=verbose)

        # Load the audio chunk for diarization
        chunk_audio_path = 'chunk.wav'  # Assuming you already have this from the audio_stream
        waveform, sample_rate = torchaudio.load(chunk_audio_path)
        diarization = pipeline({"waveform": waveform, "sample_rate": sample_rate})

        # Iterate over each segment from Whisper and find corresponding speaker
        for segment in transcription['segments']:
            start = segment['start']
            end = segment['end']
            text = segment['text']

            # Find the matching speaker from diarization results
            speaker_label = None
            for turn, _, speaker in diarization.itertracks(yield_label=True):
                if turn.start <= start <= turn.end or \
                   turn.start <= end <= turn.end:
                    speaker_label = speaker
                    break

            # Write the combined result to file
            if speaker_label:
                f.write(f"Speaker[{speaker_label}] {start:.1f}-{end:.1f}: {text}\n")
            else:
                f.write(f"Speaker[Unknown] {start:.1f}-{end:.1f}: {text}\n")


In [None]:
# load groundtruth
from pyannote.database.util import load_rttm
_, groundtruth = load_rttm('audio.rttm').popitem()

# visualize groundtruth
groundtruth