# **Methods to transcribe speech to text**

In this notebook accompanies the Medium blogpost **"Text analytics on Dutch cycling training podcasts, part I Evaluating speech-to-text methods"**. We will use four methods to transcribe speech-to-text for a Dutch audio example:

1.   Youtube transcription API
2.   Wav2vec 2.0
3. Vosk
4. Whisper

We will first install the necessary libraries. This is cell you will need to run. Each method subsequently imports the necessary modules. Also run the cells in **Intermezzo** because it will create the necessary audio files for each transcription method other then the Youtube transcription API

Note that we export files usually to a folder called "content". Depending on where you want to export your files you can change that of course.

In [None]:
# See resources. Use at least a Tesla T4 for inference of the examples

!nvidia-smi

Mon Jan  9 18:40:11 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   60C    P0    29W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Install libraries

In [None]:
# install libraries

%%capture
!pip install youtube_transcript_api
!pip install pytube
!pip install pydub
!pip install git+https://github.com/huggingface/transformers
!pip install git+https://github.com/openai/whisper.git
!pip install stable-ts
!pip install vosk

# Youtube transcription API

In [None]:
# import libraries

from youtube_transcript_api import YouTubeTranscriptApi
import json

In [None]:
# function to generate the transcript. For English for example change 'Muziek' to 'Music'

def generate_transcript(id, language):
  transcript = YouTubeTranscriptApi.get_transcript(id, languages=language)
  script = ""

  for text in transcript:
    t = text["text"]
    if t != '[Muziek]':
      script += t + " "
    
  return script

The id is the Youtube video id which you can find in the original URL shown which is in this case https://www.youtube.com/watch?v=ZZsyAxpWgdA

In [None]:
# Invoke the function to get transcript

id = 'ZZsyAxpWgdA'
language = ['nl']

transcript_youtube = generate_transcript(id, language)

In [None]:
# Example first 100 characters

transcript_youtube[:100]

'deze podcast maken we samen met CC de fiets heb voor alle wielrenners toch gaan jezelf op die fiets '

In [None]:
# write data to jsonl file for later use, which is in this case calculating WER/CER

path =('/content/youtube_transcript.jsonl')

with open(path, 'w', encoding='utf-8', ) as f:
  json.dump(transcript_youtube, f)
  f.write('\n')

# Intermezzo

Before we use other methods to generate transcripts we download the sample file. 

In [None]:
# import libraries

import pytube as pt
from pydub import AudioSegment
from pydub.utils import which 

AudioSegment.converter = which("ffmpeg") 

In [None]:
# function to download the audio file

def download_youtube_audio(id):
  yt = pt.YouTube("https://www.youtube.com/watch?v=" + str(video_id))
  stream = yt.streams.filter(only_audio=True)[0]
  return stream.download()

In [None]:
# invoke the function for specific audio from the video

video_id = "ZZsyAxpWgdA"

audio_file = download_youtube_audio(id)

Listen to a small sample of let's say 60 seconds in the Notebook. Note that the downloaded file is called "BETER WORDEN SCHAKELEN.mp4"

In [None]:
# play sound in notebook

sound = AudioSegment.from_file("BETER WORDEN 54 SCHAKELEN.mp4", format="mp4")

sound[:60000]

# Wav2vec 2.0

Here we will use **Wav2vec 2.0** as transcription architecture. We start with importing the pipeline object from transformers. You can directly transcribe the downloaded audio file (.mp4) with the pipeline object, but before we use it, we will convert the file to .wav and downsample to 16000Hz. In the pipeline object things go "under the hood", but in this case converting is also convenient since **Vosk** (other transcription method) also uses this format and sampling. 

In [None]:
# import libraries

import os
import json
from transformers import pipeline

In [None]:
# define function to convert to .wav file

def convert_mp4_wav(source_filename):
  sound = AudioSegment.from_file(source_filename, format="mp4")  
  sound = sound.set_channels(1) # mono
  sound = sound.set_frame_rate(16000) # 16000Hz
  output_path = os.path.splitext(source_filename)[0]+".wav"
  return sound.export(output_path, format="wav")

In [None]:
# invoke the function to create a .wav file

wav_file = convert_mp4_wav("BETER WORDEN 54 SCHAKELEN.mp4")

In [None]:
# create a pipeline using the GroNLP/wav2vec-dutch-large-ft-cgn model which was found on Huggingface models. For GPU, device=0 for faster inference

pipe = pipeline(model="GroNLP/wav2vec2-dutch-large-ft-cgn",  device=0)

In [None]:
# generate transcript using our model and created .wav file

transcript_wav2vec2 = pipe("BETER WORDEN 54 SCHAKELEN.wav", chunk_length_s=10, stride_length_s=(4, 2))

In [None]:
# Example first 100 characters from the generated dictionary

print(transcript_wav2vec2['text'][0:100])

DEZE POTKAAST MAKEN WE SAMEN MET JOYN PUNTCC DE FIETSEP VOOR ALLE WIELRENNERS TEPRE WERV TEL W TELV 


In [None]:
# write data to jsonl file for later use, which is in this case calculating WER/CER

path =('/content/wav2vec2_transcript.jsonl')

with open(path, 'w', encoding='utf-8', ) as f:
  json.dump(transcript_wav2vec2, f)
  f.write('\n')

# Vosk

1.   First you need to download a model from https://alphacephei.com/vosk/models
2.   When you download you will get a complete zip folder. Make sure refer to the path of the folder in the Model object (see below)

I decided to go with vosk-model-small-nl-0.22 as you will see below

In [None]:
# import libraries

import json
import wave

from vosk import Model, KaldiRecognizer, SetLogLevel

In [None]:
# open the .wav file we created

wf = wave.open("BETER WORDEN 54 SCHAKELEN.wav", "rb")

In [None]:
# Initialize model. Make sure you refer to the right directory where the model folder is

model = Model("vosk-model-small-nl-0.22/")
rec = KaldiRecognizer(model, wf.getframerate())

In [None]:
# Transcribe the file. Code from "Transcribe large audio files offline with Vosk, K Rink"

transcription = []

while True:
    data = wf.readframes(4000)
    if len(data) == 0:
        break
    if rec.AcceptWaveform(data):
        # Convert json output to dict
        result_dict = json.loads(rec.Result())
        # Extract text values and append them to transcription list
        transcription.append(result_dict.get("text", ""))

# Get final bits of audio and flush the pipeline
final_result = json.loads(rec.FinalResult())
transcription.append(final_result.get("text", ""))

# merge or join all list elements to one big string
transcription_vosk = ' '.join(transcription)

In [None]:
# Example first 100 characters 

transcription_vosk[:100]

'  deze podcast maken we samen met joint punt c de fietsen heb voor alle wielrenners twee uur het nie'

In [None]:
# write data to jsonl file for later use, which is in this case calculating WER/CER

path =('/content/vosk_transcript.jsonl')

with open(path, 'w', encoding='utf-8', ) as f:
  json.dump(transcription_vosk, f)
  f.write('\n')

# Whisper

For **Whisper** we make use of a slight modification using stable-ts (installed !pip install stable-ts) along with the whisper library. Reason is that we wanted some more accurate time stamps than from the original model. We will use its feature in a later blogpost. Please refer to https://github.com/jianfch/stable-ts for more information.

If you want to use Whisper only then just don't install stable-ts.

Note that when invoking the model we use 'large-v1'. During writing of the blogpost a new version of **Whisper** was released ('large-v2') which then became the default for inference which showed slight better result on benchmark datasets. We however, continue to use the 'large-v1' for our transcriptions.

In [None]:
# import libraries

from stable_whisper import load_model
import json

In [None]:
# load the model

whisper_model = load_model('large-v1')

100%|█████████████████████████████████████| 2.87G/2.87G [11:33<00:00, 4.45MiB/s]


In [None]:
# Transcribe the our .mp4 file

transcription_whisper = whisper_model.transcribe('BETER WORDEN 54 SCHAKELEN.mp4')

Detected language: dutch


In [None]:
# Example first 100 characters from the generated dictionary

print(transcription_whisper['text'][0:120])

 Deze podcast maken we samen met Join.cc, de fietsapp voor alle wielrenners. Zin om te fietsen? Geen zin om te fietsen. 


In [None]:
# write data to jsonl file for later use, which is in this case calculating WER/CER

path =('/content/whisper_transcript.jsonl')

with open(path, 'w', encoding='utf-8', ) as f:
  json.dump(transcription_whisper, f)
  f.write('\n')

End of the Colab Notebook used in the Medium blogpost **"Text analytics on Dutch cycling training podcasts, part I Evaluating speech-to-text methods"**. Please refer further to the Colab Notebook **"Error metrics transcriptions"** for WER/CER calculations

# Bulk transcribe all podcast audio files

**Transcribing all podcast audio files**

If we want to transcribe all the podcasts from our playlist at once we can use the code below. It first loads all the youtube files and then transcribes them one by one. Note that we remove the audio files after transcription since (at least for this post) we won't need them. 

In [None]:
# import libraries

from datetime import datetime
import time
import json
import os
import re

from stable_whisper import load_model
from pytube import YouTube, Playlist

In [None]:
# import all audio files from the playlist

p = Playlist('https://www.youtube.com/playlist?list=PLQ5QZrgeIituFNPb4TFTi3mme_YLNmaUt') # url of the playlist "Beter worden"

print(f'Downloading: {p.title}')

for audio in p.videos:
  audio.streams.filter(only_audio=True).first().download()

Downloading: BETER WORDEN PODCAST


In [None]:
# Define a path with file name to write the transcripts for our audio files

path =('/content/better_worden_podcasts_all.jsonl')

In [None]:
# Search for the .mp4 audio files in the current directory and export to one file specified in "path"

startTime = datetime.now()

with open(path, 'w', encoding='utf-8', ) as f:
  for file in os.listdir():
    if re.search('mp4', file):
      print(f"Processing {file}")
      s = time.time()
      try:
        file_name = file
      except Exception as e:
        print(f'Exception: {e}')
        continue
            
      result = model.transcribe(file_name)
        
      text = result['text']

      data = []
      for seg in result['segments']:
          data.append({'start': seg['start'], 'end': seg['end'],'text': seg['text']})

      json.dump({'id': file_name, 'text': text, 'segments': data}, f)
      f.write('\n')

      e = time.time()
      print(f'Time : {(e-s)} seconds')
      os.remove(file_name)

print(datetime.now() - startTime)

Although you can run this code be aware that it might be that you have to rerun it several times. I had to do this since my instance unfortunately crashed a few times. It is then just a matter of looking what is already transcribed and to later merge the transcripts together.