## Speed Comparison between WhisperSegmenterFast and faster-whisper (https://github.com/guillaumekln/faster-whisper)

### Performance on GPU

Tested on NVIDIA A100 GPU 40 GB

#### speed of faster-whisper

In [1]:
## This code comes from the github repo of faster-whisper: 
## https://github.com/guillaumekln/faster-whisper#transcription
import librosa
import pandas as pd
import numpy as np
import time
import os
from tqdm import tqdm

from faster_whisper import WhisperModel
model_size = "large-v2"
faster_whisper_model = WhisperModel(model_size, device="cuda", compute_type="float16")

Fetching 6 files:   0%|          | 0/6 [00:00<?, ?it/s]

In [2]:
audio_name = "data/speed_test/test_audio.mp3"
audio, _ = librosa.load(audio_name, sr = 16000)
total_audio_length = len(audio)/16000
print("Total length of audio: %.2f min"%(total_audio_length/60))

tic = time.time()

segments, info  = faster_whisper_model.transcribe(audio_name, beam_size=5)
"""
    If you comment out the following two lines, you will witness a 8x speedup. 
    However, this speed is not useful, beacause the segments above is a generator. To get the real content from it,
    one must loop through the generator, and this loop turns out to be slow, but necessary. 
    Therefore, the following two lines should be counted into the time spent by faster-whisper
"""
res = []
for segment in segments:
    res.append((segment.start, segment.end, segment.text))
    
tac = time.time()
print("Segmentation time: %f s for segmenting %.2f minutes audio"%(tac - tic, total_audio_length/60))

Estimating duration from bitrate, this may be inaccurate


Total length of audio: 13.32 min
Segmentation time: 56.542324 s for segmenting 13.32 minutes audio


The file data/speed_test/test_audio.mp3 is the same file used in the benchmark in https://github.com/guillaumekln/faster-whisper#benchmark, where the authors reported that it took **54 s** to segment this 13 min audio.

#### Speed of WhisperSegmenterFast

For a fair comparison, we let WhisperSegmenterFast segment bird song audio that is also 13 min long. 

This 13-min birdsong audio is created by merging multiple birdsong audio files.

We do not let WhisperSegmenterFast segment data/speed_test/test_audio.mp3 because this .mp3 file contains human talk. In this case WhisperSegmenterFast will extract no birdsong syllables from it, and the segmentation will be very fast and we might overestimate the speed of WhisperSegmenterFast.

In [3]:
from model import WhisperSegmenterFast
import librosa
import pandas as pd
import numpy as np
import time
import os
from tqdm import tqdm

In [4]:
segmenter_fast = WhisperSegmenterFast( "model/vocal-segment-zebra-finch-whisper-large-ct2", device="cuda" )

In [14]:
audio_name = "data/speed_test/test_birdsong_audio.wav"
audio, _ = librosa.load(audio_name, sr = 16000)
total_audio_length = len(audio)/16000
print("Total length of audio: %.2f min"%(total_audio_length/60))

tic = time.time()
audio, _ = librosa.load(audio_name, sr = 16000)
prediction = segmenter_fast.segment(audio, num_trials= 3)
tac = time.time()
print("Segmentation time: %f s for segmenting %.2f minutes audio"%(tac - tic, total_audio_length/60))

Total length of audio: 13.32 min
Segmentation time: 39.355548 s for segmenting 13.32 minutes audio


The segmentation looks reasonable, as shown by visualization.

In [13]:
segmenter_fast.visualize(audio = audio, prediction=prediction)

interactive(children=(FloatSlider(value=397.1, description='offset', max=794.2), Output()), _dom_classes=('wid…

<function ipywidgets.widgets.interaction._InteractFactory.__call__.<locals>.<lambda>(*args, **kwargs)>

**Conclusion: The speed between both WhisperSegmenterFast and faster-whipser is comparable.**

### Performance on CPU

Tested on AMD EPYC 7H12 64-Core Processor

#### speed of faster-whisper -- the small whisper model

In [1]:
## This code comes from the github repo of faster-whisper: 
## https://github.com/guillaumekln/faster-whisper#transcription
import librosa
import pandas as pd
import numpy as np
import time
import os
from tqdm import tqdm

from faster_whisper import WhisperModel
model_size = "small"
faster_whisper_model = WhisperModel(model_size, device="cpu", compute_type="float32")

audio_name = "data/speed_test/test_audio.mp3"
audio, _ = librosa.load(audio_name, sr = 16000)
total_audio_length = len(audio)/16000
print("Total length of audio: %.2f min"%(total_audio_length/60))

tic = time.time()

segments, info  = faster_whisper_model.transcribe(audio_name, beam_size=5)
"""
    If you comment out the following two lines, you will witness a 8x speedup. 
    However, this speed is not useful, beacause the segments above is a generator. To get the real content from it,
    one must loop through the generator, and this loop turns out to be slow, but necessary. 
    Therefore, the following two lines should be counted into the time spent by faster-whisper
"""
res = []
for segment in segments:
    res.append((segment.start, segment.end, segment.text))
    
tac = time.time()
print("Segmentation time: %.2f min for segmenting %.2f minutes audio"%((tac - tic)/60, total_audio_length/60))

Fetching 6 files:   0%|          | 0/6 [00:00<?, ?it/s]

Estimating duration from bitrate, this may be inaccurate


Total length of audio: 13.32 min
Segmentation time: 2.84 min for segmenting 13.32 minutes audio


In the benchmark in https://github.com/guillaumekln/faster-whisper#benchmark, where the authors reported that it took **2 min 44 s** to segment this 13 min audio with **small whisper on CPU**.

#### speed of faster-whisper -- the large whisper model

In [1]:
## This code comes from the github repo of faster-whisper: 
## https://github.com/guillaumekln/faster-whisper#transcription
import librosa
import pandas as pd
import numpy as np
import time
import os
from tqdm import tqdm

from faster_whisper import WhisperModel
model_size = "large-v2"
faster_whisper_model = WhisperModel(model_size, device="cpu", compute_type="float32")

audio_name = "data/speed_test/test_audio.mp3"
audio, _ = librosa.load(audio_name, sr = 16000)
total_audio_length = len(audio)/16000
print("Total length of audio: %.2f min"%(total_audio_length/60))

tic = time.time()

segments, info  = faster_whisper_model.transcribe(audio_name, beam_size=5)
"""
    If you comment out the following two lines, you will witness a 8x speedup. 
    However, this speed is not useful, beacause the segments above is a generator. To get the real content from it,
    one must loop through the generator, and this loop turns out to be slow, but necessary. 
    Therefore, the following two lines should be counted into the time spent by faster-whisper
"""
res = []
for segment in segments:
    res.append((segment.start, segment.end, segment.text))
    
tac = time.time()
print("Segmentation time: %.2f min for segmenting %.2f minutes audio"%((tac - tic)/60, total_audio_length/60))

Fetching 6 files:   0%|          | 0/6 [00:00<?, ?it/s]

Estimating duration from bitrate, this may be inaccurate


Total length of audio: 13.32 min
Segmentation time: 14.41 min for segmenting 13.32 minutes audio


#### speed of WhisperSegmenterFast

In [30]:
from model import WhisperSegmenterFast
import librosa
import pandas as pd
import numpy as np
import time
import os
from tqdm import tqdm

segmenter_fast = WhisperSegmenterFast( "model/vocal-segment-zebra-finch-whisper-large-ct2", device="cpu", compute_type="float32" )

audio_name = "data/speed_test/test_birdsong_audio.wav"
audio, _ = librosa.load(audio_name, sr = 16000)
total_audio_length = len(audio)/16000
print("Total length of audio: %.2f min"%(total_audio_length/60))

tic = time.time()
audio, _ = librosa.load(audio_name, sr = 16000)
prediction = segmenter_fast.segment(audio, num_trials= 1)
tac = time.time()
print("Segmentation time: %f s for segmenting %.2f minutes audio"%(tac - tic, total_audio_length/60))

Total length of audio: 13.32 min
Segmentation time: 580.828187 s for segmenting 13.32 minutes audio


## GPU Usage of WhisperSegmenterFast

GPU usage when idle: 3.8 GB <br>
GPU usage when segmenting (with a internal batch size 16):  up to 6 GB