# Speech Interaction Transcribe Evaluation

- Talking is your mouth making sounds combined with your vocal chords vibrating.
- Humming is your vocal chords vibrating without your mouth making sounds.
- Every noise you make is called a phoneme, and every phoneme (consonant or vowel) has a theoretical voiced and unvoiced variant.
- Your voice is created by your voice box, the vocal cords. When you talk those cords vibrate and create sound.

All sound is just disturbances in the air around you, after all, caused by vibrations.

🗣️ VAD preprocessing, reduces hallucination & batching with no WER degradation

When you whisper you prevent the vocal cords from vibrating, so you speak (as all the actual talking sounds are created in your mouth) but do not use your voice

🎯 Accurate word-level timestamps using wav2vec2 alignment

When you whisper, you don't "voice" any of the sounds you make. 

In [None]:
# !pip install -q --upgrade torch torchvision torchaudio
# !pip install -q git+https://github.com/huggingface/transformers
# !pip install -q accelerate optimum
!pip install -q ipython-autotime
# !sudo apt install ffmpeg
%load_ext autotime

## Test Data

In [10]:
## small samples
# !wget https://github.com/petewarden/openai-whisper-webapp/blob/main/mary.mp3
# !wget https://github.com/petewarden/openai-whisper-webapp/blob/main/two_cities.mp3
# # hard
# !wget https://github.com/petewarden/openai-whisper-webapp/blob/main/daisy_HAL_9000.mp3
sample_1="mary.mp3"
sample_2="two_cities.mp3"
sample_3="daisy_HAL_9000.mp3"

# ## large files
# !wget https://huggingface.co/datasets/reach-vb/random-audios/resolve/main/4469669.mp3
# !wget https://huggingface.co/datasets/reach-vb/random-audios/resolve/main/ted_60.wav
# !wget https://huggingface.co/datasets/reach-vb/random-audios/resolve/main/sam_altman_lex_podcast_367.flac

audiofile1_60s="ted_60.wav"
audiofile2_2hr30min="sam_altman_lex_podcast_367.flac" 
audiofile2_2hr07min="4469669.mp3" 

%load_ext autotime

The autotime extension is already loaded. To reload it, use:
  %reload_ext autotime
time: 625 µs (started: 2024-01-16 16:24:16 -05:00)


### Whisper 
https://github.com/openai/whisper

### Whisper HF

In [2]:
import torch
import optimum
import transformers
from transformers import pipeline

print(transformers.__version__)
print(torch.__version__)



4.37.0.dev0
2.1.2+cu121
time: 2.22 s (started: 2024-01-16 14:21:58 -05:00)


#### Test-1

In [3]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3"
pipe = pipeline("automatic-speech-recognition",model_id,device=device)
print(pipe.model.device)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


cuda:0
time: 3.57 s (started: 2024-01-16 13:09:05 -05:00)


In [5]:
# test-1 (wav)
outputs = pipe(audiofile1_60s,chunk_length_s=30)
print(len(outputs["text"]))
print(outputs["text"])

915
 So in college, I was a government major, which means I had to write a lot of papers. Now, when a normal student writes a paper, they might spread the work out a little like this. So, you know, you get started maybe a little slowly, but you get enough done in the first week that with some heavier days later on, everything gets done and things stay civil. And I would want to do that, like that. That would be the plan. I would have it all ready to go, but then actually the paper would come along, and then I would kind of do this. And that would happen to every single paper. But then came my 90-page senior thesis, a paper you're supposed to spend a year on. I knew for a paper like that, my normal workflow was not an option. It was way too big a project. So I planned things out, and I decided it kind of had to go something like this. This is how the year would go. So I'd start off light, and I'd bump it up
time: 5.96 s (started: 2024-01-16 13:09:29 -05:00)


### Test-2 

In [11]:
# test-2 (mp3)  - vram usage 13 GB
outputs = pipe(audiofile2_2hr07min,chunk_length_s=30)

time: 9min 16s (started: 2024-01-16 13:11:14 -05:00)


In [12]:
print(len(outputs["text"]))
print(outputs["text"])

91150
 Thank you. to 5 o'clock, we will be presenting from our side and followed by 30-minute question session for the members of the media. The questions from analysts and investors will be accepted from 5.30 to 6 o'clock Japan time. Please be aware of that. Now, we will be collecting questions via telephone conferencing system. As is informed to you beforehand, the conference call system will require the pre-registration beforehand. Let me introduce the presenter today. Corporate Senior Executive Vice President Mamoru Hatazawa. Representative Executive Officer, Corporate Executive Vice President and CFO, Masayoshi Hirata. We have a chairperson of the Strategic Review Committee outside director, Paul Brough. He is joining from Hong Kong online. My name is Hara of Corporate Communication Department. We are providing simultaneous translation, so if you are watching the live streaming in Japanese, you will be able to hear translation voice. Please be aware of that. First, before going in

### Test-3

In [13]:
# test-3 (flac)
outputs = pipe(audiofile2_2hr30min,chunk_length_s=30)

time: 13min 51s (started: 2024-01-16 13:27:39 -05:00)


In [19]:
print(len(outputs["text"]))
print(outputs["text"])

128758
 We have been a misunderstood and badly mocked org for a long time. When we started, we announced the org at the end of 2015 and said we were going to work on AGI. People thought we were batshit insane. I remember at the time, an eminent AI scientist at a large industrial AI lab was like DMing individual reporters being like, you know, these people aren't very good and it's ridiculous to talk about AGI and I can't believe you're giving them time of day. And it's like, that was the level of like pettiness and rancor in the field at a new group of people saying we're going to try to build AGI. So OpenAI and DeepMind was a small collection of folks who were brave enough to talk about AGI in the face of mockery. We don't get mocked as much now. Don't get mocked as much now. of OpenAI, the company behind GPT-4, JAD-GPT, DALI, Codex, and many other AI technologies, which both individually and together constitute some of the greatest breakthroughs in the history of artificial intellige

#### Whisper Base Result

In [20]:
## audiofile1_60s ##
## ------------------------
# time: 5.96 s (started: 2024-01-16 13:09:29 -05:00)

## audiofile2_2hr07min ##
## ------------------------
# time: 9min 16s (started: 2024-01-16 13:11:14 -05:00)

## audiofile2_2hr30min ##
## ------------------------
# time: 13min 51s (started: 2024-01-16 13:27:39 -05:00)


time: 207 µs (started: 2024-01-16 13:42:18 -05:00)


Now we've loaded the model, and have the code, this is the function that takes an audio file path as an input and returns the recognized text (and logs what it thinks the language is).

## Distilled Whisper
https://huggingface.co/distil-whisper/distil-large-v2
It is a distilled version of the Whisper model that is 6 times faster, 49% smaller, and performs within 1% WER on out-of-distribution evaluation sets. This is the repository for distil-large-v2, a distilled variant of Whisper large-v2.

```
pip install --upgrade pip
pip install --upgrade transformers accelerate datasets[audio]
```

In [6]:
#!pip install --upgrade pip
#!pip install --upgrade transformers accelerate

### large audio files
# !wget https://huggingface.co/datasets/reach-vb/random-audios/resolve/main/4469669.mp3
# !wget https://huggingface.co/datasets/reach-vb/random-audios/resolve/main/ted_60.wav
# !wget https://huggingface.co/datasets/reach-vb/random-audios/resolve/main/sam_altman_lex_podcast_367.flac

audiofile1_60s="ted_60.wav"
audiofile2_2hr30min="sam_altman_lex_podcast_367.flac" 
audiofile2_2hr07min="4469669.mp3" 


%load_ext autotime

The autotime extension is already loaded. To reload it, use:
  %reload_ext autotime
time: 453 µs (started: 2024-01-16 14:22:57 -05:00)


In [42]:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "distil-whisper/distil-large-v2"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    torch_dtype=torch_dtype,
    device=device,
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


time: 4.1 s (started: 2024-01-16 14:01:51 -05:00)


In [52]:
##TEST-1
result = pipe(audiofile1_60s)

time: 1.04 s (started: 2024-01-16 14:05:19 -05:00)


In [53]:
print(len(result["text"]), result["text"])

921  So in college, I was a government major, which means I had to write a lot of papers. Now, when a normal student writes a paper, they might spread the work out a little like this. So, you know, you get started maybe a little slowly, but you get enough done in the first week that, with some heavier days later on, everything gets done, and things stay civil. And I would want to do that like that. That would be the plan. I would have it all ready to go, but then actually the paper would come along, and then I would kind of do this. And that would happen every single paper. But then came my 90-page senior thesis. A paper you were supposed to spend a year on. And I knew for a paper like that, my normal workflow was not an option, it was way too big a project. So I planned things out, and I decided, it kind of had to go something like this. This is how the year would go. So I'd start off light, And I'd bump it up.
time: 246 µs (started: 2024-01-16 14:05:21 -05:00)


In [54]:
##TEST-2
result = pipe(audiofile2_2hr07min)

time: 1min 22s (started: 2024-01-16 14:05:32 -05:00)


In [55]:
print(len(result["text"]), result["text"])

92453  Now, it's time. May I start the presentation on transferring Toshibato to enhance shareholders' value and FY21 second quarter consolidated business results. We are organized this presentation session on online basis. From 4 to 5 o'clock, we will be presenting from our side, and followed by 30 minutes question session for the media. The questions from analysts and investors will be accepted from 530 to 6 o'clock Japan time, please be aware of that. Now, we will be collecting questions via telephone conferencing system. As is informed to you beforehand, the conference call system will require the pre-registration beforehand. Let me introduce the presenter today. and CEO Satoshi Tuna Kawa. Corporate Senior Executive Vice President, Mamur Hatazawa. Representative this is an officer, Corporate Executive Vice President, and CFO Masayoshi Hirata. We have a chairperson of Strategic Review Committee outside director, Paul Broff. He is joining from Hong Kong on online. the chairperson of 

In [56]:
##TEST-3
result = pipe(audiofile2_2hr30min)

time: 1min 49s (started: 2024-01-16 14:06:55 -05:00)


In [57]:
print(len(result["text"]), result["text"])

127487  We have been a misunderstood and badly mocked org for a long time. Like when we started, we like announced the org at the end of 2015 and said we were going to work on AGI. Like people thought we were bad shit insane. Yeah. You know, like I remember at the time a eminent AI scientist at a large industrial AI lab was like DMing individual reporters being like, you know, you know, these people aren't very good and. and it's ridiculous to talk about AGI, and I can't believe you're giving them the time of day, and it's like, that was the level of like, pettiness and rancor in the field at a new group of people saying, we're gonna try to build AGI. So open AI and deep mind was a small collection of folks who are brave enough to talk about AGI, in the face of mockery. We don't get mocked as much now. Don't get mocked as much now. The following is a conversation with Sam Altman, CEO of Open AI, the company behind GPT, JADGPT, Dolly, Codex, and many other AI which both individually and

#### Distilled Whisper Result

In [3]:
## audiofile1_60s ##
## ------------------------
# time: 1.04 s (started: 2024-01-16 14:05:19 -05:00)

## audiofile2_2hr07min ##
## ------------------------
# time: 1min 22s (started: 2024-01-16 14:05:32 -05:00)

## audiofile2_2hr30min ##
## ------------------------
# time: 1min 49s (started: 2024-01-16 14:06:55 -05:00)


time: 263 µs (started: 2024-01-16 14:22:21 -05:00)


## Speculative Decoding
Distil-Whisper can be used as an assistant model to Whisper for speculative decoding. Speculative decoding mathematically ensures the exact same outputs as Whisper are obtained while being 2 times faster. This makes it the perfect drop-in replacement for existing Whisper pipelines, since the same outputs are guaranteed.

For speculative decoding, we need to load both the teacher: openai/whisper-large-v2. As well as the assistant (a.k.a student) distil-whisper/distil-large-v2.

In [7]:
### large audio files
# !wget https://huggingface.co/datasets/reach-vb/random-audios/resolve/main/4469669.mp3
# !wget https://huggingface.co/datasets/reach-vb/random-audios/resolve/main/ted_60.wav
# !wget https://huggingface.co/datasets/reach-vb/random-audios/resolve/main/sam_altman_lex_podcast_367.flac

audiofile1_60s="ted_60.wav"
audiofile2_2hr30min="sam_altman_lex_podcast_367.flac" 
audiofile2_2hr07min="4469669.mp3" 

%load_ext autotime

The autotime extension is already loaded. To reload it, use:
  %reload_ext autotime
time: 454 µs (started: 2024-01-16 14:23:04 -05:00)


In [8]:
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor,pipeline
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v2"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

time: 2.1 s (started: 2024-01-16 14:23:12 -05:00)


Now let's load the assistant. Since Distil-Whisper shares exactly same encoder as the teacher model, we only need to load the 2-layer decoder as a "Decoder-only" model:

In [9]:
from transformers import AutoModelForCausalLM
assistant_model_id = "distil-whisper/distil-large-v2"

assistant_model = AutoModelForCausalLM.from_pretrained(
    assistant_model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
assistant_model.to(device)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    generate_kwargs={"assistant_model": assistant_model},
    torch_dtype=torch_dtype,
    device=device,
)

time: 549 ms (started: 2024-01-16 14:23:17 -05:00)


In [10]:
##TEST-1
result = pipe(audiofile1_60s)

time: 4.06 s (started: 2024-01-16 14:23:21 -05:00)


In [11]:
print(len(result["text"]), result["text"])

911  So in college, I was a government major, which means I had to write a lot of papers. Now, when a normal student writes a paper, they might spread the work out a little like this. So, you know. You get started maybe a little slowly, but you get enough done in the first week that with some heavier days later on, everything gets done and things stay civil. And I would want to do that like that. That would be the plan. I would have it all ready to go, but then actually the paper would come along, and then I would kind of do this. And that would happen every single paper. But then came my 90-page senior thesis, a paper you're supposed to spend a year on. I knew for a paper like that, my normal workflow was not an option, it was way too big a project. So I planned things out, and I decided I kind of had to go something like this. This is how the year would go. So I'd start off light, And I'd bump it up.
time: 214 µs (started: 2024-01-16 14:23:25 -05:00)


In [12]:
##TEST-2
result = pipe(audiofile2_2hr07min)

time: 3min 59s (started: 2024-01-16 14:23:27 -05:00)


In [13]:
print(len(result["text"]), result["text"])

91447  Now it is time. May I start the presentation on Transforming Toshiba to Enhance Shareholder's Value and FY21 Second Quarter Consolidated Business Results. We are organizing this presentation session on an online basis. From four to five o'clock we will be presenting from our side and followed by a 30-minute question session for the media. The questions from analysts and investors will be accepted from 5.30 to 6 o'clock Japan time. Please be aware of that. Now we will be collecting questions via telephone conferencing system. As is informed to you beforehand the conference call system will require the pre-registration beforehand. Let me introduce the presenter today. and CEO Satoshi Tsunakawa. Corporate Senior Executive Vice President Mamoru Hatazawa. Representative Executive Officer Corporate Executive Vice President and CFO Masayoshi Hirata. We have a chairperson of the Strategic Review Committee Outside Director, Paul Brough. He is joining from Hong Kong online. My name is Har

In [14]:
##TEST-3
result = pipe(audiofile2_2hr30min)

time: 5min 45s (started: 2024-01-16 14:27:26 -05:00)


In [15]:
print(len(result["text"]), result["text"])

126431  We have been a misunderstood and badly mocked org for a long time. Like when we started, we like announced the org at the end of 2015 and said we were going to work on AGI, like people thought we were batshit insane. You know, like I remember at the time a eminent AI scientist at a large industrial AI lab was like DMing individual reporters being like, you know, these people aren't very good, and it's ridiculous to talk about AGI, and I can't believe you're giving them time of day, and it's like, that was the level of like pettiness and rancor in the field at a new group of people saying we're going to try to build AGI. So OpenAI and DeepMind was a small collection of folks who were brave enough to talk about AGI in the face of mockery. We don't get mocked as much now. And don't get mocked as much now. The following is a conversation with Sam Altman, CEO of OpenAI, the company behind GPT-4, JAD-GPT, DALI, Codex, and many other AI technologies, which both individually and togeth

#### Speculative Decoding Result

In [16]:
## audiofile1_60s ##
## ------------------------
#time: 4.06 s (started: 2024-01-16 14:23:21 -05:00)

## audiofile2_2hr07min ##
## ------------------------
#time: 3min 59s (started: 2024-01-16 14:23:27 -05:00)

## audiofile2_2hr30min ##
## ------------------------
#time: 5min 45s (started: 2024-01-16 14:27:26 -05:00)


time: 198 µs (started: 2024-01-16 14:37:08 -05:00)


## Whisper.cpp
https://github.com/ggerganov/whisper.cpp
High-performance inference of OpenAI's Whisper automatic speech recognition (ASR) model:


In [23]:
!git clone https://github.com/ggerganov/whisper.cpp
%cd whisper.cpp

fatal: destination path 'whisper.cpp' already exists and is not an empty directory.
/home/pop/development/colab/InsightSolver-Colab/whisper.cpp
time: 357 ms (started: 2024-01-16 14:57:25 -05:00)


##### Install g++ (C++ compiler)

In [24]:
#!apt-get install g++

time: 175 µs (started: 2024-01-16 14:57:30 -05:00)


##### Download the model

In [25]:
!bash ./models/download-ggml-model.sh large-v3

Downloading ggml model large-v3 from 'https://huggingface.co/ggerganov/whisper.cpp' ...
Model large-v3 already exists. Skipping download.
time: 366 ms (started: 2024-01-16 14:57:32 -05:00)


##### make with GPU (using CUBLAS)

In [26]:
!make clean
!WHISPER_CUBLAS=1 make -j

I whisper.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -pthread -mavx -mavx2 -mfma -mf16c -msse3 -mssse3
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -pthread -mavx -mavx2 -mfma -mf16c -msse3 -mssse3
I LDFLAGS:  
I CC:       cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX:      g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

rm -f *.o main stream command talk talk-llama bench quantize server lsp libwhisper.a libwhisper.so
I whisper.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -pthread -mavx -mavx2 -mfma -mf16c -msse3 -mssse3 -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -D_XOPEN_SOURCE=600 -D_GNU_SOU

#### Testing

#### prepare data to mono 

In [27]:
## conver to mono 
!ffmpeg -i ../ted_60.wav -acodec pcm_s16le -ar 16000 ted_60_2.wav -y
!ffmpeg -i ../sam_altman_lex_podcast_367.flac -acodec pcm_s16le -ar 16000 sam_altman_lex_podcast_367_2.wav -y
!ffmpeg -i ../4469669.mp3 -acodec pcm_s16le -ar 16000 4469669.wav -y

ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enab

In [28]:
## TEST-1
!./main -m models/ggml-large-v3.bin -f ted_60_2.wav

whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-large-v3.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
whisper_backend_init: using CUDA backend
whisper_model_load:    C

In [29]:
## TEST-2
!./main -m models/ggml-large-v3.bin -f 4469669.wav

whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-large-v3.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
whisper_backend_init: using CUDA backend
whisper_model_load:    C

In [30]:
## TEST-3
!./main -m models/ggml-large-v3.bin -f sam_altman_lex_podcast_367_2.wav

whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-large-v3.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
whisper_backend_init: using CUDA backend
whisper_model_load:    C

#### Whisper.cpp Result

In [None]:
## audiofile1_60s ## ted_60_2.wav
## ------------------------
# time: 5.05 s (started: 2024-01-16 14:58:35 -05:00)

## audiofile2_2hr07min ## 4469669.wav
## ------------------------
#time: 4min 27s (started: 2024-01-16 14:59:04 -05:00)

## audiofile2_2hr30min ## sam_altman_lex_podcast_367_2.wav
## ------------------------
# time: 8min 57s (started: 2024-01-16 13:49:51 -05:00)


#### whisper.cpp Python binding

In [None]:
#!pip uninstall whispercpp -y  (not working)

#cpython binding - okay
#!pip install -q git+https://github.com/stlukey/whispercpp.py
#https://github.com/stlukey/whispercpp.py/blob/main/whispercpp.pyx

In [None]:
# from whispercpp import Whisper
# w = Whisper('medium')

In [None]:
# ## why slow tooked one minutes
# result = w.transcribe(audiofile1_60s)
# text = w.extract_text(result)
# text

## Faster Whisper

In [2]:
!pip install -q ipython-autotime
%load_ext autotime
#!pip install -q faster-whisper

The autotime extension is already loaded. To reload it, use:
  %reload_ext autotime
time: 2.43 s (started: 2024-01-16 15:38:22 -05:00)


In [3]:
from faster_whisper import WhisperModel

model_size = "large-v3"
# Run on GPU with FP16
model = WhisperModel(model_size, device="cuda", compute_type="float16")



time: 4.97 s (started: 2024-01-16 15:38:26 -05:00)


In [4]:
##TEST-1 
segments, info = model.transcribe("ted_60.wav", beam_size=5)
print("Detected language '%s' with probability %f" % (info.language, info.language_probability))
for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

Detected language 'en' with probability 0.997559
[0.00s -> 4.48s]  So in college, I was a government major,
[4.90s -> 6.64s]  which means I had to write a lot of papers.
[7.42s -> 8.86s]  Now, when a normal student writes a paper,
[8.94s -> 10.60s]  they might spread the work out a little like this.
[11.74s -> 16.30s]  So, you know, you get started maybe a little slowly,
[16.36s -> 17.86s]  but you get enough done in the first week
[17.86s -> 19.76s]  that with some heavier days later on,
[20.28s -> 21.98s]  everything gets done and things stay civil.
[23.64s -> 25.80s]  And I would want to do that like that.
[26.12s -> 26.94s]  That would be the plan.
[26.94s -> 29.84s]  I would have it all ready to go,
[29.84s -> 32.42s]  but then actually the paper would come along,
[32.46s -> 33.62s]  and then I would kind of do this.
[36.52s -> 38.46s]  And that would happen to every single paper.
[39.36s -> 43.04s]  But then came my 90-page senior thesis,
[43.54s -> 45.20s]  a paper you're suppos

In [6]:
##TEST-2
segments, info = model.transcribe("4469669.mp3", beam_size=5)
print("Detected language '%s' with probability %f" % (info.language, info.language_probability))
for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

Estimating duration from bitrate, this may be inaccurate


Detected language 'en' with probability 0.993652
[0.00s -> 12.00s]  Now, it's time. May I start the presentation on Transforming Toshiba to Enhance Shares
[12.00s -> 17.36s]  Value and FY21 Second Quarter Consultative Business Results. We are organizing this presentation
[17.36s -> 24.12s]  session on an online basis. From 4 to 5 o'clock, we will be presenting from our side and followed
[24.12s -> 29.78s]  by a 30-minute question session for the members of the media. The questions from analysts
[29.78s -> 35.54s]  and investors will be accepted from 5.30 to 6 o'clock Japan time. Please be aware of that.
[36.04s -> 41.16s]  Now, we will be collecting questions via telephone conferencing system. As is informed to you
[41.16s -> 47.06s]  beforehand, the conference call system will require the pre-registration beforehand.
[47.06s -> 54.10s]  Let me introduce the presenter today. President Ntse.
[54.12s -> 64.12s]  CEO, Satoshi Tsunakawa. Corporate Senior Executive Vice President, Mamoru Ha

In [5]:
##TEST-3
segments, info = model.transcribe("sam_altman_lex_podcast_367.flac", beam_size=5)
print("Detected language '%s' with probability %f" % (info.language, info.language_probability))
for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

Detected language 'en' with probability 1.000000
[0.00s -> 3.60s]  We have been a misunderstood and badly mocked org for a long time.
[3.60s -> 11.90s]  When we started, we announced the org at the end of 2015 and said we were going to work on AGI.
[12.48s -> 14.32s]  People thought we were batshit insane.
[15.38s -> 26.46s]  I remember at the time, an eminent AI scientist at a large industrial AI lab was DMing individual reporters,
[26.46s -> 31.30s]  being like, you know, these people aren't very good, and it's ridiculous to talk about AGI,
[31.30s -> 33.04s]  and I can't believe you're giving them time of day.
[33.18s -> 37.22s]  And it's like, that was the level of, like, pettiness and rancor in the field
[37.22s -> 39.42s]  at a new group of people saying we're going to try to build AGI.
[40.12s -> 50.10s]  So OpenAI and DeepMind was a small collection of folks who were brave enough to talk about AGI in the face of mockery.
[50.74s -> 52.12s]  We don't get mocked as much now.
[52.

In [None]:
# Run on GPU with FP16
# time: 44.7 s (started: 2024-01-16 11:59:01 -05:00)
# or run on GPU with INT8
# time: 50.2 s (started: 2024-01-16 12:01:22 -05:00)


#### Faster-Whisper Result

In [None]:
## audiofile1_60s ## ted_60_2.wav
## ------------------------
#time: 4.25 s (started: 2024-01-16 15:38:34 -05:00)

## audiofile2_2hr07min ## 4469669.mp3
## ------------------------
## time: 7min 45s (started: 2024-01-16 15:48:48 -05:00)

## audiofile2_2hr30min ## sam_altman_lex_podcast_367_2.mp3
## ------------------------
## time: 10min 5s (started: 2024-01-16 15:38:43 -05:00)


### whisperx  (skipped due library conflict)

In [None]:
# first itme install
#!pip install git+https://github.com/m-bain/whisperx.git
# upgrade install
#!pip install git+https://github.com/m-bain/whisperx.git --upgrade

In [None]:
# import whisperx
# import gc 

# device = "cuda" 
# audio_file = "audio.mp3"
# batch_size = 16 # reduce if low on GPU mem
# compute_type = "float16" # change to "int8" if low on GPU mem (may reduce accuracy)

# # 1. Transcribe with original whisper (batched)
# model = whisperx.load_model("large-v2", device, compute_type=compute_type)

# # save model to local path (optional)
# # model_dir = "/path/"
# # model = whisperx.load_model("large-v2", device, compute_type=compute_type, download_root=model_dir)

# audio = whisperx.load_audio(audio_file)
# result = model.transcribe(audio, batch_size=batch_size)
# print(result["segments"]) # before alignment

# # delete model if low on GPU resources
# # import gc; gc.collect(); torch.cuda.empty_cache(); del model

# # 2. Align whisper output
# model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
# result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)

# print(result["segments"]) # after alignment

# # delete model if low on GPU resources
# # import gc; gc.collect(); torch.cuda.empty_cache(); del model_a

# # 3. Assign speaker labels
# diarize_model = whisperx.DiarizationPipeline(use_auth_token=YOUR_HF_TOKEN, device=device)

# # add min/max number of speakers if known
# diarize_segments = diarize_model(audio)
# # diarize_model(audio, min_speakers=min_speakers, max_speakers=max_speakers)

# result = whisperx.assign_word_speakers(diarize_segments, result)
# print(diarize_segments)
# print(result["segments"]) # segments are now assigned speaker IDs