In [6]:
import numpy as np
from IPython.display import Audio
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

import torch, pprint

In [7]:
pp = pprint.PrettyPrinter(indent=2)

In [8]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"
device

'cuda:0'

In [9]:
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
torch_dtype

torch.float16

In [11]:
# let's see what we're working with
audio_fpath = "../data/sample_user.wav"
audio = Audio(audio_fpath)
display(audio)

In [12]:
model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

asr_pipeline  = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)


In [13]:
transcription = asr_pipeline(audio_fpath)

Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


In [14]:
pp.pprint(transcription)

{ 'text': " Hmm, that's interesting. Maybe I can use a for loop to iterate "
          'through the elements. Then I might want to try and store some sort '
          'of cache to see what elements I can match later on.'}


## Thoughts
- Decent in speed ~2.5 seconds to procses a 24 second `wav` file
- Look into [whisper-x](https://github.com/m-bain/whisperX) for sync realtime transcription
- This problem space seems to be a lot more developed in open source than Text-to-speech
- Maybe we can consider different runtimes like `ONNX` for even more optimized performance
- Will need to stream in audio somehow though
    * Kind of like how Zoom does live audio transcription
    * [Real Time Whisper](https://github.com/davabase/whisper_real_time)
    * [Wav2Vec2-Live](https://github.com/oliverguhr/wav2vec2-live)