In [1]:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=25,
    batch_size=16,
    torch_dtype=torch_dtype,
    #return_timestamps=True,
    device=device,
)

# dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
# sample = dataset[0]["audio"]

# result = pipe("./audios/¿Cómo Podría 1 Trillón de Leones Ganarle al Sol？ [NRAbqNtUgdU].m4a")
# print(result["text"])




  from .autonotebook import tqdm as notebook_tqdm





Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [2]:
result = pipe("audios\You blew your budget on WHAT？？ - Intel $5,000 Extreme Tech Upgrade [RUI1k-KHXNk].m4a")

Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.
  attn_output = torch.nn.functional.scaled_dot_product_attention(
The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


In [5]:
result

{'text': " No! No! This is going great. We're back again with Intel Extreme Tech Upgrade where Intel has given $5,000 tech makeovers to Anthony, Riley, Dennis, tons of members of the team so far. And we've seen some great themes like VR gaming, extreme comfort, and minimalism. But's is unlike anything we've seen so far. James has what I would describe as the most honest theme. Hey, it's $5,000. I'm gonna take advantage of this to the greatest extent that I can. Don't you knock? Hi, my name is James and I'm head of writing at LMG. So I make sure the videos get made. Describing my current setup in one word, adequate or suboptimal. If you're a glass half full, half empty person, you could pick which way to go there. Not gonna lie, your setup is pretty sick already. What exactly are the pain points we're trying to address today? I have these giant tower speakers. My speakers are pretty good. I didn't think that it was worth changing them out with this budget. These are way, way, way bigger

In [7]:
len(result["text"])/5

5685.4

In [4]:
# import torch
# import gc

# # Liberar memoria de la GPU
# torch.cuda.empty_cache()

# # Liberar todos los objetos no referenciados
# gc.collect()