> **Date:** 03/09/23
# Speach to text model

#### Goal:
Find a model for speach to text generation.  


#### Resources:
- Creating YouTube Captions with Wav2Vec [Link Colab](https://colab.research.google.com/github/Muennighoff/ytclipcc/blob/main/wav2vec_youtube_captions.ipynb)
- Whisper Large V3 [LINK](https://huggingface.co/openai/whisper-large-v3)

## Installing dependencies

In [None]:
!pip install transformers moviepy torch librosa accelerate

## Testing Whisper Large V3:

In [None]:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from IPython.display import Audio
from pathlib import Path
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device,
)

In [None]:
# Load your mp4 or from public audio
my_audio = "marti_test.mp4"

if not Path(my_audio).exists():
    print("Your file does not exists. We will load a public audio file: librispeech_long")
    public_file = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
    my_audio = public_file[0]["audio"]
    result = pipe(my_audio)
    print(result["text"]), display(Audio(dataset['audio'][0]['array'], rate=dataset['audio'][0]['sampling_rate']))

else:
    result = pipe(my_audio)
    print(result["text"]), display(Audio(my_audio))

## Testing Wav2Vec

#### Imports

In [1]:
from transformers import Wav2Vec2Tokenizer, Wav2Vec2ForCTC
import moviepy.editor as mp
import torch
import librosa
import os

  from .autonotebook import tqdm as notebook_tqdm


#### Load models

In [2]:
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Wav2Vec2CTCTokenizer'. 
The class this function is called from is 'Wav2Vec2Tokenizer'.
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### Extract Audio

In [3]:
clip = mp.VideoFileClip("marti_test.mp4")
end = min(clip.duration, end)

# Save the paths for later
clip_paths = []

# Extract Audio-only from mp4
for i in range(start, int(end), 10):
  sub_end = min(i+10, end)
  sub_clip = clip.subclip(i,sub_end)

  sub_clip.audio.write_audiofile("audio_" + str(i) + ".mp3")
  clip_paths.append("audio_" + str(i) + ".mp3")

KeyError: 'video_fps'

#### Transcribe Audio

In [None]:
cc = ""

for path in clip_paths:
    # Load the audio with the librosa library
    input_audio, _ = librosa.load(path, 
                                sr=16000)

    # Tokenize the audio
    input_values = tokenizer(input_audio, return_tensors="pt", padding="longest").input_values

    # Feed it through Wav2Vec & choose the most probable tokens
    with torch.no_grad():
      logits = model(input_values).logits
      predicted_ids = torch.argmax(logits, dim=-1)

    # Decode & add to our caption string
    transcription = tokenizer.batch_decode(predicted_ids)[0]
    cc += transcription + " "

# Here's your caption!
# Note that there may be mistakes especially if the audio is noisy or there are uncommon words
# If you picked the default video and change start to 0, you will see that the model gets confused by the word "Anakin"
print(cc)