## Transcribing Audio Files with Wav2Vec2

Transcribing audio into text can be really tedious and time-consuming. So, I was excited to try out [Hugging Face's Wav2Vec2 model](https://huggingface.co/transformers/model_doc/wav2vec2.html)

### My Experiments

I tested the model on different types of speeches and accents, including:

- **Short audio snippets (62s)**
- **A poetry recital (5m 34s)**
- **A longer political speech (12+ minutes)**

The results were impressive, making it easier to use NLP tasks directly from audio to text.

### Challenges

Longer audio clips (over 90 seconds) were a bit tricky and tended to crash normal work machines.

### References and Resources

- Check out the [Wav2Vec2 documentation](https://huggingface.co/transformers/model_doc/wav2vec2.html).
- Use the [inference API on Hugging Face](https://huggingface.co/facebook/wav2vec2-base-960h).
- Read the paper on [wav2vec 2.0](https://arxiv.org/abs/2006.11477).

### Requirements

- **transformers** >= 4.3
- **librosa**

If you want to use your own audio clips, make sure to downsample them to 16kHz, as Wav2Vec2 was trained on 16kHz audio. I used [Audacity](https://www.audacityteam.org/) for this.

### Models

There are various Wav2Vec2 models on Hugging Face's model hub. This project used the [wav2vec2-base-960h](https://huggingface.co/facebook/wav2vec2-base-960h) model. You can explore other models [here](https://huggingface.co/models?search=wav2ve).

# 1. TRANSCRIBE SHORT AUDIO CLIP

The included file is a 62-second clip from John F. Kennedy's famous inaugural speech in 1961. You can swap it with any other English speech you want.

It seems to work only for English speeches.

If the audio clip is longer than 90 seconds, the notebook might crash, due to memory issues.

In [1]:
import librosa
import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
#load tokenizer and pre-trained model
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Wav2Vec2CTCTokenizer'. 
The class this function is called from is 'Wav2Vec2Tokenizer'.
Some weights of the model checkpoint at facebook/wav2vec2-base-960h were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraini

In [3]:
#load audio file from folder of choice
file_path = "../audio/jfk.flac"

speech, rate = librosa.load(file_path,sr=16000)

In [4]:
%%time
input_values = tokenizer(speech, return_tensors = 'pt').input_values

#Store logits (non-normalized predictions)
logits = model(input_values).logits

#Store predicted id's
predicted_ids = torch.argmax(logits, dim =-1)

#decode the audio to generate text
transcript = tokenizer.decode(predicted_ids[0])

CPU times: total: 14.5 s
Wall time: 4.69 s


In [5]:
print(transcript)

IN THE LONG HISTORY OF THE WORLD ONLY A FEW GENERATIONS HAVE BEEN GRANDED THE ROLE OF DEFENDING FREEDOM IN ITS OUR MAXIMUM DANGER I DO NOT SHRINK FROM THIS RESPONSIBILITY I WELCOME IT    I DO NOT BELIEVE THAT ANY OF US WOULD EXCHANGE PLACES WITH ANY OTHER PEOPLE OR ANY OTHER GENERATION THE ENERGY THE FAITH THE DEVOTION WHICH WANGE BRANG TO THIS AND OF UP WILL NOT OUR COUNTRY AND ALL WHO SERVE IT AND THE GLOW FROM THAT FIRE CAND TRULY LIKE THE WORLD AND SO MY FELLOW AMERICA ASK NOT WHAT YOUR COUNTRY CAN DO FOR YOU ASK WHAT YOU CAN DO FOR YOUR COUNTRY
