# TRANSCRIBING AUDIO FILES WITH WAV2VEC2


*This is the simplest trial of the Wav2Vec2 model, involving a 62s clip of John F Kennedy's famous inaugural speech in 1961*

*if you want to use your own audio clips, make sure to downsample them to 16kHz as the Wav2Vec2 model used here was pretrained and fine-tuned on 16kHz sampled speech audio.*

I used Audacity to split up the audio files in this repo.
https://www.audacityteam.org/

In [None]:
!pip install -q transformers

[K     |████████████████████████████████| 4.4 MB 29.0 MB/s 
[K     |████████████████████████████████| 6.6 MB 42.7 MB/s 
[K     |████████████████████████████████| 101 kB 11.1 MB/s 
[K     |████████████████████████████████| 596 kB 58.5 MB/s 
[?25h

In [None]:
import librosa
import pandas as pd
import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer

In [None]:
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

Downloading:   0%|          | 0.00/291 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/163 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.56k [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Wav2Vec2CTCTokenizer'. 
The class this function is called from is 'Wav2Vec2Tokenizer'.


Downloading:   0%|          | 0.00/360M [00:00<?, ?B/s]

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Make sure the numbering of your split audio files start at "1", and not "01", or "001".

def split(x, y):
    speech = {}
    input_values = {}
    logits = {}
    predicted_ids = {}
    transcribe = {}
    for i in range(x, y+1):
        speech[i], rate = librosa.load(
            "/content/drive/MyDrive/Transcription-Suite/audio/poet/amanda_gorman-%d.flac" % i, sr=16000
        )
        input_values[i] = tokenizer(speech[i], return_tensors="pt").input_values
        logits[i] = model(input_values[i]).logits
        predicted_ids[i] = torch.argmax(logits[i], dim=-1)
        transcribe[i] = tokenizer.decode(predicted_ids[i][0])
    return transcribe

In [None]:
def transcript(num_clips):
    trans = {}
    for j in range(1, num_clips):
        if num_clips - j > 0:
            trans[j] = pd.DataFrame.from_dict(
                split(j, j + 1), orient="index"
            ).rename(columns={0: "Transcribed_Text"})
        else:
            pass
    return (
        pd.concat(trans)
        .drop_duplicates(subset=["Transcribed_Text"])
        .reset_index(drop=True)
    )

In [None]:
%%time
df = transcript(num_clips = 10)

CPU times: user 3min 3s, sys: 6.95 s, total: 3min 10s
Wall time: 3min 13s


In [None]:
# results
df.shape

(10, 1)

In [None]:
df.head(10)

Unnamed: 0,Transcribed_Text
0,MISTER PRESIDENT DOCTOR BYDEN MADAM VICE PRESI...
1,IS ISN'T ALWAYS JUST IS AND YET THE DAWN IS HO...
2,FOR ONE AND YES WE ARE FAR FROM POLISHED FAR F...
3,PUT OUR FUTSARE FIRST WE MUST FIRST PUT OUR DI...
4,EVER AGAIN SO DIVISION SKIPSER TELLS US TO INV...
5,HAR IT WE'VE SEEN A FOREST THAT WOULD SHATTER ...
6,CEPTION WE DID NOT FEEL PREPARED TO BE THE EIR...
7,VOLENCE BUT BOLD FIERCE AND FREE WE WILL NOT B...
8,BETTER THAN ONE WE WERE LEFT WITHEVERY BREATH ...
9,AND BEAUTIFUL WHEN DAY COMES WE STEP OUT OF TH...


In [None]:
df["Transcribed_Text"]

0    MISTER PRESIDENT DOCTOR BYDEN MADAM VICE PRESI...
1    IS ISN'T ALWAYS JUST IS AND YET THE DAWN IS HO...
2    FOR ONE AND YES WE ARE FAR FROM POLISHED FAR F...
3    PUT OUR FUTSARE FIRST WE MUST FIRST PUT OUR DI...
4    EVER AGAIN SO DIVISION SKIPSER TELLS US TO INV...
5    HAR IT WE'VE SEEN A FOREST THAT WOULD SHATTER ...
6    CEPTION WE DID NOT FEEL PREPARED TO BE THE EIR...
7    VOLENCE BUT BOLD FIERCE AND FREE WE WILL NOT B...
8    BETTER THAN ONE WE WERE LEFT WITHEVERY BREATH ...
9    AND BEAUTIFUL WHEN DAY COMES WE STEP OUT OF TH...
Name: Transcribed_Text, dtype: object

In [None]:
# Output the transcript to a text file

#poet = df["Transcribed_Text"].apply(''.join)
#poet.to_csv("../transcripts/amanda_gorman1.txt", sep="\t", index=False)

*This is the second trial on transcribing longer audio clips with Wav2Vec2. With clips beyond 10 minutes (I've tried up to audio clips around 21 minutes)*

Longer audio clips tend to crash notebooks using the Wav2Vec2 model, so I used a work around to transcribe Amanda Gorman's evocative inauguration poem (5 minutes 34 seconds)



In [None]:
# from google.colab import drive
# drive.mount('/content/drive/')


In [None]:
import librosa
import pandas as pd
import os
import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer

In [None]:
os.chdir("/content/drive/My Drive/Colab Notebooks")

In [None]:
#load tokenizer and pre-trained model
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Wav2Vec2CTCTokenizer'. 
The class this function is called from is 'Wav2Vec2Tokenizer'.
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
def split(x, y):
    speech = {}
    input_values = {}
    logits = {}
    predicted_ids = {}
    transcribe = {}
    for i in range(x, y+1):
        speech[i], rate = librosa.load(
            "/content/drive/MyDrive/Transcription-Suite/audio/lhl_wef/lhl_wef-%d.flac" % i, sr=16000
        )
        input_values[i] = tokenizer(speech[i], return_tensors="pt").input_values
        logits[i] = model(input_values[i]).logits
        predicted_ids[i] = torch.argmax(logits[i], dim=-1)
        transcribe[i] = tokenizer.decode(predicted_ids[i][0])
    return transcribe

In [None]:
def transcript(num_clips):
    trans = {}
    for j in range(1, num_clips):
        if num_clips - j > 0:
            trans[j] = pd.DataFrame.from_dict(
                split(j, j + 1), orient="index"
            ).rename(columns={0: "Transcribed_Text"})
        else:
            pass
    return (
        pd.concat(trans)
        .drop_duplicates(subset=["Transcribed_Text"])
        .reset_index(drop=True)
    )

In [None]:
%%time
df = transcript(num_clips = 13)

*transcribe a 12.5 minutes speech by the Singapore Prime Minister, to see how the model deals with an Asian accent.*

# TRANSCRIBING LONG AUDIO FILES WITH WAV2VEC2 - ALT VERSION OPTIMALY


Requirements:

*   transformers >= 4.3
*   librosa



I switched to the wav2vec2-large-960h-lv60-self model. In earlier notebooks/codeblocks, I used the base version.

In [None]:
!pip install -q transformers
import librosa
import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer

In [None]:
wer_model1=[]

In [None]:
!pip install -q transformers
wer_model1=[]
def model(audio_file):
  import librosa
  import torch
  from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer
  def asr_transcript(tokenizer, model, audio_file):
      transcript = ""

      # Stream over 20 seconds chunks
      stream = librosa.stream(
          audio_file, block_length=20, frame_length=16000, hop_length=16000
      )

      for speech in stream:
          if len(speech.shape) > 1:
              speech = speech[:, 0] + speech[:, 1]

          input_values = tokenizer(speech, return_tensors="pt").input_values
          logits = model(input_values).logits

          predicted_ids = torch.argmax(logits, dim=-1)
          transcription = tokenizer.decode(predicted_ids[0])
          transcript += transcription.lower() + " "
          
      return transcript
  tokenizer_transcribe = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")

  model_transcribe = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")

  poet = asr_transcript(tokenizer_transcribe, model_transcribe, audio_file)

  wer_model1.append(poet)

In [None]:
# import IPython.display as display
# display.Audio(audio_file , autoplay=True)

In [None]:
model("/content/audio0.flac")

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Wav2Vec2CTCTokenizer'. 
The class this function is called from is 'Wav2Vec2Tokenizer'.
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-960h-lv60-self and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
model("/content/audio1.flac")

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Wav2Vec2CTCTokenizer'. 
The class this function is called from is 'Wav2Vec2Tokenizer'.
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-960h-lv60-self and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# model("/content/audio2.flac")

In [None]:
wer_model1

['i will look for you i will find you and i will kill you ',
 'a year ago weere hit with a virus that was met with silence and spread unchecked denials for days weeks then months that led to more deaths more infections more stress and more loneliness photos and viteos from ']

In [None]:
wer_model1.append(
  "london has a diverse range of people and cultures and more than three hundred languages are spoken in the region the world makes it easy to create vidios and audio files with life like audio from text get started with british english text to speech free select from one of our text to speech british english male and female voices below and enter some text to create the audio ",
  )

In [None]:
wer_model1

['i will look for you i will find you and i will kill you ',
 'a year ago weere hit with a virus that was met with silence and spread unchecked denials for days weeks then months that led to more deaths more infections more stress and more loneliness photos and viteos from ',
 "mister president doctor byden madam vice president mister mhof americans and the world when day comes we ask ourselves where can we find light in this never ending shade the loss we carry a sea we must wade we've braved the belly of the beast we've learned that quiet isn't always peace in the norms and notions of what just is isn't always just is and yet the on as ours before we knew it somehow we do it somehow we've weathered and witnessed a nation that isn't broken but simply unfinished we the successors of a country and a time were a skinny black girl descended from slaves and raised by a single mother can dream of becoming president only to find herself reciting for one and yes we are far from polished far f

In [None]:
model("/content/audio5.flac")

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Wav2Vec2CTCTokenizer'. 
The class this function is called from is 'Wav2Vec2Tokenizer'.
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-960h-lv60-self and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
wer_model1

['i will look for you i will find you and i will kill you ',
 'a year ago weere hit with a virus that was met with silence and spread unchecked denials for days weeks then months that led to more deaths more infections more stress and more loneliness photos and viteos from ',
 "mister president doctor byden madam vice president mister mhof americans and the world when day comes we ask ourselves where can we find light in this never ending shade the loss we carry a sea we must wade we've braved the belly of the beast we've learned that quiet isn't always peace in the norms and notions of what just is isn't always just is and yet the on as ours before we knew it somehow we do it somehow we've weathered and witnessed a nation that isn't broken but simply unfinished we the successors of a country and a time were a skinny black girl descended from slaves and raised by a single mother can dream of becoming president only to find herself reciting for one and yes we are far from polished far f

In [None]:
model("/content/audio6.flac")

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Wav2Vec2CTCTokenizer'. 
The class this function is called from is 'Wav2Vec2Tokenizer'.
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-960h-lv60-self and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
model("/content/audio7.flac")

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Wav2Vec2CTCTokenizer'. 
The class this function is called from is 'Wav2Vec2Tokenizer'.
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-960h-lv60-self and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
wer_model1

['i will look for you i will find you and i will kill you ',
 'a year ago weere hit with a virus that was met with silence and spread unchecked denials for days weeks then months that led to more deaths more infections more stress and more loneliness photos and viteos from ',
 "mister president doctor byden madam vice president mister mhof americans and the world when day comes we ask ourselves where can we find light in this never ending shade the loss we carry a sea we must wade we've braved the belly of the beast we've learned that quiet isn't always peace in the norms and notions of what just is isn't always just is and yet the on as ours before we knew it somehow we do it somehow we've weathered and witnessed a nation that isn't broken but simply unfinished we the successors of a country and a time were a skinny black girl descended from slaves and raised by a single mother can dream of becoming president only to find herself reciting for one and yes we are far from polished far f