## ASR (Automatic Speech Recognition) using wav2vec

This notebook uses pre-trained huber model from HuggingFace to convert Audio to transcript (added youtube download code snippet because it's convenient :)

**However the biggest challenge is to improve the transcription quality** - 

    1. we've used silence removal for that
    2. large audio chunks don't produce good transcription quality, so we've used sliding window 
    3. wanted to use/perform some audio filter/pre-processing (please suggest some, I have no idea which ones to use)

In [1]:
#!pip install -r requirements.txt
!pip install librosa pydub torch transformers pytube

Collecting pytube
  Downloading pytube-12.1.2-py3-none-any.whl (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.0/57.0 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
Collecting soundfile>=0.12.1
  Downloading soundfile-0.12.1-py2.py3-none-manylinux_2_31_x86_64.whl (1.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0m00:01[0m
Installing collected packages: pytube, soundfile
  Attempting uninstall: soundfile
    Found existing installation: soundfile 0.11.0
    Uninstalling soundfile-0.11.0:
      Successfully uninstalled soundfile-0.11.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
wfdb 4.1.0 requires SoundFile<0.12.0,>=0.10.0, but you have soundfile 0.12.1 which is incompatible.[0m[31m
[0mSuccessfully installed pytube-12.1.2 soundfile-0.12.1
[0m

In [2]:
from pytube import YouTube
import os,shutil

import audioread
from IPython.display import Audio
import librosa
from pydub import AudioSegment, silence

import torch
from transformers import pipeline, T5Tokenizer, T5ForConditionalGeneration, Wav2Vec2Processor, HubertForCTC


In [29]:
## Load pre-trained Hubert model (used Connectionist Temporal Classification/CTC loss ) from HuggingFace
## This particular model works for English, however threre are models that support other languages too

model_name = "facebook/hubert-large-ls960-ft"
tokenizer = Wav2Vec2Processor.from_pretrained(model_name)
model = HubertForCTC.from_pretrained(model_name)

In [18]:
def download_audio(url=None):
    for file in os.listdir():
        if file.endswith("mp4"):
            os.remove(file)
    if url!=None:
        yt=YouTube(url)
        print(yt.title)
        stream=list(yt.streams.filter(only_audio=True, file_extension='mp4'))
        stream[0].download() # stream has all .mp4 audios
    else:
        print("Invalid url,can't download")

In [17]:
url="https://www.youtube.com/watch?v=MihlCysVWNs"
#url="https://www.youtube.com/watch?v=YVQzFCPkgt4&list=PLreVlKwe2Z0QIdDwvVoa_3QSMifIF1w1A&index=7"
download_audio(url)

Wake up to Reality - Madara Uchiha's words


In [20]:
def convert_to_wav(input_filename):
    for file in os.listdir():
        if file.endswith("wav"):
            os.remove(file)
    ext=input_filename[-3:]
    output_filename=filename[:-3]+"wav"
    if ext=="mp3":
        sound = AudioSegment.from_mp3(input_filename)
    else:
        sound = AudioSegment.from_file(input_filename,format=ext)
    sound = sound.set_frame_rate(16000)
    sound.export(output_filename,format="wav")
    os.remove(input_filename)
    return output_filename

In [21]:
## we need audio in wav format + sample rate 16K Hz
filename=""
for file in os.listdir():
    if file.endswith("mp4"):
        filename=file
print("old filename ",filename)
filename = convert_to_wav(filename)
print("new filename ", filename)

old filename  Wake up to Reality - Madara Uchihas words.mp4
new filename  Wake up to Reality - Madara Uchihas words.wav


In [22]:
## create temporary directory to store 
tmp_dir="audio_chunks"

shutil.rmtree(f"{tmp_dir}/",ignore_errors=True)
os.makedirs(tmp_dir)


In [24]:
audio = AudioSegment.from_file(filename)
dBs=audio.dBFS # get decibels 
silence_list=silence.detect_silence(audio,min_silence_len=750,silence_thresh=dBs-14)
silence_list

[[0, 2812], [51695, 52943], [54775, 55690]]

In [None]:
# test = audio[54775:55690]
# path = "/content/test_3.wav"
# test.export(path) #Exports to a mp3 file in the current path.
    
# Audio(path, autoplay=False)

In [25]:
## while breaking into chunks we need to take care of following points

def create_chunk(audio,silence_list,threshold=14,max_interval=20000):

    audio_length = int(audio.duration_seconds)*1000 ## we need value in (ms) not (s)
    non_silent_chunk=[]
    if len(silence_list)>0:
        ## for 1st chunk 
        if silence_list[0][0]!=0:
            nss=0 # non-silence chunk start
            nse=silence_list[0][0] # non-silence chunk end
            non_silent_chunk.append([nss,nse])
        for idx in range(1,len(silence_list)):
            nss=silence_list[idx-1][1]  # end of previous silence-chunk
            nse=silence_list[idx][0]  # start of current silence-chunk
            non_silent_chunk.append([nss,nse])

        # after last silence chunk 
        if silence_list[-1][1]!=audio_length:
            nss=silence_list[-1][1]
            nse=audio_length
            non_silent_chunk.append([nss,nse])
    else:
        non_silent_chunk.append([0,audio_length])

    print("non_silent_chunk : ",non_silent_chunk)
    new_non_silent_chunk = [] # we break larger non-silence chunk to smaller sub-chunks
    # using sliding window to get audio without silence
    for idx in range(len(non_silent_chunk)):
        start=non_silent_chunk[idx][0]
        end=non_silent_chunk[idx][1]
        interval = end-start
        if interval>max_interval:
            s=start
            while interval>max_interval:
                e=s+max_interval+threshold
                interval=interval-max_interval
                new_non_silent_chunk.append([s,e])
                s=e-threshold
                start=s 
            if interval<=max_interval:
                end=start+interval
                new_non_silent_chunk.append([start,end])
        else:
            new_non_silent_chunk.append([start,end])

    return new_non_silent_chunk




In [None]:
# test = audio[75690:95704]
# path = "/content/test_4.wav"
# test.export(path) #Exports to a mp3 file in the current path.
# Audio(path, autoplay=False)

In [26]:
def transcribe_audio(path,model,tokenizer,audio,start,end,overlap=15):
    device = "cuda" if torch.cuda.is_available() else "cpu"
    try:
        with torch.no_grad():
            new_audio = audio[start:end] 
            new_audio.export(path)  
            print(path)
            input_audio,sr=librosa.load(path,sr=16000)
            input_values = tokenizer(input_audio,return_tensors="pt").to(device).input_values
            logits = model.to(device)(input_values).logits
            prediction = torch.argmax(logits, dim=-1)
            transcription = tokenizer.batch_decode(prediction)[0].lower()
            transcription_start=transcription[:overlap]
            transcription_end=transcription[-overlap:]
            return transcription,transcription_start,transcription_end
    except audioread.NoBackendError:
        print("start value of chunk > end value of chunk")
        exit()


In [27]:
# max_interval - if audio chunk size > max_interval we breake it into chunk
# threshold - gap between 2 chunks ,it uses soft boundary during transition of chunks
new_non_silent_chunk = create_chunk(audio,silence_list,threshold=14,max_interval=20000)
print("new_non_silent_chunk",new_non_silent_chunk)

non_silent_chunk :  [[2812, 51695], [52943, 54775], [55690, 96000]]
new_non_silent_chunk [[2812, 22826], [22812, 42826], [42812, 51695], [52943, 54775], [55690, 75704], [75690, 95704], [95690, 96000]]


In [30]:
overlap=20 ## change this value according to need
overlapping_transcription=[]
transcription= ""
root_path="audio_chunks"
for idx in range(len(new_non_silent_chunk)):
    start=new_non_silent_chunk[idx][0]
    end=new_non_silent_chunk[idx][1]
    path=f"{root_path}/chunk_{idx}.wav"
    orginal_trans,trans_start,trans_end=transcribe_audio(path,model,tokenizer,audio,start,end,overlap)
    transcription+=orginal_trans+" "
    overlapping_transcription.append([trans_start,trans_end])

audio_chunks/chunk_0.wav


It is strongly recommended to pass the ``sampling_rate`` argument to this function. Failing to do so can result in silent errors that might be hard to debug.
It is strongly recommended to pass the ``sampling_rate`` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


audio_chunks/chunk_1.wav


It is strongly recommended to pass the ``sampling_rate`` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


audio_chunks/chunk_2.wav


It is strongly recommended to pass the ``sampling_rate`` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


audio_chunks/chunk_3.wav


It is strongly recommended to pass the ``sampling_rate`` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


audio_chunks/chunk_4.wav


It is strongly recommended to pass the ``sampling_rate`` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


audio_chunks/chunk_5.wav


It is strongly recommended to pass the ``sampling_rate`` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


audio_chunks/chunk_6.wav


In [31]:
transcription

'wake up to reality nothing ever goes as planned in this accursed world the longer you live the more you will realize that the only things that truly exist in this reality are merely pain suffering and futility listen everywhere you look in this world wherever there is light there will always be shadowys to be found as well as long as there is a conceft of victors the vanquished will also exist the selfish intent of wanting to preserve peace initiates wars and hatred is born in order to protect love there mexases causal relations ships that cannot be separated i want to sever the fate of this world a world of only victors world of only peace a world of only love i will create such a world i am the ghostt of the uchiha laa a p oa paportrul this reality esia eliaa pa aapo p n epa  '

In [32]:
overlapping_transcription

[['wake up to reality n', 'ere you look in this'],
 ['world wherever there', 'o protect love there'],
 ['mexases causal relat', 'e fate of this world'],
 ['a world of only vict', 'orld of only victors'],
 ['world of only peace ', ' of the uchiha laa a'],
 ['p oa paportrul this ', 'liaa pa aapo p n epa'],
 ['', '']]

In [33]:
model_name="flexudy/t5-small-wav2vec2-grammar-fixer"
t5_tokenizer=T5Tokenizer.from_pretrained(model_name)
t5_model=T5ForConditionalGeneration.from_pretrained(model_name)


Downloading (…)"spiece.model";:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)al_tokens_map.json";:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

Downloading (…)enizer_config.json";:   0%|          | 0.00/2.01k [00:00<?, ?B/s]

Downloading (…)"config.json";:   0%|          | 0.00/1.35k [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/242M [00:00<?, ?B/s]

In [34]:
def add_punctuation(t5_model,t5_tokenizer,transcription):
    input_text="fix:{"+transcription+"}</s>"
    input_ids=t5_tokenizer.encode(input_text,return_tensors="pt",max_length=10000,truncation=True,add_special_tokens=True)
    outputs=t5_model.generate(input_ids=input_ids,max_length=256,num_beams=4,repetition_penalty=1.0,
                              length_penalty=1.0,early_stopping=True)
    transcription=t5_tokenizer.decode(outputs[0],skip_special_tokens=True,clean_up_tokenization_spaces=True)
    return transcription

In [35]:
def split_text(transcription,max_size):
    cut2=max_size # max length we want a sentence to be
    split_text_list=[]
    nearest_idx=0
    length=len(transcription)
    
    if cut2==length:   #  add complete text
        split_text_list.append(transcription)
    else:
        while cut2<=length:
            cut1=nearest_idx
            cut2=nearest_idx+max_size
            # split by period(.)
            dots_idxs=[idx for idx,char in enumerate(transcription[cut1:cut2]) if char == "."]
            if len(dots_idxs):
                nearest_idx=max(dots_idxs)+1+cut1
            else:     # split by space('\b') , same as above
                spaces_idxs=[idx for idx,char in enumerate(transcription[cut1:cut2]) if char == " "]
                if len(spaces_idxs):
                    nearest_idx=max(spaces_idxs)+1+cut1
                else:
                    nearest_idx=cut2+cut1
            split_text_list.append(transcription[cut1:nearest_idx])

    return split_text_list


In [36]:
tmp_transcription=transcription
split_text_list=split_text(tmp_transcription+" ",512)
punctuated_text=""
#gf = Gramformer(models = 1, use_gpu=True) # 1=corrector, 2=detector
for split_text in split_text_list:
    tmp_text=add_punctuation(t5_model,t5_tokenizer,split_text)
    #corrected_sentence = gf.correct(tmp_text,max_candidates=1)
    #punctuated_text+=str(corrected_sentence)
    punctuated_text+=tmp_text

  f"This sequence already has {self.eos_token}. In future versions this behavior may lead to duplicated"


In [37]:
punctuated_text

'wake up to reality nothing ever goes as planned in this accursed world. The longer you live, the more you will realize that the only things that truly exist in this reality are: pain, suffering, and futility. Listen everywhere you look in this world. Where there is light, there will always be shadowys to be found. As long as there is a conceft ofvictors, the vanquished will also exist: the selfish intention of wanting peace initiates wars, and hatred is born in order to protect love there.Mexases causal relations ships that cannot be separated (i want to severe the fate of this world, a world of only victories, world of only peace, a world of only love, i will create such a world, i am the ghostt of theuchiha laa, a p oa paportrul, this reality, esiaelia pa aapo, p n epa).'

In [None]:
#