

---



In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
!pip install pytube
!pip install moviepy
!pip install noisereduce
!pip install playsound
!pip install pydub



In [None]:
from pytube import YouTube
from moviepy.editor import VideoFileClip

from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
import librosa
import torch
import tqdm
# import torchaudio

# from scipy.io import wavfile as wav
# from scipy.io.wavfile import write
# import sounddevice as sd
# import soundfile as sf
from playsound import playsound
# import noisereduce as nr

from scipy.io import wavfile
import soundfile as sf
import noisereduce as nr
from pydub import AudioSegment, silence, effects



---


###Data Pre-processing
- *Downloaded the video and Extracted Whole Audio*
- *Did some Augmentations on the Raw Audio*
---
####Issues Faced
  1. Had to convert from **Stereo to Mono**
  2. **Reduced background noise** of People chattering and clapping. Also as the audio was recording of an announcement from the PA System.
  3. As spoken on mic, the loudness was not equalised, hence, I had to **Normalise the audio.**
  4.(Commented Out) Tried **removing silences** as it was Interfering with the Transcriptions, But was later taken care of by the model itself.


---


*Python Modules used -*
1. Moviepy
2. noisereduce
3. pydub
4. librosa
5. tqdm
6. scipy.io -> wavfile*


---



In [None]:
# Download the video
yt = YouTube('https://www.youtube.com/watch?v=Sby1uJ_NFIY')
# stream = yt.streams.first()
stream.download(filename='video.mp4')

# Extract audio
video = VideoFileClip("video.mp4")
audio = video.audio.subclip(0, 1575.06)
audio.write_audiofile("audio.wav")
print("Length of audio in seconds = ", video.audio.duration /60, "minutes")

MoviePy - Writing audio in audio.wav


                                                                        

MoviePy - Done.
Length of audio in seconds =  26.250999999999998 minutes




In [None]:
# Convert from stereo to mono
print("Converting to mono...")
sound = AudioSegment.from_wav("audio.wav")
sound = sound.set_channels(1)
speech1 = sound.export("audio_mono.wav", format="wav")
print("---Done")

# load data
rate, data = wavfile.read(speech1)

# perform noise reduction
print("Reducing Noise...")
reduced_noise = nr.reduce_noise(y=data, sr=rate, thresh_n_mult_nonstationary=2, stationary=False, use_torch=True)
wavfile.write("noise_reduced_audio.wav", rate, reduced_noise)
print("---Done")

# Normalise Noise
print("Normalising Noise...")
rawsound = AudioSegment.from_file("noise_reduced_audio.wav", "wav")
normalizedsound = effects.normalize(rawsound)
normalizedsound.export("normalised_audio.wav", format="wav")
print("---Done")

# # Split on silence
# print("Splitting on Silence and removing the Silence...")
# sound = AudioSegment.from_wav("audio.wav")
# chunks = silence.split_on_silence(sound, min_silence_len=1000, silence_thresh=-16,keep_silence=100)
# print("---Done")

# # Combine chunks
# combined = AudioSegment.empty()
# for chunk in chunks:
#     combined += chunk
# print("---Done chunking")


# # Save the processed audio
# combined.export("silence_removed_audio.wav", format="wav")

# # Save the processed audio
# sf.write("finalised_audio.wav", normalizedsound, rate)
# print("----Audio sucessfully processed and saved.----")

Converting to mono...
---Done
Reducing Noise...
---Done
Normalising Noise...
---Done


In [None]:
speech, rate = librosa.load("normalised_audio.wav",sr=16000)
# print(type(speech))

## **My First Experimental Model - Wave2Vec2**

- used for a basic implementation.
- dropped the because the code was really lengthy and prone to a lot of errors, I made it errors free but still the next model provide better results hence...

Following Implementations were performed by me.
https://huggingface.co/docs/transformers/en/model_doc/wav2vec2
https://www.analyticsvidhya.com/blog/2021/02/hugging-face-introduces-the-first-automatic-speech-recognition-model-wav2vec2/

* Used the following implementation using ***`trellis matrix`*** for alignment of the transcriptions - https://pytorch.org/audio/stable/tutorials/forced_alignment_tutorial.html

* For tackling problems of misinterpreting Indian Accent -
https://medium.com/nerd-for-tech/indian-accent-speech-recognition-2d433eb7edac

* Also tried finetuning the pretrained wav2vec2 model but didn't have enough resources or time. -
    https://www.iitm.ac.in/donlab/tts/database.php
    https://catalog.ldc.upenn.edu/LDC2019S11
* Studied implementing Connectionist Temporal Classification - https://distill.pub/2017/ctc/
* For Perceptual Linear Prediction (PLP) and Mel-frequency cepstral coefficients (MFCC) - https://jonathan-hui.medium.com/speech-recognition-feature-extraction-mfcc-plp-5455f5a69dd9

For Inspirations -

https://github.com/AdroitAnandAI/Indian-Accent-Speech-Recognition/tree/master
https://discuss.huggingface.co/t/how-to-create-wav2vec2-with-language-model/12703
https://github.com/flashlight/flashlight/blob/main/flashlight/app/asr/README.md


---



In [None]:
# Load pre-trained Wav2Vec2 model and tokenizer
tokenizer1 = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-large-960h")
model1 = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h")

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Wav2Vec2CTCTokenizer'. 
The class this function is called from is 'Wav2Vec2Tokenizer'.

Some weights of the model checkpoint at facebook/wav2vec2-large-960h were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Ve

In [None]:
# Define the chunk duration (e.g., 30 seconds)
chunk_duration = 15  # in seconds
chunk_samples = chunk_duration * rate

# Split audio into chunks
chunks1 = [speech[i:i + chunk_samples] for i in range(0, len(speech), chunk_samples)]
input_values = tokenizer1(speech, return_tensors = 'pt').input_values

In [None]:
# Transcribe each chunk
transcriptions1 = []
for chunk in chunks1:
    input_values = tokenizer1(chunk, return_tensors='pt', padding='longest').input_values
    with torch.no_grad():
        logits = model1(input_values).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    transcription1 = tokenizer1.decode(predicted_ids[0])
    transcriptions1.append(transcription1)


In [None]:
# Combine the transcriptions

full_transcription_WAV2VEC2 = " ".join(transcriptions1)
print(full_transcription_WAV2VEC2)

CONGRATULATIONS TO YOU MISTER ROCKERBAN FOR THAT AKS OBOSTE TORNING UT OBERTE  AY EVERYBODY HOW ARE YOU AY NOT HEARING THIS AT ALL LIKE A POST LIE SHA AN EMPTY DOWNET OR SOMETHING LET'S HEAR IT ARE REGUISE THE WIG ALL RIGHT BETTER BE BECAUSE SA BE YOU HAVE A SUPA STAR GUESTIE AH YOU HERTH OF FORTY ONE MILLION DOLLARS A I DID GE ONSE IN IGNA SHE SAID AFTER THAT A SO O RIGNU ASK QUORTER WHAT FORTY MILLION DOLLARS SOHI BY THE END OF THIS CONVERSATION FOUQUET AHM BUT LET GET STARTED AH I WANT INTRODUCE THE VIC AND PRATHIUCHISCOPONDE WAS NAT HER AH WE WANTED TO STARD WITH A FLANG OF VIDIL AH OFF WHAT AM OPEN HATI DUS I ENCOURAGE ALL OF YOU TO GO AH TO THE WEBSIDE'S AROMBADIAI AND CHECK IT OUT AM BUT LET ME START BY INTRODUCING LE VIK AH WE MAKE AS A DEAR FRIEND AND A HE IS VERY VERY MODEST ON O OF THE MOST MODEST GUIDES THAT I KNOW BUT HIS PERSONAL TURN TO THE QUVINA YOU GOT A PIACTI FROM KANIMELEN YOU SATTRED AND SOLD A COMPANY TO MAGMA AND THE OBLECANDIPUT BAKTITIA FROM ROBOT IN THE VALLE

In [None]:
def chunk_transcription(transcription, max_chunk_length=512):
    words = transcription.split()
    chunks = []
    current_chunk = []

    for word in words:
        current_chunk.append(word)
        # If adding the next word exceeds the max length, start a new chunk
        if len(" ".join(current_chunk)) > max_chunk_length:
            # Join the current chunk into a string and add it to the list of chunks
            chunks.append(" ".join(current_chunk[:-1]))
            # Start a new chunk with the last word
            current_chunk = [current_chunk[-1]]

    # Add any remaining words as the last chunk
    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

chunk_transcription(full_transcription_WAV2VEC2, max_chunk_length=512)

["CONGRATULATIONS TO YOU MISTER ROCKERBAN FOR THAT AKS OBOSTE TORNING UT OBERTE AY EVERYBODY HOW ARE YOU AY NOT HEARING THIS AT ALL LIKE A POST LIE SHA AN EMPTY DOWNET OR SOMETHING LET'S HEAR IT ARE REGUISE THE WIG ALL RIGHT BETTER BE BECAUSE SA BE YOU HAVE A SUPA STAR GUESTIE AH YOU HERTH OF FORTY ONE MILLION DOLLARS A I DID GE ONSE IN IGNA SHE SAID AFTER THAT A SO O RIGNU ASK QUORTER WHAT FORTY MILLION DOLLARS SOHI BY THE END OF THIS CONVERSATION FOUQUET AHM BUT LET GET STARTED AH I WANT INTRODUCE THE VIC",
 "AND PRATHIUCHISCOPONDE WAS NAT HER AH WE WANTED TO STARD WITH A FLANG OF VIDIL AH OFF WHAT AM OPEN HATI DUS I ENCOURAGE ALL OF YOU TO GO AH TO THE WEBSIDE'S AROMBADIAI AND CHECK IT OUT AM BUT LET ME START BY INTRODUCING LE VIK AH WE MAKE AS A DEAR FRIEND AND A HE IS VERY VERY MODEST ON O OF THE MOST MODEST GUIDES THAT I KNOW BUT HIS PERSONAL TURN TO THE QUVINA YOU GOT A PIACTI FROM KANIMELEN YOU SATTRED AND SOLD A COMPANY TO MAGMA AND THE OBLECANDIPUT BAKTITIA FROM ROBOT IN THE 

In [None]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load the GPT-2 tokenizer and model
lm_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
lm_model = GPT2LMHeadModel.from_pretrained("gpt2")

# Function to enhance transcription using a language model
def enhance_transcription(transcription, lm_tokenizer, lm_model):
    # Encode the transcription with attention mask
    inputs = lm_tokenizer.encode(transcription, return_tensors='pt', padding=False, truncation=True)
    attention_mask = torch.ones(inputs.shape, dtype=torch.long)

    # Generate output from the language model
    with torch.no_grad():
        outputs = lm_model.generate(
            inputs,
            attention_mask=attention_mask,
            max_length=len(inputs[0]) + 50,
            num_return_sequences=1,
            pad_token_id=lm_tokenizer.eos_token_id
        )
    # Decode the output
    enhanced_transcription = lm_tokenizer.decode(outputs[0], skip_special_tokens=True)
    return enhanced_transcription

# Enhance each transcription
# chunks = chunk_transcription(full_transcription_WAV2VEC2, max_chunk_length=512)

# enhanced_transcriptions = [enhance_transcription(t, lm_tokenizer, lm_model) for t in full_transcription_WAV2VEC2]
enhanced_chunks = [enhance_transcription(chunk, lm_tokenizer, lm_model) for chunk in chunks]

# Print the enhanced transcriptions
for original, enhanced in zip(full_transcription_WAV2VEC2, enhanced_chunks):
    print(f"Original: {original}")
    print(f"Enhanced: {enhanced}")
    print("---------")


Original: C
Enhanced: CONGRATULATIONS TO YOU MISTER ROCKERBAN FOR THAT AKS OBOSTE TORNING UT OBERTE AY EVERYBODY HOW ARE YOU AY NOT HEARING THIS AT ALL LIKE A POST LIE SHA AN EMPTY DOWNET OR SOMETHING LET'S HEAR IT ARE REGUISE THE WIG ALL RIGHT BETTER BE BECAUSE SA BE YOU HAVE A SUPA STAR GUESTIE AH YOU HERTH OF FORTY ONE MILLION DOLLARS A I DID GE ONSE IN IGNA SHE SAID AFTER THAT A SO O RIGNU ASK QUORTER WHAT FORTY MILLION DOLLARS SOHI BY THE END OF THIS CONVERSATION FOUQUET AH BUT LET GET STARTED AH I WANT INTRODUCE THE VICELAND OF THE WIG AND THE WIG IS A WIG AND I WANT TO BE A WIG AND I WANT TO BE A WIG AND I WANT TO BE A WIG AND I WANT TO BE A WIG AND I WANT TO
---------
Original: O
Enhanced: AND PRATHIUCHISCOPONDE AS NOT HERE AH WE WANTED TO STARD WITH A FLANG OF VIDIOL AH OFF WHAT AM OPEN HATI DUS I ENCOURAGE ALL OF YOU TO GO AH TO THE WHEBSIDE'S AROMBADIAI AND CHECK IT OUT AM BUT LET ME START BY INTRODUCING LE VIK AH WE MAKE AS A DEAR FRIEND AND A HE IS VERY VERY MODEST ON O OF

In [None]:
'''
Generate frame-wise label probability (DONE ABOVE)

Generate alignment probability (trellis):
From the emission matrix, next we generate the trellis which represents the probability of transcript labels occur at each time frame.
'''
# We enclose the transcript with space tokens, which represent SOS and EOS.
# transcript = "SAMPLE|TEXT|"
# dictionary = {c: i for i, c in enumerate(labels)}

# tokens = [dictionary[c] for c in transcript]
# print(list(zip(transcript, tokens)))


# def get_trellis(emission, tokens, blank_id=0):
#     num_frame = emission.size(0)
#     num_tokens = len(tokens)

#     trellis = torch.zeros((num_frame, num_tokens))
#     trellis[1:, 0] = torch.cumsum(emission[1:, blank_id], 0)
#     trellis[0, 1:] = -float("inf")
#     trellis[-num_tokens + 1 :, 0] = float("inf")

#     for t in range(num_frame - 1):
#         trellis[t + 1, 1:] = torch.maximum(
#             # Score for staying at the same token
#             trellis[t, 1:] + emission[t, blank_id],
#             # Score for changing to the next token
#             trellis[t, :-1] + emission[t, tokens[1:]],
#         )
#     return trellis


# trellis = get_trellis(emission, tokens)
'''
BACKTRACKING :
Once the trellis is generated, we will traverse it following the elements with high probability.
The trellis matrix is used for path-finding, but for the final probability of each segment, we take the frame-wise probability from emission matrix.
'''
# class Point:
#     token_index: int
#     time_index: int
#     score: float


# def backtrack(trellis, emission, tokens, blank_id=0):
#     t, j = trellis.size(0) - 1, trellis.size(1) - 1

#     path = [Point(j, t, emission[t, blank_id].exp().item())]
#     while j > 0:
#         # Should not happen but just in case
#         assert t > 0

#         # 1. Figure out if the current position was stay or change
#         # Frame-wise score of stay vs change
#         p_stay = emission[t - 1, blank_id]
#         p_change = emission[t - 1, tokens[j]]

#         # Context-aware score for stay vs change
#         stayed = trellis[t - 1, j] + p_stay
#         changed = trellis[t - 1, j - 1] + p_change

#         # Update position
#         t -= 1
#         if changed > stayed:
#             j -= 1

#         # Store the path with frame-wise probability.
#         prob = (p_change if changed > stayed else p_stay).exp().item()
#         path.append(Point(j, t, prob))

#     # Now j == 0, which means, it reached the SoS.
#     # Fill up the rest for the sake of visualization
#     while t > 0:
#         prob = emission[t - 1, blank_id].exp().item()
#         path.append(Point(j, t - 1, prob))
#         t -= 1

#     return path[::-1]


# path = backtrack(trellis, emission, tokens)
# for p in path:
#     print(p)

'''
MERGING THE WORDS
The Wav2Vec2 model uses '|' as the word boundary, so we merge the segments before each occurance of '|'.

'''
#  # Merge words
# def merge_words(segments, separator="|"):
#     words = []
#     i1, i2 = 0, 0
#     while i1 < len(segments):
#         if i2 >= len(segments) or segments[i2].label == separator:
#             if i1 != i2:
#                 segs = segments[i1:i2]
#                 word = "".join([seg.label for seg in segs])
#                 score = sum(seg.score * seg.length for seg in segs) / sum(seg.length for seg in segs)
#                 words.append(Segment(word, segments[i1].start, segments[i2 - 1].end, score))
#             i1 = i2 + 1
#             i2 = i1
#         else:
#             i2 += 1
#     return words


# word_segments = merge_words(segments)
# for word in word_segments:
#     print(word)

"\nMERGING THE WORDS\nThe Wav2Vec2 model uses '|' as the word boundary, so we merge the segments before each occurance of '|'.\n\n"

## My Final Implementations


In [None]:
pip install faster-whisper



In [None]:
from faster_whisper import WhisperModel

model_size = "large-v3"

# Run on GPU with FP16
# model = WhisperModel(model_size, device="cuda", compute_type="float16")
# Run on GPU with INT8
# model = WhisperModel(model_size, device="cuda", compute_type="int8_float16")
# Run on CPU with INT8
# model = WhisperModel(model_size, device="cpu", compute_type="int8")
model = WhisperModel(model_size, device="auto", compute_type="default")

segments, info = model.transcribe("normalised_audio.wav", beam_size=5, vad_filter = True, vad_parameters=dict(min_silence_duration_ms=50))

print("Detected language '%s' with probability %f" % (info.language, info.language_probability))

for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

Detected language 'en' with probability 0.932617
[0.00s -> 2.10s]  Congratulations to you, Mr. Raghavan, for that.
[2.28s -> 3.46s]  Thanks so much for joining us.
[3.76s -> 4.22s]  Over to you.
[8.44s -> 9.28s]  Hi, everybody.
[9.64s -> 10.06s]  How are you?
[10.82s -> 13.56s]  I'm not hearing this at all.
[13.90s -> 16.86s]  It's like a post-lunch energy downer or something.
[17.98s -> 18.76s]  Let's hear it.
[19.42s -> 20.52s]  Are you guys awake?
[21.68s -> 22.34s]  All right.
[23.02s -> 26.90s]  You better be because we have a superstar guest here.
[27.54s -> 28.90s]  You heard the $41 million.
[29.40s -> 31.56s]  I didn't hear, honestly, anything that she said after that.
[32.66s -> 38.34s]  So, we're going to ask for about $40 million by the end of this conversation, okay?
[39.48s -> 40.54s]  But let's get started.
[41.46s -> 44.94s]  I want to introduce Vivek and Pratyush, his co-founder, who's not here.
[45.68s -> 50.88s]  We wanted to start with playing a video of what Open H

In [None]:
sample_output_list = []

# Process each segment and store the information in a dictionary
for idx, segment in enumerate(segments):
    chunk_dict = {
        "chunk_id": idx + 1,
        "chunk_length": segment.end - segment.start,
        "text": segment.text,
        "start_time": segment.start,
        "end_time": segment.end,
    }
    sample_output_list.append(chunk_dict)

# Print the result
for chunk in sample_output_list:
    print(chunk)

###TODO : Speaker Diarization

Detecting Speakers and labelling the data for each speaker.