*Note* - Before running this notebook, please make sure that there is a folder 'data' in the working '/' directory. The 'data' folder should contain a subdirectory 'video_lecture' with the 'speech_recording.mp4' file for the lecture video to be transcribed.


In [1]:
!pip install openai-whisper deep-translator ffmpeg-python langdetect --quiet

In [2]:
!pip install git+https://github.com/huggingface/parler-tts.git --quiet

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone


## Transcription

In [3]:
import os
import re
import gdown
import requests
import whisper
from pathlib import Path
from tqdm import tqdm
import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf
import warnings
import ffmpeg
warnings.filterwarnings("ignore")



In [None]:
def transcribe_video(video_path, output_path=None, model_size="base", language="hi"):
    """
    Transcribe a video file using OpenAI's Whisper model.

    Args:
        video_path (str): Path to the video file to transcribe
        output_path (str, optional): Path to save the transcript. If None, uses the video filename with .txt extension
        model_size (str, optional): Whisper model size: "tiny", "base", "small", "medium", or "large"
        language (str, optional): Language hint for the transcription model (e.g., "hi" for Hindi, "en" for English)

    Returns:
        str: Path to the saved transcript file
    """
    if not os.path.exists(video_path):
        print(FileNotFoundError(f"Video file not found: {video_path}"))

    if output_path is None:
        output_path = os.path.splitext(video_path)[0] + ".txt"

    print(f"Loading Whisper {model_size} model...")
    model = whisper.load_model(model_size)

    print(f"Transcribing {os.path.basename(video_path)}...")
    with tqdm(total=10, desc="Transcribing", bar_format='{l_bar}{bar}| {elapsed}') as pbar:
        result = model.transcribe(
            video_path,
            task="transcribe",
            language=language,
            verbose=False
        )
        pbar.update(1000)

    return result['text']




def preprocess_transcript(text, file_path):
    """
    Removes filler words like "um", "uh", etc.

    Args:
        text (str): Text string containing the transcript generated by model

    Returns:
        cleaned_text (str): Text string after removing all the filler words
    """
    filler_words = [r"\bum\b", r"\buh\b", r"\blike\b", r"\buhm\b", r"\buhhmm\b", r"\ba\b", r"\bhmm\b"]
    pattern = re.compile("|".join(filler_words), flags=re.IGNORECASE)

    cleaned_text = pattern.sub("", text)
    cleaned_text = re.sub(r"\s+", " ", cleaned_text).strip()

    os.makedirs(file_path, exist_ok=True)

    with open(file_path+'cleaned_transcript.txt', 'w', encoding='utf-8') as f:
        f.write(cleaned_text)
    print(f"Transcript cleaned and saved to: {file_path}cleaned_transcript.txt")

    return file_path



if __name__ == "__main__":

    video_path = '/data/video_lecture/speech_recording.mp4'
    target_dir = 'data/transcripts/'
    model = "base"

    if os.path.exists(video_path):

        transcript = transcribe_video(video_path, model_size=model)
        cleaned_transcript_path = preprocess_transcript(transcript, target_dir)
        print(transcript)
        print(f"Process complete.")


Loading Whisper base model...
Transcribing speech_recording.mp4...


Transcribing:   0%|          | 00:00
  0%|          | 0/169072 [00:00<?, ?frames/s][A
  1%|          | 1400/169072 [00:01<03:57, 705.00frames/s][A
  2%|▏         | 3400/169072 [00:02<01:47, 1543.76frames/s][A
  4%|▍         | 6400/169072 [00:03<01:04, 2539.32frames/s][A
  6%|▌         | 9300/169072 [00:04<00:59, 2693.97frames/s][A
  7%|▋         | 11800/169072 [00:04<00:55, 2850.88frames/s][A
  9%|▊         | 14600/169072 [00:05<00:53, 2901.69frames/s][A
 10%|█         | 17500/169072 [00:06<00:53, 2853.08frames/s][A
 12%|█▏        | 20100/169072 [00:07<00:54, 2749.08frames/s][A
 13%|█▎        | 22500/169072 [00:08<00:47, 3054.61frames/s][A
 15%|█▍        | 25200/169072 [00:09<00:45, 3190.63frames/s][A
 17%|█▋        | 27900/169072 [00:10<00:45, 3098.02frames/s][A
 18%|█▊        | 30000/169072 [00:11<00:46, 2962.47frames/s][A
 19%|█▉        | 32700/169072 [00:11<00:41, 3309.34frames/s][A
 21%|██        | 35300/169072 [00:12<00:41, 3240.21frames/s][A
 22%|██▏       | 37900

 We have been talking about this audio processing with respect to speaker recognition, speech recognition or any such related task. But you also know that with any of these security technologies, there can be attacks or there can be people who have any kind of ill intention who would like to defraud the system. Right? Who would like to fool the system? Have you heard of any such examples, any such real-world examples where any kind of security system is in place, be it biometrics, face, voice, any of those. And there have been cases where these systems have been fooled. Anybody remembers any such instance? Would like to share. And when you have to speak up. Is it going to be on the... Okay, all right. So there have been several such instances, not only in... I'm audible, right? Somebody please speak up. I'm not able to hear you guys. Yes, please. To see the text and all. Okay, so there have been several such instances with respect to different kinds of tasks, automation tasks that we h




## Audio Generation

In [14]:
# ENGLISH TO MARATHI TRANSLATION

from deep_translator import GoogleTranslator
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import re
from langdetect import detect

device = "cuda:0" if torch.cuda.is_available() else "cpu"

model = ParlerTTSForConditionalGeneration.from_pretrained("ai4bharat/indic-parler-tts").to(device)
tokenizer = AutoTokenizer.from_pretrained("ai4bharat/indic-parler-tts")
description_tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder._name_or_path)

description = "Sanjay who speaks fluent Marathi language delivers a slightly expressive and animated speech with a moderate speed and pitch. The recording is of very high quality, with the speaker's voice sounding clear and very close up."
description_input_ids = description_tokenizer(description, return_tensors="pt").to(device)

  "_name_or_path": "google/flan-t5-large",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "classifier_dropout": 0.0,
  "d_ff": 2816,
  "d_kv": 64,
  "d_model": 1024,
  "decoder_start_token_id": 0,
  "dense_act_fn": "gelu_new",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 24,
  "num_heads": 16,
  "num_layers": 24,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "transformers_version": "4.46.1",
  "use_cache": true,
  "vocab_size": 32128
}

  "_name_or_path": "ylacombe/dac_44khz",
  "architectures": [
    "DacModel"
  ],
  "codebook_dim": 8,
  "codebook_loss_weight": 1.0,
  "codebook_size": 1024,
  "commitment_loss_weight": 0.25,
  "decoder_hidden_si

In [12]:
del model, chunks, translation, tokenizer, description_tokenizer, description_input_ids

In [None]:
def split_text(text, num_chunks=4):
    """Splits text into roughly equal chunks without breaking sentences or words."""

    sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|\!)\s', text)
    target_length = 1000

    chunks = []
    current_chunk = ""
    current_length = 0

    for sentence in sentences:
        if current_length + len(sentence) <= target_length:
            current_chunk += sentence
            current_length += len(sentence)
        else:
            chunks.append(current_chunk.strip())
            current_chunk = sentence
            current_length = len(sentence)

    chunks.append(current_chunk.strip())

    return chunks


def translate_to_marathi(english_text):
    """
    Translates English text to Marathi using the Deep Translate model.

    Args:
        english_text: The English text to translate.

    Returns:
        The translated Marathi text.
    """
    translator = GoogleTranslator(source='en', target='mr')
    translated_text = translator.translate(english_text)
    return translated_text


with open('/data/transcripts/cleaned_transcript.txt', 'r') as file:
    lines = file.readlines()

chunks = split_text(lines[0], num_chunks=4)
marathi_translation = ""

for chunk in chunks:
    translation = translate_to_marathi(chunk)
    if detect(translation) != 'en':
        marathi_translation += translation

print(marathi_translation)

आणि विशेषत: भाषण. ही सर्व बातम्या आहेत ज्या आपण दस्तऐवजीकृत प्रकरणांची पाहता. विशेष आवश्यकतांसह, ते ऑडिओ कॅप्चर बोलू शकतात, बरोबर? म्हणून ... परंतु यापैकी काही तंत्रज्ञानाचा वापर करून हे फसवले जाऊ शकते. आणि यापैकी काही प्रकरणांमध्ये असेही घडले आहे. उदाहरणार्थ, या अ‍ॅडोबोकोने आवाजासाठी फोटोशॉप केला आहे ज्यामुळे रोबोट भाषण सिम्युलेटर आणि अशा अनेक गोष्टींचा वापर केला गेला आहे.यापूर्वी आम्ही याला स्पूफिंग म्हणायचो पण नंतर मी चाचणी सादरीकरणाचे हल्ले देखील म्हटले होते. बायोमेट्रिक सिस्टमच्या समोर सादर केलेले काहीही, त्यावर हल्ला केला जाऊ शकतो. फेस. म्हणून आपण वास्तविक चेहरा नोंदवाल, वास्तविक चेहरा क्वेरी म्हणून येत आहे. म्हणून हे अस्सल म्हणून अंदाज आहे, बरोबर? बायोमेट्रिक. लेट असे म्हणतात की बायोमेट्रिक चेहरा आहे किंवा बायोमेट्रिक ऑडिओ आहे.मी फक्त त्या बायोमेट्रिकला देतो की आपण पाहण्याचा प्रयत्न केला नाही किंवा आपला आवाज किंवा फिंगरप्रिंटची तोतयागिरी करण्याचा प्रयत्न केला नाही. जे काही बदल आहे, मी फक्त माझ्या वैशिष्ट्यांचा वापर करून आपल्या खात्यात प्रवेश मिळविण्याचा प्रयत्न करतो. शून्य प्रयत

In [None]:
# TEXT TO AUDIO CONVERSION

max_length = 500
chunks = [marathi_translation[i:i + max_length] for i in range(0, len(marathi_translation), max_length)]

audio_arr_full = []

for i, chunk in enumerate(chunks):

    prompt_input_ids = tokenizer(chunk, return_tensors="pt").to(device)
    generation = model.generate(
        input_ids=description_input_ids.input_ids,
        attention_mask=description_input_ids.attention_mask,
        prompt_input_ids=prompt_input_ids.input_ids,
        prompt_attention_mask=prompt_input_ids.attention_mask
    )
    audio_arr_full.extend(generation.cpu().numpy().squeeze())

    if i%5 == 0:
        sf.write("/data/marathi_tts_out.wav", audio_arr_full, model.config.sampling_rate)
        print('Output file translated from English to Marathi is saved at /data/marathi_tts_out.wav')

Output file translated from English to Marathi is saved at /content/data/marathi_tts_out.wav
