## Audio Translation Pipeline. Phase - 1.

### Installing Necessary Libraries

In [2]:
!pip install speechrecognition
!pip install pydub
!apt-get update
!apt-get install -y portaudio19-dev

!pip install pyaudio

Collecting speechrecognition
  Downloading speechrecognition-3.14.2-py3-none-any.whl.metadata (30 kB)
Downloading speechrecognition-3.14.2-py3-none-any.whl (32.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m32.9/32.9 MB[0m [31m40.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: speechrecognition
Successfully installed speechrecognition-3.14.2
Collecting pydub
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Installing collected packages: pydub
Successfully installed pydub-0.25.1
Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Get:3 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ Packages [75.2 kB]
Get:4 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Hit:5 http://archive.ubuntu.com/ubuntu jammy InRelease
Ge

### Mount Google Drive

In [3]:
# Mount gdrive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Audio Processing and Translation

Converts video to audio, enhances it, and performs speech recognition.

---

#### 1. Video to Audio (MP4 → WAV)

- **Source Check:** Tries `LOCAL_VIDEO_PATH`, then `DRIVE_VIDEO_PATH`.
- **Conversion:** Uses `moviepy.editor` to extract and save audio as WAV.
- **Error Handling:** Raises `FileNotFoundError` if video is missing.


In [7]:
# === AUDIO PROCESSING AND TRANSLATION SCRIPT ===

import os
import speech_recognition as sr
import moviepy.editor as mp
from pydub import AudioSegment
from pydub.silence import detect_nonsilent

# ====== CONFIGURATION ======
RECOGNITION_LANGUAGE = "en-IN"  # Change to "en-US", "hi-IN", etc. as needed
LOCAL_VIDEO_PATH = "/content/my_name_is_gora.mp4"
DRIVE_VIDEO_PATH = "/content/drive/MyDrive/NLP_Proj/my_name_is_gora.mp4"
PROCESSED_AUDIO_PATH = "processed_input.wav"
# ===========================

# Step 1: Convert video to audio (mp4 → wav)
try:
    input_video_file = LOCAL_VIDEO_PATH
    if not os.path.exists(input_video_file):
        raise FileNotFoundError

    input_audio_file = input_video_file.replace('.mp4', '.wav')
    video = mp.VideoFileClip(input_video_file)
    video.audio.write_audiofile(input_audio_file)
    print(f"Loaded audio file from video and saved as {input_audio_file}")
except FileNotFoundError:
    try:
        input_video_file = DRIVE_VIDEO_PATH
        if not os.path.exists(input_video_file):
            raise FileNotFoundError

        input_audio_file = input_video_file.replace('.mp4', '.wav')
        video = mp.VideoFileClip(input_video_file)
        video.audio.write_audiofile(input_audio_file)
        print(f"Loaded audio file from video and saved as {input_audio_file}")
    except FileNotFoundError:
        print("Error: 'File' not found in either /content or Google Drive.")
        raise

# Step 2: Trim silence at the beginning and boost volume
audio = AudioSegment.from_wav(input_audio_file)
nonsilent_ranges = detect_nonsilent(audio, min_silence_len=300, silence_thresh=-40)

if nonsilent_ranges:
    start_trim = max(nonsilent_ranges[0][0] - 200, 0)  # buffer before speech
    trimmed_audio = audio[start_trim:]
    louder_audio = trimmed_audio + 5  # boost volume by 5 dB
    input_audio_file = PROCESSED_AUDIO_PATH
    louder_audio.export(input_audio_file, format="wav")
    print(f"Trimmed and boosted audio saved as {input_audio_file}")
else:
    print("No speech detected, using original audio.")

# Step 3: Speech Recognition
recognizer = sr.Recognizer()

with sr.AudioFile(input_audio_file) as source:
    print("Recording the file...")
    recorded_audio = recognizer.record(source)
    print("Done recording")

try:
    print("Recognizing the text...")
    speech_to_text = recognizer.recognize_google(recorded_audio, language=RECOGNITION_LANGUAGE)
    print("Decoded Text:", speech_to_text)
except Exception as ex:
    print("Speech Recognition Error:", ex)


MoviePy - Writing audio in /content/drive/MyDrive/NLP_Proj/my_name_is_gora.wav




MoviePy - Done.
Loaded audio file from video and saved as /content/drive/MyDrive/NLP_Proj/my_name_is_gora.wav
Trimmed and boosted audio saved as processed_input.wav
Recording the file...
Done recording
Recognizing the text...
Decoded Text: my name is Gora I haven't met Benoit after that day but I know why he had come we were in the village traveling by the Night Train the Army had organised a sadhana tooth for the students at the station we saw the message on the notice board it said tasks are both online and on ground discussions enable participants to share ideas some students are reading the guitar and talking about National integration later we visited a bookstore I found a biography and the Telugu dictionary A teacher told us in order to succeed you must say it's a great time to invest in India we return home with memories and books that day I felt happy and hopeful


### Punctuation Restoration
The translation model is giving raw translation with no punctuation.
Adding punctuation to the text so that we can tokenize it afterwards.

---

#### 1. Silero Model (Preferred)

- Uses Silero TTS Engine (`silero_te`) via `torch.hub.load`.
- Applies model for automatic punctuation.
- Falls back to rule-based method if loading fails.

---

#### 2. Rule-Based Fallback

- Uses regex and string ops to insert periods and commas.
- Capitalizes first words and identifies sentence boundaries.
- Cleans up redundant punctuation.

---

#### 3. Application

- `restore_punctuation()` handles both methods.
- Stores output in `punctuated_speech_to_text`.
- Includes example test case.


In [8]:


import re
import torch
# Step 2: Punctuate the text

def restore_punctuation_with_silero(text):
    """
    Use Silero TTS Engine for punctuation restoration.
    This should be compatible with your environment since you've used it successfully before.
    """
    try:
        model, example_texts, languages, punct, apply_te = torch.hub.load(
            repo_or_dir='snakers4/silero-models',
            model='silero_te'
        )
        return apply_te(text, lan='en')
    except Exception as e:
        print(f"Error with Silero model: {e}")
        return basic_punctuation_rules(text)

def basic_punctuation_rules(text):
    """
    Ultra-simple rule-based punctuation with no ML dependencies.
    Just uses regex and basic Python string operations.
    """
    # Split text into words
    words = text.split()
    result = []

    # Simple capitalization rules
    for i, word in enumerate(words):
        # Capitalize first word
        if i == 0:
            word = word[0].upper() + word[1:] if len(word) > 0 else word

        # Look for sentence boundaries based on common patterns
        if i > 0 and i < len(words) - 1:
            prev_word = words[i-1].lower()
            next_word = words[i+1]

            # Likely sentence end if next word is capitalized proper noun or common sentence starter
            sentence_starters = ["i", "we", "they", "he", "she", "it", "the", "this", "that", "these", "those"]
            if next_word and next_word[0].isupper() or next_word.lower() in sentence_starters:
                if not prev_word.endswith(('.', '!', '?')):
                    result[-1] = result[-1] + '.'
                word = word[0].upper() + word[1:] if len(word) > 0 else word

        result.append(word)

    # Add final period if missing
    if result and not result[-1].endswith(('.', '?', '!')):
        result[-1] += '.'

    # Join back into text
    punctuated_text = ' '.join(result)

    # Add commas using simple rules (before coordinating conjunctions etc.)
    punctuated_text = re.sub(r'\s+(and|but|or|nor|for|so|yet)\s+', ', \\1 ', punctuated_text)

    # Clean up any double commas or periods
    punctuated_text = re.sub(r'\.\.+', '.', punctuated_text)
    punctuated_text = re.sub(r',,+', ',', punctuated_text)

    return punctuated_text

def restore_punctuation(text):
    """
    Main function to restore punctuation - tries Silero first, falls back to rules.
    """
    try:
        return restore_punctuation_with_silero(text)
    except:
        return basic_punctuation_rules(text)

# Testing with this example
text = "in this video I will show you how to download shortcut on windows shortcut is a free and open source video editor"
print(restore_punctuation(text))
punctuated_speech_to_text=restore_punctuation(speech_to_text)
print(punctuated_speech_to_text)


Downloading: "https://github.com/snakers4/silero-models/zipball/master" to /root/.cache/torch/hub/master.zip
 72%|███████▏  | 63.2M/87.5M [00:04<00:01, 15.2MB/s]


Error with Silero model: PytorchStreamReader failed reading zip archive: failed finding central directory
In this. Video I will show you how to download shortcut on windows shortcut is a free, and open source video editor.
Error with Silero model: PytorchStreamReader failed reading zip archive: failed finding central directory
My name. Is. Gora I haven't. Met Benoit. After that day. But I know. Why he had. Come we were. In the village traveling. By. The. Night. Train. The Army had organised a sadhana tooth. For the students. At the. Station we. Saw the message. On the notice. Board it said tasks are both online, and on ground discussions enable participants to share ideas some students are. Reading the guitar, and talking. About National integration. Later we visited a. Bookstore I found a biography. And. The Telugu. Dictionary A teacher told us in order to succeed you must say it's a great time to invest. In. India we return home with memories and. Books that. Day I felt happy, and ho

Using cache found in /root/.cache/torch/hub/snakers4_silero-models_master


### Sentence Tokenization Overview

Splits punctuated text into individual sentences for downstream processing.

---

#### 1. NLTK Setup

- Downloads `punkt` tokenizer via NLTK.
- Wrapped in `try-except` for graceful failure and logging.
- Confirms download path.

---

#### 2. Enhanced Splitter

- `enhanced_sentence_split()`:
  - Handles NLTK edge cases (e.g., transcribed speech).
  - Uses regex on punctuation + capital letters/transition words.
  - Ensures clean, punctuated, capitalized sentences.

---

#### 3. Tokenization Logic

- `get_sentences()`:
  - Uses NLTK’s `sent_tokenize`; falls back to enhanced splitter on failure.
  - Handles exceptions and logs issues.
  - Returns clean sentence list.

---

#### 4. Application

- Input: `punctuated_speech_to_text`.
- Output: `punctuated_sentences_list` with tokenized sentences.


In [9]:
# Fix for NLTK punkt_tab error and reliable sentence tokenization

# More comprehensive NLTK data download
import nltk
import re
import os

# First, try to download punkt resources more comprehensively
try:
    nltk.download('punkt', quiet=False)
    nltk.download('punkt_tab',quiet=False)
    # Explicitly check the data path
    nltk_data_path = nltk.data.path[0]
    print(f"NLTK data path: {nltk_data_path}")
except Exception as e:
    print(f"NLTK download error: {e}")

# Enhanced sentence splitter that doesn't rely on punkt_tab
def enhanced_sentence_split(text):
    """
    Enhanced sentence splitter that combines regex patterns with
    special handling for common patterns in transcribed speech.
    """
    # Step 1: Split on clear sentence boundaries
    # Look for period, question mark, or exclamation mark followed by space and capital letter
    initial_splits = re.split(r'(?<=[.!?])\s+(?=[A-Z])', text)

    final_sentences = []
    for segment in initial_splits:
        # Step 2: Handle run-on sentences with transition words
        # Common transition patterns that often indicate new sentences in speech
        transition_patterns = [
            r'(.*?)\s+(?:however|moreover|furthermore|therefore|thus|consequently|nevertheless|in addition,|for example,)\s+(.*)',
            r'(.*?),\s+(?:and|but|or|so|because)\s+(.*)'
        ]

        current_segment = segment
        for pattern in transition_patterns:
            match = re.match(pattern, current_segment, re.IGNORECASE)
            if match:
                groups = match.groups()
                if len(groups[0].split()) > 3:  # Only split if first part is substantial
                    final_sentences.append(groups[0].strip() + ".")
                    current_segment = groups[1].strip().capitalize()
                    break

        # Add the remaining segment
        if current_segment:
            final_sentences.append(current_segment)

    # Step 3: Clean up sentences
    cleaned_sentences = []
    for sentence in final_sentences:
        # Make sure sentences end with punctuation
        if not sentence.endswith(('.', '!', '?')):
            sentence += '.'

        # Ensure first letter is capitalized
        if sentence and sentence[0].isalpha() and not sentence[0].isupper():
            sentence = sentence[0].upper() + sentence[1:]

        cleaned_sentences.append(sentence)

    return cleaned_sentences

# Use this function to tokenize your text
def get_sentences(text):
    try:
        # First try native NLTK tokenizer
        from nltk.tokenize import sent_tokenize
        return sent_tokenize(text)
    except Exception as e:
        print(f"NLTK tokenization failed: {e}")
        print("Using enhanced fallback tokenizer")
        return enhanced_sentence_split(text)

# Apply to your text
punctuated_sentences_list = get_sentences(punctuated_speech_to_text)

# Print results
print("\nSentence Tokenization Results:")
for i, sentence in enumerate(punctuated_sentences_list, 1):
    print(f"{i}. {sentence}")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


NLTK data path: /root/nltk_data

Sentence Tokenization Results:
1. My name.
2. Is.
3. Gora I haven't.
4. Met Benoit.
5. After that day.
6. But I know.
7. Why he had.
8. Come we were.
9. In the village traveling.
10. By.
11. The.
12. Night.
13. Train.
14. The Army had organised a sadhana tooth.
15. For the students.
16. At the.
17. Station we.
18. Saw the message.
19. On the notice.
20. Board it said tasks are both online, and on ground discussions enable participants to share ideas some students are.
21. Reading the guitar, and talking.
22. About National integration.
23. Later we visited a.
24. Bookstore I found a biography.
25. And.
26. The Telugu.
27. Dictionary A teacher told us in order to succeed you must say it's a great time to invest.
28. In.
29. India we return home with memories and.
30. Books that.
31. Day I felt happy, and hopeful.


### Saving Speech Data

This section saves the processed speech data (raw transcript, punctuated transcript, and sentences) to a JSON file for storage and retrieval.

* **Data Storage:** A dictionary (`speech_json`) is created to hold the raw transcript, punctuated transcript, and list of sentences.
* **File Path:** The output file path (`output_json_path`) is defined to save the JSON file in Google Drive.
* **JSON Serialization:** The `speech_json` dictionary is saved to the specified file path using `json.dump` with indentation for readability.
* **Confirmation:** A message confirms the file path where the JSON data has been saved.

In [10]:
import json

# Create the dict to store everything
speech_json = {
    "raw_transcript": speech_to_text,
    "punctuated_transcript": punctuated_speech_to_text,
    "sentences": punctuated_sentences_list
}

# Define the path to save JSON in Google Drive
output_json_path = "/content/drive/MyDrive/NLP_Proj/speech_output.json"

# Save to JSON
with open(output_json_path, "w") as f:
    json.dump(speech_json, f, indent=4)

print(f"✅ Speech data saved to: {output_json_path}")


✅ Speech data saved to: /content/drive/MyDrive/NLP_Proj/speech_output.json


## Audio Translation Pipeline. Phase - 2

Next stage of our pipeline. Run this after getting the machine translation from the transformer

In [11]:
import json

with open('/content/drive/MyDrive/NLP_Proj/translated_pun_sentences.json', 'r', encoding='utf-8') as f:
    translated_pun_sentences_list = json.load(f)
translated_pun_sentences_list

['गोरा मेरा नाम है , मैं उस दिन के बाद <unk> से मिला हूं , लेकिन मैं जानता हूं कि वह क्यों आया था ।',
 'हम गांव में <unk> ट्रेन से <unk> थे ।',
 'सेना ने स्टेशन पर विद्यार्थियों के लिए एक <unk> दांत का आयोजन किया था ।',
 'हमने नोटिस बोर्ड पर संदेश देखा ।',
 'यह कहा गया कि कार्य ऑनलाइन और जमीनी चर्चा पर प्रतिभागियों को विचारों को साझा करने में सक्षम बनाता है ।',
 'कुछ छात्र <unk> को पढ़ते हैं और राष्ट्रीय एकता के बारे में बात कर रहे हैं ।',
 'बाद में हमने एक <unk> देखी और तेलुगु कोश मिला ।',
 'एक शिक्षक ने हमें सफल होने के लिए कहा ।',
 'आपको कहना चाहिए कि भारत में निवेश करना एक बहुत बड़ा समय है ।',
 'हम उस दिन <unk> और किताबें के साथ घर लौट आए हैं ।',
 'मुझे खुशी और उल्लास महसूस हुआ ।']

### Hindi Text-to-Speech (TTS) Generation

Generates Hindi audio from translated text using the Facebook MMS TTS model (`facebook/mms-tts-hin`).

---

#### 1. Model Initialization

- Loads `VitsTokenizer` and `VitsModel` from Hugging Face.
- Used for tokenizing Hindi text and generating speech audio.

---

#### 2. Text Cleaning

- `clean_hindi_text()`:
  - Replaces `<unk>` with "कुछ".
  - Removes special tokens/brackets.
  - Ensures proper Hindi punctuation.
  - Defaults to "नमस्ते।" for empty input.

---

#### 3. Safe TTS Generation

- `safe_tts_generation()`:
  - Cleans and tokenizes text.
  - Uses fallback if tokenization fails.
  - Generates waveform with `set_seed(555)` for consistency.
  - Saves output as WAV; includes exception handling.

---

#### 4. Batch Processing

- Iterates over translated sentences.
- Saves each as `output_{i}.wav`.
- Tracks and reports success count.


In [12]:
from transformers import VitsTokenizer, VitsModel, set_seed
import torch
import soundfile as sf
import re

# Load the Hindi TTS model
tts_tokenizer = VitsTokenizer.from_pretrained("facebook/mms-tts-hin")
tts_model = VitsModel.from_pretrained("facebook/mms-tts-hin")

def clean_hindi_text(text):
    """
    Clean Hindi text by removing or replacing <unk> tokens and other problematic characters
    """
    # Replace <unk> tokens with a common Hindi word meaning "something" (कुछ)
    text = text.replace("<unk>", "कुछ")

    # Remove any remaining special tokens or brackets that might cause issues
    text = re.sub(r'<[^>]+>', '', text)

    # Ensure there's proper punctuation
    if not text.endswith(('।', '.', '!', '?')):
        text += '।'

    # Make sure text isn't empty
    if not text.strip():
        text = "नमस्ते।"  # Default to "Hello" in Hindi

    return text

def safe_tts_generation(hindi_text, output_path, fallback_text="नमस्ते।"):
    """
    Safely generate TTS audio with error handling
    """
    try:
        # Clean the text first
        cleaned_text = clean_hindi_text(hindi_text)
        print(f"Cleaned text: {cleaned_text}")

        # Tokenize
        inputs = tts_tokenizer(text=cleaned_text, return_tensors="pt")

        # Check for empty inputs
        if inputs["input_ids"].size(1) == 0:
            print(f"Warning: Empty input after tokenization. Using fallback text.")
            inputs = tts_tokenizer(text=fallback_text, return_tensors="pt")

        # Generate audio
        set_seed(555)  # For deterministic results
        with torch.no_grad():
            outputs = tts_model(inputs["input_ids"].long())

        # Save the waveform
        waveform = outputs.waveform.squeeze().cpu().numpy()
        sf.write(output_path, waveform, 22050)
        print(f"Audio saved to {output_path}")
        return True

    except Exception as e:
        print(f"Error generating TTS for '{hindi_text}': {e}")

        # Try with fallback text if original text failed
        if hindi_text != fallback_text:
            print(f"Attempting with fallback text: {fallback_text}")
            try:
                inputs = tts_tokenizer(text=fallback_text, return_tensors="pt")
                with torch.no_grad():
                    outputs = tts_model(inputs["input_ids"].long())

                waveform = outputs.waveform.squeeze().cpu().numpy()
                sf.write(output_path, waveform, 22050)
                print(f"Fallback audio saved to {output_path}")
                return True
            except Exception as e2:
                print(f"Fallback also failed: {e2}")

        return False

# Process each sentence with the new safe method
success_count = 0
for i, hindi_sentence in enumerate(translated_pun_sentences_list, 1):
    print(f"\nProcessing sentence {i}: {hindi_sentence}")

    # Skip empty sentences
    if not hindi_sentence.strip():
        print(f"Skipping empty sentence {i}")
        continue

    # Generate audio
    output_path = f"output {i}.wav"
    if safe_tts_generation(hindi_sentence, output_path):
        success_count += 1

print(f"\nSuccessfully generated {success_count} of {len(translated_pun_sentences_list)} audio files.")


Processing sentence 1: गोरा मेरा नाम है , मैं उस दिन के बाद <unk> से मिला हूं , लेकिन मैं जानता हूं कि वह क्यों आया था ।
Cleaned text: गोरा मेरा नाम है , मैं उस दिन के बाद कुछ से मिला हूं , लेकिन मैं जानता हूं कि वह क्यों आया था ।
Audio saved to output 1.wav

Processing sentence 2: हम गांव में <unk> ट्रेन से <unk> थे ।
Cleaned text: हम गांव में कुछ ट्रेन से कुछ थे ।
Audio saved to output 2.wav

Processing sentence 3: सेना ने स्टेशन पर विद्यार्थियों के लिए एक <unk> दांत का आयोजन किया था ।
Cleaned text: सेना ने स्टेशन पर विद्यार्थियों के लिए एक कुछ दांत का आयोजन किया था ।
Audio saved to output 3.wav

Processing sentence 4: हमने नोटिस बोर्ड पर संदेश देखा ।
Cleaned text: हमने नोटिस बोर्ड पर संदेश देखा ।
Audio saved to output 4.wav

Processing sentence 5: यह कहा गया कि कार्य ऑनलाइन और जमीनी चर्चा पर प्रतिभागियों को विचारों को साझा करने में सक्षम बनाता है ।
Cleaned text: यह कहा गया कि कार्य ऑनलाइन और जमीनी चर्चा पर प्रतिभागियों को विचारों को साझा करने में सक्षम बनाता है ।
Audio saved to out

### Audio Segmentation and Silence Detection

This section segments the input audio using silence detection to align with translated content.

#### 1. Input Audio Handling

- **Path Selection:** Checks both `LOCAL_VIDEO_PATH` and `DRIVE_VIDEO_PATH` to locate the audio file.
- **Fallback:** Prints an error if the file is not found.

#### 2. Translated Sentences Fallback

- **Placeholder List:** If `translated_pun_sentences_list` is missing, generates a list from available translated audio files in the directory.

#### 3. Silence Detection

- **Custom Function:** `detect_silence` identifies silent regions based on a threshold, ensuring compatibility with various pydub versions.

#### 4. Audio Processing

- **`process_audio_files`:** Combines original audio segments, silences, and translated audio to produce a synchronized output.


In [13]:
from pydub import AudioSegment
import os
import numpy as np

# Defined earlier during configuration
# LOCAL_VIDEO_PATH = "/content/my_name_is_gora.mp4"
# DRIVE_VIDEO_PATH = "/content/drive/MyDrive/NLP_Proj/my_name_is_gora.mp4"

# Try to convert the LOCAL_VIDEO_PATH to audio file path (replace .mp4 with .wav)
try:
    input_audio_file = LOCAL_VIDEO_PATH.replace('.mp4', '.wav')
    # Check if the audio file exists
    if not os.path.exists(input_audio_file):
        raise FileNotFoundError(f"Audio file not found: {input_audio_file}")
    print(f"Audio file found: {input_audio_file}")
except FileNotFoundError as e:
    # If the audio file doesn't exist in the LOCAL path, fall back to DRIVE path
    try:
        input_audio_file = DRIVE_VIDEO_PATH.replace('.mp4', '.wav')
        if not os.path.exists(input_audio_file):
            raise FileNotFoundError(f"Audio file not found: {input_audio_file}")
        print(f"Audio file found: {input_audio_file}")
    except FileNotFoundError as e2:
        print(f"Error: {e2}")
        # Handle the case where the file isn't found in either location



# Make sure translated_pun_sentences_list is defined
try:
    test = translated_pun_sentences_list
except NameError:
    print("WARNING: translated_pun_sentences_list not defined, using detected output files")
    wav_dir = "./"
    output_files = [f for f in os.listdir(wav_dir) if f.startswith("output ") and f.endswith(".wav")]
    output_files.sort(key=lambda x: int(x.split(" ")[1].split(".")[0]))
    translated_pun_sentences_list = ["Sentence " + str(i+1) for i in range(len(output_files))]
    print(f"Created fallback list with {len(translated_pun_sentences_list)} entries")

def db_to_float(db):
    """Convert dB to float amplitude"""
    return 10 ** (db / 20)

def detect_silence(audio_segment, min_silence_len=1000, silence_thresh=-16, seek_step=1):
    """
    Custom implementation of silence detection that works with all pydub versions
    """
    seg_len = len(audio_segment)

    # You can't have a silent portion of a sound that is longer than the sound itself
    if seg_len < min_silence_len:
        return []

    # Convert silence threshold to a float value
    silence_thresh_amp = db_to_float(silence_thresh) * audio_segment.max_possible_amplitude

    # Find silence periods
    silence_ranges = []
    is_silent = False
    current_start = None

    # Check audio in chunks
    for i in range(0, seg_len, seek_step):
        end = min(i + seek_step, seg_len)
        chunk = audio_segment[i:end]

        # Check if this chunk is silent
        if chunk.rms < silence_thresh_amp:
            if not is_silent:
                # Start of a new silent period
                is_silent = True
                current_start = i
        else:
            if is_silent:
                # End of a silent period
                is_silent = False
                if i - current_start >= min_silence_len:
                    silence_ranges.append([current_start, i])
                current_start = None

    # Don't forget to add the last silent period if we ended in silence
    if is_silent and current_start is not None and seg_len - current_start >= min_silence_len:
        silence_ranges.append([current_start, seg_len])

    return silence_ranges

def process_audio_files(input_audio_file, translated_sentences_list):
    """Process audio files and combine them with appropriate silences"""
    print(f"\nProcessing audio file: {input_audio_file}")

    # Check if input file exists
    if not os.path.exists(input_audio_file):
        print(f"ERROR: Input audio file not found: {input_audio_file}")
        return False

    print("Detecting silences in original audio...")
    try:
        # Load the original audio file
        audio = AudioSegment.from_wav(input_audio_file)

        # Detect silence periods with our custom function
        silence_threshold = -40  # dBFS
        min_silence_len = 500    # milliseconds
        silences = detect_silence(
            audio,
            min_silence_len=min_silence_len,
            silence_thresh=silence_threshold,
            seek_step=10  # Use a larger step size for faster processing
        )

        silence_durations = [(end - start) for start, end in silences]
        print(f"Detected {len(silences)} silence periods")

        # Step 7: Combine all audio files with appropriate silences
        print("\nCombining audio files with silences...")

        # Get all output files that exist
        wav_dir = "./"
        expected_wav_files = [f"output {i}.wav" for i in range(1, len(translated_sentences_list)+1)]
        wav_files = [f for f in expected_wav_files if os.path.exists(os.path.join(wav_dir, f))]

        print(f"Found {len(wav_files)} of {len(expected_wav_files)} expected audio files")

        # Initialize the final output audio
        final_output = AudioSegment.silent(duration=0)

        # Insert appropriate silence between segments
        for i, wav_file in enumerate(wav_files):
            try:
                print(f"Adding {wav_file} to output")
                sentence_audio = AudioSegment.from_wav(os.path.join(wav_dir, wav_file))
                final_output += sentence_audio

                # Add appropriate silence after each sentence
                silence_duration = 500  # Default 500ms silence
                if i < len(silence_durations):
                    silence_duration = min(max(silence_durations[i], 300), 1500)

                silence_segment = AudioSegment.silent(duration=silence_duration)
                final_output += silence_segment
                print(f"Added {silence_duration}ms silence")

            except Exception as e:
                print(f"Error processing {wav_file}: {e}")

        # Export the final output to a WAV file
        final_output.export("end_result.wav", format="wav")
        print("Final output saved as end_result.wav")

        # Create a slowed-down version
        print("\nCreating slowed-down version...")
        try:
            slow_audio_segment = final_output._spawn(final_output.raw_data, overrides={
                "frame_rate": int(final_output.frame_rate * 0.75)
            }).set_frame_rate(final_output.frame_rate)

            slow_audio_segment.export("end_result_slow.wav", format="wav")
            print("Slowed audio saved to end_result_slow.wav")
        except Exception as e:
            print(f"Error creating slowed version: {e}")

        return True

    except Exception as e:
        print(f"Error in audio processing: {e}")
        import traceback
        traceback.print_exc()
        return False

# Execute the audio processing function
print(f"Starting audio processing with {len(translated_pun_sentences_list)} sentences")
success = process_audio_files(input_audio_file, translated_pun_sentences_list)
if success:
    print("\nAudio translation process complete!")
else:
    print("\nAudio processing encountered errors. Check the output files.")

Starting audio processing with 11 sentences

Processing audio file: processed_input.wav
Detecting silences in original audio...
Detected 13 silence periods

Combining audio files with silences...
Found 11 of 11 expected audio files
Adding output 1.wav to output
Added 770ms silence
Adding output 2.wav to output
Added 550ms silence
Adding output 3.wav to output
Added 530ms silence
Adding output 4.wav to output
Added 550ms silence
Adding output 5.wav to output
Added 500ms silence
Adding output 6.wav to output
Added 560ms silence
Adding output 7.wav to output
Added 610ms silence
Adding output 8.wav to output
Added 650ms silence
Adding output 9.wav to output
Added 640ms silence
Adding output 10.wav to output
Added 670ms silence
Adding output 11.wav to output
Added 1050ms silence
Final output saved as end_result.wav

Creating slowed-down version...
Slowed audio saved to end_result_slow.wav

Creating videos with new audio...
Loading video from /content/drive/MyDrive/NLP_Proj/my_name_is_gora.m



MoviePy - Done.
Moviepy - Writing video end_result_video.mp4






Moviepy - Done !
Moviepy - video ready end_result_video.mp4
Video with replaced audio saved to end_result_video.mp4
Loading video from /content/drive/MyDrive/NLP_Proj/my_name_is_gora.mp4
Loading audio from end_result_slow.wav
Audio (57.17s) is longer than video (54.76s). Trimming audio.
Replacing video audio...
Writing output video to end_result_slow_video.mp4
Moviepy - Building video end_result_slow_video.mp4.
MoviePy - Writing audio in end_result_slow_videoTEMP_MPY_wvf_snd.mp4




MoviePy - Done.
Moviepy - Writing video end_result_slow_video.mp4





Moviepy - Done !
Moviepy - video ready end_result_slow_video.mp4
Video with replaced audio saved to end_result_slow_video.mp4

Audio and video translation process complete!


## Translation with Facebook many to many MBART Transformer model.
This is done to contrast our translation with that of this transformer.

### Enhanced Translation with mBART Model

This section performs translation of English sentences to Hindi using the mBART model, generates TTS audio using Facebook MMS, and combines the results with appropriate silences.

---

#### 1. Loading English Sentences

- **Primary Source:** Loads from `speech_output.json`.
- **Fallback:** Uses `punctuated_sentences_list` if the JSON is unavailable.

---

#### 2. Translation with mBART

- **Model:** Uses `facebook/mbart-large-50-one-to-many-mmt`.
- **Language Configuration:** Source - English (`en_XX`), Target - Hindi (`hi_IN`).
- **Translation Features:**
  - Beam search (`num_beams=5`)
  - Handles empty or failed translations with fallback Hindi placeholders.

---

#### 3. Saving Translations

- Translations are saved to `mbart_translated_sentences.json` in Google Drive.

---

#### 4. TTS Generation using MMS

- **Model:** `facebook/mms-tts-hin`.
- **Cleaning:** Removes special characters, ensures punctuation, and fallback to "नमस्ते।" if needed.
- **Output:** Audio files are saved under `mbart_translated/output_mbart_{i}.wav`.

---

#### 5. Silence Detection

- **Function:** `detect_silence()` custom-built for compatibility.
- **Thresholds:** Silence if RMS < -40 dBFS for ≥500ms.

---

#### 6. Audio Combination

- **Function:** `process_mbart_audio_files()` merges sentence audios and inserts silences.
- **Final Outputs:**
  - `mbart_translated_result.wav` (normal speed)
  - `mbart_translated_result_slow.wav` (0.75x slowed)

---

#### 7. Summary Output

- Compares custom model output (`end_result.wav`) with mBART output.
- Both normal and slowed versions are created for better comprehension.


In [14]:
# ===== ENHANCED TRANSLATION WITH MBART MODEL =====
import json
import torch
import re
import os
import time
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
from transformers import VitsTokenizer, VitsModel, set_seed
import soundfile as sf
from pydub import AudioSegment

# Create output directory for mBART translated files
os.makedirs("mbart_translated", exist_ok=True)

# Step 1: Load the saved punctuated English sentences
try:
    # Try to load from the speech output JSON
    with open("/content/drive/MyDrive/NLP_Proj/speech_output.json", "r") as f:
        speech_data = json.load(f)
        english_sentences = speech_data["sentences"]
    print(f"Loaded {len(english_sentences)} English sentences from speech_output.json")
except:
    # Fallback - try to load from the list we created earlier
    english_sentences = punctuated_sentences_list
    print(f"Using {len(english_sentences)} sentences from current session")

# Step 2: Translate sentences using mBART model
print("\n===== mBART Model Translation Processing =====")

# Load the mBART model and tokenizer
print("Loading mBART model and tokenizer...")
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-one-to-many-mmt")
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-one-to-many-mmt", src_lang="en_XX")

# Translate sentences
mbart_translated_sentences = []
print(f"Starting translation of {len(english_sentences)} sentences...")

for i, sentence in enumerate(english_sentences, 1):
    try:
        print(f"Translating sentence {i}/{len(english_sentences)}: {sentence[:50]}...")

        # Skip empty sentences
        if not sentence or sentence.strip() == "":
            mbart_translated_sentences.append("नमस्ते।")
            continue

        # Convert sentence to tensor
        model_inputs = tokenizer(sentence, return_tensors="pt", padding=True)

        # Translate from English to Hindi
        generated_tokens = model.generate(
            **model_inputs,
            forced_bos_token_id=tokenizer.lang_code_to_id["hi_IN"],
            num_beams=5,  # Use beam search for better quality
            max_length=150  # Set maximum length to avoid excessive outputs
        )

        # Decode the generated tokens
        translation = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]

        # Add proper Hindi punctuation if missing
        if not translation.endswith(('।', '.', '!', '?')):
            translation += '।'

        mbart_translated_sentences.append(translation)
        print(f"mBART translated: {translation}")

    except Exception as e:
        print(f"Error translating sentence {i}: {e}")
        # Add a placeholder if translation fails
        mbart_translated_sentences.append("अनुवाद उपलब्ध नहीं है।")  # "Translation not available" in Hindi

# Save mBART translations to JSON for reference
mbart_output_path = "/content/drive/MyDrive/NLP_Proj/mbart_translated_sentences.json"
with open(mbart_output_path, "w", encoding="utf-8") as f:
    json.dump(mbart_translated_sentences, f, ensure_ascii=False, indent=4)
print(f"mBART translations saved to: {mbart_output_path}")

# Step 3: Generate TTS for mBART translations
print("\n===== Generating TTS for mBART Translations =====")

# Load the Hindi TTS model if not already loaded
if 'tts_model' not in locals():
    tts_tokenizer = VitsTokenizer.from_pretrained("facebook/mms-tts-hin")
    tts_model = VitsModel.from_pretrained("facebook/mms-tts-hin")

def clean_hindi_text(text):
    """Clean Hindi text by removing or replacing problematic characters"""
    # Replace special tokens
    text = text.replace("<unk>", "कुछ")
    # Remove any remaining special tokens or brackets
    text = re.sub(r'<[^>]+>', '', text)
    # Ensure proper punctuation
    if not text.endswith(('।', '.', '!', '?')):
        text += '।'
    # Make sure text isn't empty
    if not text.strip():
        text = "नमस्ते।"  # Default to "Hello" in Hindi
    return text

# Generate audio for each mBART translated sentence
success_count = 0
for i, hindi_sentence in enumerate(mbart_translated_sentences, 1):
    print(f"\nProcessing mBART translated sentence {i}: {hindi_sentence}")

    # Skip empty sentences
    if not hindi_sentence.strip():
        print(f"Skipping empty sentence {i}")
        continue

    # Generate audio
    output_path = f"mbart_translated/output_mbart_{i}.wav"
    try:
        # Clean the text
        cleaned_text = clean_hindi_text(hindi_sentence)

        # Tokenize
        inputs = tts_tokenizer(text=cleaned_text, return_tensors="pt")

        # Check for empty inputs
        if inputs["input_ids"].size(1) == 0:
            print(f"Warning: Empty input after tokenization. Using fallback.")
            inputs = tts_tokenizer(text="नमस्ते।", return_tensors="pt")

        # Generate audio
        set_seed(555)  # For deterministic results
        with torch.no_grad():
            outputs = tts_model(inputs["input_ids"].long())

        # Save the waveform
        waveform = outputs.waveform.squeeze().cpu().numpy()
        sf.write(output_path, waveform, 22050)
        print(f"Audio saved to {output_path}")
        success_count += 1
    except Exception as e:
        print(f"Error generating TTS: {e}")

print(f"\nSuccessfully generated {success_count} of {len(mbart_translated_sentences)} mBART translated audio files.")

# Step 4: Combine mBART translated audio with silences
print("\n===== Combining mBART Translated Audio =====")

def detect_silence(audio_segment, min_silence_len=1000, silence_thresh=-16, seek_step=1):
    """
    Custom implementation of silence detection that works with all pydub versions
    """
    seg_len = len(audio_segment)

    # You can't have a silent portion of a sound that is longer than the sound itself
    if seg_len < min_silence_len:
        return []

    # Convert silence threshold to a float value
    silence_thresh_amp = (10 ** (silence_thresh / 20)) * audio_segment.max_possible_amplitude

    # Find silence periods
    silence_ranges = []
    is_silent = False
    current_start = None

    # Check audio in chunks
    for i in range(0, seg_len, seek_step):
        end = min(i + seek_step, seg_len)
        chunk = audio_segment[i:end]

        # Check if this chunk is silent
        if chunk.rms < silence_thresh_amp:
            if not is_silent:
                # Start of a new silent period
                is_silent = True
                current_start = i
        else:
            if is_silent:
                # End of a silent period
                is_silent = False
                if i - current_start >= min_silence_len:
                    silence_ranges.append([current_start, i])
                current_start = None

    # Don't forget to add the last silent period if we ended in silence
    if is_silent and current_start is not None and seg_len - current_start >= min_silence_len:
        silence_ranges.append([current_start, seg_len])

    return silence_ranges

def process_mbart_audio_files(input_audio_file, mbart_sentences):
    """Process mBART translated audio files and combine with silences"""
    print(f"\nProcessing using silence from: {input_audio_file}")

    try:
        # Load the original audio file for silence detection
        audio = AudioSegment.from_wav(input_audio_file)

        # Detect silence periods
        silence_threshold = -40  # dBFS
        min_silence_len = 500    # milliseconds

        # Use the helper function for silence detection
        silences = detect_silence(
            audio,
            min_silence_len=min_silence_len,
            silence_thresh=silence_threshold,
            seek_step=10
        )

        silence_durations = [(end - start) for start, end in silences]
        print(f"Using {len(silences)} silence periods from original audio")

        # Get all mBART translated audio files
        wav_dir = "mbart_translated"
        expected_wav_files = [f"output_mbart_{i}.wav" for i in range(1, len(mbart_sentences)+1)]
        wav_files = [f for f in expected_wav_files if os.path.exists(os.path.join(wav_dir, f))]

        print(f"Found {len(wav_files)} of {len(expected_wav_files)} mBART translated audio files")

        # Initialize the final output audio
        final_output = AudioSegment.silent(duration=0)

        # Insert appropriate silence between segments
        for i, wav_file in enumerate(wav_files):
            try:
                print(f"Adding {wav_file} to output")
                sentence_audio = AudioSegment.from_wav(os.path.join(wav_dir, wav_file))
                final_output += sentence_audio

                # Add appropriate silence after each sentence
                silence_duration = 500  # Default 500ms silence
                if i < len(silence_durations):
                    silence_duration = min(max(silence_durations[i], 300), 1500)

                silence_segment = AudioSegment.silent(duration=silence_duration)
                final_output += silence_segment
                print(f"Added {silence_duration}ms silence")

            except Exception as e:
                print(f"Error processing {wav_file}: {e}")

        # Export the final mBART translated output
        final_output.export("mbart_translated_result.wav", format="wav")
        print("mBART translated output saved as mbart_translated_result.wav")

        # Create a slowed-down version
        print("\nCreating slowed-down version of mBART translation...")
        slow_audio_segment = final_output._spawn(final_output.raw_data, overrides={
            "frame_rate": int(final_output.frame_rate * 0.75)
        }).set_frame_rate(final_output.frame_rate)

        slow_audio_segment.export("mbart_translated_result_slow.wav", format="wav")
        print("Slowed mBART translation saved as mbart_translated_result_slow.wav")

        return True

    except Exception as e:
        print(f"Error in mBART audio processing: {e}")
        import traceback
        traceback.print_exc()
        return False

# Execute the mBART translated audio processing
print(f"Starting mBART translated audio processing with {len(mbart_translated_sentences)} sentences")
mbart_success = process_mbart_audio_files(input_audio_file, mbart_translated_sentences)

if mbart_success:
    print("\n===== Comparison Files Created Successfully =====")
    print("1. Your custom model translation: end_result.wav")
    print("2. mBART translation: mbart_translated_result.wav")
    print("\nSlowed versions for better comprehension:")
    print("1. Your custom model (slow): end_result_slow.wav")
    print("2. mBART (slow): mbart_translated_result_slow.wav")
else:
    print("\nmBART translated audio processing encountered errors.")

Loaded 31 English sentences from speech_output.json

===== mBART Model Translation Processing =====
Loading mBART model and tokenizer...


config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.44G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.44G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/261 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/528 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/717 [00:00<?, ?B/s]

Starting translation of 31 sentences...
Translating sentence 1/31: My name....
mBART translated: मेरा नाम।
Translating sentence 2/31: Is....
mBART translated: है.
Translating sentence 3/31: Gora I haven't....
mBART translated: गोरा नहीं।
Translating sentence 4/31: Met Benoit....
mBART translated: बेनोइट से मिला।
Translating sentence 5/31: After that day....
mBART translated: उस दिन के बाद।
Translating sentence 6/31: But I know....
mBART translated: लेकिन मैं जानता हूं।
Translating sentence 7/31: Why he had....
mBART translated: क्यों?
Translating sentence 8/31: Come we were....
mBART translated: आओ हम थे।
Translating sentence 9/31: In the village traveling....
mBART translated: गांव में यात्रा करते हुए।
Translating sentence 10/31: By....
mBART translated: द्वारा.
Translating sentence 11/31: The....
mBART translated: द.
Translating sentence 12/31: Night....
mBART translated: रात।
Translating sentence 13/31: Train....
mBART translated: ट्रेन।
Translating sentence 14/31: The Army had orga