<a href="https://colab.research.google.com/github/DebasishTripathy13/unimeds/blob/main/Whisper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Hindi to English Speech-to-Text using Faster-Whisper on Google Colab
# This notebook transcribes Hindi audio and translates it to English with improved accuracy and large file handling.

# Install required packages (all in one go)
# Use !pip install --quiet to reduce output verbosity in Colab
!pip install --quiet faster-whisper
!pip install --quiet gradio
!pip install --quiet soundfile
!pip install --quiet gtts
!pip install --quiet pydub # For audio manipulation (optional, but good for large files)

import faster_whisper
import gradio as gr
import torch
import librosa
import numpy as np
import os
from IPython.display import Audio, display
from google.colab import files
import io
from gtts import gTTS
from pydub import AudioSegment # For handling large audio files

# --- Configuration ---
# Choose a model size. "large-v2" or "large-v3" are recommended for best accuracy
# on complex languages like Hindi. "large-v2" is generally sufficient and widely used.
# "large-v3" might offer marginal improvements but requires more resources.
MODEL_SIZE = "large-v2"

# Determine device for inference (GPU if available, else CPU)
if torch.cuda.is_available():
    # Use "float16" for faster inference on GPU, "float32" for higher precision (slower)
    # If you encounter issues, try "int8" or "int8_float16" for lower VRAM usage
    COMPUTE_TYPE = "float16"
    DEVICE = "cuda"
    print(f"Using GPU ({torch.cuda.get_device_name(0)}) with {COMPUTE_TYPE} precision.")
else:
    COMPUTE_TYPE = "int8" # "int8" is good for CPU for performance
    DEVICE = "cpu"
    print(f"Using CPU with {COMPUTE_TYPE} precision.")

# Load Faster-Whisper model (this will download if not cached)
print(f"Loading Faster-Whisper model: {MODEL_SIZE}...")
try:
    # Use the fine-tuned Hindi model if available on Hugging Face or stick to OpenAI's large-v2
    # For a specific fine-tuned Hindi model, you might need to specify the full Hugging Face repo:
    # model = faster_whisper.WhisperModel("vasista22/whisper-hindi-large-v2", device=DEVICE, compute_type=COMPUTE_TYPE)
    # For general best performance and translation capabilities, sticking to OpenAI's large-v2/v3 is often good.
    model = faster_whisper.WhisperModel(MODEL_SIZE, device=DEVICE, compute_type=COMPUTE_TYPE)
    print(f"Successfully loaded {MODEL_SIZE} model on {DEVICE}!")
except Exception as e:
    print(f"Error loading model: {e}")
    print("Attempting to load on CPU as a fallback...")
    model = faster_whisper.WhisperModel(MODEL_SIZE, device="cpu", compute_type="int8")
    DEVICE = "cpu" # Update device to CPU
    print(f"Successfully loaded {MODEL_SIZE} model on CPU!")

# --- Transcription Function ---
def transcribe_hindi_to_english(audio_path_or_array, language="hi"):
    """
    Transcribe Hindi audio and translate it to English using Faster-Whisper.
    Handles both file paths and numpy arrays (from Gradio).
    Automatically chunks large audio files for processing.

    Args:
        audio_path_or_array: Path to audio file (string) or audio array (tuple from Gradio).
        language (str): Source language (default: "hi" for Hindi).

    Returns:
        dict: Contains transcribed and translated text, detected language, and confidence.
    """
    try:
        if isinstance(audio_path_or_array, tuple):
            # Gradio provides (sample_rate, audio_data)
            sample_rate, audio_data = audio_path_or_array
            # Create a temporary file for Faster-Whisper, as it prefers file paths
            temp_audio_file = "temp_gradio_audio.wav"
            # Normalize and convert to float32 as required by librosa and soundfile
            if audio_data.dtype == np.int16:
                audio_data = audio_data.astype(np.float32) / 32768.0
            elif audio_data.dtype == np.int32:
                audio_data = audio_data.astype(np.float32) / 2147483648.0

            # Resample if necessary (Faster-Whisper prefers 16kHz)
            if sample_rate != 16000:
                audio_data = librosa.resample(audio_data, orig_sr=sample_rate, target_sr=16000)
                sample_rate = 16000

            # Convert stereo to mono if needed
            if len(audio_data.shape) > 1:
                audio_data = np.mean(audio_data, axis=1)

            sf.write(temp_audio_file, audio_data, sample_rate)
            audio_source = temp_audio_file
        else:
            # It's already a file path
            audio_source = audio_path_or_array

        print(f"Starting transcription for: {audio_source}")

        # Use the transcribe method for robust handling of large files
        # It automatically handles chunking and VAD.
        segments, info = model.transcribe(
            audio=audio_source,
            language=language,
            task="translate", # This translates to English
            beam_size=5, # Number of beams for beam search, higher means potentially better but slower
            vad_filter=True, # Enable Voice Activity Detection for better segmentation
            vad_parameters={"min_silence_duration_ms": 500}, # Customize VAD if needed
            without_timestamps=False # Keep timestamps for word-level insights if desired
        )

        full_translation = ""
        # You can also get word-level timestamps if needed, for "word to word conversion" insight
        # by iterating through segments and then segment.words
        word_level_transcription = []

        print(f"Detected language: {info.language} with probability {info.language_probability:.4f}")

        for segment in segments:
            full_translation += segment.text.strip() + " "
            # If you want more detailed word-level output for "word to word" understanding:
            # for word in segment.words:
            #    word_level_transcription.append(f"{word.word} (start: {word.start:.2f}s, end: {word.end:.2f}s)")

        # Clean up temporary audio file if created
        if 'temp_audio_file' in locals() and os.path.exists(temp_audio_file):
            os.remove(temp_audio_file)

        return {
            "detected_language": info.language,
            "confidence": f"{info.language_probability:.2%}",
            "english_translation": full_translation.strip(),
            # "word_level_transcription": "\n".join(word_level_transcription) # Uncomment to see word-level
        }

    except Exception as e:
        print(f"Error during transcription: {e}")
        # Clean up temporary audio file if an error occurs
        if 'temp_audio_file' in locals() and os.path.exists(temp_audio_file):
            os.remove(temp_audio_file)
        return {
            "error": f"Transcription failed: {str(e)}",
            "detected_language": "Unknown",
            "confidence": "0%",
            "english_translation": ""
        }

# --- Gradio Interface Setup ---
def create_gradio_interface():
    """Create a web interface using Gradio for Hindi to English translation."""

    def process_audio_for_gradio(audio_input):
        if audio_input is None:
            return "Please upload an audio file or record something.", "", ""

        # `audio_input` from gr.Audio(type="numpy") is a tuple (sample_rate, audio_array)
        result = transcribe_hindi_to_english(audio_input)

        if 'error' in result:
            return result['error'], "N/A", "N/A"

        return (
            result.get('english_translation', 'No translation available'),
            result.get('detected_language', 'Unknown'),
            result.get('confidence', '0%')
            # result.get('word_level_transcription', 'N/A') # Uncomment if you enable word-level output
        )

    interface = gr.Interface(
        fn=process_audio_for_gradio,
        inputs=gr.Audio(
            sources=["microphone", "upload"],
            type="numpy", # Get audio as numpy array (sample_rate, data)
            label="Upload Hindi Audio or Record"
        ),
        outputs=[
            gr.Textbox(label="English Translation", lines=7, show_copy_button=True),
            gr.Textbox(label="Detected Language", max_lines=1),
            gr.Textbox(label="Confidence", max_lines=1),
            # gr.Textbox(label="Word-Level Details", lines=10, show_copy_button=True) # Uncomment
        ],
        title="🎙️ Hindi to English Speech Translator (Powered by Faster-Whisper)",
        description=f"""
        This tool uses the **Faster-Whisper** (`{MODEL_SIZE}`) model for highly accurate Hindi to English speech translation.
        It's optimized for large audio files and provides robust performance.
        Upload your Hindi audio (.mp3, .wav, .m4a, etc.) or record directly.
        """,
        theme=gr.themes.Soft(),
        allow_flagging="never"
    )
    return interface

# --- Helper Functions for Direct Testing in Colab ---

def create_sample_hindi_audio(filename="sample_hindi.mp3", text="नमस्ते, मैं एक परीक्षण संदेश हूं। यह लंबी ऑडियो फाइल के लिए है।"):
    """
    Creates a sample Hindi audio file using gTTS for testing.
    """
    try:
        tts = gTTS(text=text, lang='hi')
        tts.save(filename)
        print(f"✅ Sample Hindi audio created: '{filename}'")
        display(Audio(filename, autoplay=False))
        return filename
    except ImportError:
        print("❌ gTTS not installed. Please install with: !pip install gtts pydub")
        print("Could not create sample audio.")
        return None
    except Exception as e:
        print(f"❌ Error creating sample audio: {e}")
        return None

def upload_and_test_audio():
    """
    Allows user to upload an audio file via Colab's file uploader and tests transcription.
    """
    print("\n📁 Please select your Hindi audio file...")
    uploaded = files.upload()

    if not uploaded:
        print("❌ No file uploaded.")
        return

    # Process the first uploaded file
    filename = list(uploaded.keys())[0]
    print(f"\n🎯 Processing uploaded file: '{filename}'")

    # Call the core transcription function
    result = transcribe_hindi_to_english(filename)

    print("\n" + "="*60)
    print("🎯 TRANSCRIPTION RESULTS")
    print("="*60)
    if 'error' in result:
        print(f"Error: {result['error']}")
    else:
        print(f"📁 File: {filename}")
        print(f"🗣️ Detected Language: {result.get('detected_language', 'N/A')}")
        print(f"📊 Confidence: {result.get('confidence', 'N/A')}")
        print(f"\n🌐 ENGLISH TRANSLATION:")
        print("-" * 40)
        print(f"'{result.get('english_translation', 'No translation available')}'")
        # if 'word_level_transcription' in result and result['word_level_transcription']:
        #     print("\n✨ WORD-LEVEL DETAILS:")
        #     print("-" * 40)
        #     print(result['word_level_transcription'])
    print("="*60)
    return result

def test_audio_file_directly(file_path):
    """
    Tests transcription for a specific audio file path.
    Usage: test_audio_file_directly("my_long_hindi_audio.mp3")
    """
    if not os.path.exists(file_path):
        print(f"❌ Error: File '{file_path}' not found!")
        print("Make sure the file is uploaded to Colab or check the path.")
        return

    print(f"\n🎯 Processing file: '{file_path}'")
    # Call the core transcription function
    result = transcribe_hindi_to_english(file_path)

    print("\n" + "="*60)
    print("🎯 TRANSCRIPTION RESULTS")
    print("="*60)
    if 'error' in result:
        print(f"Error: {result['error']}")
    else:
        print(f"📁 File: {file_path}")
        print(f"🗣️ Detected Language: {result.get('detected_language', 'N/A')}")
        print(f"📊 Confidence: {result.get('confidence', 'N/A')}")
        print(f"\n🌐 ENGLISH TRANSLATION:")
        print("-" * 40)
        print(f"'{result.get('english_translation', 'No translation available')}'")
        # if 'word_level_transcription' in result and result['word_level_transcription']:
        #     print("\n✨ WORD-LEVEL DETAILS:")
        #     print("-" * 40)
        #     print(result['word_level_transcription'])
    print("="*60)
    return result

# --- Main Execution Block ---
if __name__ == "__main__":
    print("\n" + "="*70)
    print("🚀 Hindi to English Speech Translator Setup Complete!")
    print("="*70)
    print(f"Model: {MODEL_SIZE}, Device: {DEVICE}, Compute Type: {COMPUTE_TYPE}")

    print("\n🌐 Launching Gradio Web Interface...")
    print("   You will see a public URL below which you can share.")
    print("   This interface allows uploading audio files or recording directly.")
    interface = create_gradio_interface()
    interface.launch(share=True, debug=False, quiet=True)

    print("\n" + "="*70)
    print("💡 Additional Testing Options (run these in new cells if preferred):")
    print("="*70)
    print("1. To create a sample Hindi audio file for testing:")
    print("   create_sample_hindi_audio()")
    print("   (Then you can use test_audio_file_directly('sample_hindi.mp3'))")
    print("\n2. To upload an audio file from your local machine and test:")
    print("   upload_and_test_audio()")
    print("\n3. To test a specific audio file already in your Colab environment:")
    print("   test_audio_file_directly('your_audio_file.mp3')")
    print("\nSupported audio formats: MP3, WAV, M4A, FLAC, OGG, etc.")
    print("\nFor long audio files, Faster-Whisper automatically handles chunking for efficient processing.")
    print("Word-to-word conversion in a strict sense (like alignment) is not directly exposed as individual words for translation, but the translation itself is sentence-level.")
    print("If you uncomment the 'word_level_transcription' lines, you'll see the transcribed Hindi words with timestamps, which helps understand the 'word-to-word' aspect of the original speech.")
    print("="*70)

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.7/39.7 MB[0m [31m23.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.6/38.6 MB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.5/16.5 MB[0m [31m24.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.8/86.8 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.2/98.2 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hUsing CPU with int8 precision.
Loading Faster-Whisper model: large-v2...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

model.bin:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

vocabulary.txt: 0.00B [00:00, ?B/s]

Successfully loaded large-v2 model on cpu!

🚀 Hindi to English Speech Translator Setup Complete!
Model: large-v2, Device: cpu, Compute Type: int8

🌐 Launching Gradio Web Interface...
   You will see a public URL below which you can share.
   This interface allows uploading audio files or recording directly.




* Running on public URL: https://c801b9693631df642f.gradio.live



💡 Additional Testing Options (run these in new cells if preferred):
1. To create a sample Hindi audio file for testing:
   create_sample_hindi_audio()
   (Then you can use test_audio_file_directly('sample_hindi.mp3'))

2. To upload an audio file from your local machine and test:
   upload_and_test_audio()

3. To test a specific audio file already in your Colab environment:
   test_audio_file_directly('your_audio_file.mp3')

Supported audio formats: MP3, WAV, M4A, FLAC, OGG, etc.

For long audio files, Faster-Whisper automatically handles chunking for efficient processing.
Word-to-word conversion in a strict sense (like alignment) is not directly exposed as individual words for translation, but the translation itself is sentence-level.
If you uncomment the 'word_level_transcription' lines, you'll see the transcribed Hindi words with timestamps, which helps understand the 'word-to-word' aspect of the original speech.
