# DubAI: Automated Video Dubbing System

DubAI is a comprehensive solution for creating dubbed versions of YouTube videos in different languages. By leveraging advanced AI models for transcription, translation, and text-to-speech synthesis, this project automates the process of video dubbing.

## Overview
DubAI takes a YouTube video link and a target language as input, processes the video to extract audio, transcribes the speech, translates it into the desired language, and generates a dubbed version of the video. This project is particularly useful for content creators, educators, and businesses aiming to reach a global audience by providing multilingual video content.

## Install Required Libraries

This cell installs all the necessary Python libraries required for the project, including libraries for video processing, audio manipulation, transcription, translation, and text-to-speech synthesis.

In [98]:
!pip install moviepy
!pip install pydub
!pip install gtts
!pip install pytubefix
!pip install torch torchaudio
!pip install transformers



In [99]:
!pip install openai-whisper
!pip install demucs
!pip install google-generativeai
!pip install python-dotenv
!pip install numpy
!pip install whisper



In [100]:
!apt-get install -y ffmpeg

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
ffmpeg is already the newest version (7:4.4.2-0ubuntu0.22.04.1).
0 upgraded, 0 newly installed, 0 to remove and 122 not upgraded.


## Download the Video

This cell defines a function to download a YouTube video using the provided URL. The video is saved at temporary folder for further processing.

In [101]:
from pytubefix import YouTube
from pytubefix.cli import on_progress

def video_download(url):
    yt = YouTube(url, on_progress_callback=on_progress)
    yt.title = "vid"

    ys = yt.streams.get_highest_resolution()
    ys.download(output_path="/kaggle/working/tempfile/")
    print("Download complete!")

## Extract Audio

This cell defines a function to extract audio from the downloaded video. The extracted audio is saved as a separate file for further processing.

In [102]:
from moviepy.video.io.VideoFileClip import VideoFileClip


def extract_audio_from_video(video_path = "vid.mp4", audio_path="aud.wav"):
    try:
        # Load the video file using VideoFileClip
        video_clip = VideoFileClip(video_path)
        
        # Check if the video clip has audio
        if video_clip.audio is None:
            print(f"Warning: The video file '{video_path}' does not contain any audio.")
            video_clip.close()  # Close the video clip to release resources
            return None # Exit the function
        
        # Extract the audio and save it to a file. Use .mp3 for compressed audio.
        video_clip.audio.write_audiofile(audio_path, codec='libmp3lame')
        print(f"Audio extracted and saved to {audio_path}")
        
        return video_clip.audio
    except FileNotFoundError as e:
        print(f"File not found: {e}")
    except OSError as e:
        print(f"OS error: {e}")
        # If it's related to ffmpeg, provide more detailed guidance
        if "ffmpeg" in str(e).lower():
            print("\nThis appears to be an ffmpeg-related error.")
    except Exception as e:
        print(f"An error occurred: {e}")
    finally:
        # Ensure resources are released (though VideoFileClip's context manager should handle this)
        if 'video_clip' in locals() and video_clip is not None:
            video_clip.close()

## Extract Music and Vocals

This cell defines a function to separate the audio into vocals and background music using the Demucs model. This separation is crucial for accurate transcription and dubbing.

In [103]:
import os
import torch
from demucs.pretrained import get_model
from demucs.audio import AudioFile, save_audio
from demucs.apply import apply_model

def separate_audio(input_file, output_dir):
    os.makedirs(output_dir, exist_ok=True)
    
    # Load model (htdemucs is the latest model with good separation quality)
    model = get_model("htdemucs")
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model.to(device)
    
    # Load audio file
    wav = AudioFile(input_file).read(streams=0, samplerate=model.samplerate, channels=model.audio_channels)
    ref = wav.mean(0)
    wav = (wav - ref.mean()) / ref.std()
    
    # Apply separation - use apply_model instead of model.forward
    with torch.no_grad():
        sources = apply_model(model, wav[None], device=device)
    sources = sources * ref.std() + ref.mean()
    
    # Save each source in output directory
    track_name = os.path.splitext(os.path.basename(input_file))[0]
    track_dir = output_dir
    os.makedirs(track_dir, exist_ok=True)
    
    # Get the index of vocals from the model sources
    sources_list = model.sources
    vocals_idx = sources_list.index('vocals')
    
    # Save vocals track
    vocals = sources[0][vocals_idx]
    vocals_path = os.path.join(track_dir, "vocals.wav")
    save_audio(vocals, vocals_path, model.samplerate)
    
    # Create and save music track (everything except vocals)
    # Start with zeros, then add all non-vocal sources
    music = torch.zeros_like(vocals)
    for i, source_name in enumerate(sources_list):
        if source_name != 'vocals':
            music += sources[0][i]
    
    music_path = os.path.join(track_dir, "music.wav")
    save_audio(music, music_path, model.samplerate)
    
    print(f"Separation complete! Files saved to {track_dir}")

## Transcribe the Vocal

This cell defines functions to transcribe the vocal audio into text using the Whisper model. The transcription includes timestamps for accurate synchronization.

In [104]:
import json
import os
import wave
from typing import Dict, List, Tuple
import numpy as np
import torch
import whisper

def read_wav_file(file_path: str) -> Tuple[np.ndarray, int]:
    """Read a WAV file and return audio data and sample rate"""
    with wave.open(file_path, 'rb') as wf:
        frames = wf.getnframes()
        rate = wf.getframerate()
        audio_data = np.frombuffer(wf.readframes(frames), dtype=np.int16).astype(np.float32) / 32768.0
        return audio_data, rate

def transcribe_with_local_whisper(audio_file: str, model_size: str = "small") -> List[Dict]:
    """Transcribe audio using local Whisper model with timestamps"""
    print(f"Transcribing {audio_file} with Whisper {model_size} model...")
    
    # Check if CUDA is available
    device = "cuda" if torch.cuda.is_available() else "cpu"
    print(f"Using device: {device}")
    
    # Load the Whisper model
    model = whisper.load_model(model_size, device=device)
    
    # Transcribe the audio
    print("Running transcription...")
    result = model.transcribe(
        audio_file, 
        verbose=True,   # Show progress
        word_timestamps=True  # Enable word-level timestamps
    )
    
    # Format segments to match our expected structure
    segments = []
    for segment in result.get("segments", []):
        segments.append({
            "text": segment["text"],
            "timestamp": [segment["start"], segment["end"]]
        })
    
    # If no segments are returned, fall back to full text
    if not segments and "text" in result:
        segments.append({
            "text": result["text"],
            "timestamp": [0.0, 30.0]
        })
    
    return segments

def save_transcription(segments: List[Dict], output_file: str):
    """Save transcription with timestamps to file"""
    file_extension = os.path.splitext(output_file)[1].lower()
    
    if file_extension == '.json':
        with open(output_file, 'w') as f:
            json.dump(segments, f, indent=2)
    elif file_extension == '.txt':
        with open(output_file, 'w') as f:
            for segment in segments:
                start = segment['timestamp'][0]
                end = segment['timestamp'][1]
                text = segment['text']
                # Add four spaces after timestamp to match your desired format
                f.write(f"[{start:.2f} - {end:.2f}]    {text}\n")
    else:
        raise ValueError("Unsupported file extension. Use .json or .txt")
    
    print(f"Transcription saved to {output_file}")

## Translate the Transcript

This cell defines functions to translate the transcribed text into the target language using Google Generative AI. The translated text retains the original meaning and tone.

In [105]:
import re
import google.generativeai as genai
import os
from kaggle_secrets import UserSecretsClient

def parse_transcript(transcript_file):
    """
    Format: [start_time - end_time] text
    """
    segments = []
    with open(transcript_file, 'r') as file:
        for line in file:
            # Skip empty lines or file path comments
            if not line.strip() or line.strip().startswith('//'):
                continue
                
            # Parse timestamp and text
            match = re.match(r'\[([\d.]+) - ([\d.]+)\]\s+(.*)', line)
            if match:
                start_time = float(match.group(1))
                end_time = float(match.group(2))
                text = match.group(3).strip()
                segments.append({
                    'start_time': start_time,
                    'end_time': end_time,
                    'text': text
                })
    
    return segments

def translate_text(text, target_language, model):
    prompt = f"Translate the following English text to {target_language}. Keep the same meaning, tone and native spoken style:\n\n{text}.\n\nOnly return the translated text, no options or any other text needed."
    
    try:
        response = model.generate_content(prompt)
        return response.text
    except Exception as e:
        print(f"Translation error: {e}")
        return text  # Return original text if translation fails

def translate_transcript(transcript_file, target_language):
    """
    Translate a transcript file to the specified language and save the result
    """
    user_secrets = UserSecretsClient()
    
    # Configure Gemini AI
    try:
        api_key = user_secrets.get_secret("dubb")
        if not api_key:
            raise ValueError("GOOGLE_API_KEY not found in environment variables")
        
        genai.configure(api_key=api_key)
        model = genai.GenerativeModel('gemini-2.0-flash')
    except Exception as e:
        print(f"Error initializing Gemini: {e}")
        return
    
    # Parse the transcript
    segments = parse_transcript(transcript_file)
    if not segments:
        print(f"No valid segments found in {transcript_file}")
        return
    
    # Translate each segment
    translated_segments = []
    for i, segment in enumerate(segments):
        translated_text = translate_text(segment['text'], target_language, model)
        translated_segments.append({
            'start_time': segment['start_time'],
            'end_time': segment['end_time'],
            'text': translated_text
        })
    
    # Generate output filename
    base_name = os.path.splitext(transcript_file)[0]
    output_file = f"{base_name}_{target_language.lower().replace(' ', '_')}.txt"
    
    # Write translated transcript
    with open(output_file, 'w') as file:
        for segment in translated_segments:
            file.write(f"[{segment['start_time']:.2f} - {segment['end_time']:.2f}]  {segment['text']}\n")
    
    print(f"Translation complete. Output saved to {output_file}")

## Transcript to Vocal

This cell defines functions to convert the translated text into speech using Google Text-to-Speech (gTTS). The generated audio is synchronized with the original timestamps.

In [106]:
import re
import os
from pathlib import Path
from gtts import gTTS
from pydub import AudioSegment

def parse_timestamped_text(file_path):
    """
    Parse a file containing timestamped text segments.
    Expected format: [start_time - end_time] text
    Returns a list of (start_time, end_time, text) tuples.
    """
    with open(file_path, 'r', encoding='utf-8') as f:
        content = f.read()
    
    # Regular expression to match timestamps and text
    pattern = r'\[([\d.]+) - ([\d.]+)\]\s+(.*?)(?=\n\[|$)'
    matches = re.findall(pattern, content, re.DOTALL)
    
    result = []
    for start_time, end_time, text in matches:
        result.append((float(start_time), float(end_time), text.strip()))
    
    return result

def text_to_speech(text, lang, output_file):
    """Generate speech from text and save to file"""
    tts = gTTS(text=text, lang=lang, slow=False)
    tts.save(output_file)
    return output_file

def adjust_audio_duration(audio_segment, target_duration_ms):
    """
    Adjust audio segment to match target duration by speeding up/slowing down
    without changing pitch.
    """
    current_duration_ms = len(audio_segment)
    
    if current_duration_ms == 0:
        return AudioSegment.silent(duration=target_duration_ms)
    
    # Prevent division by zero
    if target_duration_ms == 0:
        return AudioSegment.silent(duration=1)  # Return minimal silence
    
    # Calculate the necessary speed factor
    speed_factor = current_duration_ms / target_duration_ms
    
    # if abs(speed_factor - 1.0) < 0.05:  # If difference is less than 5%
    #     return audio_segment
    
    # Limit speed adjustment to reasonable bounds
    speed_factor = max(0.5, min(2.0, speed_factor))
    
    # Apply speed change
    adjusted_audio = audio_segment.speedup(playback_speed=speed_factor)
    
    # If still not exact, trim or pad
    final_duration = len(adjusted_audio)
    if final_duration > target_duration_ms:
        # Trim end
        adjusted_audio = adjusted_audio[:target_duration_ms]
    elif final_duration < target_duration_ms:
        # Pad with silence
        silence_ms = target_duration_ms - final_duration
        adjusted_audio = adjusted_audio + AudioSegment.silent(duration=silence_ms)
    
    return adjusted_audio

def dub_text_with_timestamps(input_file, lang, output_dir, final_output):
    """
    Main function to dub text segments according to timestamps and merge them.
    Ensures output audio maintains original timestamps.
    """
    # Create output directory if it doesn't exist
    Path(output_dir).mkdir(exist_ok=True)
    
    # Parse the timestamped text
    segments = parse_timestamped_text(input_file)
    print(f"Found {len(segments)} text segments to process")
    
    # Create a blank audio file that will hold our final output
    final_audio = AudioSegment.silent(duration=0)
    last_end_time = 0
    
    # Process each segment
    for i, (start_time, end_time, text) in enumerate(segments):
        # Generate output filename for this segment
        segment_file = os.path.join(output_dir, f"segment_{i+1:03d}.mp3")
        
        # Convert text to speech
        text_to_speech(text, lang, segment_file)
        
        # Load the generated speech
        speech = AudioSegment.from_mp3(segment_file)
        
        # Calculate desired duration in milliseconds
        target_duration_ms = int((end_time - start_time) * 1000)
        
        # Adjust speech to match target duration
        adjusted_speech = adjust_audio_duration(speech, target_duration_ms)
        
        # Add silence gap if needed
        silence_duration_ms = int((start_time - last_end_time) * 1000)
        if silence_duration_ms > 0:
            final_audio += AudioSegment.silent(duration=silence_duration_ms)
        
        # Add the adjusted speech
        final_audio += adjusted_speech
        
        # Update last_end_time
        last_end_time = end_time
        
        print(f"Added segment with duration: {target_duration_ms/1000:.2f} seconds")
    
    # Export the final audio
    print(f"\nExporting final audio to {final_output}...")
    final_audio.export(final_output, format="mp3")
    print(f"Successfully created audio file: {final_output}")
    print(f"Total duration: {len(final_audio)/1000:.2f} seconds")
    
    return final_output

## Merge Audios with Video

This cell defines functions to merge the dubbed audio with the background music and add the combined audio to the original video. The final output is a fully dubbed video.

In [107]:
from moviepy.video.io.VideoFileClip import VideoFileClip
from moviepy.audio.io.AudioFileClip import AudioFileClip
from moviepy.audio.AudioClip import CompositeAudioClip

def merge_audio_tracks(dubbed_audio_path, music_path, output_audio_path):
    """
    Merge dubbed audio and background music into a single audio file.
    
    Args:
        dubbed_audio_path (str): Path to the main dubbed audio
        music_path (str): Path to the background music
        output_audio_path (str): Path for the output combined audio file
    """
    try:
        # Load audio clips
        dubbed_audio = AudioFileClip(dubbed_audio_path)
        music_audio = AudioFileClip(music_path)
        
        # Combine the audio tracks
        combined_audio = CompositeAudioClip([dubbed_audio, music_audio])
        combined_audio.fps = 44100 
        
        # Write the combined audio to file
        combined_audio.write_audiofile(output_audio_path, codec='aac')
        
        # Close clips to free resources
        dubbed_audio.close()
        music_audio.close()
        combined_audio.close()
        
        print(f"Successfully merged audio tracks to {output_audio_path}")
        return output_audio_path
        
    except Exception as e:
        print(f"Error merging audio tracks: {e}")
        return None

def add_audio_to_video(video_path, audio_path, output_path):
    """
    Add audio to a muted video.
    
    Args:
        video_path (str): Path to the video file
        audio_path (str): Path to the audio file
        output_path (str): Path for the output video file
    """
    try:
        # Load video clip and mute it
        video_clip = VideoFileClip(video_path)
        muted_video = video_clip.without_audio()
        
        # Load the combined audio
        audio_clip = AudioFileClip(audio_path)
        
        # Set the audio to the muted video
        final_video = muted_video
        final_video.audio = audio_clip
        
        # Write the result to file with proper codecs
        final_video.write_videofile(output_path, 
                                   codec='libx264',
                                   audio_codec='aac', 
                                   temp_audiofile="temp-audio.m4a", 
                                   remove_temp=True)
        
        # Close all clips
        video_clip.close()
        muted_video.close()
        audio_clip.close()
        final_video.close()
        
        print(f"Successfully added audio to video at {output_path}")
        
    except Exception as e:
        print(f"Error adding audio to video: {e}")

## Driver Executions

This cell contains the main execution flow of the project. It orchestrates all the steps, from downloading the video to generating the final dubbed video.

In [108]:
import sys

# Original mapping from code to name
_LANGUAGE_CODE_TO_NAME_MAP = {'af': 'Afrikaans', 'sq': 'Albanian', 'ar': 'Arabic', 'hy': 'Armenian', 'ca': 'Catalan', 'zh': 'Chinese', 'zh-cn': 'Chinese (Mandarin/China)', 'zh-tw': 'Chinese (Mandarin/Taiwan)', 'zh-yue': 'Chinese (Cantonese)', 'hr': 'Croatian', 'cs': 'Czech', 'da': 'Danish', 'nl': 'Dutch', 'en': 'English', 'en-au': 'English (Australia)', 'en-uk': 'English (United Kingdom)', 'en-us': 'English (United States)', 'eo': 'Esperanto', 'fi': 'Finnish', 'fr': 'French', 'de': 'German', 'el': 'Greek', 'ht': 'Haitian Creole', 'hi': 'Hindi', 'hu': 'Hungarian', 'is': 'Icelandic', 'id': 'Indonesian', 'it': 'Italian', 'ja': 'Japanese', 'ko': 'Korean', 'la': 'Latin', 'lv': 'Latvian', 'mk': 'Macedonian', 'no': 'Norwegian', 'pl': 'Polish', 'pt': 'Portuguese', 'pt-br': 'Portuguese (Brazil)', 'ro': 'Romanian', 'ru': 'Russian', 'sr': 'Serbian', 'sk': 'Slovak', 'es': 'Spanish', 'es-es': 'Spanish (Spain)', 'es-us': 'Spanish (United States)', 'sw': 'Swahili', 'sv': 'Swedish', 'ta': 'Tamil', 'th': 'Thai', 'tr': 'Turkish', 'vi': 'Vietnamese', 'cy': 'Welsh'}

# This allows for case-insensitive lookup of the language name
_LANGUAGE_NAME_TO_CODE_MAP = {
    name.lower(): code for code, name in _LANGUAGE_CODE_TO_NAME_MAP.items()
}

def get_language_code(language_name: str) -> str | None:
  name_lower = language_name.lower()
  return _LANGUAGE_NAME_TO_CODE_MAP.get(name_lower)

In [109]:
# 🎥 Step 0: Download the YouTube Video
print("📥 Enter the URL of the YouTube video to download:")
video_download(input())  # Uncomment this line
print("🌍 Enter the target language for translation:")
translation_lan = input().lower() #Uncomment this line
print("\n\n")

# 🎵 Step 1: Extract Audio from Video
print("🎞️ Extracting audio from video...")
extract_audio_from_video("/kaggle/working/tempfile/vid.mp4", "/kaggle/working/tempfile/aud.wav")
print("✅ Audio extracted successfully!\n")


# 🎤 Step 2: Separate Vocals and Music
print("🎧 Wait a little longer... Audio separation in progress...")
separate_audio("/kaggle/working/tempfile/aud.wav", "/kaggle/working/tempfile/aud")
print("✅ Audio separation complete!\n")


# 📝 Step 3: Transcribe Audio to Text
print("📝 Transcribing vocals...")
audio_file = "/kaggle/working/tempfile/aud/vocals.wav"
output_file = "/kaggle/working/tempfile/transcription.txt"
# Choose model size: 'tiny', 'base', 'small', 'medium', or 'large'
segments = transcribe_with_local_whisper(audio_file, "medium")
save_transcription(segments, output_file)
print(f"✅ Transcription complete: {audio_file} → {output_file}\n")


# 🌐 Step 4: Translate Transcription
translate_transcript("/kaggle/working/tempfile/transcription.txt", translation_lan)
print("✅ Translation complete!\n")


# 🗣️ Step 5: Generate Dubbed Audio with Translation
print("🔊 Generating dubbed audio with translated text...")
input_file = f"/kaggle/working/tempfile/transcription_{translation_lan}.txt"
language = get_language_code(translation_lan)
temp_directory = "tempfile/dubbed_temp"
final_output = "tempfile/complete_dubbed.mp3"
dub_text_with_timestamps(input_file, language, temp_directory, final_output)
print("✅ Dubbed audio generation complete!\n")


# 🎬 Step 6: Merge Audio Tracks and Add to Video
print("🎵 Merging dubbed audio and background music...")
video_path = "/kaggle/working/tempfile/vid.mp4"
dubbed_audio_path = "/kaggle/working/tempfile/complete_dubbed.mp3"
music_path = "/kaggle/working/tempfile/aud/music.wav"
combined_audio_path = "/kaggle/working/tempfile/temp-audio.m4a"
output_video_path = "output.mp4"
merged_audio = merge_audio_tracks(dubbed_audio_path, music_path, combined_audio_path)
if merged_audio:
    print("🎥 Adding merged audio to the video...")
    add_audio_to_video(video_path, merged_audio, output_video_path)
    print(f"✅ Video with dubbed audio created successfully: {output_video_path}\n")
else:
    print("❌ Failed to merge audio tracks!\n")

📥 Enter the URL of the YouTube video to download:


 https://www.youtube.com/watch?v=KHEzudV202s


Download complete!█████████████████████████████████████████| 100.0%
🌍 Enter the target language for translation:


 German





🎞️ Extracting audio from video...
MoviePy - Writing audio in /kaggle/working/tempfile/aud.wav


                                                                      

MoviePy - Done.
Audio extracted and saved to /kaggle/working/tempfile/aud.wav
✅ Audio extracted successfully!

🎧 Wait a little longer... Audio separation in progress...
Separation complete! Files saved to /kaggle/working/tempfile/aud
✅ Audio separation complete!

📝 Transcribing vocals...
Transcribing /kaggle/working/tempfile/aud/vocals.wav with Whisper medium model...
Using device: cuda
Running transcription...
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
[00:03.420 --> 00:05.320]  Welcome to English in a Minute.
[00:05.960 --> 00:09.300]  Reading is fun and a great way to learn.
[00:09.860 --> 00:14.200]  So is reading into something part of the learning process?
[00:15.840 --> 00:22.720]  Andrew, so my friend who hasn't contacted me in two years just texted, but she wrote
[00:22.720 --> 00:23.740]  almost nothing.
[00:24.440 --> 00:25.640]  What did she write?
[00:25.640 --> 00:29.480]  She wrote, hello, and 

                                                                     

MoviePy - Done.
Successfully merged audio tracks to /kaggle/working/tempfile/temp-audio.m4a
🎥 Adding merged audio to the video...
Moviepy - Building video output.mp4.
MoviePy - Writing audio in temp-audio.m4a


                                                                     

MoviePy - Done.
Moviepy - Writing video output.mp4



                                                                 

Moviepy - Done !
Moviepy - video ready output.mp4
Successfully added audio to video at output.mp4
✅ Video with dubbed audio created successfully: output.mp4



In [110]:
# Cleaning
!rm -rf /kaggle/working/tempfile