# 🎥 Download and transcribe videos from Twitter/X

Welcome to this fun and interactive notebook! In this project, we'll implement a local UI that processes videos from Twitter/X, extracts their audio, transcribes it to text, and even translates to English if needed.


---

## Importance

Imagine you stumble upon a fascinating video on Twitter/X. Maybe it's a speech, a podcast clip, or someone sharing their thoughts in another language. You want to understand it fully—get the transcript, maybe translate it to English, and save the results for later.

That's exactly what this code does! I'm building a pipeline that:

✅ Downloads a video from a Twitter/X URL  
✅ Extracts high-quality audio from it  
✅ Transcribes the audio to text using the Whisper AI model  
✅ Optionally translates the text to English if it's in another language  
✅ Saves the results and cleans up temporary files

This notebook will guide you through each step, explain what's happening, and let you run the code yourself. Let's get started!

---

## Required libraries

Here are the required libraries and what each does:

- `yt_dlp`: Downloads videos from Twitter/X  
- `whisper`: Transcribes audio to text with AI  
- `torch`: Powers Whisper with GPU acceleration if available  
- `deep_translator`: Translates text to other languages  
- `gradio`: Is used for creating the web interface   
- `ffmpeg`: Extracts high-quality audio from videos (needs to be installed separately)  
- Optional:
  - `transformers`: For an alternative transcription method  
  - `langdetect`: For detecting the language of the transcription


###### NB: To begin it's advisable to setup a virtual environment and download all required libraries. Refer to the `README`.



In [1]:
import os
import tempfile
import subprocess
import yt_dlp
import whisper
import torch
from deep_translator import GoogleTranslator
import argparse
import gradio as gr

# Conditionally import transformers if available
try:
    from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
    transformers_available = True
except ImportError:
    transformers_available = False

try:
    from langdetect import detect as detect_lang
    langdetect_available = True
except ImportError:
    langdetect_available = False

## Downloading the Video

Our first task is to grab the video from Twitter/X. The `download_twitter_video` function uses `yt_dlp` to download the video (or just its audio if specified). It creates a temporary file if no output path is provided and ensures we get the best quality available.

In [2]:
def download_twitter_video(tweet_url, output_path=None, audio_only=False):
    if output_path is None:
        temp_dir = tempfile.mkdtemp()
        output_path = os.path.join(temp_dir, "twitter_video.mp4")
    
    ydl_opts = {
        'outtmpl': output_path,
        'format': 'bestaudio/best' if audio_only else 'best',
        'quiet': True,
        'no_warnings': True,
        'extract_audio': audio_only,
        'postprocessors': [{'key': 'FFmpegExtractAudio', 'preferredcodec': 'wav', 'preferredquality': '192'}] if audio_only else [],
    }
    
    try:
        with yt_dlp.YoutubeDL(ydl_opts) as ydl:
            ydl.download([tweet_url])
        return output_path.replace('.mp4', '.wav') if audio_only else output_path
    except Exception as e:
        print(f"Error downloading video: {e}")
        raise

#### What's Happening?





- We pass a Twitter/X URL (e.g., https://twitter.com/username/status/123456789).



- If no output_path is specified, it creates a temporary file.



- The ydl_opts dictionary configures yt_dlp to download the best video or audio quality.



- If audio_only is True, it extracts audio as a WAV file with 192kbps quality.

### Extracting Audio

Next, we need the audio from the video for transcription. The `extract_audio` function uses `ffmpeg` to pull out the audio as a WAV file with a high sample rate (48kHz) and stereo channels for better transcription accuracy.

- The function takes the video file and creates a WAV file (e.g., video.mp4 becomes video.wav).



- The `ffmpeg` command specifies high-quality settings: 48kHz sample rate, stereo, and 192kbps bitrate.



- If `ffmpeg` fails (e.g., not installed), it returns the video path as a fallback.

In [3]:
def extract_audio(video_path, output_audio_path=None):
    if output_audio_path is None:
        output_audio_path = os.path.splitext(video_path)[0] + ".wav"
    
    # audio extraction 
    command = [
        "ffmpeg", "-i", video_path, 
        "-vn",  
        "-ar", "48000", 
        "-ac", "2",  
        "-ab", "192k",  
        "-f", "wav",  
        output_audio_path,
        "-y"  
    ]
    
    try:
        subprocess.run(command, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        return output_audio_path
    except subprocess.CalledProcessError as e:
        print(f"Error extracting audio: {e}")
        # If ffmpeg fails, return the video path for direct processing
        return video_path

### Transcribing with Whisper

Now for the exciting part: turning audio into text! The `transcribe_with_whisper` function uses OpenAI's Whisper model to transcribe the audio. Whisper is a powerful AI model that can handle multiple languages and noisy audio.

- We choose a Whisper model size (tiny, base, small, medium, large). Larger models are more accurate but slower.



- Transcription options `beam_size` and `best_of` improve accuracy by exploring multiple transcription possibilities.



- The result is the transcribed text, ready for further processing.

In [4]:
def transcribe_with_whisper(audio_path, model_size="medium", language=None):
    """
    Transcribe audio using Whisper with optimized settings.
    
    Args:
        audio_path: Path to audio file
        model_size: Whisper model size (tiny, base, small, medium, large)
        language: Source language code if known (improves accuracy)
    
    Returns:
        Transcribed text
    """
    # Check for GPU availability and set device
    device = "cuda" if torch.cuda.is_available() else "cpu"
    print(f"Using {device} for Whisper transcription with {model_size} model")
    
    # Load model with specified size
    model = whisper.load_model(model_size, device=device)
    
    # Configure transcription options for better accuracy
    transcribe_options = {
        "fp16": device == "cuda",  
        "language": language,  
        "task": "transcribe",
        "beam_size": 5,  
        "best_of": 5
    }
    
    # Remove None values
    transcribe_options = {k: v for k, v in transcribe_options.items() if v is not None}
    
    # Perform transcription
    result = model.transcribe(audio_path, **transcribe_options)
    return result["text"]

### Alternative Transcription with Transformers

For extra reliability, we can use the transformers library to transcribe with Whisper's large-v2 model. This is optional and only runs if transformers is installed and the model size is large.

In [5]:
def transcribe_with_transformers(audio_path):
    """Use Transformers pipeline for an alternative transcription option (Whisper large-v2)."""
    if not transformers_available:
        print("Transformers library not available. Install with: pip install transformers")
        return None
    
    try:
        # Initialize the Whisper model through transformers (provides different implementation)
        device = "cuda:0" if torch.cuda.is_available() else "cpu"
        torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
        
        model_id = "openai/whisper-large-v2"
        
        model = AutoModelForSpeechSeq2Seq.from_pretrained(
            model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
        )
        model.to(device)
        
        processor = AutoProcessor.from_pretrained(model_id)
        
        pipe = pipeline(
            "automatic-speech-recognition",
            model=model,
            tokenizer=processor.tokenizer,
            feature_extractor=processor.feature_extractor,
            max_new_tokens=128,
            chunk_length_s=30,
            batch_size=16,
            return_timestamps=True,
            torch_dtype=torch_dtype,
            device=device,
        )
        
        result = pipe(audio_path)
        return result["text"]
    except Exception as e:
        print(f"Error in transformers transcription: {e}")
        return None

### Detecting and Translating the Language

If the transcript isn't in English, we want to translate it. First, we detect the language with `langdetect` (if available). Then, we use `GoogleTranslator` to translate the text.

- `detect_language` identifies the language of the transcript (e.g., es for Spanish).



- `translate_text` translates the text to the target language (default: English). For long texts (>500 characters), it splits them into chunks for better accuracy.



- `split_text_into_chunks` breaks text into sentence-sized pieces to avoid translation errors with long texts.

In [6]:
def detect_language(text):
    """Attempt to detect the language of the text for better translation."""
    if not langdetect_available:
        return "auto"
    
    try:
        return detect_lang(text)
    except:
        return "auto" 

def translate_text(text, source='auto', target='en', use_advanced=True):
    """
    Translate text with enhanced accuracy.
    
    Args:
        text: Text to translate
        source: Source language code or 'auto' for auto-detection
        target: Target language code
        use_advanced: Whether to use advanced techniques
    
    Returns:
        Translated text
    """
    if not text or text.strip() == "":
        return ""
    
    # Simple translation with Google Translator
    try:
        translated = GoogleTranslator(source=source, target=target).translate(text)
    except Exception as e:
        print(f"Translation error: {e}")
        return text
    
    if not use_advanced:
        return translated
    
    # For improved accuracy, split long text into chunks and translate separately
    if len(text) > 500:
        chunks = split_text_into_chunks(text, 500)
        translations = []
        
        for chunk in chunks:
            chunk_translation = GoogleTranslator(source=source, target=target).translate(chunk)
            translations.append(chunk_translation)
        
        return " ".join(translations)
    
    return translated

def split_text_into_chunks(text, max_chunk_size):
    """Split text into chunks at sentence boundaries."""
    sentences = text.replace("。", ".").replace("！", "!").replace("？", "?").split(".")
    
    chunks = []
    current_chunk = ""
    
    for sentence in sentences:
        if not sentence.strip():
            continue
            
        sentence = sentence.strip() + "."
        
        if len(current_chunk) + len(sentence) <= max_chunk_size:
            current_chunk += " " + sentence
        else:
            if current_chunk:
                chunks.append(current_chunk.strip())
            current_chunk = sentence
    
    if current_chunk:
        chunks.append(current_chunk.strip())
        
    return chunks

### Putting It All Together

The `process_twitter_video` function ties everything together. It takes a Twitter/X URL, downloads the video, extracts audio, transcribes it, translates if needed, and saves the results. The function orchestrates the entire pipeline: download, extract, transcribe, translate, and save. 

In [7]:
def process_twitter_video(tweet_url, model_size="medium", output_dir=None, target_language="en"):
    """
    Process Twitter video without using argparse (for Jupyter compatibility).
    
    Args:
        tweet_url: URL of the Twitter/X video
        model_size: Whisper model size to use
        output_dir: Directory to save output files
        target_language: Target language for translation
    """
    if output_dir and not os.path.exists(output_dir):
        os.makedirs(output_dir)
    
    output_file = os.path.join(output_dir, "twitter_video.mp4") if output_dir else None
    
    print("\n=== Processing Twitter Video ===")
    print(f"Downloading video from: {tweet_url}")
    video_path = download_twitter_video(tweet_url, output_path=output_file)
    
    print("Extracting high-quality audio...")
    audio_path = extract_audio(video_path)
    
    print(f"Transcribing audio using Whisper {model_size} model...")
    whisper_transcript = transcribe_with_whisper(audio_path, model_size=model_size)
    
    # Optional: Try alternative transcription as well for comparison
    transformer_transcript = None
    if model_size == "large" and transformers_available:
        print("Performing alternative transcription with Transformers...")
        transformer_transcript = transcribe_with_transformers(audio_path)
    
    # Detect source language for better translation
    source_language = detect_language(whisper_transcript)
    if source_language != "en" and target_language == "en":
        print(f"Detected source language: {source_language}")
        print("Translating to English...")
        translated_text = translate_text(
            whisper_transcript, 
            source=source_language, 
            target=target_language, 
            use_advanced=True
        )
    else:
        translated_text = whisper_transcript
    
    # Save results to files if output directory specified
    if output_dir:
        with open(os.path.join(output_dir, "transcript.txt"), "w", encoding="utf-8") as f:
            f.write(whisper_transcript)
        
        with open(os.path.join(output_dir, "translated.txt"), "w", encoding="utf-8") as f:
            f.write(translated_text)
        
        if transformer_transcript:
            with open(os.path.join(output_dir, "alt_transcript.txt"), "w", encoding="utf-8") as f:
                f.write(transformer_transcript)
    
    # Print results
    print("\n=== Original Transcript ===")
    print(whisper_transcript)
    
    if source_language != "en" and target_language == "en":
        print("\n=== Translated Transcript ===")
        print(translated_text)
    
    # Clean up temporary files
    if not output_dir:
        try:
            os.remove(audio_path)
            if video_path != output_file:  # Only remove if it's a temp file
                os.remove(video_path)
        except:
            pass
    
    return {
        "transcript": whisper_transcript,
        "translated": translated_text,
        "alt_transcript": transformer_transcript
    }

### Adding Gradio Interface

We've added a Gradio interface! Gradio lets you create a web-based UI where users can paste a Twitter/X video URL and instantly see the transcribed and translated text. 

In [8]:
def gradio_interface(tweet_url):
    try:
        result = process_twitter_video(tweet_url, model_size="medium", output_dir="output")
        return result["transcript"], result["translated"]
    except Exception as e:
        return f"Error: {str(e)}", ""

demo = gr.Interface(
    fn=gradio_interface,
    inputs=gr.Textbox(label="Enter Twitter/X Video URL"),
    outputs=[
        gr.Textbox(label="Original Transcript"),
        gr.Textbox(label="Translated Transcript (English)")
    ],
    title="Twitter Video English Transcriber",
    description="Paste a Twitter/X video link and get the transcribed and translated speech using Whisper."
)

demo.launch(inbrowser=True) 


* Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.





=== Processing Twitter Video ===
Downloading video from: https://x.com/partidazocope/status/1907843479113240654
Extracting high-quality audio...
Transcribing audio using Whisper medium model...
Using cpu for Whisper transcription with medium model
Detected source language: ca
Translating to English...

=== Original Transcript ===
 Tornant a aquella versió, doncs el Barça sí que ha millorat. Gràcies. Sí. També et diria que crec que cap de les dues pannes ha sigut prou bona. I llavors, per molt que el gol hagi vingut en aquesta relació, que argumentes que va ser igual el dia de Wolfsburg, penso que vol dir que és una panna que ha sigut prou bona com per deixar el Madrid tancat a la seva àrea i a més. Avui buscàvem un pèl diferent. El Madrid ens defensa la seva banda dreta bastant tancada i llavors volíem estirar bastant a la frida allà per tenir algun avantatge i trobar bones seqüeles per davant, que és un dels més escriptus que té Frido. Com he dit, no ha sigut perfecte el nostre parti