# YouTube Transcription Pipeline with Self-Hosted Whisper

This notebook implements a pipeline that:
1. Downloads audio from YouTube videos
2. Uses locally hosted Whisper model for transcription and translation (no API costs)
3. Converts the output to PDF

## Setup and Requirements

- Python 3.8+
- FFmpeg (for audio processing)
- Required packages: yt-dlp, openai-whisper, torch, fpdf
- GPU recommended but not required (CPU will be slower)

## Paste the Youtube video link in the last cell


# Download the dependencies

In [None]:
# Install required packages in Jupyter Notebook
!pip install yt-dlp
!pip install -U openai-whisper
!pip install fpdf
!pip install torch  # Skip this if torch is already installed or if you're using GPU and need a specific version


# Install required dependencies
!pip install yt_dlp fpdf requests python-dotenv torch transformers openai-whisper tqdm


In [1]:
# Import required libraries
import os
import yt_dlp
import whisper
from fpdf import FPDF
import argparse
from pathlib import Path
import torch
from IPython.display import FileLink, display, Markdown


In [2]:
# Create output directories if they don't exist
os.makedirs("downloads", exist_ok=True)
os.makedirs("transcripts", exist_ok=True)


In [3]:
# Language options dictionary (language code: language name)
LANGUAGE_OPTIONS = {
    "en": "English",
    "es": "Spanish",
    "fr": "French",
    "de": "German",
    "it": "Italian",
    "pt": "Portuguese",
    "nl": "Dutch",
    "ru": "Russian",
    "zh": "Chinese",
    "ja": "Japanese",
    "ko": "Korean",
    "ar": "Arabic",
    "hi": "Hindi",
    "bn": "Bengali",
    "tr": "Turkish",
    "vi": "Vietnamese",
    "th": "Thai",
    "id": "Indonesian",
    "ms": "Malay",
    "fa": "Persian",
    "he": "Hebrew",
    "pl": "Polish",
    "cs": "Czech",
    "sv": "Swedish",
    "da": "Danish",
    "no": "Norwegian",
    "fi": "Finnish",
    "hu": "Hungarian",
    "el": "Greek",
    "ro": "Romanian",
    "uk": "Ukrainian"
}

print("Language dictionary loaded with", len(LANGUAGE_OPTIONS), "languages")


Language dictionary loaded with 31 languages


In [4]:
import re

def clean_repeated_phrases(text):
    """Clean up repeated phrases in the transcript text
    
    This function removes repetitive phrases like "Why not? Why not? Why not?"
    that might appear in the transcript, making it more readable.
    """
    # Common repeated phrases to look for
    common_phrases = [
        "Why not?", 
        "I don't know.", 
        "I love you.", 
        "I'm not sure.",
        "What is that?", 
        "I like you.", 
        "I'm a big fan of this movie.",
        "You are so smart."
    ]
    
    cleaned_text = text
    
    # Clean up common repeated phrases
    for phrase in common_phrases:
        # Replace 3+ consecutive occurrences with just one
        pattern = f"({re.escape(phrase)}\\s*){{3,}}"
        cleaned_text = re.sub(pattern, phrase + " ", cleaned_text)
        
        # Replace 2 consecutive occurrences with just one
        pattern = f"({re.escape(phrase)}\\s*){{2}}"
        cleaned_text = re.sub(pattern, phrase + " ", cleaned_text)
    
    return cleaned_text

# Test the function with a sample text
sample_text = "Why not? Why not? Why not? Why not? I don't know. I don't know. Something else. Why not?"
print("Original text:")
print(sample_text)
print("\nCleaned text:")
print(clean_repeated_phrases(sample_text))


Original text:
Why not? Why not? Why not? Why not? I don't know. I don't know. Something else. Why not?

Cleaned text:
Why not? I don't know. Something else. Why not?


## Select Whisper Model Size

Choose the Whisper model size to use for transcription. Larger models are more accurate but require more memory and computational resources:


In [5]:
# Select Whisper model size
# Available models: "tiny", "base", "small", "medium", "large"
# Larger models are more accurate but require more memory and compute
# Recommendation: Start with "base" or "small" for a balance of speed and accuracy
WHISPER_MODEL_SIZE = "tiny"  # Change this to your desired model size

# Print available devices for running the model
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

if device == "cuda":
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Available GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

# Display model size info
model_sizes = {
    "tiny": {"parameters": "39M", "required_vram": "~1 GB", "english_only": False},
    "base": {"parameters": "74M", "required_vram": "~1 GB", "english_only": False},
    "small": {"parameters": "244M", "required_vram": "~2 GB", "english_only": False},
    "medium": {"parameters": "769M", "required_vram": "~5 GB", "english_only": False},
    "large": {"parameters": "1550M", "required_vram": "~10 GB", "english_only": False},
}

print(f"\nSelected model: {WHISPER_MODEL_SIZE}")
print(f"Model parameters: {model_sizes[WHISPER_MODEL_SIZE]['parameters']}")
print(f"Approx. required VRAM: {model_sizes[WHISPER_MODEL_SIZE]['required_vram']}")

# Check if selected model is reasonable for your device
if device == "cpu" and WHISPER_MODEL_SIZE in ["medium", "large"]:
    print("\nWARNING: You've selected a large model to run on CPU. This may be very slow.")
    print("Consider using a smaller model like 'base' or 'small' for better performance on CPU.")

# Load the model (this will download the model first time)
print(f"\nLoading Whisper model '{WHISPER_MODEL_SIZE}'...")
model = whisper.load_model(WHISPER_MODEL_SIZE, device=device)
print("Model loaded successfully!")


Using device: cpu

Selected model: tiny
Model parameters: 39M
Approx. required VRAM: ~1 GB

Loading Whisper model 'tiny'...
Model loaded successfully!


## Step 1: Define TranscriptionPipeline Class

Now we'll implement the main pipeline class with methods for downloading, transcribing, and creating PDFs:


In [6]:
class TranscriptionPipeline:
    def __init__(self, whisper_model=None):
        """Initialize the transcription pipeline with local Whisper model"""
        self.whisper_model = whisper_model or model  # Use the globally loaded model
        
        # Create output directories if they don't exist
        os.makedirs("downloads", exist_ok=True)
        os.makedirs("transcripts", exist_ok=True)
        
    def remove_consecutive_duplicates(self, text):
        """Remove consecutive duplicate lines from transcription text
        
        This function removes repeated lines that appear consecutively in the transcript.
        It preserves the timestamps and only removes content that's exactly the same.
        
        Example:
        [00:01.000 --> 00:02.000]  Hello world.
        [00:02.000 --> 00:03.000]  Hello world.
        [00:03.000 --> 00:04.000]  Hello world.
        
        Becomes:
        [00:01.000 --> 00:02.000]  Hello world.
        """
        if not text:
            return text
            
        lines = text.split('\n')
        if len(lines) <= 1:
            return text
            
        # Extract content after timestamps (the actual spoken text)
        def extract_content(line):
            parts = line.split(']  ')
            if len(parts) > 1:
                return parts[1].strip()
            return line.strip()
        
        result = [lines[0]]  # Always keep the first line
        
        for i in range(1, len(lines)):
            current_content = extract_content(lines[i])
            prev_content = extract_content(result[-1])
            
            # Only add the line if it's not a duplicate of the previous line
            if current_content != prev_content or current_content == "":
                result.append(lines[i])
                
        return '\n'.join(result)
    
    def download_audio(self, youtube_url):
        """Download audio from YouTube video"""
        print(f"Downloading audio from: {youtube_url}")
        
        def sanitize_filename(s):
            # Replace characters that are problematic in filenames
            chars_to_replace = ['<', '>', ':', '"', '/', '\\', '|', '?', '*']
            for char in chars_to_replace:
                s = s.replace(char, '_')
            return s
        
        ydl_opts = {
            'format': 'bestaudio/best',
            'outtmpl': 'downloads/%(title)s.%(ext)s',
            'restrictfilenames': True,  # Restrict filenames to ASCII chars
            'postprocessors': [{
                'key': 'FFmpegExtractAudio',
                'preferredcodec': 'mp3',
                'preferredquality': '192',
            }],
        }
        
        with yt_dlp.YoutubeDL(ydl_opts) as ydl:
            info = ydl.extract_info(youtube_url, download=True)
            # Create a sanitized version of the title for the file path
            safe_title = sanitize_filename(info['title'])
            audio_file = f"downloads/{safe_title}.mp3"
            print(f"Audio downloaded: {audio_file}")
            return audio_file, info['title']
    
    def detect_language(self, audio_file):
        """Detect the language of the audio file"""
        print("Detecting original language...")
        
        # Check if file exists
        if not os.path.exists(audio_file):
            # Try to find the actual file that might have been renamed by yt-dlp
            possible_files = [f for f in os.listdir("downloads") if f.endswith(".mp3")]
            if possible_files:
                print(f"Original file path '{audio_file}' not found. Using '{os.path.join('downloads', possible_files[0])}'")
                audio_file = os.path.join("downloads", possible_files[0])
            else:
                raise FileNotFoundError(f"Audio file not found: {audio_file}")
        
        # Load audio and detect language
        audio = whisper.load_audio(audio_file)
        audio = whisper.pad_or_trim(audio)
        mel = whisper.log_mel_spectrogram(audio).to(self.whisper_model.device)
        _, probs = self.whisper_model.detect_language(mel)
        detected_lang = max(probs, key=probs.get)
        
        # Get language name if available
        lang_name = LANGUAGE_OPTIONS.get(detected_lang, "Unknown")
        print(f"Detected language: {detected_lang} ({lang_name}) - confidence: {probs[detected_lang]:.2f}")
        
        return detected_lang, probs[detected_lang]
    
    def transcribe_audio(self, audio_file, target_language):
        """Transcribe audio using local Whisper model and translate to target language"""
        print(f"Transcribing audio to {target_language} using local Whisper model...")
        
        # Check if file exists
        if not os.path.exists(audio_file):
            # Try to find the actual file that might have been renamed by yt-dlp
            possible_files = [f for f in os.listdir("downloads") if f.endswith(".mp3")]
            if possible_files:
                print(f"Original file path '{audio_file}' not found. Using '{os.path.join('downloads', possible_files[0])}'")
                audio_file = os.path.join("downloads", possible_files[0])
            else:
                raise FileNotFoundError(f"Audio file not found: {audio_file}")
        
        # First detect the language
        detected_lang, confidence = self.detect_language(audio_file)
        
        # Use this for translation to target language
        if target_language != "en":
            print(f"Translating from {detected_lang} to {target_language}...")
            # whisper.decode can translate to different languages
            result = self.whisper_model.transcribe(
                audio_file,
                task="translate",
                language=target_language,
                verbose=True
            )
        else:
            # Regular transcription in original language
            result = self.whisper_model.transcribe(
                audio_file,
                language=target_language,
                verbose=True
            )
        
        # Process the transcription text to remove consecutive duplicates
        cleaned_text = self.remove_consecutive_duplicates(result["text"])
        
        # Add detected language info to the result
        result_with_lang = {
            "text": cleaned_text,
            "detected_language": detected_lang,
            "confidence": confidence
        }
            
        # Return the text transcription and detected language
        return result_with_lang
    
    def create_pdf(self, transcript_result, title, target_language, output_file=None):
        """Generate PDF from transcription text"""
        # Extract text and language info
        text = transcript_result["text"]
        detected_lang = transcript_result["detected_language"]
        
        # Apply additional cleaning to remove repeated phrases
        text = clean_repeated_phrases(text)
        
        # Create a sanitized version of the title for the filename
        safe_title = "".join([c if c.isalnum() else "_" for c in title])
        
        if output_file is None:
            # Use the video title for the PDF filename
            output_file = f"transcripts/{safe_title}_{target_language}.pdf"
        
        print(f"Creating PDF: {output_file}")
        
        # Get language names
        source_lang_name = LANGUAGE_OPTIONS.get(detected_lang, "Unknown")
        target_lang_name = LANGUAGE_OPTIONS.get(target_language, "Unknown")
        
        try:
            # Clean the text to remove problematic characters
            clean_text = ""
            for char in text:
                try:
                    # Test if character can be encoded
                    char.encode('latin-1')
                    clean_text += char
                except UnicodeEncodeError:
                    # Replace characters that can't be encoded with a space
                    clean_text += " "
            
            # Initialize PDF
            pdf = FPDF()
            pdf.add_page()
            pdf.set_auto_page_break(auto=True, margin=15)
            
            # Add title with proper formatting for long titles
            pdf.set_font("Arial", "B", 16)
            
            # Format the title to handle long titles
            full_title = f"Transcript: {title}"
            # Calculate maximum width for title (page width minus margins)
            max_width = pdf.w - 40  # 20mm margin on each side
            
            # If title is too long, use multi_cell instead of cell
            if pdf.get_string_width(full_title) > max_width:
                pdf.multi_cell(0, 10, full_title, align="C")
            else:
                pdf.cell(0, 10, full_title, ln=True, align="C")
            
            pdf.ln(10)
            
            # Add language info
            pdf.set_font("Arial", "I", 12)
            pdf.cell(0, 10, f"Original language: {detected_lang} ({source_lang_name})", ln=True)
            pdf.cell(0, 10, f"Output language: {target_language} ({target_lang_name})", ln=True)
            pdf.ln(5)
            
            # Add transcript text
            pdf.set_font("Arial", "", 12)
            
            # Split text into lines and add to PDF
            pdf.multi_cell(0, 10, clean_text)
            
            # Save PDF
            pdf.output(output_file)
            
            return output_file
            
        except Exception as e:
            print(f"Error creating PDF: {str(e)}")
            
            # Fallback: Save as text file instead
            txt_file = output_file.replace(".pdf", ".txt")
            print(f"Saving as text file instead: {txt_file}")
            
            with open(txt_file, "w", encoding="utf-8") as f:
                f.write(f"Transcript: {title}\n\n")
                f.write(f"Original language: {detected_lang} ({source_lang_name})\n")
                f.write(f"Output language: {target_language} ({target_lang_name})\n\n")
                f.write(text)
            
            return txt_file
    
    def process_video(self, youtube_url, target_language):
        """Process a YouTube video: download, transcribe, and create PDF"""
        try:
            # Step 1: Download audio
            print("Step 1/3: Downloading audio...")
            audio_file, title = self.download_audio(youtube_url)
            
            # Step 2: Transcribe and translate audio
            print("Step 2/3: Transcribing audio...")
            transcript_result = self.transcribe_audio(audio_file, target_language)
            
            # Step 3: Create PDF
            print("Step 3/3: Creating PDF...")
            pdf_file = self.create_pdf(transcript_result, title, target_language)
            
            # Display language information
            detected_lang = transcript_result["detected_language"]
            source_lang_name = LANGUAGE_OPTIONS.get(detected_lang, "Unknown")
            target_lang_name = LANGUAGE_OPTIONS.get(target_language, "Unknown")
            
            print(f"Video language: {detected_lang} ({source_lang_name})")
            print(f"Output language: {target_language} ({target_lang_name})")
            print(f"Process completed successfully! PDF saved at: {pdf_file}")
            
            return pdf_file
            
        except Exception as e:
            print(f"Error in pipeline: {str(e)}")
            return None


## Step 2: Run the Pipeline

Now let's use our pipeline class to process a YouTube video:


In [7]:
# Initialize the transcription pipeline with the local Whisper model
pipeline = TranscriptionPipeline(whisper_model=model)

# Set the YouTube URL and target language
youtube_url = "https://www.youtube.com/watch?v=3dKSBfRMmdU&t=1s"  # Replace with your desired YouTube video

# Use the selected language from the dropdown above
try:
    # Use the selected language from the dropdown if available
    target_language = selected_language
except NameError:
    # Fall back to default if dropdown wasn't used
    target_language = "en"  # Language code (e.g., 'en', 'es', 'fr', 'de', 'ja', etc.)

print(f"Using language: {target_language} ({LANGUAGE_OPTIONS.get(target_language, 'Unknown')})")

# Run the pipeline
print("Starting transcription pipeline...")
pdf_path = pipeline.process_video(youtube_url, target_language)

# Display link to the generated PDF if successful
if pdf_path:
    display(Markdown(f"**PDF generated successfully!** [Open PDF]({pdf_path})"))


Deprecated Feature: Support for Python version 3.8 has been deprecated. Please update to Python 3.9 or above


Using language: en (English)
Starting transcription pipeline...
Step 1/3: Downloading audio...
Downloading audio from: https://www.youtube.com/watch?v=3dKSBfRMmdU&t=1s
[youtube] Extracting URL: https://www.youtube.com/watch?v=3dKSBfRMmdU&t=1s
[youtube] 3dKSBfRMmdU: Downloading webpage
[youtube] 3dKSBfRMmdU: Downloading ios player API JSON
[youtube] 3dKSBfRMmdU: Downloading mweb player API JSON
[youtube] 3dKSBfRMmdU: Downloading player 94f771d8


         player = https://www.youtube.com/s/player/94f771d8/player_ias.vflset/en_US/base.js
         n = Roynya_o9oQ7C4hxc ; player = https://www.youtube.com/s/player/94f771d8/player_ias.vflset/en_US/base.js
         player = https://www.youtube.com/s/player/94f771d8/player_ias.vflset/en_US/base.js
         n = ead6d5-LpEgnd-zmH ; player = https://www.youtube.com/s/player/94f771d8/player_ias.vflset/en_US/base.js


[youtube] 3dKSBfRMmdU: Downloading m3u8 information
[info] 3dKSBfRMmdU: Downloading 1 format(s): 251
[download] Destination: downloads\MOST_CRINGE_FEED_EVARIDI_FIRST_VIDEO_II_KAKARAKAYTALKS.webm
[download] 100% of   12.63MiB in 00:00:03 at 3.85MiB/s     
[ExtractAudio] Destination: downloads\MOST_CRINGE_FEED_EVARIDI_FIRST_VIDEO_II_KAKARAKAYTALKS.mp3
Deleting original file downloads\MOST_CRINGE_FEED_EVARIDI_FIRST_VIDEO_II_KAKARAKAYTALKS.webm (pass -k to keep)
Audio downloaded: downloads/MOST CRINGE FEED EVARIDI___ __ FIRST VIDEO II KAKARAKAYTALKS.mp3
Step 2/3: Transcribing audio...
Transcribing audio to en using local Whisper model...
Original file path 'downloads/MOST CRINGE FEED EVARIDI___ __ FIRST VIDEO II KAKARAKAYTALKS.mp3' not found. Using 'downloads\MOST CRINGE FEED EVARIDI？？？ ｜｜ FIRST VIDEO II KAKARAKAYTALKS.mp3'
Detecting original language...
Detected language: te (Unknown) - confidence: 0.73




[00:00.000 --> 00:05.000]  Hi friends welcome to our channel, get up,
[00:05.000 --> 00:07.000]  What are you doing here?
[00:07.000 --> 00:09.000]  I am going to do the first thing I do.
[00:09.000 --> 00:14.000]  I am going to do my first youtube video of French couple 9 coverage of the star and dates.
[00:14.000 --> 00:17.000]  I was going to do a lot of things.
[00:17.000 --> 00:20.000]  I was going to do a lot of things.
[00:20.000 --> 00:23.000]  I have a lot of things to do.
[00:23.000 --> 00:25.000]  Why not? Why not?
[00:26.000 --> 00:28.000]  Why not? Why not?
[00:28.000 --> 00:30.000]  Why not? Why not?
[00:30.000 --> 00:32.000]  Why not?
[00:32.000 --> 00:34.000]  Why not?
[00:34.000 --> 00:35.000]  Why not?
[00:35.000 --> 00:36.000]  Why not?
[00:36.000 --> 00:37.000]  Why not?
[00:37.000 --> 00:38.000]  Why not?
[00:38.000 --> 00:39.000]  Why not?
[00:39.000 --> 00:40.000]  Why not?
[00:40.000 --> 00:41.000]  I did a mistake.
[00:41.000 --> 00:43.000]  Okay, I will see.
[

**PDF generated successfully!** [Open PDF](transcripts/MOST_CRINGE_FEED_EVARIDI_______FIRST_VIDEO_II_KAKARAKAYTALKS_en.pdf)

## Optional: Batch Processing Multiple Videos

If you want to process multiple videos at once:


In [8]:
# # List of YouTube URLs and their target languages
# # Use our language options dictionary to select desired languages
# videos_to_process = [
#     {"url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ", "language": "en"},  # English
#     {"url": "https://www.youtube.com/watch?v=VIDEO_ID_2", "language": "es"},   # Spanish
#     # Add more videos as needed
#     # You can use any language code from LANGUAGE_OPTIONS dictionary
# ]

# # Print available languages as a reminder
# print("Available languages for batch processing:")
# print(", ".join([f"{code}: {name}" for code, name in list(LANGUAGE_OPTIONS.items())[:10]]))
# print(f"... and {len(LANGUAGE_OPTIONS) - 10} more languages (see language options cell above)")

# # Process each video
# results = []
# for video in videos_to_process:
#     print(f"\nProcessing video: {video['url']} in {video['language']}")
#     pdf_path = pipeline.process_video(video["url"], video["language"])
#     results.append({
#         "url": video["url"],
#         "language": video["language"],
#         "success": pdf_path is not None,
#         "pdf_path": pdf_path
#     })

# # Display results
# print("\nProcessing Results:")
# for i, result in enumerate(results, 1):
#     status = "✅ Success" if result["success"] else "❌ Failed"
#     print(f"{i}. {status} - {result['url']} ({result['language']})")
#     if result["success"]:
#         print(f"   PDF: {result['pdf_path']}")
